Senior Data Center Infrastructure Engineer
Your Role
Serve as a Subject Matter Expert (SME) for large-scale infrastructure operations, sharing expertise, documenting best practices, and conducting root-cause analyses for high-impact or recurring incidents.
Lead incident management, response coordination, troubleshooting, and proactive customer communication during system outages and production incidents.
Facilitate regular sync-up meetings with stakeholders to communicate updates, clarify issues, and gather customer feedback.
Analyze and report operational metrics to drive informed decision-making and continuous process improvements.
Develop and enhance operational tools and automated solutions to increase efficiency and reduce operational overhead.
Document comprehensive operational procedures, configurations, and environment setups.
Identify and eliminate operational toil by automating repetitive tasks and optimizing processes.
Train junior engineers in different subjects of expertise.
Participate in a 24x7 shifting rotation.
Your Qualifications
Bachelor's Degree in Information Technology, Engineering, or related field.
Minimum of 5 years of experience supporting critical, high-availability production environments, focusing on automation and operational enhancements.
Strong Verbal and written communication skills
Required Technical Skills
Minimum of 5 years’ experience in at least 1-2 tools per domain:
Linux Systems Administration: RHEL, CentOS, Ubuntu, or similar Unix-based OS
Version Control : Git, GitHub, Gitlab
Networking: Core networking principles, Load Balancing, Reverse Proxies (Nginx), Software-Defined Networks (SDN), DHCP, DNS
Plus Points
Relevant certifications in any of the key skills (e.g. CKA, CKAD, AWS certified).
Proven experience working in collaborative, cross-functional teams within structured processes that follow modern DevOps practices and workflows.
Proven ability to improve operational efficiency through the development of automation tools and scripts, leveraging languages such as Bash and Python to streamline workflows, reduce manual toil, and enhance system reliability.
You'll be part of a driven and passionate Site Reliability Engineering Team
that deploys some of the most diverse and
largest Datacenter and Distributed computing platforms.
Our werk is known for excellent operational work.
These are people who are taking head-on data center and
platform operation challenges, not afraid of doing the dirty work,
but still, continuously looking out for ways
to make things efficient and better.
Our team is made up of individuals who are aligned with OpsWerks’ values. In the spirit of building a healthy community, which requires open and honest communication, here are our expectations for every one of us at OpsWerks:
To uphold OpsWerks’ Mission and Methods.
To know, believe, and execute each team’s mission plan.
Growing in the 4 awareness (self, others, surroundings, and situation).
To take ownership of your personal growth for the team’s well-being.
To never give up, to never give in… only giving your best.
Apply for the job!
Ready to start your awesome journey and be part of OpsWerks?