Senior Site Reliability Engineer - Infrastructure
Serve as subject matter expert for infrastructure operations at scale by sharing knowledge amongst peers, documenting best practices, and performing root cause analysis or recurring/high impacting incidents.
Lead incident response, triaging, and customer communications during system outages and production issues.
Lead and run periodic sync-up meetings with our customers to discuss updates, get clarity, and solicit feedback.
Gather and analyze operations metrics regularly to make informed decisions in driving operational and process improvements.
Develop or contribute to existing tools and automated solutions to improve efficiency in operations.
Create comprehensive documentation for operational procedures and environment setup.
Eliminate operational toil through automation or process improvements.
Be a member of a 24x7 shifting rotation.
Bachelor’s degree in any Information Technology or Engineering course
Demonstrated ability in supporting critical production services and improving operations through automations and process enhancements.
At least 5 years of experience working with any technologies in the following domains:
Linux Systems Administration: RHEL, CentOS, Ubuntu or other *Nix systems
Container Orchestration and Scheduling: Docker, Kubernetes or similar
Infrastructure-as-a-Code: Puppet, Ansible, Chef, Terraform or similar
Logging and monitoring: Prometheus, Grafana, Splunk, MRTG
Version Management and CI/CD: Git, Jenkins or similar
Networking: Core Networking Concepts, Load Balancing (NetScaler, F5 or similar), Reverse Proxy (Nginx), Software Defined Networks (SDN), DHCP, DNS
Security: SSL certificates, Firewall, ACLs
Has strong experience managing Datacenter lifecycle (build, operate, decommission) practices.
Experience in leading infrastructure related projects to success.
Demonstrated ability in supporting critical production services and improving operations through automation and enhancements.
Solid understanding of distributed computing principles, platform operations, and best practices in Data Engineering and DevOps workflows.
Relevant certification in key skillsets – Linux (RHCSA, RHCE, LPIC), Networking (CCNP), Kubernetes (CKA, CKAD), or Cloud Computing AWS.
You'll be part of a driven and passionate Site Reliability Engineering Team that deploys some of the most diverse and massive Cloud platforms in the industry.
Our werk is known for excellent operational work. These are people who are taking head-on data center and platform operation challenges, not afraid of doing the dirty work, but still, continuously looking out for ways to make things efficient and better.
Our team is made up of individuals who are aligned with OpsWerks’ values. In the spirit of building a healthy community, which requires open and honest communication, here are our expectations for every one of us at OpsWerks:
To uphold OpsWerks’ Mission and Methods.
To know, believe, and execute each team’s mission plan.
Growing in the 4 awareness (self, others, surroundings, and situation).
To take ownership of your personal growth for the team’s well-being.
To never give up, to never give in… only giving your best.
Ready to start your awesome journey and be part of OpsWerks?