Senior Site Reliability Engineer - Cloud Applications

At OpsWerks, we're looking for a Senior Site Reliability Engineer – Cloud Applications Engineer to join our team.

Full-time · Manila

Your Role

Serve as subject matter expert for distributed application systems that reside in hybrid cloud platforms.
Champion and drive operational improvements using insights from metrics and customer feedback.
Lead incident response and post-incident reviews.
Communicates complex topics with development teams to investigate and document issues and leads internal team to develop solutions to mitigate them
Manage and maintain enterprise applications and cloud-based systems using tools and frameworks designed for secure and scalable in-house deployments.
Monitor and optimize the health and performance of applications and platforms.
Debug problems reported by partners and end-users using in-depth log analysis and stack traces.
Create comprehensive documentation for operational procedures and environment setup.
Eliminate operational toil through automation or process improvements.
Be a member of a 24x7 shifting rotation.

Your Qualifications

Bachelor’s degree in any Information Technology or Engineering course
Demonstrated ability in supporting critical production services and improving operations through automations and process enhancements.
Subject Matter Expert on the following subjects: Platform as a Service support, Distributed Systems and Microservices particularly on the fields of hosted services such as Content Delivery, Messaging, API gateways and proxies.
Strong communication skills, both written and verbal.
At least 5 years’ experience working with the following:
- Linux Administration: RHEL, CentOS, or other Unix-like systems.
- Server and Infrastructure Troubleshooting: Hardware and OS Configuration
- Logging and monitoring: Splunk, Grafana, Prometheus
- Container Orchestration: Docker, Kubernetes
- Incident management: PagerDuty, ServiceNow
- Data serialization formats and structured systems: APIs, JSON, YAML.
At least 3 years’ experience working with the following:
- Distributed Application Support: Experience in supporting several applications running in microservices implementation.
- Version Management and CICD: Git, Spinnaker
- Infrastructure Config Management: Puppet, Ansible, Salt

Plus Points

Relevant certifications in any of the key skills (e.g. CKA or CKAD certified).

You'll be part of a driven and passionate Site Reliability Engineering Team that deploys some of the most diverse and massive Cloud platforms in the industry.

Our werk is known for excellent operational work. These are people who are taking head-on data center and platform operation challenges, not afraid of doing the dirty work, but still, continuously looking out for ways to make things efficient and better.

Your Team

You'll be part of a driven and passionate Site Reliability Engineering team, whose work is known for excellent operational work. These are people who are taking head-on data center and platform operation challenges, not afraid of doing the dirty work, but still continuously looking out for ways to make things more efficient and better.

Our team is made up of individuals who are aligned with OpsWerks’ values. In the spirit of building a healthy community, which requires open and honest communication, here are our expectations for every one of us at OpsWerks:

To uphold OpsWerks’ Mission and Methods.
To know, believe, and execute each team’s mission plan.
Growing in the 4 awareness (self, others, surroundings, and situation).
To take ownership of your personal growth for the team’s well-being.
To never give up, to never give in… only giving your best.

Apply for the job!

Ready to start your awesome journey and be part of OpsWerks?

Apply now