Senior Data Platform Reliability Engineer

Full-time · Manila

OpsWerks is a technical consulting company specializing in operational services for the high-tech industry. We help platform and infrastructure teams operate multi-cloud environments, execute complex migrations, and enable seamless app deployments.

Your Role

Run managed services, not just systems. Operate multi-tenant data/AI platforms (Spark, Airflow, Flink, Jupyter) with clear SLAs/SLIs/SLOs, cost guardrails, and capacity plans across AWS/GCP + Kubernetes.
Be the face of reliability. Lead incidents end-to-end, own customer comms and post-incident reviews (RCA with actions customers can see and feel).
Design for Customer experience. Help Data scientists and customers reduce failed/slow jobs, improve time-to-data, and optimize costs—so customers notice faster pipelines and fewer surprises.
Standardize & scale. Build service runbooks, golden paths, and automation that make onboarding and daily ops predictable across customers.
Automate the toil away. Ship tooling (Bash/Python, GitOps, CI/CD) for backups, DR drills, upgrades, access, and environment bootstrapping.
Make signals meaningful. Instrument platforms with metrics/logs/traces; tune alerting to cut noise and improve detection and response times
Govern change. Plan and execute upgrades/migrations within change windows; champion safe deploys and rollback strategies.
Partner & mentor. Guide junior engineers; collaborate with customer dev/data teams to unblock delivery and raise the reliability bar.
Participate in on-call. Join a 24x7 rotation with crisp handoffs and playbooks.

Your Qualifications

Hands-on support for ETL/ELT, SQL, and production pipelines/workflows.
Strong experience in at least one of Spark, Airflow, Flink, or Jupyter (plus the ecosystem around it).
Solid working knowledge in at least one (1) language - Python, Java or Scala (Automations, Data Manipulations & Orchestrations)
Real-world AWS or GCP and production environment usage as a User or Administrator
Kubernetes (or Docker) for scheduling/scale.
Incident management, post-incident reviews, change management, and service reporting.
Clear customer-facing comms (status updates, RCAs, runbooks).
5+ years across the domains above, with depth in at least 1–2 tools per domain.

Plus Points

Certifications: CKA/CKAD, AWS (Associate/Professional), or equivalent.
IaC & DevOps: Terraform, Helm, Argo CD/GitOps, CI/CD for data platforms.
Observability & ITSM: Prometheus/Grafana/Datadog; Jira Service Management/ServiceNow, StatusPage.
Security & compliance basics (least-privilege access, audit trails)

Ready to start your awesome journey and be part of OpsWerks?

Apply now