Senior Data Platform Reliability Engineer

Apply now

Senior Data Platform Reliability Engineer

Full-time · Manila

OpsWerks is a technical consulting company specializing in operational services for the high-tech industry. We help platform and infrastructure teams operate multi-cloud environments, execute complex migrations, and enable seamless app deployments.

Your Role

  • Run managed services, not just systems. Operate multi-tenant data/AI platforms (Spark, Airflow, Flink, Jupyter) with clear SLAs/SLIs/SLOs, cost guardrails, and capacity plans across AWS/GCP + Kubernetes.

  • Be the face of reliability. Lead incidents end-to-end, own customer comms and post-incident reviews (RCA with actions customers can see and feel).

  • Design for Customer experience. Help Data scientists and customers reduce failed/slow jobs, improve time-to-data, and optimize costs—so customers notice faster pipelines and fewer surprises.

  • Standardize & scale. Build service runbooks, golden paths, and automation that make onboarding and daily ops predictable across customers.

  • Automate the toil away. Ship tooling (Bash/Python, GitOps, CI/CD) for backups, DR drills, upgrades, access, and environment bootstrapping.

  • Make signals meaningful. Instrument platforms with metrics/logs/traces; tune alerting to cut noise and improve detection and response times

  • Govern change. Plan and execute upgrades/migrations within change windows; champion safe deploys and rollback strategies.

  • Partner & mentor. Guide junior engineers; collaborate with customer dev/data teams to unblock delivery and raise the reliability bar.

  • Participate in on-call. Join a 24x7 rotation with crisp handoffs and playbooks. 

Your Qualifications

  • Background: Bachelor’s in IT/Engineering (or equivalent practical experience).

  • Data operations: Hands-on support for ETL/ELT, SQL, and production pipelines/workflows.

  • Platform depth: Strong experience in at least one of Spark, Airflow, Flink, or Jupyter (plus the ecosystem around it).

  • Scripting/Programming: Solid working knowledge in at least one (1) language - Python, Java or Scala (Automations, Data Manipulations & Orchestrations)

  • Cloud & containers: Real-world AWS or GCP and production environment usage as a User or Administrator

  • Kubernetes (or Docker) for scheduling/scale.

  • Ops craft: Incident management, post-incident reviews, change management, and service reporting.

  • Communication: Clear customer-facing comms (status updates, RCAs, runbooks).

  • Tenure:5+ years across the domains above, with depth in at least 1–2 tools per domain.

Plus Points

  • Certifications: CKA/CKAD, AWS (Associate/Professional), or equivalent.

  • IaC & DevOps: Terraform, Helm, Argo CD/GitOps, CI/CD for data platforms.

  • Observability & ITSM: Prometheus/Grafana/Datadog; Jira Service Management/ServiceNow, StatusPage.

  • Security & compliance basics (least-privilege access, audit trails)

Ready to start your awesome journey and be part of OpsWerks?