Senior Data Platform Reliability Engineer
OpsWerks is a technical consulting company specializing in operational services for the high-tech industry. We help platform and infrastructure teams operate multi-cloud environments, execute complex migrations, and enable seamless app deployments.
Your Role
Run managed services, not just systems. Operate multi-tenant data/AI platforms (Spark, Airflow, Flink, Jupyter) with clear SLAs/SLIs/SLOs, cost guardrails, and capacity plans across AWS/GCP + Kubernetes.
Be the face of reliability. Lead incidents end-to-end, own customer comms and post-incident reviews (RCA with actions customers can see and feel).
Design for Customer experience. Help Data scientists and customers reduce failed/slow jobs, improve time-to-data, and optimize costs—so customers notice faster pipelines and fewer surprises.
Standardize & scale. Build service runbooks, golden paths, and automation that make onboarding and daily ops predictable across customers.
Automate the toil away. Ship tooling (Bash/Python, GitOps, CI/CD) for backups, DR drills, upgrades, access, and environment bootstrapping.
Make signals meaningful. Instrument platforms with metrics/logs/traces; tune alerting to cut noise and improve detection and response times
Govern change. Plan and execute upgrades/migrations within change windows; champion safe deploys and rollback strategies.
Partner & mentor. Guide junior engineers; collaborate with customer dev/data teams to unblock delivery and raise the reliability bar.
Participate in on-call. Join a 24x7 rotation with crisp handoffs and playbooks.
Your Qualifications
Background: Bachelor’s in IT/Engineering (or equivalent practical experience).
Data operations: Hands-on support for ETL/ELT, SQL, and production pipelines/workflows.
Platform depth: Strong experience in at least one of Spark, Airflow, Flink, or Jupyter (plus the ecosystem around it).
Scripting/Programming: Solid working knowledge in at least one (1) language - Python, Java or Scala (Automations, Data Manipulations & Orchestrations)
Cloud & containers: Real-world AWS or GCP and production environment usage as a User or Administrator
Kubernetes (or Docker) for scheduling/scale.
Ops craft: Incident management, post-incident reviews, change management, and service reporting.
Communication: Clear customer-facing comms (status updates, RCAs, runbooks).
Tenure:5+ years across the domains above, with depth in at least 1–2 tools per domain.
Plus Points
Certifications: CKA/CKAD, AWS (Associate/Professional), or equivalent.
IaC & DevOps: Terraform, Helm, Argo CD/GitOps, CI/CD for data platforms.
Observability & ITSM: Prometheus/Grafana/Datadog; Jira Service Management/ServiceNow, StatusPage.
Security & compliance basics (least-privilege access, audit trails)
Ready to start your awesome journey and be part of OpsWerks?