Service Delivery and Incident Response Lead

Apply now

Service Delivery and Incident Response Lead

Full-time · Manila

OpsWerks is a technical consulting company specializing in operational services for the high-tech industry. We help platform and infrastructure teams operate multi-cloud environments, execute complex migrations, and enable seamless app deployments.

About the job

We’re looking for a Service Delivery & Incident Response Lead who thrives at the intersection of people’s leadership, operational reliability, and continuous improvement. You’ll lead engineers supporting mission-critical cloud and infrastructure environments, ensuring stability, responsiveness, and operational excellence 24×7.

This role combines real-time incident command with team development, process optimization, and cross-functional collaboration to keep our systems and our team performing at their best.

Your Role

People & Team Leadership

  • Lead, coach, and mentor IT engineers to build strong technical and leadership capabilities.

  • Set clear performance goals aligned with our Beliefs, Vision, Mission, Methods (BVMM).

  • Conduct 1:1s, performance reviews, and career growth discussions.

  • Foster a culture of ownership, collaboration, and continuous learning.

  • Maintain balanced workloads, shift coverage, and clear succession plans to sustain healthy 24×7 operations.

Service Operations & Reliability

  • Oversee daily service health, capacity, and reliability across all supported environments.

  • Ensure compliance with operational KPIs through proactive planning and improvement.

  • Balance demand vs. capacity and manage shift coverage to prevent burnout.

  • Partner with engineering teams to maintain runbooks, knowledge bases, and escalation paths.

  • Drive automation and workflow optimization to reduce manual overhead.

  • Use data insights to guide decisions and improvements.

Incident & Problem Management

  • Lead end-to-end incident response, triage, communication, and resolution in real time.

  • Act as Incident Commander for high-impact events across a global environment.

  • Track and improve metrics like MTTD, MTTM, and MTTR.

  • Champion blameless Post-Incident Reviews (PIRs) and translate learnings into long-term system and process improvements.

Strategic & Cross-Functional Impact

  • Represent in customer reviews, operational syncs, and briefings.

  • Collaborate with SREs, product owners, and partner engineers to align priorities and reliability goals.

  • Contribute to frameworks and governance initiatives.

  • Lead service onboarding/off-boarding and strengthen operational readiness checkpoints.

  • Identify and close systemic operational gaps through process and tool improvements.

Your Qualifications

  • Bachelor’s degree in Computer Science, Information Technology, Engineering, or a related discipline.

  • 3+ years in Service Delivery, Incident Response, or Operations Leadership within enterprise-scale, 24×7 environments.

  • Proven experience managing technical teams, driving performance, and leading through critical situations.

  • Strong grounding in ITSM / ITIL principles (Incident & Problem Management).

  • Familiarity with cloud, distributed systems, or enterprise infrastructure.

  • Skilled in monitoring, alerting, and ticketing tools (e.g., PagerDuty, Datadog, Grafana, Splunk, ServiceNow).

Core Competencies

  • People and Performance Leadership

  • Incident Command and Escalation Management

  • Analytical and Problem-Solving Skills

  • Communication and Decision-Making Under Pressure

  • Root Cause and Post-Incident Analysis

  • Operational Planning and Service Governance

  • Stakeholder and Partner Management

  • IT Service Management (Incident & Problem Management)

  • Observability, Monitoring, and Automation Tools

  • Passion for People Development, Operational Discipline, and Continuous Improvement

Good to Have

  • ITIL V3 or V4 certification

  • AWS Certified SysOps Administrator

  • SRE Foundation or Crisis/Incident Management certifications

  • Background in SRE practices and operational frameworks that promote reliability and automation


What You’ll Help Us Maintain

  • Enterprise-grade reliability: Ensuring highly available, resilient systems powering global business operations.

  • Customer-grade experience: Seamless, always-on access to applications, cloud workloads, and core services.

We are here to help those next to us and in front of us live better lives.
Our website