Service Delivery and Incident Response Lead
OpsWerks is a technical consulting company specializing in operational services for the high-tech industry. We help platform and infrastructure teams operate multi-cloud environments, execute complex migrations, and enable seamless app deployments.
About the job
We’re looking for a Service Delivery & Incident Response Lead who thrives at the intersection of people’s leadership, operational reliability, and continuous improvement. You’ll lead engineers supporting mission-critical cloud and infrastructure environments, ensuring stability, responsiveness, and operational excellence 24×7.
This role combines real-time incident command with team development, process optimization, and cross-functional collaboration to keep our systems and our team performing at their best.
Your Role
People & Team Leadership
Lead, coach, and mentor IT engineers to build strong technical and leadership capabilities.
Set clear performance goals aligned with our Beliefs, Vision, Mission, Methods (BVMM).
Conduct 1:1s, performance reviews, and career growth discussions.
Foster a culture of ownership, collaboration, and continuous learning.
Maintain balanced workloads, shift coverage, and clear succession plans to sustain healthy 24×7 operations.
Service Operations & Reliability
Oversee daily service health, capacity, and reliability across all supported environments.
Ensure compliance with operational KPIs through proactive planning and improvement.
Balance demand vs. capacity and manage shift coverage to prevent burnout.
Partner with engineering teams to maintain runbooks, knowledge bases, and escalation paths.
Drive automation and workflow optimization to reduce manual overhead.
Use data insights to guide decisions and improvements.
Incident & Problem Management
Lead end-to-end incident response, triage, communication, and resolution in real time.
Act as Incident Commander for high-impact events across a global environment.
Track and improve metrics like MTTD, MTTM, and MTTR.
Champion blameless Post-Incident Reviews (PIRs) and translate learnings into long-term system and process improvements.
Strategic & Cross-Functional Impact
Represent in customer reviews, operational syncs, and briefings.
Collaborate with SREs, product owners, and partner engineers to align priorities and reliability goals.
Contribute to frameworks and governance initiatives.
Lead service onboarding/off-boarding and strengthen operational readiness checkpoints.
Identify and close systemic operational gaps through process and tool improvements.
Your Qualifications
Bachelor’s degree in Computer Science, Information Technology, Engineering, or a related discipline.
3+ years in Service Delivery, Incident Response, or Operations Leadership within enterprise-scale, 24×7 environments.
Proven experience managing technical teams, driving performance, and leading through critical situations.
Strong grounding in ITSM / ITIL principles (Incident & Problem Management).
Familiarity with cloud, distributed systems, or enterprise infrastructure.
Skilled in monitoring, alerting, and ticketing tools (e.g., PagerDuty, Datadog, Grafana, Splunk, ServiceNow).
Core Competencies
People and Performance Leadership
Incident Command and Escalation Management
Analytical and Problem-Solving Skills
Communication and Decision-Making Under Pressure
Root Cause and Post-Incident Analysis
Operational Planning and Service Governance
Stakeholder and Partner Management
IT Service Management (Incident & Problem Management)
Observability, Monitoring, and Automation Tools
Passion for People Development, Operational Discipline, and Continuous Improvement
Good to Have
ITIL V3 or V4 certification
AWS Certified SysOps Administrator
SRE Foundation or Crisis/Incident Management certifications
Background in SRE practices and operational frameworks that promote reliability and automation
What You’ll Help Us Maintain
Enterprise-grade reliability: Ensuring highly available, resilient systems powering global business operations.
Customer-grade experience: Seamless, always-on access to applications, cloud workloads, and core services.