Senior AI Platform Engineer
OpsWerks is a technical consulting company specializing in operational services for the high-tech industry. We help platform and infrastructure teams operate multi-cloud environments, execute complex migrations, and enable seamless app deployments.
Your Role
As a Senior AI Platform Engineer, you will be responsible for operating, maintaining, and continuously improving the company’s AI platforms running on Kubernetes (On-premise and/or on AWS/GCP) - similar on the AIoEKS (AI on EKS) deployment frameworks and Kubeflow’s Machine Learning Toolkit
Platform Ownership & Operations
Deploy new releases and configuration changes through GitOps/DevOps
Monitor platform and service health using logs, metrics, and observability tools
Improve platform observability, operational tooling/automations, self-service capabilities and reliability practices to reduce recurring issues
Participate in incident response, root cause analysis and 24x7 operational rotations
User & Developer Experience
Investigate & troubleshoot user concerns by either correlating them to system-related issues, breaking integrations and/or user-specific errors/misconfigurations up to recommending/executing resolutions
Advocate for platform standards, security best practices, and operational excellence
Collaboration and Leadership
Provide structured Python mentorship to junior engineers, focusing on strong fundamentals and bridge foundational Python knowledge toward MLOps competencies
Lead the adoption of MLOps best practices for the team
Influence the team roadmap by identifying gaps in tooling, skills, and processes required to support production-grade AI systems
Your Qualifications
3+ years of experience supporting production workloads/platforms (Ray.IO, Jupyter Notebooks, AWS SageMaker, Kubeflow AI Tools or an AI-related equivalent)
5+ years of hands-on experience AI/ML lifecycle (development/deployment, DevOps/MLOps)
5+ years of Python experience in development & support on AI/ML workflows and data engineering pipelines
Practically skilled in Kubernetes environments including Cloud-provider managed Kubernetes flavors (AWS-EKS/GCP-GKE)
Knowledge on microservice architectures and service communication patterns
Strong troubleshooting fundamentals such as application crashes, resource contentions, service latency, and scaling behavior
Well-rounded competency in analyzing logs, metrics, monitoring systems, and service KPIs
Plus points if you have:
Exposure in other Data/AI platforms such as Flyte, HuggingFace & AI Agent Platforms (Vertex AI, Claude Code, LangChain, etc...)
Hands-on experience with automation or scripting (Bash, Python)
Kubernetes or cloud certifications (CKAD, AWS)
Ready to start your awesome journey and be part of OpsWerks?