Job Description / Responsibilities:
- Design, build, and operate cloud-based runtime environments for AI/ML workloads.
- Optimize training and inference runtimes for performance, scalability, and cost.
- Develop and maintain containerized AI runtimes for batch and real-time workloads.
- Integrate AI runtimes with MLOps pipelines, CI/CD, and monitoring systems.
- Manage GPU/accelerator scheduling, autoscaling, and resource optimization.
- Ensure security, observability, and high availability of AI runtime platforms.
- Collaborate with ML engineers, platform, and SRE teams.
- Document runtime architectures, deployment standards, and operational playbooks.
Required Skills / Qualifications:
- Strong experience in cloud, platform, or infrastructure engineering.
- Proficiency in Python and/or Go.
- Hands-on experience with Kubernetes, containers, and cloud-native runtimes.
- Experience running AI/ML workloads in production.
- Familiarity with AWS, Azure, or GCP AI services.
- Knowledge of GPU/accelerator management.
- Strong troubleshooting and performance optimization skills.
Nice-to-Have:
- Experience with GenAI/LLM runtimes.
- Knowledge of service mesh, serverless, or distributed compute frameworks.
- Experience with observability tools (Prometheus, Grafana).
- Exposure to FinOps for AI workloads.