AI Cloud Runtime Engineer

Job Description / Responsibilities:

  • Design, build, and operate cloud-based runtime environments for AI/ML workloads.
  • Optimize training and inference runtimes for performance, scalability, and cost.
  • Develop and maintain containerized AI runtimes for batch and real-time workloads.
  • Integrate AI runtimes with MLOps pipelines, CI/CD, and monitoring systems.
  • Manage GPU/accelerator scheduling, autoscaling, and resource optimization.
  • Ensure security, observability, and high availability of AI runtime platforms.
  • Collaborate with ML engineers, platform, and SRE teams.
  • Document runtime architectures, deployment standards, and operational playbooks.

Required Skills / Qualifications:

  • Strong experience in cloud, platform, or infrastructure engineering.
  • Proficiency in Python and/or Go.
  • Hands-on experience with Kubernetes, containers, and cloud-native runtimes.
  • Experience running AI/ML workloads in production.
  • Familiarity with AWS, Azure, or GCP AI services.
  • Knowledge of GPU/accelerator management.
  • Strong troubleshooting and performance optimization skills.

Nice-to-Have:

  • Experience with GenAI/LLM runtimes.
  • Knowledge of service mesh, serverless, or distributed compute frameworks.
  • Experience with observability tools (Prometheus, Grafana).
  • Exposure to FinOps for AI workloads.

Find Latest Job