← All Jobs
Posted Apr 24, 2026

AI Infra Engineer – SRE (Kubernetes)

Apply Now ✨
Job Category: Software Engineering Job Type: Full Time Job Location: Hybrid Remote About The Role We are a fast-growing AI infrastructure company building cutting-edge GPU cloud platforms and high-performance inference solutions that empower AI developers, startups, and enterprises worldwide. As we scale our global operations, we are looking for a skilled and hands-on AI Infra Engineer – SRE (Kubernetes) to join our Global Infrastructure team. Role Overview This is a critical hands-on position focused on the reliability, performance, and operational excellence of large-scale, high-performance AI/ML GPU clusters in our data centers. As an AI Infra Engineer – SRE (Kubernetes), you will design, operate, and optimize Kubernetes-based infrastructure to ensure maximum uptime, efficiency, and scalability for demanding AI workloads. You will bring deep expertise in system-level troubleshooting, GPU cluster management, and automation to keep our platforms running at peak performance. Key Responsibilities • Design, build, and maintain scalable, production-grade AI/ML infrastructure using Kubernetes. • Proactively monitor GPU cluster health, performance, and utilization across compute, accelerators, storage, and networking layers, performing root-cause analysis and resolution. • Develop and implement automation for infrastructure provisioning, configuration, and ongoing management. • Own the complete GPU node lifecycle — including provisioning, dynamic scaling, maintenance, decommissioning, and zero-downtime upgrades of GPU-enabled nodes in Kubernetes environments. • Build and improve CI/CD pipelines for reliable infrastructure deployment and orchestration. • Enforce security best practices, compliance standards, and operational excellence across the infrastructure stack. • Lead incident response and post-incident improvements for issues related to GPUs, CPUs, high-speed storage, and networks. • Manage end-to-end customer GPU resource provisioning — from request intake and configuration to onboarding, troubleshooting, and support — ensuring high levels of customer satisfaction. • Stay up to date with the latest GPU hardware, software, and orchestration technologies, integrating relevant advancements into our platforms. • Be available for occasional regional or international travel to data center locations as required. Requirements • Bachelor’s degree in Computer Science, Engineering, or a related technical field. • 3+ years of practical experience in data center operations, infrastructure engineering, or site reliability engineering. • Strong background in infrastructure automation using tools such as Terraform and Ansible. • Deep hands-on experience with Kubernetes in large-scale environments, including: • NVIDIA GPU Operator for GPU driver management, device plugins, container toolkit, and monitoring (DCGM). • NVIDIA Network Operator for high-performance networking, RDMA, and GPUDirect support. • CNI (Container Network Interface) and CSI (Container Storage Interface) plugins tailored for AI/ML workloads. • Integration with job schedulers such as Slurm in Kubernetes clusters. • Proficiency in Linux system administration and scripting (Python, Bash). • Experience with observability stacks including Prometheus, Grafana, and Loki. • Solid understanding of GPU architecture, NVIDIA CUDA, NCCL, and AI/ML frameworks is a strong plus. • Excellent troubleshooting skills with the ability to analyze complex system logs and performance metrics. • Strong communication and collaboration skills to work effectively with engineering and operations teams.