Job Description:
• Architect and maintain our core computing platform using Kubernetes on AWS and on-premise, providing a stable, scalable environment for all applications and services.
• Develop and manage our entire infrastructure using Infrastructure-as-Code (IaC) principles with Terraform, ensuring our environments are reproducible, versioned, and automated.
• Design, build, and optimize our AI/ML job scheduling and orchestration systems, integrating Slurm with our Kubernetes clusters to efficiently manage GPU resources.
• Provision, manage, and maintain our on-premise bare metal server infrastructure for high-performance GPU computing.
• Implement and manage the platform's networking (CNI, service mesh) and storage (CSI, S3) solutions to support high-throughput, low-latency workloads across hybrid environments.
• Develop a comprehensive observability stack (monitoring, logging, tracing) to ensure platform health, and create automation for operational tasks, incident response, and performance tuning.
• Collaborate with AI researchers and ML engineers to understand their infrastructure needs and build the tools and workflows that accelerate their development cycle.
• Automate the life cycle of single-tenant, managed deployments
Requirements:
• 5+ years of experience in Platform Engineering, DevOps, or Site Reliability Engineering (SRE)
• Proven, hands-on experience building and managing production infrastructure with Terraform
• Expert-level knowledge of Kubernetes architecture and operations in a large-scale environment
• Experience with high-performance compute (HPC) job schedulers, specifically Slurm, for managing GPU-intensive AI workloads
• Experience managing bare metal infrastructure, including server provisioning (e.g., PXE boot, MAAS), configuration, and lifecycle management
• Strong scripting and automation skills (e.g., Python, Go, Bash)
Benefits:
• Medical, dental, vision benefits
• Annual wellness stipend
• Mental health support
• Life, STD, LTD Income Insurance Plans
• Unlimited PTO
• Generous paid parental leave
• Flexible schedule
• 12 Paid US company holidays
• Quarterly personal productivity stipend
• One-time stipend for home office upgrades
• 401(k) plan with company match
• Tax Savings Programs
• Learning / Education stipend
• Participation in talks and conferences
• Employee Resource Groups
• AI enablement workshops / sessions