Engineering Manager, HPC Kubernetes Platform
Engineering Manager, AI Compute Platform (CaaS / GPUaaS)
Location: Dallas, TX (Relocation available)
Type: Direct Hire
⢠Competitive base salary + performance bonus
⢠100% company-paid benefits
Overview
We are seeking an Engineering Manager, AI Compute Platform (CaaS / GPUaaS) to lead the design, scaling, and operational excellence of a next-generation compute platform delivering GPU-as-a-Service (GPUaaS) and Container-as-a-Service (CaaS) capabilities.
This platform serves as the backbone for large-scale AI/ML training, LLM workloads, and HPC applications , enabling customers to consume high-performance, GPU-accelerated infrastructure in a flexible, multi-tenant model across distributed data center environments.
You will lead a team responsible for building a bare-metal Kubernetes platform optimized for GPU workloads , ensuring high availability, performance, and efficient resource utilization at scale. This role sits at the intersection of Kubernetes, GPU infrastructure, HPC, and cloud-like service delivery models .
This is a hands-on leadership role focused on platform architecture, performance engineering, and automation , with direct impact on how GPU compute is delivered as a scalable service.
Key Responsibilities
Leadership & Team Development
- Lead, mentor, and grow a team of engineers building next-generation AI compute platforms
- Foster a culture of ownership, reliability, and continuous improvement
- Drive alignment across platform, infrastructure, and product teams
Platform Architecture â GPUaaS / CaaS
- Architect and scale a bare-metal Kubernetes platform delivering GPUaaS and CaaS capabilities
- Design multi-tenant compute environments with strong isolation, scheduling, and resource governance
- Define service models for GPU consumption, including workload orchestration, tenancy, and quota management
GPU Platform & Workload Optimization
- Optimize GPU scheduling, sharing, and utilization across large-scale clusters (MIG, device plugins, scheduler extensions)
- Support AI/ML training, LLM workloads, and HPC use cases with high throughput and low latency
- Ensure efficient workload placement across hybrid orchestration models (Kubernetes + HPC schedulers)
Automation, SRE & Platform Operations
- Drive Infrastructure-as-Code (Terraform, Ansible) and GitOps-based CI/CD practices
- Implement SRE principles across observability, reliability, and incident response
- Build automated workflows for cluster provisioning, scaling, and lifecycle management
Performance, Reliability & Capacity Planning
- Own platform performance across thousands of GPU/CPU nodes
- Define and track KPIs for utilization, latency, throughput, and system health
- Lead capacity planning aligned with rapid AI compute demand growth
Cross-Functional & Ecosystem Collaboration
- Partner with storage and networking teams to integrate distributed filesystems and high-speed interconnects (InfiniBand, RoCE)
- Collaborate with hardware and software vendors (e.g., NVIDIA ecosystem) to optimize platform capabilities
- Align platform architecture with evolving AI infrastructure and GPUaaS service offerings
Required Experience
- 7+ years of experience in platform engineering, infrastructure engineering, or SRE, with 2+ years in leadership
- Proven experience building or operating Kubernetes platforms for GPU-intensive workloads (AI/ML or HPC)
- Experience designing or supporting CaaS, GPUaaS, or multi-tenant compute platforms
- Deep understanding of GPU scheduling, workload orchestration, and resource isolation
- Strong expertise in Linux systems, networking, and performance engineering on bare-metal infrastructure
- Experience managing large-scale, distributed compute environments
- Strong experience with automation tools (Terraform, Ansible) and observability stacks (Prometheus, Grafana, Loki)
- Excellent leadership and communication skills
Preferred Experience
- Experience with NVIDIA ecosystem (GPU Operator, DCGM, MIG, device plugins)
- Familiarity with HPC schedulers (Slurm, Flux, Volcano) and hybrid orchestration models
- Experience with container runtimes (containerd, CRI-O) and Kubernetes internals
- Contributions to open-source Kubernetes, AI infrastructure, or HPC platforms
- Experience operating in hyperscale or AI-first infrastructure environments
Why This Role
- Direct ownership of a GPUaaS / CaaS platform at scale
- Work at the forefront of AI infrastructure and high-performance compute
- Opportunity to define how GPU compute is delivered as a service in next-generation environments
- High visibility role with impact across platform, product, and customer experience
Recommended Jobs
Senior Project Manager - Class A Interiors
About the Role We are seeking an experienced Senior Project Manager to lead corporate interiors and tenant improvement projects throughout the Austin market. This leadership role is responsible for m…
Sr. Mortgage Loan Officer
Join Our Team as a Mortgage Advisor/Loan Officer! Company: Confidential Location: Nationwide Are you a passionate and driven Loan Officer looking to elevate your career in the mortgage in…
Accounts Payable
Job ID#: 37985 Temp to hire Accounts Payable Associate needed: Pasadena location to $26 hourly Job Su…
Trabajador de Vivero General - Schertz
Mortellaro’s Nursery tiene trabajos de Trabajo General/ de Vivero disponible con nuestros grupos de mantenimiento del vivero y producción de plantas en Schertz, Tejas. Tareas pueden incluir, pero no…
Plant Quality Manager
TOPPAN Packaging Americas specializes in designing and manufacturing sustainable, high-performance flexible and thermoformed packaging solutions for the food, beverage, medical and consumer goods ind…
Case Manager
Job ID#: 37773 Job Title: Case Manager – Plaintiff Personal Injury (Auto Accidents Focus) Job Type: Full Time In Office Position Overview: The firm is currently seeking an experienced Ca…
Process Tuner Field Engineer
If you are a Field Service Engineering with boiler experience and looking for an opportunity to grow your career, Emerson has an exciting Process Tuner Field Engineer opportunity for you with our Po…
Executive Accelerators - Specialist, Client Experiences - Deloitte University Location
*This is an in-office role and is onsite in the Deloitte University Westlake office. The Executive Accelerators (XA) team has been instrumental in helping thousands of executives and teams tackle b…