Engineering Manager, HPC Kubernetes Platform
Engineering Manager, AI Compute Platform (CaaS / GPUaaS)
Location: Dallas, TX (Relocation available)
Type: Direct Hire
⢠Competitive base salary + performance bonus
⢠100% company-paid benefits
Overview
We are seeking an Engineering Manager, AI Compute Platform (CaaS / GPUaaS) to lead the design, scaling, and operational excellence of a next-generation compute platform delivering GPU-as-a-Service (GPUaaS) and Container-as-a-Service (CaaS) capabilities.
This platform serves as the backbone for large-scale AI/ML training, LLM workloads, and HPC applications , enabling customers to consume high-performance, GPU-accelerated infrastructure in a flexible, multi-tenant model across distributed data center environments.
You will lead a team responsible for building a bare-metal Kubernetes platform optimized for GPU workloads , ensuring high availability, performance, and efficient resource utilization at scale. This role sits at the intersection of Kubernetes, GPU infrastructure, HPC, and cloud-like service delivery models .
This is a hands-on leadership role focused on platform architecture, performance engineering, and automation , with direct impact on how GPU compute is delivered as a scalable service.
Key Responsibilities
Leadership & Team Development
- Lead, mentor, and grow a team of engineers building next-generation AI compute platforms
- Foster a culture of ownership, reliability, and continuous improvement
- Drive alignment across platform, infrastructure, and product teams
Platform Architecture â GPUaaS / CaaS
- Architect and scale a bare-metal Kubernetes platform delivering GPUaaS and CaaS capabilities
- Design multi-tenant compute environments with strong isolation, scheduling, and resource governance
- Define service models for GPU consumption, including workload orchestration, tenancy, and quota management
GPU Platform & Workload Optimization
- Optimize GPU scheduling, sharing, and utilization across large-scale clusters (MIG, device plugins, scheduler extensions)
- Support AI/ML training, LLM workloads, and HPC use cases with high throughput and low latency
- Ensure efficient workload placement across hybrid orchestration models (Kubernetes + HPC schedulers)
Automation, SRE & Platform Operations
- Drive Infrastructure-as-Code (Terraform, Ansible) and GitOps-based CI/CD practices
- Implement SRE principles across observability, reliability, and incident response
- Build automated workflows for cluster provisioning, scaling, and lifecycle management
Performance, Reliability & Capacity Planning
- Own platform performance across thousands of GPU/CPU nodes
- Define and track KPIs for utilization, latency, throughput, and system health
- Lead capacity planning aligned with rapid AI compute demand growth
Cross-Functional & Ecosystem Collaboration
- Partner with storage and networking teams to integrate distributed filesystems and high-speed interconnects (InfiniBand, RoCE)
- Collaborate with hardware and software vendors (e.g., NVIDIA ecosystem) to optimize platform capabilities
- Align platform architecture with evolving AI infrastructure and GPUaaS service offerings
Required Experience
- 7+ years of experience in platform engineering, infrastructure engineering, or SRE, with 2+ years in leadership
- Proven experience building or operating Kubernetes platforms for GPU-intensive workloads (AI/ML or HPC)
- Experience designing or supporting CaaS, GPUaaS, or multi-tenant compute platforms
- Deep understanding of GPU scheduling, workload orchestration, and resource isolation
- Strong expertise in Linux systems, networking, and performance engineering on bare-metal infrastructure
- Experience managing large-scale, distributed compute environments
- Strong experience with automation tools (Terraform, Ansible) and observability stacks (Prometheus, Grafana, Loki)
- Excellent leadership and communication skills
Preferred Experience
- Experience with NVIDIA ecosystem (GPU Operator, DCGM, MIG, device plugins)
- Familiarity with HPC schedulers (Slurm, Flux, Volcano) and hybrid orchestration models
- Experience with container runtimes (containerd, CRI-O) and Kubernetes internals
- Contributions to open-source Kubernetes, AI infrastructure, or HPC platforms
- Experience operating in hyperscale or AI-first infrastructure environments
Why This Role
- Direct ownership of a GPUaaS / CaaS platform at scale
- Work at the forefront of AI infrastructure and high-performance compute
- Opportunity to define how GPU compute is delivered as a service in next-generation environments
- High visibility role with impact across platform, product, and customer experience
Recommended Jobs
Veterinarian
Part-Time Veterinarian - Ultimate Work-Life Balance in Cibolo, TX Are you a veterinarian looking for a part-time role that truly supports work-life balance without compromising quality medicine? O…
Outpatient RN - Sleep Clinic
Where compassion meets innovation and technology and our employees are family. Thank you for your interest in joining our team! Please review the job information below. This position will sup…
Client Account Representative
: Our firm is dedicated to providing top-notch client support while driving sales for our esteemed clients. We are looking to fill a full-time Client Account Representative on our sales and outreac…
Cloud Platform Delivery Lead - GCP Senior Manager
Specialty/Competency: Product Innovation Industry/Sector: Not Applicable Time Type: Full time Travel Requirements: Up to 60% At PwC, our people in integration and platform architecture…
IAM Saviynt IGA Consultant
Exciting IAM Saviynt IGA SME, 11 months, contract opportunity in Dallas, TX. Requirements ~8 plus years overall in IAM including 4 plus years hands-on experience with Saviynt IGA (EIC preferred;…
Fire Sprinkler Designer
Yellowstone Local is proud to represent Lone Star Fire Sprinkler, Inc., an industry leader in fire protection and life safety systems. Ready to design high-performance fire alarm systems that pr…
Licensed Plumber
Job Description Job Description We are seeking a skilled and licensed Plumber experienced in service for our residential customers and new construction projects to join our team! If you’re experi…
Attorney - Commercial Litigation
Job ID#: 36808 Our Galleria client – a leading full-service law firm with a reputation for excellence is adding an Associate Attorney to the Commercial Litigation team This position will be hand…
Male Parent Coach - Contractor
Position Summary: Under limited supervisor, the Male Engagement Family Coach collaborates with the treatment team to recruit and engage male family members in services and keeps them engaged in servic…
Urgent Hiring: Male Spanish Caregiver for Houston 77017
We have an immediate opening for a client in Houston area. MALE SPANISH SPEAKING Caregiver must have a heart and passion in caring for seniors. Client is an older male. The family would also like assi…