Engineering Manager, HPC Kubernetes Platform
Engineering Manager, AI Compute Platform (CaaS / GPUaaS)
Location: Dallas, TX (Relocation available)
Type: Direct Hire
⢠Competitive base salary + performance bonus
⢠100% company-paid benefits
Overview
We are seeking an Engineering Manager, AI Compute Platform (CaaS / GPUaaS) to lead the design, scaling, and operational excellence of a next-generation compute platform delivering GPU-as-a-Service (GPUaaS) and Container-as-a-Service (CaaS) capabilities.
This platform serves as the backbone for large-scale AI/ML training, LLM workloads, and HPC applications , enabling customers to consume high-performance, GPU-accelerated infrastructure in a flexible, multi-tenant model across distributed data center environments.
You will lead a team responsible for building a bare-metal Kubernetes platform optimized for GPU workloads , ensuring high availability, performance, and efficient resource utilization at scale. This role sits at the intersection of Kubernetes, GPU infrastructure, HPC, and cloud-like service delivery models .
This is a hands-on leadership role focused on platform architecture, performance engineering, and automation , with direct impact on how GPU compute is delivered as a scalable service.
Key Responsibilities
Leadership & Team Development
- Lead, mentor, and grow a team of engineers building next-generation AI compute platforms
- Foster a culture of ownership, reliability, and continuous improvement
- Drive alignment across platform, infrastructure, and product teams
Platform Architecture â GPUaaS / CaaS
- Architect and scale a bare-metal Kubernetes platform delivering GPUaaS and CaaS capabilities
- Design multi-tenant compute environments with strong isolation, scheduling, and resource governance
- Define service models for GPU consumption, including workload orchestration, tenancy, and quota management
GPU Platform & Workload Optimization
- Optimize GPU scheduling, sharing, and utilization across large-scale clusters (MIG, device plugins, scheduler extensions)
- Support AI/ML training, LLM workloads, and HPC use cases with high throughput and low latency
- Ensure efficient workload placement across hybrid orchestration models (Kubernetes + HPC schedulers)
Automation, SRE & Platform Operations
- Drive Infrastructure-as-Code (Terraform, Ansible) and GitOps-based CI/CD practices
- Implement SRE principles across observability, reliability, and incident response
- Build automated workflows for cluster provisioning, scaling, and lifecycle management
Performance, Reliability & Capacity Planning
- Own platform performance across thousands of GPU/CPU nodes
- Define and track KPIs for utilization, latency, throughput, and system health
- Lead capacity planning aligned with rapid AI compute demand growth
Cross-Functional & Ecosystem Collaboration
- Partner with storage and networking teams to integrate distributed filesystems and high-speed interconnects (InfiniBand, RoCE)
- Collaborate with hardware and software vendors (e.g., NVIDIA ecosystem) to optimize platform capabilities
- Align platform architecture with evolving AI infrastructure and GPUaaS service offerings
Required Experience
- 7+ years of experience in platform engineering, infrastructure engineering, or SRE, with 2+ years in leadership
- Proven experience building or operating Kubernetes platforms for GPU-intensive workloads (AI/ML or HPC)
- Experience designing or supporting CaaS, GPUaaS, or multi-tenant compute platforms
- Deep understanding of GPU scheduling, workload orchestration, and resource isolation
- Strong expertise in Linux systems, networking, and performance engineering on bare-metal infrastructure
- Experience managing large-scale, distributed compute environments
- Strong experience with automation tools (Terraform, Ansible) and observability stacks (Prometheus, Grafana, Loki)
- Excellent leadership and communication skills
Preferred Experience
- Experience with NVIDIA ecosystem (GPU Operator, DCGM, MIG, device plugins)
- Familiarity with HPC schedulers (Slurm, Flux, Volcano) and hybrid orchestration models
- Experience with container runtimes (containerd, CRI-O) and Kubernetes internals
- Contributions to open-source Kubernetes, AI infrastructure, or HPC platforms
- Experience operating in hyperscale or AI-first infrastructure environments
Why This Role
- Direct ownership of a GPUaaS / CaaS platform at scale
- Work at the forefront of AI infrastructure and high-performance compute
- Opportunity to define how GPU compute is delivered as a service in next-generation environments
- High visibility role with impact across platform, product, and customer experience
Recommended Jobs
ACTIVITY AIDE/COORDINATOR -PRN
Responsibilities Activity Aide/Coordinator Opportunity - PRN West Oaks Hospital has provided psychiatric care to the Houston area and surrounding communities for over four decades. Our 176-…
Medical Coding Team Lead/Remote
Medical Coding Team Lead Greenberg-Larraby, Inc. (GLI) is seeking an experienced Medical Coding Team Lead to support a well-known medical facility in Temple, TX. This is a full-time, on-site lea…
Sr. Sales Recruiter
Overview We are seeking a Senior Sales Recruiter with a successful track record of identifying, engaging, and closing executive level candidates; someone to grow and drive recruitment efforts across …
Operational Language Analyst - Spanish, Level 2 (2025-0066)
Acclaim Technical Services, founded in 2000, is a leading language and intelligence services company supporting a wide range of U.S. Federal agencies. We are an Employee Stock Ownership Plan (ESOP) c…
Food Service Worker - Bexar County Sheriff's Office
Job Summary: The Food Service Worker at Selrico Services Inc. will be responsible for providing high-quality food service and support to the Bexar County Sheriff's Office in San Antonio, Texas. Full-t…
Production Worker
The Production Worker performs all tasks involved in the production of Corsicana Mattress products as assigned by the Plant Manager and/or Production Supervisor. The Production Workers follows standa…
PBX Operator / Admin Assistant
PBX Operator / Admin Assistant The PBX Operator /Administrative Assistant, will provide general clerical, and/or customer service and phone support to all departments; including processing and report…
Design Consultant (In-Home Sales)
About the Role: Are you looking for a long-term career-not just another sales job? Houston Hurricane & Security Products is a leader in exterior protection solutions, helping homeowners and bu…
06 - Associate Engineer, Industrial Engineering
Req ID: 135362 Region: Americas Country: USA State/Province: Texas City: Richardson Summary We are seeking an Associate Industrial Engineer to drive operational excellence across our…
Travel Nurse RN - Obstetrics/Gynecology - $1,540 per week in Austin, TX
Registered Nurse (RN) | Obstetrics/Gynecology Location: Austin, TX Agency: OneStaff Medical Pay: $1,540 per week Shift Information: Rotating - 3 days x 12 hours Contract Durati…