Engineering Manager, HPC Kubernetes Platform

GTN Technical Staffing
Dallas, TX

Engineering Manager, AI Compute Platform (CaaS / GPUaaS)

Location: Dallas, TX (Relocation available)

Type: Direct Hire

• Competitive base salary + performance bonus
• 100% company-paid benefits

Overview

We are seeking an Engineering Manager, AI Compute Platform (CaaS / GPUaaS) to lead the design, scaling, and operational excellence of a next-generation compute platform delivering GPU-as-a-Service (GPUaaS) and Container-as-a-Service (CaaS) capabilities.

This platform serves as the backbone for large-scale AI/ML training, LLM workloads, and HPC applications , enabling customers to consume high-performance, GPU-accelerated infrastructure in a flexible, multi-tenant model across distributed data center environments.

You will lead a team responsible for building a bare-metal Kubernetes platform optimized for GPU workloads , ensuring high availability, performance, and efficient resource utilization at scale. This role sits at the intersection of Kubernetes, GPU infrastructure, HPC, and cloud-like service delivery models .

This is a hands-on leadership role focused on platform architecture, performance engineering, and automation , with direct impact on how GPU compute is delivered as a scalable service.

Key Responsibilities

Leadership & Team Development

  • Lead, mentor, and grow a team of engineers building next-generation AI compute platforms
  • Foster a culture of ownership, reliability, and continuous improvement
  • Drive alignment across platform, infrastructure, and product teams

Platform Architecture – GPUaaS / CaaS

  • Architect and scale a bare-metal Kubernetes platform delivering GPUaaS and CaaS capabilities
  • Design multi-tenant compute environments with strong isolation, scheduling, and resource governance
  • Define service models for GPU consumption, including workload orchestration, tenancy, and quota management

GPU Platform & Workload Optimization

  • Optimize GPU scheduling, sharing, and utilization across large-scale clusters (MIG, device plugins, scheduler extensions)
  • Support AI/ML training, LLM workloads, and HPC use cases with high throughput and low latency
  • Ensure efficient workload placement across hybrid orchestration models (Kubernetes + HPC schedulers)

Automation, SRE & Platform Operations

  • Drive Infrastructure-as-Code (Terraform, Ansible) and GitOps-based CI/CD practices
  • Implement SRE principles across observability, reliability, and incident response
  • Build automated workflows for cluster provisioning, scaling, and lifecycle management

Performance, Reliability & Capacity Planning

  • Own platform performance across thousands of GPU/CPU nodes
  • Define and track KPIs for utilization, latency, throughput, and system health
  • Lead capacity planning aligned with rapid AI compute demand growth

Cross-Functional & Ecosystem Collaboration

  • Partner with storage and networking teams to integrate distributed filesystems and high-speed interconnects (InfiniBand, RoCE)
  • Collaborate with hardware and software vendors (e.g., NVIDIA ecosystem) to optimize platform capabilities
  • Align platform architecture with evolving AI infrastructure and GPUaaS service offerings

Required Experience

  • 7+ years of experience in platform engineering, infrastructure engineering, or SRE, with 2+ years in leadership
  • Proven experience building or operating Kubernetes platforms for GPU-intensive workloads (AI/ML or HPC)
  • Experience designing or supporting CaaS, GPUaaS, or multi-tenant compute platforms
  • Deep understanding of GPU scheduling, workload orchestration, and resource isolation
  • Strong expertise in Linux systems, networking, and performance engineering on bare-metal infrastructure
  • Experience managing large-scale, distributed compute environments
  • Strong experience with automation tools (Terraform, Ansible) and observability stacks (Prometheus, Grafana, Loki)
  • Excellent leadership and communication skills

Preferred Experience

  • Experience with NVIDIA ecosystem (GPU Operator, DCGM, MIG, device plugins)
  • Familiarity with HPC schedulers (Slurm, Flux, Volcano) and hybrid orchestration models
  • Experience with container runtimes (containerd, CRI-O) and Kubernetes internals
  • Contributions to open-source Kubernetes, AI infrastructure, or HPC platforms
  • Experience operating in hyperscale or AI-first infrastructure environments

Why This Role

  • Direct ownership of a GPUaaS / CaaS platform at scale
  • Work at the forefront of AI infrastructure and high-performance compute
  • Opportunity to define how GPU compute is delivered as a service in next-generation environments
  • High visibility role with impact across platform, product, and customer experience
Posted 2026-04-22

Recommended Jobs

Veterinarian

Cibolo Small Animal Hospital
Cibolo, TX

Part-Time Veterinarian - Ultimate Work-Life Balance in Cibolo, TX Are you a veterinarian looking for a part-time role that truly supports work-life balance without compromising quality medicine? O…

View Details
Posted 2026-04-21

Outpatient RN - Sleep Clinic

Driscoll Health
Corpus Christi, TX

Where compassion meets innovation and technology and our employees are family. Thank you for your interest in joining our team! Please review the job information below. This position will sup…

View Details
Posted 2026-03-25

Client Account Representative

EH Consulting
Houston, TX

: Our firm is dedicated to providing top-notch client support while driving sales for our esteemed clients. We are looking to fill a full-time Client Account Representative on our sales and outreac…

View Details
Posted 2026-04-21

Cloud Platform Delivery Lead - GCP Senior Manager

PwC
Dallas, TX

Specialty/Competency: Product Innovation Industry/Sector: Not Applicable Time Type: Full time Travel Requirements: Up to 60% At PwC, our people in integration and platform architecture…

View Details
Posted 2026-01-30

IAM Saviynt IGA Consultant

WaveStrong, Inc.
Dallas, TX

Exciting IAM Saviynt IGA SME, 11 months, contract opportunity in Dallas, TX. Requirements ~8 plus years overall in IAM including 4 plus years hands-on experience with Saviynt IGA (EIC preferred;…

View Details
Posted 2026-01-15

Fire Sprinkler Designer

Yellowstone Local
Richland Hills, TX

Yellowstone Local is proud to represent Lone Star Fire Sprinkler, Inc., an industry leader in fire protection and life safety systems. Ready to design high-performance fire alarm systems that pr…

View Details
Posted 2026-04-09

Licensed Plumber

Hassell Free Plumbing
Mabank, Henderson County, TX

Job Description Job Description We are seeking a skilled and licensed Plumber experienced in service for our residential customers and new construction projects to join our team! If you’re experi…

View Details
Posted 2026-03-21

Attorney - Commercial Litigation

Professional Alternatives
Houston, TX

Job ID#: 36808 Our Galleria client – a leading full-service law firm with a reputation for excellence is adding an Associate Attorney to the Commercial Litigation team   This position will be hand…

View Details
Posted 2026-03-21

Male Parent Coach - Contractor

Santa Maria Hostel
Houston, TX

Position Summary: Under limited supervisor, the Male Engagement Family Coach collaborates with the treatment team to recruit and engage male family members in services and keeps them engaged in servic…

View Details
Posted 2025-09-24

Urgent Hiring: Male Spanish Caregiver for Houston 77017

Aloma Healthcare, Inc.
Houston, TX

We have an immediate opening for a client in Houston area. MALE SPANISH SPEAKING Caregiver must have a heart and passion in caring for seniors. Client is an older male. The family would also like assi…

View Details
Posted 2025-09-24