Engineering Manager, HPC Kubernetes Platform

GTN Technical Staffing
Dallas, TX

Engineering Manager, AI Compute Platform (CaaS / GPUaaS)

Location: Dallas, TX (Relocation available)

Type: Direct Hire

• Competitive base salary + performance bonus
• 100% company-paid benefits

Overview

We are seeking an Engineering Manager, AI Compute Platform (CaaS / GPUaaS) to lead the design, scaling, and operational excellence of a next-generation compute platform delivering GPU-as-a-Service (GPUaaS) and Container-as-a-Service (CaaS) capabilities.

This platform serves as the backbone for large-scale AI/ML training, LLM workloads, and HPC applications , enabling customers to consume high-performance, GPU-accelerated infrastructure in a flexible, multi-tenant model across distributed data center environments.

You will lead a team responsible for building a bare-metal Kubernetes platform optimized for GPU workloads , ensuring high availability, performance, and efficient resource utilization at scale. This role sits at the intersection of Kubernetes, GPU infrastructure, HPC, and cloud-like service delivery models .

This is a hands-on leadership role focused on platform architecture, performance engineering, and automation , with direct impact on how GPU compute is delivered as a scalable service.

Key Responsibilities

Leadership & Team Development

  • Lead, mentor, and grow a team of engineers building next-generation AI compute platforms
  • Foster a culture of ownership, reliability, and continuous improvement
  • Drive alignment across platform, infrastructure, and product teams

Platform Architecture – GPUaaS / CaaS

  • Architect and scale a bare-metal Kubernetes platform delivering GPUaaS and CaaS capabilities
  • Design multi-tenant compute environments with strong isolation, scheduling, and resource governance
  • Define service models for GPU consumption, including workload orchestration, tenancy, and quota management

GPU Platform & Workload Optimization

  • Optimize GPU scheduling, sharing, and utilization across large-scale clusters (MIG, device plugins, scheduler extensions)
  • Support AI/ML training, LLM workloads, and HPC use cases with high throughput and low latency
  • Ensure efficient workload placement across hybrid orchestration models (Kubernetes + HPC schedulers)

Automation, SRE & Platform Operations

  • Drive Infrastructure-as-Code (Terraform, Ansible) and GitOps-based CI/CD practices
  • Implement SRE principles across observability, reliability, and incident response
  • Build automated workflows for cluster provisioning, scaling, and lifecycle management

Performance, Reliability & Capacity Planning

  • Own platform performance across thousands of GPU/CPU nodes
  • Define and track KPIs for utilization, latency, throughput, and system health
  • Lead capacity planning aligned with rapid AI compute demand growth

Cross-Functional & Ecosystem Collaboration

  • Partner with storage and networking teams to integrate distributed filesystems and high-speed interconnects (InfiniBand, RoCE)
  • Collaborate with hardware and software vendors (e.g., NVIDIA ecosystem) to optimize platform capabilities
  • Align platform architecture with evolving AI infrastructure and GPUaaS service offerings

Required Experience

  • 7+ years of experience in platform engineering, infrastructure engineering, or SRE, with 2+ years in leadership
  • Proven experience building or operating Kubernetes platforms for GPU-intensive workloads (AI/ML or HPC)
  • Experience designing or supporting CaaS, GPUaaS, or multi-tenant compute platforms
  • Deep understanding of GPU scheduling, workload orchestration, and resource isolation
  • Strong expertise in Linux systems, networking, and performance engineering on bare-metal infrastructure
  • Experience managing large-scale, distributed compute environments
  • Strong experience with automation tools (Terraform, Ansible) and observability stacks (Prometheus, Grafana, Loki)
  • Excellent leadership and communication skills

Preferred Experience

  • Experience with NVIDIA ecosystem (GPU Operator, DCGM, MIG, device plugins)
  • Familiarity with HPC schedulers (Slurm, Flux, Volcano) and hybrid orchestration models
  • Experience with container runtimes (containerd, CRI-O) and Kubernetes internals
  • Contributions to open-source Kubernetes, AI infrastructure, or HPC platforms
  • Experience operating in hyperscale or AI-first infrastructure environments

Why This Role

  • Direct ownership of a GPUaaS / CaaS platform at scale
  • Work at the forefront of AI infrastructure and high-performance compute
  • Opportunity to define how GPU compute is delivered as a service in next-generation environments
  • High visibility role with impact across platform, product, and customer experience
Posted 2026-04-22

Recommended Jobs

Senior Project Manager - Class A Interiors

Genuine Search Group
Austin, TX

About the Role We are seeking an experienced Senior Project Manager to lead corporate interiors and tenant improvement projects throughout the Austin market. This leadership role is responsible for m…

View Details
Posted 2026-05-30

Sr. Mortgage Loan Officer

Leadling
Fort Worth, TX

Join Our Team as a Mortgage Advisor/Loan Officer! Company: Confidential Location: Nationwide Are you a passionate and driven Loan Officer looking to elevate your career in the mortgage in…

View Details
Posted 2026-03-30

Accounts Payable

Professional Alternatives
Pasadena, TX

Job ID#: 37985 Temp to hire Accounts Payable Associate needed: Pasadena location to $26 hourly                                                                                        Job Su…

View Details
Posted 2026-06-01

Trabajador de Vivero General - Schertz

Mortellaros Nursery Ltd
Schertz, TX

Mortellaro’s Nursery tiene trabajos de Trabajo General/ de Vivero disponible con nuestros grupos de mantenimiento del vivero y producción de plantas en Schertz, Tejas. Tareas pueden incluir, pero no…

View Details
Posted 2025-09-17

Plant Quality Manager

TOPPAN Packaging Americas
Waco, TX

TOPPAN Packaging Americas specializes in designing and manufacturing sustainable, high-performance flexible and thermoformed packaging solutions for the food, beverage, medical and consumer goods ind…

View Details
Posted 2026-05-15

Case Manager

Professional Alternatives
Houston, TX

Job ID#: 37773 Job Title: Case Manager – Plaintiff Personal Injury (Auto Accidents Focus) Job Type: Full Time In Office Position Overview: The firm is currently seeking an experienced Ca…

View Details
Posted 2026-06-01

Process Tuner Field Engineer

Emerson
Houston, TX

If you are a Field Service Engineering with boiler experience and looking for an opportunity to grow your career, Emerson has an exciting Process Tuner Field Engineer opportunity for you with our Po…

View Details
Posted 2026-04-18

Executive Accelerators - Specialist, Client Experiences - Deloitte University Location

Deloitte LLP
Texas

*This is an in-office role and is onsite in the Deloitte University Westlake office. The Executive Accelerators (XA) team has been instrumental in helping thousands of executives and teams tackle b…

View Details
Posted 2026-05-30