Engineering Manager, HPC Kubernetes Platform

GTN Technical Staffing
Dallas, TX

Engineering Manager, AI Compute Platform (CaaS / GPUaaS)

Location: Dallas, TX (Relocation available)

Type: Direct Hire

• Competitive base salary + performance bonus
• 100% company-paid benefits

Overview

We are seeking an Engineering Manager, AI Compute Platform (CaaS / GPUaaS) to lead the design, scaling, and operational excellence of a next-generation compute platform delivering GPU-as-a-Service (GPUaaS) and Container-as-a-Service (CaaS) capabilities.

This platform serves as the backbone for large-scale AI/ML training, LLM workloads, and HPC applications , enabling customers to consume high-performance, GPU-accelerated infrastructure in a flexible, multi-tenant model across distributed data center environments.

You will lead a team responsible for building a bare-metal Kubernetes platform optimized for GPU workloads , ensuring high availability, performance, and efficient resource utilization at scale. This role sits at the intersection of Kubernetes, GPU infrastructure, HPC, and cloud-like service delivery models .

This is a hands-on leadership role focused on platform architecture, performance engineering, and automation , with direct impact on how GPU compute is delivered as a scalable service.

Key Responsibilities

Leadership & Team Development

  • Lead, mentor, and grow a team of engineers building next-generation AI compute platforms
  • Foster a culture of ownership, reliability, and continuous improvement
  • Drive alignment across platform, infrastructure, and product teams

Platform Architecture – GPUaaS / CaaS

  • Architect and scale a bare-metal Kubernetes platform delivering GPUaaS and CaaS capabilities
  • Design multi-tenant compute environments with strong isolation, scheduling, and resource governance
  • Define service models for GPU consumption, including workload orchestration, tenancy, and quota management

GPU Platform & Workload Optimization

  • Optimize GPU scheduling, sharing, and utilization across large-scale clusters (MIG, device plugins, scheduler extensions)
  • Support AI/ML training, LLM workloads, and HPC use cases with high throughput and low latency
  • Ensure efficient workload placement across hybrid orchestration models (Kubernetes + HPC schedulers)

Automation, SRE & Platform Operations

  • Drive Infrastructure-as-Code (Terraform, Ansible) and GitOps-based CI/CD practices
  • Implement SRE principles across observability, reliability, and incident response
  • Build automated workflows for cluster provisioning, scaling, and lifecycle management

Performance, Reliability & Capacity Planning

  • Own platform performance across thousands of GPU/CPU nodes
  • Define and track KPIs for utilization, latency, throughput, and system health
  • Lead capacity planning aligned with rapid AI compute demand growth

Cross-Functional & Ecosystem Collaboration

  • Partner with storage and networking teams to integrate distributed filesystems and high-speed interconnects (InfiniBand, RoCE)
  • Collaborate with hardware and software vendors (e.g., NVIDIA ecosystem) to optimize platform capabilities
  • Align platform architecture with evolving AI infrastructure and GPUaaS service offerings

Required Experience

  • 7+ years of experience in platform engineering, infrastructure engineering, or SRE, with 2+ years in leadership
  • Proven experience building or operating Kubernetes platforms for GPU-intensive workloads (AI/ML or HPC)
  • Experience designing or supporting CaaS, GPUaaS, or multi-tenant compute platforms
  • Deep understanding of GPU scheduling, workload orchestration, and resource isolation
  • Strong expertise in Linux systems, networking, and performance engineering on bare-metal infrastructure
  • Experience managing large-scale, distributed compute environments
  • Strong experience with automation tools (Terraform, Ansible) and observability stacks (Prometheus, Grafana, Loki)
  • Excellent leadership and communication skills

Preferred Experience

  • Experience with NVIDIA ecosystem (GPU Operator, DCGM, MIG, device plugins)
  • Familiarity with HPC schedulers (Slurm, Flux, Volcano) and hybrid orchestration models
  • Experience with container runtimes (containerd, CRI-O) and Kubernetes internals
  • Contributions to open-source Kubernetes, AI infrastructure, or HPC platforms
  • Experience operating in hyperscale or AI-first infrastructure environments

Why This Role

  • Direct ownership of a GPUaaS / CaaS platform at scale
  • Work at the forefront of AI infrastructure and high-performance compute
  • Opportunity to define how GPU compute is delivered as a service in next-generation environments
  • High visibility role with impact across platform, product, and customer experience
Posted 2026-04-22

Recommended Jobs

ACTIVITY AIDE/COORDINATOR -PRN

West Oaks Hospital
Houston, TX

Responsibilities Activity Aide/Coordinator Opportunity - PRN West Oaks Hospital has provided psychiatric care to the Houston area and surrounding communities for over four decades. Our 176-…

View Details
Posted 2026-04-09

Medical Coding Team Lead/Remote

Greenberg-Larraby, Inc. (GLI)
Austin, TX

Medical Coding Team Lead Greenberg-Larraby, Inc. (GLI) is seeking an experienced Medical Coding Team Lead to support a well-known medical facility in Temple, TX. This is a full-time, on-site lea…

View Details
Posted 2026-01-15

Sr. Sales Recruiter

Esri
San Antonio, TX

Overview We are seeking a Senior Sales Recruiter with a successful track record of identifying, engaging, and closing executive level candidates; someone to grow and drive recruitment efforts across …

View Details
Posted 2026-04-15

Operational Language Analyst - Spanish, Level 2 (2025-0066)

Acclaim Technical Services
San Antonio, TX

Acclaim Technical Services, founded in 2000, is a leading language and intelligence services company supporting a wide range of U.S. Federal agencies. We are an Employee Stock Ownership Plan (ESOP) c…

View Details
Posted 2026-04-24

Food Service Worker - Bexar County Sheriff's Office

Selrico Services Inc.
San Antonio, TX

Job Summary: The Food Service Worker at Selrico Services Inc. will be responsible for providing high-quality food service and support to the Bexar County Sheriff's Office in San Antonio, Texas. Full-t…

View Details
Posted 2025-09-10

Production Worker

SBI - Corsicana Mattress
Corsicana, TX

The Production Worker performs all tasks involved in the production of Corsicana Mattress products as assigned by the Plant Manager and/or Production Supervisor. The Production Workers follows standa…

View Details
Posted 2026-05-12

PBX Operator / Admin Assistant

Legacy Insurance Group
Frisco, TX

PBX Operator / Admin Assistant The PBX Operator /Administrative Assistant, will provide general clerical, and/or customer service and phone support to all departments; including processing and report…

View Details
Posted 2026-05-27

Design Consultant (In-Home Sales)

Houston Hurricane & Security Products
Dickinson, TX

About the Role: Are you looking for a long-term career-not just another sales job? Houston Hurricane & Security Products is a leader in exterior protection solutions, helping homeowners and bu…

View Details
Posted 2026-05-21

06 - Associate Engineer, Industrial Engineering

Celestica International LP
Richardson, TX

Req ID: 135362  Region: Americas  Country: USA  State/Province: Texas  City:  Richardson  Summary We are seeking an Associate Industrial Engineer to drive operational excellence across our…

View Details
Posted 2026-06-04

Travel Nurse RN - Obstetrics/Gynecology - $1,540 per week in Austin, TX

OneStaff Medical
Austin, TX

Registered Nurse (RN) | Obstetrics/Gynecology Location: Austin, TX Agency: OneStaff Medical Pay: $1,540 per week Shift Information: Rotating - 3 days x 12 hours Contract Durati…

View Details
Posted 2026-05-21