Engineering Manager, HPC Kubernetes Platform

GTN Technical Staffing

Dallas, TX

Engineering Manager, AI Compute Platform (CaaS / GPUaaS)
Location: Dallas, TX (Relocation available)
Type: Direct Hire
â¢ Competitive base salary + performance bonus
â¢ 100% company-paid benefits

Overview

We are seeking an Engineering Manager, AI Compute Platform (CaaS / GPUaaS) to lead the design, scaling, and operational excellence of a next-generation compute platform delivering GPU-as-a-Service (GPUaaS) and Container-as-a-Service (CaaS) capabilities.

This platform serves as the backbone for large-scale AI/ML training, LLM workloads, and HPC applications , enabling customers to consume high-performance, GPU-accelerated infrastructure in a flexible, multi-tenant model across distributed data center environments.

You will lead a team responsible for building a bare-metal Kubernetes platform optimized for GPU workloads , ensuring high availability, performance, and efficient resource utilization at scale. This role sits at the intersection of Kubernetes, GPU infrastructure, HPC, and cloud-like service delivery models .

This is a hands-on leadership role focused on platform architecture, performance engineering, and automation , with direct impact on how GPU compute is delivered as a scalable service.

Key Responsibilities

Leadership & Team Development

Lead, mentor, and grow a team of engineers building next-generation AI compute platforms
Foster a culture of ownership, reliability, and continuous improvement
Drive alignment across platform, infrastructure, and product teams

Platform Architecture â GPUaaS / CaaS

Architect and scale a bare-metal Kubernetes platform delivering GPUaaS and CaaS capabilities
Design multi-tenant compute environments with strong isolation, scheduling, and resource governance
Define service models for GPU consumption, including workload orchestration, tenancy, and quota management

GPU Platform & Workload Optimization

Optimize GPU scheduling, sharing, and utilization across large-scale clusters (MIG, device plugins, scheduler extensions)
Support AI/ML training, LLM workloads, and HPC use cases with high throughput and low latency
Ensure efficient workload placement across hybrid orchestration models (Kubernetes + HPC schedulers)

Automation, SRE & Platform Operations

Drive Infrastructure-as-Code (Terraform, Ansible) and GitOps-based CI/CD practices
Implement SRE principles across observability, reliability, and incident response
Build automated workflows for cluster provisioning, scaling, and lifecycle management

Performance, Reliability & Capacity Planning

Own platform performance across thousands of GPU/CPU nodes
Define and track KPIs for utilization, latency, throughput, and system health
Lead capacity planning aligned with rapid AI compute demand growth

Cross-Functional & Ecosystem Collaboration

Partner with storage and networking teams to integrate distributed filesystems and high-speed interconnects (InfiniBand, RoCE)
Collaborate with hardware and software vendors (e.g., NVIDIA ecosystem) to optimize platform capabilities
Align platform architecture with evolving AI infrastructure and GPUaaS service offerings

Required Experience

7+ years of experience in platform engineering, infrastructure engineering, or SRE, with 2+ years in leadership
Proven experience building or operating Kubernetes platforms for GPU-intensive workloads (AI/ML or HPC)
Experience designing or supporting CaaS, GPUaaS, or multi-tenant compute platforms
Deep understanding of GPU scheduling, workload orchestration, and resource isolation
Strong expertise in Linux systems, networking, and performance engineering on bare-metal infrastructure
Experience managing large-scale, distributed compute environments
Strong experience with automation tools (Terraform, Ansible) and observability stacks (Prometheus, Grafana, Loki)
Excellent leadership and communication skills

Preferred Experience

Experience with NVIDIA ecosystem (GPU Operator, DCGM, MIG, device plugins)
Familiarity with HPC schedulers (Slurm, Flux, Volcano) and hybrid orchestration models
Experience with container runtimes (containerd, CRI-O) and Kubernetes internals
Contributions to open-source Kubernetes, AI infrastructure, or HPC platforms
Experience operating in hyperscale or AI-first infrastructure environments

Why This Role

Direct ownership of a GPUaaS / CaaS platform at scale
Work at the forefront of AI infrastructure and high-performance compute
Opportunity to define how GPU compute is delivered as a service in next-generation environments
High visibility role with impact across platform, product, and customer experience

Posted 2026-04-22

Recommended Jobs

ACTIVITY AIDE/COORDINATOR -PRN

West Oaks Hospital

Houston, TX

Responsibilities Activity Aide/Coordinator Opportunity - PRN West Oaks Hospital has provided psychiatric care to the Houston area and surrounding communities for over four decades. Our 176-…

View Details

Posted 2026-04-09

Medical Coding Team Lead/Remote

Greenberg-Larraby, Inc. (GLI)

Austin, TX

Medical Coding Team Lead Greenberg-Larraby, Inc. (GLI) is seeking an experienced Medical Coding Team Lead to support a well-known medical facility in Temple, TX. This is a full-time, on-site lea…

View Details

Posted 2026-01-15

Sr. Sales Recruiter

Esri

San Antonio, TX

Overview We are seeking a Senior Sales Recruiter with a successful track record of identifying, engaging, and closing executive level candidates; someone to grow and drive recruitment efforts across …

View Details

Posted 2026-04-15

Operational Language Analyst - Spanish, Level 2 (2025-0066)

Acclaim Technical Services

San Antonio, TX

Acclaim Technical Services, founded in 2000, is a leading language and intelligence services company supporting a wide range of U.S. Federal agencies. We are an Employee Stock Ownership Plan (ESOP) c…

View Details

Posted 2026-04-24

Food Service Worker - Bexar County Sheriff's Office

Selrico Services Inc.

San Antonio, TX

Job Summary: The Food Service Worker at Selrico Services Inc. will be responsible for providing high-quality food service and support to the Bexar County Sheriff's Office in San Antonio, Texas. Full-t…

View Details

Posted 2025-09-10

Production Worker

SBI - Corsicana Mattress

Corsicana, TX

The Production Worker performs all tasks involved in the production of Corsicana Mattress products as assigned by the Plant Manager and/or Production Supervisor. The Production Workers follows standa…

View Details

Posted 2026-05-12

PBX Operator / Admin Assistant

Legacy Insurance Group

Frisco, TX

PBX Operator / Admin Assistant The PBX Operator /Administrative Assistant, will provide general clerical, and/or customer service and phone support to all departments; including processing and report…

View Details

Posted 2026-05-27

Design Consultant (In-Home Sales)

Houston Hurricane & Security Products

Dickinson, TX

About the Role: Are you looking for a long-term career-not just another sales job? Houston Hurricane & Security Products is a leader in exterior protection solutions, helping homeowners and bu…

View Details

Posted 2026-05-21

06 - Associate Engineer, Industrial Engineering

Celestica International LP

Richardson, TX

Req ID: 135362 Region: Americas Country: USA State/Province: Texas City: Richardson Summary We are seeking an Associate Industrial Engineer to drive operational excellence across our…

View Details

Posted 2026-06-04

Travel Nurse RN - Obstetrics/Gynecology - $1,540 per week in Austin, TX

OneStaff Medical

Austin, TX

Registered Nurse (RN) | Obstetrics/Gynecology Location: Austin, TX Agency: OneStaff Medical Pay: $1,540 per week Shift Information: Rotating - 3 days x 12 hours Contract Durati…

View Details

Posted 2026-05-21

Engineering Manager, HPC Kubernetes Platform

Engineering Manager, AI Compute Platform (CaaS / GPUaaS) Location: Dallas, TX (Relocation available) Type: Direct Hire â¢ Competitive base salary + performance bonus â¢ 100% company-paid benefits

Overview

Key Responsibilities

Required Experience

Preferred Experience

Why This Role

Recommended Jobs

ACTIVITY AIDE/COORDINATOR -PRN

Medical Coding Team Lead/Remote

Sr. Sales Recruiter

Operational Language Analyst - Spanish, Level 2 (2025-0066)

Food Service Worker - Bexar County Sheriff's Office

Production Worker

PBX Operator / Admin Assistant

Design Consultant (In-Home Sales)

06 - Associate Engineer, Industrial Engineering

Travel Nurse RN - Obstetrics/Gynecology - $1,540 per week in Austin, TX

Engineering Manager, AI Compute Platform (CaaS / GPUaaS)
Location: Dallas, TX (Relocation available)
Type: Direct Hire
â¢ Competitive base salary + performance bonus
â¢ 100% company-paid benefits