HPC Kubernetes Solutions Architect (GPU Platforms)

GTN Technical Staffing
Dallas, TX

HPC Kubernetes Solutions Architect (GPU Platforms)

Location: Dallas, TX (Hybrid)

Type: Direct Hire

• Competitive base salary + performance bonus
• 100% company-paid benefits

Overview

We are seeking an HPC Kubernetes Solutions Architect to lead the design, integration, and adoption of GPU-accelerated Kubernetes platforms supporting HPC, AI/ML, simulation, and scientific workloads.

This is a highly technical, customer-facing architecture role with ownership across the full solution lifecycle—from discovery and requirements gathering through architecture design, proof-of-concept delivery, deployment, and long-term optimization. The role serves as a trusted advisor to customers while also influencing internal product and engineering direction through real-world feedback.

The ideal candidate brings deep expertise across Kubernetes, GPU orchestration, and HPC environments, along with the ability to design scalable, high-performance platforms and guide customers through complex infrastructure transformations.

Key Responsibilities

Customer Engagement & Architecture Leadership

• Serve as the primary architectural point of contact for customers adopting GPU-accelerated Kubernetes platforms
• Capture workload requirements, performance objectives, and scaling needs, translating them into reference architectures and solution designs
• Lead customer workshops, technical design sessions, and architecture reviews

Kubernetes & GPU Platform Engineering

• Architect and operate Kubernetes clusters optimized for GPU workloads using NVIDIA GPU Operator, Network Operator, DCGM, and device plugins
• Integrate Multi-Instance GPU (MIG), GPU sharing, and advanced scheduling (Volcano, Slurm integration, kube-scheduler plugins)
• Design and implement multi-tenant Kubernetes environments with strong isolation and performance guarantees

Automation & Operator Development

• Develop or extend custom Kubernetes operators and controllers using Go or Python
• Automate HPC infrastructure services and platform operations
• Support Infrastructure-as-Code and GitOps practices using Terraform, Helm, Kustomize, ArgoCD, and FluxCD

Performance Optimization & Benchmarking

• Lead proof-of-concept and benchmarking initiatives to validate performance and scalability
• Utilize profiling tools and workload characterization methodologies to optimize GPU utilization and cluster performance
• Conduct performance tuning across compute, storage, and networking layers

Integration & Infrastructure Design

• Define integration strategies across compute, storage, networking, and orchestration layers
• Support CNI integrations (NVIDIA CNI, Multus, Cilium), distributed storage (Lustre, GPFS, Ceph, VAST), and container runtimes
• Ensure seamless integration with HPC schedulers and enterprise systems

Observability & Monitoring

• Implement monitoring and telemetry solutions using Prometheus, Grafana, DCGM Exporter, and OpenTelemetry
• Provide visibility into GPU health, cluster utilization, and workload performance

Cross-Functional Collaboration

• Partner with HPC, ML, DevOps, and platform teams to ensure scalability and performance in hybrid and on-prem environments
• Collaborate with product and engineering teams to influence roadmap and platform improvements
• Build relationships with ecosystem vendors including NVIDIA, networking providers, and storage partners

Innovation & Thought Leadership

• Stay current on GPU roadmaps, interconnect technologies (InfiniBand, RoCE, NVLink), and Kubernetes advancements
• Provide forward-looking guidance to customers on scaling and future architecture evolution
• Represent the organization in technical workshops, design sessions, and industry events

Required Experience

• Extensive experience designing and operating Kubernetes platforms in HPC or GPU-intensive environments
• Deep expertise across:

  • NVIDIA GPU ecosystem (GPU Operator, device plugins, MIG, NVML, DCGM)
  • Kubernetes internals (CRDs, RBAC, scheduler extensions, custom operators/controllers)
  • High-performance networking (InfiniBand, RDMA, RoCE)
  • Distributed storage integration for HPC workloads

• Proven ability to design scalable, secure, and resilient Kubernetes-based architectures
• Proficiency in Go or Python for operator development and automation
• Experience with workload profiling, benchmarking, and performance tuning
• Strong customer-facing skills with the ability to translate requirements into actionable architectures
• Experience collaborating across engineering, product, and operations teams

Preferred Experience

• Experience delivering end-to-end HPC or AI/ML solutions from design through deployment and optimization
• Familiarity with containerized HPC environments (e.g., Singularity/Apptainer)
• Experience with GitOps practices and CI/CD pipelines for Kubernetes platforms
• Contributions to open-source projects in Kubernetes or NVIDIA ecosystems
• Experience advising customers on future-state architectures and emerging technologies
• Bachelor’s or Master’s degree in Computer Science, Engineering, Physics, or related field
• Relevant certifications such as CKA, CKAD, CKS, AWS Solutions Architect, or Azure Solutions Architect Expert

Posted 2026-04-16

Recommended Jobs

Solution Design Manager - Los Angeles, CA / Houston, TX

CEVALogistics
Houston, TX

CEVA Logistics provides global supply chain solutions to connect people, products, and providers all around the world. Present in 170+ countries and with more than 110,000 employees spread over 1,500…

View Details
Posted 2026-03-27

Operations Analyst Tech

General Dynamics Information Technology
Camp County, TX

Public Trust: None Requisition Type: Regular Your Impact Own your opportunity to support our nation's defense. Make an impact by connecting and securing critical operations across the glob…

View Details
Posted 2026-03-12

Principal software developer

Oracle
Austin, TX

Job Description Oracle Cloud Infrastructure (OCI) delivers mission-critical applications for top tier enterprises around the world. Our cloud offers unmatched hyper-scale, multi-tenant services de…

View Details
Posted 2026-04-15

Tradesman Plumber/Journeyman Plumber

Monkey Wrench Plumbing
New Braunfels, TX

Plumber – Full Time Location: New Braunfels, TX Company: Monkey Wrench Plumbing About Us: Monkey Wrench Plumbing is a locally owned and operated plumbing company serving New Braunfels a…

View Details
Posted 2026-01-20

Buyer

Foxconn Industrial Internet
Houston, TX

Job Description: This position is responsible for purchasing materials for production use and ensuring continuity and quality of supplies.   Office located in Houston, not remote, hybrid; expected…

View Details
Posted 2026-04-03

Senior Director of Product, Network Reporting Modernization

VISA
Austin, TX

Job Description The Senior Director of Product, Network Reporting Modernization will lead Visa’s next-generation reporting strategy and product portfolio—transforming decades-old, stand-alone repor…

View Details
Posted 2026-04-03

SAP OTC Lead

GTN Technical Staffing
Sealy, TX

We are seeking a hands-on SAP OTC Functional Lead to provide functional design leadership for a newly implemented SAP S/4HANA environment as part of a growing SAP Center of Excellence (COE) . Th…

View Details
Posted 2026-02-18

Service Technician, Ridgepoint

Tesla
Texas

What To Expect As a Tesla Technician, you become a vital part of our global team, working towards our mission of accelerating the world's transition to sustainable energy. At Tesla, you'll collabo…

View Details
Posted 2026-04-15

HVAC Service Technician

Tradesman Services LTD.
Waco, TX

It's time for your CAREER to take off We Want YOU! Looking for candidates that are willing to learn and grow in an industry that is essential to so many. Act Now! Your success is w…

View Details
Posted 2025-08-28

Senior Database Engineer - Distributed Data Systems, Automation & Cloud Modernization

Westlake, TX

Senior Database Engineer – Distributed Data Systems, Automation & Cloud Modernization Position Description CGI is seeking to hire a Sr. Database Engineer experienced in Modernizing enterprise sc…

View Details
Posted 2026-01-30