Senior Kubernetes Engineer
Senior Kubernetes Engineer
Location: Dallas, TX
Overview
This organization is backed by dedicated leadership and investment, with a clear mission as it operates at the bleeding edge of technology. Its goal is to scale and enhance high-performance computing (HPC) and cloud infrastructure that supports clients' research, production, and delivery, enabling breakthroughs that shape the industries of tomorrow. Its engineers build critical infrastructure to eliminate friction in scientific research, simulations, analysis, and decision-making, accelerating discovery and driving faster innovation.
We are seeking a highly skilled Senior Kubernetes Engineer to join our office in Dallas. In this role, you will design, implement, and optimise GPU-accelerated container platforms at scale, enabling high-performance workloads (AI/ML, HPC, LLM training) across hybrid or on-prem environments. You will have deep expertise with both NVIDIA and Kubernetes ecosystems, including GPU scheduling, device plugins, and custom operators.
Key Responsibilities
- Architect and operate Kubernetes clusters optimised for GPU workloads, leveraging NVIDIA GPU Operator, Network Operator, and DCGM.
- Develop, deploy, and maintain custom Kubernetes operators and controllers to automate infrastructure services.
- Integrate NVIDIA device plugins, Multi-Instance GPU (MIG), and GPU sharing features into the scheduling layer.
- Optimise GPU utilisation and job placement through scheduler extensions, such as kube-scheduler plugins, Slurm, and Volcano.
- Collaborate with HPC, ML, and DevOps teams to ensure multi-tenant, high-throughput cluster performance.
- Drive observability and telemetry integrations using Prometheus, Grafana, DCGM Exporter, and OpenTelemetry.
- Implement secure multi-user and multi-namespace GPU isolation, with RBAC and policy enforcement, such as OPA or Gatekeeper.
- Maintain CI/CD pipelines for Kubernetes infrastructure using GitOps, ArgoCD, and FluxCD.
- Contribute to infrastructure-as-code, using Terraform, Helm, and Kustomize.
- Participate in performance tuning, incident response, and production readiness reviews.
Required Experience
- Extensive experience with Kubernetes in production-grade environments and working with NVIDIA and Kubernetes, including GPU Operator, device plugin, NVML, MIG, and DCGM.
- Proficiency in Go or Python for operator development and Kubernetes controller logic.
- Deep understanding of Kubernetes internals, including CRDs, RBAC, custom controllers, and scheduler extensions.
- Experience with GPU-intensive workloads, for example for LLMs, training pipelines, and scientific computing.
- Hands-on experience with Helm, Kustomize, and GitOps workflows.
- Familiarity with CNI plugins, especially NVIDIA CNI and Multus.
- Experience with monitoring GPU metrics and cluster health using Prometheus and DCGM Exporter.
Recommended Jobs
Inside Sales Representative
Overview Help fuel Esri’s growth and ability to enable organizations around the world to do amazing things with geography and GIS. This is a great opportunity to gain cross-industry experience as you…
10 - Staff Engineer, Software (Austin)
Req ID: 129308 Region: Americas Country: USA State/Province: Texas City: Austin General Overview Job Title: Staff Software Engineer (BSP/Diag/SDK) Functional Area: Engineering…
Hospice Sales Executive
Hospice Sales Executive - MUST HAVE - Hospice or Home Health Sales Experience, no exceptions. Territory : South & East Austin Buda, Kyle, San Marcus, Dripping Springs, Lakeway Highly competiti…
Sr Staff Pharmacist (Oncology)
At Houston Methodist, the Senior Staff Pharmacist position is responsible for an expanded clinical role that may include responsibility for proactive review of patient profiles, participation in basic…
Busser/Barback
For this position, pay will be variable by location - plus tips. Our Busser/Barbacks are the right hands to our Servers and Bartenders. They ensure our dining room, bar, lobby, and servic…
Environmental & Permitting Manager / Senior Manager
Environmental & Permitting Manager / Senior Manager Remote – United States (Travel Required) Build the projects that power the future. Are you energized by shaping large‑scale infrastructure …
Sheet Metal Fabrication - Checker
Checker (QA/QC) – Sheet Metal Fabrication Location: Manor, TX Pay: $20/hour Job Type: Full-time (temp to perm) Position Summary We are seeking a detail-oriented Checker …
Neurology Movement Disorder Opportunity - Houston Methodist - Texas Medical Center
Houston Methodist Specialty Physician Group is seeking a Board-Certified or Board-Eligible Neurologist with fellowship training in Movement Disorders to join our expanding Neurology Department at Hou…