Senior Kubernetes Engineer
Senior Kubernetes Engineer
Location: Dallas, TX
Overview
This organization is backed by dedicated leadership and investment, with a clear mission as it operates at the bleeding edge of technology. Its goal is to scale and enhance high-performance computing (HPC) and cloud infrastructure that supports clients' research, production, and delivery, enabling breakthroughs that shape the industries of tomorrow. Its engineers build critical infrastructure to eliminate friction in scientific research, simulations, analysis, and decision-making, accelerating discovery and driving faster innovation.
We are seeking a highly skilled Senior Kubernetes Engineer to join our office in Dallas. In this role, you will design, implement, and optimise GPU-accelerated container platforms at scale, enabling high-performance workloads (AI/ML, HPC, LLM training) across hybrid or on-prem environments. You will have deep expertise with both NVIDIA and Kubernetes ecosystems, including GPU scheduling, device plugins, and custom operators.
Key Responsibilities
- Architect and operate Kubernetes clusters optimised for GPU workloads, leveraging NVIDIA GPU Operator, Network Operator, and DCGM.
- Develop, deploy, and maintain custom Kubernetes operators and controllers to automate infrastructure services.
- Integrate NVIDIA device plugins, Multi-Instance GPU (MIG), and GPU sharing features into the scheduling layer.
- Optimise GPU utilisation and job placement through scheduler extensions, such as kube-scheduler plugins, Slurm, and Volcano.
- Collaborate with HPC, ML, and DevOps teams to ensure multi-tenant, high-throughput cluster performance.
- Drive observability and telemetry integrations using Prometheus, Grafana, DCGM Exporter, and OpenTelemetry.
- Implement secure multi-user and multi-namespace GPU isolation, with RBAC and policy enforcement, such as OPA or Gatekeeper.
- Maintain CI/CD pipelines for Kubernetes infrastructure using GitOps, ArgoCD, and FluxCD.
- Contribute to infrastructure-as-code, using Terraform, Helm, and Kustomize.
- Participate in performance tuning, incident response, and production readiness reviews.
Required Experience
- Extensive experience with Kubernetes in production-grade environments and working with NVIDIA and Kubernetes, including GPU Operator, device plugin, NVML, MIG, and DCGM.
- Proficiency in Go or Python for operator development and Kubernetes controller logic.
- Deep understanding of Kubernetes internals, including CRDs, RBAC, custom controllers, and scheduler extensions.
- Experience with GPU-intensive workloads, for example for LLMs, training pipelines, and scientific computing.
- Hands-on experience with Helm, Kustomize, and GitOps workflows.
- Familiarity with CNI plugins, especially NVIDIA CNI and Multus.
- Experience with monitoring GPU metrics and cluster health using Prometheus and DCGM Exporter.
Recommended Jobs
EHS Coordinator
Job Description – EHS Coordinator Position Overview: The EHS Coordinator provides on-site guidance and supports the implementation and execution of Environmental, Health, and Safety (EHS) prog…
Sr Lead Software Engineer - AI/ML
Job Description Be an integral part of an agile team that's constantly pushing the envelope to enhance, build, and deliver top-notch technology products. As a Senior Lead Software Engineer at J…
Director of Cardiovascular Services
Director of Cardiovascular Services Tyler, TX 100-140K + Signing Bonus + Paid Relocation Role Overview: Strategic Leadership for High-Acuity Cardiac Programs Join the leadership team at a major acute …
Soccer Coach (Private) in Grand Prairie | TeachMe.To
Skip the line and apply on our website: About Us TeachMe.To is the leading peer-to-peer lessons marketplace, on a mission to connect independent Soccer coaches in Grand Prairie | TeachMe.To w…
Full Stack Engineer
Location: US, Remote (East Coast Preferred) Compensation: Transparency is paramount in our compensation structure. Total compensation for this role is market competitive, offering a base salary range…
Patient Financial Representative Senior
Job Responsibilities: Performs billing, collections, and account follow-up to ensure accurate claim submission and timely reimbursement. Reconciles cash postings, resolves payment discrepancies…
Product Manager (Hybrid) - HVAC Manufacturing Life-Cycle
Join a legacy of innovation. For over 80 years, RectorSeal has delivered high-quality mechanical and chemical products to the HVAC, electrical, and plumbing industries. As we continue to grow r…
Sales Representative: Home-Based
Max Spencer Co. Sales Team: Empower Your Career! Join our expanding sales team at Max Spencer Co. and unlock a remote opportunity that blends flexibility, support, and limitless earning potential. …
Registered Nurse - Acute MedSurgical A (Hiring Immediately)
Description CHRISTUS Santa Rosa Hospital - New Braunfels (CSRH-NB), nestled in the heart of downtown New Braunfels, is a full-service, 94-private bed facility that continues to expand to meet the …
Staff Accountant
Job Description Job Description Salary: About Billd Billd is a fast-growing fintech company looking to disrupt a $1.5 trillion industry. We offer first-of-its-kind, industry-leading financi…