Senior Site Reliability Engineer (SRE) - Cloud & Distributed Systems
Skills:
SRE, DevOps, AWS, GCP, Kubernetes, Docker, Python, Go, Linux, Distributed Systems, Monitoring, Logging, SLIs, SLOs, CI/CD, ObservabilityWe are seeking an experienced Senior Site Reliability Engineer (SRE) to design, build, and operate highly scalable and reliable cloud-based systems. The ideal candidate will have a strong background in DevOps, distributed systems, and cloud infrastructure , with a focus on automation, observability, and system reliability .
This role involves working in a fast-paced environment to ensure system uptime, performance, and operational excellence.
Key Responsibilities:
- Design, implement, and manage highly available, distributed systems
- Maintain and optimize cloud infrastructure (AWS/GCP)
- Develop automation scripts using Python, Go, Java, or Bash
- Manage containerized environments using Docker and Kubernetes
- Define and monitor SLIs, SLOs, and error budgets
- Implement monitoring, logging, and alerting solutions
- Lead incident management , root cause analysis (RCA), and postmortems
- Ensure system security and compliance within operational workflows
- Improve system reliability through performance tuning and optimization
- Collaborate with engineering teams to enhance deployment and release processes
- Create and maintain runbooks, dashboards, and operational documentation
Required Qualifications:
- 8+ years of experience in SRE, DevOps, or Systems Engineering
- Strong expertise in Linux/Unix systems and system internals
- Proficiency in at least one programming/scripting language ( Python, Go, Java, Bash )
- Experience designing and operating distributed systems
- Hands-on experience with cloud platforms (AWS or GCP)
- Experience with Docker and Kubernetes
- Strong understanding of monitoring, alerting, and logging concepts
- Experience managing SLIs, SLOs, and error budgets
- Experience with incident management and RCA processes
Preferred Qualifications:
- Experience with observability tools (Prometheus, Grafana, Datadog, Splunk, Application Insights)
- Experience supporting 24x7 production environments and on-call rotations
- Knowledge of chaos engineering and resiliency testing
- Experience with canary deployments, feature flags, and progressive delivery
- Strong documentation and communication skills
Recommended Jobs
Delivery and Installation Specialist
Job Schedule Corporate Retail Store Job ID 75466 Delivery Driver The salary range for this role is $15.75 to $16.50 per hour/annually.* Delivery Drivers Keep Aaron’s Moving This isn’t …
Digital Solution Consultant Senior Analyst
Req ID: 351391 NTT DATA strives to hire exceptional, innovative and passionate individuals who want to grow with us. If you want to be part of an inclusive, adaptable, and forward-thinking organiza…
Service Electrician Apprentice
Since 1980, Petri Electric has powered the Dallas-Ft. Worth metroplex from our headquarters in Richardson, Texas. As a top-tier full-service contracting firm, we specialize in commercial and indust…
Software Quality Engineer
Description: ENG- Monitor all phases of the software development process to ensure design quality and regulatory compliance, verifying that delivered aircraft software meets…
Home Health Full-Time Physical Therapist Assistant (PTA)
Urgently Hiring Home Health Physical Therapist Assistant !! **Paid Mileage and Travel Time potential for Hourly employees.** Brightstar Care is a premier home health agency in the heart of Texas!!…
Oil & Gas Underwriter - Account Executive Officer
Who Are We? Taking care of our customers, our communities and each other. That’s the Travelers Promise. By honoring this commitment, we have maintained our reputation as one of the best property ca…
CBL Easter Photo Set Bunny Character - Parkdale Mall
VIP Holiday Photos is seeking enthusiastic and friendly individuals to join our team as the Easter Bunny character at our Easter photo set. In this role, you will have the unique opportunity to bring…
Medical Lab Scientist or Technician / MLS or MLT
Overview: Join our team as a night shift , full-time , Laboratory Medical Laboratory Scientist (MLS) or Technician (MLT) i n Amarillo, TX. Why Join Us? Thrive in a People-First Environm…
Operations Leader - Full Time
Job ID: 285224 Store Name/Number: TX-Rice Village (0440) Address: 2401 Times Blvd, Houston, TX 77005, United States (US) Hourly/Salaried: Hourly (Non-Exempt) Full Time/Part Time: Full Ti…
Manager, AACU Loss Mitigation Analytics (Fort Worth)
Intro Are you ready to explore a world of possibilities, both at work and during your time off? Join our American Airlines family, and you’ll travel the world, grow your expertise and become the b…