
Lead Site Reliability Engineer
Kontakt.ioPosted 6/10/2025

Lead Site Reliability Engineer
Kontakt.io
Job Location
Job Summary
Kontakt.io is seeking a Lead Site Reliability Engineer to own the reliability, performance, and automation of their cloud-based, real-time healthcare platform. The role involves ensuring 99.99% uptime across the platform, leveraging software engineering expertise to write high-quality code, designing self-healing systems, and architecting scalable cloud infrastructure. The successful candidate will lead a team, drive technical strategy, and collaborate with various teams to align SRE initiatives with business priorities. With a focus on automation, observability, and incident response, the Lead Site Reliability Engineer will help reduce waste, optimize resources, and improve patient care in the healthcare industry. This is an opportunity to join a high-performing team and contribute to the development of a platform that delivers 10X ROI.
Job Description
Responsibilities
- Ensure 99.99% uptime across our cloud platform, meeting strict SLAs for healthcare customers.
- Leverage your software engineering expertise to write high-quality, maintainable code that improves system reliability and operational efficiency.
- Design and implement self-healing, fault-tolerant systems to prevent failures before they happen.
- Define SLIs, SLOs, and SLAs, ensuring proactive performance monitoring and incident resolution.
- Architect and manage scalable cloud infrastructure (AWS) for massive real-time data processing.
- Optimize containerized environments (Kubernetes, Docker) to support multi-region deployments.
- Lead the adoption of infrastructure as code (Terraform) to fully automate infrastructure management.
- Build and refine a world-class monitoring, alerting, and logging system using Prometheus, Grafana, OpenTelemetry, and Datadog.
- Lead incident response and on-call operations, reducing mean time to detection (MTTD) and mean time to resolution (MTTR).
- Conduct blameless postmortems and continuously improve system resilience.
- Reduce manual intervention through automated deployment, scaling, and failover mechanisms.
- Partner with Security & Compliance teams to ensure infrastructure meets HIPAA and SOC 2 standards
- Lead disaster recovery and business continuity planning to ensure critical healthcare services are always available.
- Drive technical strategy and roadmap for scalability, monitoring, and reliability engineering.
- Collaborate with Product, Engineering, and Infrastructure teams to align SRE initiatives with business priorities.
What You Bring
- 10+ years of experience in Site Reliability Engineering or Cloud Infrastructure.
- 2+ years of experience as a software engineer
- Proven success scaling high-traffic, mission-critical platforms in SaaS, IoT, or healthcare.
- Deep expertise in cloud platforms (AWS), Kubernetes, and distributed systems.
- Strong background in monitoring, logging, and observability with Prometheus, OpenTelemetry, or similar tools.
- Hands-on experience with incident management, postmortems, and building resilient systems.
- Deep knowledge of CI/CD automation, GitOps, and infrastructure as code (Terraform, etc.).
- A mature leadership approach, with the ability to drive technical strategy while growing and mentoring a high-performance SRE team.
- Strong understanding of network security, access management, and compliance frameworks (HIPAA, SOC 2).
- Experience with healthcare IT, including EHR data, FHIR, and HL7 interoperability.
- Expertise in real-time distributed systems, event-driven architectures, or large-scale data pipelines.
- Prior experience leading on-call rotations and major incident management processes.
Why You'll Love It Here
- Own Mission-Critical Reliability – Ensure hospitals and care facilities always stay online with a 99.99% uptime healthcare platform.
- Scale AI-Powered Infrastructure – Work on real-time automation and self-healing cloud systems that orchestrate care delivery.
- Drive Big Impact in Healthcare – Help reduce waste, optimize resources, and improve patient care with technology that delivers 10X ROI.
- Automation-First Culture – Minimize manual ops with cutting-edge automation, observability, and incident response strategies.
- Join a High-Performing Team – Work with top engineers, AI experts, and healthcare innovators solving real-world challenges.