Cloud Site Reliability Engineer – AWS & Azure

Overall Responsibilities

Oversee the design and improvement of infrastructure using SRE best practices, including IaC, recovery automation, and systems that detect and resolve issues independently.
Manage and fine-tune critical services across both cloud and on-prem environments: Kubernetes clusters, CI/CD pipelines, artifact registries, and custom workloads.
Enhance observability through intelligent logging, metrics, tracing, and alerting. Ensuring systems are transparent and actionable in real time.
Champion automation by eliminating repetitive tasks, from deployment workflows to security audits, through scripting and tooling.
Elevate the developer experience for 80+ engineers and researchers by streamlining secure, reliable workflows across hybrid and cloud-native platforms.
Take ownership of IAM governance across platforms like Azure AD and AWS IAM. Implement lifecycle automation, auditing, and access controls.
Foster a culture of operational excellence with strong practices around security, incident management, and resilience engineering.
Act as a trusted partner to developers and researchers, enabling their speed and innovation without compromising stability.

Experience

Experience in Site Reliability Engineering, DevOps, or Systems Engineering within fast-paced, technically demanding environments.
Strong background in Linux systems and cloud infrastructure, with hands-on experience in AWS (primary) and Azure environments.
Solid command of Kubernetes and container orchestration in production environments.
Expertise in Infrastructure as Code tools such as Ansible, building reproducible, scalable infrastructure is second nature to you.
Deep experience in observability and incident response: you know how to set up effective monitoring, handle incidents, and lead blameless post-mortems.
A security-first mindset, especially when it comes to protecting distributed systems and developer workflows.
Proven ability to support and optimize CI/CD pipelines, container image builds, and artifact lifecycle management.
Strong communication and collaboration skills. You build trust across teams and advocate for thoughtful, scalable solutions.
Bonus if you’ve worked with event-driven architectures using technologies like Kafka.

Cloud Site Reliability Engineer – Azure AWS