Cloud Site Reliability Engineer – AWS & Azure
Overall Responsibilities
- Oversee the design and improvement of infrastructure using SRE best practices, including IaC, recovery automation, and systems that detect and resolve issues independently.
- Manage and fine-tune critical services across both cloud and on-prem environments: Kubernetes clusters, CI/CD pipelines, artifact registries, and custom workloads.
- Enhance observability through intelligent logging, metrics, tracing, and alerting. Ensuring systems are transparent and actionable in real time.
- Champion automation by eliminating repetitive tasks, from deployment workflows to security audits, through scripting and tooling.
- Elevate the developer experience for 80+ engineers and researchers by streamlining secure, reliable workflows across hybrid and cloud-native platforms.
- Take ownership of IAM governance across platforms like Azure AD and AWS IAM. Implement lifecycle automation, auditing, and access controls.
- Foster a culture of operational excellence with strong practices around security, incident management, and resilience engineering.
- Act as a trusted partner to developers and researchers, enabling their speed and innovation without compromising stability.
Experience
- Experience in Site Reliability Engineering, DevOps, or Systems Engineering within fast-paced, technically demanding environments.
- Strong background in Linux systems and cloud infrastructure, with hands-on experience in AWS (primary) and Azure environments.
- Solid command of Kubernetes and container orchestration in production environments.
- Expertise in Infrastructure as Code tools such as Ansible, building reproducible, scalable infrastructure is second nature to you.
- Deep experience in observability and incident response: you know how to set up effective monitoring, handle incidents, and lead blameless post-mortems.
- A security-first mindset, especially when it comes to protecting distributed systems and developer workflows.
- Proven ability to support and optimize CI/CD pipelines, container image builds, and artifact lifecycle management.
- Strong communication and collaboration skills. You build trust across teams and advocate for thoughtful, scalable solutions.
- Bonus if you’ve worked with event-driven architectures using technologies like Kafka.