SRE Observability SME
- Location: Toronto, Ontario
- Remote: Hybrid
- Type: Contract
- Job #34997
Our Financial client in Toronto is seeking a hands-on SRE Observability SME to provide day-one expertise in improving system reliability, performance, and incident response across complex distributed environments. This is a HYBRID , embedded role working closely with engineering teams to drive observability best practices.
Key Responsibilities:
- Provide hands-on SRE and observability expertise across applications and infrastructure
- Implement and optimize monitoring, alerting, and observability frameworks
- Troubleshoot complex performance and reliability issues using metrics, events, logs, and traces (MELT)
- Design and build advanced dashboards and visualization solutions
- Guide teams on SRE best practices and reliability improvements
- Support incident response, root cause analysis, and remediation
- Develop creative observability solutions for systems with limited visibility
Required Skills & Experience:
- Strong hands-on experience with Dynatrace (DQL, dashboards, Grail, ActiveGate, plugins, workflows, BizEvents)
- Deep expertise in APM and observability tools (Dynatrace or similar)
- Advanced troubleshooting across distributed, multi-tier environments
- Strong understanding of SRE principles (Google SRE framework)
- Experience with AWS observability (CloudWatch, Application Signals, metrics, logs, traces, Lambda, API Gateway)
- Development experience with Python, AWS Lambda, ECS, Azure Functions
- Knowledge of OpenTelemetry (OTEL)
- Experience with AI-based system monitoring concepts
- Strong dashboard design (UI/UX for observability)
Nice to Have:
- Experience monitoring complex systems (e.g., IBM DataPower)
- Background in financial services or large-scale enterprise environments
Work Model:
- Hybrid (2–3 days onsite in Toronto)