Senior Site Reliability Engineer, Loblaw Companies Limited
Toronto, Canada- Architected and implemented an Observability Platform using Golang, defining SRE principles with SLI, SLO, and Error Budget, enhancing issue identification and automated alerting using grafana dashboard templating.
- Manage and maintain Kubernetes clusters across multiple environments ensuring 99.99% uptime and to deploy 100+ applications.
- Enhance the performance of the Kubernetes cluster with seamless version upgrades, monitoring with real time metrics (VM cluster), Auto-scaling and Developed inhouse Operators.
- Replaced the single instance Prometheus for time series data with Victoriametrics, which is fast, scalable, fast data ingestion, light-speed querying
- Automated infrastructure provisioning using Terraform and Ansible, reducing deployment times by 40%.
- Played a key-role in moving the standalone application running in VM to GKE (Google Kubernetes Engine) using helm,gitlab pipelines, vault.
- Improved performance of services with the help of Akamai CDN and built IAC with gitlab pipelines for version rollouts.
- Collaborate with the team for MR reviews/feedbacks, System design, coding in Go, Python, Bash.
- Improved the application observability by instrument using opentelemetry.
- Managed Linux-based servers (RHEL 6,7,8), centos, ubuntu, ensuring optimal performance and security and troubleshooting/debug issues related to it.