[Sticky] SRE Duties in your Organization
Site Reliability Engineering (SRE) is a discipline that incorporates aspects of software engineering and applies them to infrastructure and operations problems. The primary goal of SRE is to create scalable and highly reliable software systems. The roles and responsibilities of SREs can vary across organizations, but generally, they involve the following:
-
Service Reliability:
- Ensure the reliability and availability of critical services and systems.
-
Service Level Objectives (SLOs) and Service Level Indicators (SLIs):
- Define and measure SLOs and SLIs to quantify the reliability of services.
-
Automation:
- Develop and maintain automation tools for deployment, monitoring, and incident response to minimize manual intervention.
-
Capacity Planning:
- Conduct capacity planning to ensure that systems can handle current and future loads.
-
Incident Response:
- Participate in on-call rotations and respond to incidents to minimize downtime and resolve issues promptly.
-
Monitoring and Alerting:
- Implement effective monitoring and alerting systems to detect and respond to anomalies and issues.
-
Emergency Response:
- Collaborate with development teams to conduct post-incident reviews and implement improvements to prevent future incidents.
-
Performance Optimization:
- Identify and address performance bottlenecks in systems to improve overall efficiency.
-
Risk Management:
- Assess risks to system reliability and implement mitigations to reduce the impact of potential issues.
-
Infrastructure as Code (IaC):
- Use IaC principles to manage and configure infrastructure, making it more scalable, version-controlled, and reproducible.
-
Release Engineering:
- Collaborate with development teams on the release process, ensuring smooth and reliable deployments.
-
Security:
- Work on security-related tasks, such as implementing secure coding practices, participating in security reviews, and ensuring compliance with security policies.
-
Documentation:
- Create and maintain documentation for operational procedures, configurations, and incident response playbooks.
-
Cross-Functional Collaboration:
- Collaborate with development, product, and other cross-functional teams to align reliability goals with overall business objectives.
-
On-Call Responsibilities:
- Share on-call responsibilities to respond to incidents and ensure 24/7 system reliability.
SREs focus on the intersection of software engineering and systems administration, applying software engineering principles to infrastructure and operations challenges. Their ultimate goal is to create scalable and reliable systems that meet or exceed service level objectives.
- Solving business and technical problems to maintain high available and reliable applications and infrastructure
- Implementing monitoring solution or use existing monitoring platform to detect issues and create automated scripts that Acts to resolve the issues
- Work towards reducing the error budget to minimum
- Share workloads with the Devops team to resolve technical debts
- Work on operations and on-call basis.
Technology Skills:
- Scripting languages : ARM templates, Biceps, Terraform, Shell scripts
- CICD tools: Github actions
- OS: Linux (RHEL), Windows
- Orchestrators : Kubernetes, Docker
- Package managers: Helm, Nexus, Azure Artifacts
- Observability : Azure monitor, Prometheus , ELK, Grafana
- Service Management tool – Service Now
- Source code Version control System – Github
- Cloud knowledge- Azure IaaS, PaaS and SaaS solutions