|
Site Reliability Engineer
Role name Site Reliability Engineer
Location: Atlanta, GA (On Site) Contract Role Role and responsibilities: 5+ years of experience in Site/System Reliability, DevOps, or related roles. • Strong skills in Linux/Unix administration and shell scripting. • Proficiency with cloud platforms (AWS, Azure, GCP) and container orchestration (Kubernetes, Docker). • Knowledge of networking fundamentals (TCP/IP, DNS, load balancing). • Proficiency in Linux/Unix administration, scripting (Python, Bash, or similar). • Experience with monitoring tools (Prometheus, Grafana, Data Dog). • Familiarity with containerization (Docker, Kubernetes) and cloud services. • Experience with CI/CD systems (Jenkins, GitHub Actions, GitLab CI). • Strong analytical and problem-solving skills. • Knowledge of security practices (IAM, encryption, secrets management). • Experience with incident management frameworks and SRE principles. • Knowledge of performance tuning and capacity planning. • Exposure to observability tools and log aggregation systems. • Understanding of networking and security fundamentals. Design, implement, and maintain monitoring, logging, and alerting systems. • Define and track Service Level Indicators (SLIs), Objectives (SLOs), and Agreements (SLAs). • Conduct post-incident reviews and implement preventive measures. • Automate deployment, scaling, and operational tasks using Infrastructure-as-Code tools (Terraform, Ansible, CloudFormation). • Implement CI/CD pipelines and release management processes. • Optimize infrastructure for reliability, performance, and cost efficiency. • Respond to production incidents, perform root cause analysis, and implement solutions. • Collaborate with development teams to ensure system robustness. • Maintain runbooks and operational documentation. • Partner with software developers, QA, DevOps, and product teams to improve system reliability. • Promote best practices in coding, testing, and deployment. • Advocate for proactive measures to prevent outages and reduce operational toil. • Ensure systems adhere to security, compliance, and governance standards. • Participate in vulnerability assessments and remediation planning. | ||||||