Site Reliability Engineer (SRE)

Search for More Jobs

Forward job to a friend

Quick Apply

Apply by creating/using an account

Language

Site Reliability Engineer (SRE)

Job Title: Site Reliability Engineer (SRE)
Location: Plano, TX (5 Days onsite role)
Long Term Project
Job Summary:

We are seeking a highly skilled Site Reliability Engineer (SRE) to join our Commercial & Investment Banking technology team. In this role, you will be responsible for ensuring the reliability, scalability, and performance of applications and infrastructure. You will work with modern cloud technologies, implement automation, and proactively improve system health while collaborating across engineering and business teams.

Key Responsibilities:

Design, build, and maintain scalable, reliable, and high-performance systems
Collaborate with development teams to implement CI/CD pipelines and deployment strategies
Develop and manage infrastructure using Infrastructure as Code (IaC) practices
Monitor system performance and availability using observability tools
Implement and maintain SLOs/SLAs and proactively resolve potential issues
Troubleshoot complex system and network issues across distributed environments
Drive adoption of SRE best practices including automation, reliability, and performance optimization
Partner with stakeholders and technical teams to solve business-critical problems

Required Skills & Qualifications:

Bachelor's degree in Computer Science or related field (or equivalent experience)
3+ years of experience in Site Reliability Engineering / DevOps / Software Engineering
Strong knowledge of system reliability, scalability, performance, and security principles
Proficiency in at least one programming language (Python, Java, or similar)
Experience with CI/CD tools such as Jenkins, GitLab, or Terraform
Hands-on experience with containerization and orchestration tools (Docker, Kubernetes, ECS)
Strong understanding of observability tools (Grafana, Prometheus, Datadog, Dynatrace, Splunk)
Experience with cloud platforms and distributed systems
Solid understanding of networking concepts and troubleshooting

Preferred Qualifications:

Experience implementing SLO/SLA frameworks for critical systems
Knowledge of chaos engineering tools (e.g., Gremlin, Chaos Monkey)
Familiarity with infrastructure components (load balancers, routers, storage systems)
Experience with tools like Jira, Confluence, ServiceNow, Netcool
Strong problem-solving skills and ability to work in a fast-paced environment
Experience with log analysis and monitoring tools

Key Competencies: