Sr Site Reliability Engineer

Search for More Jobs

Forward job to a friend

Quick Apply

Apply by creating/using an account

Language

Sr Site Reliability Engineer

Ref No.:	26-00101
Location:	Bellevue, Washington
Position Type:	Contract

Role: Sr Site Reliability Engineer
Location: Bellevue US
Contract Role

JOB DESCRIPTION:
"AI Platform & Gateway Engineering
• Design, deploy, and operate enterprise AI Gateway infrastructure supporting OpenAI and internal LLM-based services.
• Implement and manage regional routing (east/west), failover strategies, and upstream host configurations for AI traffic.
• Develop and maintain Helm charts, Kubernetes manifests, and Jinja templates for multi-environment deployments (dev, plab, qlab).
• Enable per-API configuration for rate limiting, AI feature toggles, security credentials, and regional host overrides.
• Stay current with industry best practices for:
o AI Gateways and MCP servers
o Secure LLM consumption patterns
o Token handling, secrets management, and request isolation
o Observability standards for AI platforms
Vendor & Stakeholder Management
• Lead bi-weekly technical and operational syncs with AI Gateway vendors.
• Translate vendor capabilities, limitations, and roadmaps into actionable platform strategies.
• Communicate clearly in both technical and business terms with:
o Engineering teams
o SRE
o Security & compliance
o Product and leadership stakeholders
Reliability, Observability & Operations
• Build and maintain monitoring and troubleshooting frameworks for AI workloads using Splunk and Grafana.
• Author and evolve SRE support cookbooks for proactive monitoring, incident response, and escalation.
• Analyze failure rates, latency spikes, and request flows across distributed AI systems.
• Support on-call readiness through actionable dashboards, alerts, and operational runbooks.
CI/CD & Automation
• Build CI pipelines to generate and deploy environment-specific configurations at scale.
• Automate service registration, deployment validation, and environment promotion.
• Enforce consistent naming, versioning, and deployment standards across clusters and environments.
Cross-Functional Collaboration
• Act as a technical bridge between application teams, SRE, security, and platform engineering.
• Provide architectural guidance for teams onboarding to AI Gateway and Enterprise GPT platforms.
• Contribute to platform roadmaps, technical design reviews, and operational readiness planning.

Required Qualifications
• Strong experience with Kubernetes, Helm, and cloud-native networking.
• Hands-on experience with Istio / service mesh, routing rules, and traffic management.
• Proficiency in Python, Bash, and Jinja templating for infrastructure automation.
• Experience operating production-grade APIs with high reliability and observability standards.
• Deep understanding of SRE principles, monitoring, alerting, and incident management.
• Experience building observability frameworks using Splunk, Grafana, or similar tools.
• Strong ability to communicate complex technical issues in clear business terms.
• Experience working with AI/LLM APIs (OpenAI or similar) in an enterprise context.

Preferred Qualifications
• Knowledge of MCP servers, AI gateway patterns, and LLM security models.
• Familiarity with security controls for AI platforms (secrets management, token handling, access controls).
• Experience supporting multi-region, multi-environment deployments at scale.
• Strong documentation skills with a focus on operational clarity and enablement.