|
Sr Site Reliability Engineer
Role: Sr Site Reliability Engineer
Location: Bellevue US Contract Role JOB DESCRIPTION: "AI Platform & Gateway Engineering • Design, deploy, and operate enterprise AI Gateway infrastructure supporting OpenAI and internal LLM-based services. • Implement and manage regional routing (east/west), failover strategies, and upstream host configurations for AI traffic. • Develop and maintain Helm charts, Kubernetes manifests, and Jinja templates for multi-environment deployments (dev, plab, qlab). • Enable per-API configuration for rate limiting, AI feature toggles, security credentials, and regional host overrides. • Stay current with industry best practices for: o AI Gateways and MCP servers o Secure LLM consumption patterns o Token handling, secrets management, and request isolation o Observability standards for AI platforms Vendor & Stakeholder Management • Lead bi-weekly technical and operational syncs with AI Gateway vendors. • Translate vendor capabilities, limitations, and roadmaps into actionable platform strategies. • Communicate clearly in both technical and business terms with: o Engineering teams o SRE o Security & compliance o Product and leadership stakeholders Reliability, Observability & Operations • Build and maintain monitoring and troubleshooting frameworks for AI workloads using Splunk and Grafana. • Author and evolve SRE support cookbooks for proactive monitoring, incident response, and escalation. • Analyze failure rates, latency spikes, and request flows across distributed AI systems. • Support on-call readiness through actionable dashboards, alerts, and operational runbooks. CI/CD & Automation • Build CI pipelines to generate and deploy environment-specific configurations at scale. • Automate service registration, deployment validation, and environment promotion. • Enforce consistent naming, versioning, and deployment standards across clusters and environments. Cross-Functional Collaboration • Act as a technical bridge between application teams, SRE, security, and platform engineering. • Provide architectural guidance for teams onboarding to AI Gateway and Enterprise GPT platforms. • Contribute to platform roadmaps, technical design reviews, and operational readiness planning. Required Qualifications • Strong experience with Kubernetes, Helm, and cloud-native networking. • Hands-on experience with Istio / service mesh, routing rules, and traffic management. • Proficiency in Python, Bash, and Jinja templating for infrastructure automation. • Experience operating production-grade APIs with high reliability and observability standards. • Deep understanding of SRE principles, monitoring, alerting, and incident management. • Experience building observability frameworks using Splunk, Grafana, or similar tools. • Strong ability to communicate complex technical issues in clear business terms. • Experience working with AI/LLM APIs (OpenAI or similar) in an enterprise context. Preferred Qualifications • Knowledge of MCP servers, AI gateway patterns, and LLM security models. • Familiarity with security controls for AI platforms (secrets management, token handling, access controls). • Experience supporting multi-region, multi-environment deployments at scale. • Strong documentation skills with a focus on operational clarity and enablement. | ||||||