Cloud Infrastructure / Site Reliability Engineering / Platform Operations
Shanghai, China
I build production infrastructure and operational tooling for Kubernetes and AWS environments, with a focus on reliability, observability, and safe automation. Recently, I have been working on AI-supported incident analysis systems that turn alerts, runbooks, and platform evidence into actionable operational context.
- Kubernetes / EKS production operations: troubleshooting, release safety, cluster access patterns, and platform reliability
- AWS infrastructure: IAM, networking, Bedrock AgentCore, MCP-backed tooling, and environment-aware automation
- AI for operations: runbook retrieval, alert summarization, evidence collection, and human-in-the-loop recommendations
- Incident workflows: Alertmanager intake, triage, evidence collection, escalation context, and post-incident improvements
- Infrastructure systems: Terraform, Jsonnet, CI/CD, documentation, and operator-facing runbooks
- Designing AI-supported incident analysis systems that prepare production context before responders engage
- Connecting Kubernetes, AWS, and runbook evidence through controlled tool access
- Building automation that stays auditable, reversible, and useful under pressure
- Turning repeated operational lessons into durable platform defaults
- Reliability first: optimize for observable, understandable systems over clever automation
- Least privilege by default: production access should be scoped, reviewed, and easy to audit
- Human-in-the-loop operations: automation should explain, recommend, and reduce toil before it remediates
- Incidents should compound into better tooling, better docs, and safer release paths
AWS · Kubernetes · Bedrock AgentCore · MCP · Python · TypeScript · Shell · Docker · Terraform · Jsonnet · PostgreSQL · Redis
Shanghai, China · yixingyan@gmail.com