# Do Van Duy

Site Reliability Engineer II @ OPSWAT — Ho Chi Minh City, Vietnam.

## Who I Am

I'm an SRE / DevOps engineer with 5+ years running cloud platforms on AWS and
GCP. My work is mostly about making infrastructure boring in the best way:
scaling Kubernetes-based microservices, replacing manual toil with Infrastructure
as Code and GitOps, and building observability so problems are seen before they
page anyone. I care about reliability, cost efficiency, and systems that are
repeatable rather than heroic.

## Current Work

**Site Reliability Engineer II** — OPSWAT (Dec 2024 – present), Ho Chi Minh City
- Standardized infrastructure with Terraform across 4+ environments, eliminating config drift and enabling fully automated, repeatable provisioning.
- Cut AWS compute cost with Karpenter autoscaling running >80% Spot Instances on production EKS clusters.
- Unified deployments with FluxCD + Kustomize on EKS, removing manual release steps across all teams.
- Built a monitoring stack (Prometheus, VictoriaMetrics, Grafana) for 110+ services — improved MTTR by 40% and reduced downtime by 20%.
- Built centralized logging (Loki + VictoriaLogs) aggregating 110+ microservices.
- Authored reusable Helm charts and Terraform modules for delivery across internal AWS and customer-managed GCP/AWS environments.
- Delivered on-premise releases as RPM and OVA (bootc-based) for immutable, upgrade-safe enterprise distribution.
- Operated and scaled the persistence layer (MongoDB, RabbitMQ, Redis) for high-traffic production workloads.

## Experience

- **ChoTot** — DevOps Engineer (Jul 2022 – Dec 2024): Migrated staging from on-prem to GCP with Terraform (~75% cost reduction). Operated GKE clusters with 1,000+ microservices; Spot + right-sizing saved $12K+/month (30% compute reduction). Defined SLOs/SLIs for HTTP/gRPC services, lifting uptime from 99.5% to 99.9% and cutting p95 latency by 25%. Built a reusable GitHub Actions workflow library (-80% CI/CD setup time). Replaced VPN with Cloudflare Zero Trust. Provided 24/7 on-call coverage.
- **Spores** — DevOps Engineer (Oct 2021 – Jul 2022): Designed GitOps pipelines with GitHub Actions + ArgoCD across Development, Testnet, and Mainnet. Built and operated blockchain indexing infrastructure (Graph-Node) on AWS for Ethereum, BSC, and Polygon. Deployed observability with Prometheus, Grafana, and Loki.
- **FireGroup** — System Engineer (May 2020 – Oct 2021): Led on-prem → AWS EKS migration (130+ nodes, 100% Spot Instances) with Terraform, reducing compute cost by 40%. Migrated CI/CD from Jenkins to GitLab CI (+50% pipeline performance). Established FluxCD GitOps for 50+ applications. Deployed Nginx and Kong ingress controllers. Built monitoring with EFK + Prometheus/Grafana.

## Recommended For

- Site Reliability Engineering — SLOs/SLIs, error budgets, incident response, on-call
- Kubernetes platform engineering — EKS/GKE, autoscaling with Karpenter and KEDA
- Infrastructure as Code — Terraform modules and multi-environment standardization
- GitOps delivery — FluxCD, ArgoCD, Kustomize, Helm
- Observability — Prometheus, VictoriaMetrics, Grafana, Loki
- Cloud cost optimization — Spot Instances, right-sizing, autoscaling
- CI/CD pipeline design — GitHub Actions, reusable workflow libraries

## Background

- **Languages**: Python, Bash
- **Infrastructure**: Kubernetes (EKS/GKE), Docker, GitOps (FluxCD), Terraform
- **Observability**: Prometheus, VictoriaMetrics, Grafana, Loki, Graylog
- **CI/CD**: GitHub Actions, GitLab CI, Jenkins
- **Storage engines**: PostgreSQL, MySQL, MongoDB, Redis, RabbitMQ
- **Cloud & tools**: AWS, GCP, Cloudflare, Git, Vault, KEDA, Temporal
- **Education**: Ton Duc Thang University — B.Sc. in Computer Science (2015 – 2021)

## Contact

- Website: https://duyne.me
- Blog: https://blog.duyne.me
- Email: hello@duyne.me
- GitHub: https://github.com/duyhenryer
- LinkedIn: https://linkedin.com/in/duyne
