Selected work

Reliability, from platforms to AI infra.

Before founding Auti, I spent ~9 years as an SRE inside the teams that built these systems. Below is a selection of that work — anonymised where engagements remain confidential. It's my hands-on track record as an engineer, not a client roster.

AI Infrastructure · Enterprise

Secure Deployment Platform for AI Agents

AI Infra Platform Security

The problem

As teams started building AI agents, there was no safe, standard way to get them into production. Every developer wired up their own observability, model access, cost controls, and retrieval by hand — and every deployment raised the same unanswered questions: what can it access, how much is it spending, and what is it actually doing once it's live.

The work

I helped build an internal platform where developers build the agent and the platform handles the tooling around it — Langfuse-based observability, a gateway governing model and token usage, and managed RAG for agents to consume. Cost governance, access scoping, and structured telemetry were wired in by default, so the guardrails were part of the platform rather than an afterthought.

Outcome

Turned ad-hoc, self-wired agent deployment into a self-serve path that took minutes — with observability, model and cost governance, and retrieval provided by the platform instead of rebuilt by each team.

Health-tech · HITRUST

ECS→EKS Migration & CI/CD Platform

Kubernetes Platform CI/CD

The problem

A compliance-driven health-tech platform (HITRUST) was running client workloads on an ECS setup it had outgrown — scaling and maintenance were increasingly painful, and every team wired up its own release pipeline by hand.

The work

I led automation of the build and release pipelines through a months-long, multi-team migration of client workloads from ECS to EKS — implementing Concord deployment flows for running the platform on Kubernetes and migrating one of the first clients onto the new architecture. Alongside it I built a core library of reusable GitHub Actions that let Java, Node.js, and Python services build, containerise, and push to ECR from a few lines of YAML — with Chainguard base images and Trivy scanning baked in for secure, production-ready artifacts.

Outcome

A new service went from days of hand-rolled pipeline plumbing to production-grade CI in minutes of YAML, and manual multi-step releases became automated, repeatable deploys — on a hardened, scalable Kubernetes platform.

Developer Platform

Internal Developer Platform

Platform Developer Experience Cost

The problem

Developers had no single place to find and stand up what they needed. Service ownership was scattered, build artifacts sat on a third party's premises, and there was no safe, self-serve way to experiment against production-like environments.

The work

I built out an internal developer platform, cumulatively: a Backstage service catalog so teams could discover and own their services; a self-hosted Sonatype Nexus registry, moved off the SaaS-hosted version onto infrastructure the team controlled, for security and control (SSO, Maven/npm/PyPI, and a Go post-setup script for consistent, repeatable setup); dynamic, branch-based sandbox environments so developers could spin up isolated instances in their own namespaces; and an internal portal surfacing cost and usage metrics back to teams.

Outcome

One self-serve platform: developers discover services in Backstage, pull artifacts from a secured in-house registry, spin up isolated environments on demand, and see their own cost and usage — instead of waiting on shared environments and stitching tooling together by hand.

Reliability

Observability & SRE Foundation

Observability SRE Reliability

The problem

Services were reaching production without a consistent way to see what they were doing. Metrics, logs, traces, and profiles were fragmented — and without a proper observability stack, fast diagnosis simply isn't possible. On-call was a pain, and incidents ran on guesswork.

The work

I helped roll out a comprehensive observability stack on the kube-prometheus-stack — system-wide metrics via Prometheus, Grafana dashboards backed by Loki for logs and Mimir for long-term metric storage in S3, and alerting through Alertmanager. I onboarded teams onto profiling (Pyroscope) and distributed tracing (Tempo), and supported service-level monitoring with exporters and custom metrics.

Outcome

One alertable view across metrics, logs, traces, and profiles — turning incident response from per-service guesswork into fast root-cause, and making on-call sustainable. Long-term metrics landed on S3 for durable, cost-efficient retention.

Much of my work stays under NDA and is shared by referral.

Happy to walk through the details on a call.

Start a conversation

Have a problem that looks like these?

Tell me what you're dealing with. I'll respond with an honest take on whether and how I can help.

Get in touch