← all writing
Homelab

I rebuilt the AWS analytics stack in my homelab

Open-source parts, two mini PCs, and the lessons the managed services quietly handle for you.

I build data platforms on AWS for a living, at a scale where the monthly bill has its own stakeholders. So when I wanted somewhere to experiment without filing a change request, I did the slightly unhinged thing and rebuilt a smaller version of the same stack at home, on two mini PCs that fit in a shoebox.

It started as a way to keep my hands dirty. It turned into the best teacher I have had in years. Running the open-source equivalent of every managed service you take for granted is the fastest way to learn what those services were actually doing for you.

01  Why do this to myself

The honest answer is that I missed building. Most of my day is reviews, design docs, and meetings about meetings, so I wanted something where I owned the whole thing from the metal up, and where breaking it only annoyed me.

The useful answer is that I wanted one place to demo a full pipeline end to end, ingestion to dashboard, without spinning up an account and watching the cost graph climb. So I wrote it all down as code and put it on GitHub as data-platforms-infra. If you can read a compose file, you can stand the whole thing up in an afternoon.

02  Mapping AWS to open source

The trick was to keep the shapes familiar. I did not want a pile of random tools, I wanted the same mental model I use at work with the labels swapped. So S3 became MinIO, the Glue catalog became a Hive metastore, Athena became Trino, Glue jobs became plain Spark, and Airflow stayed Airflow, because some things are already open source and already good.

Streaming was the one I went back and forth on. Kinesis is lovely until you read the pricing, so I landed on Redpanda, which speaks the Kafka protocol and does not need a JVM or a zookeeper to babysit. For the BI layer I used Metabase, because I wanted something a non-engineer could actually open and trust.

Half of what you pay AWS for is the part you never see.
docker-compose.yml
# the storage and query core, the rest hangs off this
services:
  minio:           # stands in for S3
    image: minio/minio
    command: server /data --console-address ":9001"

  metastore:       # stands in for the Glue Data Catalog
    image: apache/hive:4.0.0
    environment:
      SERVICE_NAME: metastore

  trino:           # stands in for Athena
    image: trinodb/trino
    depends_on: [minio, metastore]

03  What the cloud quietly hides

Here is where it got educational. The Hive metastore, which AWS hands you as a checkbox called the Glue Data Catalog, is a finicky little service with a real database behind it and an opinion about every table you register. I spent a full evening on a schema mismatch that AWS would have swallowed without a word.

Networking was the other one. In the cloud, a managed service talks to another managed service and you never think about it. At home, every container has to find every other container, and one wrong hostname turns into an hour with docker logs and a strong coffee. None of this is hard. It is just the work the bill was quietly paying someone else to do.

04  What I kept, what I tossed

I kept the catalog as the center of gravity. The single best decision in the whole build, exactly like at work, was treating discovery as a first-class thing instead of an afterthought. A lake nobody can search is just an expensive hard drive.

I tossed the idea that I needed everything. The first version had a tool for every box on the AWS analytics page, and most of them sat idle. The version on GitHub now is smaller and meaner, and it teaches the same lessons with a third of the containers. Freddie, my Aussie and chief morale officer, approves of anything that runs quieter.

Vitor Garbim
Senior Data Engineer at Amazon, in Austin. Nineteen years of data, one homelab, and a blue merle Aussie named Freddie who supervises every draft.
Freddie