Data Engineering & Databases · Project

Build a Mini Lakehouse at Home: A Portfolio Project

A hands-on project to learn ingestion, transforms, and BI.

Reading time: ~8–12 min
Level: All levels
Updated:

A “mini lakehouse” is the fastest way to learn real data engineering without getting lost in enterprise complexity: you’ll ingest data into a raw file lake, transform it into analytics-ready tables, and ship a simple BI dashboard on top. This post walks you through a home-friendly stack that fits in Docker and produces portfolio-grade artifacts: versioned data layers, repeatable runs, and a dashboard you can demo in 2 minutes.


Quickstart

If you only have an afternoon, build the smallest version of a lakehouse that still teaches the important lessons: idempotent ingestion, layered transforms, and a serving layer your BI tool can query.

1) Pick one dataset + one business question

Constraints create clarity. Your project is easier to explain when it has a single “story”.

  • Choose a source you can re-run (public API, CSV drop, or open dataset)
  • Write one question: “What do we want to track weekly?”
  • Define 3–5 core metrics (counts, rates, averages, top-N)

2) Stand up the local stack

Run it locally with Docker so anyone (including future-you) can reproduce it.

  • Object storage (local S3): MinIO
  • Serving warehouse: Postgres
  • BI: Metabase
  • Keep credentials in a local .env (never commit)

3) Ingest to a Bronze layer (raw files)

Store raw data in immutable files so you can reprocess and debug later.

  • Write Parquet (compressed, columnar)
  • Partition by date (e.g., ingest_date=YYYY-MM-DD)
  • Write a tiny run log (what was ingested, when)

4) Produce one Gold table + one dashboard

A portfolio project should end with a demo: a table someone can query and a chart someone can read.

  • Create a clean dimension + fact (or one wide “mart” table)
  • Load into Postgres for BI
  • Build 3 simple charts (trend, top-N, breakdown)
Portfolio framing in one sentence

“This project ingests raw data into an object store, transforms it through Bronze/Silver/Gold layers, and serves analytics tables to a BI dashboard — all reproducible via Docker and versioned code.”

Overview

“Build a Mini Lakehouse at Home” is a portfolio-first project: you’re not trying to clone a full enterprise platform. You’re building a compact system that demonstrates the same engineering principles hiring managers look for: reproducibility, clear data contracts, and the ability to debug failures end-to-end.

What you’ll build

Layer What lives there Why it exists
Bronze (Raw) Immutable files (Parquet) + ingestion logs Re-runs, audits, and debugging (“what did we get?”)
Silver (Clean) Standardized schema, types, deduped records Consistency and correctness (“what does it mean?”)
Gold (Marts) Analytics-ready tables (facts/dims or wide marts) Fast BI and clear business definitions
Serving Postgres tables exposed to BI Stable interface for dashboards and queries

The goal is not a perfect “lakehouse definition” debate. The goal is hands-on fluency: you’ll handle ingestion, schema drift, idempotency, transforms, and a real consumer (BI). Once you can do this locally, moving to managed tools in the cloud becomes a tooling decision — not a conceptual leap.

Why this is a great portfolio project
  • It shows end-to-end ownership (source → dashboard)
  • It produces tangible artifacts (tables, metrics, dashboards)
  • It’s demoable and reviewable (Docker + README + SQL)
  • It exposes “real-world” problems: late data, duplicates, bad types

Core concepts

Before you touch tools, lock in a few mental models. These are the “why” behind lakehouse-style projects — and they’re what you’ll talk about in interviews.

Lakehouse, in practical terms

A lakehouse combines two worlds: a data lake (cheap files, flexible storage) and a warehouse (fast SQL, governance, BI). In a home project, that usually means: keep raw data as files (Parquet), do transformations with SQL, and expose curated tables to dashboards.

Medallion architecture (Bronze / Silver / Gold)

The medallion pattern is less about buzzwords and more about debugging: you want to know whether a bug happened during ingestion, cleaning, or business transforms. Separate layers turn “mystery failures” into “oh, this broke in Silver because a timestamp changed format”.

Good layer boundaries

  • Bronze: “What did we receive?” (no interpretation)
  • Silver: “What does each field mean?” (types, dedupe, standardize)
  • Gold: “What does the business want?” (metrics, dimensions, marts)

Bad layer boundaries

  • Fixing business logic in Bronze
  • Letting Silver be “whatever the analyst needs today”
  • Putting dashboard-only calculations directly in BI

Idempotent pipelines

Idempotent means: if you run the pipeline twice for the same time window, you get the same result and you don’t duplicate data. This is the single most “production” concept you can demonstrate at home.

Idempotency checklist

  • Partition outputs by date (or another stable key)
  • Write to a temp location, then atomically move/rename
  • Use natural keys + MERGE / upserts for tables
  • Record run metadata (window, row counts, hash/checksum)

Schema & data contracts

A “contract” is simply an agreement: which columns exist, what their types mean, and what values are allowed. Without contracts, every downstream query becomes a fragile guess.

The silent killer: schema drift

Many sources change slowly: a new field appears, a type changes, or a timestamp format flips. A robust mini lakehouse catches this in Silver (validation) instead of letting it corrupt Gold metrics.

Serving layer vs compute layer

Your compute layer reads and transforms (files → tables). Your serving layer is what tools query (stable SQL endpoint). For a home project, Postgres is a great serving layer: everyone understands it, Metabase connects easily, and you can show indexes, constraints, and query performance basics.

Concern Compute (files + engine) Serving (warehouse)
Primary job Transform raw data into clean shapes Answer queries fast and consistently
Typical storage Parquet in object store Relational tables
Failure mode Bad parsing, drift, duplicates Slow dashboards, inconsistent definitions

Step-by-step

This guide uses a “home-friendly” stack: MinIO (local S3), DuckDB (fast SQL over files), Postgres (serving warehouse), and Metabase (BI). You can swap tools later — the architecture stays the same.

Dataset suggestion

Use a source you can refresh daily without API keys. For example: weather (Open-Meteo), crypto prices, public transport feeds, GitHub issues, or any CSV that updates. The specifics don’t matter — the pipeline patterns do.

Step 0 — Create a repo structure that looks professional

A portfolio project should be readable in under 5 minutes. A clean repo is part of the deliverable.

Path What goes there
infra/ Docker compose, env examples, setup scripts
pipelines/ Ingestion + transform jobs (Python/SQL)
sql/ Reusable SQL models (Silver/Gold)
docs/ Architecture diagram, decisions, runbook
data/ (optional local) Local dev outputs (gitignored), mirrors what MinIO stores

Step 1 — Run MinIO + Postgres + Metabase locally

The quickest “real stack” is a single Docker Compose file. Keep it boring and obvious: stable ports, named volumes, and one place to configure credentials.

services:
  minio:
    image: minio/minio:latest
    command: server /data --console-address ":9001"
    ports:
      - "9000:9000"
      - "9001:9001"
    environment:
      MINIO_ROOT_USER: ${MINIO_ROOT_USER:-minio}
      MINIO_ROOT_PASSWORD: ${MINIO_ROOT_PASSWORD:-minio12345}
    volumes:
      - minio_data:/data

  postgres:
    image: postgres:16
    ports:
      - "5432:5432"
    environment:
      POSTGRES_USER: ${PGUSER:-lakehouse}
      POSTGRES_PASSWORD: ${PGPASSWORD:-lakehouse123}
      POSTGRES_DB: ${PGDATABASE:-lakehouse}
    volumes:
      - pg_data:/var/lib/postgresql/data

  metabase:
    image: metabase/metabase:latest
    ports:
      - "3000:3000"
    depends_on:
      - postgres

volumes:
  minio_data:
  pg_data:
What to do after it starts
  • Open MinIO Console (port 9001) and create buckets: bronze, silver, gold
  • Create a schema in Postgres (e.g., analytics) for curated tables
  • Open Metabase (port 3000) and connect it to Postgres

Step 2 — Ingest into Bronze as Parquet (repeatable & partitioned)

Your ingestion job should be safe to re-run. A simple pattern: fetch data for a date window, write a Parquet file partitioned by ingest_date, and record a small run log. Even if you later switch to a scheduler, the job stays the same.

"""
Minimal Bronze ingestion example.

- Pulls time-series data from a public API (no keys).
- Writes Parquet partitioned by ingest_date.
- Creates an append-only run log (CSV/JSON is fine for a mini project).

Tip: keep raw fields as-is in Bronze. Clean types in Silver.
"""
from __future__ import annotations

import json
from datetime import date, timedelta
from pathlib import Path

import pandas as pd
import requests

BRONZE_DIR = Path("data/bronze/open_meteo")  # mirror your object store layout
RUNLOG_DIR = Path("data/runlogs")
BRONZE_DIR.mkdir(parents=True, exist_ok=True)
RUNLOG_DIR.mkdir(parents=True, exist_ok=True)

def fetch_day(lat: float, lon: float, day: date) -> pd.DataFrame:
    url = "https://api.open-meteo.com/v1/forecast"
    params = {
        "latitude": lat,
        "longitude": lon,
        "start_date": day.isoformat(),
        "end_date": day.isoformat(),
        "hourly": "temperature_2m,precipitation,wind_speed_10m",
        "timezone": "UTC",
    }
    r = requests.get(url, params=params, timeout=30)
    r.raise_for_status()
    payload = r.json()
    hourly = payload.get("hourly", {})
    df = pd.DataFrame(hourly)
    df["location_lat"] = lat
    df["location_lon"] = lon
    df["source"] = "open-meteo"
    df["ingest_date"] = day.isoformat()
    return df

def write_bronze(df: pd.DataFrame) -> Path:
    part = df["ingest_date"].iloc[0]
    out_dir = BRONZE_DIR / f"ingest_date={part}"
    out_dir.mkdir(parents=True, exist_ok=True)
    out_path = out_dir / "hourly.parquet"
    df.to_parquet(out_path, index=False)
    return out_path

def append_runlog(window_start: date, window_end: date, rows: int, files: list[str]) -> None:
    record = {
        "run_utc": pd.Timestamp.utcnow().isoformat(),
        "window_start": window_start.isoformat(),
        "window_end": window_end.isoformat(),
        "rows": rows,
        "files": files,
    }
    (RUNLOG_DIR / "bronze_runs.jsonl").open("a", encoding="utf-8").write(json.dumps(record) + "\n")

if __name__ == "__main__":
    # Example: last 3 days for one location
    lat, lon = 48.1486, 17.1077  # Bratislava-ish; replace with your target environment
    end = date.today()
    start = end - timedelta(days=2)

    all_frames: list[pd.DataFrame] = []
    files: list[str] = []
    d = start
    while d <= end:
        df_day = fetch_day(lat, lon, d)
        out_file = write_bronze(df_day)
        all_frames.append(df_day)
        files.append(str(out_file))
        d += timedelta(days=1)

    df_all = pd.concat(all_frames, ignore_index=True)
    append_runlog(start, end, rows=len(df_all), files=files)
Avoid “Bronze = cleaned”

If you start “fixing” the source in Bronze, you lose your audit trail. Keep Bronze raw and immutable. Put all interpretation (types, dedupe, business rules) into Silver/Gold.

Step 3 — Transform with SQL (Silver) and publish marts (Gold)

For a mini lakehouse, SQL is your superpower. Use it to standardize types, handle nulls, and define stable business logic. The easiest pattern is: read Bronze Parquet with a compute engine, produce clean Silver views/tables, then write Gold tables to Postgres for BI.

-- DuckDB example: read Bronze Parquet, build Silver/Gold, and write Gold to Postgres.
-- Run in DuckDB CLI or via a small Python runner.
-- Note: You can also keep everything in Postgres if you prefer; the patterns stay the same.

-- 1) Read Bronze files (schema-on-read)
CREATE OR REPLACE VIEW bronze_weather AS
SELECT
  CAST(time AS TIMESTAMP) AS ts_utc,
  CAST(temperature_2m AS DOUBLE) AS temperature_c,
  CAST(precipitation AS DOUBLE) AS precipitation_mm,
  CAST(wind_speed_10m AS DOUBLE) AS wind_speed_kmh,
  CAST(location_lat AS DOUBLE) AS location_lat,
  CAST(location_lon AS DOUBLE) AS location_lon,
  CAST(source AS VARCHAR) AS source,
  CAST(ingest_date AS DATE) AS ingest_date
FROM read_parquet('data/bronze/open_meteo/ingest_date=*/hourly.parquet');

-- 2) Silver: standardize + basic quality rules
CREATE OR REPLACE TABLE silver_weather AS
SELECT
  ts_utc,
  temperature_c,
  precipitation_mm,
  wind_speed_kmh,
  location_lat,
  location_lon,
  source,
  ingest_date
FROM bronze_weather
WHERE ts_utc IS NOT NULL
  AND temperature_c IS NOT NULL;

-- 3) Gold: analytics mart (daily rollup)
CREATE OR REPLACE TABLE gold_weather_daily AS
SELECT
  DATE_TRUNC('day', ts_utc) AS day_utc,
  source,
  location_lat,
  location_lon,
  AVG(temperature_c) AS avg_temp_c,
  MAX(temperature_c) AS max_temp_c,
  MIN(temperature_c) AS min_temp_c,
  SUM(precipitation_mm) AS total_precip_mm,
  AVG(wind_speed_kmh) AS avg_wind_kmh,
  COUNT(*) AS hourly_points
FROM silver_weather
GROUP BY 1, 2, 3, 4;

-- 4) Publish Gold to Postgres (serving)
INSTALL postgres;
LOAD postgres;

ATTACH 'dbname=lakehouse user=lakehouse password=lakehouse123 host=localhost port=5432' AS pg (TYPE postgres);

CREATE SCHEMA IF NOT EXISTS pg.analytics;

CREATE OR REPLACE TABLE pg.analytics.weather_daily AS
SELECT * FROM gold_weather_daily;

What to expose to BI (minimum)

  • One Gold fact/mart table (like weather_daily)
  • A clear date column and a stable grain (daily/hourly)
  • Human-readable dimensions (location, source, category)
  • Document metric definitions in the repo

Metabase dashboard ideas

  • Line chart: avg temperature over time
  • Bar chart: total precipitation by week
  • Table: top 10 windiest days
  • Filter: date range + location

Step 4 — Add tiny but real data quality checks

You don’t need a big testing framework to show maturity. Add a few checks that prove you can protect downstream users. Put them in your transform runner (or a separate “validate” step) and fail fast when something is off.

Practical checks for a mini lakehouse

  • Schema presence: required columns exist in Bronze reads
  • Null thresholds: key columns (timestamp, metric) not mostly null
  • Row-count sanity: today isn’t suddenly 10× smaller than yesterday
  • Uniqueness: expected unique key is actually unique (e.g., (ts_utc, lat, lon))

Step 5 — Make it portfolio-ready (the “demo checklist”)

This is where a tutorial becomes a portfolio project. You’re aiming for an experience where someone can: clone your repo, run one command, and see a dashboard.

Artifacts to include

  • README with a 5-minute quickstart
  • An architecture diagram (even a simple box diagram)
  • Example queries (SQL snippets) for Gold tables
  • Screenshot(s) of the Metabase dashboard
  • A short “decisions” doc: why this stack, what you’d change in prod

Talking points (interview-ready)

  • How you ensured idempotency (partitions, upserts, run logs)
  • How you handled drift (Silver validation, typed casting)
  • How you defined metrics (Gold layer, documented grain)
  • How you would scale it (scheduler, incremental loads, table formats)

Common mistakes

Most mini lakehouse projects fail for the same reasons production pipelines fail — just on a smaller scale. Fix these, and your project becomes dramatically more credible.

Mistake 1 — Mixing raw and business logic

If Bronze contains “fixes”, you can’t trust it as an audit trail.

  • Fix: keep Bronze immutable; apply all casting/cleaning in Silver.
  • Fix: document exactly what each layer represents (one paragraph in README).

Mistake 2 — Non-idempotent ingestion (duplicate rows)

Re-running jobs should not duplicate data. Duplicates silently poison metrics.

  • Fix: partition outputs by a stable window (date/hour) and overwrite that partition.
  • Fix: use a unique key in Silver/Gold and dedupe explicitly.

Mistake 3 — “One big table” with unclear grain

If you can’t say the grain, you can’t safely aggregate.

  • Fix: name your mart tables with the grain (e.g., weather_daily).
  • Fix: include a README snippet: “one row per day per location per source”.

Mistake 4 — Not validating timestamps/timezones

Time bugs are the #1 reason dashboards “look wrong”.

  • Fix: standardize to UTC in Silver; convert in BI if needed.
  • Fix: store both ts_utc and day_utc when doing daily rollups.

Mistake 5 — Building dashboards on Silver

Silver is for correctness; Gold is for consumption.

  • Fix: only connect BI to Gold/analytics schema.
  • Fix: keep metric definitions in Gold SQL, not in BI formulas.

Mistake 6 — No run metadata (can’t debug)

When something breaks, you need to answer: what ran, what changed, and what was produced.

  • Fix: write a small JSONL run log with window + row counts + output paths.
  • Fix: include an error budget: fail the job if key checks fail.
How to prove quality quickly

Add a “Data Quality” section to your README with 3 checks and an example failure message. It instantly signals you know how pipelines behave in real life.

FAQ

Do I need Spark to build a mini lakehouse?

No. Spark is great at scale, but a mini lakehouse is about patterns, not cluster management. A fast local SQL engine (like DuckDB) + a serving database (like Postgres) is enough to demonstrate ingestion, layered transforms, and BI consumption.

What makes this a “lakehouse” and not just an ETL demo?

The separation of storage and serving. You keep raw data as files in an object store (the “lake”), you transform through explicit layers (Bronze/Silver/Gold), and you expose curated tables to BI (the “warehouse” behavior). The key is that you can reprocess from raw files whenever logic changes.

Should I choose ETL or ELT for a home project?

Prefer ELT-style thinking: land raw data first (Bronze), then transform with SQL (Silver/Gold). Even if some steps are “ETL-ish” (e.g., moving files), the principle is the same: don’t bake business logic into ingestion.

How do I handle late-arriving or corrected data?

Use partitions and upserts. Re-ingest the affected window (e.g., yesterday/today), then rebuild Silver/Gold for that window and overwrite or MERGE into serving tables. Keep the run log so you can prove what was recalculated.

What should I put in Gold: star schema or one wide table?

Either is fine as long as you clearly define the grain and metric definitions. For a portfolio project, one wide mart table (one row per day per entity) is often easiest to explain and demo. If you want to flex modeling skills, create one fact + one dimension and show joins.

How can I show this project is “production-minded”?

Add reproducibility + checks. Docker Compose for infrastructure, scripted pipeline runs, small data quality checks, and versioned SQL transforms are the signals that matter most.

Cheatsheet

Scan this when you’re building (or debugging) your mini lakehouse at home.

Architecture (minimum viable)

  • Bronze: raw Parquet in object storage
  • Compute: SQL engine reads Bronze and produces Silver/Gold
  • Serving: Postgres tables in analytics schema
  • BI: Metabase connected only to analytics

Pipeline rules

  • Idempotent runs: overwrite by partition/window
  • Immutable Bronze: no business logic
  • Silver = types + dedupe + validation
  • Gold = metrics + marts + documented grain

Data quality (tiny but real)

  • Required columns exist
  • Null thresholds on keys
  • Row-count sanity per window
  • Uniqueness of expected keys

Portfolio “done” definition

  • One command boots infra (Docker)
  • One command runs pipeline (ingest + transform)
  • Gold tables are queryable in Postgres
  • Metabase dashboard exists + screenshot in repo
  • README explains grain, metrics, and reruns
If you get stuck

Debug in order: Bronze files (did we ingest?) → Silver table (types/filters) → Gold table (grouping/grain) → BI query (joins/filters/timezones). Most “BI bugs” are really grain or timezone bugs.

Wrap-up

You just designed a mini lakehouse the right way: raw data lands in an object-store-style lake, transformations are layered and reproducible, and BI queries a stable serving layer. That’s the whole game — the “cloud version” is just different tooling.

Next upgrades (optional, but impressive)

  • Incremental loads: ingest only new windows and upsert Gold tables
  • Orchestration: run daily via cron, then graduate to Airflow/Dagster/Prefect
  • Observability: store run logs in Postgres and build a “pipeline health” dashboard
  • Modeling: add a star schema mart (fact + dimension) and show join patterns
  • Table formats: experiment with Iceberg/Delta once the basics are solid
One final polish move

Add a “Repro steps” section to your README with exact commands and expected outputs (row counts, table names). Reviewers love anything that makes verification fast.

Want to go deeper? The related posts below pair well with this project: ETL vs ELT decisions, scalable SQL patterns, and Postgres performance moves you can apply to your serving layer.

Quiz

Quick self-check (demo). This quiz is auto-generated for data / engineering / databases.

1) In a mini lakehouse, what should the Bronze layer contain?
2) What does “idempotent pipeline” mean?
3) Why do teams separate Silver and Gold layers?
4) What is the best practice for what BI should query in this project?