Build a Mini Lakehouse at Home: A Portfolio Project

By Samuel Labant Published Jan 9, 2026 Updated Jan 9, 2026

A “mini lakehouse” is the fastest way to learn real data engineering without getting lost in enterprise complexity: you’ll ingest data into a raw file lake, transform it into analytics-ready tables, and ship a simple BI dashboard on top. This post walks you through a home-friendly stack that fits in Docker and produces portfolio-grade artifacts: versioned data layers, repeatable runs, and a dashboard you can demo in 2 minutes.

Quickstart

If you only have an afternoon, build the smallest version of a lakehouse that still teaches the important lessons: idempotent ingestion, layered transforms, and a serving layer your BI tool can query.

1) Pick one dataset + one business question

Constraints create clarity. Your project is easier to explain when it has a single “story”.

Choose a source you can re-run (public API, CSV drop, or open dataset)
Write one question: “What do we want to track weekly?”
Define 3–5 core metrics (counts, rates, averages, top-N)

2) Stand up the local stack

Run it locally with Docker so anyone (including future-you) can reproduce it.

Object storage (local S3): MinIO
Serving warehouse: Postgres
BI: Metabase
Keep credentials in a local .env (never commit)

3) Ingest to a Bronze layer (raw files)

Store raw data in immutable files so you can reprocess and debug later.

Write Parquet (compressed, columnar)
Partition by date (e.g., ingest_date=YYYY-MM-DD)
Write a tiny run log (what was ingested, when)

4) Produce one Gold table + one dashboard

A portfolio project should end with a demo: a table someone can query and a chart someone can read.

Create a clean dimension + fact (or one wide “mart” table)
Load into Postgres for BI
Build 3 simple charts (trend, top-N, breakdown)

Portfolio framing in one sentence

“This project ingests raw data into an object store, transforms it through Bronze/Silver/Gold layers, and serves analytics tables to a BI dashboard — all reproducible via Docker and versioned code.”

Overview

“Build a Mini Lakehouse at Home” is a portfolio-first project: you’re not trying to clone a full enterprise platform. You’re building a compact system that demonstrates the same engineering principles hiring managers look for: reproducibility, clear data contracts, and the ability to debug failures end-to-end.

What you’ll build

Layer	What lives there	Why it exists
Bronze (Raw)	Immutable files (Parquet) + ingestion logs	Re-runs, audits, and debugging (“what did we get?”)
Silver (Clean)	Standardized schema, types, deduped records	Consistency and correctness (“what does it mean?”)
Gold (Marts)	Analytics-ready tables (facts/dims or wide marts)	Fast BI and clear business definitions
Serving	Postgres tables exposed to BI	Stable interface for dashboards and queries

The goal is not a perfect “lakehouse definition” debate. The goal is hands-on fluency: you’ll handle ingestion, schema drift, idempotency, transforms, and a real consumer (BI). Once you can do this locally, moving to managed tools in the cloud becomes a tooling decision — not a conceptual leap.

Why this is a great portfolio project

It shows end-to-end ownership (source → dashboard)
It produces tangible artifacts (tables, metrics, dashboards)
It’s demoable and reviewable (Docker + README + SQL)
It exposes “real-world” problems: late data, duplicates, bad types

Core concepts

Before you touch tools, lock in a few mental models. These are the “why” behind lakehouse-style projects — and they’re what you’ll talk about in interviews.

Lakehouse, in practical terms

A lakehouse combines two worlds: a data lake (cheap files, flexible storage) and a warehouse (fast SQL, governance, BI). In a home project, that usually means: keep raw data as files (Parquet), do transformations with SQL, and expose curated tables to dashboards.

Medallion architecture (Bronze / Silver / Gold)

The medallion pattern is less about buzzwords and more about debugging: you want to know whether a bug happened during ingestion, cleaning, or business transforms. Separate layers turn “mystery failures” into “oh, this broke in Silver because a timestamp changed format”.

Good layer boundaries

Bronze: “What did we receive?” (no interpretation)
Silver: “What does each field mean?” (types, dedupe, standardize)
Gold: “What does the business want?” (metrics, dimensions, marts)

Bad layer boundaries

Fixing business logic in Bronze
Letting Silver be “whatever the analyst needs today”
Putting dashboard-only calculations directly in BI

Idempotent pipelines

Idempotent means: if you run the pipeline twice for the same time window, you get the same result and you don’t duplicate data. This is the single most “production” concept you can demonstrate at home.

Idempotency checklist

Partition outputs by date (or another stable key)
Write to a temp location, then atomically move/rename
Use natural keys + MERGE / upserts for tables
Record run metadata (window, row counts, hash/checksum)

Schema & data contracts

A “contract” is simply an agreement: which columns exist, what their types mean, and what values are allowed. Without contracts, every downstream query becomes a fragile guess.

The silent killer: schema drift

Many sources change slowly: a new field appears, a type changes, or a timestamp format flips. A robust mini lakehouse catches this in Silver (validation) instead of letting it corrupt Gold metrics.

Serving layer vs compute layer

Your compute layer reads and transforms (files → tables). Your serving layer is what tools query (stable SQL endpoint). For a home project, Postgres is a great serving layer: everyone understands it, Metabase connects easily, and you can show indexes, constraints, and query performance basics.

Concern	Compute (files + engine)	Serving (warehouse)
Primary job	Transform raw data into clean shapes	Answer queries fast and consistently
Typical storage	Parquet in object store	Relational tables
Failure mode	Bad parsing, drift, duplicates	Slow dashboards, inconsistent definitions

Step-by-step

This guide uses a “home-friendly” stack: MinIO (local S3), DuckDB (fast SQL over files), Postgres (serving warehouse), and Metabase (BI). You can swap tools later — the architecture stays the same.

Dataset suggestion

Use a source you can refresh daily without API keys. For example: weather (Open-Meteo), crypto prices, public transport feeds, GitHub issues, or any CSV that updates. The specifics don’t matter — the pipeline patterns do.

Step 0 — Create a repo structure that looks professional

A portfolio project should be readable in under 5 minutes. A clean repo is part of the deliverable.

Path	What goes there
`infra/`	Docker compose, env examples, setup scripts
`pipelines/`	Ingestion + transform jobs (Python/SQL)
`sql/`	Reusable SQL models (Silver/Gold)
`docs/`	Architecture diagram, decisions, runbook
`data/` (optional local)	Local dev outputs (gitignored), mirrors what MinIO stores

Step 1 — Run MinIO + Postgres + Metabase locally

The quickest “real stack” is a single Docker Compose file. Keep it boring and obvious: stable ports, named volumes, and one place to configure credentials.

services:
  minio:
    image: minio/minio:latest
    command: server /data --console-address ":9001"
    ports:
      - "9000:9000"
      - "9001:9001"
    environment:
      MINIO_ROOT_USER: ${MINIO_ROOT_USER:-minio}
      MINIO_ROOT_PASSWORD: ${MINIO_ROOT_PASSWORD:-minio12345}
    volumes:
      - minio_data:/data

  postgres:
    image: postgres:16
    ports:
      - "5432:5432"
    environment:
      POSTGRES_USER: ${PGUSER:-lakehouse}
      POSTGRES_PASSWORD: ${PGPASSWORD:-lakehouse123}
      POSTGRES_DB: ${PGDATABASE:-lakehouse}
    volumes:
      - pg_data:/var/lib/postgresql/data

  metabase:
    image: metabase/metabase:latest
    ports:
      - "3000:3000"
    depends_on:
      - postgres

volumes:
  minio_data:
  pg_data:

What to do after it starts

Open MinIO Console (port 9001) and create buckets: bronze, silver, gold
Create a schema in Postgres (e.g., analytics) for curated tables
Open Metabase (port 3000) and connect it to Postgres

Step 2 — Ingest into Bronze as Parquet (repeatable & partitioned)

Your ingestion job should be safe to re-run. A simple pattern: fetch data for a date window, write a Parquet file partitioned by ingest_date, and record a small run log. Even if you later switch to a scheduler, the job stays the same.

"""
Minimal Bronze ingestion example.

- Pulls time-series data from a public API (no keys).
- Writes Parquet partitioned by ingest_date.
- Creates an append-only run log (CSV/JSON is fine for a mini project).

Tip: keep raw fields as-is in Bronze. Clean types in Silver.
"""
from __future__ import annotations

import json
from datetime import date, timedelta
from pathlib import Path

import pandas as pd
import requests

BRONZE_DIR = Path("data/bronze/open_meteo")  # mirror your object store layout
RUNLOG_DIR = Path("data/runlogs")
BRONZE_DIR.mkdir(parents=True, exist_ok=True)
RUNLOG_DIR.mkdir(parents=True, exist_ok=True)

def fetch_day(lat: float, lon: float, day: date) -> pd.DataFrame:
    url = "https://api.open-meteo.com/v1/forecast"
    params = {
        "latitude": lat,
        "longitude": lon,
        "start_date": day.isoformat(),
        "end_date": day.isoformat(),
        "hourly": "temperature_2m,precipitation,wind_speed_10m",
        "timezone": "UTC",
    }
    r = requests.get(url, params=params, timeout=30)
    r.raise_for_status()
    payload = r.json()
    hourly = payload.get("hourly", {})
    df = pd.DataFrame(hourly)
    df["location_lat"] = lat
    df["location_lon"] = lon
    df["source"] = "open-meteo"
    df["ingest_date"] = day.isoformat()
    return df

def write_bronze(df: pd.DataFrame) -> Path:
    part = df["ingest_date"].iloc[0]
    out_dir = BRONZE_DIR / f"ingest_date={part}"
    out_dir.mkdir(parents=True, exist_ok=True)
    out_path = out_dir / "hourly.parquet"
    df.to_parquet(out_path, index=False)
    return out_path

def append_runlog(window_start: date, window_end: date, rows: int, files: list[str]) -> None:
    record = {
        "run_utc": pd.Timestamp.utcnow().isoformat(),
        "window_start": window_start.isoformat(),
        "window_end": window_end.isoformat(),
        "rows": rows,
        "files": files,
    }
    (RUNLOG_DIR / "bronze_runs.jsonl").open("a", encoding="utf-8").write(json.dumps(record) + "\n")

if __name__ == "__main__":
    # Example: last 3 days for one location
    lat, lon = 48.1486, 17.1077  # Bratislava-ish; replace with your target environment
    end = date.today()
    start = end - timedelta(days=2)

    all_frames: list[pd.DataFrame] = []
    files: list[str] = []
    d = start
    while d <= end:
        df_day = fetch_day(lat, lon, d)
        out_file = write_bronze(df_day)
        all_frames.append(df_day)
        files.append(str(out_file))
        d += timedelta(days=1)

    df_all = pd.concat(all_frames, ignore_index=True)
    append_runlog(start, end, rows=len(df_all), files=files)

Avoid “Bronze = cleaned”

If you start “fixing” the source in Bronze, you lose your audit trail. Keep Bronze raw and immutable. Put all interpretation (types, dedupe, business rules) into Silver/Gold.

Step 3 — Transform with SQL (Silver) and publish marts (Gold)

For a mini lakehouse, SQL is your superpower. Use it to standardize types, handle nulls, and define stable business logic. The easiest pattern is: read Bronze Parquet with a compute engine, produce clean Silver views/tables, then write Gold tables to Postgres for BI.

-- DuckDB example: read Bronze Parquet, build Silver/Gold, and write Gold to Postgres.
-- Run in DuckDB CLI or via a small Python runner.
-- Note: You can also keep everything in Postgres if you prefer; the patterns stay the same.

-- 1) Read Bronze files (schema-on-read)
CREATE OR REPLACE VIEW bronze_weather AS
SELECT
  CAST(time AS TIMESTAMP) AS ts_utc,
  CAST(temperature_2m AS DOUBLE) AS temperature_c,
  CAST(precipitation AS DOUBLE) AS precipitation_mm,
  CAST(wind_speed_10m AS DOUBLE) AS wind_speed_kmh,
  CAST(location_lat AS DOUBLE) AS location_lat,
  CAST(location_lon AS DOUBLE) AS location_lon,
  CAST(source AS VARCHAR) AS source,
  CAST(ingest_date AS DATE) AS ingest_date
FROM read_parquet('data/bronze/open_meteo/ingest_date=*/hourly.parquet');

-- 2) Silver: standardize + basic quality rules
CREATE OR REPLACE TABLE silver_weather AS
SELECT
  ts_utc,
  temperature_c,
  precipitation_mm,
  wind_speed_kmh,
  location_lat,
  location_lon,
  source,
  ingest_date
FROM bronze_weather
WHERE ts_utc IS NOT NULL
  AND temperature_c IS NOT NULL;

-- 3) Gold: analytics mart (daily rollup)
CREATE OR REPLACE TABLE gold_weather_daily AS
SELECT
  DATE_TRUNC('day', ts_utc) AS day_utc,
  source,
  location_lat,
  location_lon,
  AVG(temperature_c) AS avg_temp_c,
  MAX(temperature_c) AS max_temp_c,
  MIN(temperature_c) AS min_temp_c,
  SUM(precipitation_mm) AS total_precip_mm,
  AVG(wind_speed_kmh) AS avg_wind_kmh,
  COUNT(*) AS hourly_points
FROM silver_weather
GROUP BY 1, 2, 3, 4;

-- 4) Publish Gold to Postgres (serving)
INSTALL postgres;
LOAD postgres;

ATTACH 'dbname=lakehouse user=lakehouse password=lakehouse123 host=localhost port=5432' AS pg (TYPE postgres);

CREATE SCHEMA IF NOT EXISTS pg.analytics;

CREATE OR REPLACE TABLE pg.analytics.weather_daily AS
SELECT * FROM gold_weather_daily;

What to expose to BI (minimum)

One Gold fact/mart table (like weather_daily)
A clear date column and a stable grain (daily/hourly)
Human-readable dimensions (location, source, category)
Document metric definitions in the repo

Metabase dashboard ideas

Line chart: avg temperature over time
Bar chart: total precipitation by week
Table: top 10 windiest days
Filter: date range + location

Step 4 — Add tiny but real data quality checks

You don’t need a big testing framework to show maturity. Add a few checks that prove you can protect downstream users. Put them in your transform runner (or a separate “validate” step) and fail fast when something is off.

Practical checks for a mini lakehouse

Schema presence: required columns exist in Bronze reads
Null thresholds: key columns (timestamp, metric) not mostly null
Row-count sanity: today isn’t suddenly 10× smaller than yesterday
Uniqueness: expected unique key is actually unique (e.g., (ts_utc, lat, lon))

Step 5 — Make it portfolio-ready (the “demo checklist”)

This is where a tutorial becomes a portfolio project. You’re aiming for an experience where someone can: clone your repo, run one command, and see a dashboard.

Artifacts to include

README with a 5-minute quickstart
An architecture diagram (even a simple box diagram)
Example queries (SQL snippets) for Gold tables
Screenshot(s) of the Metabase dashboard
A short “decisions” doc: why this stack, what you’d change in prod

Talking points (interview-ready)

How you ensured idempotency (partitions, upserts, run logs)
How you handled drift (Silver validation, typed casting)
How you defined metrics (Gold layer, documented grain)
How you would scale it (scheduler, incremental loads, table formats)

Common mistakes

Most mini lakehouse projects fail for the same reasons production pipelines fail — just on a smaller scale. Fix these, and your project becomes dramatically more credible.

Mistake 1 — Mixing raw and business logic

If Bronze contains “fixes”, you can’t trust it as an audit trail.

Fix: keep Bronze immutable; apply all casting/cleaning in Silver.
Fix: document exactly what each layer represents (one paragraph in README).

Mistake 2 — Non-idempotent ingestion (duplicate rows)

Re-running jobs should not duplicate data. Duplicates silently poison metrics.

Fix: partition outputs by a stable window (date/hour) and overwrite that partition.
Fix: use a unique key in Silver/Gold and dedupe explicitly.

Mistake 3 — “One big table” with unclear grain

If you can’t say the grain, you can’t safely aggregate.

Fix: name your mart tables with the grain (e.g., weather_daily).
Fix: include a README snippet: “one row per day per location per source”.

Mistake 4 — Not validating timestamps/timezones

Time bugs are the #1 reason dashboards “look wrong”.

Fix: standardize to UTC in Silver; convert in BI if needed.
Fix: store both ts_utc and day_utc when doing daily rollups.

Mistake 5 — Building dashboards on Silver

Silver is for correctness; Gold is for consumption.

Fix: only connect BI to Gold/analytics schema.
Fix: keep metric definitions in Gold SQL, not in BI formulas.

Mistake 6 — No run metadata (can’t debug)

When something breaks, you need to answer: what ran, what changed, and what was produced.

Fix: write a small JSONL run log with window + row counts + output paths.
Fix: include an error budget: fail the job if key checks fail.

How to prove quality quickly

Add a “Data Quality” section to your README with 3 checks and an example failure message. It instantly signals you know how pipelines behave in real life.

FAQ

Do I need Spark to build a mini lakehouse?

No. Spark is great at scale, but a mini lakehouse is about patterns, not cluster management. A fast local SQL engine (like DuckDB) + a serving database (like Postgres) is enough to demonstrate ingestion, layered transforms, and BI consumption.

What makes this a “lakehouse” and not just an ETL demo?

The separation of storage and serving. You keep raw data as files in an object store (the “lake”), you transform through explicit layers (Bronze/Silver/Gold), and you expose curated tables to BI (the “warehouse” behavior). The key is that you can reprocess from raw files whenever logic changes.

Should I choose ETL or ELT for a home project?

Prefer ELT-style thinking: land raw data first (Bronze), then transform with SQL (Silver/Gold). Even if some steps are “ETL-ish” (e.g., moving files), the principle is the same: don’t bake business logic into ingestion.

How do I handle late-arriving or corrected data?

Use partitions and upserts. Re-ingest the affected window (e.g., yesterday/today), then rebuild Silver/Gold for that window and overwrite or MERGE into serving tables. Keep the run log so you can prove what was recalculated.

What should I put in Gold: star schema or one wide table?

Either is fine as long as you clearly define the grain and metric definitions. For a portfolio project, one wide mart table (one row per day per entity) is often easiest to explain and demo. If you want to flex modeling skills, create one fact + one dimension and show joins.

How can I show this project is “production-minded”?

Add reproducibility + checks. Docker Compose for infrastructure, scripted pipeline runs, small data quality checks, and versioned SQL transforms are the signals that matter most.

Cheatsheet

Scan this when you’re building (or debugging) your mini lakehouse at home.

Architecture (minimum viable)

Bronze: raw Parquet in object storage
Compute: SQL engine reads Bronze and produces Silver/Gold
Serving: Postgres tables in analytics schema
BI: Metabase connected only to analytics

Pipeline rules

Idempotent runs: overwrite by partition/window
Immutable Bronze: no business logic
Silver = types + dedupe + validation
Gold = metrics + marts + documented grain

Data quality (tiny but real)

Required columns exist
Null thresholds on keys
Row-count sanity per window
Uniqueness of expected keys

Portfolio “done” definition

One command boots infra (Docker)
One command runs pipeline (ingest + transform)
Gold tables are queryable in Postgres
Metabase dashboard exists + screenshot in repo
README explains grain, metrics, and reruns

If you get stuck

Debug in order: Bronze files (did we ingest?) → Silver table (types/filters) → Gold table (grouping/grain) → BI query (joins/filters/timezones). Most “BI bugs” are really grain or timezone bugs.

Wrap-up

You just designed a mini lakehouse the right way: raw data lands in an object-store-style lake, transformations are layered and reproducible, and BI queries a stable serving layer. That’s the whole game — the “cloud version” is just different tooling.

Next upgrades (optional, but impressive)

Incremental loads: ingest only new windows and upsert Gold tables
Orchestration: run daily via cron, then graduate to Airflow/Dagster/Prefect
Observability: store run logs in Postgres and build a “pipeline health” dashboard
Modeling: add a star schema mart (fact + dimension) and show join patterns
Table formats: experiment with Iceberg/Delta once the basics are solid

One final polish move

Add a “Repro steps” section to your README with exact commands and expected outputs (row counts, table names). Reviewers love anything that makes verification fast.

Want to go deeper? The related posts below pair well with this project: ETL vs ELT decisions, scalable SQL patterns, and Postgres performance moves you can apply to your serving layer.

UniLab Editorial

Modern learning notes for practical builders.

Build a Mini Lakehouse at Home: A Portfolio Project

Quickstart

1) Pick one dataset + one business question

2) Stand up the local stack

3) Ingest to a Bronze layer (raw files)

4) Produce one Gold table + one dashboard

Overview

What you’ll build

Core concepts

Lakehouse, in practical terms

Medallion architecture (Bronze / Silver / Gold)

Good layer boundaries

Bad layer boundaries

Idempotent pipelines

Idempotency checklist

Schema & data contracts

Serving layer vs compute layer

Step-by-step

Step 0 — Create a repo structure that looks professional

Step 1 — Run MinIO + Postgres + Metabase locally

Step 2 — Ingest into Bronze as Parquet (repeatable & partitioned)

Step 3 — Transform with SQL (Silver) and publish marts (Gold)

What to expose to BI (minimum)

Metabase dashboard ideas

Step 4 — Add tiny but real data quality checks

Practical checks for a mini lakehouse

Step 5 — Make it portfolio-ready (the “demo checklist”)

Artifacts to include

Talking points (interview-ready)

Common mistakes

Mistake 1 — Mixing raw and business logic

Mistake 2 — Non-idempotent ingestion (duplicate rows)

Mistake 3 — “One big table” with unclear grain

Mistake 4 — Not validating timestamps/timezones

Mistake 5 — Building dashboards on Silver

Mistake 6 — No run metadata (can’t debug)

FAQ

Do I need Spark to build a mini lakehouse?

What makes this a “lakehouse” and not just an ETL demo?

Should I choose ETL or ELT for a home project?

How do I handle late-arriving or corrected data?

What should I put in Gold: star schema or one wide table?

How can I show this project is “production-minded”?

Cheatsheet

Architecture (minimum viable)

Pipeline rules

Data quality (tiny but real)

Portfolio “done” definition

Wrap-up

Next upgrades (optional, but impressive)

Quiz

Related posts