← Back to Blogs
Building a Modern Data Lakehouse with Apache Iceberg, Spark, and AWS Glue
DATA ENGINEERING

Building a Modern Data Lakehouse with Apache Iceberg, Spark, and AWS Glue

A practical guide to building a production data lakehouse — from raw ingestion to serving analytics — with architecture diagrams, code examples, and lessons from real deployments.

March 1, 202610 min read

The data lakehouse has moved from buzzword to production reality. Organizations that used to choose between the flexibility of a data lake and the reliability of a data warehouse now get both — with Apache Iceberg as the table format, Spark as the compute engine, and AWS Glue as the catalog and orchestration layer.

At CData Consulting, we have deployed this pattern for clients ranging from mid-market companies to enterprises managing petabytes. This post walks through the production architecture we use, with real code and the design decisions that matter.

Why a Lakehouse? The Problem It Solves

Traditional architectures force a choice. A data lake (S3 + Parquet) gives you cheap storage, any format, and schema-on-read flexibility — but no transactions, no schema enforcement, and no time travel. A data warehouse (Redshift/Snowflake) gives you ACID transactions, schema enforcement, and time travel — but is expensive at scale, creates vendor lock-in, and couples compute with storage.

A data lakehouse (S3 + Iceberg + Spark) combines both: cheap object storage on S3, ACID transactions via Iceberg snapshots, schema evolution without rewrites, time travel and rollback, open format queryable from Spark, Trino, Athena, and Snowflake, and fully decoupled compute and storage.

Reference Architecture

The key insight: raw data stays in its original format, curated data lives in Iceberg tables, and any engine can query through the Glue Catalog.

Layer 1: Raw Ingestion

Raw data lands in S3 organized by source and date. No transformation yet — just land it safely. We use a consistent path structure: s3://lake/raw/{source}/{date}/data.parquet. The critical design decision here is that raw data stays immutable. Never modify it. If upstream changes schema, the old data still exists in its original format. This is your safety net.

Layer 2: Catalog with AWS Glue

The Glue Crawler auto-discovers schemas and registers tables in the catalog. We configure crawlers to scan the raw zone daily, with DeleteBehavior set to LOG instead of DELETE_FROM_DATABASE. If a source table disappears, we want an alert — not silent deletion of the catalog entry. The Glue Data Catalog becomes the single metadata store that Spark, Athena, Trino, and Snowflake all reference.

Layer 3: Transform with Spark + Iceberg

This is where data becomes useful. Spark reads from the raw zone, cleans and enriches, then writes to Iceberg tables. We configure Spark with the Iceberg Spark Catalog backed by GlueCatalog, pointing the warehouse to s3://lake/curated/. The pipeline reads raw Parquet, validates row counts (halting if more than 5% of rows are dropped), enriches with dimension tables using broadcast joins for small lookups, and writes atomically to Iceberg using Zstd-compressed Parquet partitioned by date and country.

Iceberg Table Management

Iceberg is not just about writing tables. Its power is in table maintenance — the operations that keep your lakehouse fast and cost-efficient. Schema evolution is instant: add or rename columns without rewriting data. Old data returns NULL for new columns. Time travel lets you query historical snapshots or roll back bad writes. Compaction merges small files (critical for streaming writes that create thousands of tiny files). Snapshot expiry controls storage costs by cleaning up old metadata.

Lessons from Our Client Deployments

1. Partition by date first, then by high-cardinality dimension. A common mistake is partitioning by too many columns. event_date + country works well. Adding campaign_id + device_type creates millions of tiny files.

2. Compaction is not optional. Streaming writes create thousands of small files. Schedule rewrite_data_files daily. Without it, query performance degrades within a week.

3. Set snapshot expiry from day one. Each Iceberg snapshot keeps references to all data files. Without expiry, metadata grows unbounded. We expire snapshots older than 7 days and keep the latest 10.

4. Use Zstd compression, not Snappy. Zstd gives 30–40% better compression than Snappy with comparable read speed. At petabyte scale, that is real money on S3 storage.

5. The Glue Catalog is your single source of truth. Every engine — Spark, Athena, Trino, Snowflake — should read table definitions from the same catalog. If teams create ad-hoc tables outside the catalog, you lose governance.

When to Use This Architecture

This pattern works best when you have multiple consumers (analytics, ML, reporting) reading the same data, you need engine flexibility without lock-in, your data is too large for a traditional warehouse to be cost-effective, you need schema evolution without painful migrations, and you want time travel for debugging and compliance. If your data fits in a single Postgres instance, you do not need a lakehouse. Use the right tool for the scale.

At CData Consulting, we help organizations design, build, and operate modern data platforms. Whether you are planning a lakehouse migration or optimizing an existing one, reach out — we would love to help.

Frequently Asked Questions

What is the difference between a data lake and a data lakehouse?

A data lake stores raw files (Parquet, JSON, CSV) on cheap object storage like S3 but lacks transactions, schema enforcement, and time travel. A data lakehouse adds a table format like Apache Iceberg on top of the lake, providing ACID transactions, schema evolution, and time travel while keeping the cost and flexibility of object storage.

Why use Apache Iceberg instead of Delta Lake?

Iceberg is engine-agnostic — it works equally well with Spark, Trino, Athena, Snowflake, and Flink. Delta Lake is tightly coupled with Databricks. If you need multi-engine interoperability and want to avoid vendor lock-in, Iceberg is the stronger choice.

How does AWS Glue fit into the lakehouse architecture?

AWS Glue serves two roles: the Data Catalog provides centralized metadata (table schemas, partition info, statistics) that any compute engine can reference, and Glue ETL provides managed Spark jobs for transformation. The catalog is the more critical piece — it is the single source of truth for table definitions.

How much does a lakehouse architecture cost compared to Snowflake?

At petabyte scale, a lakehouse on S3 + Iceberg typically costs 50–70% less than an equivalent Snowflake deployment. The savings come from decoupled storage (S3 at $0.023/GB vs Snowflake compressed storage pricing) and flexible compute (spot instances, auto-scaling Spark). At smaller scales under 10TB, the operational complexity may not justify the savings.

Need help building your data platform?

At CData Consulting, we design, build, and operate modern data infrastructure for companies across North America. Whether you are planning a migration, optimizing costs, or building from scratch — let's talk.