Курси по темі

Середній

Introduction to Data Engineering with Azure

Master the essentials of data engineering with Microsoft Azure in this comprehensive course. Starting with foundational concepts like cloud computing, resource management, and storage solutions, you'll progress to hands-on training in Azure Data Factory (ADF), mastering ETL/ELT workflows, and advanced data transformations. Through practical examples and real-world problem-solving, you'll gain the skills to design, implement, and optimize scalable data solutions using Azure.

Microsoft Azure

4.4

Курс

Середній

Data Pipelines with Python

Master the practical skills needed to design, build, and automate robust data pipelines using Python. This course covers ETL and ELT fundamentals, batch processing, incremental loading, and orchestration patterns, equipping you to handle real-world data engineering tasks with confidence.

python

4.9

Курс

Базовий

Database Design Patterns

Explore foundational and advanced database design patterns using SQL. This course introduces essential concepts, best practices, and real-world examples to help you build robust, scalable, and efficient relational databases.

SQL

4.8

Data ManipulationBackEnd DevelopmentDevelopment Tools

Apache Iceberg and the Modern Data Lakehouse

How an open table format quietly became the foundation of modern analytics

by Daniil Lypenets

Full Stack Developer

May, 2026・
15 min read

Apache Iceberg and the Modern Data Lakehouse

Introduction

For most of analytics history, you had two choices for storing data at scale.

A data warehouse gave you structure, transactions, and fast queries — but locked you into one vendor's compute, charged you for storage at a premium, and made it painful to share data across systems.

A data lake gave you cheap object storage, vendor independence, and the flexibility to dump anything into it — but lost transactions, schema enforcement, and most of what made warehouses pleasant to query.

For years, the conventional wisdom was that you needed both. You loaded data into the lake, ran some pipeline jobs, and copied curated subsets into the warehouse for analytics. Two systems, two copies of the truth, twice the bills, twice the bugs.

Apache Iceberg quietly destroyed that assumption. By 2026, it has become the default open standard for storing analytical data, supported by every major data platform — Snowflake, Databricks, BigQuery, Trino, Spark, Flink, and more. This article explains what Iceberg actually does, why it changed the architecture conversation, and how to think about adopting it.

The Problem Before Iceberg

To see why Iceberg matters, look at what was broken.

The data lake model — typically Parquet files sitting in S3 or equivalent — had clear strengths. Storage was cheap. Files were vendor-neutral. Anyone could read them with any tool.

But it had three problems that everyone worked around but no one solved cleanly:

No transactions. If two jobs wrote to the same dataset at the same time, you could end up with partial reads or corrupted state. Most teams handled this by running pipelines sequentially and praying;
No safe schema evolution. Renaming a column or changing its type meant rewriting every file in the dataset, or living with a broken read path forever;
No time travel. Once you overwrote a partition, the old data was gone. Reproducing yesterday's report meant hoping you had a backup somewhere. Data warehouses solved these problems decades ago, but did so by tightly coupling storage and compute. Iceberg's contribution was to bring warehouse semantics to cheap object storage without giving up the openness of the lake.

What a Table Format Actually Is

The phrase "table format" sounds vague, but it has a precise meaning.

A file format like Parquet describes how a single file is laid out — columns, encoding, compression. It tells you how to read one file's worth of data.

A table format like Iceberg describes how a collection of files is composed into a logical table. It tells you which files belong to the table right now, how they should be combined when you query, and how to safely change the set of files without breaking readers in flight.

In practice, an Iceberg table is three layers:

Data files — the actual data, usually in Parquet, sitting in object storage;
Metadata files — JSON manifests that describe which data files belong to which version of the table;
A catalog — a small piece of state that records the current version of each table. When you query an Iceberg table, the engine reads the catalog to find the current metadata, reads the metadata to find the active data files, and only then opens the actual data. That extra layer of indirection is what makes everything else possible.

Run Code from Your Browser - No Installation Required

Iceberg's Three Key Tricks

The architecture above gives Iceberg three capabilities that no plain Parquet-on-S3 setup can match.

Snapshot isolation. Every change to an Iceberg table produces a new snapshot — a new metadata file that points to the new set of data files. Readers in flight continue to see their original snapshot until they finish. Writers do not block readers, and two writers cannot corrupt each other's output. You get genuine ACID semantics on object storage.

Time travel. Because snapshots are immutable, you can query the table as of any past point. SELECT * FROM orders FOR VERSION AS OF '2026-01-15' is a real query, not a stunt. Reproducing yesterday's report or auditing a specific moment becomes trivial.

Schema and partition evolution. Iceberg tracks columns and partitions by stable internal IDs, not by name. Renaming a column, changing its type, or repartitioning the data does not require rewriting any existing files. The metadata layer absorbs the change and readers keep working.

Iceberg is what happens when you treat the table not as a folder of files but as a versioned data structure.

That shift is small in description and enormous in consequence.

Run Code from Your Browser - No Installation Required

The Lakehouse Architecture in Practice

A data lakehouse is what you get when you put Iceberg (or a comparable table format) on top of object storage and run multiple engines against it. The architecture looks deceptively simple:

Storage layer. Object storage (S3, GCS, Azure Blob). Cheap, durable, vendor-neutral;
Table format layer. Iceberg defines how files compose into tables;
Catalog layer. A small service tracking the current version of each table — usually AWS Glue, Snowflake Polaris, or a project like Nessie;
Compute layer. Whatever engine you want — Spark for ETL, Trino for interactive analytics, Flink for streaming, Snowflake or BigQuery for warehouse-grade workloads, DuckDB for local exploration. The cleanest property of this architecture is that the storage layer is shared. Your batch pipeline, your real-time stream, your ad-hoc analyst, your ML training job — all read the same Iceberg tables. No copies, no syncs, no drift.

This is what people mean when they say the lakehouse "unifies" data infrastructure. It is not a marketing claim — it is a structural property of the architecture.

Iceberg vs Delta Lake vs Hudi

Iceberg is not the only open table format. Two others matter in 2026.

Aspect	Apache Iceberg	Delta Lake	Apache Hudi
Primary origin	Netflix	Databricks	Uber
Governance	Apache Software Foundation	Linux Foundation (Delta Lake project)	Apache Software Foundation
Schema evolution	Strong — by column ID	Strong — column renaming via mapping	Strong
Partition evolution	Yes — hidden partitioning	Limited	Limited
Streaming writes	Good	Native (Spark Structured Streaming)	Best — designed for streaming first
Engine support	Broadest — multi-engine first	Strong in Databricks ecosystem	Strong in Spark, narrower elsewhere
Adoption trend	Fastest-growing as the cross-vendor standard	Strong inside Databricks customers	Steady, especially streaming-heavy use cases

The honest summary is that all three are production-ready, and your choice depends more on your existing platform than on technical merit. Iceberg wins on vendor neutrality — it is supported equally by every major engine, including ones built by Delta Lake's parent company. Delta Lake wins inside the Databricks ecosystem, where the integration is deepest. Hudi wins for streaming-first architectures where mutations on the lake are the primary use case.

If you are starting fresh and do not have a strong reason to pick otherwise, Iceberg is the default in 2026 — exactly because it is the safe cross-vendor bet.

Start Learning Coding today and boost your Career Potential

Adopting Iceberg in Your Stack

Adopting Iceberg is rarely a single big migration. It is a sequence of small decisions.

Pick a catalog first. Without a catalog, Iceberg tables are just files. AWS Glue is the default in AWS shops. Snowflake's Polaris is the open-source choice when you want vendor-neutral but production-grade. Nessie is the right call when you want Git-like branching for data;
Write new tables to Iceberg first. Do not migrate existing tables on day one. Instead, point new pipelines at Iceberg from the start. Coexistence is easy; mass migration is hard;
Convert in-place when you can. Most engines can register an existing Parquet dataset as an Iceberg table without rewriting the data. This is the cheapest possible migration path;
Standardize on one engine for writes initially. Iceberg is multi-engine on reads from day one, but having one canonical writer simplifies your operational story until your team is comfortable;
Treat the catalog as critical infrastructure. If the catalog is down, every Iceberg query is down. Operate it accordingly — high availability, backups, monitoring. The most common mistake is treating Iceberg adoption as a technology project rather than an architectural one. The format is easy. The discipline of running a real multi-engine lakehouse — naming conventions, ownership, access control, lifecycle policies — is the hard part.

The teams that get value from Iceberg are not the ones with the fanciest engines. They are the ones with the cleanest governance.

That is the lesson worth internalizing before the rollout, not after.

Conclusion

Apache Iceberg is one of those rare technologies that wins not by being flashy but by being correct. It solved a real problem, did it in an open way, and let the rest of the ecosystem build on top of it.

For data engineers, the practical takeaway is to stop thinking about the warehouse and the lake as separate worlds. The modern stack is one storage layer, one table format, many engines. The architecture is simpler, the bills are lower, and the data is genuinely shared across teams instead of copied between systems.

The data lakehouse is not a new product. It is what happens when the table format problem finally gets solved properly.

Iceberg is the format that solved it. The rest is a matter of catching up.

FAQ

Q: Is Iceberg only for huge datasets?

A: No. Iceberg works at any scale. For very small tables, the metadata overhead is a real cost — but for anything above a few gigabytes, the benefits in transactionality and schema evolution easily outweigh the cost.

Q: Do I need Spark to use Iceberg?

A: No. Iceberg is engine-agnostic. Spark, Trino, Flink, Snowflake, BigQuery, DuckDB, and others all read and write Iceberg tables directly. Pick the engine that fits your workload.

Q: Can I migrate a Parquet dataset to Iceberg without rewriting the files?

A: Yes, in most cases. Engines like Spark and Trino can register an existing Parquet dataset as an Iceberg table without copying or rewriting the data. The result is an Iceberg table that points at your existing files.

Q: Does Iceberg replace my data warehouse?

A: Not necessarily. It often coexists with warehouses — Snowflake, BigQuery, and Redshift all read Iceberg natively. The point is that you no longer have to choose between warehouse semantics and lake economics.

Q: What is a catalog in Iceberg and which one should I use?

A: The catalog is the small service that tracks which version of each table is current. AWS Glue, Snowflake Polaris, Hive Metastore, and Nessie are the most common choices. Pick one based on your platform — if you are AWS-first, Glue is the easy default.

Q: How does Iceberg handle deletes and updates?

A: Iceberg supports row-level deletes and updates through two strategies: copy-on-write (rewrite affected files) and merge-on-read (write delete markers, apply at query time). The right strategy depends on your read-vs-write ratio.

Q: Is Iceberg good for real-time streaming workloads?

A: It works for streaming, especially with Flink, and it has improved substantially in 2026. For pure streaming-first workloads with constant mutations, Apache Hudi was historically a stronger fit, but the gap has narrowed.

Ця стаття була корисною?

Поділитися:

Ця стаття була корисною?

Поділитися:

Курси по темі

Всі курси

Курс

Середній