Photo by Sri Gowda on Unsplash
Introduction
Data lakes stores the data in the raw format and provides flexibility to assign a schema during read time (Schema on Read). Table format provides an abstraction layer which helps to interact with the files in the data lake same as tables by defining schema ,columns and datatypes.
Hive tables are the first generation table format which provided a flexibility to read files in various formats(csv,tsv,parquet etc) as tables. Hive Metastore stored the metadata about the table such as columns, file location and data type.
As the analytics space evolved , the table formats needed to support scalability, improved performance and data governance/privacy requirements like GDPR and CCPA(Data Deletion, Right to be forgotten RTBF)
In this article I will be focusing on three most popular open source table formats and underlying concepts in these formats.
- Apache Hudi
- Apache Iceberg
- Delta Lake
Before comparing the pros and cons of each format, lets look into some of the concepts behind the data lake table formats.
Features of Data Lake
1. ACID Transactions
ACID is most fundamental principle in database design. All relational databases will support ACID properties.
ACID stands for:
Atomic — all parts of transaction succeed, or all fail and roll back.
Consistent — all committed data must be consistent with all data rules including constraints, triggers, cascades, atomicity, isolation, durability.
Isolated — No transaction can interfere with other concurrent transactions
Durable — Once a transaction is committed, data will survive system failures, and can be recovered after an unwanted deletion.
Enabling ACID transactions solves the one of major challenges in data lakes. This feature enabled data lakes to support record level transactions (insert,upsert,delete) required by data governance initiatives like GDPR which requires user_id level operations (For ex.Deleting user data as part of account closures)
ACID in Data Lake
- All three formats support ACID transactions in data lake.
2. Schema Evolution
Schema evolution simply means the ability to make changes to the table structure over time. Changes can be
- Add — add a new column to the table or to a nested struct
- Drop — remove an existing column from the table or a nested struct
- Rename — rename an existing column or field in a nested struct
- Update — widen the type of a column, struct field, map key, map value, or list element
- Reorder — change the order of columns or fields in a nested struct
There are specific restrictions on each of the table format but in general the schema evolution is supported in all three table formats for Apache spark.
Schema Evolution
- Hudi supports only schema evolution on spark.
Hudi: https://hudi.apache.org/docs/schema_evolution/
Delta Lake: https://docs.delta.io/latest/delta-batch.html#update-table-schema
Iceberg: https://iceberg.apache.org/docs/latest/evolution/#schema-evolution
3. Partition evolution
Partition evolution feature provides a ability to change the partition keys of the table without rewriting the entire table. (Yes! Its possible to keep two different partition grain in same table. Table can be partitioned both at month and day level)
Image Ref: Iceberg documentation
The older data is written in the old partition grain and new data in new layout. This is achieved by hidden partitioning.
Hidden Partitioning — SQL queries doesn't have to use partition keys in filters for partition pruning.
Partition evolution feature is very specific to Iceberg at this time.
4. Write Operations
Options to write/rewrite the data into the data lake table. In general there three type of write operations.
Append — Add new data to the existing table
Overwrite — Replace the existing data with new data.For tables with partition key, only the partitions will be replaced. There are two overwrite modes for partitioned table.Static(default) and Dynamic.
- Static — Partition clause used in the filter condition will be used for overwrites. Ignoring partition filter will replace the entire table.
- Dynamic — Partitions produced in the SQL query will be used to overwrite. ie There can one or more partitions in the same sql query.
Apache Hudi has two different table type for handling write operations.
Copy on Write (CoW): In simple terms, CoW means for each update in a file ,new version of the file is created. Data is stored only in the parquet format. COW tables write heavy and read efficient.CoW is better suited for read-heavy workloads on data that changes less frequently.
Merge On Read (MoR): Updates are logged to the row based delta files. Compaction process is used to merge the snapshot and delta files based on configurations. MoR is better suited for write- or change-heavy workloads with fewer reads.
Hudi also provides three logical views for accessing the data:
- Read-optimized view — Provides the latest committed dataset from CoW tables and the latest compacted dataset from MoR tables.
- Incremental view — Provides a change stream between two actions out of a CoW dataset to feed downstream jobs and extract, transform, load (ETL) workflows.
- Real-time view — Provides the latest committed data from a MoR table by merging the columnar and row-based files inline.
5. Time Travel
Time travel feature provides an ability to query the older snapshots of the table. This feature is very helpful for recreating analysis, fixing data issues, historical comparisons.
All three table formats supports time travel queries. Here is sql sample. It looks very similar to filter clause.
SELECT * FROM table_name TIMESTAMP AS OF timestamp_expression
SELECT * FROM table_name VERSION AS OF version
Other few important features.
- Incremental queries — Handling change data capture
- Bootstrap — Migrating existing tables to new formats without complete table overwrites
- Concurrency — Writes and reads concurrently in the tables. Optimistic concurrency control(OCC) vs Multi version concurrency control(MVCC). All three formats support OCC.
- Primary Keys — Having primary keys in the tables. (Hudi supports primary keys)
- Schema Enforcement — Enforcing the schema in the ETL workloads.
Conclusion
All the three Data lake table formats have their own pros and cons. Choosing the right format is essential for the performance and maintenance of the data lake. Both Apache Hudi and Delta lake has native support in EMR which makes it easy for AWS users. Databricks also provides a commercial version of delta lake. So choosing the table format will based on the compute provider and features required.
The purpose of this post is to provide a high level overview of the concepts behind the table formats in data lake.
Thanks for reading!!
References
https://hudi.apache.org/docs/table_types/
https://docs.delta.io/latest/delta-intro.html
https://www.databricks.com/session_na20/a-thorough-comparison-of-delta-lake-iceberg-and-hudi
https://iceberg.apache.org/docs/latest/spark-writes/