尊敬的 微信汇率:1円 ≈ 0.046078 元 支付宝汇率:1円 ≈ 0.046168元 [退出登录]
SlideShare a Scribd company logo
Apache Iceberg
Ryan Blue
2019 Big Data Orchestration Summit
Netflix’s Data Warehouse
● Smarter processing engines
○ CBO, better join implementations
○ Result set caching, materialized views
● Reduce manual data maintenance
○ Data librarian services
○ Declarative instead of imperative
5-year Challenges
● Unsafe operations are everywhere
○ Writing to multiple partitions
○ Renaming a column
● Interaction with object stores causes major headaches
○ Eventual consistency to performance problems
○ Output committers can’t fix it
● Endless scale challenges
Problem Whack-a-mole
What is Iceberg?
Iceberg is a scalable format for
tables with a lot of best
practices built in.
A format?
We already have Parquet, Avro
and ORC . . .
A table format.
● File formats help you modify or skip data in a single file
● Table formats do the same thing for a collection of files
● To demonstrate this, consider Hive tables . . .
A table format
● Key idea: organize data in a directory tree
date=20180513/
|- hour=18/
| |- ...
|- hour=19/
| |- part-000.parquet
| |- ...
| |- part-031.parquet
|- hour=20/
| |- ...
Hive Tables
● Filter: WHERE date = '20180513' AND hour = 19
date=20180513/
|- hour=18/
| |- ...
|- hour=19/
| |- part-000.parquet
| |- ...
| |- part-031.parquet
|- hour=20/
| |- ...
Hive Tables
● Problem: too much directory listing for large tables
● Solution: use HMS to track partitions
date=20180513/hour=19 -> hdfs:/.../date=20180513/hour=19
date=20180513/hour=20 -> hdfs:/.../date=20180513/hour=20
● The file system still tracks the files in each partition . . .
Hive Metastore
● State is kept in both the metastore and in a file system
● Changes are not atomic without locking
● Requires directory listing
○ O(n) listing calls, n = # matching partitions
○ Eventual consistency breaks correctness
Hive Tables: Problems
● Everything supports Hive tables*
○ Engines: Hive, Spark, Presto, Flink, Pig
○ Tools: Hudi, NiFi, Flume, Sqoop
● Simplicity and ubiquity have made Hive tables indispensable
● The whole ecosystem uses the same at-rest data!
Hive Tables: Benefits
Iceberg
● An open spec and community for at-rest data interchange
○ Maintain a clear spec for the format
○ Design for multiple implementations across languages
○ Support needs across projects to avoid fragmentation
Iceberg’s Goals
● Improve scale and reliability
○ Work on a single node, scale to a cluster
○ All changes are atomic, with serializable isolation
○ Native support for cloud object stores
○ Support many concurrent writers
Iceberg’s Goals
● Fix persistent usability problems
○ In-place evolution for schema and layout (no side-effects)
○ Hide partitioning: insulate queries from physical layout
○ Support time-travel, rollback, and metadata inspection
○ Configure tables, not jobs
● Tables should have no unpleasant surprises
Iceberg’s Goals
● Key idea: track all files in a table over time
○ A snapshot is a complete list of files in a table
○ Each write produces and commits a new snapshot
Iceberg’s Design
S1 S2 S3 ...
S1 S2 S3 ...
R W
● Readers use the current snapshot
● Writers optimistically create new snapshots, then commit
Iceberg’s Design
In reality, it’s a bit more
complicated.
● All changes are atomic
● No expensive (or inconsistent) file system operations
● Snapshots are indexed for scan planning on a single node
● CBO metrics are reliable
● Versions for incremental updates and materialized views
Iceberg Design Benefits
Iceberg at Netflix
● Production tables: tens of petabytes, millions of partitions
○ Scan planning fits on a single node
○ Advanced filtering enables more use cases
○ Overall performance is better
● Low latency queries are faster for large tables
Scale
● Production Flink pipeline writing in 3 AWS regions
● Lift service moving data into a single region
● Merge service compacting small files
Concurrency
● Rollback is popular
● Metadata tables
○ Track down the version a job read
○ Find the process that wrote a bad version
Usability
● Spark vectorization for faster bulk reads
○ Presto vectorization already done
● Row-level delete encodings
○ MERGE INTO
○ ID equality predicates
Future Work
Thank you!
Questions?
Ryan Blue
rblue@netflix.com

More Related Content

What's hot

Dynamic filtering for presto join optimisation
Dynamic filtering for presto join optimisationDynamic filtering for presto join optimisation
Dynamic filtering for presto join optimisation
Ori Reshef
 
Understanding Query Plans and Spark UIs
Understanding Query Plans and Spark UIsUnderstanding Query Plans and Spark UIs
Understanding Query Plans and Spark UIs
Databricks
 
Building robust CDC pipeline with Apache Hudi and Debezium
Building robust CDC pipeline with Apache Hudi and DebeziumBuilding robust CDC pipeline with Apache Hudi and Debezium
Building robust CDC pipeline with Apache Hudi and Debezium
Tathastu.ai
 
Intro to Delta Lake
Intro to Delta LakeIntro to Delta Lake
Intro to Delta Lake
Databricks
 
Data Engineer's Lunch #83: Strategies for Migration to Apache Iceberg
Data Engineer's Lunch #83: Strategies for Migration to Apache IcebergData Engineer's Lunch #83: Strategies for Migration to Apache Iceberg
Data Engineer's Lunch #83: Strategies for Migration to Apache Iceberg
Anant Corporation
 
Snowflake: The Good, the Bad, and the Ugly
Snowflake: The Good, the Bad, and the UglySnowflake: The Good, the Bad, and the Ugly
Snowflake: The Good, the Bad, and the Ugly
Tyler Wishnoff
 
Building a Virtual Data Lake with Apache Arrow
Building a Virtual Data Lake with Apache ArrowBuilding a Virtual Data Lake with Apache Arrow
Building a Virtual Data Lake with Apache Arrow
Dremio Corporation
 
Iceberg + Alluxio for Fast Data Analytics
Iceberg + Alluxio for Fast Data AnalyticsIceberg + Alluxio for Fast Data Analytics
Iceberg + Alluxio for Fast Data Analytics
Alluxio, Inc.
 
Amazon S3 Best Practice and Tuning for Hadoop/Spark in the Cloud
Amazon S3 Best Practice and Tuning for Hadoop/Spark in the CloudAmazon S3 Best Practice and Tuning for Hadoop/Spark in the Cloud
Amazon S3 Best Practice and Tuning for Hadoop/Spark in the Cloud
Noritaka Sekiyama
 
Improving SparkSQL Performance by 30%: How We Optimize Parquet Pushdown and P...
Improving SparkSQL Performance by 30%: How We Optimize Parquet Pushdown and P...Improving SparkSQL Performance by 30%: How We Optimize Parquet Pushdown and P...
Improving SparkSQL Performance by 30%: How We Optimize Parquet Pushdown and P...
Databricks
 
The columnar roadmap: Apache Parquet and Apache Arrow
The columnar roadmap: Apache Parquet and Apache ArrowThe columnar roadmap: Apache Parquet and Apache Arrow
The columnar roadmap: Apache Parquet and Apache Arrow
Julien Le Dem
 
File Format Benchmark - Avro, JSON, ORC & Parquet
File Format Benchmark - Avro, JSON, ORC & ParquetFile Format Benchmark - Avro, JSON, ORC & Parquet
File Format Benchmark - Avro, JSON, ORC & Parquet
DataWorks Summit/Hadoop Summit
 
Hive + Tez: A Performance Deep Dive
Hive + Tez: A Performance Deep DiveHive + Tez: A Performance Deep Dive
Hive + Tez: A Performance Deep Dive
DataWorks Summit
 
OSA Con 2022 - Apache Iceberg_ An Architectural Look Under the Covers - Alex ...
OSA Con 2022 - Apache Iceberg_ An Architectural Look Under the Covers - Alex ...OSA Con 2022 - Apache Iceberg_ An Architectural Look Under the Covers - Alex ...
OSA Con 2022 - Apache Iceberg_ An Architectural Look Under the Covers - Alex ...
Altinity Ltd
 
iceberg introduction.pptx
iceberg introduction.pptxiceberg introduction.pptx
iceberg introduction.pptx
Dori Waldman
 
Zero to Snowflake Presentation
Zero to Snowflake Presentation Zero to Snowflake Presentation
Zero to Snowflake Presentation
Brett VanderPlaats
 
Presto on Apache Spark: A Tale of Two Computation Engines
Presto on Apache Spark: A Tale of Two Computation EnginesPresto on Apache Spark: A Tale of Two Computation Engines
Presto on Apache Spark: A Tale of Two Computation Engines
Databricks
 
Presto Summit 2018 - 09 - Netflix Iceberg
Presto Summit 2018  - 09 - Netflix IcebergPresto Summit 2018  - 09 - Netflix Iceberg
Presto Summit 2018 - 09 - Netflix Iceberg
kbajda
 
Databricks Delta Lake and Its Benefits
Databricks Delta Lake and Its BenefitsDatabricks Delta Lake and Its Benefits
Databricks Delta Lake and Its Benefits
Databricks
 
Using ClickHouse for Experimentation
Using ClickHouse for ExperimentationUsing ClickHouse for Experimentation
Using ClickHouse for Experimentation
Gleb Kanterov
 

What's hot (20)

Dynamic filtering for presto join optimisation
Dynamic filtering for presto join optimisationDynamic filtering for presto join optimisation
Dynamic filtering for presto join optimisation
 
Understanding Query Plans and Spark UIs
Understanding Query Plans and Spark UIsUnderstanding Query Plans and Spark UIs
Understanding Query Plans and Spark UIs
 
Building robust CDC pipeline with Apache Hudi and Debezium
Building robust CDC pipeline with Apache Hudi and DebeziumBuilding robust CDC pipeline with Apache Hudi and Debezium
Building robust CDC pipeline with Apache Hudi and Debezium
 
Intro to Delta Lake
Intro to Delta LakeIntro to Delta Lake
Intro to Delta Lake
 
Data Engineer's Lunch #83: Strategies for Migration to Apache Iceberg
Data Engineer's Lunch #83: Strategies for Migration to Apache IcebergData Engineer's Lunch #83: Strategies for Migration to Apache Iceberg
Data Engineer's Lunch #83: Strategies for Migration to Apache Iceberg
 
Snowflake: The Good, the Bad, and the Ugly
Snowflake: The Good, the Bad, and the UglySnowflake: The Good, the Bad, and the Ugly
Snowflake: The Good, the Bad, and the Ugly
 
Building a Virtual Data Lake with Apache Arrow
Building a Virtual Data Lake with Apache ArrowBuilding a Virtual Data Lake with Apache Arrow
Building a Virtual Data Lake with Apache Arrow
 
Iceberg + Alluxio for Fast Data Analytics
Iceberg + Alluxio for Fast Data AnalyticsIceberg + Alluxio for Fast Data Analytics
Iceberg + Alluxio for Fast Data Analytics
 
Amazon S3 Best Practice and Tuning for Hadoop/Spark in the Cloud
Amazon S3 Best Practice and Tuning for Hadoop/Spark in the CloudAmazon S3 Best Practice and Tuning for Hadoop/Spark in the Cloud
Amazon S3 Best Practice and Tuning for Hadoop/Spark in the Cloud
 
Improving SparkSQL Performance by 30%: How We Optimize Parquet Pushdown and P...
Improving SparkSQL Performance by 30%: How We Optimize Parquet Pushdown and P...Improving SparkSQL Performance by 30%: How We Optimize Parquet Pushdown and P...
Improving SparkSQL Performance by 30%: How We Optimize Parquet Pushdown and P...
 
The columnar roadmap: Apache Parquet and Apache Arrow
The columnar roadmap: Apache Parquet and Apache ArrowThe columnar roadmap: Apache Parquet and Apache Arrow
The columnar roadmap: Apache Parquet and Apache Arrow
 
File Format Benchmark - Avro, JSON, ORC & Parquet
File Format Benchmark - Avro, JSON, ORC & ParquetFile Format Benchmark - Avro, JSON, ORC & Parquet
File Format Benchmark - Avro, JSON, ORC & Parquet
 
Hive + Tez: A Performance Deep Dive
Hive + Tez: A Performance Deep DiveHive + Tez: A Performance Deep Dive
Hive + Tez: A Performance Deep Dive
 
OSA Con 2022 - Apache Iceberg_ An Architectural Look Under the Covers - Alex ...
OSA Con 2022 - Apache Iceberg_ An Architectural Look Under the Covers - Alex ...OSA Con 2022 - Apache Iceberg_ An Architectural Look Under the Covers - Alex ...
OSA Con 2022 - Apache Iceberg_ An Architectural Look Under the Covers - Alex ...
 
iceberg introduction.pptx
iceberg introduction.pptxiceberg introduction.pptx
iceberg introduction.pptx
 
Zero to Snowflake Presentation
Zero to Snowflake Presentation Zero to Snowflake Presentation
Zero to Snowflake Presentation
 
Presto on Apache Spark: A Tale of Two Computation Engines
Presto on Apache Spark: A Tale of Two Computation EnginesPresto on Apache Spark: A Tale of Two Computation Engines
Presto on Apache Spark: A Tale of Two Computation Engines
 
Presto Summit 2018 - 09 - Netflix Iceberg
Presto Summit 2018  - 09 - Netflix IcebergPresto Summit 2018  - 09 - Netflix Iceberg
Presto Summit 2018 - 09 - Netflix Iceberg
 
Databricks Delta Lake and Its Benefits
Databricks Delta Lake and Its BenefitsDatabricks Delta Lake and Its Benefits
Databricks Delta Lake and Its Benefits
 
Using ClickHouse for Experimentation
Using ClickHouse for ExperimentationUsing ClickHouse for Experimentation
Using ClickHouse for Experimentation
 

Similar to Apache Iceberg - A Table Format for Hige Analytic Datasets

The evolution of Netflix's S3 data warehouse (Strata NY 2018)
The evolution of Netflix's S3 data warehouse (Strata NY 2018)The evolution of Netflix's S3 data warehouse (Strata NY 2018)
The evolution of Netflix's S3 data warehouse (Strata NY 2018)
Ryan Blue
 
Batch Processing at Scale with Flink & Iceberg
Batch Processing at Scale with Flink & IcebergBatch Processing at Scale with Flink & Iceberg
Batch Processing at Scale with Flink & Iceberg
Flink Forward
 
week1slides1704202828322.pdf
week1slides1704202828322.pdfweek1slides1704202828322.pdf
week1slides1704202828322.pdf
TusharAgarwal49094
 
Change data capture
Change data captureChange data capture
Change data capture
Ron Barabash
 
Hadoop 3 @ Hadoop Summit San Jose 2017
Hadoop 3 @ Hadoop Summit San Jose 2017Hadoop 3 @ Hadoop Summit San Jose 2017
Hadoop 3 @ Hadoop Summit San Jose 2017
Junping Du
 
Apache Hadoop 3.0 Community Update
Apache Hadoop 3.0 Community UpdateApache Hadoop 3.0 Community Update
Apache Hadoop 3.0 Community Update
DataWorks Summit
 
Fluent Bit: Log Forwarding at Scale
Fluent Bit: Log Forwarding at ScaleFluent Bit: Log Forwarding at Scale
Fluent Bit: Log Forwarding at Scale
Eduardo Silva Pereira
 
A Day in the Life of a Druid Implementor and Druid's Roadmap
A Day in the Life of a Druid Implementor and Druid's RoadmapA Day in the Life of a Druid Implementor and Druid's Roadmap
A Day in the Life of a Druid Implementor and Druid's Roadmap
Itai Yaffe
 
Spark Meetup at Uber
Spark Meetup at UberSpark Meetup at Uber
Spark Meetup at Uber
Databricks
 
Data Science in the Cloud @StitchFix
Data Science in the Cloud @StitchFixData Science in the Cloud @StitchFix
Data Science in the Cloud @StitchFix
C4Media
 
Netflix Open Source Meetup Season 4 Episode 2
Netflix Open Source Meetup Season 4 Episode 2Netflix Open Source Meetup Season 4 Episode 2
Netflix Open Source Meetup Season 4 Episode 2
aspyker
 
Understanding Hadoop
Understanding HadoopUnderstanding Hadoop
Understanding Hadoop
Ahmed Ossama
 
How to Develop and Operate Cloud First Data Platforms
How to Develop and Operate Cloud First Data PlatformsHow to Develop and Operate Cloud First Data Platforms
How to Develop and Operate Cloud First Data Platforms
Alluxio, Inc.
 
ApacheCon 2022_ Large scale unification of file format.pptx
ApacheCon 2022_ Large scale unification of file format.pptxApacheCon 2022_ Large scale unification of file format.pptx
ApacheCon 2022_ Large scale unification of file format.pptx
XinliShang1
 
AWS Big Data Demystified #1: Big data architecture lessons learned
AWS Big Data Demystified #1: Big data architecture lessons learned AWS Big Data Demystified #1: Big data architecture lessons learned
AWS Big Data Demystified #1: Big data architecture lessons learned
Omid Vahdaty
 
Type safe, versioned, and rewindable stream processing with Apache {Avro, K...
Type safe, versioned, and rewindable stream processing  with  Apache {Avro, K...Type safe, versioned, and rewindable stream processing  with  Apache {Avro, K...
Type safe, versioned, and rewindable stream processing with Apache {Avro, K...
Hisham Mardam-Bey
 
Netflix running Presto in the AWS Cloud
Netflix running Presto in the AWS CloudNetflix running Presto in the AWS Cloud
Netflix running Presto in the AWS Cloud
Zhenxiao Luo
 
MongoDB World 2019: Packing Up Your Data and Moving to MongoDB Atlas
MongoDB World 2019: Packing Up Your Data and Moving to MongoDB AtlasMongoDB World 2019: Packing Up Your Data and Moving to MongoDB Atlas
MongoDB World 2019: Packing Up Your Data and Moving to MongoDB Atlas
MongoDB
 
RubiX
RubiXRubiX
AWS big-data-demystified #1.1 | Big Data Architecture Lessons Learned | English
AWS big-data-demystified #1.1  | Big Data Architecture Lessons Learned | EnglishAWS big-data-demystified #1.1  | Big Data Architecture Lessons Learned | English
AWS big-data-demystified #1.1 | Big Data Architecture Lessons Learned | English
Omid Vahdaty
 

Similar to Apache Iceberg - A Table Format for Hige Analytic Datasets (20)

The evolution of Netflix's S3 data warehouse (Strata NY 2018)
The evolution of Netflix's S3 data warehouse (Strata NY 2018)The evolution of Netflix's S3 data warehouse (Strata NY 2018)
The evolution of Netflix's S3 data warehouse (Strata NY 2018)
 
Batch Processing at Scale with Flink & Iceberg
Batch Processing at Scale with Flink & IcebergBatch Processing at Scale with Flink & Iceberg
Batch Processing at Scale with Flink & Iceberg
 
week1slides1704202828322.pdf
week1slides1704202828322.pdfweek1slides1704202828322.pdf
week1slides1704202828322.pdf
 
Change data capture
Change data captureChange data capture
Change data capture
 
Hadoop 3 @ Hadoop Summit San Jose 2017
Hadoop 3 @ Hadoop Summit San Jose 2017Hadoop 3 @ Hadoop Summit San Jose 2017
Hadoop 3 @ Hadoop Summit San Jose 2017
 
Apache Hadoop 3.0 Community Update
Apache Hadoop 3.0 Community UpdateApache Hadoop 3.0 Community Update
Apache Hadoop 3.0 Community Update
 
Fluent Bit: Log Forwarding at Scale
Fluent Bit: Log Forwarding at ScaleFluent Bit: Log Forwarding at Scale
Fluent Bit: Log Forwarding at Scale
 
A Day in the Life of a Druid Implementor and Druid's Roadmap
A Day in the Life of a Druid Implementor and Druid's RoadmapA Day in the Life of a Druid Implementor and Druid's Roadmap
A Day in the Life of a Druid Implementor and Druid's Roadmap
 
Spark Meetup at Uber
Spark Meetup at UberSpark Meetup at Uber
Spark Meetup at Uber
 
Data Science in the Cloud @StitchFix
Data Science in the Cloud @StitchFixData Science in the Cloud @StitchFix
Data Science in the Cloud @StitchFix
 
Netflix Open Source Meetup Season 4 Episode 2
Netflix Open Source Meetup Season 4 Episode 2Netflix Open Source Meetup Season 4 Episode 2
Netflix Open Source Meetup Season 4 Episode 2
 
Understanding Hadoop
Understanding HadoopUnderstanding Hadoop
Understanding Hadoop
 
How to Develop and Operate Cloud First Data Platforms
How to Develop and Operate Cloud First Data PlatformsHow to Develop and Operate Cloud First Data Platforms
How to Develop and Operate Cloud First Data Platforms
 
ApacheCon 2022_ Large scale unification of file format.pptx
ApacheCon 2022_ Large scale unification of file format.pptxApacheCon 2022_ Large scale unification of file format.pptx
ApacheCon 2022_ Large scale unification of file format.pptx
 
AWS Big Data Demystified #1: Big data architecture lessons learned
AWS Big Data Demystified #1: Big data architecture lessons learned AWS Big Data Demystified #1: Big data architecture lessons learned
AWS Big Data Demystified #1: Big data architecture lessons learned
 
Type safe, versioned, and rewindable stream processing with Apache {Avro, K...
Type safe, versioned, and rewindable stream processing  with  Apache {Avro, K...Type safe, versioned, and rewindable stream processing  with  Apache {Avro, K...
Type safe, versioned, and rewindable stream processing with Apache {Avro, K...
 
Netflix running Presto in the AWS Cloud
Netflix running Presto in the AWS CloudNetflix running Presto in the AWS Cloud
Netflix running Presto in the AWS Cloud
 
MongoDB World 2019: Packing Up Your Data and Moving to MongoDB Atlas
MongoDB World 2019: Packing Up Your Data and Moving to MongoDB AtlasMongoDB World 2019: Packing Up Your Data and Moving to MongoDB Atlas
MongoDB World 2019: Packing Up Your Data and Moving to MongoDB Atlas
 
RubiX
RubiXRubiX
RubiX
 
AWS big-data-demystified #1.1 | Big Data Architecture Lessons Learned | English
AWS big-data-demystified #1.1  | Big Data Architecture Lessons Learned | EnglishAWS big-data-demystified #1.1  | Big Data Architecture Lessons Learned | English
AWS big-data-demystified #1.1 | Big Data Architecture Lessons Learned | English
 

More from Alluxio, Inc.

Alluxio Webinar | 10x Faster Trino Queries on Your Data Platform
Alluxio Webinar | 10x Faster Trino Queries on Your Data PlatformAlluxio Webinar | 10x Faster Trino Queries on Your Data Platform
Alluxio Webinar | 10x Faster Trino Queries on Your Data Platform
Alluxio, Inc.
 
AI/ML Infra Meetup | ML explainability in Michelangelo
AI/ML Infra Meetup | ML explainability in MichelangeloAI/ML Infra Meetup | ML explainability in Michelangelo
AI/ML Infra Meetup | ML explainability in Michelangelo
Alluxio, Inc.
 
AI/ML Infra Meetup | Reducing Prefill for LLM Serving in RAG
AI/ML Infra Meetup | Reducing Prefill for LLM Serving in RAGAI/ML Infra Meetup | Reducing Prefill for LLM Serving in RAG
AI/ML Infra Meetup | Reducing Prefill for LLM Serving in RAG
Alluxio, Inc.
 
AI/ML Infra Meetup | Perspective on Deep Learning Framework
AI/ML Infra Meetup | Perspective on Deep Learning FrameworkAI/ML Infra Meetup | Perspective on Deep Learning Framework
AI/ML Infra Meetup | Perspective on Deep Learning Framework
Alluxio, Inc.
 
AI/ML Infra Meetup | Improve Speed and GPU Utilization for Model Training & S...
AI/ML Infra Meetup | Improve Speed and GPU Utilization for Model Training & S...AI/ML Infra Meetup | Improve Speed and GPU Utilization for Model Training & S...
AI/ML Infra Meetup | Improve Speed and GPU Utilization for Model Training & S...
Alluxio, Inc.
 
Alluxio Monthly Webinar | Simplify Data Access for AI in Multi-Cloud
Alluxio Monthly Webinar | Simplify Data Access for AI in Multi-CloudAlluxio Monthly Webinar | Simplify Data Access for AI in Multi-Cloud
Alluxio Monthly Webinar | Simplify Data Access for AI in Multi-Cloud
Alluxio, Inc.
 
Alluxio Monthly Webinar | Cloud-Native Model Training on Distributed Data
Alluxio Monthly Webinar | Cloud-Native Model Training on Distributed DataAlluxio Monthly Webinar | Cloud-Native Model Training on Distributed Data
Alluxio Monthly Webinar | Cloud-Native Model Training on Distributed Data
Alluxio, Inc.
 
Optimizing Data Access for Analytics And AI with Alluxio
Optimizing Data Access for Analytics And AI with AlluxioOptimizing Data Access for Analytics And AI with Alluxio
Optimizing Data Access for Analytics And AI with Alluxio
Alluxio, Inc.
 
Speed Up Presto at Uber with Alluxio Caching
Speed Up Presto at Uber with Alluxio CachingSpeed Up Presto at Uber with Alluxio Caching
Speed Up Presto at Uber with Alluxio Caching
Alluxio, Inc.
 
Correctly Loading Incremental Data at Scale
Correctly Loading Incremental Data at ScaleCorrectly Loading Incremental Data at Scale
Correctly Loading Incremental Data at Scale
Alluxio, Inc.
 
Big Data Bellevue Meetup | Enhancing Python Data Loading in the Cloud for AI/ML
Big Data Bellevue Meetup | Enhancing Python Data Loading in the Cloud for AI/MLBig Data Bellevue Meetup | Enhancing Python Data Loading in the Cloud for AI/ML
Big Data Bellevue Meetup | Enhancing Python Data Loading in the Cloud for AI/ML
Alluxio, Inc.
 
Alluxio Monthly Webinar | Why a Multi-Cloud Strategy Matters for Your AI Plat...
Alluxio Monthly Webinar | Why a Multi-Cloud Strategy Matters for Your AI Plat...Alluxio Monthly Webinar | Why a Multi-Cloud Strategy Matters for Your AI Plat...
Alluxio Monthly Webinar | Why a Multi-Cloud Strategy Matters for Your AI Plat...
Alluxio, Inc.
 
Alluxio Monthly Webinar | Five Disruptive Trends that Every Data & AI Leader...
Alluxio Monthly Webinar | Five Disruptive Trends that Every  Data & AI Leader...Alluxio Monthly Webinar | Five Disruptive Trends that Every  Data & AI Leader...
Alluxio Monthly Webinar | Five Disruptive Trends that Every Data & AI Leader...
Alluxio, Inc.
 
Data Infra Meetup | FIFO Queues are All You Need for Cache Eviction
Data Infra Meetup | FIFO Queues are All You Need for Cache EvictionData Infra Meetup | FIFO Queues are All You Need for Cache Eviction
Data Infra Meetup | FIFO Queues are All You Need for Cache Eviction
Alluxio, Inc.
 
Data Infra Meetup | Accelerate Your Trino/Presto Queries - Gain the Alluxio Edge
Data Infra Meetup | Accelerate Your Trino/Presto Queries - Gain the Alluxio EdgeData Infra Meetup | Accelerate Your Trino/Presto Queries - Gain the Alluxio Edge
Data Infra Meetup | Accelerate Your Trino/Presto Queries - Gain the Alluxio Edge
Alluxio, Inc.
 
Data Infra Meetup | Accelerate Distributed PyTorch/Ray Workloads in the Cloud
Data Infra Meetup | Accelerate Distributed PyTorch/Ray Workloads in the CloudData Infra Meetup | Accelerate Distributed PyTorch/Ray Workloads in the Cloud
Data Infra Meetup | Accelerate Distributed PyTorch/Ray Workloads in the Cloud
Alluxio, Inc.
 
Data Infra Meetup | ByteDance's Native Parquet Reader
Data Infra Meetup | ByteDance's Native Parquet ReaderData Infra Meetup | ByteDance's Native Parquet Reader
Data Infra Meetup | ByteDance's Native Parquet Reader
Alluxio, Inc.
 
Data Infra Meetup | Uber's Data Storage Evolution
Data Infra Meetup | Uber's Data Storage EvolutionData Infra Meetup | Uber's Data Storage Evolution
Data Infra Meetup | Uber's Data Storage Evolution
Alluxio, Inc.
 
Alluxio Monthly Webinar | Why NFS/NAS on Object Storage May Not Solve Your AI...
Alluxio Monthly Webinar | Why NFS/NAS on Object Storage May Not Solve Your AI...Alluxio Monthly Webinar | Why NFS/NAS on Object Storage May Not Solve Your AI...
Alluxio Monthly Webinar | Why NFS/NAS on Object Storage May Not Solve Your AI...
Alluxio, Inc.
 
AI Infra Day | Accelerate Your Model Training and Serving with Distributed Ca...
AI Infra Day | Accelerate Your Model Training and Serving with Distributed Ca...AI Infra Day | Accelerate Your Model Training and Serving with Distributed Ca...
AI Infra Day | Accelerate Your Model Training and Serving with Distributed Ca...
Alluxio, Inc.
 

More from Alluxio, Inc. (20)

Alluxio Webinar | 10x Faster Trino Queries on Your Data Platform
Alluxio Webinar | 10x Faster Trino Queries on Your Data PlatformAlluxio Webinar | 10x Faster Trino Queries on Your Data Platform
Alluxio Webinar | 10x Faster Trino Queries on Your Data Platform
 
AI/ML Infra Meetup | ML explainability in Michelangelo
AI/ML Infra Meetup | ML explainability in MichelangeloAI/ML Infra Meetup | ML explainability in Michelangelo
AI/ML Infra Meetup | ML explainability in Michelangelo
 
AI/ML Infra Meetup | Reducing Prefill for LLM Serving in RAG
AI/ML Infra Meetup | Reducing Prefill for LLM Serving in RAGAI/ML Infra Meetup | Reducing Prefill for LLM Serving in RAG
AI/ML Infra Meetup | Reducing Prefill for LLM Serving in RAG
 
AI/ML Infra Meetup | Perspective on Deep Learning Framework
AI/ML Infra Meetup | Perspective on Deep Learning FrameworkAI/ML Infra Meetup | Perspective on Deep Learning Framework
AI/ML Infra Meetup | Perspective on Deep Learning Framework
 
AI/ML Infra Meetup | Improve Speed and GPU Utilization for Model Training & S...
AI/ML Infra Meetup | Improve Speed and GPU Utilization for Model Training & S...AI/ML Infra Meetup | Improve Speed and GPU Utilization for Model Training & S...
AI/ML Infra Meetup | Improve Speed and GPU Utilization for Model Training & S...
 
Alluxio Monthly Webinar | Simplify Data Access for AI in Multi-Cloud
Alluxio Monthly Webinar | Simplify Data Access for AI in Multi-CloudAlluxio Monthly Webinar | Simplify Data Access for AI in Multi-Cloud
Alluxio Monthly Webinar | Simplify Data Access for AI in Multi-Cloud
 
Alluxio Monthly Webinar | Cloud-Native Model Training on Distributed Data
Alluxio Monthly Webinar | Cloud-Native Model Training on Distributed DataAlluxio Monthly Webinar | Cloud-Native Model Training on Distributed Data
Alluxio Monthly Webinar | Cloud-Native Model Training on Distributed Data
 
Optimizing Data Access for Analytics And AI with Alluxio
Optimizing Data Access for Analytics And AI with AlluxioOptimizing Data Access for Analytics And AI with Alluxio
Optimizing Data Access for Analytics And AI with Alluxio
 
Speed Up Presto at Uber with Alluxio Caching
Speed Up Presto at Uber with Alluxio CachingSpeed Up Presto at Uber with Alluxio Caching
Speed Up Presto at Uber with Alluxio Caching
 
Correctly Loading Incremental Data at Scale
Correctly Loading Incremental Data at ScaleCorrectly Loading Incremental Data at Scale
Correctly Loading Incremental Data at Scale
 
Big Data Bellevue Meetup | Enhancing Python Data Loading in the Cloud for AI/ML
Big Data Bellevue Meetup | Enhancing Python Data Loading in the Cloud for AI/MLBig Data Bellevue Meetup | Enhancing Python Data Loading in the Cloud for AI/ML
Big Data Bellevue Meetup | Enhancing Python Data Loading in the Cloud for AI/ML
 
Alluxio Monthly Webinar | Why a Multi-Cloud Strategy Matters for Your AI Plat...
Alluxio Monthly Webinar | Why a Multi-Cloud Strategy Matters for Your AI Plat...Alluxio Monthly Webinar | Why a Multi-Cloud Strategy Matters for Your AI Plat...
Alluxio Monthly Webinar | Why a Multi-Cloud Strategy Matters for Your AI Plat...
 
Alluxio Monthly Webinar | Five Disruptive Trends that Every Data & AI Leader...
Alluxio Monthly Webinar | Five Disruptive Trends that Every  Data & AI Leader...Alluxio Monthly Webinar | Five Disruptive Trends that Every  Data & AI Leader...
Alluxio Monthly Webinar | Five Disruptive Trends that Every Data & AI Leader...
 
Data Infra Meetup | FIFO Queues are All You Need for Cache Eviction
Data Infra Meetup | FIFO Queues are All You Need for Cache EvictionData Infra Meetup | FIFO Queues are All You Need for Cache Eviction
Data Infra Meetup | FIFO Queues are All You Need for Cache Eviction
 
Data Infra Meetup | Accelerate Your Trino/Presto Queries - Gain the Alluxio Edge
Data Infra Meetup | Accelerate Your Trino/Presto Queries - Gain the Alluxio EdgeData Infra Meetup | Accelerate Your Trino/Presto Queries - Gain the Alluxio Edge
Data Infra Meetup | Accelerate Your Trino/Presto Queries - Gain the Alluxio Edge
 
Data Infra Meetup | Accelerate Distributed PyTorch/Ray Workloads in the Cloud
Data Infra Meetup | Accelerate Distributed PyTorch/Ray Workloads in the CloudData Infra Meetup | Accelerate Distributed PyTorch/Ray Workloads in the Cloud
Data Infra Meetup | Accelerate Distributed PyTorch/Ray Workloads in the Cloud
 
Data Infra Meetup | ByteDance's Native Parquet Reader
Data Infra Meetup | ByteDance's Native Parquet ReaderData Infra Meetup | ByteDance's Native Parquet Reader
Data Infra Meetup | ByteDance's Native Parquet Reader
 
Data Infra Meetup | Uber's Data Storage Evolution
Data Infra Meetup | Uber's Data Storage EvolutionData Infra Meetup | Uber's Data Storage Evolution
Data Infra Meetup | Uber's Data Storage Evolution
 
Alluxio Monthly Webinar | Why NFS/NAS on Object Storage May Not Solve Your AI...
Alluxio Monthly Webinar | Why NFS/NAS on Object Storage May Not Solve Your AI...Alluxio Monthly Webinar | Why NFS/NAS on Object Storage May Not Solve Your AI...
Alluxio Monthly Webinar | Why NFS/NAS on Object Storage May Not Solve Your AI...
 
AI Infra Day | Accelerate Your Model Training and Serving with Distributed Ca...
AI Infra Day | Accelerate Your Model Training and Serving with Distributed Ca...AI Infra Day | Accelerate Your Model Training and Serving with Distributed Ca...
AI Infra Day | Accelerate Your Model Training and Serving with Distributed Ca...
 

Recently uploaded

Top 5 Ways To Use Instagram API in 2024 for your business
Top 5 Ways To Use Instagram API in 2024 for your businessTop 5 Ways To Use Instagram API in 2024 for your business
Top 5 Ways To Use Instagram API in 2024 for your business
Yara Milbes
 
Revolutionizing Task Scheduling in CFML!
Revolutionizing Task Scheduling in CFML!Revolutionizing Task Scheduling in CFML!
Revolutionizing Task Scheduling in CFML!
Ortus Solutions, Corp
 
Extreme DDD Modelling Patterns - 2024 Devoxx Poland
Extreme DDD Modelling Patterns - 2024 Devoxx PolandExtreme DDD Modelling Patterns - 2024 Devoxx Poland
Extreme DDD Modelling Patterns - 2024 Devoxx Poland
Alberto Brandolini
 
Streamlining End-to-End Testing Automation
Streamlining End-to-End Testing AutomationStreamlining End-to-End Testing Automation
Streamlining End-to-End Testing Automation
Anand Bagmar
 
Folding Cheat Sheet #6 - sixth in a series
Folding Cheat Sheet #6 - sixth in a seriesFolding Cheat Sheet #6 - sixth in a series
Folding Cheat Sheet #6 - sixth in a series
Philip Schwarz
 
🔥 Call Girls In Pune 💯Call Us 🔝 7737669865 🔝💃Top Class Call Girl Service Avai...
🔥 Call Girls In Pune 💯Call Us 🔝 7737669865 🔝💃Top Class Call Girl Service Avai...🔥 Call Girls In Pune 💯Call Us 🔝 7737669865 🔝💃Top Class Call Girl Service Avai...
🔥 Call Girls In Pune 💯Call Us 🔝 7737669865 🔝💃Top Class Call Girl Service Avai...
nikhilkumarji0156
 
Erotic Call Girls Bangalore🫱9079923931🫲 High Quality Call Girl Service Right ...
Erotic Call Girls Bangalore🫱9079923931🫲 High Quality Call Girl Service Right ...Erotic Call Girls Bangalore🫱9079923931🫲 High Quality Call Girl Service Right ...
Erotic Call Girls Bangalore🫱9079923931🫲 High Quality Call Girl Service Right ...
meenusingh4354543
 
AI Based Testing - A Comprehensive Guide.pdf
AI Based Testing - A Comprehensive Guide.pdfAI Based Testing - A Comprehensive Guide.pdf
AI Based Testing - A Comprehensive Guide.pdf
kalichargn70th171
 
119321250-History-of-Computer-Programming.ppt
119321250-History-of-Computer-Programming.ppt119321250-History-of-Computer-Programming.ppt
119321250-History-of-Computer-Programming.ppt
lavesingh522
 
🔥 Chennai Call Girls  👉 6350257716 👫 High Profile Call Girls Whatsapp Number ...
🔥 Chennai Call Girls  👉 6350257716 👫 High Profile Call Girls Whatsapp Number ...🔥 Chennai Call Girls  👉 6350257716 👫 High Profile Call Girls Whatsapp Number ...
🔥 Chennai Call Girls  👉 6350257716 👫 High Profile Call Girls Whatsapp Number ...
tinakumariji156
 
一比一原版宾夕法尼亚大学毕业证(UPenn毕业证书)学历如何办理
一比一原版宾夕法尼亚大学毕业证(UPenn毕业证书)学历如何办理一比一原版宾夕法尼亚大学毕业证(UPenn毕业证书)学历如何办理
一比一原版宾夕法尼亚大学毕业证(UPenn毕业证书)学历如何办理
eydbbz
 
🔥 Kolkata Call Girls  👉 9079923931 👫 High Profile Call Girls Whatsapp Number ...
🔥 Kolkata Call Girls  👉 9079923931 👫 High Profile Call Girls Whatsapp Number ...🔥 Kolkata Call Girls  👉 9079923931 👫 High Profile Call Girls Whatsapp Number ...
🔥 Kolkata Call Girls  👉 9079923931 👫 High Profile Call Girls Whatsapp Number ...
tinakumariji156
 
Trailhead Talks_ Journey of an All-Star Ranger .pptx
Trailhead Talks_ Journey of an All-Star Ranger .pptxTrailhead Talks_ Journey of an All-Star Ranger .pptx
Trailhead Talks_ Journey of an All-Star Ranger .pptx
ImtiazBinMohiuddin
 
Hot Call Girls In Ahmedabad ✔ 7737669865 ✔ Hi I Am Divya Vip Call Girl Servic...
Hot Call Girls In Ahmedabad ✔ 7737669865 ✔ Hi I Am Divya Vip Call Girl Servic...Hot Call Girls In Ahmedabad ✔ 7737669865 ✔ Hi I Am Divya Vip Call Girl Servic...
Hot Call Girls In Ahmedabad ✔ 7737669865 ✔ Hi I Am Divya Vip Call Girl Servic...
ns9201415
 
European Standard S1000D, an Unnecessary Expense to OEM.pptx
European Standard S1000D, an Unnecessary Expense to OEM.pptxEuropean Standard S1000D, an Unnecessary Expense to OEM.pptx
European Standard S1000D, an Unnecessary Expense to OEM.pptx
Digital Teacher
 
Photo Copier Xerox Machine annual maintenance contract system.pdf
Photo Copier Xerox Machine annual maintenance contract system.pdfPhoto Copier Xerox Machine annual maintenance contract system.pdf
Photo Copier Xerox Machine annual maintenance contract system.pdf
SERVE WELL CRM NASHIK
 
Call Girls Bangalore🔥7023059433🔥Best Profile Escorts in Bangalore Available 24/7
Call Girls Bangalore🔥7023059433🔥Best Profile Escorts in Bangalore Available 24/7Call Girls Bangalore🔥7023059433🔥Best Profile Escorts in Bangalore Available 24/7
Call Girls Bangalore🔥7023059433🔥Best Profile Escorts in Bangalore Available 24/7
manji sharman06
 
Call Girls in Varanasi || 7426014248 || Quick Booking at Affordable Price
Call Girls in Varanasi || 7426014248 || Quick Booking at Affordable PriceCall Girls in Varanasi || 7426014248 || Quick Booking at Affordable Price
Call Girls in Varanasi || 7426014248 || Quick Booking at Affordable Price
vickythakur209464
 
Accelerate your Sitecore development with GenAI
Accelerate your Sitecore development with GenAIAccelerate your Sitecore development with GenAI
Accelerate your Sitecore development with GenAI
Ahmed Okour
 
What’s new in VictoriaMetrics - Q2 2024 Update
What’s new in VictoriaMetrics - Q2 2024 UpdateWhat’s new in VictoriaMetrics - Q2 2024 Update
What’s new in VictoriaMetrics - Q2 2024 Update
VictoriaMetrics
 

Recently uploaded (20)

Top 5 Ways To Use Instagram API in 2024 for your business
Top 5 Ways To Use Instagram API in 2024 for your businessTop 5 Ways To Use Instagram API in 2024 for your business
Top 5 Ways To Use Instagram API in 2024 for your business
 
Revolutionizing Task Scheduling in CFML!
Revolutionizing Task Scheduling in CFML!Revolutionizing Task Scheduling in CFML!
Revolutionizing Task Scheduling in CFML!
 
Extreme DDD Modelling Patterns - 2024 Devoxx Poland
Extreme DDD Modelling Patterns - 2024 Devoxx PolandExtreme DDD Modelling Patterns - 2024 Devoxx Poland
Extreme DDD Modelling Patterns - 2024 Devoxx Poland
 
Streamlining End-to-End Testing Automation
Streamlining End-to-End Testing AutomationStreamlining End-to-End Testing Automation
Streamlining End-to-End Testing Automation
 
Folding Cheat Sheet #6 - sixth in a series
Folding Cheat Sheet #6 - sixth in a seriesFolding Cheat Sheet #6 - sixth in a series
Folding Cheat Sheet #6 - sixth in a series
 
🔥 Call Girls In Pune 💯Call Us 🔝 7737669865 🔝💃Top Class Call Girl Service Avai...
🔥 Call Girls In Pune 💯Call Us 🔝 7737669865 🔝💃Top Class Call Girl Service Avai...🔥 Call Girls In Pune 💯Call Us 🔝 7737669865 🔝💃Top Class Call Girl Service Avai...
🔥 Call Girls In Pune 💯Call Us 🔝 7737669865 🔝💃Top Class Call Girl Service Avai...
 
Erotic Call Girls Bangalore🫱9079923931🫲 High Quality Call Girl Service Right ...
Erotic Call Girls Bangalore🫱9079923931🫲 High Quality Call Girl Service Right ...Erotic Call Girls Bangalore🫱9079923931🫲 High Quality Call Girl Service Right ...
Erotic Call Girls Bangalore🫱9079923931🫲 High Quality Call Girl Service Right ...
 
AI Based Testing - A Comprehensive Guide.pdf
AI Based Testing - A Comprehensive Guide.pdfAI Based Testing - A Comprehensive Guide.pdf
AI Based Testing - A Comprehensive Guide.pdf
 
119321250-History-of-Computer-Programming.ppt
119321250-History-of-Computer-Programming.ppt119321250-History-of-Computer-Programming.ppt
119321250-History-of-Computer-Programming.ppt
 
🔥 Chennai Call Girls  👉 6350257716 👫 High Profile Call Girls Whatsapp Number ...
🔥 Chennai Call Girls  👉 6350257716 👫 High Profile Call Girls Whatsapp Number ...🔥 Chennai Call Girls  👉 6350257716 👫 High Profile Call Girls Whatsapp Number ...
🔥 Chennai Call Girls  👉 6350257716 👫 High Profile Call Girls Whatsapp Number ...
 
一比一原版宾夕法尼亚大学毕业证(UPenn毕业证书)学历如何办理
一比一原版宾夕法尼亚大学毕业证(UPenn毕业证书)学历如何办理一比一原版宾夕法尼亚大学毕业证(UPenn毕业证书)学历如何办理
一比一原版宾夕法尼亚大学毕业证(UPenn毕业证书)学历如何办理
 
🔥 Kolkata Call Girls  👉 9079923931 👫 High Profile Call Girls Whatsapp Number ...
🔥 Kolkata Call Girls  👉 9079923931 👫 High Profile Call Girls Whatsapp Number ...🔥 Kolkata Call Girls  👉 9079923931 👫 High Profile Call Girls Whatsapp Number ...
🔥 Kolkata Call Girls  👉 9079923931 👫 High Profile Call Girls Whatsapp Number ...
 
Trailhead Talks_ Journey of an All-Star Ranger .pptx
Trailhead Talks_ Journey of an All-Star Ranger .pptxTrailhead Talks_ Journey of an All-Star Ranger .pptx
Trailhead Talks_ Journey of an All-Star Ranger .pptx
 
Hot Call Girls In Ahmedabad ✔ 7737669865 ✔ Hi I Am Divya Vip Call Girl Servic...
Hot Call Girls In Ahmedabad ✔ 7737669865 ✔ Hi I Am Divya Vip Call Girl Servic...Hot Call Girls In Ahmedabad ✔ 7737669865 ✔ Hi I Am Divya Vip Call Girl Servic...
Hot Call Girls In Ahmedabad ✔ 7737669865 ✔ Hi I Am Divya Vip Call Girl Servic...
 
European Standard S1000D, an Unnecessary Expense to OEM.pptx
European Standard S1000D, an Unnecessary Expense to OEM.pptxEuropean Standard S1000D, an Unnecessary Expense to OEM.pptx
European Standard S1000D, an Unnecessary Expense to OEM.pptx
 
Photo Copier Xerox Machine annual maintenance contract system.pdf
Photo Copier Xerox Machine annual maintenance contract system.pdfPhoto Copier Xerox Machine annual maintenance contract system.pdf
Photo Copier Xerox Machine annual maintenance contract system.pdf
 
Call Girls Bangalore🔥7023059433🔥Best Profile Escorts in Bangalore Available 24/7
Call Girls Bangalore🔥7023059433🔥Best Profile Escorts in Bangalore Available 24/7Call Girls Bangalore🔥7023059433🔥Best Profile Escorts in Bangalore Available 24/7
Call Girls Bangalore🔥7023059433🔥Best Profile Escorts in Bangalore Available 24/7
 
Call Girls in Varanasi || 7426014248 || Quick Booking at Affordable Price
Call Girls in Varanasi || 7426014248 || Quick Booking at Affordable PriceCall Girls in Varanasi || 7426014248 || Quick Booking at Affordable Price
Call Girls in Varanasi || 7426014248 || Quick Booking at Affordable Price
 
Accelerate your Sitecore development with GenAI
Accelerate your Sitecore development with GenAIAccelerate your Sitecore development with GenAI
Accelerate your Sitecore development with GenAI
 
What’s new in VictoriaMetrics - Q2 2024 Update
What’s new in VictoriaMetrics - Q2 2024 UpdateWhat’s new in VictoriaMetrics - Q2 2024 Update
What’s new in VictoriaMetrics - Q2 2024 Update
 

Apache Iceberg - A Table Format for Hige Analytic Datasets

  • 1. Apache Iceberg Ryan Blue 2019 Big Data Orchestration Summit
  • 3. ● Smarter processing engines ○ CBO, better join implementations ○ Result set caching, materialized views ● Reduce manual data maintenance ○ Data librarian services ○ Declarative instead of imperative 5-year Challenges
  • 4. ● Unsafe operations are everywhere ○ Writing to multiple partitions ○ Renaming a column ● Interaction with object stores causes major headaches ○ Eventual consistency to performance problems ○ Output committers can’t fix it ● Endless scale challenges Problem Whack-a-mole
  • 6. Iceberg is a scalable format for tables with a lot of best practices built in.
  • 7. A format? We already have Parquet, Avro and ORC . . .
  • 9. ● File formats help you modify or skip data in a single file ● Table formats do the same thing for a collection of files ● To demonstrate this, consider Hive tables . . . A table format
  • 10. ● Key idea: organize data in a directory tree date=20180513/ |- hour=18/ | |- ... |- hour=19/ | |- part-000.parquet | |- ... | |- part-031.parquet |- hour=20/ | |- ... Hive Tables
  • 11. ● Filter: WHERE date = '20180513' AND hour = 19 date=20180513/ |- hour=18/ | |- ... |- hour=19/ | |- part-000.parquet | |- ... | |- part-031.parquet |- hour=20/ | |- ... Hive Tables
  • 12. ● Problem: too much directory listing for large tables ● Solution: use HMS to track partitions date=20180513/hour=19 -> hdfs:/.../date=20180513/hour=19 date=20180513/hour=20 -> hdfs:/.../date=20180513/hour=20 ● The file system still tracks the files in each partition . . . Hive Metastore
  • 13. ● State is kept in both the metastore and in a file system ● Changes are not atomic without locking ● Requires directory listing ○ O(n) listing calls, n = # matching partitions ○ Eventual consistency breaks correctness Hive Tables: Problems
  • 14. ● Everything supports Hive tables* ○ Engines: Hive, Spark, Presto, Flink, Pig ○ Tools: Hudi, NiFi, Flume, Sqoop ● Simplicity and ubiquity have made Hive tables indispensable ● The whole ecosystem uses the same at-rest data! Hive Tables: Benefits
  • 16. ● An open spec and community for at-rest data interchange ○ Maintain a clear spec for the format ○ Design for multiple implementations across languages ○ Support needs across projects to avoid fragmentation Iceberg’s Goals
  • 17. ● Improve scale and reliability ○ Work on a single node, scale to a cluster ○ All changes are atomic, with serializable isolation ○ Native support for cloud object stores ○ Support many concurrent writers Iceberg’s Goals
  • 18. ● Fix persistent usability problems ○ In-place evolution for schema and layout (no side-effects) ○ Hide partitioning: insulate queries from physical layout ○ Support time-travel, rollback, and metadata inspection ○ Configure tables, not jobs ● Tables should have no unpleasant surprises Iceberg’s Goals
  • 19. ● Key idea: track all files in a table over time ○ A snapshot is a complete list of files in a table ○ Each write produces and commits a new snapshot Iceberg’s Design S1 S2 S3 ...
  • 20. S1 S2 S3 ... R W ● Readers use the current snapshot ● Writers optimistically create new snapshots, then commit Iceberg’s Design
  • 21. In reality, it’s a bit more complicated.
  • 22. ● All changes are atomic ● No expensive (or inconsistent) file system operations ● Snapshots are indexed for scan planning on a single node ● CBO metrics are reliable ● Versions for incremental updates and materialized views Iceberg Design Benefits
  • 24. ● Production tables: tens of petabytes, millions of partitions ○ Scan planning fits on a single node ○ Advanced filtering enables more use cases ○ Overall performance is better ● Low latency queries are faster for large tables Scale
  • 25. ● Production Flink pipeline writing in 3 AWS regions ● Lift service moving data into a single region ● Merge service compacting small files Concurrency
  • 26. ● Rollback is popular ● Metadata tables ○ Track down the version a job read ○ Find the process that wrote a bad version Usability
  • 27. ● Spark vectorization for faster bulk reads ○ Presto vectorization already done ● Row-level delete encodings ○ MERGE INTO ○ ID equality predicates Future Work
  翻译: