Introducing Databricks Delta

Unifying Data Warehousing
with Data Lakes
Ali Ghodsi, Co-Founder & CEO
Oct 25, 2017

Many enterprises are undergoing
a data transformation

Databricks Customers Across Industries
Financial Services Healthcare & Pharma Media & Entertainment Technology
Public Sector Retail & CPG Consumer Services Energy & Industrial IoTMarketing & AdTech
Data & Analytics Services

Health care AI cloud dataset use case
Correlate EMR of 50,000 patients
compared with their DNA

Enterprise AI use case
5
Provide recommendations to sales
using NLP and deep learning

6
Real-time AI use-case
Curb abusive behavior
across gamers globally

Big Data was the Missing Link for AI
BIG DATA
Customer Data
Emails/Web pages
Click Streams
Sensor data (IoT)
Video/Speech
…
Most companies are Struggling with Big Data
GREAT RESULTS

Hardest part of AI isn’t AI
“Hidden Technical Debt in Machine Learning Systems", Google NIPS 2015
The hardest part of AI is Big Data
ML
Code

Building Predictive Applications
is really Hard!

Unified Analytics Platform
UNIFIED
EXPERIENCE
ACROSS TEAMS
UNIFIED
PROCESSING
ENGINE

Data Warehouse (DW)
THE GOOD
• Pristine Data
• Fast Queries
• Transactional
THE BAD
• Expensive to Scale, not Elastic
• Requires ETL, Stale Data, No Real-Time
• No Predictions, No ML
• Closed formats (lock in)
Not Future Proof – Missing Predictions, Real-time, Scale
ETL important data to central DW and get Business Intelligence (BI)

THE BAD
• Inconsistent Data
• Unreliable for Analytics
• Lack of Schema
• Poor Performance
Hadoop Data Lake
Become a cheap messy data store with poor performance
ETL all data to central scalable open lake for all use cases
THE GOOD
• Massive scale
• Inexpensive Storage
• Open Formats (Parquet, ORC)
• Promise of ML & Real Time
Streaming

The Current State
of Data Platforms

Info Sec at a Fortune 100 Company
DISADVANTAGES OF ARCHITECTURE
• Poor agility in responding to new threats
• Scale Limitations, no historical data
• 6 Months and twenty people to build
ENTERPRISE DATA WAREHOUSE
• Only 2 weeks of data
• Very expensive to scale
• Proprietary Formats
• No Predictions (ML)
Messy data not ready for analytics
Billions of records a day HADOOP
DATA LAKE
Complex ETL
EDW
EDW
EDW
Incidence
Response
Alerting
Reports

The Next Generation
Data Platform

First UNIFIED data management system that delivers:
The
SCALE
of data lake
The
LOW-LATENCY
of streaming
The
RELIABILITY &
PERFORMANCE
of data warehouse
Announcing Databricks Delta

The
SCALE
of data lake
The
LOW-LATENCY
of streaming
The
RELIABILITY &
PERFORMANCE
of data warehouse

Databricks Delta
Enables Predictions, Real-time and Ad Hoc
Analytics at Massive Scale
THE GOOD
OF DATA LAKES
• Massive scale on Amazon S3
• Open Formats (Parquet, ORC)
• Predictions (ML) & Real Time
Streaming
THE GOOD
OF DATA WAREHOUSES
• Pristine Data
• Transactional Reliability
• Fast Queries (10-100x)

Databricks Delta Under the Hood
• Decouple Compute & Storage
• ACID Transactions & Data Validation
• Data Indexing & Caching (10-100x)
• Real-Time Streaming Ingest
MASSIVE SCALE
RELIABILITY
PERFORMANCE
LOW-LATENCY

Info Sec with Databricks Delta
DATABRICKS RUNTIME
powered by
DATABRICKS
RUNTIME
Trillion Records a Day
DATABRICKS
DELTA
ETL, Schema Validation SQL , ML, Stream
ADVANTAGES
• AI capable data warehouse at the scale of a data lake
• Interactive analysis on 2 years of data
• 2 Weeks to build with a 5 person data platform team

UNIFIED
EXPERIENCE
ACROSS TEAMS
Notebooks,Dashboards,Reports

+
UNIFIED
DATA
MANAGEMENT
ReliableTransactions,Performance
UNIFIED
EXPERIENCE
ACROSS TEAMS
Notebooks,Dashboards,Reports

Evolution of a Cutting-Edge Data Pipeline
Events
?
Reporting
Streaming
Analytics
Data Lake

Evolution of a Cutting-Edge Data Pipeline
Events
Reporting
Streaming
Analytics
Data Lake

Challenge #1: Historical Queries?
Data Lake
λ-arch
λ-arch
Streaming
Analytics
Reporting
Events
λ-arch1
1
1

Challenge #2: Messy Data?
Data Lake
λ-arch
λ-arch
Streaming
Analytics
Reporting
Events
Validation
λ-arch
Validation
1
21
1
2

Reprocessing
Challenge #3: Mistakes and Failures?
Data Lake
λ-arch
λ-arch
Streaming
Analytics
Reporting
Events
Validation
λ-arch
Validation
Reprocessing
Partitioned
1
2
3
1
1
3
2

Reprocessing
Challenge #4: Query Performance?
Data Lake
λ-arch
λ-arch
Streaming
Analytics
Reporting
Events
Validation
λ-arch
Validation
Reprocessing
Compaction
Partitioned
Compact
Small Files
Scheduled to
Avoid Compaction
1
2
3
1
1
2
4
4
4
2

Let’s try it instead with
DELTA

Reprocessing
The Canonical Data Pipeline
Data Lake
λ-arch
λ-arch
Streaming
Analytics
Reporting
Events
Validation
λ-arch
Validation
Reprocessing
Compaction
Partitioned
Compact
Small Files
Scheduled to
Avoid Compaction
1
2
3
1
1
2
4
4
4
2
Challenge

DELTA
DATA LAKE
Reporting
Streaming
Analytics
The
LOW-LATENCY
of streaming
The
RELIABILITY &
PERFORMANCE
of data warehouse
The
SCALE
of data lake
The Delta Architecture

Sign up for the Private Beta
visit databricks.

Introducing Databricks Delta

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Introducing Databricks Delta

Similar to Introducing Databricks Delta (20)

More from Databricks

More from Databricks (20)

Recently uploaded

Recently uploaded (20)

Introducing Databricks Delta