尊敬的 微信汇率:1円 ≈ 0.046166 元 支付宝汇率:1円 ≈ 0.046257元 [退出登录]
SlideShare a Scribd company logo
Vicky Avison Cox Automotive UK
Alex Bush KPMG Lighthouse New Zealand
Best Practices for Building and Deploying Data
Pipelines in Apache Spark
#UnifiedDataAnalytics #SparkAISummit
Cox Automotive
Data Platform
KPMG Lighthouse
Centre of Excellence for Information and Analytics
We provide services across the data value chain including:
● Data strategy and analytics maturity assessment
● Information management
● Data engineering
● Data warehousing, business intelligence (BI) and data visualisation
● Data science, advanced analytics and artificial intelligence (AI)
● Cloud-based analytics services
What is this talk about?
- What are data pipelines and who builds them?
- Why is data pipeline development difficult to get right?
- How have we changed the way we develop and deploy our data
pipelines?
What are data pipelines and who builds them?
Data Sources
(e.g. files, relational
databases, REST APIs)
What do we mean by ‘Data Pipeline’?
Data Platform
(storage + compute)
Ingest the raw data1
raw data
table_a
table_b
table_c
table_d
table_e
table_f
...
...
prepared data
data model_a
data model_b
data model_c
data model_d
1
Prepare the data for use
in further analysis and
dashboards, meaning:
a. Deduplication
b. Cleansing
c. Enrichment
d. Creation of data
models i.e joins,
aggregations etc.
2
2
Who is in a data team?
Data Engineering
Deep understanding of the
technology
Know how to build robust,
performant data pipelines
Business Intelligence
and Data Science
Deep understanding of the data
Know how to extract business
value from the data
Why is data pipeline development difficult to get
right?
What do we need to think about when building a pipeline?
1. How do we handle late-arriving or duplicate data?
2. How can we ensure that if the pipeline fails part-way through, we can run it again without
any problems?
3. How do we avoid the small-file problem?
4. How do we monitor data quality?
5. How do we configure our application? (credentials, input paths, output paths etc.)
6. How do we maximise performance?
7. How do extract only what we need from the source (e.g. only extract new records from
RDBM)?
What about the business logic?
table_a table_b table_c table_d
cleanup, convert data types, create user-friendly column names, add derived columns etc.
a_b_model
join a and b together
d_counts
group by and perform counts
b_c_d_model
join b, c and the aggregated d
together
deduplicated raw data
deployment
location
environment
(data location)
paths
Hive
databases
What about deployments?
environment (server)
software
interaction response
Traditional software development e.g
web development
softwareinput data
Data development
deployment
deployment
What are the main challenges?
- A lot of overhead for every new data pipeline, even when the problems are
very similar each time
- Production-grade business logic is hard to write without specialist Data
Engineering skills
- No tools or best practice around deploying and managing environments for
data pipelines
How have we changed the way we develop and
deploy our data pipelines?
A long(ish) time ago, in an office quite far away….
How were we dealing with the main challenges?
A lot of overhead for every new data pipeline, even when the problems are very
similar each time
We were… shoehorning new pipeline requirements into a single application in an
attempt to avoid the overhead
How were we dealing with the main challenges?
Production-grade business logic hard to write without specialist Data
Engineering skills
We were… taking business logic defined by our BI and Data Science colleagues
and reimplementing it
How were we dealing with the main challenges?
No tools or best practice around deploying and managing environments for data
pipelines
We were… manually deploying jars, passing environment-specific configuration
to our applications each time we ran them
Business
Logic
Applications
Business
Logic
Applications
Could we make better use of the skills in the team?
Business Intelligence
and Data Science
Data Engineering
Data
Platform
Business
Engagement
Data
Ingestion
Applications Modelling
Consulting
Business
Logic
Definition
Deep understanding of the technology Deep understanding of the data
Data
ExplorationTools and
Frameworks
What tools and frameworks would we need to provide?
Spark and Hadoop KMS Delta Lake Deequ Etc...
Third-party services and libraries
Configuration
Management
Idempotency and
Atomicity
Deduplication
Compaction
Table Metadata
Management
Action Coordination
Boilerplate and
Structuring
Data Engineering frameworks
Business Logic
Data Science and Business
Intelligence Applications
Environment
Management
Application
Deployment
Data Engineering
tools
Data Ingestion
Data Engineering
Applications
How would we design a Data Engineering framework?
Input
Output
{ Inputs }
{ Transformations }
{ Outputs }
Business Logic
Performance
Optimisations
Data Quality
Monitoring
Deduplication
Compaction
ConfigurationMgmt
etc.
Spark and Hadoop
High-level APIs
Business Logic
Complexity hidden
behind high-level APIs
Intuitive structuring of
business logic code
Injection of optimisations
and monitoring
Efficient scheduling of
transformations and
actions
How would we like to manage deployments?
v 0.1 v 0.2
Paths
/data/prod/my_project
Hive databases
prod_my_project
Deployed jars
my_project-0.1.jar
/data/dev/my_project/feature_one
dev_my_project_feature_one
my_project_feature_one-0.2-SNAPSHOT.jar
/data/dev/my_project/feature_two
dev_my_project_feature_two
my_project_feature_two-0.2-SNAPSHOT.jar
master
feature/one
feature/two
my_project-0.2.jar
What does this look like in practice?
Simpler data ingestion
case class SQLServerConnectionDetails(server: String, user: String, password: String)
val dbConf = CaseClassConfigParser[SQLServerConnectionDetails](
SparkFlowContext(spark), "app1.dbconf"
)
Retrieve server configuration from
combination of Spark conf and
Databricks Secrets
Pull deltas from SQLServer
temporal tables and store in
storage layer
Storage layer will capture last
updated values and primary keys
Small files will be compacted once
between 11pm and 4am
Flow is lazy, nothing happens until
execute is called
val flow = Waimak.sparkFlow(spark)
.extractToStorageFromRDBM(
rdbmExtractor = new SQLServerTemporalExtractor(spark, dbConf),
dbSchema = ...,
storageBasePath = ...,
tableConfigs = ...,
extractDateTime = ZonedDateTime.now(),
doCompaction = runSingleCompactionDuringWindow(23, 4)
)("table1", "table2", "table3")
flow.execute()
val flow = Waimak.sparkFlow(spark)
.snapshotFromStorage(basePath, tsNow)("table1", "table2", "table3")
.transform("table1", "table2")("model1")(
(t1, t2) => t1.join(t2, "pk1")
)
.transform("table3", "model1")("model2")(
(t3, m1) => t3.join(m1, "pk2")
)
.sql("table1", "model2")("reporting1",
"""select m2.pk1, count(t1.col1) from model2 m2
left join table1 t1 on m2.pk1 = t1.fpk1
group by m2.pk1"""
)
.writeHiveManagedTable("model_db")("model1", "model2")
.writeHiveManagedTable("reporting_db")("reporting1")
.sparkCache("model2")
.addDeequCheck(
"reporting1",
Check(Error, "Not valid PK").isPrimaryKey("pk1")
)(ExceptionQualityAlert())
Simpler business logic development
Execute with flow.execute()
Read from storage layer and deduplicate
Perform two transformations on labels
using the DataFrame API, generating
two more labels
Perform a Spark SQL transformation on
a table and label generated during a
transform, generating one more label
Write labels to two different databases
Add explicit caching and data quality
monitoring actions
Simpler environment management
case class MySparkAppEnv(project: String, //e.g. my_spark_app
environment: String, //e.g. dev
branch: String //e.g. feature/one
) extends HiveEnv
Environment consists of a base path:
/data/dev/my_spark_app/feature_one/
And a Hive database:
dev_my_spark_app_feature_one
object MySparkApp extends SparkApp[MySparkAppEnv] {
override def run(sparkSession: SparkSession, env: MySparkAppEnv): Unit =
//We have a base path and Hive database available in env via env.basePath and env.baseDBName
???
}
Define application logic given a
SparkSession and an environment
Use MultiAppRunner to run apps
individually or together with
dependencies
spark.waimak.apprunner.apps = my_spark_app
spark.waimak.apprunner.my_spark_app.appClassName = com.example.MySparkApp
spark.waimak.environment.my_spark_app.environment = dev
spark.waimak.environment.my_spark_app.branch = feature/one
Simpler deployments
Questions?
github.com/
CoxAutomotiveDataSolutions/
waimak

More Related Content

What's hot

Data Lakehouse, Data Mesh, and Data Fabric (r2)
Data Lakehouse, Data Mesh, and Data Fabric (r2)Data Lakehouse, Data Mesh, and Data Fabric (r2)
Data Lakehouse, Data Mesh, and Data Fabric (r2)
James Serra
 
Understanding my database through SQL*Plus using the free tool eDB360
Understanding my database through SQL*Plus using the free tool eDB360Understanding my database through SQL*Plus using the free tool eDB360
Understanding my database through SQL*Plus using the free tool eDB360
Carlos Sierra
 
Modern Data Pipelines
Modern Data PipelinesModern Data Pipelines
Modern Data Pipelines
Karthik Ramasamy
 
Apache storm vs. Spark Streaming
Apache storm vs. Spark StreamingApache storm vs. Spark Streaming
Apache storm vs. Spark Streaming
P. Taylor Goetz
 
Architecting a datalake
Architecting a datalakeArchitecting a datalake
Architecting a datalake
Laurent Leturgez
 
Choosing Between Microsoft Fabric, Azure Synapse Analytics and Azure Data Fac...
Choosing Between Microsoft Fabric, Azure Synapse Analytics and Azure Data Fac...Choosing Between Microsoft Fabric, Azure Synapse Analytics and Azure Data Fac...
Choosing Between Microsoft Fabric, Azure Synapse Analytics and Azure Data Fac...
Cathrine Wilhelmsen
 
Delta Lake OSS: Create reliable and performant Data Lake by Quentin Ambard
Delta Lake OSS: Create reliable and performant Data Lake by Quentin AmbardDelta Lake OSS: Create reliable and performant Data Lake by Quentin Ambard
Delta Lake OSS: Create reliable and performant Data Lake by Quentin Ambard
Paris Data Engineers !
 
Intro to Delta Lake
Intro to Delta LakeIntro to Delta Lake
Intro to Delta Lake
Databricks
 
Azure Synapse Analytics Overview (r1)
Azure Synapse Analytics Overview (r1)Azure Synapse Analytics Overview (r1)
Azure Synapse Analytics Overview (r1)
James Serra
 
Making Apache Spark Better with Delta Lake
Making Apache Spark Better with Delta LakeMaking Apache Spark Better with Delta Lake
Making Apache Spark Better with Delta Lake
Databricks
 
TechEvent Databricks on Azure
TechEvent Databricks on AzureTechEvent Databricks on Azure
TechEvent Databricks on Azure
Trivadis
 
Large Scale Lakehouse Implementation Using Structured Streaming
Large Scale Lakehouse Implementation Using Structured StreamingLarge Scale Lakehouse Implementation Using Structured Streaming
Large Scale Lakehouse Implementation Using Structured Streaming
Databricks
 
The Modern Data Team for the Modern Data Stack: dbt and the Role of the Analy...
The Modern Data Team for the Modern Data Stack: dbt and the Role of the Analy...The Modern Data Team for the Modern Data Stack: dbt and the Role of the Analy...
The Modern Data Team for the Modern Data Stack: dbt and the Role of the Analy...
Databricks
 
Delta lake and the delta architecture
Delta lake and the delta architectureDelta lake and the delta architecture
Delta lake and the delta architecture
Adam Doyle
 
Databricks for Dummies
Databricks for DummiesDatabricks for Dummies
Databricks for Dummies
Rodney Joyce
 
오픈소스를 활용한 Batch_처리_플랫폼_공유
오픈소스를 활용한 Batch_처리_플랫폼_공유오픈소스를 활용한 Batch_처리_플랫폼_공유
오픈소스를 활용한 Batch_처리_플랫폼_공유
knight1128
 
Delta Lake with Azure Databricks
Delta Lake with Azure DatabricksDelta Lake with Azure Databricks
Delta Lake with Azure Databricks
Dustin Vannoy
 
[EN] Building modern data pipeline with Snowflake + DBT + Airflow.pdf
[EN] Building modern data pipeline with Snowflake + DBT + Airflow.pdf[EN] Building modern data pipeline with Snowflake + DBT + Airflow.pdf
[EN] Building modern data pipeline with Snowflake + DBT + Airflow.pdf
Chris Hoyean Song
 
The delta architecture
The delta architectureThe delta architecture
The delta architecture
Prakash Chockalingam
 
Azure Databricks - An Introduction (by Kris Bock)
Azure Databricks - An Introduction (by Kris Bock)Azure Databricks - An Introduction (by Kris Bock)
Azure Databricks - An Introduction (by Kris Bock)
Daniel Toomey
 

What's hot (20)

Data Lakehouse, Data Mesh, and Data Fabric (r2)
Data Lakehouse, Data Mesh, and Data Fabric (r2)Data Lakehouse, Data Mesh, and Data Fabric (r2)
Data Lakehouse, Data Mesh, and Data Fabric (r2)
 
Understanding my database through SQL*Plus using the free tool eDB360
Understanding my database through SQL*Plus using the free tool eDB360Understanding my database through SQL*Plus using the free tool eDB360
Understanding my database through SQL*Plus using the free tool eDB360
 
Modern Data Pipelines
Modern Data PipelinesModern Data Pipelines
Modern Data Pipelines
 
Apache storm vs. Spark Streaming
Apache storm vs. Spark StreamingApache storm vs. Spark Streaming
Apache storm vs. Spark Streaming
 
Architecting a datalake
Architecting a datalakeArchitecting a datalake
Architecting a datalake
 
Choosing Between Microsoft Fabric, Azure Synapse Analytics and Azure Data Fac...
Choosing Between Microsoft Fabric, Azure Synapse Analytics and Azure Data Fac...Choosing Between Microsoft Fabric, Azure Synapse Analytics and Azure Data Fac...
Choosing Between Microsoft Fabric, Azure Synapse Analytics and Azure Data Fac...
 
Delta Lake OSS: Create reliable and performant Data Lake by Quentin Ambard
Delta Lake OSS: Create reliable and performant Data Lake by Quentin AmbardDelta Lake OSS: Create reliable and performant Data Lake by Quentin Ambard
Delta Lake OSS: Create reliable and performant Data Lake by Quentin Ambard
 
Intro to Delta Lake
Intro to Delta LakeIntro to Delta Lake
Intro to Delta Lake
 
Azure Synapse Analytics Overview (r1)
Azure Synapse Analytics Overview (r1)Azure Synapse Analytics Overview (r1)
Azure Synapse Analytics Overview (r1)
 
Making Apache Spark Better with Delta Lake
Making Apache Spark Better with Delta LakeMaking Apache Spark Better with Delta Lake
Making Apache Spark Better with Delta Lake
 
TechEvent Databricks on Azure
TechEvent Databricks on AzureTechEvent Databricks on Azure
TechEvent Databricks on Azure
 
Large Scale Lakehouse Implementation Using Structured Streaming
Large Scale Lakehouse Implementation Using Structured StreamingLarge Scale Lakehouse Implementation Using Structured Streaming
Large Scale Lakehouse Implementation Using Structured Streaming
 
The Modern Data Team for the Modern Data Stack: dbt and the Role of the Analy...
The Modern Data Team for the Modern Data Stack: dbt and the Role of the Analy...The Modern Data Team for the Modern Data Stack: dbt and the Role of the Analy...
The Modern Data Team for the Modern Data Stack: dbt and the Role of the Analy...
 
Delta lake and the delta architecture
Delta lake and the delta architectureDelta lake and the delta architecture
Delta lake and the delta architecture
 
Databricks for Dummies
Databricks for DummiesDatabricks for Dummies
Databricks for Dummies
 
오픈소스를 활용한 Batch_처리_플랫폼_공유
오픈소스를 활용한 Batch_처리_플랫폼_공유오픈소스를 활용한 Batch_처리_플랫폼_공유
오픈소스를 활용한 Batch_처리_플랫폼_공유
 
Delta Lake with Azure Databricks
Delta Lake with Azure DatabricksDelta Lake with Azure Databricks
Delta Lake with Azure Databricks
 
[EN] Building modern data pipeline with Snowflake + DBT + Airflow.pdf
[EN] Building modern data pipeline with Snowflake + DBT + Airflow.pdf[EN] Building modern data pipeline with Snowflake + DBT + Airflow.pdf
[EN] Building modern data pipeline with Snowflake + DBT + Airflow.pdf
 
The delta architecture
The delta architectureThe delta architecture
The delta architecture
 
Azure Databricks - An Introduction (by Kris Bock)
Azure Databricks - An Introduction (by Kris Bock)Azure Databricks - An Introduction (by Kris Bock)
Azure Databricks - An Introduction (by Kris Bock)
 

Similar to Best Practices for Building and Deploying Data Pipelines in Apache Spark

Spark Based Distributed Deep Learning Framework For Big Data Applications
Spark Based Distributed Deep Learning Framework For Big Data Applications Spark Based Distributed Deep Learning Framework For Big Data Applications
Spark Based Distributed Deep Learning Framework For Big Data Applications
Humoyun Ahmedov
 
Mr bi
Mr biMr bi
Mr bi
renjan131
 
Scaling up with Cisco Big Data: Data + Science = Data Science
Scaling up with Cisco Big Data: Data + Science = Data ScienceScaling up with Cisco Big Data: Data + Science = Data Science
Scaling up with Cisco Big Data: Data + Science = Data Science
eRic Choo
 
Jump Start into Apache® Spark™ and Databricks
Jump Start into Apache® Spark™ and DatabricksJump Start into Apache® Spark™ and Databricks
Jump Start into Apache® Spark™ and Databricks
Databricks
 
Microsoft R Server for Data Sciencea
Microsoft R Server for Data ScienceaMicrosoft R Server for Data Sciencea
Microsoft R Server for Data Sciencea
Data Science Thailand
 
Ops Jumpstart: MongoDB Administration 101
Ops Jumpstart: MongoDB Administration 101Ops Jumpstart: MongoDB Administration 101
Ops Jumpstart: MongoDB Administration 101
MongoDB
 
New Developments in Spark
New Developments in SparkNew Developments in Spark
New Developments in Spark
Databricks
 
Spark streaming , Spark SQL
Spark streaming , Spark SQLSpark streaming , Spark SQL
Spark streaming , Spark SQL
Yousun Jeong
 
Spark + AI Summit 2020 イベント概要
Spark + AI Summit 2020 イベント概要Spark + AI Summit 2020 イベント概要
Spark + AI Summit 2020 イベント概要
Paulo Gutierrez
 
My Master's Thesis
My Master's ThesisMy Master's Thesis
My Master's Thesis
Humoyun Ahmedov
 
A look under the hood at Apache Spark's API and engine evolutions
A look under the hood at Apache Spark's API and engine evolutionsA look under the hood at Apache Spark's API and engine evolutions
A look under the hood at Apache Spark's API and engine evolutions
Databricks
 
Big Data Meetup #7
Big Data Meetup #7Big Data Meetup #7
Big Data Meetup #7
Paul Lo
 
Resume_Asad_updated_DEC2016
Resume_Asad_updated_DEC2016Resume_Asad_updated_DEC2016
Resume_Asad_updated_DEC2016
Asadullah Khan
 
Azure Data.pptx
Azure Data.pptxAzure Data.pptx
Azure Data.pptx
FedoRam1
 
AWS November Webinar Series - Advanced Analytics with Amazon Redshift and the...
AWS November Webinar Series - Advanced Analytics with Amazon Redshift and the...AWS November Webinar Series - Advanced Analytics with Amazon Redshift and the...
AWS November Webinar Series - Advanced Analytics with Amazon Redshift and the...
Amazon Web Services
 
Unified Big Data Processing with Apache Spark
Unified Big Data Processing with Apache SparkUnified Big Data Processing with Apache Spark
Unified Big Data Processing with Apache Spark
C4Media
 
Democratization of Data @Indix
Democratization of Data @IndixDemocratization of Data @Indix
Democratization of Data @Indix
Manoj Mahalingam
 
Tiny Batches, in the wine: Shiny New Bits in Spark Streaming
Tiny Batches, in the wine: Shiny New Bits in Spark StreamingTiny Batches, in the wine: Shiny New Bits in Spark Streaming
Tiny Batches, in the wine: Shiny New Bits in Spark Streaming
Paco Nathan
 
Learnings Using Spark Streaming and DataFrames for Walmart Search: Spark Summ...
Learnings Using Spark Streaming and DataFrames for Walmart Search: Spark Summ...Learnings Using Spark Streaming and DataFrames for Walmart Search: Spark Summ...
Learnings Using Spark Streaming and DataFrames for Walmart Search: Spark Summ...
Spark Summit
 
Developing Enterprise Consciousness: Building Modern Open Data Platforms
Developing Enterprise Consciousness: Building Modern Open Data PlatformsDeveloping Enterprise Consciousness: Building Modern Open Data Platforms
Developing Enterprise Consciousness: Building Modern Open Data Platforms
ScyllaDB
 

Similar to Best Practices for Building and Deploying Data Pipelines in Apache Spark (20)

Spark Based Distributed Deep Learning Framework For Big Data Applications
Spark Based Distributed Deep Learning Framework For Big Data Applications Spark Based Distributed Deep Learning Framework For Big Data Applications
Spark Based Distributed Deep Learning Framework For Big Data Applications
 
Mr bi
Mr biMr bi
Mr bi
 
Scaling up with Cisco Big Data: Data + Science = Data Science
Scaling up with Cisco Big Data: Data + Science = Data ScienceScaling up with Cisco Big Data: Data + Science = Data Science
Scaling up with Cisco Big Data: Data + Science = Data Science
 
Jump Start into Apache® Spark™ and Databricks
Jump Start into Apache® Spark™ and DatabricksJump Start into Apache® Spark™ and Databricks
Jump Start into Apache® Spark™ and Databricks
 
Microsoft R Server for Data Sciencea
Microsoft R Server for Data ScienceaMicrosoft R Server for Data Sciencea
Microsoft R Server for Data Sciencea
 
Ops Jumpstart: MongoDB Administration 101
Ops Jumpstart: MongoDB Administration 101Ops Jumpstart: MongoDB Administration 101
Ops Jumpstart: MongoDB Administration 101
 
New Developments in Spark
New Developments in SparkNew Developments in Spark
New Developments in Spark
 
Spark streaming , Spark SQL
Spark streaming , Spark SQLSpark streaming , Spark SQL
Spark streaming , Spark SQL
 
Spark + AI Summit 2020 イベント概要
Spark + AI Summit 2020 イベント概要Spark + AI Summit 2020 イベント概要
Spark + AI Summit 2020 イベント概要
 
My Master's Thesis
My Master's ThesisMy Master's Thesis
My Master's Thesis
 
A look under the hood at Apache Spark's API and engine evolutions
A look under the hood at Apache Spark's API and engine evolutionsA look under the hood at Apache Spark's API and engine evolutions
A look under the hood at Apache Spark's API and engine evolutions
 
Big Data Meetup #7
Big Data Meetup #7Big Data Meetup #7
Big Data Meetup #7
 
Resume_Asad_updated_DEC2016
Resume_Asad_updated_DEC2016Resume_Asad_updated_DEC2016
Resume_Asad_updated_DEC2016
 
Azure Data.pptx
Azure Data.pptxAzure Data.pptx
Azure Data.pptx
 
AWS November Webinar Series - Advanced Analytics with Amazon Redshift and the...
AWS November Webinar Series - Advanced Analytics with Amazon Redshift and the...AWS November Webinar Series - Advanced Analytics with Amazon Redshift and the...
AWS November Webinar Series - Advanced Analytics with Amazon Redshift and the...
 
Unified Big Data Processing with Apache Spark
Unified Big Data Processing with Apache SparkUnified Big Data Processing with Apache Spark
Unified Big Data Processing with Apache Spark
 
Democratization of Data @Indix
Democratization of Data @IndixDemocratization of Data @Indix
Democratization of Data @Indix
 
Tiny Batches, in the wine: Shiny New Bits in Spark Streaming
Tiny Batches, in the wine: Shiny New Bits in Spark StreamingTiny Batches, in the wine: Shiny New Bits in Spark Streaming
Tiny Batches, in the wine: Shiny New Bits in Spark Streaming
 
Learnings Using Spark Streaming and DataFrames for Walmart Search: Spark Summ...
Learnings Using Spark Streaming and DataFrames for Walmart Search: Spark Summ...Learnings Using Spark Streaming and DataFrames for Walmart Search: Spark Summ...
Learnings Using Spark Streaming and DataFrames for Walmart Search: Spark Summ...
 
Developing Enterprise Consciousness: Building Modern Open Data Platforms
Developing Enterprise Consciousness: Building Modern Open Data PlatformsDeveloping Enterprise Consciousness: Building Modern Open Data Platforms
Developing Enterprise Consciousness: Building Modern Open Data Platforms
 

More from Databricks

DW Migration Webinar-March 2022.pptx
DW Migration Webinar-March 2022.pptxDW Migration Webinar-March 2022.pptx
DW Migration Webinar-March 2022.pptx
Databricks
 
Data Lakehouse Symposium | Day 1 | Part 1
Data Lakehouse Symposium | Day 1 | Part 1Data Lakehouse Symposium | Day 1 | Part 1
Data Lakehouse Symposium | Day 1 | Part 1
Databricks
 
Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 1 | Part 2Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 1 | Part 2
Databricks
 
Data Lakehouse Symposium | Day 2
Data Lakehouse Symposium | Day 2Data Lakehouse Symposium | Day 2
Data Lakehouse Symposium | Day 2
Databricks
 
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
Databricks
 
Democratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized PlatformDemocratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized Platform
Databricks
 
Learn to Use Databricks for Data Science
Learn to Use Databricks for Data ScienceLearn to Use Databricks for Data Science
Learn to Use Databricks for Data Science
Databricks
 
Why APM Is Not the Same As ML Monitoring
Why APM Is Not the Same As ML MonitoringWhy APM Is Not the Same As ML Monitoring
Why APM Is Not the Same As ML Monitoring
Databricks
 
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
The Function, the Context, and the Data—Enabling ML Ops at Stitch FixThe Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
Databricks
 
Stage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI IntegrationStage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI Integration
Databricks
 
Simplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorchSimplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorch
Databricks
 
Scaling your Data Pipelines with Apache Spark on Kubernetes
Scaling your Data Pipelines with Apache Spark on KubernetesScaling your Data Pipelines with Apache Spark on Kubernetes
Scaling your Data Pipelines with Apache Spark on Kubernetes
Databricks
 
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
Scaling and Unifying SciKit Learn and Apache Spark PipelinesScaling and Unifying SciKit Learn and Apache Spark Pipelines
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
Databricks
 
Sawtooth Windows for Feature Aggregations
Sawtooth Windows for Feature AggregationsSawtooth Windows for Feature Aggregations
Sawtooth Windows for Feature Aggregations
Databricks
 
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Redis + Apache Spark = Swiss Army Knife Meets Kitchen SinkRedis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Databricks
 
Re-imagine Data Monitoring with whylogs and Spark
Re-imagine Data Monitoring with whylogs and SparkRe-imagine Data Monitoring with whylogs and Spark
Re-imagine Data Monitoring with whylogs and Spark
Databricks
 
Raven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction QueriesRaven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction Queries
Databricks
 
Processing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache SparkProcessing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache Spark
Databricks
 
Massive Data Processing in Adobe Using Delta Lake
Massive Data Processing in Adobe Using Delta LakeMassive Data Processing in Adobe Using Delta Lake
Massive Data Processing in Adobe Using Delta Lake
Databricks
 
Machine Learning CI/CD for Email Attack Detection
Machine Learning CI/CD for Email Attack DetectionMachine Learning CI/CD for Email Attack Detection
Machine Learning CI/CD for Email Attack Detection
Databricks
 

More from Databricks (20)

DW Migration Webinar-March 2022.pptx
DW Migration Webinar-March 2022.pptxDW Migration Webinar-March 2022.pptx
DW Migration Webinar-March 2022.pptx
 
Data Lakehouse Symposium | Day 1 | Part 1
Data Lakehouse Symposium | Day 1 | Part 1Data Lakehouse Symposium | Day 1 | Part 1
Data Lakehouse Symposium | Day 1 | Part 1
 
Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 1 | Part 2Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 1 | Part 2
 
Data Lakehouse Symposium | Day 2
Data Lakehouse Symposium | Day 2Data Lakehouse Symposium | Day 2
Data Lakehouse Symposium | Day 2
 
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
 
Democratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized PlatformDemocratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized Platform
 
Learn to Use Databricks for Data Science
Learn to Use Databricks for Data ScienceLearn to Use Databricks for Data Science
Learn to Use Databricks for Data Science
 
Why APM Is Not the Same As ML Monitoring
Why APM Is Not the Same As ML MonitoringWhy APM Is Not the Same As ML Monitoring
Why APM Is Not the Same As ML Monitoring
 
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
The Function, the Context, and the Data—Enabling ML Ops at Stitch FixThe Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
 
Stage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI IntegrationStage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI Integration
 
Simplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorchSimplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorch
 
Scaling your Data Pipelines with Apache Spark on Kubernetes
Scaling your Data Pipelines with Apache Spark on KubernetesScaling your Data Pipelines with Apache Spark on Kubernetes
Scaling your Data Pipelines with Apache Spark on Kubernetes
 
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
Scaling and Unifying SciKit Learn and Apache Spark PipelinesScaling and Unifying SciKit Learn and Apache Spark Pipelines
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
 
Sawtooth Windows for Feature Aggregations
Sawtooth Windows for Feature AggregationsSawtooth Windows for Feature Aggregations
Sawtooth Windows for Feature Aggregations
 
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Redis + Apache Spark = Swiss Army Knife Meets Kitchen SinkRedis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
 
Re-imagine Data Monitoring with whylogs and Spark
Re-imagine Data Monitoring with whylogs and SparkRe-imagine Data Monitoring with whylogs and Spark
Re-imagine Data Monitoring with whylogs and Spark
 
Raven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction QueriesRaven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction Queries
 
Processing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache SparkProcessing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache Spark
 
Massive Data Processing in Adobe Using Delta Lake
Massive Data Processing in Adobe Using Delta LakeMassive Data Processing in Adobe Using Delta Lake
Massive Data Processing in Adobe Using Delta Lake
 
Machine Learning CI/CD for Email Attack Detection
Machine Learning CI/CD for Email Attack DetectionMachine Learning CI/CD for Email Attack Detection
Machine Learning CI/CD for Email Attack Detection
 

Recently uploaded

Difference in Differences - Does Strict Speed Limit Restrictions Reduce Road ...
Difference in Differences - Does Strict Speed Limit Restrictions Reduce Road ...Difference in Differences - Does Strict Speed Limit Restrictions Reduce Road ...
Difference in Differences - Does Strict Speed Limit Restrictions Reduce Road ...
ThinkInnovation
 
Mumbai Call Girls service 9920874524 Call Girl service in Mumbai Mumbai Call ...
Mumbai Call Girls service 9920874524 Call Girl service in Mumbai Mumbai Call ...Mumbai Call Girls service 9920874524 Call Girl service in Mumbai Mumbai Call ...
Mumbai Call Girls service 9920874524 Call Girl service in Mumbai Mumbai Call ...
hanshkumar9870
 
Independent Call Girls In Bangalore 9024918724 Just CALL ME Book Beautiful Gi...
Independent Call Girls In Bangalore 9024918724 Just CALL ME Book Beautiful Gi...Independent Call Girls In Bangalore 9024918724 Just CALL ME Book Beautiful Gi...
Independent Call Girls In Bangalore 9024918724 Just CALL ME Book Beautiful Gi...
uthkarshkumar987000
 
Telemetry Solution for Gaming (AWS Summit'24)
Telemetry Solution for Gaming (AWS Summit'24)Telemetry Solution for Gaming (AWS Summit'24)
Telemetry Solution for Gaming (AWS Summit'24)
GeorgiiSteshenko
 
IBM watsonx.data - Seller Enablement Deck.PPTX
IBM watsonx.data - Seller Enablement Deck.PPTXIBM watsonx.data - Seller Enablement Deck.PPTX
IBM watsonx.data - Seller Enablement Deck.PPTX
EbtsamRashed
 
Direct Lake Deep Dive slides from Fabric Engineering Roadshow
Direct Lake Deep Dive slides from Fabric Engineering RoadshowDirect Lake Deep Dive slides from Fabric Engineering Roadshow
Direct Lake Deep Dive slides from Fabric Engineering Roadshow
Gabi Münster
 
06-20-2024-AI Camp Meetup-Unstructured Data and Vector Databases
06-20-2024-AI Camp Meetup-Unstructured Data and Vector Databases06-20-2024-AI Camp Meetup-Unstructured Data and Vector Databases
06-20-2024-AI Camp Meetup-Unstructured Data and Vector Databases
Timothy Spann
 
Delhi Call Girls Karol Bagh 👉 9711199012 👈 unlimited short high profile full ...
Delhi Call Girls Karol Bagh 👉 9711199012 👈 unlimited short high profile full ...Delhi Call Girls Karol Bagh 👉 9711199012 👈 unlimited short high profile full ...
Delhi Call Girls Karol Bagh 👉 9711199012 👈 unlimited short high profile full ...
mona lisa $A12
 
🔥Book Call Girls Lucknow 💯Call Us 🔝 6350257716 🔝💃Independent Lucknow Escorts ...
🔥Book Call Girls Lucknow 💯Call Us 🔝 6350257716 🔝💃Independent Lucknow Escorts ...🔥Book Call Girls Lucknow 💯Call Us 🔝 6350257716 🔝💃Independent Lucknow Escorts ...
🔥Book Call Girls Lucknow 💯Call Us 🔝 6350257716 🔝💃Independent Lucknow Escorts ...
AK47
 
一比一原版(sfu学位证书)西蒙弗雷泽大学毕业证如何办理
一比一原版(sfu学位证书)西蒙弗雷泽大学毕业证如何办理一比一原版(sfu学位证书)西蒙弗雷泽大学毕业证如何办理
一比一原版(sfu学位证书)西蒙弗雷泽大学毕业证如何办理
gebegu
 
Fabric Engineering Deep Dive Keynote from Fabric Engineering Roadshow
Fabric Engineering Deep Dive Keynote from Fabric Engineering RoadshowFabric Engineering Deep Dive Keynote from Fabric Engineering Roadshow
Fabric Engineering Deep Dive Keynote from Fabric Engineering Roadshow
Gabi Münster
 
Call Girls In Tirunelveli 👯‍♀️ 7339748667 🔥 Safe Housewife Call Girl Service ...
Call Girls In Tirunelveli 👯‍♀️ 7339748667 🔥 Safe Housewife Call Girl Service ...Call Girls In Tirunelveli 👯‍♀️ 7339748667 🔥 Safe Housewife Call Girl Service ...
Call Girls In Tirunelveli 👯‍♀️ 7339748667 🔥 Safe Housewife Call Girl Service ...
wwefun9823#S0007
 
🔥College Call Girls Kolkata 💯Call Us 🔝 8094342248 🔝💃Top Class Call Girl Servi...
🔥College Call Girls Kolkata 💯Call Us 🔝 8094342248 🔝💃Top Class Call Girl Servi...🔥College Call Girls Kolkata 💯Call Us 🔝 8094342248 🔝💃Top Class Call Girl Servi...
🔥College Call Girls Kolkata 💯Call Us 🔝 8094342248 🔝💃Top Class Call Girl Servi...
rukmnaikaseen
 
Pune Call Girls <BOOK> 😍 Call Girl Pune Escorts Service
Pune Call Girls <BOOK> 😍 Call Girl Pune Escorts ServicePune Call Girls <BOOK> 😍 Call Girl Pune Escorts Service
Pune Call Girls <BOOK> 😍 Call Girl Pune Escorts Service
vashimk775
 
machine learning notes by Andrew Ng and Tengyu Ma
machine learning notes by Andrew Ng and Tengyu Mamachine learning notes by Andrew Ng and Tengyu Ma
machine learning notes by Andrew Ng and Tengyu Ma
Vijayabaskar Uthirapathy
 
saps4hanaandsapanalyticswheretodowhat1565272000538.pdf
saps4hanaandsapanalyticswheretodowhat1565272000538.pdfsaps4hanaandsapanalyticswheretodowhat1565272000538.pdf
saps4hanaandsapanalyticswheretodowhat1565272000538.pdf
newdirectionconsulta
 
Essential Skills for Family Assessment - Marital and Family Therapy and Couns...
Essential Skills for Family Assessment - Marital and Family Therapy and Couns...Essential Skills for Family Assessment - Marital and Family Therapy and Couns...
Essential Skills for Family Assessment - Marital and Family Therapy and Couns...
PsychoTech Services
 
Call Girls Goa (india) ☎️ +91-7426014248 Goa Call Girl
Call Girls Goa (india) ☎️ +91-7426014248 Goa Call GirlCall Girls Goa (india) ☎️ +91-7426014248 Goa Call Girl
Call Girls Goa (india) ☎️ +91-7426014248 Goa Call Girl
sapna sharmap11
 
202406 - Cape Town Snowflake User Group - LLM & RAG.pdf
202406 - Cape Town Snowflake User Group - LLM & RAG.pdf202406 - Cape Town Snowflake User Group - LLM & RAG.pdf
202406 - Cape Town Snowflake User Group - LLM & RAG.pdf
Douglas Day
 
❻❸❼⓿❽❻❷⓿⓿❼KALYAN MATKA CHART FINAL OPEN JODI PANNA FIXXX DPBOSS MATKA RESULT ...
❻❸❼⓿❽❻❷⓿⓿❼KALYAN MATKA CHART FINAL OPEN JODI PANNA FIXXX DPBOSS MATKA RESULT ...❻❸❼⓿❽❻❷⓿⓿❼KALYAN MATKA CHART FINAL OPEN JODI PANNA FIXXX DPBOSS MATKA RESULT ...
❻❸❼⓿❽❻❷⓿⓿❼KALYAN MATKA CHART FINAL OPEN JODI PANNA FIXXX DPBOSS MATKA RESULT ...
#kalyanmatkaresult #dpboss #kalyanmatka #satta #matka #sattamatka
 

Recently uploaded (20)

Difference in Differences - Does Strict Speed Limit Restrictions Reduce Road ...
Difference in Differences - Does Strict Speed Limit Restrictions Reduce Road ...Difference in Differences - Does Strict Speed Limit Restrictions Reduce Road ...
Difference in Differences - Does Strict Speed Limit Restrictions Reduce Road ...
 
Mumbai Call Girls service 9920874524 Call Girl service in Mumbai Mumbai Call ...
Mumbai Call Girls service 9920874524 Call Girl service in Mumbai Mumbai Call ...Mumbai Call Girls service 9920874524 Call Girl service in Mumbai Mumbai Call ...
Mumbai Call Girls service 9920874524 Call Girl service in Mumbai Mumbai Call ...
 
Independent Call Girls In Bangalore 9024918724 Just CALL ME Book Beautiful Gi...
Independent Call Girls In Bangalore 9024918724 Just CALL ME Book Beautiful Gi...Independent Call Girls In Bangalore 9024918724 Just CALL ME Book Beautiful Gi...
Independent Call Girls In Bangalore 9024918724 Just CALL ME Book Beautiful Gi...
 
Telemetry Solution for Gaming (AWS Summit'24)
Telemetry Solution for Gaming (AWS Summit'24)Telemetry Solution for Gaming (AWS Summit'24)
Telemetry Solution for Gaming (AWS Summit'24)
 
IBM watsonx.data - Seller Enablement Deck.PPTX
IBM watsonx.data - Seller Enablement Deck.PPTXIBM watsonx.data - Seller Enablement Deck.PPTX
IBM watsonx.data - Seller Enablement Deck.PPTX
 
Direct Lake Deep Dive slides from Fabric Engineering Roadshow
Direct Lake Deep Dive slides from Fabric Engineering RoadshowDirect Lake Deep Dive slides from Fabric Engineering Roadshow
Direct Lake Deep Dive slides from Fabric Engineering Roadshow
 
06-20-2024-AI Camp Meetup-Unstructured Data and Vector Databases
06-20-2024-AI Camp Meetup-Unstructured Data and Vector Databases06-20-2024-AI Camp Meetup-Unstructured Data and Vector Databases
06-20-2024-AI Camp Meetup-Unstructured Data and Vector Databases
 
Delhi Call Girls Karol Bagh 👉 9711199012 👈 unlimited short high profile full ...
Delhi Call Girls Karol Bagh 👉 9711199012 👈 unlimited short high profile full ...Delhi Call Girls Karol Bagh 👉 9711199012 👈 unlimited short high profile full ...
Delhi Call Girls Karol Bagh 👉 9711199012 👈 unlimited short high profile full ...
 
🔥Book Call Girls Lucknow 💯Call Us 🔝 6350257716 🔝💃Independent Lucknow Escorts ...
🔥Book Call Girls Lucknow 💯Call Us 🔝 6350257716 🔝💃Independent Lucknow Escorts ...🔥Book Call Girls Lucknow 💯Call Us 🔝 6350257716 🔝💃Independent Lucknow Escorts ...
🔥Book Call Girls Lucknow 💯Call Us 🔝 6350257716 🔝💃Independent Lucknow Escorts ...
 
一比一原版(sfu学位证书)西蒙弗雷泽大学毕业证如何办理
一比一原版(sfu学位证书)西蒙弗雷泽大学毕业证如何办理一比一原版(sfu学位证书)西蒙弗雷泽大学毕业证如何办理
一比一原版(sfu学位证书)西蒙弗雷泽大学毕业证如何办理
 
Fabric Engineering Deep Dive Keynote from Fabric Engineering Roadshow
Fabric Engineering Deep Dive Keynote from Fabric Engineering RoadshowFabric Engineering Deep Dive Keynote from Fabric Engineering Roadshow
Fabric Engineering Deep Dive Keynote from Fabric Engineering Roadshow
 
Call Girls In Tirunelveli 👯‍♀️ 7339748667 🔥 Safe Housewife Call Girl Service ...
Call Girls In Tirunelveli 👯‍♀️ 7339748667 🔥 Safe Housewife Call Girl Service ...Call Girls In Tirunelveli 👯‍♀️ 7339748667 🔥 Safe Housewife Call Girl Service ...
Call Girls In Tirunelveli 👯‍♀️ 7339748667 🔥 Safe Housewife Call Girl Service ...
 
🔥College Call Girls Kolkata 💯Call Us 🔝 8094342248 🔝💃Top Class Call Girl Servi...
🔥College Call Girls Kolkata 💯Call Us 🔝 8094342248 🔝💃Top Class Call Girl Servi...🔥College Call Girls Kolkata 💯Call Us 🔝 8094342248 🔝💃Top Class Call Girl Servi...
🔥College Call Girls Kolkata 💯Call Us 🔝 8094342248 🔝💃Top Class Call Girl Servi...
 
Pune Call Girls <BOOK> 😍 Call Girl Pune Escorts Service
Pune Call Girls <BOOK> 😍 Call Girl Pune Escorts ServicePune Call Girls <BOOK> 😍 Call Girl Pune Escorts Service
Pune Call Girls <BOOK> 😍 Call Girl Pune Escorts Service
 
machine learning notes by Andrew Ng and Tengyu Ma
machine learning notes by Andrew Ng and Tengyu Mamachine learning notes by Andrew Ng and Tengyu Ma
machine learning notes by Andrew Ng and Tengyu Ma
 
saps4hanaandsapanalyticswheretodowhat1565272000538.pdf
saps4hanaandsapanalyticswheretodowhat1565272000538.pdfsaps4hanaandsapanalyticswheretodowhat1565272000538.pdf
saps4hanaandsapanalyticswheretodowhat1565272000538.pdf
 
Essential Skills for Family Assessment - Marital and Family Therapy and Couns...
Essential Skills for Family Assessment - Marital and Family Therapy and Couns...Essential Skills for Family Assessment - Marital and Family Therapy and Couns...
Essential Skills for Family Assessment - Marital and Family Therapy and Couns...
 
Call Girls Goa (india) ☎️ +91-7426014248 Goa Call Girl
Call Girls Goa (india) ☎️ +91-7426014248 Goa Call GirlCall Girls Goa (india) ☎️ +91-7426014248 Goa Call Girl
Call Girls Goa (india) ☎️ +91-7426014248 Goa Call Girl
 
202406 - Cape Town Snowflake User Group - LLM & RAG.pdf
202406 - Cape Town Snowflake User Group - LLM & RAG.pdf202406 - Cape Town Snowflake User Group - LLM & RAG.pdf
202406 - Cape Town Snowflake User Group - LLM & RAG.pdf
 
❻❸❼⓿❽❻❷⓿⓿❼KALYAN MATKA CHART FINAL OPEN JODI PANNA FIXXX DPBOSS MATKA RESULT ...
❻❸❼⓿❽❻❷⓿⓿❼KALYAN MATKA CHART FINAL OPEN JODI PANNA FIXXX DPBOSS MATKA RESULT ...❻❸❼⓿❽❻❷⓿⓿❼KALYAN MATKA CHART FINAL OPEN JODI PANNA FIXXX DPBOSS MATKA RESULT ...
❻❸❼⓿❽❻❷⓿⓿❼KALYAN MATKA CHART FINAL OPEN JODI PANNA FIXXX DPBOSS MATKA RESULT ...
 

Best Practices for Building and Deploying Data Pipelines in Apache Spark

  • 1. Vicky Avison Cox Automotive UK Alex Bush KPMG Lighthouse New Zealand Best Practices for Building and Deploying Data Pipelines in Apache Spark #UnifiedDataAnalytics #SparkAISummit
  • 3. KPMG Lighthouse Centre of Excellence for Information and Analytics We provide services across the data value chain including: ● Data strategy and analytics maturity assessment ● Information management ● Data engineering ● Data warehousing, business intelligence (BI) and data visualisation ● Data science, advanced analytics and artificial intelligence (AI) ● Cloud-based analytics services
  • 4. What is this talk about? - What are data pipelines and who builds them? - Why is data pipeline development difficult to get right? - How have we changed the way we develop and deploy our data pipelines?
  • 5. What are data pipelines and who builds them?
  • 6. Data Sources (e.g. files, relational databases, REST APIs) What do we mean by ‘Data Pipeline’? Data Platform (storage + compute) Ingest the raw data1 raw data table_a table_b table_c table_d table_e table_f ... ... prepared data data model_a data model_b data model_c data model_d 1 Prepare the data for use in further analysis and dashboards, meaning: a. Deduplication b. Cleansing c. Enrichment d. Creation of data models i.e joins, aggregations etc. 2 2
  • 7. Who is in a data team? Data Engineering Deep understanding of the technology Know how to build robust, performant data pipelines Business Intelligence and Data Science Deep understanding of the data Know how to extract business value from the data
  • 8. Why is data pipeline development difficult to get right?
  • 9. What do we need to think about when building a pipeline? 1. How do we handle late-arriving or duplicate data? 2. How can we ensure that if the pipeline fails part-way through, we can run it again without any problems? 3. How do we avoid the small-file problem? 4. How do we monitor data quality? 5. How do we configure our application? (credentials, input paths, output paths etc.) 6. How do we maximise performance? 7. How do extract only what we need from the source (e.g. only extract new records from RDBM)?
  • 10. What about the business logic? table_a table_b table_c table_d cleanup, convert data types, create user-friendly column names, add derived columns etc. a_b_model join a and b together d_counts group by and perform counts b_c_d_model join b, c and the aggregated d together deduplicated raw data
  • 11. deployment location environment (data location) paths Hive databases What about deployments? environment (server) software interaction response Traditional software development e.g web development softwareinput data Data development deployment deployment
  • 12. What are the main challenges? - A lot of overhead for every new data pipeline, even when the problems are very similar each time - Production-grade business logic is hard to write without specialist Data Engineering skills - No tools or best practice around deploying and managing environments for data pipelines
  • 13. How have we changed the way we develop and deploy our data pipelines?
  • 14. A long(ish) time ago, in an office quite far away….
  • 15. How were we dealing with the main challenges? A lot of overhead for every new data pipeline, even when the problems are very similar each time We were… shoehorning new pipeline requirements into a single application in an attempt to avoid the overhead
  • 16. How were we dealing with the main challenges? Production-grade business logic hard to write without specialist Data Engineering skills We were… taking business logic defined by our BI and Data Science colleagues and reimplementing it
  • 17. How were we dealing with the main challenges? No tools or best practice around deploying and managing environments for data pipelines We were… manually deploying jars, passing environment-specific configuration to our applications each time we ran them
  • 18. Business Logic Applications Business Logic Applications Could we make better use of the skills in the team? Business Intelligence and Data Science Data Engineering Data Platform Business Engagement Data Ingestion Applications Modelling Consulting Business Logic Definition Deep understanding of the technology Deep understanding of the data Data ExplorationTools and Frameworks
  • 19. What tools and frameworks would we need to provide? Spark and Hadoop KMS Delta Lake Deequ Etc... Third-party services and libraries Configuration Management Idempotency and Atomicity Deduplication Compaction Table Metadata Management Action Coordination Boilerplate and Structuring Data Engineering frameworks Business Logic Data Science and Business Intelligence Applications Environment Management Application Deployment Data Engineering tools Data Ingestion Data Engineering Applications
  • 20. How would we design a Data Engineering framework? Input Output { Inputs } { Transformations } { Outputs } Business Logic Performance Optimisations Data Quality Monitoring Deduplication Compaction ConfigurationMgmt etc. Spark and Hadoop High-level APIs Business Logic Complexity hidden behind high-level APIs Intuitive structuring of business logic code Injection of optimisations and monitoring Efficient scheduling of transformations and actions
  • 21. How would we like to manage deployments? v 0.1 v 0.2 Paths /data/prod/my_project Hive databases prod_my_project Deployed jars my_project-0.1.jar /data/dev/my_project/feature_one dev_my_project_feature_one my_project_feature_one-0.2-SNAPSHOT.jar /data/dev/my_project/feature_two dev_my_project_feature_two my_project_feature_two-0.2-SNAPSHOT.jar master feature/one feature/two my_project-0.2.jar
  • 22. What does this look like in practice?
  • 23. Simpler data ingestion case class SQLServerConnectionDetails(server: String, user: String, password: String) val dbConf = CaseClassConfigParser[SQLServerConnectionDetails]( SparkFlowContext(spark), "app1.dbconf" ) Retrieve server configuration from combination of Spark conf and Databricks Secrets Pull deltas from SQLServer temporal tables and store in storage layer Storage layer will capture last updated values and primary keys Small files will be compacted once between 11pm and 4am Flow is lazy, nothing happens until execute is called val flow = Waimak.sparkFlow(spark) .extractToStorageFromRDBM( rdbmExtractor = new SQLServerTemporalExtractor(spark, dbConf), dbSchema = ..., storageBasePath = ..., tableConfigs = ..., extractDateTime = ZonedDateTime.now(), doCompaction = runSingleCompactionDuringWindow(23, 4) )("table1", "table2", "table3") flow.execute()
  • 24. val flow = Waimak.sparkFlow(spark) .snapshotFromStorage(basePath, tsNow)("table1", "table2", "table3") .transform("table1", "table2")("model1")( (t1, t2) => t1.join(t2, "pk1") ) .transform("table3", "model1")("model2")( (t3, m1) => t3.join(m1, "pk2") ) .sql("table1", "model2")("reporting1", """select m2.pk1, count(t1.col1) from model2 m2 left join table1 t1 on m2.pk1 = t1.fpk1 group by m2.pk1""" ) .writeHiveManagedTable("model_db")("model1", "model2") .writeHiveManagedTable("reporting_db")("reporting1") .sparkCache("model2") .addDeequCheck( "reporting1", Check(Error, "Not valid PK").isPrimaryKey("pk1") )(ExceptionQualityAlert()) Simpler business logic development Execute with flow.execute() Read from storage layer and deduplicate Perform two transformations on labels using the DataFrame API, generating two more labels Perform a Spark SQL transformation on a table and label generated during a transform, generating one more label Write labels to two different databases Add explicit caching and data quality monitoring actions
  • 25. Simpler environment management case class MySparkAppEnv(project: String, //e.g. my_spark_app environment: String, //e.g. dev branch: String //e.g. feature/one ) extends HiveEnv Environment consists of a base path: /data/dev/my_spark_app/feature_one/ And a Hive database: dev_my_spark_app_feature_one object MySparkApp extends SparkApp[MySparkAppEnv] { override def run(sparkSession: SparkSession, env: MySparkAppEnv): Unit = //We have a base path and Hive database available in env via env.basePath and env.baseDBName ??? } Define application logic given a SparkSession and an environment Use MultiAppRunner to run apps individually or together with dependencies spark.waimak.apprunner.apps = my_spark_app spark.waimak.apprunner.my_spark_app.appClassName = com.example.MySparkApp spark.waimak.environment.my_spark_app.environment = dev spark.waimak.environment.my_spark_app.branch = feature/one
  翻译: