尊敬的 微信汇率:1円 ≈ 0.046078 元 支付宝汇率:1円 ≈ 0.046168元 [退出登录]
SlideShare a Scribd company logo
Modularized ETL
Writing with Spark
Neelesh Salian
Software Engineer - Stitch Fix
May 27, 2021
whoami
Neelesh Salian
Software Engineer - Data Platform
Agenda
▪ What is Stitch Fix?
▪ Apache Spark @ Stitch Fix
▪ Spark Writer Modules
▪ Learnings & Future Work
What is Stitch Fix?
What does the company do?
Stitch Fix is a personalized styling service
Shop at your personal curated store. Check out what you like.
Data Science is behind everything we do
algorithms-tour.stitchfix.com
• Algorithms org
• 145+ Data Scientists and Platform engineers
• 3 main verticals + platform
Apache Spark @ Stitch Fix
How we use Spark in our teams?
Spark @ Stitch Fix - History and Current State
▪ Spark was introduced to enhance and
scale ETL capabilities (circa 2016)
▪ Starting version: 1.2.x
▪ Spark SQL was the dominant use
case
▪ Used for reading and writing data into
the warehouse as Hive Tables.
▪ Current Version: 2.4.x,
3.1.x [ prototyping]
▪ For all ETL reads and writes,
production and test
▪ Spark serves regular pyspark,sql and
scala jobs, notebooks &
pandas-based readers - writers
▪ Controls all writing with more
functionality [this talk]
How it’s going
How it started
Spark @ Stitch Fix - Spark Tooling
• Spark Sql + Pyspark + Scala
• Containerized Spark driver + AWS EMR (for compute)
• Used for production and staging ETL by Data Scientists
• Notebooks
• Jupyterhub setup with Stitch Fix libraries and python packages pre-installed.
• Used by Data Scientists to test and prototype
• Pandas-based Readers - Writers
• Reads and writes data using pandas dataframes
• No bootstrap time for Spark jobs - uses Apache Livy for execution
• Used for test + production
All the tooling available to Data Scientists to use Spark to read and write data
Spark @ Stitch Fix - Writing data to the warehouse
Spark @ Stitch Fix - Steps while writing data
At the start, and even today, writing data through the writer library
has these steps.
1. Validation - check dataframe for type matches, schema matches
to the Hive table, overflow type checks.
2. Writing the data into files in S3 - parquet or text format based on
the Hive table’s configuration
3. Update the Hive Metastore - with versioning scheme for data.
Spark @ Stitch Fix - Data Versioning
• Writing into a Partitioned Table (e.g partitioned by a date_column
for a date value of 20210527)
• s3:<bucket>/<hive_db_name>/<hive_table_name>/date_column=20210527/batch_id=epoch_ts
• Writing into a Non-Partitioned Table
• s3:<bucket>/<hive_db_name>/<hive_table_name>/batch_id=epoch_ts
We also add the latest write_timestamp to the Hive table metadata, to indicate when the last write
was done to the table.
Writing data into the Data Warehouse with versioning to distinguish old vs new data.
We add the epoch_timestamp of the write time to indicate the freshness of the data.
Since we have a single path to
validate and write to the Data
Warehouse, what other
common functionality could
we add to provide more value
to our Data Scientists?
Spark Writer Modules
Config driven transformations while writing data to the Data Warehouse
Spark Writer Modules - Adding modules
Adding them as transformations in the writer library was
straightforward. In addition, we had to:
• Make each module configurable via spark properties
• Make each module behave the same for every write pipeline
• Make them configurable to either block writing data or not in
the event of failure
• Add documentation for each module to help steer Data
Scientists
How do we add additional functionality to the writing pipeline behind the scenes?
Spark Writer Modules - 3 Modules
• Journalizer
• Data Cleanser
• Data Quality Checker
The 3 modules we built
Journalizer
Journalizing - Data can change
Example: Data about a client has the potential to change and we need to capture it
Note: These are Slowly Changing Dimensions (Type 2) - where we preserve the old values.
client_id favorite_color dress_style
10 blue formal
Current on Date: 2021-05-21
client_id favorite_color dress_style
10 black formal
Current on Date: 2021-05-22
client_id favorite_color dress_style
10 green formal
Current on Date: 2021-07-23
client_id favorite_color dress_style
10 purple formal
Current on Date: 2021-05-23
Journalizing - 2 ways of capturing historical information
▪ Record of all data - written daily and
partitioned by date
▪ Contains all records - duplicated
across partitions
▪ Difficult to find nuanced information
or track changes in data by date since
all the data is included.
▪ Harder to access the data because of
the size of the table
▪ Compressed, de-duped information
▪ Two partitions: is_current = 1 (latest
data) & is_current = 0 (old data)
▪ Tracks changing values by
timestamp. e.g sets start and end
date to a value to show duration of
validity
▪ Sorted for easy access by primary key
Journal Tables
History Tables
2 types of Hive Tables to store this information.
client_id favorite_color dress_style date
(partition_column)
10 blue formal 2021-05-20
10 blue formal 2021-05-21
10 black formal 2021-05-21
10 blue formal 2021-05-22
10 black formal 2021-05-22
10 purple formal 2021-05-22
….. ….. ….. …….
10 blue formal 2021-07-23
10 black formal 2021-07-23
10 purple formal 2021-07-23
10 green formal 2021-07-23
History Table Journal Table
client_id favorite_color start_date end_date is_current
(partition
column)
10 blue 2021-01-01
(first time
recorded)
2021-05-20 0
10 black 2021-05-21 2021-05-21 0
10 purple 2021-05-22 2021-07-22 0
10 green 2021-07-23 2999-01-01
(default end
time)
1
Note: Tracking changes to favorite_color
across time
Given the compressed nature of
Journal tables, we moved
historical data into them.
A Journal table is meant to be a
ledger of the change in values and
a pointer to the current values.
Let’s now look at how Journal
tables are created.
Journalizing - How do we create a journal table?
Some questions we asked ourselves:
1. How could we get easy access to latest information about a
particular key?
2. How can information be compressed and de-duplicated?
3. Can we determine - how long was the favorite_color set to
<value>?
4. But, how do we update the table each time to maintain this
ordering?
5. Where and when do we run this process of conversion?
What we need to get to the table structure?
client_id favorite_color date
10 blue 2021-05-20
10 blue 2021-05-21
10 blue 2021-05-22
10 purple 2021-05-23
Compression/ De-dupe client_i
d
favorite_color start_date end_date
10 blue 2021-01-01
(first time
recorded)
2021-05-22
10 purple 2021-05-23 2999-01-01
(default end
time)
Start date when
value was valid
End date when
value was valid
Symbolizing the
latest value
without a specified
end
client_id favorite_color date
10 blue 2021-05-20
10 blue 2021-05-21
Current Pointer
Partition
client_id favorite_color start_date end_date is_current
10 blue 2021-01-01
(first time
recorded)
2999-01-01
(default
end time)
1
In a history table, we
don’t know the
changed value since
it’s not marked.
client_id favorite_color date
10 blue 2021-05-20
10 blue 2021-05-21
10 blue 2021-05-22
10 purple 2021-05-22
client_id favorite_color start_date end_date is_current
10 blue 2021-01-01
(first time
recorded)
2021-05-21 0
10 purple 2021-05-22 2999-01-01
(default
end time)
1
Current Pointer
Partition
purple is now marked
as the current value,
and blue is moved to
the older partition
Journalizing - Process of Journalizing
1. User creates a Journal table and sets a field to track using
metadata e.g. (client_id is set as primary key)
2. When data is written to this table, the table is reloaded in its
entirety and we perform
a. Deduplication and compression
b. Set the current values in partitions - if there are changes
c. Sort the table based on the date
3. Rewrite this new DataFrame into the table
Journalizing - The workflow
Journalizing - Journal Table Pros & Cons
▪ De-duped data
▪ Two partitions for easy querying -
is_current = 1 (latest data) &
is_current = 0 (old data). Data pipeline
needs to access only 1 partition for all
the latest values.
▪ Compressed and timestamp to
indicate field values lifespan to track
changes
▪ Sorted for easy access by primary key
▪ Complicated process and multiple
steps prior to writing.
▪ Rewriting the table is a must to
maintain the rules of compression
and deduplication
Cons
Pros
Data Cleanser
Data Cleanser - What and why?
Data can be old or un-referenced or meant to be excluded.
• How do we make sure some record values don’t continue to
persist in a table?
• How do we delete records or nullify them consistently
throughout the warehouse?
• Can this be configured by the Data Scientists to apply to their
table?
Can we cleanse data based on a configuration?
Data Cleanser - What does cleansing mean?
Let’s say we wish to nullify/delete some column values in a table
id column_a column_b color style
9 value_a “string_field_1” blue formal
10 value_a1 “string_field_2” red casual
11 value_a2 “string_field_3” white formal
OR
Nullified
Deleted
id column_a column_b color style
9 null null blue formal
10 null null red casual
11 null null white formal
id column_a column_b color style
9 <empty> <empty> blue formal
10 <empty> <empty> red casual
11 <empty> <empty> white formal
Data Cleanser - Criteria
1. Has to be configurable
2. Users should be able to specify the key to be monitored and
columns for cleansing
3. At least, two treatments should be available:
a. nullify
b. delete
4. Should happen to data at write and/or at rest
What does the cleanser have to do?
Data Cleanser - How?
• How?
• Perform cleansing at write time to ensure all future records are cleansed
despite the source having included them.
• Separately, cleanse the entire Hive table of the data is not used - to make
sure older partitions don’t have the un-referenced data.
• What do we need?
• A mechanism to configure what to cleanse - nullify/delete per table
• This mechanism needs to be accessible at write / rest to run the cleansing
on the data.
How do we cleanse data?
Data Cleanser - Implementation
We have a metadata infrastructure that
allows users to add metadata to their
owned tables
▪ Hive tables have metadata fields that
can be used to store auxiliary
information about them
▪ The cleanser could simply access
the tables metadata and perform
cleansing accordingly.
Each table could have a configuration
naming columns like [column_a, column_b]
that needed to be cleansed along with the
treatment.
▪ Reacting to the specified metadata
meant the cleanser module could work
as configured at all times.
▪ The same module could perform
cleansing for data while writing and/or
at rest.
Cleansing
Table Configuration
Data Cleanser - The workflow
1. User specifics metadata configuration for cleansing in a Hive
table
metadata = {"key": "id",
"treatment": "nullify",
"columns": ["column_a", "column_b"]]}
2. Cleanser reads the table and checks all the columns that
match
3. Performs nullify/delete on the DataFrame and proceeds to the
next transformation or writes this cleansed DataFrame to the
Data warehouse.
How does it come together?
Data Cleanser - The workflow
Data Quality Checker
Data Quality - Background
• How do we detect errors or skews in data?
• When do we check for data problems?
• How do Data Scientists setup Data Quality checks?
What motivated the data quality initiative?
Data Quality - What do we need to check data?
• Service to initialize tests and run tests on Hive tables.
• Mechanism that calculates metrics based on the configured
tests on the data prior to writing it to the warehouse
• Interface that allows users to autonomously setup Data quality
and run tests on their pipelines.
What components were needed for running data quality checks?
Data Quality - What would a Test look like?
• NullCount(column_name)
• Is the null count on this column higher than “value”?
• Average(column_name)
• Is the average below what is expected?
• Max(column_name)
• Is the max value for this column exceeding a certain limit?
• RowCount(table)
• Are we suddenly writing more rows than anticipated?
Some examples of tests that we started off with.
Data Quality - How we built it?
• Built a service that was equipped to:
• Enable CRUD operations on tests for Hive tables
• Had the ability to run tests on metrics when triggered
• At the same time, we built in the ability to calculate metrics in a
module in the Spark writer library.
• This module interacted with the data quality service to find the metrics that
were needed to be calculated.
• Ran these calculations in Spark on the input DataFrame - e.g. average
(column_name)
• Triggered tests on these metrics and posted the results to the user.
Putting the components together
Data Quality - Surfacing Data Quality to users
1. The data quality service had a python client that helped users
run CRUD operations on tests
2. The writer module could be configured to run on a write
operation for a table.
a. Setting spark.enable.data.quality.checks=true in Spark properties helped run
these tests at write time.
3. Separately, we created an offline mode to run tests on already
written data, if the user doesn’t wish to block writes to the table.
What did the interface look like?
Spark Writer Modules - Transformations in code
def writeDataFrame(inputDataframe:DataFrame,
databaseName: String,
tableName: String) = {
// Validation
val validatedDataframe = sfWriter.validateDataframe(inputDataframe,databaseName,tableName)
// Journalizing
val journalizedDataframe = sfWriter.journalizeDataframe(validatedDataframe,databaseName,tableName)
// Data Cleanser
val cleansedDataframe = sfWriter.dataCleanser(journalizedDataframe,databaseName,tableName)
// Data Quality Checker
sfWriter.dataQualityChecker(cleansedDataframe,databaseName,tableName)
// Write to the Data Warehouse + Update Metastore
sfWriter.writeToS3(cleansedDataframe,databaseName,tableName)
}
Learnings & Future Work
What we learnt and where are we headed?
Learnings & Future Work - Lessons learnt
• Adding new modules meant more complexity to the write pipeline, but
each step was doing a valuable transformation
• Making each transformation performant and efficient was a top priority
when each module was being created.
• Testing - unit & integration was key in rolling out without mishaps
• Introducing these modules to Data Scientists meant we needed better
communication and more documentation
• Getting data quality checks to run efficiently was a challenge, since we
had to programmatically calculate the partitions of the DataFrame and
run tests against each potential Hive partition. This took some effort to
run smoothly.
By adding modularized transformations to data, what changed and how did we adapt?
Learnings & Future Work - Future Work
Now, additional modules can easily be added in a similar fashion
• Data Quality is being enhanced with support for customized testing
rather than simple threshold or values.
• The goal is to have Data quality ingrained in the ETL process of our
Data Science workflows.
• Journalizer and data cleansing are mostly static but we are exploring
alternate solutions to help augment and delete records more
efficiently.
By adding modularized transformations to data, what changed and how did we adapt?
Summary
TL;DR:
Summary
Writing data with Spark @ Stitch Fix:
• We have a singular write path to input data into the warehouse driven
by Spark
• 3 modules that perform transformations are config driven and
available at the time of write.
• Journalizing: Writing a non-duplicated historical record of data to help quick
access and compression.
• Data Cleanser: Delete or nullify values based on table configuration.
• Data Quality: Enabling the calculation of metrics and running tests on incoming
data into the warehouse.
Thank you.
Questions?

More Related Content

What's hot

Scaling your Data Pipelines with Apache Spark on Kubernetes
Scaling your Data Pipelines with Apache Spark on KubernetesScaling your Data Pipelines with Apache Spark on Kubernetes
Scaling your Data Pipelines with Apache Spark on Kubernetes
Databricks
 
The Rise of ZStandard: Apache Spark/Parquet/ORC/Avro
The Rise of ZStandard: Apache Spark/Parquet/ORC/AvroThe Rise of ZStandard: Apache Spark/Parquet/ORC/Avro
The Rise of ZStandard: Apache Spark/Parquet/ORC/Avro
Databricks
 
Solving Enterprise Data Challenges with Apache Arrow
Solving Enterprise Data Challenges with Apache ArrowSolving Enterprise Data Challenges with Apache Arrow
Solving Enterprise Data Challenges with Apache Arrow
Wes McKinney
 
Trino: A Ludicrously Fast Query Engine - Pulsar Summit NA 2021
Trino: A Ludicrously Fast Query Engine - Pulsar Summit NA 2021Trino: A Ludicrously Fast Query Engine - Pulsar Summit NA 2021
Trino: A Ludicrously Fast Query Engine - Pulsar Summit NA 2021
StreamNative
 
Understanding Query Plans and Spark UIs
Understanding Query Plans and Spark UIsUnderstanding Query Plans and Spark UIs
Understanding Query Plans and Spark UIs
Databricks
 
Massive Data Processing in Adobe Using Delta Lake
Massive Data Processing in Adobe Using Delta LakeMassive Data Processing in Adobe Using Delta Lake
Massive Data Processing in Adobe Using Delta Lake
Databricks
 
The Parquet Format and Performance Optimization Opportunities
The Parquet Format and Performance Optimization OpportunitiesThe Parquet Format and Performance Optimization Opportunities
The Parquet Format and Performance Optimization Opportunities
Databricks
 
Improving SparkSQL Performance by 30%: How We Optimize Parquet Pushdown and P...
Improving SparkSQL Performance by 30%: How We Optimize Parquet Pushdown and P...Improving SparkSQL Performance by 30%: How We Optimize Parquet Pushdown and P...
Improving SparkSQL Performance by 30%: How We Optimize Parquet Pushdown and P...
Databricks
 
Architecting a datalake
Architecting a datalakeArchitecting a datalake
Architecting a datalake
Laurent Leturgez
 
Streaming SQL with Apache Calcite
Streaming SQL with Apache CalciteStreaming SQL with Apache Calcite
Streaming SQL with Apache Calcite
Julian Hyde
 
Delta lake and the delta architecture
Delta lake and the delta architectureDelta lake and the delta architecture
Delta lake and the delta architecture
Adam Doyle
 
Achieving Lakehouse Models with Spark 3.0
Achieving Lakehouse Models with Spark 3.0Achieving Lakehouse Models with Spark 3.0
Achieving Lakehouse Models with Spark 3.0
Databricks
 
Tuning and Debugging in Apache Spark
Tuning and Debugging in Apache SparkTuning and Debugging in Apache Spark
Tuning and Debugging in Apache Spark
Patrick Wendell
 
Data Engineer's Lunch #83: Strategies for Migration to Apache Iceberg
Data Engineer's Lunch #83: Strategies for Migration to Apache IcebergData Engineer's Lunch #83: Strategies for Migration to Apache Iceberg
Data Engineer's Lunch #83: Strategies for Migration to Apache Iceberg
Anant Corporation
 
Apache Spark Data Source V2 with Wenchen Fan and Gengliang Wang
Apache Spark Data Source V2 with Wenchen Fan and Gengliang WangApache Spark Data Source V2 with Wenchen Fan and Gengliang Wang
Apache Spark Data Source V2 with Wenchen Fan and Gengliang Wang
Databricks
 
Data Quality With or Without Apache Spark and Its Ecosystem
Data Quality With or Without Apache Spark and Its EcosystemData Quality With or Without Apache Spark and Its Ecosystem
Data Quality With or Without Apache Spark and Its Ecosystem
Databricks
 
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...
Databricks
 
Efficient Data Storage for Analytics with Apache Parquet 2.0
Efficient Data Storage for Analytics with Apache Parquet 2.0Efficient Data Storage for Analytics with Apache Parquet 2.0
Efficient Data Storage for Analytics with Apache Parquet 2.0
Cloudera, Inc.
 
Introduction SQL Analytics on Lakehouse Architecture
Introduction SQL Analytics on Lakehouse ArchitectureIntroduction SQL Analytics on Lakehouse Architecture
Introduction SQL Analytics on Lakehouse Architecture
Databricks
 
Presto on Apache Spark: A Tale of Two Computation Engines
Presto on Apache Spark: A Tale of Two Computation EnginesPresto on Apache Spark: A Tale of Two Computation Engines
Presto on Apache Spark: A Tale of Two Computation Engines
Databricks
 

What's hot (20)

Scaling your Data Pipelines with Apache Spark on Kubernetes
Scaling your Data Pipelines with Apache Spark on KubernetesScaling your Data Pipelines with Apache Spark on Kubernetes
Scaling your Data Pipelines with Apache Spark on Kubernetes
 
The Rise of ZStandard: Apache Spark/Parquet/ORC/Avro
The Rise of ZStandard: Apache Spark/Parquet/ORC/AvroThe Rise of ZStandard: Apache Spark/Parquet/ORC/Avro
The Rise of ZStandard: Apache Spark/Parquet/ORC/Avro
 
Solving Enterprise Data Challenges with Apache Arrow
Solving Enterprise Data Challenges with Apache ArrowSolving Enterprise Data Challenges with Apache Arrow
Solving Enterprise Data Challenges with Apache Arrow
 
Trino: A Ludicrously Fast Query Engine - Pulsar Summit NA 2021
Trino: A Ludicrously Fast Query Engine - Pulsar Summit NA 2021Trino: A Ludicrously Fast Query Engine - Pulsar Summit NA 2021
Trino: A Ludicrously Fast Query Engine - Pulsar Summit NA 2021
 
Understanding Query Plans and Spark UIs
Understanding Query Plans and Spark UIsUnderstanding Query Plans and Spark UIs
Understanding Query Plans and Spark UIs
 
Massive Data Processing in Adobe Using Delta Lake
Massive Data Processing in Adobe Using Delta LakeMassive Data Processing in Adobe Using Delta Lake
Massive Data Processing in Adobe Using Delta Lake
 
The Parquet Format and Performance Optimization Opportunities
The Parquet Format and Performance Optimization OpportunitiesThe Parquet Format and Performance Optimization Opportunities
The Parquet Format and Performance Optimization Opportunities
 
Improving SparkSQL Performance by 30%: How We Optimize Parquet Pushdown and P...
Improving SparkSQL Performance by 30%: How We Optimize Parquet Pushdown and P...Improving SparkSQL Performance by 30%: How We Optimize Parquet Pushdown and P...
Improving SparkSQL Performance by 30%: How We Optimize Parquet Pushdown and P...
 
Architecting a datalake
Architecting a datalakeArchitecting a datalake
Architecting a datalake
 
Streaming SQL with Apache Calcite
Streaming SQL with Apache CalciteStreaming SQL with Apache Calcite
Streaming SQL with Apache Calcite
 
Delta lake and the delta architecture
Delta lake and the delta architectureDelta lake and the delta architecture
Delta lake and the delta architecture
 
Achieving Lakehouse Models with Spark 3.0
Achieving Lakehouse Models with Spark 3.0Achieving Lakehouse Models with Spark 3.0
Achieving Lakehouse Models with Spark 3.0
 
Tuning and Debugging in Apache Spark
Tuning and Debugging in Apache SparkTuning and Debugging in Apache Spark
Tuning and Debugging in Apache Spark
 
Data Engineer's Lunch #83: Strategies for Migration to Apache Iceberg
Data Engineer's Lunch #83: Strategies for Migration to Apache IcebergData Engineer's Lunch #83: Strategies for Migration to Apache Iceberg
Data Engineer's Lunch #83: Strategies for Migration to Apache Iceberg
 
Apache Spark Data Source V2 with Wenchen Fan and Gengliang Wang
Apache Spark Data Source V2 with Wenchen Fan and Gengliang WangApache Spark Data Source V2 with Wenchen Fan and Gengliang Wang
Apache Spark Data Source V2 with Wenchen Fan and Gengliang Wang
 
Data Quality With or Without Apache Spark and Its Ecosystem
Data Quality With or Without Apache Spark and Its EcosystemData Quality With or Without Apache Spark and Its Ecosystem
Data Quality With or Without Apache Spark and Its Ecosystem
 
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...
 
Efficient Data Storage for Analytics with Apache Parquet 2.0
Efficient Data Storage for Analytics with Apache Parquet 2.0Efficient Data Storage for Analytics with Apache Parquet 2.0
Efficient Data Storage for Analytics with Apache Parquet 2.0
 
Introduction SQL Analytics on Lakehouse Architecture
Introduction SQL Analytics on Lakehouse ArchitectureIntroduction SQL Analytics on Lakehouse Architecture
Introduction SQL Analytics on Lakehouse Architecture
 
Presto on Apache Spark: A Tale of Two Computation Engines
Presto on Apache Spark: A Tale of Two Computation EnginesPresto on Apache Spark: A Tale of Two Computation Engines
Presto on Apache Spark: A Tale of Two Computation Engines
 

Similar to Modularized ETL Writing with Apache Spark

Optimizing Data Accessin Sq Lserver2005
Optimizing Data Accessin Sq Lserver2005Optimizing Data Accessin Sq Lserver2005
Optimizing Data Accessin Sq Lserver2005
rainynovember12
 
Designing Database Solutions for Microsoft SQL Server 2012 2012 Microsoft 70-...
Designing Database Solutions for Microsoft SQL Server 2012 2012 Microsoft 70-...Designing Database Solutions for Microsoft SQL Server 2012 2012 Microsoft 70-...
Designing Database Solutions for Microsoft SQL Server 2012 2012 Microsoft 70-...
ChristopherBow2
 
Move a successful onpremise oltp application to the cloud
Move a successful onpremise oltp application to the cloudMove a successful onpremise oltp application to the cloud
Move a successful onpremise oltp application to the cloud
Ike Ellis
 
Evolutionary database design
Evolutionary database designEvolutionary database design
Evolutionary database design
Salehein Syed
 
SQL Server 2008 Development for Programmers
SQL Server 2008 Development for ProgrammersSQL Server 2008 Development for Programmers
SQL Server 2008 Development for Programmers
Adam Hutson
 
DSN_Power BIDSN_Power BIDSN_Power BIDSN_Power BIDSN_Power BIDSN_Power BI
DSN_Power BIDSN_Power BIDSN_Power BIDSN_Power BIDSN_Power BIDSN_Power BIDSN_Power BIDSN_Power BIDSN_Power BIDSN_Power BIDSN_Power BIDSN_Power BI
DSN_Power BIDSN_Power BIDSN_Power BIDSN_Power BIDSN_Power BIDSN_Power BI
EzekielJames8
 
Sql Portfolio
Sql PortfolioSql Portfolio
Sql Portfolio
Shelli Ciaschini
 
Tech-Spark: Scaling Databases
Tech-Spark: Scaling DatabasesTech-Spark: Scaling Databases
Tech-Spark: Scaling Databases
Ralph Attard
 
Performance Tuning And Optimization Microsoft SQL Database
Performance Tuning And Optimization Microsoft SQL DatabasePerformance Tuning And Optimization Microsoft SQL Database
Performance Tuning And Optimization Microsoft SQL Database
Tung Nguyen Thanh
 
ETL
ETL ETL
ETL
butest
 
Sql Server 2005 Business Inteligence
Sql Server 2005 Business InteligenceSql Server 2005 Business Inteligence
Sql Server 2005 Business Inteligence
abercius24
 
The strength of a spatial database
The strength of a spatial databaseThe strength of a spatial database
The strength of a spatial database
Peter Horsbøll Møller
 
You Can Do It in SQL
You Can Do It in SQLYou Can Do It in SQL
You Can Do It in SQL
Databricks
 
123448572 all-in-one-informatica
123448572 all-in-one-informatica123448572 all-in-one-informatica
123448572 all-in-one-informatica
homeworkping9
 
Sql 2016 - What's New
Sql 2016 - What's NewSql 2016 - What's New
Sql 2016 - What's New
dpcobb
 
MMYERS Portfolio
MMYERS PortfolioMMYERS Portfolio
MMYERS Portfolio
Mike Myers
 
Db2 migration -_tips,_tricks,_and_pitfalls
Db2 migration -_tips,_tricks,_and_pitfallsDb2 migration -_tips,_tricks,_and_pitfalls
Db2 migration -_tips,_tricks,_and_pitfalls
sam2sung2
 
The Future of Data Science and Machine Learning at Scale: A Look at MLflow, D...
The Future of Data Science and Machine Learning at Scale: A Look at MLflow, D...The Future of Data Science and Machine Learning at Scale: A Look at MLflow, D...
The Future of Data Science and Machine Learning at Scale: A Look at MLflow, D...
Databricks
 
Sap abap
Sap abapSap abap
Sap abap
nrj10
 
Tuning and Optimizing U-SQL Queries (SQLPASS 2016)
Tuning and Optimizing U-SQL Queries (SQLPASS 2016)Tuning and Optimizing U-SQL Queries (SQLPASS 2016)
Tuning and Optimizing U-SQL Queries (SQLPASS 2016)
Michael Rys
 

Similar to Modularized ETL Writing with Apache Spark (20)

Optimizing Data Accessin Sq Lserver2005
Optimizing Data Accessin Sq Lserver2005Optimizing Data Accessin Sq Lserver2005
Optimizing Data Accessin Sq Lserver2005
 
Designing Database Solutions for Microsoft SQL Server 2012 2012 Microsoft 70-...
Designing Database Solutions for Microsoft SQL Server 2012 2012 Microsoft 70-...Designing Database Solutions for Microsoft SQL Server 2012 2012 Microsoft 70-...
Designing Database Solutions for Microsoft SQL Server 2012 2012 Microsoft 70-...
 
Move a successful onpremise oltp application to the cloud
Move a successful onpremise oltp application to the cloudMove a successful onpremise oltp application to the cloud
Move a successful onpremise oltp application to the cloud
 
Evolutionary database design
Evolutionary database designEvolutionary database design
Evolutionary database design
 
SQL Server 2008 Development for Programmers
SQL Server 2008 Development for ProgrammersSQL Server 2008 Development for Programmers
SQL Server 2008 Development for Programmers
 
DSN_Power BIDSN_Power BIDSN_Power BIDSN_Power BIDSN_Power BIDSN_Power BI
DSN_Power BIDSN_Power BIDSN_Power BIDSN_Power BIDSN_Power BIDSN_Power BIDSN_Power BIDSN_Power BIDSN_Power BIDSN_Power BIDSN_Power BIDSN_Power BI
DSN_Power BIDSN_Power BIDSN_Power BIDSN_Power BIDSN_Power BIDSN_Power BI
 
Sql Portfolio
Sql PortfolioSql Portfolio
Sql Portfolio
 
Tech-Spark: Scaling Databases
Tech-Spark: Scaling DatabasesTech-Spark: Scaling Databases
Tech-Spark: Scaling Databases
 
Performance Tuning And Optimization Microsoft SQL Database
Performance Tuning And Optimization Microsoft SQL DatabasePerformance Tuning And Optimization Microsoft SQL Database
Performance Tuning And Optimization Microsoft SQL Database
 
ETL
ETL ETL
ETL
 
Sql Server 2005 Business Inteligence
Sql Server 2005 Business InteligenceSql Server 2005 Business Inteligence
Sql Server 2005 Business Inteligence
 
The strength of a spatial database
The strength of a spatial databaseThe strength of a spatial database
The strength of a spatial database
 
You Can Do It in SQL
You Can Do It in SQLYou Can Do It in SQL
You Can Do It in SQL
 
123448572 all-in-one-informatica
123448572 all-in-one-informatica123448572 all-in-one-informatica
123448572 all-in-one-informatica
 
Sql 2016 - What's New
Sql 2016 - What's NewSql 2016 - What's New
Sql 2016 - What's New
 
MMYERS Portfolio
MMYERS PortfolioMMYERS Portfolio
MMYERS Portfolio
 
Db2 migration -_tips,_tricks,_and_pitfalls
Db2 migration -_tips,_tricks,_and_pitfallsDb2 migration -_tips,_tricks,_and_pitfalls
Db2 migration -_tips,_tricks,_and_pitfalls
 
The Future of Data Science and Machine Learning at Scale: A Look at MLflow, D...
The Future of Data Science and Machine Learning at Scale: A Look at MLflow, D...The Future of Data Science and Machine Learning at Scale: A Look at MLflow, D...
The Future of Data Science and Machine Learning at Scale: A Look at MLflow, D...
 
Sap abap
Sap abapSap abap
Sap abap
 
Tuning and Optimizing U-SQL Queries (SQLPASS 2016)
Tuning and Optimizing U-SQL Queries (SQLPASS 2016)Tuning and Optimizing U-SQL Queries (SQLPASS 2016)
Tuning and Optimizing U-SQL Queries (SQLPASS 2016)
 

More from Databricks

DW Migration Webinar-March 2022.pptx
DW Migration Webinar-March 2022.pptxDW Migration Webinar-March 2022.pptx
DW Migration Webinar-March 2022.pptx
Databricks
 
Data Lakehouse Symposium | Day 1 | Part 1
Data Lakehouse Symposium | Day 1 | Part 1Data Lakehouse Symposium | Day 1 | Part 1
Data Lakehouse Symposium | Day 1 | Part 1
Databricks
 
Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 1 | Part 2Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 1 | Part 2
Databricks
 
Data Lakehouse Symposium | Day 2
Data Lakehouse Symposium | Day 2Data Lakehouse Symposium | Day 2
Data Lakehouse Symposium | Day 2
Databricks
 
Data Lakehouse Symposium | Day 4
Data Lakehouse Symposium | Day 4Data Lakehouse Symposium | Day 4
Data Lakehouse Symposium | Day 4
Databricks
 
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
Databricks
 
Democratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized PlatformDemocratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized Platform
Databricks
 
Learn to Use Databricks for Data Science
Learn to Use Databricks for Data ScienceLearn to Use Databricks for Data Science
Learn to Use Databricks for Data Science
Databricks
 
Why APM Is Not the Same As ML Monitoring
Why APM Is Not the Same As ML MonitoringWhy APM Is Not the Same As ML Monitoring
Why APM Is Not the Same As ML Monitoring
Databricks
 
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
The Function, the Context, and the Data—Enabling ML Ops at Stitch FixThe Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
Databricks
 
Stage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI IntegrationStage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI Integration
Databricks
 
Simplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorchSimplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorch
Databricks
 
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
Scaling and Unifying SciKit Learn and Apache Spark PipelinesScaling and Unifying SciKit Learn and Apache Spark Pipelines
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
Databricks
 
Sawtooth Windows for Feature Aggregations
Sawtooth Windows for Feature AggregationsSawtooth Windows for Feature Aggregations
Sawtooth Windows for Feature Aggregations
Databricks
 
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Redis + Apache Spark = Swiss Army Knife Meets Kitchen SinkRedis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Databricks
 
Re-imagine Data Monitoring with whylogs and Spark
Re-imagine Data Monitoring with whylogs and SparkRe-imagine Data Monitoring with whylogs and Spark
Re-imagine Data Monitoring with whylogs and Spark
Databricks
 
Raven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction QueriesRaven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction Queries
Databricks
 
Processing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache SparkProcessing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache Spark
Databricks
 
Machine Learning CI/CD for Email Attack Detection
Machine Learning CI/CD for Email Attack DetectionMachine Learning CI/CD for Email Attack Detection
Machine Learning CI/CD for Email Attack Detection
Databricks
 
Jeeves Grows Up: An AI Chatbot for Performance and Quality
Jeeves Grows Up: An AI Chatbot for Performance and QualityJeeves Grows Up: An AI Chatbot for Performance and Quality
Jeeves Grows Up: An AI Chatbot for Performance and Quality
Databricks
 

More from Databricks (20)

DW Migration Webinar-March 2022.pptx
DW Migration Webinar-March 2022.pptxDW Migration Webinar-March 2022.pptx
DW Migration Webinar-March 2022.pptx
 
Data Lakehouse Symposium | Day 1 | Part 1
Data Lakehouse Symposium | Day 1 | Part 1Data Lakehouse Symposium | Day 1 | Part 1
Data Lakehouse Symposium | Day 1 | Part 1
 
Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 1 | Part 2Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 1 | Part 2
 
Data Lakehouse Symposium | Day 2
Data Lakehouse Symposium | Day 2Data Lakehouse Symposium | Day 2
Data Lakehouse Symposium | Day 2
 
Data Lakehouse Symposium | Day 4
Data Lakehouse Symposium | Day 4Data Lakehouse Symposium | Day 4
Data Lakehouse Symposium | Day 4
 
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
 
Democratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized PlatformDemocratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized Platform
 
Learn to Use Databricks for Data Science
Learn to Use Databricks for Data ScienceLearn to Use Databricks for Data Science
Learn to Use Databricks for Data Science
 
Why APM Is Not the Same As ML Monitoring
Why APM Is Not the Same As ML MonitoringWhy APM Is Not the Same As ML Monitoring
Why APM Is Not the Same As ML Monitoring
 
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
The Function, the Context, and the Data—Enabling ML Ops at Stitch FixThe Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
 
Stage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI IntegrationStage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI Integration
 
Simplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorchSimplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorch
 
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
Scaling and Unifying SciKit Learn and Apache Spark PipelinesScaling and Unifying SciKit Learn and Apache Spark Pipelines
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
 
Sawtooth Windows for Feature Aggregations
Sawtooth Windows for Feature AggregationsSawtooth Windows for Feature Aggregations
Sawtooth Windows for Feature Aggregations
 
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Redis + Apache Spark = Swiss Army Knife Meets Kitchen SinkRedis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
 
Re-imagine Data Monitoring with whylogs and Spark
Re-imagine Data Monitoring with whylogs and SparkRe-imagine Data Monitoring with whylogs and Spark
Re-imagine Data Monitoring with whylogs and Spark
 
Raven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction QueriesRaven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction Queries
 
Processing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache SparkProcessing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache Spark
 
Machine Learning CI/CD for Email Attack Detection
Machine Learning CI/CD for Email Attack DetectionMachine Learning CI/CD for Email Attack Detection
Machine Learning CI/CD for Email Attack Detection
 
Jeeves Grows Up: An AI Chatbot for Performance and Quality
Jeeves Grows Up: An AI Chatbot for Performance and QualityJeeves Grows Up: An AI Chatbot for Performance and Quality
Jeeves Grows Up: An AI Chatbot for Performance and Quality
 

Recently uploaded

Classifying Shooting Incident Fatality in New York project presentation
Classifying Shooting Incident Fatality in New York project presentationClassifying Shooting Incident Fatality in New York project presentation
Classifying Shooting Incident Fatality in New York project presentation
Boston Institute of Analytics
 
Delhi Call Girls Karol Bagh 👉 9711199012 👈 unlimited short high profile full ...
Delhi Call Girls Karol Bagh 👉 9711199012 👈 unlimited short high profile full ...Delhi Call Girls Karol Bagh 👉 9711199012 👈 unlimited short high profile full ...
Delhi Call Girls Karol Bagh 👉 9711199012 👈 unlimited short high profile full ...
mona lisa $A12
 
🔥Night Call Girls Pune 💯Call Us 🔝 7014168258 🔝💃Independent Pune Escorts Servi...
🔥Night Call Girls Pune 💯Call Us 🔝 7014168258 🔝💃Independent Pune Escorts Servi...🔥Night Call Girls Pune 💯Call Us 🔝 7014168258 🔝💃Independent Pune Escorts Servi...
🔥Night Call Girls Pune 💯Call Us 🔝 7014168258 🔝💃Independent Pune Escorts Servi...
yuvishachadda
 
satta matka Dpboss guessing Kalyan matka Today Kalyan Panel Chart Kalyan Jodi...
satta matka Dpboss guessing Kalyan matka Today Kalyan Panel Chart Kalyan Jodi...satta matka Dpboss guessing Kalyan matka Today Kalyan Panel Chart Kalyan Jodi...
satta matka Dpboss guessing Kalyan matka Today Kalyan Panel Chart Kalyan Jodi...
#kalyanmatkaresult #dpboss #kalyanmatka #satta #matka #sattamatka
 
Product Cluster Analysis: Unveiling Hidden Customer Preferences
Product Cluster Analysis: Unveiling Hidden Customer PreferencesProduct Cluster Analysis: Unveiling Hidden Customer Preferences
Product Cluster Analysis: Unveiling Hidden Customer Preferences
Boston Institute of Analytics
 
Ahmedabad Call Girls 7339748667 With Free Home Delivery At Your Door
Ahmedabad Call Girls 7339748667 With Free Home Delivery At Your DoorAhmedabad Call Girls 7339748667 With Free Home Delivery At Your Door
Ahmedabad Call Girls 7339748667 With Free Home Delivery At Your Door
Russian Escorts in Delhi 9711199171 with low rate Book online
 
Mumbai Call Girls service 9920874524 Call Girl service in Mumbai Mumbai Call ...
Mumbai Call Girls service 9920874524 Call Girl service in Mumbai Mumbai Call ...Mumbai Call Girls service 9920874524 Call Girl service in Mumbai Mumbai Call ...
Mumbai Call Girls service 9920874524 Call Girl service in Mumbai Mumbai Call ...
hanshkumar9870
 
202406 - Cape Town Snowflake User Group - LLM & RAG.pdf
202406 - Cape Town Snowflake User Group - LLM & RAG.pdf202406 - Cape Town Snowflake User Group - LLM & RAG.pdf
202406 - Cape Town Snowflake User Group - LLM & RAG.pdf
Douglas Day
 
CAP Excel Formulas & Functions July - Copy (4).pdf
CAP Excel Formulas & Functions July - Copy (4).pdfCAP Excel Formulas & Functions July - Copy (4).pdf
CAP Excel Formulas & Functions July - Copy (4).pdf
frp60658
 
Call Girls Hyderabad (india) ☎️ +91-7426014248 Hyderabad Call Girl
Call Girls Hyderabad  (india) ☎️ +91-7426014248 Hyderabad  Call GirlCall Girls Hyderabad  (india) ☎️ +91-7426014248 Hyderabad  Call Girl
Call Girls Hyderabad (india) ☎️ +91-7426014248 Hyderabad Call Girl
sapna sharmap11
 
🔥College Call Girls Kolkata 💯Call Us 🔝 8094342248 🔝💃Top Class Call Girl Servi...
🔥College Call Girls Kolkata 💯Call Us 🔝 8094342248 🔝💃Top Class Call Girl Servi...🔥College Call Girls Kolkata 💯Call Us 🔝 8094342248 🔝💃Top Class Call Girl Servi...
🔥College Call Girls Kolkata 💯Call Us 🔝 8094342248 🔝💃Top Class Call Girl Servi...
rukmnaikaseen
 
Bangalore ℂall Girl 000000 Bangalore Escorts Service
Bangalore ℂall Girl 000000 Bangalore Escorts ServiceBangalore ℂall Girl 000000 Bangalore Escorts Service
Bangalore ℂall Girl 000000 Bangalore Escorts Service
nhero3888
 
High Profile Call Girls Navi Mumbai ✅ 9833363713 FULL CASH PAYMENT
High Profile Call Girls Navi Mumbai ✅ 9833363713 FULL CASH PAYMENTHigh Profile Call Girls Navi Mumbai ✅ 9833363713 FULL CASH PAYMENT
High Profile Call Girls Navi Mumbai ✅ 9833363713 FULL CASH PAYMENT
ranjeet3341
 
Direct Lake Deep Dive slides from Fabric Engineering Roadshow
Direct Lake Deep Dive slides from Fabric Engineering RoadshowDirect Lake Deep Dive slides from Fabric Engineering Roadshow
Direct Lake Deep Dive slides from Fabric Engineering Roadshow
Gabi Münster
 
machine learning notes by Andrew Ng and Tengyu Ma
machine learning notes by Andrew Ng and Tengyu Mamachine learning notes by Andrew Ng and Tengyu Ma
machine learning notes by Andrew Ng and Tengyu Ma
Vijayabaskar Uthirapathy
 
A review of I_O behavior on Oracle database in ASM
A review of I_O behavior on Oracle database in ASMA review of I_O behavior on Oracle database in ASM
A review of I_O behavior on Oracle database in ASM
Alireza Kamrani
 
🔥Call Girl Price Pune 💯Call Us 🔝 7014168258 🔝💃Independent Pune Escorts Servic...
🔥Call Girl Price Pune 💯Call Us 🔝 7014168258 🔝💃Independent Pune Escorts Servic...🔥Call Girl Price Pune 💯Call Us 🔝 7014168258 🔝💃Independent Pune Escorts Servic...
🔥Call Girl Price Pune 💯Call Us 🔝 7014168258 🔝💃Independent Pune Escorts Servic...
Ak47
 
Hyderabad Call Girls Service 🔥 9352988975 🔥 High Profile Call Girls Hyderabad
Hyderabad Call Girls Service 🔥 9352988975 🔥 High Profile Call Girls HyderabadHyderabad Call Girls Service 🔥 9352988975 🔥 High Profile Call Girls Hyderabad
Hyderabad Call Girls Service 🔥 9352988975 🔥 High Profile Call Girls Hyderabad
binna singh$A17
 
Mumbai Central Call Girls ☑ +91-9833325238 ☑ Available Hot Girls Aunty Book Now
Mumbai Central Call Girls ☑ +91-9833325238 ☑ Available Hot Girls Aunty Book NowMumbai Central Call Girls ☑ +91-9833325238 ☑ Available Hot Girls Aunty Book Now
Mumbai Central Call Girls ☑ +91-9833325238 ☑ Available Hot Girls Aunty Book Now
radhika ansal $A12
 
Call Girls Lucknow 8923113531 Independent Call Girl Service in Lucknow
Call Girls Lucknow 8923113531 Independent Call Girl Service in LucknowCall Girls Lucknow 8923113531 Independent Call Girl Service in Lucknow
Call Girls Lucknow 8923113531 Independent Call Girl Service in Lucknow
hiju9823
 

Recently uploaded (20)

Classifying Shooting Incident Fatality in New York project presentation
Classifying Shooting Incident Fatality in New York project presentationClassifying Shooting Incident Fatality in New York project presentation
Classifying Shooting Incident Fatality in New York project presentation
 
Delhi Call Girls Karol Bagh 👉 9711199012 👈 unlimited short high profile full ...
Delhi Call Girls Karol Bagh 👉 9711199012 👈 unlimited short high profile full ...Delhi Call Girls Karol Bagh 👉 9711199012 👈 unlimited short high profile full ...
Delhi Call Girls Karol Bagh 👉 9711199012 👈 unlimited short high profile full ...
 
🔥Night Call Girls Pune 💯Call Us 🔝 7014168258 🔝💃Independent Pune Escorts Servi...
🔥Night Call Girls Pune 💯Call Us 🔝 7014168258 🔝💃Independent Pune Escorts Servi...🔥Night Call Girls Pune 💯Call Us 🔝 7014168258 🔝💃Independent Pune Escorts Servi...
🔥Night Call Girls Pune 💯Call Us 🔝 7014168258 🔝💃Independent Pune Escorts Servi...
 
satta matka Dpboss guessing Kalyan matka Today Kalyan Panel Chart Kalyan Jodi...
satta matka Dpboss guessing Kalyan matka Today Kalyan Panel Chart Kalyan Jodi...satta matka Dpboss guessing Kalyan matka Today Kalyan Panel Chart Kalyan Jodi...
satta matka Dpboss guessing Kalyan matka Today Kalyan Panel Chart Kalyan Jodi...
 
Product Cluster Analysis: Unveiling Hidden Customer Preferences
Product Cluster Analysis: Unveiling Hidden Customer PreferencesProduct Cluster Analysis: Unveiling Hidden Customer Preferences
Product Cluster Analysis: Unveiling Hidden Customer Preferences
 
Ahmedabad Call Girls 7339748667 With Free Home Delivery At Your Door
Ahmedabad Call Girls 7339748667 With Free Home Delivery At Your DoorAhmedabad Call Girls 7339748667 With Free Home Delivery At Your Door
Ahmedabad Call Girls 7339748667 With Free Home Delivery At Your Door
 
Mumbai Call Girls service 9920874524 Call Girl service in Mumbai Mumbai Call ...
Mumbai Call Girls service 9920874524 Call Girl service in Mumbai Mumbai Call ...Mumbai Call Girls service 9920874524 Call Girl service in Mumbai Mumbai Call ...
Mumbai Call Girls service 9920874524 Call Girl service in Mumbai Mumbai Call ...
 
202406 - Cape Town Snowflake User Group - LLM & RAG.pdf
202406 - Cape Town Snowflake User Group - LLM & RAG.pdf202406 - Cape Town Snowflake User Group - LLM & RAG.pdf
202406 - Cape Town Snowflake User Group - LLM & RAG.pdf
 
CAP Excel Formulas & Functions July - Copy (4).pdf
CAP Excel Formulas & Functions July - Copy (4).pdfCAP Excel Formulas & Functions July - Copy (4).pdf
CAP Excel Formulas & Functions July - Copy (4).pdf
 
Call Girls Hyderabad (india) ☎️ +91-7426014248 Hyderabad Call Girl
Call Girls Hyderabad  (india) ☎️ +91-7426014248 Hyderabad  Call GirlCall Girls Hyderabad  (india) ☎️ +91-7426014248 Hyderabad  Call Girl
Call Girls Hyderabad (india) ☎️ +91-7426014248 Hyderabad Call Girl
 
🔥College Call Girls Kolkata 💯Call Us 🔝 8094342248 🔝💃Top Class Call Girl Servi...
🔥College Call Girls Kolkata 💯Call Us 🔝 8094342248 🔝💃Top Class Call Girl Servi...🔥College Call Girls Kolkata 💯Call Us 🔝 8094342248 🔝💃Top Class Call Girl Servi...
🔥College Call Girls Kolkata 💯Call Us 🔝 8094342248 🔝💃Top Class Call Girl Servi...
 
Bangalore ℂall Girl 000000 Bangalore Escorts Service
Bangalore ℂall Girl 000000 Bangalore Escorts ServiceBangalore ℂall Girl 000000 Bangalore Escorts Service
Bangalore ℂall Girl 000000 Bangalore Escorts Service
 
High Profile Call Girls Navi Mumbai ✅ 9833363713 FULL CASH PAYMENT
High Profile Call Girls Navi Mumbai ✅ 9833363713 FULL CASH PAYMENTHigh Profile Call Girls Navi Mumbai ✅ 9833363713 FULL CASH PAYMENT
High Profile Call Girls Navi Mumbai ✅ 9833363713 FULL CASH PAYMENT
 
Direct Lake Deep Dive slides from Fabric Engineering Roadshow
Direct Lake Deep Dive slides from Fabric Engineering RoadshowDirect Lake Deep Dive slides from Fabric Engineering Roadshow
Direct Lake Deep Dive slides from Fabric Engineering Roadshow
 
machine learning notes by Andrew Ng and Tengyu Ma
machine learning notes by Andrew Ng and Tengyu Mamachine learning notes by Andrew Ng and Tengyu Ma
machine learning notes by Andrew Ng and Tengyu Ma
 
A review of I_O behavior on Oracle database in ASM
A review of I_O behavior on Oracle database in ASMA review of I_O behavior on Oracle database in ASM
A review of I_O behavior on Oracle database in ASM
 
🔥Call Girl Price Pune 💯Call Us 🔝 7014168258 🔝💃Independent Pune Escorts Servic...
🔥Call Girl Price Pune 💯Call Us 🔝 7014168258 🔝💃Independent Pune Escorts Servic...🔥Call Girl Price Pune 💯Call Us 🔝 7014168258 🔝💃Independent Pune Escorts Servic...
🔥Call Girl Price Pune 💯Call Us 🔝 7014168258 🔝💃Independent Pune Escorts Servic...
 
Hyderabad Call Girls Service 🔥 9352988975 🔥 High Profile Call Girls Hyderabad
Hyderabad Call Girls Service 🔥 9352988975 🔥 High Profile Call Girls HyderabadHyderabad Call Girls Service 🔥 9352988975 🔥 High Profile Call Girls Hyderabad
Hyderabad Call Girls Service 🔥 9352988975 🔥 High Profile Call Girls Hyderabad
 
Mumbai Central Call Girls ☑ +91-9833325238 ☑ Available Hot Girls Aunty Book Now
Mumbai Central Call Girls ☑ +91-9833325238 ☑ Available Hot Girls Aunty Book NowMumbai Central Call Girls ☑ +91-9833325238 ☑ Available Hot Girls Aunty Book Now
Mumbai Central Call Girls ☑ +91-9833325238 ☑ Available Hot Girls Aunty Book Now
 
Call Girls Lucknow 8923113531 Independent Call Girl Service in Lucknow
Call Girls Lucknow 8923113531 Independent Call Girl Service in LucknowCall Girls Lucknow 8923113531 Independent Call Girl Service in Lucknow
Call Girls Lucknow 8923113531 Independent Call Girl Service in Lucknow
 

Modularized ETL Writing with Apache Spark

  • 1. Modularized ETL Writing with Spark Neelesh Salian Software Engineer - Stitch Fix May 27, 2021
  • 3. Agenda ▪ What is Stitch Fix? ▪ Apache Spark @ Stitch Fix ▪ Spark Writer Modules ▪ Learnings & Future Work
  • 4. What is Stitch Fix? What does the company do?
  • 5. Stitch Fix is a personalized styling service Shop at your personal curated store. Check out what you like.
  • 6. Data Science is behind everything we do algorithms-tour.stitchfix.com • Algorithms org • 145+ Data Scientists and Platform engineers • 3 main verticals + platform
  • 7. Apache Spark @ Stitch Fix How we use Spark in our teams?
  • 8. Spark @ Stitch Fix - History and Current State ▪ Spark was introduced to enhance and scale ETL capabilities (circa 2016) ▪ Starting version: 1.2.x ▪ Spark SQL was the dominant use case ▪ Used for reading and writing data into the warehouse as Hive Tables. ▪ Current Version: 2.4.x, 3.1.x [ prototyping] ▪ For all ETL reads and writes, production and test ▪ Spark serves regular pyspark,sql and scala jobs, notebooks & pandas-based readers - writers ▪ Controls all writing with more functionality [this talk] How it’s going How it started
  • 9. Spark @ Stitch Fix - Spark Tooling • Spark Sql + Pyspark + Scala • Containerized Spark driver + AWS EMR (for compute) • Used for production and staging ETL by Data Scientists • Notebooks • Jupyterhub setup with Stitch Fix libraries and python packages pre-installed. • Used by Data Scientists to test and prototype • Pandas-based Readers - Writers • Reads and writes data using pandas dataframes • No bootstrap time for Spark jobs - uses Apache Livy for execution • Used for test + production All the tooling available to Data Scientists to use Spark to read and write data
  • 10. Spark @ Stitch Fix - Writing data to the warehouse
  • 11. Spark @ Stitch Fix - Steps while writing data At the start, and even today, writing data through the writer library has these steps. 1. Validation - check dataframe for type matches, schema matches to the Hive table, overflow type checks. 2. Writing the data into files in S3 - parquet or text format based on the Hive table’s configuration 3. Update the Hive Metastore - with versioning scheme for data.
  • 12.
  • 13. Spark @ Stitch Fix - Data Versioning • Writing into a Partitioned Table (e.g partitioned by a date_column for a date value of 20210527) • s3:<bucket>/<hive_db_name>/<hive_table_name>/date_column=20210527/batch_id=epoch_ts • Writing into a Non-Partitioned Table • s3:<bucket>/<hive_db_name>/<hive_table_name>/batch_id=epoch_ts We also add the latest write_timestamp to the Hive table metadata, to indicate when the last write was done to the table. Writing data into the Data Warehouse with versioning to distinguish old vs new data. We add the epoch_timestamp of the write time to indicate the freshness of the data.
  • 14. Since we have a single path to validate and write to the Data Warehouse, what other common functionality could we add to provide more value to our Data Scientists?
  • 15. Spark Writer Modules Config driven transformations while writing data to the Data Warehouse
  • 16. Spark Writer Modules - Adding modules Adding them as transformations in the writer library was straightforward. In addition, we had to: • Make each module configurable via spark properties • Make each module behave the same for every write pipeline • Make them configurable to either block writing data or not in the event of failure • Add documentation for each module to help steer Data Scientists How do we add additional functionality to the writing pipeline behind the scenes?
  • 17. Spark Writer Modules - 3 Modules • Journalizer • Data Cleanser • Data Quality Checker The 3 modules we built
  • 18.
  • 20.
  • 21. Journalizing - Data can change Example: Data about a client has the potential to change and we need to capture it Note: These are Slowly Changing Dimensions (Type 2) - where we preserve the old values. client_id favorite_color dress_style 10 blue formal Current on Date: 2021-05-21 client_id favorite_color dress_style 10 black formal Current on Date: 2021-05-22 client_id favorite_color dress_style 10 green formal Current on Date: 2021-07-23 client_id favorite_color dress_style 10 purple formal Current on Date: 2021-05-23
  • 22. Journalizing - 2 ways of capturing historical information ▪ Record of all data - written daily and partitioned by date ▪ Contains all records - duplicated across partitions ▪ Difficult to find nuanced information or track changes in data by date since all the data is included. ▪ Harder to access the data because of the size of the table ▪ Compressed, de-duped information ▪ Two partitions: is_current = 1 (latest data) & is_current = 0 (old data) ▪ Tracks changing values by timestamp. e.g sets start and end date to a value to show duration of validity ▪ Sorted for easy access by primary key Journal Tables History Tables 2 types of Hive Tables to store this information.
  • 23. client_id favorite_color dress_style date (partition_column) 10 blue formal 2021-05-20 10 blue formal 2021-05-21 10 black formal 2021-05-21 10 blue formal 2021-05-22 10 black formal 2021-05-22 10 purple formal 2021-05-22 ….. ….. ….. ……. 10 blue formal 2021-07-23 10 black formal 2021-07-23 10 purple formal 2021-07-23 10 green formal 2021-07-23 History Table Journal Table client_id favorite_color start_date end_date is_current (partition column) 10 blue 2021-01-01 (first time recorded) 2021-05-20 0 10 black 2021-05-21 2021-05-21 0 10 purple 2021-05-22 2021-07-22 0 10 green 2021-07-23 2999-01-01 (default end time) 1 Note: Tracking changes to favorite_color across time
  • 24. Given the compressed nature of Journal tables, we moved historical data into them. A Journal table is meant to be a ledger of the change in values and a pointer to the current values. Let’s now look at how Journal tables are created.
  • 25. Journalizing - How do we create a journal table? Some questions we asked ourselves: 1. How could we get easy access to latest information about a particular key? 2. How can information be compressed and de-duplicated? 3. Can we determine - how long was the favorite_color set to <value>? 4. But, how do we update the table each time to maintain this ordering? 5. Where and when do we run this process of conversion? What we need to get to the table structure?
  • 26. client_id favorite_color date 10 blue 2021-05-20 10 blue 2021-05-21 10 blue 2021-05-22 10 purple 2021-05-23 Compression/ De-dupe client_i d favorite_color start_date end_date 10 blue 2021-01-01 (first time recorded) 2021-05-22 10 purple 2021-05-23 2999-01-01 (default end time) Start date when value was valid End date when value was valid Symbolizing the latest value without a specified end
  • 27. client_id favorite_color date 10 blue 2021-05-20 10 blue 2021-05-21 Current Pointer Partition client_id favorite_color start_date end_date is_current 10 blue 2021-01-01 (first time recorded) 2999-01-01 (default end time) 1 In a history table, we don’t know the changed value since it’s not marked. client_id favorite_color date 10 blue 2021-05-20 10 blue 2021-05-21 10 blue 2021-05-22 10 purple 2021-05-22 client_id favorite_color start_date end_date is_current 10 blue 2021-01-01 (first time recorded) 2021-05-21 0 10 purple 2021-05-22 2999-01-01 (default end time) 1 Current Pointer Partition purple is now marked as the current value, and blue is moved to the older partition
  • 28. Journalizing - Process of Journalizing 1. User creates a Journal table and sets a field to track using metadata e.g. (client_id is set as primary key) 2. When data is written to this table, the table is reloaded in its entirety and we perform a. Deduplication and compression b. Set the current values in partitions - if there are changes c. Sort the table based on the date 3. Rewrite this new DataFrame into the table
  • 29. Journalizing - The workflow
  • 30. Journalizing - Journal Table Pros & Cons ▪ De-duped data ▪ Two partitions for easy querying - is_current = 1 (latest data) & is_current = 0 (old data). Data pipeline needs to access only 1 partition for all the latest values. ▪ Compressed and timestamp to indicate field values lifespan to track changes ▪ Sorted for easy access by primary key ▪ Complicated process and multiple steps prior to writing. ▪ Rewriting the table is a must to maintain the rules of compression and deduplication Cons Pros
  • 32.
  • 33. Data Cleanser - What and why? Data can be old or un-referenced or meant to be excluded. • How do we make sure some record values don’t continue to persist in a table? • How do we delete records or nullify them consistently throughout the warehouse? • Can this be configured by the Data Scientists to apply to their table? Can we cleanse data based on a configuration?
  • 34. Data Cleanser - What does cleansing mean? Let’s say we wish to nullify/delete some column values in a table id column_a column_b color style 9 value_a “string_field_1” blue formal 10 value_a1 “string_field_2” red casual 11 value_a2 “string_field_3” white formal OR Nullified Deleted id column_a column_b color style 9 null null blue formal 10 null null red casual 11 null null white formal id column_a column_b color style 9 <empty> <empty> blue formal 10 <empty> <empty> red casual 11 <empty> <empty> white formal
  • 35. Data Cleanser - Criteria 1. Has to be configurable 2. Users should be able to specify the key to be monitored and columns for cleansing 3. At least, two treatments should be available: a. nullify b. delete 4. Should happen to data at write and/or at rest What does the cleanser have to do?
  • 36. Data Cleanser - How? • How? • Perform cleansing at write time to ensure all future records are cleansed despite the source having included them. • Separately, cleanse the entire Hive table of the data is not used - to make sure older partitions don’t have the un-referenced data. • What do we need? • A mechanism to configure what to cleanse - nullify/delete per table • This mechanism needs to be accessible at write / rest to run the cleansing on the data. How do we cleanse data?
  • 37. Data Cleanser - Implementation We have a metadata infrastructure that allows users to add metadata to their owned tables ▪ Hive tables have metadata fields that can be used to store auxiliary information about them ▪ The cleanser could simply access the tables metadata and perform cleansing accordingly. Each table could have a configuration naming columns like [column_a, column_b] that needed to be cleansed along with the treatment. ▪ Reacting to the specified metadata meant the cleanser module could work as configured at all times. ▪ The same module could perform cleansing for data while writing and/or at rest. Cleansing Table Configuration
  • 38. Data Cleanser - The workflow 1. User specifics metadata configuration for cleansing in a Hive table metadata = {"key": "id", "treatment": "nullify", "columns": ["column_a", "column_b"]]} 2. Cleanser reads the table and checks all the columns that match 3. Performs nullify/delete on the DataFrame and proceeds to the next transformation or writes this cleansed DataFrame to the Data warehouse. How does it come together?
  • 39. Data Cleanser - The workflow
  • 41.
  • 42. Data Quality - Background • How do we detect errors or skews in data? • When do we check for data problems? • How do Data Scientists setup Data Quality checks? What motivated the data quality initiative?
  • 43. Data Quality - What do we need to check data? • Service to initialize tests and run tests on Hive tables. • Mechanism that calculates metrics based on the configured tests on the data prior to writing it to the warehouse • Interface that allows users to autonomously setup Data quality and run tests on their pipelines. What components were needed for running data quality checks?
  • 44. Data Quality - What would a Test look like? • NullCount(column_name) • Is the null count on this column higher than “value”? • Average(column_name) • Is the average below what is expected? • Max(column_name) • Is the max value for this column exceeding a certain limit? • RowCount(table) • Are we suddenly writing more rows than anticipated? Some examples of tests that we started off with.
  • 45. Data Quality - How we built it? • Built a service that was equipped to: • Enable CRUD operations on tests for Hive tables • Had the ability to run tests on metrics when triggered • At the same time, we built in the ability to calculate metrics in a module in the Spark writer library. • This module interacted with the data quality service to find the metrics that were needed to be calculated. • Ran these calculations in Spark on the input DataFrame - e.g. average (column_name) • Triggered tests on these metrics and posted the results to the user. Putting the components together
  • 46. Data Quality - Surfacing Data Quality to users 1. The data quality service had a python client that helped users run CRUD operations on tests 2. The writer module could be configured to run on a write operation for a table. a. Setting spark.enable.data.quality.checks=true in Spark properties helped run these tests at write time. 3. Separately, we created an offline mode to run tests on already written data, if the user doesn’t wish to block writes to the table. What did the interface look like?
  • 47.
  • 48. Spark Writer Modules - Transformations in code def writeDataFrame(inputDataframe:DataFrame, databaseName: String, tableName: String) = { // Validation val validatedDataframe = sfWriter.validateDataframe(inputDataframe,databaseName,tableName) // Journalizing val journalizedDataframe = sfWriter.journalizeDataframe(validatedDataframe,databaseName,tableName) // Data Cleanser val cleansedDataframe = sfWriter.dataCleanser(journalizedDataframe,databaseName,tableName) // Data Quality Checker sfWriter.dataQualityChecker(cleansedDataframe,databaseName,tableName) // Write to the Data Warehouse + Update Metastore sfWriter.writeToS3(cleansedDataframe,databaseName,tableName) }
  • 49. Learnings & Future Work What we learnt and where are we headed?
  • 50. Learnings & Future Work - Lessons learnt • Adding new modules meant more complexity to the write pipeline, but each step was doing a valuable transformation • Making each transformation performant and efficient was a top priority when each module was being created. • Testing - unit & integration was key in rolling out without mishaps • Introducing these modules to Data Scientists meant we needed better communication and more documentation • Getting data quality checks to run efficiently was a challenge, since we had to programmatically calculate the partitions of the DataFrame and run tests against each potential Hive partition. This took some effort to run smoothly. By adding modularized transformations to data, what changed and how did we adapt?
  • 51. Learnings & Future Work - Future Work Now, additional modules can easily be added in a similar fashion • Data Quality is being enhanced with support for customized testing rather than simple threshold or values. • The goal is to have Data quality ingrained in the ETL process of our Data Science workflows. • Journalizer and data cleansing are mostly static but we are exploring alternate solutions to help augment and delete records more efficiently. By adding modularized transformations to data, what changed and how did we adapt?
  • 53. Summary Writing data with Spark @ Stitch Fix: • We have a singular write path to input data into the warehouse driven by Spark • 3 modules that perform transformations are config driven and available at the time of write. • Journalizing: Writing a non-duplicated historical record of data to help quick access and compression. • Data Cleanser: Delete or nullify values based on table configuration. • Data Quality: Enabling the calculation of metrics and running tests on incoming data into the warehouse.
  翻译: