Azure Data Factory Data Flow Performance Tuning 101

Azure Data Factory: Mapping Data Flows
Performance Tuning Data Flows
v001

Sample Timings 1
Scenario 1
 Source: Delimited Text Blob Store
 Sink: Azure SQL DB
 File size: 421Mb, 74 columns, 887k rows
 Transforms: Single derived column to mask 3 fields
 Time: 4 mins end-to-end using memory optimized 80-core debug Azure IR
 Recommended settings: Current partitioning used throughout

Sample timings 2
 Scenario 2
 Source: Azure SQL DB Table
 Sink: Azure SQL DB Table
 Table size: 74 columns, 887k rows
 Recommended settings: Source partitioning on SQL DB Source, current partitioning
on Derived Column and Sink

Sample timings 3
 Scenario 3
 Source: Delimited Text Blob Store
 Sink: Delimited Text Blob store
 Table size: 74 columns, 887k rows
 Recommended settings: Leaving default/current partitioning throughout allows
ADF to scale-up/down partitions based on size of Azure IR (i.e. number of worker
cores)

File conversion Source->Sink property findings
 Large data sizes should use more vcores(16+) with memory optimized or
general purpose
 Compute optimized does not improve performance in this scenario
 CSV to parquet format convert has 45% time overhead in comparison with CSV
to CSV
 CSV to JSON format convert has 24% time overhead in comparison with CSV to
CSV
 CSV to JSON has better performance even though it has a lot of data to write
 CSV to parquet has a slight lag because of time spent in decompression
 Scaling V-cores improves performance for both IO and computation

File Conversion Timing
Compute type: General Purpose
• Dataset has 36 Columns of string, integer, short, double
• CSV dataset has 25 files with different file sizes
• Performance improvement scales proportionately with increase
in Vcores
• 8 Vcore to 64 Vcore performance increase is around 8 times more

Synapse DW Timing
Adding cores proportionally decreases time it takes to process data into staging files for Polybase. However, there is a
fairly static amount time that it takes to write that data from Parquet into SQL tables using Polybase.

CosmosDB Timing

Window / Aggregate Timing
• Performance improvement scales proportionately with
increase in Vcores
• 8 Vcore to 64 Vcore performance increase is around 5
times more

Transformation Timings
Transformation recommendations
• When ranking data across entire dataset, use Rank
transformation instead of Window with rank()
• When using rowNumber() in Window to uniquely
add a row counter to each row across entire
dataset, instead use the Surrogate Key
transformation

TPCH Timings
TPCH CSV in ADLS Gen 2

Azure Data Factory Data Flow Performance
* Includes cold cluster start-up time

Azure Synapse Data Flow Performance
* Includes cold cluster start-up time

Identifying bottlenecks
1. Cluster startup time
2. Sink processing time
3. Source read time
4. Transformation stage time
1. Sequential executions can
lower the cluster startup time
by setting a TTL in Azure IR
2. Total time to process the
stream from source to sink.
There is also a post-processing
time when you click on the Sink
that will show you how much
time Spark had to spend with
partition and job clean-up.
Write to single file and slow
database connections will
increase this time
3. Shows you how long it took to
read data from source.
Optimize with different source
partition strategies
4. This will show you bottlenecks
in your transformation logic.
With larger general purpose
and mem optimized IRs, most
of these operations occur in
memory in data frames and are
usually the fastest operations
in your data flow

File Partitioning
 Maintain current partitioning
 Avoid output to single file
 For manual partitioning, use number of cores from your Azure IR
and multiply by 5
 Example: transform a series of files in your ADLS folders w/32-core Azure IR, number of
partitions would be 32 x 5 = 160 partitions
 If you know data well enough to have high-cardinality columns, use those columns as Hash
partition
 If you do not know data patterns very well, use Round Robin

Best practices - Sources
 When reading from file-based sources, data flow automatically
partitions the data based on size
 ~128 MB per partition, evenly distributed
 Use current partitioning will be fastest for file-based and Synapse using PolyBase
 Enable staging for Synapse
 For Azure SQL DB, use Source partitioning on column with high
cardinality
 Improves performance, but can saturate your source database
 Reading can be limited by the I/O of your source

Optimizing transformations
 Each transformation has its own optimize tab
 Generally better to not alter -> reshuffling is a relatively slow process
 Reshuffling can occur if data is very skewed
 One node has a disproportionate amount of data
 For Joins, Exists and Lookups:
 If you have a many of these transforms, memory optimized greatly increases performance
 Can ‘Broadcast’ if the data on one side is small
 Rule of thumb: Less than 50k rows
 Use Window transformation partitioned over segments of data
 For Rank() across entire dataset, use the Rank transformation instead
 For RowNumber() across entire dataset, use the Surrogate Key transformation instead
 Transformations that require reshuffling like Sort negatively impact
performance

Best practices – Debug (Data Preview)
 Data Preview
 Data preview is inside the data flow designer transformation properties
 Uses row limits and sampling techniques to preview data from a small size of data
 Allows you to build and validate units of logic with samples of data in real time
 You have control over the size of the data limits under Debug Settings
 If you wish to test with larger datasets, set a larger compute size in the Azure IR when
switching on “Debug Mode”
 Data Preview is only a snapshot of data in memory from Spark data frames. This feature does
not write any data, so the sink drivers are not utilized and not tested in this mode.

Best practices – Debug (Pipeline Debug)
 Pipeline Debug
 Click debug button to test your data flow inside of a pipeline
 Default debug limits the execution runtime so you will want to limit data sizes
 Sampling can be applied here as well by using the “Enable Sampling” option in each Source
 Use the debug button option of “use activity IR” when you wish to use a job execution
compute environment
 This option is good for debugging with larger datasets. It will not have the same execution timeout limit as the
default debug setting

Best practices - Sinks
 SQL:
 Disable indexes on target with pre/post SQL scripts
 Increase SQL capacity during pipeline execution
 Enable staging when using Synapse
 Use Source Partitioning on Source under Optimize
 Set number of partitions based on size of IR
 File-based sinks:
 Use current partitioning allows Spark to create output
 Output to single file is a slow operation
 Often unnecessary by whoever is consuming data
 Can set naming patterns or use data in column
 Any reshuffling of data is slow
 Cosmos DB
 Set throughput and batch size to meet performance requirements

Azure Integration Runtime
 Data Flows use JIT compute to minimize running expensive clusters
when they are mostly idle
 Generally more economical, but each cluster takes ~4 minutes to spin up
 IR specifies what cluster type and core-count to use
 Memory optimized is best, compute optimized doesn’t generally work for production workloads
 When running Sequential jobs utilize Time to Live to reuse cluster
between executions
 Keeps compute resources alive for TTL minutes after execution for new job to use
 Maximum one job per cluster
 Reduces job startup latency to ~1.5 minutes
 Rule of thumb: start small and scale up

Azure IR – General Purpose
• This was General Purpose 4+4, the default auto resolve Azure IR
• For prod workloads, GP is usually sufficient at >= 16 cores
• You get 1 driver and 1 worker node, both with 4 vcores
• Good for debugging, testing, and many production workloads
• Tested with 887k row CSV file with 74 columns
• Default partitioning
• Spark chose 4 partitions
• Cluster startup time: 4.5 mins
• Sink IO writing: 46s
• Transformation time: 42s
• Sink post-processing time: 45s

Azure IR – Compute Optimized
• Computed Optimized intended for smaller workloads
• 8+8, this is smallest CO option and you get 1 driver and 2
workers
• Not suitable for large production workloads
• More worker nodes gave us more partitions and better perf than
General Purpose

Azure IR – Memory Optimized
• Memory Optimized well suited for large production workload
reliability with many aggregates, lookups, and joins
• 64+16 gives you 16 vcores for driver and 64 across worker nodes

Azure Data Factory Data Flow Performance Tuning 101

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Azure Data Factory Data Flow Performance Tuning 101

Similar to Azure Data Factory Data Flow Performance Tuning 101 (20)

More from Mark Kromer

More from Mark Kromer (19)

Recently uploaded

Recently uploaded (20)

Azure Data Factory Data Flow Performance Tuning 101