尊敬的 微信汇率:1円 ≈ 0.046166 元 支付宝汇率:1円 ≈ 0.046257元 [退出登录]
SlideShare a Scribd company logo
Azure Data Factory: Mapping Data Flows
Performance Tuning Data Flows
v001
Sample Timings 1
Scenario 1
 Source: Delimited Text Blob Store
 Sink: Azure SQL DB
 File size: 421Mb, 74 columns, 887k rows
 Transforms: Single derived column to mask 3 fields
 Time: 4 mins end-to-end using memory optimized 80-core debug Azure IR
 Recommended settings: Current partitioning used throughout
Sample timings 2
 Scenario 2
 Source: Azure SQL DB Table
 Sink: Azure SQL DB Table
 Table size: 74 columns, 887k rows
 Transforms: Single derived column to mask 3 fields
 Time: 3 mins end-to-end using memory optimized 80-core debug Azure IR
 Recommended settings: Source partitioning on SQL DB Source, current partitioning
on Derived Column and Sink
Sample timings 3
 Scenario 3
 Source: Delimited Text Blob Store
 Sink: Delimited Text Blob store
 Table size: 74 columns, 887k rows
 Transforms: Single derived column to mask 3 fields
 Time: 2 mins end-to-end using memory optimized 80-core debug Azure IR
 Recommended settings: Leaving default/current partitioning throughout allows
ADF to scale-up/down partitions based on size of Azure IR (i.e. number of worker
cores)
File conversion Source->Sink property findings
 Large data sizes should use more vcores(16+) with memory optimized or
general purpose
 Compute optimized does not improve performance in this scenario
 CSV to parquet format convert has 45% time overhead in comparison with CSV
to CSV
 CSV to JSON format convert has 24% time overhead in comparison with CSV to
CSV
 CSV to JSON has better performance even though it has a lot of data to write
 CSV to parquet has a slight lag because of time spent in decompression
 Scaling V-cores improves performance for both IO and computation
File Conversion Timing
Compute type: General Purpose
• Dataset has 36 Columns of string, integer, short, double
• CSV dataset has 25 files with different file sizes
• Performance improvement scales proportionately with increase
in Vcores
• 8 Vcore to 64 Vcore performance increase is around 8 times more
SQL Database Timing
Synapse DW Timing
Compute type: General Purpose
Adding cores proportionally decreases time it takes to process data into staging files for Polybase. However, there is a
fairly static amount time that it takes to write that data from Parquet into SQL tables using Polybase.
CosmosDB Timing
Compute type: General Purpose
Window / Aggregate Timing
Compute type: General Purpose
• Performance improvement scales proportionately with
increase in Vcores
• 8 Vcore to 64 Vcore performance increase is around 5
times more
Transformation Timings
Compute type: General Purpose
Transformation recommendations
• When ranking data across entire dataset, use Rank
transformation instead of Window with rank()
• When using rowNumber() in Window to uniquely
add a row counter to each row across entire
dataset, instead use the Surrogate Key
transformation
TPCH Timings
Compute type: General Purpose
TPCH CSV in ADLS Gen 2
Azure Data Factory Data Flow Performance
* Includes cold cluster start-up time
Azure Synapse Data Flow Performance
* Includes cold cluster start-up time
Identifying bottlenecks
1. Cluster startup time
2. Sink processing time
3. Source read time
4. Transformation stage time
1. Sequential executions can
lower the cluster startup time
by setting a TTL in Azure IR
2. Total time to process the
stream from source to sink.
There is also a post-processing
time when you click on the Sink
that will show you how much
time Spark had to spend with
partition and job clean-up.
Write to single file and slow
database connections will
increase this time
3. Shows you how long it took to
read data from source.
Optimize with different source
partition strategies
4. This will show you bottlenecks
in your transformation logic.
With larger general purpose
and mem optimized IRs, most
of these operations occur in
memory in data frames and are
usually the fastest operations
in your data flow
File Partitioning
 Maintain current partitioning
 Avoid output to single file
 For manual partitioning, use number of cores from your Azure IR
and multiply by 5
 Example: transform a series of files in your ADLS folders w/32-core Azure IR, number of
partitions would be 32 x 5 = 160 partitions
 If you know data well enough to have high-cardinality columns, use those columns as Hash
partition
 If you do not know data patterns very well, use Round Robin
Best practices - Sources
 When reading from file-based sources, data flow automatically
partitions the data based on size
 ~128 MB per partition, evenly distributed
 Use current partitioning will be fastest for file-based and Synapse using PolyBase
 Enable staging for Synapse
 For Azure SQL DB, use Source partitioning on column with high
cardinality
 Improves performance, but can saturate your source database
 Reading can be limited by the I/O of your source
Optimizing transformations
 Each transformation has its own optimize tab
 Generally better to not alter -> reshuffling is a relatively slow process
 Reshuffling can occur if data is very skewed
 One node has a disproportionate amount of data
 For Joins, Exists and Lookups:
 If you have a many of these transforms, memory optimized greatly increases performance
 Can ‘Broadcast’ if the data on one side is small
 Rule of thumb: Less than 50k rows
 Use Window transformation partitioned over segments of data
 For Rank() across entire dataset, use the Rank transformation instead
 For RowNumber() across entire dataset, use the Surrogate Key transformation instead
 Transformations that require reshuffling like Sort negatively impact
performance
Best practices – Debug (Data Preview)
 Data Preview
 Data preview is inside the data flow designer transformation properties
 Uses row limits and sampling techniques to preview data from a small size of data
 Allows you to build and validate units of logic with samples of data in real time
 You have control over the size of the data limits under Debug Settings
 If you wish to test with larger datasets, set a larger compute size in the Azure IR when
switching on “Debug Mode”
 Data Preview is only a snapshot of data in memory from Spark data frames. This feature does
not write any data, so the sink drivers are not utilized and not tested in this mode.
Best practices – Debug (Pipeline Debug)
 Pipeline Debug
 Click debug button to test your data flow inside of a pipeline
 Default debug limits the execution runtime so you will want to limit data sizes
 Sampling can be applied here as well by using the “Enable Sampling” option in each Source
 Use the debug button option of “use activity IR” when you wish to use a job execution
compute environment
 This option is good for debugging with larger datasets. It will not have the same execution timeout limit as the
default debug setting
Best practices - Sinks
 SQL:
 Disable indexes on target with pre/post SQL scripts
 Increase SQL capacity during pipeline execution
 Enable staging when using Synapse
 Use Source Partitioning on Source under Optimize
 Set number of partitions based on size of IR
 File-based sinks:
 Use current partitioning allows Spark to create output
 Output to single file is a slow operation
 Often unnecessary by whoever is consuming data
 Can set naming patterns or use data in column
 Any reshuffling of data is slow
 Cosmos DB
 Set throughput and batch size to meet performance requirements
Azure Integration Runtime
 Data Flows use JIT compute to minimize running expensive clusters
when they are mostly idle
 Generally more economical, but each cluster takes ~4 minutes to spin up
 IR specifies what cluster type and core-count to use
 Memory optimized is best, compute optimized doesn’t generally work for production workloads
 When running Sequential jobs utilize Time to Live to reuse cluster
between executions
 Keeps compute resources alive for TTL minutes after execution for new job to use
 Maximum one job per cluster
 Reduces job startup latency to ~1.5 minutes
 Rule of thumb: start small and scale up
Azure IR – General Purpose
• This was General Purpose 4+4, the default auto resolve Azure IR
• For prod workloads, GP is usually sufficient at >= 16 cores
• You get 1 driver and 1 worker node, both with 4 vcores
• Good for debugging, testing, and many production workloads
• Tested with 887k row CSV file with 74 columns
• Default partitioning
• Spark chose 4 partitions
• Cluster startup time: 4.5 mins
• Sink IO writing: 46s
• Transformation time: 42s
• Sink post-processing time: 45s
Azure IR – Compute Optimized
• Computed Optimized intended for smaller workloads
• 8+8, this is smallest CO option and you get 1 driver and 2
workers
• Not suitable for large production workloads
• Tested with 887k row CSV file with 74 columns
• Default partitioning
• Spark chose 8 partitions
• Cluster startup time: 4.5 mins
• Sink IO writing: 20s
• Transformation time: 35s
• Sink post-processing time: 40s
• More worker nodes gave us more partitions and better perf than
General Purpose
Azure IR – Memory Optimized
• Memory Optimized well suited for large production workload
reliability with many aggregates, lookups, and joins
• 64+16 gives you 16 vcores for driver and 64 across worker nodes
• Tested with 887k row CSV file with 74 columns
• Default partitioning
• Spark chose 64 partitions
• Cluster startup time: 4.8 mins
• Sink IO writing: 19s
• Transformation time: 17s
• Sink post-processing time: 40s

More Related Content

What's hot

Building Lakehouses on Delta Lake with SQL Analytics Primer
Building Lakehouses on Delta Lake with SQL Analytics PrimerBuilding Lakehouses on Delta Lake with SQL Analytics Primer
Building Lakehouses on Delta Lake with SQL Analytics Primer
Databricks
 
Tackle Your Dark Data Challenge with AWS Glue - AWS Online Tech Talks
Tackle Your Dark Data  Challenge with AWS Glue - AWS Online Tech TalksTackle Your Dark Data  Challenge with AWS Glue - AWS Online Tech Talks
Tackle Your Dark Data Challenge with AWS Glue - AWS Online Tech Talks
Amazon Web Services
 
Owning Your Own (Data) Lake House
Owning Your Own (Data) Lake HouseOwning Your Own (Data) Lake House
Owning Your Own (Data) Lake House
Data Con LA
 
Azure data factory
Azure data factoryAzure data factory
Azure data factory
BizTalk360
 
Azure Data Factory Data Flow
Azure Data Factory Data FlowAzure Data Factory Data Flow
Azure Data Factory Data Flow
Mark Kromer
 
Build data quality rules and data cleansing into your data pipelines
Build data quality rules and data cleansing into your data pipelinesBuild data quality rules and data cleansing into your data pipelines
Build data quality rules and data cleansing into your data pipelines
Mark Kromer
 
Microsoft Azure Data Factory Hands-On Lab Overview Slides
Microsoft Azure Data Factory Hands-On Lab Overview SlidesMicrosoft Azure Data Factory Hands-On Lab Overview Slides
Microsoft Azure Data Factory Hands-On Lab Overview Slides
Mark Kromer
 
Big Data Architectural Patterns and Best Practices
Big Data Architectural Patterns and Best PracticesBig Data Architectural Patterns and Best Practices
Big Data Architectural Patterns and Best Practices
Amazon Web Services
 
Microsoft Data Integration Pipelines: Azure Data Factory and SSIS
Microsoft Data Integration Pipelines: Azure Data Factory and SSISMicrosoft Data Integration Pipelines: Azure Data Factory and SSIS
Microsoft Data Integration Pipelines: Azure Data Factory and SSIS
Mark Kromer
 
Introducing Databricks Delta
Introducing Databricks DeltaIntroducing Databricks Delta
Introducing Databricks Delta
Databricks
 
Azure Data Factory Data Flows Training (Sept 2020 Update)
Azure Data Factory Data Flows Training (Sept 2020 Update)Azure Data Factory Data Flows Training (Sept 2020 Update)
Azure Data Factory Data Flows Training (Sept 2020 Update)
Mark Kromer
 
Lift SSIS package to Azure Data Factory V2
Lift SSIS package to Azure Data Factory V2Lift SSIS package to Azure Data Factory V2
Lift SSIS package to Azure Data Factory V2
Manjeet Singh
 
Best Practices for Amazon S3 and Amazon Glacier (STG203-R2) - AWS re:Invent 2018
Best Practices for Amazon S3 and Amazon Glacier (STG203-R2) - AWS re:Invent 2018Best Practices for Amazon S3 and Amazon Glacier (STG203-R2) - AWS re:Invent 2018
Best Practices for Amazon S3 and Amazon Glacier (STG203-R2) - AWS re:Invent 2018
Amazon Web Services
 
Azure Data Explorer deep dive - review 04.2020
Azure Data Explorer deep dive - review 04.2020Azure Data Explorer deep dive - review 04.2020
Azure Data Explorer deep dive - review 04.2020
Riccardo Zamana
 
Data Lakehouse Symposium | Day 4
Data Lakehouse Symposium | Day 4Data Lakehouse Symposium | Day 4
Data Lakehouse Symposium | Day 4
Databricks
 
Azure data platform overview
Azure data platform overviewAzure data platform overview
Azure data platform overview
James Serra
 
Azure storage
Azure storageAzure storage
Azure storage
Adam Skibicki
 
Batch Processing vs Stream Processing Difference
Batch Processing vs Stream Processing DifferenceBatch Processing vs Stream Processing Difference
Batch Processing vs Stream Processing Difference
jeetendra mandal
 
NEW LAUNCH! Intro to Amazon Athena. Analyze data in S3, using SQL
NEW LAUNCH! Intro to Amazon Athena. Analyze data in S3, using SQLNEW LAUNCH! Intro to Amazon Athena. Analyze data in S3, using SQL
NEW LAUNCH! Intro to Amazon Athena. Analyze data in S3, using SQL
Amazon Web Services
 
Azure Data Factory V2; The Data Flows
Azure Data Factory V2; The Data FlowsAzure Data Factory V2; The Data Flows
Azure Data Factory V2; The Data Flows
Thomas Sykes
 

What's hot (20)

Building Lakehouses on Delta Lake with SQL Analytics Primer
Building Lakehouses on Delta Lake with SQL Analytics PrimerBuilding Lakehouses on Delta Lake with SQL Analytics Primer
Building Lakehouses on Delta Lake with SQL Analytics Primer
 
Tackle Your Dark Data Challenge with AWS Glue - AWS Online Tech Talks
Tackle Your Dark Data  Challenge with AWS Glue - AWS Online Tech TalksTackle Your Dark Data  Challenge with AWS Glue - AWS Online Tech Talks
Tackle Your Dark Data Challenge with AWS Glue - AWS Online Tech Talks
 
Owning Your Own (Data) Lake House
Owning Your Own (Data) Lake HouseOwning Your Own (Data) Lake House
Owning Your Own (Data) Lake House
 
Azure data factory
Azure data factoryAzure data factory
Azure data factory
 
Azure Data Factory Data Flow
Azure Data Factory Data FlowAzure Data Factory Data Flow
Azure Data Factory Data Flow
 
Build data quality rules and data cleansing into your data pipelines
Build data quality rules and data cleansing into your data pipelinesBuild data quality rules and data cleansing into your data pipelines
Build data quality rules and data cleansing into your data pipelines
 
Microsoft Azure Data Factory Hands-On Lab Overview Slides
Microsoft Azure Data Factory Hands-On Lab Overview SlidesMicrosoft Azure Data Factory Hands-On Lab Overview Slides
Microsoft Azure Data Factory Hands-On Lab Overview Slides
 
Big Data Architectural Patterns and Best Practices
Big Data Architectural Patterns and Best PracticesBig Data Architectural Patterns and Best Practices
Big Data Architectural Patterns and Best Practices
 
Microsoft Data Integration Pipelines: Azure Data Factory and SSIS
Microsoft Data Integration Pipelines: Azure Data Factory and SSISMicrosoft Data Integration Pipelines: Azure Data Factory and SSIS
Microsoft Data Integration Pipelines: Azure Data Factory and SSIS
 
Introducing Databricks Delta
Introducing Databricks DeltaIntroducing Databricks Delta
Introducing Databricks Delta
 
Azure Data Factory Data Flows Training (Sept 2020 Update)
Azure Data Factory Data Flows Training (Sept 2020 Update)Azure Data Factory Data Flows Training (Sept 2020 Update)
Azure Data Factory Data Flows Training (Sept 2020 Update)
 
Lift SSIS package to Azure Data Factory V2
Lift SSIS package to Azure Data Factory V2Lift SSIS package to Azure Data Factory V2
Lift SSIS package to Azure Data Factory V2
 
Best Practices for Amazon S3 and Amazon Glacier (STG203-R2) - AWS re:Invent 2018
Best Practices for Amazon S3 and Amazon Glacier (STG203-R2) - AWS re:Invent 2018Best Practices for Amazon S3 and Amazon Glacier (STG203-R2) - AWS re:Invent 2018
Best Practices for Amazon S3 and Amazon Glacier (STG203-R2) - AWS re:Invent 2018
 
Azure Data Explorer deep dive - review 04.2020
Azure Data Explorer deep dive - review 04.2020Azure Data Explorer deep dive - review 04.2020
Azure Data Explorer deep dive - review 04.2020
 
Data Lakehouse Symposium | Day 4
Data Lakehouse Symposium | Day 4Data Lakehouse Symposium | Day 4
Data Lakehouse Symposium | Day 4
 
Azure data platform overview
Azure data platform overviewAzure data platform overview
Azure data platform overview
 
Azure storage
Azure storageAzure storage
Azure storage
 
Batch Processing vs Stream Processing Difference
Batch Processing vs Stream Processing DifferenceBatch Processing vs Stream Processing Difference
Batch Processing vs Stream Processing Difference
 
NEW LAUNCH! Intro to Amazon Athena. Analyze data in S3, using SQL
NEW LAUNCH! Intro to Amazon Athena. Analyze data in S3, using SQLNEW LAUNCH! Intro to Amazon Athena. Analyze data in S3, using SQL
NEW LAUNCH! Intro to Amazon Athena. Analyze data in S3, using SQL
 
Azure Data Factory V2; The Data Flows
Azure Data Factory V2; The Data FlowsAzure Data Factory V2; The Data Flows
Azure Data Factory V2; The Data Flows
 

Similar to Azure Data Factory Data Flow Performance Tuning 101

Mapping Data Flows Perf Tuning April 2021
Mapping Data Flows Perf Tuning April 2021Mapping Data Flows Perf Tuning April 2021
Mapping Data Flows Perf Tuning April 2021
Mark Kromer
 
Processing and Analytics
Processing and AnalyticsProcessing and Analytics
Processing and Analytics
Amazon Web Services
 
MongoDB World 2019: Finding the Right MongoDB Atlas Cluster Size: Does This I...
MongoDB World 2019: Finding the Right MongoDB Atlas Cluster Size: Does This I...MongoDB World 2019: Finding the Right MongoDB Atlas Cluster Size: Does This I...
MongoDB World 2019: Finding the Right MongoDB Atlas Cluster Size: Does This I...
MongoDB
 
SRV407 Deep Dive on Amazon Aurora
SRV407 Deep Dive on Amazon AuroraSRV407 Deep Dive on Amazon Aurora
SRV407 Deep Dive on Amazon Aurora
Amazon Web Services
 
Performance tuning in sql server
Performance tuning in sql serverPerformance tuning in sql server
Performance tuning in sql server
Antonios Chatzipavlis
 
45 ways to speed up firebird database
45 ways to speed up firebird database45 ways to speed up firebird database
45 ways to speed up firebird database
Fabio Codebue
 
AWS Analytics
AWS AnalyticsAWS Analytics
AWS Analytics
Amazon Web Services
 
Cómo se diseña una base de datos que pueda ingerir más de cuatro millones de ...
Cómo se diseña una base de datos que pueda ingerir más de cuatro millones de ...Cómo se diseña una base de datos que pueda ingerir más de cuatro millones de ...
Cómo se diseña una base de datos que pueda ingerir más de cuatro millones de ...
javier ramirez
 
Masterclass Live: Amazon EMR
Masterclass Live: Amazon EMRMasterclass Live: Amazon EMR
Masterclass Live: Amazon EMR
Amazon Web Services
 
Espc17 make your share point fly by tuning and optimising sql server
Espc17 make your share point  fly by tuning and optimising sql serverEspc17 make your share point  fly by tuning and optimising sql server
Espc17 make your share point fly by tuning and optimising sql server
Isabelle Van Campenhoudt
 
Make your SharePoint fly by tuning and optimizing SQL Server
Make your SharePoint  fly by tuning and optimizing SQL ServerMake your SharePoint  fly by tuning and optimizing SQL Server
Make your SharePoint fly by tuning and optimizing SQL Server
serge luca
 
071410 sun a_1515_feldman_stephen
071410 sun a_1515_feldman_stephen071410 sun a_1515_feldman_stephen
071410 sun a_1515_feldman_stephen
Steve Feldman
 
Scaling with sync_replication using Galera and EC2
Scaling with sync_replication using Galera and EC2Scaling with sync_replication using Galera and EC2
Scaling with sync_replication using Galera and EC2
Marco Tusa
 
Taking Splunk to the Next Level - Architecture Breakout Session
Taking Splunk to the Next Level - Architecture Breakout SessionTaking Splunk to the Next Level - Architecture Breakout Session
Taking Splunk to the Next Level - Architecture Breakout Session
Splunk
 
MariaDB Performance Tuning and Optimization
MariaDB Performance Tuning and OptimizationMariaDB Performance Tuning and Optimization
MariaDB Performance Tuning and Optimization
MariaDB plc
 
Cassandra CLuster Management by Japan Cassandra Community
Cassandra CLuster Management by Japan Cassandra CommunityCassandra CLuster Management by Japan Cassandra Community
Cassandra CLuster Management by Japan Cassandra Community
Hiromitsu Komatsu
 
JavaOne2016 - Microservices: Terabytes in Microseconds [CON4516]
JavaOne2016 - Microservices: Terabytes in Microseconds [CON4516]JavaOne2016 - Microservices: Terabytes in Microseconds [CON4516]
JavaOne2016 - Microservices: Terabytes in Microseconds [CON4516]
Malin Weiss
 
JavaOne2016 - Microservices: Terabytes in Microseconds [CON4516]
JavaOne2016 - Microservices: Terabytes in Microseconds [CON4516]JavaOne2016 - Microservices: Terabytes in Microseconds [CON4516]
JavaOne2016 - Microservices: Terabytes in Microseconds [CON4516]
Speedment, Inc.
 
Optimize SQL server performance for SharePoint
Optimize SQL server performance for SharePointOptimize SQL server performance for SharePoint
Optimize SQL server performance for SharePoint
serge luca
 
Fabric Data Factory Pipeline Copy Perf Tips.pptx
Fabric Data Factory Pipeline Copy Perf Tips.pptxFabric Data Factory Pipeline Copy Perf Tips.pptx
Fabric Data Factory Pipeline Copy Perf Tips.pptx
Mark Kromer
 

Similar to Azure Data Factory Data Flow Performance Tuning 101 (20)

Mapping Data Flows Perf Tuning April 2021
Mapping Data Flows Perf Tuning April 2021Mapping Data Flows Perf Tuning April 2021
Mapping Data Flows Perf Tuning April 2021
 
Processing and Analytics
Processing and AnalyticsProcessing and Analytics
Processing and Analytics
 
MongoDB World 2019: Finding the Right MongoDB Atlas Cluster Size: Does This I...
MongoDB World 2019: Finding the Right MongoDB Atlas Cluster Size: Does This I...MongoDB World 2019: Finding the Right MongoDB Atlas Cluster Size: Does This I...
MongoDB World 2019: Finding the Right MongoDB Atlas Cluster Size: Does This I...
 
SRV407 Deep Dive on Amazon Aurora
SRV407 Deep Dive on Amazon AuroraSRV407 Deep Dive on Amazon Aurora
SRV407 Deep Dive on Amazon Aurora
 
Performance tuning in sql server
Performance tuning in sql serverPerformance tuning in sql server
Performance tuning in sql server
 
45 ways to speed up firebird database
45 ways to speed up firebird database45 ways to speed up firebird database
45 ways to speed up firebird database
 
AWS Analytics
AWS AnalyticsAWS Analytics
AWS Analytics
 
Cómo se diseña una base de datos que pueda ingerir más de cuatro millones de ...
Cómo se diseña una base de datos que pueda ingerir más de cuatro millones de ...Cómo se diseña una base de datos que pueda ingerir más de cuatro millones de ...
Cómo se diseña una base de datos que pueda ingerir más de cuatro millones de ...
 
Masterclass Live: Amazon EMR
Masterclass Live: Amazon EMRMasterclass Live: Amazon EMR
Masterclass Live: Amazon EMR
 
Espc17 make your share point fly by tuning and optimising sql server
Espc17 make your share point  fly by tuning and optimising sql serverEspc17 make your share point  fly by tuning and optimising sql server
Espc17 make your share point fly by tuning and optimising sql server
 
Make your SharePoint fly by tuning and optimizing SQL Server
Make your SharePoint  fly by tuning and optimizing SQL ServerMake your SharePoint  fly by tuning and optimizing SQL Server
Make your SharePoint fly by tuning and optimizing SQL Server
 
071410 sun a_1515_feldman_stephen
071410 sun a_1515_feldman_stephen071410 sun a_1515_feldman_stephen
071410 sun a_1515_feldman_stephen
 
Scaling with sync_replication using Galera and EC2
Scaling with sync_replication using Galera and EC2Scaling with sync_replication using Galera and EC2
Scaling with sync_replication using Galera and EC2
 
Taking Splunk to the Next Level - Architecture Breakout Session
Taking Splunk to the Next Level - Architecture Breakout SessionTaking Splunk to the Next Level - Architecture Breakout Session
Taking Splunk to the Next Level - Architecture Breakout Session
 
MariaDB Performance Tuning and Optimization
MariaDB Performance Tuning and OptimizationMariaDB Performance Tuning and Optimization
MariaDB Performance Tuning and Optimization
 
Cassandra CLuster Management by Japan Cassandra Community
Cassandra CLuster Management by Japan Cassandra CommunityCassandra CLuster Management by Japan Cassandra Community
Cassandra CLuster Management by Japan Cassandra Community
 
JavaOne2016 - Microservices: Terabytes in Microseconds [CON4516]
JavaOne2016 - Microservices: Terabytes in Microseconds [CON4516]JavaOne2016 - Microservices: Terabytes in Microseconds [CON4516]
JavaOne2016 - Microservices: Terabytes in Microseconds [CON4516]
 
JavaOne2016 - Microservices: Terabytes in Microseconds [CON4516]
JavaOne2016 - Microservices: Terabytes in Microseconds [CON4516]JavaOne2016 - Microservices: Terabytes in Microseconds [CON4516]
JavaOne2016 - Microservices: Terabytes in Microseconds [CON4516]
 
Optimize SQL server performance for SharePoint
Optimize SQL server performance for SharePointOptimize SQL server performance for SharePoint
Optimize SQL server performance for SharePoint
 
Fabric Data Factory Pipeline Copy Perf Tips.pptx
Fabric Data Factory Pipeline Copy Perf Tips.pptxFabric Data Factory Pipeline Copy Perf Tips.pptx
Fabric Data Factory Pipeline Copy Perf Tips.pptx
 

More from Mark Kromer

Data cleansing and prep with synapse data flows
Data cleansing and prep with synapse data flowsData cleansing and prep with synapse data flows
Data cleansing and prep with synapse data flows
Mark Kromer
 
Data cleansing and data prep with synapse data flows
Data cleansing and data prep with synapse data flowsData cleansing and data prep with synapse data flows
Data cleansing and data prep with synapse data flows
Mark Kromer
 
Data Lake ETL in the Cloud with ADF
Data Lake ETL in the Cloud with ADFData Lake ETL in the Cloud with ADF
Data Lake ETL in the Cloud with ADF
Mark Kromer
 
Azure Data Factory Data Wrangling with Power Query
Azure Data Factory Data Wrangling with Power QueryAzure Data Factory Data Wrangling with Power Query
Azure Data Factory Data Wrangling with Power Query
Mark Kromer
 
Data Quality Patterns in the Cloud with ADF
Data Quality Patterns in the Cloud with ADFData Quality Patterns in the Cloud with ADF
Data Quality Patterns in the Cloud with ADF
Mark Kromer
 
Data quality patterns in the cloud with ADF
Data quality patterns in the cloud with ADFData quality patterns in the cloud with ADF
Data quality patterns in the cloud with ADF
Mark Kromer
 
Azure Data Factory Data Flows Training v005
Azure Data Factory Data Flows Training v005Azure Data Factory Data Flows Training v005
Azure Data Factory Data Flows Training v005
Mark Kromer
 
ADF Mapping Data Flows Level 300
ADF Mapping Data Flows Level 300ADF Mapping Data Flows Level 300
ADF Mapping Data Flows Level 300
Mark Kromer
 
ADF Mapping Data Flows Training V2
ADF Mapping Data Flows Training V2ADF Mapping Data Flows Training V2
ADF Mapping Data Flows Training V2
Mark Kromer
 
ADF Mapping Data Flows Training Slides V1
ADF Mapping Data Flows Training Slides V1ADF Mapping Data Flows Training Slides V1
ADF Mapping Data Flows Training Slides V1
Mark Kromer
 
ADF Mapping Data Flow Private Preview Migration
ADF Mapping Data Flow Private Preview MigrationADF Mapping Data Flow Private Preview Migration
ADF Mapping Data Flow Private Preview Migration
Mark Kromer
 
SQL Saturday Redmond 2019 ETL Patterns in the Cloud
SQL Saturday Redmond 2019 ETL Patterns in the CloudSQL Saturday Redmond 2019 ETL Patterns in the Cloud
SQL Saturday Redmond 2019 ETL Patterns in the Cloud
Mark Kromer
 
Azure Data Factory Data Flow Limited Preview for January 2019
Azure Data Factory Data Flow Limited Preview for January 2019Azure Data Factory Data Flow Limited Preview for January 2019
Azure Data Factory Data Flow Limited Preview for January 2019
Mark Kromer
 
Microsoft Azure Data Factory Data Flow Scenarios
Microsoft Azure Data Factory Data Flow ScenariosMicrosoft Azure Data Factory Data Flow Scenarios
Microsoft Azure Data Factory Data Flow Scenarios
Mark Kromer
 
Azure Data Factory Data Flow Preview December 2019
Azure Data Factory Data Flow Preview December 2019Azure Data Factory Data Flow Preview December 2019
Azure Data Factory Data Flow Preview December 2019
Mark Kromer
 
Azure Data Factory for Azure Data Week
Azure Data Factory for Azure Data WeekAzure Data Factory for Azure Data Week
Azure Data Factory for Azure Data Week
Mark Kromer
 
Azure Data Factory for Redmond SQL PASS UG Sept 2018
Azure Data Factory for Redmond SQL PASS UG Sept 2018Azure Data Factory for Redmond SQL PASS UG Sept 2018
Azure Data Factory for Redmond SQL PASS UG Sept 2018
Mark Kromer
 
Microsoft Build 2018 Analytic Solutions with Azure Data Factory and Azure SQL...
Microsoft Build 2018 Analytic Solutions with Azure Data Factory and Azure SQL...Microsoft Build 2018 Analytic Solutions with Azure Data Factory and Azure SQL...
Microsoft Build 2018 Analytic Solutions with Azure Data Factory and Azure SQL...
Mark Kromer
 
Microsoft Azure BI Solutions in the Cloud
Microsoft Azure BI Solutions in the CloudMicrosoft Azure BI Solutions in the Cloud
Microsoft Azure BI Solutions in the Cloud
Mark Kromer
 

More from Mark Kromer (19)

Data cleansing and prep with synapse data flows
Data cleansing and prep with synapse data flowsData cleansing and prep with synapse data flows
Data cleansing and prep with synapse data flows
 
Data cleansing and data prep with synapse data flows
Data cleansing and data prep with synapse data flowsData cleansing and data prep with synapse data flows
Data cleansing and data prep with synapse data flows
 
Data Lake ETL in the Cloud with ADF
Data Lake ETL in the Cloud with ADFData Lake ETL in the Cloud with ADF
Data Lake ETL in the Cloud with ADF
 
Azure Data Factory Data Wrangling with Power Query
Azure Data Factory Data Wrangling with Power QueryAzure Data Factory Data Wrangling with Power Query
Azure Data Factory Data Wrangling with Power Query
 
Data Quality Patterns in the Cloud with ADF
Data Quality Patterns in the Cloud with ADFData Quality Patterns in the Cloud with ADF
Data Quality Patterns in the Cloud with ADF
 
Data quality patterns in the cloud with ADF
Data quality patterns in the cloud with ADFData quality patterns in the cloud with ADF
Data quality patterns in the cloud with ADF
 
Azure Data Factory Data Flows Training v005
Azure Data Factory Data Flows Training v005Azure Data Factory Data Flows Training v005
Azure Data Factory Data Flows Training v005
 
ADF Mapping Data Flows Level 300
ADF Mapping Data Flows Level 300ADF Mapping Data Flows Level 300
ADF Mapping Data Flows Level 300
 
ADF Mapping Data Flows Training V2
ADF Mapping Data Flows Training V2ADF Mapping Data Flows Training V2
ADF Mapping Data Flows Training V2
 
ADF Mapping Data Flows Training Slides V1
ADF Mapping Data Flows Training Slides V1ADF Mapping Data Flows Training Slides V1
ADF Mapping Data Flows Training Slides V1
 
ADF Mapping Data Flow Private Preview Migration
ADF Mapping Data Flow Private Preview MigrationADF Mapping Data Flow Private Preview Migration
ADF Mapping Data Flow Private Preview Migration
 
SQL Saturday Redmond 2019 ETL Patterns in the Cloud
SQL Saturday Redmond 2019 ETL Patterns in the CloudSQL Saturday Redmond 2019 ETL Patterns in the Cloud
SQL Saturday Redmond 2019 ETL Patterns in the Cloud
 
Azure Data Factory Data Flow Limited Preview for January 2019
Azure Data Factory Data Flow Limited Preview for January 2019Azure Data Factory Data Flow Limited Preview for January 2019
Azure Data Factory Data Flow Limited Preview for January 2019
 
Microsoft Azure Data Factory Data Flow Scenarios
Microsoft Azure Data Factory Data Flow ScenariosMicrosoft Azure Data Factory Data Flow Scenarios
Microsoft Azure Data Factory Data Flow Scenarios
 
Azure Data Factory Data Flow Preview December 2019
Azure Data Factory Data Flow Preview December 2019Azure Data Factory Data Flow Preview December 2019
Azure Data Factory Data Flow Preview December 2019
 
Azure Data Factory for Azure Data Week
Azure Data Factory for Azure Data WeekAzure Data Factory for Azure Data Week
Azure Data Factory for Azure Data Week
 
Azure Data Factory for Redmond SQL PASS UG Sept 2018
Azure Data Factory for Redmond SQL PASS UG Sept 2018Azure Data Factory for Redmond SQL PASS UG Sept 2018
Azure Data Factory for Redmond SQL PASS UG Sept 2018
 
Microsoft Build 2018 Analytic Solutions with Azure Data Factory and Azure SQL...
Microsoft Build 2018 Analytic Solutions with Azure Data Factory and Azure SQL...Microsoft Build 2018 Analytic Solutions with Azure Data Factory and Azure SQL...
Microsoft Build 2018 Analytic Solutions with Azure Data Factory and Azure SQL...
 
Microsoft Azure BI Solutions in the Cloud
Microsoft Azure BI Solutions in the CloudMicrosoft Azure BI Solutions in the Cloud
Microsoft Azure BI Solutions in the Cloud
 

Recently uploaded

Automation Student Developers Session 3: Introduction to UI Automation
Automation Student Developers Session 3: Introduction to UI AutomationAutomation Student Developers Session 3: Introduction to UI Automation
Automation Student Developers Session 3: Introduction to UI Automation
UiPathCommunity
 
Call Girls Chandigarh🔥7023059433🔥Agency Profile Escorts in Chandigarh Availab...
Call Girls Chandigarh🔥7023059433🔥Agency Profile Escorts in Chandigarh Availab...Call Girls Chandigarh🔥7023059433🔥Agency Profile Escorts in Chandigarh Availab...
Call Girls Chandigarh🔥7023059433🔥Agency Profile Escorts in Chandigarh Availab...
manji sharman06
 
ScyllaDB Leaps Forward with Dor Laor, CEO of ScyllaDB
ScyllaDB Leaps Forward with Dor Laor, CEO of ScyllaDBScyllaDB Leaps Forward with Dor Laor, CEO of ScyllaDB
ScyllaDB Leaps Forward with Dor Laor, CEO of ScyllaDB
ScyllaDB
 
QR Secure: A Hybrid Approach Using Machine Learning and Security Validation F...
QR Secure: A Hybrid Approach Using Machine Learning and Security Validation F...QR Secure: A Hybrid Approach Using Machine Learning and Security Validation F...
QR Secure: A Hybrid Approach Using Machine Learning and Security Validation F...
AlexanderRichford
 
Communications Mining Series - Zero to Hero - Session 2
Communications Mining Series - Zero to Hero - Session 2Communications Mining Series - Zero to Hero - Session 2
Communications Mining Series - Zero to Hero - Session 2
DianaGray10
 
QA or the Highway - Component Testing: Bridging the gap between frontend appl...
QA or the Highway - Component Testing: Bridging the gap between frontend appl...QA or the Highway - Component Testing: Bridging the gap between frontend appl...
QA or the Highway - Component Testing: Bridging the gap between frontend appl...
zjhamm304
 
DynamoDB to ScyllaDB: Technical Comparison and the Path to Success
DynamoDB to ScyllaDB: Technical Comparison and the Path to SuccessDynamoDB to ScyllaDB: Technical Comparison and the Path to Success
DynamoDB to ScyllaDB: Technical Comparison and the Path to Success
ScyllaDB
 
MongoDB vs ScyllaDB: Tractian’s Experience with Real-Time ML
MongoDB vs ScyllaDB: Tractian’s Experience with Real-Time MLMongoDB vs ScyllaDB: Tractian’s Experience with Real-Time ML
MongoDB vs ScyllaDB: Tractian’s Experience with Real-Time ML
ScyllaDB
 
Mutation Testing for Task-Oriented Chatbots
Mutation Testing for Task-Oriented ChatbotsMutation Testing for Task-Oriented Chatbots
Mutation Testing for Task-Oriented Chatbots
Pablo Gómez Abajo
 
LF Energy Webinar: Carbon Data Specifications: Mechanisms to Improve Data Acc...
LF Energy Webinar: Carbon Data Specifications: Mechanisms to Improve Data Acc...LF Energy Webinar: Carbon Data Specifications: Mechanisms to Improve Data Acc...
LF Energy Webinar: Carbon Data Specifications: Mechanisms to Improve Data Acc...
DanBrown980551
 
CTO Insights: Steering a High-Stakes Database Migration
CTO Insights: Steering a High-Stakes Database MigrationCTO Insights: Steering a High-Stakes Database Migration
CTO Insights: Steering a High-Stakes Database Migration
ScyllaDB
 
Day 2 - Intro to UiPath Studio Fundamentals
Day 2 - Intro to UiPath Studio FundamentalsDay 2 - Intro to UiPath Studio Fundamentals
Day 2 - Intro to UiPath Studio Fundamentals
UiPathCommunity
 
ThousandEyes New Product Features and Release Highlights: June 2024
ThousandEyes New Product Features and Release Highlights: June 2024ThousandEyes New Product Features and Release Highlights: June 2024
ThousandEyes New Product Features and Release Highlights: June 2024
ThousandEyes
 
Guidelines for Effective Data Visualization
Guidelines for Effective Data VisualizationGuidelines for Effective Data Visualization
Guidelines for Effective Data Visualization
UmmeSalmaM1
 
Poznań ACE event - 19.06.2024 Team 24 Wrapup slidedeck
Poznań ACE event - 19.06.2024 Team 24 Wrapup slidedeckPoznań ACE event - 19.06.2024 Team 24 Wrapup slidedeck
Poznań ACE event - 19.06.2024 Team 24 Wrapup slidedeck
FilipTomaszewski5
 
Chapter 5 - Managing Test Activities V4.0
Chapter 5 - Managing Test Activities V4.0Chapter 5 - Managing Test Activities V4.0
Chapter 5 - Managing Test Activities V4.0
Neeraj Kumar Singh
 
Call Girls Chennai ☎️ +91-7426014248 😍 Chennai Call Girl Beauty Girls Chennai...
Call Girls Chennai ☎️ +91-7426014248 😍 Chennai Call Girl Beauty Girls Chennai...Call Girls Chennai ☎️ +91-7426014248 😍 Chennai Call Girl Beauty Girls Chennai...
Call Girls Chennai ☎️ +91-7426014248 😍 Chennai Call Girl Beauty Girls Chennai...
anilsa9823
 
An All-Around Benchmark of the DBaaS Market
An All-Around Benchmark of the DBaaS MarketAn All-Around Benchmark of the DBaaS Market
An All-Around Benchmark of the DBaaS Market
ScyllaDB
 
So You've Lost Quorum: Lessons From Accidental Downtime
So You've Lost Quorum: Lessons From Accidental DowntimeSo You've Lost Quorum: Lessons From Accidental Downtime
So You've Lost Quorum: Lessons From Accidental Downtime
ScyllaDB
 
Lee Barnes - Path to Becoming an Effective Test Automation Engineer.pdf
Lee Barnes - Path to Becoming an Effective Test Automation Engineer.pdfLee Barnes - Path to Becoming an Effective Test Automation Engineer.pdf
Lee Barnes - Path to Becoming an Effective Test Automation Engineer.pdf
leebarnesutopia
 

Recently uploaded (20)

Automation Student Developers Session 3: Introduction to UI Automation
Automation Student Developers Session 3: Introduction to UI AutomationAutomation Student Developers Session 3: Introduction to UI Automation
Automation Student Developers Session 3: Introduction to UI Automation
 
Call Girls Chandigarh🔥7023059433🔥Agency Profile Escorts in Chandigarh Availab...
Call Girls Chandigarh🔥7023059433🔥Agency Profile Escorts in Chandigarh Availab...Call Girls Chandigarh🔥7023059433🔥Agency Profile Escorts in Chandigarh Availab...
Call Girls Chandigarh🔥7023059433🔥Agency Profile Escorts in Chandigarh Availab...
 
ScyllaDB Leaps Forward with Dor Laor, CEO of ScyllaDB
ScyllaDB Leaps Forward with Dor Laor, CEO of ScyllaDBScyllaDB Leaps Forward with Dor Laor, CEO of ScyllaDB
ScyllaDB Leaps Forward with Dor Laor, CEO of ScyllaDB
 
QR Secure: A Hybrid Approach Using Machine Learning and Security Validation F...
QR Secure: A Hybrid Approach Using Machine Learning and Security Validation F...QR Secure: A Hybrid Approach Using Machine Learning and Security Validation F...
QR Secure: A Hybrid Approach Using Machine Learning and Security Validation F...
 
Communications Mining Series - Zero to Hero - Session 2
Communications Mining Series - Zero to Hero - Session 2Communications Mining Series - Zero to Hero - Session 2
Communications Mining Series - Zero to Hero - Session 2
 
QA or the Highway - Component Testing: Bridging the gap between frontend appl...
QA or the Highway - Component Testing: Bridging the gap between frontend appl...QA or the Highway - Component Testing: Bridging the gap between frontend appl...
QA or the Highway - Component Testing: Bridging the gap between frontend appl...
 
DynamoDB to ScyllaDB: Technical Comparison and the Path to Success
DynamoDB to ScyllaDB: Technical Comparison and the Path to SuccessDynamoDB to ScyllaDB: Technical Comparison and the Path to Success
DynamoDB to ScyllaDB: Technical Comparison and the Path to Success
 
MongoDB vs ScyllaDB: Tractian’s Experience with Real-Time ML
MongoDB vs ScyllaDB: Tractian’s Experience with Real-Time MLMongoDB vs ScyllaDB: Tractian’s Experience with Real-Time ML
MongoDB vs ScyllaDB: Tractian’s Experience with Real-Time ML
 
Mutation Testing for Task-Oriented Chatbots
Mutation Testing for Task-Oriented ChatbotsMutation Testing for Task-Oriented Chatbots
Mutation Testing for Task-Oriented Chatbots
 
LF Energy Webinar: Carbon Data Specifications: Mechanisms to Improve Data Acc...
LF Energy Webinar: Carbon Data Specifications: Mechanisms to Improve Data Acc...LF Energy Webinar: Carbon Data Specifications: Mechanisms to Improve Data Acc...
LF Energy Webinar: Carbon Data Specifications: Mechanisms to Improve Data Acc...
 
CTO Insights: Steering a High-Stakes Database Migration
CTO Insights: Steering a High-Stakes Database MigrationCTO Insights: Steering a High-Stakes Database Migration
CTO Insights: Steering a High-Stakes Database Migration
 
Day 2 - Intro to UiPath Studio Fundamentals
Day 2 - Intro to UiPath Studio FundamentalsDay 2 - Intro to UiPath Studio Fundamentals
Day 2 - Intro to UiPath Studio Fundamentals
 
ThousandEyes New Product Features and Release Highlights: June 2024
ThousandEyes New Product Features and Release Highlights: June 2024ThousandEyes New Product Features and Release Highlights: June 2024
ThousandEyes New Product Features and Release Highlights: June 2024
 
Guidelines for Effective Data Visualization
Guidelines for Effective Data VisualizationGuidelines for Effective Data Visualization
Guidelines for Effective Data Visualization
 
Poznań ACE event - 19.06.2024 Team 24 Wrapup slidedeck
Poznań ACE event - 19.06.2024 Team 24 Wrapup slidedeckPoznań ACE event - 19.06.2024 Team 24 Wrapup slidedeck
Poznań ACE event - 19.06.2024 Team 24 Wrapup slidedeck
 
Chapter 5 - Managing Test Activities V4.0
Chapter 5 - Managing Test Activities V4.0Chapter 5 - Managing Test Activities V4.0
Chapter 5 - Managing Test Activities V4.0
 
Call Girls Chennai ☎️ +91-7426014248 😍 Chennai Call Girl Beauty Girls Chennai...
Call Girls Chennai ☎️ +91-7426014248 😍 Chennai Call Girl Beauty Girls Chennai...Call Girls Chennai ☎️ +91-7426014248 😍 Chennai Call Girl Beauty Girls Chennai...
Call Girls Chennai ☎️ +91-7426014248 😍 Chennai Call Girl Beauty Girls Chennai...
 
An All-Around Benchmark of the DBaaS Market
An All-Around Benchmark of the DBaaS MarketAn All-Around Benchmark of the DBaaS Market
An All-Around Benchmark of the DBaaS Market
 
So You've Lost Quorum: Lessons From Accidental Downtime
So You've Lost Quorum: Lessons From Accidental DowntimeSo You've Lost Quorum: Lessons From Accidental Downtime
So You've Lost Quorum: Lessons From Accidental Downtime
 
Lee Barnes - Path to Becoming an Effective Test Automation Engineer.pdf
Lee Barnes - Path to Becoming an Effective Test Automation Engineer.pdfLee Barnes - Path to Becoming an Effective Test Automation Engineer.pdf
Lee Barnes - Path to Becoming an Effective Test Automation Engineer.pdf
 

Azure Data Factory Data Flow Performance Tuning 101

  • 1. Azure Data Factory: Mapping Data Flows Performance Tuning Data Flows v001
  • 2. Sample Timings 1 Scenario 1  Source: Delimited Text Blob Store  Sink: Azure SQL DB  File size: 421Mb, 74 columns, 887k rows  Transforms: Single derived column to mask 3 fields  Time: 4 mins end-to-end using memory optimized 80-core debug Azure IR  Recommended settings: Current partitioning used throughout
  • 3. Sample timings 2  Scenario 2  Source: Azure SQL DB Table  Sink: Azure SQL DB Table  Table size: 74 columns, 887k rows  Transforms: Single derived column to mask 3 fields  Time: 3 mins end-to-end using memory optimized 80-core debug Azure IR  Recommended settings: Source partitioning on SQL DB Source, current partitioning on Derived Column and Sink
  • 4. Sample timings 3  Scenario 3  Source: Delimited Text Blob Store  Sink: Delimited Text Blob store  Table size: 74 columns, 887k rows  Transforms: Single derived column to mask 3 fields  Time: 2 mins end-to-end using memory optimized 80-core debug Azure IR  Recommended settings: Leaving default/current partitioning throughout allows ADF to scale-up/down partitions based on size of Azure IR (i.e. number of worker cores)
  • 5. File conversion Source->Sink property findings  Large data sizes should use more vcores(16+) with memory optimized or general purpose  Compute optimized does not improve performance in this scenario  CSV to parquet format convert has 45% time overhead in comparison with CSV to CSV  CSV to JSON format convert has 24% time overhead in comparison with CSV to CSV  CSV to JSON has better performance even though it has a lot of data to write  CSV to parquet has a slight lag because of time spent in decompression  Scaling V-cores improves performance for both IO and computation
  • 6. File Conversion Timing Compute type: General Purpose • Dataset has 36 Columns of string, integer, short, double • CSV dataset has 25 files with different file sizes • Performance improvement scales proportionately with increase in Vcores • 8 Vcore to 64 Vcore performance increase is around 8 times more
  • 8. Synapse DW Timing Compute type: General Purpose Adding cores proportionally decreases time it takes to process data into staging files for Polybase. However, there is a fairly static amount time that it takes to write that data from Parquet into SQL tables using Polybase.
  • 10. Window / Aggregate Timing Compute type: General Purpose • Performance improvement scales proportionately with increase in Vcores • 8 Vcore to 64 Vcore performance increase is around 5 times more
  • 11. Transformation Timings Compute type: General Purpose Transformation recommendations • When ranking data across entire dataset, use Rank transformation instead of Window with rank() • When using rowNumber() in Window to uniquely add a row counter to each row across entire dataset, instead use the Surrogate Key transformation
  • 12. TPCH Timings Compute type: General Purpose TPCH CSV in ADLS Gen 2
  • 13. Azure Data Factory Data Flow Performance * Includes cold cluster start-up time
  • 14. Azure Synapse Data Flow Performance * Includes cold cluster start-up time
  • 15. Identifying bottlenecks 1. Cluster startup time 2. Sink processing time 3. Source read time 4. Transformation stage time 1. Sequential executions can lower the cluster startup time by setting a TTL in Azure IR 2. Total time to process the stream from source to sink. There is also a post-processing time when you click on the Sink that will show you how much time Spark had to spend with partition and job clean-up. Write to single file and slow database connections will increase this time 3. Shows you how long it took to read data from source. Optimize with different source partition strategies 4. This will show you bottlenecks in your transformation logic. With larger general purpose and mem optimized IRs, most of these operations occur in memory in data frames and are usually the fastest operations in your data flow
  • 16. File Partitioning  Maintain current partitioning  Avoid output to single file  For manual partitioning, use number of cores from your Azure IR and multiply by 5  Example: transform a series of files in your ADLS folders w/32-core Azure IR, number of partitions would be 32 x 5 = 160 partitions  If you know data well enough to have high-cardinality columns, use those columns as Hash partition  If you do not know data patterns very well, use Round Robin
  • 17. Best practices - Sources  When reading from file-based sources, data flow automatically partitions the data based on size  ~128 MB per partition, evenly distributed  Use current partitioning will be fastest for file-based and Synapse using PolyBase  Enable staging for Synapse  For Azure SQL DB, use Source partitioning on column with high cardinality  Improves performance, but can saturate your source database  Reading can be limited by the I/O of your source
  • 18. Optimizing transformations  Each transformation has its own optimize tab  Generally better to not alter -> reshuffling is a relatively slow process  Reshuffling can occur if data is very skewed  One node has a disproportionate amount of data  For Joins, Exists and Lookups:  If you have a many of these transforms, memory optimized greatly increases performance  Can ‘Broadcast’ if the data on one side is small  Rule of thumb: Less than 50k rows  Use Window transformation partitioned over segments of data  For Rank() across entire dataset, use the Rank transformation instead  For RowNumber() across entire dataset, use the Surrogate Key transformation instead  Transformations that require reshuffling like Sort negatively impact performance
  • 19. Best practices – Debug (Data Preview)  Data Preview  Data preview is inside the data flow designer transformation properties  Uses row limits and sampling techniques to preview data from a small size of data  Allows you to build and validate units of logic with samples of data in real time  You have control over the size of the data limits under Debug Settings  If you wish to test with larger datasets, set a larger compute size in the Azure IR when switching on “Debug Mode”  Data Preview is only a snapshot of data in memory from Spark data frames. This feature does not write any data, so the sink drivers are not utilized and not tested in this mode.
  • 20. Best practices – Debug (Pipeline Debug)  Pipeline Debug  Click debug button to test your data flow inside of a pipeline  Default debug limits the execution runtime so you will want to limit data sizes  Sampling can be applied here as well by using the “Enable Sampling” option in each Source  Use the debug button option of “use activity IR” when you wish to use a job execution compute environment  This option is good for debugging with larger datasets. It will not have the same execution timeout limit as the default debug setting
  • 21. Best practices - Sinks  SQL:  Disable indexes on target with pre/post SQL scripts  Increase SQL capacity during pipeline execution  Enable staging when using Synapse  Use Source Partitioning on Source under Optimize  Set number of partitions based on size of IR  File-based sinks:  Use current partitioning allows Spark to create output  Output to single file is a slow operation  Often unnecessary by whoever is consuming data  Can set naming patterns or use data in column  Any reshuffling of data is slow  Cosmos DB  Set throughput and batch size to meet performance requirements
  • 22. Azure Integration Runtime  Data Flows use JIT compute to minimize running expensive clusters when they are mostly idle  Generally more economical, but each cluster takes ~4 minutes to spin up  IR specifies what cluster type and core-count to use  Memory optimized is best, compute optimized doesn’t generally work for production workloads  When running Sequential jobs utilize Time to Live to reuse cluster between executions  Keeps compute resources alive for TTL minutes after execution for new job to use  Maximum one job per cluster  Reduces job startup latency to ~1.5 minutes  Rule of thumb: start small and scale up
  • 23. Azure IR – General Purpose • This was General Purpose 4+4, the default auto resolve Azure IR • For prod workloads, GP is usually sufficient at >= 16 cores • You get 1 driver and 1 worker node, both with 4 vcores • Good for debugging, testing, and many production workloads • Tested with 887k row CSV file with 74 columns • Default partitioning • Spark chose 4 partitions • Cluster startup time: 4.5 mins • Sink IO writing: 46s • Transformation time: 42s • Sink post-processing time: 45s
  • 24. Azure IR – Compute Optimized • Computed Optimized intended for smaller workloads • 8+8, this is smallest CO option and you get 1 driver and 2 workers • Not suitable for large production workloads • Tested with 887k row CSV file with 74 columns • Default partitioning • Spark chose 8 partitions • Cluster startup time: 4.5 mins • Sink IO writing: 20s • Transformation time: 35s • Sink post-processing time: 40s • More worker nodes gave us more partitions and better perf than General Purpose
  • 25. Azure IR – Memory Optimized • Memory Optimized well suited for large production workload reliability with many aggregates, lookups, and joins • 64+16 gives you 16 vcores for driver and 64 across worker nodes • Tested with 887k row CSV file with 74 columns • Default partitioning • Spark chose 64 partitions • Cluster startup time: 4.8 mins • Sink IO writing: 19s • Transformation time: 17s • Sink post-processing time: 40s
  翻译: