尊敬的 微信汇率:1円 ≈ 0.046239 元 支付宝汇率:1円 ≈ 0.04633元 [退出登录]
SlideShare a Scribd company logo
What’s
coming in
Apache
Airflow 2.0
NYC Meetup
13th of May 2020
Who are we?
Tomek Urbaszek
Committer
Software Engineer @ Polidea
Jarek Potiuk
Committer, PMC member
Principal Software Engineer @ Polidea
Kamil Breguła
Committer
Software Engineer @ Polidea
Ash Berlin-Taylor
Committer, PMC member
Airflow Engineering Lead @ Astronomer
Daniel Imberman
Committer
Senior Data Engineer @ Astronomer
Kaxil Naik
Committer, PMC member
Senior Data Engineer @ Astronomer
High Availability
Scheduler High Availability
Goals:
● Performance - reduce task-to-task schedule "lag"
● Scalability - increase task throughput by horizontal scaling
● Resiliency - kill a scheduler and have tasks continue to be scheduled
Scheduler High Availability: Design
● Active-active model. Each scheduler does everything
● Uses existing database - no new components needed, no extra operational
burden
● Plan to use row-level-locks in the DB
● Will re-evaluate if performance/stress testing show the need
Example HA configuration
Scheduler High Availability: Tasks
● Separate DAG parsing from DAG scheduling
This removes the tie between parsing and scheduling that is still present
● Run a mini scheduler in the worker after each task is completed
A.K.A. "fast follow". Look at immediate down stream tasks of what just finished and see what we can
schedule
● Test it to destruction
This is a big architectural change, we need to be sure it works well.
DAG Serialization
Dag Serialization
Dag Serialization (Tasks Completed)
● Stateless Webserver: Scheduler parses the DAG files, serializes them in JSON format & saves
them in the Metadata DB.
● Lazy Loading of DAGs: Instead of loading an entire DagBag when the Webserver starts we only
load each DAG on demand. This helps reduce Webserver startup time and memory. This
reduction is notable with large number of DAGs.
● Deploying new DAGs to Airflow - no longer requires long restarts of webserver (if DAGs are baked in
Docker image)
● Feature to use the “JSON” library of choice for Serialization (default is inbuilt ‘json’ library)
● Paves way for DAG Versioning & Scheduler HA
Dag Serialization (Tasks In-Progress for Airflow 2.0)
● Decouple DAG Parsing and Serializing from the scheduling loop.
● Scheduler will fetch DAGs from DB
● DAG will be parsed, serialized and saved to DB by a separate component
“Serializer”/ “Dag Parser”
● This should reduce the delay in Scheduling tasks when the number of DAGs
are large
DAG Versioning
Dag Versioning
Current Problem:
● Change in DAG structure affects viewing previous DagRuns too
● Not possible to view the code associated with a specific DagRun
Dag Versioning (Current Problem)
Dag Versioning (Current Problem)
New task is shown in Graph View for older DAG Runs too with “no status”.
Dag Versioning
Current Problem:
● Change in DAG structure affects viewing previous DagRuns too
● Not possible to view the code associated with a specific DagRun
Goal:
● Support for storing multiple versions of Serialized DAGs
● Baked-In Maintenance DAGs to cleanup old DagRuns & associated
Serialized DAGs
● Graph View shows the DAG associated with that DagRun
Performance Improvements
Performance improvements
● Review each component of scheduler in turn and its optimization.
● Perf kit
○ A set of tools that allows you to quickly check the performance of a component
Do you see a performance problem?
Results for DagFileProcessor
When we have one DAG file with 200 DAGs, each DAG with 5 tasks:
Before After Diff
Average time: 8080.246 ms 628.801 ms -7452 ms (92%)
Queries count: 2692 5 -2687 (99%)
How to avoid regression?
REST API
API: follows Open API 3.0 specification
Outreachy interns
Ephraim Anierobi
Omair Khan
Dev/CI environment
CI environment
● Moving to GitHub Actions
○ Kubernetes Tests
○ Easier way to test Kubernetes Tests locally
● Quarantined tests
○ Process of fixing the Quarantined tests
● Thinning CI image
○ Move integrations out of the image (hadoop etc)
● Automated System Tests (AIP-21)
GitHub Actions
Dev environment
● Breeze
○ unit testing
○ package building
○ release preparation
○ refreshing videos
● CodeSpaces integration
Backport Packages
● Bring Airflow 2.0 providers to 1.10.*
● Packages per-provider
● 58 packages (!)
● Python 3.6+ only(!)
● Automatically tested on CI
● Future
○ Automated System Tests (AIP-21)
○ Split Airflow (AIP-8)?
Automated release notes for backport packages
Support for Production
Deployments
Production Image
● Alpha quality image is ready
● Gathering feedback
● Started with “bare image”
● Listening to use cases from users
● Integration with Docker Compose
● Integration with Helm Chart
KEDA Autoscaling
KubernetesExecutor
KubernetesExecutor
KubernetesExecutor
KubernetesExecutor vs. CeleryExecutor
KEDA Autoscaling
● Kubernetes Event-driven Autoscaler
● Scales based on # of RUNNING and QUEUED tasks in PostgreSQL backend
KEDA Autoscaling
KEDA Autoscaling
KEDA Autoscaling
KEDA Queues
● Historically Queues were expensive and hard to allocate
● With KEDA, queues are free! (can have 100 queues)
● KEDA works with k8s deployments so any customization you can make in a
k8s pod, you can make in a k8s queue (worker size, GPU, secrets, etc.)
KubernetesExecutor
Pod Templating from
YAML/JSON
KubernetesExecutor Pod Templating
● In the K8sExecutor currently, users can modify certain parts of the pod, but
many features of the k8s API are abstracted away
● We did this because at the time the airflow community was not well
acquainted with the k8s API
● We want to enable users to modify their worker pods to better match their
use-cases
KubernetesExecutor Pod Templating
● Users can now set the pod_template_file config in their
airflow.cfg
● Given a path, the KubernetesExecutor will now parse the yaml
file when launching a worker pod
● Huge thank you to @davlum for this feature
Official Airflow Helm Chart
Helm Chart
● Donated by astronomer.io.
● This is the official helm chart that we have used both in our
enterprise and in our cloud offerings (thousands of deployments
of varying sizes)
● Users can turn on KEDA autoscaling through helm variables
Helm Chart
● Chart will cut new releases with each airflow release
● Will be tested on official docker image
● Significantly simplifies airflow onboarding process for
Kubernetes users
DAG authoring "sugar"
Functional DAGs
➔ PythonOperator boilerplate code
➔ Define order and data relation
separately
➔ Writing jinja strings by hand
Functional DAGs
No PythonOperator boilerplate code!
Data and order relationship are same!
And works for all operators
Functional DAGs
AIP-31: Airflow functional DAG definition
➔ Easy way to convert a function to
an operator
➔ Simplified way of writing DAGs
➔ Pluggable XCom Storage engine
Example: store and retrieve DataFrames on
GCS or S3 buckets without boilerplate code
Smaller changes
Other changes of note
● Connection IDs now need to be unique
It was often confusing, and there are better ways to do load balancing
● Python 3 only
Python 2.7 unsupported upstream since Jan 1, 2020
● "RBAC" UI is now the only UI.
Was a config option before, now only option. Charts/data profiling removed due to security risks
Road to Airflow 2.0
When will Airflow 2.0 be available?
Airflow 2.0 – deprecate, but (try) not to remove
● Breaking changes should be avoided where we can – if upgrade is to difficult
users will be left behind
● Release "backport providers" to make new code layout available "now":
● Before 2.0 we want to make sure we've fixed everything we want to remove
or break.
pip install apache-airflow-backport-providers-aws 
apache-airflow-backport-providers-google
How to upgrade to 2.0 safely
● Install the latest 1.10 release
● Run airflow upgrade-check (doesn't exist yet)
● Fix any warnings
● Upgrade Airflow
Thank you!
Time for Q & A

More Related Content

What's hot

Airflow presentation
Airflow presentationAirflow presentation
Airflow presentation
Ilias Okacha
 
Building a Data Pipeline using Apache Airflow (on AWS / GCP)
Building a Data Pipeline using Apache Airflow (on AWS / GCP)Building a Data Pipeline using Apache Airflow (on AWS / GCP)
Building a Data Pipeline using Apache Airflow (on AWS / GCP)
Yohei Onishi
 
Airflow and supervisor
Airflow and supervisorAirflow and supervisor
Airflow and supervisor
Rafael Roman Otero
 
Building Better Data Pipelines using Apache Airflow
Building Better Data Pipelines using Apache AirflowBuilding Better Data Pipelines using Apache Airflow
Building Better Data Pipelines using Apache Airflow
Sid Anand
 
Building an analytics workflow using Apache Airflow
Building an analytics workflow using Apache AirflowBuilding an analytics workflow using Apache Airflow
Building an analytics workflow using Apache Airflow
Yohei Onishi
 
Airflow presentation
Airflow presentationAirflow presentation
Airflow presentation
Anant Corporation
 
How I learned to time travel, or, data pipelining and scheduling with Airflow
How I learned to time travel, or, data pipelining and scheduling with AirflowHow I learned to time travel, or, data pipelining and scheduling with Airflow
How I learned to time travel, or, data pipelining and scheduling with Airflow
Laura Lorenz
 
From business requirements to working pipelines with apache airflow
From business requirements to working pipelines with apache airflowFrom business requirements to working pipelines with apache airflow
From business requirements to working pipelines with apache airflow
Derrick Qin
 
Apache Airflow
Apache AirflowApache Airflow
Apache Airflow
Knoldus Inc.
 
Airflow at WePay
Airflow at WePayAirflow at WePay
Airflow at WePay
Chris Riccomini
 
Apache Airflow in Production
Apache Airflow in ProductionApache Airflow in Production
Apache Airflow in Production
Robert Sanders
 
Airflow Clustering and High Availability
Airflow Clustering and High AvailabilityAirflow Clustering and High Availability
Airflow Clustering and High Availability
Robert Sanders
 
Airflow tutorials hands_on
Airflow tutorials hands_onAirflow tutorials hands_on
Airflow tutorials hands_on
pko89403
 
From airflow to google cloud composer
From airflow to google cloud composerFrom airflow to google cloud composer
From airflow to google cloud composer
Bruce Kuo
 
Apache airflow
Apache airflowApache airflow
Apache airflow
Purna Chander
 
What's Coming in Apache Airflow 2.0 - PyDataWarsaw 2019
What's Coming in Apache Airflow 2.0 - PyDataWarsaw 2019What's Coming in Apache Airflow 2.0 - PyDataWarsaw 2019
What's Coming in Apache Airflow 2.0 - PyDataWarsaw 2019
Jarek Potiuk
 
Airflow introduction
Airflow introductionAirflow introduction
Airflow introduction
Chandler Huang
 
Data Pipelines with Apache Airflow
Data Pipelines with Apache AirflowData Pipelines with Apache Airflow
Data Pipelines with Apache Airflow
Manning Publications
 
AIRflow at Scale
AIRflow at ScaleAIRflow at Scale
AIRflow at Scale
Digital Vidya
 
Elasticsearch features and ecosystem
Elasticsearch features and ecosystemElasticsearch features and ecosystem
Elasticsearch features and ecosystem
Pavel Alexeev
 

What's hot (20)

Airflow presentation
Airflow presentationAirflow presentation
Airflow presentation
 
Building a Data Pipeline using Apache Airflow (on AWS / GCP)
Building a Data Pipeline using Apache Airflow (on AWS / GCP)Building a Data Pipeline using Apache Airflow (on AWS / GCP)
Building a Data Pipeline using Apache Airflow (on AWS / GCP)
 
Airflow and supervisor
Airflow and supervisorAirflow and supervisor
Airflow and supervisor
 
Building Better Data Pipelines using Apache Airflow
Building Better Data Pipelines using Apache AirflowBuilding Better Data Pipelines using Apache Airflow
Building Better Data Pipelines using Apache Airflow
 
Building an analytics workflow using Apache Airflow
Building an analytics workflow using Apache AirflowBuilding an analytics workflow using Apache Airflow
Building an analytics workflow using Apache Airflow
 
Airflow presentation
Airflow presentationAirflow presentation
Airflow presentation
 
How I learned to time travel, or, data pipelining and scheduling with Airflow
How I learned to time travel, or, data pipelining and scheduling with AirflowHow I learned to time travel, or, data pipelining and scheduling with Airflow
How I learned to time travel, or, data pipelining and scheduling with Airflow
 
From business requirements to working pipelines with apache airflow
From business requirements to working pipelines with apache airflowFrom business requirements to working pipelines with apache airflow
From business requirements to working pipelines with apache airflow
 
Apache Airflow
Apache AirflowApache Airflow
Apache Airflow
 
Airflow at WePay
Airflow at WePayAirflow at WePay
Airflow at WePay
 
Apache Airflow in Production
Apache Airflow in ProductionApache Airflow in Production
Apache Airflow in Production
 
Airflow Clustering and High Availability
Airflow Clustering and High AvailabilityAirflow Clustering and High Availability
Airflow Clustering and High Availability
 
Airflow tutorials hands_on
Airflow tutorials hands_onAirflow tutorials hands_on
Airflow tutorials hands_on
 
From airflow to google cloud composer
From airflow to google cloud composerFrom airflow to google cloud composer
From airflow to google cloud composer
 
Apache airflow
Apache airflowApache airflow
Apache airflow
 
What's Coming in Apache Airflow 2.0 - PyDataWarsaw 2019
What's Coming in Apache Airflow 2.0 - PyDataWarsaw 2019What's Coming in Apache Airflow 2.0 - PyDataWarsaw 2019
What's Coming in Apache Airflow 2.0 - PyDataWarsaw 2019
 
Airflow introduction
Airflow introductionAirflow introduction
Airflow introduction
 
Data Pipelines with Apache Airflow
Data Pipelines with Apache AirflowData Pipelines with Apache Airflow
Data Pipelines with Apache Airflow
 
AIRflow at Scale
AIRflow at ScaleAIRflow at Scale
AIRflow at Scale
 
Elasticsearch features and ecosystem
Elasticsearch features and ecosystemElasticsearch features and ecosystem
Elasticsearch features and ecosystem
 

Similar to What's coming in Airflow 2.0? - NYC Apache Airflow Meetup

19. Cloud Native Computing - Kubernetes - Bratislava - Databases in K8s world
19. Cloud Native Computing - Kubernetes - Bratislava - Databases in K8s world19. Cloud Native Computing - Kubernetes - Bratislava - Databases in K8s world
19. Cloud Native Computing - Kubernetes - Bratislava - Databases in K8s world
Dávid Kőszeghy
 
What's new in Airflow 2.3?
What's new in Airflow 2.3?What's new in Airflow 2.3?
What's new in Airflow 2.3?
Kaxil Naik
 
Spring Update | July 2023
Spring Update | July 2023Spring Update | July 2023
Spring Update | July 2023
VMware Tanzu
 
Spring Boot 3 And Beyond
Spring Boot 3 And BeyondSpring Boot 3 And Beyond
Spring Boot 3 And Beyond
VMware Tanzu
 
Como creamos QuestDB Cloud, un SaaS basado en Kubernetes alrededor de QuestDB...
Como creamos QuestDB Cloud, un SaaS basado en Kubernetes alrededor de QuestDB...Como creamos QuestDB Cloud, un SaaS basado en Kubernetes alrededor de QuestDB...
Como creamos QuestDB Cloud, un SaaS basado en Kubernetes alrededor de QuestDB...
javier ramirez
 
SCM Puppet: from an intro to the scaling
SCM Puppet: from an intro to the scalingSCM Puppet: from an intro to the scaling
SCM Puppet: from an intro to the scaling
Stanislav Osipov
 
Accumulo Summit Keynote 2018
Accumulo Summit Keynote 2018Accumulo Summit Keynote 2018
Accumulo Summit Keynote 2018
Accumulo Summit
 
Extreme Replication - RMOUG Presentation
Extreme Replication - RMOUG PresentationExtreme Replication - RMOUG Presentation
Extreme Replication - RMOUG Presentation
Bobby Curtis
 
Scalable Clusters On Demand
Scalable Clusters On DemandScalable Clusters On Demand
Scalable Clusters On Demand
Bogdan Kyryliuk
 
Airflow based Video Encoding Platform
Airflow based Video Encoding PlatformAirflow based Video Encoding Platform
Airflow based Video Encoding Platform
Hotstar
 
Why Airflow? & What's new in Airflow 2.3?
Why Airflow? & What's new in Airflow 2.3?Why Airflow? & What's new in Airflow 2.3?
Why Airflow? & What's new in Airflow 2.3?
Kaxil Naik
 
Scaling your Data Pipelines with Apache Spark on Kubernetes
Scaling your Data Pipelines with Apache Spark on KubernetesScaling your Data Pipelines with Apache Spark on Kubernetes
Scaling your Data Pipelines with Apache Spark on Kubernetes
Databricks
 
Kubernetes Forum Seoul 2019: Re-architecting Data Platform with Kubernetes
Kubernetes Forum Seoul 2019: Re-architecting Data Platform with KubernetesKubernetes Forum Seoul 2019: Re-architecting Data Platform with Kubernetes
Kubernetes Forum Seoul 2019: Re-architecting Data Platform with Kubernetes
SeungYong Oh
 
COUG_AAbate_Oracle_Database_12c_New_Features
COUG_AAbate_Oracle_Database_12c_New_FeaturesCOUG_AAbate_Oracle_Database_12c_New_Features
COUG_AAbate_Oracle_Database_12c_New_Features
Alfredo Abate
 
Airflow at lyft
Airflow at lyftAirflow at lyft
Airflow at lyft
Tao Feng
 
Dataflow.pptx
Dataflow.pptxDataflow.pptx
Dataflow.pptx
Sadeka Islam
 
Healthcare Claim Reimbursement using Apache Spark
Healthcare Claim Reimbursement using Apache SparkHealthcare Claim Reimbursement using Apache Spark
Healthcare Claim Reimbursement using Apache Spark
Databricks
 
Automating using Ansible
Automating using AnsibleAutomating using Ansible
Automating using Ansible
Alok Patra
 
Apache Druid Auto Scale-out/in for Streaming Data Ingestion on Kubernetes
Apache Druid Auto Scale-out/in for Streaming Data Ingestion on KubernetesApache Druid Auto Scale-out/in for Streaming Data Ingestion on Kubernetes
Apache Druid Auto Scale-out/in for Streaming Data Ingestion on Kubernetes
DataWorks Summit
 
Serverless ETL and Optimization on ML pipeline
Serverless ETL and Optimization on ML pipelineServerless ETL and Optimization on ML pipeline
Serverless ETL and Optimization on ML pipeline
Shu-Jeng Hsieh
 

Similar to What's coming in Airflow 2.0? - NYC Apache Airflow Meetup (20)

19. Cloud Native Computing - Kubernetes - Bratislava - Databases in K8s world
19. Cloud Native Computing - Kubernetes - Bratislava - Databases in K8s world19. Cloud Native Computing - Kubernetes - Bratislava - Databases in K8s world
19. Cloud Native Computing - Kubernetes - Bratislava - Databases in K8s world
 
What's new in Airflow 2.3?
What's new in Airflow 2.3?What's new in Airflow 2.3?
What's new in Airflow 2.3?
 
Spring Update | July 2023
Spring Update | July 2023Spring Update | July 2023
Spring Update | July 2023
 
Spring Boot 3 And Beyond
Spring Boot 3 And BeyondSpring Boot 3 And Beyond
Spring Boot 3 And Beyond
 
Como creamos QuestDB Cloud, un SaaS basado en Kubernetes alrededor de QuestDB...
Como creamos QuestDB Cloud, un SaaS basado en Kubernetes alrededor de QuestDB...Como creamos QuestDB Cloud, un SaaS basado en Kubernetes alrededor de QuestDB...
Como creamos QuestDB Cloud, un SaaS basado en Kubernetes alrededor de QuestDB...
 
SCM Puppet: from an intro to the scaling
SCM Puppet: from an intro to the scalingSCM Puppet: from an intro to the scaling
SCM Puppet: from an intro to the scaling
 
Accumulo Summit Keynote 2018
Accumulo Summit Keynote 2018Accumulo Summit Keynote 2018
Accumulo Summit Keynote 2018
 
Extreme Replication - RMOUG Presentation
Extreme Replication - RMOUG PresentationExtreme Replication - RMOUG Presentation
Extreme Replication - RMOUG Presentation
 
Scalable Clusters On Demand
Scalable Clusters On DemandScalable Clusters On Demand
Scalable Clusters On Demand
 
Airflow based Video Encoding Platform
Airflow based Video Encoding PlatformAirflow based Video Encoding Platform
Airflow based Video Encoding Platform
 
Why Airflow? & What's new in Airflow 2.3?
Why Airflow? & What's new in Airflow 2.3?Why Airflow? & What's new in Airflow 2.3?
Why Airflow? & What's new in Airflow 2.3?
 
Scaling your Data Pipelines with Apache Spark on Kubernetes
Scaling your Data Pipelines with Apache Spark on KubernetesScaling your Data Pipelines with Apache Spark on Kubernetes
Scaling your Data Pipelines with Apache Spark on Kubernetes
 
Kubernetes Forum Seoul 2019: Re-architecting Data Platform with Kubernetes
Kubernetes Forum Seoul 2019: Re-architecting Data Platform with KubernetesKubernetes Forum Seoul 2019: Re-architecting Data Platform with Kubernetes
Kubernetes Forum Seoul 2019: Re-architecting Data Platform with Kubernetes
 
COUG_AAbate_Oracle_Database_12c_New_Features
COUG_AAbate_Oracle_Database_12c_New_FeaturesCOUG_AAbate_Oracle_Database_12c_New_Features
COUG_AAbate_Oracle_Database_12c_New_Features
 
Airflow at lyft
Airflow at lyftAirflow at lyft
Airflow at lyft
 
Dataflow.pptx
Dataflow.pptxDataflow.pptx
Dataflow.pptx
 
Healthcare Claim Reimbursement using Apache Spark
Healthcare Claim Reimbursement using Apache SparkHealthcare Claim Reimbursement using Apache Spark
Healthcare Claim Reimbursement using Apache Spark
 
Automating using Ansible
Automating using AnsibleAutomating using Ansible
Automating using Ansible
 
Apache Druid Auto Scale-out/in for Streaming Data Ingestion on Kubernetes
Apache Druid Auto Scale-out/in for Streaming Data Ingestion on KubernetesApache Druid Auto Scale-out/in for Streaming Data Ingestion on Kubernetes
Apache Druid Auto Scale-out/in for Streaming Data Ingestion on Kubernetes
 
Serverless ETL and Optimization on ML pipeline
Serverless ETL and Optimization on ML pipelineServerless ETL and Optimization on ML pipeline
Serverless ETL and Optimization on ML pipeline
 

Recently uploaded

Orca: Nocode Graphical Editor for Container Orchestration
Orca: Nocode Graphical Editor for Container OrchestrationOrca: Nocode Graphical Editor for Container Orchestration
Orca: Nocode Graphical Editor for Container Orchestration
Pedro J. Molina
 
OpenChain Webinar - Open Source Due Diligence for M&A - 2024-06-17
OpenChain Webinar - Open Source Due Diligence for M&A - 2024-06-17OpenChain Webinar - Open Source Due Diligence for M&A - 2024-06-17
OpenChain Webinar - Open Source Due Diligence for M&A - 2024-06-17
Shane Coughlan
 
🏎️Tech Transformation: DevOps Insights from the Experts 👩‍💻
🏎️Tech Transformation: DevOps Insights from the Experts 👩‍💻🏎️Tech Transformation: DevOps Insights from the Experts 👩‍💻
🏎️Tech Transformation: DevOps Insights from the Experts 👩‍💻
campbellclarkson
 
Strengthening Web Development with CommandBox 6: Seamless Transition and Scal...
Strengthening Web Development with CommandBox 6: Seamless Transition and Scal...Strengthening Web Development with CommandBox 6: Seamless Transition and Scal...
Strengthening Web Development with CommandBox 6: Seamless Transition and Scal...
Ortus Solutions, Corp
 
Streamlining End-to-End Testing Automation
Streamlining End-to-End Testing AutomationStreamlining End-to-End Testing Automation
Streamlining End-to-End Testing Automation
Anand Bagmar
 
Refactoring legacy systems using events commands and bubble contexts
Refactoring legacy systems using events commands and bubble contextsRefactoring legacy systems using events commands and bubble contexts
Refactoring legacy systems using events commands and bubble contexts
Michał Kurzeja
 
Microsoft-Power-Platform-Adoption-Planning.pptx
Microsoft-Power-Platform-Adoption-Planning.pptxMicrosoft-Power-Platform-Adoption-Planning.pptx
Microsoft-Power-Platform-Adoption-Planning.pptx
jrodriguezq3110
 
Independent Call Girls In Kolkata ✔ 7014168258 ✔ Hi I Am Divya Vip Call Girl ...
Independent Call Girls In Kolkata ✔ 7014168258 ✔ Hi I Am Divya Vip Call Girl ...Independent Call Girls In Kolkata ✔ 7014168258 ✔ Hi I Am Divya Vip Call Girl ...
Independent Call Girls In Kolkata ✔ 7014168258 ✔ Hi I Am Divya Vip Call Girl ...
simmi singh
 
What’s New in VictoriaLogs - Q2 2024 Update
What’s New in VictoriaLogs - Q2 2024 UpdateWhat’s New in VictoriaLogs - Q2 2024 Update
What’s New in VictoriaLogs - Q2 2024 Update
VictoriaMetrics
 
Alluxio Webinar | 10x Faster Trino Queries on Your Data Platform
Alluxio Webinar | 10x Faster Trino Queries on Your Data PlatformAlluxio Webinar | 10x Faster Trino Queries on Your Data Platform
Alluxio Webinar | 10x Faster Trino Queries on Your Data Platform
Alluxio, Inc.
 
How GenAI Can Improve Supplier Performance Management.pdf
How GenAI Can Improve Supplier Performance Management.pdfHow GenAI Can Improve Supplier Performance Management.pdf
How GenAI Can Improve Supplier Performance Management.pdf
Zycus
 
Introducing Claris FileMaker 2024: presented by DB Services
Introducing Claris FileMaker 2024: presented by DB ServicesIntroducing Claris FileMaker 2024: presented by DB Services
Introducing Claris FileMaker 2024: presented by DB Services
DB Services
 
Call Girls Solapur ☎️ +91-7426014248 😍 Solapur Call Girl Beauty Girls Solapur...
Call Girls Solapur ☎️ +91-7426014248 😍 Solapur Call Girl Beauty Girls Solapur...Call Girls Solapur ☎️ +91-7426014248 😍 Solapur Call Girl Beauty Girls Solapur...
Call Girls Solapur ☎️ +91-7426014248 😍 Solapur Call Girl Beauty Girls Solapur...
anshsharma8761
 
🔥 Chennai Call Girls  👉 6350257716 👫 High Profile Call Girls Whatsapp Number ...
🔥 Chennai Call Girls  👉 6350257716 👫 High Profile Call Girls Whatsapp Number ...🔥 Chennai Call Girls  👉 6350257716 👫 High Profile Call Girls Whatsapp Number ...
🔥 Chennai Call Girls  👉 6350257716 👫 High Profile Call Girls Whatsapp Number ...
tinakumariji156
 
What’s new in VictoriaMetrics - Q2 2024 Update
What’s new in VictoriaMetrics - Q2 2024 UpdateWhat’s new in VictoriaMetrics - Q2 2024 Update
What’s new in VictoriaMetrics - Q2 2024 Update
VictoriaMetrics
 
Beginner's Guide to Observability@Devoxx PL 2024
Beginner's  Guide to Observability@Devoxx PL 2024Beginner's  Guide to Observability@Devoxx PL 2024
Beginner's Guide to Observability@Devoxx PL 2024
michniczscribd
 
Ensuring Efficiency and Speed with Practical Solutions for Clinical Operations
Ensuring Efficiency and Speed with Practical Solutions for Clinical OperationsEnsuring Efficiency and Speed with Practical Solutions for Clinical Operations
Ensuring Efficiency and Speed with Practical Solutions for Clinical Operations
OnePlan Solutions
 
Software Test Automation - A Comprehensive Guide on Automated Testing.pdf
Software Test Automation - A Comprehensive Guide on Automated Testing.pdfSoftware Test Automation - A Comprehensive Guide on Automated Testing.pdf
Software Test Automation - A Comprehensive Guide on Automated Testing.pdf
kalichargn70th171
 
Introduction to Python and Basic Syntax.pptx
Introduction to Python and Basic Syntax.pptxIntroduction to Python and Basic Syntax.pptx
Introduction to Python and Basic Syntax.pptx
GevitaChinnaiah
 
TheFutureIsDynamic-BoxLang-CFCamp2024.pdf
TheFutureIsDynamic-BoxLang-CFCamp2024.pdfTheFutureIsDynamic-BoxLang-CFCamp2024.pdf
TheFutureIsDynamic-BoxLang-CFCamp2024.pdf
Ortus Solutions, Corp
 

Recently uploaded (20)

Orca: Nocode Graphical Editor for Container Orchestration
Orca: Nocode Graphical Editor for Container OrchestrationOrca: Nocode Graphical Editor for Container Orchestration
Orca: Nocode Graphical Editor for Container Orchestration
 
OpenChain Webinar - Open Source Due Diligence for M&A - 2024-06-17
OpenChain Webinar - Open Source Due Diligence for M&A - 2024-06-17OpenChain Webinar - Open Source Due Diligence for M&A - 2024-06-17
OpenChain Webinar - Open Source Due Diligence for M&A - 2024-06-17
 
🏎️Tech Transformation: DevOps Insights from the Experts 👩‍💻
🏎️Tech Transformation: DevOps Insights from the Experts 👩‍💻🏎️Tech Transformation: DevOps Insights from the Experts 👩‍💻
🏎️Tech Transformation: DevOps Insights from the Experts 👩‍💻
 
Strengthening Web Development with CommandBox 6: Seamless Transition and Scal...
Strengthening Web Development with CommandBox 6: Seamless Transition and Scal...Strengthening Web Development with CommandBox 6: Seamless Transition and Scal...
Strengthening Web Development with CommandBox 6: Seamless Transition and Scal...
 
Streamlining End-to-End Testing Automation
Streamlining End-to-End Testing AutomationStreamlining End-to-End Testing Automation
Streamlining End-to-End Testing Automation
 
Refactoring legacy systems using events commands and bubble contexts
Refactoring legacy systems using events commands and bubble contextsRefactoring legacy systems using events commands and bubble contexts
Refactoring legacy systems using events commands and bubble contexts
 
Microsoft-Power-Platform-Adoption-Planning.pptx
Microsoft-Power-Platform-Adoption-Planning.pptxMicrosoft-Power-Platform-Adoption-Planning.pptx
Microsoft-Power-Platform-Adoption-Planning.pptx
 
Independent Call Girls In Kolkata ✔ 7014168258 ✔ Hi I Am Divya Vip Call Girl ...
Independent Call Girls In Kolkata ✔ 7014168258 ✔ Hi I Am Divya Vip Call Girl ...Independent Call Girls In Kolkata ✔ 7014168258 ✔ Hi I Am Divya Vip Call Girl ...
Independent Call Girls In Kolkata ✔ 7014168258 ✔ Hi I Am Divya Vip Call Girl ...
 
What’s New in VictoriaLogs - Q2 2024 Update
What’s New in VictoriaLogs - Q2 2024 UpdateWhat’s New in VictoriaLogs - Q2 2024 Update
What’s New in VictoriaLogs - Q2 2024 Update
 
Alluxio Webinar | 10x Faster Trino Queries on Your Data Platform
Alluxio Webinar | 10x Faster Trino Queries on Your Data PlatformAlluxio Webinar | 10x Faster Trino Queries on Your Data Platform
Alluxio Webinar | 10x Faster Trino Queries on Your Data Platform
 
How GenAI Can Improve Supplier Performance Management.pdf
How GenAI Can Improve Supplier Performance Management.pdfHow GenAI Can Improve Supplier Performance Management.pdf
How GenAI Can Improve Supplier Performance Management.pdf
 
Introducing Claris FileMaker 2024: presented by DB Services
Introducing Claris FileMaker 2024: presented by DB ServicesIntroducing Claris FileMaker 2024: presented by DB Services
Introducing Claris FileMaker 2024: presented by DB Services
 
Call Girls Solapur ☎️ +91-7426014248 😍 Solapur Call Girl Beauty Girls Solapur...
Call Girls Solapur ☎️ +91-7426014248 😍 Solapur Call Girl Beauty Girls Solapur...Call Girls Solapur ☎️ +91-7426014248 😍 Solapur Call Girl Beauty Girls Solapur...
Call Girls Solapur ☎️ +91-7426014248 😍 Solapur Call Girl Beauty Girls Solapur...
 
🔥 Chennai Call Girls  👉 6350257716 👫 High Profile Call Girls Whatsapp Number ...
🔥 Chennai Call Girls  👉 6350257716 👫 High Profile Call Girls Whatsapp Number ...🔥 Chennai Call Girls  👉 6350257716 👫 High Profile Call Girls Whatsapp Number ...
🔥 Chennai Call Girls  👉 6350257716 👫 High Profile Call Girls Whatsapp Number ...
 
What’s new in VictoriaMetrics - Q2 2024 Update
What’s new in VictoriaMetrics - Q2 2024 UpdateWhat’s new in VictoriaMetrics - Q2 2024 Update
What’s new in VictoriaMetrics - Q2 2024 Update
 
Beginner's Guide to Observability@Devoxx PL 2024
Beginner's  Guide to Observability@Devoxx PL 2024Beginner's  Guide to Observability@Devoxx PL 2024
Beginner's Guide to Observability@Devoxx PL 2024
 
Ensuring Efficiency and Speed with Practical Solutions for Clinical Operations
Ensuring Efficiency and Speed with Practical Solutions for Clinical OperationsEnsuring Efficiency and Speed with Practical Solutions for Clinical Operations
Ensuring Efficiency and Speed with Practical Solutions for Clinical Operations
 
Software Test Automation - A Comprehensive Guide on Automated Testing.pdf
Software Test Automation - A Comprehensive Guide on Automated Testing.pdfSoftware Test Automation - A Comprehensive Guide on Automated Testing.pdf
Software Test Automation - A Comprehensive Guide on Automated Testing.pdf
 
Introduction to Python and Basic Syntax.pptx
Introduction to Python and Basic Syntax.pptxIntroduction to Python and Basic Syntax.pptx
Introduction to Python and Basic Syntax.pptx
 
TheFutureIsDynamic-BoxLang-CFCamp2024.pdf
TheFutureIsDynamic-BoxLang-CFCamp2024.pdfTheFutureIsDynamic-BoxLang-CFCamp2024.pdf
TheFutureIsDynamic-BoxLang-CFCamp2024.pdf
 

What's coming in Airflow 2.0? - NYC Apache Airflow Meetup

  • 2. Who are we? Tomek Urbaszek Committer Software Engineer @ Polidea Jarek Potiuk Committer, PMC member Principal Software Engineer @ Polidea Kamil Breguła Committer Software Engineer @ Polidea Ash Berlin-Taylor Committer, PMC member Airflow Engineering Lead @ Astronomer Daniel Imberman Committer Senior Data Engineer @ Astronomer Kaxil Naik Committer, PMC member Senior Data Engineer @ Astronomer
  • 4. Scheduler High Availability Goals: ● Performance - reduce task-to-task schedule "lag" ● Scalability - increase task throughput by horizontal scaling ● Resiliency - kill a scheduler and have tasks continue to be scheduled
  • 5. Scheduler High Availability: Design ● Active-active model. Each scheduler does everything ● Uses existing database - no new components needed, no extra operational burden ● Plan to use row-level-locks in the DB ● Will re-evaluate if performance/stress testing show the need
  • 7. Scheduler High Availability: Tasks ● Separate DAG parsing from DAG scheduling This removes the tie between parsing and scheduling that is still present ● Run a mini scheduler in the worker after each task is completed A.K.A. "fast follow". Look at immediate down stream tasks of what just finished and see what we can schedule ● Test it to destruction This is a big architectural change, we need to be sure it works well.
  • 10. Dag Serialization (Tasks Completed) ● Stateless Webserver: Scheduler parses the DAG files, serializes them in JSON format & saves them in the Metadata DB. ● Lazy Loading of DAGs: Instead of loading an entire DagBag when the Webserver starts we only load each DAG on demand. This helps reduce Webserver startup time and memory. This reduction is notable with large number of DAGs. ● Deploying new DAGs to Airflow - no longer requires long restarts of webserver (if DAGs are baked in Docker image) ● Feature to use the “JSON” library of choice for Serialization (default is inbuilt ‘json’ library) ● Paves way for DAG Versioning & Scheduler HA
  • 11. Dag Serialization (Tasks In-Progress for Airflow 2.0) ● Decouple DAG Parsing and Serializing from the scheduling loop. ● Scheduler will fetch DAGs from DB ● DAG will be parsed, serialized and saved to DB by a separate component “Serializer”/ “Dag Parser” ● This should reduce the delay in Scheduling tasks when the number of DAGs are large
  • 13. Dag Versioning Current Problem: ● Change in DAG structure affects viewing previous DagRuns too ● Not possible to view the code associated with a specific DagRun
  • 15. Dag Versioning (Current Problem) New task is shown in Graph View for older DAG Runs too with “no status”.
  • 16. Dag Versioning Current Problem: ● Change in DAG structure affects viewing previous DagRuns too ● Not possible to view the code associated with a specific DagRun Goal: ● Support for storing multiple versions of Serialized DAGs ● Baked-In Maintenance DAGs to cleanup old DagRuns & associated Serialized DAGs ● Graph View shows the DAG associated with that DagRun
  • 18. Performance improvements ● Review each component of scheduler in turn and its optimization. ● Perf kit ○ A set of tools that allows you to quickly check the performance of a component
  • 19. Do you see a performance problem?
  • 20.
  • 21. Results for DagFileProcessor When we have one DAG file with 200 DAGs, each DAG with 5 tasks: Before After Diff Average time: 8080.246 ms 628.801 ms -7452 ms (92%) Queries count: 2692 5 -2687 (99%)
  • 22. How to avoid regression?
  • 24. API: follows Open API 3.0 specification Outreachy interns Ephraim Anierobi Omair Khan
  • 26. CI environment ● Moving to GitHub Actions ○ Kubernetes Tests ○ Easier way to test Kubernetes Tests locally ● Quarantined tests ○ Process of fixing the Quarantined tests ● Thinning CI image ○ Move integrations out of the image (hadoop etc) ● Automated System Tests (AIP-21)
  • 28. Dev environment ● Breeze ○ unit testing ○ package building ○ release preparation ○ refreshing videos ● CodeSpaces integration
  • 29. Backport Packages ● Bring Airflow 2.0 providers to 1.10.* ● Packages per-provider ● 58 packages (!) ● Python 3.6+ only(!) ● Automatically tested on CI ● Future ○ Automated System Tests (AIP-21) ○ Split Airflow (AIP-8)?
  • 30. Automated release notes for backport packages
  • 32. Production Image ● Alpha quality image is ready ● Gathering feedback ● Started with “bare image” ● Listening to use cases from users ● Integration with Docker Compose ● Integration with Helm Chart
  • 38.
  • 39. KEDA Autoscaling ● Kubernetes Event-driven Autoscaler ● Scales based on # of RUNNING and QUEUED tasks in PostgreSQL backend
  • 43. KEDA Queues ● Historically Queues were expensive and hard to allocate ● With KEDA, queues are free! (can have 100 queues) ● KEDA works with k8s deployments so any customization you can make in a k8s pod, you can make in a k8s queue (worker size, GPU, secrets, etc.)
  • 45. KubernetesExecutor Pod Templating ● In the K8sExecutor currently, users can modify certain parts of the pod, but many features of the k8s API are abstracted away ● We did this because at the time the airflow community was not well acquainted with the k8s API ● We want to enable users to modify their worker pods to better match their use-cases
  • 46. KubernetesExecutor Pod Templating ● Users can now set the pod_template_file config in their airflow.cfg ● Given a path, the KubernetesExecutor will now parse the yaml file when launching a worker pod ● Huge thank you to @davlum for this feature
  • 48. Helm Chart ● Donated by astronomer.io. ● This is the official helm chart that we have used both in our enterprise and in our cloud offerings (thousands of deployments of varying sizes) ● Users can turn on KEDA autoscaling through helm variables
  • 49. Helm Chart ● Chart will cut new releases with each airflow release ● Will be tested on official docker image ● Significantly simplifies airflow onboarding process for Kubernetes users
  • 51. Functional DAGs ➔ PythonOperator boilerplate code ➔ Define order and data relation separately ➔ Writing jinja strings by hand
  • 52. Functional DAGs No PythonOperator boilerplate code! Data and order relationship are same! And works for all operators
  • 53. Functional DAGs AIP-31: Airflow functional DAG definition ➔ Easy way to convert a function to an operator ➔ Simplified way of writing DAGs ➔ Pluggable XCom Storage engine Example: store and retrieve DataFrames on GCS or S3 buckets without boilerplate code
  • 55. Other changes of note ● Connection IDs now need to be unique It was often confusing, and there are better ways to do load balancing ● Python 3 only Python 2.7 unsupported upstream since Jan 1, 2020 ● "RBAC" UI is now the only UI. Was a config option before, now only option. Charts/data profiling removed due to security risks
  • 57. When will Airflow 2.0 be available?
  • 58. Airflow 2.0 – deprecate, but (try) not to remove ● Breaking changes should be avoided where we can – if upgrade is to difficult users will be left behind ● Release "backport providers" to make new code layout available "now": ● Before 2.0 we want to make sure we've fixed everything we want to remove or break. pip install apache-airflow-backport-providers-aws apache-airflow-backport-providers-google
  • 59. How to upgrade to 2.0 safely ● Install the latest 1.10 release ● Run airflow upgrade-check (doesn't exist yet) ● Fix any warnings ● Upgrade Airflow
  • 60.
  翻译: