尊敬的 微信汇率:1円 ≈ 0.046089 元 支付宝汇率:1円 ≈ 0.04618元 [退出登录]
SlideShare a Scribd company logo
Building Data Pipelines Using
Apache Airflow
PURNA CHANDER RAO . KATHULA
Agenda
1. What is a Data Pipeline ?
2. Components of a Data pipeline.
3. Traditional Data Flows and issues
4. Introduction to Apache Airflow
5. Features
6. Core Components
7. Key Components
8. Demo
What is a Data Pipeline
Data Pipeline is a set of data processing elements connected in series, where the
output of one element is the input of the next one. The elements of the pipeline are
often executed in parallel or in time-sliced fashion. The name ‘pipeline’ come from
a rough analogy with physical plumbing.
● Modern data pipelines are used to ingest & process vast volumes of data in
real time.
● Real time processing of data as opposed to traditional ETL / batch modes.
Common Components of a data pipeline
Typical parts of a data pipeline
● Data Ingestion
● Filtering
● Processing
● Querying of the data
● Data warehousing
● Reprocessing capabilities
Typical Requirements
● Scalability
○ Billions of messages and terabytes of
data 24 /7
● Availability and redundancy
○ Across physical Locations
● Latency
○ Real time / Batch
● Platform support
Traditional data flow model
Webclients Reporting
Apps
Public Rest API Billing System
Microservices
OLTP
DB
Report
DB
Metrics
DB
$ curl api.example.com | filter.py | psql
Analytics
Messy data flow model ( 6 / 12 months later)
web clients reporting
Apps
Public Rest API Billing System
Microservices
OLTP
DB
Report
DB
Metrics
DB
Analytics
External
cloud
Doc
Store
DWH
Apache Airflow Introduction
● Apache Airflow is a way to programatically author, schedule and monitor
workflows
● Developed in Python and is open source.
● Workflows are configured as Python code.
● It uses python as the programming language, where in we can enrich the quality
of data pipelines by using python inbuilt libraries.
● Has multiple hooks and operators for handling BigData ecosystem components, (
Hive, Sqoop etc.. ) and DB hooks for RDBMS and Other NOSQL databases.
Features
● Cron replacement
● Fault tolerant.
● Dependency rules.
● Beautiful UI.
● Handle task failures.
● Python Code.
● Report / Alert on failures.
● Monitor your pipelines from the WebUI.
● And etc..
Core Components
● Webserver - Apache Airflow WebUI.
● Scheduler - Responsible for scheduling your jobs.
● Executor - bound to the scheduler , determine the worker process that
executes the the schedule task. ( Sequential , LocalExecutor, CeleryExecutor)
● Worker - Process that execute the task , determined by the executor.
● Metadatabase - Database were all the metadata related to your jobs are stored
Key Concepts
● DAG - Directed Acyclic graph . the graphical representation of your data
pipeline
● Operator - describes a single task in your data pipeline
● Task - An instance of operator task.
● Workflow - DAG + Operator + Task
Overview
● What is a DAG?
● What is an Operator?
● Operator relationships and Bitshift composition
● How the scheduler works?
● What is a Workflow ?
DAG ( Directed Acyclic Graph)
Simple DAG where we could imagine that
Task 1 - downloading the data.
Task 2 - Sending the data for processing.
Task 3 - monitoring the data processing.
Task 4 - generating the report.
Task 5 - Sending the email to the DAG owner or intended recipients.
Task 1 Task 2 Task 3 Task 4 Task 5
Not a DAG
Task 1 Task 2 Task 3 Task 4 Task 5
Operators
While DAG describes how to run a workflow , Operator defines what actually gets
done.
● Operator describes a single task in a workflow.
● Operators should be idempotent. ( it should produce the same result
irrespective of how many times it is executed.
● Retry Automatically in case of Failure.
Different Operators
● Bash Operator
○ Executes a bash command
● Python Operator
○ Calls an Arbitrary python function
● Email Operator
○ Sends an Email
● Mysql Operator, SQLite Operator, Postgres Operator.
○ Executes the SQL commands
● <Custom Operators> Inheriting from the BaseOperator
Types of Operators
There are 3 types of operators
● Action Operators
○ Perform an action ( Bash operator, Python Operator , Email Operator)
● Transfer Operators
○ Moving data from one system to another ( PrestoToMySQL operator, SFTP operator
● Sensor Operators
○ Waiting for the data to arrive at the default location.
Important Properties
● DAG’s are defined in Python files placed into Airflows DAG_FOLDER
● dag_id serves as a unique identifier for your DAG.
● description the description of your DAG.
● start_date - tell when your DAG should start.
● schedule_interval - define how often your DAG runs.
● depend_on_past - run the next DAGRun if the previous one is completed
successfully.
● default_args - a dictionary of variables to be used as constructor keyword
parameter when initializing operators
AirFlow WebUI
DAG Code
Python Operator tasks ( fetching_tweet.py)
Python Operator tasks ( cleansing_tweet.py)
Start the DAG ( Toggle the ON/ OFF ) button
Graph View of the Dag
Tree View of the Dag
Executing the DAG and Checking the hive tables
Check Hive table count after the DAG
Questions
THANK YOU

More Related Content

What's hot

Intro to Airflow: Goodbye Cron, Welcome scheduled workflow management
Intro to Airflow: Goodbye Cron, Welcome scheduled workflow managementIntro to Airflow: Goodbye Cron, Welcome scheduled workflow management
Intro to Airflow: Goodbye Cron, Welcome scheduled workflow management
Burasakorn Sabyeying
 
Introduction to Apache Airflow
Introduction to Apache AirflowIntroduction to Apache Airflow
Introduction to Apache Airflow
mutt_data
 
Apache Airflow
Apache AirflowApache Airflow
Apache Airflow
Knoldus Inc.
 
Orchestrating workflows Apache Airflow on GCP & AWS
Orchestrating workflows Apache Airflow on GCP & AWSOrchestrating workflows Apache Airflow on GCP & AWS
Orchestrating workflows Apache Airflow on GCP & AWS
Derrick Qin
 
Airflow - a data flow engine
Airflow - a data flow engineAirflow - a data flow engine
Airflow - a data flow engine
Walter Liu
 
Airflow introduction
Airflow introductionAirflow introduction
Airflow introduction
Chandler Huang
 
Building a Data Pipeline using Apache Airflow (on AWS / GCP)
Building a Data Pipeline using Apache Airflow (on AWS / GCP)Building a Data Pipeline using Apache Airflow (on AWS / GCP)
Building a Data Pipeline using Apache Airflow (on AWS / GCP)
Yohei Onishi
 
Apache Airflow Introduction
Apache Airflow IntroductionApache Airflow Introduction
Apache Airflow Introduction
Liangjun Jiang
 
Airflow Intro-1.pdf
Airflow Intro-1.pdfAirflow Intro-1.pdf
Airflow Intro-1.pdf
BagustTriCahyo1
 
Airflow presentation
Airflow presentationAirflow presentation
Airflow presentation
Ilias Okacha
 
Apache airflow
Apache airflowApache airflow
Apache airflow
Pavel Alexeev
 
Grafana introduction
Grafana introductionGrafana introduction
Grafana introduction
Rico Chen
 
Airflow tutorials hands_on
Airflow tutorials hands_onAirflow tutorials hands_on
Airflow tutorials hands_on
pko89403
 
Airflow Best Practises & Roadmap to Airflow 2.0
Airflow Best Practises & Roadmap to Airflow 2.0Airflow Best Practises & Roadmap to Airflow 2.0
Airflow Best Practises & Roadmap to Airflow 2.0
Kaxil Naik
 
Apache Airflow Architecture
Apache Airflow ArchitectureApache Airflow Architecture
Apache Airflow Architecture
Gerard Toonstra
 
Building Better Data Pipelines using Apache Airflow
Building Better Data Pipelines using Apache AirflowBuilding Better Data Pipelines using Apache Airflow
Building Better Data Pipelines using Apache Airflow
Sid Anand
 
Airflow at lyft for Airflow summit 2020 conference
Airflow at lyft for Airflow summit 2020 conferenceAirflow at lyft for Airflow summit 2020 conference
Airflow at lyft for Airflow summit 2020 conference
Tao Feng
 
How I learned to time travel, or, data pipelining and scheduling with Airflow
How I learned to time travel, or, data pipelining and scheduling with AirflowHow I learned to time travel, or, data pipelining and scheduling with Airflow
How I learned to time travel, or, data pipelining and scheduling with Airflow
PyData
 
Apache Airflow in Production
Apache Airflow in ProductionApache Airflow in Production
Apache Airflow in Production
Robert Sanders
 
Fall in Love with Graphs and Metrics using Grafana
Fall in Love with Graphs and Metrics using GrafanaFall in Love with Graphs and Metrics using Grafana
Fall in Love with Graphs and Metrics using Grafana
torkelo
 

What's hot (20)

Intro to Airflow: Goodbye Cron, Welcome scheduled workflow management
Intro to Airflow: Goodbye Cron, Welcome scheduled workflow managementIntro to Airflow: Goodbye Cron, Welcome scheduled workflow management
Intro to Airflow: Goodbye Cron, Welcome scheduled workflow management
 
Introduction to Apache Airflow
Introduction to Apache AirflowIntroduction to Apache Airflow
Introduction to Apache Airflow
 
Apache Airflow
Apache AirflowApache Airflow
Apache Airflow
 
Orchestrating workflows Apache Airflow on GCP & AWS
Orchestrating workflows Apache Airflow on GCP & AWSOrchestrating workflows Apache Airflow on GCP & AWS
Orchestrating workflows Apache Airflow on GCP & AWS
 
Airflow - a data flow engine
Airflow - a data flow engineAirflow - a data flow engine
Airflow - a data flow engine
 
Airflow introduction
Airflow introductionAirflow introduction
Airflow introduction
 
Building a Data Pipeline using Apache Airflow (on AWS / GCP)
Building a Data Pipeline using Apache Airflow (on AWS / GCP)Building a Data Pipeline using Apache Airflow (on AWS / GCP)
Building a Data Pipeline using Apache Airflow (on AWS / GCP)
 
Apache Airflow Introduction
Apache Airflow IntroductionApache Airflow Introduction
Apache Airflow Introduction
 
Airflow Intro-1.pdf
Airflow Intro-1.pdfAirflow Intro-1.pdf
Airflow Intro-1.pdf
 
Airflow presentation
Airflow presentationAirflow presentation
Airflow presentation
 
Apache airflow
Apache airflowApache airflow
Apache airflow
 
Grafana introduction
Grafana introductionGrafana introduction
Grafana introduction
 
Airflow tutorials hands_on
Airflow tutorials hands_onAirflow tutorials hands_on
Airflow tutorials hands_on
 
Airflow Best Practises & Roadmap to Airflow 2.0
Airflow Best Practises & Roadmap to Airflow 2.0Airflow Best Practises & Roadmap to Airflow 2.0
Airflow Best Practises & Roadmap to Airflow 2.0
 
Apache Airflow Architecture
Apache Airflow ArchitectureApache Airflow Architecture
Apache Airflow Architecture
 
Building Better Data Pipelines using Apache Airflow
Building Better Data Pipelines using Apache AirflowBuilding Better Data Pipelines using Apache Airflow
Building Better Data Pipelines using Apache Airflow
 
Airflow at lyft for Airflow summit 2020 conference
Airflow at lyft for Airflow summit 2020 conferenceAirflow at lyft for Airflow summit 2020 conference
Airflow at lyft for Airflow summit 2020 conference
 
How I learned to time travel, or, data pipelining and scheduling with Airflow
How I learned to time travel, or, data pipelining and scheduling with AirflowHow I learned to time travel, or, data pipelining and scheduling with Airflow
How I learned to time travel, or, data pipelining and scheduling with Airflow
 
Apache Airflow in Production
Apache Airflow in ProductionApache Airflow in Production
Apache Airflow in Production
 
Fall in Love with Graphs and Metrics using Grafana
Fall in Love with Graphs and Metrics using GrafanaFall in Love with Graphs and Metrics using Grafana
Fall in Love with Graphs and Metrics using Grafana
 

Similar to Apache airflow

Presto
PrestoPresto
Presto
Knoldus Inc.
 
Introduction to Apache Apex
Introduction to Apache ApexIntroduction to Apache Apex
Introduction to Apache Apex
Apache Apex
 
IoT Ingestion & Analytics using Apache Apex - A Native Hadoop Platform
 IoT Ingestion & Analytics using Apache Apex - A Native Hadoop Platform IoT Ingestion & Analytics using Apache Apex - A Native Hadoop Platform
IoT Ingestion & Analytics using Apache Apex - A Native Hadoop Platform
Apache Apex
 
Stream processing - Apache flink
Stream processing - Apache flinkStream processing - Apache flink
Stream processing - Apache flink
Renato Guimaraes
 
Intro to Apache Apex - Next Gen Platform for Ingest and Transform
Intro to Apache Apex - Next Gen Platform for Ingest and TransformIntro to Apache Apex - Next Gen Platform for Ingest and Transform
Intro to Apache Apex - Next Gen Platform for Ingest and Transform
Apache Apex
 
Dataflow.pptx
Dataflow.pptxDataflow.pptx
Dataflow.pptx
Sadeka Islam
 
Introduction to Apache Apex by Thomas Weise
Introduction to Apache Apex by Thomas WeiseIntroduction to Apache Apex by Thomas Weise
Introduction to Apache Apex by Thomas Weise
Big Data Spain
 
Data Engineer's Lunch #44: Prefect
Data Engineer's Lunch #44: PrefectData Engineer's Lunch #44: Prefect
Data Engineer's Lunch #44: Prefect
Anant Corporation
 
Apache Big Data 2016: Next Gen Big Data Analytics with Apache Apex
Apache Big Data 2016: Next Gen Big Data Analytics with Apache ApexApache Big Data 2016: Next Gen Big Data Analytics with Apache Apex
Apache Big Data 2016: Next Gen Big Data Analytics with Apache Apex
Apache Apex
 
Running Airflow Workflows as ETL Processes on Hadoop
Running Airflow Workflows as ETL Processes on HadoopRunning Airflow Workflows as ETL Processes on Hadoop
Running Airflow Workflows as ETL Processes on Hadoop
clairvoyantllc
 
Journey and evolution of Presto@Grab
Journey and evolution of Presto@GrabJourney and evolution of Presto@Grab
Journey and evolution of Presto@Grab
Shubham Tagra
 
Function Mesh for Apache Pulsar, the Way for Simple Streaming Solutions
Function Mesh for Apache Pulsar, the Way for Simple Streaming SolutionsFunction Mesh for Apache Pulsar, the Way for Simple Streaming Solutions
Function Mesh for Apache Pulsar, the Way for Simple Streaming Solutions
StreamNative
 
Apache Big Data EU 2016: Next Gen Big Data Analytics with Apache Apex
Apache Big Data EU 2016: Next Gen Big Data Analytics with Apache ApexApache Big Data EU 2016: Next Gen Big Data Analytics with Apache Apex
Apache Big Data EU 2016: Next Gen Big Data Analytics with Apache Apex
Apache Apex
 
Introduction to Apache Apex and writing a big data streaming application
Introduction to Apache Apex and writing a big data streaming application  Introduction to Apache Apex and writing a big data streaming application
Introduction to Apache Apex and writing a big data streaming application
Apache Apex
 
Apache Apex: Stream Processing Architecture and Applications
Apache Apex: Stream Processing Architecture and ApplicationsApache Apex: Stream Processing Architecture and Applications
Apache Apex: Stream Processing Architecture and Applications
Thomas Weise
 
Apache Apex: Stream Processing Architecture and Applications
Apache Apex: Stream Processing Architecture and Applications Apache Apex: Stream Processing Architecture and Applications
Apache Apex: Stream Processing Architecture and Applications
Comsysto Reply GmbH
 
How I learned to time travel, or, data pipelining and scheduling with Airflow
How I learned to time travel, or, data pipelining and scheduling with AirflowHow I learned to time travel, or, data pipelining and scheduling with Airflow
How I learned to time travel, or, data pipelining and scheduling with Airflow
Laura Lorenz
 
Hadoop Summit SJ 2016: Next Gen Big Data Analytics with Apache Apex
Hadoop Summit SJ 2016: Next Gen Big Data Analytics with Apache ApexHadoop Summit SJ 2016: Next Gen Big Data Analytics with Apache Apex
Hadoop Summit SJ 2016: Next Gen Big Data Analytics with Apache Apex
Apache Apex
 
Data Pipelines with Python - NWA TechFest 2017
Data Pipelines with Python - NWA TechFest 2017Data Pipelines with Python - NWA TechFest 2017
Data Pipelines with Python - NWA TechFest 2017
Casey Kinsey
 
Sap bodi bods online training course
Sap bodi bods online training courseSap bodi bods online training course
Sap bodi bods online training course
Newyorksys.com
 

Similar to Apache airflow (20)

Presto
PrestoPresto
Presto
 
Introduction to Apache Apex
Introduction to Apache ApexIntroduction to Apache Apex
Introduction to Apache Apex
 
IoT Ingestion & Analytics using Apache Apex - A Native Hadoop Platform
 IoT Ingestion & Analytics using Apache Apex - A Native Hadoop Platform IoT Ingestion & Analytics using Apache Apex - A Native Hadoop Platform
IoT Ingestion & Analytics using Apache Apex - A Native Hadoop Platform
 
Stream processing - Apache flink
Stream processing - Apache flinkStream processing - Apache flink
Stream processing - Apache flink
 
Intro to Apache Apex - Next Gen Platform for Ingest and Transform
Intro to Apache Apex - Next Gen Platform for Ingest and TransformIntro to Apache Apex - Next Gen Platform for Ingest and Transform
Intro to Apache Apex - Next Gen Platform for Ingest and Transform
 
Dataflow.pptx
Dataflow.pptxDataflow.pptx
Dataflow.pptx
 
Introduction to Apache Apex by Thomas Weise
Introduction to Apache Apex by Thomas WeiseIntroduction to Apache Apex by Thomas Weise
Introduction to Apache Apex by Thomas Weise
 
Data Engineer's Lunch #44: Prefect
Data Engineer's Lunch #44: PrefectData Engineer's Lunch #44: Prefect
Data Engineer's Lunch #44: Prefect
 
Apache Big Data 2016: Next Gen Big Data Analytics with Apache Apex
Apache Big Data 2016: Next Gen Big Data Analytics with Apache ApexApache Big Data 2016: Next Gen Big Data Analytics with Apache Apex
Apache Big Data 2016: Next Gen Big Data Analytics with Apache Apex
 
Running Airflow Workflows as ETL Processes on Hadoop
Running Airflow Workflows as ETL Processes on HadoopRunning Airflow Workflows as ETL Processes on Hadoop
Running Airflow Workflows as ETL Processes on Hadoop
 
Journey and evolution of Presto@Grab
Journey and evolution of Presto@GrabJourney and evolution of Presto@Grab
Journey and evolution of Presto@Grab
 
Function Mesh for Apache Pulsar, the Way for Simple Streaming Solutions
Function Mesh for Apache Pulsar, the Way for Simple Streaming SolutionsFunction Mesh for Apache Pulsar, the Way for Simple Streaming Solutions
Function Mesh for Apache Pulsar, the Way for Simple Streaming Solutions
 
Apache Big Data EU 2016: Next Gen Big Data Analytics with Apache Apex
Apache Big Data EU 2016: Next Gen Big Data Analytics with Apache ApexApache Big Data EU 2016: Next Gen Big Data Analytics with Apache Apex
Apache Big Data EU 2016: Next Gen Big Data Analytics with Apache Apex
 
Introduction to Apache Apex and writing a big data streaming application
Introduction to Apache Apex and writing a big data streaming application  Introduction to Apache Apex and writing a big data streaming application
Introduction to Apache Apex and writing a big data streaming application
 
Apache Apex: Stream Processing Architecture and Applications
Apache Apex: Stream Processing Architecture and ApplicationsApache Apex: Stream Processing Architecture and Applications
Apache Apex: Stream Processing Architecture and Applications
 
Apache Apex: Stream Processing Architecture and Applications
Apache Apex: Stream Processing Architecture and Applications Apache Apex: Stream Processing Architecture and Applications
Apache Apex: Stream Processing Architecture and Applications
 
How I learned to time travel, or, data pipelining and scheduling with Airflow
How I learned to time travel, or, data pipelining and scheduling with AirflowHow I learned to time travel, or, data pipelining and scheduling with Airflow
How I learned to time travel, or, data pipelining and scheduling with Airflow
 
Hadoop Summit SJ 2016: Next Gen Big Data Analytics with Apache Apex
Hadoop Summit SJ 2016: Next Gen Big Data Analytics with Apache ApexHadoop Summit SJ 2016: Next Gen Big Data Analytics with Apache Apex
Hadoop Summit SJ 2016: Next Gen Big Data Analytics with Apache Apex
 
Data Pipelines with Python - NWA TechFest 2017
Data Pipelines with Python - NWA TechFest 2017Data Pipelines with Python - NWA TechFest 2017
Data Pipelines with Python - NWA TechFest 2017
 
Sap bodi bods online training course
Sap bodi bods online training courseSap bodi bods online training course
Sap bodi bods online training course
 

Recently uploaded

IBM watsonx.data - Seller Enablement Deck.PPTX
IBM watsonx.data - Seller Enablement Deck.PPTXIBM watsonx.data - Seller Enablement Deck.PPTX
IBM watsonx.data - Seller Enablement Deck.PPTX
EbtsamRashed
 
Bangalore Call Girls ♠ 9079923931 ♠ Beautiful Call Girls In Bangalore
Bangalore Call Girls  ♠ 9079923931 ♠ Beautiful Call Girls In BangaloreBangalore Call Girls  ♠ 9079923931 ♠ Beautiful Call Girls In Bangalore
Bangalore Call Girls ♠ 9079923931 ♠ Beautiful Call Girls In Bangalore
yashusingh54876
 
satta matka Dpboss guessing Kalyan matka Today Kalyan Panel Chart Kalyan Jodi...
satta matka Dpboss guessing Kalyan matka Today Kalyan Panel Chart Kalyan Jodi...satta matka Dpboss guessing Kalyan matka Today Kalyan Panel Chart Kalyan Jodi...
satta matka Dpboss guessing Kalyan matka Today Kalyan Panel Chart Kalyan Jodi...
#kalyanmatkaresult #dpboss #kalyanmatka #satta #matka #sattamatka
 
Ahmedabad Call Girls 7339748667 With Free Home Delivery At Your Door
Ahmedabad Call Girls 7339748667 With Free Home Delivery At Your DoorAhmedabad Call Girls 7339748667 With Free Home Delivery At Your Door
Ahmedabad Call Girls 7339748667 With Free Home Delivery At Your Door
Russian Escorts in Delhi 9711199171 with low rate Book online
 
Call Girls Hyderabad ❤️ 7339748667 ❤️ With No Advance Payment
Call Girls Hyderabad ❤️ 7339748667 ❤️ With No Advance PaymentCall Girls Hyderabad ❤️ 7339748667 ❤️ With No Advance Payment
Call Girls Hyderabad ❤️ 7339748667 ❤️ With No Advance Payment
prijesh mathew
 
Optimizing Feldera: Integrating Advanced UDFs and Enhanced SQL Functionality ...
Optimizing Feldera: Integrating Advanced UDFs and Enhanced SQL Functionality ...Optimizing Feldera: Integrating Advanced UDFs and Enhanced SQL Functionality ...
Optimizing Feldera: Integrating Advanced UDFs and Enhanced SQL Functionality ...
mparmparousiskostas
 
Call Girls Hyderabad (india) ☎️ +91-7426014248 Hyderabad Call Girl
Call Girls Hyderabad  (india) ☎️ +91-7426014248 Hyderabad  Call GirlCall Girls Hyderabad  (india) ☎️ +91-7426014248 Hyderabad  Call Girl
Call Girls Hyderabad (india) ☎️ +91-7426014248 Hyderabad Call Girl
sapna sharmap11
 
Classifying Shooting Incident Fatality in New York project presentation
Classifying Shooting Incident Fatality in New York project presentationClassifying Shooting Incident Fatality in New York project presentation
Classifying Shooting Incident Fatality in New York project presentation
Boston Institute of Analytics
 
CAP Excel Formulas & Functions July - Copy (4).pdf
CAP Excel Formulas & Functions July - Copy (4).pdfCAP Excel Formulas & Functions July - Copy (4).pdf
CAP Excel Formulas & Functions July - Copy (4).pdf
frp60658
 
Pune Call Girls <BOOK> 😍 Call Girl Pune Escorts Service
Pune Call Girls <BOOK> 😍 Call Girl Pune Escorts ServicePune Call Girls <BOOK> 😍 Call Girl Pune Escorts Service
Pune Call Girls <BOOK> 😍 Call Girl Pune Escorts Service
vashimk775
 
Product Cluster Analysis: Unveiling Hidden Customer Preferences
Product Cluster Analysis: Unveiling Hidden Customer PreferencesProduct Cluster Analysis: Unveiling Hidden Customer Preferences
Product Cluster Analysis: Unveiling Hidden Customer Preferences
Boston Institute of Analytics
 
🔥Night Call Girls Pune 💯Call Us 🔝 7014168258 🔝💃Independent Pune Escorts Servi...
🔥Night Call Girls Pune 💯Call Us 🔝 7014168258 🔝💃Independent Pune Escorts Servi...🔥Night Call Girls Pune 💯Call Us 🔝 7014168258 🔝💃Independent Pune Escorts Servi...
🔥Night Call Girls Pune 💯Call Us 🔝 7014168258 🔝💃Independent Pune Escorts Servi...
yuvishachadda
 
Mumbai Call Girls service 9920874524 Call Girl service in Mumbai Mumbai Call ...
Mumbai Call Girls service 9920874524 Call Girl service in Mumbai Mumbai Call ...Mumbai Call Girls service 9920874524 Call Girl service in Mumbai Mumbai Call ...
Mumbai Call Girls service 9920874524 Call Girl service in Mumbai Mumbai Call ...
hanshkumar9870
 
Direct Lake Deep Dive slides from Fabric Engineering Roadshow
Direct Lake Deep Dive slides from Fabric Engineering RoadshowDirect Lake Deep Dive slides from Fabric Engineering Roadshow
Direct Lake Deep Dive slides from Fabric Engineering Roadshow
Gabi Münster
 
PCI-DSS-Data Security Standard v4.0.1.pdf
PCI-DSS-Data Security Standard v4.0.1.pdfPCI-DSS-Data Security Standard v4.0.1.pdf
PCI-DSS-Data Security Standard v4.0.1.pdf
incitbe
 
Delhi Call Girls Karol Bagh 👉 9711199012 👈 unlimited short high profile full ...
Delhi Call Girls Karol Bagh 👉 9711199012 👈 unlimited short high profile full ...Delhi Call Girls Karol Bagh 👉 9711199012 👈 unlimited short high profile full ...
Delhi Call Girls Karol Bagh 👉 9711199012 👈 unlimited short high profile full ...
mona lisa $A12
 
Hyderabad Call Girls 7339748667 With Free Home Delivery At Your Door
Hyderabad Call Girls 7339748667 With Free Home Delivery At Your DoorHyderabad Call Girls 7339748667 With Free Home Delivery At Your Door
Hyderabad Call Girls 7339748667 With Free Home Delivery At Your Door
Russian Escorts in Delhi 9711199171 with low rate Book online
 
🔥Mature Women / Aunty Call Girl Chennai 💯Call Us 🔝 8094342248 🔝💃Top Class Cal...
🔥Mature Women / Aunty Call Girl Chennai 💯Call Us 🔝 8094342248 🔝💃Top Class Cal...🔥Mature Women / Aunty Call Girl Chennai 💯Call Us 🔝 8094342248 🔝💃Top Class Cal...
🔥Mature Women / Aunty Call Girl Chennai 💯Call Us 🔝 8094342248 🔝💃Top Class Cal...
shivangimorya083
 
Hot Call Girls In Bangalore 🔥 9352988975 🔥 Real Fun With Sexual Girl Availabl...
Hot Call Girls In Bangalore 🔥 9352988975 🔥 Real Fun With Sexual Girl Availabl...Hot Call Girls In Bangalore 🔥 9352988975 🔥 Real Fun With Sexual Girl Availabl...
Hot Call Girls In Bangalore 🔥 9352988975 🔥 Real Fun With Sexual Girl Availabl...
nainasharmans346
 
_Lufthansa Airlines MIA Terminal (1).pdf
_Lufthansa Airlines MIA Terminal (1).pdf_Lufthansa Airlines MIA Terminal (1).pdf
_Lufthansa Airlines MIA Terminal (1).pdf
rc76967005
 

Recently uploaded (20)

IBM watsonx.data - Seller Enablement Deck.PPTX
IBM watsonx.data - Seller Enablement Deck.PPTXIBM watsonx.data - Seller Enablement Deck.PPTX
IBM watsonx.data - Seller Enablement Deck.PPTX
 
Bangalore Call Girls ♠ 9079923931 ♠ Beautiful Call Girls In Bangalore
Bangalore Call Girls  ♠ 9079923931 ♠ Beautiful Call Girls In BangaloreBangalore Call Girls  ♠ 9079923931 ♠ Beautiful Call Girls In Bangalore
Bangalore Call Girls ♠ 9079923931 ♠ Beautiful Call Girls In Bangalore
 
satta matka Dpboss guessing Kalyan matka Today Kalyan Panel Chart Kalyan Jodi...
satta matka Dpboss guessing Kalyan matka Today Kalyan Panel Chart Kalyan Jodi...satta matka Dpboss guessing Kalyan matka Today Kalyan Panel Chart Kalyan Jodi...
satta matka Dpboss guessing Kalyan matka Today Kalyan Panel Chart Kalyan Jodi...
 
Ahmedabad Call Girls 7339748667 With Free Home Delivery At Your Door
Ahmedabad Call Girls 7339748667 With Free Home Delivery At Your DoorAhmedabad Call Girls 7339748667 With Free Home Delivery At Your Door
Ahmedabad Call Girls 7339748667 With Free Home Delivery At Your Door
 
Call Girls Hyderabad ❤️ 7339748667 ❤️ With No Advance Payment
Call Girls Hyderabad ❤️ 7339748667 ❤️ With No Advance PaymentCall Girls Hyderabad ❤️ 7339748667 ❤️ With No Advance Payment
Call Girls Hyderabad ❤️ 7339748667 ❤️ With No Advance Payment
 
Optimizing Feldera: Integrating Advanced UDFs and Enhanced SQL Functionality ...
Optimizing Feldera: Integrating Advanced UDFs and Enhanced SQL Functionality ...Optimizing Feldera: Integrating Advanced UDFs and Enhanced SQL Functionality ...
Optimizing Feldera: Integrating Advanced UDFs and Enhanced SQL Functionality ...
 
Call Girls Hyderabad (india) ☎️ +91-7426014248 Hyderabad Call Girl
Call Girls Hyderabad  (india) ☎️ +91-7426014248 Hyderabad  Call GirlCall Girls Hyderabad  (india) ☎️ +91-7426014248 Hyderabad  Call Girl
Call Girls Hyderabad (india) ☎️ +91-7426014248 Hyderabad Call Girl
 
Classifying Shooting Incident Fatality in New York project presentation
Classifying Shooting Incident Fatality in New York project presentationClassifying Shooting Incident Fatality in New York project presentation
Classifying Shooting Incident Fatality in New York project presentation
 
CAP Excel Formulas & Functions July - Copy (4).pdf
CAP Excel Formulas & Functions July - Copy (4).pdfCAP Excel Formulas & Functions July - Copy (4).pdf
CAP Excel Formulas & Functions July - Copy (4).pdf
 
Pune Call Girls <BOOK> 😍 Call Girl Pune Escorts Service
Pune Call Girls <BOOK> 😍 Call Girl Pune Escorts ServicePune Call Girls <BOOK> 😍 Call Girl Pune Escorts Service
Pune Call Girls <BOOK> 😍 Call Girl Pune Escorts Service
 
Product Cluster Analysis: Unveiling Hidden Customer Preferences
Product Cluster Analysis: Unveiling Hidden Customer PreferencesProduct Cluster Analysis: Unveiling Hidden Customer Preferences
Product Cluster Analysis: Unveiling Hidden Customer Preferences
 
🔥Night Call Girls Pune 💯Call Us 🔝 7014168258 🔝💃Independent Pune Escorts Servi...
🔥Night Call Girls Pune 💯Call Us 🔝 7014168258 🔝💃Independent Pune Escorts Servi...🔥Night Call Girls Pune 💯Call Us 🔝 7014168258 🔝💃Independent Pune Escorts Servi...
🔥Night Call Girls Pune 💯Call Us 🔝 7014168258 🔝💃Independent Pune Escorts Servi...
 
Mumbai Call Girls service 9920874524 Call Girl service in Mumbai Mumbai Call ...
Mumbai Call Girls service 9920874524 Call Girl service in Mumbai Mumbai Call ...Mumbai Call Girls service 9920874524 Call Girl service in Mumbai Mumbai Call ...
Mumbai Call Girls service 9920874524 Call Girl service in Mumbai Mumbai Call ...
 
Direct Lake Deep Dive slides from Fabric Engineering Roadshow
Direct Lake Deep Dive slides from Fabric Engineering RoadshowDirect Lake Deep Dive slides from Fabric Engineering Roadshow
Direct Lake Deep Dive slides from Fabric Engineering Roadshow
 
PCI-DSS-Data Security Standard v4.0.1.pdf
PCI-DSS-Data Security Standard v4.0.1.pdfPCI-DSS-Data Security Standard v4.0.1.pdf
PCI-DSS-Data Security Standard v4.0.1.pdf
 
Delhi Call Girls Karol Bagh 👉 9711199012 👈 unlimited short high profile full ...
Delhi Call Girls Karol Bagh 👉 9711199012 👈 unlimited short high profile full ...Delhi Call Girls Karol Bagh 👉 9711199012 👈 unlimited short high profile full ...
Delhi Call Girls Karol Bagh 👉 9711199012 👈 unlimited short high profile full ...
 
Hyderabad Call Girls 7339748667 With Free Home Delivery At Your Door
Hyderabad Call Girls 7339748667 With Free Home Delivery At Your DoorHyderabad Call Girls 7339748667 With Free Home Delivery At Your Door
Hyderabad Call Girls 7339748667 With Free Home Delivery At Your Door
 
🔥Mature Women / Aunty Call Girl Chennai 💯Call Us 🔝 8094342248 🔝💃Top Class Cal...
🔥Mature Women / Aunty Call Girl Chennai 💯Call Us 🔝 8094342248 🔝💃Top Class Cal...🔥Mature Women / Aunty Call Girl Chennai 💯Call Us 🔝 8094342248 🔝💃Top Class Cal...
🔥Mature Women / Aunty Call Girl Chennai 💯Call Us 🔝 8094342248 🔝💃Top Class Cal...
 
Hot Call Girls In Bangalore 🔥 9352988975 🔥 Real Fun With Sexual Girl Availabl...
Hot Call Girls In Bangalore 🔥 9352988975 🔥 Real Fun With Sexual Girl Availabl...Hot Call Girls In Bangalore 🔥 9352988975 🔥 Real Fun With Sexual Girl Availabl...
Hot Call Girls In Bangalore 🔥 9352988975 🔥 Real Fun With Sexual Girl Availabl...
 
_Lufthansa Airlines MIA Terminal (1).pdf
_Lufthansa Airlines MIA Terminal (1).pdf_Lufthansa Airlines MIA Terminal (1).pdf
_Lufthansa Airlines MIA Terminal (1).pdf
 

Apache airflow

  • 1. Building Data Pipelines Using Apache Airflow PURNA CHANDER RAO . KATHULA
  • 2. Agenda 1. What is a Data Pipeline ? 2. Components of a Data pipeline. 3. Traditional Data Flows and issues 4. Introduction to Apache Airflow 5. Features 6. Core Components 7. Key Components 8. Demo
  • 3. What is a Data Pipeline Data Pipeline is a set of data processing elements connected in series, where the output of one element is the input of the next one. The elements of the pipeline are often executed in parallel or in time-sliced fashion. The name ‘pipeline’ come from a rough analogy with physical plumbing. ● Modern data pipelines are used to ingest & process vast volumes of data in real time. ● Real time processing of data as opposed to traditional ETL / batch modes.
  • 4. Common Components of a data pipeline Typical parts of a data pipeline ● Data Ingestion ● Filtering ● Processing ● Querying of the data ● Data warehousing ● Reprocessing capabilities Typical Requirements ● Scalability ○ Billions of messages and terabytes of data 24 /7 ● Availability and redundancy ○ Across physical Locations ● Latency ○ Real time / Batch ● Platform support
  • 5. Traditional data flow model Webclients Reporting Apps Public Rest API Billing System Microservices OLTP DB Report DB Metrics DB $ curl api.example.com | filter.py | psql Analytics
  • 6. Messy data flow model ( 6 / 12 months later) web clients reporting Apps Public Rest API Billing System Microservices OLTP DB Report DB Metrics DB Analytics External cloud Doc Store DWH
  • 7. Apache Airflow Introduction ● Apache Airflow is a way to programatically author, schedule and monitor workflows ● Developed in Python and is open source. ● Workflows are configured as Python code. ● It uses python as the programming language, where in we can enrich the quality of data pipelines by using python inbuilt libraries. ● Has multiple hooks and operators for handling BigData ecosystem components, ( Hive, Sqoop etc.. ) and DB hooks for RDBMS and Other NOSQL databases.
  • 8. Features ● Cron replacement ● Fault tolerant. ● Dependency rules. ● Beautiful UI. ● Handle task failures. ● Python Code. ● Report / Alert on failures. ● Monitor your pipelines from the WebUI. ● And etc..
  • 9. Core Components ● Webserver - Apache Airflow WebUI. ● Scheduler - Responsible for scheduling your jobs. ● Executor - bound to the scheduler , determine the worker process that executes the the schedule task. ( Sequential , LocalExecutor, CeleryExecutor) ● Worker - Process that execute the task , determined by the executor. ● Metadatabase - Database were all the metadata related to your jobs are stored
  • 10. Key Concepts ● DAG - Directed Acyclic graph . the graphical representation of your data pipeline ● Operator - describes a single task in your data pipeline ● Task - An instance of operator task. ● Workflow - DAG + Operator + Task
  • 11. Overview ● What is a DAG? ● What is an Operator? ● Operator relationships and Bitshift composition ● How the scheduler works? ● What is a Workflow ?
  • 12. DAG ( Directed Acyclic Graph) Simple DAG where we could imagine that Task 1 - downloading the data. Task 2 - Sending the data for processing. Task 3 - monitoring the data processing. Task 4 - generating the report. Task 5 - Sending the email to the DAG owner or intended recipients. Task 1 Task 2 Task 3 Task 4 Task 5
  • 13. Not a DAG Task 1 Task 2 Task 3 Task 4 Task 5
  • 14. Operators While DAG describes how to run a workflow , Operator defines what actually gets done. ● Operator describes a single task in a workflow. ● Operators should be idempotent. ( it should produce the same result irrespective of how many times it is executed. ● Retry Automatically in case of Failure.
  • 15. Different Operators ● Bash Operator ○ Executes a bash command ● Python Operator ○ Calls an Arbitrary python function ● Email Operator ○ Sends an Email ● Mysql Operator, SQLite Operator, Postgres Operator. ○ Executes the SQL commands ● <Custom Operators> Inheriting from the BaseOperator
  • 16. Types of Operators There are 3 types of operators ● Action Operators ○ Perform an action ( Bash operator, Python Operator , Email Operator) ● Transfer Operators ○ Moving data from one system to another ( PrestoToMySQL operator, SFTP operator ● Sensor Operators ○ Waiting for the data to arrive at the default location.
  • 17. Important Properties ● DAG’s are defined in Python files placed into Airflows DAG_FOLDER ● dag_id serves as a unique identifier for your DAG. ● description the description of your DAG. ● start_date - tell when your DAG should start. ● schedule_interval - define how often your DAG runs. ● depend_on_past - run the next DAGRun if the previous one is completed successfully. ● default_args - a dictionary of variables to be used as constructor keyword parameter when initializing operators
  • 20. Python Operator tasks ( fetching_tweet.py)
  • 21. Python Operator tasks ( cleansing_tweet.py)
  • 22. Start the DAG ( Toggle the ON/ OFF ) button
  • 23. Graph View of the Dag
  • 24. Tree View of the Dag
  • 25. Executing the DAG and Checking the hive tables
  • 26. Check Hive table count after the DAG
  翻译: