尊敬的 微信汇率:1円 ≈ 0.046078 元 支付宝汇率:1円 ≈ 0.046168元 [退出登录]
SlideShare a Scribd company logo
DATA PIPELINE
A set of data processing elements connected in series, where the output of one element is the
input of the next one
WHAT IS A DATA PIPELINE
The data we we most care about in one place, in the same format, consistency,
quality and real time
We don’t have to do it every time we want to build analysis or predictive
models.
It is reproducible
You will define what, where and how data are collected, transformed and loaded
This means that you’ll need to manually pick every field, table, data source,
transformation, join, etc.The good news is that if you do it right you’ll have to
do it just once.Afterwards everything will be automated.
CANDIDATES I EVALUATED
Spark Structured Streaming
Apache NIFI
Spring Cloud Data Flow
AWS Glue
SPARK STRUCTURED STREAMING
Scalable and fault-tolerant Stream processing engine built on the
Spark SQL Engine
Express your streams computation in the same way you express
batch computation on static data.
The stream runs incrementally and continuously and updating the
final result as streaming data continues to arrive.
SPARK STRUCTURED STREAMING
APACHE NIFI
• Lends well to visual creation and management of directed graphs of processors
• Is inherently asynchronous which allows for very high throughput and natural buffering
even as processing and flow rates fluctuate
• Provides a highly concurrent model without a developer having to worry about the
typical complexities of concurrency
• Promotes the development of cohesive and loosely coupled components which can
then be reused in other contexts and promotes testable units
• The resource constrained connections make critical functions such as back-pressure
and pressure release very natural and intuitive
• Error handling becomes as natural as the happy-path rather than a coarse grained catch-
all
• The points at which data enters and exits the system as well as how it flows through
are well understood and easily tracked
APACHE NIFI
APACHE NIFI
SPRING CLOUD DATA FLOW
SPRING CLOUD DATA FLOW
• The Data Flow server is also responsible for:
• Interpreting and executing a stream DSL that describes the logical
flow of data through multiple long-lived applications.
• Launching a long-lived task application.
• Interpreting and executing a composed task DSL that describes the
logical flow of data through multiple short-lived applications.
• Applying a deployment manifest that describes the mapping of
applications onto the runtime - for example, to set the initial number
of instances, memory requirements, and data partitioning.
• Providing the runtime status of deployed applications.
SPRING CLOUD DATA FLOW
• The Data Flow Server deploys applications onto the target runtime
that conform to the microservice architectural style.
• A stream represents a high-level application that consists of multiple
small microservice applications each running in their own process.
• Each microservice application can be scaled up or down
independently of the other and each has its own versioning lifecycle.
• Both Streaming and Task-based microservice applications build upon
Spring Boot as the foundational library.
• This gives all microservice applications functionality such as health
checks, security, configurable logging, monitoring, and management
functionality, as well as executable JAR packaging.
SPRING CLOUD DATA FLOW
AWS GLUE: BUILDYOUR DATA
CATALOG
AWS GLUE: GENERATE AND EDIT
TRANSFORMATIONS
AWS GLUE: SCHEDULE AND RUN
YOUR JOBS
AWS GLUE:ANALYZE LOG DATA
AWS GLUE: UNIFIEDVIEW OF DATA
ACROSS MULTIPLE DATA STORES
AWS GLUE: QUERY AGAINST S3
DATA LAKE
AWS GLUE: EVENT-DRIVEN ETL
PIPELINES
EVALUATION
valuation
Criteria
Spark Structured
Stream
Spring Cloud Data Flow Apache NIFI AWS Glue
Real Time PRO

the system runs in near real
time and can check point it
status
PRO

runs data from point to point
across a pre defined acyclic
graph of microservices in
near real time
PRO

probably the best here
CON

Glue uses the basic spark
sql engine and each job
must be scheduled or
triggered. It could be setup
to run every x minutes.
Manages
updating data
PRO

The spark writing
infrastucture is very
sophisticated for writing to
large data systems like AWS
S3 or Hadoop HDFS. It has
been extended here for
managing the updating
nature of a realtime system.
CON

Managed by code you write
in Java or Groovy
PRO CON

Spark has to be managed
effectively to over write a
dataset.
Manage Stops
and Restarts
PRO

This system has a
checkpointing system that
allows the system to stop
and restart and remember
what has been processed
and what has not. It make
sure duplicates are not
created
CON

Managed by code or
another 3rd party system
you must use
PRO CON

This system runs each job
EVALUATION
valuation
Criteria
Spark Structured
Stream
Spring Cloud Data Flow Apache NIFI AWS Glue
Manage
Failures and
Duplicates
PRO

Checkpointing
CON

Write it yourself or use 3rd
party system
PRO CON
Secure PRO PRO CON

Uses Kerberos. Can be
somewhat secure.
PRO

All the security any AWS
and Spark/Hadoop system
would have in AWS
Manage Large
Data Sets
PRO

All the power of Spark's
RDD and Spark Sql and ML
Lib World class system
CON

Difficult to code or require
some kind of third party
system that would work in
Sping Boot and a Cluster of
user like Yarn or Cloud
Foundry
CON

While the system seems
scalable, the data must be
sent in as a descreet unit to
scale and not rely on joining
or groups across multiple
systems
PRO

All the power of Spark
Scale out using
Yarn, Mesos

and/or Cloud
Foundry
PRO

Yarn and Meso
PRO

Yarn, Meso and Cloud
Foundry
CON

This seems to manage it's
load itself you just start it up
on other computers. Their is
an open JIRA task for NIFI
to use Cloud Foundry
PRO

Yarn Meso
EVALUATION
valuation
Criteria
Spark Structured
Stream
Spring Cloud Data Flow Apache NIFI AWS Glue
Maturity of
Product
CON

Alpha product but has been
very reliable for the last two
months of use
PRO PRO

started as a NSA product
and then open sourced.
Horton Works is investing in
the Apache project
CON

Spark dataframes and RDDs
are mature but the
extensions of the data
catalog and glue api added
are still in Beta
Extraction PRO

Spark extraction is mature
but Structured Streaming
does not support the full
power of what SPARK is
currently supporting. It is a
pro because of it's support
of S3 and  Kafka
CON

Must be programmed or 3rd
party product used
PRO

Work with Kafka not sure if
Kinesis. Not sure if it works
with S3
PRO

Uses the power of Spark
but also uses the AWS Glue
Catalog where data sources
and their structure can be
crawled and then recreated
in Spark python. Very easy
to use data sources with the
Glue API
Transformation PRO

The power of Spark
CON

Managed by coding or 3rd
party product
PRO/CON

The purpose of the product
is ingestion, Transformation
is possible but not sure how
sophisticated
PRO

GLUE is an ETL product
EVALUATION
valuation
Criteria
Spark Structured
Stream
Spring Cloud Data Flow Apache NIFI AWS Glue
Load CON

This part is Alpha but will
catch up with the spark
architecture that is very
strong
CON

Managed by coding or 3rd
party product
PRO

the key purpose of the
product
PRO

GLUE is an ETL product
Machine
Learning
PRO

Power of Spark
CON

Managed by coding or 3rd
party product
CON(with spark PRO)

This system can put a spark
streaming or structured
streaming job in nifi to
support ML jobs within a
NIFI pipeline
PRO

All the power of spark but
not part of the Glue
infrastrucure
Machine
Learning on
Large Datasets
PRO

Integrates with the SPARK
ML Lib
CON

Managed by coding or 3rd
party product
CON(with spark PRO)

This system can put a spark
streaming or structured
streaming job in nifi to
support ML jobs within a
NIFI pipeline
PRO

All the power of spark but
not part of the Glue
infrastrucure
EVALUATION
valuation
Criteria
Spark Structured
Stream
Spring Cloud Data Flow Apache NIFI AWS Glue
Data Catalog
and integration
to AWS
services
CON

would need integration with
3rd party
CON

Managed by coding or 3rd
party product
CON PRO

The AWS Glue Data catalog
can automatically update
itself, be used to create
Spark Dataframes and
integrate with AWS Athena
and AWS Redshift. This
product is still young but it's
usefulness is already
impressive. More evaluation
of Data Catalog products
should be evaluated.
Count PRO 11

CON 3
PRO 4

CON 10
PRO 8 or 10

CON 6 or 4
PRO 9

CON 5

More Related Content

What's hot

Spark Summit EU talk by Yiannis Gkoufas
Spark Summit EU talk by Yiannis GkoufasSpark Summit EU talk by Yiannis Gkoufas
Spark Summit EU talk by Yiannis Gkoufas
Spark Summit
 
Spark Uber Development Kit
Spark Uber Development KitSpark Uber Development Kit
Spark Uber Development Kit
Jen Aman
 
Introduction to Amazon Athena
Introduction to Amazon AthenaIntroduction to Amazon Athena
Introduction to Amazon Athena
Amazon Web Services
 
How Credit Karma Makes Real-Time Decisions For 60 Million Users With Akka Str...
How Credit Karma Makes Real-Time Decisions For 60 Million Users With Akka Str...How Credit Karma Makes Real-Time Decisions For 60 Million Users With Akka Str...
How Credit Karma Makes Real-Time Decisions For 60 Million Users With Akka Str...
Lightbend
 
Scala eXchange: Building robust data pipelines in Scala
Scala eXchange: Building robust data pipelines in ScalaScala eXchange: Building robust data pipelines in Scala
Scala eXchange: Building robust data pipelines in Scala
Alexander Dean
 
Spark Summit EU talk by John Musser
Spark Summit EU talk by John MusserSpark Summit EU talk by John Musser
Spark Summit EU talk by John Musser
Spark Summit
 
Big Data Integration & Analytics Data Flows with AWS Data Pipeline (BDT207) |...
Big Data Integration & Analytics Data Flows with AWS Data Pipeline (BDT207) |...Big Data Integration & Analytics Data Flows with AWS Data Pipeline (BDT207) |...
Big Data Integration & Analytics Data Flows with AWS Data Pipeline (BDT207) |...
Amazon Web Services
 
Hands On With Spark: Creating A Fast Data Pipeline With Structured Streaming ...
Hands On With Spark: Creating A Fast Data Pipeline With Structured Streaming ...Hands On With Spark: Creating A Fast Data Pipeline With Structured Streaming ...
Hands On With Spark: Creating A Fast Data Pipeline With Structured Streaming ...
Lightbend
 
Asynchronous micro-services and the unified log
Asynchronous micro-services and the unified logAsynchronous micro-services and the unified log
Asynchronous micro-services and the unified log
Alexander Dean
 
Lambda architecture: from zero to One
Lambda architecture: from zero to OneLambda architecture: from zero to One
Lambda architecture: from zero to One
Serg Masyutin
 
Spark Summit EU talk by Kaarthik Sivashanmugam
Spark Summit EU talk by Kaarthik SivashanmugamSpark Summit EU talk by Kaarthik Sivashanmugam
Spark Summit EU talk by Kaarthik Sivashanmugam
Spark Summit
 
Introduction to Amazon Athena
Introduction to Amazon AthenaIntroduction to Amazon Athena
Introduction to Amazon Athena
Amazon Web Services
 
Confluent kafka meetupseattle jan2017
Confluent kafka meetupseattle jan2017Confluent kafka meetupseattle jan2017
Confluent kafka meetupseattle jan2017
Nitin Kumar
 
AWS re:Invent 2016: Learn how IFTTT uses ElastiCache for Redis to predict eve...
AWS re:Invent 2016: Learn how IFTTT uses ElastiCache for Redis to predict eve...AWS re:Invent 2016: Learn how IFTTT uses ElastiCache for Redis to predict eve...
AWS re:Invent 2016: Learn how IFTTT uses ElastiCache for Redis to predict eve...
Amazon Web Services
 
Cloud Native Predictive Data Pipelines (micro talk)
Cloud Native Predictive Data Pipelines (micro talk)Cloud Native Predictive Data Pipelines (micro talk)
Cloud Native Predictive Data Pipelines (micro talk)
Sid Anand
 
Lambda architecture on Spark, Kafka for real-time large scale ML
Lambda architecture on Spark, Kafka for real-time large scale MLLambda architecture on Spark, Kafka for real-time large scale ML
Lambda architecture on Spark, Kafka for real-time large scale ML
huguk
 
Span Conference: Why your company needs a unified log
Span Conference: Why your company needs a unified logSpan Conference: Why your company needs a unified log
Span Conference: Why your company needs a unified log
Alexander Dean
 
Data streaming
Data streamingData streaming
Data streaming
Alberto Paro
 
Next Generation Big Data Platform at Netflix 2014
Next Generation Big Data Platform at Netflix 2014Next Generation Big Data Platform at Netflix 2014
Next Generation Big Data Platform at Netflix 2014
Eva Tse
 
Build a Real-time Streaming Data Visualization System with Amazon Kinesis Ana...
Build a Real-time Streaming Data Visualization System with Amazon Kinesis Ana...Build a Real-time Streaming Data Visualization System with Amazon Kinesis Ana...
Build a Real-time Streaming Data Visualization System with Amazon Kinesis Ana...
Amazon Web Services
 

What's hot (20)

Spark Summit EU talk by Yiannis Gkoufas
Spark Summit EU talk by Yiannis GkoufasSpark Summit EU talk by Yiannis Gkoufas
Spark Summit EU talk by Yiannis Gkoufas
 
Spark Uber Development Kit
Spark Uber Development KitSpark Uber Development Kit
Spark Uber Development Kit
 
Introduction to Amazon Athena
Introduction to Amazon AthenaIntroduction to Amazon Athena
Introduction to Amazon Athena
 
How Credit Karma Makes Real-Time Decisions For 60 Million Users With Akka Str...
How Credit Karma Makes Real-Time Decisions For 60 Million Users With Akka Str...How Credit Karma Makes Real-Time Decisions For 60 Million Users With Akka Str...
How Credit Karma Makes Real-Time Decisions For 60 Million Users With Akka Str...
 
Scala eXchange: Building robust data pipelines in Scala
Scala eXchange: Building robust data pipelines in ScalaScala eXchange: Building robust data pipelines in Scala
Scala eXchange: Building robust data pipelines in Scala
 
Spark Summit EU talk by John Musser
Spark Summit EU talk by John MusserSpark Summit EU talk by John Musser
Spark Summit EU talk by John Musser
 
Big Data Integration & Analytics Data Flows with AWS Data Pipeline (BDT207) |...
Big Data Integration & Analytics Data Flows with AWS Data Pipeline (BDT207) |...Big Data Integration & Analytics Data Flows with AWS Data Pipeline (BDT207) |...
Big Data Integration & Analytics Data Flows with AWS Data Pipeline (BDT207) |...
 
Hands On With Spark: Creating A Fast Data Pipeline With Structured Streaming ...
Hands On With Spark: Creating A Fast Data Pipeline With Structured Streaming ...Hands On With Spark: Creating A Fast Data Pipeline With Structured Streaming ...
Hands On With Spark: Creating A Fast Data Pipeline With Structured Streaming ...
 
Asynchronous micro-services and the unified log
Asynchronous micro-services and the unified logAsynchronous micro-services and the unified log
Asynchronous micro-services and the unified log
 
Lambda architecture: from zero to One
Lambda architecture: from zero to OneLambda architecture: from zero to One
Lambda architecture: from zero to One
 
Spark Summit EU talk by Kaarthik Sivashanmugam
Spark Summit EU talk by Kaarthik SivashanmugamSpark Summit EU talk by Kaarthik Sivashanmugam
Spark Summit EU talk by Kaarthik Sivashanmugam
 
Introduction to Amazon Athena
Introduction to Amazon AthenaIntroduction to Amazon Athena
Introduction to Amazon Athena
 
Confluent kafka meetupseattle jan2017
Confluent kafka meetupseattle jan2017Confluent kafka meetupseattle jan2017
Confluent kafka meetupseattle jan2017
 
AWS re:Invent 2016: Learn how IFTTT uses ElastiCache for Redis to predict eve...
AWS re:Invent 2016: Learn how IFTTT uses ElastiCache for Redis to predict eve...AWS re:Invent 2016: Learn how IFTTT uses ElastiCache for Redis to predict eve...
AWS re:Invent 2016: Learn how IFTTT uses ElastiCache for Redis to predict eve...
 
Cloud Native Predictive Data Pipelines (micro talk)
Cloud Native Predictive Data Pipelines (micro talk)Cloud Native Predictive Data Pipelines (micro talk)
Cloud Native Predictive Data Pipelines (micro talk)
 
Lambda architecture on Spark, Kafka for real-time large scale ML
Lambda architecture on Spark, Kafka for real-time large scale MLLambda architecture on Spark, Kafka for real-time large scale ML
Lambda architecture on Spark, Kafka for real-time large scale ML
 
Span Conference: Why your company needs a unified log
Span Conference: Why your company needs a unified logSpan Conference: Why your company needs a unified log
Span Conference: Why your company needs a unified log
 
Data streaming
Data streamingData streaming
Data streaming
 
Next Generation Big Data Platform at Netflix 2014
Next Generation Big Data Platform at Netflix 2014Next Generation Big Data Platform at Netflix 2014
Next Generation Big Data Platform at Netflix 2014
 
Build a Real-time Streaming Data Visualization System with Amazon Kinesis Ana...
Build a Real-time Streaming Data Visualization System with Amazon Kinesis Ana...Build a Real-time Streaming Data Visualization System with Amazon Kinesis Ana...
Build a Real-time Streaming Data Visualization System with Amazon Kinesis Ana...
 

Similar to Data Pipeline for The Big Data/Data Science OKC

Delivering the power of data using Spring Cloud DataFlow and DataStax Enterpr...
Delivering the power of data using Spring Cloud DataFlow and DataStax Enterpr...Delivering the power of data using Spring Cloud DataFlow and DataStax Enterpr...
Delivering the power of data using Spring Cloud DataFlow and DataStax Enterpr...
VMware Tanzu
 
Cloud-Native Patterns for Data-Intensive Applications
Cloud-Native Patterns for Data-Intensive ApplicationsCloud-Native Patterns for Data-Intensive Applications
Cloud-Native Patterns for Data-Intensive Applications
VMware Tanzu
 
Spark + AI Summit 2019: Headaches and Breakthroughs in Building Continuous Ap...
Spark + AI Summit 2019: Headaches and Breakthroughs in Building Continuous Ap...Spark + AI Summit 2019: Headaches and Breakthroughs in Building Continuous Ap...
Spark + AI Summit 2019: Headaches and Breakthroughs in Building Continuous Ap...
Landon Robinson
 
Tordatasci meetup-precima-retail-analytics-201901
Tordatasci meetup-precima-retail-analytics-201901Tordatasci meetup-precima-retail-analytics-201901
Tordatasci meetup-precima-retail-analytics-201901
WeCloudData
 
Headaches and Breakthroughs in Building Continuous Applications
Headaches and Breakthroughs in Building Continuous ApplicationsHeadaches and Breakthroughs in Building Continuous Applications
Headaches and Breakthroughs in Building Continuous Applications
Databricks
 
Cloud Native Data Pipelines (DataEngConf SF 2017)
Cloud Native Data Pipelines (DataEngConf SF 2017)Cloud Native Data Pipelines (DataEngConf SF 2017)
Cloud Native Data Pipelines (DataEngConf SF 2017)
Sid Anand
 
Azure Databricks & Spark @ Techorama 2018
Azure Databricks & Spark @ Techorama 2018Azure Databricks & Spark @ Techorama 2018
Azure Databricks & Spark @ Techorama 2018
Nathan Bijnens
 
Building scalable data with kafka and spark
Building scalable data with kafka and sparkBuilding scalable data with kafka and spark
Building scalable data with kafka and spark
babatunde ekemode
 
Spark Workshop
Spark WorkshopSpark Workshop
Spark Workshop
Navid Kalaei
 
Apache Spark Architecture | Apache Spark Architecture Explained | Apache Spar...
Apache Spark Architecture | Apache Spark Architecture Explained | Apache Spar...Apache Spark Architecture | Apache Spark Architecture Explained | Apache Spar...
Apache Spark Architecture | Apache Spark Architecture Explained | Apache Spar...
Simplilearn
 
Spark + AI Summit 2020 イベント概要
Spark + AI Summit 2020 イベント概要Spark + AI Summit 2020 イベント概要
Spark + AI Summit 2020 イベント概要
Paulo Gutierrez
 
BDA311 Introduction to AWS Glue
BDA311 Introduction to AWS GlueBDA311 Introduction to AWS Glue
BDA311 Introduction to AWS Glue
Amazon Web Services
 
Azure Data Lake Intro (SQLBits 2016)
Azure Data Lake Intro (SQLBits 2016)Azure Data Lake Intro (SQLBits 2016)
Azure Data Lake Intro (SQLBits 2016)
Michael Rys
 
The Never Landing Stream with HTAP and Streaming
The Never Landing Stream with HTAP and StreamingThe Never Landing Stream with HTAP and Streaming
The Never Landing Stream with HTAP and Streaming
Timothy Spann
 
Spark
SparkSpark
Present and future of unified, portable, and efficient data processing with A...
Present and future of unified, portable, and efficient data processing with A...Present and future of unified, portable, and efficient data processing with A...
Present and future of unified, portable, and efficient data processing with A...
DataWorks Summit
 
Cloud lunch and learn real-time streaming in azure
Cloud lunch and learn real-time streaming in azureCloud lunch and learn real-time streaming in azure
Cloud lunch and learn real-time streaming in azure
Timothy Spann
 
TechEvent Databricks on Azure
TechEvent Databricks on AzureTechEvent Databricks on Azure
TechEvent Databricks on Azure
Trivadis
 
Spark Streaming with Azure Databricks
Spark Streaming with Azure DatabricksSpark Streaming with Azure Databricks
Spark Streaming with Azure Databricks
Dustin Vannoy
 
Otimizações de Projetos de Big Data, Dw e AI no Microsoft Azure
Otimizações de Projetos de Big Data, Dw e AI no Microsoft AzureOtimizações de Projetos de Big Data, Dw e AI no Microsoft Azure
Otimizações de Projetos de Big Data, Dw e AI no Microsoft Azure
Luan Moreno Medeiros Maciel
 

Similar to Data Pipeline for The Big Data/Data Science OKC (20)

Delivering the power of data using Spring Cloud DataFlow and DataStax Enterpr...
Delivering the power of data using Spring Cloud DataFlow and DataStax Enterpr...Delivering the power of data using Spring Cloud DataFlow and DataStax Enterpr...
Delivering the power of data using Spring Cloud DataFlow and DataStax Enterpr...
 
Cloud-Native Patterns for Data-Intensive Applications
Cloud-Native Patterns for Data-Intensive ApplicationsCloud-Native Patterns for Data-Intensive Applications
Cloud-Native Patterns for Data-Intensive Applications
 
Spark + AI Summit 2019: Headaches and Breakthroughs in Building Continuous Ap...
Spark + AI Summit 2019: Headaches and Breakthroughs in Building Continuous Ap...Spark + AI Summit 2019: Headaches and Breakthroughs in Building Continuous Ap...
Spark + AI Summit 2019: Headaches and Breakthroughs in Building Continuous Ap...
 
Tordatasci meetup-precima-retail-analytics-201901
Tordatasci meetup-precima-retail-analytics-201901Tordatasci meetup-precima-retail-analytics-201901
Tordatasci meetup-precima-retail-analytics-201901
 
Headaches and Breakthroughs in Building Continuous Applications
Headaches and Breakthroughs in Building Continuous ApplicationsHeadaches and Breakthroughs in Building Continuous Applications
Headaches and Breakthroughs in Building Continuous Applications
 
Cloud Native Data Pipelines (DataEngConf SF 2017)
Cloud Native Data Pipelines (DataEngConf SF 2017)Cloud Native Data Pipelines (DataEngConf SF 2017)
Cloud Native Data Pipelines (DataEngConf SF 2017)
 
Azure Databricks & Spark @ Techorama 2018
Azure Databricks & Spark @ Techorama 2018Azure Databricks & Spark @ Techorama 2018
Azure Databricks & Spark @ Techorama 2018
 
Building scalable data with kafka and spark
Building scalable data with kafka and sparkBuilding scalable data with kafka and spark
Building scalable data with kafka and spark
 
Spark Workshop
Spark WorkshopSpark Workshop
Spark Workshop
 
Apache Spark Architecture | Apache Spark Architecture Explained | Apache Spar...
Apache Spark Architecture | Apache Spark Architecture Explained | Apache Spar...Apache Spark Architecture | Apache Spark Architecture Explained | Apache Spar...
Apache Spark Architecture | Apache Spark Architecture Explained | Apache Spar...
 
Spark + AI Summit 2020 イベント概要
Spark + AI Summit 2020 イベント概要Spark + AI Summit 2020 イベント概要
Spark + AI Summit 2020 イベント概要
 
BDA311 Introduction to AWS Glue
BDA311 Introduction to AWS GlueBDA311 Introduction to AWS Glue
BDA311 Introduction to AWS Glue
 
Azure Data Lake Intro (SQLBits 2016)
Azure Data Lake Intro (SQLBits 2016)Azure Data Lake Intro (SQLBits 2016)
Azure Data Lake Intro (SQLBits 2016)
 
The Never Landing Stream with HTAP and Streaming
The Never Landing Stream with HTAP and StreamingThe Never Landing Stream with HTAP and Streaming
The Never Landing Stream with HTAP and Streaming
 
Spark
SparkSpark
Spark
 
Present and future of unified, portable, and efficient data processing with A...
Present and future of unified, portable, and efficient data processing with A...Present and future of unified, portable, and efficient data processing with A...
Present and future of unified, portable, and efficient data processing with A...
 
Cloud lunch and learn real-time streaming in azure
Cloud lunch and learn real-time streaming in azureCloud lunch and learn real-time streaming in azure
Cloud lunch and learn real-time streaming in azure
 
TechEvent Databricks on Azure
TechEvent Databricks on AzureTechEvent Databricks on Azure
TechEvent Databricks on Azure
 
Spark Streaming with Azure Databricks
Spark Streaming with Azure DatabricksSpark Streaming with Azure Databricks
Spark Streaming with Azure Databricks
 
Otimizações de Projetos de Big Data, Dw e AI no Microsoft Azure
Otimizações de Projetos de Big Data, Dw e AI no Microsoft AzureOtimizações de Projetos de Big Data, Dw e AI no Microsoft Azure
Otimizações de Projetos de Big Data, Dw e AI no Microsoft Azure
 

More from Mark Smith

Ss jan19 2020_isafepeople
Ss jan19 2020_isafepeopleSs jan19 2020_isafepeople
Ss jan19 2020_isafepeople
Mark Smith
 
Ss jan12 2020_introboundaries
Ss jan12 2020_introboundariesSs jan12 2020_introboundaries
Ss jan12 2020_introboundaries
Mark Smith
 
Ss dec092018genesis
Ss dec092018genesisSs dec092018genesis
Ss dec092018genesis
Mark Smith
 
The Bridge Sunday School. Acts Prayer Model Week 1
The Bridge Sunday School. Acts Prayer Model Week 1The Bridge Sunday School. Acts Prayer Model Week 1
The Bridge Sunday School. Acts Prayer Model Week 1
Mark Smith
 
The Bridge Sunday School. Acts Prayer Model Week 2
The Bridge Sunday School. Acts Prayer Model Week 2The Bridge Sunday School. Acts Prayer Model Week 2
The Bridge Sunday School. Acts Prayer Model Week 2
Mark Smith
 
Sunday School Trial of Jesus
Sunday School Trial of JesusSunday School Trial of Jesus
Sunday School Trial of Jesus
Mark Smith
 
Ss sep11 2016_apologetics
Ss sep11 2016_apologeticsSs sep11 2016_apologetics
Ss sep11 2016_apologetics
Mark Smith
 
Ss aug28 2016_apologetics
Ss aug28 2016_apologeticsSs aug28 2016_apologetics
Ss aug28 2016_apologetics
Mark Smith
 
Tulsa techfest Spark Core Aug 5th 2016
Tulsa techfest Spark Core Aug 5th 2016Tulsa techfest Spark Core Aug 5th 2016
Tulsa techfest Spark Core Aug 5th 2016
Mark Smith
 
Big data meet_up_08042016
Big data meet_up_08042016Big data meet_up_08042016
Big data meet_up_08042016
Mark Smith
 

More from Mark Smith (10)

Ss jan19 2020_isafepeople
Ss jan19 2020_isafepeopleSs jan19 2020_isafepeople
Ss jan19 2020_isafepeople
 
Ss jan12 2020_introboundaries
Ss jan12 2020_introboundariesSs jan12 2020_introboundaries
Ss jan12 2020_introboundaries
 
Ss dec092018genesis
Ss dec092018genesisSs dec092018genesis
Ss dec092018genesis
 
The Bridge Sunday School. Acts Prayer Model Week 1
The Bridge Sunday School. Acts Prayer Model Week 1The Bridge Sunday School. Acts Prayer Model Week 1
The Bridge Sunday School. Acts Prayer Model Week 1
 
The Bridge Sunday School. Acts Prayer Model Week 2
The Bridge Sunday School. Acts Prayer Model Week 2The Bridge Sunday School. Acts Prayer Model Week 2
The Bridge Sunday School. Acts Prayer Model Week 2
 
Sunday School Trial of Jesus
Sunday School Trial of JesusSunday School Trial of Jesus
Sunday School Trial of Jesus
 
Ss sep11 2016_apologetics
Ss sep11 2016_apologeticsSs sep11 2016_apologetics
Ss sep11 2016_apologetics
 
Ss aug28 2016_apologetics
Ss aug28 2016_apologeticsSs aug28 2016_apologetics
Ss aug28 2016_apologetics
 
Tulsa techfest Spark Core Aug 5th 2016
Tulsa techfest Spark Core Aug 5th 2016Tulsa techfest Spark Core Aug 5th 2016
Tulsa techfest Spark Core Aug 5th 2016
 
Big data meet_up_08042016
Big data meet_up_08042016Big data meet_up_08042016
Big data meet_up_08042016
 

Recently uploaded

202406 - Cape Town Snowflake User Group - LLM & RAG.pdf
202406 - Cape Town Snowflake User Group - LLM & RAG.pdf202406 - Cape Town Snowflake User Group - LLM & RAG.pdf
202406 - Cape Town Snowflake User Group - LLM & RAG.pdf
Douglas Day
 
9711199012⎷❤✨ Call Girls RK Puram Special Price with a special young
9711199012⎷❤✨ Call Girls RK Puram Special Price with a special young9711199012⎷❤✨ Call Girls RK Puram Special Price with a special young
9711199012⎷❤✨ Call Girls RK Puram Special Price with a special young
Ak47
 
Royal-Class Call Girls Thane🌹9967824496🌹369+ call girls @₹6K-18K/full night cash
Royal-Class Call Girls Thane🌹9967824496🌹369+ call girls @₹6K-18K/full night cashRoyal-Class Call Girls Thane🌹9967824496🌹369+ call girls @₹6K-18K/full night cash
Royal-Class Call Girls Thane🌹9967824496🌹369+ call girls @₹6K-18K/full night cash
Ak47
 
satta matka Dpboss guessing Kalyan matka Today Kalyan Panel Chart Kalyan Jodi...
satta matka Dpboss guessing Kalyan matka Today Kalyan Panel Chart Kalyan Jodi...satta matka Dpboss guessing Kalyan matka Today Kalyan Panel Chart Kalyan Jodi...
satta matka Dpboss guessing Kalyan matka Today Kalyan Panel Chart Kalyan Jodi...
#kalyanmatkaresult #dpboss #kalyanmatka #satta #matka #sattamatka
 
Call Girls In Tirunelveli 👯‍♀️ 7339748667 🔥 Safe Housewife Call Girl Service ...
Call Girls In Tirunelveli 👯‍♀️ 7339748667 🔥 Safe Housewife Call Girl Service ...Call Girls In Tirunelveli 👯‍♀️ 7339748667 🔥 Safe Housewife Call Girl Service ...
Call Girls In Tirunelveli 👯‍♀️ 7339748667 🔥 Safe Housewife Call Girl Service ...
wwefun9823#S0007
 
Salesforce AI + Data Community Tour Slides - Canarias
Salesforce AI + Data Community Tour Slides - CanariasSalesforce AI + Data Community Tour Slides - Canarias
Salesforce AI + Data Community Tour Slides - Canarias
davidpietrzykowski1
 
MySQL Notes For Professionals sttudy.pdf
MySQL Notes For Professionals sttudy.pdfMySQL Notes For Professionals sttudy.pdf
MySQL Notes For Professionals sttudy.pdf
Ananta Patil
 
Mumbai Central Call Girls ☑ +91-9833325238 ☑ Available Hot Girls Aunty Book Now
Mumbai Central Call Girls ☑ +91-9833325238 ☑ Available Hot Girls Aunty Book NowMumbai Central Call Girls ☑ +91-9833325238 ☑ Available Hot Girls Aunty Book Now
Mumbai Central Call Girls ☑ +91-9833325238 ☑ Available Hot Girls Aunty Book Now
radhika ansal $A12
 
Product Cluster Analysis: Unveiling Hidden Customer Preferences
Product Cluster Analysis: Unveiling Hidden Customer PreferencesProduct Cluster Analysis: Unveiling Hidden Customer Preferences
Product Cluster Analysis: Unveiling Hidden Customer Preferences
Boston Institute of Analytics
 
High Profile Call Girls Navi Mumbai ✅ 9833363713 FULL CASH PAYMENT
High Profile Call Girls Navi Mumbai ✅ 9833363713 FULL CASH PAYMENTHigh Profile Call Girls Navi Mumbai ✅ 9833363713 FULL CASH PAYMENT
High Profile Call Girls Navi Mumbai ✅ 9833363713 FULL CASH PAYMENT
ranjeet3341
 
🔥Mature Women / Aunty Call Girl Chennai 💯Call Us 🔝 8094342248 🔝💃Top Class Cal...
🔥Mature Women / Aunty Call Girl Chennai 💯Call Us 🔝 8094342248 🔝💃Top Class Cal...🔥Mature Women / Aunty Call Girl Chennai 💯Call Us 🔝 8094342248 🔝💃Top Class Cal...
🔥Mature Women / Aunty Call Girl Chennai 💯Call Us 🔝 8094342248 🔝💃Top Class Cal...
shivangimorya083
 
machine learning notes by Andrew Ng and Tengyu Ma
machine learning notes by Andrew Ng and Tengyu Mamachine learning notes by Andrew Ng and Tengyu Ma
machine learning notes by Andrew Ng and Tengyu Ma
Vijayabaskar Uthirapathy
 
Interview Methods - Marital and Family Therapy and Counselling - Psychology S...
Interview Methods - Marital and Family Therapy and Counselling - Psychology S...Interview Methods - Marital and Family Therapy and Counselling - Psychology S...
Interview Methods - Marital and Family Therapy and Counselling - Psychology S...
PsychoTech Services
 
Bangalore ℂall Girl 000000 Bangalore Escorts Service
Bangalore ℂall Girl 000000 Bangalore Escorts ServiceBangalore ℂall Girl 000000 Bangalore Escorts Service
Bangalore ℂall Girl 000000 Bangalore Escorts Service
nhero3888
 
Ahmedabad Call Girls 7339748667 With Free Home Delivery At Your Door
Ahmedabad Call Girls 7339748667 With Free Home Delivery At Your DoorAhmedabad Call Girls 7339748667 With Free Home Delivery At Your Door
Ahmedabad Call Girls 7339748667 With Free Home Delivery At Your Door
Russian Escorts in Delhi 9711199171 with low rate Book online
 
Delhi Call Girls Karol Bagh 👉 9711199012 👈 unlimited short high profile full ...
Delhi Call Girls Karol Bagh 👉 9711199012 👈 unlimited short high profile full ...Delhi Call Girls Karol Bagh 👉 9711199012 👈 unlimited short high profile full ...
Delhi Call Girls Karol Bagh 👉 9711199012 👈 unlimited short high profile full ...
mona lisa $A12
 
PCI-DSS-Data Security Standard v4.0.1.pdf
PCI-DSS-Data Security Standard v4.0.1.pdfPCI-DSS-Data Security Standard v4.0.1.pdf
PCI-DSS-Data Security Standard v4.0.1.pdf
incitbe
 
AI WITH THE HELP OF NAGALAND CAN WIN. DOWNLOAD NOW
AI WITH THE HELP OF NAGALAND CAN WIN. DOWNLOAD NOWAI WITH THE HELP OF NAGALAND CAN WIN. DOWNLOAD NOW
AI WITH THE HELP OF NAGALAND CAN WIN. DOWNLOAD NOW
arash10gamer
 
Direct Lake Deep Dive slides from Fabric Engineering Roadshow
Direct Lake Deep Dive slides from Fabric Engineering RoadshowDirect Lake Deep Dive slides from Fabric Engineering Roadshow
Direct Lake Deep Dive slides from Fabric Engineering Roadshow
Gabi Münster
 
🔥Book Call Girls Lucknow 💯Call Us 🔝 6350257716 🔝💃Independent Lucknow Escorts ...
🔥Book Call Girls Lucknow 💯Call Us 🔝 6350257716 🔝💃Independent Lucknow Escorts ...🔥Book Call Girls Lucknow 💯Call Us 🔝 6350257716 🔝💃Independent Lucknow Escorts ...
🔥Book Call Girls Lucknow 💯Call Us 🔝 6350257716 🔝💃Independent Lucknow Escorts ...
AK47
 

Recently uploaded (20)

202406 - Cape Town Snowflake User Group - LLM & RAG.pdf
202406 - Cape Town Snowflake User Group - LLM & RAG.pdf202406 - Cape Town Snowflake User Group - LLM & RAG.pdf
202406 - Cape Town Snowflake User Group - LLM & RAG.pdf
 
9711199012⎷❤✨ Call Girls RK Puram Special Price with a special young
9711199012⎷❤✨ Call Girls RK Puram Special Price with a special young9711199012⎷❤✨ Call Girls RK Puram Special Price with a special young
9711199012⎷❤✨ Call Girls RK Puram Special Price with a special young
 
Royal-Class Call Girls Thane🌹9967824496🌹369+ call girls @₹6K-18K/full night cash
Royal-Class Call Girls Thane🌹9967824496🌹369+ call girls @₹6K-18K/full night cashRoyal-Class Call Girls Thane🌹9967824496🌹369+ call girls @₹6K-18K/full night cash
Royal-Class Call Girls Thane🌹9967824496🌹369+ call girls @₹6K-18K/full night cash
 
satta matka Dpboss guessing Kalyan matka Today Kalyan Panel Chart Kalyan Jodi...
satta matka Dpboss guessing Kalyan matka Today Kalyan Panel Chart Kalyan Jodi...satta matka Dpboss guessing Kalyan matka Today Kalyan Panel Chart Kalyan Jodi...
satta matka Dpboss guessing Kalyan matka Today Kalyan Panel Chart Kalyan Jodi...
 
Call Girls In Tirunelveli 👯‍♀️ 7339748667 🔥 Safe Housewife Call Girl Service ...
Call Girls In Tirunelveli 👯‍♀️ 7339748667 🔥 Safe Housewife Call Girl Service ...Call Girls In Tirunelveli 👯‍♀️ 7339748667 🔥 Safe Housewife Call Girl Service ...
Call Girls In Tirunelveli 👯‍♀️ 7339748667 🔥 Safe Housewife Call Girl Service ...
 
Salesforce AI + Data Community Tour Slides - Canarias
Salesforce AI + Data Community Tour Slides - CanariasSalesforce AI + Data Community Tour Slides - Canarias
Salesforce AI + Data Community Tour Slides - Canarias
 
MySQL Notes For Professionals sttudy.pdf
MySQL Notes For Professionals sttudy.pdfMySQL Notes For Professionals sttudy.pdf
MySQL Notes For Professionals sttudy.pdf
 
Mumbai Central Call Girls ☑ +91-9833325238 ☑ Available Hot Girls Aunty Book Now
Mumbai Central Call Girls ☑ +91-9833325238 ☑ Available Hot Girls Aunty Book NowMumbai Central Call Girls ☑ +91-9833325238 ☑ Available Hot Girls Aunty Book Now
Mumbai Central Call Girls ☑ +91-9833325238 ☑ Available Hot Girls Aunty Book Now
 
Product Cluster Analysis: Unveiling Hidden Customer Preferences
Product Cluster Analysis: Unveiling Hidden Customer PreferencesProduct Cluster Analysis: Unveiling Hidden Customer Preferences
Product Cluster Analysis: Unveiling Hidden Customer Preferences
 
High Profile Call Girls Navi Mumbai ✅ 9833363713 FULL CASH PAYMENT
High Profile Call Girls Navi Mumbai ✅ 9833363713 FULL CASH PAYMENTHigh Profile Call Girls Navi Mumbai ✅ 9833363713 FULL CASH PAYMENT
High Profile Call Girls Navi Mumbai ✅ 9833363713 FULL CASH PAYMENT
 
🔥Mature Women / Aunty Call Girl Chennai 💯Call Us 🔝 8094342248 🔝💃Top Class Cal...
🔥Mature Women / Aunty Call Girl Chennai 💯Call Us 🔝 8094342248 🔝💃Top Class Cal...🔥Mature Women / Aunty Call Girl Chennai 💯Call Us 🔝 8094342248 🔝💃Top Class Cal...
🔥Mature Women / Aunty Call Girl Chennai 💯Call Us 🔝 8094342248 🔝💃Top Class Cal...
 
machine learning notes by Andrew Ng and Tengyu Ma
machine learning notes by Andrew Ng and Tengyu Mamachine learning notes by Andrew Ng and Tengyu Ma
machine learning notes by Andrew Ng and Tengyu Ma
 
Interview Methods - Marital and Family Therapy and Counselling - Psychology S...
Interview Methods - Marital and Family Therapy and Counselling - Psychology S...Interview Methods - Marital and Family Therapy and Counselling - Psychology S...
Interview Methods - Marital and Family Therapy and Counselling - Psychology S...
 
Bangalore ℂall Girl 000000 Bangalore Escorts Service
Bangalore ℂall Girl 000000 Bangalore Escorts ServiceBangalore ℂall Girl 000000 Bangalore Escorts Service
Bangalore ℂall Girl 000000 Bangalore Escorts Service
 
Ahmedabad Call Girls 7339748667 With Free Home Delivery At Your Door
Ahmedabad Call Girls 7339748667 With Free Home Delivery At Your DoorAhmedabad Call Girls 7339748667 With Free Home Delivery At Your Door
Ahmedabad Call Girls 7339748667 With Free Home Delivery At Your Door
 
Delhi Call Girls Karol Bagh 👉 9711199012 👈 unlimited short high profile full ...
Delhi Call Girls Karol Bagh 👉 9711199012 👈 unlimited short high profile full ...Delhi Call Girls Karol Bagh 👉 9711199012 👈 unlimited short high profile full ...
Delhi Call Girls Karol Bagh 👉 9711199012 👈 unlimited short high profile full ...
 
PCI-DSS-Data Security Standard v4.0.1.pdf
PCI-DSS-Data Security Standard v4.0.1.pdfPCI-DSS-Data Security Standard v4.0.1.pdf
PCI-DSS-Data Security Standard v4.0.1.pdf
 
AI WITH THE HELP OF NAGALAND CAN WIN. DOWNLOAD NOW
AI WITH THE HELP OF NAGALAND CAN WIN. DOWNLOAD NOWAI WITH THE HELP OF NAGALAND CAN WIN. DOWNLOAD NOW
AI WITH THE HELP OF NAGALAND CAN WIN. DOWNLOAD NOW
 
Direct Lake Deep Dive slides from Fabric Engineering Roadshow
Direct Lake Deep Dive slides from Fabric Engineering RoadshowDirect Lake Deep Dive slides from Fabric Engineering Roadshow
Direct Lake Deep Dive slides from Fabric Engineering Roadshow
 
🔥Book Call Girls Lucknow 💯Call Us 🔝 6350257716 🔝💃Independent Lucknow Escorts ...
🔥Book Call Girls Lucknow 💯Call Us 🔝 6350257716 🔝💃Independent Lucknow Escorts ...🔥Book Call Girls Lucknow 💯Call Us 🔝 6350257716 🔝💃Independent Lucknow Escorts ...
🔥Book Call Girls Lucknow 💯Call Us 🔝 6350257716 🔝💃Independent Lucknow Escorts ...
 

Data Pipeline for The Big Data/Data Science OKC

  • 1. DATA PIPELINE A set of data processing elements connected in series, where the output of one element is the input of the next one
  • 2. WHAT IS A DATA PIPELINE The data we we most care about in one place, in the same format, consistency, quality and real time We don’t have to do it every time we want to build analysis or predictive models. It is reproducible You will define what, where and how data are collected, transformed and loaded This means that you’ll need to manually pick every field, table, data source, transformation, join, etc.The good news is that if you do it right you’ll have to do it just once.Afterwards everything will be automated.
  • 3. CANDIDATES I EVALUATED Spark Structured Streaming Apache NIFI Spring Cloud Data Flow AWS Glue
  • 4. SPARK STRUCTURED STREAMING Scalable and fault-tolerant Stream processing engine built on the Spark SQL Engine Express your streams computation in the same way you express batch computation on static data. The stream runs incrementally and continuously and updating the final result as streaming data continues to arrive.
  • 6. APACHE NIFI • Lends well to visual creation and management of directed graphs of processors • Is inherently asynchronous which allows for very high throughput and natural buffering even as processing and flow rates fluctuate • Provides a highly concurrent model without a developer having to worry about the typical complexities of concurrency • Promotes the development of cohesive and loosely coupled components which can then be reused in other contexts and promotes testable units • The resource constrained connections make critical functions such as back-pressure and pressure release very natural and intuitive • Error handling becomes as natural as the happy-path rather than a coarse grained catch- all • The points at which data enters and exits the system as well as how it flows through are well understood and easily tracked
  • 10. SPRING CLOUD DATA FLOW • The Data Flow server is also responsible for: • Interpreting and executing a stream DSL that describes the logical flow of data through multiple long-lived applications. • Launching a long-lived task application. • Interpreting and executing a composed task DSL that describes the logical flow of data through multiple short-lived applications. • Applying a deployment manifest that describes the mapping of applications onto the runtime - for example, to set the initial number of instances, memory requirements, and data partitioning. • Providing the runtime status of deployed applications.
  • 11. SPRING CLOUD DATA FLOW • The Data Flow Server deploys applications onto the target runtime that conform to the microservice architectural style. • A stream represents a high-level application that consists of multiple small microservice applications each running in their own process. • Each microservice application can be scaled up or down independently of the other and each has its own versioning lifecycle. • Both Streaming and Task-based microservice applications build upon Spring Boot as the foundational library. • This gives all microservice applications functionality such as health checks, security, configurable logging, monitoring, and management functionality, as well as executable JAR packaging.
  • 13. AWS GLUE: BUILDYOUR DATA CATALOG
  • 14. AWS GLUE: GENERATE AND EDIT TRANSFORMATIONS
  • 15. AWS GLUE: SCHEDULE AND RUN YOUR JOBS
  • 17. AWS GLUE: UNIFIEDVIEW OF DATA ACROSS MULTIPLE DATA STORES
  • 18. AWS GLUE: QUERY AGAINST S3 DATA LAKE
  • 19. AWS GLUE: EVENT-DRIVEN ETL PIPELINES
  • 20. EVALUATION valuation Criteria Spark Structured Stream Spring Cloud Data Flow Apache NIFI AWS Glue Real Time PRO the system runs in near real time and can check point it status PRO runs data from point to point across a pre defined acyclic graph of microservices in near real time PRO probably the best here CON Glue uses the basic spark sql engine and each job must be scheduled or triggered. It could be setup to run every x minutes. Manages updating data PRO The spark writing infrastucture is very sophisticated for writing to large data systems like AWS S3 or Hadoop HDFS. It has been extended here for managing the updating nature of a realtime system. CON Managed by code you write in Java or Groovy PRO CON Spark has to be managed effectively to over write a dataset. Manage Stops and Restarts PRO This system has a checkpointing system that allows the system to stop and restart and remember what has been processed and what has not. It make sure duplicates are not created CON Managed by code or another 3rd party system you must use PRO CON This system runs each job
  • 21. EVALUATION valuation Criteria Spark Structured Stream Spring Cloud Data Flow Apache NIFI AWS Glue Manage Failures and Duplicates PRO Checkpointing CON Write it yourself or use 3rd party system PRO CON Secure PRO PRO CON Uses Kerberos. Can be somewhat secure. PRO All the security any AWS and Spark/Hadoop system would have in AWS Manage Large Data Sets PRO All the power of Spark's RDD and Spark Sql and ML Lib World class system CON Difficult to code or require some kind of third party system that would work in Sping Boot and a Cluster of user like Yarn or Cloud Foundry CON While the system seems scalable, the data must be sent in as a descreet unit to scale and not rely on joining or groups across multiple systems PRO All the power of Spark Scale out using Yarn, Mesos and/or Cloud Foundry PRO Yarn and Meso PRO Yarn, Meso and Cloud Foundry CON This seems to manage it's load itself you just start it up on other computers. Their is an open JIRA task for NIFI to use Cloud Foundry PRO Yarn Meso
  • 22. EVALUATION valuation Criteria Spark Structured Stream Spring Cloud Data Flow Apache NIFI AWS Glue Maturity of Product CON Alpha product but has been very reliable for the last two months of use PRO PRO started as a NSA product and then open sourced. Horton Works is investing in the Apache project CON Spark dataframes and RDDs are mature but the extensions of the data catalog and glue api added are still in Beta Extraction PRO Spark extraction is mature but Structured Streaming does not support the full power of what SPARK is currently supporting. It is a pro because of it's support of S3 and  Kafka CON Must be programmed or 3rd party product used PRO Work with Kafka not sure if Kinesis. Not sure if it works with S3 PRO Uses the power of Spark but also uses the AWS Glue Catalog where data sources and their structure can be crawled and then recreated in Spark python. Very easy to use data sources with the Glue API Transformation PRO The power of Spark CON Managed by coding or 3rd party product PRO/CON The purpose of the product is ingestion, Transformation is possible but not sure how sophisticated PRO GLUE is an ETL product
  • 23. EVALUATION valuation Criteria Spark Structured Stream Spring Cloud Data Flow Apache NIFI AWS Glue Load CON This part is Alpha but will catch up with the spark architecture that is very strong CON Managed by coding or 3rd party product PRO the key purpose of the product PRO GLUE is an ETL product Machine Learning PRO Power of Spark CON Managed by coding or 3rd party product CON(with spark PRO) This system can put a spark streaming or structured streaming job in nifi to support ML jobs within a NIFI pipeline PRO All the power of spark but not part of the Glue infrastrucure Machine Learning on Large Datasets PRO Integrates with the SPARK ML Lib CON Managed by coding or 3rd party product CON(with spark PRO) This system can put a spark streaming or structured streaming job in nifi to support ML jobs within a NIFI pipeline PRO All the power of spark but not part of the Glue infrastrucure
  • 24. EVALUATION valuation Criteria Spark Structured Stream Spring Cloud Data Flow Apache NIFI AWS Glue Data Catalog and integration to AWS services CON would need integration with 3rd party CON Managed by coding or 3rd party product CON PRO The AWS Glue Data catalog can automatically update itself, be used to create Spark Dataframes and integrate with AWS Athena and AWS Redshift. This product is still young but it's usefulness is already impressive. More evaluation of Data Catalog products should be evaluated. Count PRO 11 CON 3 PRO 4 CON 10 PRO 8 or 10 CON 6 or 4 PRO 9 CON 5
  翻译: