尊敬的 微信汇率:1円 ≈ 0.046078 元 支付宝汇率:1円 ≈ 0.046168元 [退出登录]
SlideShare a Scribd company logo
Cloud Native Data
Pipelines
1
Sid Anand (@r39132)
Cloud Data Next 2017
About Me
2
Work [ed | s] @
Committer &
PPMC on
Father of 2
Co-Chair for
Apache Airflow
Agari
3
What We Do!
Agari : What We Do
4
5
Agari : What We Do
6
Agari : What We Do
7
Agari : What We Do
8
Agari : What We Do
9
Enterprise
Customers
email
metadata
apply
trust
models
email md +
trust score
Agari’s Previous EP Version
Agari : What We Do
Batch
Quarantine,
Label,
PassThrough
10
email
metadata
apply
trust
models
email md +
trust score
Agari’s Current EP VersionEnterprise
Customers
Agari : What We Do
Near-real
time
Motivation
Cloud Native Data Pipelines
11
Cloud Native Data Pipelines
12
Big Data Companies like LinkedIn, Facebook, Twitter, & Google
have large teams to manage their data pipelines (100s of
engineers)

Most start-ups have small teams (10s of engineers) & run in the
public cloud. Can they leverage aspects of the public cloud to
build comparable pipelines?
Cloud Native Data Pipelines
13
Cloud Native
Techniques

Open Source
Technogies
Data Pipelines seen
in Big Data companies

~
Design Goals
Desirable Qualities of a Resilient Data Pipeline
14
15
Desirable Qualities of a Resilient
Data Pipeline
OperabilityCorrectness
Timeliness Cost
16
Desirable Qualities of a Resilient
Data Pipeline
OperabilityCorrectness
Timeliness Cost
• Data Integrity (no loss, etc…)
• Expected data distributions
• All output within time-bound SLAs
• Minimize Operational Fatigue /
Automate Everything
• Fine-grained Monitoring & Alerting of
Correctness & Timeliness SLAs
• Quick Recoverability
• Pay-as-you-go
Quickly Recoverable
17
• Bugs happen!
• Bugs in Predictive Data Pipelines have a large blast radius
• Optimize for MTTR
Predictive Analytics @ Agari
Use Cases
18
Use Cases
19
Apply trust models
(message scoring)
batch + near real
time
Build trust models
batch
(Enterprise Protect)
Focus of this talk
Use-Case : Message
Scoring (batch)
Batch Pipeline Architecture
20
Use-Case : Message Scoring
21
enterprise A
enterprise B
enterprise C
S3
S3 uploads an Avro file
every 15 minutes
Use-Case : Message Scoring
22
enterprise A
enterprise B
enterprise C
S3
Airflow kicks of a Spark
message scoring job
every hour (EMR)
Use-Case : Message Scoring
23
enterprise A
enterprise B
enterprise C
S3
Spark job writes scored
messages and stats to
another S3 bucket
S3
Use-Case : Message Scoring
24
enterprise A
enterprise B
enterprise C
S3
This triggers SNS/SQS
messages events
S3
SNS
SQS
Use-Case : Message Scoring
25
enterprise A
enterprise B
enterprise C
S3
An Autoscale Group
(ASG) of Importers spins
up when it detects SQS
messages
S3
SNS
SQS
Importers
ASG
26
enterprise A
enterprise B
enterprise C
S3
The importers rapidly ingest scored
messages and aggregate statistics into
the DB
S3
SNS
SQS
Importers
ASG
DB
Use-Case : Message Scoring
27
enterprise A
enterprise B
enterprise C
S3
Users receive alerts of
untrusted emails &
can review them in
the web app
S3
SNS
SQS
Importers
ASG
DB
Use-Case : Message Scoring
28
enterprise A
enterprise B
enterprise C
S3 S3
SNS
SQS
Importers
ASG
DB
Airflow manages the entire process
Use-Case : Message Scoring
29
Architectural Components
Component Role Uses Salient Features Operability Model
Data Lake
• All data stored in S3
• All processing uses S3
Scalable, Available,
Performant
Serverless
Messaging
• Reliable, Transactional,
Pub/Sub
Scalable, Available,
Performant
Serverless
ASG
General
Processing
• Used for importing,
data cleansing,
business logic
Scalable, Available,
Performant
Managed
Data Science
Processing
• Aggregation
• Model Building
• Scoring
Nice programming
model at the cost of
debugging complexity
We Operate
Workflow
Engine
• Coordinates all Spark
Jobs & complex flows
Lightweight, DAGs as
Code, Steep learning
curve
We Operate
DB
Persistence for
WebApp
• Holds subset of data
needed for Web App
Rails + Postgres
‘nuff said
We Operate
S3
SNS SQS
Tackling Cost & Timeliness
Leveraging the AWS Cloud
30
Tackling Cost
31
Between Daily Runs During Daily Runs
When running daily, for 23 hours of a day, we didn’t
pay for instances in the ASG or EMR
Tackling Cost
32
Between Hourly Runs During Hourly Runs
When running daily, for 23 hours of a day, we didn’t pay for
instances in the ASG or EMR
This does not help when runs are hourly since AWS charges at
an hourly rate for EC2 instances!
Tackling Timeliness
Auto Scaling Group (ASG)
33
ASG - Overview
34
What is it?
A means to automatically scale out/in clusters to handle
variable load/traffic
A means to keep a cluster/service of a fixed size always up
ASG - Data Pipeline
35
importer
importer
importer
importer
Importer
ASG
scaleout/in
SQS
DB
36
Sent
CPU
ACKd/Recvd
CPU-based auto-scaling is
good at scaling in/out to
keep the average CPU
constant
ASG : CPU-based
ASG : CPU-based
37
Sent
CPU
Recv
Premature
Scale-in
Premature Scale-in:
• The CPU drops to noise-levels before all messages are
consumed
• This causes scale in to occur while the last few
messages are still being committed
38
Scale-out: When Visible Messages > 0 (a.k.a. when queue depth > 0)
Scale-in: When Invisible Messages = 0 (a.k.a. when the last in-flight
message is ACK’d)
This causes the
ASG to grow
This causes the
ASG to shrink
ASG : Queue-based
Auto Scaling Groups
Build & Deploy
39
ASG - Build & Deploy
40
Component Role Details
Spins up Cloud Resources
• Spins up SQS, Kinesis, EC2, ASG,
ELB, etc.. and associate them
using Terraform
• A better version of Chef &
Puppet
• Sets up an EC2 instance
• Agentless, idempotent, &
declarative tool to set up EC2
instances, by installing &
configuring packages, and more
• Spins up an EC2 instance
for the purposes of building
an AMI!
• Can be used with Ansible &
Terraform to bake AMIs & Launch
Auto-Scaling Groups
ASG - Build & Deploy
41
EC2 Step 1 : Packer spins up a temporary
EC2 node - a blank canvas!
EC2
ASG - Build & Deploy
42
EC2 Step 1 : Packer spins up a temporary
EC2 node - a blank canvas!
Step 2 : Packer runs an Ansible role against the
EC2 node to set it up.
EC2
ASG - Build & Deploy
43
EC2
Step 2 : Packer runs an Ansible role against the
EC2 node to set it up.
Step 3 : Snapshots the machine & register the
AMI.EC2
Step 1 : Packer spins up a temporary
EC2 node - a blank canvas!
EC2
ASG - Build & Deploy
44
EC2
Step 2 : Packer runs an Ansible role against the
EC2 node to set it up.
Step 3 : Snapshots the machine & register the
AMI.EC2
Step 4 : Terminates the EC2 instance!
Step 1 : Packer spins up a temporary
EC2 node - a blank canvas!
EC2
ASG - Build & Deploy
45
EC2
Step 2 : Packer runs an Ansible role against the
EC2 node to set it up.
Step 3 : Snapshots the machine & register the
AMI.EC2
Step 4 : Terminates the EC2 instance!
Step 5 : Using the AMI, Terraform spins up an
auto-scaled compute cluster (ASG)
Step 1 : Packer spins up a temporary
EC2 node - a blank canvas!
ASG
46
Desirable Qualities of a Resilient
Data Pipeline
OperabilityCorrectness
Timeliness Cost
• ASG
• EMR Spark
Daily
• ASG
• EMR Spark
Hourly ASG
• No Cost Savings
Tackling Operability &
Correctness
Leveraging Tooling
47
48
A simple way to author, configure, manage workflows
Provides visual insight into the state & performance of workflow
runs
Integrates with our alerting and monitoring tools
Tackling Operability : Requirements
Apache Airflow
Workflow Automation & Scheduling
49
50
Airflow: Author DAGs in Python! No need to bundle many config files!
Apache Airflow - Authoring DAGs
51
Airflow: Visualizing a DAG
Apache Airflow - Authoring DAGs
Apache Airflow - Perf. Insights
52
Airflow: Gantt chart view reveals the slowest tasks for a run!
53
Apache Airflow - Perf. Insights
Airflow: Task Duration chart view show task completion time trends!
54
Airflow: …And easy to integrate with Ops tools!
Apache Airflow - Alerting
55
Desirable Qualities of a Resilient
Data Pipeline
OperabilityCorrectness
Timeliness Cost
Use-Case : Message
Scoring (near-real time)
NRT Pipeline Architecture
56
Use-Case : Message Scoring
57
enterprise A
enterprise B
enterprise C
Kinesis batch put every
second
K
Use-Case : Message Scoring
58
enterprise A
enterprise B
enterprise C
K
As ASG of scorers is
scaled up to one process
per core per kinesis shard
Scorers
ASG
Use-Case : Message Scoring
59
enterprise A
enterprise B
enterprise C
K
Scorers
ASG
Kinesis
Scorers apply the trust
model and send scored
messages downstream
Use-Case : Message Scoring
60
enterprise A
enterprise B
enterprise C
K
Scorers
ASG
Kinesis
Importers
ASG
As ASG of importers is
scaled up to rapidly
import messages
DB
Use-Case : Message Scoring
61
enterprise A
enterprise B
enterprise C
K
Scorers
ASG
Kinesis
Importers
ASG
Imported messages are
also consumed by the
alerter
DB
K
Alerters
ASG
Use-Case : Message Scoring
62
enterprise A
enterprise B
enterprise C
K
Scorers
ASG
Kinesis
Importers
ASG
Imported messages are
also consumed by the
alerter
DB
K
Alerters
ASG
Quarantine Email
63
Stream Processing Architecture
Component Role Details Pros Operability Model
Data Lake
• All data stored in S3 via
Kinesis Firehose
Scalable, Available,
Performant, Serverless
Serverless
Kinesis Messaging
• Streaming transport
modeled on Kafka
Scalable, Available,
Serverless
Serverless
General
Processing
• ASG Replacement except
for Rails Apps
Scalable, Available,
Serverless
Serverless
ASG
General
Processing
• Used for importing, data
cleansing, business logic
Scalable, Available,
Managed
Managed
Data Science
Processing
• Model Building
We Operate
Workflow Engine
• Nightly model builds +
some classic Ops cron
workloads
Lightweight, DAGs as
Code
We Operate
DB
Persistence for
WebApp
• Holds smaller subset of
data needed for Web App
Rails + Postgres
‘nuff said
We Operate
Persistence for
WebApp
• Aggregation + Search
moved from DB to ES
• Model Building queries
moved to Elasticache
Redis
Faster. more accurate for
aggregates, frees up
headroom for DB (polyglot
persistence)
Managed
S3
Innovations
NRT Pipeline Architecture
64
Apache Avro
What is Avro?
65
66
What is Avro?
Avro is a self-describing serialization format that supports
primitive data types : int, long, boolean, float, string, bytes, etc…
complex data types : records, arrays, unions, maps, enums, etc…
many language bindings : Java, Scala, Python, Ruby, etc…
67
What is Avro?
Avro is a self-describing serialization format that supports
primitive data types : int, long, boolean, float, string, bytes, etc…
complex data types : records, arrays, unions, maps, enums, etc…
many language bindings : Java, Scala, Python, Ruby, etc…
The most common format for storing structured Big Data at rest in
HDFS, S3, Google Cloud Storage, etc…
Supports Schema Evolution!
Apache Avro
Why is it useful?
68
69
Why is Avro Useful?
Agari is an IoT company!
Agari Sensors, deployed at customer sites, stream data to Agari’s
Cloud SAAS
Data is sent via Kinesis!
enterprise A
enterprise B
enterprise C Kinesis
Agari SAAS
in AWS
70
Why is Avro Useful?
enterprise A :
enterprise B :
enterprise C : Kinesis
v1
v2
v3
Agari is an IoT company!
Agari Sensors, deployed at customer sites, stream data to Agari’s
Cloud SAAS
Data is sent via Kinesis!
At any point in time, customers run different versions of the Agari
Sensor
Agari SAAS
in AWS
71
Why is Avro Useful?
enterprise A :
enterprise B :
enterprise C : Kinesis
v1
v2
v3
Agari is an IoT company!
Agari Sensors, deployed at customer sites, stream data to
Agari’s Cloud SAAS
Data is sent via Kinesis!
At any point in time, customers run different versions of the
Agari Sensor
These Sensors might send different format versions of the
data!
Agari SAAS
in AWS
72
Why is Avro Useful?
enterprise A :
enterprise B :
enterprise C : Kinesis
v1
v2
v3
Agari SAAS
in AWS
v4
Agari is an IoT company!
Agari Sensors, deployed at customer sites, stream data to
Agari’s Cloud SAAS
Data is sent via Kinesis!
At any point in time, customers run different versions of the
Agari Sensor
These Sensors might send different format versions of the
data!
73
Why is Avro Useful?
enterprise A :
enterprise B :
enterprise C :
v1
v2
v3
Avro allows Agari to seamlessly handle different IoT data format
versions
Agari SAAS
in AWS
Kinesis v4
datum_reader = DatumReader( writers_schema = writers_schema,
readers_schema = readers_schema)
Requirements:
• Schemas are backward-compatible
74
Why is Avro Useful?
Agari SAAS in AWS
S1 S2 S3
s3 Spark
Avro Everywhere!
Avro is so useful, we don’t just to communicate between our
Sensors & our SAAS infrastructure
We also use it as the common data-interchange format between all
services (streaming & batch) within our AWS deployment
75
Why is Avro Useful?
Agari SAAS in AWS
S1 S2 S3
s3 Spark
Avro Everywhere!
Good Language Bindings :
Data Pipelines services are written in Java, Ruby, & Python
Apache Avro
By Example
76
77
{"namespace": "agari",
"type": "record",
"name": "User",
"fields": [
{"name": "name", "type": "string"},
{"name": "favorite_number", "type": ["int", "null"]},
{"name": "favorite_color", "type": ["string", "null"]}
]
}
complex type (record)
Schema name : User
3 fields in the record: 1 required, 2
optional
Avro Schema Example
78
{"namespace": "agari",
"type": "record",
"name": "User",
"fields": [
{"name": "name", "type": "string"},
{"name": "favorite_number", "type": ["int", "null"]},
{"name": "favorite_color", "type": ["string", "null"]}
]
}
Data
x 1,000,000,000
Avro Schema Data File Example
Schema
Data
0.0001 %
99.999 %
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
79
{"namespace": "agari",
"type": "record",
"name": "User",
"fields": [
{"name": "name", "type": "string"},
{"name": "favorite_number", "type": ["int", "null"]},
{"name": "favorite_color", "type": ["string", "null"]}
]
}
Binary Data block
Avro Schema Streaming Example
Schema
Data
99 %
1 %
Data
80
{"namespace": "agari",
"type": "record",
"name": "User",
"fields": [
{"name": "name", "type": "string"},
{"name": "favorite_number", "type": ["int", "null"]},
{"name": "favorite_color", "type": ["string", "null"]}
]
}
Binary Data block
Avro Schema Streaming Example
Schema
Data
99 %
1 %
Data
OVERHEAD!!
Apache Avro
Schema Registry
81
82
Schema
Registry
(Lambda)
Avro Schema Registry
{"namespace": "agari",
"type": "record",
"name": "User",
"fields": [
{"name": "name", "type": "string"},
{"name": "favorite_number", "type": ["int", "null"]},
{"name": "favorite_color", "type": ["string", "null"]}
]
}
register_schema
Message
Producer (P)
83
Schema
Registry
(Lambda)
register_schema returns a UUID
Message
Producer (P)
Avro Schema Registry
84
Schema
Registry
(Lambda)
Message Producer sends UUID +
Message
Producer (P)
Data
Message
Consumer (C)
Avro Schema Registry
85
Schema
Registry
(Lambda)
Message
Producer (P)
Data
Message
Consumer (C)
getSchemaById (UUID)
Avro Schema Registry
86
Schema
Registry
(Lambda)
Message
Producer (P)
Data
Message
Consumer (C)
getSchemaById (UUID)
{"namespace": "agari",
"type": "record",
"name": "User",
"fields": [
{"name": "name", "type": "string"},
{"name": "favorite_number", "type": ["int", "null"]},
{"name": "favorite_color", "type": ["string", "null"]}
]
}
Avro Schema Registry
87
Schema
Registry
(Lambda)
Message
Producer (P)
Message
Consumer (C)
getSchemaById (UUID)
{"namespace": "agari",
"type": "record",
"name": "User",
"fields": [
{"name": "name", "type": "string"},
{"name": "favorite_number", "type": ["int", "null"]},
{"name": "favorite_color", "type": ["string", "null"]}
]
}
Message Consumers
• download & cache the schema
• then decode the data
Avro Schema Registry
88
enterprise A
enterprise B
enterprise C
K
Scorers
ASG
Kinesis
Importers
ASG
Imported messages are
also consumed by the
alerter
DB
K
Alerters
ASG
SR
SR
SR
Avro Schema Registry
89
enterprise A
enterprise B
enterprise C
K
Scorers
ASG
Kinesis
Importers
ASG
Imported messages are
also consumed by the
alerter
DB
K
Alerters
ASG
SR
SR
SR
Avro Schema Registry
Acknowledgments
90
• Vidur Apparao
• Stephen Cattaneo
• Jon Chase
• Andrew Flury
• William Forrester
• Chris Haag
• Chris Buchanan
• Neil Chapin
• Wil Collins
• Don Spencer
• Scot Kennedy
• Natia Chachkhiani
• Patrick Cockwell
• Kevin Mandich
• Gabriel Ortiz
• Jacob Rideout
• Josh Yang
• Julian Mehnle
• Gabriel Poon
• Spencer Sun
• Nathan Bryant
None of this work would be possible without the
essential contributions of the team below
Questions?
(@r39132)
91

More Related Content

What's hot

Ml 3 ways
Ml 3 waysMl 3 ways
Ml 3 ways
PhilipBasford
 
A Tour of Google Cloud Platform
A Tour of Google Cloud PlatformA Tour of Google Cloud Platform
A Tour of Google Cloud Platform
Colin Su
 
Aws Summit Berlin 2013 - Understanding database options on AWS
Aws Summit Berlin 2013 - Understanding database options on AWSAws Summit Berlin 2013 - Understanding database options on AWS
Aws Summit Berlin 2013 - Understanding database options on AWS
AWS Germany
 
Data Science on Google Cloud Platform
Data Science on Google Cloud PlatformData Science on Google Cloud Platform
Data Science on Google Cloud Platform
Virot "Ta" Chiraphadhanakul
 
Google Cloud Platform
Google Cloud PlatformGoogle Cloud Platform
Google Cloud Platform
VMware Tanzu
 
Google Cloud Platform: Prototype ->Production-> Planet scale
Google Cloud Platform: Prototype ->Production-> Planet scaleGoogle Cloud Platform: Prototype ->Production-> Planet scale
Google Cloud Platform: Prototype ->Production-> Planet scale
Idan Tohami
 
AWS re:Invent 2016: Extending Hadoop and Spark to the AWS Cloud (GPST304)
AWS re:Invent 2016: Extending Hadoop and Spark to the AWS Cloud (GPST304)AWS re:Invent 2016: Extending Hadoop and Spark to the AWS Cloud (GPST304)
AWS re:Invent 2016: Extending Hadoop and Spark to the AWS Cloud (GPST304)
Amazon Web Services
 
What's New with Big Data Analytics
What's New with Big Data AnalyticsWhat's New with Big Data Analytics
What's New with Big Data Analytics
Amazon Web Services
 
AWS re:Invent 2016: Visualizing Big Data Insights with Amazon QuickSight (BDM...
AWS re:Invent 2016: Visualizing Big Data Insights with Amazon QuickSight (BDM...AWS re:Invent 2016: Visualizing Big Data Insights with Amazon QuickSight (BDM...
AWS re:Invent 2016: Visualizing Big Data Insights with Amazon QuickSight (BDM...
Amazon Web Services
 
Deep Learning in the Cloud at Scale: A Data Orchestration Story
Deep Learning in the Cloud at Scale: A Data Orchestration StoryDeep Learning in the Cloud at Scale: A Data Orchestration Story
Deep Learning in the Cloud at Scale: A Data Orchestration Story
Alluxio, Inc.
 
Getting Started with Big Data and HPC in the Cloud - August 2015
Getting Started with Big Data and HPC in the Cloud - August 2015Getting Started with Big Data and HPC in the Cloud - August 2015
Getting Started with Big Data and HPC in the Cloud - August 2015
Amazon Web Services
 
AWS Compute Overview: Servers, Containers, Serverless, and Batch | AWS Public...
AWS Compute Overview: Servers, Containers, Serverless, and Batch | AWS Public...AWS Compute Overview: Servers, Containers, Serverless, and Batch | AWS Public...
AWS Compute Overview: Servers, Containers, Serverless, and Batch | AWS Public...
Amazon Web Services
 
#lspe Q1 2013 dynamically scaling netflix in the cloud
#lspe Q1 2013   dynamically scaling netflix in the cloud#lspe Q1 2013   dynamically scaling netflix in the cloud
#lspe Q1 2013 dynamically scaling netflix in the cloud
Coburn Watson
 
AWS re:Invent 2016: Dollars and Sense: Technical Tips for Continual Cost Opti...
AWS re:Invent 2016: Dollars and Sense: Technical Tips for Continual Cost Opti...AWS re:Invent 2016: Dollars and Sense: Technical Tips for Continual Cost Opti...
AWS re:Invent 2016: Dollars and Sense: Technical Tips for Continual Cost Opti...
Amazon Web Services
 
Understand AWS Pricing
Understand AWS PricingUnderstand AWS Pricing
Understand AWS Pricing
Lynn Langit
 
Amazon EC2 Instances, Featuring Performance Optimisation Best Practices
Amazon EC2 Instances, Featuring Performance Optimisation Best PracticesAmazon EC2 Instances, Featuring Performance Optimisation Best Practices
Amazon EC2 Instances, Featuring Performance Optimisation Best Practices
Amazon Web Services
 
Tensorflow in production with AWS Lambda
Tensorflow in production with AWS LambdaTensorflow in production with AWS Lambda
Tensorflow in production with AWS Lambda
Fabian Dubois
 
AWS re:Invent 2016: How Mapbox Uses the AWS Edge to Deliver Fast Maps for Mob...
AWS re:Invent 2016: How Mapbox Uses the AWS Edge to Deliver Fast Maps for Mob...AWS re:Invent 2016: How Mapbox Uses the AWS Edge to Deliver Fast Maps for Mob...
AWS re:Invent 2016: How Mapbox Uses the AWS Edge to Deliver Fast Maps for Mob...
Amazon Web Services
 
AWS Summit Berlin 2013 - Optimizing your AWS applications and usage to reduce...
AWS Summit Berlin 2013 - Optimizing your AWS applications and usage to reduce...AWS Summit Berlin 2013 - Optimizing your AWS applications and usage to reduce...
AWS Summit Berlin 2013 - Optimizing your AWS applications and usage to reduce...
AWS Germany
 
Analytics at Scale with Apache Spark on AWS with Jonathan Fritz
Analytics at Scale with Apache Spark on AWS with Jonathan FritzAnalytics at Scale with Apache Spark on AWS with Jonathan Fritz
Analytics at Scale with Apache Spark on AWS with Jonathan Fritz
Databricks
 

What's hot (20)

Ml 3 ways
Ml 3 waysMl 3 ways
Ml 3 ways
 
A Tour of Google Cloud Platform
A Tour of Google Cloud PlatformA Tour of Google Cloud Platform
A Tour of Google Cloud Platform
 
Aws Summit Berlin 2013 - Understanding database options on AWS
Aws Summit Berlin 2013 - Understanding database options on AWSAws Summit Berlin 2013 - Understanding database options on AWS
Aws Summit Berlin 2013 - Understanding database options on AWS
 
Data Science on Google Cloud Platform
Data Science on Google Cloud PlatformData Science on Google Cloud Platform
Data Science on Google Cloud Platform
 
Google Cloud Platform
Google Cloud PlatformGoogle Cloud Platform
Google Cloud Platform
 
Google Cloud Platform: Prototype ->Production-> Planet scale
Google Cloud Platform: Prototype ->Production-> Planet scaleGoogle Cloud Platform: Prototype ->Production-> Planet scale
Google Cloud Platform: Prototype ->Production-> Planet scale
 
AWS re:Invent 2016: Extending Hadoop and Spark to the AWS Cloud (GPST304)
AWS re:Invent 2016: Extending Hadoop and Spark to the AWS Cloud (GPST304)AWS re:Invent 2016: Extending Hadoop and Spark to the AWS Cloud (GPST304)
AWS re:Invent 2016: Extending Hadoop and Spark to the AWS Cloud (GPST304)
 
What's New with Big Data Analytics
What's New with Big Data AnalyticsWhat's New with Big Data Analytics
What's New with Big Data Analytics
 
AWS re:Invent 2016: Visualizing Big Data Insights with Amazon QuickSight (BDM...
AWS re:Invent 2016: Visualizing Big Data Insights with Amazon QuickSight (BDM...AWS re:Invent 2016: Visualizing Big Data Insights with Amazon QuickSight (BDM...
AWS re:Invent 2016: Visualizing Big Data Insights with Amazon QuickSight (BDM...
 
Deep Learning in the Cloud at Scale: A Data Orchestration Story
Deep Learning in the Cloud at Scale: A Data Orchestration StoryDeep Learning in the Cloud at Scale: A Data Orchestration Story
Deep Learning in the Cloud at Scale: A Data Orchestration Story
 
Getting Started with Big Data and HPC in the Cloud - August 2015
Getting Started with Big Data and HPC in the Cloud - August 2015Getting Started with Big Data and HPC in the Cloud - August 2015
Getting Started with Big Data and HPC in the Cloud - August 2015
 
AWS Compute Overview: Servers, Containers, Serverless, and Batch | AWS Public...
AWS Compute Overview: Servers, Containers, Serverless, and Batch | AWS Public...AWS Compute Overview: Servers, Containers, Serverless, and Batch | AWS Public...
AWS Compute Overview: Servers, Containers, Serverless, and Batch | AWS Public...
 
#lspe Q1 2013 dynamically scaling netflix in the cloud
#lspe Q1 2013   dynamically scaling netflix in the cloud#lspe Q1 2013   dynamically scaling netflix in the cloud
#lspe Q1 2013 dynamically scaling netflix in the cloud
 
AWS re:Invent 2016: Dollars and Sense: Technical Tips for Continual Cost Opti...
AWS re:Invent 2016: Dollars and Sense: Technical Tips for Continual Cost Opti...AWS re:Invent 2016: Dollars and Sense: Technical Tips for Continual Cost Opti...
AWS re:Invent 2016: Dollars and Sense: Technical Tips for Continual Cost Opti...
 
Understand AWS Pricing
Understand AWS PricingUnderstand AWS Pricing
Understand AWS Pricing
 
Amazon EC2 Instances, Featuring Performance Optimisation Best Practices
Amazon EC2 Instances, Featuring Performance Optimisation Best PracticesAmazon EC2 Instances, Featuring Performance Optimisation Best Practices
Amazon EC2 Instances, Featuring Performance Optimisation Best Practices
 
Tensorflow in production with AWS Lambda
Tensorflow in production with AWS LambdaTensorflow in production with AWS Lambda
Tensorflow in production with AWS Lambda
 
AWS re:Invent 2016: How Mapbox Uses the AWS Edge to Deliver Fast Maps for Mob...
AWS re:Invent 2016: How Mapbox Uses the AWS Edge to Deliver Fast Maps for Mob...AWS re:Invent 2016: How Mapbox Uses the AWS Edge to Deliver Fast Maps for Mob...
AWS re:Invent 2016: How Mapbox Uses the AWS Edge to Deliver Fast Maps for Mob...
 
AWS Summit Berlin 2013 - Optimizing your AWS applications and usage to reduce...
AWS Summit Berlin 2013 - Optimizing your AWS applications and usage to reduce...AWS Summit Berlin 2013 - Optimizing your AWS applications and usage to reduce...
AWS Summit Berlin 2013 - Optimizing your AWS applications and usage to reduce...
 
Analytics at Scale with Apache Spark on AWS with Jonathan Fritz
Analytics at Scale with Apache Spark on AWS with Jonathan FritzAnalytics at Scale with Apache Spark on AWS with Jonathan Fritz
Analytics at Scale with Apache Spark on AWS with Jonathan Fritz
 

Similar to Cloud Native Data Pipelines

Cloud Native Data Pipelines (DataEngConf SF 2017)
Cloud Native Data Pipelines (DataEngConf SF 2017)Cloud Native Data Pipelines (DataEngConf SF 2017)
Cloud Native Data Pipelines (DataEngConf SF 2017)
Sid Anand
 
Cloud Native Data Pipelines (in Eng & Japanese) - QCon Tokyo
Cloud Native Data Pipelines (in Eng & Japanese)  - QCon TokyoCloud Native Data Pipelines (in Eng & Japanese)  - QCon Tokyo
Cloud Native Data Pipelines (in Eng & Japanese) - QCon Tokyo
Sid Anand
 
Cloud Native Data Pipelines (QCon Shanghai & Tokyo 2016)
Cloud Native Data Pipelines (QCon Shanghai & Tokyo 2016)Cloud Native Data Pipelines (QCon Shanghai & Tokyo 2016)
Cloud Native Data Pipelines (QCon Shanghai & Tokyo 2016)
Sid Anand
 
Introduction to Apache Airflow - Data Day Seattle 2016
Introduction to Apache Airflow - Data Day Seattle 2016Introduction to Apache Airflow - Data Day Seattle 2016
Introduction to Apache Airflow - Data Day Seattle 2016
Sid Anand
 
Building a Big Data & Analytics Platform using AWS
Building a Big Data & Analytics Platform using AWS Building a Big Data & Analytics Platform using AWS
Building a Big Data & Analytics Platform using AWS
Amazon Web Services
 
Phil Basford - machine learning at scale with aws sage maker
Phil Basford - machine learning at scale with aws sage makerPhil Basford - machine learning at scale with aws sage maker
Phil Basford - machine learning at scale with aws sage maker
AWSCOMSUM
 
Machine learning at scale with aws sage maker
Machine learning at scale with aws sage makerMachine learning at scale with aws sage maker
Machine learning at scale with aws sage maker
PhilipBasford
 
The Next Generation of Data Processing and Open Source
The Next Generation of Data Processing and Open SourceThe Next Generation of Data Processing and Open Source
The Next Generation of Data Processing and Open Source
DataWorks Summit/Hadoop Summit
 
Airflow @ Agari
Airflow @ Agari Airflow @ Agari
Airflow @ Agari
Sid Anand
 
Big Data on AWS
Big Data on AWSBig Data on AWS
Big Data on AWS
Amazon Web Services
 
Resilient Predictive Data Pipelines (QCon London 2016)
Resilient Predictive Data Pipelines (QCon London 2016)Resilient Predictive Data Pipelines (QCon London 2016)
Resilient Predictive Data Pipelines (QCon London 2016)
Sid Anand
 
Spark logs made easy
Spark logs made easySpark logs made easy
Spark logs made easy
Simona Meriam
 
London Redshift Meetup - July 2017
London Redshift Meetup - July 2017London Redshift Meetup - July 2017
London Redshift Meetup - July 2017
Pratim Das
 
(CMP403) AWS Lambda: Simplifying Big Data Workloads
(CMP403) AWS Lambda: Simplifying Big Data Workloads(CMP403) AWS Lambda: Simplifying Big Data Workloads
(CMP403) AWS Lambda: Simplifying Big Data Workloads
Amazon Web Services
 
Creating a scalable & cost efficient BI infrastructure for a startup in the A...
Creating a scalable & cost efficient BI infrastructure for a startup in the A...Creating a scalable & cost efficient BI infrastructure for a startup in the A...
Creating a scalable & cost efficient BI infrastructure for a startup in the A...
vcrisan
 
(BDT404) Large-Scale ETL Data Flows w/AWS Data Pipeline & Dataduct
(BDT404) Large-Scale ETL Data Flows w/AWS Data Pipeline & Dataduct(BDT404) Large-Scale ETL Data Flows w/AWS Data Pipeline & Dataduct
(BDT404) Large-Scale ETL Data Flows w/AWS Data Pipeline & Dataduct
Amazon Web Services
 
Data analytics master class: predict hotel revenue
Data analytics master class: predict hotel revenueData analytics master class: predict hotel revenue
Data analytics master class: predict hotel revenue
Kris Peeters
 
DCEU 18: From Legacy Mainframe to the Cloud: The Finnish Railways Evolution w...
DCEU 18: From Legacy Mainframe to the Cloud: The Finnish Railways Evolution w...DCEU 18: From Legacy Mainframe to the Cloud: The Finnish Railways Evolution w...
DCEU 18: From Legacy Mainframe to the Cloud: The Finnish Railways Evolution w...
Docker, Inc.
 
Using Grid Technologies in the Cloud for High Scalability
Using Grid Technologies in the Cloud for High ScalabilityUsing Grid Technologies in the Cloud for High Scalability
Using Grid Technologies in the Cloud for High Scalability
mabuhr
 
AWS re:Invent 2016: State of the Union: Containers (CON316)
AWS re:Invent 2016: State of the Union:  Containers (CON316)AWS re:Invent 2016: State of the Union:  Containers (CON316)
AWS re:Invent 2016: State of the Union: Containers (CON316)
Amazon Web Services
 

Similar to Cloud Native Data Pipelines (20)

Cloud Native Data Pipelines (DataEngConf SF 2017)
Cloud Native Data Pipelines (DataEngConf SF 2017)Cloud Native Data Pipelines (DataEngConf SF 2017)
Cloud Native Data Pipelines (DataEngConf SF 2017)
 
Cloud Native Data Pipelines (in Eng & Japanese) - QCon Tokyo
Cloud Native Data Pipelines (in Eng & Japanese)  - QCon TokyoCloud Native Data Pipelines (in Eng & Japanese)  - QCon Tokyo
Cloud Native Data Pipelines (in Eng & Japanese) - QCon Tokyo
 
Cloud Native Data Pipelines (QCon Shanghai & Tokyo 2016)
Cloud Native Data Pipelines (QCon Shanghai & Tokyo 2016)Cloud Native Data Pipelines (QCon Shanghai & Tokyo 2016)
Cloud Native Data Pipelines (QCon Shanghai & Tokyo 2016)
 
Introduction to Apache Airflow - Data Day Seattle 2016
Introduction to Apache Airflow - Data Day Seattle 2016Introduction to Apache Airflow - Data Day Seattle 2016
Introduction to Apache Airflow - Data Day Seattle 2016
 
Building a Big Data & Analytics Platform using AWS
Building a Big Data & Analytics Platform using AWS Building a Big Data & Analytics Platform using AWS
Building a Big Data & Analytics Platform using AWS
 
Phil Basford - machine learning at scale with aws sage maker
Phil Basford - machine learning at scale with aws sage makerPhil Basford - machine learning at scale with aws sage maker
Phil Basford - machine learning at scale with aws sage maker
 
Machine learning at scale with aws sage maker
Machine learning at scale with aws sage makerMachine learning at scale with aws sage maker
Machine learning at scale with aws sage maker
 
The Next Generation of Data Processing and Open Source
The Next Generation of Data Processing and Open SourceThe Next Generation of Data Processing and Open Source
The Next Generation of Data Processing and Open Source
 
Airflow @ Agari
Airflow @ Agari Airflow @ Agari
Airflow @ Agari
 
Big Data on AWS
Big Data on AWSBig Data on AWS
Big Data on AWS
 
Resilient Predictive Data Pipelines (QCon London 2016)
Resilient Predictive Data Pipelines (QCon London 2016)Resilient Predictive Data Pipelines (QCon London 2016)
Resilient Predictive Data Pipelines (QCon London 2016)
 
Spark logs made easy
Spark logs made easySpark logs made easy
Spark logs made easy
 
London Redshift Meetup - July 2017
London Redshift Meetup - July 2017London Redshift Meetup - July 2017
London Redshift Meetup - July 2017
 
(CMP403) AWS Lambda: Simplifying Big Data Workloads
(CMP403) AWS Lambda: Simplifying Big Data Workloads(CMP403) AWS Lambda: Simplifying Big Data Workloads
(CMP403) AWS Lambda: Simplifying Big Data Workloads
 
Creating a scalable & cost efficient BI infrastructure for a startup in the A...
Creating a scalable & cost efficient BI infrastructure for a startup in the A...Creating a scalable & cost efficient BI infrastructure for a startup in the A...
Creating a scalable & cost efficient BI infrastructure for a startup in the A...
 
(BDT404) Large-Scale ETL Data Flows w/AWS Data Pipeline & Dataduct
(BDT404) Large-Scale ETL Data Flows w/AWS Data Pipeline & Dataduct(BDT404) Large-Scale ETL Data Flows w/AWS Data Pipeline & Dataduct
(BDT404) Large-Scale ETL Data Flows w/AWS Data Pipeline & Dataduct
 
Data analytics master class: predict hotel revenue
Data analytics master class: predict hotel revenueData analytics master class: predict hotel revenue
Data analytics master class: predict hotel revenue
 
DCEU 18: From Legacy Mainframe to the Cloud: The Finnish Railways Evolution w...
DCEU 18: From Legacy Mainframe to the Cloud: The Finnish Railways Evolution w...DCEU 18: From Legacy Mainframe to the Cloud: The Finnish Railways Evolution w...
DCEU 18: From Legacy Mainframe to the Cloud: The Finnish Railways Evolution w...
 
Using Grid Technologies in the Cloud for High Scalability
Using Grid Technologies in the Cloud for High ScalabilityUsing Grid Technologies in the Cloud for High Scalability
Using Grid Technologies in the Cloud for High Scalability
 
AWS re:Invent 2016: State of the Union: Containers (CON316)
AWS re:Invent 2016: State of the Union:  Containers (CON316)AWS re:Invent 2016: State of the Union:  Containers (CON316)
AWS re:Invent 2016: State of the Union: Containers (CON316)
 

More from Bill Liu

Walk Through a Real World ML Production Project
Walk Through a Real World ML Production ProjectWalk Through a Real World ML Production Project
Walk Through a Real World ML Production Project
Bill Liu
 
Redefining MLOps with Model Deployment, Management and Observability in Produ...
Redefining MLOps with Model Deployment, Management and Observability in Produ...Redefining MLOps with Model Deployment, Management and Observability in Produ...
Redefining MLOps with Model Deployment, Management and Observability in Produ...
Bill Liu
 
Productizing Machine Learning at the Edge
Productizing Machine Learning at the EdgeProductizing Machine Learning at the Edge
Productizing Machine Learning at the Edge
Bill Liu
 
Transformers in Vision: From Zero to Hero
Transformers in Vision: From Zero to HeroTransformers in Vision: From Zero to Hero
Transformers in Vision: From Zero to Hero
Bill Liu
 
Deep AutoViML For Tensorflow Models and MLOps Workflows
Deep AutoViML For Tensorflow Models and MLOps WorkflowsDeep AutoViML For Tensorflow Models and MLOps Workflows
Deep AutoViML For Tensorflow Models and MLOps Workflows
Bill Liu
 
Metaflow: The ML Infrastructure at Netflix
Metaflow: The ML Infrastructure at NetflixMetaflow: The ML Infrastructure at Netflix
Metaflow: The ML Infrastructure at Netflix
Bill Liu
 
Practical Crowdsourcing for ML at Scale
Practical Crowdsourcing for ML at ScalePractical Crowdsourcing for ML at Scale
Practical Crowdsourcing for ML at Scale
Bill Liu
 
Building large scale transactional data lake using apache hudi
Building large scale transactional data lake using apache hudiBuilding large scale transactional data lake using apache hudi
Building large scale transactional data lake using apache hudi
Bill Liu
 
Deep Reinforcement Learning and Its Applications
Deep Reinforcement Learning and Its ApplicationsDeep Reinforcement Learning and Its Applications
Deep Reinforcement Learning and Its Applications
Bill Liu
 
Big Data and AI in Fighting Against COVID-19
Big Data and AI in Fighting Against COVID-19Big Data and AI in Fighting Against COVID-19
Big Data and AI in Fighting Against COVID-19
Bill Liu
 
Highly-scalable Reinforcement Learning RLlib for Real-world Applications
Highly-scalable Reinforcement Learning RLlib for Real-world ApplicationsHighly-scalable Reinforcement Learning RLlib for Real-world Applications
Highly-scalable Reinforcement Learning RLlib for Real-world Applications
Bill Liu
 
Build computer vision models to perform object detection and classification w...
Build computer vision models to perform object detection and classification w...Build computer vision models to perform object detection and classification w...
Build computer vision models to perform object detection and classification w...
Bill Liu
 
Causal Inference in Data Science and Machine Learning
Causal Inference in Data Science and Machine LearningCausal Inference in Data Science and Machine Learning
Causal Inference in Data Science and Machine Learning
Bill Liu
 
Weekly #106: Deep Learning on Mobile
Weekly #106: Deep Learning on MobileWeekly #106: Deep Learning on Mobile
Weekly #106: Deep Learning on Mobile
Bill Liu
 
Weekly #105: AutoViz and Auto_ViML Visualization and Machine Learning
Weekly #105: AutoViz and Auto_ViML Visualization and Machine LearningWeekly #105: AutoViz and Auto_ViML Visualization and Machine Learning
Weekly #105: AutoViz and Auto_ViML Visualization and Machine Learning
Bill Liu
 
AISF19 - On Blending Machine Learning with Microeconomics
AISF19 - On Blending Machine Learning with MicroeconomicsAISF19 - On Blending Machine Learning with Microeconomics
AISF19 - On Blending Machine Learning with Microeconomics
Bill Liu
 
AISF19 - Travel in the AI-First World
AISF19 - Travel in the AI-First WorldAISF19 - Travel in the AI-First World
AISF19 - Travel in the AI-First World
Bill Liu
 
AISF19 - Unleash Computer Vision at the Edge
AISF19 - Unleash Computer Vision at the EdgeAISF19 - Unleash Computer Vision at the Edge
AISF19 - Unleash Computer Vision at the Edge
Bill Liu
 
AISF19 - Building Scalable, Kubernetes-Native ML/AI Pipelines with TFX, KubeF...
AISF19 - Building Scalable, Kubernetes-Native ML/AI Pipelines with TFX, KubeF...AISF19 - Building Scalable, Kubernetes-Native ML/AI Pipelines with TFX, KubeF...
AISF19 - Building Scalable, Kubernetes-Native ML/AI Pipelines with TFX, KubeF...
Bill Liu
 
Toronto meetup 20190917
Toronto meetup 20190917Toronto meetup 20190917
Toronto meetup 20190917
Bill Liu
 

More from Bill Liu (20)

Walk Through a Real World ML Production Project
Walk Through a Real World ML Production ProjectWalk Through a Real World ML Production Project
Walk Through a Real World ML Production Project
 
Redefining MLOps with Model Deployment, Management and Observability in Produ...
Redefining MLOps with Model Deployment, Management and Observability in Produ...Redefining MLOps with Model Deployment, Management and Observability in Produ...
Redefining MLOps with Model Deployment, Management and Observability in Produ...
 
Productizing Machine Learning at the Edge
Productizing Machine Learning at the EdgeProductizing Machine Learning at the Edge
Productizing Machine Learning at the Edge
 
Transformers in Vision: From Zero to Hero
Transformers in Vision: From Zero to HeroTransformers in Vision: From Zero to Hero
Transformers in Vision: From Zero to Hero
 
Deep AutoViML For Tensorflow Models and MLOps Workflows
Deep AutoViML For Tensorflow Models and MLOps WorkflowsDeep AutoViML For Tensorflow Models and MLOps Workflows
Deep AutoViML For Tensorflow Models and MLOps Workflows
 
Metaflow: The ML Infrastructure at Netflix
Metaflow: The ML Infrastructure at NetflixMetaflow: The ML Infrastructure at Netflix
Metaflow: The ML Infrastructure at Netflix
 
Practical Crowdsourcing for ML at Scale
Practical Crowdsourcing for ML at ScalePractical Crowdsourcing for ML at Scale
Practical Crowdsourcing for ML at Scale
 
Building large scale transactional data lake using apache hudi
Building large scale transactional data lake using apache hudiBuilding large scale transactional data lake using apache hudi
Building large scale transactional data lake using apache hudi
 
Deep Reinforcement Learning and Its Applications
Deep Reinforcement Learning and Its ApplicationsDeep Reinforcement Learning and Its Applications
Deep Reinforcement Learning and Its Applications
 
Big Data and AI in Fighting Against COVID-19
Big Data and AI in Fighting Against COVID-19Big Data and AI in Fighting Against COVID-19
Big Data and AI in Fighting Against COVID-19
 
Highly-scalable Reinforcement Learning RLlib for Real-world Applications
Highly-scalable Reinforcement Learning RLlib for Real-world ApplicationsHighly-scalable Reinforcement Learning RLlib for Real-world Applications
Highly-scalable Reinforcement Learning RLlib for Real-world Applications
 
Build computer vision models to perform object detection and classification w...
Build computer vision models to perform object detection and classification w...Build computer vision models to perform object detection and classification w...
Build computer vision models to perform object detection and classification w...
 
Causal Inference in Data Science and Machine Learning
Causal Inference in Data Science and Machine LearningCausal Inference in Data Science and Machine Learning
Causal Inference in Data Science and Machine Learning
 
Weekly #106: Deep Learning on Mobile
Weekly #106: Deep Learning on MobileWeekly #106: Deep Learning on Mobile
Weekly #106: Deep Learning on Mobile
 
Weekly #105: AutoViz and Auto_ViML Visualization and Machine Learning
Weekly #105: AutoViz and Auto_ViML Visualization and Machine LearningWeekly #105: AutoViz and Auto_ViML Visualization and Machine Learning
Weekly #105: AutoViz and Auto_ViML Visualization and Machine Learning
 
AISF19 - On Blending Machine Learning with Microeconomics
AISF19 - On Blending Machine Learning with MicroeconomicsAISF19 - On Blending Machine Learning with Microeconomics
AISF19 - On Blending Machine Learning with Microeconomics
 
AISF19 - Travel in the AI-First World
AISF19 - Travel in the AI-First WorldAISF19 - Travel in the AI-First World
AISF19 - Travel in the AI-First World
 
AISF19 - Unleash Computer Vision at the Edge
AISF19 - Unleash Computer Vision at the EdgeAISF19 - Unleash Computer Vision at the Edge
AISF19 - Unleash Computer Vision at the Edge
 
AISF19 - Building Scalable, Kubernetes-Native ML/AI Pipelines with TFX, KubeF...
AISF19 - Building Scalable, Kubernetes-Native ML/AI Pipelines with TFX, KubeF...AISF19 - Building Scalable, Kubernetes-Native ML/AI Pipelines with TFX, KubeF...
AISF19 - Building Scalable, Kubernetes-Native ML/AI Pipelines with TFX, KubeF...
 
Toronto meetup 20190917
Toronto meetup 20190917Toronto meetup 20190917
Toronto meetup 20190917
 

Recently uploaded

Move Auth, Policy, and Resilience to the Platform
Move Auth, Policy, and Resilience to the PlatformMove Auth, Policy, and Resilience to the Platform
Move Auth, Policy, and Resilience to the Platform
Christian Posta
 
Building a Semantic Layer of your Data Platform
Building a Semantic Layer of your Data PlatformBuilding a Semantic Layer of your Data Platform
Building a Semantic Layer of your Data Platform
Enterprise Knowledge
 
Day 2 - Intro to UiPath Studio Fundamentals
Day 2 - Intro to UiPath Studio FundamentalsDay 2 - Intro to UiPath Studio Fundamentals
Day 2 - Intro to UiPath Studio Fundamentals
UiPathCommunity
 
MongoDB vs ScyllaDB: Tractian’s Experience with Real-Time ML
MongoDB vs ScyllaDB: Tractian’s Experience with Real-Time MLMongoDB vs ScyllaDB: Tractian’s Experience with Real-Time ML
MongoDB vs ScyllaDB: Tractian’s Experience with Real-Time ML
ScyllaDB
 
Ubuntu Server CLI cheat sheet 2024 v6.pdf
Ubuntu Server CLI cheat sheet 2024 v6.pdfUbuntu Server CLI cheat sheet 2024 v6.pdf
Ubuntu Server CLI cheat sheet 2024 v6.pdf
TechOnDemandSolution
 
Leveraging AI for Software Developer Productivity.pptx
Leveraging AI for Software Developer Productivity.pptxLeveraging AI for Software Developer Productivity.pptx
Leveraging AI for Software Developer Productivity.pptx
petabridge
 
Communications Mining Series - Zero to Hero - Session 2
Communications Mining Series - Zero to Hero - Session 2Communications Mining Series - Zero to Hero - Session 2
Communications Mining Series - Zero to Hero - Session 2
DianaGray10
 
New ThousandEyes Product Features and Release Highlights: June 2024
New ThousandEyes Product Features and Release Highlights: June 2024New ThousandEyes Product Features and Release Highlights: June 2024
New ThousandEyes Product Features and Release Highlights: June 2024
ThousandEyes
 
Lee Barnes - Path to Becoming an Effective Test Automation Engineer.pdf
Lee Barnes - Path to Becoming an Effective Test Automation Engineer.pdfLee Barnes - Path to Becoming an Effective Test Automation Engineer.pdf
Lee Barnes - Path to Becoming an Effective Test Automation Engineer.pdf
leebarnesutopia
 
EverHost AI Review: Empowering Websites with Limitless Possibilities through ...
EverHost AI Review: Empowering Websites with Limitless Possibilities through ...EverHost AI Review: Empowering Websites with Limitless Possibilities through ...
EverHost AI Review: Empowering Websites with Limitless Possibilities through ...
SOFTTECHHUB
 
Elasticity vs. State? Exploring Kafka Streams Cassandra State Store
Elasticity vs. State? Exploring Kafka Streams Cassandra State StoreElasticity vs. State? Exploring Kafka Streams Cassandra State Store
Elasticity vs. State? Exploring Kafka Streams Cassandra State Store
ScyllaDB
 
Chapter 6 - Test Tools Considerations V4.0
Chapter 6 - Test Tools Considerations V4.0Chapter 6 - Test Tools Considerations V4.0
Chapter 6 - Test Tools Considerations V4.0
Neeraj Kumar Singh
 
ThousandEyes New Product Features and Release Highlights: June 2024
ThousandEyes New Product Features and Release Highlights: June 2024ThousandEyes New Product Features and Release Highlights: June 2024
ThousandEyes New Product Features and Release Highlights: June 2024
ThousandEyes
 
TrustArc Webinar - Your Guide for Smooth Cross-Border Data Transfers and Glob...
TrustArc Webinar - Your Guide for Smooth Cross-Border Data Transfers and Glob...TrustArc Webinar - Your Guide for Smooth Cross-Border Data Transfers and Glob...
TrustArc Webinar - Your Guide for Smooth Cross-Border Data Transfers and Glob...
TrustArc
 
APJC Introduction to ThousandEyes Webinar
APJC Introduction to ThousandEyes WebinarAPJC Introduction to ThousandEyes Webinar
APJC Introduction to ThousandEyes Webinar
ThousandEyes
 
Chapter 5 - Managing Test Activities V4.0
Chapter 5 - Managing Test Activities V4.0Chapter 5 - Managing Test Activities V4.0
Chapter 5 - Managing Test Activities V4.0
Neeraj Kumar Singh
 
Dev Dives: Mining your data with AI-powered Continuous Discovery
Dev Dives: Mining your data with AI-powered Continuous DiscoveryDev Dives: Mining your data with AI-powered Continuous Discovery
Dev Dives: Mining your data with AI-powered Continuous Discovery
UiPathCommunity
 
Call Girls Chennai ☎️ +91-7426014248 😍 Chennai Call Girl Beauty Girls Chennai...
Call Girls Chennai ☎️ +91-7426014248 😍 Chennai Call Girl Beauty Girls Chennai...Call Girls Chennai ☎️ +91-7426014248 😍 Chennai Call Girl Beauty Girls Chennai...
Call Girls Chennai ☎️ +91-7426014248 😍 Chennai Call Girl Beauty Girls Chennai...
anilsa9823
 
Guidelines for Effective Data Visualization
Guidelines for Effective Data VisualizationGuidelines for Effective Data Visualization
Guidelines for Effective Data Visualization
UmmeSalmaM1
 
The "Zen" of Python Exemplars - OTel Community Day
The "Zen" of Python Exemplars - OTel Community DayThe "Zen" of Python Exemplars - OTel Community Day
The "Zen" of Python Exemplars - OTel Community Day
Paige Cruz
 

Recently uploaded (20)

Move Auth, Policy, and Resilience to the Platform
Move Auth, Policy, and Resilience to the PlatformMove Auth, Policy, and Resilience to the Platform
Move Auth, Policy, and Resilience to the Platform
 
Building a Semantic Layer of your Data Platform
Building a Semantic Layer of your Data PlatformBuilding a Semantic Layer of your Data Platform
Building a Semantic Layer of your Data Platform
 
Day 2 - Intro to UiPath Studio Fundamentals
Day 2 - Intro to UiPath Studio FundamentalsDay 2 - Intro to UiPath Studio Fundamentals
Day 2 - Intro to UiPath Studio Fundamentals
 
MongoDB vs ScyllaDB: Tractian’s Experience with Real-Time ML
MongoDB vs ScyllaDB: Tractian’s Experience with Real-Time MLMongoDB vs ScyllaDB: Tractian’s Experience with Real-Time ML
MongoDB vs ScyllaDB: Tractian’s Experience with Real-Time ML
 
Ubuntu Server CLI cheat sheet 2024 v6.pdf
Ubuntu Server CLI cheat sheet 2024 v6.pdfUbuntu Server CLI cheat sheet 2024 v6.pdf
Ubuntu Server CLI cheat sheet 2024 v6.pdf
 
Leveraging AI for Software Developer Productivity.pptx
Leveraging AI for Software Developer Productivity.pptxLeveraging AI for Software Developer Productivity.pptx
Leveraging AI for Software Developer Productivity.pptx
 
Communications Mining Series - Zero to Hero - Session 2
Communications Mining Series - Zero to Hero - Session 2Communications Mining Series - Zero to Hero - Session 2
Communications Mining Series - Zero to Hero - Session 2
 
New ThousandEyes Product Features and Release Highlights: June 2024
New ThousandEyes Product Features and Release Highlights: June 2024New ThousandEyes Product Features and Release Highlights: June 2024
New ThousandEyes Product Features and Release Highlights: June 2024
 
Lee Barnes - Path to Becoming an Effective Test Automation Engineer.pdf
Lee Barnes - Path to Becoming an Effective Test Automation Engineer.pdfLee Barnes - Path to Becoming an Effective Test Automation Engineer.pdf
Lee Barnes - Path to Becoming an Effective Test Automation Engineer.pdf
 
EverHost AI Review: Empowering Websites with Limitless Possibilities through ...
EverHost AI Review: Empowering Websites with Limitless Possibilities through ...EverHost AI Review: Empowering Websites with Limitless Possibilities through ...
EverHost AI Review: Empowering Websites with Limitless Possibilities through ...
 
Elasticity vs. State? Exploring Kafka Streams Cassandra State Store
Elasticity vs. State? Exploring Kafka Streams Cassandra State StoreElasticity vs. State? Exploring Kafka Streams Cassandra State Store
Elasticity vs. State? Exploring Kafka Streams Cassandra State Store
 
Chapter 6 - Test Tools Considerations V4.0
Chapter 6 - Test Tools Considerations V4.0Chapter 6 - Test Tools Considerations V4.0
Chapter 6 - Test Tools Considerations V4.0
 
ThousandEyes New Product Features and Release Highlights: June 2024
ThousandEyes New Product Features and Release Highlights: June 2024ThousandEyes New Product Features and Release Highlights: June 2024
ThousandEyes New Product Features and Release Highlights: June 2024
 
TrustArc Webinar - Your Guide for Smooth Cross-Border Data Transfers and Glob...
TrustArc Webinar - Your Guide for Smooth Cross-Border Data Transfers and Glob...TrustArc Webinar - Your Guide for Smooth Cross-Border Data Transfers and Glob...
TrustArc Webinar - Your Guide for Smooth Cross-Border Data Transfers and Glob...
 
APJC Introduction to ThousandEyes Webinar
APJC Introduction to ThousandEyes WebinarAPJC Introduction to ThousandEyes Webinar
APJC Introduction to ThousandEyes Webinar
 
Chapter 5 - Managing Test Activities V4.0
Chapter 5 - Managing Test Activities V4.0Chapter 5 - Managing Test Activities V4.0
Chapter 5 - Managing Test Activities V4.0
 
Dev Dives: Mining your data with AI-powered Continuous Discovery
Dev Dives: Mining your data with AI-powered Continuous DiscoveryDev Dives: Mining your data with AI-powered Continuous Discovery
Dev Dives: Mining your data with AI-powered Continuous Discovery
 
Call Girls Chennai ☎️ +91-7426014248 😍 Chennai Call Girl Beauty Girls Chennai...
Call Girls Chennai ☎️ +91-7426014248 😍 Chennai Call Girl Beauty Girls Chennai...Call Girls Chennai ☎️ +91-7426014248 😍 Chennai Call Girl Beauty Girls Chennai...
Call Girls Chennai ☎️ +91-7426014248 😍 Chennai Call Girl Beauty Girls Chennai...
 
Guidelines for Effective Data Visualization
Guidelines for Effective Data VisualizationGuidelines for Effective Data Visualization
Guidelines for Effective Data Visualization
 
The "Zen" of Python Exemplars - OTel Community Day
The "Zen" of Python Exemplars - OTel Community DayThe "Zen" of Python Exemplars - OTel Community Day
The "Zen" of Python Exemplars - OTel Community Day
 

Cloud Native Data Pipelines

  • 1. Cloud Native Data Pipelines 1 Sid Anand (@r39132) Cloud Data Next 2017
  • 2. About Me 2 Work [ed | s] @ Committer & PPMC on Father of 2 Co-Chair for Apache Airflow
  • 4. Agari : What We Do 4
  • 9. 9 Enterprise Customers email metadata apply trust models email md + trust score Agari’s Previous EP Version Agari : What We Do Batch
  • 10. Quarantine, Label, PassThrough 10 email metadata apply trust models email md + trust score Agari’s Current EP VersionEnterprise Customers Agari : What We Do Near-real time
  • 12. Cloud Native Data Pipelines 12 Big Data Companies like LinkedIn, Facebook, Twitter, & Google have large teams to manage their data pipelines (100s of engineers) Most start-ups have small teams (10s of engineers) & run in the public cloud. Can they leverage aspects of the public cloud to build comparable pipelines?
  • 13. Cloud Native Data Pipelines 13 Cloud Native Techniques Open Source Technogies Data Pipelines seen in Big Data companies ~
  • 14. Design Goals Desirable Qualities of a Resilient Data Pipeline 14
  • 15. 15 Desirable Qualities of a Resilient Data Pipeline OperabilityCorrectness Timeliness Cost
  • 16. 16 Desirable Qualities of a Resilient Data Pipeline OperabilityCorrectness Timeliness Cost • Data Integrity (no loss, etc…) • Expected data distributions • All output within time-bound SLAs • Minimize Operational Fatigue / Automate Everything • Fine-grained Monitoring & Alerting of Correctness & Timeliness SLAs • Quick Recoverability • Pay-as-you-go
  • 17. Quickly Recoverable 17 • Bugs happen! • Bugs in Predictive Data Pipelines have a large blast radius • Optimize for MTTR
  • 18. Predictive Analytics @ Agari Use Cases 18
  • 19. Use Cases 19 Apply trust models (message scoring) batch + near real time Build trust models batch (Enterprise Protect) Focus of this talk
  • 20. Use-Case : Message Scoring (batch) Batch Pipeline Architecture 20
  • 21. Use-Case : Message Scoring 21 enterprise A enterprise B enterprise C S3 S3 uploads an Avro file every 15 minutes
  • 22. Use-Case : Message Scoring 22 enterprise A enterprise B enterprise C S3 Airflow kicks of a Spark message scoring job every hour (EMR)
  • 23. Use-Case : Message Scoring 23 enterprise A enterprise B enterprise C S3 Spark job writes scored messages and stats to another S3 bucket S3
  • 24. Use-Case : Message Scoring 24 enterprise A enterprise B enterprise C S3 This triggers SNS/SQS messages events S3 SNS SQS
  • 25. Use-Case : Message Scoring 25 enterprise A enterprise B enterprise C S3 An Autoscale Group (ASG) of Importers spins up when it detects SQS messages S3 SNS SQS Importers ASG
  • 26. 26 enterprise A enterprise B enterprise C S3 The importers rapidly ingest scored messages and aggregate statistics into the DB S3 SNS SQS Importers ASG DB Use-Case : Message Scoring
  • 27. 27 enterprise A enterprise B enterprise C S3 Users receive alerts of untrusted emails & can review them in the web app S3 SNS SQS Importers ASG DB Use-Case : Message Scoring
  • 28. 28 enterprise A enterprise B enterprise C S3 S3 SNS SQS Importers ASG DB Airflow manages the entire process Use-Case : Message Scoring
  • 29. 29 Architectural Components Component Role Uses Salient Features Operability Model Data Lake • All data stored in S3 • All processing uses S3 Scalable, Available, Performant Serverless Messaging • Reliable, Transactional, Pub/Sub Scalable, Available, Performant Serverless ASG General Processing • Used for importing, data cleansing, business logic Scalable, Available, Performant Managed Data Science Processing • Aggregation • Model Building • Scoring Nice programming model at the cost of debugging complexity We Operate Workflow Engine • Coordinates all Spark Jobs & complex flows Lightweight, DAGs as Code, Steep learning curve We Operate DB Persistence for WebApp • Holds subset of data needed for Web App Rails + Postgres ‘nuff said We Operate S3 SNS SQS
  • 30. Tackling Cost & Timeliness Leveraging the AWS Cloud 30
  • 31. Tackling Cost 31 Between Daily Runs During Daily Runs When running daily, for 23 hours of a day, we didn’t pay for instances in the ASG or EMR
  • 32. Tackling Cost 32 Between Hourly Runs During Hourly Runs When running daily, for 23 hours of a day, we didn’t pay for instances in the ASG or EMR This does not help when runs are hourly since AWS charges at an hourly rate for EC2 instances!
  • 34. ASG - Overview 34 What is it? A means to automatically scale out/in clusters to handle variable load/traffic A means to keep a cluster/service of a fixed size always up
  • 35. ASG - Data Pipeline 35 importer importer importer importer Importer ASG scaleout/in SQS DB
  • 36. 36 Sent CPU ACKd/Recvd CPU-based auto-scaling is good at scaling in/out to keep the average CPU constant ASG : CPU-based
  • 37. ASG : CPU-based 37 Sent CPU Recv Premature Scale-in Premature Scale-in: • The CPU drops to noise-levels before all messages are consumed • This causes scale in to occur while the last few messages are still being committed
  • 38. 38 Scale-out: When Visible Messages > 0 (a.k.a. when queue depth > 0) Scale-in: When Invisible Messages = 0 (a.k.a. when the last in-flight message is ACK’d) This causes the ASG to grow This causes the ASG to shrink ASG : Queue-based
  • 40. ASG - Build & Deploy 40 Component Role Details Spins up Cloud Resources • Spins up SQS, Kinesis, EC2, ASG, ELB, etc.. and associate them using Terraform • A better version of Chef & Puppet • Sets up an EC2 instance • Agentless, idempotent, & declarative tool to set up EC2 instances, by installing & configuring packages, and more • Spins up an EC2 instance for the purposes of building an AMI! • Can be used with Ansible & Terraform to bake AMIs & Launch Auto-Scaling Groups
  • 41. ASG - Build & Deploy 41 EC2 Step 1 : Packer spins up a temporary EC2 node - a blank canvas!
  • 42. EC2 ASG - Build & Deploy 42 EC2 Step 1 : Packer spins up a temporary EC2 node - a blank canvas! Step 2 : Packer runs an Ansible role against the EC2 node to set it up.
  • 43. EC2 ASG - Build & Deploy 43 EC2 Step 2 : Packer runs an Ansible role against the EC2 node to set it up. Step 3 : Snapshots the machine & register the AMI.EC2 Step 1 : Packer spins up a temporary EC2 node - a blank canvas!
  • 44. EC2 ASG - Build & Deploy 44 EC2 Step 2 : Packer runs an Ansible role against the EC2 node to set it up. Step 3 : Snapshots the machine & register the AMI.EC2 Step 4 : Terminates the EC2 instance! Step 1 : Packer spins up a temporary EC2 node - a blank canvas!
  • 45. EC2 ASG - Build & Deploy 45 EC2 Step 2 : Packer runs an Ansible role against the EC2 node to set it up. Step 3 : Snapshots the machine & register the AMI.EC2 Step 4 : Terminates the EC2 instance! Step 5 : Using the AMI, Terraform spins up an auto-scaled compute cluster (ASG) Step 1 : Packer spins up a temporary EC2 node - a blank canvas! ASG
  • 46. 46 Desirable Qualities of a Resilient Data Pipeline OperabilityCorrectness Timeliness Cost • ASG • EMR Spark Daily • ASG • EMR Spark Hourly ASG • No Cost Savings
  • 48. 48 A simple way to author, configure, manage workflows Provides visual insight into the state & performance of workflow runs Integrates with our alerting and monitoring tools Tackling Operability : Requirements
  • 50. 50 Airflow: Author DAGs in Python! No need to bundle many config files! Apache Airflow - Authoring DAGs
  • 51. 51 Airflow: Visualizing a DAG Apache Airflow - Authoring DAGs
  • 52. Apache Airflow - Perf. Insights 52 Airflow: Gantt chart view reveals the slowest tasks for a run!
  • 53. 53 Apache Airflow - Perf. Insights Airflow: Task Duration chart view show task completion time trends!
  • 54. 54 Airflow: …And easy to integrate with Ops tools! Apache Airflow - Alerting
  • 55. 55 Desirable Qualities of a Resilient Data Pipeline OperabilityCorrectness Timeliness Cost
  • 56. Use-Case : Message Scoring (near-real time) NRT Pipeline Architecture 56
  • 57. Use-Case : Message Scoring 57 enterprise A enterprise B enterprise C Kinesis batch put every second K
  • 58. Use-Case : Message Scoring 58 enterprise A enterprise B enterprise C K As ASG of scorers is scaled up to one process per core per kinesis shard Scorers ASG
  • 59. Use-Case : Message Scoring 59 enterprise A enterprise B enterprise C K Scorers ASG Kinesis Scorers apply the trust model and send scored messages downstream
  • 60. Use-Case : Message Scoring 60 enterprise A enterprise B enterprise C K Scorers ASG Kinesis Importers ASG As ASG of importers is scaled up to rapidly import messages DB
  • 61. Use-Case : Message Scoring 61 enterprise A enterprise B enterprise C K Scorers ASG Kinesis Importers ASG Imported messages are also consumed by the alerter DB K Alerters ASG
  • 62. Use-Case : Message Scoring 62 enterprise A enterprise B enterprise C K Scorers ASG Kinesis Importers ASG Imported messages are also consumed by the alerter DB K Alerters ASG Quarantine Email
  • 63. 63 Stream Processing Architecture Component Role Details Pros Operability Model Data Lake • All data stored in S3 via Kinesis Firehose Scalable, Available, Performant, Serverless Serverless Kinesis Messaging • Streaming transport modeled on Kafka Scalable, Available, Serverless Serverless General Processing • ASG Replacement except for Rails Apps Scalable, Available, Serverless Serverless ASG General Processing • Used for importing, data cleansing, business logic Scalable, Available, Managed Managed Data Science Processing • Model Building We Operate Workflow Engine • Nightly model builds + some classic Ops cron workloads Lightweight, DAGs as Code We Operate DB Persistence for WebApp • Holds smaller subset of data needed for Web App Rails + Postgres ‘nuff said We Operate Persistence for WebApp • Aggregation + Search moved from DB to ES • Model Building queries moved to Elasticache Redis Faster. more accurate for aggregates, frees up headroom for DB (polyglot persistence) Managed S3
  • 66. 66 What is Avro? Avro is a self-describing serialization format that supports primitive data types : int, long, boolean, float, string, bytes, etc… complex data types : records, arrays, unions, maps, enums, etc… many language bindings : Java, Scala, Python, Ruby, etc…
  • 67. 67 What is Avro? Avro is a self-describing serialization format that supports primitive data types : int, long, boolean, float, string, bytes, etc… complex data types : records, arrays, unions, maps, enums, etc… many language bindings : Java, Scala, Python, Ruby, etc… The most common format for storing structured Big Data at rest in HDFS, S3, Google Cloud Storage, etc… Supports Schema Evolution!
  • 68. Apache Avro Why is it useful? 68
  • 69. 69 Why is Avro Useful? Agari is an IoT company! Agari Sensors, deployed at customer sites, stream data to Agari’s Cloud SAAS Data is sent via Kinesis! enterprise A enterprise B enterprise C Kinesis Agari SAAS in AWS
  • 70. 70 Why is Avro Useful? enterprise A : enterprise B : enterprise C : Kinesis v1 v2 v3 Agari is an IoT company! Agari Sensors, deployed at customer sites, stream data to Agari’s Cloud SAAS Data is sent via Kinesis! At any point in time, customers run different versions of the Agari Sensor Agari SAAS in AWS
  • 71. 71 Why is Avro Useful? enterprise A : enterprise B : enterprise C : Kinesis v1 v2 v3 Agari is an IoT company! Agari Sensors, deployed at customer sites, stream data to Agari’s Cloud SAAS Data is sent via Kinesis! At any point in time, customers run different versions of the Agari Sensor These Sensors might send different format versions of the data! Agari SAAS in AWS
  • 72. 72 Why is Avro Useful? enterprise A : enterprise B : enterprise C : Kinesis v1 v2 v3 Agari SAAS in AWS v4 Agari is an IoT company! Agari Sensors, deployed at customer sites, stream data to Agari’s Cloud SAAS Data is sent via Kinesis! At any point in time, customers run different versions of the Agari Sensor These Sensors might send different format versions of the data!
  • 73. 73 Why is Avro Useful? enterprise A : enterprise B : enterprise C : v1 v2 v3 Avro allows Agari to seamlessly handle different IoT data format versions Agari SAAS in AWS Kinesis v4 datum_reader = DatumReader( writers_schema = writers_schema, readers_schema = readers_schema) Requirements: • Schemas are backward-compatible
  • 74. 74 Why is Avro Useful? Agari SAAS in AWS S1 S2 S3 s3 Spark Avro Everywhere! Avro is so useful, we don’t just to communicate between our Sensors & our SAAS infrastructure We also use it as the common data-interchange format between all services (streaming & batch) within our AWS deployment
  • 75. 75 Why is Avro Useful? Agari SAAS in AWS S1 S2 S3 s3 Spark Avro Everywhere! Good Language Bindings : Data Pipelines services are written in Java, Ruby, & Python
  • 77. 77 {"namespace": "agari", "type": "record", "name": "User", "fields": [ {"name": "name", "type": "string"}, {"name": "favorite_number", "type": ["int", "null"]}, {"name": "favorite_color", "type": ["string", "null"]} ] } complex type (record) Schema name : User 3 fields in the record: 1 required, 2 optional Avro Schema Example
  • 78. 78 {"namespace": "agari", "type": "record", "name": "User", "fields": [ {"name": "name", "type": "string"}, {"name": "favorite_number", "type": ["int", "null"]}, {"name": "favorite_color", "type": ["string", "null"]} ] } Data x 1,000,000,000 Avro Schema Data File Example Schema Data 0.0001 % 99.999 % Data Data Data Data Data Data Data Data Data Data Data Data Data Data Data Data Data
  • 79. 79 {"namespace": "agari", "type": "record", "name": "User", "fields": [ {"name": "name", "type": "string"}, {"name": "favorite_number", "type": ["int", "null"]}, {"name": "favorite_color", "type": ["string", "null"]} ] } Binary Data block Avro Schema Streaming Example Schema Data 99 % 1 % Data
  • 80. 80 {"namespace": "agari", "type": "record", "name": "User", "fields": [ {"name": "name", "type": "string"}, {"name": "favorite_number", "type": ["int", "null"]}, {"name": "favorite_color", "type": ["string", "null"]} ] } Binary Data block Avro Schema Streaming Example Schema Data 99 % 1 % Data OVERHEAD!!
  • 82. 82 Schema Registry (Lambda) Avro Schema Registry {"namespace": "agari", "type": "record", "name": "User", "fields": [ {"name": "name", "type": "string"}, {"name": "favorite_number", "type": ["int", "null"]}, {"name": "favorite_color", "type": ["string", "null"]} ] } register_schema Message Producer (P)
  • 83. 83 Schema Registry (Lambda) register_schema returns a UUID Message Producer (P) Avro Schema Registry
  • 84. 84 Schema Registry (Lambda) Message Producer sends UUID + Message Producer (P) Data Message Consumer (C) Avro Schema Registry
  • 86. 86 Schema Registry (Lambda) Message Producer (P) Data Message Consumer (C) getSchemaById (UUID) {"namespace": "agari", "type": "record", "name": "User", "fields": [ {"name": "name", "type": "string"}, {"name": "favorite_number", "type": ["int", "null"]}, {"name": "favorite_color", "type": ["string", "null"]} ] } Avro Schema Registry
  • 87. 87 Schema Registry (Lambda) Message Producer (P) Message Consumer (C) getSchemaById (UUID) {"namespace": "agari", "type": "record", "name": "User", "fields": [ {"name": "name", "type": "string"}, {"name": "favorite_number", "type": ["int", "null"]}, {"name": "favorite_color", "type": ["string", "null"]} ] } Message Consumers • download & cache the schema • then decode the data Avro Schema Registry
  • 88. 88 enterprise A enterprise B enterprise C K Scorers ASG Kinesis Importers ASG Imported messages are also consumed by the alerter DB K Alerters ASG SR SR SR Avro Schema Registry
  • 89. 89 enterprise A enterprise B enterprise C K Scorers ASG Kinesis Importers ASG Imported messages are also consumed by the alerter DB K Alerters ASG SR SR SR Avro Schema Registry
  • 90. Acknowledgments 90 • Vidur Apparao • Stephen Cattaneo • Jon Chase • Andrew Flury • William Forrester • Chris Haag • Chris Buchanan • Neil Chapin • Wil Collins • Don Spencer • Scot Kennedy • Natia Chachkhiani • Patrick Cockwell • Kevin Mandich • Gabriel Ortiz • Jacob Rideout • Josh Yang • Julian Mehnle • Gabriel Poon • Spencer Sun • Nathan Bryant None of this work would be possible without the essential contributions of the team below
  翻译: