AWS Re-Invent 2017 Netflix Keystone SPaaS - Monal Daxini - Abd320 2017

© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
AWS re:INVENT
Netflix Keystone SPaaS
S t r e a m P r o c e s s i n g A s a S e r v i c e
A B D 3 2 0
Monal Daxini @monaldax #reInvent #Netflix
Stream Processing Infrastructure

@monaldax
● Data Engineer Why stream processing, and what does the platform offer?
● Data Leader Product / vision of a stream processing platform
● Platform engineer How we build and operate a stream processing platform?
What Do I Get Out Of This Talk?
Organized based on different roles or perspectives
@monaldax

@monaldax
● I will focus on stream processing platform for business insights, which my
team builds, mostly based on Flink
● I won’t
● Address operational insights for which we have different systems
● Compare stream processing engines, or cover stream processing concepts

@monaldax
Why Stream Processing?
@monaldax

@monaldax
● Low latency business insights and analytics
● Processing data as it arrives helps spread workload over time, &
reduce processing redundancy
● Need to process unbounded data sets becoming increasingly
common
Why Real Time Data?

@monaldax
● Enable users to focus on data and business insights, and not worry
about building stream processing infrastructure and tooling
Why Build A Stream Processing Platform?

@monaldax
What Does A Stream
Processing Platform Offer?

@monaldax
Platform Needs To Offer Robust Way To Process Streams
Allowing To Tradeoff Between Ease, Capability, & Flexibility
SPaaS

@monaldax
Point & Click
Routing, Filtering, Projection
Streaming Jobs
● Support Streaming SQL Future
● Interactive exploration of streams for quick prototyping Future
Stream Processing as a Service platform offers

@monaldax
Point & Click

@monaldax
Event
Producers
Sinks
Ingest Pipelines Are The Backbone Of A Real-time
Data Infrastructure
SERVERLESS
Turnkey
100% in AWS

@monaldax
Keystone Pipeline– Provision A Managed Data Stream 📽

@monaldax
* We would eventually like to move away from xpath & our custom parser
Keystone Self-serve – Message Formats

@monaldax
Keystone Self-Serve – Optional Projection 📽

@monaldax
Keystone Self-serve – Elasticsearch Sink Config

@monaldax
Keystone Self-serve – Kafka Sink Partition Key Support

@monaldax
Keystone - Configure 1 Data Stream, A Filter, & 3 Sinks

Event
Producer
Create Kafka Topic, And Three Separate Jobs
SPaaS Router
Fronting
Kafka
KSGateway
Consumer Kafka
KCW
Elasticsearch
3 Jobs1 Topic
Keystone Management
1 Topic
@monaldax

Event Flow: Producer Uses Kafka Client Wrapper Or Proxy
SPaaS Router
Fronting
Kafka
Event
Producer
KSGateway
Consumer Kafka
Keystone Management
KCW
Elasticsearch
@monaldax

Event Flow: Events Queued In Kafka
SPaaS RouterFronting
Kafka
Event
Producer
KSGateway
Consumer Kafka
KCW
Elasticsearch
3 instances
Keystone Management
@monaldax

Event Flow: Each Router Reads From Source, Optionally
Applies Filter & Projection
Kafka
Event
Producer
KSGateway
Consumer Kafka
KCW
Elasticsearch
3 instances
Keystone Management
@monaldax

Event Flow: Each Router Writes To Their Respective Sinks
Kafka
Event
Producer
KSGateway
Consumer Kafka
KCW
Elasticsearch
3 instances
Non-Keyed Keyed Supported
Keystone Management
@monaldax

Dashboard Generated For Provisioned Streams

@monaldax
Keystone Router Admin Links

Data Stream Operations is Managed
• Fully managed scaling
• Managed capacity planning
• 24 X 7 availability [Scale]
• Garbage collect unused streams
@monaldax

Keystone Pipeline - The Road Ahead
• Additional components – UDFs, Data Hygiene, Data Alerting, etc
• Component chaining in the UI
• Schema Support
• Data Lineage
• Cost attribution
@monaldax

@monaldax
Point & Click
(prod)
Streaming Jobs

Why A Streaming Job?
• When we need more flexibility and power than what Point &
Click pipeline offers, use stream processing jobs.
@monaldax

Generate Streaming Job From Template
@monaldax

Run And Debug Locally In The IDE
@monaldax

Create A New Streaming Job Config For Deployment
@monaldax

Deploying A Streaming Job In Test
@monaldax

Deploying A Streaming Job In Other Environments
@monaldax

Deployment Status Of A Sample Streaming Job

Streaming Job Actions & Links
@monaldax

Streaming Job Dashboard – Platform Metrics Auto-updated
@monaldax

Searchable Streaming-Job Logs
@monaldax

@monaldax
● Use case specific consulting
● Recipes
● Examples and Documentation
In Addition, Consulting & Documentation

@monaldax
Types of Streaming Jobs

Broadly, Two Categories Of Streaming Jobs
• Stateless
• No state maintained across events
• Stateful
• State maintained across events
@monaldax

Event
Producer
Streaming Job In Context Of Keystone Pipeline
Kafka
KSGateway
Consumer
Kafka
Keystone Management
KCW
Elasticsearch
Streaming Job
@monaldax

Image adapted from: Stephan Ewen
Stateless Stream Processor – No Internal State
@monaldax

Stateless Stream Processor – External State
Image adapted from: Stephan Ewen@monaldax

Stateless Example: Generating Plays Feed For Personalization,
And Discovery Of Shows

@monaldax
Stateless Streaming Job Use Case: High Level Architecture
Enriching And Identifying Certain Plays
Playback
History Service
Video
Metadata
Streaming Job
Play Logs
Live Service Lookup Data

Stateful Stream Processing
Image adapted from: Stephan Ewen@monaldax

Stateful Example: Creating Search Sessions

Search Personalization – Custom Windowing On
Out-of-order Events
...... S ES
……….Session 2: S
Hours
S E
Session 1:
SE …
@monaldax

Streaming Application
Flink Engine
Local State
Stateful Streaming Application With Local State,
Checkpoints, And Savepoints
Sinks
Savepoints
(Explicitly Triggered)
Checkpoints
(Automatic)
Sources
@monaldax

Streaming Job (Flink) Savepoint Tooling Support
• Amazon S3 based multi-tenant storage management
• Auto savepoint and resume from savepoint on redeploy
• Resume from an existing savepoint
@monaldax

Streaming Job (Flink) High Level Features
• Stateless jobs
• Event enrichment support by accessing services using platform thick clients
• Stateful jobs 100s of GB, with larger state support in the works
• Reusable blocks (in progress)
• Job development, deployment, and monitoring tooling (alpha)
@monaldax

Streaming Jobs - The Road Ahead
• Easy resource provisioning estimates
• Flink support for reading and writing from data warehouse, backfill
• Continue to evolve tooling and support for large state
• Reusable Components - sources, sinks, operators, schema support, data hygiene
• Tooling support for Spark Streaming
@monaldax

Prod – Trending Events & Scale With Events
Flowing To Hive, Elasticsearch, Kafka
≅ 80B to 1.3T
• 1.3T+ events processed per day
• 600B to 1T unique per day
• 2+ PB in 4.5+ PB out per day
• Peak: 12M events in / sec & 36 GB / sec
@monaldax

@monaldax
Keystone Router Stream Processing Jobs Scale
m4.4xl

@monaldax
RTDI Consists Of 4 Systems. Keystone Pipeline Runs 24 X 7,
& Does Not Impact Members Ability To Play Videos
Keystone
Stream Processing
(SPaaS)
Keystone
Management
Keystone
Messaging
24 x 7
- Dev
- Test
- Prod
Granular
shadowing

Event
Producer
Components & Streaming Jobs
Kafka
KSGateway
Consumer
Kafka
Keystone Management
KCW
Hive
Elasticsearch
Streaming Job
@monaldax

Event
Producer
Event Producer Library
Kafka
KSGateway
Consumer
Kafka
Keystone Management
KCW
Hive
Elasticsearch
Streaming Job
@monaldax

• Inject event metadata - GUID, timestamp, host, app
• Transparent and dynamic traffic routing for producers
• Chaski - Custom binary data wrapper within Keystone pipeline
• Multiple serialization support & Additional metadata
• Netflix ecosystem integration – Eureka, Archaius, Atlas
Producer Library - Kafka Client Wrapper
@monaldax

Streaming Job
Event
Producer
Boundary Of Custom Binary Data Wrapper
Kafka
KSGateway
Consumer
Kafka
Keystone Management
KCW
Hive
Elasticsearch
@monaldax

• Automated Kafka producer buffer (60s) tuning based on traffic
• Best effort delivery, Prioritizes host application availability
• acks=1, Do not block to send events, Unclean leader election
• Non-keyed messages, retry send to available partitions
• 99.9%+ delivery
Producer Library - Kafka Client Wrapper
@monaldax

Event
Producer
Ksgateway - Event Proxy For Non-java Clients, REST & GRPC
Kafka
KSGateway
Consumer
Kafka
Keystone Management
KCW
Hive
Elasticsearch
Streaming Job
@monaldax

Event
Producer
Kafka Clusters (0.10) on Amazon EC2
Kafka
KSGateway
Consumer
Kafka
Keystone Management
KCW
Elasticsearch
Streaming Job
@monaldax

• Have message sizes > 1MB and up to 10MB
• Large Scale Keystone Ingest pipelines results in large fan out
• Lower Latency – used for ad-hoc messaging as well
• Open source – enhance, patch, or extend
• Cons: It’s not Managed
Why Kafka?
@monaldax

Scale for Large Fan-out and Isolation - Cascading Topology
Fronting
Kafka
Consumer
Kafka
Consumer
@monaldax

Alternative: Logical Stream (Topic) Spread Across
Multiple Topics Across Multiple Clusters (WIP)
Multi-Cluster
Producer
Multi-Cluster
Consumer
@monaldax

• Dedicated Zookeeper cluster per Kafka cluster
• Small Clusters < 200 brokers, partitions <= 10K
• Partitions distributed evenly across brokers
• Rack-aware replica assignment, brokers spread in 3 Zones
• 2 copies & Unclean leader election on
• Non-transactional
Kafka Deployment Strategies – Version 0.10 (YMMV)
@monaldax

• 36+ Kafka & Zookeeper clusters
• 4000+ brokers (EC2), 700+ topics
• 3000+ d2.xl, 900+ i2.2xl
• Highly available 99.99%+
• Retention 2hr, 4hr, 8hr, 24hr
Kafka Clusters Scale
@monaldax

Event
Producer
Stream Processing Platform
Router
Fronting
Kafka
KSGateway
Consumer
Kafka
Keystone Management
KCW
Elasticsearch
Stream Consumers
@monaldax

High-level Stream Processing Platform Architecture - Routers
Keystone Management
Point & Click
Router Streaming Job
Container Runtime
1. Create
Streaming Job
2. Launch Job with
Config, Source,
Sink, Filters,
Projections, etc. 3. Launch Containers
• Immutable Image
• Automated, system driven config overrides
@monaldax

• Keystone pipeline is built on Flink Routers
• Each Flink Router is a stream processing job
• Router provisioning based on incoming traffic or estimates
• Runs on containers atop EC2
• Island mode - single AWS Region
Streaming Jobs 1.3.2
@monaldax

High-level Stream Processing Platform Architecture
Streaming Jobs
Keystone Management
Point & Click or
Streaming Job
Container Runtime
1. Create
Streaming Job
2. Launch Job with
Config overrides
3. Launch Containers
• Immutable Image
• User driven config overrides
@monaldax

Stream Processing Platform - Layered cake
Amazon EC2
Titus Container Runtime
(Flink Streaming Engine, Config Management)
Reusable Components
Source & Sink Connectors, Filtering, Projection, etc.
Routers
(Streaming Job)
Streaming Jobs
@monaldax

@monaldax
Flink Job Cluster In HA Mode
Zookeeper
Job Manager
Leader (WebUI)
Task Manager Task Manager Task Manager
Job Manager
(WebUI)
One dedicated Zookeeper
cluster for all streamig Jobs

Flink Task Slots & Automatic Operator Chaining
Image: Flink 1.2 documentation@monaldax

@monaldax
Flink Job Cluster In HA Mode With Checkpoints
Zookeeper
Job Manager
(Leader)
Task Manager Task Manager Task Manager
Job Manager
State Checkpoints
State Metadata
Checkpoints

Flink Checkpoints Similar To 2 Phase Commit
Image: Flink 1.2 documentation@monaldax

@monaldax
Titus Job
Task Manager
IP
Titus Host 4 Titus Host 5
Checkpoints Are Taken Often
Zookeeper
Job Manager
(standby)
Job Manager
(master)
Task Manager
Titus Host 1
IP
Titus Host 1
….
Task Manager
Titus Host 2
IP
Titus Job
IPIP
AWS
VPC
State
- Checkpoints
- Kafka Offset
Save

@monaldax
Titus Job
Task Manager
IP
Checkpoints Are Taken Often. A Container Could Fail…
Zookeeper
Job Manager
(standby)
Job Manager
(master)
Task Manager
Titus Host 1
IP
Titus Host 1
….
Task Manager
Titus Host 2
IP
Titus Job
IPIP
AWS
VPC
State
- Checkpoints
- Kafka Offset
Save
X

@monaldax
Titus Job
Task Manager
IP
Zookeeper
Job Manager
(standby)
Job Manager
(master)
Task Manager
Titus Host 1
IP
Titus Host 2
….
Task Manager
Titus Host 3
IP
Titus Job
IPIP
AWS
VPC
State
- Checkpoints
- Kafka OffsetRestore
Failed Container Automatically Replaced. State
Restored To Last Checkpoint, Partially Recovery Supported
Replacement container

Event
Producer
and Streaming Jobs Management
Kafka
KSGateway
Consumer Kafka
Keystone Management
KCW
Hive
Elasticsearch
Streaming Job
@monaldax

@monaldax
Keystone Management Current Architecture - Imperative
Composable
Joblets
Composable
Joblets

@monaldax
Keystone Management New Architecture (WIP) – Declarative

@monaldax
Keystone Management New Architecture (WIP)

• The ability to pass data along the chain of Joblets within a Job
• Locks and semaphores on resources spanning across jobs
• Customization and integration into Netflix ecosystem – Eureka, etc.
Keystone Management Unique Features
@monaldax

@monaldax
How Do We Operate It?
Scale Operations Using Systems Not Humans

• No separate Ops team
• No separate QA team
• No separate Dev team
• It’s all done by developers of the Real Time Data Infrastructure
We Run What We Build!
@monaldax

• We rely on metrics, monitoring, alerting & paging, & automation
• Separate metrics system – Atlas
• Separate alert configuration and alert actions system
• Options for separate system to run cross-system automation tasks
We Leverage Other Netflix Systems
@monaldax

Easy Alert Configuration And Status
@monaldax

Easy View Of Fired Alerts
@monaldax

Streaming Job
Event
Producer
Operating Ksgateway - Event Proxy For Non-Java Clients
Kafka
KSGateway
Consumer
Kafka
Keystone Management
Hive
Elasticsearch
• Stateless Service
• Scaled Using Elastic Load Balancing and Auto Scaling Group
• Pre-scaled for planned increase in traffic
@monaldax

Streaming Job
Event
Producer
Event Producer Related Monitoring And Alerts
Kafka
KSGateway
Consumer
Kafka
Keystone Management
KCW
Elasticsearch
@monaldax

@monaldax
Monitoring Producer, Alert On Drop Rate

Event
Producer
Kafka Clusters
Kafka
KSGateway
Consumer
Kafka
Keystone Management
KCW
Hive
Elasticsearch
Streaming Job
@monaldax

@monaldax
Kafka Failover - Fronting Kafka Clusters

@monaldax
Fully
Automated
Kafka Cluster Failover – As Fast As 5 Minutes

@monaldax
Kafka Cluster & Routers In Healthy State
Flink Router
Fronting Kafka
Event
Producer

@monaldax
Issue With Kafka Cluster
Flink Router
Fronting Kafka
Event
Producer
X

@monaldax
Launch Backup Kafka Cluster With Same Number Of
Instances, But Smaller Instance Type
Flink Router
Fronting Kafka
Event
Producer
Bring up failover
Kafka cluster
Copy metadata
from Zookeeper
X

@monaldax
Change Producer Config To Produce To Failover
Cluster, And Launch Routers For Failover Traffic
Flink Router
Fronting Kafka
Event
Producer
Failover Flink
Router
X

@monaldax
Change Producer Config To Original Cluster, And
Finish Draining Events From Backup Flink Router
Flink Router
Fronting Kafka
Event
Producer
Failover Flink
Router

@monaldax
Decommission Backup Cluster And Router Once Original
Cluster Is Fixed, Or A Replacement Cluster Is Live
Flink Router
Fronting Kafka
Event
Producer
Failover Flink
Router
X X

@monaldax
Flink Router
Fronting Kafka
Event
Producer
Back To Steady State With Click Of A Button

• Failover currently supported for Fronting Kafka clusters only
• We are working on multi-consumer client with support for keyed
message to support failover of consumer Kafka clusters.
Consumer Kafka Clusters
@monaldax

Planned & Regular
Kafka Kong
This Automation Also Serves As Kafka Kong, A Tool That
Follows Principles Of Chaos Engineering
@monaldax

• Over provision for variations and traffic for failover
• Broker health & outlier detection and auto termination
• 99 percentile response time
• Broker TCP timeouts, errors, retransmissions
• Producer’s send latency
Kafka Operation Strategies (YMMV)
@monaldax

• Scale up by
• Adding partitions – to new brokers, requires no keyed messages
• Partition reassignment – in small batches with custom tool
• Scale down by
• Create New topics / New clusters
• Create new clusters - use Kafka failover automation
Kafka Operation Strategies (YMMV)
@monaldax

Event
Producer
Router
Fronting
Kafka
KSGateway
Consumer
Kafka
Keystone Management
KCW
Elasticsearch
Flink Streaming Job
@monaldax

• Container replacement
• Checkpoints and Savepoints
• Keep retrying if event data format is valid
• Isolation – issue with one sink does not impact another
Routers & Streaming Job Fault Tolerance By Design
@monaldax

• Provision new or updated streams
• Bulk updates and terminate routers and re-deployment
• Automatic partial recovery allows zero-touch migration of
underlying container infrastructure
• Manual – KSRunbook
Router Deployment Automation
@monaldax

Manual Intervention, We Have Runbook.
Goal Is To Automate And Keep Runbook Small
@monaldax

• Per stream provisioning based on past weeks traffic or bit rate estimate
• Provision buffer capacity
• Run 1 additional container for latency sensitive consumers
• Manual, % increase, easy to compute and deploy
• Plan capacity to handle service failover, and holiday peaks
Router Capacity Planning And Provisioning
@monaldax

Admin Tooling To Scale Up Manually, Or To Deploy A New Build
@monaldax

Application Metrics – Router Message Flow
@monaldax

Application Metrics – Router filtering
@monaldax

Platform-level Metrics – Kafka Offset Metrics

System Metrics - Router JVM Metrics
@monaldax

Alerts– Hive Sink Router
@monaldax

@monaldax
Flink Streaming Job
● Split between application and infrastructure
● Metrics and monitoring and
● Alerts
● Paging and on-call rotations
● Platform customers follow the same “We build it we run it model”

Example Streaming Job Application Level Simulated Metrics

Example Streaming Job System Level Simulated Metrics

@monaldax
Operations – The road ahead
● True auto scaling
● Bootstrap capacity planning for stateful streaming jobs
● Automated Canary tooling & Data parity
● Point and Click components quick testing, and performance profiling
● E.g., - iterating over a Filter definition

@monaldax
I Want To Learn More
● http://bit.ly/mLOOP - Deep dive into Unbounded Data Processing Systems
● http://bit.ly/m17FF - Keynote – Stream Processing with Flink at Netflix
● http://bit.ly/2BoYAq0 - Multi-tenant Multi-cluster Kafka Messaging Service

AWS Re-Invent 2017 Netflix Keystone SPaaS - Monal Daxini - Abd320 2017

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to AWS Re-Invent 2017 Netflix Keystone SPaaS - Monal Daxini - Abd320 2017

Similar to AWS Re-Invent 2017 Netflix Keystone SPaaS - Monal Daxini - Abd320 2017 (20)

More from Monal Daxini

More from Monal Daxini (6)

Recently uploaded

Recently uploaded (20)

AWS Re-Invent 2017 Netflix Keystone SPaaS - Monal Daxini - Abd320 2017