Apache Spark Crash Course

Robert Hryniewicz
Developer Advocate
@RobertH8z
Apache Spark
Crash Course - DataWorks Summit - Munich 2017

2 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
“Big Data”
Ã Internet of Anything (IoT)
– Wind Turbines, Oil Rigs
– Beacons, Wearables
– Smart Cars
Ã User Generated Content (Social, Web & Mobile)
– Twitter, Facebook, Snapchat
– Clickstream
– Paypal, Venmo
44ZB in 2020

Visualizing 44ZB
100 pixels = 1M TB
100 px -> 1M TB assumes 5M pixel resolution screen

The “Big Data” Problem
Ã A single machine cannot process or even store all the data!
Problem
Solution
Ã Distribute data over large clusters
Difficulty
Ã How to split work across machines?
Ã Moving data over network is expensive
Ã Must consider data & network locality
Ã How to deal with failures?
Ã How to deal with slow nodes?

Spark Background

What Is Apache Spark?
Ã Apache open source project
originally developed at AMPLab
(University of California Berkeley)
Ã Unified data processing engine that
operates across varied data
workloads and platforms

Why Apache Spark?
Ã Elegant Developer APIs
– Single environment for data munging, data wrangling, and Machine Learning (ML)
Ã In-memory computation model – Fast!
– Effective for iterative computations and ML
Ã Machine Learning
– Implementation of distributed ML algorithms
– Pipeline API (Spark ML)

Spark SQL
Structured Data
Spark Streaming
Near Real-time
Spark MLlib
Machine Learning
GraphX
Graph Analysis

Spark Basics

SparkSession
Ã Main entry point for Spark functionality
Ã Allows programming with DataFrame and Dataset APIs
– Fewer concepts and constructs a developer has to juggle while interacting with Spark
Ã Represented as spark and auto-initialized in Zeppelin env.
What is it?

Spark SQL
Structured Data
Spark Streaming
Near Real-time
Spark MLlib
Machine Learning
GraphX
Graph Analysis

More Flexible Better Storage and Performance///

Spark SQL Overview
Ã Spark module for structured data processing (e.g. DB tables, JSON files, CSV)
Ã Three ways to manipulate data:
– DataFrames API
– SQL queries
– Datasets API

DataFrames
Ã Distributed collection of data organized into named
columns
Ã Conceptually equivalent to a table in relational DB or
a data frame in R/Python
Ã API available in Scala, Java, Python, and R
Col1 Col2 … … ColN
DataFrame
Column
Row
Data is described as a DataFrame
with rows, columns, and a schema

Sources
CSVAvro
HIVE
Spark SQL
Col1 Col2 … … ColN
DataFrame
Column
Row
JSON

Create a DataFrame
val path = "examples/flights.json"
val flights = spark.read.json(path)
Example

Register a Temporary View (SQL API)
Example
flights.createOrReplaceTempView("flightsView")

Two API Examples: DataFrame and SQL APIs
flights.select("Origin", "Dest", "DepDelay”)
.filter($"DepDelay" > 15).show(5)
Results
+------+----+--------+
|Origin|Dest|DepDelay|
+------+----+--------+
| IAD| TPA| 19|
| IND| BWI| 34|
| IND| JAX| 25|
| IND| LAS| 67|
| IND| MCO| 94|
+------+----+--------+
SELECT Origin, Dest, DepDelay
FROM flightsView
WHERE DepDelay > 15 LIMIT 5
SQL API
DataFrame API

Spark SQL
Structured Data
Spark Streaming
Near Real-time
Spark MLlib
Machine Learning
GraphX
Graph Analysis

What is Stream Processing?
Batch Processing
• Ability to process and analyze data at-rest (stored data)
• Request-based, bulk evaluation and short-lived processing
• Enabler for Retrospective, Reactive and On-demand Analytics
Stream Processing
• Ability to ingest, process and analyze data in-motion in real- or near-real-time
• Event or micro-batch driven, continuous evaluation and long-lived processing
• Enabler for real-time Prospective, Proactive and Predictive Analytics for Next Best
Action
Stream Processing + Batch Processing = All Data Analytics
real-time (now) historical (past)

Next Generation Analytics
Iterative & Exploratory
Data is the structure
Traditional Analytics
Structured & Repeatable
Structure built to store data
22
Modern Data Applications approach to Insights
Start with hypothesis
Test against selected data
Data leads the way
Explore all data, identify correlations
Analyze after landing… Analyze in motion…

Spark Streaming
Ã Extension of Spark Core API
Ã Stream processing of live data streams
– Scalable
– High-throughput
– Fault-tolerant
Overview
ZeroMQ
MQTT
No longer
supported
in
Spark 2.x

Spark Streaming

Spark Streaming
Discretized Streams (DStreams)
Ã High-level abstraction representing continuous stream of data
Ã Internally represented as a sequence of RDDs
Ã Operation applied on a DStream translates to operations on the underlying RDDs

Spark Streaming
Example: flatMap operation

Spark Streaming
Ã Apply transformations over a sliding window of data, e.g. rolling average
Window Operations

Challenges in Streaming Data
Ã Consistency
Ã Fault tolerance
Ã Out-of-order data

Structured Streaming: Basics

Structured Streaming: Model

Handling late arriving data

Spark SQL
Structured Data
Spark Streaming
Near Real-time
Spark MLlib
Machine Learning
GraphX
Graph Analysis

AI in Media & Pop Culture

Machine Learning use cases
Healthcare
• Predict diagnosis
• Prioritize screenings
• Reduce re-admittance rates
Financial services
• Fraud Detection/prevention
• Predict underwriting risk
• New account risk screens
Public Sector
• Analyze public sentiment
• Optimize resource allocation
• Law enforcement & security
Retail
• Product recommendation
• Inventory management
• Price optimization
Telco/mobile
• Predict customer churn
• Predict equipment failure
• Customer behavior analysis
Oil & Gas
• Predictive maintenance
• Seismic data management
• Predict well production levels

Scatter 2D Data Visualized
scatterData ç DataFrame
+-----+--------+
|label|features|
+-----+--------+
|-12.0| [-4.9]|
| -6.0| [-4.5]|
| -7.2| [-4.1]|
| -5.0| [-3.2]|
| -2.0| [-3.0]|
| -3.1| [-2.1]|
| -4.0| [-1.5]|
| -2.2| [-1.2]|
| -2.0| [-0.7]|
| 1.0| [-0.5]|
| -0.7| [-0.2]|
...
...
...

Linear Regression Model Training (one feature)
Coefficients: 2.81 Intercept: 3.05
y = 2.81x + 3.05
Training
Result

Linear Regression (two features)
Coefficients: [0.464, 0.464]
Intercept: 0.0563

Spark API for building ML pipelines
Feature
transform
1
Feature
transform
2
Combine
features
Linear
Regression
Input
DataFrame
Input
DataFrame
Output
DataFrame
Pipeline
Pipeline Model
Train
Predict
Export Model

Spark SQL
Structured Data
Spark Streaming
Near Real-time
Spark MLlib
Machine Learning
GraphX
Graph Analysis

Ã Page Rank
Ã Topic Modeling (LDA)
Ã Community Detection
Source: ampcamp.berkeley.edu

Zeppelin & HDP

What’s Apache Zeppelin?
Web-based notebook
that enables interactive
data analytics.
You can make beautiful
data-driven, interactive
and collaborative
documents with SQL,
Python, Scala and more

imple line chart

orizontal plot of three line charts

treaming data into a line chart

lotting Iris data features in one plot

omparing Iris data distributions

What is a Note/Notebook?
• A web based GUI for small code snippets
• Write code snippets in browser
• Zeppelin sends code to backend for execution
• Zeppelin gets data back from backend
• Zeppelin visualizes data
• Zeppelin Note = Set of (Paragraphs/Cells)
• Other Features - Sharing/Collaboration/Reports/Import/Export

How does Zeppelin work?
Notebook
Author
Collaborators/
Report viewers
Zeppelin
Cluster
Spark | Hive | HBase
Any of 30+ back ends

Big Data Lifecycle
Collect
ETL /
Process
Analysis
Report
Data
Product
Business user
Customer
Data ScientistData Engineer
All in Zeppelin!

Ã Zeppelin è Interactive notebook
Ã Spark
Ã YARN è Resource Management
Ã HDFS è Distributed Storage Layer
YARN
Scala
Java
Python
R
APIs
Spark Core Engine
Spark
SQL
Spark
Streaming
MLlib GraphX
1 ° ° ° ° ° ° ° ° °
° ° ° ° ° ° ° ° ° °
°
N
HDFS

Access patterns enabled by YARN
YARN: Data Operating System
1 ° ° ° ° ° ° ° ° °
° ° ° ° ° ° ° ° °
°
°N
HDFS
Hadoop Distributed File System
Interactive Real-TimeBatch
Applications Batch
Needs to happen but, no
timeframe limitations
Interactive
Needs to happen at
Human time
Real-Time
Needs to happen at
Machine Execution time.

Why Apache Spark on YARN?
Ã Resource management
Ã Utilizes existing HDP cluster
infrastructure
Ã Scheduling and queues
Spark Driver
Client
Spark
Application Master
YARN container
Spark Executor
YARN container
Task Task
Spark Executor
YARN container
Task Task
Spark Executor
YARN container
Task Task

Why HDFS?
Fault Tolerant Distributed Storage
• Divide files into big blocks and distribute 3 copies randomly across the cluster
• Processing Data Locality
• Not Just storage but computation
10110100101
00100111001
11111001010
01110100101
00101100100
10101001100
01010010111
01011101011
11011011010
10110100101
01001010101
01011100100
11010111010
0
Logical File
1
2
3
4
Blocks
1
Cluster
1
1
2
2
2
3
3
34
4
4

Spark and HDP

HDCloud

Hortonworks Cloud Solutions
Microsoft AWS Google
Managed Azure HDInsight
Non-Managed /
Marketplace
Hortonworks Data
Cloud for AWS
Cloud IaaS
Hortonworks Data Platform
(via Ambari and via Cloudbreak)

Hortonworks Cloud Solutions: Flexibility and Choice
Hortonworks Data
Cloud for AWS
Cloudbreak
HDP on Cloud IaaS
More Prescriptive
More Ephemeral
More Options
More Long Running

HDP 2.6 and New Cluster Types
Spark
2.1
Druid
TP
Interactive
Hive

Multitenancy with Zeppelin

Livy
Ã Livy is the open source REST interface for interacting with Apache Spark from anywhere
Ã Installed as Spark Ambari Service
Livy Client
HTTP HTTP (RPC)
Spark Interactive Session
SparkContext
Spark Batch Session
SparkContext
Livy Server

Security Across Zeppelin-Livy-Spark
Shiro
Ispark Group Interpreter
SPNego: Kerberos Kerberos
Livy APIs
Spark on YARN
Zeppelin
Driver
LDAP
Livy Server

Reasons to Integrate with Livy
Ã Bring Sessions to Apache Zeppelin
– Isolation
– Session sharing
Ã Enable efficient cluster resource utilization
– Default Spark interpreter keeps YARN/Spark job running forever
– Livy interpreter recycled after 60 minutes of inactivity
(controlled by livy.server.session.timeout )
Ã To Identity Propagation
– Send user identity from Zeppelin > Livy > Spark on YARN

Livy Server
SparkSession Sharing
Session-2
Session-1
SparkSession-1
SparkContext
SparkSession-2
SparkContext
Client 1
Client 2
Client 3
Session-1
Session-1
Session-2

Apache Zeppelin Security: Authentication + SSL
Tommy Callahan
Zeppelin Spark on YARN
LDAP
SSL
Firewall
1
2
3

Apache Zeppelin + Livy End-to-End Security
Ispark Group Interpreter
SPNego: Kerberos Kerberos/RPC
Livy APIs
Spark on YARN
Zeppelin
LDAP
Livy Server
Job runs as
Tommy Callahan
Tommy Callahan

Sample Architecture

Modern Data Apps
Ã HDP 2.6
– Batch Processing
Ã HDF 2.1
– Streaming Apps
DATA AT
REST
DATA IN
MOTION
ACTIONABLE
INTELLIGENCE
Modern Data Applications

Custom or Off the Shelf
Real-Time Cyber Security
protects systems with superior threat
detection
Smart Manufacturing
dramatically improves yields by managing
more variables in greater detail
Connected, Autonomous Cars
drive themselves and improve road safety
Future Farming
optimizing soil, seeds and equipment to
measured conditions on each square foot
Automatic Recommendation Engines
match products to preferences in milliseconds
DATA AT
REST
DATA IN
MOTION
ACTIONABLE
INTELLIGENCE
Hortonworks
DataFlow
Hortonworks
Data Platform

Managed Dataflow
SOURCES
REGIONAL
INFRASTRUCTURE
CORE
INFRASTRUCTURE

High-Level Overview
IoT Edge
(single node)
IoT Edge
(single node)
IoT Devices
IoT Devices
NiFi Hub Data Broker
Column
DB
Data
Store
Live Dashboard
Data Center
(on prem/cloud)
HDFS/S3 HBase/Cassandra

Labs / Tutorials

Future Tutorials
Ã Deploying Models with Spark Structured Streaming
Ã Predicting Airline Delays with SparkR
Ã Sentiment Analysis with Apache Spark (Gradient Boosting)
Ã Auto Text Classification (Naïve Bayes)

Hortonworks Community Connection
Read access for everyone, join to participate and be recognized
• Full Q&A Platform (like StackOverflow)
• Knowledge Base Articles
• Code Samples and Repositories

www.futureofdata.io
Future of Data Meetups

FB Sort
Ã Spark job that reads 60 TB of compressed data
and performs a 90 TB shuffle and sort.
Ã Largest real-world Spark job to date!
– Databricks’ PetaByte sort was on synthetic data.
Ã Multiple reliability fixes.
Ã Spark job that reads 60 TB of compressed data
and performs a 90 TB shuffle and sort.
Ã Largest real-world Spark job to date!
– Databricks’ PetaByte sort was on synthetic data.
Ã Multiple reliability fixes.
“Spark could reliably shuffle and sort 90 TB+ intermediate data and run 250,000 tasks in a
single job [...] and it has been running in production for several months.”

Spark SQL
Structured Data
Spark Streaming
Near Real-time
Spark MLlib
Machine Learning
GraphX
Graph Analysis
Robert Hryniewicz
@robertH8z

What’s new in HDP 2.6 – Spark & Zeppelin
Ã Spark 1.6.3 GA
Ã Spark 2.1 GA
Ã REST API (Livy) GA
Ã Spark Thrift Server doAS GA
Ã SparkSQL – Row/Column Security (GA)
Ã Spark Streaming + Kafka over SSL
Ã Multi Cluster HBase support for SHC
Ã Package support in PySpark & SparkR
Spark
Ã Spark 2.x support
Ã Improved Livy integration
Ã No password in clear
Ã JDBC interpreter improvements
Ã Smart Sense integration
Ã Knox proxy Zeppelin UI
Zeppelin 0.7.x

Robert Hryniewicz
@RobertH8z
Thanks!

Apache Spark Crash Course

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Apache Spark Crash Course

Similar to Apache Spark Crash Course (20)

More from DataWorks Summit/Hadoop Summit

More from DataWorks Summit/Hadoop Summit (20)

Recently uploaded

Recently uploaded (20)

Apache Spark Crash Course