Sr. Jon Ander, Internet de las Cosas y Big Data: ¿hacia dónde va la Industria?

Big Data Analytics
new challenges new tools
Jon Ander Gómez Adrián
jon@dsic.upv.es
Pattern Recognition and Human Language
Technologies (PRHLT) Research Group
Universitat Politècnica de València

Main Idea
How can we take profit of new software developments
for working with (processing, managing, analyzing …)
huge amounts of data?
December 18, 2015 jon@dsic.upv.es 2

What is Big Data?
• The concept or idea of Big Data appears with the
necessity of working with huge amounts of data,
• when the tasks of collecting, storing, processing and
analyzing data cannot be done with a traditional system,
even in High Performance Computing (HPC) systems,
• because the requirements of CPU time (processing
power) and memory (RAM and/or Disk) are too big.

What is Big Data?
• The Big Data phenomenon is a direct consequence of
the digitization of every activity in personal, public and
commercial life [1]
• Smartphones
• Financial transactions
• Internet of Things
• Internet of People ≈ Social Networks
• Wearable Devices
• …

What is Big Data?
• Smartphones
• Conversations
• Geolocation
• Searches (restaurants, cinemas, … )
• People to who each person is connected with

0.00E+00
5.00E+08
1.00E+09
1.50E+09
2.00E+09
2.50E+09
3.00E+09
3.50E+09
4.00E+09
2012 2015 2020
1 billion
2 billion
4 billion
Evolution of the use of Smartphones
2012 2015 2020
Source: Benedict Evans, a partner with Andreessen Horowitz [1,2]

What is Big Data?
• Smartphones
• Credit/Debit card transactions
• Accounting
• Loans’ data / Delay in payment
• Domestic/International transactions between companies
• Type of clients’ purchases
• …

% of conserved data
% of discarded data
0
20
40
60
80
100
Before 1850
1850-1930
1930-1960
1960-1990
Since 2010
Future
Evolution of the percentage of conserved data vs
the volume of generated financial data*
% of conserved data % of discarded data
(*) Non real data for illustrating the relevance of today storing 100% of generated data.

What is Big Data?
• Smartphones
• The growing network of everyday objects equipped with sensors
• that can send and receive data over Internet
• without human intervention
• A good example: Factory 4.0

2020 ≈ 25
billion
(estimation)
2015
4.9 billion
2014
3.4 billion
Source: Gartner Inc. [1]
Objectsequippedwithsensors

What is Big Data?
• Smartphones
• Source of non-structured data
• Data with a high level of ambiguity: metaphor, irony, sarcasm, …
• Text with grammatical mistakes, misspelling, misuse and abuse of symbols that
are no letters, …
• Large variety of images

What is Big Data?
• Smartphones
• Growing number of people monitoring themselves
• and storing all collected data
• In USA people share they vital signs data, collected daily and
properly anonymized, for helping to improve early diagnosis

The Famous Vs of Big Data
Volume vs Storage Capacity
Velocity vs Streaming
Variety vs Structure
Veracity vs Security

What is Big Data?
• A social and economical phenomenon
• Rethinking business strategies: data have high value
• Facing problems in a different way, thanks to
• the availability of enough data for learning statistical
(predictive) models is an inflection point
• The way people live: wearable devices
• Privacy of data and security of the computerized infrastructure
• A set of technological challenges

What is Big Data?
• A social and economical phenomenon
• The value of data: it is business as petrol is
• A set of technological challenges
• Traditional computer systems are not enough to work
with huge volumes of data
• We need to massively exploit low-cost hardware
• New software tools have been developed during recent
years
• Hadoop, Spark, Mesos, … (middleware)

What is Big Data?
• More isn’t just more …
• The basis of commercial enterprise is information
• Big Data tools allows society to deal with more data than ever
• When one changes the amount, one change the form
• The change in scale leads to a change in state
• By having more data, we can fundamentally do new things, with
more accuracy
• More isn’t just more. More is new, better and different [3]

What is Big Data? ─ In summary
• Currently, human beings are collecting all generated data from
different areas of regular file: from financial data up to health data
passing through geolocation data, travelling data, Internet
searches, …
• What implies several technological challenges at different levels
• A lifestyle change  a social and economical phenomena
• Better predictive models  an inflection point
• More isn’t just more, … more is new, more is different, more
is better [3]

Big Data in relation to other areas

Data Driven Decision Making
Business Intelligence
Data Visualization
Data Science
Machine Learning | Data Mining | Information Retrieval | Knowledge Data Discovery
Infrastructure: Hadoop, Spark, Mesos, …
System Manager
Big Data Infrastructure
Data Scientist
Machine Learning …
Data Analyst, CDO?
Statistical Analysis

Pre-processing
Curation
Storing
Curation
KDD
Data & Text Mining
Information Retrieval
Visualization
Synthesis
Analytics
Infrastructure Data Science Analytics

Business
Intelligence
Machine
Learning
Distributed
Computing
and Storage
Data
Visualization
Information
Retrieval
Data
Analytics
Data & Text
Mining
Knowledge
Data
Discovery
Data
Curation

Infrastructure for Big Data
• We need to massively exploit low-cost hardware
• Distributed File Systems for storing Big Data
• Structured and non-structured distributed databases
• Middleware for exploiting the low-cost hardware in
parallel
• Machine Learning algorithms for processing data in
order to extract relevant information
• Analytical and visualization tools for giving support to
decision making

Infrastructure for Big Data
Spark stack [4]

Infrastructure for Big Data:
MapReduce
Source: www.millanor.net

Cloud Service Models

Cloud Service Models
Traditional
Systems
Applications
Data
Runtime
Middleware
O/S
Virtualization
Servers
Storage
Networking
Infrastructure
as a Service
Applications
Data
Runtime
Middleware
O/S
Virtualization
Servers
Storage
Networking
Platform as a
Service
Applications
Data
Runtime
Middleware
O/S
Virtualization
Servers
Storage
Networking
Software as a
Service
Applications
Data
Runtime
Middleware
O/S
Virtualization
Servers
Storage
Networking
Managedbytheclient
Managedbythevendor

References
1. Securing the Big Data Life Cycle, MIT Tech Review
2. The Truly Personal Computer, The Economist,
2015
3. Big Data and the Future of Business, Kenneth
Cukier, The Economist (Reinventing the Company in
the Digital Age, BBVA-OpenMind)
4. Learning Spark, H.Karau, A. Konwinski, P. Wendell
& M. Zaharia, O’Reilly 2015

An Introduction to Spark
and to its Programming
Model
Jon Ander Gómez Adrián
jon@dsic.upv.es
Pattern Recognition and Human Language Technologies
(PRHLT) Research Group
Universitat Politècnica de València

Introduction to Spark
• In a very short time, Apache Spark has emerged as
the next generation big data processing engine.
• Spark improves over Hadoop MapReduce, which
helped ignite the big data revolution.
• It is much faster and much easier to use due to its
rich APIs.
• And it goes far beyond batch applications to support
a variety of workloads, including interactive queries,
streaming, machine learning, and graph processing.

• As parallel data analysis has grown common,
practitioners in many fields have sought easier tools
for this task.
• Apache Spark has quickly emerged as one of the
most popular, extending and generalizing
MapReduce.
• In Spark data is stored into the memory of the
worker nodes, except if data size exceeds the
capacity. Unlike Hadoop, where MapReduce tasks
operates on disk files.

• Spark offers three main benefits:
1. It is easy to use—you can develop applications on your
laptop, using a high-level API that lets you focus on the
content of your computation.
2. Spark is fast, enabling interactive use and complex
algorithms.
3. Spark is a general engine, letting you combine multiple
types of computations (e.g., SQL queries, text
processing, and machine learning) that might
previously have required different engines.
These features make Spark an excellent starting point to
learn about Big Data in general.

Introduction to Spark: history
• Spark is an open source project that has been built and
is maintained by a diverse community of developers.
• Spark started in 2009 as a research project in the UC
Berkeley RAD Lab, that later became AMPLab.
• Research papers were published about Spark at
academic conferences since its creation in 2009.
• It was early used by Machine Learning researchers at
the Mobile Millennium project, where it was used to
monitor and predict traffic congestion in the San
Francisco Bay Area.

What is Apache Spark?
• Apache Spark is a cluster computing platform
designed to be fast and general purpose.
• Spark extends the popular MapReduce model to
efficiently support more types of computations,
including interactive queries and stream processing.
• In addition to run computations in memory, it is
more efficient than Hadoop MapReduce for
complex applications running on disk.

• Spark is designed to cover a wide range of
workloads that previously required separate
distributed systems.
• It is also designed to be highly accessible by offering
simple APIs in Python, Java, Scala and SQL.
• Spark can run in Hadoop clusters and access any
Hadoop data source, including Cassandra.

• As a Unified Stack, Spark contains multiple closely
integrated components.
• In its core, Spark is a computational engine that is
responsible for scheduling, distributing and
monitoring applications.
• Applications that consist of many computational
tasks across many worker machines, or a computer
cluster.

Spark stack [4]

Spark Core
• Contains the basic functionality for
• task scheduling,
• memory management,
• fault recovery,
• interacting with storage systems,
• and more.
• Defines the Resilient Distributed Data sets (RDDs), the
main Spark programming abstraction.
• RDDs represent collections of items distributed across
many worker nodes that can be manipulated in parallel.

Spark SQL
• For working with structured data.
• It allows querying data via SQL as well as the Apache
Hive variant of SQL – called the Hive Query
Language (HQL).
• It supports many sources of data, including Hive
tables, Parquet and JSON.
• Allows developers to mix SQL queries with data
manipulations supported by RDDs in Python, Java
and Scala.

Spark Streaming
• It is a component that enables processing of live
streams of data: log files generated by production
web servers, for instance.
• It provides with an API for manipulating data
streams that closely matches the RDD API.
• Making easy for programmers to learn the project
and move between applications that manipulate
data stored in memory, on disk, or arriving in real
time.

Spark MLlib
• MLlib is a library that contains common Machine
Learning (ML) functionality.
• MLlib provides multiple types of ML algorithms,
including classification, regression, clustering and
collaborative filtering.
• It also supports functionality for model evaluation and
data import.
• MLlib provides some lower-level ML primitives,
including a generic gradient descent algorithm.
• All the methods are designed to scale out across a
cluster.

Spark GraphX
• It is a library for manipulating graphs,
• and performing graph-parallel computations.
• GraphX extends the Spark RDD API, allowing us to create
a directed graph with arbitrary properties attached to
each vertex and edge.
• GraphX also provides various operators for manipulating
graphs (e.g. subgraph and mapVertices)
• And a library of common graphs algorithms (e.g.
PageRank and triangle couting).

Cluster Managers
• Spark is designed to efficiently scale up from one to
many thousands of compute nodes.
• Spark can run over a variety of cluster managers,
• including Hadoop YARN, Apache Mesos,
• and a simple cluster manager included in Spark itself
called the Standalone Scheduler.

Storage Layers for Spark
• Spark can create distributed datasets from any file
stored in the Hadoop distributed filesystem (HDFS)
• or other storage systems supported by Hadoop
APIs,
• including your local filesystem, Amazon S3,
Cassandra, Hive, HBase, etc.
• Spark supports text files, SequenceFiles, Avro,
Parquet, and any other Hadoop InputFormat.

Example Cluster Configuration
node1
HDFS NameNode + Spark Master
node3HDFS DataNode + Spark Slave
node4HDFS DataNode + Spark Slave
nodeNHDFS DataNode + Spark Slave
node2YARN Resource Manager
+ Job History Server + ProxyServer
. . .

Starting Services in the Cluster
• Start up the cluster.
Run the command
$ vagrant up
in the same directory where the file Vagrantfile is located
The Vagrantfile contains the configuration and instructions
for Vagrant including references to the scripts used for
configuring and provisioning the virtual machines.

• Format the HDFS, it should be done the first time
the cluster is started.
• First step: connect to the HDFS NameNode
$ vagrant ssh node-1
• Second step: once logged in the node1 run the
following command:
$ ${HADOOP_HOME}/bin/hdfs namenode -format

• Start HADOOP daemons for HDFS
$ ${HADOOP_HOME}/sbin/start-dfs.sh
Commands in red are to be executed in a node of the cluster,
commands in black are to be executed in the host.

• Start HADOOP daemons for YARN and the
MapReduce Job History Server
$ ${HADOOP_HOME}/sbin/start-yarn.sh*
$ ${HADOOP_HOME}/sbin/mr-jobhistory-daemon.sh
start historyserver --config ${HADOOP_CONF_DIR}
(*) This script should be modified before the first time it is executed by
uncommenting the last line concerning the execution of the proxyserver.

• Start the Spark master.
$ ${SPARK_HOME}/sbin/start-all.sh

Monitoring the cluster services
• HDFS NameNode
http://10.211.55.101:50070/dfshealth.html
• Resource Manager
http://10.211.55.102:8088/cluster
• Job History Server
http://10.211.55.102:19888/jobhistory
• Spark
http://10.211.55.101:8080

Stopping Services in the Cluster
$ ${SPARK_HOME}/sbin/stop-all.sh
$ exit
$ ${HADOOP_HOME}/sbin/mr-jobhistory-daemon.sh stop
historyserver –config ${HADOOP_CONF_DIR}
$ ${HADOOP_HOME}/sbin/stop-yarn.sh
$ exit
$ ${HADOOP_HOME}/sbin/stop-dfs.sh
$ exit

Stopping Services in the Cluster
• Shutting down the cluster
$ vagrant halt
• Or destroying it
$ vagrant destroy
• Every time the cluster is booted, if it is not yet
created Vagrant will create each configured virtual
machine, provision it, configure it by means of the
scripts referenced in the Vagrantfile and finally
each node of the cluster is booted.

Programming environment: Spark concepts
• Every Spark application consists of a driver program
that launches several parallel operations on a
cluster.
• The driver program contains your application’s main
function and defines distributed datasets on the
cluster,
• then applies operations to them.

• Driver programs access Spark through a
SparkContext object which represents a connection
to the computing cluster.
• In a shell the SparkContext is created for you and
available as the variable sc.
• You can use it to build Resilient Distributed Data
(RDD) objects.
• Driver programs manage a number of worker nodes
called executors.

Worker NodeWorker Node
Driver Program
SparkContext
Executor
Task Task
Executor
Task Task

• The Spark API provides with a set of operators to
run functions on the cluster.
• Functions that usually are provided by the
programmer.
lines = sc.textFile( “README.txt” )
vagrantLines = lines.filter( lambda line : “vagrant” in line )
sparkLines = lines.filter( lambda line : “Spark” in line )

• Passing functions to Spark.
With lambda syntax allows us to define “simple”
functions inline. But we can pass defined functions.
def hasHadoop( line ):
return “Hadoop” in line
lines = sc.textFile( “README.txt” )
hadoopLines = lines.filter( hasHadoop )

• Initializing a SparkContext
from pyspark import SparkConf, SparkContext
conf = SparkConf().setMaster( “local” )
conf.setAppName( “example1” )
sc = SparckContext( conf = conf )

Programming with RDDs
• Spark’s core abstraction for working with data are
the Resilient Distributed Dataset (RDD) objects.
• An RDD object is distributed collection of items.
• All work is expressed as either, creating new RDDs,
transforming existing RDDs, or calling operations on
RDDs to compute a result.
• Spark automatically distributes the data contained
in RDDs across the nodes in the cluster and
parallelizes the operations you perform on them.

• An RDD in Spark is an immutable distributed
collection of objects.
• Each RDD is split into multiple partitions, which can
be computed on different nodes of the cluster.
• RDD objects can contain any type of Python, Java or
Scala objects, including user defined classes.
• Once created, RDDs offer two types of operations
transformations and actions.

• Transformations construct a new RDD object from a
previous one.
• Actions compute a result based on an existing RDD
object, and either return it to the driver program or
save it to an external storage system.
• Transformations and actions are different because
of the way Spark computes RDDs.
• Spark computes RDDs in a lazy way, i.e., the first
time they are used in an action.

• RDDs are by default recomputed each time you run
an action on them.
• If you want to reuse an RDD in multiple actions, you
can ask Spark to persist it using persist().
• Then, Spark will store the RDD contents in memory
(partitioned across the nodes in the cluster), and
reuse them in future actions.
• It is necessary to call unpersist() once you known
the RDD contents will not be used again.

Creating RDDs
• Spark provides two ways for creating RDDs
• Loading an external dataset
lines = sc.textFile( “/path/to/filename” )
• and parallelizing a collection in your driver program
list1 = [“hello”, “world”]
lines = sc.parallelize( list1 )

RDD Operations
• Two types:
• transformations return RDDs,
• actions return a result to the driver program.
• Transformations are operations on RDDs that return a
new RDD. Never modify existing RDDs because are
immutable.
• Transformed RDDs are computed lazily.
• Sparks keeps track of the set of dependencies between
different RDDs, called the lineage graph.
• The lineage graph is used for computing each RDD on
demand, when an action is carried out.

Example of lineage graph

RDD Operations
• Actions are operations that return a final value to
the driver program or write data to an external
storage system.
• Actions force the evaluation of the transformations
required for the RDD they were called on, since they
need to actually produce output.
• Because transformations are lazily executed, Spark
will not begin to execute until it sees an action.

Common Transformations and Actions
• Element-wise transformations:
• map(): takes in a function and applies it to each
element.
• filter(): takes in a function and returns an RDD that
has only the elements that pass the filter function.
nums = sc.parallelize( [1,2,3,4,5,6,7,8,9] )
squares = nums.map( lambda x: x*x )
odd_numbers = squares.filter( lambda x: (x%2)==1 )
sum = odd_numbers.reduce( lambda x,y: x+y )

• Element-wise transformations:
• flatMap(): takes in a function that returns an iterator
lines = sc.parallelize( [ “hello world”, “bye” ] )
words = lines.flatMap( lambda line: line.split() )
print( words.first() )

Pseudo set operations

Actions (uncomplete list)
• collect(): returns all elements from the RDD
• count(): number of elements in the RDD
• countByValue(): number of times each element
occurs in the RDD
• take(num): returns num elements from the RDD
• top(num): returns the top num elements from the RDD
• takeOrdered(num)(ordering): returns num
elements based on the provided ordering

• reduce(func): combines the elements of the RDD
together in parallel
• fold(zero)(func): same as reduce() but with the
provided zero value
• aggregate(zeroValue)(seq_op)(comb_op):
similar to reduce() but used to return a different type
• foreach(func): apply the provided function to each
element of the RDD

• reduceByKey(func): combines values with the same
key
• groupByKey(): Group values with the same key
• mapValues(func): apply a function to each value of a
pair RDD without changing the key
• keys(): returns an RDD of just the keys
• values(): returns an RDD of just the values
• sortByKey(): returns an RDD sorted by the key

Lab practices
• Let’s go to see two basic examples and three Pyhton
programs:
1. An estimation of π
2. Word count of the contents of a file or several files in
the same directory
3. The same word count but loading the contents of each
file separately

Sr. Jon Ander, Internet de las Cosas y Big Data: ¿hacia dónde va la Industria?

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Sr. Jon Ander, Internet de las Cosas y Big Data: ¿hacia dónde va la Industria?

Similar to Sr. Jon Ander, Internet de las Cosas y Big Data: ¿hacia dónde va la Industria? (20)

More from INACAP

More from INACAP (20)

Recently uploaded

Recently uploaded (20)

Sr. Jon Ander, Internet de las Cosas y Big Data: ¿hacia dónde va la Industria?

Editor's Notes