尊敬的 微信汇率:1円 ≈ 0.046078 元 支付宝汇率:1円 ≈ 0.046168元 [退出登录]
SlideShare a Scribd company logo
Big Data Analytics
new challenges new tools
Jon Ander Gómez Adrián
jon@dsic.upv.es
Pattern Recognition and Human Language
Technologies (PRHLT) Research Group
Universitat Politècnica de València
Main Idea
How can we take profit of new software developments
for working with (processing, managing, analyzing …)
huge amounts of data?
December 18, 2015 jon@dsic.upv.es 2
What is Big Data?
• The concept or idea of Big Data appears with the
necessity of working with huge amounts of data,
• when the tasks of collecting, storing, processing and
analyzing data cannot be done with a traditional system,
even in High Performance Computing (HPC) systems,
• because the requirements of CPU time (processing
power) and memory (RAM and/or Disk) are too big.
December 18, 2015 jon@dsic.upv.es 3
What is Big Data?
• The Big Data phenomenon is a direct consequence of
the digitization of every activity in personal, public and
commercial life [1]
• Smartphones
• Financial transactions
• Internet of Things
• Internet of People ≈ Social Networks
• Wearable Devices
• …
December 18, 2015 jon@dsic.upv.es 4
What is Big Data?
• Smartphones
• Conversations
• Geolocation
• Searches (restaurants, cinemas, … )
• People to who each person is connected with
• Financial transactions
• Internet of Things
• Internet of People ≈ Social Networks
• Wearable Devices
December 18, 2015 jon@dsic.upv.es 5
December 18, 2015 jon@dsic.upv.es 6
0.00E+00
5.00E+08
1.00E+09
1.50E+09
2.00E+09
2.50E+09
3.00E+09
3.50E+09
4.00E+09
2012 2015 2020
1 billion
2 billion
4 billion
Evolution of the use of Smartphones
2012 2015 2020
Source: Benedict Evans, a partner with Andreessen Horowitz [1,2]
What is Big Data?
• Smartphones
• Financial transactions
• Credit/Debit card transactions
• Accounting
• Loans’ data / Delay in payment
• Domestic/International transactions between companies
• Type of clients’ purchases
• …
• Internet of Things
• Internet of People ≈ Social Networks
• Wearable Devices
December 18, 2015 jon@dsic.upv.es 7
% of conserved data
% of discarded data
0
20
40
60
80
100
Before 1850
1850-1930
1930-1960
1960-1990
Since 2010
Future
Evolution of the percentage of conserved data vs
the volume of generated financial data*
% of conserved data % of discarded data
December 18, 2015 jon@dsic.upv.es 8
(*) Non real data for illustrating the relevance of today storing 100% of generated data.
What is Big Data?
• Smartphones
• Financial transactions
• Internet of Things
• The growing network of everyday objects equipped with sensors
• that can send and receive data over Internet
• without human intervention
• A good example: Factory 4.0
• Internet of People ≈ Social Networks
• Wearable Devices
December 18, 2015 jon@dsic.upv.es 9
December 18, 2015 jon@dsic.upv.es 10
2020 ≈ 25
billion
(estimation)
2015
4.9 billion
2014
3.4 billion
Source: Gartner Inc. [1]
Objectsequippedwithsensors
What is Big Data?
• Smartphones
• Financial transactions
• Internet of Things
• Internet of People ≈ Social Networks
• Source of non-structured data
• Data with a high level of ambiguity: metaphor, irony, sarcasm, …
• Text with grammatical mistakes, misspelling, misuse and abuse of symbols that
are no letters, …
• Large variety of images
• Wearable Devices
December 18, 2015 jon@dsic.upv.es 11
What is Big Data?
• Smartphones
• Financial transactions
• Internet of Things
• Internet of People ≈ Social Networks
• Wearable Devices
• Growing number of people monitoring themselves
• and storing all collected data
• In USA people share they vital signs data, collected daily and
properly anonymized, for helping to improve early diagnosis
December 18, 2015 jon@dsic.upv.es 12
The Famous Vs of Big Data
December 18, 2015 jon@dsic.upv.es 13
Volume vs Storage Capacity
Velocity vs Streaming
Variety vs Structure
Veracity vs Security
What is Big Data?
• A social and economical phenomenon
• Rethinking business strategies: data have high value
• Facing problems in a different way, thanks to
• the availability of enough data for learning statistical
(predictive) models is an inflection point
• The way people live: wearable devices
• Privacy of data and security of the computerized infrastructure
• A set of technological challenges
December 18, 2015 jon@dsic.upv.es 14
What is Big Data?
• A social and economical phenomenon
• The value of data: it is business as petrol is
• A set of technological challenges
• Traditional computer systems are not enough to work
with huge volumes of data
• We need to massively exploit low-cost hardware
• New software tools have been developed during recent
years
• Hadoop, Spark, Mesos, … (middleware)
December 18, 2015 jon@dsic.upv.es 15
What is Big Data?
• More isn’t just more …
• The basis of commercial enterprise is information
• Big Data tools allows society to deal with more data than ever
• When one changes the amount, one change the form
• The change in scale leads to a change in state
• By having more data, we can fundamentally do new things, with
more accuracy
• More isn’t just more. More is new, better and different [3]
December 18, 2015 jon@dsic.upv.es 16
What is Big Data? ─ In summary
• Currently, human beings are collecting all generated data from
different areas of regular file: from financial data up to health data
passing through geolocation data, travelling data, Internet
searches, …
• What implies several technological challenges at different levels
• A lifestyle change  a social and economical phenomena
• Better predictive models  an inflection point
• More isn’t just more, … more is new, more is different, more
is better [3]
December 18, 2015 jon@dsic.upv.es 17
Big Data in relation to other areas
December 18, 2015 jon@dsic.upv.es 18
Data Driven Decision Making
December 18, 2015 jon@dsic.upv.es 19
Business Intelligence
Data Visualization
Data Science
Machine Learning | Data Mining | Information Retrieval | Knowledge Data Discovery
Infrastructure: Hadoop, Spark, Mesos, …
System Manager
Big Data Infrastructure
Data Scientist
Machine Learning …
Data Analyst, CDO?
Statistical Analysis
Data Driven Decision Making
December 18, 2015 jon@dsic.upv.es 20
Pre-processing
Curation
Storing
Curation
KDD
Data & Text Mining
Information Retrieval
Visualization
Synthesis
Analytics
Infrastructure Data Science Analytics
Data Driven Decision Making
Business
Intelligence
Machine
Learning
Distributed
Computing
and Storage
December 18, 2015 jon@dsic.upv.es 21
Data
Visualization
Information
Retrieval
Data
Analytics
Data & Text
Mining
Knowledge
Data
Discovery
Data
Curation
Infrastructure for Big Data
• We need to massively exploit low-cost hardware
• Distributed File Systems for storing Big Data
• Structured and non-structured distributed databases
• Middleware for exploiting the low-cost hardware in
parallel
• Machine Learning algorithms for processing data in
order to extract relevant information
• Analytical and visualization tools for giving support to
decision making
December 18, 2015 jon@dsic.upv.es 22
December 18, 2015 jon@dsic.upv.es 23
Infrastructure for Big Data
December 18, 2015 jon@dsic.upv.es 24
Spark stack [4]
Infrastructure for Big Data:
MapReduce
December 18, 2015 jon@dsic.upv.es 25
Source: www.millanor.net
Infrastructure for Big Data:
Cloud Service Models
December 18, 2015 jon@dsic.upv.es 26
Infrastructure for Big Data:
Cloud Service Models
December 18, 2015 jon@dsic.upv.es 27
Traditional
Systems
Applications
Data
Runtime
Middleware
O/S
Virtualization
Servers
Storage
Networking
Infrastructure
as a Service
Applications
Data
Runtime
Middleware
O/S
Virtualization
Servers
Storage
Networking
Platform as a
Service
Applications
Data
Runtime
Middleware
O/S
Virtualization
Servers
Storage
Networking
Software as a
Service
Applications
Data
Runtime
Middleware
O/S
Virtualization
Servers
Storage
Networking
Managedbytheclient
Managedbythevendor
References
1. Securing the Big Data Life Cycle, MIT Tech Review
2. The Truly Personal Computer, The Economist,
2015
3. Big Data and the Future of Business, Kenneth
Cukier, The Economist (Reinventing the Company in
the Digital Age, BBVA-OpenMind)
4. Learning Spark, H.Karau, A. Konwinski, P. Wendell
& M. Zaharia, O’Reilly 2015
December 18, 2015 jon@dsic.upv.es 28
Thanks for your
attention!
An Introduction to Spark
and to its Programming
Model
Jon Ander Gómez Adrián
jon@dsic.upv.es
Pattern Recognition and Human Language Technologies
(PRHLT) Research Group
Universitat Politècnica de València
Introduction to Spark
• In a very short time, Apache Spark has emerged as
the next generation big data processing engine.
• Spark improves over Hadoop MapReduce, which
helped ignite the big data revolution.
• It is much faster and much easier to use due to its
rich APIs.
• And it goes far beyond batch applications to support
a variety of workloads, including interactive queries,
streaming, machine learning, and graph processing.
December 18, 2015 jon@dsic.upv.es 31
Introduction to Spark
• As parallel data analysis has grown common,
practitioners in many fields have sought easier tools
for this task.
• Apache Spark has quickly emerged as one of the
most popular, extending and generalizing
MapReduce.
• In Spark data is stored into the memory of the
worker nodes, except if data size exceeds the
capacity. Unlike Hadoop, where MapReduce tasks
operates on disk files.
December 18, 2015 jon@dsic.upv.es 32
Introduction to Spark
• Spark offers three main benefits:
1. It is easy to use—you can develop applications on your
laptop, using a high-level API that lets you focus on the
content of your computation.
2. Spark is fast, enabling interactive use and complex
algorithms.
3. Spark is a general engine, letting you combine multiple
types of computations (e.g., SQL queries, text
processing, and machine learning) that might
previously have required different engines.
These features make Spark an excellent starting point to
learn about Big Data in general.
December 18, 2015 jon@dsic.upv.es 33
Introduction to Spark: history
• Spark is an open source project that has been built and
is maintained by a diverse community of developers.
• Spark started in 2009 as a research project in the UC
Berkeley RAD Lab, that later became AMPLab.
• Research papers were published about Spark at
academic conferences since its creation in 2009.
• It was early used by Machine Learning researchers at
the Mobile Millennium project, where it was used to
monitor and predict traffic congestion in the San
Francisco Bay Area.
December 18, 2015 jon@dsic.upv.es 34
What is Apache Spark?
• Apache Spark is a cluster computing platform
designed to be fast and general purpose.
• Spark extends the popular MapReduce model to
efficiently support more types of computations,
including interactive queries and stream processing.
• In addition to run computations in memory, it is
more efficient than Hadoop MapReduce for
complex applications running on disk.
December 18, 2015 jon@dsic.upv.es 35
What is Apache Spark?
• Spark is designed to cover a wide range of
workloads that previously required separate
distributed systems.
• It is also designed to be highly accessible by offering
simple APIs in Python, Java, Scala and SQL.
• Spark can run in Hadoop clusters and access any
Hadoop data source, including Cassandra.
December 18, 2015 jon@dsic.upv.es 36
What is Apache Spark?
• As a Unified Stack, Spark contains multiple closely
integrated components.
• In its core, Spark is a computational engine that is
responsible for scheduling, distributing and
monitoring applications.
• Applications that consist of many computational
tasks across many worker machines, or a computer
cluster.
December 18, 2015 jon@dsic.upv.es 37
What is Apache Spark?
December 18, 2015 jon@dsic.upv.es 38
Spark stack [4]
Spark Core
• Contains the basic functionality for
• task scheduling,
• memory management,
• fault recovery,
• interacting with storage systems,
• and more.
• Defines the Resilient Distributed Data sets (RDDs), the
main Spark programming abstraction.
• RDDs represent collections of items distributed across
many worker nodes that can be manipulated in parallel.
December 18, 2015 jon@dsic.upv.es 39
Spark SQL
• For working with structured data.
• It allows querying data via SQL as well as the Apache
Hive variant of SQL – called the Hive Query
Language (HQL).
• It supports many sources of data, including Hive
tables, Parquet and JSON.
• Allows developers to mix SQL queries with data
manipulations supported by RDDs in Python, Java
and Scala.
December 18, 2015 jon@dsic.upv.es 40
Spark Streaming
• It is a component that enables processing of live
streams of data: log files generated by production
web servers, for instance.
• It provides with an API for manipulating data
streams that closely matches the RDD API.
• Making easy for programmers to learn the project
and move between applications that manipulate
data stored in memory, on disk, or arriving in real
time.
December 18, 2015 jon@dsic.upv.es 41
Spark MLlib
• MLlib is a library that contains common Machine
Learning (ML) functionality.
• MLlib provides multiple types of ML algorithms,
including classification, regression, clustering and
collaborative filtering.
• It also supports functionality for model evaluation and
data import.
• MLlib provides some lower-level ML primitives,
including a generic gradient descent algorithm.
• All the methods are designed to scale out across a
cluster.
December 18, 2015 jon@dsic.upv.es 42
Spark GraphX
• It is a library for manipulating graphs,
• and performing graph-parallel computations.
• GraphX extends the Spark RDD API, allowing us to create
a directed graph with arbitrary properties attached to
each vertex and edge.
• GraphX also provides various operators for manipulating
graphs (e.g. subgraph and mapVertices)
• And a library of common graphs algorithms (e.g.
PageRank and triangle couting).
December 18, 2015 jon@dsic.upv.es 43
Cluster Managers
• Spark is designed to efficiently scale up from one to
many thousands of compute nodes.
• Spark can run over a variety of cluster managers,
• including Hadoop YARN, Apache Mesos,
• and a simple cluster manager included in Spark itself
called the Standalone Scheduler.
December 18, 2015 jon@dsic.upv.es 44
Storage Layers for Spark
• Spark can create distributed datasets from any file
stored in the Hadoop distributed filesystem (HDFS)
• or other storage systems supported by Hadoop
APIs,
• including your local filesystem, Amazon S3,
Cassandra, Hive, HBase, etc.
• Spark supports text files, SequenceFiles, Avro,
Parquet, and any other Hadoop InputFormat.
December 18, 2015 jon@dsic.upv.es 45
Example Cluster Configuration
node1
HDFS NameNode + Spark Master
node3HDFS DataNode + Spark Slave
node4HDFS DataNode + Spark Slave
nodeNHDFS DataNode + Spark Slave
node2YARN Resource Manager
+ Job History Server + ProxyServer
December 18, 2015 jon@dsic.upv.es 46
. . .
Starting Services in the Cluster
• Start up the cluster.
Run the command
$ vagrant up
in the same directory where the file Vagrantfile is located
The Vagrantfile contains the configuration and instructions
for Vagrant including references to the scripts used for
configuring and provisioning the virtual machines.
December 18, 2015 jon@dsic.upv.es 47
Starting Services in the Cluster
• Format the HDFS, it should be done the first time
the cluster is started.
• First step: connect to the HDFS NameNode
$ vagrant ssh node-1
• Second step: once logged in the node1 run the
following command:
$ ${HADOOP_HOME}/bin/hdfs namenode -format
December 18, 2015 jon@dsic.upv.es 48
Starting Services in the Cluster
• Start HADOOP daemons for HDFS
$ vagrant ssh node-1
$ ${HADOOP_HOME}/sbin/start-dfs.sh
December 18, 2015 jon@dsic.upv.es 49
Commands in red are to be executed in a node of the cluster,
commands in black are to be executed in the host.
Starting Services in the Cluster
• Start HADOOP daemons for YARN and the
MapReduce Job History Server
$ vagrant ssh node-2
$ ${HADOOP_HOME}/sbin/start-yarn.sh*
$ ${HADOOP_HOME}/sbin/mr-jobhistory-daemon.sh 
start historyserver --config ${HADOOP_CONF_DIR}
December 18, 2015 jon@dsic.upv.es 50
(*) This script should be modified before the first time it is executed by
uncommenting the last line concerning the execution of the proxyserver.
Starting Services in the Cluster
• Start the Spark master.
$ vagrant ssh node-1
$ ${SPARK_HOME}/sbin/start-all.sh
December 18, 2015 jon@dsic.upv.es 51
Monitoring the cluster services
• HDFS NameNode
http://10.211.55.101:50070/dfshealth.html
• Resource Manager
http://10.211.55.102:8088/cluster
• Job History Server
http://10.211.55.102:19888/jobhistory
• Spark
http://10.211.55.101:8080
December 18, 2015 jon@dsic.upv.es 52
Stopping Services in the Cluster
$ vagrant ssh node-1
$ ${SPARK_HOME}/sbin/stop-all.sh
$ exit
$ vagrant ssh node-2
$ ${HADOOP_HOME}/sbin/mr-jobhistory-daemon.sh stop 
historyserver –config ${HADOOP_CONF_DIR}
$ ${HADOOP_HOME}/sbin/stop-yarn.sh
$ exit
$ vagrant ssh node-1
$ ${HADOOP_HOME}/sbin/stop-dfs.sh
$ exit
December 18, 2015 jon@dsic.upv.es 53
Stopping Services in the Cluster
• Shutting down the cluster
$ vagrant halt
• Or destroying it
$ vagrant destroy
• Every time the cluster is booted, if it is not yet
created Vagrant will create each configured virtual
machine, provision it, configure it by means of the
scripts referenced in the Vagrantfile and finally
each node of the cluster is booted.
December 18, 2015 jon@dsic.upv.es 54
Programming environment: Spark concepts
• Every Spark application consists of a driver program
that launches several parallel operations on a
cluster.
• The driver program contains your application’s main
function and defines distributed datasets on the
cluster,
• then applies operations to them.
December 18, 2015 jon@dsic.upv.es 55
Programming environment: Spark concepts
• Driver programs access Spark through a
SparkContext object which represents a connection
to the computing cluster.
• In a shell the SparkContext is created for you and
available as the variable sc.
• You can use it to build Resilient Distributed Data
(RDD) objects.
• Driver programs manage a number of worker nodes
called executors.
December 18, 2015 jon@dsic.upv.es 56
Programming environment: Spark concepts
Worker NodeWorker Node
Driver Program
December 18, 2015 jon@dsic.upv.es 57
SparkContext
Executor
Task Task
Executor
Task Task
Programming environment: Spark concepts
• The Spark API provides with a set of operators to
run functions on the cluster.
• Functions that usually are provided by the
programmer.
lines = sc.textFile( “README.txt” )
vagrantLines = lines.filter( lambda line : “vagrant” in line )
sparkLines = lines.filter( lambda line : “Spark” in line )
December 18, 2015 jon@dsic.upv.es 58
Programming environment: Spark concepts
• Passing functions to Spark.
With lambda syntax allows us to define “simple”
functions inline. But we can pass defined functions.
def hasHadoop( line ):
return “Hadoop” in line
lines = sc.textFile( “README.txt” )
hadoopLines = lines.filter( hasHadoop )
December 18, 2015 jon@dsic.upv.es 59
Programming environment: Spark concepts
• Initializing a SparkContext
from pyspark import SparkConf, SparkContext
conf = SparkConf().setMaster( “local” )
conf.setAppName( “example1” )
sc = SparckContext( conf = conf )
December 18, 2015 jon@dsic.upv.es 60
Programming with RDDs
• Spark’s core abstraction for working with data are
the Resilient Distributed Dataset (RDD) objects.
• An RDD object is distributed collection of items.
• All work is expressed as either, creating new RDDs,
transforming existing RDDs, or calling operations on
RDDs to compute a result.
• Spark automatically distributes the data contained
in RDDs across the nodes in the cluster and
parallelizes the operations you perform on them.
December 18, 2015 jon@dsic.upv.es 61
Programming with RDDs
• An RDD in Spark is an immutable distributed
collection of objects.
• Each RDD is split into multiple partitions, which can
be computed on different nodes of the cluster.
• RDD objects can contain any type of Python, Java or
Scala objects, including user defined classes.
• Once created, RDDs offer two types of operations
transformations and actions.
December 18, 2015 jon@dsic.upv.es 62
Programming with RDDs
• Transformations construct a new RDD object from a
previous one.
• Actions compute a result based on an existing RDD
object, and either return it to the driver program or
save it to an external storage system.
• Transformations and actions are different because
of the way Spark computes RDDs.
• Spark computes RDDs in a lazy way, i.e., the first
time they are used in an action.
December 18, 2015 jon@dsic.upv.es 63
Programming with RDDs
• RDDs are by default recomputed each time you run
an action on them.
• If you want to reuse an RDD in multiple actions, you
can ask Spark to persist it using persist().
• Then, Spark will store the RDD contents in memory
(partitioned across the nodes in the cluster), and
reuse them in future actions.
• It is necessary to call unpersist() once you known
the RDD contents will not be used again.
December 18, 2015 jon@dsic.upv.es 64
Creating RDDs
• Spark provides two ways for creating RDDs
• Loading an external dataset
lines = sc.textFile( “/path/to/filename” )
• and parallelizing a collection in your driver program
list1 = [“hello”, “world”]
lines = sc.parallelize( list1 )
December 18, 2015 jon@dsic.upv.es 65
RDD Operations
• Two types:
• transformations return RDDs,
• actions return a result to the driver program.
• Transformations are operations on RDDs that return a
new RDD. Never modify existing RDDs because are
immutable.
• Transformed RDDs are computed lazily.
• Sparks keeps track of the set of dependencies between
different RDDs, called the lineage graph.
• The lineage graph is used for computing each RDD on
demand, when an action is carried out.
December 18, 2015 jon@dsic.upv.es 66
Example of lineage graph
December 18, 2015 jon@dsic.upv.es 67
RDD Operations
• Actions are operations that return a final value to
the driver program or write data to an external
storage system.
• Actions force the evaluation of the transformations
required for the RDD they were called on, since they
need to actually produce output.
• Because transformations are lazily executed, Spark
will not begin to execute until it sees an action.
December 18, 2015 jon@dsic.upv.es 68
Common Transformations and Actions
• Element-wise transformations:
• map(): takes in a function and applies it to each
element.
• filter(): takes in a function and returns an RDD that
has only the elements that pass the filter function.
nums = sc.parallelize( [1,2,3,4,5,6,7,8,9] )
squares = nums.map( lambda x: x*x )
odd_numbers = squares.filter( lambda x: (x%2)==1 )
sum = odd_numbers.reduce( lambda x,y: x+y )
December 18, 2015 jon@dsic.upv.es 69
Common Transformations and Actions
• Element-wise transformations:
• flatMap(): takes in a function that returns an iterator
lines = sc.parallelize( [ “hello world”, “bye” ] )
words = lines.flatMap( lambda line: line.split() )
print( words.first() )
December 18, 2015 jon@dsic.upv.es 70
Common Transformations and Actions
December 18, 2015 jon@dsic.upv.es 71
Pseudo set operations
December 18, 2015 jon@dsic.upv.es 72
Pseudo set operations
December 18, 2015 jon@dsic.upv.es 73
Actions (uncomplete list)
• collect(): returns all elements from the RDD
• count(): number of elements in the RDD
• countByValue(): number of times each element
occurs in the RDD
• take(num): returns num elements from the RDD
• top(num): returns the top num elements from the RDD
• takeOrdered(num)(ordering): returns num
elements based on the provided ordering
December 18, 2015 jon@dsic.upv.es 74
Actions (uncomplete list)
• reduce(func): combines the elements of the RDD
together in parallel
• fold(zero)(func): same as reduce() but with the
provided zero value
• aggregate(zeroValue)(seq_op)(comb_op):
similar to reduce() but used to return a different type
• foreach(func): apply the provided function to each
element of the RDD
December 18, 2015 jon@dsic.upv.es 75
Actions (uncomplete list)
• reduceByKey(func): combines values with the same
key
• groupByKey(): Group values with the same key
• mapValues(func): apply a function to each value of a
pair RDD without changing the key
• keys(): returns an RDD of just the keys
• values(): returns an RDD of just the values
• sortByKey(): returns an RDD sorted by the key
December 18, 2015 jon@dsic.upv.es 76
Lab practices
• Let’s go to see two basic examples and three Pyhton
programs:
1. An estimation of π
2. Word count of the contents of a file or several files in
the same directory
3. The same word count but loading the contents of each
file separately
December 18, 2015 jon@dsic.upv.es 77
Thanks for your
attention!

More Related Content

What's hot

Big data ppt
Big data pptBig data ppt
Big-Data-Seminar-6-Aug-2014-Koenig
Big-Data-Seminar-6-Aug-2014-KoenigBig-Data-Seminar-6-Aug-2014-Koenig
Big-Data-Seminar-6-Aug-2014-Koenig
Manish Chopra
 
Applications of Big Data
Applications of Big DataApplications of Big Data
Applications of Big Data
Prashant Kumar Jadia
 
Big data
Big dataBig data
From Big Data to Smart Data
From Big Data to Smart DataFrom Big Data to Smart Data
From Big Data to Smart Data
Marin Dimitrov
 
Big data 2017 final
Big data 2017   finalBig data 2017   final
Big data 2017 final
Amjid Ali
 
Team 2 Big Data Presentation
Team 2 Big Data PresentationTeam 2 Big Data Presentation
Team 2 Big Data Presentation
Matthew Urdan
 
Introduction to big data
Introduction to big dataIntroduction to big data
Introduction to big data
Richard Vidgen
 
Integrating Big Data Technologies
Integrating Big Data TechnologiesIntegrating Big Data Technologies
Integrating Big Data Technologies
DATAVERSITY
 
Big data analytics
Big data analyticsBig data analytics
Big data analytics
Vikram Nandini
 
Big data introduction - Big Data from a Consulting perspective - Sogeti
Big data introduction - Big Data from a Consulting perspective - SogetiBig data introduction - Big Data from a Consulting perspective - Sogeti
Big data introduction - Big Data from a Consulting perspective - Sogeti
Edzo Botjes
 
Research issues in the big data and its Challenges
Research issues in the big data and its ChallengesResearch issues in the big data and its Challenges
Research issues in the big data and its Challenges
Kathirvel Ayyaswamy
 
Everis big data_wilson_v1.4
Everis big data_wilson_v1.4Everis big data_wilson_v1.4
Everis big data_wilson_v1.4
wilson_lucas
 
Presentation Big Data
Presentation Big DataPresentation Big Data
Presentation Big Data
René Kuipers
 
A novel approach to big data veracity using crowd-sourcing techniques
A novel approach to big data veracity using crowd-sourcing techniques A novel approach to big data veracity using crowd-sourcing techniques
A novel approach to big data veracity using crowd-sourcing techniques
Abhiram Ravikumar
 
Big data
Big dataBig data
Big data
Mahmudul Alam
 
Big data
Big dataBig data
Big data Seminar/Presentation
Big data Seminar/PresentationBig data Seminar/Presentation
Big data Seminar/Presentation
Kirtimaan Chhabra
 
Big Data introduction - Café Numérique Bruxelles
Big Data introduction - Café Numérique BruxellesBig Data introduction - Café Numérique Bruxelles
Big Data introduction - Café Numérique Bruxelles
Eric Rodriguez (Hiring in Lex)
 
Kartikey tripathi
Kartikey tripathiKartikey tripathi
Kartikey tripathi
KARTIKEY TRIPATHI
 

What's hot (20)

Big data ppt
Big data pptBig data ppt
Big data ppt
 
Big-Data-Seminar-6-Aug-2014-Koenig
Big-Data-Seminar-6-Aug-2014-KoenigBig-Data-Seminar-6-Aug-2014-Koenig
Big-Data-Seminar-6-Aug-2014-Koenig
 
Applications of Big Data
Applications of Big DataApplications of Big Data
Applications of Big Data
 
Big data
Big dataBig data
Big data
 
From Big Data to Smart Data
From Big Data to Smart DataFrom Big Data to Smart Data
From Big Data to Smart Data
 
Big data 2017 final
Big data 2017   finalBig data 2017   final
Big data 2017 final
 
Team 2 Big Data Presentation
Team 2 Big Data PresentationTeam 2 Big Data Presentation
Team 2 Big Data Presentation
 
Introduction to big data
Introduction to big dataIntroduction to big data
Introduction to big data
 
Integrating Big Data Technologies
Integrating Big Data TechnologiesIntegrating Big Data Technologies
Integrating Big Data Technologies
 
Big data analytics
Big data analyticsBig data analytics
Big data analytics
 
Big data introduction - Big Data from a Consulting perspective - Sogeti
Big data introduction - Big Data from a Consulting perspective - SogetiBig data introduction - Big Data from a Consulting perspective - Sogeti
Big data introduction - Big Data from a Consulting perspective - Sogeti
 
Research issues in the big data and its Challenges
Research issues in the big data and its ChallengesResearch issues in the big data and its Challenges
Research issues in the big data and its Challenges
 
Everis big data_wilson_v1.4
Everis big data_wilson_v1.4Everis big data_wilson_v1.4
Everis big data_wilson_v1.4
 
Presentation Big Data
Presentation Big DataPresentation Big Data
Presentation Big Data
 
A novel approach to big data veracity using crowd-sourcing techniques
A novel approach to big data veracity using crowd-sourcing techniques A novel approach to big data veracity using crowd-sourcing techniques
A novel approach to big data veracity using crowd-sourcing techniques
 
Big data
Big dataBig data
Big data
 
Big data
Big dataBig data
Big data
 
Big data Seminar/Presentation
Big data Seminar/PresentationBig data Seminar/Presentation
Big data Seminar/Presentation
 
Big Data introduction - Café Numérique Bruxelles
Big Data introduction - Café Numérique BruxellesBig Data introduction - Café Numérique Bruxelles
Big Data introduction - Café Numérique Bruxelles
 
Kartikey tripathi
Kartikey tripathiKartikey tripathi
Kartikey tripathi
 

Similar to Sr. Jon Ander, Internet de las Cosas y Big Data: ¿hacia dónde va la Industria?

Data science and its potential to change business as we know it. The Roadmap ...
Data science and its potential to change business as we know it. The Roadmap ...Data science and its potential to change business as we know it. The Roadmap ...
Data science and its potential to change business as we know it. The Roadmap ...
InnoTech
 
BIGDATAPrepared ByMuhammad Abrar UddinIntrodu.docx
BIGDATAPrepared ByMuhammad Abrar UddinIntrodu.docxBIGDATAPrepared ByMuhammad Abrar UddinIntrodu.docx
BIGDATAPrepared ByMuhammad Abrar UddinIntrodu.docx
tangyechloe
 
Let's make money from big data!
Let's make money from big data! Let's make money from big data!
Let's make money from big data!
B Spot
 
Big Data et eGovernment
Big Data et eGovernmentBig Data et eGovernment
Big Data et eGovernment
eGov Innovation Center
 
Big data Analytics
Big data Analytics Big data Analytics
Big data Analytics
Guduru Lakshmi Kiranmai
 
Bigdatappt 140225061440-phpapp01
Bigdatappt 140225061440-phpapp01Bigdatappt 140225061440-phpapp01
Bigdatappt 140225061440-phpapp01
nayanbhatia2
 
Big data ppt
Big  data pptBig  data ppt
Big data ppt
Nasrin Hussain
 
SKILLWISE-BIGDATA ANALYSIS
SKILLWISE-BIGDATA ANALYSISSKILLWISE-BIGDATA ANALYSIS
SKILLWISE-BIGDATA ANALYSIS
Skillwise Consulting
 
BigDataFinal.pptx
BigDataFinal.pptxBigDataFinal.pptx
BigDataFinal.pptx
PentaTech
 
big-datagroup6-150317090053-conversion-gate01.pdf
big-datagroup6-150317090053-conversion-gate01.pdfbig-datagroup6-150317090053-conversion-gate01.pdf
big-datagroup6-150317090053-conversion-gate01.pdf
VirajSaud
 
DevOps for Data Engineers - Automate Your Data Science Pipeline with Ansible,...
DevOps for Data Engineers - Automate Your Data Science Pipeline with Ansible,...DevOps for Data Engineers - Automate Your Data Science Pipeline with Ansible,...
DevOps for Data Engineers - Automate Your Data Science Pipeline with Ansible,...
Mihai Criveti
 
Big data
Big dataBig data
ppt final.pptx
ppt final.pptxppt final.pptx
ppt final.pptx
kalai75
 
Presentation on Big Data
Presentation on Big DataPresentation on Big Data
Presentation on Big Data
Md. Salman Ahmed
 
Big_Data_ppt[1] (1).pptx
Big_Data_ppt[1] (1).pptxBig_Data_ppt[1] (1).pptx
Big_Data_ppt[1] (1).pptx
TanguturiAvinash
 
Identifying the new frontier of big data as an enabler for T&T industries: Re...
Identifying the new frontier of big data as an enabler for T&T industries: Re...Identifying the new frontier of big data as an enabler for T&T industries: Re...
Identifying the new frontier of big data as an enabler for T&T industries: Re...
International Federation for Information Technologies in Travel and Tourism (IFITT)
 
Special issues on big data
Special issues on big dataSpecial issues on big data
Special issues on big data
Vedanand Singh
 
The New Convergence of Data; The Next Strategic Business Advantage
The New Convergence of Data; The Next Strategic Business AdvantageThe New Convergence of Data; The Next Strategic Business Advantage
The New Convergence of Data; The Next Strategic Business Advantage
JoAnna Cheshire
 
Datascience
DatascienceDatascience
Content1. Introduction2. What is Big Data3. Characte.docx
Content1. Introduction2. What is Big Data3. Characte.docxContent1. Introduction2. What is Big Data3. Characte.docx
Content1. Introduction2. What is Big Data3. Characte.docx
dickonsondorris
 

Similar to Sr. Jon Ander, Internet de las Cosas y Big Data: ¿hacia dónde va la Industria? (20)

Data science and its potential to change business as we know it. The Roadmap ...
Data science and its potential to change business as we know it. The Roadmap ...Data science and its potential to change business as we know it. The Roadmap ...
Data science and its potential to change business as we know it. The Roadmap ...
 
BIGDATAPrepared ByMuhammad Abrar UddinIntrodu.docx
BIGDATAPrepared ByMuhammad Abrar UddinIntrodu.docxBIGDATAPrepared ByMuhammad Abrar UddinIntrodu.docx
BIGDATAPrepared ByMuhammad Abrar UddinIntrodu.docx
 
Let's make money from big data!
Let's make money from big data! Let's make money from big data!
Let's make money from big data!
 
Big Data et eGovernment
Big Data et eGovernmentBig Data et eGovernment
Big Data et eGovernment
 
Big data Analytics
Big data Analytics Big data Analytics
Big data Analytics
 
Bigdatappt 140225061440-phpapp01
Bigdatappt 140225061440-phpapp01Bigdatappt 140225061440-phpapp01
Bigdatappt 140225061440-phpapp01
 
Big data ppt
Big  data pptBig  data ppt
Big data ppt
 
SKILLWISE-BIGDATA ANALYSIS
SKILLWISE-BIGDATA ANALYSISSKILLWISE-BIGDATA ANALYSIS
SKILLWISE-BIGDATA ANALYSIS
 
BigDataFinal.pptx
BigDataFinal.pptxBigDataFinal.pptx
BigDataFinal.pptx
 
big-datagroup6-150317090053-conversion-gate01.pdf
big-datagroup6-150317090053-conversion-gate01.pdfbig-datagroup6-150317090053-conversion-gate01.pdf
big-datagroup6-150317090053-conversion-gate01.pdf
 
DevOps for Data Engineers - Automate Your Data Science Pipeline with Ansible,...
DevOps for Data Engineers - Automate Your Data Science Pipeline with Ansible,...DevOps for Data Engineers - Automate Your Data Science Pipeline with Ansible,...
DevOps for Data Engineers - Automate Your Data Science Pipeline with Ansible,...
 
Big data
Big dataBig data
Big data
 
ppt final.pptx
ppt final.pptxppt final.pptx
ppt final.pptx
 
Presentation on Big Data
Presentation on Big DataPresentation on Big Data
Presentation on Big Data
 
Big_Data_ppt[1] (1).pptx
Big_Data_ppt[1] (1).pptxBig_Data_ppt[1] (1).pptx
Big_Data_ppt[1] (1).pptx
 
Identifying the new frontier of big data as an enabler for T&T industries: Re...
Identifying the new frontier of big data as an enabler for T&T industries: Re...Identifying the new frontier of big data as an enabler for T&T industries: Re...
Identifying the new frontier of big data as an enabler for T&T industries: Re...
 
Special issues on big data
Special issues on big dataSpecial issues on big data
Special issues on big data
 
The New Convergence of Data; The Next Strategic Business Advantage
The New Convergence of Data; The Next Strategic Business AdvantageThe New Convergence of Data; The Next Strategic Business Advantage
The New Convergence of Data; The Next Strategic Business Advantage
 
Datascience
DatascienceDatascience
Datascience
 
Content1. Introduction2. What is Big Data3. Characte.docx
Content1. Introduction2. What is Big Data3. Characte.docxContent1. Introduction2. What is Big Data3. Characte.docx
Content1. Introduction2. What is Big Data3. Characte.docx
 

More from INACAP

INACAP-Expectativas y Trayectorias en la EMTP
INACAP-Expectativas y Trayectorias en la EMTPINACAP-Expectativas y Trayectorias en la EMTP
INACAP-Expectativas y Trayectorias en la EMTP
INACAP
 
Tendencias y Desafios para la Educacion Online en Chile
Tendencias y Desafios para la Educacion Online en ChileTendencias y Desafios para la Educacion Online en Chile
Tendencias y Desafios para la Educacion Online en Chile
INACAP
 
Funiversity: ¿quién dijo que estudiar es aburrido? - Lluís Pastor
Funiversity: ¿quién dijo que estudiar es aburrido? - Lluís PastorFuniversity: ¿quién dijo que estudiar es aburrido? - Lluís Pastor
Funiversity: ¿quién dijo que estudiar es aburrido? - Lluís Pastor
INACAP
 
Educación Abierta: Panorama, retos y oportunidades - Marcela Morales
Educación Abierta: Panorama, retos y oportunidades - Marcela MoralesEducación Abierta: Panorama, retos y oportunidades - Marcela Morales
Educación Abierta: Panorama, retos y oportunidades - Marcela Morales
INACAP
 
Dra. Michèle Artigue_Seminario Innovaciones Pedagógicas Inacap
Dra. Michèle Artigue_Seminario Innovaciones Pedagógicas InacapDra. Michèle Artigue_Seminario Innovaciones Pedagógicas Inacap
Dra. Michèle Artigue_Seminario Innovaciones Pedagógicas Inacap
INACAP
 
Dr. (c) Jorge Gaona_Seminario Innovaciones Pedagógicas Inacap
Dr. (c) Jorge Gaona_Seminario Innovaciones Pedagógicas InacapDr. (c) Jorge Gaona_Seminario Innovaciones Pedagógicas Inacap
Dr. (c) Jorge Gaona_Seminario Innovaciones Pedagógicas Inacap
INACAP
 
Dr.Julio Cabero_Seminario Innovaciones Pedagógicas Inacap
Dr.Julio Cabero_Seminario Innovaciones Pedagógicas InacapDr.Julio Cabero_Seminario Innovaciones Pedagógicas Inacap
Dr.Julio Cabero_Seminario Innovaciones Pedagógicas Inacap
INACAP
 
Presentación Karina Carrascal
Presentación Karina CarrascalPresentación Karina Carrascal
Presentación Karina Carrascal
INACAP
 
Presentación María Martínez
Presentación María MartínezPresentación María Martínez
Presentación María Martínez
INACAP
 
Presentación Karina Carrascal
Presentación Karina CarrascalPresentación Karina Carrascal
Presentación Karina Carrascal
INACAP
 
Presentación Rodrigo Ortega
Presentación Rodrigo OrtegaPresentación Rodrigo Ortega
Presentación Rodrigo Ortega
INACAP
 
Presentación Manuel Miranda
Presentación Manuel MirandaPresentación Manuel Miranda
Presentación Manuel Miranda
INACAP
 
Presentación Adalberto Pereiro
Presentación Adalberto PereiroPresentación Adalberto Pereiro
Presentación Adalberto Pereiro
INACAP
 
Presentación Gustavo Vicentini
Presentación Gustavo Vicentini Presentación Gustavo Vicentini
Presentación Gustavo Vicentini
INACAP
 
Presentación Carlos Bernales
Presentación Carlos BernalesPresentación Carlos Bernales
Presentación Carlos Bernales
INACAP
 
Presentación Marcelo Lagos
Presentación Marcelo LagosPresentación Marcelo Lagos
Presentación Marcelo Lagos
INACAP
 
Presentación Felipe Valencia
Presentación Felipe ValenciaPresentación Felipe Valencia
Presentación Felipe Valencia
INACAP
 
Presentación Daniel Rojas
Presentación Daniel RojasPresentación Daniel Rojas
Presentación Daniel Rojas
INACAP
 
Presentación Jorge Astudillo
Presentación Jorge AstudilloPresentación Jorge Astudillo
Presentación Jorge Astudillo
INACAP
 
Presentación Juan Pablo Payero
Presentación Juan Pablo PayeroPresentación Juan Pablo Payero
Presentación Juan Pablo Payero
INACAP
 

More from INACAP (20)

INACAP-Expectativas y Trayectorias en la EMTP
INACAP-Expectativas y Trayectorias en la EMTPINACAP-Expectativas y Trayectorias en la EMTP
INACAP-Expectativas y Trayectorias en la EMTP
 
Tendencias y Desafios para la Educacion Online en Chile
Tendencias y Desafios para la Educacion Online en ChileTendencias y Desafios para la Educacion Online en Chile
Tendencias y Desafios para la Educacion Online en Chile
 
Funiversity: ¿quién dijo que estudiar es aburrido? - Lluís Pastor
Funiversity: ¿quién dijo que estudiar es aburrido? - Lluís PastorFuniversity: ¿quién dijo que estudiar es aburrido? - Lluís Pastor
Funiversity: ¿quién dijo que estudiar es aburrido? - Lluís Pastor
 
Educación Abierta: Panorama, retos y oportunidades - Marcela Morales
Educación Abierta: Panorama, retos y oportunidades - Marcela MoralesEducación Abierta: Panorama, retos y oportunidades - Marcela Morales
Educación Abierta: Panorama, retos y oportunidades - Marcela Morales
 
Dra. Michèle Artigue_Seminario Innovaciones Pedagógicas Inacap
Dra. Michèle Artigue_Seminario Innovaciones Pedagógicas InacapDra. Michèle Artigue_Seminario Innovaciones Pedagógicas Inacap
Dra. Michèle Artigue_Seminario Innovaciones Pedagógicas Inacap
 
Dr. (c) Jorge Gaona_Seminario Innovaciones Pedagógicas Inacap
Dr. (c) Jorge Gaona_Seminario Innovaciones Pedagógicas InacapDr. (c) Jorge Gaona_Seminario Innovaciones Pedagógicas Inacap
Dr. (c) Jorge Gaona_Seminario Innovaciones Pedagógicas Inacap
 
Dr.Julio Cabero_Seminario Innovaciones Pedagógicas Inacap
Dr.Julio Cabero_Seminario Innovaciones Pedagógicas InacapDr.Julio Cabero_Seminario Innovaciones Pedagógicas Inacap
Dr.Julio Cabero_Seminario Innovaciones Pedagógicas Inacap
 
Presentación Karina Carrascal
Presentación Karina CarrascalPresentación Karina Carrascal
Presentación Karina Carrascal
 
Presentación María Martínez
Presentación María MartínezPresentación María Martínez
Presentación María Martínez
 
Presentación Karina Carrascal
Presentación Karina CarrascalPresentación Karina Carrascal
Presentación Karina Carrascal
 
Presentación Rodrigo Ortega
Presentación Rodrigo OrtegaPresentación Rodrigo Ortega
Presentación Rodrigo Ortega
 
Presentación Manuel Miranda
Presentación Manuel MirandaPresentación Manuel Miranda
Presentación Manuel Miranda
 
Presentación Adalberto Pereiro
Presentación Adalberto PereiroPresentación Adalberto Pereiro
Presentación Adalberto Pereiro
 
Presentación Gustavo Vicentini
Presentación Gustavo Vicentini Presentación Gustavo Vicentini
Presentación Gustavo Vicentini
 
Presentación Carlos Bernales
Presentación Carlos BernalesPresentación Carlos Bernales
Presentación Carlos Bernales
 
Presentación Marcelo Lagos
Presentación Marcelo LagosPresentación Marcelo Lagos
Presentación Marcelo Lagos
 
Presentación Felipe Valencia
Presentación Felipe ValenciaPresentación Felipe Valencia
Presentación Felipe Valencia
 
Presentación Daniel Rojas
Presentación Daniel RojasPresentación Daniel Rojas
Presentación Daniel Rojas
 
Presentación Jorge Astudillo
Presentación Jorge AstudilloPresentación Jorge Astudillo
Presentación Jorge Astudillo
 
Presentación Juan Pablo Payero
Presentación Juan Pablo PayeroPresentación Juan Pablo Payero
Presentación Juan Pablo Payero
 

Recently uploaded

Opportunity scholarships and the schools that receive them
Opportunity scholarships and the schools that receive themOpportunity scholarships and the schools that receive them
Opportunity scholarships and the schools that receive them
EducationNC
 
220711130086 Sukanta Singh E learning and mobile learning EPC 3 Internal Asse...
220711130086 Sukanta Singh E learning and mobile learning EPC 3 Internal Asse...220711130086 Sukanta Singh E learning and mobile learning EPC 3 Internal Asse...
220711130086 Sukanta Singh E learning and mobile learning EPC 3 Internal Asse...
Kalna College
 
The Rise of the Digital Telecommunication Marketplace.pptx
The Rise of the Digital Telecommunication Marketplace.pptxThe Rise of the Digital Telecommunication Marketplace.pptx
The Rise of the Digital Telecommunication Marketplace.pptx
PriyaKumari928991
 
Slides Peluncuran Amalan Pemakanan Sihat.pptx
Slides Peluncuran Amalan Pemakanan Sihat.pptxSlides Peluncuran Amalan Pemakanan Sihat.pptx
Slides Peluncuran Amalan Pemakanan Sihat.pptx
shabeluno
 
Interprofessional Education Platform Introduction.pdf
Interprofessional Education Platform Introduction.pdfInterprofessional Education Platform Introduction.pdf
Interprofessional Education Platform Introduction.pdf
Ben Aldrich
 
BÀI TẬP BỔ TRỢ 4 KỸ NĂNG TIẾNG ANH LỚP 9 - GLOBAL SUCCESS - FORM MỚI 2025 - C...
BÀI TẬP BỔ TRỢ 4 KỸ NĂNG TIẾNG ANH LỚP 9 - GLOBAL SUCCESS - FORM MỚI 2025 - C...BÀI TẬP BỔ TRỢ 4 KỸ NĂNG TIẾNG ANH LỚP 9 - GLOBAL SUCCESS - FORM MỚI 2025 - C...
BÀI TẬP BỔ TRỢ 4 KỸ NĂNG TIẾNG ANH LỚP 9 - GLOBAL SUCCESS - FORM MỚI 2025 - C...
Nguyen Thanh Tu Collection
 
What are the new features in the Fleet Odoo 17
What are the new features in the Fleet Odoo 17What are the new features in the Fleet Odoo 17
What are the new features in the Fleet Odoo 17
Celine George
 
Diversity Quiz Prelims by Quiz Club, IIT Kanpur
Diversity Quiz Prelims by Quiz Club, IIT KanpurDiversity Quiz Prelims by Quiz Club, IIT Kanpur
Diversity Quiz Prelims by Quiz Club, IIT Kanpur
Quiz Club IIT Kanpur
 
Brand Guideline of Bashundhara A4 Paper - 2024
Brand Guideline of Bashundhara A4 Paper - 2024Brand Guideline of Bashundhara A4 Paper - 2024
Brand Guideline of Bashundhara A4 Paper - 2024
khabri85
 
BỘ BÀI TẬP TEST THEO UNIT - FORM 2025 - TIẾNG ANH 12 GLOBAL SUCCESS - KÌ 1 (B...
BỘ BÀI TẬP TEST THEO UNIT - FORM 2025 - TIẾNG ANH 12 GLOBAL SUCCESS - KÌ 1 (B...BỘ BÀI TẬP TEST THEO UNIT - FORM 2025 - TIẾNG ANH 12 GLOBAL SUCCESS - KÌ 1 (B...
BỘ BÀI TẬP TEST THEO UNIT - FORM 2025 - TIẾNG ANH 12 GLOBAL SUCCESS - KÌ 1 (B...
Nguyen Thanh Tu Collection
 
Environmental science 1.What is environmental science and components of envir...
Environmental science 1.What is environmental science and components of envir...Environmental science 1.What is environmental science and components of envir...
Environmental science 1.What is environmental science and components of envir...
Deepika
 
(T.L.E.) Agriculture: "Ornamental Plants"
(T.L.E.) Agriculture: "Ornamental Plants"(T.L.E.) Agriculture: "Ornamental Plants"
(T.L.E.) Agriculture: "Ornamental Plants"
MJDuyan
 
220711130082 Srabanti Bag Internet Resources For Natural Science
220711130082 Srabanti Bag Internet Resources For Natural Science220711130082 Srabanti Bag Internet Resources For Natural Science
220711130082 Srabanti Bag Internet Resources For Natural Science
Kalna College
 
How to stay relevant as a cyber professional: Skills, trends and career paths...
How to stay relevant as a cyber professional: Skills, trends and career paths...How to stay relevant as a cyber professional: Skills, trends and career paths...
How to stay relevant as a cyber professional: Skills, trends and career paths...
Infosec
 
Art Integrated Project between Maharashtra and Sikkim
Art Integrated Project between Maharashtra and SikkimArt Integrated Project between Maharashtra and Sikkim
Art Integrated Project between Maharashtra and Sikkim
pranavsawarbandhe24
 
Talking Tech through Compelling Visual Aids
Talking Tech through Compelling Visual AidsTalking Tech through Compelling Visual Aids
Talking Tech through Compelling Visual Aids
MattVassar1
 
Erasmus + DISSEMINATION ACTIVITIES Croatia
Erasmus + DISSEMINATION ACTIVITIES CroatiaErasmus + DISSEMINATION ACTIVITIES Croatia
Erasmus + DISSEMINATION ACTIVITIES Croatia
whatchangedhowreflec
 
How to Create User Notification in Odoo 17
How to Create User Notification in Odoo 17How to Create User Notification in Odoo 17
How to Create User Notification in Odoo 17
Celine George
 
Creating Images and Videos through AI.pptx
Creating Images and Videos through AI.pptxCreating Images and Videos through AI.pptx
Creating Images and Videos through AI.pptx
Forum of Blended Learning
 
220711130083 SUBHASHREE RAKSHIT Internet resources for social science
220711130083 SUBHASHREE RAKSHIT  Internet resources for social science220711130083 SUBHASHREE RAKSHIT  Internet resources for social science
220711130083 SUBHASHREE RAKSHIT Internet resources for social science
Kalna College
 

Recently uploaded (20)

Opportunity scholarships and the schools that receive them
Opportunity scholarships and the schools that receive themOpportunity scholarships and the schools that receive them
Opportunity scholarships and the schools that receive them
 
220711130086 Sukanta Singh E learning and mobile learning EPC 3 Internal Asse...
220711130086 Sukanta Singh E learning and mobile learning EPC 3 Internal Asse...220711130086 Sukanta Singh E learning and mobile learning EPC 3 Internal Asse...
220711130086 Sukanta Singh E learning and mobile learning EPC 3 Internal Asse...
 
The Rise of the Digital Telecommunication Marketplace.pptx
The Rise of the Digital Telecommunication Marketplace.pptxThe Rise of the Digital Telecommunication Marketplace.pptx
The Rise of the Digital Telecommunication Marketplace.pptx
 
Slides Peluncuran Amalan Pemakanan Sihat.pptx
Slides Peluncuran Amalan Pemakanan Sihat.pptxSlides Peluncuran Amalan Pemakanan Sihat.pptx
Slides Peluncuran Amalan Pemakanan Sihat.pptx
 
Interprofessional Education Platform Introduction.pdf
Interprofessional Education Platform Introduction.pdfInterprofessional Education Platform Introduction.pdf
Interprofessional Education Platform Introduction.pdf
 
BÀI TẬP BỔ TRỢ 4 KỸ NĂNG TIẾNG ANH LỚP 9 - GLOBAL SUCCESS - FORM MỚI 2025 - C...
BÀI TẬP BỔ TRỢ 4 KỸ NĂNG TIẾNG ANH LỚP 9 - GLOBAL SUCCESS - FORM MỚI 2025 - C...BÀI TẬP BỔ TRỢ 4 KỸ NĂNG TIẾNG ANH LỚP 9 - GLOBAL SUCCESS - FORM MỚI 2025 - C...
BÀI TẬP BỔ TRỢ 4 KỸ NĂNG TIẾNG ANH LỚP 9 - GLOBAL SUCCESS - FORM MỚI 2025 - C...
 
What are the new features in the Fleet Odoo 17
What are the new features in the Fleet Odoo 17What are the new features in the Fleet Odoo 17
What are the new features in the Fleet Odoo 17
 
Diversity Quiz Prelims by Quiz Club, IIT Kanpur
Diversity Quiz Prelims by Quiz Club, IIT KanpurDiversity Quiz Prelims by Quiz Club, IIT Kanpur
Diversity Quiz Prelims by Quiz Club, IIT Kanpur
 
Brand Guideline of Bashundhara A4 Paper - 2024
Brand Guideline of Bashundhara A4 Paper - 2024Brand Guideline of Bashundhara A4 Paper - 2024
Brand Guideline of Bashundhara A4 Paper - 2024
 
BỘ BÀI TẬP TEST THEO UNIT - FORM 2025 - TIẾNG ANH 12 GLOBAL SUCCESS - KÌ 1 (B...
BỘ BÀI TẬP TEST THEO UNIT - FORM 2025 - TIẾNG ANH 12 GLOBAL SUCCESS - KÌ 1 (B...BỘ BÀI TẬP TEST THEO UNIT - FORM 2025 - TIẾNG ANH 12 GLOBAL SUCCESS - KÌ 1 (B...
BỘ BÀI TẬP TEST THEO UNIT - FORM 2025 - TIẾNG ANH 12 GLOBAL SUCCESS - KÌ 1 (B...
 
Environmental science 1.What is environmental science and components of envir...
Environmental science 1.What is environmental science and components of envir...Environmental science 1.What is environmental science and components of envir...
Environmental science 1.What is environmental science and components of envir...
 
(T.L.E.) Agriculture: "Ornamental Plants"
(T.L.E.) Agriculture: "Ornamental Plants"(T.L.E.) Agriculture: "Ornamental Plants"
(T.L.E.) Agriculture: "Ornamental Plants"
 
220711130082 Srabanti Bag Internet Resources For Natural Science
220711130082 Srabanti Bag Internet Resources For Natural Science220711130082 Srabanti Bag Internet Resources For Natural Science
220711130082 Srabanti Bag Internet Resources For Natural Science
 
How to stay relevant as a cyber professional: Skills, trends and career paths...
How to stay relevant as a cyber professional: Skills, trends and career paths...How to stay relevant as a cyber professional: Skills, trends and career paths...
How to stay relevant as a cyber professional: Skills, trends and career paths...
 
Art Integrated Project between Maharashtra and Sikkim
Art Integrated Project between Maharashtra and SikkimArt Integrated Project between Maharashtra and Sikkim
Art Integrated Project between Maharashtra and Sikkim
 
Talking Tech through Compelling Visual Aids
Talking Tech through Compelling Visual AidsTalking Tech through Compelling Visual Aids
Talking Tech through Compelling Visual Aids
 
Erasmus + DISSEMINATION ACTIVITIES Croatia
Erasmus + DISSEMINATION ACTIVITIES CroatiaErasmus + DISSEMINATION ACTIVITIES Croatia
Erasmus + DISSEMINATION ACTIVITIES Croatia
 
How to Create User Notification in Odoo 17
How to Create User Notification in Odoo 17How to Create User Notification in Odoo 17
How to Create User Notification in Odoo 17
 
Creating Images and Videos through AI.pptx
Creating Images and Videos through AI.pptxCreating Images and Videos through AI.pptx
Creating Images and Videos through AI.pptx
 
220711130083 SUBHASHREE RAKSHIT Internet resources for social science
220711130083 SUBHASHREE RAKSHIT  Internet resources for social science220711130083 SUBHASHREE RAKSHIT  Internet resources for social science
220711130083 SUBHASHREE RAKSHIT Internet resources for social science
 

Sr. Jon Ander, Internet de las Cosas y Big Data: ¿hacia dónde va la Industria?

  • 1. Big Data Analytics new challenges new tools Jon Ander Gómez Adrián jon@dsic.upv.es Pattern Recognition and Human Language Technologies (PRHLT) Research Group Universitat Politècnica de València
  • 2. Main Idea How can we take profit of new software developments for working with (processing, managing, analyzing …) huge amounts of data? December 18, 2015 jon@dsic.upv.es 2
  • 3. What is Big Data? • The concept or idea of Big Data appears with the necessity of working with huge amounts of data, • when the tasks of collecting, storing, processing and analyzing data cannot be done with a traditional system, even in High Performance Computing (HPC) systems, • because the requirements of CPU time (processing power) and memory (RAM and/or Disk) are too big. December 18, 2015 jon@dsic.upv.es 3
  • 4. What is Big Data? • The Big Data phenomenon is a direct consequence of the digitization of every activity in personal, public and commercial life [1] • Smartphones • Financial transactions • Internet of Things • Internet of People ≈ Social Networks • Wearable Devices • … December 18, 2015 jon@dsic.upv.es 4
  • 5. What is Big Data? • Smartphones • Conversations • Geolocation • Searches (restaurants, cinemas, … ) • People to who each person is connected with • Financial transactions • Internet of Things • Internet of People ≈ Social Networks • Wearable Devices December 18, 2015 jon@dsic.upv.es 5
  • 6. December 18, 2015 jon@dsic.upv.es 6 0.00E+00 5.00E+08 1.00E+09 1.50E+09 2.00E+09 2.50E+09 3.00E+09 3.50E+09 4.00E+09 2012 2015 2020 1 billion 2 billion 4 billion Evolution of the use of Smartphones 2012 2015 2020 Source: Benedict Evans, a partner with Andreessen Horowitz [1,2]
  • 7. What is Big Data? • Smartphones • Financial transactions • Credit/Debit card transactions • Accounting • Loans’ data / Delay in payment • Domestic/International transactions between companies • Type of clients’ purchases • … • Internet of Things • Internet of People ≈ Social Networks • Wearable Devices December 18, 2015 jon@dsic.upv.es 7
  • 8. % of conserved data % of discarded data 0 20 40 60 80 100 Before 1850 1850-1930 1930-1960 1960-1990 Since 2010 Future Evolution of the percentage of conserved data vs the volume of generated financial data* % of conserved data % of discarded data December 18, 2015 jon@dsic.upv.es 8 (*) Non real data for illustrating the relevance of today storing 100% of generated data.
  • 9. What is Big Data? • Smartphones • Financial transactions • Internet of Things • The growing network of everyday objects equipped with sensors • that can send and receive data over Internet • without human intervention • A good example: Factory 4.0 • Internet of People ≈ Social Networks • Wearable Devices December 18, 2015 jon@dsic.upv.es 9
  • 10. December 18, 2015 jon@dsic.upv.es 10 2020 ≈ 25 billion (estimation) 2015 4.9 billion 2014 3.4 billion Source: Gartner Inc. [1] Objectsequippedwithsensors
  • 11. What is Big Data? • Smartphones • Financial transactions • Internet of Things • Internet of People ≈ Social Networks • Source of non-structured data • Data with a high level of ambiguity: metaphor, irony, sarcasm, … • Text with grammatical mistakes, misspelling, misuse and abuse of symbols that are no letters, … • Large variety of images • Wearable Devices December 18, 2015 jon@dsic.upv.es 11
  • 12. What is Big Data? • Smartphones • Financial transactions • Internet of Things • Internet of People ≈ Social Networks • Wearable Devices • Growing number of people monitoring themselves • and storing all collected data • In USA people share they vital signs data, collected daily and properly anonymized, for helping to improve early diagnosis December 18, 2015 jon@dsic.upv.es 12
  • 13. The Famous Vs of Big Data December 18, 2015 jon@dsic.upv.es 13 Volume vs Storage Capacity Velocity vs Streaming Variety vs Structure Veracity vs Security
  • 14. What is Big Data? • A social and economical phenomenon • Rethinking business strategies: data have high value • Facing problems in a different way, thanks to • the availability of enough data for learning statistical (predictive) models is an inflection point • The way people live: wearable devices • Privacy of data and security of the computerized infrastructure • A set of technological challenges December 18, 2015 jon@dsic.upv.es 14
  • 15. What is Big Data? • A social and economical phenomenon • The value of data: it is business as petrol is • A set of technological challenges • Traditional computer systems are not enough to work with huge volumes of data • We need to massively exploit low-cost hardware • New software tools have been developed during recent years • Hadoop, Spark, Mesos, … (middleware) December 18, 2015 jon@dsic.upv.es 15
  • 16. What is Big Data? • More isn’t just more … • The basis of commercial enterprise is information • Big Data tools allows society to deal with more data than ever • When one changes the amount, one change the form • The change in scale leads to a change in state • By having more data, we can fundamentally do new things, with more accuracy • More isn’t just more. More is new, better and different [3] December 18, 2015 jon@dsic.upv.es 16
  • 17. What is Big Data? ─ In summary • Currently, human beings are collecting all generated data from different areas of regular file: from financial data up to health data passing through geolocation data, travelling data, Internet searches, … • What implies several technological challenges at different levels • A lifestyle change  a social and economical phenomena • Better predictive models  an inflection point • More isn’t just more, … more is new, more is different, more is better [3] December 18, 2015 jon@dsic.upv.es 17
  • 18. Big Data in relation to other areas December 18, 2015 jon@dsic.upv.es 18
  • 19. Data Driven Decision Making December 18, 2015 jon@dsic.upv.es 19 Business Intelligence Data Visualization Data Science Machine Learning | Data Mining | Information Retrieval | Knowledge Data Discovery Infrastructure: Hadoop, Spark, Mesos, … System Manager Big Data Infrastructure Data Scientist Machine Learning … Data Analyst, CDO? Statistical Analysis
  • 20. Data Driven Decision Making December 18, 2015 jon@dsic.upv.es 20 Pre-processing Curation Storing Curation KDD Data & Text Mining Information Retrieval Visualization Synthesis Analytics Infrastructure Data Science Analytics
  • 21. Data Driven Decision Making Business Intelligence Machine Learning Distributed Computing and Storage December 18, 2015 jon@dsic.upv.es 21 Data Visualization Information Retrieval Data Analytics Data & Text Mining Knowledge Data Discovery Data Curation
  • 22. Infrastructure for Big Data • We need to massively exploit low-cost hardware • Distributed File Systems for storing Big Data • Structured and non-structured distributed databases • Middleware for exploiting the low-cost hardware in parallel • Machine Learning algorithms for processing data in order to extract relevant information • Analytical and visualization tools for giving support to decision making December 18, 2015 jon@dsic.upv.es 22
  • 23. December 18, 2015 jon@dsic.upv.es 23
  • 24. Infrastructure for Big Data December 18, 2015 jon@dsic.upv.es 24 Spark stack [4]
  • 25. Infrastructure for Big Data: MapReduce December 18, 2015 jon@dsic.upv.es 25 Source: www.millanor.net
  • 26. Infrastructure for Big Data: Cloud Service Models December 18, 2015 jon@dsic.upv.es 26
  • 27. Infrastructure for Big Data: Cloud Service Models December 18, 2015 jon@dsic.upv.es 27 Traditional Systems Applications Data Runtime Middleware O/S Virtualization Servers Storage Networking Infrastructure as a Service Applications Data Runtime Middleware O/S Virtualization Servers Storage Networking Platform as a Service Applications Data Runtime Middleware O/S Virtualization Servers Storage Networking Software as a Service Applications Data Runtime Middleware O/S Virtualization Servers Storage Networking Managedbytheclient Managedbythevendor
  • 28. References 1. Securing the Big Data Life Cycle, MIT Tech Review 2. The Truly Personal Computer, The Economist, 2015 3. Big Data and the Future of Business, Kenneth Cukier, The Economist (Reinventing the Company in the Digital Age, BBVA-OpenMind) 4. Learning Spark, H.Karau, A. Konwinski, P. Wendell & M. Zaharia, O’Reilly 2015 December 18, 2015 jon@dsic.upv.es 28
  • 30. An Introduction to Spark and to its Programming Model Jon Ander Gómez Adrián jon@dsic.upv.es Pattern Recognition and Human Language Technologies (PRHLT) Research Group Universitat Politècnica de València
  • 31. Introduction to Spark • In a very short time, Apache Spark has emerged as the next generation big data processing engine. • Spark improves over Hadoop MapReduce, which helped ignite the big data revolution. • It is much faster and much easier to use due to its rich APIs. • And it goes far beyond batch applications to support a variety of workloads, including interactive queries, streaming, machine learning, and graph processing. December 18, 2015 jon@dsic.upv.es 31
  • 32. Introduction to Spark • As parallel data analysis has grown common, practitioners in many fields have sought easier tools for this task. • Apache Spark has quickly emerged as one of the most popular, extending and generalizing MapReduce. • In Spark data is stored into the memory of the worker nodes, except if data size exceeds the capacity. Unlike Hadoop, where MapReduce tasks operates on disk files. December 18, 2015 jon@dsic.upv.es 32
  • 33. Introduction to Spark • Spark offers three main benefits: 1. It is easy to use—you can develop applications on your laptop, using a high-level API that lets you focus on the content of your computation. 2. Spark is fast, enabling interactive use and complex algorithms. 3. Spark is a general engine, letting you combine multiple types of computations (e.g., SQL queries, text processing, and machine learning) that might previously have required different engines. These features make Spark an excellent starting point to learn about Big Data in general. December 18, 2015 jon@dsic.upv.es 33
  • 34. Introduction to Spark: history • Spark is an open source project that has been built and is maintained by a diverse community of developers. • Spark started in 2009 as a research project in the UC Berkeley RAD Lab, that later became AMPLab. • Research papers were published about Spark at academic conferences since its creation in 2009. • It was early used by Machine Learning researchers at the Mobile Millennium project, where it was used to monitor and predict traffic congestion in the San Francisco Bay Area. December 18, 2015 jon@dsic.upv.es 34
  • 35. What is Apache Spark? • Apache Spark is a cluster computing platform designed to be fast and general purpose. • Spark extends the popular MapReduce model to efficiently support more types of computations, including interactive queries and stream processing. • In addition to run computations in memory, it is more efficient than Hadoop MapReduce for complex applications running on disk. December 18, 2015 jon@dsic.upv.es 35
  • 36. What is Apache Spark? • Spark is designed to cover a wide range of workloads that previously required separate distributed systems. • It is also designed to be highly accessible by offering simple APIs in Python, Java, Scala and SQL. • Spark can run in Hadoop clusters and access any Hadoop data source, including Cassandra. December 18, 2015 jon@dsic.upv.es 36
  • 37. What is Apache Spark? • As a Unified Stack, Spark contains multiple closely integrated components. • In its core, Spark is a computational engine that is responsible for scheduling, distributing and monitoring applications. • Applications that consist of many computational tasks across many worker machines, or a computer cluster. December 18, 2015 jon@dsic.upv.es 37
  • 38. What is Apache Spark? December 18, 2015 jon@dsic.upv.es 38 Spark stack [4]
  • 39. Spark Core • Contains the basic functionality for • task scheduling, • memory management, • fault recovery, • interacting with storage systems, • and more. • Defines the Resilient Distributed Data sets (RDDs), the main Spark programming abstraction. • RDDs represent collections of items distributed across many worker nodes that can be manipulated in parallel. December 18, 2015 jon@dsic.upv.es 39
  • 40. Spark SQL • For working with structured data. • It allows querying data via SQL as well as the Apache Hive variant of SQL – called the Hive Query Language (HQL). • It supports many sources of data, including Hive tables, Parquet and JSON. • Allows developers to mix SQL queries with data manipulations supported by RDDs in Python, Java and Scala. December 18, 2015 jon@dsic.upv.es 40
  • 41. Spark Streaming • It is a component that enables processing of live streams of data: log files generated by production web servers, for instance. • It provides with an API for manipulating data streams that closely matches the RDD API. • Making easy for programmers to learn the project and move between applications that manipulate data stored in memory, on disk, or arriving in real time. December 18, 2015 jon@dsic.upv.es 41
  • 42. Spark MLlib • MLlib is a library that contains common Machine Learning (ML) functionality. • MLlib provides multiple types of ML algorithms, including classification, regression, clustering and collaborative filtering. • It also supports functionality for model evaluation and data import. • MLlib provides some lower-level ML primitives, including a generic gradient descent algorithm. • All the methods are designed to scale out across a cluster. December 18, 2015 jon@dsic.upv.es 42
  • 43. Spark GraphX • It is a library for manipulating graphs, • and performing graph-parallel computations. • GraphX extends the Spark RDD API, allowing us to create a directed graph with arbitrary properties attached to each vertex and edge. • GraphX also provides various operators for manipulating graphs (e.g. subgraph and mapVertices) • And a library of common graphs algorithms (e.g. PageRank and triangle couting). December 18, 2015 jon@dsic.upv.es 43
  • 44. Cluster Managers • Spark is designed to efficiently scale up from one to many thousands of compute nodes. • Spark can run over a variety of cluster managers, • including Hadoop YARN, Apache Mesos, • and a simple cluster manager included in Spark itself called the Standalone Scheduler. December 18, 2015 jon@dsic.upv.es 44
  • 45. Storage Layers for Spark • Spark can create distributed datasets from any file stored in the Hadoop distributed filesystem (HDFS) • or other storage systems supported by Hadoop APIs, • including your local filesystem, Amazon S3, Cassandra, Hive, HBase, etc. • Spark supports text files, SequenceFiles, Avro, Parquet, and any other Hadoop InputFormat. December 18, 2015 jon@dsic.upv.es 45
  • 46. Example Cluster Configuration node1 HDFS NameNode + Spark Master node3HDFS DataNode + Spark Slave node4HDFS DataNode + Spark Slave nodeNHDFS DataNode + Spark Slave node2YARN Resource Manager + Job History Server + ProxyServer December 18, 2015 jon@dsic.upv.es 46 . . .
  • 47. Starting Services in the Cluster • Start up the cluster. Run the command $ vagrant up in the same directory where the file Vagrantfile is located The Vagrantfile contains the configuration and instructions for Vagrant including references to the scripts used for configuring and provisioning the virtual machines. December 18, 2015 jon@dsic.upv.es 47
  • 48. Starting Services in the Cluster • Format the HDFS, it should be done the first time the cluster is started. • First step: connect to the HDFS NameNode $ vagrant ssh node-1 • Second step: once logged in the node1 run the following command: $ ${HADOOP_HOME}/bin/hdfs namenode -format December 18, 2015 jon@dsic.upv.es 48
  • 49. Starting Services in the Cluster • Start HADOOP daemons for HDFS $ vagrant ssh node-1 $ ${HADOOP_HOME}/sbin/start-dfs.sh December 18, 2015 jon@dsic.upv.es 49 Commands in red are to be executed in a node of the cluster, commands in black are to be executed in the host.
  • 50. Starting Services in the Cluster • Start HADOOP daemons for YARN and the MapReduce Job History Server $ vagrant ssh node-2 $ ${HADOOP_HOME}/sbin/start-yarn.sh* $ ${HADOOP_HOME}/sbin/mr-jobhistory-daemon.sh start historyserver --config ${HADOOP_CONF_DIR} December 18, 2015 jon@dsic.upv.es 50 (*) This script should be modified before the first time it is executed by uncommenting the last line concerning the execution of the proxyserver.
  • 51. Starting Services in the Cluster • Start the Spark master. $ vagrant ssh node-1 $ ${SPARK_HOME}/sbin/start-all.sh December 18, 2015 jon@dsic.upv.es 51
  • 52. Monitoring the cluster services • HDFS NameNode http://10.211.55.101:50070/dfshealth.html • Resource Manager http://10.211.55.102:8088/cluster • Job History Server http://10.211.55.102:19888/jobhistory • Spark http://10.211.55.101:8080 December 18, 2015 jon@dsic.upv.es 52
  • 53. Stopping Services in the Cluster $ vagrant ssh node-1 $ ${SPARK_HOME}/sbin/stop-all.sh $ exit $ vagrant ssh node-2 $ ${HADOOP_HOME}/sbin/mr-jobhistory-daemon.sh stop historyserver –config ${HADOOP_CONF_DIR} $ ${HADOOP_HOME}/sbin/stop-yarn.sh $ exit $ vagrant ssh node-1 $ ${HADOOP_HOME}/sbin/stop-dfs.sh $ exit December 18, 2015 jon@dsic.upv.es 53
  • 54. Stopping Services in the Cluster • Shutting down the cluster $ vagrant halt • Or destroying it $ vagrant destroy • Every time the cluster is booted, if it is not yet created Vagrant will create each configured virtual machine, provision it, configure it by means of the scripts referenced in the Vagrantfile and finally each node of the cluster is booted. December 18, 2015 jon@dsic.upv.es 54
  • 55. Programming environment: Spark concepts • Every Spark application consists of a driver program that launches several parallel operations on a cluster. • The driver program contains your application’s main function and defines distributed datasets on the cluster, • then applies operations to them. December 18, 2015 jon@dsic.upv.es 55
  • 56. Programming environment: Spark concepts • Driver programs access Spark through a SparkContext object which represents a connection to the computing cluster. • In a shell the SparkContext is created for you and available as the variable sc. • You can use it to build Resilient Distributed Data (RDD) objects. • Driver programs manage a number of worker nodes called executors. December 18, 2015 jon@dsic.upv.es 56
  • 57. Programming environment: Spark concepts Worker NodeWorker Node Driver Program December 18, 2015 jon@dsic.upv.es 57 SparkContext Executor Task Task Executor Task Task
  • 58. Programming environment: Spark concepts • The Spark API provides with a set of operators to run functions on the cluster. • Functions that usually are provided by the programmer. lines = sc.textFile( “README.txt” ) vagrantLines = lines.filter( lambda line : “vagrant” in line ) sparkLines = lines.filter( lambda line : “Spark” in line ) December 18, 2015 jon@dsic.upv.es 58
  • 59. Programming environment: Spark concepts • Passing functions to Spark. With lambda syntax allows us to define “simple” functions inline. But we can pass defined functions. def hasHadoop( line ): return “Hadoop” in line lines = sc.textFile( “README.txt” ) hadoopLines = lines.filter( hasHadoop ) December 18, 2015 jon@dsic.upv.es 59
  • 60. Programming environment: Spark concepts • Initializing a SparkContext from pyspark import SparkConf, SparkContext conf = SparkConf().setMaster( “local” ) conf.setAppName( “example1” ) sc = SparckContext( conf = conf ) December 18, 2015 jon@dsic.upv.es 60
  • 61. Programming with RDDs • Spark’s core abstraction for working with data are the Resilient Distributed Dataset (RDD) objects. • An RDD object is distributed collection of items. • All work is expressed as either, creating new RDDs, transforming existing RDDs, or calling operations on RDDs to compute a result. • Spark automatically distributes the data contained in RDDs across the nodes in the cluster and parallelizes the operations you perform on them. December 18, 2015 jon@dsic.upv.es 61
  • 62. Programming with RDDs • An RDD in Spark is an immutable distributed collection of objects. • Each RDD is split into multiple partitions, which can be computed on different nodes of the cluster. • RDD objects can contain any type of Python, Java or Scala objects, including user defined classes. • Once created, RDDs offer two types of operations transformations and actions. December 18, 2015 jon@dsic.upv.es 62
  • 63. Programming with RDDs • Transformations construct a new RDD object from a previous one. • Actions compute a result based on an existing RDD object, and either return it to the driver program or save it to an external storage system. • Transformations and actions are different because of the way Spark computes RDDs. • Spark computes RDDs in a lazy way, i.e., the first time they are used in an action. December 18, 2015 jon@dsic.upv.es 63
  • 64. Programming with RDDs • RDDs are by default recomputed each time you run an action on them. • If you want to reuse an RDD in multiple actions, you can ask Spark to persist it using persist(). • Then, Spark will store the RDD contents in memory (partitioned across the nodes in the cluster), and reuse them in future actions. • It is necessary to call unpersist() once you known the RDD contents will not be used again. December 18, 2015 jon@dsic.upv.es 64
  • 65. Creating RDDs • Spark provides two ways for creating RDDs • Loading an external dataset lines = sc.textFile( “/path/to/filename” ) • and parallelizing a collection in your driver program list1 = [“hello”, “world”] lines = sc.parallelize( list1 ) December 18, 2015 jon@dsic.upv.es 65
  • 66. RDD Operations • Two types: • transformations return RDDs, • actions return a result to the driver program. • Transformations are operations on RDDs that return a new RDD. Never modify existing RDDs because are immutable. • Transformed RDDs are computed lazily. • Sparks keeps track of the set of dependencies between different RDDs, called the lineage graph. • The lineage graph is used for computing each RDD on demand, when an action is carried out. December 18, 2015 jon@dsic.upv.es 66
  • 67. Example of lineage graph December 18, 2015 jon@dsic.upv.es 67
  • 68. RDD Operations • Actions are operations that return a final value to the driver program or write data to an external storage system. • Actions force the evaluation of the transformations required for the RDD they were called on, since they need to actually produce output. • Because transformations are lazily executed, Spark will not begin to execute until it sees an action. December 18, 2015 jon@dsic.upv.es 68
  • 69. Common Transformations and Actions • Element-wise transformations: • map(): takes in a function and applies it to each element. • filter(): takes in a function and returns an RDD that has only the elements that pass the filter function. nums = sc.parallelize( [1,2,3,4,5,6,7,8,9] ) squares = nums.map( lambda x: x*x ) odd_numbers = squares.filter( lambda x: (x%2)==1 ) sum = odd_numbers.reduce( lambda x,y: x+y ) December 18, 2015 jon@dsic.upv.es 69
  • 70. Common Transformations and Actions • Element-wise transformations: • flatMap(): takes in a function that returns an iterator lines = sc.parallelize( [ “hello world”, “bye” ] ) words = lines.flatMap( lambda line: line.split() ) print( words.first() ) December 18, 2015 jon@dsic.upv.es 70
  • 71. Common Transformations and Actions December 18, 2015 jon@dsic.upv.es 71
  • 72. Pseudo set operations December 18, 2015 jon@dsic.upv.es 72
  • 73. Pseudo set operations December 18, 2015 jon@dsic.upv.es 73
  • 74. Actions (uncomplete list) • collect(): returns all elements from the RDD • count(): number of elements in the RDD • countByValue(): number of times each element occurs in the RDD • take(num): returns num elements from the RDD • top(num): returns the top num elements from the RDD • takeOrdered(num)(ordering): returns num elements based on the provided ordering December 18, 2015 jon@dsic.upv.es 74
  • 75. Actions (uncomplete list) • reduce(func): combines the elements of the RDD together in parallel • fold(zero)(func): same as reduce() but with the provided zero value • aggregate(zeroValue)(seq_op)(comb_op): similar to reduce() but used to return a different type • foreach(func): apply the provided function to each element of the RDD December 18, 2015 jon@dsic.upv.es 75
  • 76. Actions (uncomplete list) • reduceByKey(func): combines values with the same key • groupByKey(): Group values with the same key • mapValues(func): apply a function to each value of a pair RDD without changing the key • keys(): returns an RDD of just the keys • values(): returns an RDD of just the values • sortByKey(): returns an RDD sorted by the key December 18, 2015 jon@dsic.upv.es 76
  • 77. Lab practices • Let’s go to see two basic examples and three Pyhton programs: 1. An estimation of π 2. Word count of the contents of a file or several files in the same directory 3. The same word count but loading the contents of each file separately December 18, 2015 jon@dsic.upv.es 77

Editor's Notes

  1. It’s evident that before 19th century the collected data was practically non saved for future use. Scientists were collecting data long time ago. Up to the crash of 1929 financial data was stored irregularly, so depending on the firm maybe the conserved data corresponds to few years. Up to sixties the data was stored in hard paper, impractically to be analyzed. Then it begin to be stored, processed and analyzed using electro-mechanical devices. Nowadays all generated data is conserved.
  2. Volume: distributed filesystems are needed Velocity: processing data and accessing to databases in real time Variety: need to process with NLP techniques non structured data in order to perform information retrieval Veracity: common attacks to big data systems are so sophisticated, instead of removing data it is modified
  3. Man shape his tools. And his tools shape him.
  4. Infrastructure as a service (IaaS) is a standardized, highly automated offering, where compute resources, complemented by storage and networking capabilities are owned and hosted by a service provider and offered to customers on-demand. Customers are able to self-provision this infrastructure, using a Web-based graphical user interface that serves as an IT operations management console for the overall environment. API access to the infrastructure may also be offered as an option. Platform as a service (PaaS) is a category of cloud computing services that provides a platform allowing customers to develop, run, and manage Web applications without the complexity of building and maintaining the infrastructure typically associated with developing and launching an app.[1][2][3] PaaS can be delivered in two ways: as a public cloud service from a provider, where the consumer controls software deployment and configuration settings, and the provider provides the networks, servers, storage and other services to host the consumer's application; or as software installed in private data centers or public infrastructure as a service and managed by internal IT departments. Software as a service (SaaS; pronounced /sæs/ or /sɑːs/[1]) is a software licensing and delivery model in which software is licensed on a subscription basis and is centrally hosted.[2][3] It is sometimes referred to as "on-demand software".[4] SaaS is typically accessed by users using a thin client via a web browser. SaaS has become a common delivery model for many business applications, including office and messaging software, payroll processing software, DBMS software, management software, CAD software, development software, gamification, virtualization,[4] accounting, collaboration, customer relationship management (CRM), management information systems (MIS), enterprise resource planning (ERP), invoicing, human resource management (HRM), talent acquisition, content management (CM), antivirus software, and service desk management.[5] SaaS has been incorporated into the strategy of all leading enterprise software companies. One of the biggest selling points for these companies is the potential to reduce IT support costs by outsourcing hardware and software maintenance and support to the SaaS provider
  5. Infrastructure as a service (IaaS) is a standardized, highly automated offering, where compute resources, complemented by storage and networking capabilities are owned and hosted by a service provider and offered to customers on-demand. Customers are able to self-provision this infrastructure, using a Web-based graphical user interface that serves as an IT operations management console for the overall environment. API access to the infrastructure may also be offered as an option. Platform as a service (PaaS) is a category of cloud computing services that provides a platform allowing customers to develop, run, and manage Web applications without the complexity of building and maintaining the infrastructure typically associated with developing and launching an app.[1][2][3] PaaS can be delivered in two ways: as a public cloud service from a provider, where the consumer controls software deployment and configuration settings, and the provider provides the networks, servers, storage and other services to host the consumer's application; or as software installed in private data centers or public infrastructure as a service and managed by internal IT departments. Software as a service (SaaS; pronounced /sæs/ or /sɑːs/[1]) is a software licensing and delivery model in which software is licensed on a subscription basis and is centrally hosted.[2][3] It is sometimes referred to as "on-demand software".[4] SaaS is typically accessed by users using a thin client via a web browser. SaaS has become a common delivery model for many business applications, including office and messaging software, payroll processing software, DBMS software, management software, CAD software, development software, gamification, virtualization,[4] accounting, collaboration, customer relationship management (CRM), management information systems (MIS), enterprise resource planning (ERP), invoicing, human resource management (HRM), talent acquisition, content management (CM), antivirus software, and service desk management.[5] SaaS has been incorporated into the strategy of all leading enterprise software companies. One of the biggest selling points for these companies is the potential to reduce IT support costs by outsourcing hardware and software maintenance and support to the SaaS provider
  翻译: