尊敬的 微信汇率:1円 ≈ 0.046089 元 支付宝汇率:1円 ≈ 0.04618元 [退出登录]
SlideShare a Scribd company logo
Clickstream & Social Media Analysis 
Use cases and examples using Apache Spark 
Michael Cutler @ TUMRA – November 2014
About Me 
• Early adopter of Hadoop 
• Spoke at Hadoop World on 
machine learning 
• Twitter: @cotdp 
We use Data Science and Big Data 
technology to help ecommerce 
companies understand their 
customers and increase sales. 
This Talk 
• Slide are on Slideshare 
• Code example on Github 
• Twitter: @tumra
1 Background 
2 Introducing Apache Spark 
3 Examples
1 Background
Clickstream & Social Media Analysis 
Generalised Approach 
Mobile/Tablet App 
Reporting & 
Web Site 
Social Network 
Events Files Tables
How has this approach evolved? 
Rapidly reducing the ‘time to insight’ 
pre-Historic Hadoop 
• Proprietary & Expensive 
• Slow Constrained 
Time to Insight 
48+ hours 
2008 - Hadoop 
• Open-source & Inexpensive 
• Flexible but complex to use 
Time to Insight 
2014 - Spark 
• Batch, Streaming & Interactive 
• Fast & Easy to use 
Time to Insight 
Weaving a story from a string of activities 
Understanding the shoppers journey 
PPC long-tail 
Day #0 
Opened Email 
Newsletter on iPad PPC brand 
PPC brand keyword & 
signed up email 
Add To Cart 
Day #7 Day #10 Day #13 Day #17
Shopper Journey 
Understanding the shoppers journey 
Research Consideration Purchase 
It’s all about People & Products 
Not just boring log files! 
Turn low-level events like “Page Views” into something meaningful 
e.g. <Person1234> <viewed-a> <Product:Camera> 
Bought a … 
Activity & Interactions 
Gauging Interest 
Measuring the degree of interest a Person has about a Product 
e.g. are 10 views for a certain Product a good or bad thing? 
Either inferred from other Peoples activities, or Product similarity 
Both people and products have properties, 
e.g. <Person1234> <is:gender> <Female>
People & Product Interactions 
Source: Snowplow Analytics 
e.g. “Michael” “bought a” “Americano” “Starbucks, Shoreditch”
That sounds like a Graph … 
Use graphs to understand user intent 
Interest Graph Visualisation 
• Collect user activity data in real-time, not just 
clicks but mouse-overs, images, video, social. 
• Algorithms identify products, categories and 
brands a particular person is interested in. 
• Cluster users into ‘neighborhoods’ to infer what to 
show to existing and future visitors. 
This visualization illustrates just 1% of 6 weeks visitor 
activity data. Blue data points are People, Orange 
data points are Products.
Introducing 2 Apache Spark
Three reasons Apache Spark is awesome! 
Apart from “no more Java Map/Reduce code!!!” 
• In-memory Caching 
• DAG execution optimisation 
• Easy to use in Scala, Java, Python 
• Machine Learning baked in 
• Graph algorithms 
• Interactive Shell 
• Query from Spark SQL 
• Streaming 
• Batch (file based)
Apache Spark 
Architecture Overview 
Apache ZooKeeper Hadoop Filesystem 
Yarn / Mesos 
Apache Spark 
Coexists with your existing Hadoop Infrastructure 
Hadoop Filesystem (HDFS) 
Apache ZooKeeper 
Apache Hive etc. 
Map / Reduce 
Yarn / Mesos
Apache Spark can … 
Simple example of Spark SQL used from Scala 
Source: Databricks 
Go from a SQL query… 
… to a trained machine learning 
model in three lines of code.
3 Examples
Example Architecture 
Coexists with your existing Hadoop Infrastructure 
Hadoop Filesystem (HDFS) NoSQL Store 
Apache ZooKeeper 
Apache Kafka 
Social Media Analysis 
Converting a low-level event into a meaningful high-level interaction 
• A user-interaction from the 
Facebook firehose, received as a 
real-time stream of JSON 
• Streamed into Apache Kafka, 
also stored in SequenceFiles 
• Modeled into Scala Case Class:
Example - Spark (Scala) 
Using the Spark (Scala) interface to analyze the data 
• Parse JSON 
• Extract interesting attributes 
• ‘Reduce by Key’ to sum the result 
• Print results
Example - Spark SQL 
Using the Spark SQL interface to analyze the data 
• Parse JSON 
• Extract interesting attributes, 
transform into Case Classes 
• ‘Register as table’ 
• Execute SQL, print results
Want to play with awesome tech and data? 
We’re hiring! team@tumra.com 
Data Engineer 
Scala, functional programming, 
Hadoop, NoSQL 
Sales & Marketing 
Experience with SaaS and ecommerce sales

More Related Content

What's hot

Running Apache Spark on Kubernetes: Best Practices and Pitfalls
Running Apache Spark on Kubernetes: Best Practices and PitfallsRunning Apache Spark on Kubernetes: Best Practices and Pitfalls
Running Apache Spark on Kubernetes: Best Practices and Pitfalls
Streaming Data Lakes using Kafka Connect + Apache Hudi | Vinoth Chandar, Apac...
Streaming Data Lakes using Kafka Connect + Apache Hudi | Vinoth Chandar, Apac...Streaming Data Lakes using Kafka Connect + Apache Hudi | Vinoth Chandar, Apac...
Streaming Data Lakes using Kafka Connect + Apache Hudi | Vinoth Chandar, Apac...
Building a Scalable Web Crawler with Hadoop
Building a Scalable Web Crawler with HadoopBuilding a Scalable Web Crawler with Hadoop
Building a Scalable Web Crawler with Hadoop
Hadoop User Group
Common Strategies for Improving Performance on Your Delta Lakehouse
Common Strategies for Improving Performance on Your Delta LakehouseCommon Strategies for Improving Performance on Your Delta Lakehouse
Common Strategies for Improving Performance on Your Delta Lakehouse
DASK and Apache Spark
DASK and Apache SparkDASK and Apache Spark
DASK and Apache Spark
Can Apache Kafka Replace a Database?
Can Apache Kafka Replace a Database?Can Apache Kafka Replace a Database?
Can Apache Kafka Replace a Database?
Kai Wähner
A Deep Dive into Stateful Stream Processing in Structured Streaming with Tath...
A Deep Dive into Stateful Stream Processing in Structured Streaming with Tath...A Deep Dive into Stateful Stream Processing in Structured Streaming with Tath...
A Deep Dive into Stateful Stream Processing in Structured Streaming with Tath...
The Top 5 Apache Kafka Use Cases and Architectures in 2022
The Top 5 Apache Kafka Use Cases and Architectures in 2022The Top 5 Apache Kafka Use Cases and Architectures in 2022
The Top 5 Apache Kafka Use Cases and Architectures in 2022
Kai Wähner
Real-Life Use Cases & Architectures for Event Streaming with Apache Kafka
Real-Life Use Cases & Architectures for Event Streaming with Apache KafkaReal-Life Use Cases & Architectures for Event Streaming with Apache Kafka
Real-Life Use Cases & Architectures for Event Streaming with Apache Kafka
Kai Wähner
Designing ETL Pipelines with Structured Streaming and Delta Lake—How to Archi...
Designing ETL Pipelines with Structured Streaming and Delta Lake—How to Archi...Designing ETL Pipelines with Structured Streaming and Delta Lake—How to Archi...
Designing ETL Pipelines with Structured Streaming and Delta Lake—How to Archi...
AI-Powered Streaming Analytics for Real-Time Customer Experience
AI-Powered Streaming Analytics for Real-Time Customer ExperienceAI-Powered Streaming Analytics for Real-Time Customer Experience
AI-Powered Streaming Analytics for Real-Time Customer Experience
Spark SQL
Spark SQLSpark SQL
Spark SQL
Joud Khattab
Jitney, Kafka at Airbnb
Jitney, Kafka at AirbnbJitney, Kafka at Airbnb
Jitney, Kafka at Airbnb
Kafka + Uber- The World’s Realtime Transit Infrastructure, Aaron Schildkrout
Kafka + Uber- The World’s Realtime Transit Infrastructure, Aaron SchildkroutKafka + Uber- The World’s Realtime Transit Infrastructure, Aaron Schildkrout
Kafka + Uber- The World’s Realtime Transit Infrastructure, Aaron Schildkrout
Observability for Data Pipelines With OpenLineage
Observability for Data Pipelines With OpenLineageObservability for Data Pipelines With OpenLineage
Observability for Data Pipelines With OpenLineage
Modernizing to a Cloud Data Architecture
Modernizing to a Cloud Data ArchitectureModernizing to a Cloud Data Architecture
Modernizing to a Cloud Data Architecture
Near Real-Time Netflix Recommendations Using Apache Spark Streaming with Nit...
 Near Real-Time Netflix Recommendations Using Apache Spark Streaming with Nit... Near Real-Time Netflix Recommendations Using Apache Spark Streaming with Nit...
Near Real-Time Netflix Recommendations Using Apache Spark Streaming with Nit...
Slim Baltagi
Snowflake: The most cost-effective agile and scalable data warehouse ever!
Snowflake: The most cost-effective agile and scalable data warehouse ever!Snowflake: The most cost-effective agile and scalable data warehouse ever!
Snowflake: The most cost-effective agile and scalable data warehouse ever!
Data Source API in Spark
Data Source API in SparkData Source API in Spark
Data Source API in Spark

What's hot (20)

Running Apache Spark on Kubernetes: Best Practices and Pitfalls
Running Apache Spark on Kubernetes: Best Practices and PitfallsRunning Apache Spark on Kubernetes: Best Practices and Pitfalls
Running Apache Spark on Kubernetes: Best Practices and Pitfalls
Streaming Data Lakes using Kafka Connect + Apache Hudi | Vinoth Chandar, Apac...
Streaming Data Lakes using Kafka Connect + Apache Hudi | Vinoth Chandar, Apac...Streaming Data Lakes using Kafka Connect + Apache Hudi | Vinoth Chandar, Apac...
Streaming Data Lakes using Kafka Connect + Apache Hudi | Vinoth Chandar, Apac...
Building a Scalable Web Crawler with Hadoop
Building a Scalable Web Crawler with HadoopBuilding a Scalable Web Crawler with Hadoop
Building a Scalable Web Crawler with Hadoop
Common Strategies for Improving Performance on Your Delta Lakehouse
Common Strategies for Improving Performance on Your Delta LakehouseCommon Strategies for Improving Performance on Your Delta Lakehouse
Common Strategies for Improving Performance on Your Delta Lakehouse
DASK and Apache Spark
DASK and Apache SparkDASK and Apache Spark
DASK and Apache Spark
Can Apache Kafka Replace a Database?
Can Apache Kafka Replace a Database?Can Apache Kafka Replace a Database?
Can Apache Kafka Replace a Database?
A Deep Dive into Stateful Stream Processing in Structured Streaming with Tath...
A Deep Dive into Stateful Stream Processing in Structured Streaming with Tath...A Deep Dive into Stateful Stream Processing in Structured Streaming with Tath...
A Deep Dive into Stateful Stream Processing in Structured Streaming with Tath...
The Top 5 Apache Kafka Use Cases and Architectures in 2022
The Top 5 Apache Kafka Use Cases and Architectures in 2022The Top 5 Apache Kafka Use Cases and Architectures in 2022
The Top 5 Apache Kafka Use Cases and Architectures in 2022
Real-Life Use Cases & Architectures for Event Streaming with Apache Kafka
Real-Life Use Cases & Architectures for Event Streaming with Apache KafkaReal-Life Use Cases & Architectures for Event Streaming with Apache Kafka
Real-Life Use Cases & Architectures for Event Streaming with Apache Kafka
Designing ETL Pipelines with Structured Streaming and Delta Lake—How to Archi...
Designing ETL Pipelines with Structured Streaming and Delta Lake—How to Archi...Designing ETL Pipelines with Structured Streaming and Delta Lake—How to Archi...
Designing ETL Pipelines with Structured Streaming and Delta Lake—How to Archi...
AI-Powered Streaming Analytics for Real-Time Customer Experience
AI-Powered Streaming Analytics for Real-Time Customer ExperienceAI-Powered Streaming Analytics for Real-Time Customer Experience
AI-Powered Streaming Analytics for Real-Time Customer Experience
Spark SQL
Spark SQLSpark SQL
Spark SQL
Jitney, Kafka at Airbnb
Jitney, Kafka at AirbnbJitney, Kafka at Airbnb
Jitney, Kafka at Airbnb
Kafka + Uber- The World’s Realtime Transit Infrastructure, Aaron Schildkrout
Kafka + Uber- The World’s Realtime Transit Infrastructure, Aaron SchildkroutKafka + Uber- The World’s Realtime Transit Infrastructure, Aaron Schildkrout
Kafka + Uber- The World’s Realtime Transit Infrastructure, Aaron Schildkrout
Observability for Data Pipelines With OpenLineage
Observability for Data Pipelines With OpenLineageObservability for Data Pipelines With OpenLineage
Observability for Data Pipelines With OpenLineage
Modernizing to a Cloud Data Architecture
Modernizing to a Cloud Data ArchitectureModernizing to a Cloud Data Architecture
Modernizing to a Cloud Data Architecture
Near Real-Time Netflix Recommendations Using Apache Spark Streaming with Nit...
 Near Real-Time Netflix Recommendations Using Apache Spark Streaming with Nit... Near Real-Time Netflix Recommendations Using Apache Spark Streaming with Nit...
Near Real-Time Netflix Recommendations Using Apache Spark Streaming with Nit...
Snowflake: The most cost-effective agile and scalable data warehouse ever!
Snowflake: The most cost-effective agile and scalable data warehouse ever!Snowflake: The most cost-effective agile and scalable data warehouse ever!
Snowflake: The most cost-effective agile and scalable data warehouse ever!
Data Source API in Spark
Data Source API in SparkData Source API in Spark
Data Source API in Spark

Similar to Clickstream & Social Media Analysis using Apache Spark

Continuous Analytics & Optimisation using Apache Spark (Big Data Analytics, L...
Continuous Analytics & Optimisation using Apache Spark (Big Data Analytics, L...Continuous Analytics & Optimisation using Apache Spark (Big Data Analytics, L...
Continuous Analytics & Optimisation using Apache Spark (Big Data Analytics, L...
TUMRA | Big Data Science - Gain a competitive advantage through Big Data & Data Science
Snowplow presentation for Amsterdam Meetup #3
Snowplow presentation for Amsterdam Meetup #3Snowplow presentation for Amsterdam Meetup #3
Snowplow presentation for Amsterdam Meetup #3
Snowplow Analytics
Analytics With PowerBI On Azure
Analytics With PowerBI On AzureAnalytics With PowerBI On Azure
Analytics With PowerBI On Azure
Anita Luthra
Jason Huang, Solutions Engineer, Qubole at MLconf ATL - 9/18/15
Jason Huang, Solutions Engineer, Qubole at MLconf ATL - 9/18/15Jason Huang, Solutions Engineer, Qubole at MLconf ATL - 9/18/15
Jason Huang, Solutions Engineer, Qubole at MLconf ATL - 9/18/15
Atlanta MLConf
Atlanta MLConfAtlanta MLConf
Atlanta MLConf
Cloudera Breakfast: Advanced Analytics Part II: Do More With Your Data
Cloudera Breakfast: Advanced Analytics Part II: Do More With Your DataCloudera Breakfast: Advanced Analytics Part II: Do More With Your Data
Cloudera Breakfast: Advanced Analytics Part II: Do More With Your Data
Cloudera, Inc.
Big Data 2.0 - How Spark technologies are reshaping the world of big data ana...
Big Data 2.0 - How Spark technologies are reshaping the world of big data ana...Big Data 2.0 - How Spark technologies are reshaping the world of big data ana...
Big Data 2.0 - How Spark technologies are reshaping the world of big data ana...
Lillian Pierson
Turn Data into Business Value – Starting with Data Analytics on Oracle Cloud ...
Turn Data into Business Value – Starting with Data Analytics on Oracle Cloud ...Turn Data into Business Value – Starting with Data Analytics on Oracle Cloud ...
Turn Data into Business Value – Starting with Data Analytics on Oracle Cloud ...
Lucas Jellema
Introduction To Big Data and Use Cases on Hadoop
Introduction To Big Data and Use Cases on HadoopIntroduction To Big Data and Use Cases on Hadoop
Introduction To Big Data and Use Cases on Hadoop
Jongwook Woo
Data Scientist Toolbox
Data Scientist ToolboxData Scientist Toolbox
Data Scientist Toolbox
Andrei Savu
Customer Feedback Analytics for Starbucks
Customer Feedback Analytics for Starbucks Customer Feedback Analytics for Starbucks
Customer Feedback Analytics for Starbucks
Nishant Gandhi
Big data and AI in Socialbakers
Big data and AI in SocialbakersBig data and AI in Socialbakers
Big data and AI in Socialbakers
Enrich a 360-degree Customer View with Splunk and Apache Hadoop
Enrich a 360-degree Customer View with Splunk and Apache HadoopEnrich a 360-degree Customer View with Splunk and Apache Hadoop
Enrich a 360-degree Customer View with Splunk and Apache Hadoop
Hadoop summit socialize_v1.0
Hadoop summit socialize_v1.0Hadoop summit socialize_v1.0
Hadoop summit socialize_v1.0
Isaac Mosquera
Splunk/Socialize at Hadoop Summit
Splunk/Socialize at Hadoop SummitSplunk/Socialize at Hadoop Summit
Splunk/Socialize at Hadoop Summit
Isaac Mosquera
Big Data at the Speed of Business: Lessons Learned from Leading at the Edge
Big Data at the Speed of Business: Lessons Learned from Leading at the EdgeBig Data at the Speed of Business: Lessons Learned from Leading at the Edge
Big Data at the Speed of Business: Lessons Learned from Leading at the Edge
DataWorks Summit
Meetup070416 Presentations
Meetup070416 PresentationsMeetup070416 Presentations
Meetup070416 Presentations
Ana Rebelo
Peyman Mohajerian
Open Blueprint for Real-Time Analytics with In-Stream Processing
Open Blueprint for Real-Time Analytics with In-Stream ProcessingOpen Blueprint for Real-Time Analytics with In-Stream Processing
Open Blueprint for Real-Time Analytics with In-Stream Processing
Grid Dynamics
How Oracle Uses CrowdFlower For Sentiment Analysis
How Oracle Uses CrowdFlower For Sentiment AnalysisHow Oracle Uses CrowdFlower For Sentiment Analysis
How Oracle Uses CrowdFlower For Sentiment Analysis

Similar to Clickstream & Social Media Analysis using Apache Spark (20)

Continuous Analytics & Optimisation using Apache Spark (Big Data Analytics, L...
Continuous Analytics & Optimisation using Apache Spark (Big Data Analytics, L...Continuous Analytics & Optimisation using Apache Spark (Big Data Analytics, L...
Continuous Analytics & Optimisation using Apache Spark (Big Data Analytics, L...
Snowplow presentation for Amsterdam Meetup #3
Snowplow presentation for Amsterdam Meetup #3Snowplow presentation for Amsterdam Meetup #3
Snowplow presentation for Amsterdam Meetup #3
Analytics With PowerBI On Azure
Analytics With PowerBI On AzureAnalytics With PowerBI On Azure
Analytics With PowerBI On Azure
Jason Huang, Solutions Engineer, Qubole at MLconf ATL - 9/18/15
Jason Huang, Solutions Engineer, Qubole at MLconf ATL - 9/18/15Jason Huang, Solutions Engineer, Qubole at MLconf ATL - 9/18/15
Jason Huang, Solutions Engineer, Qubole at MLconf ATL - 9/18/15
Atlanta MLConf
Atlanta MLConfAtlanta MLConf
Atlanta MLConf
Cloudera Breakfast: Advanced Analytics Part II: Do More With Your Data
Cloudera Breakfast: Advanced Analytics Part II: Do More With Your DataCloudera Breakfast: Advanced Analytics Part II: Do More With Your Data
Cloudera Breakfast: Advanced Analytics Part II: Do More With Your Data
Big Data 2.0 - How Spark technologies are reshaping the world of big data ana...
Big Data 2.0 - How Spark technologies are reshaping the world of big data ana...Big Data 2.0 - How Spark technologies are reshaping the world of big data ana...
Big Data 2.0 - How Spark technologies are reshaping the world of big data ana...
Turn Data into Business Value – Starting with Data Analytics on Oracle Cloud ...
Turn Data into Business Value – Starting with Data Analytics on Oracle Cloud ...Turn Data into Business Value – Starting with Data Analytics on Oracle Cloud ...
Turn Data into Business Value – Starting with Data Analytics on Oracle Cloud ...
Introduction To Big Data and Use Cases on Hadoop
Introduction To Big Data and Use Cases on HadoopIntroduction To Big Data and Use Cases on Hadoop
Introduction To Big Data and Use Cases on Hadoop
Data Scientist Toolbox
Data Scientist ToolboxData Scientist Toolbox
Data Scientist Toolbox
Customer Feedback Analytics for Starbucks
Customer Feedback Analytics for Starbucks Customer Feedback Analytics for Starbucks
Customer Feedback Analytics for Starbucks
Big data and AI in Socialbakers
Big data and AI in SocialbakersBig data and AI in Socialbakers
Big data and AI in Socialbakers
Enrich a 360-degree Customer View with Splunk and Apache Hadoop
Enrich a 360-degree Customer View with Splunk and Apache HadoopEnrich a 360-degree Customer View with Splunk and Apache Hadoop
Enrich a 360-degree Customer View with Splunk and Apache Hadoop
Hadoop summit socialize_v1.0
Hadoop summit socialize_v1.0Hadoop summit socialize_v1.0
Hadoop summit socialize_v1.0
Splunk/Socialize at Hadoop Summit
Splunk/Socialize at Hadoop SummitSplunk/Socialize at Hadoop Summit
Splunk/Socialize at Hadoop Summit
Big Data at the Speed of Business: Lessons Learned from Leading at the Edge
Big Data at the Speed of Business: Lessons Learned from Leading at the EdgeBig Data at the Speed of Business: Lessons Learned from Leading at the Edge
Big Data at the Speed of Business: Lessons Learned from Leading at the Edge
Meetup070416 Presentations
Meetup070416 PresentationsMeetup070416 Presentations
Meetup070416 Presentations
Open Blueprint for Real-Time Analytics with In-Stream Processing
Open Blueprint for Real-Time Analytics with In-Stream ProcessingOpen Blueprint for Real-Time Analytics with In-Stream Processing
Open Blueprint for Real-Time Analytics with In-Stream Processing
How Oracle Uses CrowdFlower For Sentiment Analysis
How Oracle Uses CrowdFlower For Sentiment AnalysisHow Oracle Uses CrowdFlower For Sentiment Analysis
How Oracle Uses CrowdFlower For Sentiment Analysis

Recently uploaded

CAP Excel Formulas & Functions July - Copy (4).pdf
CAP Excel Formulas & Functions July - Copy (4).pdfCAP Excel Formulas & Functions July - Copy (4).pdf
CAP Excel Formulas & Functions July - Copy (4).pdf
Call Girls Goa (india) ☎️ +91-7426014248 Goa Call Girl
Call Girls Goa (india) ☎️ +91-7426014248 Goa Call GirlCall Girls Goa (india) ☎️ +91-7426014248 Goa Call Girl
Call Girls Goa (india) ☎️ +91-7426014248 Goa Call Girl
sapna sharmap11
Pune Call Girls <BOOK> 😍 Call Girl Pune Escorts Service
Pune Call Girls <BOOK> 😍 Call Girl Pune Escorts ServicePune Call Girls <BOOK> 😍 Call Girl Pune Escorts Service
Pune Call Girls <BOOK> 😍 Call Girl Pune Escorts Service
Hot Call Girls In Bangalore 🔥 9352988975 🔥 Real Fun With Sexual Girl Availabl...
Hot Call Girls In Bangalore 🔥 9352988975 🔥 Real Fun With Sexual Girl Availabl...Hot Call Girls In Bangalore 🔥 9352988975 🔥 Real Fun With Sexual Girl Availabl...
Hot Call Girls In Bangalore 🔥 9352988975 🔥 Real Fun With Sexual Girl Availabl...
Hyderabad Call Girls Service 🔥 9352988975 🔥 High Profile Call Girls Hyderabad
Hyderabad Call Girls Service 🔥 9352988975 🔥 High Profile Call Girls HyderabadHyderabad Call Girls Service 🔥 9352988975 🔥 High Profile Call Girls Hyderabad
Hyderabad Call Girls Service 🔥 9352988975 🔥 High Profile Call Girls Hyderabad
binna singh$A17
Royal-Class Call Girls Thane🌹9967824496🌹369+ call girls @₹6K-18K/full night cash
Royal-Class Call Girls Thane🌹9967824496🌹369+ call girls @₹6K-18K/full night cashRoyal-Class Call Girls Thane🌹9967824496🌹369+ call girls @₹6K-18K/full night cash
Royal-Class Call Girls Thane🌹9967824496🌹369+ call girls @₹6K-18K/full night cash
Product Cluster Analysis: Unveiling Hidden Customer Preferences
Product Cluster Analysis: Unveiling Hidden Customer PreferencesProduct Cluster Analysis: Unveiling Hidden Customer Preferences
Product Cluster Analysis: Unveiling Hidden Customer Preferences
Boston Institute of Analytics
Delhi Call Girls Karol Bagh 👉 9711199012 👈 unlimited short high profile full ...
Delhi Call Girls Karol Bagh 👉 9711199012 👈 unlimited short high profile full ...Delhi Call Girls Karol Bagh 👉 9711199012 👈 unlimited short high profile full ...
Delhi Call Girls Karol Bagh 👉 9711199012 👈 unlimited short high profile full ...
mona lisa $A12
Direct Lake Deep Dive slides from Fabric Engineering Roadshow
Direct Lake Deep Dive slides from Fabric Engineering RoadshowDirect Lake Deep Dive slides from Fabric Engineering Roadshow
Direct Lake Deep Dive slides from Fabric Engineering Roadshow
Gabi Münster
Call Girls Goa👉9024918724👉Low Rate Escorts in Goa 💃 Available 24/7
Call Girls Goa👉9024918724👉Low Rate Escorts in Goa 💃 Available 24/7Call Girls Goa👉9024918724👉Low Rate Escorts in Goa 💃 Available 24/7
Call Girls Goa👉9024918724👉Low Rate Escorts in Goa 💃 Available 24/7
Bangalore ℂall Girl 000000 Bangalore Escorts Service
Bangalore ℂall Girl 000000 Bangalore Escorts ServiceBangalore ℂall Girl 000000 Bangalore Escorts Service
Bangalore ℂall Girl 000000 Bangalore Escorts Service
#kalyanmatkaresult #dpboss #kalyanmatka #satta #matka #sattamatka
Call Girls Hyderabad (india) ☎️ +91-7426014248 Hyderabad Call Girl
Call Girls Hyderabad  (india) ☎️ +91-7426014248 Hyderabad  Call GirlCall Girls Hyderabad  (india) ☎️ +91-7426014248 Hyderabad  Call Girl
Call Girls Hyderabad (india) ☎️ +91-7426014248 Hyderabad Call Girl
sapna sharmap11
Bangalore Call Girls ♠ 9079923931 ♠ Beautiful Call Girls In Bangalore
Bangalore Call Girls  ♠ 9079923931 ♠ Beautiful Call Girls In BangaloreBangalore Call Girls  ♠ 9079923931 ♠ Beautiful Call Girls In Bangalore
Bangalore Call Girls ♠ 9079923931 ♠ Beautiful Call Girls In Bangalore
Call Girls Hyderabad ❤️ 7339748667 ❤️ With No Advance Payment
Call Girls Hyderabad ❤️ 7339748667 ❤️ With No Advance PaymentCall Girls Hyderabad ❤️ 7339748667 ❤️ With No Advance Payment
Call Girls Hyderabad ❤️ 7339748667 ❤️ With No Advance Payment
prijesh mathew
🔥College Call Girls Kolkata 💯Call Us 🔝 8094342248 🔝💃Top Class Call Girl Servi...
🔥College Call Girls Kolkata 💯Call Us 🔝 8094342248 🔝💃Top Class Call Girl Servi...🔥College Call Girls Kolkata 💯Call Us 🔝 8094342248 🔝💃Top Class Call Girl Servi...
🔥College Call Girls Kolkata 💯Call Us 🔝 8094342248 🔝💃Top Class Call Girl Servi...
Call Girls Hyderabad (india) ☎️ +91-7426014248 Hyderabad Call Girl
Call Girls Hyderabad  (india) ☎️ +91-7426014248 Hyderabad  Call GirlCall Girls Hyderabad  (india) ☎️ +91-7426014248 Hyderabad  Call Girl
Call Girls Hyderabad (india) ☎️ +91-7426014248 Hyderabad Call Girl
sapna sharmap11
🔥Call Girl Price Pune 💯Call Us 🔝 7014168258 🔝💃Independent Pune Escorts Servic...
🔥Call Girl Price Pune 💯Call Us 🔝 7014168258 🔝💃Independent Pune Escorts Servic...🔥Call Girl Price Pune 💯Call Us 🔝 7014168258 🔝💃Independent Pune Escorts Servic...
🔥Call Girl Price Pune 💯Call Us 🔝 7014168258 🔝💃Independent Pune Escorts Servic...

Recently uploaded (20)

CAP Excel Formulas & Functions July - Copy (4).pdf
CAP Excel Formulas & Functions July - Copy (4).pdfCAP Excel Formulas & Functions July - Copy (4).pdf
CAP Excel Formulas & Functions July - Copy (4).pdf
Call Girls Goa (india) ☎️ +91-7426014248 Goa Call Girl
Call Girls Goa (india) ☎️ +91-7426014248 Goa Call GirlCall Girls Goa (india) ☎️ +91-7426014248 Goa Call Girl
Call Girls Goa (india) ☎️ +91-7426014248 Goa Call Girl
Pune Call Girls <BOOK> 😍 Call Girl Pune Escorts Service
Pune Call Girls <BOOK> 😍 Call Girl Pune Escorts ServicePune Call Girls <BOOK> 😍 Call Girl Pune Escorts Service
Pune Call Girls <BOOK> 😍 Call Girl Pune Escorts Service
Hot Call Girls In Bangalore 🔥 9352988975 🔥 Real Fun With Sexual Girl Availabl...
Hot Call Girls In Bangalore 🔥 9352988975 🔥 Real Fun With Sexual Girl Availabl...Hot Call Girls In Bangalore 🔥 9352988975 🔥 Real Fun With Sexual Girl Availabl...
Hot Call Girls In Bangalore 🔥 9352988975 🔥 Real Fun With Sexual Girl Availabl...
Hyderabad Call Girls Service 🔥 9352988975 🔥 High Profile Call Girls Hyderabad
Hyderabad Call Girls Service 🔥 9352988975 🔥 High Profile Call Girls HyderabadHyderabad Call Girls Service 🔥 9352988975 🔥 High Profile Call Girls Hyderabad
Hyderabad Call Girls Service 🔥 9352988975 🔥 High Profile Call Girls Hyderabad
Royal-Class Call Girls Thane🌹9967824496🌹369+ call girls @₹6K-18K/full night cash
Royal-Class Call Girls Thane🌹9967824496🌹369+ call girls @₹6K-18K/full night cashRoyal-Class Call Girls Thane🌹9967824496🌹369+ call girls @₹6K-18K/full night cash
Royal-Class Call Girls Thane🌹9967824496🌹369+ call girls @₹6K-18K/full night cash
Product Cluster Analysis: Unveiling Hidden Customer Preferences
Product Cluster Analysis: Unveiling Hidden Customer PreferencesProduct Cluster Analysis: Unveiling Hidden Customer Preferences
Product Cluster Analysis: Unveiling Hidden Customer Preferences
Delhi Call Girls Karol Bagh 👉 9711199012 👈 unlimited short high profile full ...
Delhi Call Girls Karol Bagh 👉 9711199012 👈 unlimited short high profile full ...Delhi Call Girls Karol Bagh 👉 9711199012 👈 unlimited short high profile full ...
Delhi Call Girls Karol Bagh 👉 9711199012 👈 unlimited short high profile full ...
Direct Lake Deep Dive slides from Fabric Engineering Roadshow
Direct Lake Deep Dive slides from Fabric Engineering RoadshowDirect Lake Deep Dive slides from Fabric Engineering Roadshow
Direct Lake Deep Dive slides from Fabric Engineering Roadshow
Call Girls Goa👉9024918724👉Low Rate Escorts in Goa 💃 Available 24/7
Call Girls Goa👉9024918724👉Low Rate Escorts in Goa 💃 Available 24/7Call Girls Goa👉9024918724👉Low Rate Escorts in Goa 💃 Available 24/7
Call Girls Goa👉9024918724👉Low Rate Escorts in Goa 💃 Available 24/7
Bangalore ℂall Girl 000000 Bangalore Escorts Service
Bangalore ℂall Girl 000000 Bangalore Escorts ServiceBangalore ℂall Girl 000000 Bangalore Escorts Service
Bangalore ℂall Girl 000000 Bangalore Escorts Service
Call Girls Hyderabad (india) ☎️ +91-7426014248 Hyderabad Call Girl
Call Girls Hyderabad  (india) ☎️ +91-7426014248 Hyderabad  Call GirlCall Girls Hyderabad  (india) ☎️ +91-7426014248 Hyderabad  Call Girl
Call Girls Hyderabad (india) ☎️ +91-7426014248 Hyderabad Call Girl
Bangalore Call Girls ♠ 9079923931 ♠ Beautiful Call Girls In Bangalore
Bangalore Call Girls  ♠ 9079923931 ♠ Beautiful Call Girls In BangaloreBangalore Call Girls  ♠ 9079923931 ♠ Beautiful Call Girls In Bangalore
Bangalore Call Girls ♠ 9079923931 ♠ Beautiful Call Girls In Bangalore
Call Girls Hyderabad ❤️ 7339748667 ❤️ With No Advance Payment
Call Girls Hyderabad ❤️ 7339748667 ❤️ With No Advance PaymentCall Girls Hyderabad ❤️ 7339748667 ❤️ With No Advance Payment
Call Girls Hyderabad ❤️ 7339748667 ❤️ With No Advance Payment
🔥College Call Girls Kolkata 💯Call Us 🔝 8094342248 🔝💃Top Class Call Girl Servi...
🔥College Call Girls Kolkata 💯Call Us 🔝 8094342248 🔝💃Top Class Call Girl Servi...🔥College Call Girls Kolkata 💯Call Us 🔝 8094342248 🔝💃Top Class Call Girl Servi...
🔥College Call Girls Kolkata 💯Call Us 🔝 8094342248 🔝💃Top Class Call Girl Servi...
Call Girls Hyderabad (india) ☎️ +91-7426014248 Hyderabad Call Girl
Call Girls Hyderabad  (india) ☎️ +91-7426014248 Hyderabad  Call GirlCall Girls Hyderabad  (india) ☎️ +91-7426014248 Hyderabad  Call Girl
Call Girls Hyderabad (india) ☎️ +91-7426014248 Hyderabad Call Girl
🔥Call Girl Price Pune 💯Call Us 🔝 7014168258 🔝💃Independent Pune Escorts Servic...
🔥Call Girl Price Pune 💯Call Us 🔝 7014168258 🔝💃Independent Pune Escorts Servic...🔥Call Girl Price Pune 💯Call Us 🔝 7014168258 🔝💃Independent Pune Escorts Servic...
🔥Call Girl Price Pune 💯Call Us 🔝 7014168258 🔝💃Independent Pune Escorts Servic...

Clickstream & Social Media Analysis using Apache Spark

  • 1. Clickstream & Social Media Analysis Use cases and examples using Apache Spark Michael Cutler @ TUMRA – November 2014
  • 2. Hello About Me • Early adopter of Hadoop • Spoke at Hadoop World on machine learning • Twitter: @cotdp TUMRA We use Data Science and Big Data technology to help ecommerce companies understand their customers and increase sales. This Talk • Slide are on Slideshare • Code example on Github • Twitter: @tumra
  • 3. 1 Background 2 Introducing Apache Spark 3 Examples
  • 5. Clickstream & Social Media Analysis Generalised Approach Mobile/Tablet App Data Collection Data Processing Reporting & Analysis Web Site You People Social Network Events Files Tables
  • 6. How has this approach evolved? Rapidly reducing the ‘time to insight’ pre-Historic Hadoop • Proprietary & Expensive • Slow Constrained Time to Insight 48+ hours 2008 - Hadoop • Open-source & Inexpensive • Flexible but complex to use Time to Insight hours 2014 - Spark • Batch, Streaming & Interactive • Fast & Easy to use Time to Insight minutes
  • 7. Weaving a story from a string of activities Understanding the shoppers journey PPC long-tail keyword Day #0 Opened Email Newsletter on iPad PPC brand PPC brand keyword & signed up email keyword Add To Cart Order Placed Day #7 Day #10 Day #13 Day #17
  • 8. Shopper Journey Understanding the shoppers journey Time Shopper Consumer Research Consideration Purchase Need
  • 9. It’s all about People & Products Not just boring log files! Turn low-level events like “Page Views” into something meaningful e.g. <Person1234> <viewed-a> <Product:Camera> Bought a … Activity & Interactions Gauging Interest Measuring the degree of interest a Person has about a Product e.g. are 10 views for a certain Product a good or bad thing? Affinities Either inferred from other Peoples activities, or Product similarity Properties Both people and products have properties, e.g. <Person1234> <is:gender> <Female>
  • 10. People & Product Interactions Source: Snowplow Analytics e.g. “Michael” “bought a” “Americano” “Starbucks, Shoreditch”
  • 11. That sounds like a Graph … Use graphs to understand user intent Interest Graph Visualisation • Collect user activity data in real-time, not just clicks but mouse-overs, images, video, social. • Algorithms identify products, categories and brands a particular person is interested in. • Cluster users into ‘neighborhoods’ to infer what to show to existing and future visitors. This visualization illustrates just 1% of 6 weeks visitor activity data. Blue data points are People, Orange data points are Products.
  • 13. Three reasons Apache Spark is awesome! Apart from “no more Java Map/Reduce code!!!” Fast • In-memory Caching • DAG execution optimisation • Easy to use in Scala, Java, Python Smart • Machine Learning baked in • Graph algorithms • Interactive Shell Flexible • Query from Spark SQL • Streaming • Batch (file based)
  • 14. Apache Spark Architecture Overview Apache ZooKeeper Hadoop Filesystem (HDFS) Yarn / Mesos (optional)
  • 15. Apache Spark Coexists with your existing Hadoop Infrastructure Hadoop Filesystem (HDFS) Apache ZooKeeper Apache Hive etc. Map / Reduce Yarn / Mesos
  • 16. Apache Spark can … Simple example of Spark SQL used from Scala Source: Databricks Go from a SQL query… … to a trained machine learning model in three lines of code.
  • 18. Example Architecture Coexists with your existing Hadoop Infrastructure Reporting Dashboard Hadoop Filesystem (HDFS) NoSQL Store Apache ZooKeeper (Cassandra) Apache Kafka Analytics Jobs
  • 19. Social Media Analysis Converting a low-level event into a meaningful high-level interaction • A user-interaction from the Facebook firehose, received as a real-time stream of JSON • Streamed into Apache Kafka, also stored in SequenceFiles • Modeled into Scala Case Class:
  • 20. Example - Spark (Scala) Using the Spark (Scala) interface to analyze the data • Parse JSON • Extract interesting attributes • ‘Reduce by Key’ to sum the result • Print results
  • 21. Example - Spark SQL Using the Spark SQL interface to analyze the data • Parse JSON • Extract interesting attributes, transform into Case Classes • ‘Register as table’ • Execute SQL, print results
  • 22. Want to play with awesome tech and data? We’re hiring! team@tumra.com Data Engineer Scala, functional programming, Hadoop, NoSQL Sales & Marketing Experience with SaaS and ecommerce sales