Clickstream & Social Media Analysis using Apache Spark

Clickstream & Social Media Analysis
Use cases and examples using Apache Spark
Michael Cutler @ TUMRA – November 2014

Hello
About Me
• Early adopter of Hadoop
• Spoke at Hadoop World on
machine learning
• Twitter: @cotdp
TUMRA
We use Data Science and Big Data
technology to help ecommerce
companies understand their
customers and increase sales.
This Talk
• Slide are on Slideshare
• Code example on Github
• Twitter: @tumra

1 Background
2 Introducing Apache Spark
3 Examples

Clickstream & Social Media Analysis
Generalised Approach
Mobile/Tablet App
Data
Collection
Data
Processing
Reporting &
Analysis
Web Site
You
People
Social Network
Events Files Tables

How has this approach evolved?
Rapidly reducing the ‘time to insight’
pre-Historic Hadoop
• Proprietary & Expensive
• Slow Constrained
Time to Insight
48+ hours
2008 - Hadoop
• Open-source & Inexpensive
• Flexible but complex to use
Time to Insight
hours
2014 - Spark
• Batch, Streaming & Interactive
• Fast & Easy to use
Time to Insight
minutes

Weaving a story from a string of activities
Understanding the shoppers journey
PPC long-tail
keyword
Day #0
Opened Email
Newsletter on iPad PPC brand
PPC brand keyword &
signed up email
keyword
Add To Cart
Order
Placed
Day #7 Day #10 Day #13 Day #17

Shopper Journey
Understanding the shoppers journey
Time
Shopper
Consumer
Research Consideration Purchase
Need

It’s all about People & Products
Not just boring log files!
Turn low-level events like “Page Views” into something meaningful
e.g. <Person1234> <viewed-a> <Product:Camera>
Bought a …
Activity & Interactions
Gauging Interest
Measuring the degree of interest a Person has about a Product
e.g. are 10 views for a certain Product a good or bad thing?
Affinities
Either inferred from other Peoples activities, or Product similarity
Properties
Both people and products have properties,
e.g. <Person1234> <is:gender> <Female>

People & Product Interactions
Source: Snowplow Analytics
e.g. “Michael” “bought a” “Americano” “Starbucks, Shoreditch”

That sounds like a Graph …
Use graphs to understand user intent
Interest Graph Visualisation
• Collect user activity data in real-time, not just
clicks but mouse-overs, images, video, social.
• Algorithms identify products, categories and
brands a particular person is interested in.
• Cluster users into ‘neighborhoods’ to infer what to
show to existing and future visitors.
This visualization illustrates just 1% of 6 weeks visitor
activity data. Blue data points are People, Orange
data points are Products.

Three reasons Apache Spark is awesome!
Apart from “no more Java Map/Reduce code!!!”
Fast
• In-memory Caching
• DAG execution optimisation
• Easy to use in Scala, Java, Python
Smart
• Machine Learning baked in
• Graph algorithms
• Interactive Shell
Flexible
• Query from Spark SQL
• Streaming
• Batch (file based)

Apache Spark
Architecture Overview
Apache ZooKeeper Hadoop Filesystem
(HDFS)
Yarn / Mesos
(optional)

Apache Spark
Coexists with your existing Hadoop Infrastructure
Hadoop Filesystem (HDFS)
Apache ZooKeeper
Apache Hive etc.
Map / Reduce
Yarn / Mesos

Apache Spark can …
Simple example of Spark SQL used from Scala
Source: Databricks
Go from a SQL query…
… to a trained machine learning
model in three lines of code.

Example Architecture
Coexists with your existing Hadoop Infrastructure
Reporting
Dashboard
Hadoop Filesystem (HDFS) NoSQL Store
Apache ZooKeeper
(Cassandra)
Apache Kafka
Analytics
Jobs

Social Media Analysis
Converting a low-level event into a meaningful high-level interaction
• A user-interaction from the
Facebook firehose, received as a
real-time stream of JSON
• Streamed into Apache Kafka,
also stored in SequenceFiles
• Modeled into Scala Case Class:

Example - Spark (Scala)
Using the Spark (Scala) interface to analyze the data
• Parse JSON
• Extract interesting attributes
• ‘Reduce by Key’ to sum the result
• Print results

Example - Spark SQL
Using the Spark SQL interface to analyze the data
• Parse JSON
• Extract interesting attributes,
transform into Case Classes
• ‘Register as table’
• Execute SQL, print results

Want to play with awesome tech and data?
We’re hiring! team@tumra.com
Data Engineer
Scala, functional programming,
Hadoop, NoSQL
Sales & Marketing
Experience with SaaS and ecommerce sales

Clickstream & Social Media Analysis using Apache Spark

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Clickstream & Social Media Analysis using Apache Spark

Similar to Clickstream & Social Media Analysis using Apache Spark (20)

Recently uploaded

Recently uploaded (20)

Clickstream & Social Media Analysis using Apache Spark