尊敬的 微信汇率:1円 ≈ 0.046078 元 支付宝汇率:1円 ≈ 0.046168元 [退出登录]
SlideShare a Scribd company logo
June 2013
BIG DATA SCIENCE: A PATH FORWARD
CONFIDENTIAL | 2
linkedin.com/in/danmallinger/
@danmallinger
www.thinkbiganalytics.com
 Data Science Lead @ Think Big
 Product/Brand Obsessive
 Teacher
 Occasional Engineer
CONFIDENTIAL | 3
TODAY
• High level exploration of the
• skills, tools, and techniques
• needed to achieve early success
• and to help you build
• your data science practice.
CONFIDENTIAL | 4
 Understand our organizational needs for data science
 Infrastructure: Technological tools and platforms.
 Talent: Staff hired and trained.
 Capabilities: Data science techniques utilized.
INFRASTRUCTURE, TALENT, & CAPABILITIES
Hadoop NoSQL Analytics SQL/MPP Real Time
Scripting MapReduce
Data
Exploration
Basic Modeling PhD Math
Visualization Clustering Categorization
Continuous
Models
Text Analysis
CONFIDENTIAL | 5
 Boxed Solutions: Mahout & Platform
 Toolkits: RHadoop, Scikit, etc.
 You will need toolkits to solve unique problems
 but smart techniques make that easier.
 Boxed solutions are limited
 but can be a good source of early velocity.
ANALYTICS TOOLS
CONFIDENTIAL | 6
 Gigabytes from Stackoverflow
 Questions from users
 With metadata
 Users have reputations
 Questions open or closed
 Follow along
 Thinking about your data
 To learn in a
 Familiar context and
 Plan
DATA
Presenter Audience
Hadoop NoSQL Analytics SQL/MPP Real Time
Scripting MapReduce Exploration Basic Modeling PhD Math
Visualization Clustering Categorization Continuous Text Analysis
CONFIDENTIAL | 7
select count(1) as total
, sum(has_code)
, avg(body_count)
, stddev_samp(body_count)
, corr(reputation,
owner_questions)
,
histogram_numeric(body_count, 10)
from questions
;
STEP 1: EXPLORE
Hadoop NoSQL Analytics SQL/MPP Real Time
Scripting MapReduce Exploration Basic Modeling PhD Math
Visualization Clustering Categorization Continuous Text Analysis
Patterns through Hive Patterns through Tableau
CONFIDENTIAL | 8
 Summaries of unstructured
data
 Time-since metrics
select transform(…)
using ‘python …’
 Clustering: Browsing cohorts
/bin/mahout canopy
STEP 2: FEATURE BUILDING
Hadoop NoSQL Analytics SQL/MPP Real Time
Scripting MapReduce Exploration Basic Modeling PhD Math
Visualization Clustering Categorization Continuous Text Analysis
SQL Windowing Cross-Record Features
CONFIDENTIAL | 9
• Sample (don’t parallelize)
• Naturally parallel
• SVD
• Random Forests
• Estimators and Ensembles
• Bootstrapping
• Localizing
• Advanced Parallelization
• Linear models with SGD
• Neural networks
PARALLEL MODELS IN HADOOP
Hadoop NoSQL Analytics SQL/MPP Real Time
Scripting MapReduce Exploration Basic Modeling PhD Math
Visualization Clustering Categorization Continuous Text Analysis
CONFIDENTIAL | 10
 Single R model
 run many times
 over samples
 and aggregated
m <- C5.0(status ~ …)
STEP 3: STRUCTURED MODEL (BAGGING)
Hadoop NoSQL Analytics SQL/MPP Real Time
Scripting MapReduce Exploration Basic Modeling PhD Math
Visualization Clustering Categorization Continuous Text Analysis
Mapper 1:
Define n reducer keys
Send any record to reducer I with
probability p
Reducer 1:
Key: Id of sample
Value: List of records
Perform analysis over records
Reducer 2:
Key: One
Value: List of models
Aggregate the models (e.g. average)
Bagging a Model
CONFIDENTIAL | 11
WHERE ARE WE?
Hadoop NoSQL Analytics SQL/MPP Real Time
Scripting MapReduce Exploration Basic Modeling PhD Math
Visualization Clustering Categorization Continuous Text Analysis
 We’ve created a structured model
 to flag questions that won’t be closed
 using Big Data.
 But we haven’t used unstructured data.
CONFIDENTIAL | 12
TEXT ANALYSIS
Hadoop NoSQL Analytics SQL/MPP Real Time
Scripting MapReduce Exploration Basic Modeling PhD Math
Visualization Clustering Categorization Continuous Text Analysis
• Is “the big dog” really different from “dog is big?”
• How about “I like eggs but hate tofu” and “I hate eggs but like tofu?”
• Language has lexical and syntactical features
• Different techniques leverage these in different ways
 Bag of Words: Structure doesn’t matter
 n-gram: Structure matters (but not that much)
 Feature Extraction: BACON! BACON! BACON!
CONFIDENTIAL | 13
STEP 4: UNSTRUCTURED MODEL
Hadoop NoSQL Analytics SQL/MPP Real Time
Scripting MapReduce Exploration Basic Modeling PhD Math
Visualization Clustering Categorization Continuous Text Analysis
 Similar to Hadoop’s Word
Count
 Create counts for
token/category pairs
 Use counts to calculate
Information Gain
MR Job 1:
Calculate information gain (IG) for all
tokens.
MR Job 2:
Select tokens with largest IG.
Create structured data for record, tokens:
question #4 | 0 | 1 | 0 | 1 | 1
MR Job 3:
Build a classifier over the newly structured
data (prior slides)
Information Gain
CONFIDENTIAL | 14
WHERE ARE WE?
Hadoop NoSQL Analytics SQL/MPP Real Time
Scripting MapReduce Exploration Basic Modeling PhD Math
Visualization Clustering Categorization Continuous Text Analysis
 We’ve created two models
 One structured,
 one unstructured.
 But they don’t work together.
CONFIDENTIAL | 15
STEP 5: ENSEMBLE MODEL
Hadoop NoSQL Analytics SQL/MPP Real Time
Scripting MapReduce Exploration Basic Modeling PhD Math
Visualization Clustering Categorization Continuous Text Analysis
 Join many models together
 By using their output
 As input to ensemble model.
 Best when models perform
differently
 Exploit differences with
nonlinearities
 Like interaction effects.
Ensembling
Mapper 1:
Load multiple models
Score the models per record and output
Reducer 1:
Key: Id of record
Value: List of model outputs
Join model outputs to make new records
MR Job 2:
Build a model over the output data as if it
was raw data.
CONFIDENTIAL | 16
 We’ve created two models:
 one structured,
 one unstructured
 and have ensembled them
 to create a single, powerful model
 and solve a practical business problem.
WHERE ARE WE?
Hadoop NoSQL Analytics SQL/MPP Real Time
Scripting MapReduce Exploration Basic Modeling PhD Math
Visualization Clustering Categorization Continuous Text Analysis
CONFIDENTIAL | 17
 This required simple infrastructure
 a blend of analysis and scripting skills
 an understanding of BIG data science techniques
 but not a team of PhDs or a billion dollars.
HOW DID WE GET HERE?
Hadoop NoSQL Analytics SQL/MPP Real Time
Scripting MapReduce Exploration Basic Modeling PhD Math
Visualization Clustering Categorization Continuous Text Analysis
CONFIDENTIAL | 18
Questions?
www.thinkbiganalytics.com
@danmallinger

More Related Content

What's hot

Building a Data Science as a Service Platform in Azure with Databricks
Building a Data Science as a Service Platform in Azure with DatabricksBuilding a Data Science as a Service Platform in Azure with Databricks
Building a Data Science as a Service Platform in Azure with Databricks
Databricks
 
Building data "Py-pelines"
Building data "Py-pelines"Building data "Py-pelines"
Building data "Py-pelines"
Rob Winters
 
Leveraging Spark to Democratize Data for Omni-Commerce with Shafaq Abdullah
Leveraging Spark to Democratize Data for Omni-Commerce with Shafaq AbdullahLeveraging Spark to Democratize Data for Omni-Commerce with Shafaq Abdullah
Leveraging Spark to Democratize Data for Omni-Commerce with Shafaq Abdullah
Databricks
 
LinkedInSaxoBankDataWorkbench
LinkedInSaxoBankDataWorkbenchLinkedInSaxoBankDataWorkbench
LinkedInSaxoBankDataWorkbench
Sheetal Pratik
 
Building Real-Time Data Pipeline for Diabetes Medication Recommender System U...
Building Real-Time Data Pipeline for Diabetes Medication Recommender System U...Building Real-Time Data Pipeline for Diabetes Medication Recommender System U...
Building Real-Time Data Pipeline for Diabetes Medication Recommender System U...
Databricks
 
Building the Artificially Intelligent Enterprise
Building the Artificially Intelligent EnterpriseBuilding the Artificially Intelligent Enterprise
Building the Artificially Intelligent Enterprise
Databricks
 
Multiplatform Spark solution for Graph datasources by Javier Dominguez
Multiplatform Spark solution for Graph datasources by Javier DominguezMultiplatform Spark solution for Graph datasources by Javier Dominguez
Multiplatform Spark solution for Graph datasources by Javier Dominguez
Big Data Spain
 
Operationalizing analytics to scale
Operationalizing analytics to scaleOperationalizing analytics to scale
Operationalizing analytics to scale
Looker
 
Big Data Day LA 2016/ Big Data Track - Rapid Analytics @ Netflix LA (Updated ...
Big Data Day LA 2016/ Big Data Track - Rapid Analytics @ Netflix LA (Updated ...Big Data Day LA 2016/ Big Data Track - Rapid Analytics @ Netflix LA (Updated ...
Big Data Day LA 2016/ Big Data Track - Rapid Analytics @ Netflix LA (Updated ...
Data Con LA
 
Predicting Patient Outcomes in Real-Time at HCA
Predicting Patient Outcomes in Real-Time at HCAPredicting Patient Outcomes in Real-Time at HCA
Predicting Patient Outcomes in Real-Time at HCA
Sri Ambati
 
Initiate Edinburgh 2019 - Big Data Meets AI
Initiate Edinburgh 2019 - Big Data Meets AIInitiate Edinburgh 2019 - Big Data Meets AI
Initiate Edinburgh 2019 - Big Data Meets AI
Amazon Web Services
 
Netflix Data Engineering @ Uber Engineering Meetup
Netflix Data Engineering @ Uber Engineering MeetupNetflix Data Engineering @ Uber Engineering Meetup
Netflix Data Engineering @ Uber Engineering Meetup
Blake Irvine
 
SplunkSummit 2015 - Real World Big Data Architecture
SplunkSummit 2015 -  Real World Big Data ArchitectureSplunkSummit 2015 -  Real World Big Data Architecture
SplunkSummit 2015 - Real World Big Data Architecture
Splunk
 
2016 Tableau in the Cloud - A Netflix Original (AWS Re:invent)
2016 Tableau in the Cloud - A Netflix Original (AWS Re:invent)2016 Tableau in the Cloud - A Netflix Original (AWS Re:invent)
2016 Tableau in the Cloud - A Netflix Original (AWS Re:invent)
Albert Wong
 
Operationalizing Edge Machine Learning with Apache Spark with Nisha Talagala ...
Operationalizing Edge Machine Learning with Apache Spark with Nisha Talagala ...Operationalizing Edge Machine Learning with Apache Spark with Nisha Talagala ...
Operationalizing Edge Machine Learning with Apache Spark with Nisha Talagala ...
Databricks
 
Spark Summit Keynote by Seshu Adunuthula
Spark Summit Keynote by Seshu AdunuthulaSpark Summit Keynote by Seshu Adunuthula
Spark Summit Keynote by Seshu Adunuthula
Spark Summit
 
H2O World - Solving Customer Churn with Machine Learning - Julian Bharadwaj
H2O World - Solving Customer Churn with Machine Learning - Julian BharadwajH2O World - Solving Customer Churn with Machine Learning - Julian Bharadwaj
H2O World - Solving Customer Churn with Machine Learning - Julian Bharadwaj
Sri Ambati
 
Agile Big Data Analytics Development: An Architecture-Centric Approach
Agile Big Data Analytics Development: An Architecture-Centric ApproachAgile Big Data Analytics Development: An Architecture-Centric Approach
Agile Big Data Analytics Development: An Architecture-Centric Approach
SoftServe
 
Data Science At Zillow
Data Science At ZillowData Science At Zillow
Data Science At Zillow
Nicholas McClure
 
The Rise of the DataOps - Dataiku - J On the Beach 2016
The Rise of the DataOps - Dataiku - J On the Beach 2016 The Rise of the DataOps - Dataiku - J On the Beach 2016
The Rise of the DataOps - Dataiku - J On the Beach 2016
Dataiku
 

What's hot (20)

Building a Data Science as a Service Platform in Azure with Databricks
Building a Data Science as a Service Platform in Azure with DatabricksBuilding a Data Science as a Service Platform in Azure with Databricks
Building a Data Science as a Service Platform in Azure with Databricks
 
Building data "Py-pelines"
Building data "Py-pelines"Building data "Py-pelines"
Building data "Py-pelines"
 
Leveraging Spark to Democratize Data for Omni-Commerce with Shafaq Abdullah
Leveraging Spark to Democratize Data for Omni-Commerce with Shafaq AbdullahLeveraging Spark to Democratize Data for Omni-Commerce with Shafaq Abdullah
Leveraging Spark to Democratize Data for Omni-Commerce with Shafaq Abdullah
 
LinkedInSaxoBankDataWorkbench
LinkedInSaxoBankDataWorkbenchLinkedInSaxoBankDataWorkbench
LinkedInSaxoBankDataWorkbench
 
Building Real-Time Data Pipeline for Diabetes Medication Recommender System U...
Building Real-Time Data Pipeline for Diabetes Medication Recommender System U...Building Real-Time Data Pipeline for Diabetes Medication Recommender System U...
Building Real-Time Data Pipeline for Diabetes Medication Recommender System U...
 
Building the Artificially Intelligent Enterprise
Building the Artificially Intelligent EnterpriseBuilding the Artificially Intelligent Enterprise
Building the Artificially Intelligent Enterprise
 
Multiplatform Spark solution for Graph datasources by Javier Dominguez
Multiplatform Spark solution for Graph datasources by Javier DominguezMultiplatform Spark solution for Graph datasources by Javier Dominguez
Multiplatform Spark solution for Graph datasources by Javier Dominguez
 
Operationalizing analytics to scale
Operationalizing analytics to scaleOperationalizing analytics to scale
Operationalizing analytics to scale
 
Big Data Day LA 2016/ Big Data Track - Rapid Analytics @ Netflix LA (Updated ...
Big Data Day LA 2016/ Big Data Track - Rapid Analytics @ Netflix LA (Updated ...Big Data Day LA 2016/ Big Data Track - Rapid Analytics @ Netflix LA (Updated ...
Big Data Day LA 2016/ Big Data Track - Rapid Analytics @ Netflix LA (Updated ...
 
Predicting Patient Outcomes in Real-Time at HCA
Predicting Patient Outcomes in Real-Time at HCAPredicting Patient Outcomes in Real-Time at HCA
Predicting Patient Outcomes in Real-Time at HCA
 
Initiate Edinburgh 2019 - Big Data Meets AI
Initiate Edinburgh 2019 - Big Data Meets AIInitiate Edinburgh 2019 - Big Data Meets AI
Initiate Edinburgh 2019 - Big Data Meets AI
 
Netflix Data Engineering @ Uber Engineering Meetup
Netflix Data Engineering @ Uber Engineering MeetupNetflix Data Engineering @ Uber Engineering Meetup
Netflix Data Engineering @ Uber Engineering Meetup
 
SplunkSummit 2015 - Real World Big Data Architecture
SplunkSummit 2015 -  Real World Big Data ArchitectureSplunkSummit 2015 -  Real World Big Data Architecture
SplunkSummit 2015 - Real World Big Data Architecture
 
2016 Tableau in the Cloud - A Netflix Original (AWS Re:invent)
2016 Tableau in the Cloud - A Netflix Original (AWS Re:invent)2016 Tableau in the Cloud - A Netflix Original (AWS Re:invent)
2016 Tableau in the Cloud - A Netflix Original (AWS Re:invent)
 
Operationalizing Edge Machine Learning with Apache Spark with Nisha Talagala ...
Operationalizing Edge Machine Learning with Apache Spark with Nisha Talagala ...Operationalizing Edge Machine Learning with Apache Spark with Nisha Talagala ...
Operationalizing Edge Machine Learning with Apache Spark with Nisha Talagala ...
 
Spark Summit Keynote by Seshu Adunuthula
Spark Summit Keynote by Seshu AdunuthulaSpark Summit Keynote by Seshu Adunuthula
Spark Summit Keynote by Seshu Adunuthula
 
H2O World - Solving Customer Churn with Machine Learning - Julian Bharadwaj
H2O World - Solving Customer Churn with Machine Learning - Julian BharadwajH2O World - Solving Customer Churn with Machine Learning - Julian Bharadwaj
H2O World - Solving Customer Churn with Machine Learning - Julian Bharadwaj
 
Agile Big Data Analytics Development: An Architecture-Centric Approach
Agile Big Data Analytics Development: An Architecture-Centric ApproachAgile Big Data Analytics Development: An Architecture-Centric Approach
Agile Big Data Analytics Development: An Architecture-Centric Approach
 
Data Science At Zillow
Data Science At ZillowData Science At Zillow
Data Science At Zillow
 
The Rise of the DataOps - Dataiku - J On the Beach 2016
The Rise of the DataOps - Dataiku - J On the Beach 2016 The Rise of the DataOps - Dataiku - J On the Beach 2016
The Rise of the DataOps - Dataiku - J On the Beach 2016
 

Viewers also liked

Data Science for Business Managers by TektosData
Data Science for Business Managers by TektosDataData Science for Business Managers by TektosData
Data Science for Business Managers by TektosData
Maurício Garcia
 
How can Data Science benefit your business?
How can Data Science benefit your business?How can Data Science benefit your business?
How can Data Science benefit your business?
Peadar Coyle
 
Business model canvas
Business model canvasBusiness model canvas
Business model canvas
Hamza Jounaidi
 
Moving From The Art To The Science
Moving From The Art To The ScienceMoving From The Art To The Science
Moving From The Art To The Science
Capgemini
 
Creating Big Data Success with the Collaboration of Business and IT
Creating Big Data Success with the Collaboration of Business and ITCreating Big Data Success with the Collaboration of Business and IT
Creating Big Data Success with the Collaboration of Business and IT
Edward Chenard
 
Data Science & Data Products at Neue Zürcher Zeitung
Data Science & Data Products at Neue Zürcher ZeitungData Science & Data Products at Neue Zürcher Zeitung
Data Science & Data Products at Neue Zürcher Zeitung
René Pfitzner
 
Data Science for Smart Manufacturing
Data Science for Smart ManufacturingData Science for Smart Manufacturing
Data Science for Smart Manufacturing
Carlo Torniai
 
From Data Science to Business Value - Analytics Applied
From Data Science to Business Value - Analytics AppliedFrom Data Science to Business Value - Analytics Applied
From Data Science to Business Value - Analytics Applied
Teradata Aster
 
Girish Sathyanarayana, Senior Data Scientist at AppLift, " Business Value Thr...
Girish Sathyanarayana, Senior Data Scientist at AppLift, " Business Value Thr...Girish Sathyanarayana, Senior Data Scientist at AppLift, " Business Value Thr...
Girish Sathyanarayana, Senior Data Scientist at AppLift, " Business Value Thr...
Dataconomy Media
 
Harvesting business Value with Data Science
Harvesting business Value with Data ScienceHarvesting business Value with Data Science
Harvesting business Value with Data Science
InfoFarm
 
Business model innovation for the digital age
Business model innovation for the digital ageBusiness model innovation for the digital age
Business model innovation for the digital age
Chanade Hemming
 
11 Principles of Applied Analytics
11 Principles of Applied Analytics11 Principles of Applied Analytics
11 Principles of Applied Analytics
Georgian
 
On Big Data Analytics - opportunities and challenges
On Big Data Analytics - opportunities and challengesOn Big Data Analytics - opportunities and challenges
On Big Data Analytics - opportunities and challenges
Petteri Alahuhta
 
Malang Digital Core - Business Model Navigator
Malang Digital Core - Business Model NavigatorMalang Digital Core - Business Model Navigator
Malang Digital Core - Business Model Navigator
Evans Winata
 
Business Model Canvas - Definition & Some examples
Business Model Canvas - Definition & Some examplesBusiness Model Canvas - Definition & Some examples
Business Model Canvas - Definition & Some examples
Federico Giovanni Rega
 
UCD Smurfit: Digital Merchants Business Model Analysis
UCD Smurfit: Digital Merchants Business Model AnalysisUCD Smurfit: Digital Merchants Business Model Analysis
UCD Smurfit: Digital Merchants Business Model Analysis
Lara Zaccaria
 
The Value of Data for Digital Business Models
The Value of Data for Digital Business ModelsThe Value of Data for Digital Business Models
The Value of Data for Digital Business Models
Boris Otto
 
Applying Data Science to Your Business Problem
Applying Data Science to Your Business ProblemApplying Data Science to Your Business Problem
Applying Data Science to Your Business Problem
CA Technologies
 
Monetization - The Right Business Model for Your Digital Assets
Monetization - The Right Business Model for Your Digital AssetsMonetization - The Right Business Model for Your Digital Assets
Monetization - The Right Business Model for Your Digital Assets
Apigee | Google Cloud
 
How to choose the right business model? by @boardofinno - @nickdemey
How to choose the right business model? by @boardofinno - @nickdemeyHow to choose the right business model? by @boardofinno - @nickdemey
How to choose the right business model? by @boardofinno - @nickdemey
Board of Innovation
 

Viewers also liked (20)

Data Science for Business Managers by TektosData
Data Science for Business Managers by TektosDataData Science for Business Managers by TektosData
Data Science for Business Managers by TektosData
 
How can Data Science benefit your business?
How can Data Science benefit your business?How can Data Science benefit your business?
How can Data Science benefit your business?
 
Business model canvas
Business model canvasBusiness model canvas
Business model canvas
 
Moving From The Art To The Science
Moving From The Art To The ScienceMoving From The Art To The Science
Moving From The Art To The Science
 
Creating Big Data Success with the Collaboration of Business and IT
Creating Big Data Success with the Collaboration of Business and ITCreating Big Data Success with the Collaboration of Business and IT
Creating Big Data Success with the Collaboration of Business and IT
 
Data Science & Data Products at Neue Zürcher Zeitung
Data Science & Data Products at Neue Zürcher ZeitungData Science & Data Products at Neue Zürcher Zeitung
Data Science & Data Products at Neue Zürcher Zeitung
 
Data Science for Smart Manufacturing
Data Science for Smart ManufacturingData Science for Smart Manufacturing
Data Science for Smart Manufacturing
 
From Data Science to Business Value - Analytics Applied
From Data Science to Business Value - Analytics AppliedFrom Data Science to Business Value - Analytics Applied
From Data Science to Business Value - Analytics Applied
 
Girish Sathyanarayana, Senior Data Scientist at AppLift, " Business Value Thr...
Girish Sathyanarayana, Senior Data Scientist at AppLift, " Business Value Thr...Girish Sathyanarayana, Senior Data Scientist at AppLift, " Business Value Thr...
Girish Sathyanarayana, Senior Data Scientist at AppLift, " Business Value Thr...
 
Harvesting business Value with Data Science
Harvesting business Value with Data ScienceHarvesting business Value with Data Science
Harvesting business Value with Data Science
 
Business model innovation for the digital age
Business model innovation for the digital ageBusiness model innovation for the digital age
Business model innovation for the digital age
 
11 Principles of Applied Analytics
11 Principles of Applied Analytics11 Principles of Applied Analytics
11 Principles of Applied Analytics
 
On Big Data Analytics - opportunities and challenges
On Big Data Analytics - opportunities and challengesOn Big Data Analytics - opportunities and challenges
On Big Data Analytics - opportunities and challenges
 
Malang Digital Core - Business Model Navigator
Malang Digital Core - Business Model NavigatorMalang Digital Core - Business Model Navigator
Malang Digital Core - Business Model Navigator
 
Business Model Canvas - Definition & Some examples
Business Model Canvas - Definition & Some examplesBusiness Model Canvas - Definition & Some examples
Business Model Canvas - Definition & Some examples
 
UCD Smurfit: Digital Merchants Business Model Analysis
UCD Smurfit: Digital Merchants Business Model AnalysisUCD Smurfit: Digital Merchants Business Model Analysis
UCD Smurfit: Digital Merchants Business Model Analysis
 
The Value of Data for Digital Business Models
The Value of Data for Digital Business ModelsThe Value of Data for Digital Business Models
The Value of Data for Digital Business Models
 
Applying Data Science to Your Business Problem
Applying Data Science to Your Business ProblemApplying Data Science to Your Business Problem
Applying Data Science to Your Business Problem
 
Monetization - The Right Business Model for Your Digital Assets
Monetization - The Right Business Model for Your Digital AssetsMonetization - The Right Business Model for Your Digital Assets
Monetization - The Right Business Model for Your Digital Assets
 
How to choose the right business model? by @boardofinno - @nickdemey
How to choose the right business model? by @boardofinno - @nickdemeyHow to choose the right business model? by @boardofinno - @nickdemey
How to choose the right business model? by @boardofinno - @nickdemey
 

Similar to Big data-science-oanyc

A Hands-on Intro to Data Science and R Presentation.ppt
A Hands-on Intro to Data Science and R Presentation.pptA Hands-on Intro to Data Science and R Presentation.ppt
A Hands-on Intro to Data Science and R Presentation.ppt
Sanket Shikhar
 
Data science and OSS
Data science and OSSData science and OSS
Data science and OSS
Kevin Crocker
 
The Challenges of Bringing Machine Learning to the Masses
The Challenges of Bringing Machine Learning to the MassesThe Challenges of Bringing Machine Learning to the Masses
The Challenges of Bringing Machine Learning to the Masses
Alice Zheng
 
Topic modeling using big data analytics
Topic modeling using big data analytics Topic modeling using big data analytics
Topic modeling using big data analytics
Farheen Nilofer
 
DevOps for DataScience
DevOps for DataScienceDevOps for DataScience
DevOps for DataScience
Stepan Pushkarev
 
Multiplaform Solution for Graph Datasources
Multiplaform Solution for Graph DatasourcesMultiplaform Solution for Graph Datasources
Multiplaform Solution for Graph Datasources
Stratio
 
Production-Ready BIG ML Workflows - from zero to hero
Production-Ready BIG ML Workflows - from zero to heroProduction-Ready BIG ML Workflows - from zero to hero
Production-Ready BIG ML Workflows - from zero to hero
Daniel Marcous
 
Ted Willke, Senior Principal Engineer & GM, Datacenter Group, Intel at MLconf SF
Ted Willke, Senior Principal Engineer & GM, Datacenter Group, Intel at MLconf SFTed Willke, Senior Principal Engineer & GM, Datacenter Group, Intel at MLconf SF
Ted Willke, Senior Principal Engineer & GM, Datacenter Group, Intel at MLconf SF
MLconf
 
Continuous delivery for machine learning
Continuous delivery for machine learningContinuous delivery for machine learning
Continuous delivery for machine learning
Rajesh Muppalla
 
Python and data analytics
Python and data analyticsPython and data analytics
From Pandas to Koalas: Reducing Time-To-Insight for Virgin Hyperloop's Data
From Pandas to Koalas: Reducing Time-To-Insight for Virgin Hyperloop's DataFrom Pandas to Koalas: Reducing Time-To-Insight for Virgin Hyperloop's Data
From Pandas to Koalas: Reducing Time-To-Insight for Virgin Hyperloop's Data
Databricks
 
BigDL webinar - Deep Learning Library for Spark
BigDL webinar - Deep Learning Library for SparkBigDL webinar - Deep Learning Library for Spark
BigDL webinar - Deep Learning Library for Spark
DESMOND YUEN
 
Yarn spark next_gen_hadoop_8_jan_2014
Yarn spark next_gen_hadoop_8_jan_2014Yarn spark next_gen_hadoop_8_jan_2014
Yarn spark next_gen_hadoop_8_jan_2014
Vijay Srinivas Agneeswaran, Ph.D
 
Spark
SparkSpark
The Analytics Frontier of the Hadoop Eco-System
The Analytics Frontier of the Hadoop Eco-SystemThe Analytics Frontier of the Hadoop Eco-System
The Analytics Frontier of the Hadoop Eco-System
inside-BigData.com
 
How to Survive as a Data Architect in a Polyglot Database World
How to Survive as a Data Architect in a Polyglot Database WorldHow to Survive as a Data Architect in a Polyglot Database World
How to Survive as a Data Architect in a Polyglot Database World
Karen Lopez
 
Apache ® Spark™ MLlib 2.x: How to Productionize your Machine Learning Models
Apache ® Spark™ MLlib 2.x: How to Productionize your Machine Learning ModelsApache ® Spark™ MLlib 2.x: How to Productionize your Machine Learning Models
Apache ® Spark™ MLlib 2.x: How to Productionize your Machine Learning Models
Anyscale
 
AI on Greenplum Using
 Apache MADlib and MADlib Flow - Greenplum Summit 2019
AI on Greenplum Using
 Apache MADlib and MADlib Flow - Greenplum Summit 2019AI on Greenplum Using
 Apache MADlib and MADlib Flow - Greenplum Summit 2019
AI on Greenplum Using
 Apache MADlib and MADlib Flow - Greenplum Summit 2019
VMware Tanzu
 
Combining Machine Learning frameworks with Apache Spark
Combining Machine Learning frameworks with Apache SparkCombining Machine Learning frameworks with Apache Spark
Combining Machine Learning frameworks with Apache Spark
DataWorks Summit/Hadoop Summit
 
How Cloud is Affecting Data Scientists
How Cloud is Affecting Data Scientists How Cloud is Affecting Data Scientists
How Cloud is Affecting Data Scientists
CCG
 

Similar to Big data-science-oanyc (20)

A Hands-on Intro to Data Science and R Presentation.ppt
A Hands-on Intro to Data Science and R Presentation.pptA Hands-on Intro to Data Science and R Presentation.ppt
A Hands-on Intro to Data Science and R Presentation.ppt
 
Data science and OSS
Data science and OSSData science and OSS
Data science and OSS
 
The Challenges of Bringing Machine Learning to the Masses
The Challenges of Bringing Machine Learning to the MassesThe Challenges of Bringing Machine Learning to the Masses
The Challenges of Bringing Machine Learning to the Masses
 
Topic modeling using big data analytics
Topic modeling using big data analytics Topic modeling using big data analytics
Topic modeling using big data analytics
 
DevOps for DataScience
DevOps for DataScienceDevOps for DataScience
DevOps for DataScience
 
Multiplaform Solution for Graph Datasources
Multiplaform Solution for Graph DatasourcesMultiplaform Solution for Graph Datasources
Multiplaform Solution for Graph Datasources
 
Production-Ready BIG ML Workflows - from zero to hero
Production-Ready BIG ML Workflows - from zero to heroProduction-Ready BIG ML Workflows - from zero to hero
Production-Ready BIG ML Workflows - from zero to hero
 
Ted Willke, Senior Principal Engineer & GM, Datacenter Group, Intel at MLconf SF
Ted Willke, Senior Principal Engineer & GM, Datacenter Group, Intel at MLconf SFTed Willke, Senior Principal Engineer & GM, Datacenter Group, Intel at MLconf SF
Ted Willke, Senior Principal Engineer & GM, Datacenter Group, Intel at MLconf SF
 
Continuous delivery for machine learning
Continuous delivery for machine learningContinuous delivery for machine learning
Continuous delivery for machine learning
 
Python and data analytics
Python and data analyticsPython and data analytics
Python and data analytics
 
From Pandas to Koalas: Reducing Time-To-Insight for Virgin Hyperloop's Data
From Pandas to Koalas: Reducing Time-To-Insight for Virgin Hyperloop's DataFrom Pandas to Koalas: Reducing Time-To-Insight for Virgin Hyperloop's Data
From Pandas to Koalas: Reducing Time-To-Insight for Virgin Hyperloop's Data
 
BigDL webinar - Deep Learning Library for Spark
BigDL webinar - Deep Learning Library for SparkBigDL webinar - Deep Learning Library for Spark
BigDL webinar - Deep Learning Library for Spark
 
Yarn spark next_gen_hadoop_8_jan_2014
Yarn spark next_gen_hadoop_8_jan_2014Yarn spark next_gen_hadoop_8_jan_2014
Yarn spark next_gen_hadoop_8_jan_2014
 
Spark
SparkSpark
Spark
 
The Analytics Frontier of the Hadoop Eco-System
The Analytics Frontier of the Hadoop Eco-SystemThe Analytics Frontier of the Hadoop Eco-System
The Analytics Frontier of the Hadoop Eco-System
 
How to Survive as a Data Architect in a Polyglot Database World
How to Survive as a Data Architect in a Polyglot Database WorldHow to Survive as a Data Architect in a Polyglot Database World
How to Survive as a Data Architect in a Polyglot Database World
 
Apache ® Spark™ MLlib 2.x: How to Productionize your Machine Learning Models
Apache ® Spark™ MLlib 2.x: How to Productionize your Machine Learning ModelsApache ® Spark™ MLlib 2.x: How to Productionize your Machine Learning Models
Apache ® Spark™ MLlib 2.x: How to Productionize your Machine Learning Models
 
AI on Greenplum Using
 Apache MADlib and MADlib Flow - Greenplum Summit 2019
AI on Greenplum Using
 Apache MADlib and MADlib Flow - Greenplum Summit 2019AI on Greenplum Using
 Apache MADlib and MADlib Flow - Greenplum Summit 2019
AI on Greenplum Using
 Apache MADlib and MADlib Flow - Greenplum Summit 2019
 
Combining Machine Learning frameworks with Apache Spark
Combining Machine Learning frameworks with Apache SparkCombining Machine Learning frameworks with Apache Spark
Combining Machine Learning frameworks with Apache Spark
 
How Cloud is Affecting Data Scientists
How Cloud is Affecting Data Scientists How Cloud is Affecting Data Scientists
How Cloud is Affecting Data Scientists
 

More from Open Analytics

Cyber after Snowden (OA Cyber Summit)
Cyber after Snowden (OA Cyber Summit)Cyber after Snowden (OA Cyber Summit)
Cyber after Snowden (OA Cyber Summit)
Open Analytics
 
Utilizing cyber intelligence to combat cyber adversaries (OA Cyber Summit)
Utilizing cyber intelligence to combat cyber adversaries (OA Cyber Summit)Utilizing cyber intelligence to combat cyber adversaries (OA Cyber Summit)
Utilizing cyber intelligence to combat cyber adversaries (OA Cyber Summit)
Open Analytics
 
CDM….Where do you start? (OA Cyber Summit)
CDM….Where do you start? (OA Cyber Summit)CDM….Where do you start? (OA Cyber Summit)
CDM….Where do you start? (OA Cyber Summit)
Open Analytics
 
An Immigrant’s view of Cyberspace (OA Cyber Summit)
An Immigrant’s view of Cyberspace (OA Cyber Summit)An Immigrant’s view of Cyberspace (OA Cyber Summit)
An Immigrant’s view of Cyberspace (OA Cyber Summit)
Open Analytics
 
MOLOCH: Search for Full Packet Capture (OA Cyber Summit)
MOLOCH: Search for Full Packet Capture (OA Cyber Summit)MOLOCH: Search for Full Packet Capture (OA Cyber Summit)
MOLOCH: Search for Full Packet Capture (OA Cyber Summit)
Open Analytics
 
Observations on CFR.org Website Traffic Surge Due to Chechnya Terrorism Scare...
Observations on CFR.org Website Traffic Surge Due to Chechnya Terrorism Scare...Observations on CFR.org Website Traffic Surge Due to Chechnya Terrorism Scare...
Observations on CFR.org Website Traffic Surge Due to Chechnya Terrorism Scare...
Open Analytics
 
Using Real-Time Data to Drive Optimization & Personalization
Using Real-Time Data to Drive Optimization & PersonalizationUsing Real-Time Data to Drive Optimization & Personalization
Using Real-Time Data to Drive Optimization & Personalization
Open Analytics
 
M&A Trends in Telco Analytics
M&A Trends in Telco AnalyticsM&A Trends in Telco Analytics
M&A Trends in Telco Analytics
Open Analytics
 
Competing in the Digital Economy
Competing in the Digital EconomyCompeting in the Digital Economy
Competing in the Digital Economy
Open Analytics
 
Piwik: An Analytics Alternative (Chicago Summit)
Piwik: An Analytics Alternative (Chicago Summit)Piwik: An Analytics Alternative (Chicago Summit)
Piwik: An Analytics Alternative (Chicago Summit)
Open Analytics
 
Social Media, Cloud Computing, Machine Learning, Open Source, and Big Data An...
Social Media, Cloud Computing, Machine Learning, Open Source, and Big Data An...Social Media, Cloud Computing, Machine Learning, Open Source, and Big Data An...
Social Media, Cloud Computing, Machine Learning, Open Source, and Big Data An...
Open Analytics
 
Crossing the Chasm (Ikanow - Chicago Summit)
Crossing the Chasm (Ikanow - Chicago Summit)Crossing the Chasm (Ikanow - Chicago Summit)
Crossing the Chasm (Ikanow - Chicago Summit)
Open Analytics
 
On the “Moneyball” – Building the Team, Product, and Service to Rival (Pegged...
On the “Moneyball” – Building the Team, Product, and Service to Rival (Pegged...On the “Moneyball” – Building the Team, Product, and Service to Rival (Pegged...
On the “Moneyball” – Building the Team, Product, and Service to Rival (Pegged...
Open Analytics
 
Data evolutions in media, marketing, and retail (Business Adv Group - Chicago...
Data evolutions in media, marketing, and retail (Business Adv Group - Chicago...Data evolutions in media, marketing, and retail (Business Adv Group - Chicago...
Data evolutions in media, marketing, and retail (Business Adv Group - Chicago...
Open Analytics
 
Characterizing Risk in your Supply Chain (nContext - Chicago Summit)
Characterizing Risk in your Supply Chain (nContext - Chicago Summit)Characterizing Risk in your Supply Chain (nContext - Chicago Summit)
Characterizing Risk in your Supply Chain (nContext - Chicago Summit)
Open Analytics
 
From Insight to Impact (Chicago Summit - Keynote)
From Insight to Impact (Chicago Summit - Keynote)From Insight to Impact (Chicago Summit - Keynote)
From Insight to Impact (Chicago Summit - Keynote)
Open Analytics
 
Easybib Open Analytics NYC
Easybib Open Analytics NYCEasybib Open Analytics NYC
Easybib Open Analytics NYC
Open Analytics
 
MarkLogic - Open Analytics Meetup
MarkLogic - Open Analytics MeetupMarkLogic - Open Analytics Meetup
MarkLogic - Open Analytics Meetup
Open Analytics
 
The caprate presentation_july2013_open analytics dc meetup
The caprate presentation_july2013_open analytics dc meetupThe caprate presentation_july2013_open analytics dc meetup
The caprate presentation_july2013_open analytics dc meetup
Open Analytics
 
Verifeed open analytics_3min deck_071713_final
Verifeed open analytics_3min deck_071713_finalVerifeed open analytics_3min deck_071713_final
Verifeed open analytics_3min deck_071713_final
Open Analytics
 

More from Open Analytics (20)

Cyber after Snowden (OA Cyber Summit)
Cyber after Snowden (OA Cyber Summit)Cyber after Snowden (OA Cyber Summit)
Cyber after Snowden (OA Cyber Summit)
 
Utilizing cyber intelligence to combat cyber adversaries (OA Cyber Summit)
Utilizing cyber intelligence to combat cyber adversaries (OA Cyber Summit)Utilizing cyber intelligence to combat cyber adversaries (OA Cyber Summit)
Utilizing cyber intelligence to combat cyber adversaries (OA Cyber Summit)
 
CDM….Where do you start? (OA Cyber Summit)
CDM….Where do you start? (OA Cyber Summit)CDM….Where do you start? (OA Cyber Summit)
CDM….Where do you start? (OA Cyber Summit)
 
An Immigrant’s view of Cyberspace (OA Cyber Summit)
An Immigrant’s view of Cyberspace (OA Cyber Summit)An Immigrant’s view of Cyberspace (OA Cyber Summit)
An Immigrant’s view of Cyberspace (OA Cyber Summit)
 
MOLOCH: Search for Full Packet Capture (OA Cyber Summit)
MOLOCH: Search for Full Packet Capture (OA Cyber Summit)MOLOCH: Search for Full Packet Capture (OA Cyber Summit)
MOLOCH: Search for Full Packet Capture (OA Cyber Summit)
 
Observations on CFR.org Website Traffic Surge Due to Chechnya Terrorism Scare...
Observations on CFR.org Website Traffic Surge Due to Chechnya Terrorism Scare...Observations on CFR.org Website Traffic Surge Due to Chechnya Terrorism Scare...
Observations on CFR.org Website Traffic Surge Due to Chechnya Terrorism Scare...
 
Using Real-Time Data to Drive Optimization & Personalization
Using Real-Time Data to Drive Optimization & PersonalizationUsing Real-Time Data to Drive Optimization & Personalization
Using Real-Time Data to Drive Optimization & Personalization
 
M&A Trends in Telco Analytics
M&A Trends in Telco AnalyticsM&A Trends in Telco Analytics
M&A Trends in Telco Analytics
 
Competing in the Digital Economy
Competing in the Digital EconomyCompeting in the Digital Economy
Competing in the Digital Economy
 
Piwik: An Analytics Alternative (Chicago Summit)
Piwik: An Analytics Alternative (Chicago Summit)Piwik: An Analytics Alternative (Chicago Summit)
Piwik: An Analytics Alternative (Chicago Summit)
 
Social Media, Cloud Computing, Machine Learning, Open Source, and Big Data An...
Social Media, Cloud Computing, Machine Learning, Open Source, and Big Data An...Social Media, Cloud Computing, Machine Learning, Open Source, and Big Data An...
Social Media, Cloud Computing, Machine Learning, Open Source, and Big Data An...
 
Crossing the Chasm (Ikanow - Chicago Summit)
Crossing the Chasm (Ikanow - Chicago Summit)Crossing the Chasm (Ikanow - Chicago Summit)
Crossing the Chasm (Ikanow - Chicago Summit)
 
On the “Moneyball” – Building the Team, Product, and Service to Rival (Pegged...
On the “Moneyball” – Building the Team, Product, and Service to Rival (Pegged...On the “Moneyball” – Building the Team, Product, and Service to Rival (Pegged...
On the “Moneyball” – Building the Team, Product, and Service to Rival (Pegged...
 
Data evolutions in media, marketing, and retail (Business Adv Group - Chicago...
Data evolutions in media, marketing, and retail (Business Adv Group - Chicago...Data evolutions in media, marketing, and retail (Business Adv Group - Chicago...
Data evolutions in media, marketing, and retail (Business Adv Group - Chicago...
 
Characterizing Risk in your Supply Chain (nContext - Chicago Summit)
Characterizing Risk in your Supply Chain (nContext - Chicago Summit)Characterizing Risk in your Supply Chain (nContext - Chicago Summit)
Characterizing Risk in your Supply Chain (nContext - Chicago Summit)
 
From Insight to Impact (Chicago Summit - Keynote)
From Insight to Impact (Chicago Summit - Keynote)From Insight to Impact (Chicago Summit - Keynote)
From Insight to Impact (Chicago Summit - Keynote)
 
Easybib Open Analytics NYC
Easybib Open Analytics NYCEasybib Open Analytics NYC
Easybib Open Analytics NYC
 
MarkLogic - Open Analytics Meetup
MarkLogic - Open Analytics MeetupMarkLogic - Open Analytics Meetup
MarkLogic - Open Analytics Meetup
 
The caprate presentation_july2013_open analytics dc meetup
The caprate presentation_july2013_open analytics dc meetupThe caprate presentation_july2013_open analytics dc meetup
The caprate presentation_july2013_open analytics dc meetup
 
Verifeed open analytics_3min deck_071713_final
Verifeed open analytics_3min deck_071713_finalVerifeed open analytics_3min deck_071713_final
Verifeed open analytics_3min deck_071713_final
 

Recently uploaded

Product Listing Optimization Presentation - Gay De La Cruz.pdf
Product Listing Optimization Presentation - Gay De La Cruz.pdfProduct Listing Optimization Presentation - Gay De La Cruz.pdf
Product Listing Optimization Presentation - Gay De La Cruz.pdf
gaydlc2513
 
From NCSA to the National Research Platform
From NCSA to the National Research PlatformFrom NCSA to the National Research Platform
From NCSA to the National Research Platform
Larry Smarr
 
Database Management Myths for Developers
Database Management Myths for DevelopersDatabase Management Myths for Developers
Database Management Myths for Developers
John Sterrett
 
Automation Student Developers Session 3: Introduction to UI Automation
Automation Student Developers Session 3: Introduction to UI AutomationAutomation Student Developers Session 3: Introduction to UI Automation
Automation Student Developers Session 3: Introduction to UI Automation
UiPathCommunity
 
DynamoDB to ScyllaDB: Technical Comparison and the Path to Success
DynamoDB to ScyllaDB: Technical Comparison and the Path to SuccessDynamoDB to ScyllaDB: Technical Comparison and the Path to Success
DynamoDB to ScyllaDB: Technical Comparison and the Path to Success
ScyllaDB
 
Cyber Recovery Wargame
Cyber Recovery WargameCyber Recovery Wargame
Cyber Recovery Wargame
Databarracks
 
Call Girls Chandigarh🔥7023059433🔥Agency Profile Escorts in Chandigarh Availab...
Call Girls Chandigarh🔥7023059433🔥Agency Profile Escorts in Chandigarh Availab...Call Girls Chandigarh🔥7023059433🔥Agency Profile Escorts in Chandigarh Availab...
Call Girls Chandigarh🔥7023059433🔥Agency Profile Escorts in Chandigarh Availab...
manji sharman06
 
Introduction to ThousandEyes AMER Webinar
Introduction  to ThousandEyes AMER WebinarIntroduction  to ThousandEyes AMER Webinar
Introduction to ThousandEyes AMER Webinar
ThousandEyes
 
intra-mart Accel series 2024 Spring updates_En
intra-mart Accel series 2024 Spring updates_Enintra-mart Accel series 2024 Spring updates_En
intra-mart Accel series 2024 Spring updates_En
NTTDATA INTRAMART
 
Chapter 5 - Managing Test Activities V4.0
Chapter 5 - Managing Test Activities V4.0Chapter 5 - Managing Test Activities V4.0
Chapter 5 - Managing Test Activities V4.0
Neeraj Kumar Singh
 
Move Auth, Policy, and Resilience to the Platform
Move Auth, Policy, and Resilience to the PlatformMove Auth, Policy, and Resilience to the Platform
Move Auth, Policy, and Resilience to the Platform
Christian Posta
 
Getting Started Using the National Research Platform
Getting Started Using the National Research PlatformGetting Started Using the National Research Platform
Getting Started Using the National Research Platform
Larry Smarr
 
TrustArc Webinar - Your Guide for Smooth Cross-Border Data Transfers and Glob...
TrustArc Webinar - Your Guide for Smooth Cross-Border Data Transfers and Glob...TrustArc Webinar - Your Guide for Smooth Cross-Border Data Transfers and Glob...
TrustArc Webinar - Your Guide for Smooth Cross-Border Data Transfers and Glob...
TrustArc
 
New ThousandEyes Product Features and Release Highlights: June 2024
New ThousandEyes Product Features and Release Highlights: June 2024New ThousandEyes Product Features and Release Highlights: June 2024
New ThousandEyes Product Features and Release Highlights: June 2024
ThousandEyes
 
CTO Insights: Steering a High-Stakes Database Migration
CTO Insights: Steering a High-Stakes Database MigrationCTO Insights: Steering a High-Stakes Database Migration
CTO Insights: Steering a High-Stakes Database Migration
ScyllaDB
 
Radically Outperforming DynamoDB @ Digital Turbine with SADA and Google Cloud
Radically Outperforming DynamoDB @ Digital Turbine with SADA and Google CloudRadically Outperforming DynamoDB @ Digital Turbine with SADA and Google Cloud
Radically Outperforming DynamoDB @ Digital Turbine with SADA and Google Cloud
ScyllaDB
 
The Strategy Behind ReversingLabs’ Massive Key-Value Migration
The Strategy Behind ReversingLabs’ Massive Key-Value MigrationThe Strategy Behind ReversingLabs’ Massive Key-Value Migration
The Strategy Behind ReversingLabs’ Massive Key-Value Migration
ScyllaDB
 
MongoDB vs ScyllaDB: Tractian’s Experience with Real-Time ML
MongoDB vs ScyllaDB: Tractian’s Experience with Real-Time MLMongoDB vs ScyllaDB: Tractian’s Experience with Real-Time ML
MongoDB vs ScyllaDB: Tractian’s Experience with Real-Time ML
ScyllaDB
 
An Introduction to All Data Enterprise Integration
An Introduction to All Data Enterprise IntegrationAn Introduction to All Data Enterprise Integration
An Introduction to All Data Enterprise Integration
Safe Software
 
Guidelines for Effective Data Visualization
Guidelines for Effective Data VisualizationGuidelines for Effective Data Visualization
Guidelines for Effective Data Visualization
UmmeSalmaM1
 

Recently uploaded (20)

Product Listing Optimization Presentation - Gay De La Cruz.pdf
Product Listing Optimization Presentation - Gay De La Cruz.pdfProduct Listing Optimization Presentation - Gay De La Cruz.pdf
Product Listing Optimization Presentation - Gay De La Cruz.pdf
 
From NCSA to the National Research Platform
From NCSA to the National Research PlatformFrom NCSA to the National Research Platform
From NCSA to the National Research Platform
 
Database Management Myths for Developers
Database Management Myths for DevelopersDatabase Management Myths for Developers
Database Management Myths for Developers
 
Automation Student Developers Session 3: Introduction to UI Automation
Automation Student Developers Session 3: Introduction to UI AutomationAutomation Student Developers Session 3: Introduction to UI Automation
Automation Student Developers Session 3: Introduction to UI Automation
 
DynamoDB to ScyllaDB: Technical Comparison and the Path to Success
DynamoDB to ScyllaDB: Technical Comparison and the Path to SuccessDynamoDB to ScyllaDB: Technical Comparison and the Path to Success
DynamoDB to ScyllaDB: Technical Comparison and the Path to Success
 
Cyber Recovery Wargame
Cyber Recovery WargameCyber Recovery Wargame
Cyber Recovery Wargame
 
Call Girls Chandigarh🔥7023059433🔥Agency Profile Escorts in Chandigarh Availab...
Call Girls Chandigarh🔥7023059433🔥Agency Profile Escorts in Chandigarh Availab...Call Girls Chandigarh🔥7023059433🔥Agency Profile Escorts in Chandigarh Availab...
Call Girls Chandigarh🔥7023059433🔥Agency Profile Escorts in Chandigarh Availab...
 
Introduction to ThousandEyes AMER Webinar
Introduction  to ThousandEyes AMER WebinarIntroduction  to ThousandEyes AMER Webinar
Introduction to ThousandEyes AMER Webinar
 
intra-mart Accel series 2024 Spring updates_En
intra-mart Accel series 2024 Spring updates_Enintra-mart Accel series 2024 Spring updates_En
intra-mart Accel series 2024 Spring updates_En
 
Chapter 5 - Managing Test Activities V4.0
Chapter 5 - Managing Test Activities V4.0Chapter 5 - Managing Test Activities V4.0
Chapter 5 - Managing Test Activities V4.0
 
Move Auth, Policy, and Resilience to the Platform
Move Auth, Policy, and Resilience to the PlatformMove Auth, Policy, and Resilience to the Platform
Move Auth, Policy, and Resilience to the Platform
 
Getting Started Using the National Research Platform
Getting Started Using the National Research PlatformGetting Started Using the National Research Platform
Getting Started Using the National Research Platform
 
TrustArc Webinar - Your Guide for Smooth Cross-Border Data Transfers and Glob...
TrustArc Webinar - Your Guide for Smooth Cross-Border Data Transfers and Glob...TrustArc Webinar - Your Guide for Smooth Cross-Border Data Transfers and Glob...
TrustArc Webinar - Your Guide for Smooth Cross-Border Data Transfers and Glob...
 
New ThousandEyes Product Features and Release Highlights: June 2024
New ThousandEyes Product Features and Release Highlights: June 2024New ThousandEyes Product Features and Release Highlights: June 2024
New ThousandEyes Product Features and Release Highlights: June 2024
 
CTO Insights: Steering a High-Stakes Database Migration
CTO Insights: Steering a High-Stakes Database MigrationCTO Insights: Steering a High-Stakes Database Migration
CTO Insights: Steering a High-Stakes Database Migration
 
Radically Outperforming DynamoDB @ Digital Turbine with SADA and Google Cloud
Radically Outperforming DynamoDB @ Digital Turbine with SADA and Google CloudRadically Outperforming DynamoDB @ Digital Turbine with SADA and Google Cloud
Radically Outperforming DynamoDB @ Digital Turbine with SADA and Google Cloud
 
The Strategy Behind ReversingLabs’ Massive Key-Value Migration
The Strategy Behind ReversingLabs’ Massive Key-Value MigrationThe Strategy Behind ReversingLabs’ Massive Key-Value Migration
The Strategy Behind ReversingLabs’ Massive Key-Value Migration
 
MongoDB vs ScyllaDB: Tractian’s Experience with Real-Time ML
MongoDB vs ScyllaDB: Tractian’s Experience with Real-Time MLMongoDB vs ScyllaDB: Tractian’s Experience with Real-Time ML
MongoDB vs ScyllaDB: Tractian’s Experience with Real-Time ML
 
An Introduction to All Data Enterprise Integration
An Introduction to All Data Enterprise IntegrationAn Introduction to All Data Enterprise Integration
An Introduction to All Data Enterprise Integration
 
Guidelines for Effective Data Visualization
Guidelines for Effective Data VisualizationGuidelines for Effective Data Visualization
Guidelines for Effective Data Visualization
 

Big data-science-oanyc

  • 1. June 2013 BIG DATA SCIENCE: A PATH FORWARD
  • 2. CONFIDENTIAL | 2 linkedin.com/in/danmallinger/ @danmallinger www.thinkbiganalytics.com  Data Science Lead @ Think Big  Product/Brand Obsessive  Teacher  Occasional Engineer
  • 3. CONFIDENTIAL | 3 TODAY • High level exploration of the • skills, tools, and techniques • needed to achieve early success • and to help you build • your data science practice.
  • 4. CONFIDENTIAL | 4  Understand our organizational needs for data science  Infrastructure: Technological tools and platforms.  Talent: Staff hired and trained.  Capabilities: Data science techniques utilized. INFRASTRUCTURE, TALENT, & CAPABILITIES Hadoop NoSQL Analytics SQL/MPP Real Time Scripting MapReduce Data Exploration Basic Modeling PhD Math Visualization Clustering Categorization Continuous Models Text Analysis
  • 5. CONFIDENTIAL | 5  Boxed Solutions: Mahout & Platform  Toolkits: RHadoop, Scikit, etc.  You will need toolkits to solve unique problems  but smart techniques make that easier.  Boxed solutions are limited  but can be a good source of early velocity. ANALYTICS TOOLS
  • 6. CONFIDENTIAL | 6  Gigabytes from Stackoverflow  Questions from users  With metadata  Users have reputations  Questions open or closed  Follow along  Thinking about your data  To learn in a  Familiar context and  Plan DATA Presenter Audience Hadoop NoSQL Analytics SQL/MPP Real Time Scripting MapReduce Exploration Basic Modeling PhD Math Visualization Clustering Categorization Continuous Text Analysis
  • 7. CONFIDENTIAL | 7 select count(1) as total , sum(has_code) , avg(body_count) , stddev_samp(body_count) , corr(reputation, owner_questions) , histogram_numeric(body_count, 10) from questions ; STEP 1: EXPLORE Hadoop NoSQL Analytics SQL/MPP Real Time Scripting MapReduce Exploration Basic Modeling PhD Math Visualization Clustering Categorization Continuous Text Analysis Patterns through Hive Patterns through Tableau
  • 8. CONFIDENTIAL | 8  Summaries of unstructured data  Time-since metrics select transform(…) using ‘python …’  Clustering: Browsing cohorts /bin/mahout canopy STEP 2: FEATURE BUILDING Hadoop NoSQL Analytics SQL/MPP Real Time Scripting MapReduce Exploration Basic Modeling PhD Math Visualization Clustering Categorization Continuous Text Analysis SQL Windowing Cross-Record Features
  • 9. CONFIDENTIAL | 9 • Sample (don’t parallelize) • Naturally parallel • SVD • Random Forests • Estimators and Ensembles • Bootstrapping • Localizing • Advanced Parallelization • Linear models with SGD • Neural networks PARALLEL MODELS IN HADOOP Hadoop NoSQL Analytics SQL/MPP Real Time Scripting MapReduce Exploration Basic Modeling PhD Math Visualization Clustering Categorization Continuous Text Analysis
  • 10. CONFIDENTIAL | 10  Single R model  run many times  over samples  and aggregated m <- C5.0(status ~ …) STEP 3: STRUCTURED MODEL (BAGGING) Hadoop NoSQL Analytics SQL/MPP Real Time Scripting MapReduce Exploration Basic Modeling PhD Math Visualization Clustering Categorization Continuous Text Analysis Mapper 1: Define n reducer keys Send any record to reducer I with probability p Reducer 1: Key: Id of sample Value: List of records Perform analysis over records Reducer 2: Key: One Value: List of models Aggregate the models (e.g. average) Bagging a Model
  • 11. CONFIDENTIAL | 11 WHERE ARE WE? Hadoop NoSQL Analytics SQL/MPP Real Time Scripting MapReduce Exploration Basic Modeling PhD Math Visualization Clustering Categorization Continuous Text Analysis  We’ve created a structured model  to flag questions that won’t be closed  using Big Data.  But we haven’t used unstructured data.
  • 12. CONFIDENTIAL | 12 TEXT ANALYSIS Hadoop NoSQL Analytics SQL/MPP Real Time Scripting MapReduce Exploration Basic Modeling PhD Math Visualization Clustering Categorization Continuous Text Analysis • Is “the big dog” really different from “dog is big?” • How about “I like eggs but hate tofu” and “I hate eggs but like tofu?” • Language has lexical and syntactical features • Different techniques leverage these in different ways  Bag of Words: Structure doesn’t matter  n-gram: Structure matters (but not that much)  Feature Extraction: BACON! BACON! BACON!
  • 13. CONFIDENTIAL | 13 STEP 4: UNSTRUCTURED MODEL Hadoop NoSQL Analytics SQL/MPP Real Time Scripting MapReduce Exploration Basic Modeling PhD Math Visualization Clustering Categorization Continuous Text Analysis  Similar to Hadoop’s Word Count  Create counts for token/category pairs  Use counts to calculate Information Gain MR Job 1: Calculate information gain (IG) for all tokens. MR Job 2: Select tokens with largest IG. Create structured data for record, tokens: question #4 | 0 | 1 | 0 | 1 | 1 MR Job 3: Build a classifier over the newly structured data (prior slides) Information Gain
  • 14. CONFIDENTIAL | 14 WHERE ARE WE? Hadoop NoSQL Analytics SQL/MPP Real Time Scripting MapReduce Exploration Basic Modeling PhD Math Visualization Clustering Categorization Continuous Text Analysis  We’ve created two models  One structured,  one unstructured.  But they don’t work together.
  • 15. CONFIDENTIAL | 15 STEP 5: ENSEMBLE MODEL Hadoop NoSQL Analytics SQL/MPP Real Time Scripting MapReduce Exploration Basic Modeling PhD Math Visualization Clustering Categorization Continuous Text Analysis  Join many models together  By using their output  As input to ensemble model.  Best when models perform differently  Exploit differences with nonlinearities  Like interaction effects. Ensembling Mapper 1: Load multiple models Score the models per record and output Reducer 1: Key: Id of record Value: List of model outputs Join model outputs to make new records MR Job 2: Build a model over the output data as if it was raw data.
  • 16. CONFIDENTIAL | 16  We’ve created two models:  one structured,  one unstructured  and have ensembled them  to create a single, powerful model  and solve a practical business problem. WHERE ARE WE? Hadoop NoSQL Analytics SQL/MPP Real Time Scripting MapReduce Exploration Basic Modeling PhD Math Visualization Clustering Categorization Continuous Text Analysis
  • 17. CONFIDENTIAL | 17  This required simple infrastructure  a blend of analysis and scripting skills  an understanding of BIG data science techniques  but not a team of PhDs or a billion dollars. HOW DID WE GET HERE? Hadoop NoSQL Analytics SQL/MPP Real Time Scripting MapReduce Exploration Basic Modeling PhD Math Visualization Clustering Categorization Continuous Text Analysis
  翻译: