尊敬的 微信汇率:1円 ≈ 0.046078 元 支付宝汇率:1円 ≈ 0.046168元 [退出登录]
SlideShare a Scribd company logo
Big Data, Disruption
and the 800 Pound
Gorilla in the Corner
Michael Stonebraker
Big Data, Disruption and the 800 Pound Gorilla in the Corner 2
Speakers
Michael Stonebraker
Adjunct Professor at MIT and
Tamr Co-founder
Mingo Sanchez
Sales Engineer
Big Data, Disruption and the 800 Pound Gorilla in the Corner
The Meaning of Big Data - 3 V’s
• Big Volume
— simple (SQL) analytics
— complex (non-SQL) analytics
• Big Velocity
— Drink from a fire hose
• Big Variety
— Large number of diverse data sources to integrate
3
Big Data, Disruption and the 800 Pound Gorilla in the Corner 4
Big Volume - Little Analytics
• Well addressed by data warehouse crowd
• Who are pretty good at SQL analytics on
• Hundreds of nodes
• Petabytes of data
Big Data, Disruption and the 800 Pound Gorilla in the Corner
Big Volume - Little Analytics
• Architecture
• Multi-node, partitioned, parallel column stores
• Except Oracle!!!
• No gorilla here
• But two disrupters are looming
5
First Disrupter– The Cloud
• Everybody will move there sooner or later
• Hamilton’s vignette
• Dewitt’s vignette
• You should be planning on moving decision support to
the cloud aggressively
• And cloud DBMSs are all moving to support elasticity
• Apply 1000 nodes or 5, depending on query requirements – pay
for what you need
6
Big Data, Disruption and the 800 Pound Gorilla in the Corner
First Disrupter– The Cloud
● AWS plays by different rules
• S3 has a dramatic (artificial) pricing advantage
relative to EBS
• House solutions are favored
• >50 tee shirt sizes!!!!
● Cloud architecture is a challenge!
7
Big Data, Disruption and the 800 Pound Gorilla in the Corner
Second Disrupter — Warehouses are
Yesterday’s Issue
• Data science will supersede business
intelligence
• As soon as enterprises can hire enough
competent data scientists
• After all would you like a predictive model or
a big table of numbers
8
Big Data, Disruption and the 800 Pound Gorilla in the Corner
Second Disrupter — Warehouses are
Yesterday’s Issue
● Data Science is machine learning
• Deep learning
• Conventional machine learning (Deep learning will not take
over the world because of training data and explainability)
● Data Science is Non-SQL data analytics (PCA, SVD, …)
● Whatever your marketing department comes up with!
9
Big Data, Disruption and the 800 Pound Gorilla in the Corner
Data Science is
Big Data - Big Analytics
• Complex math operations (machine learning, clustering,
trend detection, ….)
• The world of the “quants” and the “rocket scientists”
• Mostly specified as linear algebra on array data
• A dozen or so common ‘inner loops’
• Matrix multiply
• QR decomposition
• SVD decomposition
• Linear regression
• Run on CPUs, GPUs and/or other stuff
10
Solution Options
ML Packages
• SciKit-learn, tensor flow, …
• Code for popular algos (SciKit)
• Platform to code custom algo (tensor flow)
• No data management or persistence!
11
Big Data, Disruption and the 800 Pound Gorilla in the Corner
Solution Options
R, SAS, Matlab, et al
● Weak or non-existent data management
● File system storage
● R doesn’t scale and is not a parallel system
12
Solution Options
RDBMS alone
● Code analytics in SQL
• Orders of magnitude too slow!!
• Tamr quickly discarded this tactic
● Coding operations as UDFs still requires you to
simulate arrays on top of tables
• And current UDF model not powerful enough to support
iteration
• Have to code in the UDF
13
Big Data, Disruption and the 800 Pound Gorilla in the Corner
Array DBMSs
● SciDB, TileDB, et. al.
• Storage and filtering
• With competitive built-in algo
• Traction in genomics space
14
Solution Options
Spark
● Spark SQL is not competitive
● Neither is Spark Streaming
● Spark is popular parallel platform – do your own
plumbing
• Tamr uses Spark, HBase, Elastic search,
Postgres
15
Big Data, Disruption and the 800 Pound Gorilla in the Corner
Solution Options
Map Reduce / Hadoop
● Not good for anything
• Map Reduce discarded by Google 5 years ago
● Hadoop has been rebranded as HDFS plus a bunch of
other stuff
● Remember that the cloud vendors have more
complete offerings (S3 plus …) in this area.
16
Big Data, Disruption and the 800 Pound Gorilla in the Corner
Big Data — Big Analytics
● The wild west right now
• Hold onto your seat belt!
17
Big Data, Disruption and the 800 Pound Gorilla in the Corner
Big Velocity
• Sensor tagging everything of value sends velocity
through the roof
• E.g. car insurance
• Smartphones as a mobile platform sends velocity
through the roof
• State of multiplayer internet games must be
recorded — sends velocity through the roof
18
Big Data, Disruption and the 800 Pound Gorilla in the Corner
• Big pattern — little state (electronic trading)
• Find me a ‘strawberry’ followed within 100 ms by a ‘banana’
• Complex event processing (CEP) is focused on this
problem
• Patterns in a firehose
P.S. I started StreamBase but I have no current relationship with the
company
Two Different Solutions
19
Big Data, Disruption and the 800 Pound Gorilla in the Corner
Big Pattern – Little State
● CEP is pretty mature
• Offerings from Tibco, Kafka, …
● What transaction model do you want?
• Exactly once semantics for sure
• Failure model?
20
Big Data, Disruption and the 800 Pound Gorilla in the Corner
• Big state - little pattern
• For every security, assemble my real-time global position
• And alert me if my exposure is greater than X
• Looks like high performance OLTP
• Want to update a database at very high speed
21
Two Different Solutions
Big Data, Disruption and the 800 Pound Gorilla in the Corner
Solution Choices for New OLTP
Old SQL
● The elephants
• Slooooow!!!!
• See “Through the OLTP Looking Glass and
What We Found There” VLDB 2007
22
Big Data, Disruption and the 800 Pound Gorilla in the Corner
Solution Choices for New OLTP
No SQL
● Give up SQL
• Interesting to note that Cassandra
and Mongo are moving to (yup) SQL
● Give up ACID
• If you need ACID, this is a decision to
tear your hair out by doing it in user
code
• Can you guarantee you won’t need
ACID tomorrow?
23
Big Data, Disruption and the 800 Pound Gorilla in the Corner
Solution Choices for New OLTP
New SQL
● Keep ACID and SQL, but
• Use main memory
• Different XACT implementation
● Products from Microsoft, IBM, VoltDB, MemSQL, …
● I have not seen an application that overruns CEP
or NewSQL
• No gorilla here
24
Big Variety – The 800 Pound Gorilla
• Scenario #1 – Data Scientists
• Scenario #2 – Integration of silos
25
Big Data, Disruption and the 800 Pound Gorilla in the Corner
Data Science
• Data scientist has an idea – e.g. does Rogaine cause weight
gain in mice?
• He must
• Find relevant data sets (Merck has 4000 or so Oracle data bases plus a
big data lake plus files) – and the public web is a treasure chest of info
• Perform data integration on the results
• A quote from a data scientist at iRobot “I spend 90% of my
time finding and cleaning data and then 90% of the other
10% checking the cleaning”
26
Big Data, Disruption and the 800 Pound Gorilla in the Corner
Enterprise Problem
• GE has 75 procurement systems
• CFO estimates that GE can save $100M per year if it
can empower each of the 75 procurement offices to
discover the terms and conditions negotiated by
his/her 74 counterparts at contract renewal time
• And demand most favored nation status
• Requires integrating 75 supplier databases!
• Enterprises also want to integrate parts, customers, lab
data,…
27
Big Data, Disruption and the 800 Pound Gorilla in the Corner
Data Integration Challenge
• For each local data source, have to
— Ingest the source
— Perform transformations (e.g. $ to Euros)
— Perform data cleaning (a rule of thumb is at
least 10% of your data is wrong or missing)
— Perform schema integration
— Perform deduplication (GE’s problem)
— Find “golden values” in clusters of duplicates
28
Big Data, Disruption and the 800 Pound Gorilla in the Corner
At Scale
• GE has about 10M supplier records
• Toyota Motor Europe has about 30M customer
records
• Do not even think about naïve algorithms in python!!!!
29
Big Data, Disruption and the 800 Pound Gorilla in the Corner
Traditional Solution
• Extract, Transform and Load (ETL) packages plus
Master Data Management (MDM) tools
• Brought to you by Informatica, IBM, Talend, Knime, …
• ETL requires too much manual programmer effort
• And is architected wrong
• MDM does not scale
• Rule system issues
30
Big Data, Disruption and the 800 Pound Gorilla in the Corner
MDM
● Does “match/merge” with rules
• A human can grok about 500 rules….
● A GE classification problem
• 20M transaction records
• To be classified into a pre-existing hierarchy
● GE wrote 500 rules
• Which classified 2M transactions
• What about the other 18M?
31
Big Data, Disruption and the 800 Pound Gorilla in the Corner
A Better Solution — Tamr
● Uses ML for schema integration, deduplication
and golden value resolution
• E.g. Used result of 500 rules as training data for an
ML model which classified the rest of the GE
transactions
• Dedup is not N ** 2 (that would be death)
32
Big Data, Disruption and the 800 Pound Gorilla in the Corner
The Future
• Lots of startups in this space
• Some oriented toward “data preparation”
• A few focused on enterprise data integration
• Some focused on text
• Some focused on deep learning
• The wild, wild west
• Hold onto your seat belt
33
Big Data, Disruption and the 800 Pound Gorilla in the Corner
Summary
• ML will be omnipresent
• Some deep learning
• Some conventional ML
• Complex analytics (data science) will replace
business intelligence
• As soon as we can train enough data scientists
• Both will get nowhere without good data
• Requires data integration at scale
• The 800 pound gorilla
34
Big Data, Disruption and the 800 Pound Gorilla in the Corner
Questions?
35
Big Data, Disruption and the 800 Pound Gorilla in the Corner
Thank you!

More Related Content

What's hot

Balancing Data Governance and Innovation
Balancing Data Governance and InnovationBalancing Data Governance and Innovation
Balancing Data Governance and Innovation
Caserta
 
Intro to Data Science on Hadoop
Intro to Data Science on HadoopIntro to Data Science on Hadoop
Intro to Data Science on Hadoop
Caserta
 
You're the New CDO, Now What?
You're the New CDO, Now What?You're the New CDO, Now What?
You're the New CDO, Now What?
Caserta
 
Big Data Analytics on the Cloud
Big Data Analytics on the CloudBig Data Analytics on the Cloud
Big Data Analytics on the Cloud
Caserta
 
What Data Do You Have and Where is It?
What Data Do You Have and Where is It? What Data Do You Have and Where is It?
What Data Do You Have and Where is It?
Caserta
 
Big Data's Impact on the Enterprise
Big Data's Impact on the EnterpriseBig Data's Impact on the Enterprise
Big Data's Impact on the Enterprise
Caserta
 
The Key to Big Data Modeling: Collaboration
The Key to Big Data Modeling: CollaborationThe Key to Big Data Modeling: Collaboration
The Key to Big Data Modeling: Collaboration
Embarcadero Technologies
 
Balancing Data Governance and Innovation
Balancing Data Governance and InnovationBalancing Data Governance and Innovation
Balancing Data Governance and Innovation
Caserta
 
Architecting Data For The Modern Enterprise - Data Summit 2017, Closing Keynote
Architecting Data For The Modern Enterprise - Data Summit 2017, Closing KeynoteArchitecting Data For The Modern Enterprise - Data Summit 2017, Closing Keynote
Architecting Data For The Modern Enterprise - Data Summit 2017, Closing Keynote
Caserta
 
Making Big Data Easy for Everyone
Making Big Data Easy for EveryoneMaking Big Data Easy for Everyone
Making Big Data Easy for Everyone
Caserta
 
Creating a DevOps Practice for Analytics -- Strata Data, September 28, 2017
Creating a DevOps Practice for Analytics -- Strata Data, September 28, 2017Creating a DevOps Practice for Analytics -- Strata Data, September 28, 2017
Creating a DevOps Practice for Analytics -- Strata Data, September 28, 2017
Caserta
 
A modern, flexible approach to Hadoop implementation incorporating innovation...
A modern, flexible approach to Hadoop implementation incorporating innovation...A modern, flexible approach to Hadoop implementation incorporating innovation...
A modern, flexible approach to Hadoop implementation incorporating innovation...
DataWorks Summit
 
Journey to Cloud Analytics
Journey to Cloud Analytics Journey to Cloud Analytics
Journey to Cloud Analytics
Datavail
 
DAS Slides: Emerging Trends in Data Architecture – What’s the Next Big Thing?
DAS Slides: Emerging Trends in Data Architecture – What’s the Next Big Thing?DAS Slides: Emerging Trends in Data Architecture – What’s the Next Big Thing?
DAS Slides: Emerging Trends in Data Architecture – What’s the Next Big Thing?
DATAVERSITY
 
Setting Up the Data Lake
Setting Up the Data LakeSetting Up the Data Lake
Setting Up the Data Lake
Caserta
 
Mastering Customer Data on Apache Spark
Mastering Customer Data on Apache SparkMastering Customer Data on Apache Spark
Mastering Customer Data on Apache Spark
Caserta
 
How to Consume Your Data for AI
How to Consume Your Data for AIHow to Consume Your Data for AI
How to Consume Your Data for AI
DATAVERSITY
 
Webinar: Decoding the Mystery - How to Know if You Need a Data Catalog, a Dat...
Webinar: Decoding the Mystery - How to Know if You Need a Data Catalog, a Dat...Webinar: Decoding the Mystery - How to Know if You Need a Data Catalog, a Dat...
Webinar: Decoding the Mystery - How to Know if You Need a Data Catalog, a Dat...
DATAVERSITY
 
Agile & Data Modeling – How Can They Work Together?
Agile & Data Modeling – How Can They Work Together?Agile & Data Modeling – How Can They Work Together?
Agile & Data Modeling – How Can They Work Together?
DATAVERSITY
 
Using Machine Learning & Spark to Power Data-Driven Marketing
Using Machine Learning & Spark to Power Data-Driven MarketingUsing Machine Learning & Spark to Power Data-Driven Marketing
Using Machine Learning & Spark to Power Data-Driven Marketing
Caserta
 

What's hot (20)

Balancing Data Governance and Innovation
Balancing Data Governance and InnovationBalancing Data Governance and Innovation
Balancing Data Governance and Innovation
 
Intro to Data Science on Hadoop
Intro to Data Science on HadoopIntro to Data Science on Hadoop
Intro to Data Science on Hadoop
 
You're the New CDO, Now What?
You're the New CDO, Now What?You're the New CDO, Now What?
You're the New CDO, Now What?
 
Big Data Analytics on the Cloud
Big Data Analytics on the CloudBig Data Analytics on the Cloud
Big Data Analytics on the Cloud
 
What Data Do You Have and Where is It?
What Data Do You Have and Where is It? What Data Do You Have and Where is It?
What Data Do You Have and Where is It?
 
Big Data's Impact on the Enterprise
Big Data's Impact on the EnterpriseBig Data's Impact on the Enterprise
Big Data's Impact on the Enterprise
 
The Key to Big Data Modeling: Collaboration
The Key to Big Data Modeling: CollaborationThe Key to Big Data Modeling: Collaboration
The Key to Big Data Modeling: Collaboration
 
Balancing Data Governance and Innovation
Balancing Data Governance and InnovationBalancing Data Governance and Innovation
Balancing Data Governance and Innovation
 
Architecting Data For The Modern Enterprise - Data Summit 2017, Closing Keynote
Architecting Data For The Modern Enterprise - Data Summit 2017, Closing KeynoteArchitecting Data For The Modern Enterprise - Data Summit 2017, Closing Keynote
Architecting Data For The Modern Enterprise - Data Summit 2017, Closing Keynote
 
Making Big Data Easy for Everyone
Making Big Data Easy for EveryoneMaking Big Data Easy for Everyone
Making Big Data Easy for Everyone
 
Creating a DevOps Practice for Analytics -- Strata Data, September 28, 2017
Creating a DevOps Practice for Analytics -- Strata Data, September 28, 2017Creating a DevOps Practice for Analytics -- Strata Data, September 28, 2017
Creating a DevOps Practice for Analytics -- Strata Data, September 28, 2017
 
A modern, flexible approach to Hadoop implementation incorporating innovation...
A modern, flexible approach to Hadoop implementation incorporating innovation...A modern, flexible approach to Hadoop implementation incorporating innovation...
A modern, flexible approach to Hadoop implementation incorporating innovation...
 
Journey to Cloud Analytics
Journey to Cloud Analytics Journey to Cloud Analytics
Journey to Cloud Analytics
 
DAS Slides: Emerging Trends in Data Architecture – What’s the Next Big Thing?
DAS Slides: Emerging Trends in Data Architecture – What’s the Next Big Thing?DAS Slides: Emerging Trends in Data Architecture – What’s the Next Big Thing?
DAS Slides: Emerging Trends in Data Architecture – What’s the Next Big Thing?
 
Setting Up the Data Lake
Setting Up the Data LakeSetting Up the Data Lake
Setting Up the Data Lake
 
Mastering Customer Data on Apache Spark
Mastering Customer Data on Apache SparkMastering Customer Data on Apache Spark
Mastering Customer Data on Apache Spark
 
How to Consume Your Data for AI
How to Consume Your Data for AIHow to Consume Your Data for AI
How to Consume Your Data for AI
 
Webinar: Decoding the Mystery - How to Know if You Need a Data Catalog, a Dat...
Webinar: Decoding the Mystery - How to Know if You Need a Data Catalog, a Dat...Webinar: Decoding the Mystery - How to Know if You Need a Data Catalog, a Dat...
Webinar: Decoding the Mystery - How to Know if You Need a Data Catalog, a Dat...
 
Agile & Data Modeling – How Can They Work Together?
Agile & Data Modeling – How Can They Work Together?Agile & Data Modeling – How Can They Work Together?
Agile & Data Modeling – How Can They Work Together?
 
Using Machine Learning & Spark to Power Data-Driven Marketing
Using Machine Learning & Spark to Power Data-Driven MarketingUsing Machine Learning & Spark to Power Data-Driven Marketing
Using Machine Learning & Spark to Power Data-Driven Marketing
 

Similar to Michael Stonebraker: Big Data, Disruption, and the 800 Pound Gorilla in the Corner

Citizens Bank: Data Lake Implementation – Selecting BigInsights ViON Spark/Ha...
Citizens Bank: Data Lake Implementation – Selecting BigInsights ViON Spark/Ha...Citizens Bank: Data Lake Implementation – Selecting BigInsights ViON Spark/Ha...
Citizens Bank: Data Lake Implementation – Selecting BigInsights ViON Spark/Ha...
Seeling Cheung
 
What Does Big Data Mean and Who Will Win
What Does Big Data Mean and Who Will WinWhat Does Big Data Mean and Who Will Win
What Does Big Data Mean and Who Will Win
BigDataCloud
 
Big Data Rampage
Big Data RampageBig Data Rampage
Big Data Rampage
Niko Vuokko
 
Big Data
Big DataBig Data
Big Data
Mahesh Bmn
 
Big data in the enterprise: When to use what?
Big data in the enterprise: When to use what?Big data in the enterprise: When to use what?
Big data in the enterprise: When to use what?
Jesus Rodriguez
 
Tamr | Strata hadoop 2014 Michael Stonebraker
Tamr | Strata hadoop 2014 Michael StonebrakerTamr | Strata hadoop 2014 Michael Stonebraker
Tamr | Strata hadoop 2014 Michael Stonebraker
Tamr_Inc
 
Big data
Big dataBig data
Big data
roysonli
 
Big Data Analytics: Finding diamonds in the rough with Azure
Big Data Analytics: Finding diamonds in the rough with AzureBig Data Analytics: Finding diamonds in the rough with Azure
Big Data Analytics: Finding diamonds in the rough with Azure
Christos Charmatzis
 
Big Data Overview Part 1
Big Data Overview Part 1Big Data Overview Part 1
Big Data Overview Part 1
William Simms
 
Building your big data solution
Building your big data solution Building your big data solution
Building your big data solution
WSO2
 
Big Data Science Challenges in Media
Big Data Science Challenges in MediaBig Data Science Challenges in Media
Big Data Science Challenges in Media
Chandan Rajah
 
Big Data Pipeline for Analytics at Scale @ FIT CVUT 2014
Big Data Pipeline for Analytics at Scale @ FIT CVUT 2014Big Data Pipeline for Analytics at Scale @ FIT CVUT 2014
Big Data Pipeline for Analytics at Scale @ FIT CVUT 2014
Jaroslav Gergic
 
Big data for cio 2015
Big data for cio 2015Big data for cio 2015
Big data for cio 2015
Zohar Elkayam
 
Unlock the value in your big data reservoir using oracle big data discovery a...
Unlock the value in your big data reservoir using oracle big data discovery a...Unlock the value in your big data reservoir using oracle big data discovery a...
Unlock the value in your big data reservoir using oracle big data discovery a...
Mark Rittman
 
Webinar: The Future of SQL
Webinar: The Future of SQLWebinar: The Future of SQL
Webinar: The Future of SQL
Crate.io
 
Big Data, Hadoop, NoSQL and more ...
Big Data, Hadoop, NoSQL and more ...Big Data, Hadoop, NoSQL and more ...
Big Data, Hadoop, NoSQL and more ...
Varad Meru
 
Introduction to Big Data
Introduction to Big DataIntroduction to Big Data
Introduction to Big Data
Roi Blanco
 
Time Series Databases for IoT (On-premises and Azure)
Time Series Databases for IoT (On-premises and Azure)Time Series Databases for IoT (On-premises and Azure)
Time Series Databases for IoT (On-premises and Azure)
Ivo Andreev
 
bigdata.pdf
bigdata.pdfbigdata.pdf
bigdata.pdf
AnjaliKumari301316
 
Séminaire Big Data Alter Way - Elasticsearch - octobre 2014
Séminaire Big Data Alter Way - Elasticsearch - octobre 2014Séminaire Big Data Alter Way - Elasticsearch - octobre 2014
Séminaire Big Data Alter Way - Elasticsearch - octobre 2014
ALTER WAY
 

Similar to Michael Stonebraker: Big Data, Disruption, and the 800 Pound Gorilla in the Corner (20)

Citizens Bank: Data Lake Implementation – Selecting BigInsights ViON Spark/Ha...
Citizens Bank: Data Lake Implementation – Selecting BigInsights ViON Spark/Ha...Citizens Bank: Data Lake Implementation – Selecting BigInsights ViON Spark/Ha...
Citizens Bank: Data Lake Implementation – Selecting BigInsights ViON Spark/Ha...
 
What Does Big Data Mean and Who Will Win
What Does Big Data Mean and Who Will WinWhat Does Big Data Mean and Who Will Win
What Does Big Data Mean and Who Will Win
 
Big Data Rampage
Big Data RampageBig Data Rampage
Big Data Rampage
 
Big Data
Big DataBig Data
Big Data
 
Big data in the enterprise: When to use what?
Big data in the enterprise: When to use what?Big data in the enterprise: When to use what?
Big data in the enterprise: When to use what?
 
Tamr | Strata hadoop 2014 Michael Stonebraker
Tamr | Strata hadoop 2014 Michael StonebrakerTamr | Strata hadoop 2014 Michael Stonebraker
Tamr | Strata hadoop 2014 Michael Stonebraker
 
Big data
Big dataBig data
Big data
 
Big Data Analytics: Finding diamonds in the rough with Azure
Big Data Analytics: Finding diamonds in the rough with AzureBig Data Analytics: Finding diamonds in the rough with Azure
Big Data Analytics: Finding diamonds in the rough with Azure
 
Big Data Overview Part 1
Big Data Overview Part 1Big Data Overview Part 1
Big Data Overview Part 1
 
Building your big data solution
Building your big data solution Building your big data solution
Building your big data solution
 
Big Data Science Challenges in Media
Big Data Science Challenges in MediaBig Data Science Challenges in Media
Big Data Science Challenges in Media
 
Big Data Pipeline for Analytics at Scale @ FIT CVUT 2014
Big Data Pipeline for Analytics at Scale @ FIT CVUT 2014Big Data Pipeline for Analytics at Scale @ FIT CVUT 2014
Big Data Pipeline for Analytics at Scale @ FIT CVUT 2014
 
Big data for cio 2015
Big data for cio 2015Big data for cio 2015
Big data for cio 2015
 
Unlock the value in your big data reservoir using oracle big data discovery a...
Unlock the value in your big data reservoir using oracle big data discovery a...Unlock the value in your big data reservoir using oracle big data discovery a...
Unlock the value in your big data reservoir using oracle big data discovery a...
 
Webinar: The Future of SQL
Webinar: The Future of SQLWebinar: The Future of SQL
Webinar: The Future of SQL
 
Big Data, Hadoop, NoSQL and more ...
Big Data, Hadoop, NoSQL and more ...Big Data, Hadoop, NoSQL and more ...
Big Data, Hadoop, NoSQL and more ...
 
Introduction to Big Data
Introduction to Big DataIntroduction to Big Data
Introduction to Big Data
 
Time Series Databases for IoT (On-premises and Azure)
Time Series Databases for IoT (On-premises and Azure)Time Series Databases for IoT (On-premises and Azure)
Time Series Databases for IoT (On-premises and Azure)
 
bigdata.pdf
bigdata.pdfbigdata.pdf
bigdata.pdf
 
Séminaire Big Data Alter Way - Elasticsearch - octobre 2014
Séminaire Big Data Alter Way - Elasticsearch - octobre 2014Séminaire Big Data Alter Way - Elasticsearch - octobre 2014
Séminaire Big Data Alter Way - Elasticsearch - octobre 2014
 

More from TamrMarketing

Data Mastering at Scale with Michael Stonebraker
Data Mastering at Scale with Michael StonebrakerData Mastering at Scale with Michael Stonebraker
Data Mastering at Scale with Michael Stonebraker
TamrMarketing
 
Data as a Strategic Asset
Data as a Strategic AssetData as a Strategic Asset
Data as a Strategic Asset
TamrMarketing
 
Optimize supply chains using machine learning superpowers webinar deck
Optimize supply chains using machine learning superpowers webinar deckOptimize supply chains using machine learning superpowers webinar deck
Optimize supply chains using machine learning superpowers webinar deck
TamrMarketing
 
7 Steps for Boosting R&D Outcomes
7 Steps for Boosting R&D Outcomes7 Steps for Boosting R&D Outcomes
7 Steps for Boosting R&D Outcomes
TamrMarketing
 
How Santander UK Accelerates Digital Initiatives by Mastering Customer Data
How Santander UK Accelerates Digital Initiatives by Mastering Customer DataHow Santander UK Accelerates Digital Initiatives by Mastering Customer Data
How Santander UK Accelerates Digital Initiatives by Mastering Customer Data
TamrMarketing
 
DataOps @ Scale: A Modern Framework for Data Management in the Public Sector
DataOps @ Scale: A Modern Framework for Data Management in the Public SectorDataOps @ Scale: A Modern Framework for Data Management in the Public Sector
DataOps @ Scale: A Modern Framework for Data Management in the Public Sector
TamrMarketing
 
How to Implement a Spend Analytics Program Using Machine Learning
 How to Implement a Spend Analytics Program Using Machine Learning How to Implement a Spend Analytics Program Using Machine Learning
How to Implement a Spend Analytics Program Using Machine Learning
TamrMarketing
 
3 Strategies to drive more data driven outcomes in financial services
3 Strategies to drive more data driven outcomes in financial services3 Strategies to drive more data driven outcomes in financial services
3 Strategies to drive more data driven outcomes in financial services
TamrMarketing
 

More from TamrMarketing (8)

Data Mastering at Scale with Michael Stonebraker
Data Mastering at Scale with Michael StonebrakerData Mastering at Scale with Michael Stonebraker
Data Mastering at Scale with Michael Stonebraker
 
Data as a Strategic Asset
Data as a Strategic AssetData as a Strategic Asset
Data as a Strategic Asset
 
Optimize supply chains using machine learning superpowers webinar deck
Optimize supply chains using machine learning superpowers webinar deckOptimize supply chains using machine learning superpowers webinar deck
Optimize supply chains using machine learning superpowers webinar deck
 
7 Steps for Boosting R&D Outcomes
7 Steps for Boosting R&D Outcomes7 Steps for Boosting R&D Outcomes
7 Steps for Boosting R&D Outcomes
 
How Santander UK Accelerates Digital Initiatives by Mastering Customer Data
How Santander UK Accelerates Digital Initiatives by Mastering Customer DataHow Santander UK Accelerates Digital Initiatives by Mastering Customer Data
How Santander UK Accelerates Digital Initiatives by Mastering Customer Data
 
DataOps @ Scale: A Modern Framework for Data Management in the Public Sector
DataOps @ Scale: A Modern Framework for Data Management in the Public SectorDataOps @ Scale: A Modern Framework for Data Management in the Public Sector
DataOps @ Scale: A Modern Framework for Data Management in the Public Sector
 
How to Implement a Spend Analytics Program Using Machine Learning
 How to Implement a Spend Analytics Program Using Machine Learning How to Implement a Spend Analytics Program Using Machine Learning
How to Implement a Spend Analytics Program Using Machine Learning
 
3 Strategies to drive more data driven outcomes in financial services
3 Strategies to drive more data driven outcomes in financial services3 Strategies to drive more data driven outcomes in financial services
3 Strategies to drive more data driven outcomes in financial services
 

Recently uploaded

Product Cluster Analysis: Unveiling Hidden Customer Preferences
Product Cluster Analysis: Unveiling Hidden Customer PreferencesProduct Cluster Analysis: Unveiling Hidden Customer Preferences
Product Cluster Analysis: Unveiling Hidden Customer Preferences
Boston Institute of Analytics
 
Fabric Engineering Deep Dive Keynote from Fabric Engineering Roadshow
Fabric Engineering Deep Dive Keynote from Fabric Engineering RoadshowFabric Engineering Deep Dive Keynote from Fabric Engineering Roadshow
Fabric Engineering Deep Dive Keynote from Fabric Engineering Roadshow
Gabi Münster
 
一比一原版(heriotwatt学位证书)英国赫瑞瓦特大学毕业证如何办理
一比一原版(heriotwatt学位证书)英国赫瑞瓦特大学毕业证如何办理一比一原版(heriotwatt学位证书)英国赫瑞瓦特大学毕业证如何办理
一比一原版(heriotwatt学位证书)英国赫瑞瓦特大学毕业证如何办理
zoykygu
 
AI WITH THE HELP OF NAGALAND CAN WIN. DOWNLOAD NOW
AI WITH THE HELP OF NAGALAND CAN WIN. DOWNLOAD NOWAI WITH THE HELP OF NAGALAND CAN WIN. DOWNLOAD NOW
AI WITH THE HELP OF NAGALAND CAN WIN. DOWNLOAD NOW
arash10gamer
 
High Profile Call Girls Navi Mumbai ✅ 9833363713 FULL CASH PAYMENT
High Profile Call Girls Navi Mumbai ✅ 9833363713 FULL CASH PAYMENTHigh Profile Call Girls Navi Mumbai ✅ 9833363713 FULL CASH PAYMENT
High Profile Call Girls Navi Mumbai ✅ 9833363713 FULL CASH PAYMENT
ranjeet3341
 
❣VIP Call Girls Chennai 💯Call Us 🔝 7737669865 🔝💃Independent Chennai Escorts S...
❣VIP Call Girls Chennai 💯Call Us 🔝 7737669865 🔝💃Independent Chennai Escorts S...❣VIP Call Girls Chennai 💯Call Us 🔝 7737669865 🔝💃Independent Chennai Escorts S...
❣VIP Call Girls Chennai 💯Call Us 🔝 7737669865 🔝💃Independent Chennai Escorts S...
jasodak99
 
Difference in Differences - Does Strict Speed Limit Restrictions Reduce Road ...
Difference in Differences - Does Strict Speed Limit Restrictions Reduce Road ...Difference in Differences - Does Strict Speed Limit Restrictions Reduce Road ...
Difference in Differences - Does Strict Speed Limit Restrictions Reduce Road ...
ThinkInnovation
 
Royal-Class Call Girls Thane🌹9967824496🌹369+ call girls @₹6K-18K/full night cash
Royal-Class Call Girls Thane🌹9967824496🌹369+ call girls @₹6K-18K/full night cashRoyal-Class Call Girls Thane🌹9967824496🌹369+ call girls @₹6K-18K/full night cash
Royal-Class Call Girls Thane🌹9967824496🌹369+ call girls @₹6K-18K/full night cash
Ak47
 
_Lufthansa Airlines MIA Terminal (1).pdf
_Lufthansa Airlines MIA Terminal (1).pdf_Lufthansa Airlines MIA Terminal (1).pdf
_Lufthansa Airlines MIA Terminal (1).pdf
rc76967005
 
Hot Call Girls In Bangalore 🔥 9352988975 🔥 Real Fun With Sexual Girl Availabl...
Hot Call Girls In Bangalore 🔥 9352988975 🔥 Real Fun With Sexual Girl Availabl...Hot Call Girls In Bangalore 🔥 9352988975 🔥 Real Fun With Sexual Girl Availabl...
Hot Call Girls In Bangalore 🔥 9352988975 🔥 Real Fun With Sexual Girl Availabl...
nainasharmans346
 
9711199012⎷❤✨ Call Girls RK Puram Special Price with a special young
9711199012⎷❤✨ Call Girls RK Puram Special Price with a special young9711199012⎷❤✨ Call Girls RK Puram Special Price with a special young
9711199012⎷❤✨ Call Girls RK Puram Special Price with a special young
Ak47
 
🔥Book Call Girls Lucknow 💯Call Us 🔝 6350257716 🔝💃Independent Lucknow Escorts ...
🔥Book Call Girls Lucknow 💯Call Us 🔝 6350257716 🔝💃Independent Lucknow Escorts ...🔥Book Call Girls Lucknow 💯Call Us 🔝 6350257716 🔝💃Independent Lucknow Escorts ...
🔥Book Call Girls Lucknow 💯Call Us 🔝 6350257716 🔝💃Independent Lucknow Escorts ...
AK47
 
Call Girls Hyderabad (india) ☎️ +91-7426014248 Hyderabad Call Girl
Call Girls Hyderabad  (india) ☎️ +91-7426014248 Hyderabad  Call GirlCall Girls Hyderabad  (india) ☎️ +91-7426014248 Hyderabad  Call Girl
Call Girls Hyderabad (india) ☎️ +91-7426014248 Hyderabad Call Girl
sapna sharmap11
 
IBM watsonx.data - Seller Enablement Deck.PPTX
IBM watsonx.data - Seller Enablement Deck.PPTXIBM watsonx.data - Seller Enablement Deck.PPTX
IBM watsonx.data - Seller Enablement Deck.PPTX
EbtsamRashed
 
🔥Night Call Girls Pune 💯Call Us 🔝 7014168258 🔝💃Independent Pune Escorts Servi...
🔥Night Call Girls Pune 💯Call Us 🔝 7014168258 🔝💃Independent Pune Escorts Servi...🔥Night Call Girls Pune 💯Call Us 🔝 7014168258 🔝💃Independent Pune Escorts Servi...
🔥Night Call Girls Pune 💯Call Us 🔝 7014168258 🔝💃Independent Pune Escorts Servi...
yuvishachadda
 
🔥Mature Women / Aunty Call Girl Chennai 💯Call Us 🔝 8094342248 🔝💃Top Class Cal...
🔥Mature Women / Aunty Call Girl Chennai 💯Call Us 🔝 8094342248 🔝💃Top Class Cal...🔥Mature Women / Aunty Call Girl Chennai 💯Call Us 🔝 8094342248 🔝💃Top Class Cal...
🔥Mature Women / Aunty Call Girl Chennai 💯Call Us 🔝 8094342248 🔝💃Top Class Cal...
shivangimorya083
 
Hyderabad Call Girls 7339748667 With Free Home Delivery At Your Door
Hyderabad Call Girls 7339748667 With Free Home Delivery At Your DoorHyderabad Call Girls 7339748667 With Free Home Delivery At Your Door
Hyderabad Call Girls 7339748667 With Free Home Delivery At Your Door
Russian Escorts in Delhi 9711199171 with low rate Book online
 
machine learning notes by Andrew Ng and Tengyu Ma
machine learning notes by Andrew Ng and Tengyu Mamachine learning notes by Andrew Ng and Tengyu Ma
machine learning notes by Andrew Ng and Tengyu Ma
Vijayabaskar Uthirapathy
 
Call Girls Hyderabad ❤️ 7339748667 ❤️ With No Advance Payment
Call Girls Hyderabad ❤️ 7339748667 ❤️ With No Advance PaymentCall Girls Hyderabad ❤️ 7339748667 ❤️ With No Advance Payment
Call Girls Hyderabad ❤️ 7339748667 ❤️ With No Advance Payment
prijesh mathew
 
Classifying Shooting Incident Fatality in New York project presentation
Classifying Shooting Incident Fatality in New York project presentationClassifying Shooting Incident Fatality in New York project presentation
Classifying Shooting Incident Fatality in New York project presentation
Boston Institute of Analytics
 

Recently uploaded (20)

Product Cluster Analysis: Unveiling Hidden Customer Preferences
Product Cluster Analysis: Unveiling Hidden Customer PreferencesProduct Cluster Analysis: Unveiling Hidden Customer Preferences
Product Cluster Analysis: Unveiling Hidden Customer Preferences
 
Fabric Engineering Deep Dive Keynote from Fabric Engineering Roadshow
Fabric Engineering Deep Dive Keynote from Fabric Engineering RoadshowFabric Engineering Deep Dive Keynote from Fabric Engineering Roadshow
Fabric Engineering Deep Dive Keynote from Fabric Engineering Roadshow
 
一比一原版(heriotwatt学位证书)英国赫瑞瓦特大学毕业证如何办理
一比一原版(heriotwatt学位证书)英国赫瑞瓦特大学毕业证如何办理一比一原版(heriotwatt学位证书)英国赫瑞瓦特大学毕业证如何办理
一比一原版(heriotwatt学位证书)英国赫瑞瓦特大学毕业证如何办理
 
AI WITH THE HELP OF NAGALAND CAN WIN. DOWNLOAD NOW
AI WITH THE HELP OF NAGALAND CAN WIN. DOWNLOAD NOWAI WITH THE HELP OF NAGALAND CAN WIN. DOWNLOAD NOW
AI WITH THE HELP OF NAGALAND CAN WIN. DOWNLOAD NOW
 
High Profile Call Girls Navi Mumbai ✅ 9833363713 FULL CASH PAYMENT
High Profile Call Girls Navi Mumbai ✅ 9833363713 FULL CASH PAYMENTHigh Profile Call Girls Navi Mumbai ✅ 9833363713 FULL CASH PAYMENT
High Profile Call Girls Navi Mumbai ✅ 9833363713 FULL CASH PAYMENT
 
❣VIP Call Girls Chennai 💯Call Us 🔝 7737669865 🔝💃Independent Chennai Escorts S...
❣VIP Call Girls Chennai 💯Call Us 🔝 7737669865 🔝💃Independent Chennai Escorts S...❣VIP Call Girls Chennai 💯Call Us 🔝 7737669865 🔝💃Independent Chennai Escorts S...
❣VIP Call Girls Chennai 💯Call Us 🔝 7737669865 🔝💃Independent Chennai Escorts S...
 
Difference in Differences - Does Strict Speed Limit Restrictions Reduce Road ...
Difference in Differences - Does Strict Speed Limit Restrictions Reduce Road ...Difference in Differences - Does Strict Speed Limit Restrictions Reduce Road ...
Difference in Differences - Does Strict Speed Limit Restrictions Reduce Road ...
 
Royal-Class Call Girls Thane🌹9967824496🌹369+ call girls @₹6K-18K/full night cash
Royal-Class Call Girls Thane🌹9967824496🌹369+ call girls @₹6K-18K/full night cashRoyal-Class Call Girls Thane🌹9967824496🌹369+ call girls @₹6K-18K/full night cash
Royal-Class Call Girls Thane🌹9967824496🌹369+ call girls @₹6K-18K/full night cash
 
_Lufthansa Airlines MIA Terminal (1).pdf
_Lufthansa Airlines MIA Terminal (1).pdf_Lufthansa Airlines MIA Terminal (1).pdf
_Lufthansa Airlines MIA Terminal (1).pdf
 
Hot Call Girls In Bangalore 🔥 9352988975 🔥 Real Fun With Sexual Girl Availabl...
Hot Call Girls In Bangalore 🔥 9352988975 🔥 Real Fun With Sexual Girl Availabl...Hot Call Girls In Bangalore 🔥 9352988975 🔥 Real Fun With Sexual Girl Availabl...
Hot Call Girls In Bangalore 🔥 9352988975 🔥 Real Fun With Sexual Girl Availabl...
 
9711199012⎷❤✨ Call Girls RK Puram Special Price with a special young
9711199012⎷❤✨ Call Girls RK Puram Special Price with a special young9711199012⎷❤✨ Call Girls RK Puram Special Price with a special young
9711199012⎷❤✨ Call Girls RK Puram Special Price with a special young
 
🔥Book Call Girls Lucknow 💯Call Us 🔝 6350257716 🔝💃Independent Lucknow Escorts ...
🔥Book Call Girls Lucknow 💯Call Us 🔝 6350257716 🔝💃Independent Lucknow Escorts ...🔥Book Call Girls Lucknow 💯Call Us 🔝 6350257716 🔝💃Independent Lucknow Escorts ...
🔥Book Call Girls Lucknow 💯Call Us 🔝 6350257716 🔝💃Independent Lucknow Escorts ...
 
Call Girls Hyderabad (india) ☎️ +91-7426014248 Hyderabad Call Girl
Call Girls Hyderabad  (india) ☎️ +91-7426014248 Hyderabad  Call GirlCall Girls Hyderabad  (india) ☎️ +91-7426014248 Hyderabad  Call Girl
Call Girls Hyderabad (india) ☎️ +91-7426014248 Hyderabad Call Girl
 
IBM watsonx.data - Seller Enablement Deck.PPTX
IBM watsonx.data - Seller Enablement Deck.PPTXIBM watsonx.data - Seller Enablement Deck.PPTX
IBM watsonx.data - Seller Enablement Deck.PPTX
 
🔥Night Call Girls Pune 💯Call Us 🔝 7014168258 🔝💃Independent Pune Escorts Servi...
🔥Night Call Girls Pune 💯Call Us 🔝 7014168258 🔝💃Independent Pune Escorts Servi...🔥Night Call Girls Pune 💯Call Us 🔝 7014168258 🔝💃Independent Pune Escorts Servi...
🔥Night Call Girls Pune 💯Call Us 🔝 7014168258 🔝💃Independent Pune Escorts Servi...
 
🔥Mature Women / Aunty Call Girl Chennai 💯Call Us 🔝 8094342248 🔝💃Top Class Cal...
🔥Mature Women / Aunty Call Girl Chennai 💯Call Us 🔝 8094342248 🔝💃Top Class Cal...🔥Mature Women / Aunty Call Girl Chennai 💯Call Us 🔝 8094342248 🔝💃Top Class Cal...
🔥Mature Women / Aunty Call Girl Chennai 💯Call Us 🔝 8094342248 🔝💃Top Class Cal...
 
Hyderabad Call Girls 7339748667 With Free Home Delivery At Your Door
Hyderabad Call Girls 7339748667 With Free Home Delivery At Your DoorHyderabad Call Girls 7339748667 With Free Home Delivery At Your Door
Hyderabad Call Girls 7339748667 With Free Home Delivery At Your Door
 
machine learning notes by Andrew Ng and Tengyu Ma
machine learning notes by Andrew Ng and Tengyu Mamachine learning notes by Andrew Ng and Tengyu Ma
machine learning notes by Andrew Ng and Tengyu Ma
 
Call Girls Hyderabad ❤️ 7339748667 ❤️ With No Advance Payment
Call Girls Hyderabad ❤️ 7339748667 ❤️ With No Advance PaymentCall Girls Hyderabad ❤️ 7339748667 ❤️ With No Advance Payment
Call Girls Hyderabad ❤️ 7339748667 ❤️ With No Advance Payment
 
Classifying Shooting Incident Fatality in New York project presentation
Classifying Shooting Incident Fatality in New York project presentationClassifying Shooting Incident Fatality in New York project presentation
Classifying Shooting Incident Fatality in New York project presentation
 

Michael Stonebraker: Big Data, Disruption, and the 800 Pound Gorilla in the Corner

  • 1. Big Data, Disruption and the 800 Pound Gorilla in the Corner Michael Stonebraker
  • 2. Big Data, Disruption and the 800 Pound Gorilla in the Corner 2 Speakers Michael Stonebraker Adjunct Professor at MIT and Tamr Co-founder Mingo Sanchez Sales Engineer
  • 3. Big Data, Disruption and the 800 Pound Gorilla in the Corner The Meaning of Big Data - 3 V’s • Big Volume — simple (SQL) analytics — complex (non-SQL) analytics • Big Velocity — Drink from a fire hose • Big Variety — Large number of diverse data sources to integrate 3
  • 4. Big Data, Disruption and the 800 Pound Gorilla in the Corner 4 Big Volume - Little Analytics • Well addressed by data warehouse crowd • Who are pretty good at SQL analytics on • Hundreds of nodes • Petabytes of data
  • 5. Big Data, Disruption and the 800 Pound Gorilla in the Corner Big Volume - Little Analytics • Architecture • Multi-node, partitioned, parallel column stores • Except Oracle!!! • No gorilla here • But two disrupters are looming 5
  • 6. First Disrupter– The Cloud • Everybody will move there sooner or later • Hamilton’s vignette • Dewitt’s vignette • You should be planning on moving decision support to the cloud aggressively • And cloud DBMSs are all moving to support elasticity • Apply 1000 nodes or 5, depending on query requirements – pay for what you need 6
  • 7. Big Data, Disruption and the 800 Pound Gorilla in the Corner First Disrupter– The Cloud ● AWS plays by different rules • S3 has a dramatic (artificial) pricing advantage relative to EBS • House solutions are favored • >50 tee shirt sizes!!!! ● Cloud architecture is a challenge! 7
  • 8. Big Data, Disruption and the 800 Pound Gorilla in the Corner Second Disrupter — Warehouses are Yesterday’s Issue • Data science will supersede business intelligence • As soon as enterprises can hire enough competent data scientists • After all would you like a predictive model or a big table of numbers 8
  • 9. Big Data, Disruption and the 800 Pound Gorilla in the Corner Second Disrupter — Warehouses are Yesterday’s Issue ● Data Science is machine learning • Deep learning • Conventional machine learning (Deep learning will not take over the world because of training data and explainability) ● Data Science is Non-SQL data analytics (PCA, SVD, …) ● Whatever your marketing department comes up with! 9
  • 10. Big Data, Disruption and the 800 Pound Gorilla in the Corner Data Science is Big Data - Big Analytics • Complex math operations (machine learning, clustering, trend detection, ….) • The world of the “quants” and the “rocket scientists” • Mostly specified as linear algebra on array data • A dozen or so common ‘inner loops’ • Matrix multiply • QR decomposition • SVD decomposition • Linear regression • Run on CPUs, GPUs and/or other stuff 10
  • 11. Solution Options ML Packages • SciKit-learn, tensor flow, … • Code for popular algos (SciKit) • Platform to code custom algo (tensor flow) • No data management or persistence! 11
  • 12. Big Data, Disruption and the 800 Pound Gorilla in the Corner Solution Options R, SAS, Matlab, et al ● Weak or non-existent data management ● File system storage ● R doesn’t scale and is not a parallel system 12
  • 13. Solution Options RDBMS alone ● Code analytics in SQL • Orders of magnitude too slow!! • Tamr quickly discarded this tactic ● Coding operations as UDFs still requires you to simulate arrays on top of tables • And current UDF model not powerful enough to support iteration • Have to code in the UDF 13
  • 14. Big Data, Disruption and the 800 Pound Gorilla in the Corner Array DBMSs ● SciDB, TileDB, et. al. • Storage and filtering • With competitive built-in algo • Traction in genomics space 14
  • 15. Solution Options Spark ● Spark SQL is not competitive ● Neither is Spark Streaming ● Spark is popular parallel platform – do your own plumbing • Tamr uses Spark, HBase, Elastic search, Postgres 15
  • 16. Big Data, Disruption and the 800 Pound Gorilla in the Corner Solution Options Map Reduce / Hadoop ● Not good for anything • Map Reduce discarded by Google 5 years ago ● Hadoop has been rebranded as HDFS plus a bunch of other stuff ● Remember that the cloud vendors have more complete offerings (S3 plus …) in this area. 16
  • 17. Big Data, Disruption and the 800 Pound Gorilla in the Corner Big Data — Big Analytics ● The wild west right now • Hold onto your seat belt! 17
  • 18. Big Data, Disruption and the 800 Pound Gorilla in the Corner Big Velocity • Sensor tagging everything of value sends velocity through the roof • E.g. car insurance • Smartphones as a mobile platform sends velocity through the roof • State of multiplayer internet games must be recorded — sends velocity through the roof 18
  • 19. Big Data, Disruption and the 800 Pound Gorilla in the Corner • Big pattern — little state (electronic trading) • Find me a ‘strawberry’ followed within 100 ms by a ‘banana’ • Complex event processing (CEP) is focused on this problem • Patterns in a firehose P.S. I started StreamBase but I have no current relationship with the company Two Different Solutions 19
  • 20. Big Data, Disruption and the 800 Pound Gorilla in the Corner Big Pattern – Little State ● CEP is pretty mature • Offerings from Tibco, Kafka, … ● What transaction model do you want? • Exactly once semantics for sure • Failure model? 20
  • 21. Big Data, Disruption and the 800 Pound Gorilla in the Corner • Big state - little pattern • For every security, assemble my real-time global position • And alert me if my exposure is greater than X • Looks like high performance OLTP • Want to update a database at very high speed 21 Two Different Solutions
  • 22. Big Data, Disruption and the 800 Pound Gorilla in the Corner Solution Choices for New OLTP Old SQL ● The elephants • Slooooow!!!! • See “Through the OLTP Looking Glass and What We Found There” VLDB 2007 22
  • 23. Big Data, Disruption and the 800 Pound Gorilla in the Corner Solution Choices for New OLTP No SQL ● Give up SQL • Interesting to note that Cassandra and Mongo are moving to (yup) SQL ● Give up ACID • If you need ACID, this is a decision to tear your hair out by doing it in user code • Can you guarantee you won’t need ACID tomorrow? 23
  • 24. Big Data, Disruption and the 800 Pound Gorilla in the Corner Solution Choices for New OLTP New SQL ● Keep ACID and SQL, but • Use main memory • Different XACT implementation ● Products from Microsoft, IBM, VoltDB, MemSQL, … ● I have not seen an application that overruns CEP or NewSQL • No gorilla here 24
  • 25. Big Variety – The 800 Pound Gorilla • Scenario #1 – Data Scientists • Scenario #2 – Integration of silos 25
  • 26. Big Data, Disruption and the 800 Pound Gorilla in the Corner Data Science • Data scientist has an idea – e.g. does Rogaine cause weight gain in mice? • He must • Find relevant data sets (Merck has 4000 or so Oracle data bases plus a big data lake plus files) – and the public web is a treasure chest of info • Perform data integration on the results • A quote from a data scientist at iRobot “I spend 90% of my time finding and cleaning data and then 90% of the other 10% checking the cleaning” 26
  • 27. Big Data, Disruption and the 800 Pound Gorilla in the Corner Enterprise Problem • GE has 75 procurement systems • CFO estimates that GE can save $100M per year if it can empower each of the 75 procurement offices to discover the terms and conditions negotiated by his/her 74 counterparts at contract renewal time • And demand most favored nation status • Requires integrating 75 supplier databases! • Enterprises also want to integrate parts, customers, lab data,… 27
  • 28. Big Data, Disruption and the 800 Pound Gorilla in the Corner Data Integration Challenge • For each local data source, have to — Ingest the source — Perform transformations (e.g. $ to Euros) — Perform data cleaning (a rule of thumb is at least 10% of your data is wrong or missing) — Perform schema integration — Perform deduplication (GE’s problem) — Find “golden values” in clusters of duplicates 28
  • 29. Big Data, Disruption and the 800 Pound Gorilla in the Corner At Scale • GE has about 10M supplier records • Toyota Motor Europe has about 30M customer records • Do not even think about naïve algorithms in python!!!! 29
  • 30. Big Data, Disruption and the 800 Pound Gorilla in the Corner Traditional Solution • Extract, Transform and Load (ETL) packages plus Master Data Management (MDM) tools • Brought to you by Informatica, IBM, Talend, Knime, … • ETL requires too much manual programmer effort • And is architected wrong • MDM does not scale • Rule system issues 30
  • 31. Big Data, Disruption and the 800 Pound Gorilla in the Corner MDM ● Does “match/merge” with rules • A human can grok about 500 rules…. ● A GE classification problem • 20M transaction records • To be classified into a pre-existing hierarchy ● GE wrote 500 rules • Which classified 2M transactions • What about the other 18M? 31
  • 32. Big Data, Disruption and the 800 Pound Gorilla in the Corner A Better Solution — Tamr ● Uses ML for schema integration, deduplication and golden value resolution • E.g. Used result of 500 rules as training data for an ML model which classified the rest of the GE transactions • Dedup is not N ** 2 (that would be death) 32
  • 33. Big Data, Disruption and the 800 Pound Gorilla in the Corner The Future • Lots of startups in this space • Some oriented toward “data preparation” • A few focused on enterprise data integration • Some focused on text • Some focused on deep learning • The wild, wild west • Hold onto your seat belt 33
  • 34. Big Data, Disruption and the 800 Pound Gorilla in the Corner Summary • ML will be omnipresent • Some deep learning • Some conventional ML • Complex analytics (data science) will replace business intelligence • As soon as we can train enough data scientists • Both will get nowhere without good data • Requires data integration at scale • The 800 pound gorilla 34
  • 35. Big Data, Disruption and the 800 Pound Gorilla in the Corner Questions? 35
  • 36. Big Data, Disruption and the 800 Pound Gorilla in the Corner Thank you!
  翻译: