尊敬的 微信汇率:1円 ≈ 0.046078 元 支付宝汇率:1円 ≈ 0.046168元 [退出登录]
SlideShare a Scribd company logo
Feature Ask
Optimize for
Make optimization
Re-usable for others
(0.78, 0.8, 0.4, 0.3, 0.9,...)
(0.75, 0.6, 0.1, 0.7, 0.2,...)
… …
• Semantic similarity
Vector Representation
Nearest Neighbor Search
in Semantic Space
Q: {is it legal for 17
year old to buy a car}
Bag of Words Inverted Index Matching
• L1: BM25F
Posting 1
Posting 2
Posting 3
Posting 4
Semantic search can help recall issues, nearly a third of relevance DSATs.
Query: {how many women voices in Switchboard telephone corpus }
Cannot recall the good urls by query
term alteration and term match
DL model captures full context, and builds semantic
meanings into vectors. The query vector and
document vector are near in vector space.
Query: Where's the nearest fruit smoothies
Location: Omaha, Nebraska
Deep Learning Platform
DLIS Pluggable Runtime
Linux ContainerNative Windows
CNTK TensorFlowWin
Hardware Accerlation
Theano ...
Web Text Speech Image Enterprise
DLVS Pluggable Runtime
HNSW K-D Tree Faiss
ANN Index Build
on Multi-Tenancy
• Customizable runtime
• Privacy and Compliance Certification
• In production globally
1Ms QPS, 100s models, 100Bs vectors, 20+ Regions
Optimal distribution to match model requirements to server fleet
Windows Machine SKU-2
Linux Machine SKU-4
Windows Machine SKU-1
Linux Machine SKU-5
Windows Machine SKU-3
Windows Machine SKU-2
Windows Machine SKU-2
Multiple model instances
across multiple machines
Multiple model instances
share same machine
System in
same bed
Different runtime
Linux Machine SKU-5
SKU in
same bed
coffee in Melbourne
Document 1
Document 2
Vector Set
Similarity Search
Vector Recall by Nearest Neighbor Search
Search among
points in bucket
Hash query
to this bucket
Semantic word 1
Semantic word 2 Semantic word 3
Wang, Jingdong, and Shipeng Li. "Query-driven iterated neighborhood graph search for large scale indexing." Proceedings of the 20th ACM international conference on Multimedia. ACM, 2012.
Hardware +
BrainWave / FPGA
RNN Serving Performance Challenges
Language Modeling
Machine Translation
Machine Reading
Conversation Bot
Limited Parallelism
Limited Bandwidth
• Small batch size
• Sequential dependency
• Vector-matrix multiplication
• Low data reuse
Xt-1 Xt Xt+1
Ot-1 Ot Ot+1
St-1 St
1. Matrix computation:
2. Activation function
3. Operation Fusing
4. Affinity
5. Locality
6. Parallelism
7. Task scheduling
Collaborating with Yuxiong He, Minjia Zhang, Samyam Rajbhandari, Wenhan Wang,
Microsoft AI and Research.
𝑧𝑡 = 𝜎 𝑊𝑧 𝑥 𝑡 + 𝑈𝑧ℎ 𝑡−1 + 𝑏 𝑧
𝑟𝑡 = 𝜎 𝑊𝑟 𝑥 𝑡 + 𝑈𝑟ℎ 𝑡−1 + 𝑏 𝑟
ℎ 𝑡 = 𝑧𝑡 ∘ ℎ 𝑡−1 + 1 − 𝑧𝑡 ∘ tanh(𝑊ℎ 𝑥 𝑡 + 𝑈ℎ 𝑟𝑡 ∘ ℎ 𝑡−1 + 𝑏ℎ)
On a machine with 12 cores…
a) 1 core per operation, multiplications done in parallel
1 1 1 1 1
b) 12 cores per operation, multiplications done sequentially
12 12 12 12 12
many idle cores
unbalanced load
poor speedup of
intra-op parallelism
𝑧𝑡 = 𝜎 𝑊𝑧 𝑥 𝑡 + 𝑈𝑧ℎ 𝑡−1 + 𝑏 𝑧
𝑟𝑡 = 𝜎 𝑊𝑟 𝑥 𝑡 + 𝑈𝑟ℎ 𝑡−1 + 𝑏 𝑟
ℎ 𝑡 = 𝑧𝑡 ∘ ℎ 𝑡−1 + 1 − 𝑧𝑡 ∘ tanh(𝑊ℎ 𝑥 𝑡 + 𝑈ℎ 𝑟𝑡 ∘ ℎ 𝑡−1 + 𝑏ℎ)
On a machine with 12 cores…
d) an optimized configuration, reducing latency
2 2 3 3 2
c) 4 cores per operation
4 4 4 4 4
1 1 1 2 2
1 2
Bad scheduling order
✓ Workload size
✓ Parallelism efficiency
✓ Critical path
✓ Load balancing
Cache-Aware Partitioning
DL Scenarios Original Latency
Optimized Latency
Turing Prototype 2 ~100ms 10ms 9ms >10X > 10X
Turing Prototype 3 ~107ms 10ms 4.1ms >20X > 50X
Deep Query Document
10~12ms for [query,
1 doc] x 33 docs
1.5ms for [query, 1 doc];
<6ms for [query, 33 docs]
>6X > 30X
Malta Click Features
10ms for
[query, 1 passage]
x 150 passages
<1ms for [query, 1 passage];
<5ms for [query, 150 passages]
>10X > 100X
Ads seq2seq model for
query rewriting
51ms 5ms 4ms >10X > 3X
AGI Encoder V2 ~29ms 10ms 5.4ms 5X 5X
RNet (InfoBot + Bing)
~45ms for 1 [query,
4.0ms for 1 [query,
<8.5ms for 20 [query,
11X > 100X
Bing query tagging 9~16ms on CNTK 3ms 0.95ms 10X > 10X
WideDeepRight Model
(TP3 L1)
~25ms for [query, 1
title url]
7ms for a
batch size of
5.4ms for [query,
33 title url];
10X > 100X
TP3 L2 Classifier 60ms 3ms 3ms 20X 20X
TP3 L1 8ms 3ms 1ms 8X 8X
Original TensorFlow model
TensorFlow model with DeepCPU operator
Pretrained DNN Model
in TF/CNTK/ONNX, etc.
Scalable DNN Hardware
Soft DPU
Instr Decoder
& Control
Neural FU
Network switches
Production Bing DNN Model 1
CPU only Brainwave accelerated Improvement
Model Details GRU 128X200 (X2) + W2Vec LSTM 500X200 (x8) +W2Vec Brainwave accelerated mode
is > 10X larger and > 10X
lower latencyEnd-to-End latency per Batch
1 request at 95%
9ms 0.85ms
Production Bing DNN Model 2
CPU only Brainwave accelerated Improvement
Model Details 1D CNN + W2Vec (RNNs
1D CNN + W2Vec + GRU
500x500 (x4)
Brainwave accelerated mode
is > 10X larger and 3X lower
End-to-End latency per Batch
1 request at 95%
15ms 5ms
Layer GEMM
𝑥 𝑡S
Recurrent GEMM
𝑈 𝑜
ℎ 𝑡−1H
S = synthetic_dim
H = hidden_dim
N = batch_size
G = num_gates
Shared Memory
ℎ 𝑡−1
GRU P4 - FP32, batch_size = 1
*Can add more work in this instance
100 1 3∗100∗100∗4
≅ 46%
100+3∗100 ∗4
≅ 2%
20 1 3∗20∗20∗4
≅ 2% 20+3∗20 ∗4
≪ 1%
Significant gain from deep learning
in search, speech, vision and
machine reading comprehension.
Large scale and low latency
inference and vector search service
in production
Heterogenous hardware and
pluggable framework support
Deep Learning Inference at speed and scale

More Related Content

What's hot

Heuristic design of experiments w meta gradient search
Heuristic design of experiments w meta gradient searchHeuristic design of experiments w meta gradient search
Heuristic design of experiments w meta gradient search
Greg Makowski
How Artificial Intelligence & Machine Learning Are Transforming Modern Marketing
How Artificial Intelligence & Machine Learning Are Transforming Modern MarketingHow Artificial Intelligence & Machine Learning Are Transforming Modern Marketing
How Artificial Intelligence & Machine Learning Are Transforming Modern Marketing
Deep Learning Enabled Question Answering System to Automate Corporate Helpdesk
Deep Learning Enabled Question Answering System to Automate Corporate HelpdeskDeep Learning Enabled Question Answering System to Automate Corporate Helpdesk
Deep Learning Enabled Question Answering System to Automate Corporate Helpdesk
Saurabh Saxena
Rasa NLU and ML Interpretability
Rasa NLU and ML InterpretabilityRasa NLU and ML Interpretability
Rasa NLU and ML Interpretability
Statistical Models for Massive Web Data
Statistical Models for Massive Web DataStatistical Models for Massive Web Data
Statistical Models for Massive Web Data
Deepak Agarwal
How Artificial Intelligence & Machine Learning Are Transforming Modern Market...
How Artificial Intelligence & Machine Learning Are Transforming Modern Market...How Artificial Intelligence & Machine Learning Are Transforming Modern Market...
How Artificial Intelligence & Machine Learning Are Transforming Modern Market...
DigiMarCon - Digital Marketing, Media and Advertising Conferences & Exhibitions
Intro to machine learning
Intro to machine learningIntro to machine learning
Intro to machine learning
Tamir Taha
Predictive Model and Record Description with Segmented Sensitivity Analysis (...
Predictive Model and Record Description with Segmented Sensitivity Analysis (...Predictive Model and Record Description with Segmented Sensitivity Analysis (...
Predictive Model and Record Description with Segmented Sensitivity Analysis (...
Greg Makowski
Big Data Analytics (ML, DL, AI) hands-on
Big Data Analytics (ML, DL, AI) hands-onBig Data Analytics (ML, DL, AI) hands-on
Big Data Analytics (ML, DL, AI) hands-on
Dony Riyanto
Data Science, Machine Learning and Neural Networks
Data Science, Machine Learning and Neural NetworksData Science, Machine Learning and Neural Networks
Data Science, Machine Learning and Neural Networks
10 Lessons Learned from Building Machine Learning Systems
10 Lessons Learned from Building Machine Learning Systems10 Lessons Learned from Building Machine Learning Systems
10 Lessons Learned from Building Machine Learning Systems
Xavier Amatriain
Big Data Spain 2018: How to build Weighted XGBoost ML model for Imbalance dat...
Big Data Spain 2018: How to build Weighted XGBoost ML model for Imbalance dat...Big Data Spain 2018: How to build Weighted XGBoost ML model for Imbalance dat...
Big Data Spain 2018: How to build Weighted XGBoost ML model for Imbalance dat...
Alok Singh
Deep Learning Models for Question Answering
Deep Learning Models for Question AnsweringDeep Learning Models for Question Answering
Deep Learning Models for Question Answering
Sujit Pal
Barga ACM DEBS 2013 Keynote
Barga ACM DEBS 2013 KeynoteBarga ACM DEBS 2013 Keynote
Barga ACM DEBS 2013 Keynote
Roger Barga
Towards a Comprehensive Machine Learning Benchmark
Towards a Comprehensive Machine Learning BenchmarkTowards a Comprehensive Machine Learning Benchmark
Towards a Comprehensive Machine Learning Benchmark
Turi, Inc.
Creating AnswerBot with Keras and TensorFlow (TensorBeat)
Creating AnswerBot with Keras and TensorFlow (TensorBeat)Creating AnswerBot with Keras and TensorFlow (TensorBeat)
Creating AnswerBot with Keras and TensorFlow (TensorBeat)
Avkash Chauhan
Conformer-Kernel with Query Term Independence @ TREC 2020 Deep Learning Track
Conformer-Kernel with Query Term Independence @ TREC 2020 Deep Learning TrackConformer-Kernel with Query Term Independence @ TREC 2020 Deep Learning Track
Conformer-Kernel with Query Term Independence @ TREC 2020 Deep Learning Track
Bhaskar Mitra
Feature Engineering
Feature Engineering Feature Engineering
Feature Engineering
Dato Keynote
Dato KeynoteDato Keynote
Dato Keynote
Turi, Inc.
Learning deep structured semantic models for web search
Learning deep structured semantic models for web searchLearning deep structured semantic models for web search
Learning deep structured semantic models for web search
hyunsung lee

What's hot (20)

Heuristic design of experiments w meta gradient search
Heuristic design of experiments w meta gradient searchHeuristic design of experiments w meta gradient search
Heuristic design of experiments w meta gradient search
How Artificial Intelligence & Machine Learning Are Transforming Modern Marketing
How Artificial Intelligence & Machine Learning Are Transforming Modern MarketingHow Artificial Intelligence & Machine Learning Are Transforming Modern Marketing
How Artificial Intelligence & Machine Learning Are Transforming Modern Marketing
Deep Learning Enabled Question Answering System to Automate Corporate Helpdesk
Deep Learning Enabled Question Answering System to Automate Corporate HelpdeskDeep Learning Enabled Question Answering System to Automate Corporate Helpdesk
Deep Learning Enabled Question Answering System to Automate Corporate Helpdesk
Rasa NLU and ML Interpretability
Rasa NLU and ML InterpretabilityRasa NLU and ML Interpretability
Rasa NLU and ML Interpretability
Statistical Models for Massive Web Data
Statistical Models for Massive Web DataStatistical Models for Massive Web Data
Statistical Models for Massive Web Data
How Artificial Intelligence & Machine Learning Are Transforming Modern Market...
How Artificial Intelligence & Machine Learning Are Transforming Modern Market...How Artificial Intelligence & Machine Learning Are Transforming Modern Market...
How Artificial Intelligence & Machine Learning Are Transforming Modern Market...
Intro to machine learning
Intro to machine learningIntro to machine learning
Intro to machine learning
Predictive Model and Record Description with Segmented Sensitivity Analysis (...
Predictive Model and Record Description with Segmented Sensitivity Analysis (...Predictive Model and Record Description with Segmented Sensitivity Analysis (...
Predictive Model and Record Description with Segmented Sensitivity Analysis (...
Big Data Analytics (ML, DL, AI) hands-on
Big Data Analytics (ML, DL, AI) hands-onBig Data Analytics (ML, DL, AI) hands-on
Big Data Analytics (ML, DL, AI) hands-on
Data Science, Machine Learning and Neural Networks
Data Science, Machine Learning and Neural NetworksData Science, Machine Learning and Neural Networks
Data Science, Machine Learning and Neural Networks
10 Lessons Learned from Building Machine Learning Systems
10 Lessons Learned from Building Machine Learning Systems10 Lessons Learned from Building Machine Learning Systems
10 Lessons Learned from Building Machine Learning Systems
Big Data Spain 2018: How to build Weighted XGBoost ML model for Imbalance dat...
Big Data Spain 2018: How to build Weighted XGBoost ML model for Imbalance dat...Big Data Spain 2018: How to build Weighted XGBoost ML model for Imbalance dat...
Big Data Spain 2018: How to build Weighted XGBoost ML model for Imbalance dat...
Deep Learning Models for Question Answering
Deep Learning Models for Question AnsweringDeep Learning Models for Question Answering
Deep Learning Models for Question Answering
Barga ACM DEBS 2013 Keynote
Barga ACM DEBS 2013 KeynoteBarga ACM DEBS 2013 Keynote
Barga ACM DEBS 2013 Keynote
Towards a Comprehensive Machine Learning Benchmark
Towards a Comprehensive Machine Learning BenchmarkTowards a Comprehensive Machine Learning Benchmark
Towards a Comprehensive Machine Learning Benchmark
Creating AnswerBot with Keras and TensorFlow (TensorBeat)
Creating AnswerBot with Keras and TensorFlow (TensorBeat)Creating AnswerBot with Keras and TensorFlow (TensorBeat)
Creating AnswerBot with Keras and TensorFlow (TensorBeat)
Conformer-Kernel with Query Term Independence @ TREC 2020 Deep Learning Track
Conformer-Kernel with Query Term Independence @ TREC 2020 Deep Learning TrackConformer-Kernel with Query Term Independence @ TREC 2020 Deep Learning Track
Conformer-Kernel with Query Term Independence @ TREC 2020 Deep Learning Track
Feature Engineering
Feature Engineering Feature Engineering
Feature Engineering
Dato Keynote
Dato KeynoteDato Keynote
Dato Keynote
Learning deep structured semantic models for web search
Learning deep structured semantic models for web searchLearning deep structured semantic models for web search
Learning deep structured semantic models for web search

Similar to Deep Learning Inference at speed and scale

Javantura v4 - Java and lambdas and streams - are they better than for loops ...
Javantura v4 - Java and lambdas and streams - are they better than for loops ...Javantura v4 - Java and lambdas and streams - are they better than for loops ...
Javantura v4 - Java and lambdas and streams - are they better than for loops ...
HUJAK - Hrvatska udruga Java korisnika / Croatian Java User Association
Making Machine Learning Scale: Single Machine and Distributed
Making Machine Learning Scale: Single Machine and DistributedMaking Machine Learning Scale: Single Machine and Distributed
Making Machine Learning Scale: Single Machine and Distributed
Turi, Inc.
Alex Smola, Professor in the Machine Learning Department, Carnegie Mellon Uni...
Alex Smola, Professor in the Machine Learning Department, Carnegie Mellon Uni...Alex Smola, Professor in the Machine Learning Department, Carnegie Mellon Uni...
Alex Smola, Professor in the Machine Learning Department, Carnegie Mellon Uni...
Nitish Upreti
Memory efficient java tutorial practices and challenges
Memory efficient java tutorial practices and challengesMemory efficient java tutorial practices and challenges
Memory efficient java tutorial practices and challenges
mustafa sarac
Presto at Tivo, Boston Hadoop Meetup
Presto at Tivo, Boston Hadoop MeetupPresto at Tivo, Boston Hadoop Meetup
Presto at Tivo, Boston Hadoop Meetup
Justin Borgman
AI optimizing HPC simulations (presentation from 6th EULAG Workshop)
AI optimizing HPC simulations (presentation from  6th EULAG Workshop)AI optimizing HPC simulations (presentation from  6th EULAG Workshop)
AI optimizing HPC simulations (presentation from 6th EULAG Workshop)
SnappyData at Spark Summit 2017
SnappyData at Spark Summit 2017SnappyData at Spark Summit 2017
SnappyData at Spark Summit 2017
Jags Ramnarayan
SnappyData, the Spark Database. A unified cluster for streaming, transactions...
SnappyData, the Spark Database. A unified cluster for streaming, transactions...SnappyData, the Spark Database. A unified cluster for streaming, transactions...
SnappyData, the Spark Database. A unified cluster for streaming, transactions...
Driving Moore's Law with Python-Powered Machine Learning: An Insider's Perspe...
Driving Moore's Law with Python-Powered Machine Learning: An Insider's Perspe...Driving Moore's Law with Python-Powered Machine Learning: An Insider's Perspe...
Driving Moore's Law with Python-Powered Machine Learning: An Insider's Perspe...
AI and Deep Learning
AI and Deep Learning AI and Deep Learning
AI and Deep Learning
Subrat Panda, PhD
Predicting Optimal Parallelism for Data Analytics
Predicting Optimal Parallelism for Data AnalyticsPredicting Optimal Parallelism for Data Analytics
Predicting Optimal Parallelism for Data Analytics
Keynote at IWLS 2017
Keynote at IWLS 2017Keynote at IWLS 2017
Keynote at IWLS 2017
Manish Pandey
Michigan Information Retrieval Enthusiasts Group Meetup - August 19, 2010
Michigan Information Retrieval Enthusiasts Group Meetup - August 19, 2010Michigan Information Retrieval Enthusiasts Group Meetup - August 19, 2010
Michigan Information Retrieval Enthusiasts Group Meetup - August 19, 2010
ivan provalov
Ernest: Efficient Performance Prediction for Advanced Analytics on Apache Spa...
Ernest: Efficient Performance Prediction for Advanced Analytics on Apache Spa...Ernest: Efficient Performance Prediction for Advanced Analytics on Apache Spa...
Ernest: Efficient Performance Prediction for Advanced Analytics on Apache Spa...
Spark Summit
Multidimensional Interfaces for Selecting Data with Order
Multidimensional Interfaces for Selecting Data with OrderMultidimensional Interfaces for Selecting Data with Order
Multidimensional Interfaces for Selecting Data with Order
Ruben Taelman
Advanced Data Science on Spark-(Reza Zadeh, Stanford)
Advanced Data Science on Spark-(Reza Zadeh, Stanford)Advanced Data Science on Spark-(Reza Zadeh, Stanford)
Advanced Data Science on Spark-(Reza Zadeh, Stanford)
Spark Summit
Two methods for optimising cognitive model parameters
Two methods for optimising cognitive model parametersTwo methods for optimising cognitive model parameters
Two methods for optimising cognitive model parameters
University of Huddersfield
ESWC2015 - Query Optimization for Clients of Linked Data Fragments
ESWC2015 - Query Optimization for Clients of Linked Data FragmentsESWC2015 - Query Optimization for Clients of Linked Data Fragments
ESWC2015 - Query Optimization for Clients of Linked Data Fragments
Joachim Van Herwegen
Deep Dive on Deep Learning (June 2018)
Deep Dive on Deep Learning (June 2018)Deep Dive on Deep Learning (June 2018)
Deep Dive on Deep Learning (June 2018)
Julien SIMON

Similar to Deep Learning Inference at speed and scale (20)

Javantura v4 - Java and lambdas and streams - are they better than for loops ...
Javantura v4 - Java and lambdas and streams - are they better than for loops ...Javantura v4 - Java and lambdas and streams - are they better than for loops ...
Javantura v4 - Java and lambdas and streams - are they better than for loops ...
Making Machine Learning Scale: Single Machine and Distributed
Making Machine Learning Scale: Single Machine and DistributedMaking Machine Learning Scale: Single Machine and Distributed
Making Machine Learning Scale: Single Machine and Distributed
Alex Smola, Professor in the Machine Learning Department, Carnegie Mellon Uni...
Alex Smola, Professor in the Machine Learning Department, Carnegie Mellon Uni...Alex Smola, Professor in the Machine Learning Department, Carnegie Mellon Uni...
Alex Smola, Professor in the Machine Learning Department, Carnegie Mellon Uni...
Memory efficient java tutorial practices and challenges
Memory efficient java tutorial practices and challengesMemory efficient java tutorial practices and challenges
Memory efficient java tutorial practices and challenges
Presto at Tivo, Boston Hadoop Meetup
Presto at Tivo, Boston Hadoop MeetupPresto at Tivo, Boston Hadoop Meetup
Presto at Tivo, Boston Hadoop Meetup
AI optimizing HPC simulations (presentation from 6th EULAG Workshop)
AI optimizing HPC simulations (presentation from  6th EULAG Workshop)AI optimizing HPC simulations (presentation from  6th EULAG Workshop)
AI optimizing HPC simulations (presentation from 6th EULAG Workshop)
SnappyData at Spark Summit 2017
SnappyData at Spark Summit 2017SnappyData at Spark Summit 2017
SnappyData at Spark Summit 2017
SnappyData, the Spark Database. A unified cluster for streaming, transactions...
SnappyData, the Spark Database. A unified cluster for streaming, transactions...SnappyData, the Spark Database. A unified cluster for streaming, transactions...
SnappyData, the Spark Database. A unified cluster for streaming, transactions...
Driving Moore's Law with Python-Powered Machine Learning: An Insider's Perspe...
Driving Moore's Law with Python-Powered Machine Learning: An Insider's Perspe...Driving Moore's Law with Python-Powered Machine Learning: An Insider's Perspe...
Driving Moore's Law with Python-Powered Machine Learning: An Insider's Perspe...
AI and Deep Learning
AI and Deep Learning AI and Deep Learning
AI and Deep Learning
Predicting Optimal Parallelism for Data Analytics
Predicting Optimal Parallelism for Data AnalyticsPredicting Optimal Parallelism for Data Analytics
Predicting Optimal Parallelism for Data Analytics
Keynote at IWLS 2017
Keynote at IWLS 2017Keynote at IWLS 2017
Keynote at IWLS 2017
Michigan Information Retrieval Enthusiasts Group Meetup - August 19, 2010
Michigan Information Retrieval Enthusiasts Group Meetup - August 19, 2010Michigan Information Retrieval Enthusiasts Group Meetup - August 19, 2010
Michigan Information Retrieval Enthusiasts Group Meetup - August 19, 2010
Ernest: Efficient Performance Prediction for Advanced Analytics on Apache Spa...
Ernest: Efficient Performance Prediction for Advanced Analytics on Apache Spa...Ernest: Efficient Performance Prediction for Advanced Analytics on Apache Spa...
Ernest: Efficient Performance Prediction for Advanced Analytics on Apache Spa...
Multidimensional Interfaces for Selecting Data with Order
Multidimensional Interfaces for Selecting Data with OrderMultidimensional Interfaces for Selecting Data with Order
Multidimensional Interfaces for Selecting Data with Order
Advanced Data Science on Spark-(Reza Zadeh, Stanford)
Advanced Data Science on Spark-(Reza Zadeh, Stanford)Advanced Data Science on Spark-(Reza Zadeh, Stanford)
Advanced Data Science on Spark-(Reza Zadeh, Stanford)
Two methods for optimising cognitive model parameters
Two methods for optimising cognitive model parametersTwo methods for optimising cognitive model parameters
Two methods for optimising cognitive model parameters
ESWC2015 - Query Optimization for Clients of Linked Data Fragments
ESWC2015 - Query Optimization for Clients of Linked Data FragmentsESWC2015 - Query Optimization for Clients of Linked Data Fragments
ESWC2015 - Query Optimization for Clients of Linked Data Fragments
Deep Dive on Deep Learning (June 2018)
Deep Dive on Deep Learning (June 2018)Deep Dive on Deep Learning (June 2018)
Deep Dive on Deep Learning (June 2018)

More from Bill Liu

Walk Through a Real World ML Production Project
Walk Through a Real World ML Production ProjectWalk Through a Real World ML Production Project
Walk Through a Real World ML Production Project
Bill Liu
Redefining MLOps with Model Deployment, Management and Observability in Produ...
Redefining MLOps with Model Deployment, Management and Observability in Produ...Redefining MLOps with Model Deployment, Management and Observability in Produ...
Redefining MLOps with Model Deployment, Management and Observability in Produ...
Bill Liu
Productizing Machine Learning at the Edge
Productizing Machine Learning at the EdgeProductizing Machine Learning at the Edge
Productizing Machine Learning at the Edge
Bill Liu
Transformers in Vision: From Zero to Hero
Transformers in Vision: From Zero to HeroTransformers in Vision: From Zero to Hero
Transformers in Vision: From Zero to Hero
Bill Liu
Deep AutoViML For Tensorflow Models and MLOps Workflows
Deep AutoViML For Tensorflow Models and MLOps WorkflowsDeep AutoViML For Tensorflow Models and MLOps Workflows
Deep AutoViML For Tensorflow Models and MLOps Workflows
Bill Liu
Metaflow: The ML Infrastructure at Netflix
Metaflow: The ML Infrastructure at NetflixMetaflow: The ML Infrastructure at Netflix
Metaflow: The ML Infrastructure at Netflix
Bill Liu
Practical Crowdsourcing for ML at Scale
Practical Crowdsourcing for ML at ScalePractical Crowdsourcing for ML at Scale
Practical Crowdsourcing for ML at Scale
Bill Liu
Building large scale transactional data lake using apache hudi
Building large scale transactional data lake using apache hudiBuilding large scale transactional data lake using apache hudi
Building large scale transactional data lake using apache hudi
Bill Liu
Deep Reinforcement Learning and Its Applications
Deep Reinforcement Learning and Its ApplicationsDeep Reinforcement Learning and Its Applications
Deep Reinforcement Learning and Its Applications
Bill Liu
Big Data and AI in Fighting Against COVID-19
Big Data and AI in Fighting Against COVID-19Big Data and AI in Fighting Against COVID-19
Big Data and AI in Fighting Against COVID-19
Bill Liu
Highly-scalable Reinforcement Learning RLlib for Real-world Applications
Highly-scalable Reinforcement Learning RLlib for Real-world ApplicationsHighly-scalable Reinforcement Learning RLlib for Real-world Applications
Highly-scalable Reinforcement Learning RLlib for Real-world Applications
Bill Liu
Build computer vision models to perform object detection and classification w...
Build computer vision models to perform object detection and classification w...Build computer vision models to perform object detection and classification w...
Build computer vision models to perform object detection and classification w...
Bill Liu
Causal Inference in Data Science and Machine Learning
Causal Inference in Data Science and Machine LearningCausal Inference in Data Science and Machine Learning
Causal Inference in Data Science and Machine Learning
Bill Liu
Weekly #106: Deep Learning on Mobile
Weekly #106: Deep Learning on MobileWeekly #106: Deep Learning on Mobile
Weekly #106: Deep Learning on Mobile
Bill Liu
Weekly #105: AutoViz and Auto_ViML Visualization and Machine Learning
Weekly #105: AutoViz and Auto_ViML Visualization and Machine LearningWeekly #105: AutoViz and Auto_ViML Visualization and Machine Learning
Weekly #105: AutoViz and Auto_ViML Visualization and Machine Learning
Bill Liu
AISF19 - On Blending Machine Learning with Microeconomics
AISF19 - On Blending Machine Learning with MicroeconomicsAISF19 - On Blending Machine Learning with Microeconomics
AISF19 - On Blending Machine Learning with Microeconomics
Bill Liu
AISF19 - Travel in the AI-First World
AISF19 - Travel in the AI-First WorldAISF19 - Travel in the AI-First World
AISF19 - Travel in the AI-First World
Bill Liu
AISF19 - Unleash Computer Vision at the Edge
AISF19 - Unleash Computer Vision at the EdgeAISF19 - Unleash Computer Vision at the Edge
AISF19 - Unleash Computer Vision at the Edge
Bill Liu
AISF19 - Building Scalable, Kubernetes-Native ML/AI Pipelines with TFX, KubeF...
AISF19 - Building Scalable, Kubernetes-Native ML/AI Pipelines with TFX, KubeF...AISF19 - Building Scalable, Kubernetes-Native ML/AI Pipelines with TFX, KubeF...
AISF19 - Building Scalable, Kubernetes-Native ML/AI Pipelines with TFX, KubeF...
Bill Liu
Toronto meetup 20190917
Toronto meetup 20190917Toronto meetup 20190917
Toronto meetup 20190917
Bill Liu

More from Bill Liu (20)

Walk Through a Real World ML Production Project
Walk Through a Real World ML Production ProjectWalk Through a Real World ML Production Project
Walk Through a Real World ML Production Project
Redefining MLOps with Model Deployment, Management and Observability in Produ...
Redefining MLOps with Model Deployment, Management and Observability in Produ...Redefining MLOps with Model Deployment, Management and Observability in Produ...
Redefining MLOps with Model Deployment, Management and Observability in Produ...
Productizing Machine Learning at the Edge
Productizing Machine Learning at the EdgeProductizing Machine Learning at the Edge
Productizing Machine Learning at the Edge
Transformers in Vision: From Zero to Hero
Transformers in Vision: From Zero to HeroTransformers in Vision: From Zero to Hero
Transformers in Vision: From Zero to Hero
Deep AutoViML For Tensorflow Models and MLOps Workflows
Deep AutoViML For Tensorflow Models and MLOps WorkflowsDeep AutoViML For Tensorflow Models and MLOps Workflows
Deep AutoViML For Tensorflow Models and MLOps Workflows
Metaflow: The ML Infrastructure at Netflix
Metaflow: The ML Infrastructure at NetflixMetaflow: The ML Infrastructure at Netflix
Metaflow: The ML Infrastructure at Netflix
Practical Crowdsourcing for ML at Scale
Practical Crowdsourcing for ML at ScalePractical Crowdsourcing for ML at Scale
Practical Crowdsourcing for ML at Scale
Building large scale transactional data lake using apache hudi
Building large scale transactional data lake using apache hudiBuilding large scale transactional data lake using apache hudi
Building large scale transactional data lake using apache hudi
Deep Reinforcement Learning and Its Applications
Deep Reinforcement Learning and Its ApplicationsDeep Reinforcement Learning and Its Applications
Deep Reinforcement Learning and Its Applications
Big Data and AI in Fighting Against COVID-19
Big Data and AI in Fighting Against COVID-19Big Data and AI in Fighting Against COVID-19
Big Data and AI in Fighting Against COVID-19
Highly-scalable Reinforcement Learning RLlib for Real-world Applications
Highly-scalable Reinforcement Learning RLlib for Real-world ApplicationsHighly-scalable Reinforcement Learning RLlib for Real-world Applications
Highly-scalable Reinforcement Learning RLlib for Real-world Applications
Build computer vision models to perform object detection and classification w...
Build computer vision models to perform object detection and classification w...Build computer vision models to perform object detection and classification w...
Build computer vision models to perform object detection and classification w...
Causal Inference in Data Science and Machine Learning
Causal Inference in Data Science and Machine LearningCausal Inference in Data Science and Machine Learning
Causal Inference in Data Science and Machine Learning
Weekly #106: Deep Learning on Mobile
Weekly #106: Deep Learning on MobileWeekly #106: Deep Learning on Mobile
Weekly #106: Deep Learning on Mobile
Weekly #105: AutoViz and Auto_ViML Visualization and Machine Learning
Weekly #105: AutoViz and Auto_ViML Visualization and Machine LearningWeekly #105: AutoViz and Auto_ViML Visualization and Machine Learning
Weekly #105: AutoViz and Auto_ViML Visualization and Machine Learning
AISF19 - On Blending Machine Learning with Microeconomics
AISF19 - On Blending Machine Learning with MicroeconomicsAISF19 - On Blending Machine Learning with Microeconomics
AISF19 - On Blending Machine Learning with Microeconomics
AISF19 - Travel in the AI-First World
AISF19 - Travel in the AI-First WorldAISF19 - Travel in the AI-First World
AISF19 - Travel in the AI-First World
AISF19 - Unleash Computer Vision at the Edge
AISF19 - Unleash Computer Vision at the EdgeAISF19 - Unleash Computer Vision at the Edge
AISF19 - Unleash Computer Vision at the Edge
AISF19 - Building Scalable, Kubernetes-Native ML/AI Pipelines with TFX, KubeF...
AISF19 - Building Scalable, Kubernetes-Native ML/AI Pipelines with TFX, KubeF...AISF19 - Building Scalable, Kubernetes-Native ML/AI Pipelines with TFX, KubeF...
AISF19 - Building Scalable, Kubernetes-Native ML/AI Pipelines with TFX, KubeF...
Toronto meetup 20190917
Toronto meetup 20190917Toronto meetup 20190917
Toronto meetup 20190917

Recently uploaded

From NCSA to the National Research Platform
From NCSA to the National Research PlatformFrom NCSA to the National Research Platform
From NCSA to the National Research Platform
Larry Smarr
TrustArc Webinar - Your Guide for Smooth Cross-Border Data Transfers and Glob...
TrustArc Webinar - Your Guide for Smooth Cross-Border Data Transfers and Glob...TrustArc Webinar - Your Guide for Smooth Cross-Border Data Transfers and Glob...
TrustArc Webinar - Your Guide for Smooth Cross-Border Data Transfers and Glob...
Database Management Myths for Developers
Database Management Myths for DevelopersDatabase Management Myths for Developers
Database Management Myths for Developers
John Sterrett
An Introduction to All Data Enterprise Integration
An Introduction to All Data Enterprise IntegrationAn Introduction to All Data Enterprise Integration
An Introduction to All Data Enterprise Integration
Safe Software
Brightwell ILC Futures workshop David Sinclair presentation
Brightwell ILC Futures workshop David Sinclair presentationBrightwell ILC Futures workshop David Sinclair presentation
Brightwell ILC Futures workshop David Sinclair presentation
Kubernetes Cloud Native Indonesia Meetup - June 2024
Kubernetes Cloud Native Indonesia Meetup - June 2024Kubernetes Cloud Native Indonesia Meetup - June 2024
Kubernetes Cloud Native Indonesia Meetup - June 2024
Prasta Maha
The Strategy Behind ReversingLabs’ Massive Key-Value Migration
The Strategy Behind ReversingLabs’ Massive Key-Value MigrationThe Strategy Behind ReversingLabs’ Massive Key-Value Migration
The Strategy Behind ReversingLabs’ Massive Key-Value Migration
ScyllaDB Topology on Raft: An Inside Look
ScyllaDB Topology on Raft: An Inside LookScyllaDB Topology on Raft: An Inside Look
ScyllaDB Topology on Raft: An Inside Look
MySQL InnoDB Storage Engine: Deep Dive - Mydbops
MySQL InnoDB Storage Engine: Deep Dive - MydbopsMySQL InnoDB Storage Engine: Deep Dive - Mydbops
MySQL InnoDB Storage Engine: Deep Dive - Mydbops
CNSCon 2024 Lightning Talk: Don’t Make Me Impersonate My Identity
CNSCon 2024 Lightning Talk: Don’t Make Me Impersonate My IdentityCNSCon 2024 Lightning Talk: Don’t Make Me Impersonate My Identity
CNSCon 2024 Lightning Talk: Don’t Make Me Impersonate My Identity
Cynthia Thomas
MongoDB vs ScyllaDB: Tractian’s Experience with Real-Time ML
MongoDB vs ScyllaDB: Tractian’s Experience with Real-Time MLMongoDB vs ScyllaDB: Tractian’s Experience with Real-Time ML
MongoDB vs ScyllaDB: Tractian’s Experience with Real-Time ML
Communications Mining Series - Zero to Hero - Session 2
Communications Mining Series - Zero to Hero - Session 2Communications Mining Series - Zero to Hero - Session 2
Communications Mining Series - Zero to Hero - Session 2
Corporate Open Source Anti-Patterns: A Decade Later
Corporate Open Source Anti-Patterns: A Decade LaterCorporate Open Source Anti-Patterns: A Decade Later
Corporate Open Source Anti-Patterns: A Decade Later
Call Girls Chandigarh🔥7023059433🔥Agency Profile Escorts in Chandigarh Availab...
Call Girls Chandigarh🔥7023059433🔥Agency Profile Escorts in Chandigarh Availab...Call Girls Chandigarh🔥7023059433🔥Agency Profile Escorts in Chandigarh Availab...
Call Girls Chandigarh🔥7023059433🔥Agency Profile Escorts in Chandigarh Availab...
manji sharman06
Supplier Sourcing Presentation - Gay De La Cruz.pdf
Supplier Sourcing Presentation - Gay De La Cruz.pdfSupplier Sourcing Presentation - Gay De La Cruz.pdf
Supplier Sourcing Presentation - Gay De La Cruz.pdf
Dev Dives: Mining your data with AI-powered Continuous Discovery
Dev Dives: Mining your data with AI-powered Continuous DiscoveryDev Dives: Mining your data with AI-powered Continuous Discovery
Dev Dives: Mining your data with AI-powered Continuous Discovery
Radically Outperforming DynamoDB @ Digital Turbine with SADA and Google Cloud
Radically Outperforming DynamoDB @ Digital Turbine with SADA and Google CloudRadically Outperforming DynamoDB @ Digital Turbine with SADA and Google Cloud
Radically Outperforming DynamoDB @ Digital Turbine with SADA and Google Cloud
Chapter 5 - Managing Test Activities V4.0
Chapter 5 - Managing Test Activities V4.0Chapter 5 - Managing Test Activities V4.0
Chapter 5 - Managing Test Activities V4.0
Neeraj Kumar Singh
QA or the Highway - Component Testing: Bridging the gap between frontend appl...
QA or the Highway - Component Testing: Bridging the gap between frontend appl...QA or the Highway - Component Testing: Bridging the gap between frontend appl...
QA or the Highway - Component Testing: Bridging the gap between frontend appl...
QR Secure: A Hybrid Approach Using Machine Learning and Security Validation F...
QR Secure: A Hybrid Approach Using Machine Learning and Security Validation F...QR Secure: A Hybrid Approach Using Machine Learning and Security Validation F...
QR Secure: A Hybrid Approach Using Machine Learning and Security Validation F...

Recently uploaded (20)

From NCSA to the National Research Platform
From NCSA to the National Research PlatformFrom NCSA to the National Research Platform
From NCSA to the National Research Platform
TrustArc Webinar - Your Guide for Smooth Cross-Border Data Transfers and Glob...
TrustArc Webinar - Your Guide for Smooth Cross-Border Data Transfers and Glob...TrustArc Webinar - Your Guide for Smooth Cross-Border Data Transfers and Glob...
TrustArc Webinar - Your Guide for Smooth Cross-Border Data Transfers and Glob...
Database Management Myths for Developers
Database Management Myths for DevelopersDatabase Management Myths for Developers
Database Management Myths for Developers
An Introduction to All Data Enterprise Integration
An Introduction to All Data Enterprise IntegrationAn Introduction to All Data Enterprise Integration
An Introduction to All Data Enterprise Integration
Brightwell ILC Futures workshop David Sinclair presentation
Brightwell ILC Futures workshop David Sinclair presentationBrightwell ILC Futures workshop David Sinclair presentation
Brightwell ILC Futures workshop David Sinclair presentation
Kubernetes Cloud Native Indonesia Meetup - June 2024
Kubernetes Cloud Native Indonesia Meetup - June 2024Kubernetes Cloud Native Indonesia Meetup - June 2024
Kubernetes Cloud Native Indonesia Meetup - June 2024
The Strategy Behind ReversingLabs’ Massive Key-Value Migration
The Strategy Behind ReversingLabs’ Massive Key-Value MigrationThe Strategy Behind ReversingLabs’ Massive Key-Value Migration
The Strategy Behind ReversingLabs’ Massive Key-Value Migration
ScyllaDB Topology on Raft: An Inside Look
ScyllaDB Topology on Raft: An Inside LookScyllaDB Topology on Raft: An Inside Look
ScyllaDB Topology on Raft: An Inside Look
MySQL InnoDB Storage Engine: Deep Dive - Mydbops
MySQL InnoDB Storage Engine: Deep Dive - MydbopsMySQL InnoDB Storage Engine: Deep Dive - Mydbops
MySQL InnoDB Storage Engine: Deep Dive - Mydbops
CNSCon 2024 Lightning Talk: Don’t Make Me Impersonate My Identity
CNSCon 2024 Lightning Talk: Don’t Make Me Impersonate My IdentityCNSCon 2024 Lightning Talk: Don’t Make Me Impersonate My Identity
CNSCon 2024 Lightning Talk: Don’t Make Me Impersonate My Identity
MongoDB vs ScyllaDB: Tractian’s Experience with Real-Time ML
MongoDB vs ScyllaDB: Tractian’s Experience with Real-Time MLMongoDB vs ScyllaDB: Tractian’s Experience with Real-Time ML
MongoDB vs ScyllaDB: Tractian’s Experience with Real-Time ML
Communications Mining Series - Zero to Hero - Session 2
Communications Mining Series - Zero to Hero - Session 2Communications Mining Series - Zero to Hero - Session 2
Communications Mining Series - Zero to Hero - Session 2
Corporate Open Source Anti-Patterns: A Decade Later
Corporate Open Source Anti-Patterns: A Decade LaterCorporate Open Source Anti-Patterns: A Decade Later
Corporate Open Source Anti-Patterns: A Decade Later
Call Girls Chandigarh🔥7023059433🔥Agency Profile Escorts in Chandigarh Availab...
Call Girls Chandigarh🔥7023059433🔥Agency Profile Escorts in Chandigarh Availab...Call Girls Chandigarh🔥7023059433🔥Agency Profile Escorts in Chandigarh Availab...
Call Girls Chandigarh🔥7023059433🔥Agency Profile Escorts in Chandigarh Availab...
Supplier Sourcing Presentation - Gay De La Cruz.pdf
Supplier Sourcing Presentation - Gay De La Cruz.pdfSupplier Sourcing Presentation - Gay De La Cruz.pdf
Supplier Sourcing Presentation - Gay De La Cruz.pdf
Dev Dives: Mining your data with AI-powered Continuous Discovery
Dev Dives: Mining your data with AI-powered Continuous DiscoveryDev Dives: Mining your data with AI-powered Continuous Discovery
Dev Dives: Mining your data with AI-powered Continuous Discovery
Radically Outperforming DynamoDB @ Digital Turbine with SADA and Google Cloud
Radically Outperforming DynamoDB @ Digital Turbine with SADA and Google CloudRadically Outperforming DynamoDB @ Digital Turbine with SADA and Google Cloud
Radically Outperforming DynamoDB @ Digital Turbine with SADA and Google Cloud
Chapter 5 - Managing Test Activities V4.0
Chapter 5 - Managing Test Activities V4.0Chapter 5 - Managing Test Activities V4.0
Chapter 5 - Managing Test Activities V4.0
QA or the Highway - Component Testing: Bridging the gap between frontend appl...
QA or the Highway - Component Testing: Bridging the gap between frontend appl...QA or the Highway - Component Testing: Bridging the gap between frontend appl...
QA or the Highway - Component Testing: Bridging the gap between frontend appl...
QR Secure: A Hybrid Approach Using Machine Learning and Security Validation F...
QR Secure: A Hybrid Approach Using Machine Learning and Security Validation F...QR Secure: A Hybrid Approach Using Machine Learning and Security Validation F...
QR Secure: A Hybrid Approach Using Machine Learning and Security Validation F...

Deep Learning Inference at speed and scale

  • 1.
  • 4.
  • 5. Applications (0.78, 0.8, 0.4, 0.3, 0.9,...) (0.75, 0.6, 0.1, 0.7, 0.2,...) … … • Semantic similarity Vector Representation Nearest Neighbor Search in Semantic Space Q: {is it legal for 17 year old to buy a car} Bag of Words Inverted Index Matching car legal … OR AND buy own • L1: BM25F Ranking Posting 1 Posting 2 Posting 3 Posting 4 … buy legal … L2/L3/L4 ReRanking
  • 6. Semantic search can help recall issues, nearly a third of relevance DSATs. Query: {how many women voices in Switchboard telephone corpus } Cannot recall the good urls by query term alteration and term match DL model captures full context, and builds semantic meanings into vectors. The query vector and document vector are near in vector space. Applications
  • 8. Query: Where's the nearest fruit smoothies Location: Omaha, Nebraska Applications
  • 9. Framework Deep Learning Platform DLIS Pluggable Runtime Linux ContainerNative Windows Microsoft CNTK TensorFlowWin DeepCPU TensorFlowLinux Caffe Hardware Accerlation CPU GPUFPGA Self-Serve Portal Model Development toolkit Model Repository Theano ... Workloads Web Text Speech Image Enterprise DLVS Pluggable Runtime HNSW K-D Tree Faiss ANN Index Build on Multi-Tenancy FrontDoor DLIS DLVS • Customizable runtime • Privacy and Compliance Certification • In production globally
  • 10. • • • 1Ms QPS, 100s models, 100Bs vectors, 20+ Regions Framework
  • 11. Optimal distribution to match model requirements to server fleet Framework Windows Machine SKU-2 Model2 Model2 Linux Machine SKU-4 Windows Machine SKU-1 Model1 Model1 Linux Machine SKU-5 Model6 Windows Machine SKU-3 Model6 Model6 Model6 Model2 Model2 Windows Machine SKU-2 Model2 Model2 Windows Machine SKU-2 Model3 Model4 Model5 Model2 Model2 Model1 Model1 Model1 Model1 Model1 Model1 Model1 Model1 Model1 Model4 Model5 Multiple model instances across multiple machines Multiple model instances share same machine Different Operating System in same bed Different runtime Model7 Linux Machine SKU-5 Model8Model7 Model7 CNTK TensorFlow Windows DeepCPU TensorFlow Linux Different machine SKU in same bed
  • 13. Vector Recall by Nearest Neighbor Search Search among points in bucket Hash query to this bucket NNG HNSW KD-tree Semantic word 1 Semantic word 2 Semantic word 3 TP-tree & Wang, Jingdong, and Shipeng Li. "Query-driven iterated neighborhood graph search for large scale indexing." Proceedings of the 20th ACM international conference on Multimedia. ACM, 2012. Framework
  • 15. RNN Serving Performance Challenges Language Modeling Machine Translation Machine Reading Comprehension Conversation Bot Speech Recognition … Limited Parallelism Limited Bandwidth • Small batch size • Sequential dependency • Vector-matrix multiplication • Low data reuse 14 Xt-1 Xt Xt+1 Ot-1 Ot Ot+1 St-1 St St+1 W W W U U U V V V Optimization
  • 16. 1. Matrix computation: 2. Activation function 3. Operation Fusing 4. Affinity 5. Locality 6. Parallelism 7. Task scheduling Collaborating with Yuxiong He, Minjia Zhang, Samyam Rajbhandari, Wenhan Wang, Microsoft AI and Research. Optimization
  • 17. 𝑧𝑡 = 𝜎 𝑊𝑧 𝑥 𝑡 + 𝑈𝑧ℎ 𝑡−1 + 𝑏 𝑧 𝑟𝑡 = 𝜎 𝑊𝑟 𝑥 𝑡 + 𝑈𝑟ℎ 𝑡−1 + 𝑏 𝑟 ℎ 𝑡 = 𝑧𝑡 ∘ ℎ 𝑡−1 + 1 − 𝑧𝑡 ∘ tanh(𝑊ℎ 𝑥 𝑡 + 𝑈ℎ 𝑟𝑡 ∘ ℎ 𝑡−1 + 𝑏ℎ) On a machine with 12 cores… a) 1 core per operation, multiplications done in parallel 1 1 1 1 1 1 time cores 6 12 b) 12 cores per operation, multiplications done sequentially 12 12 12 12 12 12 6 12 cores time many idle cores unbalanced load poor speedup of intra-op parallelism Optimization
  • 18. Optimization 𝑧𝑡 = 𝜎 𝑊𝑧 𝑥 𝑡 + 𝑈𝑧ℎ 𝑡−1 + 𝑏 𝑧 𝑟𝑡 = 𝜎 𝑊𝑟 𝑥 𝑡 + 𝑈𝑟ℎ 𝑡−1 + 𝑏 𝑟 ℎ 𝑡 = 𝑧𝑡 ∘ ℎ 𝑡−1 + 1 − 𝑧𝑡 ∘ tanh(𝑊ℎ 𝑥 𝑡 + 𝑈ℎ 𝑟𝑡 ∘ ℎ 𝑡−1 + 𝑏ℎ) On a machine with 12 cores… d) an optimized configuration, reducing latency 6 12 cores time 2 2 3 3 2 6 c) 4 cores per operation 4 4 4 4 4 4 time cores 6 12 1 1 1 2 2 2 1 2 Bad scheduling order ✓ Workload size ✓ Parallelism efficiency ✓ Critical path ✓ Load balancing
  • 23. DL Scenarios Original Latency Latency Target Optimized Latency Latency reduction Throughput improvement Turing Prototype 2 ~100ms 10ms 9ms >10X > 10X Turing Prototype 3 ~107ms 10ms 4.1ms >20X > 50X Deep Query Document Similarity 10~12ms for [query, 1 doc] x 33 docs 6ms 1.5ms for [query, 1 doc]; <6ms for [query, 33 docs] >6X > 30X Malta Click Features 10ms for [query, 1 passage] x 150 passages 5ms <1ms for [query, 1 passage]; <5ms for [query, 150 passages] >10X > 100X Ads seq2seq model for query rewriting 51ms 5ms 4ms >10X > 3X AGI Encoder V2 ~29ms 10ms 5.4ms 5X 5X RNet (InfoBot + Bing) ~45ms for 1 [query, passage] 10ms 4.0ms for 1 [query, passage]; <8.5ms for 20 [query, passage] 11X > 100X Bing query tagging 9~16ms on CNTK 3ms 0.95ms 10X > 10X WideDeepRight Model (TP3 L1) ~25ms for [query, 1 title url] 7ms for a batch size of 33 5.4ms for [query, 33 title url]; 10X > 100X TP3 L2 Classifier 60ms 3ms 3ms 20X 20X TP3 L1 8ms 3ms 1ms 8X 8X Optimization
  • 25. 24 Original TensorFlow model TensorFlow model with DeepCPU operator Optimization
  • 26. F F F L0 L1 F F F L0 Pretrained DNN Model in TF/CNTK/ONNX, etc. Scalable DNN Hardware Microservice BrainWave Soft DPU Instr Decoder & Control Neural FU Network switches FPGAs Optimization
  • 28. Production Bing DNN Model 1 CPU only Brainwave accelerated Improvement Model Details GRU 128X200 (X2) + W2Vec LSTM 500X200 (x8) +W2Vec Brainwave accelerated mode is > 10X larger and > 10X lower latencyEnd-to-End latency per Batch 1 request at 95% 9ms 0.85ms Production Bing DNN Model 2 CPU only Brainwave accelerated Improvement Model Details 1D CNN + W2Vec (RNNs removed) 1D CNN + W2Vec + GRU 500x500 (x4) Brainwave accelerated mode is > 10X larger and 3X lower latency End-to-End latency per Batch 1 request at 95% 15ms 5ms Optimization
  • 29. Layer GEMM 𝑊𝑖 𝑊𝑓 𝑊𝑜 𝑊𝑐 G*H S 𝑥 𝑡S N G*H N Recurrent GEMM 𝑈𝑖 𝑈𝑓 𝑈 𝑜 𝑈𝑐 H ℎ 𝑡−1H G*H N N G*H S = synthetic_dim H = hidden_dim N = batch_size G = num_gates Optimization
  • 30. Optimization RF 𝑊𝑒𝑖𝑔ℎ𝑡𝑠 H G*H H Shared Memory ℎ 𝑡−1 result NN H G*H GRU P4 - FP32, batch_size = 1 *Can add more work in this instance Other variables H N RF Usage SMEM 100 1 3∗100∗100∗4 256∗1024 ≅ 46% 100+3∗100 ∗4 96∗1024 ≅ 2% 20 1 3∗20∗20∗4 256∗1024 ≅ 2% 20+3∗20 ∗4 96∗1024 ≪ 1%
  • 31. Summary Significant gain from deep learning in search, speech, vision and machine reading comprehension. Large scale and low latency inference and vector search service in production Heterogenous hardware and pluggable framework support