Deep Learning Inference at speed and scale

Agenda
Applications
FrameworkOptimization
Fundamental
Feature Ask
Optimize for
Application
Make optimization
Re-usable for others

Applications
(0.78, 0.8, 0.4, 0.3, 0.9,...)
(0.75, 0.6, 0.1, 0.7, 0.2,...)
… …
• Semantic similarity
Vector Representation
Nearest Neighbor Search
in Semantic Space
Q: {is it legal for 17
year old to buy a car}
Bag of Words Inverted Index Matching
car
legal
…
OR
AND
buy
own
• L1: BM25F
Ranking
Posting 1
Posting 2
Posting 3
Posting 4
…
buy
legal
…
L2/L3/L4
ReRanking

Semantic search can help recall issues, nearly a third of relevance DSATs.
Query: {how many women voices in Switchboard telephone corpus }
Cannot recall the good urls by query
term alteration and term match
DL model captures full context, and builds semantic
meanings into vectors. The query vector and
document vector are near in vector space.
Applications

•
•
•
•
•
Applications

Query: Where's the nearest fruit smoothies
Location: Omaha, Nebraska
Applications

Framework
Deep Learning Platform
DLIS Pluggable Runtime
Linux ContainerNative Windows
Microsoft
CNTK TensorFlowWin
DeepCPU
TensorFlowLinux
Caffe
Hardware Accerlation
CPU GPUFPGA
Self-Serve
Portal
Model
Development
toolkit
Model
Repository
Theano ...
Workloads
Web Text Speech Image Enterprise
DLVS Pluggable Runtime
HNSW K-D Tree Faiss
ANN Index Build
on Multi-Tenancy
FrontDoor
DLIS DLVS
• Customizable runtime
• Privacy and Compliance Certification
• In production globally

•
•
•
1Ms QPS, 100s models, 100Bs vectors, 20+ Regions
Framework

Optimal distribution to match model requirements to server fleet
Framework
Windows Machine SKU-2
Model2
Model2
Linux Machine SKU-4
Model1
Model1
Linux Machine SKU-5
Model6
Model6
Model6
Model6
Model2
Model2
Model2
Model2
Model3
Model4
Model5
Model2
Model2
Model1
Model1
Model1
Model1
Model1
Model1
Model1
Model1
Model1
Model4
Model5
Multiple model instances
across multiple machines
Multiple model instances
share same machine
Different
Operating
System in
same bed
Different runtime
Model7
Linux Machine SKU-5
Model8Model7
Model7
CNTK
TensorFlow
Windows
DeepCPU
TensorFlow
Linux
Different
machine
SKU in
same bed

Query:
coffee in Melbourne
Semantic
Representation
Vector
Online
Inferencing
Batch
Inferencing
Document 1
Document 2
...
Vector Set
Similarity Search
Framework

Vector Recall by Nearest Neighbor Search
Search among
points in bucket
Hash query
to this bucket
NNG
HNSW
KD-tree
Semantic word 1
Semantic word 2 Semantic word 3
TP-tree
&
Wang, Jingdong, and Shipeng Li. "Query-driven iterated neighborhood graph search for large scale indexing." Proceedings of the 20th ACM international conference on Multimedia. ACM, 2012.
Framework

Optimization
Hardware +
Software
Acceleration
DeepCPU
BrainWave / FPGA
DeepGPU

RNN Serving Performance Challenges
Language Modeling
Machine Translation
Machine Reading
Comprehension
Conversation Bot
Speech
Recognition
…
Limited Parallelism
Limited Bandwidth
• Small batch size
• Sequential dependency
• Vector-matrix multiplication
• Low data reuse
14
Xt-1 Xt Xt+1
Ot-1 Ot Ot+1
St-1 St
St+1
W W W
U U U
V V V
Optimization

1. Matrix computation:
2. Activation function
3. Operation Fusing
4. Affinity
5. Locality
6. Parallelism
7. Task scheduling
Collaborating with Yuxiong He, Minjia Zhang, Samyam Rajbhandari, Wenhan Wang,
Microsoft AI and Research.
Optimization

𝑧𝑡 = 𝜎 𝑊𝑧 𝑥 𝑡 + 𝑈𝑧ℎ 𝑡−1 + 𝑏 𝑧
𝑟𝑡 = 𝜎 𝑊𝑟 𝑥 𝑡 + 𝑈𝑟ℎ 𝑡−1 + 𝑏 𝑟
ℎ 𝑡 = 𝑧𝑡 ∘ ℎ 𝑡−1 + 1 − 𝑧𝑡 ∘ tanh(𝑊ℎ 𝑥 𝑡 + 𝑈ℎ 𝑟𝑡 ∘ ℎ 𝑡−1 + 𝑏ℎ)
On a machine with 12 cores…
a) 1 core per operation, multiplications done in parallel
1 1 1 1 1
1
time
cores
6
12
b) 12 cores per operation, multiplications done sequentially
12 12 12 12 12
12
6
12
cores
time
many idle cores
unbalanced load
poor speedup of
intra-op parallelism
Optimization

Optimization
𝑧𝑡 = 𝜎 𝑊𝑧 𝑥 𝑡 + 𝑈𝑧ℎ 𝑡−1 + 𝑏 𝑧
𝑟𝑡 = 𝜎 𝑊𝑟 𝑥 𝑡 + 𝑈𝑟ℎ 𝑡−1 + 𝑏 𝑟
ℎ 𝑡 = 𝑧𝑡 ∘ ℎ 𝑡−1 + 1 − 𝑧𝑡 ∘ tanh(𝑊ℎ 𝑥 𝑡 + 𝑈ℎ 𝑟𝑡 ∘ ℎ 𝑡−1 + 𝑏ℎ)
On a machine with 12 cores…
d) an optimized configuration, reducing latency
6
12
cores
time
2 2 3 3 2
6
c) 4 cores per operation
4 4 4 4 4
4
time
cores
6
12
1 1 1 2 2
2
1 2
Bad scheduling order
✓ Workload size
✓ Parallelism efficiency
✓ Critical path
✓ Load balancing

DL Scenarios Original Latency
Latency
Target
Optimized Latency
Latency
reduction
Throughput
improvement
Turing Prototype 2 ~100ms 10ms 9ms >10X > 10X
Turing Prototype 3 ~107ms 10ms 4.1ms >20X > 50X
Deep Query Document
Similarity
10~12ms for [query,
1 doc] x 33 docs
6ms
1.5ms for [query, 1 doc];
<6ms for [query, 33 docs]
>6X > 30X
Malta Click Features
10ms for
[query, 1 passage]
x 150 passages
5ms
<1ms for [query, 1 passage];
<5ms for [query, 150 passages]
>10X > 100X
Ads seq2seq model for
query rewriting
51ms 5ms 4ms >10X > 3X
AGI Encoder V2 ~29ms 10ms 5.4ms 5X 5X
RNet (InfoBot + Bing)
~45ms for 1 [query,
passage]
10ms
4.0ms for 1 [query,
passage];
<8.5ms for 20 [query,
passage]
11X > 100X
Bing query tagging 9~16ms on CNTK 3ms 0.95ms 10X > 10X
WideDeepRight Model
(TP3 L1)
~25ms for [query, 1
title url]
7ms for a
batch size of
33
5.4ms for [query,
33 title url];
10X > 100X
TP3 L2 Classifier 60ms 3ms 3ms 20X 20X
TP3 L1 8ms 3ms 1ms 8X 8X
Optimization

24
Original TensorFlow model
TensorFlow model with DeepCPU operator
Optimization

F F F
L0
L1
F F F
L0
Pretrained DNN Model
in TF/CNTK/ONNX, etc.
Scalable DNN Hardware
Microservice
BrainWave
Soft DPU
Instr Decoder
& Control
Neural FU
Network switches
FPGAs
Optimization

Production Bing DNN Model 1
CPU only Brainwave accelerated Improvement
Model Details GRU 128X200 (X2) + W2Vec LSTM 500X200 (x8) +W2Vec Brainwave accelerated mode
is > 10X larger and > 10X
lower latencyEnd-to-End latency per Batch
1 request at 95%
9ms 0.85ms
Production Bing DNN Model 2
CPU only Brainwave accelerated Improvement
Model Details 1D CNN + W2Vec (RNNs
removed)
1D CNN + W2Vec + GRU
500x500 (x4)
Brainwave accelerated mode
is > 10X larger and 3X lower
latency
End-to-End latency per Batch
1 request at 95%
15ms 5ms
Optimization

Layer GEMM
𝑊𝑖
𝑊𝑓
𝑊𝑜
𝑊𝑐
G*H
S
𝑥 𝑡S
N
G*H
N
Recurrent GEMM
𝑈𝑖
𝑈𝑓
𝑈 𝑜
𝑈𝑐
H
ℎ 𝑡−1H
G*H
N
N
G*H
S = synthetic_dim
H = hidden_dim
N = batch_size
G = num_gates
Optimization

Optimization
RF
𝑊𝑒𝑖𝑔ℎ𝑡𝑠
H
G*H
H
Shared Memory
ℎ 𝑡−1
result
NN
H
G*H
GRU P4 - FP32, batch_size = 1
*Can add more work in this instance
Other
variables
H N RF Usage SMEM
100 1 3∗100∗100∗4
256∗1024
≅ 46%
100+3∗100 ∗4
96∗1024
≅ 2%
20 1 3∗20∗20∗4
256∗1024
≅ 2% 20+3∗20 ∗4
96∗1024
≪ 1%

Summary
Significant gain from deep learning
in search, speech, vision and
machine reading comprehension.
Large scale and low latency
inference and vector search service
in production
Heterogenous hardware and
pluggable framework support

Deep Learning Inference at speed and scale

Deep Learning Inference at speed and scale

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Deep Learning Inference at speed and scale

Similar to Deep Learning Inference at speed and scale (20)

More from Bill Liu

More from Bill Liu (20)

Recently uploaded

Recently uploaded (20)

Deep Learning Inference at speed and scale