尊敬的 微信汇率:1円 ≈ 0.046089 元 支付宝汇率:1円 ≈ 0.04618元 [退出登录]
SlideShare a Scribd company logo
HOSTED BY
The Latency Stack: Discovering
Surprising Sources of Latency
Mark Gritter
Principal Engineer at Postman
Mark Gritter (he/him)
Principal Engineer at Postman
■ Working on monitoring and observability
solutions for APIs at Akita Software, now
Postman
■ Previously built VM-aware flash storage arrays
at Tintri
■ Hobbies: gardening, weaving, math
Where does latency
come from?
A simple picture of the world
func(w http.ResponseWriter, r *http.Request) {
}
start
end
latency!
did I
wait for
a lock?
are my
queries
slow?
did I
miss in
cache?
A more realistic picture of the world
start
end
latency!
myService.call()
ret, err := ...
did I make
unnecessary
round trips?
did I send
too much
data?
But the more we look the worse it gets…
client application server application
client OS
client hypervisor
server OS
server hypervisor
DNS server
TCP/IP
TLS
Load balancers
Firewalls
Service meshes
Understanding latency requires opening
up our abstractions!
Example: Video Preprocessing @ Kealia
One of my first jobs was at a streaming video startup.
■ Custom hardware shipped video out; needed 1MB blocks.
■ My software had to preprocess MPEG; we wanted to handle live TV as well as
ingested movies.
■ 3Mbit/s standard definition = one block every 2.7 seconds, delivered over
TCP to a storage device.
■ Not a highly demanding schedule… but once in a while we were several
seconds late.
It must be my software’s fault
Latency = from video arriving at preprocessor to block landing in disk storage.
We looked at all the usual things, and some unusual ones.
● Inefficient algorithm?
● Lock contention?
● Bad scheduling?
● Memory allocation?
We even built a custom tracing tool so we could see what each thread was doing!
Is it the network?
We were all networking nerds… so we of course looked at a tcpdump trace on the
receiver
● saw that the late-arriving blocks were transmitted with acceptably low latency
● they just started late…?
syn
Is it the network? (2)
But finally we look at the sending side, and caught this!
syn
syn
The block was ready on time.
We opened up a new TCP connection to send it, and the initial SYN got lost.
And TCP’s retry for an unacknowledged SYN is: 5 seconds ?!?
Confession Time
We could have improved our TCP connection pooling, but because we were
compiling our own kernel anyway, I just reduced the retry to a few hundred
milliseconds.
Lesson learned: default protocol parameters cause latency.
It’s Always Storage’s Fault
Example: Flash storage @ Tintri
One of my more recent jobs was at a flash storage startup.
■ A potential customer using VMWare was seeing high I/O latency.
■ Obviously they needed to upgrade their storage! So they rolled in a fancy all-
flash storage array and… nothing improved.
■ That gave us a shot at the problem.
The Wrong Picture
VM sends request
Storage accesses SSD
VM receives response
Latency = network delay + controller software + device read
The Closer-to-Correct Picture
VM sends request
Storage accesses SSD
VM is scheduled to run and receives request
Hypervisor receives response
Latency = network delay + controller software + device read
+ time VM isn’t running
Hypervisor delays
Why might the hypervisor be slow?
● Swapping (sometimes)
● All Virtual Machine CPUs have to be scheduled together
○ More CPUs = harder to schedule!
I have 8
physical
CPUs, one
of them
busy.
I need 8
virtual
CPUs
lol, stby
Measuring what matters
Our strategy:
● Measure latency locally (in our case, for each VM)
● Estimate network latency using inter-arrival times
● Ask the hypervisor for its measurement
● “Host” latency = hypervisor - storage - network
● Grab the appropriate hypervisor stats to shift blame away from storage.
Weeks of finger-pointing solved!
Lesson learned: latency is the result of having too many CPUs.
When the Network is not a Network
Example: Autosupports @ Tintri
Our flash storage arrays uploaded “autosupport reports” periodically.
■ The back-end system thought the upload succeeded.
■ The storage array thought it did not.
■ And uploaded it it again. And again. And again.
It’s just HTTP POST!
How can the client and the server disagree so badly – and so consistently – about
whether the upload succeeded or not?
● Upload had a checksum (sent in the header) so we knew the data was not
corrupt or truncated.
● Logging and debugging in our software showed nothing particularly unusual
– the client was “just” getting some weird TLS error.
Is it the network?
I was still a networking nerd… so we of course looked at a tcpdump trace on the
sender and the receiver (because I had learned.)
fin
server
rst
storage
array
Is it the network? (2)
But that picture was a lie. Once I matched up the timestamps…
fin
server
rst
storage
array
The Closer-to-Correct Picture
ELB sends data
slowly to server
Array sends data
quickly to ELB
Network latency is negative: the client sees an (error) response
before the server finishes!
ELB closes the
“idle” connection.
… but keeps
sending data
anyway!
HTTP 200 OK
Mitigation
N=1 so we removed the ELB, but also fixed the underlying problem:
● Don’t read from a network stream 1 byte at a time.
Why do ELB load balancers have 10’s of megabytes of buffering? (Do they still? I
don’t know.)
Lesson learned: AWS services cause latency
It’s always DNS
We’ve known it’s DNS since the early 2000’s
“The contribution of DNS Lookup Costs to Web Object Retrieval" by Craig E Willis
and Hao Shang, 2004, WPI-CS-TR-00-12:
Finally we found that the DNS lookup time contributed more than one second to
approximately 20% of retrievals for the Web objects on the home page of a larger list
of popular servers.
● DNS is not normally slow, but when your HTTP request is slow…
● it’s probably DNS, particularly if you are still using a 1-second retry time.
Techniques
The Latency Stack
● Routing in the network
● Your server’s physical network
hardware
● Your server’s hypervisor
● Your server’s network
protocols
● Your server’s operating system
● Your server’s CPU and memory
subsystems
● Your application’s queues
● Your code
● Your application’s
dependencies
● The client software
● The client’s DNS
● The client’s operating system
● The client’s network protocols
● The client’s hypervisor
● The client’s physical network
hardware
● Queuing in the network
● Middleboxes in the network
Start by looking at both sides
Compare client and server measurements, highlight discrepancies:
● Failure and error rates
● Duration and latency
Extend the scope of spans to cover more of the network and kernel:
● Server latency not just application latency
● eBPF probes to examine kernel queuing
● Network-based measurements to get the “ground truth”
Draw the whole picture:
● Start including pieces you otherwise abstract away
Measuring Latency from a Network Trace
C - B = processing latency
D - A = lower bound on end-to-end latency
If you’re really ambitious, RTT estimates can be derived too!
A = first
packet
of request
B = last
packet
of request
C = first
packet
of response
D = last
packet
of response
When we’re examining tail latencies,
by definition we’re looking at
something unusual!
Mark Gritter
mgritter@gmail.com
@markgritter@mathstodon.xyz
Thank you! Let’s connect.

More Related Content

Similar to The Latency Stack: Discovering Surprising Sources of Latency

Netflix Open Source Meetup Season 4 Episode 2
Netflix Open Source Meetup Season 4 Episode 2Netflix Open Source Meetup Season 4 Episode 2
Netflix Open Source Meetup Season 4 Episode 2
aspyker
 
Helen Tabunshchyk "Handling large amounts of traffic on the Edge"
Helen Tabunshchyk "Handling large amounts of traffic on the Edge"Helen Tabunshchyk "Handling large amounts of traffic on the Edge"
Helen Tabunshchyk "Handling large amounts of traffic on the Edge"
Fwdays
 
STP201 Efficiency at Scale - AWS re: Invent 2012
STP201 Efficiency at Scale - AWS re: Invent 2012STP201 Efficiency at Scale - AWS re: Invent 2012
STP201 Efficiency at Scale - AWS re: Invent 2012
Amazon Web Services
 
Http2 in practice
Http2 in practiceHttp2 in practice
Http2 in practice
Patrick Meenan
 
Guarding Fast Data Delivery in Cloud: an Effective Approach to Isolating Perf...
Guarding Fast Data Delivery in Cloud: an Effective Approach to Isolating Perf...Guarding Fast Data Delivery in Cloud: an Effective Approach to Isolating Perf...
Guarding Fast Data Delivery in Cloud: an Effective Approach to Isolating Perf...
Zhenyun Zhuang
 
Distributed monitoring
Distributed monitoringDistributed monitoring
Distributed monitoring
Leon Torres
 
What we can learn from CDNs about Web Development, Deployment, and Performance
What we can learn from CDNs about Web Development, Deployment, and PerformanceWhat we can learn from CDNs about Web Development, Deployment, and Performance
What we can learn from CDNs about Web Development, Deployment, and Performance
Fastly
 
Overcoming Variable Payloads to Optimize for Performance
Overcoming Variable Payloads to Optimize for PerformanceOvercoming Variable Payloads to Optimize for Performance
Overcoming Variable Payloads to Optimize for Performance
ScyllaDB
 
(WEB401) Optimizing Your Web Server on AWS | AWS re:Invent 2014
(WEB401) Optimizing Your Web Server on AWS | AWS re:Invent 2014(WEB401) Optimizing Your Web Server on AWS | AWS re:Invent 2014
(WEB401) Optimizing Your Web Server on AWS | AWS re:Invent 2014
Amazon Web Services
 
A new perspective on Network Visibility - RISK 2015
A new perspective on Network Visibility - RISK 2015A new perspective on Network Visibility - RISK 2015
A new perspective on Network Visibility - RISK 2015
Network Performance Channel GmbH
 
Full Stack Load Testing
Full Stack Load Testing Full Stack Load Testing
Full Stack Load Testing
Terral R Jordan
 
Automating the Hunt for Non-Obvious Sources of Latency Spreads
Automating the Hunt for Non-Obvious Sources of Latency SpreadsAutomating the Hunt for Non-Obvious Sources of Latency Spreads
Automating the Hunt for Non-Obvious Sources of Latency Spreads
ScyllaDB
 
High performance Infrastructure Oct 2013
High performance Infrastructure Oct 2013High performance Infrastructure Oct 2013
High performance Infrastructure Oct 2013
Server Density
 
Tempesta FW - Framework и Firewall для WAF и DDoS mitigation, Александр Крижа...
Tempesta FW - Framework и Firewall для WAF и DDoS mitigation, Александр Крижа...Tempesta FW - Framework и Firewall для WAF и DDoS mitigation, Александр Крижа...
Tempesta FW - Framework и Firewall для WAF и DDoS mitigation, Александр Крижа...
Ontico
 
.Net Architecture and Performance Tuning
.Net Architecture and Performance Tuning.Net Architecture and Performance Tuning
.Net Architecture and Performance Tuning
GauranG Bajpai
 
How it's made - MyGet (CloudBurst)
How it's made - MyGet (CloudBurst)How it's made - MyGet (CloudBurst)
How it's made - MyGet (CloudBurst)
Maarten Balliauw
 
NY Web Perf Meetup: Peeling the Web Performance Onion
NY Web Perf Meetup: Peeling the Web Performance OnionNY Web Perf Meetup: Peeling the Web Performance Onion
NY Web Perf Meetup: Peeling the Web Performance Onion
Catchpoint Systems
 
加快互联网核心协议,提高Web速度yuchungcheng
加快互联网核心协议,提高Web速度yuchungcheng加快互联网核心协议,提高Web速度yuchungcheng
加快互联网核心协议,提高Web速度yuchungcheng
Michael Zhang
 
Logging makes perfect - Riemann, Elasticsearch and friends
Logging makes perfect - Riemann, Elasticsearch and friendsLogging makes perfect - Riemann, Elasticsearch and friends
Logging makes perfect - Riemann, Elasticsearch and friends
Itamar
 
Challenges and experiences with IPTV from a network point of view
Challenges and experiences with IPTV from a network point of viewChallenges and experiences with IPTV from a network point of view
Challenges and experiences with IPTV from a network point of view
brouer
 

Similar to The Latency Stack: Discovering Surprising Sources of Latency (20)

Netflix Open Source Meetup Season 4 Episode 2
Netflix Open Source Meetup Season 4 Episode 2Netflix Open Source Meetup Season 4 Episode 2
Netflix Open Source Meetup Season 4 Episode 2
 
Helen Tabunshchyk "Handling large amounts of traffic on the Edge"
Helen Tabunshchyk "Handling large amounts of traffic on the Edge"Helen Tabunshchyk "Handling large amounts of traffic on the Edge"
Helen Tabunshchyk "Handling large amounts of traffic on the Edge"
 
STP201 Efficiency at Scale - AWS re: Invent 2012
STP201 Efficiency at Scale - AWS re: Invent 2012STP201 Efficiency at Scale - AWS re: Invent 2012
STP201 Efficiency at Scale - AWS re: Invent 2012
 
Http2 in practice
Http2 in practiceHttp2 in practice
Http2 in practice
 
Guarding Fast Data Delivery in Cloud: an Effective Approach to Isolating Perf...
Guarding Fast Data Delivery in Cloud: an Effective Approach to Isolating Perf...Guarding Fast Data Delivery in Cloud: an Effective Approach to Isolating Perf...
Guarding Fast Data Delivery in Cloud: an Effective Approach to Isolating Perf...
 
Distributed monitoring
Distributed monitoringDistributed monitoring
Distributed monitoring
 
What we can learn from CDNs about Web Development, Deployment, and Performance
What we can learn from CDNs about Web Development, Deployment, and PerformanceWhat we can learn from CDNs about Web Development, Deployment, and Performance
What we can learn from CDNs about Web Development, Deployment, and Performance
 
Overcoming Variable Payloads to Optimize for Performance
Overcoming Variable Payloads to Optimize for PerformanceOvercoming Variable Payloads to Optimize for Performance
Overcoming Variable Payloads to Optimize for Performance
 
(WEB401) Optimizing Your Web Server on AWS | AWS re:Invent 2014
(WEB401) Optimizing Your Web Server on AWS | AWS re:Invent 2014(WEB401) Optimizing Your Web Server on AWS | AWS re:Invent 2014
(WEB401) Optimizing Your Web Server on AWS | AWS re:Invent 2014
 
A new perspective on Network Visibility - RISK 2015
A new perspective on Network Visibility - RISK 2015A new perspective on Network Visibility - RISK 2015
A new perspective on Network Visibility - RISK 2015
 
Full Stack Load Testing
Full Stack Load Testing Full Stack Load Testing
Full Stack Load Testing
 
Automating the Hunt for Non-Obvious Sources of Latency Spreads
Automating the Hunt for Non-Obvious Sources of Latency SpreadsAutomating the Hunt for Non-Obvious Sources of Latency Spreads
Automating the Hunt for Non-Obvious Sources of Latency Spreads
 
High performance Infrastructure Oct 2013
High performance Infrastructure Oct 2013High performance Infrastructure Oct 2013
High performance Infrastructure Oct 2013
 
Tempesta FW - Framework и Firewall для WAF и DDoS mitigation, Александр Крижа...
Tempesta FW - Framework и Firewall для WAF и DDoS mitigation, Александр Крижа...Tempesta FW - Framework и Firewall для WAF и DDoS mitigation, Александр Крижа...
Tempesta FW - Framework и Firewall для WAF и DDoS mitigation, Александр Крижа...
 
.Net Architecture and Performance Tuning
.Net Architecture and Performance Tuning.Net Architecture and Performance Tuning
.Net Architecture and Performance Tuning
 
How it's made - MyGet (CloudBurst)
How it's made - MyGet (CloudBurst)How it's made - MyGet (CloudBurst)
How it's made - MyGet (CloudBurst)
 
NY Web Perf Meetup: Peeling the Web Performance Onion
NY Web Perf Meetup: Peeling the Web Performance OnionNY Web Perf Meetup: Peeling the Web Performance Onion
NY Web Perf Meetup: Peeling the Web Performance Onion
 
加快互联网核心协议,提高Web速度yuchungcheng
加快互联网核心协议,提高Web速度yuchungcheng加快互联网核心协议,提高Web速度yuchungcheng
加快互联网核心协议,提高Web速度yuchungcheng
 
Logging makes perfect - Riemann, Elasticsearch and friends
Logging makes perfect - Riemann, Elasticsearch and friendsLogging makes perfect - Riemann, Elasticsearch and friends
Logging makes perfect - Riemann, Elasticsearch and friends
 
Challenges and experiences with IPTV from a network point of view
Challenges and experiences with IPTV from a network point of viewChallenges and experiences with IPTV from a network point of view
Challenges and experiences with IPTV from a network point of view
 

More from ScyllaDB

99.99% of Your Traces are Trash by Paige Cruz
99.99% of Your Traces are Trash by Paige Cruz99.99% of Your Traces are Trash by Paige Cruz
99.99% of Your Traces are Trash by Paige Cruz
ScyllaDB
 
Square's Lessons Learned from Implementing a Key-Value Store with Raft
Square's Lessons Learned from Implementing a Key-Value Store with RaftSquare's Lessons Learned from Implementing a Key-Value Store with Raft
Square's Lessons Learned from Implementing a Key-Value Store with Raft
ScyllaDB
 
Making Python 100x Faster with Less Than 100 Lines of Rust
Making Python 100x Faster with Less Than 100 Lines of RustMaking Python 100x Faster with Less Than 100 Lines of Rust
Making Python 100x Faster with Less Than 100 Lines of Rust
ScyllaDB
 
A Deep Dive Into Concurrent React by Matheus Albuquerque
A Deep Dive Into Concurrent React by Matheus AlbuquerqueA Deep Dive Into Concurrent React by Matheus Albuquerque
A Deep Dive Into Concurrent React by Matheus Albuquerque
ScyllaDB
 
eBPF vs Sidecars by Liz Rice at Isovalent
eBPF vs Sidecars by Liz Rice at IsovalenteBPF vs Sidecars by Liz Rice at Isovalent
eBPF vs Sidecars by Liz Rice at Isovalent
ScyllaDB
 
How to Improve Your Ability to Solve Complex Performance Problems
How to Improve Your Ability to Solve Complex Performance ProblemsHow to Improve Your Ability to Solve Complex Performance Problems
How to Improve Your Ability to Solve Complex Performance Problems
ScyllaDB
 
Using ScyllaDB for Real-Time Write-Heavy Workloads
Using ScyllaDB for Real-Time Write-Heavy WorkloadsUsing ScyllaDB for Real-Time Write-Heavy Workloads
Using ScyllaDB for Real-Time Write-Heavy Workloads
ScyllaDB
 
Distributed System Performance Troubleshooting Like You’ve Been Doing it for ...
Distributed System Performance Troubleshooting Like You’ve Been Doing it for ...Distributed System Performance Troubleshooting Like You’ve Been Doing it for ...
Distributed System Performance Troubleshooting Like You’ve Been Doing it for ...
ScyllaDB
 
From 1M to 1B Features Per Second: Scaling ShareChat's ML Feature Store
From 1M to 1B Features Per Second: Scaling ShareChat's ML Feature StoreFrom 1M to 1B Features Per Second: Scaling ShareChat's ML Feature Store
From 1M to 1B Features Per Second: Scaling ShareChat's ML Feature Store
ScyllaDB
 
The Art of Event Driven Observability with OpenTelemetry
The Art of Event Driven Observability with OpenTelemetryThe Art of Event Driven Observability with OpenTelemetry
The Art of Event Driven Observability with OpenTelemetry
ScyllaDB
 
ORM is Bad, But is There an Alternative?
ORM is Bad, But is There an Alternative?ORM is Bad, But is There an Alternative?
ORM is Bad, But is There an Alternative?
ScyllaDB
 
High Performance on a Low Budget with Gwen Shapira
High Performance on a Low Budget with Gwen ShapiraHigh Performance on a Low Budget with Gwen Shapira
High Performance on a Low Budget with Gwen Shapira
ScyllaDB
 
Writing Low Latency Database Applications Even If Your Code Sucks
Writing Low Latency Database Applications Even If Your Code SucksWriting Low Latency Database Applications Even If Your Code Sucks
Writing Low Latency Database Applications Even If Your Code Sucks
ScyllaDB
 
Building a 10x More Efficient Edge Platform
Building a 10x More Efficient Edge PlatformBuilding a 10x More Efficient Edge Platform
Building a 10x More Efficient Edge Platform
ScyllaDB
 
Beyond Availability: The Seven Dimensions for Data Product SLOs
Beyond Availability: The Seven Dimensions for Data Product SLOsBeyond Availability: The Seven Dimensions for Data Product SLOs
Beyond Availability: The Seven Dimensions for Data Product SLOs
ScyllaDB
 
Quantifying the Performance Impact of Shard-per-core Architecture
Quantifying the Performance Impact of Shard-per-core ArchitectureQuantifying the Performance Impact of Shard-per-core Architecture
Quantifying the Performance Impact of Shard-per-core Architecture
ScyllaDB
 
Low-Latency Data Access: The Required Synergy Between Memory & Disk
Low-Latency Data Access: The Required Synergy Between Memory & DiskLow-Latency Data Access: The Required Synergy Between Memory & Disk
Low-Latency Data Access: The Required Synergy Between Memory & Disk
ScyllaDB
 
Demanding the Impossible: Rigorous Database Benchmarking
Demanding the Impossible: Rigorous Database BenchmarkingDemanding the Impossible: Rigorous Database Benchmarking
Demanding the Impossible: Rigorous Database Benchmarking
ScyllaDB
 
P99 Publish Performance in a Multi-Cloud NATS.io System
P99 Publish Performance in a Multi-Cloud NATS.io SystemP99 Publish Performance in a Multi-Cloud NATS.io System
P99 Publish Performance in a Multi-Cloud NATS.io System
ScyllaDB
 
Segment-Based Storage vs. Partition-Based Storage: Which is Better for Real-T...
Segment-Based Storage vs. Partition-Based Storage: Which is Better for Real-T...Segment-Based Storage vs. Partition-Based Storage: Which is Better for Real-T...
Segment-Based Storage vs. Partition-Based Storage: Which is Better for Real-T...
ScyllaDB
 

More from ScyllaDB (20)

99.99% of Your Traces are Trash by Paige Cruz
99.99% of Your Traces are Trash by Paige Cruz99.99% of Your Traces are Trash by Paige Cruz
99.99% of Your Traces are Trash by Paige Cruz
 
Square's Lessons Learned from Implementing a Key-Value Store with Raft
Square's Lessons Learned from Implementing a Key-Value Store with RaftSquare's Lessons Learned from Implementing a Key-Value Store with Raft
Square's Lessons Learned from Implementing a Key-Value Store with Raft
 
Making Python 100x Faster with Less Than 100 Lines of Rust
Making Python 100x Faster with Less Than 100 Lines of RustMaking Python 100x Faster with Less Than 100 Lines of Rust
Making Python 100x Faster with Less Than 100 Lines of Rust
 
A Deep Dive Into Concurrent React by Matheus Albuquerque
A Deep Dive Into Concurrent React by Matheus AlbuquerqueA Deep Dive Into Concurrent React by Matheus Albuquerque
A Deep Dive Into Concurrent React by Matheus Albuquerque
 
eBPF vs Sidecars by Liz Rice at Isovalent
eBPF vs Sidecars by Liz Rice at IsovalenteBPF vs Sidecars by Liz Rice at Isovalent
eBPF vs Sidecars by Liz Rice at Isovalent
 
How to Improve Your Ability to Solve Complex Performance Problems
How to Improve Your Ability to Solve Complex Performance ProblemsHow to Improve Your Ability to Solve Complex Performance Problems
How to Improve Your Ability to Solve Complex Performance Problems
 
Using ScyllaDB for Real-Time Write-Heavy Workloads
Using ScyllaDB for Real-Time Write-Heavy WorkloadsUsing ScyllaDB for Real-Time Write-Heavy Workloads
Using ScyllaDB for Real-Time Write-Heavy Workloads
 
Distributed System Performance Troubleshooting Like You’ve Been Doing it for ...
Distributed System Performance Troubleshooting Like You’ve Been Doing it for ...Distributed System Performance Troubleshooting Like You’ve Been Doing it for ...
Distributed System Performance Troubleshooting Like You’ve Been Doing it for ...
 
From 1M to 1B Features Per Second: Scaling ShareChat's ML Feature Store
From 1M to 1B Features Per Second: Scaling ShareChat's ML Feature StoreFrom 1M to 1B Features Per Second: Scaling ShareChat's ML Feature Store
From 1M to 1B Features Per Second: Scaling ShareChat's ML Feature Store
 
The Art of Event Driven Observability with OpenTelemetry
The Art of Event Driven Observability with OpenTelemetryThe Art of Event Driven Observability with OpenTelemetry
The Art of Event Driven Observability with OpenTelemetry
 
ORM is Bad, But is There an Alternative?
ORM is Bad, But is There an Alternative?ORM is Bad, But is There an Alternative?
ORM is Bad, But is There an Alternative?
 
High Performance on a Low Budget with Gwen Shapira
High Performance on a Low Budget with Gwen ShapiraHigh Performance on a Low Budget with Gwen Shapira
High Performance on a Low Budget with Gwen Shapira
 
Writing Low Latency Database Applications Even If Your Code Sucks
Writing Low Latency Database Applications Even If Your Code SucksWriting Low Latency Database Applications Even If Your Code Sucks
Writing Low Latency Database Applications Even If Your Code Sucks
 
Building a 10x More Efficient Edge Platform
Building a 10x More Efficient Edge PlatformBuilding a 10x More Efficient Edge Platform
Building a 10x More Efficient Edge Platform
 
Beyond Availability: The Seven Dimensions for Data Product SLOs
Beyond Availability: The Seven Dimensions for Data Product SLOsBeyond Availability: The Seven Dimensions for Data Product SLOs
Beyond Availability: The Seven Dimensions for Data Product SLOs
 
Quantifying the Performance Impact of Shard-per-core Architecture
Quantifying the Performance Impact of Shard-per-core ArchitectureQuantifying the Performance Impact of Shard-per-core Architecture
Quantifying the Performance Impact of Shard-per-core Architecture
 
Low-Latency Data Access: The Required Synergy Between Memory & Disk
Low-Latency Data Access: The Required Synergy Between Memory & DiskLow-Latency Data Access: The Required Synergy Between Memory & Disk
Low-Latency Data Access: The Required Synergy Between Memory & Disk
 
Demanding the Impossible: Rigorous Database Benchmarking
Demanding the Impossible: Rigorous Database BenchmarkingDemanding the Impossible: Rigorous Database Benchmarking
Demanding the Impossible: Rigorous Database Benchmarking
 
P99 Publish Performance in a Multi-Cloud NATS.io System
P99 Publish Performance in a Multi-Cloud NATS.io SystemP99 Publish Performance in a Multi-Cloud NATS.io System
P99 Publish Performance in a Multi-Cloud NATS.io System
 
Segment-Based Storage vs. Partition-Based Storage: Which is Better for Real-T...
Segment-Based Storage vs. Partition-Based Storage: Which is Better for Real-T...Segment-Based Storage vs. Partition-Based Storage: Which is Better for Real-T...
Segment-Based Storage vs. Partition-Based Storage: Which is Better for Real-T...
 

Recently uploaded

How to Optimize Call Monitoring: Automate QA and Elevate Customer Experience
How to Optimize Call Monitoring: Automate QA and Elevate Customer ExperienceHow to Optimize Call Monitoring: Automate QA and Elevate Customer Experience
How to Optimize Call Monitoring: Automate QA and Elevate Customer Experience
Aggregage
 
Corporate Open Source Anti-Patterns: A Decade Later
Corporate Open Source Anti-Patterns: A Decade LaterCorporate Open Source Anti-Patterns: A Decade Later
Corporate Open Source Anti-Patterns: A Decade Later
ScyllaDB
 
QR Secure: A Hybrid Approach Using Machine Learning and Security Validation F...
QR Secure: A Hybrid Approach Using Machine Learning and Security Validation F...QR Secure: A Hybrid Approach Using Machine Learning and Security Validation F...
QR Secure: A Hybrid Approach Using Machine Learning and Security Validation F...
AlexanderRichford
 
Introduction to ThousandEyes AMER Webinar
Introduction  to ThousandEyes AMER WebinarIntroduction  to ThousandEyes AMER Webinar
Introduction to ThousandEyes AMER Webinar
ThousandEyes
 
Chapter 5 - Managing Test Activities V4.0
Chapter 5 - Managing Test Activities V4.0Chapter 5 - Managing Test Activities V4.0
Chapter 5 - Managing Test Activities V4.0
Neeraj Kumar Singh
 
Chapter 6 - Test Tools Considerations V4.0
Chapter 6 - Test Tools Considerations V4.0Chapter 6 - Test Tools Considerations V4.0
Chapter 6 - Test Tools Considerations V4.0
Neeraj Kumar Singh
 
Leveraging AI for Software Developer Productivity.pptx
Leveraging AI for Software Developer Productivity.pptxLeveraging AI for Software Developer Productivity.pptx
Leveraging AI for Software Developer Productivity.pptx
petabridge
 
Radically Outperforming DynamoDB @ Digital Turbine with SADA and Google Cloud
Radically Outperforming DynamoDB @ Digital Turbine with SADA and Google CloudRadically Outperforming DynamoDB @ Digital Turbine with SADA and Google Cloud
Radically Outperforming DynamoDB @ Digital Turbine with SADA and Google Cloud
ScyllaDB
 
Day 4 - Excel Automation and Data Manipulation
Day 4 - Excel Automation and Data ManipulationDay 4 - Excel Automation and Data Manipulation
Day 4 - Excel Automation and Data Manipulation
UiPathCommunity
 
Fuxnet [EN] .pdf
Fuxnet [EN]                                   .pdfFuxnet [EN]                                   .pdf
Fuxnet [EN] .pdf
Overkill Security
 
Lee Barnes - Path to Becoming an Effective Test Automation Engineer.pdf
Lee Barnes - Path to Becoming an Effective Test Automation Engineer.pdfLee Barnes - Path to Becoming an Effective Test Automation Engineer.pdf
Lee Barnes - Path to Becoming an Effective Test Automation Engineer.pdf
leebarnesutopia
 
Day 2 - Intro to UiPath Studio Fundamentals
Day 2 - Intro to UiPath Studio FundamentalsDay 2 - Intro to UiPath Studio Fundamentals
Day 2 - Intro to UiPath Studio Fundamentals
UiPathCommunity
 
The "Zen" of Python Exemplars - OTel Community Day
The "Zen" of Python Exemplars - OTel Community DayThe "Zen" of Python Exemplars - OTel Community Day
The "Zen" of Python Exemplars - OTel Community Day
Paige Cruz
 
QA or the Highway - Component Testing: Bridging the gap between frontend appl...
QA or the Highway - Component Testing: Bridging the gap between frontend appl...QA or the Highway - Component Testing: Bridging the gap between frontend appl...
QA or the Highway - Component Testing: Bridging the gap between frontend appl...
zjhamm304
 
TrustArc Webinar - Your Guide for Smooth Cross-Border Data Transfers and Glob...
TrustArc Webinar - Your Guide for Smooth Cross-Border Data Transfers and Glob...TrustArc Webinar - Your Guide for Smooth Cross-Border Data Transfers and Glob...
TrustArc Webinar - Your Guide for Smooth Cross-Border Data Transfers and Glob...
TrustArc
 
Supplier Sourcing Presentation - Gay De La Cruz.pdf
Supplier Sourcing Presentation - Gay De La Cruz.pdfSupplier Sourcing Presentation - Gay De La Cruz.pdf
Supplier Sourcing Presentation - Gay De La Cruz.pdf
gaydlc2513
 
EverHost AI Review: Empowering Websites with Limitless Possibilities through ...
EverHost AI Review: Empowering Websites with Limitless Possibilities through ...EverHost AI Review: Empowering Websites with Limitless Possibilities through ...
EverHost AI Review: Empowering Websites with Limitless Possibilities through ...
SOFTTECHHUB
 
Kubernetes Cloud Native Indonesia Meetup - June 2024
Kubernetes Cloud Native Indonesia Meetup - June 2024Kubernetes Cloud Native Indonesia Meetup - June 2024
Kubernetes Cloud Native Indonesia Meetup - June 2024
Prasta Maha
 
New ThousandEyes Product Features and Release Highlights: June 2024
New ThousandEyes Product Features and Release Highlights: June 2024New ThousandEyes Product Features and Release Highlights: June 2024
New ThousandEyes Product Features and Release Highlights: June 2024
ThousandEyes
 
DynamoDB to ScyllaDB: Technical Comparison and the Path to Success
DynamoDB to ScyllaDB: Technical Comparison and the Path to SuccessDynamoDB to ScyllaDB: Technical Comparison and the Path to Success
DynamoDB to ScyllaDB: Technical Comparison and the Path to Success
ScyllaDB
 

Recently uploaded (20)

How to Optimize Call Monitoring: Automate QA and Elevate Customer Experience
How to Optimize Call Monitoring: Automate QA and Elevate Customer ExperienceHow to Optimize Call Monitoring: Automate QA and Elevate Customer Experience
How to Optimize Call Monitoring: Automate QA and Elevate Customer Experience
 
Corporate Open Source Anti-Patterns: A Decade Later
Corporate Open Source Anti-Patterns: A Decade LaterCorporate Open Source Anti-Patterns: A Decade Later
Corporate Open Source Anti-Patterns: A Decade Later
 
QR Secure: A Hybrid Approach Using Machine Learning and Security Validation F...
QR Secure: A Hybrid Approach Using Machine Learning and Security Validation F...QR Secure: A Hybrid Approach Using Machine Learning and Security Validation F...
QR Secure: A Hybrid Approach Using Machine Learning and Security Validation F...
 
Introduction to ThousandEyes AMER Webinar
Introduction  to ThousandEyes AMER WebinarIntroduction  to ThousandEyes AMER Webinar
Introduction to ThousandEyes AMER Webinar
 
Chapter 5 - Managing Test Activities V4.0
Chapter 5 - Managing Test Activities V4.0Chapter 5 - Managing Test Activities V4.0
Chapter 5 - Managing Test Activities V4.0
 
Chapter 6 - Test Tools Considerations V4.0
Chapter 6 - Test Tools Considerations V4.0Chapter 6 - Test Tools Considerations V4.0
Chapter 6 - Test Tools Considerations V4.0
 
Leveraging AI for Software Developer Productivity.pptx
Leveraging AI for Software Developer Productivity.pptxLeveraging AI for Software Developer Productivity.pptx
Leveraging AI for Software Developer Productivity.pptx
 
Radically Outperforming DynamoDB @ Digital Turbine with SADA and Google Cloud
Radically Outperforming DynamoDB @ Digital Turbine with SADA and Google CloudRadically Outperforming DynamoDB @ Digital Turbine with SADA and Google Cloud
Radically Outperforming DynamoDB @ Digital Turbine with SADA and Google Cloud
 
Day 4 - Excel Automation and Data Manipulation
Day 4 - Excel Automation and Data ManipulationDay 4 - Excel Automation and Data Manipulation
Day 4 - Excel Automation and Data Manipulation
 
Fuxnet [EN] .pdf
Fuxnet [EN]                                   .pdfFuxnet [EN]                                   .pdf
Fuxnet [EN] .pdf
 
Lee Barnes - Path to Becoming an Effective Test Automation Engineer.pdf
Lee Barnes - Path to Becoming an Effective Test Automation Engineer.pdfLee Barnes - Path to Becoming an Effective Test Automation Engineer.pdf
Lee Barnes - Path to Becoming an Effective Test Automation Engineer.pdf
 
Day 2 - Intro to UiPath Studio Fundamentals
Day 2 - Intro to UiPath Studio FundamentalsDay 2 - Intro to UiPath Studio Fundamentals
Day 2 - Intro to UiPath Studio Fundamentals
 
The "Zen" of Python Exemplars - OTel Community Day
The "Zen" of Python Exemplars - OTel Community DayThe "Zen" of Python Exemplars - OTel Community Day
The "Zen" of Python Exemplars - OTel Community Day
 
QA or the Highway - Component Testing: Bridging the gap between frontend appl...
QA or the Highway - Component Testing: Bridging the gap between frontend appl...QA or the Highway - Component Testing: Bridging the gap between frontend appl...
QA or the Highway - Component Testing: Bridging the gap between frontend appl...
 
TrustArc Webinar - Your Guide for Smooth Cross-Border Data Transfers and Glob...
TrustArc Webinar - Your Guide for Smooth Cross-Border Data Transfers and Glob...TrustArc Webinar - Your Guide for Smooth Cross-Border Data Transfers and Glob...
TrustArc Webinar - Your Guide for Smooth Cross-Border Data Transfers and Glob...
 
Supplier Sourcing Presentation - Gay De La Cruz.pdf
Supplier Sourcing Presentation - Gay De La Cruz.pdfSupplier Sourcing Presentation - Gay De La Cruz.pdf
Supplier Sourcing Presentation - Gay De La Cruz.pdf
 
EverHost AI Review: Empowering Websites with Limitless Possibilities through ...
EverHost AI Review: Empowering Websites with Limitless Possibilities through ...EverHost AI Review: Empowering Websites with Limitless Possibilities through ...
EverHost AI Review: Empowering Websites with Limitless Possibilities through ...
 
Kubernetes Cloud Native Indonesia Meetup - June 2024
Kubernetes Cloud Native Indonesia Meetup - June 2024Kubernetes Cloud Native Indonesia Meetup - June 2024
Kubernetes Cloud Native Indonesia Meetup - June 2024
 
New ThousandEyes Product Features and Release Highlights: June 2024
New ThousandEyes Product Features and Release Highlights: June 2024New ThousandEyes Product Features and Release Highlights: June 2024
New ThousandEyes Product Features and Release Highlights: June 2024
 
DynamoDB to ScyllaDB: Technical Comparison and the Path to Success
DynamoDB to ScyllaDB: Technical Comparison and the Path to SuccessDynamoDB to ScyllaDB: Technical Comparison and the Path to Success
DynamoDB to ScyllaDB: Technical Comparison and the Path to Success
 

The Latency Stack: Discovering Surprising Sources of Latency

  • 1. HOSTED BY The Latency Stack: Discovering Surprising Sources of Latency Mark Gritter Principal Engineer at Postman
  • 2. Mark Gritter (he/him) Principal Engineer at Postman ■ Working on monitoring and observability solutions for APIs at Akita Software, now Postman ■ Previously built VM-aware flash storage arrays at Tintri ■ Hobbies: gardening, weaving, math
  • 4. A simple picture of the world func(w http.ResponseWriter, r *http.Request) { } start end latency! did I wait for a lock? are my queries slow? did I miss in cache?
  • 5. A more realistic picture of the world start end latency! myService.call() ret, err := ... did I make unnecessary round trips? did I send too much data?
  • 6. But the more we look the worse it gets… client application server application client OS client hypervisor server OS server hypervisor DNS server TCP/IP TLS Load balancers Firewalls Service meshes
  • 7. Understanding latency requires opening up our abstractions!
  • 8. Example: Video Preprocessing @ Kealia One of my first jobs was at a streaming video startup. ■ Custom hardware shipped video out; needed 1MB blocks. ■ My software had to preprocess MPEG; we wanted to handle live TV as well as ingested movies. ■ 3Mbit/s standard definition = one block every 2.7 seconds, delivered over TCP to a storage device. ■ Not a highly demanding schedule… but once in a while we were several seconds late.
  • 9. It must be my software’s fault Latency = from video arriving at preprocessor to block landing in disk storage. We looked at all the usual things, and some unusual ones. ● Inefficient algorithm? ● Lock contention? ● Bad scheduling? ● Memory allocation? We even built a custom tracing tool so we could see what each thread was doing!
  • 10. Is it the network? We were all networking nerds… so we of course looked at a tcpdump trace on the receiver ● saw that the late-arriving blocks were transmitted with acceptably low latency ● they just started late…? syn
  • 11. Is it the network? (2) But finally we look at the sending side, and caught this! syn syn The block was ready on time. We opened up a new TCP connection to send it, and the initial SYN got lost. And TCP’s retry for an unacknowledged SYN is: 5 seconds ?!?
  • 12. Confession Time We could have improved our TCP connection pooling, but because we were compiling our own kernel anyway, I just reduced the retry to a few hundred milliseconds. Lesson learned: default protocol parameters cause latency.
  • 14. Example: Flash storage @ Tintri One of my more recent jobs was at a flash storage startup. ■ A potential customer using VMWare was seeing high I/O latency. ■ Obviously they needed to upgrade their storage! So they rolled in a fancy all- flash storage array and… nothing improved. ■ That gave us a shot at the problem.
  • 15. The Wrong Picture VM sends request Storage accesses SSD VM receives response Latency = network delay + controller software + device read
  • 16. The Closer-to-Correct Picture VM sends request Storage accesses SSD VM is scheduled to run and receives request Hypervisor receives response Latency = network delay + controller software + device read + time VM isn’t running
  • 17. Hypervisor delays Why might the hypervisor be slow? ● Swapping (sometimes) ● All Virtual Machine CPUs have to be scheduled together ○ More CPUs = harder to schedule! I have 8 physical CPUs, one of them busy. I need 8 virtual CPUs lol, stby
  • 18. Measuring what matters Our strategy: ● Measure latency locally (in our case, for each VM) ● Estimate network latency using inter-arrival times ● Ask the hypervisor for its measurement ● “Host” latency = hypervisor - storage - network ● Grab the appropriate hypervisor stats to shift blame away from storage. Weeks of finger-pointing solved! Lesson learned: latency is the result of having too many CPUs.
  • 19. When the Network is not a Network
  • 20. Example: Autosupports @ Tintri Our flash storage arrays uploaded “autosupport reports” periodically. ■ The back-end system thought the upload succeeded. ■ The storage array thought it did not. ■ And uploaded it it again. And again. And again.
  • 21. It’s just HTTP POST! How can the client and the server disagree so badly – and so consistently – about whether the upload succeeded or not? ● Upload had a checksum (sent in the header) so we knew the data was not corrupt or truncated. ● Logging and debugging in our software showed nothing particularly unusual – the client was “just” getting some weird TLS error.
  • 22. Is it the network? I was still a networking nerd… so we of course looked at a tcpdump trace on the sender and the receiver (because I had learned.) fin server rst storage array
  • 23. Is it the network? (2) But that picture was a lie. Once I matched up the timestamps… fin server rst storage array
  • 24. The Closer-to-Correct Picture ELB sends data slowly to server Array sends data quickly to ELB Network latency is negative: the client sees an (error) response before the server finishes! ELB closes the “idle” connection. … but keeps sending data anyway! HTTP 200 OK
  • 25. Mitigation N=1 so we removed the ELB, but also fixed the underlying problem: ● Don’t read from a network stream 1 byte at a time. Why do ELB load balancers have 10’s of megabytes of buffering? (Do they still? I don’t know.) Lesson learned: AWS services cause latency
  • 27. We’ve known it’s DNS since the early 2000’s “The contribution of DNS Lookup Costs to Web Object Retrieval" by Craig E Willis and Hao Shang, 2004, WPI-CS-TR-00-12: Finally we found that the DNS lookup time contributed more than one second to approximately 20% of retrievals for the Web objects on the home page of a larger list of popular servers. ● DNS is not normally slow, but when your HTTP request is slow… ● it’s probably DNS, particularly if you are still using a 1-second retry time.
  • 29. The Latency Stack ● Routing in the network ● Your server’s physical network hardware ● Your server’s hypervisor ● Your server’s network protocols ● Your server’s operating system ● Your server’s CPU and memory subsystems ● Your application’s queues ● Your code ● Your application’s dependencies ● The client software ● The client’s DNS ● The client’s operating system ● The client’s network protocols ● The client’s hypervisor ● The client’s physical network hardware ● Queuing in the network ● Middleboxes in the network
  • 30. Start by looking at both sides Compare client and server measurements, highlight discrepancies: ● Failure and error rates ● Duration and latency Extend the scope of spans to cover more of the network and kernel: ● Server latency not just application latency ● eBPF probes to examine kernel queuing ● Network-based measurements to get the “ground truth” Draw the whole picture: ● Start including pieces you otherwise abstract away
  • 31. Measuring Latency from a Network Trace C - B = processing latency D - A = lower bound on end-to-end latency If you’re really ambitious, RTT estimates can be derived too! A = first packet of request B = last packet of request C = first packet of response D = last packet of response
  • 32. When we’re examining tail latencies, by definition we’re looking at something unusual!
  翻译: