The Latency Stack: Discovering Surprising Sources of Latency

HOSTED BY
The Latency Stack: Discovering
Surprising Sources of Latency
Mark Gritter
Principal Engineer at Postman

Mark Gritter (he/him)
Principal Engineer at Postman
■ Working on monitoring and observability
solutions for APIs at Akita Software, now
Postman
■ Previously built VM-aware flash storage arrays
at Tintri
■ Hobbies: gardening, weaving, math

A simple picture of the world
func(w http.ResponseWriter, r *http.Request) {
}
start
end
latency!
did I
wait for
a lock?
are my
queries
slow?
did I
miss in
cache?

A more realistic picture of the world
start
end
latency!
myService.call()
ret, err := ...
did I make
unnecessary
round trips?
did I send
too much
data?

But the more we look the worse it gets…
client application server application
client OS
client hypervisor
server OS
server hypervisor
DNS server
TCP/IP
TLS
Load balancers
Firewalls
Service meshes

Understanding latency requires opening
up our abstractions!

Example: Video Preprocessing @ Kealia
One of my first jobs was at a streaming video startup.
■ Custom hardware shipped video out; needed 1MB blocks.
■ My software had to preprocess MPEG; we wanted to handle live TV as well as
ingested movies.
■ 3Mbit/s standard definition = one block every 2.7 seconds, delivered over
TCP to a storage device.
■ Not a highly demanding schedule… but once in a while we were several
seconds late.

It must be my software’s fault
Latency = from video arriving at preprocessor to block landing in disk storage.
We looked at all the usual things, and some unusual ones.
● Inefficient algorithm?
● Lock contention?
● Bad scheduling?
● Memory allocation?
We even built a custom tracing tool so we could see what each thread was doing!

Is it the network?
We were all networking nerds… so we of course looked at a tcpdump trace on the
receiver
● saw that the late-arriving blocks were transmitted with acceptably low latency
● they just started late…?
syn

Is it the network? (2)
But finally we look at the sending side, and caught this!
syn
syn
The block was ready on time.
We opened up a new TCP connection to send it, and the initial SYN got lost.
And TCP’s retry for an unacknowledged SYN is: 5 seconds ?!?

Confession Time
We could have improved our TCP connection pooling, but because we were
compiling our own kernel anyway, I just reduced the retry to a few hundred
milliseconds.
Lesson learned: default protocol parameters cause latency.

It’s Always Storage’s Fault

Example: Flash storage @ Tintri
One of my more recent jobs was at a flash storage startup.
■ A potential customer using VMWare was seeing high I/O latency.
■ Obviously they needed to upgrade their storage! So they rolled in a fancy all-
flash storage array and… nothing improved.
■ That gave us a shot at the problem.

The Wrong Picture
VM sends request
Storage accesses SSD
VM receives response
Latency = network delay + controller software + device read

The Closer-to-Correct Picture
VM sends request
Storage accesses SSD
VM is scheduled to run and receives request
Hypervisor receives response
Latency = network delay + controller software + device read
+ time VM isn’t running

Hypervisor delays
Why might the hypervisor be slow?
● Swapping (sometimes)
● All Virtual Machine CPUs have to be scheduled together
○ More CPUs = harder to schedule!
I have 8
physical
CPUs, one
of them
busy.
I need 8
virtual
CPUs
lol, stby

Measuring what matters
Our strategy:
● Measure latency locally (in our case, for each VM)
● Estimate network latency using inter-arrival times
● Ask the hypervisor for its measurement
● “Host” latency = hypervisor - storage - network
● Grab the appropriate hypervisor stats to shift blame away from storage.
Weeks of finger-pointing solved!
Lesson learned: latency is the result of having too many CPUs.

When the Network is not a Network

Example: Autosupports @ Tintri
Our flash storage arrays uploaded “autosupport reports” periodically.
■ The back-end system thought the upload succeeded.
■ The storage array thought it did not.
■ And uploaded it it again. And again. And again.

It’s just HTTP POST!
How can the client and the server disagree so badly – and so consistently – about
whether the upload succeeded or not?
● Upload had a checksum (sent in the header) so we knew the data was not
corrupt or truncated.
● Logging and debugging in our software showed nothing particularly unusual
– the client was “just” getting some weird TLS error.

Is it the network?
I was still a networking nerd… so we of course looked at a tcpdump trace on the
sender and the receiver (because I had learned.)
fin
server
rst
storage
array

Is it the network? (2)
But that picture was a lie. Once I matched up the timestamps…
fin
server
rst
storage
array

The Closer-to-Correct Picture
ELB sends data
slowly to server
Array sends data
quickly to ELB
Network latency is negative: the client sees an (error) response
before the server finishes!
ELB closes the
“idle” connection.
… but keeps
sending data
anyway!
HTTP 200 OK

Mitigation
N=1 so we removed the ELB, but also fixed the underlying problem:
● Don’t read from a network stream 1 byte at a time.
Why do ELB load balancers have 10’s of megabytes of buffering? (Do they still? I
don’t know.)
Lesson learned: AWS services cause latency

We’ve known it’s DNS since the early 2000’s
“The contribution of DNS Lookup Costs to Web Object Retrieval" by Craig E Willis
and Hao Shang, 2004, WPI-CS-TR-00-12:
Finally we found that the DNS lookup time contributed more than one second to
approximately 20% of retrievals for the Web objects on the home page of a larger list
of popular servers.
● DNS is not normally slow, but when your HTTP request is slow…
● it’s probably DNS, particularly if you are still using a 1-second retry time.

The Latency Stack
● Routing in the network
● Your server’s physical network
hardware
● Your server’s hypervisor
● Your server’s network
protocols
● Your server’s operating system
● Your server’s CPU and memory
subsystems
● Your application’s queues
● Your code
● Your application’s
dependencies
● The client software
● The client’s DNS
● The client’s operating system
● The client’s network protocols
● The client’s hypervisor
● The client’s physical network
hardware
● Queuing in the network
● Middleboxes in the network

Start by looking at both sides
Compare client and server measurements, highlight discrepancies:
● Failure and error rates
● Duration and latency
Extend the scope of spans to cover more of the network and kernel:
● Server latency not just application latency
● eBPF probes to examine kernel queuing
● Network-based measurements to get the “ground truth”
Draw the whole picture:
● Start including pieces you otherwise abstract away

Measuring Latency from a Network Trace
C - B = processing latency
D - A = lower bound on end-to-end latency
If you’re really ambitious, RTT estimates can be derived too!
A = first
packet
of request
B = last
packet
of request
C = first
packet
of response
D = last
packet
of response

When we’re examining tail latencies,
by definition we’re looking at
something unusual!

Mark Gritter
mgritter@gmail.com
@markgritter@mathstodon.xyz
Thank you! Let’s connect.

The Latency Stack: Discovering Surprising Sources of Latency

Recommended

Recommended

More Related Content

Similar to The Latency Stack: Discovering Surprising Sources of Latency

Similar to The Latency Stack: Discovering Surprising Sources of Latency (20)

More from ScyllaDB

More from ScyllaDB (20)

Recently uploaded

Recently uploaded (20)

The Latency Stack: Discovering Surprising Sources of Latency