Fighting Packet Loss with Curl

2 127

Note: This post was originally published on Healthchecks.io blog. It follows my experimentation session with getting curl to handle high levels of synthetic packet loss.

One class of support requests I get at Healthchecks.io is about occasional failed HTTP requests to ping endpoints (hc-ping.com and hchk.io). Following an investigation, the conclusion often is that the failed requests are caused by a packet loss somewhere along the path from the client to the server. The problem starts and ends seemingly at random, presumably as network operators fix failing equipment or change the routing rules. This is mostly opaque to the end users on both ends: you send packets into a “black hole” and they come out at the other end – and sometimes they don’t.

One way to measure packet loss is using the mtr utility:

$ mtr -w -c 1000 -s 1000 -r 2a01:4f8:231:1b68::2
Start: 2019-10-07T06:25:42+0000
HOST: vams                              Loss%   Snt   Last   Avg  Best  Wrst StDev
  1.|-- ???                               100.0  1000    0.0   0.0   0.0   0.0   0.0
  2.|-- ???                               100.0  1000    0.0   0.0   0.0   0.0   0.0
  3.|-- 2001:19f0:5000::a48:129           46.3%  1000    1.7   2.2   1.0  28.9   2.5
  4.|-- ae4-0.ams10.core-backbone.com      4.0%  1000    1.1   2.5   1.0  43.7   4.6
  5.|-- ae16-2074.fra10.core-backbone.com  4.4%  1000    6.8   8.3   6.4  57.7   5.6
  6.|-- 2a01:4a0:0:2021::4                 4.7%  1000    6.8   6.8   6.4  26.4   2.2
  7.|-- 2a01:4a0:1338:3::2                 4.5%  1000    6.7  12.0   6.5 147.7  16.7
  8.|-- core22.fsn1.hetzner.com            4.4%  1000   11.6  16.4  11.4  84.9  14.2
  9.|-- ex9k1.dc14.fsn1.hetzner.com        5.2%  1000   16.7  12.4  11.5  47.4   3.5
 10.|-- 2a01:4f8:231:1b68::2               5.2%  1000   12.1  11.7  11.5  31.3   1.7

The command line parameters used:

  • -w Puts mtr into wide report mode. When in this mode, mtr will not cut hostnames in the report.

  • -c 1000 The number of pings sent to determine both the machines on the network and the reliability of those machines. Each cycle lasts one second.

  • -s 1000 The packet size used for probing. It is in bytes, inclusive IP and ICMP headers.

  • -r Report mode. mtr will run for the number of cycles specified by the -c option, and then print statistics and exit.

The last parameter is the IP address to probe. You can also put a hostname (e.g. hc-ping.com) there. The above run shows a 5.2% packet loss from the host to one of the IPv6 addresses used by Healthchecks.io ping endpoints. That’s above what I would consider “normal”, and will sometimes cause latency spikes when making HTTP requests, but the requests will still usually succeed.

Packet loss cannot be completely eliminated: there are always going to be equipment failures and human errors. Some packet loss is also allowed by IP protocol’s design: when a router or network segment is congested, it is expected to drop packets.


I’ve been experimenting with curl parameters to make it more resilient to packet loss. I learned that with enough brute force, curl can get a request through fairly reliably even at 80% packet loss levels. The extra parameters I’m testing below should not be needed, and in an ideal world the HTTP requests would just work. But sometimes they don’t.

For my testing I used iptables to simulate packet loss. For example, this incantation sets up 50% packet loss:

iptables -A INPUT -m statistic --mode random --probability 0.5 -j DROP    

Be careful when adding rules like this one over SSH: you may lose access to the remote machine. If you do add the rule, you will probably want to remove it later:

iptables -D INPUT -m statistic --mode random --probability 0.5 -j DROP

I made a quick bash script to run curl in a loop and count failures:

errors=0
start=`date +%s`

for i in {1..20}
do
    echo -e "\nAttempt $i\n"
    # This is the command we are testing:
    curl --retry 3 --max-time 30 https://hc-ping.com/6e1fbf8f-c17e-4749-af44-0c81461bdd19
    if [ $? -ne 0 ]; then
        errors=$((errors+1))
    fi
done

end=`date +%s`
echo -e "\nDone! Attempts: $i, errors: $errors, ok: $(($i - $errors))"
echo -e "Total Time: $((end-start))" 

For the baseline, I used the “–retry 3” and “–max-time 30” parameters: curl will retry transient errors up to 3 times, and each attempt is capped to 30 seconds. Without the 30 second limit, curl could sit for hours waiting for missing packets.

Baseline results with no packet loss:

  • 👍 Successful requests: 20

  • 💩 Failed requests: 0

  • ⏱️ Total time: 4 seconds

Baseline results with 50% packet loss:

  • 👍 Successful requests: 20

  • 💩 Failed requests: 0

  • ⏱️ Total time: 2 min 4 s

Baseline results with 80% packet loss:

  • 👍 Successful requests: 13

  • 💩 Failed requests: 7

  • ⏱️ Total time; 17 min 43 s

Next, I increased the number of retries to 20, and reduced the time-per-request to 5 seconds. The idea is to fail quickly and try again, rinse and repeat:

curl --retry 20 -m 5 https://hc-ping.com/6e1fbf8f-c17e-4749-af44-0c81461bdd19

When using the –retry parameter, curl delays the retries using an exponential backoff algorithm: 1 second, 2 seconds, 4 seconds, 8 seconds, … This test was going to take hours so I added an explicit fixed delay:

curl --retry 20 -m 5 --retry-delay 1 https://hc-ping.com/6e1fbf8f-c17e-4749-af44-0c81461bdd19

Fast retries with 1 second retry delay and 80% packet loss:

  • 👍 Successful requests: 15

  • 💩 Failed requests: 5

  • ⏱️ Total time: 18 min 18 s

Of the 5 errors, in 3 cases curl simply ran out of retries, and in 2 cases it aborted with the “Error in the HTTP2 framing layer” error. So I tried HTTP/1.0 instead. To make the results more statistically significant, I also increased the number of runs to 100:

curl -0 --retry 20 -m 5 --retry-delay 1 https://hc-ping.com/6e1fbf8f-c17e-4749-af44-0c81461bdd19

Fast retries over HTTP/1.0 with 80% packet loss:

  • 👍 Successful requests: 98

  • 💩 Failed requests: 2

  • ⏱️ Total time: 51 min 3 s

For a good measure, I ran the baseline version again, now with 100 iterations. Baseline results:

  • 👍 Successful requests: 75

  • 💩 Failed requests: 25

  • ⏱️ Total time: 60 min 22 s

Summary: in a simulated 80% packet loss environment, the “retry early, retry often” strategy clearly beats the default strategy. It would likely reach 100% success rate if I increased the number of retries some more.

Forcing HTTP/1.0 prevents curl from aborting prematurely when it hits the “Error in the HTTP2 framing layer” error.

Going from HTTPS to plain HTTP would likely also help a lot because of the reduced number of required round-trips per request. But trading privacy for potentially more reliability is a questionable trade-off.

From my experience, IPv6 communications over today’s internet are more prone to intermittent packet loss than IPv4. If you have the option to use either, you can pass the “-4” flag to curl and it will use IPv4. This might be a pragmatic choice in short term, but we should also keep pestering ISPs to improve their IPv6 reliability.


If you experience failed HTTP requests, and fixing the root cause is outside of your control, adding the above retry parameters to your curl calls can help as a mitigation. Also, curl is awesome.

Happy curl’ing,
–Pēteris, Healthchecks.io

2
$ 1.06
$ 1.00 from @Read.Cash
$ 0.05 from @Polydot
$ 0.01 from @Geri

Comments

Great stuff, glad to see more unix content, thought I was alone in that sphere on read.cash.

You should join my read.cash community on opensource, and share your article. https://read.cash/c/opensource-0cc0

$ 0.00
4 years ago

Nice experiment, i would like to add up some extra informations here for you.

As a coder i havent yet reached beyond http 1.0 (http2 or https) due to lack of relyable information that would allow me to produce code which can relyable decode and encode data streams for these protocols.

http1.0 is very simply, as it uses no special snowflake framing, its just basically raw text send through normal tcp packets (which also may or may not contains the overall size of the request), and this design allows the packet losses to be complitely handled by the operating system itself through the tcp or ip levels of the protocols, this is why it resulted you no snowflake errors when you did your experiment.

$ 0.00
4 years ago