Decreasing latency noise and maximizing performance during end-to-end benchmarking

I was taken by surprise to find that benchmarking graphql-engine at 100 RPS was such a light load that power management was skewing my measurements dramatically (and worse, misleading me when e.g. having the browser open (i.e. more load) actually caused a dramatic performance improvement, causing me to think I'd made a performance breakthrough!).

While benchmarking some tweaks to graphql-engine we noticed some confusing and misleading/inconsistent results which led us down a bit of a rabbit hole. This is a summary of what we've learned so far, what is left to try, and what we've tried which didn't seem to have much effect.

We're starting to explore two interrelated things:

understanding how to maximize performance in a running graphql-engine generally; this is a type of noise reduction as well for the purposes of benchmarking, but is also obviously beneficial for best-practices to share with users or to help them measure performance
to understand how to run graphql-engine such that latency is consistent (but possibly much slower), so that we can be confident that a change we've made was actually beneficial or not

Some of the results here probably apply to production deployments of graphql-engine (and other services) as well; but we don't have concrete recommendations there yet. If you have success with any of these techniques in production, let us know!

Things that were very effective

Background: modern intel processors have extremely sophisticated power management that modifies the clock frequency and powers up and down subsystems dynamically and constantly (many times per second). There are knobs on modern linux for some amount of control over how all this behaves: the intel_idle driver allows some control over c-states (processor idle states), while intel_pstate deals with p-states. See the references at the bottom for more.

I was taken by surprise to find that benchmarking graphql-engine at 100 RPS was such a light load that power management was skewing my measurements dramatically (and worse, misleading me when e.g. having the browser open (i.e. more load) actually caused a dramatic performance improvement, causing me to think I'd made a performance breakthrough!).

We can use turbostat (from the cpupower package) to look at power management states of the machine during an execution of some load, with:

$ turbostat --interval 0.1 sleep 120

For my laptop (with Core i7-3667U), without tweaks, running the 100 RPS benchmark:

    Core     CPU Avg_MHz   Busy% Bzy_MHz TSC_MHz     IRQ     SMI  CPU%c1  CPU%c3  CPU%c6  CPU%c7 CoreTmp  PkgTmp GFX%rc6  GFXMHz Pkg%pc2 Pkg%pc3 Pkg%pc6 Pkg%pc7 PkgWatt CorWatt GFXWatt
       -       -     110    9.44    1167    2494       0       0   14.44    4.30    0.01   71.81      57      57    0.00       0    4.58    1.66    3.58   35.50    4.14    1.16    0.31
       0       0     113    9.80    1161    2494       0       0   13.99    3.98    0.00   72.23      56      57    0.00       0    4.58    1.66    3.58   35.50    4.14    1.16    0.31
       0       1     103    8.92    1160    2494       0       0   14.87
       1       2     113    9.85    1151    2494       0       0   14.12    4.62    0.02   71.40      57
       1       3     110    9.18    1196    2494       0       0   14.79

        Thread Stats   Avg      Stdev     Max   +/- Stdev
          Latency     2.44ms  833.69us  13.82ms   78.94%
          Req/Sec    26.36     67.81   400.00     85.22%
        6004 requests in 1.00m, 2.35MB read

...we can see the processor was busy (in C-state 0) less than 10% of the time, while it spent over 70% of the time in the deep C-7 state, which is slow to wake and do useful work.

"performance" p-state governor

intel_pstate offers two "governors" or modes. Switching it to "performance", with...

$ sudo cpupower frequency-set -g performance

...significantly reduces latency as the CPU is (waves hands) more ready to ramp up and do work when it comes:

    Core     CPU Avg_MHz   Busy% Bzy_MHz TSC_MHz     IRQ     SMI  CPU%c1  CPU%c3  CPU%c6  CPU%c7 CoreTmp  PkgTmp GFX%rc6  GFXMHz Pkg%pc2 Pkg%pc3 Pkg%pc6 Pkg%pc7 PkgWatt CorWatt GFXWatt
       -       -      99    4.49    2206    2494       0       0   13.02    5.00    0.01   77.47      55      55    0.00       0    5.35    2.06    3.97   48.43    4.40    1.67    0.13
       0       0     104    5.01    2073    2494       0       0   12.72    5.05    0.02   77.20      54      55    0.00       0    5.35    2.06    3.97   48.43    4.40    1.67    0.13
       0       1      93    3.96    2361    2494       0       0   13.77
       1       2     103    4.99    2071    2494       0       0   12.31    4.95    0.00   77.74      55
       1       3      95    4.00    2386    2494       0       0   13.30

        Thread Stats   Avg      Stdev     Max   +/- Stdev
          Latency     1.52ms    1.14ms  33.86ms   97.87%
          Req/Sec    26.48     67.00   333.00     85.14%
        6004 requests in 1.00m, 2.35MB read

Notice Bzy_MHz is close to the advertised clock speed here.

Prohibit deep sleep states

Finally we can keep the processor from entering deep sleep states with:

$ sudo cpupower idle-set -D10

(Note: you can re-enable all idle states with : sudo cpupower idle-set -E)

    Core     CPU Avg_MHz   Busy% Bzy_MHz TSC_MHz     IRQ     SMI  CPU%c1  CPU%c3  CPU%c6  CPU%c7 CoreTmp  PkgTmp GFX%rc6  GFXMHz Pkg%pc2 Pkg%pc3 Pkg%pc6 Pkg%pc7 PkgWatt CorWatt GFXWatt
       -       -      68    2.28    3000    2494       0       0   97.72    0.00    0.00    0.00      64      64    0.00       0    0.00    0.00    0.00    0.00    8.74    5.86    0.11
       0       0      71    2.36    3000    2494       0       0   97.64    0.00    0.00    0.00      60      64    0.00       0    0.00    0.00    0.00    0.00    8.74    5.86    0.11
       0       1      67    2.25    3000    2494       0       0   97.75
       1       2      73    2.43    3000    2494       0       0   97.57    0.00    0.00    0.00      64
       1       3      63    2.09    3000    2494       0       0   97.91

        Thread Stats   Avg      Stdev     Max   +/- Stdev
          Latency     1.25ms  404.96us   6.43ms   68.94%
          Req/Sec    25.66     63.97   333.00     84.95%
        6004 requests in 1.00m, 2.35MB read

Here we see a smaller but significant reduction in latency. We can see that we're spending nearly all our time in the shallower C1 state, ready to wake up quickly.

We can graph latency improvements from these two changes. Here we're measuring from instrumented graphql-engine:

FYI Here are all AWS instances under $3/hr that allow setting both cstate/pstate:

                       CPUS      ECUs    RAM
  $1.591   c4.8xlarge    36      132      60 GiB
  $1.872   h1.8xlarge    32      99      128 GiB
  $2.00    m4.10xlarge   40      124.5   160 GiB
  $2.128   r4.8xlarge    32      99      244 GiB
  $2.496   i3.8xlarge    32      99      244 GiB

Things that seemed to have little or dubious effect

...but might when e.g. graphql-engine is carefully isolated on CPUs:

echo 0 | sudo tee /proc/sys/kernel/randomize_va_space
+RTS -qa -qm and/or pinning the parent haskell process with taskset -cp $(ps -F <pid> | awk '{print $7}' | tail -n1) <pid>
I'm still trying to understand what's happening here, but launching a new OS thread (at ~5 us latency) is unlikely to be a bottleneck, however maybe it's associated with some other type of blocking
echo never | sudo tee /sys/kernel/mm/transparent_hugepage/enabled : this resulted in 15% regression for small return payloads, and 15% improvement for large ones. Worth revisiting.

Things that are likely to be effective but not explored

carefully isolating graphql-engine on particular CPUs. See also link dump below.
Copied from LLVM benchmarking docs:

Use https://github.com/lpechacek/cpuset to reserve cpus for just the
program you are benchmarking. If using perf, leave at least 2 cores
so that perf runs in one and your program in another::

cset shield -c N1,N2 -k on

This will move all threads out of N1 and N2. The -k on means
that even kernel threads are moved out.

Disable the SMT pair of the cpus you will use for the benchmark. The
pair of cpu N can be found in
/sys/devices/system/cpu/cpuN/topology/thread_siblings_list and
disabled with::

cat /sys/devices/system/cpu/cpu*/topology/thread_siblings_list | \
awk -F, '{print $2}' | \
sort -n | \
uniq | \
( while read X ; do echo $X ; echo 0 | sudo tee /sys/devices/system/cpu/cpu$X/online ; done )

Run the program with::

cset shield --exec -- perf stat -r 10 <cmd>

This will run the command after -- in the isolated cpus. The
particular perf command runs the <cmd> 10 times and reports
statistics.

...and not practical for actual deployment but potential for more stable (and slower) latencies

disable turboboost: echo 1 | sudo tee /sys/devices/system/cpu/intel_pstate/no_turbo
consider using tmpfs for our benchmarking postgres instance so we never touch disk

Things that seemed to be a waste of time

fiddling with /sys/devices/system/cpu/cpu*/cpufreq/scaling_min_freq and .../intel_pstate/min_perf_pct. It's not clear if this did anything. It certainly didn't produce a fixed CPU frequency. From https://www.kernel.org/doc/Documentation/cpu-freq/intel-pstate.txt :

For contemporary Intel processors, the frequency is controlled by the processor itself and the P-State exposed to software is related to performance levels. The idea that frequency can be set to a single frequency is fictional for Intel Core processors. Even if the scaling driver selects a single P-State, the actual frequency the processor will run at is selected by the processor itself.

References dump

Controlling placement of processes and isolating them: https://webcache.googleusercontent.com/search?q=cache:SmEke2ayDOAJ:https://access.redhat.com/solutions/2884991+&cd=1&hl=en&ct=clnk&gl=us
solid info on p/c-states https://metebalci.com/blog/a-minimum-complete-tutorial-of-cpu-power-management-c-states-and-p-states/
Kernel pstate driver docs: https://www.kernel.org/doc/html/v4.12/admin-guide/pm/intel_pstate.html
and https://www.kernel.org/doc/Documentation/cpu-freq/intel-pstate.txt
restricting to c0 doesn't play nice with HT !:
https://www.codeblueprint.co.uk/2017/03/06/intel_idle-max_cstate-considered-harmful.html
https://software.intel.com/en-us/forums/software-tuning-performance-optimization-platform-monitoring/topic/535130
controlling cstates : https://wiki.ntb.ch/infoportal/_media/embedded_systems/ethercat/controlling_processor_c-state_usage_in_linux_v1.1_nov2013.pdf
constellation of kernel cargo cult crankery:
https://www.codethink.co.uk/articles/2018/configuring-linux-to-stabilise-latency/
https://gitlab.com/CodethinkLabs/determinism/wikis/Kernel Tunings
http://epickrram.blogspot.com/2015/09/reducing-system-jitter.html
http://epickrram.blogspot.com/2015/11/reducing-system-jitter-part-2.html

Decreasing latency noise and maximizing performance during end-to-end benchmarking

Things that were very effective

"performance" p-state governor

Prohibit deep sleep states

Things that seemed to have little or dubious effect

Things that are likely to be effective but not explored

...and not practical for actual deployment but potential for more stable (and slower) latencies

Things that seemed to be a waste of time

References dump

Further reading:

Share

Share

Compiling GraphQL for optimal performance: going beyond Dataloader

Efficiently compiling GraphQL queries for MongoDB performance ⚡

Hasura Performs Faster than DIY Node.js GraphQL APIs on PostgreSQL

Decreasing latency noise and maximizing performance during end-to-end benchmarking

Things that were very effective

"performance" p-state governor

Prohibit deep sleep states

Things that seemed to have little or dubious effect

Things that are likely to be effective but not explored

...and not practical for actual deployment but potential for more stable (and slower) latencies

Things that seemed to be a waste of time

References dump

Further reading:

Share

Share

Compiling GraphQL for optimal performance: going beyond Dataloader

Efficiently compiling GraphQL queries for MongoDB performance ⚡

Hasura Performs Faster than DIY Node.js GraphQL APIs on PostgreSQL

Loading...