Decreasing latency noise and maximizing performance during end-to-end benchmarking

I was taken by surprise to find that benchmarking graphql-engine at 100 RPS was such a light load that power management was skewing my measurements dramatically (and worse, misleading me when e.g. having the browser open (i.e. more load) actually caused a dramatic performance improvement, causing me to think I'd made a performance breakthrough!).

While benchmarking some tweaks to graphql-engine we noticed some confusing and misleading/inconsistent results which led us down a bit of a rabbit hole. This is a summary of what we've learned so far, what is left to try, and what we've tried which didn't seem to have much effect.

We're starting to explore two interrelated things:

  • understanding how to maximize performance in a running graphql-engine generally; this is a type of noise reduction as well for the purposes of benchmarking, but is also obviously beneficial for best-practices to share with users or to help them measure performance
  • to understand how to run graphql-engine such that latency is consistent (but possibly much slower), so that we can be confident that a change we've made was actually beneficial or not

Some of the results here probably apply to production deployments of graphql-engine (and other services) as well; but we don't have concrete recommendations there yet. If you have success with any of these techniques in production, let us know!

Things that were very effective

Background: modern intel processors have extremely sophisticated power management that modifies the clock frequency and powers up and down subsystems dynamically and constantly (many times per second). There are knobs on modern linux for some amount of control over how all this behaves: the intel_idle driver allows some control over c-states (processor idle states), while intel_pstate deals with p-states. See the references at the bottom for more.

I was taken by surprise to find that benchmarking graphql-engine at 100 RPS was such a light load that power management was skewing my measurements dramatically (and worse, misleading me when e.g. having the browser open (i.e. more load) actually caused a dramatic performance improvement, causing me to think I'd made a performance breakthrough!).

We can use  turbostat (from the cpupower package) to look at power management states of the machine during an execution of some load, with:

$ turbostat --interval 0.1 sleep 120

For my laptop (with Core i7-3667U), without tweaks, running the 100 RPS benchmark:

    Core     CPU Avg_MHz   Busy% Bzy_MHz TSC_MHz     IRQ     SMI  CPU%c1  CPU%c3  CPU%c6  CPU%c7 CoreTmp  PkgTmp GFX%rc6  GFXMHz Pkg%pc2 Pkg%pc3 Pkg%pc6 Pkg%pc7 PkgWatt CorWatt GFXWatt
       -       -     110    9.44    1167    2494       0       0   14.44    4.30    0.01   71.81      57      57    0.00       0    4.58    1.66    3.58   35.50    4.14    1.16    0.31
       0       0     113    9.80    1161    2494       0       0   13.99    3.98    0.00   72.23      56      57    0.00       0    4.58    1.66    3.58   35.50    4.14    1.16    0.31
       0       1     103    8.92    1160    2494       0       0   14.87
       1       2     113    9.85    1151    2494       0       0   14.12    4.62    0.02   71.40      57
       1       3     110    9.18    1196    2494       0       0   14.79

        Thread Stats   Avg      Stdev     Max   +/- Stdev
          Latency     2.44ms  833.69us  13.82ms   78.94%
          Req/Sec    26.36     67.81   400.00     85.22%
        6004 requests in 1.00m, 2.35MB read

...we can see the processor was busy (in C-state 0) less than 10% of the time, while it spent over 70% of the time in the deep C-7 state, which is slow to wake and do useful work.

"performance" p-state governor

intel_pstate offers two "governors" or modes. Switching it to "performance", with...

$ sudo cpupower frequency-set -g performance

...significantly reduces latency as the CPU is (waves hands) more ready to ramp up and do work when it comes:

    Core     CPU Avg_MHz   Busy% Bzy_MHz TSC_MHz     IRQ     SMI  CPU%c1  CPU%c3  CPU%c6  CPU%c7 CoreTmp  PkgTmp GFX%rc6  GFXMHz Pkg%pc2 Pkg%pc3 Pkg%pc6 Pkg%pc7 PkgWatt CorWatt GFXWatt
       -       -      99    4.49    2206    2494       0       0   13.02    5.00    0.01   77.47      55      55    0.00       0    5.35    2.06    3.97   48.43    4.40    1.67    0.13
       0       0     104    5.01    2073    2494       0       0   12.72    5.05    0.02   77.20      54      55    0.00       0    5.35    2.06    3.97   48.43    4.40    1.67    0.13
       0       1      93    3.96    2361    2494       0       0   13.77
       1       2     103    4.99    2071    2494       0       0   12.31    4.95    0.00   77.74      55
       1       3      95    4.00    2386    2494       0       0   13.30

        Thread Stats   Avg      Stdev     Max   +/- Stdev
          Latency     1.52ms    1.14ms  33.86ms   97.87%
          Req/Sec    26.48     67.00   333.00     85.14%
        6004 requests in 1.00m, 2.35MB read
 

Notice Bzy_MHz is close to the advertised clock speed here.

Prohibit deep sleep states

Finally we can keep the processor from entering deep sleep states with:

$ sudo cpupower idle-set -D10  

(Note: you can re-enable all idle states with : sudo cpupower idle-set -E)

    Core     CPU Avg_MHz   Busy% Bzy_MHz TSC_MHz     IRQ     SMI  CPU%c1  CPU%c3  CPU%c6  CPU%c7 CoreTmp  PkgTmp GFX%rc6  GFXMHz Pkg%pc2 Pkg%pc3 Pkg%pc6 Pkg%pc7 PkgWatt CorWatt GFXWatt
       -       -      68    2.28    3000    2494       0       0   97.72    0.00    0.00    0.00      64      64    0.00       0    0.00    0.00    0.00    0.00    8.74    5.86    0.11
       0       0      71    2.36    3000    2494       0       0   97.64    0.00    0.00    0.00      60      64    0.00       0    0.00    0.00    0.00    0.00    8.74    5.86    0.11
       0       1      67    2.25    3000    2494       0       0   97.75
       1       2      73    2.43    3000    2494       0       0   97.57    0.00    0.00    0.00      64
       1       3      63    2.09    3000    2494       0       0   97.91

        Thread Stats   Avg      Stdev     Max   +/- Stdev
          Latency     1.25ms  404.96us   6.43ms   68.94%
          Req/Sec    25.66     63.97   333.00     84.95%
        6004 requests in 1.00m, 2.35MB read

Here we see a smaller but significant reduction in latency. We can see that we're spending nearly all our time in the shallower C1 state, ready to wake up quickly.

We can graph latency improvements from these two changes. Here we're measuring from instrumented graphql-engine:

plot

FYI Here are all AWS instances under $3/hr that allow setting both cstate/pstate:

                       CPUS      ECUs    RAM
  $1.591   c4.8xlarge    36      132      60 GiB
  $1.872   h1.8xlarge    32      99      128 GiB
  $2.00    m4.10xlarge   40      124.5   160 GiB
  $2.128   r4.8xlarge    32      99      244 GiB
  $2.496   i3.8xlarge    32      99      244 GiB

Things that seemed to have little or dubious effect

...but might when e.g. graphql-engine is carefully isolated on CPUs:

  • echo 0 | sudo tee /proc/sys/kernel/randomize_va_space
  • +RTS -qa -qm and/or pinning the parent haskell process with taskset -cp $(ps -F <pid> | awk '{print $7}' | tail -n1) <pid>
    I'm still trying to understand what's happening here, but launching a new OS thread (at ~5 us latency) is unlikely to be a bottleneck, however maybe it's associated with some other type of blocking
  • echo never | sudo tee /sys/kernel/mm/transparent_hugepage/enabled : this resulted in 15% regression for small return payloads, and 15% improvement for large ones. Worth revisiting.

Things that are likely to be effective but not explored

carefully isolating graphql-engine on particular CPUs. See also link dump below.
Copied from LLVM benchmarking docs:

Use https://github.com/lpechacek/cpuset to reserve cpus for just the
program you are benchmarking. If using perf, leave at least 2 cores
so that perf runs in one and your program in another::
cset shield -c N1,N2 -k on
This will move all threads out of N1 and N2. The -k on means
that even kernel threads are moved out.
Disable the SMT pair of the cpus you will use for the benchmark. The
pair of cpu N can be found in
/sys/devices/system/cpu/cpuN/topology/thread_siblings_list and
disabled with::
cat /sys/devices/system/cpu/cpu*/topology/thread_siblings_list | \
awk -F, '{print $2}' | \
sort -n | \
uniq | \
( while read X ; do echo $X ; echo 0 | sudo tee /sys/devices/system/cpu/cpu$X/online ; done )
Run the program with::
cset shield --exec -- perf stat -r 10 <cmd>
This will run the command after -- in the isolated cpus. The
particular perf command runs the <cmd> 10 times and reports
statistics.

...and not practical for actual deployment but potential for more stable (and slower) latencies

  • disable turboboost: echo 1 | sudo tee /sys/devices/system/cpu/intel_pstate/no_turbo
  • consider using tmpfs for our benchmarking postgres instance so we never touch disk

Things that seemed to be a waste of time

fiddling with /sys/devices/system/cpu/cpu*/cpufreq/scaling_min_freq and .../intel_pstate/min_perf_pct. It's not clear if this did anything. It certainly didn't produce a fixed CPU frequency.  From https://www.kernel.org/doc/Documentation/cpu-freq/intel-pstate.txt :

For contemporary Intel processors, the frequency is controlled by the processor itself and the P-State exposed to software is related to performance levels.  The idea that frequency can be set to a single frequency is fictional for Intel Core processors. Even if the scaling driver selects a single P-State, the actual frequency the processor will run at is selected by the processor itself.

References dump


Further reading:

We have explored this topic further in this blogpost:  https://hasura.io/blog/effect-of-intels-power-management-on-webservers/

Blog
14 Apr, 2020
Email
Subscribe to stay up-to-date on all things Hasura. One newsletter, once a month.
Loading...
v3-pattern
Accelerate development and data access with radically reduced complexity.