Decreasing latency noise and maximizing performance during end-to-end benchmarking
I was taken by surprise to find that benchmarking graphql-engine at 100 RPS was such a light load that power management was skewing my measurements dramatically (and worse, misleading me when e.g. having the browser open (i.e. more load) actually caused a dramatic performance improvement, causing me to think I'd made a performance breakthrough!).
- understanding how to maximize performance in a running graphql-engine generally; this is a type of noise reduction as well for the purposes of benchmarking, but is also obviously beneficial for best-practices to share with users or to help them measure performance
- to understand how to run graphql-engine such that latency is consistent (but possibly much slower), so that we can be confident that a change we've made was actually beneficial or not
Things that were very effective
$ turbostat --interval 0.1 sleep 120
Core CPU Avg_MHz Busy% Bzy_MHz TSC_MHz IRQ SMI CPU%c1 CPU%c3 CPU%c6 CPU%c7 CoreTmp PkgTmp GFX%rc6 GFXMHz Pkg%pc2 Pkg%pc3 Pkg%pc6 Pkg%pc7 PkgWatt CorWatt GFXWatt - - 110 9.44 1167 2494 0 0 14.44 4.30 0.01 71.81 57 57 0.00 0 4.58 1.66 3.58 35.50 4.14 1.16 0.31 0 0 113 9.80 1161 2494 0 0 13.99 3.98 0.00 72.23 56 57 0.00 0 4.58 1.66 3.58 35.50 4.14 1.16 0.31 0 1 103 8.92 1160 2494 0 0 14.87 1 2 113 9.85 1151 2494 0 0 14.12 4.62 0.02 71.40 57 1 3 110 9.18 1196 2494 0 0 14.79 Thread Stats Avg Stdev Max +/- Stdev Latency 2.44ms 833.69us 13.82ms 78.94% Req/Sec 26.36 67.81 400.00 85.22% 6004 requests in 1.00m, 2.35MB read
"performance" p-state governor
$ sudo cpupower frequency-set -g performance
Core CPU Avg_MHz Busy% Bzy_MHz TSC_MHz IRQ SMI CPU%c1 CPU%c3 CPU%c6 CPU%c7 CoreTmp PkgTmp GFX%rc6 GFXMHz Pkg%pc2 Pkg%pc3 Pkg%pc6 Pkg%pc7 PkgWatt CorWatt GFXWatt - - 99 4.49 2206 2494 0 0 13.02 5.00 0.01 77.47 55 55 0.00 0 5.35 2.06 3.97 48.43 4.40 1.67 0.13 0 0 104 5.01 2073 2494 0 0 12.72 5.05 0.02 77.20 54 55 0.00 0 5.35 2.06 3.97 48.43 4.40 1.67 0.13 0 1 93 3.96 2361 2494 0 0 13.77 1 2 103 4.99 2071 2494 0 0 12.31 4.95 0.00 77.74 55 1 3 95 4.00 2386 2494 0 0 13.30 Thread Stats Avg Stdev Max +/- Stdev Latency 1.52ms 1.14ms 33.86ms 97.87% Req/Sec 26.48 67.00 333.00 85.14% 6004 requests in 1.00m, 2.35MB read
Prohibit deep sleep states
$ sudo cpupower idle-set -D10
Core CPU Avg_MHz Busy% Bzy_MHz TSC_MHz IRQ SMI CPU%c1 CPU%c3 CPU%c6 CPU%c7 CoreTmp PkgTmp GFX%rc6 GFXMHz Pkg%pc2 Pkg%pc3 Pkg%pc6 Pkg%pc7 PkgWatt CorWatt GFXWatt - - 68 2.28 3000 2494 0 0 97.72 0.00 0.00 0.00 64 64 0.00 0 0.00 0.00 0.00 0.00 8.74 5.86 0.11 0 0 71 2.36 3000 2494 0 0 97.64 0.00 0.00 0.00 60 64 0.00 0 0.00 0.00 0.00 0.00 8.74 5.86 0.11 0 1 67 2.25 3000 2494 0 0 97.75 1 2 73 2.43 3000 2494 0 0 97.57 0.00 0.00 0.00 64 1 3 63 2.09 3000 2494 0 0 97.91 Thread Stats Avg Stdev Max +/- Stdev Latency 1.25ms 404.96us 6.43ms 68.94% Req/Sec 25.66 63.97 333.00 84.95% 6004 requests in 1.00m, 2.35MB read
CPUS ECUs RAM
$1.591 c4.8xlarge 36 132 60 GiB
$1.872 h1.8xlarge 32 99 128 GiB
$2.00 m4.10xlarge 40 124.5 160 GiB
$2.128 r4.8xlarge 32 99 244 GiB
$2.496 i3.8xlarge 32 99 244 GiB
Things that seemed to have little or dubious effect
echo 0 | sudo tee /proc/sys/kernel/randomize_va_space
+RTS -qa -qm
and/or pinning the parent haskell process withtaskset -cp $(ps -F <pid> | awk '{print $7}' | tail -n1) <pid>
I'm still trying to understand what's happening here, but launching a new OS thread (at ~5 us latency) is unlikely to be a bottleneck, however maybe it's associated with some other type of blockingecho never | sudo tee /sys/kernel/mm/transparent_hugepage/enabled
: this resulted in 15% regression for small return payloads, and 15% improvement for large ones. Worth revisiting.
Things that are likely to be effective but not explored
Use https://github.com/lpechacek/cpuset to reserve cpus for just the
program you are benchmarking. If using perf, leave at least 2 cores
so that perf runs in one and your program in another::
cset shield -c N1,N2 -k on
This will move all threads out of N1 and N2. The-k on
means
that even kernel threads are moved out.
Disable the SMT pair of the cpus you will use for the benchmark. The
pair of cpu N can be found in/sys/devices/system/cpu/cpuN/topology/thread_siblings_list
and
disabled with::
cat /sys/devices/system/cpu/cpu*/topology/thread_siblings_list | \
awk -F, '{print $2}' | \
sort -n | \
uniq | \
( while read X ; do echo $X ; echo 0 | sudo tee /sys/devices/system/cpu/cpu$X/online ; done )
Run the program with::
cset shield --exec -- perf stat -r 10 <cmd>
This will run the command after--
in the isolated cpus. The
particular perf command runs the<cmd>
10 times and reports
statistics.
...and not practical for actual deployment but potential for more stable (and slower) latencies
- disable turboboost:
echo 1 | sudo tee /sys/devices/system/cpu/intel_pstate/no_turbo
- consider using
tmpfs
for our benchmarking postgres instance so we never touch disk
Things that seemed to be a waste of time
For contemporary Intel processors, the frequency is controlled by the processor itself and the P-State exposed to software is related to performance levels. The idea that frequency can be set to a single frequency is fictional for Intel Core processors. Even if the scaling driver selects a single P-State, the actual frequency the processor will run at is selected by the processor itself.
References dump
- Controlling placement of processes and isolating them: https://webcache.googleusercontent.com/search?q=cache:SmEke2ayDOAJ:https://access.redhat.com/solutions/2884991+&cd=1&hl=en&ct=clnk&gl=us
- solid info on p/c-states https://metebalci.com/blog/a-minimum-complete-tutorial-of-cpu-power-management-c-states-and-p-states/
- Kernel pstate driver docs: https://www.kernel.org/doc/html/v4.12/admin-guide/pm/intel_pstate.html
and https://www.kernel.org/doc/Documentation/cpu-freq/intel-pstate.txt - restricting to c0 doesn't play nice with HT !:
https://www.codeblueprint.co.uk/2017/03/06/intel_idle-max_cstate-considered-harmful.html
https://software.intel.com/en-us/forums/software-tuning-performance-optimization-platform-monitoring/topic/535130 - controlling cstates : https://wiki.ntb.ch/infoportal/_media/embedded_systems/ethercat/controlling_processor_c-state_usage_in_linux_v1.1_nov2013.pdf
- constellation of kernel cargo cult crankery:
https://www.codethink.co.uk/articles/2018/configuring-linux-to-stabilise-latency/
https://gitlab.com/CodethinkLabs/determinism/wikis/Kernel Tunings
http://epickrram.blogspot.com/2015/09/reducing-system-jitter.html
http://epickrram.blogspot.com/2015/11/reducing-system-jitter-part-2.html
Further reading:
Related reading