Effect of Intel's power management on web servers

While benchmarking some tweaks to Hasura we noticed some strange behavior. We were measuring latencies for a simple GraphQL request being served by Hasura. Surprisingly Hasura was performing better when there was more load on the system than otherwise.

At 100 requests per second the processor load was low enough that the processor going to sleep was negatively impacting the latency measurements. However, when a browser was open on the same machine, latency measurements improved!

The explanation for this is quite intriguing and is documented in detail in this post by Brandon. This blog post explains that post from a layman’s perspective.

Background

Modern Intel processors have extremely sophisticated power management that modifies the clock frequency and powers up and down subsystems dynamically and constantly. This can happen several times a second.

p-states and c-states

Power management happens both when the processor is idle and otherwise. Idle states of a processor are called C-states and are denoted as C0, C1, C2... C0 is the operating state whereas C1, C2, C3.. are states where some or more of the subsystems are shutdown. Generally the higher the C-state the lower the power consumption but also the longer it takes for the processor to move back to the operating state C0.

We can look at the idle states available on my laptop using the cpupower command:

$ cpupower idle-info
CPUidle driver: intel_idle
CPUidle governor: menu
analyzing CPU 0:

Number of idle states: 9
Available idle states: POLL C1 C1E C3 C6 C7s C8 C9 C10
POLL:
Flags/Description: CPUIDLE CORE POLL IDLE
Latency: 0
Usage: 5845
Duration: 3482887
C1:
Flags/Description: MWAIT 0x00
Latency: 2
Usage: 155904
Duration: 40263250
C1E:
Flags/Description: MWAIT 0x01
Latency: 10
Usage: 505874
Duration: 163384145
C3:
Flags/Description: MWAIT 0x10
Latency: 70
Usage: 405130
Duration: 115592556
C6:
Flags/Description: MWAIT 0x20
Latency: 85
Usage: 2407019
Duration: 1503034108
C7s:
Flags/Description: MWAIT 0x33
Latency: 124
Usage: 657
Duration: 674710
C8:
Flags/Description: MWAIT 0x40
Latency: 200
Usage: 1694254
Duration: 2067037470
C9:
Flags/Description: MWAIT 0x50
Latency: 480
Usage: 38
Duration: 65699
C10:
Flags/Description: MWAIT 0x60
Latency: 890
Usage: 44032
Duration: 79069478

We can see that my processor has these idle states available: POLL, C1, C1E, C3, C6, C7s, C8, C9, C10. We can also see a Latency field that tells us the maximum time (in µs) the processor will take to go from that state to the operating C0 state. It is possible to enable/disable these states using the cpupower idle-set command.

Similarly when a processor is not idle it will be in one of several P-states denoted by P0, P1, P2, P3. The higher the p-state, the lower the voltage and operating frequency of the processor. This means CPU bound programs will in general execute faster in a lower P-state than a higher p-state. We cannot explicitly set the processor to a particular state. However we can set a CPUFreq governor using the cpupower frequency-set -g <governor> command.

On my laptop I can see the available governors with:

$ cpupower frequency-info
analyzing CPU 0:
  driver: intel_pstate
  CPUs which run at the same hardware frequency: 0
  CPUs which need to have their frequency coordinated by software: 0
  maximum transition latency:  Cannot determine or is not supported.
  hardware limits: 400 MHz - 4.00 GHz
  available cpufreq governors: performance powersave
  current policy: frequency should be within 400 MHz and 4.00 GHz.
                  The governor "powersave" may decide which speed to use
                  within this range.
  current CPU frequency: Unable to call hardware
  current CPU frequency: 1.60 GHz (asserted by call to kernel)
  boost state support:
    Supported: yes
    Active: yes

We can see that I can set the governor to performance or powersave and the current governor is powersave.

This article explains the various P-states and C-states that intel processors implement in greater detail.

We can also measure the time spent by the processor in various states using the turbostat command:

$ sudo turbostat --quiet sleep 5
5.004098 sec
Core	CPU	Avg_MHz	Busy%	Bzy_MHz	TSC_MHz	IRQ	SMI	C1	C1E	C3	C6	C7s	C8	C9	C10	C1%	C1E%	C3%	C6%	C7s%	C8%	C9%	C10%	CPU%c1	CPU%c3	CPU%c6	CPU%c7	CoreTmp	PkgTmp	GFX%rc6	GFXMHz	Totl%C0	Any%C0	GFX%C0	CPUGFX%	Pkg%pc2	Pkg%pc3	Pkg%pc6	Pkg%pc7	Pkg%pc8	Pkg%pc9	Pk%pc10	PkgWatt	CorWatt	GFXWatt	RAMWatt	PKG_%	RAM_%
-	-	6	0.35	1640	1991	2009	0	1	54	11	78	0	1555	0	523	0.00	0.01	0.00	0.18	0.00	27.94	0.00	71.50	0.95	0.00	0.30	98.39	54	55	99.96	300	2.76	1.83 0.00	0.00	20.55	0.12	0.09	0.15	75.31	0.00	0.00	0.76	0.09	0.00	0.32	0.00	0.00
0	0	12	0.77	1625	1991	730	0	0	2	5	48	0	603	0	90	0.00	0.00	0.01	0.98	0.00	47.68	0.00	50.57	1.25	0.00	0.90	97.07	54	55	99.96	300	2.76	1.83	0.00	0.00	20.56	0.12	0.09	0.15	75.31	0.00	0.00	0.76	0.09	0.00	0.32	0.00	0.00
0	4	4	0.25	1650	1991	154	0	0	0	0	4	0	125	0	70	0.00	0.00	0.00	0.07	0.00	22.48	0.00	77.17	1.77
1	1	2	0.15	1689	1991	95	0	0	8	0	3	0	66	0	32	0.00	0.01	0.00	0.05	0.00	9.76	0.00	90.01	0.86	0.00	0.08	98.91	54
1	5	6	0.35	1626	1991	222	0	0	0	0	4	0	129	0	115	0.00	0.00	0.00	0.06	0.00	18.13	0.00	81.44	0.67
2	2	5	0.31	1672	1991	187	0	0	1	0	6	0	163	0	53	0.00	0.00	0.00	0.10	0.00	39.79	0.00	59.78	1.02	0.01	0.13	98.53	54
2	6	8	0.50	1623	1991	299	0	0	1	6	5	0	246	0	61	0.00	0.00	0.01	0.08	0.00	46.01	0.00	53.39	0.84
3	3	4	0.27	1642	1991	180	0	0	5	0	5	0	139	0	43	0.00	0.01	0.00	0.07	0.00	29.59	0.00	70.05	0.58	0.00	0.10	99.06	54
3	7	3	0.19	1664	1992	142	0	1	37	0	3	0	84	0	59	0.00	0.05	0.00	0.05	0.00	10.11	0.00	89.58	0.66

In the above output we can see that the processor spends < 0.5% in C0 state (Busy%) because we are executing the sleep command and there is not much else happening on the machine. The bulk of the time is spent in C7 state (CPU%c7). This man page explains the various fields that turbostat will output.

Processor states while Hasura serves a GraphQL Request

How does all of this affect Hasura's performance you ask? Let us look at power management states of the machine while Hasura executes a simple GraphQL query. The below measurements were made on my laptop running Intel i7-8550U (1.8GHz turbo boost to 4GHz).

Both Hasura and Postgres were running on the same machine and a simple GraphQL query fetching about 5 rows was used. I've used wrk2 - launched using turbostat - to run a constant load of 100RPS.

Here's the output from the wrk2 command:

Running 1m test @ http://localhost:8181/v1/graphql
  1 threads and 10 connections
  Thread calibration: mean lat.: 2.489ms, rate sampling interval: 10ms
  Thread Stats   Avg      Stdev     Max   +/- Stdev
    Latency     2.49ms  521.70us  12.94ms   75.29%
    Req/Sec   105.63    102.66   363.00     37.27%
  Latency Distribution (HdrHistogram - Recorded Latency)
 50.000%    2.52ms
 75.000%    2.80ms
 90.000%    3.03ms
 99.000%    3.53ms
 99.900%    5.90ms
 99.990%   12.94ms
 99.999%   12.94ms
100.000%   12.94ms

Here's the output from turbostat:

60.011521 sec
Core	CPU	Avg_MHz	Busy%	Bzy_MHz	TSC_MHz	IRQ	SMI	C1	C1E	C3	C6	C7s	C8	C9	C10	C1%	C1E%	C3%	C6%	C7s%	C8%	C9%	C10%	CPU%c1	CPU%c3	CPU%c6	CPU%c7	CoreTmp	PkgTmp	GFX%rc6	GFXMHz	Totl%C0	Any%C0	GFX%C0	CPUGFX%	Pkg%pc2	Pkg%pc3	Pkg%pc6	Pkg%pc7	Pkg%pc8	Pkg%pc9	Pk%pc10	PkgWatt	CorWatt	GFXWatt	RAMWatt	PKG_%	RAM_%
-	-	50	3.00	1669	1992	70009	0	261	1971	1526	11017	34	71007	4	17698	0.01	0.04	0.06	1.62	0.02	49.45	0.00	45.85	5.20	0.07	2.63	89.10	39	40	100.00	300	24.38	20.60	0.00	0.00	25.92	1.42	0.42	0.11	47.07	0.00	0.00	1.15	0.46	0.00	0.43	0.00	0.00
0	0	50	3.08	1629	1992	8822	0	32	210	185	1225	8	9399	2	2005	0.00	0.03	0.05	1.39	0.03	51.97	0.00	43.49	5.23	0.07	2.56	89.07	39	40	100.00	300	24.38	20.60	0.00	0.00	25.92	1.42	0.42	0.11	47.07	0.00	0.00	1.15	0.46	0.00	0.43	0.00	0.00
0	4	50	3.07	1615	1992	9119	0	49	259	216	1523	5	9173	0	2120	0.00	0.04	0.06	1.81	0.03	48.68	0.00	46.35	5.24
1	1	60	3.29	1807	1992	8405	0	20	171	191	1201	0	8406	0	2056	0.01	0.02	0.06	1.35	0.00	49.63	0.00	45.67	4.86	0.07	2.42	89.36	39
1	5	45	2.75	1653	1992	8734	0	39	385	191	1384	8	8161	0	2498	0.00	0.07	0.05	1.64	0.02	44.75	0.00	50.75	5.40
2	2	52	3.15	1644	1992	9181	0	29	269	184	1436	3	9256	0	2008	0.00	0.07	0.06	1.70	0.02	50.97	0.00	44.07	5.19	0.07	2.49	89.10	39
2	6	53	2.99	1756	1992	7914	0	25	265	183	1138	2	8432	0	2248	0.01	0.03	0.05	1.34	0.01	49.80	0.00	45.80	5.35
3	3	45	2.80	1616	1992	8336	0	25	209	150	1401	6	8112	0	2663	0.00	0.05	0.05	1.66	0.02	44.79	0.00	50.68	5.20	0.07	3.06	88.88	39
3	7	46	2.85	1617	1992	9498	0	42	203	226	1709	2	10068	2	2100	0.01	0.03	0.06	2.12	0.00	55.00	0.02	39.95	5.14

We can see the processor was busy i.e in C-state 0 less than 3.5% of the time, while it spent about 90% of the time in the deep C-7 state, which is slow to wake and do useful work.

Essentially at 100 RPS Hasura takes so little of the processor time that the processor spends most of the time in deep sleep 🤓 That in turn increases the latency of requests because it takes time for the processor to wake up and serve the request.

Fine tuning processor states

"performance" p-state governor

How does using the performance governor improve the situation? Switching it to performance mode with...

$ sudo cpupower frequency-set -g performance

...reduces latency as the CPU is more ready to ramp up and do work when it comes:

wrk2 output:

Running 1m test @ http://localhost:8181/v1/graphql
  1 threads and 10 connections
  Thread calibration: mean lat.: 2.002ms, rate sampling interval: 10ms
  Thread Stats   Avg      Stdev     Max   +/- Stdev
    Latency     1.94ms  579.10us   9.26ms   68.44%
    Req/Sec   105.33    100.88   333.00     43.93%
  Latency Distribution (HdrHistogram - Recorded Latency)
 50.000%    1.90ms
 75.000%    2.34ms
 90.000%    2.62ms
 99.000%    3.21ms
 99.900%    5.98ms
 99.990%    9.27ms
 99.999%    9.27ms
100.000%    9.27ms

We see better latencies at pretty much every percentile.

turbostat output:

60.006307 sec
Core	CPU	Avg_MHz	Busy%	Bzy_MHz	TSC_MHz	IRQ	SMI	C1	C1E	C3	C6	C7s	C8	C9	C10	C1%	C1E%	C3%	C6%	C7s%	C8%	C9%	C10%	CPU%c1	CPU%c3	CPU%c6	CPU%c7	CoreTmp	PkgTmp	GFX%rc6	GFXMHz	Totl%C0	Any%C0	GFX%C0	CPUGFX%	Pkg%pc2	Pkg%pc3	Pkg%pc6	Pkg%pc7	Pkg%pc8	Pkg%pc9	Pk%pc10	PkgWatt	CorWatt	GFXWatt	RAMWatt	PKG_%	RAM_%
-	-	67	1.72	3930	1992	67347	0	227	1648	1301	9135	32	72585	2	17162	0.01	0.04	0.08	1.62	0.01	50.84	0.00	45.66	8.01	0.10	2.23	87.95	49	49	100.01	300	13.73	11.58	0.00	0.00	19.86	1.41	0.35	0.04	52.64	0.00	0.00	2.49	1.69	0.00	0.64	0.00	0.00
0	0	62	1.58	3923	1992	9298	0	34	167	175	1277	3	9688	0	1967	0.02	0.05	0.08	1.92	0.01	54.27	0.00	42.06	7.76	0.07	2.17	88.42	49	49	100.01	300	13.73	11.58	0.00	0.00	19.86	1.41	0.35	0.04	52.64	0.00	0.00	2.49	1.69	0.00	0.64	0.00	0.00
0	4	64	1.64	3924	1992	7662	0	36	196	144	917	2	8251	0	2133	0.00	0.02	0.06	1.19	0.01	47.09	0.00	49.97	7.69
1	1	71	1.81	3932	1992	8996	0	26	232	196	1253	4	9553	1	2035	0.00	0.05	0.09	1.73	0.01	53.37	0.00	42.92	8.42	0.08	2.25	87.44	46
1	5	68	1.74	3933	1992	8754	0	35	224	136	1132	5	9662	0	2293	0.02	0.04	0.06	1.59	0.01	53.98	0.00	42.54	8.48
2	2	70	1.78	3937	1992	7611	0	20	200	149	1063	3	8247	0	2260	0.00	0.02	0.08	1.53	0.00	47.71	0.00	48.85	7.35	0.08	1.78	89.01	47
2	6	60	1.54	3918	1992	7920	0	25	210	134	780	2	8559	1	1993	0.00	0.04	0.06	1.05	0.01	51.94	0.00	45.33	7.58
3	3	74	1.88	3937	1992	8414	0	25	259	164	1262	3	9352	0	2064	0.00	0.10	0.07	1.78	0.02	49.07	0.00	47.05	8.32	0.16	2.72	86.91	48
3	7	69	1.76	3930	1992	8692	0	26	160	203	1451	10	9273	0	2417	0.00	0.03	0.17	2.16	0.01	49.29	0.00	46.56	8.45

Notice Bzy_MHz is now close to the maximum advertised frequency of 4GHz (with turboboost).

Prohibit idle states

We can keep the processor from entering deep sleep states with:

$ sudo cpupower idle-set -D2

This will disable any idle state with a latency >= 2µs which on my laptop is all idle states.

Here's the results from wrk2:

Running 1m test @ http://localhost:8181/v1/graphql
  1 threads and 10 connections
  Thread calibration: mean lat.: 0.995ms, rate sampling interval: 10ms
  Thread Stats   Avg      Stdev     Max   +/- Stdev
    Latency     1.13ms  353.31us   8.95ms   79.16%
    Req/Sec   108.45    104.68   333.00     19.19%
  Latency Distribution (HdrHistogram - Recorded Latency)
 50.000%    1.20ms
 75.000%    1.32ms
 90.000%    1.41ms
 99.000%    1.79ms
 99.900%    4.33ms
 99.990%    8.96ms
 99.999%    8.96ms
100.000%    8.96ms

Again we can see an improvement at almost every percentile.

Turbostat output:

60.005077 sec
Core	CPU	Avg_MHz	Busy%	Bzy_MHz	TSC_MHz	IRQ	SMI	C1	C1E	C3	C6	C7s	C8	C9	C10	C1%	C1E%	C3%	C6%	C7s%	C8%	C9%	C10%	CPU%c1	CPU%c3	CPU%c6	CPU%c7	CoreTmp	PkgTmp	GFX%rc6	GFXMHz	Totl%C0	Any%C0	GFX%C0	CPUGFX%	Pkg%pc2	Pkg%pc3	Pkg%pc6	Pkg%pc7	Pkg%pc8	Pkg%pc9	Pk%pc10	PkgWatt	CorWatt	GFXWatt	RAMWatt	PKG_%	RAM_%
-	-	3685	99.76	3693	1992	68091	16	0	0	0	0	0	0	0	0	0.00	0.00	0.00	0.00	0.00	0.00	0.00	0.00	0.24	0.00	0.00	0.00	85	84	99.92	300	398.38	99.59	0.00	0.00	0.00	0.00	0.00	0.00	0.00	0.00	0.00	19.73	18.80	0.00	0.35	0.00	0.00
0	0	3685	99.76	3693	1992	9292	2	0	0	0	0	0	0	0	0	0.00	0.00	0.00	0.00	0.00	0.00	0.00	0.00	0.24	0.00	0.00	0.00	81	84	99.92	300	398.38	99.59	0.00	0.00	0.00	0.00	0.00	0.00	0.00	0.00	0.00	19.73	18.80	0.00	0.35	0.00	0.00
0	4	3685	99.76	3693	1992	7130	2	0	0	0	0	0	0	0	0	0.00	0.00	0.00	0.00	0.00	0.00	0.00	0.00	0.24
1	1	3685	99.76	3693	1992	9178	2	0	0	0	0	0	0	0	0	0.00	0.00	0.00	0.00	0.00	0.00	0.00	0.00	0.24	0.00	0.00	0.00	80
1	5	3685	99.76	3693	1992	9541	2	0	0	0	0	0	0	0	0	0.00	0.00	0.00	0.00	0.00	0.00	0.00	0.00	0.24
2	2	3685	99.76	3693	1992	7925	2	0	0	0	0	0	0	0	0	0.00	0.00	0.00	0.00	0.00	0.00	0.00	0.00	0.24	0.00	0.00	0.00	85
2	6	3685	99.76	3693	1992	8344	2	0	0	0	0	0	0	0	0	0.00	0.00	0.00	0.00	0.00	0.00	0.00	0.00	0.24
3	3	3685	99.76	3693	1992	9789	2	0	0	0	0	0	0	0	0	0.00	0.00	0.00	0.00	0.00	0.00	0.00	0.00	0.24	0.00	0.00	0.00	80
3	7	3685	99.76	3693	1992	6892	2	0	0	0	0	0	0	0	0	0.00	0.00	0.00	0.00	0.00	0.00	0.00	0.00	0.24

As expected the Busy% is now close to the 100%.

Conclusion

In this article we have seen that processor power management has significant effect on response times with Hasura. Using the performance governor and disabling idle states improves the average latency by almost 50%. A couple of caveats to be aware of though:

With wrk2's, measured latencies are [only] accurate to a +/- ~1 ms granularity, due to OS sleep time behavior.
Disabling idle states/using the performance governor has the effect of also consuming more power. In particular disabling all idle states means that the CPU is running at peak frequency continuously. You might want to experiment with disabling some of the deeper C-states and leaving the shallower ones on.

We have also not tried these in production and these are not official recommendations. However, if you do try these techniques out, let us know of your experience on Discord or tweet to us at @HasuraHQ

Gautam BT

Gautam BT is a senior technical evangel...

14 Apr, 2020

13 MIN READ

Blog

14 Apr, 2020