On 2021/6/27 14:07, Michael S. Tsirkin wrote:
On Fri, Jun 25, 2021 at 05:20:10PM +0800, Yunsheng Lin wrote:
On 2021/6/25 14:39, Michael S. Tsirkin wrote:
On Fri, Jun 25, 2021 at 11:18:56AM +0800, Yunsheng Lin wrote:
Currently r->queue[] is cleared after r->consumer_head is moved forward, which makes the __ptr_ring_empty() checking called in page_pool_refill_alloc_cache() unreliable if the checking is done after the r->queue clearing and before the consumer_head moving forward.
Move the r->queue[] clearing after consumer_head moving forward to make __ptr_ring_empty() checking more reliable.
As a side effect of above change, a consumer_head checking is avoided for the likely case, and it has noticeable performance improvement when it is tested using the ptr_ring_test selftest added in the previous patch.
Using "taskset -c 1 ./ptr_ring_test -s 1000 -m 0 -N 100000000" to test the case of single thread doing both the enqueuing and dequeuing:
arch unpatched patched delta arm64 4648 ms 4464 ms +3.9% X86 2562 ms 2401 ms +6.2%
Using "taskset -c 1-2 ./ptr_ring_test -s 1000 -m 1 -N 100000000" to test the case of one thread doing enqueuing and another thread doing dequeuing concurrently, also known as single-producer/single- consumer:
arch unpatched patched delta arm64 3624 ms + 3624 ms 3462 ms + 3462 ms +4.4% x86 2758 ms + 2758 ms 2547 ms + 2547 ms +7.6%
Nice but it's small - could be a fluke. How many tests did you run? What is the variance? Did you try pinning to different CPUs to observe numa effects? Please use perf or some other modern tool for this kind of benchmark. Thanks!
The result is quite stable, and retest using perf stat:
How stable exactly? Try with -r so we can find out.
Retest with "perf stat -r":
For unpatched one: Performance counter stats for './ptr_ring_test -s 1000 -m 1 -N 100000000' (100 runs):
6780.97 msec task-clock # 2.000 CPUs utilized ( +- 5.36% ) 73 context-switches # 0.011 K/sec ( +- 5.07% ) 0 cpu-migrations # 0.000 K/sec ( +-100.00% ) 81 page-faults # 0.012 K/sec ( +- 0.76% ) 17629544748 cycles # 2.600 GHz ( +- 5.36% ) 25496488950 instructions # 1.45 insn per cycle ( +- 0.26% ) <not supported> branches 11489031 branch-misses ( +- 1.69% )
3.391 +- 0.182 seconds time elapsed ( +- 5.35% )
For patched one: Performance counter stats for './ptr_ring_test_opt -s 1000 -m 1 -N 100000000' (100 runs):
6567.83 msec task-clock # 2.000 CPUs utilized ( +- 5.53% ) 71 context-switches # 0.011 K/sec ( +- 5.26% ) 0 cpu-migrations # 0.000 K/sec 82 page-faults # 0.012 K/sec ( +- 0.85% ) 17075489298 cycles # 2.600 GHz ( +- 5.53% ) 23861051578 instructions # 1.40 insn per cycle ( +- 0.07% ) <not supported> branches 10473776 branch-misses ( +- 0.60% )
3.284 +- 0.182 seconds time elapsed ( +- 5.53% )
The result is more stable when using taskset to limit the running cpu, but I suppose the above data is stable enough to justify the performance improvement?