Patch 1: add a selftest app to benchmark the performance of ptr_ring. Patch 2: move r->queue[] clearing after r->consumer_head updating. Patch 3: add barrier to ensure the visiblity of r->queue[].
V3: add patch 3 and address most of Michael's comment. V2: add patch 1 and add performance data for patch 2.
--- Performance raw data using "perf stat -r" cmd, comparison is also done in patch 2/3. ptr_ring_test_org: patch 1 ptr_ring_test_opt1: patch 1 + patch 2 ptr_ring_test_opt2: patch 1 + patch 2 + patch 3
x86_64(as there is other workload in the x86_64 system, so run 1000 times to get more accurate result):
Performance counter stats for './ptr_ring_test_org -s 1000 -m 1 -N 100000000' (1000 runs):
5,291.83 msec task-clock # 1.994 CPUs utilized ( +- 0.41% ) 690 context-switches # 0.130 K/sec ( +- 3.65% ) 8 cpu-migrations # 0.002 K/sec ( +- 5.70% ) 291 page-faults # 0.055 K/sec ( +- 0.05% ) 12,660,040,758 cycles # 2.392 GHz ( +- 0.41% ) 24,202,160,722 instructions # 1.91 insn per cycle ( +- 0.06% ) 3,559,123,597 branches # 672.569 M/sec ( +- 0.07% ) 8,009,010 branch-misses # 0.23% of all branches ( +- 0.11% )
2.6538 +- 0.0109 seconds time elapsed ( +- 0.41% )
Performance counter stats for './ptr_ring_test_opt1 -s 1000 -m 1 -N 100000000' (1000 runs):
5,064.95 msec task-clock # 1.992 CPUs utilized ( +- 0.55% ) 668 context-switches # 0.132 K/sec ( +- 4.20% ) 9 cpu-migrations # 0.002 K/sec ( +- 4.45% ) 291 page-faults # 0.057 K/sec ( +- 0.06% ) 12,117,262,182 cycles # 2.392 GHz ( +- 0.55% ) 22,586,035,716 instructions # 1.86 insn per cycle ( +- 0.08% ) 3,404,652,345 branches # 672.199 M/sec ( +- 0.10% ) 7,864,190 branch-misses # 0.23% of all branches ( +- 0.16% )
2.5422 +- 0.0142 seconds time elapsed ( +- 0.56% )
Performance counter stats for './ptr_ring_test_opt2 -s 1000 -m 1 -N 100000000' (1000 runs):
5,105.33 msec task-clock # 1.995 CPUs utilized ( +- 0.47% ) 589 context-switches # 0.115 K/sec ( +- 4.24% ) 11 cpu-migrations # 0.002 K/sec ( +- 4.24% ) 292 page-faults # 0.057 K/sec ( +- 0.04% ) 12,214,160,307 cycles # 2.392 GHz ( +- 0.47% ) 22,756,292,370 instructions # 1.86 insn per cycle ( +- 0.10% ) 3,429,218,233 branches # 671.694 M/sec ( +- 0.12% ) 7,921,984 branch-misses # 0.23% of all branches ( +- 0.15% )
2.5587 +- 0.0122 seconds time elapsed ( +- 0.47% )
------------------------------------------------------------------------------------------------- arm64(using taskset to avoid the numa effects):
Performance counter stats for 'taskset -c 0-1 ./ptr_ring_test_org -s 1000 -m 1 -N 100000000' (100 runs):
4172.83 msec task-clock # 1.999 CPUs utilized ( +- 0.01% ) 54 context-switches # 0.013 K/sec ( +- 0.29% ) 1 cpu-migrations # 0.000 K/sec 115 page-faults # 0.028 K/sec ( +- 0.16% ) 10848085945 cycles # 2.600 GHz ( +- 0.01% ) 25808501369 instructions # 2.38 insn per cycle ( +- 0.00% ) <not supported> branches 11190266 branch-misses ( +- 0.02% )
2.087205 +- 0.000130 seconds time elapsed ( +- 0.01% )
Performance counter stats for 'taskset -c 0-1 ./ptr_ring_test_opt1 -s 1000 -m 1 -N 100000000' (100 runs):
3774.91 msec task-clock # 1.999 CPUs utilized ( +- 0.03% ) 50 context-switches # 0.013 K/sec ( +- 0.36% ) 1 cpu-migrations # 0.000 K/sec 114 page-faults # 0.030 K/sec ( +- 0.15% ) 9813658996 cycles # 2.600 GHz ( +- 0.03% ) 23920189000 instructions # 2.44 insn per cycle ( +- 0.01% ) <not supported> branches 10018927 branch-misses ( +- 0.04% )
1.888224 +- 0.000541 seconds time elapsed ( +- 0.03% )
Performance counter stats for 'taskset -c 0-1 ./ptr_ring_test_opt2 -s 1000 -m 1 -N 100000000' (100 runs):
3785.79 msec task-clock # 1.999 CPUs utilized ( +- 0.03% ) 49 context-switches # 0.013 K/sec ( +- 0.32% ) 1 cpu-migrations # 0.000 K/sec 114 page-faults # 0.030 K/sec ( +- 0.15% ) 9842067534 cycles # 2.600 GHz ( +- 0.03% ) 24074397270 instructions # 2.45 insn per cycle ( +- 0.01% ) <not supported> branches 10091918 branch-misses ( +- 0.04% )
1.893673 +- 0.000508 seconds time elapsed ( +- 0.03% )
Yunsheng Lin (3): selftests/ptr_ring: add benchmark application for ptr_ring ptr_ring: move r->queue[] clearing after r->consumer_head updating ptr_ring: add barrier to ensure the visiblity of r->queue[]
MAINTAINERS | 5 + include/linux/ptr_ring.h | 52 ++++-- tools/testing/selftests/ptr_ring/Makefile | 6 + tools/testing/selftests/ptr_ring/ptr_ring_test.c | 224 +++++++++++++++++++++++ tools/testing/selftests/ptr_ring/ptr_ring_test.h | 130 +++++++++++++ 5 files changed, 399 insertions(+), 18 deletions(-) create mode 100644 tools/testing/selftests/ptr_ring/Makefile create mode 100644 tools/testing/selftests/ptr_ring/ptr_ring_test.c create mode 100644 tools/testing/selftests/ptr_ring/ptr_ring_test.h
Currently ptr_ring selftest is embedded within the virtio selftest, which involves some specific virtio operation, such as notifying and kicking.
As ptr_ring has been used by various subsystems, it deserves it's owner selftest in order to benchmark different usecase of ptr_ring, such as page pool and pfifo_fast qdisc.
So add a simple application to benchmark ptr_ring performance. Currently two test mode is supported: Mode 0: Both producing and consuming is done in a single thread, it is called simple test mode in the test app. Mode 1: Producing and consuming is done in different thread concurrently, also known as SPSC(single-producer/ single-consumer) test.
The multi-producer/single-consumer test for pfifo_fast case is not added yet, which can be added if using CAS atomic operation to enable lockless multi-producer is proved to be better than using r->producer_lock.
Signed-off-by: Yunsheng Lin linyunsheng@huawei.com --- V3: Remove timestamp sampling, use standard C library as much as possible. --- MAINTAINERS | 5 + tools/testing/selftests/ptr_ring/Makefile | 6 + tools/testing/selftests/ptr_ring/ptr_ring_test.c | 224 +++++++++++++++++++++++ tools/testing/selftests/ptr_ring/ptr_ring_test.h | 130 +++++++++++++ 4 files changed, 365 insertions(+) create mode 100644 tools/testing/selftests/ptr_ring/Makefile create mode 100644 tools/testing/selftests/ptr_ring/ptr_ring_test.c create mode 100644 tools/testing/selftests/ptr_ring/ptr_ring_test.h
diff --git a/MAINTAINERS b/MAINTAINERS index cc375fd..1227022 100644 --- a/MAINTAINERS +++ b/MAINTAINERS @@ -14847,6 +14847,11 @@ F: drivers/net/phy/dp83640* F: drivers/ptp/* F: include/linux/ptp_cl*
+PTR RING BENCHMARK +M: Yunsheng Lin linyunsheng@huawei.com +L: netdev@vger.kernel.org +F: tools/testing/selftests/ptr_ring/ + PTRACE SUPPORT M: Oleg Nesterov oleg@redhat.com S: Maintained diff --git a/tools/testing/selftests/ptr_ring/Makefile b/tools/testing/selftests/ptr_ring/Makefile new file mode 100644 index 0000000..346dea9 --- /dev/null +++ b/tools/testing/selftests/ptr_ring/Makefile @@ -0,0 +1,6 @@ +# SPDX-License-Identifier: GPL-2.0 +LDLIBS = -lpthread + +TEST_GEN_PROGS := ptr_ring_test + +include ../lib.mk diff --git a/tools/testing/selftests/ptr_ring/ptr_ring_test.c b/tools/testing/selftests/ptr_ring/ptr_ring_test.c new file mode 100644 index 0000000..4a5312f --- /dev/null +++ b/tools/testing/selftests/ptr_ring/ptr_ring_test.c @@ -0,0 +1,224 @@ +// SPDX-License-Identifier: GPL-2.0-or-later +/* + * Copyright (C) 2021 HiSilicon Limited. + */ + +#include <stdio.h> +#include <stdlib.h> +#include <unistd.h> +#include <string.h> +#include <errno.h> +#include <malloc.h> +#include <stdbool.h> + +#include "ptr_ring_test.h" +#include "../../../../include/linux/ptr_ring.h" + +#define MIN_RING_SIZE 2 +#define MAX_RING_SIZE 10000000 + +static struct ptr_ring ring ____cacheline_aligned_in_smp; + +struct worker_info { + pthread_t tid; + int test_count; + bool error; +}; + +static void *produce_worker(void *arg) +{ + struct worker_info *info = arg; + unsigned long i = 0; + int ret; + + while (++i <= info->test_count) { + while (__ptr_ring_full(&ring)) + cpu_relax(); + + ret = __ptr_ring_produce(&ring, (void *)i); + if (ret) { + fprintf(stderr, "produce failed: %d\n", ret); + info->error = true; + return NULL; + } + } + + info->error = false; + + return NULL; +} + +static void *consume_worker(void *arg) +{ + struct worker_info *info = arg; + unsigned long i = 0; + int *ptr; + + while (++i <= info->test_count) { + while (__ptr_ring_empty(&ring)) + cpu_relax(); + + ptr = __ptr_ring_consume(&ring); + if ((unsigned long)ptr != i) { + fprintf(stderr, "consumer failed, ptr: %lu, i: %lu\n", + (unsigned long)ptr, i); + info->error = true; + return NULL; + } + } + + if (!__ptr_ring_empty(&ring)) { + fprintf(stderr, "ring should be empty, test failed\n"); + info->error = true; + return NULL; + } + + info->error = false; + return NULL; +} + +/* test case for single producer single consumer */ +static void spsc_test(int size, int count) +{ + struct worker_info producer, consumer; + pthread_attr_t attr; + void *res; + int ret; + + ret = ptr_ring_init(&ring, size, 0); + if (ret) { + fprintf(stderr, "init failed: %d\n", ret); + return; + } + + producer.test_count = count; + consumer.test_count = count; + + ret = pthread_attr_init(&attr); + if (ret) { + fprintf(stderr, "pthread attr init failed: %d\n", ret); + goto out; + } + + ret = pthread_create(&producer.tid, &attr, + produce_worker, &producer); + if (ret) { + fprintf(stderr, "create producer thread failed: %d\n", ret); + goto out; + } + + ret = pthread_create(&consumer.tid, &attr, + consume_worker, &consumer); + if (ret) { + fprintf(stderr, "create consumer thread failed: %d\n", ret); + goto out; + } + + ret = pthread_join(producer.tid, &res); + if (ret) { + fprintf(stderr, "join producer thread failed: %d\n", ret); + goto out; + } + + ret = pthread_join(consumer.tid, &res); + if (ret) { + fprintf(stderr, "join consumer thread failed: %d\n", ret); + goto out; + } + + if (producer.error || consumer.error) { + fprintf(stderr, "spsc test failed\n"); + goto out; + } + + printf("ptr_ring(size:%d) perf spsc test produced/comsumed %d items, finished\n", + size, count); +out: + ptr_ring_cleanup(&ring, NULL); +} + +static void simple_test(int size, int count) +{ + struct timeval start, end; + int i = 0; + int *ptr; + int ret; + + ret = ptr_ring_init(&ring, size, 0); + if (ret) { + fprintf(stderr, "init failed: %d\n", ret); + return; + } + + while (++i <= count) { + ret = __ptr_ring_produce(&ring, &count); + if (ret) { + fprintf(stderr, "produce failed: %d\n", ret); + goto out; + } + + ptr = __ptr_ring_consume(&ring); + if (ptr != &count) { + fprintf(stderr, "consume failed: %p\n", ptr); + goto out; + } + } + + printf("ptr_ring(size:%d) perf simple test produced/consumed %d items, finished\n", + size, count); + +out: + ptr_ring_cleanup(&ring, NULL); +} + +int main(int argc, char *argv[]) +{ + int count = 1000000; + int size = 1000; + int mode = 0; + int opt; + + while ((opt = getopt(argc, argv, "N:s:m:h")) != -1) { + switch (opt) { + case 'N': + count = atoi(optarg); + break; + case 's': + size = atoi(optarg); + break; + case 'm': + mode = atoi(optarg); + break; + case 'h': + printf("usage: ptr_ring_test [-N COUNT] [-s RING_SIZE] [-m TEST_MODE]\n"); + return 0; + default: + return -1; + } + } + + if (count <= 0) { + fprintf(stderr, "invalid test count, must be > 0\n"); + return -1; + } + + if (size < MIN_RING_SIZE || size > MAX_RING_SIZE) { + fprintf(stderr, "invalid ring size, must be in %d-%d\n", + MIN_RING_SIZE, MAX_RING_SIZE); + return -1; + } + + switch (mode) { + case 0: + simple_test(size, count); + break; + case 1: + spsc_test(size, count); + break; + default: + fprintf(stderr, "invalid test mode\n"); + return -1; + } + + return 0; +} diff --git a/tools/testing/selftests/ptr_ring/ptr_ring_test.h b/tools/testing/selftests/ptr_ring/ptr_ring_test.h new file mode 100644 index 0000000..32bfefb --- /dev/null +++ b/tools/testing/selftests/ptr_ring/ptr_ring_test.h @@ -0,0 +1,130 @@ +/* SPDX-License-Identifier: GPL-2.0-or-later */ + +#ifndef _TEST_PTR_RING_TEST_H +#define _TEST_PTR_RING_TEST_H + +#include <assert.h> +#include <stdatomic.h> +#include <pthread.h> + +/* Assuming the cache line size is 64 for most cpu, + * change it accordingly if the running cpu has different + * cache line size in order to get more accurate result. + */ +#define SMP_CACHE_BYTES 64 + +#define cpu_relax() sched_yield() +#define smp_release() atomic_thread_fence(memory_order_release) +#define smp_acquire() atomic_thread_fence(memory_order_acquire) +#define smp_wmb() smp_release() +#define smp_store_release(p, v) \ + atomic_store_explicit(p, v, memory_order_release) + +#define READ_ONCE(x) (*(volatile typeof(x) *)&(x)) +#define WRITE_ONCE(x, val) ((*(volatile typeof(x) *)&(x)) = (val)) +#define cache_line_size SMP_CACHE_BYTES +#define unlikely(x) (__builtin_expect(!!(x), 0)) +#define likely(x) (__builtin_expect(!!(x), 1)) +#define ALIGN(x, a) (((x) + (a) - 1) / (a) * (a)) +#define SIZE_MAX (~(size_t)0) +#define KMALLOC_MAX_SIZE SIZE_MAX +#define spinlock_t pthread_spinlock_t +#define gfp_t int +#define __GFP_ZERO 0x1 + +#define ____cacheline_aligned_in_smp \ + __attribute__((aligned(SMP_CACHE_BYTES))) + +static inline void *kmalloc(unsigned int size, gfp_t gfp) +{ + void *p; + + p = memalign(64, size); + if (!p) + return p; + + if (gfp & __GFP_ZERO) + memset(p, 0, size); + + return p; +} + +static inline void *kzalloc(unsigned int size, gfp_t flags) +{ + return kmalloc(size, flags | __GFP_ZERO); +} + +static inline void *kmalloc_array(size_t n, size_t size, gfp_t flags) +{ + if (size != 0 && n > SIZE_MAX / size) + return NULL; + return kmalloc(n * size, flags); +} + +static inline void *kcalloc(size_t n, size_t size, gfp_t flags) +{ + return kmalloc_array(n, size, flags | __GFP_ZERO); +} + +static inline void kfree(void *p) +{ + free(p); +} + +#define kvmalloc_array kmalloc_array +#define kvfree kfree + +static inline void spin_lock_init(spinlock_t *lock) +{ + int r = pthread_spin_init(lock, 0); + + assert(!r); +} + +static inline void spin_lock(spinlock_t *lock) +{ + int ret = pthread_spin_lock(lock); + + assert(!ret); +} + +static inline void spin_unlock(spinlock_t *lock) +{ + int ret = pthread_spin_unlock(lock); + + assert(!ret); +} + +static inline void spin_lock_bh(spinlock_t *lock) +{ + spin_lock(lock); +} + +static inline void spin_unlock_bh(spinlock_t *lock) +{ + spin_unlock(lock); +} + +static inline void spin_lock_irq(spinlock_t *lock) +{ + spin_lock(lock); +} + +static inline void spin_unlock_irq(spinlock_t *lock) +{ + spin_unlock(lock); +} + +static inline void spin_lock_irqsave(spinlock_t *lock, + unsigned long f) +{ + spin_lock(lock); +} + +static inline void spin_unlock_irqrestore(spinlock_t *lock, + unsigned long f) +{ + spin_unlock(lock); +} + +#endif
在 2021/7/1 下午8:26, Yunsheng Lin 写道:
Currently ptr_ring selftest is embedded within the virtio selftest, which involves some specific virtio operation, such as notifying and kicking.
As ptr_ring has been used by various subsystems, it deserves it's owner selftest in order to benchmark different usecase of ptr_ring, such as page pool and pfifo_fast qdisc.
So add a simple application to benchmark ptr_ring performance. Currently two test mode is supported: Mode 0: Both producing and consuming is done in a single thread, it is called simple test mode in the test app. Mode 1: Producing and consuming is done in different thread concurrently, also known as SPSC(single-producer/ single-consumer) test.
The multi-producer/single-consumer test for pfifo_fast case is not added yet, which can be added if using CAS atomic operation to enable lockless multi-producer is proved to be better than using r->producer_lock.
Signed-off-by: Yunsheng Lin linyunsheng@huawei.com
V3: Remove timestamp sampling, use standard C library as much as possible.
MAINTAINERS | 5 + tools/testing/selftests/ptr_ring/Makefile | 6 + tools/testing/selftests/ptr_ring/ptr_ring_test.c | 224 +++++++++++++++++++++++ tools/testing/selftests/ptr_ring/ptr_ring_test.h | 130 +++++++++++++ 4 files changed, 365 insertions(+) create mode 100644 tools/testing/selftests/ptr_ring/Makefile create mode 100644 tools/testing/selftests/ptr_ring/ptr_ring_test.c create mode 100644 tools/testing/selftests/ptr_ring/ptr_ring_test.h
diff --git a/MAINTAINERS b/MAINTAINERS index cc375fd..1227022 100644 --- a/MAINTAINERS +++ b/MAINTAINERS @@ -14847,6 +14847,11 @@ F: drivers/net/phy/dp83640* F: drivers/ptp/* F: include/linux/ptp_cl*
+PTR RING BENCHMARK +M: Yunsheng Lin linyunsheng@huawei.com +L: netdev@vger.kernel.org +F: tools/testing/selftests/ptr_ring/
- PTRACE SUPPORT M: Oleg Nesterov oleg@redhat.com S: Maintained
diff --git a/tools/testing/selftests/ptr_ring/Makefile b/tools/testing/selftests/ptr_ring/Makefile new file mode 100644 index 0000000..346dea9 --- /dev/null +++ b/tools/testing/selftests/ptr_ring/Makefile @@ -0,0 +1,6 @@ +# SPDX-License-Identifier: GPL-2.0 +LDLIBS = -lpthread
+TEST_GEN_PROGS := ptr_ring_test
+include ../lib.mk diff --git a/tools/testing/selftests/ptr_ring/ptr_ring_test.c b/tools/testing/selftests/ptr_ring/ptr_ring_test.c new file mode 100644 index 0000000..4a5312f --- /dev/null +++ b/tools/testing/selftests/ptr_ring/ptr_ring_test.c @@ -0,0 +1,224 @@ +// SPDX-License-Identifier: GPL-2.0-or-later +/*
- Copyright (C) 2021 HiSilicon Limited.
- */
+#include <stdio.h> +#include <stdlib.h> +#include <unistd.h> +#include <string.h> +#include <errno.h> +#include <malloc.h> +#include <stdbool.h>
+#include "ptr_ring_test.h" +#include "../../../../include/linux/ptr_ring.h"
+#define MIN_RING_SIZE 2 +#define MAX_RING_SIZE 10000000
+static struct ptr_ring ring ____cacheline_aligned_in_smp;
+struct worker_info {
- pthread_t tid;
- int test_count;
- bool error;
+};
+static void *produce_worker(void *arg) +{
- struct worker_info *info = arg;
- unsigned long i = 0;
- int ret;
- while (++i <= info->test_count) {
while (__ptr_ring_full(&ring))
cpu_relax();
ret = __ptr_ring_produce(&ring, (void *)i);
if (ret) {
fprintf(stderr, "produce failed: %d\n", ret);
info->error = true;
return NULL;
}
- }
- info->error = false;
- return NULL;
+}
+static void *consume_worker(void *arg) +{
- struct worker_info *info = arg;
- unsigned long i = 0;
- int *ptr;
- while (++i <= info->test_count) {
while (__ptr_ring_empty(&ring))
cpu_relax();
Any reason for not simply use __ptr_ring_consume() here?
ptr = __ptr_ring_consume(&ring);
if ((unsigned long)ptr != i) {
fprintf(stderr, "consumer failed, ptr: %lu, i: %lu\n",
(unsigned long)ptr, i);
info->error = true;
return NULL;
}
- }
- if (!__ptr_ring_empty(&ring)) {
fprintf(stderr, "ring should be empty, test failed\n");
info->error = true;
return NULL;
- }
- info->error = false;
- return NULL;
+}
+/* test case for single producer single consumer */ +static void spsc_test(int size, int count) +{
- struct worker_info producer, consumer;
- pthread_attr_t attr;
- void *res;
- int ret;
- ret = ptr_ring_init(&ring, size, 0);
- if (ret) {
fprintf(stderr, "init failed: %d\n", ret);
return;
- }
- producer.test_count = count;
- consumer.test_count = count;
- ret = pthread_attr_init(&attr);
- if (ret) {
fprintf(stderr, "pthread attr init failed: %d\n", ret);
goto out;
- }
- ret = pthread_create(&producer.tid, &attr,
produce_worker, &producer);
- if (ret) {
fprintf(stderr, "create producer thread failed: %d\n", ret);
goto out;
- }
- ret = pthread_create(&consumer.tid, &attr,
consume_worker, &consumer);
- if (ret) {
fprintf(stderr, "create consumer thread failed: %d\n", ret);
goto out;
- }
- ret = pthread_join(producer.tid, &res);
- if (ret) {
fprintf(stderr, "join producer thread failed: %d\n", ret);
goto out;
- }
- ret = pthread_join(consumer.tid, &res);
- if (ret) {
fprintf(stderr, "join consumer thread failed: %d\n", ret);
goto out;
- }
- if (producer.error || consumer.error) {
fprintf(stderr, "spsc test failed\n");
goto out;
- }
- printf("ptr_ring(size:%d) perf spsc test produced/comsumed %d items, finished\n",
size, count);
+out:
- ptr_ring_cleanup(&ring, NULL);
+}
+static void simple_test(int size, int count) +{
- struct timeval start, end;
- int i = 0;
- int *ptr;
- int ret;
- ret = ptr_ring_init(&ring, size, 0);
- if (ret) {
fprintf(stderr, "init failed: %d\n", ret);
return;
- }
- while (++i <= count) {
ret = __ptr_ring_produce(&ring, &count);
if (ret) {
fprintf(stderr, "produce failed: %d\n", ret);
goto out;
}
ptr = __ptr_ring_consume(&ring);
if (ptr != &count) {
fprintf(stderr, "consume failed: %p\n", ptr);
goto out;
}
- }
- printf("ptr_ring(size:%d) perf simple test produced/consumed %d items, finished\n",
size, count);
+out:
- ptr_ring_cleanup(&ring, NULL);
+}
+int main(int argc, char *argv[]) +{
- int count = 1000000;
- int size = 1000;
- int mode = 0;
- int opt;
- while ((opt = getopt(argc, argv, "N:s:m:h")) != -1) {
switch (opt) {
case 'N':
count = atoi(optarg);
break;
case 's':
size = atoi(optarg);
break;
case 'm':
mode = atoi(optarg);
break;
case 'h':
printf("usage: ptr_ring_test [-N COUNT] [-s RING_SIZE] [-m TEST_MODE]\n");
return 0;
default:
return -1;
}
- }
- if (count <= 0) {
fprintf(stderr, "invalid test count, must be > 0\n");
return -1;
- }
- if (size < MIN_RING_SIZE || size > MAX_RING_SIZE) {
fprintf(stderr, "invalid ring size, must be in %d-%d\n",
MIN_RING_SIZE, MAX_RING_SIZE);
return -1;
- }
- switch (mode) {
- case 0:
simple_test(size, count);
break;
- case 1:
spsc_test(size, count);
break;
- default:
fprintf(stderr, "invalid test mode\n");
return -1;
- }
- return 0;
+} diff --git a/tools/testing/selftests/ptr_ring/ptr_ring_test.h b/tools/testing/selftests/ptr_ring/ptr_ring_test.h new file mode 100644 index 0000000..32bfefb --- /dev/null +++ b/tools/testing/selftests/ptr_ring/ptr_ring_test.h
Let's reuse ptr_ring.c in tools/virtio/ringtest. Nothing virt specific there.
Thanks
@@ -0,0 +1,130 @@ +/* SPDX-License-Identifier: GPL-2.0-or-later */
+#ifndef _TEST_PTR_RING_TEST_H +#define _TEST_PTR_RING_TEST_H
+#include <assert.h> +#include <stdatomic.h> +#include <pthread.h>
+/* Assuming the cache line size is 64 for most cpu,
- change it accordingly if the running cpu has different
- cache line size in order to get more accurate result.
- */
+#define SMP_CACHE_BYTES 64
+#define cpu_relax() sched_yield() +#define smp_release() atomic_thread_fence(memory_order_release) +#define smp_acquire() atomic_thread_fence(memory_order_acquire) +#define smp_wmb() smp_release() +#define smp_store_release(p, v) \
atomic_store_explicit(p, v, memory_order_release)
+#define READ_ONCE(x) (*(volatile typeof(x) *)&(x)) +#define WRITE_ONCE(x, val) ((*(volatile typeof(x) *)&(x)) = (val)) +#define cache_line_size SMP_CACHE_BYTES +#define unlikely(x) (__builtin_expect(!!(x), 0)) +#define likely(x) (__builtin_expect(!!(x), 1)) +#define ALIGN(x, a) (((x) + (a) - 1) / (a) * (a)) +#define SIZE_MAX (~(size_t)0) +#define KMALLOC_MAX_SIZE SIZE_MAX +#define spinlock_t pthread_spinlock_t +#define gfp_t int +#define __GFP_ZERO 0x1
+#define ____cacheline_aligned_in_smp \
__attribute__((aligned(SMP_CACHE_BYTES)))
+static inline void *kmalloc(unsigned int size, gfp_t gfp) +{
- void *p;
- p = memalign(64, size);
- if (!p)
return p;
- if (gfp & __GFP_ZERO)
memset(p, 0, size);
- return p;
+}
+static inline void *kzalloc(unsigned int size, gfp_t flags) +{
- return kmalloc(size, flags | __GFP_ZERO);
+}
+static inline void *kmalloc_array(size_t n, size_t size, gfp_t flags) +{
- if (size != 0 && n > SIZE_MAX / size)
return NULL;
- return kmalloc(n * size, flags);
+}
+static inline void *kcalloc(size_t n, size_t size, gfp_t flags) +{
- return kmalloc_array(n, size, flags | __GFP_ZERO);
+}
+static inline void kfree(void *p) +{
- free(p);
+}
+#define kvmalloc_array kmalloc_array +#define kvfree kfree
+static inline void spin_lock_init(spinlock_t *lock) +{
- int r = pthread_spin_init(lock, 0);
- assert(!r);
+}
+static inline void spin_lock(spinlock_t *lock) +{
- int ret = pthread_spin_lock(lock);
- assert(!ret);
+}
+static inline void spin_unlock(spinlock_t *lock) +{
- int ret = pthread_spin_unlock(lock);
- assert(!ret);
+}
+static inline void spin_lock_bh(spinlock_t *lock) +{
- spin_lock(lock);
+}
+static inline void spin_unlock_bh(spinlock_t *lock) +{
- spin_unlock(lock);
+}
+static inline void spin_lock_irq(spinlock_t *lock) +{
- spin_lock(lock);
+}
+static inline void spin_unlock_irq(spinlock_t *lock) +{
- spin_unlock(lock);
+}
+static inline void spin_lock_irqsave(spinlock_t *lock,
unsigned long f)
+{
- spin_lock(lock);
+}
+static inline void spin_unlock_irqrestore(spinlock_t *lock,
unsigned long f)
+{
- spin_unlock(lock);
+}
+#endif
On 2021/7/2 14:43, Jason Wang wrote:
在 2021/7/1 下午8:26, Yunsheng Lin 写道:
Currently ptr_ring selftest is embedded within the virtio selftest, which involves some specific virtio operation, such as notifying and kicking.
As ptr_ring has been used by various subsystems, it deserves it's owner selftest in order to benchmark different usecase of ptr_ring, such as page pool and pfifo_fast qdisc.
So add a simple application to benchmark ptr_ring performance. Currently two test mode is supported: Mode 0: Both producing and consuming is done in a single thread, it is called simple test mode in the test app. Mode 1: Producing and consuming is done in different thread concurrently, also known as SPSC(single-producer/ single-consumer) test.
The multi-producer/single-consumer test for pfifo_fast case is not added yet, which can be added if using CAS atomic operation to enable lockless multi-producer is proved to be better than using r->producer_lock.
Signed-off-by: Yunsheng Lin linyunsheng@huawei.com
V3: Remove timestamp sampling, use standard C library as much as possible.
[...]
+static void *produce_worker(void *arg) +{
- struct worker_info *info = arg;
- unsigned long i = 0;
- int ret;
- while (++i <= info->test_count) {
while (__ptr_ring_full(&ring))
cpu_relax();
ret = __ptr_ring_produce(&ring, (void *)i);
if (ret) {
fprintf(stderr, "produce failed: %d\n", ret);
info->error = true;
return NULL;
}
- }
- info->error = false;
- return NULL;
+}
+static void *consume_worker(void *arg) +{
- struct worker_info *info = arg;
- unsigned long i = 0;
- int *ptr;
- while (++i <= info->test_count) {
while (__ptr_ring_empty(&ring))
cpu_relax();
Any reason for not simply use __ptr_ring_consume() here?
No particular reason, just to make sure the ring is non-empty before doing the enqueuing, we could check if the __ptr_ring_consume() return NULL to decide the if the ring is empty. Using __ptr_ring_consume() here enable testing the correctness and performance of __ptr_ring_consume() too.
ptr = __ptr_ring_consume(&ring);
if ((unsigned long)ptr != i) {
fprintf(stderr, "consumer failed, ptr: %lu, i: %lu\n",
(unsigned long)ptr, i);
info->error = true;
return NULL;
}
- }
- if (!__ptr_ring_empty(&ring)) {
fprintf(stderr, "ring should be empty, test failed\n");
info->error = true;
return NULL;
- }
- info->error = false;
- return NULL;
+}
[...]
- return 0;
+} diff --git a/tools/testing/selftests/ptr_ring/ptr_ring_test.h b/tools/testing/selftests/ptr_ring/ptr_ring_test.h new file mode 100644 index 0000000..32bfefb --- /dev/null +++ b/tools/testing/selftests/ptr_ring/ptr_ring_test.h
Let's reuse ptr_ring.c in tools/virtio/ringtest. Nothing virt specific there.
It *does* have some virtio specific at the end of ptr_ring.c. It can be argued that the ptr_ring.c in tools/virtio/ringtest could be refactored to remove the function related to virtio.
But as mentioned in the previous disscusion [1], the tools/virtio/ seems to have compile error in the latest kernel, it does not seems right to reuse that. And most of testcase in tools/virtio/ seems better be in tools/virtio/ringtest instead,so until the testcase in tools/virtio/ is compile-error-free and moved to tools/testing/ selftests/, it seems better not to reuse it for now.
1. https://patchwork.kernel.org/project/netdevbpf/patch/1624591136-6647-2-git-s...
Thanks
[...]
.
On Fri, Jul 02, 2021 at 04:17:17PM +0800, Yunsheng Lin wrote:
Let's reuse ptr_ring.c in tools/virtio/ringtest. Nothing virt specific there.
It *does* have some virtio specific at the end of ptr_ring.c. It can be argued that the ptr_ring.c in tools/virtio/ringtest could be refactored to remove the function related to virtio.
But as mentioned in the previous disscusion [1], the tools/virtio/ seems to have compile error in the latest kernel, it does not seems right to reuse that. And most of testcase in tools/virtio/ seems better be in tools/virtio/ringtest instead,so until the testcase in tools/virtio/ is compile-error-free and moved to tools/testing/ selftests/, it seems better not to reuse it for now.
That's a great reason to reuse - so tools/virtio/ stays working. Please just fix that.
On 2021/7/2 16:30, Michael S. Tsirkin wrote:
On Fri, Jul 02, 2021 at 04:17:17PM +0800, Yunsheng Lin wrote:
Let's reuse ptr_ring.c in tools/virtio/ringtest. Nothing virt specific there.
It *does* have some virtio specific at the end of ptr_ring.c. It can be argued that the ptr_ring.c in tools/virtio/ringtest could be refactored to remove the function related to virtio.
But as mentioned in the previous disscusion [1], the tools/virtio/ seems to have compile error in the latest kernel, it does not seems right to reuse that. And most of testcase in tools/virtio/ seems better be in tools/virtio/ringtest instead,so until the testcase in tools/virtio/ is compile-error-free and moved to tools/testing/ selftests/, it seems better not to reuse it for now.
That's a great reason to reuse - so tools/virtio/ stays working. Please just fix that.
I understand that you guys like to see a working testcase of virtio. I would love to do that if I have the time and knowledge of virtio, But I do not think I have the time and I am familiar enough with virtio to fix that now.
在 2021/7/2 下午4:46, Yunsheng Lin 写道:
On 2021/7/2 16:30, Michael S. Tsirkin wrote:
On Fri, Jul 02, 2021 at 04:17:17PM +0800, Yunsheng Lin wrote:
Let's reuse ptr_ring.c in tools/virtio/ringtest. Nothing virt specific there.
It *does* have some virtio specific at the end of ptr_ring.c.
They are just wrappers to make ptr ring works like a virtio ring. We can split them out into another file if necessary.
It can be argued that the ptr_ring.c in tools/virtio/ringtest could be refactored to remove the function related to virtio.
But as mentioned in the previous disscusion [1], the tools/virtio/ seems to have compile error in the latest kernel, it does not seems right to reuse that. And most of testcase in tools/virtio/ seems better be in tools/virtio/ringtest instead,so until the testcase in tools/virtio/ is compile-error-free and moved to tools/testing/ selftests/, it seems better not to reuse it for now.
That's a great reason to reuse - so tools/virtio/ stays working. Please just fix that.
+1
I understand that you guys like to see a working testcase of virtio. I would love to do that if I have the time and knowledge of virtio, But I do not think I have the time and I am familiar enough with virtio to fix that now.
So ringtest is used for bench-marking the ring performance for different format. Virtio is only one of the supported ring format, ptr ring is another. Wrappers were used to reuse the same test logic.
Though you may see host/guest in the test, it's in fact done via two processes.
We need figure out:
1) why the current ringtest.c does not fit for your requirement (it has SPSC test) 2) why can't we tweak the ptr_ring.c to be used by both ring_test and your benchmark
If neither of the above work, we can invent new ptr_ring infrastructure under tests/
Thanks
On 2021/7/2 17:04, Jason Wang wrote:
[...]
I understand that you guys like to see a working testcase of virtio. I would love to do that if I have the time and knowledge of virtio, But I do not think I have the time and I am familiar enough with virtio to fix that now.
So ringtest is used for bench-marking the ring performance for different format. Virtio is only one of the supported ring format, ptr ring is another. Wrappers were used to reuse the same test logic.
Though you may see host/guest in the test, it's in fact done via two processes.
We need figure out:
- why the current ringtest.c does not fit for your requirement (it has SPSC test)
There is MPSC case used by pfifo_fast, it make more sense to use a separate selftest for ptr_ring as ptr_ring has been used by various subsystems.
- why can't we tweak the ptr_ring.c to be used by both ring_test and your benchmark
Actually that is what I do in this patch, move the specific part related to ptr_ring to ptr_ring_test.h. When the virtio testing is refactored to work, it can reuse the abstract layer in ptr_ring_test.h too.
If neither of the above work, we can invent new ptr_ring infrastructure under tests/
Thanks
.
On Fri, Jul 02, 2021 at 05:54:42PM +0800, Yunsheng Lin wrote:
On 2021/7/2 17:04, Jason Wang wrote:
[...]
I understand that you guys like to see a working testcase of virtio. I would love to do that if I have the time and knowledge of virtio, But I do not think I have the time and I am familiar enough with virtio to fix that now.
So ringtest is used for bench-marking the ring performance for different format. Virtio is only one of the supported ring format, ptr ring is another. Wrappers were used to reuse the same test logic.
Though you may see host/guest in the test, it's in fact done via two processes.
We need figure out:
- why the current ringtest.c does not fit for your requirement (it has SPSC test)
There is MPSC case used by pfifo_fast, it make more sense to use a separate selftest for ptr_ring as ptr_ring has been used by various subsystems.
- why can't we tweak the ptr_ring.c to be used by both ring_test and your benchmark
Actually that is what I do in this patch, move the specific part related to ptr_ring to ptr_ring_test.h. When the virtio testing is refactored to work, it can reuse the abstract layer in ptr_ring_test.h too.
Sounds good. But that refactoring will be up to you as a contributor.
If neither of the above work, we can invent new ptr_ring infrastructure under tests/
Thanks
.
On 2021/7/2 22:18, Michael S. Tsirkin wrote:
On Fri, Jul 02, 2021 at 05:54:42PM +0800, Yunsheng Lin wrote:
On 2021/7/2 17:04, Jason Wang wrote:
[...]
I understand that you guys like to see a working testcase of virtio. I would love to do that if I have the time and knowledge of virtio, But I do not think I have the time and I am familiar enough with virtio to fix that now.
So ringtest is used for bench-marking the ring performance for different format. Virtio is only one of the supported ring format, ptr ring is another. Wrappers were used to reuse the same test logic.
Though you may see host/guest in the test, it's in fact done via two processes.
We need figure out:
- why the current ringtest.c does not fit for your requirement (it has SPSC test)
There is MPSC case used by pfifo_fast, it make more sense to use a separate selftest for ptr_ring as ptr_ring has been used by various subsystems.
- why can't we tweak the ptr_ring.c to be used by both ring_test and your benchmark
Actually that is what I do in this patch, move the specific part related to ptr_ring to ptr_ring_test.h. When the virtio testing is refactored to work, it can reuse the abstract layer in ptr_ring_test.h too.
Sounds good. But that refactoring will be up to you as a contributor.
It seems that tools/include/* have a lot of portability infrastructure for building kernel code from userspace, will try to refactor the ptr_ring.h to use the portability infrastructure in tools/include/* when building ptr_ring.h from userspace.
On Fri, Jul 02, 2021 at 05:04:44PM +0800, Jason Wang wrote:
在 2021/7/2 下午4:46, Yunsheng Lin 写道:
On 2021/7/2 16:30, Michael S. Tsirkin wrote:
On Fri, Jul 02, 2021 at 04:17:17PM +0800, Yunsheng Lin wrote:
Let's reuse ptr_ring.c in tools/virtio/ringtest. Nothing virt specific there.
It *does* have some virtio specific at the end of ptr_ring.c.
They are just wrappers to make ptr ring works like a virtio ring. We can split them out into another file if necessary.
It can be argued that the ptr_ring.c in tools/virtio/ringtest could be refactored to remove the function related to virtio.
But as mentioned in the previous disscusion [1], the tools/virtio/ seems to have compile error in the latest kernel, it does not seems right to reuse that. And most of testcase in tools/virtio/ seems better be in tools/virtio/ringtest instead,so until the testcase in tools/virtio/ is compile-error-free and moved to tools/testing/ selftests/, it seems better not to reuse it for now.
That's a great reason to reuse - so tools/virtio/ stays working. Please just fix that.
+1
I understand that you guys like to see a working testcase of virtio. I would love to do that if I have the time and knowledge of virtio, But I do not think I have the time and I am familiar enough with virtio to fix that now.
So ringtest is used for bench-marking the ring performance for different format. Virtio is only one of the supported ring format, ptr ring is another. Wrappers were used to reuse the same test logic.
Though you may see host/guest in the test, it's in fact done via two processes.
We need figure out:
- why the current ringtest.c does not fit for your requirement (it has SPSC
test) 2) why can't we tweak the ptr_ring.c to be used by both ring_test and your benchmark
If neither of the above work, we can invent new ptr_ring infrastructure under tests/
Thanks
For me 1) is not a question.
All the available/used terminology is not an ideal fit for ptr ring. With virtio buffers are always owned by driver (producer) so producer has a way to find out if a buffer has been consumed. With ptr ring there's no way for producer to know a buffer has been consumed. The test hacks around that but it is very reasonable not to want to rely on that.
However 2) is very much a question. We can split ptr_ring to the preamble and virtio related hacks. So all the portability infrastructure for building kernel code from userspace, command line parsing, run-on-all.sh to figure out affinity effects, all that can and should IMHO be reused and not copy-pasted.
Currently r->queue[] clearing is done before r->consumer_head updating, which makes the __ptr_ring_empty() returning false positive result(the ring is non-empty, but __ptr_ring_empty() suggest that it is empty) if the checking is done after the r->queue clearing and before the consumer_head moving forward.
Move the r->queue[] clearing after consumer_head moving forward to avoid the above case.
As a side effect of above change, a consumer_head checking is avoided for the likely case, and it has noticeable performance improvement when it is tested using the ptr_ring_test selftest added in the previous patch.
Tested using the "perf stat -r 1000 ./ptr_ring_test -s 1000 -m 1 -N 100000000", comparing the elapsed time:
arch unpatched patched improvement arm64 2.087205 sec 1.888224 sec +9.5% X86 2.6538 sec 2.5422 sec +4.2%
Signed-off-by: Yunsheng Lin linyunsheng@huawei.com --- V3: adjust the title and comment log according to disscusion in V2, and update performance data using "perf stat -r". V2: Add performance data. --- include/linux/ptr_ring.h | 25 ++++++++++++++++--------- 1 file changed, 16 insertions(+), 9 deletions(-)
diff --git a/include/linux/ptr_ring.h b/include/linux/ptr_ring.h index 808f9d3..db9c282 100644 --- a/include/linux/ptr_ring.h +++ b/include/linux/ptr_ring.h @@ -261,8 +261,7 @@ static inline void __ptr_ring_discard_one(struct ptr_ring *r) /* Note: we must keep consumer_head valid at all times for __ptr_ring_empty * to work correctly. */ - int consumer_head = r->consumer_head; - int head = consumer_head++; + int consumer_head = r->consumer_head + 1;
/* Once we have processed enough entries invalidate them in * the ring all at once so producer can reuse their space in the ring. @@ -271,19 +270,27 @@ static inline void __ptr_ring_discard_one(struct ptr_ring *r) */ if (unlikely(consumer_head - r->consumer_tail >= r->batch || consumer_head >= r->size)) { + int tail = r->consumer_tail; + + if (unlikely(consumer_head >= r->size)) { + r->consumer_tail = 0; + WRITE_ONCE(r->consumer_head, 0); + } else { + r->consumer_tail = consumer_head; + WRITE_ONCE(r->consumer_head, consumer_head); + } + /* Zero out entries in the reverse order: this way we touch the * cache line that producer might currently be reading the last; * producer won't make progress and touch other cache lines * besides the first one until we write out all entries. */ - while (likely(head >= r->consumer_tail)) - r->queue[head--] = NULL; - r->consumer_tail = consumer_head; - } - if (unlikely(consumer_head >= r->size)) { - consumer_head = 0; - r->consumer_tail = 0; + while (likely(--consumer_head >= tail)) + r->queue[consumer_head] = NULL; + + return; } + /* matching READ_ONCE in __ptr_ring_empty for lockless tests */ WRITE_ONCE(r->consumer_head, consumer_head); }
在 2021/7/1 下午8:26, Yunsheng Lin 写道:
Currently r->queue[] clearing is done before r->consumer_head updating, which makes the __ptr_ring_empty() returning false positive result(the ring is non-empty, but __ptr_ring_empty() suggest that it is empty) if the checking is done after the r->queue clearing and before the consumer_head moving forward.
Move the r->queue[] clearing after consumer_head moving forward to avoid the above case.
As a side effect of above change, a consumer_head checking is avoided for the likely case, and it has noticeable performance improvement when it is tested using the ptr_ring_test selftest added in the previous patch.
Tested using the "perf stat -r 1000 ./ptr_ring_test -s 1000 -m 1 -N 100000000", comparing the elapsed time:
arch unpatched patched improvement arm64 2.087205 sec 1.888224 sec +9.5% X86 2.6538 sec 2.5422 sec +4.2%
I think we need the number of real workloads here.
Thanks
Signed-off-by: Yunsheng Lin linyunsheng@huawei.com
V3: adjust the title and comment log according to disscusion in V2, and update performance data using "perf stat -r". V2: Add performance data.
include/linux/ptr_ring.h | 25 ++++++++++++++++--------- 1 file changed, 16 insertions(+), 9 deletions(-)
diff --git a/include/linux/ptr_ring.h b/include/linux/ptr_ring.h index 808f9d3..db9c282 100644 --- a/include/linux/ptr_ring.h +++ b/include/linux/ptr_ring.h @@ -261,8 +261,7 @@ static inline void __ptr_ring_discard_one(struct ptr_ring *r) /* Note: we must keep consumer_head valid at all times for __ptr_ring_empty * to work correctly. */
- int consumer_head = r->consumer_head;
- int head = consumer_head++;
int consumer_head = r->consumer_head + 1;
/* Once we have processed enough entries invalidate them in
- the ring all at once so producer can reuse their space in the ring.
@@ -271,19 +270,27 @@ static inline void __ptr_ring_discard_one(struct ptr_ring *r) */ if (unlikely(consumer_head - r->consumer_tail >= r->batch || consumer_head >= r->size)) {
int tail = r->consumer_tail;
if (unlikely(consumer_head >= r->size)) {
r->consumer_tail = 0;
WRITE_ONCE(r->consumer_head, 0);
} else {
r->consumer_tail = consumer_head;
WRITE_ONCE(r->consumer_head, consumer_head);
}
- /* Zero out entries in the reverse order: this way we touch the
*/
- cache line that producer might currently be reading the last;
- producer won't make progress and touch other cache lines
- besides the first one until we write out all entries.
while (likely(head >= r->consumer_tail))
r->queue[head--] = NULL;
r->consumer_tail = consumer_head;
- }
- if (unlikely(consumer_head >= r->size)) {
consumer_head = 0;
r->consumer_tail = 0;
while (likely(--consumer_head >= tail))
r->queue[consumer_head] = NULL;
}return;
- /* matching READ_ONCE in __ptr_ring_empty for lockless tests */ WRITE_ONCE(r->consumer_head, consumer_head); }
On 2021/7/2 14:45, Jason Wang wrote:
在 2021/7/1 下午8:26, Yunsheng Lin 写道:
Currently r->queue[] clearing is done before r->consumer_head updating, which makes the __ptr_ring_empty() returning false positive result(the ring is non-empty, but __ptr_ring_empty() suggest that it is empty) if the checking is done after the r->queue clearing and before the consumer_head moving forward.
Move the r->queue[] clearing after consumer_head moving forward to avoid the above case.
As a side effect of above change, a consumer_head checking is avoided for the likely case, and it has noticeable performance improvement when it is tested using the ptr_ring_test selftest added in the previous patch.
Tested using the "perf stat -r 1000 ./ptr_ring_test -s 1000 -m 1 -N 100000000", comparing the elapsed time:
arch unpatched patched improvement arm64 2.087205 sec 1.888224 sec +9.5% X86 2.6538 sec 2.5422 sec +4.2%
I think we need the number of real workloads here.
As it is a low optimization, and overhead of enqueuing and dequeuing is small for any real workloads, so the performance improvement could be buried in deviation. And that is why the ptr_ring_test is added, the about 10% improvement for arm64 seems big, but note that it is tested using the taskset to avoid the numa effects for arm64.
Anyway, here is the performance data for pktgen in queue_xmit mode + dummy netdev with pfifo_fast(which uses ptr_ring too), which is not obvious to the above data:
threads unpatched unpatched delta 1 3.21Mpps 3.23Mpps +0.6% 2 5.56Mpps 3.59Mpps +0.5% 4 5.58Mpps 5.61Mpps +0.5% 8 2.76Mpps 2.75Mpps -0.3% 16 2.23Mpps 2.22Mpps -0.4%
Thanks
Signed-off-by: Yunsheng Lin linyunsheng@huawei.com
After r->consumer_head is updated in __ptr_ring_discard_one(), r->queue[r->consumer_head] is already cleared in the previous round of __ptr_ring_discard_one(). But there is no guarantee other thread will see the r->queue[r->consumer_head] being NULL because there is no explicit barrier between r->queue[] clearing and r->consumer_head updating.
So add two explicit barrier to make sure r->queue[] cleared in __ptr_ring_discard_one() to be visible to other cpu, mainly to make sure the cpu calling the __ptr_ring_empty() will see the correct r->queue[r->consumer_head].
Hopefully the previous and this patch have ensured the correct visibility of r->queue[], so update the comment accordingly about __ptr_ring_empty().
Tested using the "perf stat -r 1000 ./ptr_ring_test -s 1000 -m 1 -N 100000000", comparing the elapsed time:
arch unpatched patched improvement arm64 1.888224 sec 1.893673 sec -0.2% X86 2.5422 sec 2.5587 sec -0.6%
Reported-by: Michael S. Tsirkin mst@redhat.com Signed-off-by: Yunsheng Lin linyunsheng@huawei.com --- include/linux/ptr_ring.h | 29 +++++++++++++++++++---------- 1 file changed, 19 insertions(+), 10 deletions(-)
diff --git a/include/linux/ptr_ring.h b/include/linux/ptr_ring.h index db9c282..d78aab8 100644 --- a/include/linux/ptr_ring.h +++ b/include/linux/ptr_ring.h @@ -178,15 +178,11 @@ static inline void *__ptr_ring_peek(struct ptr_ring *r) * * NB: This is only safe to call if ring is never resized. * - * However, if some other CPU consumes ring entries at the same time, the value - * returned is not guaranteed to be correct. - * - * In this case - to avoid incorrectly detecting the ring - * as empty - the CPU consuming the ring entries is responsible - * for either consuming all ring entries until the ring is empty, - * or synchronizing with some other CPU and causing it to - * re-test __ptr_ring_empty and/or consume the ring enteries - * after the synchronization point. + * caller might need to use the smp_rmb() to pair with smp_wmb() + * or smp_store_release() in __ptr_ring_discard_one() and smp_wmb() + * in __ptr_ring_produce() to ensure correct ordering between + * __ptr_ring_empty() checking and subsequent operation after + * __ptr_ring_empty() checking. * * Note: callers invoking this in a loop must use a compiler barrier, * for example cpu_relax(). @@ -274,7 +270,12 @@ static inline void __ptr_ring_discard_one(struct ptr_ring *r)
if (unlikely(consumer_head >= r->size)) { r->consumer_tail = 0; - WRITE_ONCE(r->consumer_head, 0); + + /* Make sure r->queue[0] ~ r->queue[r->consumer_tail] + * cleared in previous __ptr_ring_discard_one() is + * visible to other cpu. + */ + smp_store_release(&r->consumer_head, 0); } else { r->consumer_tail = consumer_head; WRITE_ONCE(r->consumer_head, consumer_head); @@ -288,6 +289,14 @@ static inline void __ptr_ring_discard_one(struct ptr_ring *r) while (likely(--consumer_head >= tail)) r->queue[consumer_head] = NULL;
+ if (unlikely(!r->consumer_head)) { + /* Make sure r->queue[r->consumer_tail] ~ + * r->queue[r->size - 1] cleared above is visible to + * other cpu. + */ + smp_wmb(); + } + return; }