hulk inclusion category: performance bugzilla: https://gitee.com/openeuler/kernel/issues/I8MV01
--------------------------------
Using the UnixBench test suite, we clearly find that osq_lock() cause extremely high overheads with perf tool in the File Copy items:
Overhead Shared Object Symbol 94.25% [kernel] [k] osq_lock 0.74% [kernel] [k] rwsem_spin_on_owner 0.32% [kernel] [k] filemap_get_read_batch
In response to this, we conducted an analysis and made some gains:
In the prologue of osq_lock(), it set `cpu` member of percpu struct optimistic_spin_node with the local cpu id, after that the value of the percpu struct would never change in fact. Based on that, we can regard the `cpu` member as a constant variable.
In the meanwhile, other members of the percpu struct like next, prev and locked are frequently modified by osq_lock() and osq_unlock() which are called by rwsem, mutex and so on. However, that would invalidate the cache of the cpu member on other CPUs.
Therefore, we can place padding here and split them into different cache lines to avoid cache misses when the next CPU is spinning to check other node's cpu member by vcpu_is_preempted().
Here provide the UnixBench full-core test result as below: Machine Intel(R) Xeon(R) Gold 6248 CPU, 40 cores, 80 threads Run the command of "./Run -c 80 -i 3" 10 times and take the average.
System Benchmarks Index Values Without Patch With Patch Diff Dhrystone 2 using register variables 185876.43 185945.41 0.04% Double-Precision Whetstone 79637.27 79659.29 0.03% Execl Throughput 9909.61 10576.06 6.73% File Copy 1024 bufsize 2000 maxblocks 1723.01 2086.08 21.07% File Copy 256 bufsize 500 maxblocks 1150.24 1338.21 16.34% File Copy 4096 bufsize 8000 maxblocks 3719.19 4011.99 7.87% Pipe Throughput 66184.84 66025.25 -0.24% Pipe-based Context Switching 30606.18 31074.21 1.53% Process Creation 9442.48 9450.77 0.09% Shell Scripts (1 concurrent) 44526.52 46548.54 4.54% Shell Scripts (8 concurrent) 42903.96 45718.56 6.56% System Call Overhead 3645.20 3717.42 1.98% ======== System Benchmarks Index Score 15126.87 15931.29 5.32%
Signed-off-by: Zeng Heng zengheng4@huawei.com --- include/linux/osq_lock.h | 2 +- kernel/locking/osq_lock.c | 8 +++++++- 2 files changed, 8 insertions(+), 2 deletions(-)
diff --git a/include/linux/osq_lock.h b/include/linux/osq_lock.h index 5581dbd3bd34..deb90ad5f560 100644 --- a/include/linux/osq_lock.h +++ b/include/linux/osq_lock.h @@ -9,7 +9,7 @@ struct optimistic_spin_node { struct optimistic_spin_node *next, *prev; int locked; /* 1 if lock acquired */ - int cpu; /* encoded CPU # + 1 value */ + int cpu ____cacheline_aligned; /* encoded CPU # + 1 value */ };
struct optimistic_spin_queue { diff --git a/kernel/locking/osq_lock.c b/kernel/locking/osq_lock.c index 1de006ed3aa8..4fa8f3b9e2a1 100644 --- a/kernel/locking/osq_lock.c +++ b/kernel/locking/osq_lock.c @@ -96,7 +96,13 @@ bool osq_lock(struct optimistic_spin_queue *lock)
node->locked = 0; node->next = NULL; - node->cpu = curr; + /* + * After this cpu member is initialized for the first time, it + * would no longer change in fact. That could avoid cache misses + * when spin and access the cpu member by other CPUs. + */ + if (node->cpu != curr) + node->cpu = curr;
/* * We need both ACQUIRE (pairs with corresponding RELEASE in