[Linuxarm] Re: [PATCH net v2] net: sched: add barrier to ensure correct ordering for lockless qdisc

19 Jun 2021

      On Fri, Jun 18, 2021 at 05:38:37PM -0700, Jakub Kicinski wrote:
...
On Fri, 18 Jun 2021 17:30:47 -0700 Jakub Kicinski wrote:
...
On Thu, 17 Jun 2021 09:04:14 +0800 Yunsheng Lin wrote:
...
The spin_trylock() was assumed to contain the implicit
barrier needed to ensure the correct ordering between
STATE_MISSED setting/clearing and STATE_MISSED checking
in commit a90c57f2cedd ("net: sched: fix packet stuck
problem for lockless qdisc").
But it turns out that spin_trylock() only has load-acquire
semantic, for strongly-ordered system(like x86), the compiler
barrier implicitly contained in spin_trylock() seems enough
to ensure the correct ordering. But for weakly-orderly system
(like arm64), the store-release semantic is needed to ensure
the correct ordering as clear_bit() and test_bit() is store
operation, see queued_spin_lock().
So add the explicit barrier to ensure the correct ordering
for the above case.
Fixes: a90c57f2cedd ("net: sched: fix packet stuck problem for lockless qdisc")
Signed-off-by: Yunsheng Lin <linyunsheng@huawei.com>
Acked-by: Jakub Kicinski <kuba@kernel.org>
Actually.. do we really need the _before_atomic() barrier?
I'd think we only need to make sure we re-check the lock 
after we set the bit, ordering of the first check doesn't 
matter.
When debugging pointed to the misordering between STATE_MISSED
setting/clearing and STATE_MISSED checking, only _after_atomic()
was added first, and it did not fix the misordering problem,
when both _before_atomic() and _after_atomic() were added, the
misordering problem disappeared.

I suppose _before_atomic() matters because the STATE_MISSED
setting and the lock rechecking is only done when first check of
STATE_MISSED returns false. _before_atomic() is used to make sure
the first check returns correct result, if it does not return the
correct result, then we may have misordering problem too.

     cpu0                        cpu1
                              clear MISSED
                             _after_atomic()
                                dequeue
    enqueue
 first trylock() #false
  MISSED check #*true* ?

As above, even cpu1 has a _after_atomic() between clearing
STATE_MISSED and dequeuing, we might stiil need a barrier to
prevent cpu0 doing speculative MISSED checking before cpu1
clearing MISSED?

And the implicit load-acquire barrier contained in the first
trylock() does not seems to prevent the above case too.

And there is no load-acquire barrier in pfifo_fast_dequeue()
too, which possibly make the above case more likely to happen.