From: Linus Torvalds torvalds@linux-foundation.org
stable inclusion from stable-5.10.56 commit 27aa7171fe2b00c3de01e8e3a3298a3639f37fa3 bugzilla: 176004 https://gitee.com/openeuler/kernel/issues/I4DYZ4
Reference: https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/commit/?id=...
--------------------------------
commit 3a34b13a88caeb2800ab44a4918f230041b37dd9 upstream.
Since commit 1b6b26ae7053 ("pipe: fix and clarify pipe write wakeup logic") we have sanitized the pipe write logic, and would only try to wake up readers if they needed it.
In particular, if the pipe already had data in it before the write, there was no point in trying to wake up a reader, since any existing readers must have been aware of the pre-existing data already. Doing extraneous wakeups will only cause potential thundering herd problems.
However, it turns out that some Android libraries have misused the EPOLL interface, and expected "edge triggered" be to "any new write will trigger it". Even if there was no edge in sight.
Quoting Sandeep Patil: "The commit 1b6b26ae7053 ('pipe: fix and clarify pipe write wakeup logic') changed pipe write logic to wakeup readers only if the pipe was empty at the time of write. However, there are libraries that relied upon the older behavior for notification scheme similar to what's described in [1]
One such library 'realm-core'[2] is used by numerous Android applications. The library uses a similar notification mechanism as GNU Make but it never drains the pipe until it is full. When Android moved to v5.10 kernel, all applications using this library stopped working.
The library has since been fixed[3] but it will be a while before all applications incorporate the updated library"
Our regression rule for the kernel is that if applications break from new behavior, it's a regression, even if it was because the application did something patently wrong. Also note the original report [4] by Michal Kerrisk about a test for this epoll behavior - but at that point we didn't know of any actual broken use case.
So add the extraneous wakeup, to approximate the old behavior.
[ I say "approximate", because the exact old behavior was to do a wakeup not for each write(), but for each pipe buffer chunk that was filled in. The behavior introduced by this change is not that - this is just "every write will cause a wakeup, whether necessary or not", which seems to be sufficient for the broken library use. ]
It's worth noting that this adds the extraneous wakeup only for the write side, while the read side still considers the "edge" to be purely about reading enough from the pipe to allow further writes.
See commit f467a6a66419 ("pipe: fix and clarify pipe read wakeup logic") for the pipe read case, which remains that "only wake up if the pipe was full, and we read something from it".
Link: https://lore.kernel.org/lkml/CAHk-=wjeG0q1vgzu4iJhW5juPkTsjTYmiqiMUYAebWW+0b... [1] Link: https://github.com/realm/realm-core [2] Link: https://github.com/realm/realm-core/issues/4666 [3] Link: https://lore.kernel.org/lkml/CAKgNAkjMBGeAwF=2MKK758BhxvW58wYTgYKB2V-gY1PwXx... [4] Link: https://lore.kernel.org/lkml/20210729222635.2937453-1-sspatil@android.com/ Reported-by: Sandeep Patil sspatil@android.com Cc: Michael Kerrisk mtk.manpages@gmail.com Signed-off-by: Linus Torvalds torvalds@linux-foundation.org Signed-off-by: Greg Kroah-Hartman gregkh@linuxfoundation.org Signed-off-by: Chen Jun chenjun102@huawei.com Acked-by: Weilong Chen chenweilong@huawei.com Signed-off-by: Chen Jun chenjun102@huawei.com --- fs/pipe.c | 10 +++++----- 1 file changed, 5 insertions(+), 5 deletions(-)
diff --git a/fs/pipe.c b/fs/pipe.c index 412b3b618994..6f5f97c0fdee 100644 --- a/fs/pipe.c +++ b/fs/pipe.c @@ -429,20 +429,20 @@ pipe_write(struct kiocb *iocb, struct iov_iter *from) #endif
/* - * Only wake up if the pipe started out empty, since - * otherwise there should be no readers waiting. + * Epoll nonsensically wants a wakeup whether the pipe + * was already empty or not. * * If it wasn't empty we try to merge new data into * the last buffer. * * That naturally merges small writes, but it also - * page-aligs the rest of the writes for large writes + * page-aligns the rest of the writes for large writes * spanning multiple pages. */ head = pipe->head; - was_empty = pipe_empty(head, pipe->tail); + was_empty = true; chars = total_len & (PAGE_SIZE-1); - if (chars && !was_empty) { + if (chars && !pipe_empty(head, pipe->tail)) { unsigned int mask = pipe->ring_size - 1; struct pipe_buffer *buf = &pipe->bufs[(head - 1) & mask]; int offset = buf->offset + buf->len;