On Wed, Aug 25, 2021 at 9:29 AM David Ahern dsahern@gmail.com wrote:
On 8/23/21 8:04 AM, Eric Dumazet wrote:
It seems PAGE_ALLOC_COSTLY_ORDER is mostly related to pcp page, OOM, memory compact and memory isolation, as the test system has a lot of memory installed (about 500G, only 3-4G is used), so I used the below patch to test the max possible performance improvement when making TCP frags twice bigger, and the performance improvement went from about 30Gbit to 32Gbit for one thread iperf tcp flow in IOMMU strict mode,
This is encouraging, and means we can do much better.
Even with SKB_FRAG_PAGE_ORDER set to 4, typical skbs will need 3 mappings
- One for the headers (in skb->head)
- Two page frags, because one TSO packet payload is not a nice power-of-two.
interesting observation. I have noticed 17 with the ZC API. That might explain the less than expected performance bump with iommu strict mode.
Note that if application is using huge pages, things get better after
commit 394fcd8a813456b3306c423ec4227ed874dfc08b Author: Eric Dumazet edumazet@google.com Date: Thu Aug 20 08:43:59 2020 -0700
net: zerocopy: combine pages in zerocopy_sg_from_iter()
Currently, tcp sendmsg(MSG_ZEROCOPY) is building skbs with order-0 fragments. Compared to standard sendmsg(), these skbs usually contain up to 16 fragments on arches with 4KB page sizes, instead of two.
This adds considerable costs on various ndo_start_xmit() handlers, especially when IOMMU is in the picture.
As high performance applications are often using huge pages, we can try to combine adjacent pages belonging to same compound page.
Tested on AMD Rome platform, with IOMMU, nominal single TCP flow speed is roughly doubled (~55Gbit -> ~100Gbit), when user application is using hugepages.
For reference, nominal single TCP flow speed on this platform without MSG_ZEROCOPY is ~65Gbit.
Signed-off-by: Eric Dumazet edumazet@google.com Cc: Willem de Bruijn willemb@google.com Signed-off-by: David S. Miller davem@davemloft.net
Ideally the gup stuff should really directly deal with hugepages, so that we avoid all these crazy refcounting games on the per-huge-page central refcount.
The first issue can be addressed using a piece of coherent memory (128 or 256 bytes per entry in TX ring). Copying the headers can avoid one IOMMU mapping, and improve IOTLB hits, because all slots of the TX ring buffer will use one single IOTLB slot.
The second issue can be solved by tweaking a bit skb_page_frag_refill() to accept an additional parameter so that the whole skb payload fits in a single order-4 page.