Florian Westphal (7): netfilter: conntrack: tell compiler to not inline nf_ct_resolve_clash netfilter: conntrack: remove two args from resolve_clash netfilter: conntrack: place confirm-bit setting in a helper netfilter: conntrack: split resolve_clash function netfilter: conntrack: allow insertion of clashing entries netfilter: conntrack: do not auto-delete clash entries on reply netfilter: conntrack: fix infinite loop on rmmod
include/linux/rculist_nulls.h | 7 + .../linux/netfilter/nf_conntrack_common.h | 12 +- net/netfilter/nf_conntrack_core.c | 215 +++++++++++++++--- net/netfilter/nf_conntrack_proto_udp.c | 8 +- net/netfilter/nft_flow_offload.c | 2 +- 5 files changed, 206 insertions(+), 38 deletions(-)
反馈: 您发送到kernel@openeuler.org的补丁/补丁集,已成功转换为PR! PR链接地址: https://gitee.com/openeuler/kernel/pulls/2713 邮件列表地址:https://mailweb.openeuler.org/hyperkitty/list/kernel@openeuler.org/message/6...
FeedBack: The patch(es) which you have sent to kernel@openeuler.org mailing list has been converted to a pull request successfully! Pull request link: https://gitee.com/openeuler/kernel/pulls/2713 Mailing list address: https://mailweb.openeuler.org/hyperkitty/list/kernel@openeuler.org/message/6...
From: Florian Westphal fw@strlen.de
mainline inclusion from mainline-v5.5-rc3 commit c7c17e6a03e08860f7a40095643e72c24c3f896b category: bugfix bugzilla: https://gitee.com/openeuler/kernel/issues/I63OS1 CVE: NA
--------------------------------
At this time compiler inlines it, but this code will not be executed under normal conditions.
Also, no inlining allows to use "nf_ct_resolve_clash%return" perf probe.
Signed-off-by: Florian Westphal fw@strlen.de Signed-off-by: Pablo Neira Ayuso pablo@netfilter.org Signed-off-by: Sun Shouxin sunshouxin@chinatelecom.cn Signed-off-by: Xibo.Wang wangxb12@chinatelecom.cn Signed-off-by: Lu Wei luwei32@huawei.com --- net/netfilter/nf_conntrack_core.c | 7 ++++--- 1 file changed, 4 insertions(+), 3 deletions(-)
diff --git a/net/netfilter/nf_conntrack_core.c b/net/netfilter/nf_conntrack_core.c index 34a9d69295d0..c3f84d002891 100644 --- a/net/netfilter/nf_conntrack_core.c +++ b/net/netfilter/nf_conntrack_core.c @@ -870,9 +870,10 @@ static void nf_ct_acct_merge(struct nf_conn *ct, enum ip_conntrack_info ctinfo, }
/* Resolve race on insertion if this protocol allows this. */ -static int nf_ct_resolve_clash(struct net *net, struct sk_buff *skb, - enum ip_conntrack_info ctinfo, - struct nf_conntrack_tuple_hash *h) +static __cold noinline int +nf_ct_resolve_clash(struct net *net, struct sk_buff *skb, + enum ip_conntrack_info ctinfo, + struct nf_conntrack_tuple_hash *h) { /* This is the conntrack entry already in hashes that won race. */ struct nf_conn *ct = nf_ct_tuplehash_to_ctrack(h);
From: Florian Westphal fw@strlen.de
mainline inclusion from mainline-v5.6-rc3 commit 3d1e0b406de16508de96f4a07fc3f94cfc678372 category: bugfix bugzilla: https://gitee.com/openeuler/kernel/issues/I63OS1 CVE: NA
--------------------------------
ctinfo is whats taken from the skb, i.e. ct = nf_ct_get(skb, &ctinfo).
We do not pass 'ct' and instead re-fetch it from the skb. Just do the same for both netns and ctinfo.
Also add a comment on what clash resolution is supposed to do. While at it, one indent level can be removed.
Signed-off-by: Florian Westphal fw@strlen.de Signed-off-by: Pablo Neira Ayuso pablo@netfilter.org Signed-off-by: Sun Shouxin sunshouxin@chinatelecom.cn Signed-off-by: Xibo.Wang wangxb12@chinatelecom.cn
conflict: net/netfilter/nf_conntrack_core.c
Signed-off-by: Lu Wei luwei32@huawei.com --- net/netfilter/nf_conntrack_core.c | 69 +++++++++++++++++++++++-------- 1 file changed, 51 insertions(+), 18 deletions(-)
diff --git a/net/netfilter/nf_conntrack_core.c b/net/netfilter/nf_conntrack_core.c index c3f84d002891..99d69e8b3a0b 100644 --- a/net/netfilter/nf_conntrack_core.c +++ b/net/netfilter/nf_conntrack_core.c @@ -869,31 +869,64 @@ static void nf_ct_acct_merge(struct nf_conn *ct, enum ip_conntrack_info ctinfo, } }
-/* Resolve race on insertion if this protocol allows this. */ +/** + * nf_ct_resolve_clash - attempt to handle clash without packet drop + * + * @skb: skb that causes the clash + * @h: tuplehash of the clashing entry already in table + * + * A conntrack entry can be inserted to the connection tracking table + * if there is no existing entry with an identical tuple. + * + * If there is one, @skb (and the assocated, unconfirmed conntrack) has + * to be dropped. In case @skb is retransmitted, next conntrack lookup + * will find the already-existing entry. + * + * The major problem with such packet drop is the extra delay added by + * the packet loss -- it will take some time for a retransmit to occur + * (or the sender to time out when waiting for a reply). + * + * This function attempts to handle the situation without packet drop. + * + * If @skb has no NAT transformation or if the colliding entries are + * exactly the same, only the to-be-confirmed conntrack entry is discarded + * and @skb is associated with the conntrack entry already in the table. + * + * Returns NF_DROP if the clash could not be resolved. + */ static __cold noinline int -nf_ct_resolve_clash(struct net *net, struct sk_buff *skb, - enum ip_conntrack_info ctinfo, - struct nf_conntrack_tuple_hash *h) +nf_ct_resolve_clash(struct sk_buff *skb, struct nf_conntrack_tuple_hash *h) { /* This is the conntrack entry already in hashes that won race. */ struct nf_conn *ct = nf_ct_tuplehash_to_ctrack(h); const struct nf_conntrack_l4proto *l4proto; - enum ip_conntrack_info oldinfo; - struct nf_conn *loser_ct = nf_ct_get(skb, &oldinfo); + enum ip_conntrack_info ctinfo; + struct nf_conn *loser_ct; + struct net *net; + + loser_ct = nf_ct_get(skb, &ctinfo);
l4proto = __nf_ct_l4proto_find(nf_ct_l3num(ct), nf_ct_protonum(ct)); - if (l4proto->allow_clash && - !nf_ct_is_dying(ct) && - atomic_inc_not_zero(&ct->ct_general.use)) { - if (((ct->status & IPS_NAT_DONE_MASK) == 0) || - nf_ct_match(ct, loser_ct)) { - nf_ct_acct_merge(ct, ctinfo, loser_ct); - nf_conntrack_put(&loser_ct->ct_general); - nf_ct_set(skb, ct, oldinfo); - return NF_ACCEPT; - } - nf_ct_put(ct); + if (!l4proto->allow_clash) + goto drop; + + if (nf_ct_is_dying(ct)) + goto drop; + + if (!atomic_inc_not_zero(&ct->ct_general.use)) + goto drop; + + if (((ct->status & IPS_NAT_DONE_MASK) == 0) || + nf_ct_match(ct, loser_ct)) { + nf_ct_acct_merge(ct, ctinfo, loser_ct); + nf_conntrack_put(&loser_ct->ct_general); + nf_ct_set(skb, ct, ctinfo); + return NF_ACCEPT; } + + nf_ct_put(ct); +drop: + net = nf_ct_net(loser_ct); NF_CT_STAT_INC(net, drop); return NF_DROP; } @@ -1014,7 +1047,7 @@ __nf_conntrack_confirm(struct sk_buff *skb)
out: nf_ct_add_to_dying_list(ct); - ret = nf_ct_resolve_clash(net, skb, ctinfo, h); + ret = nf_ct_resolve_clash(skb, h); dying: nf_conntrack_double_unlock(hash, reply_hash); NF_CT_STAT_INC(net, insert_failed);
From: Florian Westphal fw@strlen.de
mainline inclusion from mainline-v5.6-rc3 commit b1b32552c1d81f0cf6a8e79043a2a47e769ff071 category: bugfix bugzilla: https://gitee.com/openeuler/kernel/issues/I63OS1 CVE: NA
--------------------------------
... so it can be re-used from clash resolution in followup patch.
Signed-off-by: Florian Westphal fw@strlen.de Signed-off-by: Pablo Neira Ayuso pablo@netfilter.org Signed-off-by: Sun Shouxin sunshouxin@chinatelecom.cn Signed-off-by: Xibo.Wang wangxb12@chinatelecom.cn
conflict: net/netfilter/nf_conntrack_core.c
Signed-off-by: Lu Wei luwei32@huawei.com --- net/netfilter/nf_conntrack_core.c | 30 ++++++++++++++++++------------ 1 file changed, 18 insertions(+), 12 deletions(-)
diff --git a/net/netfilter/nf_conntrack_core.c b/net/netfilter/nf_conntrack_core.c index 99d69e8b3a0b..02538cbfc2cb 100644 --- a/net/netfilter/nf_conntrack_core.c +++ b/net/netfilter/nf_conntrack_core.c @@ -869,6 +869,23 @@ static void nf_ct_acct_merge(struct nf_conn *ct, enum ip_conntrack_info ctinfo, } }
+static void __nf_conntrack_insert_prepare(struct nf_conn *ct, struct sk_buff *skb) +{ + struct nf_conn_tstamp *tstamp; + + atomic_inc(&ct->ct_general.use); + ct->status |= IPS_CONFIRMED; + + /* set conntrack timestamp, if enabled. */ + tstamp = nf_conn_tstamp_find(ct); + if (tstamp) { + if (skb->tstamp == 0) + __net_timestamp(skb); + + tstamp->start = ktime_to_ns(skb->tstamp); + } +} + /** * nf_ct_resolve_clash - attempt to handle clash without packet drop * @@ -940,7 +957,6 @@ __nf_conntrack_confirm(struct sk_buff *skb) struct nf_conntrack_tuple_hash *h; struct nf_conn *ct; struct nf_conn_help *help; - struct nf_conn_tstamp *tstamp; struct hlist_nulls_node *n; enum ip_conntrack_info ctinfo; struct net *net; @@ -1017,17 +1033,7 @@ __nf_conntrack_confirm(struct sk_buff *skb) setting time, otherwise we'd get timer wrap in weird delay cases. */ ct->timeout += nfct_time_stamp; - atomic_inc(&ct->ct_general.use); - ct->status |= IPS_CONFIRMED; - - /* set conntrack timestamp, if enabled. */ - tstamp = nf_conn_tstamp_find(ct); - if (tstamp) { - if (skb->tstamp == 0) - __net_timestamp(skb); - - tstamp->start = ktime_to_ns(skb->tstamp); - } + __nf_conntrack_insert_prepare(ct, skb); /* Since the lookup is lockless, hash insertion must be done after * starting the timer and setting the CONFIRMED bit. The RCU barriers * guarantee that no other CPU can find the conntrack before the above
From: Florian Westphal fw@strlen.de
mainline inclusion from mainline-v5.6-rc3 commit bb89abe52bf426f1f40850c441efc77426cc31e1 category: bugfix bugzilla: https://gitee.com/openeuler/kernel/issues/I63OS1 CVE: NA
--------------------------------
Followup patch will need a helper function with the 'clashing entries refer to the identical tuple in both directions' resolution logic.
This patch will add another resolve_clash helper where loser_ct must not be added to the dying list because it will be inserted into the table.
Therefore this also moves the stat counters and dying-list insertion of the losing ct.
Signed-off-by: Florian Westphal fw@strlen.de Signed-off-by: Pablo Neira Ayuso pablo@netfilter.org Signed-off-by: Sun Shouxin sunshouxin@chinatelecom.cn Signed-off-by: Xibo.Wang wangxb12@chinatelecom.cn
conflict: net/netfilter/nf_conntrack_core.c
Signed-off-by: Lu Wei luwei32@huawei.com --- net/netfilter/nf_conntrack_core.c | 58 ++++++++++++++++++++++--------- 1 file changed, 41 insertions(+), 17 deletions(-)
diff --git a/net/netfilter/nf_conntrack_core.c b/net/netfilter/nf_conntrack_core.c index 02538cbfc2cb..7eb3eedc7e3c 100644 --- a/net/netfilter/nf_conntrack_core.c +++ b/net/netfilter/nf_conntrack_core.c @@ -886,6 +886,39 @@ static void __nf_conntrack_insert_prepare(struct nf_conn *ct, struct sk_buff *sk } }
+static int __nf_ct_resolve_clash(struct sk_buff *skb, + struct nf_conntrack_tuple_hash *h) +{ + /* This is the conntrack entry already in hashes that won race. */ + struct nf_conn *ct = nf_ct_tuplehash_to_ctrack(h); + enum ip_conntrack_info ctinfo; + struct nf_conn *loser_ct; + + loser_ct = nf_ct_get(skb, &ctinfo); + + if (nf_ct_is_dying(ct)) + return NF_DROP; + + if (!atomic_inc_not_zero(&ct->ct_general.use)) + return NF_DROP; + + if (((ct->status & IPS_NAT_DONE_MASK) == 0) || + nf_ct_match(ct, loser_ct)) { + struct net *net = nf_ct_net(ct); + + nf_ct_acct_merge(ct, ctinfo, loser_ct); + nf_ct_add_to_dying_list(loser_ct); + nf_conntrack_put(&loser_ct->ct_general); + nf_ct_set(skb, ct, ctinfo); + + NF_CT_STAT_INC(net, insert_failed); + return NF_ACCEPT; + } + + nf_ct_put(ct); + return NF_DROP; +} + /** * nf_ct_resolve_clash - attempt to handle clash without packet drop * @@ -920,31 +953,23 @@ nf_ct_resolve_clash(struct sk_buff *skb, struct nf_conntrack_tuple_hash *h) enum ip_conntrack_info ctinfo; struct nf_conn *loser_ct; struct net *net; + int ret;
loser_ct = nf_ct_get(skb, &ctinfo); + net = nf_ct_net(loser_ct);
l4proto = __nf_ct_l4proto_find(nf_ct_l3num(ct), nf_ct_protonum(ct)); if (!l4proto->allow_clash) goto drop;
- if (nf_ct_is_dying(ct)) - goto drop; - - if (!atomic_inc_not_zero(&ct->ct_general.use)) - goto drop; - - if (((ct->status & IPS_NAT_DONE_MASK) == 0) || - nf_ct_match(ct, loser_ct)) { - nf_ct_acct_merge(ct, ctinfo, loser_ct); - nf_conntrack_put(&loser_ct->ct_general); - nf_ct_set(skb, ct, ctinfo); - return NF_ACCEPT; - } + ret = __nf_ct_resolve_clash(skb, h); + if (ret == NF_ACCEPT) + return ret;
- nf_ct_put(ct); drop: - net = nf_ct_net(loser_ct); + nf_ct_add_to_dying_list(loser_ct); NF_CT_STAT_INC(net, drop); + NF_CT_STAT_INC(net, insert_failed); return NF_DROP; }
@@ -1013,6 +1038,7 @@ __nf_conntrack_confirm(struct sk_buff *skb)
if (unlikely(nf_ct_is_dying(ct))) { nf_ct_add_to_dying_list(ct); + NF_CT_STAT_INC(net, insert_failed); goto dying; }
@@ -1052,11 +1078,9 @@ __nf_conntrack_confirm(struct sk_buff *skb) return NF_ACCEPT;
out: - nf_ct_add_to_dying_list(ct); ret = nf_ct_resolve_clash(skb, h); dying: nf_conntrack_double_unlock(hash, reply_hash); - NF_CT_STAT_INC(net, insert_failed); local_bh_enable(); return ret; }
From: Florian Westphal fw@strlen.de
mainline inclusion from mainline-v5.6-rc3 commit 6a757c07e51f80ac34325fcd558490d2d1439e1b category: bugfix bugzilla: https://gitee.com/openeuler/kernel/issues/I63OS1 CVE: NA
--------------------------------
This patch further relaxes the need to drop an skb due to a clash with an existing conntrack entry.
Current clash resolution handles the case where the clash occurs between two identical entries (distinct nf_conn objects with same tuples), i.e.:
Original Reply existing: 10.2.3.4:42 -> 10.8.8.8:53 10.2.3.4:42 <- 10.0.0.6:5353 clashing: 10.2.3.4:42 -> 10.8.8.8:53 10.2.3.4:42 <- 10.0.0.6:5353
... existing handling will discard the unconfirmed clashing entry and makes skb->_nfct point to the existing one. The skb can then be processed normally just as if the clash would not have existed in the first place.
For other clashes, the skb needs to be dropped. This frequently happens with DNS resolvers that send A and AAAA queries back-to-back when NAT rules are present that cause packets to get different DNAT transformations applied, for example:
-m statistics --mode random ... -j DNAT --dnat-to 10.0.0.6:5353 -m statistics --mode random ... -j DNAT --dnat-to 10.0.0.7:5353
In this case the A or AAAA query is dropped which incurs a costly delay during name resolution.
This patch also allows this collision type: Original Reply existing: 10.2.3.4:42 -> 10.8.8.8:53 10.2.3.4:42 <- 10.0.0.6:5353 clashing: 10.2.3.4:42 -> 10.8.8.8:53 10.2.3.4:42 <- 10.0.0.7:5353
In this case, clash is in original direction -- the reply direction is still unique.
The change makes it so that when the 2nd colliding packet is received, the clashing conntrack is tagged with new IPS_NAT_CLASH_BIT, gets a fixed 1 second timeout and is inserted in the reply direction only.
The entry is hidden from 'conntrack -L', it will time out quickly and it can be early dropped because it will never progress to the ASSURED state.
To avoid special-casing the delete code path to special case the ORIGINAL hlist_nulls node, a new helper, "hlist_nulls_add_fake", is added so hlist_nulls_del() will work.
Example:
CPU A: CPU B: 1. 10.2.3.4:42 -> 10.8.8.8:53 (A) 2. 10.2.3.4:42 -> 10.8.8.8:53 (AAAA) 3. Apply DNAT, reply changed to 10.0.0.6 4. 10.2.3.4:42 -> 10.8.8.8:53 (AAAA) 5. Apply DNAT, reply changed to 10.0.0.7 6. confirm/commit to conntrack table, no collisions 7. commit clashing entry
Reply comes in:
10.2.3.4:42 <- 10.0.0.6:5353 (A) -> Finds a conntrack, DNAT is reversed & packet forwarded to 10.2.3.4:42 10.2.3.4:42 <- 10.0.0.7:5353 (AAAA) -> Finds a conntrack, DNAT is reversed & packet forwarded to 10.2.3.4:42 The conntrack entry is deleted from table, as it has the NAT_CLASH bit set.
In case of a retransmit from ORIGINAL dir, all further packets will get the DNAT transformation to 10.0.0.6.
I tried to come up with other solutions but they all have worse problems.
Alternatives considered were: 1. Confirm ct entries at allocation time, not in postrouting. a. will cause uneccesarry work when the skb that creates the conntrack is dropped by ruleset. b. in case nat is applied, ct entry would need to be moved in the table, which requires another spinlock pair to be taken. c. breaks the 'unconfirmed entry is private to cpu' assumption: we would need to guard all nfct->ext allocation requests with ct->lock spinlock.
2. Make the unconfirmed list a hash table instead of a pcpu list. Shares drawback c) of the first alternative.
3. Document this is expected and force users to rearrange their ruleset (e.g. by using "-m cluster" instead of "-m statistics"). nft has the 'jhash' expression which can be used instead of 'numgen'.
Major drawback: doesn't fix what I consider a bug, not very realistic and I believe its reasonable to have the existing rulesets to 'just work'.
4. Document this is expected and force users to steer problematic packets to the same CPU -- this would serialize the "allocate new conntrack entry/nat table evaluation/perform nat/confirm entry", so no race can occur. Similar drawback to 3.
Another advantage of this patch compared to 1) and 2) is that there are no changes to the hot path; things are handled in the udp tracker and the clash resolution path.
Cc: rcu@vger.kernel.org Cc: "Paul E. McKenney" paulmck@kernel.org Cc: Josh Triplett josh@joshtriplett.org Cc: Jozsef Kadlecsik kadlec@netfilter.org Signed-off-by: Florian Westphal fw@strlen.de Signed-off-by: Pablo Neira Ayuso pablo@netfilter.org Signed-off-by: Sun Shouxin sunshouxin@chinatelecom.cn Signed-off-by: Xibo.Wang wangxb12@chinatelecom.cn
conflict: net/netfilter/nf_conntrack_proto_udp.c
Signed-off-by: Lu Wei luwei32@huawei.com --- include/linux/rculist_nulls.h | 7 ++ .../linux/netfilter/nf_conntrack_common.h | 12 ++- net/netfilter/nf_conntrack_core.c | 76 ++++++++++++++++++- net/netfilter/nf_conntrack_proto_udp.c | 16 +++- 4 files changed, 106 insertions(+), 5 deletions(-)
diff --git a/include/linux/rculist_nulls.h b/include/linux/rculist_nulls.h index 61974c4c566b..ffa076d727b4 100644 --- a/include/linux/rculist_nulls.h +++ b/include/linux/rculist_nulls.h @@ -137,6 +137,13 @@ static inline void hlist_nulls_add_tail_rcu(struct hlist_nulls_node *n, } }
+/* after that hlist_nulls_del will work */ +static inline void hlist_nulls_add_fake(struct hlist_nulls_node *n) +{ + n->pprev = &n->next; + n->next = (struct hlist_nulls_node *)NULLS_MARKER(NULL); +} + /** * hlist_nulls_for_each_entry_rcu - iterate over rcu list of given type * @tpos: the type * to use as a loop cursor. diff --git a/include/uapi/linux/netfilter/nf_conntrack_common.h b/include/uapi/linux/netfilter/nf_conntrack_common.h index 336014bf8868..b6f0bb1dc799 100644 --- a/include/uapi/linux/netfilter/nf_conntrack_common.h +++ b/include/uapi/linux/netfilter/nf_conntrack_common.h @@ -97,6 +97,15 @@ enum ip_conntrack_status { IPS_UNTRACKED_BIT = 12, IPS_UNTRACKED = (1 << IPS_UNTRACKED_BIT),
+#ifdef __KERNEL__ + /* Re-purposed for in-kernel use: + * Tags a conntrack entry that clashed with an existing entry + * on insert. + */ + IPS_NAT_CLASH_BIT = IPS_UNTRACKED_BIT, + IPS_NAT_CLASH = IPS_UNTRACKED, +#endif + /* Conntrack got a helper explicitly attached via CT target. */ IPS_HELPER_BIT = 13, IPS_HELPER = (1 << IPS_HELPER_BIT), @@ -110,7 +119,8 @@ enum ip_conntrack_status { */ IPS_UNCHANGEABLE_MASK = (IPS_NAT_DONE_MASK | IPS_NAT_MASK | IPS_EXPECTED | IPS_CONFIRMED | IPS_DYING | - IPS_SEQ_ADJUST | IPS_TEMPLATE | IPS_OFFLOAD), + IPS_SEQ_ADJUST | IPS_TEMPLATE | IPS_UNTRACKED | + IPS_OFFLOAD),
__IPS_MAX_BIT = 15, }; diff --git a/net/netfilter/nf_conntrack_core.c b/net/netfilter/nf_conntrack_core.c index 7eb3eedc7e3c..f677400a0e98 100644 --- a/net/netfilter/nf_conntrack_core.c +++ b/net/netfilter/nf_conntrack_core.c @@ -919,11 +919,71 @@ static int __nf_ct_resolve_clash(struct sk_buff *skb, return NF_DROP; }
+/** + * nf_ct_resolve_clash_harder - attempt to insert clashing conntrack entry + * + * @skb: skb that causes the collision + * @repl_idx: hash slot for reply direction + * + * Called when origin or reply direction had a clash. + * The skb can be handled without packet drop provided the reply direction + * is unique or there the existing entry has the identical tuple in both + * directions. + * + * Caller must hold conntrack table locks to prevent concurrent updates. + * + * Returns NF_DROP if the clash could not be handled. + */ +static int nf_ct_resolve_clash_harder(struct sk_buff *skb, u32 repl_idx) +{ + struct nf_conn *loser_ct = (struct nf_conn *)skb_nfct(skb); + const struct nf_conntrack_zone *zone; + struct nf_conntrack_tuple_hash *h; + struct hlist_nulls_node *n; + struct net *net; + + zone = nf_ct_zone(loser_ct); + net = nf_ct_net(loser_ct); + + /* Reply direction must never result in a clash, unless both origin + * and reply tuples are identical. + */ + hlist_nulls_for_each_entry(h, n, &nf_conntrack_hash[repl_idx], hnnode) { + if (nf_ct_key_equal(h, + &loser_ct->tuplehash[IP_CT_DIR_REPLY].tuple, + zone, net)) + return __nf_ct_resolve_clash(skb, h); + } + + /* We want the clashing entry to go away real soon: 1 second timeout. */ + loser_ct->timeout = nfct_time_stamp + HZ; + + /* IPS_NAT_CLASH removes the entry automatically on the first + * reply. Also prevents UDP tracker from moving the entry to + * ASSURED state, i.e. the entry can always be evicted under + * pressure. + */ + loser_ct->status |= IPS_FIXED_TIMEOUT | IPS_NAT_CLASH; + + __nf_conntrack_insert_prepare(loser_ct, skb); + + /* fake add for ORIGINAL dir: we want lookups to only find the entry + * already in the table. This also hides the clashing entry from + * ctnetlink iteration, i.e. conntrack -L won't show them. + */ + hlist_nulls_add_fake(&loser_ct->tuplehash[IP_CT_DIR_ORIGINAL].hnnode); + + hlist_nulls_add_head_rcu(&loser_ct->tuplehash[IP_CT_DIR_REPLY].hnnode, + &nf_conntrack_hash[repl_idx]); + return NF_ACCEPT; +} + /** * nf_ct_resolve_clash - attempt to handle clash without packet drop * * @skb: skb that causes the clash * @h: tuplehash of the clashing entry already in table + * @hash_reply: hash slot for reply direction * * A conntrack entry can be inserted to the connection tracking table * if there is no existing entry with an identical tuple. @@ -942,10 +1002,18 @@ static int __nf_ct_resolve_clash(struct sk_buff *skb, * exactly the same, only the to-be-confirmed conntrack entry is discarded * and @skb is associated with the conntrack entry already in the table. * + * Failing that, the new, unconfirmed conntrack is still added to the table + * provided that the collision only occurs in the ORIGINAL direction. + * The new entry will be added after the existing one in the hash list, + * so packets in the ORIGINAL direction will continue to match the existing + * entry. The new entry will also have a fixed timeout so it expires -- + * due to the collision, it will not see bidirectional traffic. + * * Returns NF_DROP if the clash could not be resolved. */ static __cold noinline int -nf_ct_resolve_clash(struct sk_buff *skb, struct nf_conntrack_tuple_hash *h) +nf_ct_resolve_clash(struct sk_buff *skb, struct nf_conntrack_tuple_hash *h, + u32 reply_hash) { /* This is the conntrack entry already in hashes that won race. */ struct nf_conn *ct = nf_ct_tuplehash_to_ctrack(h); @@ -966,6 +1034,10 @@ nf_ct_resolve_clash(struct sk_buff *skb, struct nf_conntrack_tuple_hash *h) if (ret == NF_ACCEPT) return ret;
+ ret = nf_ct_resolve_clash_harder(skb, reply_hash); + if (ret == NF_ACCEPT) + return ret; + drop: nf_ct_add_to_dying_list(loser_ct); NF_CT_STAT_INC(net, drop); @@ -1078,7 +1150,7 @@ __nf_conntrack_confirm(struct sk_buff *skb) return NF_ACCEPT;
out: - ret = nf_ct_resolve_clash(skb, h); + ret = nf_ct_resolve_clash(skb, h, reply_hash); dying: nf_conntrack_double_unlock(hash, reply_hash); local_bh_enable(); diff --git a/net/netfilter/nf_conntrack_proto_udp.c b/net/netfilter/nf_conntrack_proto_udp.c index 31e2b85f8f17..f4cc96261c67 100644 --- a/net/netfilter/nf_conntrack_proto_udp.c +++ b/net/netfilter/nf_conntrack_proto_udp.c @@ -37,6 +37,18 @@ static unsigned int *udp_get_timeouts(struct net *net) return nf_udp_pernet(net)->timeouts; }
+static void nf_conntrack_udp_refresh_unreplied(struct nf_conn *ct, + struct sk_buff *skb, + enum ip_conntrack_info ctinfo, + u32 extra_jiffies) +{ + if (unlikely(ctinfo == IP_CT_ESTABLISHED_REPLY && + ct->status & IPS_NAT_CLASH)) + nf_ct_kill(ct); + else + nf_ct_refresh_acct(ct, ctinfo, skb, extra_jiffies); +} + /* Returns verdict for packet, and may modify conntracktype */ static int udp_packet(struct nf_conn *ct, const struct sk_buff *skb, @@ -58,8 +70,8 @@ static int udp_packet(struct nf_conn *ct, if (!test_and_set_bit(IPS_ASSURED_BIT, &ct->status)) nf_conntrack_event_cache(IPCT_ASSURED, ct); } else { - nf_ct_refresh_acct(ct, ctinfo, skb, - timeouts[UDP_CT_UNREPLIED]); + nf_conntrack_udp_refresh_unreplied(ct, skb, ctinfo, + timeouts[UDP_CT_UNREPLIED]); } return NF_ACCEPT; }
From: Florian Westphal fw@strlen.de
mainline inclusion from mainline-v5.9-rc4 commit c46172147ebbeb70094db48d76ab7945d96c638b category: bugfix bugzilla: https://gitee.com/openeuler/kernel/issues/I63OS1 CVE: NA
--------------------------------
Its possible that we have more than one packet with the same ct tuple simultaneously, e.g. when an application emits n packets on same UDP socket from multiple threads.
NAT rules might be applied to those packets. With the right set of rules, n packets will be mapped to m destinations, where at least two packets end up with the same destination.
When this happens, the existing clash resolution may merge the skb that is processed after the first has been received with the identical tuple already in hash table.
However, its possible that this identical tuple is a NAT_CLASH tuple. In that case the second skb will be sent, but no reply can be received since the reply that is processed first removes the NAT_CLASH tuple.
Do not auto-delete, this gives a 1 second window for replies to be passed back to originator.
Packets that are coming later (udp stream case) will not be affected: they match the original ct entry, not a NAT_CLASH one.
Also prevent NAT_CLASH entries from getting offloaded.
Fixes: 6a757c07e51f ("netfilter: conntrack: allow insertion of clashing entries") Signed-off-by: Florian Westphal fw@strlen.de Signed-off-by: Pablo Neira Ayuso pablo@netfilter.org
conflict: net/netfilter/nf_conntrack_proto_udp.c
Signed-off-by: Lu Wei luwei32@huawei.com --- net/netfilter/nf_conntrack_proto_udp.c | 20 ++++++-------------- net/netfilter/nft_flow_offload.c | 2 +- 2 files changed, 7 insertions(+), 15 deletions(-)
diff --git a/net/netfilter/nf_conntrack_proto_udp.c b/net/netfilter/nf_conntrack_proto_udp.c index f4cc96261c67..83ad9f53afe3 100644 --- a/net/netfilter/nf_conntrack_proto_udp.c +++ b/net/netfilter/nf_conntrack_proto_udp.c @@ -37,18 +37,6 @@ static unsigned int *udp_get_timeouts(struct net *net) return nf_udp_pernet(net)->timeouts; }
-static void nf_conntrack_udp_refresh_unreplied(struct nf_conn *ct, - struct sk_buff *skb, - enum ip_conntrack_info ctinfo, - u32 extra_jiffies) -{ - if (unlikely(ctinfo == IP_CT_ESTABLISHED_REPLY && - ct->status & IPS_NAT_CLASH)) - nf_ct_kill(ct); - else - nf_ct_refresh_acct(ct, ctinfo, skb, extra_jiffies); -} - /* Returns verdict for packet, and may modify conntracktype */ static int udp_packet(struct nf_conn *ct, const struct sk_buff *skb, @@ -66,12 +54,16 @@ static int udp_packet(struct nf_conn *ct, if (test_bit(IPS_SEEN_REPLY_BIT, &ct->status)) { nf_ct_refresh_acct(ct, ctinfo, skb, timeouts[UDP_CT_REPLIED]); + + /* never set ASSURED for IPS_NAT_CLASH, they time out soon */ + if (unlikely((ct->status & IPS_NAT_CLASH))) + return NF_ACCEPT; + /* Also, more likely to be important, and not a probe */ if (!test_and_set_bit(IPS_ASSURED_BIT, &ct->status)) nf_conntrack_event_cache(IPCT_ASSURED, ct); } else { - nf_conntrack_udp_refresh_unreplied(ct, skb, ctinfo, - timeouts[UDP_CT_UNREPLIED]); + nf_ct_refresh_acct(ct, ctinfo, skb, timeouts[UDP_CT_UNREPLIED]); } return NF_ACCEPT; } diff --git a/net/netfilter/nft_flow_offload.c b/net/netfilter/nft_flow_offload.c index 166edea0e452..df545267b933 100644 --- a/net/netfilter/nft_flow_offload.c +++ b/net/netfilter/nft_flow_offload.c @@ -100,7 +100,7 @@ static void nft_flow_offload_eval(const struct nft_expr *expr, }
if (nf_ct_ext_exist(ct, NF_CT_EXT_HELPER) || - ct->status & IPS_SEQ_ADJUST) + ct->status & (IPS_SEQ_ADJUST | IPS_NAT_CLASH)) goto out;
if (!nf_ct_is_confirmed(ct))
From: Florian Westphal fw@strlen.de
mainline inclusion from mainline-v5.7-rc6 commit 54ab49fde95605a1077f759ce454d94e84b5ca45 category: bugfix bugzilla: https://gitee.com/openeuler/kernel/issues/I63OS1 CVE: NA
--------------------------------
'rmmod nf_conntrack' can hang forever, because the netns exit gets stuck in nf_conntrack_cleanup_net_list():
i_see_dead_people: busy = 0; list_for_each_entry(net, net_exit_list, exit_list) { nf_ct_iterate_cleanup(kill_all, net, 0, 0); if (atomic_read(&net->ct.count) != 0) busy = 1; } if (busy) { schedule(); goto i_see_dead_people; }
When nf_ct_iterate_cleanup iterates the conntrack table, all nf_conn structures can be found twice: once for the original tuple and once for the conntracks reply tuple.
get_next_corpse() only calls the iterator when the entry is in original direction -- the idea was to avoid unneeded invocations of the iterator callback.
When support for clashing entries was added, the assumption that all nf_conn objects are added twice, once in original, once for reply tuple no longer holds -- NF_CLASH_BIT entries are only added in the non-clashing reply direction.
Thus, if at least one NF_CLASH entry is in the list then nf_conntrack_cleanup_net_list() always skips it completely.
During normal netns destruction, this causes a hang of several seconds, until the gc worker removes the entry (NF_CLASH entries always have a 1 second timeout).
But in the rmmod case, the gc worker has already been stopped, so ct.count never becomes 0.
We can fix this in two ways:
1. Add a second test for CLASH_BIT and call iterator for those entries as well, or: 2. Skip the original tuple direction and use the reply tuple.
2) is simpler, so do that.
Fixes: 6a757c07e51f80ac ("netfilter: conntrack: allow insertion of clashing entries") Reported-by: Chen Yi yiche@redhat.com Signed-off-by: Florian Westphal fw@strlen.de Signed-off-by: Pablo Neira Ayuso pablo@netfilter.org Signed-off-by: Lu Wei luwei32@huawei.com --- net/netfilter/nf_conntrack_core.c | 13 ++++++++++++- 1 file changed, 12 insertions(+), 1 deletion(-)
diff --git a/net/netfilter/nf_conntrack_core.c b/net/netfilter/nf_conntrack_core.c index f677400a0e98..6bfdc8f6e961 100644 --- a/net/netfilter/nf_conntrack_core.c +++ b/net/netfilter/nf_conntrack_core.c @@ -2059,8 +2059,19 @@ get_next_corpse(int (*iter)(struct nf_conn *i, void *data), nf_conntrack_lock(lockp); if (*bucket < nf_conntrack_htable_size) { hlist_nulls_for_each_entry(h, n, &nf_conntrack_hash[*bucket], hnnode) { - if (NF_CT_DIRECTION(h) != IP_CT_DIR_ORIGINAL) + if (NF_CT_DIRECTION(h) != IP_CT_DIR_REPLY) continue; + /* All nf_conn objects are added to hash table twice, one + * for original direction tuple, once for the reply tuple. + * + * Exception: In the IPS_NAT_CLASH case, only the reply + * tuple is added (the original tuple already existed for + * a different object). + * + * We only need to call the iterator once for each + * conntrack, so we just use the 'reply' direction + * tuple while iterating. + */ ct = nf_ct_tuplehash_to_ctrack(h); if (iter(ct, data)) goto found;