[PATCH openEuler-1.0-LTS V3 0/6] Fix the problem that the number of tcp timeout retransmissions is lost

List overview All Threads
Download

newer

older

Re: [PATCH openEuler-21.03]...

[PATCH openEuler-1.0-LTS V2 0/6]...

Laibin Qiu

29 Oct 2021 29 Oct '21

11:21 a.m.

issue: https://gitee.com/openeuler/kernel/issues/I4AFRJ?from=project-issue

Eric Dumazet (4): tcp: switch tcp and sch_fq to new earliest departure time model net_sched: sch_fq: ensure maxrate fq parameter applies to EDT flows tcp: address problems caused by EDT misshaps tcp: adjust rto_base in retransmits_timed_out()

Yuchung Cheng (2): tcp: always set retrans_stamp on recovery tcp: create a helper to model exponential backoff

net/ipv4/tcp_bbr.c | 7 +++-- net/ipv4/tcp_input.c | 17 +++++++----- net/ipv4/tcp_output.c | 31 +++++++++++++++------ net/ipv4/tcp_timer.c | 64 ++++++++++++++++++++----------------------- net/sched/sch_fq.c | 46 ++++++++++++++++++------------- 5 files changed, 93 insertions(+), 72 deletions(-)

-- 2.22.0

Show replies by date

Laibin Qiu

29 Oct 29 Oct

11:21 a.m.

New subject: [PATCH openEuler-1.0-LTS V3 1/6] tcp: switch tcp and sch_fq to new earliest departure time model

From: Eric Dumazet edumazet@google.com

mainline inclusion from mainline-v4.20-rc1 commit ab408b6dc7449c0f791e9e5f8de72fa7428584f2 category: bugfix bugzilla: https://gitee.com/openeuler/kernel/issues/I4AFRJ?from=project-issue CVE: NA

------------------------------------------------------------

TCP keeps track of tcp_wstamp_ns by itself, meaning sch_fq no longer has to do it.

Thanks to this model, TCP can get more accurate RTT samples, since pacing no longer inflates them.

This has the nice effect of removing some delays caused by FQ quantum mechanism, causing inflated max/P99 latencies.

Also we might relax TCP Small Queue tight limits in the future, since this new model allow TCP to build bigger batches, since sch_fq (or a device with earliest departure time offload) ensure these packets will be delivered on time.

Note that other protocols are not converted (they will probably never be) so sch_fq has still support for SO_MAX_PACING_RATE

Tested:

Test showing FQ pacing quantum artifact for low-rate flows, adding unexpected throttles for RPC flows, inflating max and P99 latencies.

The parameters chosen here are to show what happens typically when a TCP flow has a reduced pacing rate (this can be caused by a reduced cwin after few losses, or/and rtt above few ms)

MIBS="MIN_LATENCY,MEAN_LATENCY,MAX_LATENCY,P99_LATENCY,STDDEV_LATENCY" Before : $ netperf -H 10.246.7.133 -t TCP_RR -Cc -T6,6 -- -q 2000000 -r 100,100 -o $MIBS MIGRATED TCP REQUEST/RESPONSE TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to 10.246.7.133 () port 0 AF_INET : first burst 0 : cpu bind Minimum Latency Microseconds,Mean Latency Microseconds,Maximum Latency Microseconds,99th Percentile Latency Microseconds,Stddev Latency Microseconds 19,82.78,5279,3825,482.02

After : $ netperf -H 10.246.7.133 -t TCP_RR -Cc -T6,6 -- -q 2000000 -r 100,100 -o $MIBS MIGRATED TCP REQUEST/RESPONSE TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to 10.246.7.133 () port 0 AF_INET : first burst 0 : cpu bind Minimum Latency Microseconds,Mean Latency Microseconds,Maximum Latency Microseconds,99th Percentile Latency Microseconds,Stddev Latency Microseconds 20,49.94,128,63,3.18

Signed-off-by: Eric Dumazet edumazet@google.com Signed-off-by: David S. Miller davem@davemloft.net conflict: net/ipv4/tcp_output.c Signed-off-by: Jiazhenyuan jiazhenyuan@uniontech.com #openEuler_contributor Signed-off-by: Laibin Qiu qiulaibin@huawei.com --- net/ipv4/tcp_bbr.c | 7 ++++--- net/ipv4/tcp_output.c | 22 ++++++++++++++++++---- net/sched/sch_fq.c | 21 +++++++++++---------- 3 files changed, 33 insertions(+), 17 deletions(-)

diff --git a/net/ipv4/tcp_bbr.c b/net/ipv4/tcp_bbr.c index 1740de053..095effcab 100644 --- a/net/ipv4/tcp_bbr.c +++ b/net/ipv4/tcp_bbr.c @@ -136,6 +136,9 @@ static const u32 bbr_probe_rtt_mode_ms = 200; /* Skip TSO below the following bandwidth (bits/sec): */ static const int bbr_min_tso_rate = 1200000;

+/* Pace at ~1% below estimated bw, on average, to reduce queue at bottleneck. */ +static const int bbr_pacing_marging_percent = 1; + /* We use a high_gain value of 2/ln(2) because it's the smallest pacing gain * that will allow a smoothly increasing pacing rate that will double each RTT * and send the same number of packets per RTT that an un-paced, slow-starting @@ -235,12 +238,10 @@ static u64 bbr_rate_bytes_per_sec(struct sock *sk, u64 rate, int gain) { unsigned int mss = tcp_sk(sk)->mss_cache;

- if (!tcp_needs_internal_pacing(sk)) - mss = tcp_mss_to_mtu(sk, mss); rate *= mss; rate *= gain; rate >>= BBR_SCALE; - rate *= USEC_PER_SEC; + rate *= USEC_PER_SEC / 100 * (100 - bbr_pacing_marging_percent); return rate >> BW_SCALE; }

diff --git a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c index 97b9d671a..7d174fa21 100644 --- a/net/ipv4/tcp_output.c +++ b/net/ipv4/tcp_output.c @@ -1038,9 +1038,23 @@ static void tcp_internal_pacing(struct sock *sk, const struct sk_buff *skb) sock_hold(sk); }

-static void tcp_update_skb_after_send(struct tcp_sock *tp, struct sk_buff *skb) +static void tcp_update_skb_after_send(struct sock *sk, struct sk_buff *skb) { + struct tcp_sock *tp = tcp_sk(sk); + skb->skb_mstamp = tp->tcp_mstamp; + if (sk->sk_pacing_status != SK_PACING_NONE) { + u32 rate = sk->sk_pacing_rate; + + /* Original sch_fq does not pace first 10 MSS + * Note that tp->data_segs_out overflows after 2^32 packets, + * this is a minor annoyance. + */ + if (rate != ~0U && rate && tp->data_segs_out >= 10) { + tp->tcp_mstamp += div_u64((u64)skb->len * NSEC_PER_SEC, rate); + /* TODO: update internal pacing here */ + } + } list_move_tail(&skb->tcp_tsorted_anchor, &tp->tsorted_sent_queue); }

@@ -1205,7 +1219,7 @@ static int __tcp_transmit_skb(struct sock *sk, struct sk_buff *skb, err = net_xmit_eval(err); } if (!err && oskb) { - tcp_update_skb_after_send(tp, oskb); + tcp_update_skb_after_send(sk, oskb); tcp_rate_skb_sent(sk, oskb); } return err; @@ -2387,7 +2401,7 @@ static bool tcp_write_xmit(struct sock *sk, unsigned int mss_now, int nonagle,

if (unlikely(tp->repair) && tp->repair_queue == TCP_SEND_QUEUE) { /* "skb_mstamp" is used as a start point for the retransmit timer */ - tcp_update_skb_after_send(tp, skb); + tcp_update_skb_after_send(sk, skb); goto repair; /* Skip network transmission */ }

@@ -2978,7 +2992,7 @@ int __tcp_retransmit_skb(struct sock *sk, struct sk_buff *skb, int segs) } tcp_skb_tsorted_restore(skb);

if (!err) { - tcp_update_skb_after_send(tp, skb); + tcp_update_skb_after_send(sk, skb); tcp_rate_skb_sent(sk, skb); } } else { diff --git a/net/sched/sch_fq.c b/net/sched/sch_fq.c index ba60a8dd5..41cb94791 100644 --- a/net/sched/sch_fq.c +++ b/net/sched/sch_fq.c @@ -491,11 +491,16 @@ static struct sk_buff *fq_dequeue(struct Qdisc *sch) }

skb = f->head; - if (unlikely(skb && now < f->time_next_packet && - !skb_is_tcp_pure_ack(skb))) { - head->first = f->next; - fq_flow_set_throttled(q, f); - goto begin; + if (skb && !skb_is_tcp_pure_ack(skb)) { + u64 time_next_packet = max_t(u64, ktime_to_ns(skb->tstamp), + f->time_next_packet); + + if (now < time_next_packet) { + head->first = f->next; + f->time_next_packet = time_next_packet; + fq_flow_set_throttled(q, f); + goto begin; + } }

skb = fq_dequeue_head(sch, f); @@ -513,11 +518,7 @@ static struct sk_buff *fq_dequeue(struct Qdisc *sch) prefetch(&skb->end); f->credit -= qdisc_pkt_len(skb);

- if (!q->rate_enable) - goto out; - - /* Do not pace locally generated ack packets */ - if (skb_is_tcp_pure_ack(skb)) + if (ktime_to_ns(skb->tstamp) || !q->rate_enable) goto out;

rate = q->flow_max_rate;

-- 2.22.0

Laibin Qiu

11:21 a.m.

New subject: [PATCH openEuler-1.0-LTS V3 2/6] net_sched: sch_fq: ensure maxrate fq parameter applies to EDT flows

From: Eric Dumazet edumazet@google.com

mainline inclusion from mainline-v4.20-rc4 commit 08e14fe429a07475ee9f29a283945d602e4a6d92 category: bugfix bugzilla: https://gitee.com/openeuler/kernel/issues/I4AFRJ?from=project-issue CVE: NA

------------------------------------------------------------

When EDT conversion happened, fq lost the ability to enfore a maxrate for all flows. It kept it for non EDT flows.

This commit restores the functionality.

Tested:

tc qd replace dev eth0 root fq maxrate 500Mbit netperf -P0 -H host -- -O THROUGHPUT 489.75

Fixes: ab408b6dc744 ("tcp: switch tcp and sch_fq to new earliest departure time model") Signed-off-by: Eric Dumazet edumazet@google.com Signed-off-by: David S. Miller davem@davemloft.net Signed-off-by: Jiazhenyuan jiazhenyuan@uniontech.com #openEuler_contributor Signed-off-by: Laibin Qiu qiulaibin@huawei.com --- net/sched/sch_fq.c | 29 ++++++++++++++++++----------- 1 file changed, 18 insertions(+), 11 deletions(-)

diff --git a/net/sched/sch_fq.c b/net/sched/sch_fq.c index 41cb94791..64d7f11b0 100644 --- a/net/sched/sch_fq.c +++ b/net/sched/sch_fq.c @@ -516,22 +516,29 @@ static struct sk_buff *fq_dequeue(struct Qdisc *sch) goto begin; } prefetch(&skb->end); - f->credit -= qdisc_pkt_len(skb); + plen = qdisc_pkt_len(skb); + f->credit -= plen;

- if (ktime_to_ns(skb->tstamp) || !q->rate_enable) + if (!q->rate_enable) goto out;

rate = q->flow_max_rate; - if (skb->sk) - rate = min(skb->sk->sk_pacing_rate, rate);

- if (rate <= q->low_rate_threshold) { - f->credit = 0; - plen = qdisc_pkt_len(skb); - } else { - plen = max(qdisc_pkt_len(skb), q->quantum); - if (f->credit > 0) - goto out; + /* If EDT time was provided for this skb, we need to + * update f->time_next_packet only if this qdisc enforces + * a flow max rate. + */ + if (!skb->tstamp) { + if (skb->sk) + rate = min(skb->sk->sk_pacing_rate, rate); + + if (rate <= q->low_rate_threshold) { + f->credit = 0; + } else { + plen = max(plen, q->quantum); + if (f->credit > 0) + goto out; + } } if (rate != ~0U) { u64 len = (u64)plen * NSEC_PER_SEC;

-- 2.22.0

Laibin Qiu

11:21 a.m.

New subject: [PATCH openEuler-1.0-LTS V3 3/6] tcp: address problems caused by EDT misshaps

From: Eric Dumazet edumazet@google.com

mainline inclusion from mainline-v4.20-rc5 commit 9efdda4e3abed13f0903b7b6e4d4c2102019440a category: bugfix bugzilla: https://gitee.com/openeuler/kernel/issues/I4AFRJ?from=project-issue CVE: NA

------------------------------------------------------------

When a qdisc setup including pacing FQ is dismantled and recreated, some TCP packets are sent earlier than instructed by TCP stack.

TCP can be fooled when ACK comes back, because the following operation can return a negative value.

tcp_time_stamp(tp) - tp->rx_opt.rcv_tsecr;

Some paths in TCP stack were not dealing properly with this, this patch addresses four of them.

Fixes: ab408b6dc744 ("tcp: switch tcp and sch_fq to new earliest departure time model") Signed-off-by: Eric Dumazet edumazet@google.com Signed-off-by: David S. Miller davem@davemloft.net Signed-off-by: Jiazhenyuan jiazhenyuan@uniontech.com #openEuler_contributor Signed-off-by: Laibin Qiu qiulaibin@huawei.com --- net/ipv4/tcp_input.c | 17 ++++++++++------- net/ipv4/tcp_timer.c | 10 ++++++---- 2 files changed, 16 insertions(+), 11 deletions(-)

diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c index a058fb936..993a97557 100644 --- a/net/ipv4/tcp_input.c +++ b/net/ipv4/tcp_input.c @@ -581,10 +581,12 @@ static inline void tcp_rcv_rtt_measure_ts(struct sock *sk, u32 delta = tcp_time_stamp(tp) - tp->rx_opt.rcv_tsecr; u32 delta_us;

- if (!delta) - delta = 1; - delta_us = delta * (USEC_PER_SEC / TCP_TS_HZ); - tcp_rcv_rtt_update(tp, delta_us, 0); + if (likely(delta < INT_MAX / (USEC_PER_SEC / TCP_TS_HZ))) { + if (!delta) + delta = 1; + delta_us = delta * (USEC_PER_SEC / TCP_TS_HZ); + tcp_rcv_rtt_update(tp, delta_us, 0); + } } }

@@ -2934,9 +2936,10 @@ static bool tcp_ack_update_rtt(struct sock *sk, const int flag, if (seq_rtt_us < 0 && tp->rx_opt.saw_tstamp && tp->rx_opt.rcv_tsecr && flag & FLAG_ACKED) { u32 delta = tcp_time_stamp(tp) - tp->rx_opt.rcv_tsecr; - u32 delta_us = delta * (USEC_PER_SEC / TCP_TS_HZ); - - seq_rtt_us = ca_rtt_us = delta_us; + if (likely(delta < INT_MAX / (USEC_PER_SEC / TCP_TS_HZ))) { + seq_rtt_us = delta * (USEC_PER_SEC / TCP_TS_HZ); + ca_rtt_us = seq_rtt_us; + } } rs->rtt_us = ca_rtt_us; /* RTT of last (S)ACKed packet (or -1) */ if (seq_rtt_us < 0) diff --git a/net/ipv4/tcp_timer.c b/net/ipv4/tcp_timer.c index 681882a40..7c7f7e6cd 100644 --- a/net/ipv4/tcp_timer.c +++ b/net/ipv4/tcp_timer.c @@ -40,15 +40,17 @@ static u32 tcp_clamp_rto_to_user_timeout(const struct sock *sk) { struct inet_connection_sock *icsk = inet_csk(sk); u32 elapsed, start_ts; + s32 remaining;

start_ts = tcp_retransmit_stamp(sk); if (!icsk->icsk_user_timeout || !start_ts) return icsk->icsk_rto; elapsed = tcp_time_stamp(tcp_sk(sk)) - start_ts; - if (elapsed >= icsk->icsk_user_timeout) + remaining = icsk->icsk_user_timeout - elapsed; + if (remaining <= 0) return 1; /* user timeout has passed; fire ASAP */ - else - return min_t(u32, icsk->icsk_rto, msecs_to_jiffies(icsk->icsk_user_timeout - elapsed)); + + return min_t(u32, icsk->icsk_rto, msecs_to_jiffies(remaining)); }

/** @@ -210,7 +212,7 @@ static bool retransmits_timed_out(struct sock *sk, (boundary - linear_backoff_thresh) * TCP_RTO_MAX; timeout = jiffies_to_msecs(timeout); } - return (tcp_time_stamp(tcp_sk(sk)) - start_ts) >= timeout; + return (s32)(tcp_time_stamp(tcp_sk(sk)) - start_ts - timeout) >= 0; }

/* A write timeout has occurred. Process the after effects. */

-- 2.22.0

Laibin Qiu

11:21 a.m.

New subject: [PATCH openEuler-1.0-LTS V3 4/6] tcp: always set retrans_stamp on recovery

From: Yuchung Cheng ycheng@google.com

mainline inclusion from mainline-v5.1-rc1 commit 7ae189759cc48cf8b54beebff566e9fd2d4e7d7c category: bugfix bugzilla: https://gitee.com/openeuler/kernel/issues/I4AFRJ?from=project-issue CVE: NA

------------------------------------------------------------

Previously TCP socket's retrans_stamp is not set if the retransmission has failed to send. As a result if a socket is experiencing local issues to retransmit packets, determining when to abort a socket is complicated w/o knowning the starting time of the recovery since retrans_stamp may remain zero.

This complication causes sub-optimal behavior that TCP may use the latest, instead of the first, retransmission time to compute the elapsed time of a stalling connection due to local issues. Then TCP may disrecard TCP retries settings and keep retrying until it finally succeed: not a good idea when the local host is already strained.

The simple fix is to always timestamp the start of a recovery. It's worth noting that retrans_stamp is also used to compare echo timestamp values to detect spurious recovery. This patch does not break that because retrans_stamp is still later than when the original packet was sent.

Signed-off-by: Yuchung Cheng ycheng@google.com Signed-off-by: Eric Dumazet edumazet@google.com Reviewed-by: Neal Cardwell ncardwell@google.com Reviewed-by: Soheil Hassas Yeganeh soheil@google.com Signed-off-by: David S. Miller davem@davemloft.net Signed-off-by: Jiazhenyuan jiazhenyuan@uniontech #openEuler_contributor Signed-off-by: Laibin Qiu qiulaibin@huawei.com --- net/ipv4/tcp_output.c | 9 ++++----- net/ipv4/tcp_timer.c | 23 +++-------------------- 2 files changed, 7 insertions(+), 25 deletions(-)

diff --git a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c index 7d174fa21..ac75aa86b 100644 --- a/net/ipv4/tcp_output.c +++ b/net/ipv4/tcp_output.c @@ -3025,13 +3025,12 @@ int tcp_retransmit_skb(struct sock *sk, struct sk_buff *skb, int segs) #endif TCP_SKB_CB(skb)->sacked |= TCPCB_RETRANS; tp->retrans_out += tcp_skb_pcount(skb); - - /* Save stamp of the first retransmit. */ - if (!tp->retrans_stamp) - tp->retrans_stamp = tcp_skb_timestamp(skb); - }

+ /* Save stamp of the first (attempted) retransmit. */ + if (!tp->retrans_stamp) + tp->retrans_stamp = tcp_skb_timestamp(skb); + if (tp->undo_retrans < 0) tp->undo_retrans = 0; tp->undo_retrans += tcp_skb_pcount(skb); diff --git a/net/ipv4/tcp_timer.c b/net/ipv4/tcp_timer.c index 7c7f7e6cd..9a904bf8d 100644 --- a/net/ipv4/tcp_timer.c +++ b/net/ipv4/tcp_timer.c @@ -22,28 +22,14 @@ #include <linux/gfp.h> #include <net/tcp.h>

-static u32 tcp_retransmit_stamp(const struct sock *sk) -{ - u32 start_ts = tcp_sk(sk)->retrans_stamp; - - if (unlikely(!start_ts)) { - struct sk_buff *head = tcp_rtx_queue_head(sk); - - if (!head) - return 0; - start_ts = tcp_skb_timestamp(head); - } - return start_ts; -} - static u32 tcp_clamp_rto_to_user_timeout(const struct sock *sk) { struct inet_connection_sock *icsk = inet_csk(sk); u32 elapsed, start_ts; s32 remaining;

- start_ts = tcp_retransmit_stamp(sk); - if (!icsk->icsk_user_timeout || !start_ts) + start_ts = tcp_sk(sk)->retrans_stamp; + if (!icsk->icsk_user_timeout) return icsk->icsk_rto; elapsed = tcp_time_stamp(tcp_sk(sk)) - start_ts; remaining = icsk->icsk_user_timeout - elapsed; @@ -198,10 +184,7 @@ static bool retransmits_timed_out(struct sock *sk, if (!inet_csk(sk)->icsk_retransmits) return false;

- start_ts = tcp_retransmit_stamp(sk); - if (!start_ts) - return false; - + start_ts = tcp_sk(sk)->retrans_stamp; if (likely(timeout == 0)) { linear_backoff_thresh = ilog2(TCP_RTO_MAX/rto_base);

-- 2.22.0

Laibin Qiu

11:21 a.m.

New subject: [PATCH openEuler-1.0-LTS V3 5/6] tcp: create a helper to model exponential backoff

From: Yuchung Cheng ycheng@google.com

mainline inclusion from mainline-v5.1-rc1 commit 01a523b071618abbc634d1958229fe3bd2dfa5fa category: bugfix bugzilla: https://gitee.com/openeuler/kernel/issues/I4AFRJ?from=project-issue CVE: NA

------------------------------------------------------------

Create a helper to model TCP exponential backoff for the next patch. This is pure refactor w no behavior change.

Signed-off-by: Yuchung Cheng ycheng@google.com Signed-off-by: Eric Dumazet edumazet@google.com Reviewed-by: Neal Cardwell ncardwell@google.com Reviewed-by: Soheil Hassas Yeganeh soheil@google.com Signed-off-by: David S. Miller davem@davemloft.net Signed-off-by: Jiazhenyuan jiazhenyuan@uniontech.com #openEuler_contributor Signed-off-by: Laibin Qiu qiulaibin@huawei.com --- net/ipv4/tcp_timer.c | 30 ++++++++++++++++++------------ 1 file changed, 18 insertions(+), 12 deletions(-)

diff --git a/net/ipv4/tcp_timer.c b/net/ipv4/tcp_timer.c index 9a904bf8d..c9e63bd68 100644 --- a/net/ipv4/tcp_timer.c +++ b/net/ipv4/tcp_timer.c @@ -160,6 +160,20 @@ static void tcp_mtu_probing(struct inet_connection_sock *icsk, struct sock *sk) tcp_sync_mss(sk, icsk->icsk_pmtu_cookie); }

+static unsigned int tcp_model_timeout(struct sock *sk, + unsigned int boundary, + unsigned int rto_base) +{ + unsigned int linear_backoff_thresh, timeout; + + linear_backoff_thresh = ilog2(TCP_RTO_MAX / rto_base); + if (boundary <= linear_backoff_thresh) + timeout = ((2 << boundary) - 1) * rto_base; + else + timeout = ((2 << linear_backoff_thresh) - 1) * rto_base + + (boundary - linear_backoff_thresh) * TCP_RTO_MAX; + return jiffies_to_msecs(timeout); +}

/** * retransmits_timed_out() - returns true if this connection has timed out @@ -178,23 +192,15 @@ static bool retransmits_timed_out(struct sock *sk, unsigned int boundary, unsigned int timeout) { - const unsigned int rto_base = TCP_RTO_MIN; - unsigned int linear_backoff_thresh, start_ts; + unsigned int start_ts;

if (!inet_csk(sk)->icsk_retransmits) return false;

start_ts = tcp_sk(sk)->retrans_stamp; - if (likely(timeout == 0)) { - linear_backoff_thresh = ilog2(TCP_RTO_MAX/rto_base); - - if (boundary <= linear_backoff_thresh) - timeout = ((2 << boundary) - 1) * rto_base; - else - timeout = ((2 << linear_backoff_thresh) - 1) * rto_base + - (boundary - linear_backoff_thresh) * TCP_RTO_MAX; - timeout = jiffies_to_msecs(timeout); - } + if (likely(timeout == 0)) + timeout = tcp_model_timeout(sk, boundary, TCP_RTO_MIN); + return (s32)(tcp_time_stamp(tcp_sk(sk)) - start_ts - timeout) >= 0; }

-- 2.22.0

Laibin Qiu

11:21 a.m.

New subject: [PATCH openEuler-1.0-LTS V3 6/6] tcp: adjust rto_base in retransmits_timed_out()

From: Eric Dumazet edumazet@google.com

mainline inclusion from mainline-v5.4-rc2 commit 3256a2d6ab1f71f9a1bd2d7f6f18eb8108c48d17 category: bugfix bugzilla: https://gitee.com/openeuler/kernel/issues/I4AFRJ?from=project-issue CVE: NA

------------------------------------------------------------

The cited commit exposed an old retransmits_timed_out() bug which assumed it could call tcp_model_timeout() with TCP_RTO_MIN as rto_base for all states.

But flows in SYN_SENT or SYN_RECV state uses a different RTO base (1 sec instead of 200 ms, unless BPF choses another value)

This caused a reduction of SYN retransmits from 6 to 4 with the default /proc/sys/net/ipv4/tcp_syn_retries value.

Fixes: a41e8a88b06e ("tcp: better handle TCP_USER_TIMEOUT in SYN_SENT state") Signed-off-by: Eric Dumazet edumazet@google.com Cc: Yuchung Cheng ycheng@google.com Cc: Marek Majkowski marek@cloudflare.com Signed-off-by: David S. Miller davem@davemloft.net Signed-off-by: Jiazhenyuan jiazhenyuan@uniontech.com #openEuler_contributor Signed-off-by: Laibin Qiu qiulaibin@huawei.com --- net/ipv4/tcp_timer.c | 9 +++++++-- 1 file changed, 7 insertions(+), 2 deletions(-)

diff --git a/net/ipv4/tcp_timer.c b/net/ipv4/tcp_timer.c index c9e63bd68..657b95d8b 100644 --- a/net/ipv4/tcp_timer.c +++ b/net/ipv4/tcp_timer.c @@ -198,8 +198,13 @@ static bool retransmits_timed_out(struct sock *sk, return false;

start_ts = tcp_sk(sk)->retrans_stamp; - if (likely(timeout == 0)) - timeout = tcp_model_timeout(sk, boundary, TCP_RTO_MIN); + if (likely(timeout == 0)) { + unsigned int rto_base = TCP_RTO_MIN; + + if ((1 << sk->sk_state) & (TCPF_SYN_SENT | TCPF_SYN_RECV)) + rto_base = tcp_timeout_init(sk); + timeout = tcp_model_timeout(sk, boundary, rto_base); + }

return (s32)(tcp_time_stamp(tcp_sk(sk)) - start_ts - timeout) >= 0; }

-- 2.22.0

1168

Age (days ago)

1168

Last active (days ago)

kernel@openeuler.org

6 comments

1 participants

tags (0)

participants (1)

Laibin Qiu