Eric Dumazet [Fri, 21 Sep 2018 22:27:52 +0000 (15:27 -0700)]
tun: remove ndo_poll_controller
As diagnosed by Song Liu, ndo_poll_controller() can
be very dangerous on loaded hosts, since the cpu
calling ndo_poll_controller() might steal all NAPI
contexts (for all RX/TX queues of the NIC). This capture
can last for unlimited amount of time, since one
cpu is generally not able to drain all the queues under load.
tun uses NAPI for TX completions, so we better let core
networking stack call the napi->poll() to avoid the capture.
Eric Dumazet [Fri, 21 Sep 2018 22:27:51 +0000 (15:27 -0700)]
nfp: remove ndo_poll_controller
As diagnosed by Song Liu, ndo_poll_controller() can
be very dangerous on loaded hosts, since the cpu
calling ndo_poll_controller() might steal all NAPI
contexts (for all RX/TX queues of the NIC). This capture
can last for unlimited amount of time, since one
cpu is generally not able to drain all the queues under load.
nfp uses NAPI for TX completions, so we better let core
networking stack call the napi->poll() to avoid the capture.
Eric Dumazet [Fri, 21 Sep 2018 22:27:50 +0000 (15:27 -0700)]
bnxt: remove ndo_poll_controller
As diagnosed by Song Liu, ndo_poll_controller() can
be very dangerous on loaded hosts, since the cpu
calling ndo_poll_controller() might steal all NAPI
contexts (for all RX/TX queues of the NIC). This capture
can last for unlimited amount of time, since one
cpu is generally not able to drain all the queues under load.
bnxt uses NAPI for TX completions, so we better let core
networking stack call the napi->poll() to avoid the capture.
Eric Dumazet [Fri, 21 Sep 2018 22:27:49 +0000 (15:27 -0700)]
bnx2x: remove ndo_poll_controller
As diagnosed by Song Liu, ndo_poll_controller() can
be very dangerous on loaded hosts, since the cpu
calling ndo_poll_controller() might steal all NAPI
contexts (for all RX/TX queues of the NIC). This capture
can last for unlimited amount of time, since one
cpu is generally not able to drain all the queues under load.
bnx2x uses NAPI for TX completions, so we better let core
networking stack call the napi->poll() to avoid the capture.
Eric Dumazet [Fri, 21 Sep 2018 22:27:48 +0000 (15:27 -0700)]
mlx5: remove ndo_poll_controller
As diagnosed by Song Liu, ndo_poll_controller() can
be very dangerous on loaded hosts, since the cpu
calling ndo_poll_controller() might steal all NAPI
contexts (for all RX/TX queues of the NIC). This capture
can last for unlimited amount of time, since one
cpu is generally not able to drain all the queues under load.
mlx5 uses NAPI for TX completions, so we better let core
networking stack call the napi->poll() to avoid the capture.
Eric Dumazet [Fri, 21 Sep 2018 22:27:47 +0000 (15:27 -0700)]
mlx4: remove ndo_poll_controller
As diagnosed by Song Liu, ndo_poll_controller() can
be very dangerous on loaded hosts, since the cpu
calling ndo_poll_controller() might steal all NAPI
contexts (for all RX/TX queues of the NIC). This capture
can last for unlimited amount of time, since one
cpu is generally not able to drain all the queues under load.
mlx4 uses NAPI for TX completions, so we better let core
networking stack call the napi->poll() to avoid the capture.
Eric Dumazet [Fri, 21 Sep 2018 22:27:46 +0000 (15:27 -0700)]
i40evf: remove ndo_poll_controller
As diagnosed by Song Liu, ndo_poll_controller() can
be very dangerous on loaded hosts, since the cpu
calling ndo_poll_controller() might steal all NAPI
contexts (for all RX/TX queues of the NIC). This capture
can last for unlimited amount of time, since one
cpu is generally not able to drain all the queues under load.
i40evf uses NAPI for TX completions, so we better let core
networking stack call the napi->poll() to avoid the capture.
Eric Dumazet [Fri, 21 Sep 2018 22:27:45 +0000 (15:27 -0700)]
ice: remove ndo_poll_controller
As diagnosed by Song Liu, ndo_poll_controller() can
be very dangerous on loaded hosts, since the cpu
calling ndo_poll_controller() might steal all NAPI
contexts (for all RX/TX queues of the NIC). This capture
can last for unlimited amount of time, since one
cpu is generally not able to drain all the queues under load.
ice uses NAPI for TX completions, so we better let core
networking stack call the napi->poll() to avoid the capture.
Eric Dumazet [Fri, 21 Sep 2018 22:27:44 +0000 (15:27 -0700)]
igb: remove ndo_poll_controller
As diagnosed by Song Liu, ndo_poll_controller() can
be very dangerous on loaded hosts, since the cpu
calling ndo_poll_controller() might steal all NAPI
contexts (for all RX/TX queues of the NIC). This capture
can last for unlimited amount of time, since one
cpu is generally not able to drain all the queues under load.
igb uses NAPI for TX completions, so we better let core
networking stack call the napi->poll() to avoid the capture.
Eric Dumazet [Fri, 21 Sep 2018 22:27:43 +0000 (15:27 -0700)]
ixgb: remove ndo_poll_controller
As diagnosed by Song Liu, ndo_poll_controller() can
be very dangerous on loaded hosts, since the cpu
calling ndo_poll_controller() might steal all NAPI
contexts (for all RX/TX queues of the NIC). This capture
can last for unlimited amount of time, since one
cpu is generally not able to drain all the queues under load.
ixgb uses NAPI for TX completions, so we better let core
networking stack call the napi->poll() to avoid the capture.
This also removes a problematic use of disable_irq() in
a context it is forbidden, as explained in commit af3e0fcf7887 ("8139too: Use disable_irq_nosync() in
rtl8139_poll_controller()")
Eric Dumazet [Fri, 21 Sep 2018 22:27:42 +0000 (15:27 -0700)]
fm10k: remove ndo_poll_controller
As diagnosed by Song Liu, ndo_poll_controller() can
be very dangerous on loaded hosts, since the cpu
calling ndo_poll_controller() might steal all NAPI
contexts (for all RX/TX queues of the NIC). This capture
lasts for unlimited amount of time, since one
cpu is generally not able to drain all the queues under load.
fm10k uses NAPI for TX completions, so we better let core
networking stack call the napi->poll() to avoid the capture.
Eric Dumazet [Fri, 21 Sep 2018 22:27:41 +0000 (15:27 -0700)]
ixgbevf: remove ndo_poll_controller
As diagnosed by Song Liu, ndo_poll_controller() can
be very dangerous on loaded hosts, since the cpu
calling ndo_poll_controller() might steal all NAPI
contexts (for all RX/TX queues of the NIC). This capture
can last for unlimited amount of time, since one
cpu is generally not able to drain all the queues under load.
ixgbevf uses NAPI for TX completions, so we better let core
networking stack call the napi->poll() to avoid the capture.
Eric Dumazet [Fri, 21 Sep 2018 22:27:40 +0000 (15:27 -0700)]
ixgbe: remove ndo_poll_controller
As diagnosed by Song Liu, ndo_poll_controller() can
be very dangerous on loaded hosts, since the cpu
calling ndo_poll_controller() might steal all NAPI
contexts (for all RX/TX queues of the NIC). This capture
can last for unlimited amount of time, since one
cpu is generally not able to drain all the queues under load.
ixgbe uses NAPI for TX completions, so we better let core
networking stack call the napi->poll() to avoid the capture.
Eric Dumazet [Fri, 21 Sep 2018 22:27:38 +0000 (15:27 -0700)]
netpoll: make ndo_poll_controller() optional
As diagnosed by Song Liu, ndo_poll_controller() can
be very dangerous on loaded hosts, since the cpu
calling ndo_poll_controller() might steal all NAPI
contexts (for all RX/TX queues of the NIC). This capture
can last for unlimited amount of time, since one
cpu is generally not able to drain all the queues under load.
It seems that all networking drivers that do use NAPI
for their TX completions, should not provide a ndo_poll_controller().
NAPI drivers have netpoll support already handled
in core networking stack, since netpoll_poll_dev()
uses poll_napi(dev) to iterate through registered
NAPI contexts for a device.
This patch allows netpoll_poll_dev() to process NAPI
contexts even for drivers not providing ndo_poll_controller(),
allowing for following patches in NAPI drivers.
Also we export netpoll_poll_dev() so that it can be called
by bonding/team drivers in following patches.
Petr Machata [Sun, 23 Sep 2018 14:48:55 +0000 (17:48 +0300)]
mlxsw: Make MLXSW_SP1_FWREV_MINOR a hard requirement
Up until now, mlxsw tolerated firmware versions that weren't exactly
matching the required version, if the branch number matched. That
allowed the users to test various firmware versions as long as they were
on the right branch.
On the other hand, it made it impossible for mlxsw to put a hard lower
bound on a version that fixes all problems known to date. If a user had
a somewhat older FW version installed, mlxsw would start up just fine,
possibly performing non-optimally as it would use features that trigger
problematic behavior.
Therefore tweak the check to accept any FW version that is:
- on the same branch as the preferred version, and
- the same as or newer than the preferred version.
Merge tag 'mfd-fixes-4.19' of git://git.kernel.org/pub/scm/linux/kernel/git/lee/mfd
Lee writes:
"MFD fixes for v4.19
- Fix Dialog DA9063 regulator constraints issue causing failure in
probe
- Fix OMAP Device Tree compatible strings to match DT"
* tag 'mfd-fixes-4.19' of git://git.kernel.org/pub/scm/linux/kernel/git/lee/mfd:
mfd: omap-usb-host: Fix dts probe of children
mfd: da9063: Fix DT probing with constraints
Merge tag 'for-linus-4.19d-rc5-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/xen/tip
Juergen writes:
"xen:
Two small fixes for xen drivers."
* tag 'for-linus-4.19d-rc5-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/xen/tip:
xen: issue warning message when out of grant maptrack entries
xen/x86/vpmu: Zero struct pt_regs before calling into sample handling code
Merge branch 'x86-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Thomas writes:
"A set of fixes for x86:
- Resolve the kvmclock regression on AMD systems with memory
encryption enabled. The rework of the kvmclock memory allocation
during early boot results in encrypted storage, which is not
shareable with the hypervisor. Create a new section for this data
which is mapped unencrypted and take care that the later
allocations for shared kvmclock memory is unencrypted as well.
- Fix the build regression in the paravirt code introduced by the
recent spectre v2 updates.
- Ensure that the initial static page tables cover the fixmap space
correctly so early console always works. This worked so far by
chance, but recent modifications to the fixmap layout can -
depending on kernel configuration - move the relevant entries to a
different place which is not covered by the initial static page
tables.
- Address the regressions and issues which got introduced with the
recent extensions to the Intel Recource Director Technology code.
- Update maintainer entries to document reality"
* 'x86-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
x86/mm: Expand static page table for fixmap space
MAINTAINERS: Add X86 MM entry
x86/intel_rdt: Add Reinette as co-maintainer for RDT
MAINTAINERS: Add Borislav to the x86 maintainers
x86/paravirt: Fix some warning messages
x86/intel_rdt: Fix incorrect loop end condition
x86/intel_rdt: Fix exclusive mode handling of MBA resource
x86/intel_rdt: Fix incorrect loop end condition
x86/intel_rdt: Do not allow pseudo-locking of MBA resource
x86/intel_rdt: Fix unchecked MSR access
x86/intel_rdt: Fix invalid mode warning when multiple resources are managed
x86/intel_rdt: Global closid helper to support future fixes
x86/intel_rdt: Fix size reporting of MBA resource
x86/intel_rdt: Fix data type in parsing callbacks
x86/kvm: Use __bss_decrypted attribute in shared variables
x86/mm: Add .bss..decrypted section to hold shared variables
Merge branch 'perf-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Thomas writes:
"- Provide a strerror_r wrapper so lib/bpf can be built on systems
without _GNU_SOURCE
- Unbreak the man page generator when building out of tree"
* 'perf-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
perf Documentation: Fix out-of-tree asciidoctor man page generation
tools lib bpf: Provide wrapper for strerror_r to build in !_GNU_SOURCE systems
====================
hv_netvsc: Support LRO/RSC in the vSwitch
The patch adds support for LRO/RSC in the vSwitch feature. It reduces
the per packet processing overhead by coalescing multiple TCP segments
when possible. The feature is enabled by default on VMs running on
Windows Server 2019 and later.
The patch set also adds ethtool command handler and documents.
====================
LRO/RSC in the vSwitch is a feature available in Windows Server 2019
hosts and later. It reduces the per packet processing overhead by
coalescing multiple TCP segments when possible. This patch adds netvsc
driver support for this feature.
(userns)$ tcpdump -i usb_rndis0
tcpdump: WARNING: usb_rndis0: SIOCETHTOOL(ETHTOOL_GUFO) ioctl failed: Operation not permitted
Warning: Kernel filter failed: Bad file descriptor
tcpdump: can't remove kernel filter: Bad file descriptor
With this change it returns EOPNOTSUPP instead of EPERM.
See also https://github.com/the-tcpdump-group/libpcap/issues/689
Fixes: 08a00fea6de2 "net: Remove references to NETIF_F_UFO from ethtool." Cc: David S. Miller <[email protected]> Signed-off-by: Maciej Żenczykowski <[email protected]> Signed-off-by: David S. Miller <[email protected]>
Here are two additional fixes that are required in order for SGMII to
work correctly. This was discovered with using a copper SFP which would
make us use SGMII mode, we would actually leave the HW configured in its
default mode: Fiber.
====================
net: dsa: b53: Also include SGMII for mac_config and mac_link_state
In both 802.3z and SGMII modes we need to configure the MAC accordingly
to flip between Fiber and SGMII modes, and we need to read the MAC
status from the SGMII in-band control word.
Fixes: 0e01491de646 ("net: dsa: b53: Add SerDes support") Signed-off-by: Florian Fainelli <[email protected]> Signed-off-by: David S. Miller <[email protected]>
Maths went wrong, to get 0x20, we need to do 0x1e + (x) * 2, not 0x18,
fix that offset so we access the correct registers. This would make us
not access the correct SerDes Digital control words, status would be
fine and so we would not be correctly flipping between Fiber and SGMII
modes resulting in incorrect status words being pulled into the SerDes
digital status register.
Fixes: 0e01491de646 ("net: dsa: b53: Add SerDes support") Signed-off-by: Florian Fainelli <[email protected]> Signed-off-by: David S. Miller <[email protected]>
Peter Oskolkov [Fri, 21 Sep 2018 18:17:16 +0000 (11:17 -0700)]
net/ipfrag: let ip[6]frag_high_thresh in ns be higher than in init_net
Currently, ip[6]frag_high_thresh sysctl values in new namespaces are
hard-limited to those of the root/init ns.
There are at least two use cases when it would be desirable to
set the high_thresh values higher in a child namespace vs the global hard
limit:
- a security/ddos protection policy may lower the thresholds in the
root/init ns but allow for a special exception in a child namespace
- testing: a test running in a namespace may want to set these
thresholds higher in its namespace than what is in the root/init ns
RDS: IB: Use DEFINE_PER_CPU_SHARED_ALIGNED for rds_ib_stats
Clang warns when two declarations' section attributes don't match.
net/rds/ib_stats.c:40:1: warning: section does not match previous
declaration [-Wsection]
DEFINE_PER_CPU_SHARED_ALIGNED(struct rds_ib_statistics, rds_ib_stats);
^
./include/linux/percpu-defs.h:142:2: note: expanded from macro
'DEFINE_PER_CPU_SHARED_ALIGNED'
DEFINE_PER_CPU_SECTION(type, name,
PER_CPU_SHARED_ALIGNED_SECTION) \
^
./include/linux/percpu-defs.h:93:9: note: expanded from macro
'DEFINE_PER_CPU_SECTION'
extern __PCPU_ATTRS(sec) __typeof__(type) name;
\
^
./include/linux/percpu-defs.h:49:26: note: expanded from macro
'__PCPU_ATTRS'
__percpu __attribute__((section(PER_CPU_BASE_SECTION sec)))
\
^
net/rds/ib.h:446:1: note: previous attribute is here
DECLARE_PER_CPU(struct rds_ib_statistics, rds_ib_stats);
^
./include/linux/percpu-defs.h:111:2: note: expanded from macro
'DECLARE_PER_CPU'
DECLARE_PER_CPU_SECTION(type, name, "")
^
./include/linux/percpu-defs.h:87:9: note: expanded from macro
'DECLARE_PER_CPU_SECTION'
extern __PCPU_ATTRS(sec) __typeof__(type) name
^
./include/linux/percpu-defs.h:49:26: note: expanded from macro
'__PCPU_ATTRS'
__percpu __attribute__((section(PER_CPU_BASE_SECTION sec)))
\
^
1 warning generated.
The initial definition was added in commit ec16227e1414 ("RDS/IB:
Infiniband transport") and the cache aligned definition was added in
commit e6babe4cc4ce ("RDS/IB: Stats and sysctls") right after. The
definition probably should have been updated in net/rds/ib.h, which is
what this patch does.
Eric Dumazet [Fri, 21 Sep 2018 17:58:07 +0000 (10:58 -0700)]
net/ipv4: avoid compile error in fib_info_nh_uses_dev
net/ipv4/fib_frontend.c: In function 'fib_info_nh_uses_dev':
net/ipv4/fib_frontend.c:322:6: error: unused variable 'ret' [-Werror=unused-variable]
cc1: all warnings being treated as errors
====================
tcp: switch to Early Departure Time model
In the early days, pacing has been implemented in sch_fq (FQ)
in a generic way :
- SO_MAX_PACING_RATE could be used by any sockets.
- TCP would vary effective pacing rate based on CWND*MSS/SRTT
- FQ would ensure delays between packets based on current
sk->sk_pacing_rate, but with some quantum based artifacts.
(inflating RPC tail latencies)
- BBR then tweaked the pacing rate in its various phases
(PROBE, DRAIN, ...)
This worked reasonably well, but had the side effect that TCP RTT
samples would be inflated by the sojourn time of the packets in FQ.
Also note that when FQ is not used and TCP wants pacing, the
internal pacing fallback has very different behavior, since TCP
emits packets at the time they should be sent (with unreasonable
assumptions about scheduling costs)
Van Jacobson gave a talk at Netdev 0x12 in Montreal, about letting
TCP (or applications for UDP messages) decide of the Earliest
Departure Time, instead of letting packet schedulers derive it
from pacing rate.
Recent additions in linux provided SO_TXTIME and a new ETF qdisc
supporting the new skb->tstamp role
This patch series converts TCP and FQ to the same model.
This might in the future allow us to relax tight TSQ limits
(if FQ is present in the output path), and thus lower
number of callbacks to tcp_write_xmit(), thanks to batching.
This will be followed by FQ change allowing SO_TXTIME support
so that QUIC servers can let the pacing being done in FQ (or
offloaded if network device permits)
For example, a TCP flow rated at 24Mbps now shows a more meaningful RTT
A nice side effect of this patch series is a reduction of max/p99
latencies of RPC workloads, since the FQ quantum no longer adds
artifact.
====================
Eric Dumazet [Fri, 21 Sep 2018 15:51:54 +0000 (08:51 -0700)]
net_sched: sch_fq: remove dead code dealing with retransmits
With the earliest departure time model, we no longer plan
special casing TCP retransmits. We therefore remove dead
code (since most compilers understood skb_is_retransmit()
was false)
Eric Dumazet [Fri, 21 Sep 2018 15:51:52 +0000 (08:51 -0700)]
tcp: switch tcp and sch_fq to new earliest departure time model
TCP keeps track of tcp_wstamp_ns by itself, meaning sch_fq
no longer has to do it.
Thanks to this model, TCP can get more accurate RTT samples,
since pacing no longer inflates them.
This has the nice effect of removing some delays caused by FQ
quantum mechanism, causing inflated max/P99 latencies.
Also we might relax TCP Small Queue tight limits in the future,
since this new model allow TCP to build bigger batches, since
sch_fq (or a device with earliest departure time offload) ensure
these packets will be delivered on time.
Note that other protocols are not converted (they will probably
never be) so sch_fq has still support for SO_MAX_PACING_RATE
Tested:
Test showing FQ pacing quantum artifact for low-rate flows,
adding unexpected throttles for RPC flows, inflating max and P99 latencies.
The parameters chosen here are to show what happens typically when
a TCP flow has a reduced pacing rate (this can be caused by a reduced
cwin after few losses, or/and rtt above few ms)
MIBS="MIN_LATENCY,MEAN_LATENCY,MAX_LATENCY,P99_LATENCY,STDDEV_LATENCY"
Before :
$ netperf -H 10.246.7.133 -t TCP_RR -Cc -T6,6 -- -q 2000000 -r 100,100 -o $MIBS
MIGRATED TCP REQUEST/RESPONSE TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to 10.246.7.133 () port 0 AF_INET : first burst 0 : cpu bind
Minimum Latency Microseconds,Mean Latency Microseconds,Maximum Latency Microseconds,99th Percentile Latency Microseconds,Stddev Latency Microseconds
19,82.78,5279,3825,482.02
After :
$ netperf -H 10.246.7.133 -t TCP_RR -Cc -T6,6 -- -q 2000000 -r 100,100 -o $MIBS
MIGRATED TCP REQUEST/RESPONSE TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to 10.246.7.133 () port 0 AF_INET : first burst 0 : cpu bind
Minimum Latency Microseconds,Mean Latency Microseconds,Maximum Latency Microseconds,99th Percentile Latency Microseconds,Stddev Latency Microseconds
20,49.94,128,63,3.18
Eric Dumazet [Fri, 21 Sep 2018 15:51:50 +0000 (08:51 -0700)]
tcp: provide earliest departure time in skb->tstamp
Switch internal TCP skb->skb_mstamp to skb->skb_mstamp_ns,
from usec units to nsec units.
Do not clear skb->tstamp before entering IP stacks in TX,
so that qdisc or devices can implement pacing based on the
earliest departure time instead of socket sk->sk_pacing_rate
Packets are fed with tcp_wstamp_ns, and following patch
will update tcp_wstamp_ns when both TCP and sch_fq switch to
the earliest departure time mechanism.
Eric Dumazet [Fri, 21 Sep 2018 15:51:48 +0000 (08:51 -0700)]
net_sched: sch_fq: switch to CLOCK_TAI
TCP will soon provide per skb->tstamp with earliest departure time,
so that sch_fq does not have to determine departure time by looking
at socket sk_pacing_rate.
We chose in linux-4.19 CLOCK_TAI as the clock base for transports,
qdiscs, and NIC offloads.
Eric Dumazet [Fri, 21 Sep 2018 15:51:46 +0000 (08:51 -0700)]
tcp: switch tcp_clock_ns() to CLOCK_TAI base
TCP pacing is either implemented in sch_fq or internally.
We have the goal of being able to offload pacing on the NICS.
TCP will soon provide per skb skb->tstamp as early departure time.
Like ETF in commit 25db26a91364 ("net/sched: Introduce the ETF Qdisc")
we chose CLOCK_T as the clock base, so that TCP and pacers can share
a common clock, to get better RTT samples (without pacing artificially
inflating these samples).
net: hns3: Fix speed/duplex information loss problem when executing ethtool ethx cmd of VF
Our VF has not implemented the ops for get_port_type. So when we executing
ethtool ethx cmd of VF, hns3_get_link_ksettings will return directly. And
we can not query anything.
To support get_link_ksettings for VF, this patch replaces get_port_type
with get_media_type. If the media type is HNAE3_MEDIA_TYPE_NONE,
hns3_get_link_ksettings will return link information of VF.
There are already multiple types packets statistics for error packets,
it's unnecessary to print them, which may affect the rx performance if
print too many.
net: hns3: Add nic state check before calling netif_tx_wake_queue
When nic down, it firstly calls netif_tx_stop_all_queues(), then calls
napi_disable(). But napi_disable() will wait current napi_poll finish,
it may call netif_tx_wake_queue(). This patch fixes it by add nic state
checking.
net: hns3: Fix tqp array traversal condition for vf
There are two tqp_num variables "hdev->tqp_num" and "kinfo->tqp_num"
used in VF. "hdev->tqp_num" is the total tqp number allocated to the
VF, and "kinfo->tqp_num" indicates the tqp number being used by the
VF. Usually the two variables are equal. But for the case hdev->tqp_num
larger than rss_size_max, and num_tc is 1, "kinfo->tqp_num" will be
less than "hdev->tqp_num".
In original codes, "hdev->tqp_num" is always used to traverse the
tqp array of kinfo. It may cause null pointer error when "hdev->tqp_num"
is larger than "kinfo->tqp_num"
Fixes: e2cb1dec9779 ("net: hns3: Add HNS3 VF HCL(Hardware Compatibility Layer) Support") Signed-off-by: Jian Shen <[email protected]> Signed-off-by: Peng Li <[email protected]> Signed-off-by: Salil Mehta <[email protected]> Signed-off-by: David S. Miller <[email protected]>
For desc.data is already point to the address of struct member "data[6]",
it's unnecessary to use '&' to get its address. This patch unifies all
the type convert for dest.data, using "req = (struct name *)dest.data".
There is a defect in hclge_ets_validate(). If each member of tc_tsa is
not IEEE_8021QAZ_TSA_ETS, the variable total_ets_bw won't be updated.
In this case, the check for value of total_ets_bw will fail. This patch
fixes it by checking total_ets_bw only after it has been updated.
Fixes: cacde272dd00 ("net: hns3: Add hclge_dcb module for the support of DCB feature") Signed-off-by: Jian Shen <[email protected]> Signed-off-by: Peng Li <[email protected]> Signed-off-by: Salil Mehta <[email protected]> Signed-off-by: David S. Miller <[email protected]>
Klaus Kusche reported that the I/O busy time in /proc/diskstats was not
updating properly on 4.18. This is because we started using ktime to
track elapsed time, and we convert nanoseconds to jiffies when we update
the partition counter. However, this gets rounded down, so any I/Os that
take less than a jiffy are not accounted for. Previously in this case,
the value of jiffies would sometimes increment while we were doing I/O,
so at least some I/Os were accounted for.
Let's convert the stats to use nanoseconds internally. We still report
milliseconds as before, now more accurately than ever. The value is
still truncated to 32 bits for backwards compatibility.
Andrew Lunn [Fri, 21 Sep 2018 13:52:26 +0000 (15:52 +0200)]
ravb: Disable Pause Advertisement
The previous commit to ravb had the side effect of making the PHY
advertise Pause and Asym Pause, which previously did not happen. By
default, phydev->supported has both forms of pause enabled, but
phydev->advertising does not. The new phy_remove_link_mode() copies
phydev->supported to phydev->advertising after removing the requested
link mode. These Pause configuration bits appears it stops the PHY
from completing Auto-Neg and the link remains down. Be explicit and
remove the Pause and Asym Pause modes, so restoring the old behavior.
Fixes: 41124fa64d4b ("net: ethernet: Add helper to remove a supported link mode") Reported-by: Simon Horman <[email protected]> Signed-off-by: Andrew Lunn <[email protected]> Reviewed-by: Sergei Shtylyov <[email protected]> Signed-off-by: David S. Miller <[email protected]>
====================
net: if_arp: use define instead of hard-coded value
Struct arpreq contains the name of the device. All other places in the
kernel, the define IFNAMSIZ is used to designate its size. But in
if_arp.h, a literal constant is used.
As it could be good reasons to use constants instead of the defines in
include files under uapi, it seems to be OK to use the define here,
without opening a can of worms in user-land.
This because if_arp.h includes netdevice.h, which also uses
IFNAMSIZ. For the distros I have checked, this also holds true for the
use-land side.
The series also fixes some incorrect indents.
====================
net/mlx4: Use cpumask_available for eq->affinity_mask
Clang warns that the address of a pointer will always evaluated as true
in a boolean context:
drivers/net/ethernet/mellanox/mlx4/eq.c:243:11: warning: address of
array 'eq->affinity_mask' will always evaluate to 'true'
[-Wpointer-bool-conversion]
if (!eq->affinity_mask || cpumask_empty(eq->affinity_mask))
~~~~~^~~~~~~~~~~~~
1 warning generated.
Use cpumask_available, introduced in commit f7e30f01a9e2 ("cpumask: Add
helper cpumask_available()"), which does the proper checking and avoids
this warning.
Dan Carpenter [Fri, 21 Sep 2018 08:07:55 +0000 (11:07 +0300)]
devlink: double free in devlink_resource_fill()
Smatch reports that devlink_dpipe_send_and_alloc_skb() frees the skb
on error so this is a double free. We fixed a bunch of these bugs in
commit 7fe4d6dcbcb4 ("devlink: Remove redundant free on error path") but
we accidentally overlooked this one.
Fixes: d9f9b9a4d05f ("devlink: Add support for resource abstraction") Signed-off-by: Dan Carpenter <[email protected]> Signed-off-by: David S. Miller <[email protected]>
net/tls: Add support for async encryption of records for performance
In current implementation, tls records are encrypted & transmitted
serially. Till the time the previously submitted user data is encrypted,
the implementation waits and on finish starts transmitting the record.
This approach of encrypt-one record at a time is inefficient when
asynchronous crypto accelerators are used. For each record, there are
overheads of interrupts, driver softIRQ scheduling etc. Also the crypto
accelerator sits idle most of time while an encrypted record's pages are
handed over to tcp stack for transmission.
This patch enables encryption of multiple records in parallel when an
async capable crypto accelerator is present in system. This is achieved
by allowing the user space application to send more data using sendmsg()
even while previously issued data is being processed by crypto
accelerator. This requires returning the control back to user space
application after submitting encryption request to accelerator. This
also means that zero-copy mode of encryption cannot be used with async
accelerator as we must be done with user space application buffer before
returning from sendmsg().
There can be multiple records in flight to/from the accelerator. Each of
the record is represented by 'struct tls_rec'. This is used to store the
memory pages for the record.
After the records are encrypted, they are added in a linked list called
tx_ready_list which contains encrypted tls records sorted as per tls
sequence number. The records from tx_ready_list are transmitted using a
newly introduced function called tls_tx_records(). The tx_ready_list is
polled for any record ready to be transmitted in sendmsg(), sendpage()
after initiating encryption of new tls records. This achieves parallel
encryption and transmission of records when async accelerator is
present.
There could be situation when crypto accelerator completes encryption
later than polling of tx_ready_list by sendmsg()/sendpage(). Therefore
we need a deferred work context to be able to transmit records from
tx_ready_list. The deferred work context gets scheduled if applications
are not sending much data through the socket. If the applications issue
sendmsg()/sendpage() in quick succession, then the scheduling of
tx_work_handler gets cancelled as the tx_ready_list would be polled from
application's context itself. This saves scheduling overhead of deferred
work.
The patch also brings some side benefit. We are able to get rid of the
concept of CLOSED record. This is because the records once closed are
either encrypted and then placed into tx_ready_list or if encryption
fails, the socket error is set. This simplifies the kernel tls
sendpath. However since tls_device.c is still using macros, accessory
functions for CLOSED records have been retained.
net: apple: fix return type of ndo_start_xmit function
The method ndo_start_xmit() is defined as returning an 'netdev_tx_t',
which is a typedef for an enum type, so make sure the implementation in
this driver has returns 'netdev_tx_t' value, and change the function
return type to netdev_tx_t.
net: i825xx: fix return type of ndo_start_xmit function
The method ndo_start_xmit() is defined as returning an 'netdev_tx_t',
which is a typedef for an enum type, so make sure the implementation in
this driver has returns 'netdev_tx_t' value, and change the function
return type to netdev_tx_t.
net: wiznet: fix return type of ndo_start_xmit function
The method ndo_start_xmit() is defined as returning an 'netdev_tx_t',
which is a typedef for an enum type, so make sure the implementation in
this driver has returns 'netdev_tx_t' value, and change the function
return type to netdev_tx_t.
net: sgi: fix return type of ndo_start_xmit function
The method ndo_start_xmit() is defined as returning an 'netdev_tx_t',
which is a typedef for an enum type, so make sure the implementation in
this driver has returns 'netdev_tx_t' value, and change the function
return type to netdev_tx_t.
net: cirrus: fix return type of ndo_start_xmit function
The method ndo_start_xmit() is defined as returning an 'netdev_tx_t',
which is a typedef for an enum type, so make sure the implementation in
this driver has returns 'netdev_tx_t' value, and change the function
return type to netdev_tx_t.
net: seeq: fix return type of ndo_start_xmit function
The method ndo_start_xmit() is defined as returning an 'netdev_tx_t',
which is a typedef for an enum type, so make sure the implementation in
this driver has returns 'netdev_tx_t' value, and change the function
return type to netdev_tx_t.
PCI: hv: Fix return value check in hv_pci_assign_slots()
In case of error, the function pci_create_slot() returns ERR_PTR() and
never returns NULL. The NULL test in the return value check should be
replaced with IS_ERR().
Fixes: a15f2c08c708 ("PCI: hv: support reporting serial number as slot information") Signed-off-by: Wei Yongjun <[email protected]> Signed-off-by: David S. Miller <[email protected]>
net: freescale: fix return type of ndo_start_xmit function
The method ndo_start_xmit() is defined as returning an 'netdev_tx_t',
which is a typedef for an enum type, so make sure the implementation in
this driver has returns 'netdev_tx_t' value, and change the function
return type to netdev_tx_t.
net: micrel: fix return type of ndo_start_xmit function
The method ndo_start_xmit() is defined as returning an 'netdev_tx_t',
which is a typedef for an enum type, so make sure the implementation in
this driver has returns 'netdev_tx_t' value, and change the function
return type to netdev_tx_t.
Jeff Barnhill [Fri, 21 Sep 2018 00:45:27 +0000 (00:45 +0000)]
net/ipv6: Display all addresses in output of /proc/net/if_inet6
The backend handling for /proc/net/if_inet6 in addrconf.c doesn't properly
handle starting/stopping the iteration. The problem is that at some point
during the iteration, an overflow is detected and the process is
subsequently stopped. The item being shown via seq_printf() when the
overflow occurs is not actually shown, though. When start() is
subsequently called to resume iterating, it returns the next item, and
thus the item that was being processed when the overflow occurred never
gets printed.
Alter the meaning of the private data member "offset". Currently, when it
is not 0 (which only happens at the very beginning), "offset" represents
the next hlist item to be printed. After this change, "offset" always
represents the current item.
This is also consistent with the private data member "bucket", which
represents the current bucket, and also the use of "pos" as defined in
seq_file.txt:
The pos passed to start() will always be either zero, or the most
recent pos used in the previous session.
Allow the configuration of the MDIO clock divider when the Device Tree
contains 'clock-frequency' property (similar to I2C and SPI buses).
Because the hardware may have lost its state during suspend/resume,
re-apply the MDIO clock divider upon resumption.
Clang warns when a variable is assigned to itself.
drivers/net/usb/lan78xx.c:940:11: warning: explicitly assigning value of
variable of type 'u32' (aka 'unsigned int') to itself [-Wself-assign]
offset = offset;
~~~~~~ ^ ~~~~~~
1 warning generated.
Reorder the if statement to acheive the same result and avoid a self
assignment warning.
Clang warns when a variable is assigned to itself.
drivers/net/fddi/skfp/pcmplc.c:1257:6: warning: explicitly assigning
value of variable of type 'int' to itself [-Wself-assign]
phy = phy ; on_off = on_off ;
~~~ ^ ~~~
drivers/net/fddi/skfp/pcmplc.c:1257:21: warning: explicitly assigning
value of variable of type 'int' to itself [-Wself-assign]
phy = phy ; on_off = on_off ;
~~~~~~ ^ ~~~~~~
2 warnings generated.
Turns out this entire function doesn't actually do anything since
SK_UNUSED is just casting the pointer to void. Remove it to silence
this Clang warning.
Clang warns when a variable is assigned to itself.
drivers/net/ethernet/brocade/bna/bna_enet.c:1800:9: warning: explicitly
assigning value of variable of type 'int' to itself [-Wself-assign]
for (i = i; i < (bna->ioceth.attr.num_ucmac * 2); i++)
~ ^ ~
drivers/net/ethernet/brocade/bna/bna_enet.c:1835:9: warning: explicitly
assigning value of variable of type 'int' to itself [-Wself-assign]
for (i = i; i < (bna->ioceth.attr.num_mcmac * 2); i++)
~ ^ ~
2 warnings generated.
Clang warns when multiple pairs of parentheses are used for a single
conditional statement.
drivers/net/ethernet/neterion/vxge/vxge-traffic.c:2265:31: warning:
equality comparison with extraneous parentheses [-Wparentheses-equality]
if ((hldev->config.intr_mode ==
VXGE_HW_INTR_MODE_MSIX_ONE_SHOT))
~~~~~~~~~~~~~~~~~~~~~~~~^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
drivers/net/ethernet/neterion/vxge/vxge-traffic.c:2265:31: note: remove
extraneous parentheses around the comparison to silence this warning
if ((hldev->config.intr_mode ==
VXGE_HW_INTR_MODE_MSIX_ONE_SHOT))
~ ^ ~
drivers/net/ethernet/neterion/vxge/vxge-traffic.c:2265:31: note: use '='
to turn this equality comparison into an assignment
if ((hldev->config.intr_mode ==
VXGE_HW_INTR_MODE_MSIX_ONE_SHOT))
^~
=
1 warning generated.
Sean Tranchetti [Thu, 20 Sep 2018 20:29:45 +0000 (14:29 -0600)]
netlabel: check for IPV4MASK in addrinfo_get
netlbl_unlabel_addrinfo_get() assumes that if it finds the
NLBL_UNLABEL_A_IPV4ADDR attribute, it must also have the
NLBL_UNLABEL_A_IPV4MASK attribute as well. However, this is
not necessarily the case as the current checks in
netlbl_unlabel_staticadd() and friends are not sufficent to
enforce this.
If passed a netlink message with NLBL_UNLABEL_A_IPV4ADDR,
NLBL_UNLABEL_A_IPV6ADDR, and NLBL_UNLABEL_A_IPV6MASK attributes,
these functions will all call netlbl_unlabel_addrinfo_get() which
will then attempt dereference NULL when fetching the non-existent
NLBL_UNLABEL_A_IPV4MASK attribute:
Unable to handle kernel NULL pointer dereference at virtual address 0
Process unlab (pid: 31762, stack limit = 0xffffff80502d8000)
Call trace:
netlbl_unlabel_addrinfo_get+0x44/0xd8
netlbl_unlabel_staticremovedef+0x98/0xe0
genl_rcv_msg+0x354/0x388
netlink_rcv_skb+0xac/0x118
genl_rcv+0x34/0x48
netlink_unicast+0x158/0x1f0
netlink_sendmsg+0x32c/0x338
sock_sendmsg+0x44/0x60
___sys_sendmsg+0x1d0/0x2a8
__sys_sendmsg+0x64/0xb4
SyS_sendmsg+0x34/0x4c
el0_svc_naked+0x34/0x38
Code: 510011497100113f540000a0f9401508 (79400108)
---[ end trace f6438a488e737143 ]---
Kernel panic - not syncing: Fatal exception
Daniel Borkmann [Sat, 22 Sep 2018 00:46:42 +0000 (02:46 +0200)]
Merge branch 'bpf-sockmap-estab-fixes'
John Fastabend says:
====================
Eric noted that using the close callback is not sufficient
to catch all transitions from ESTABLISHED state to a LISTEN
state. So this series does two things. First, only allow
adding socks in ESTABLISH state and second use unhash callback
to catch tcp_disconnect() transitions.
v2: added check for ESTABLISH state in hash update sockmap as well
v3: Do not release lock from unhash in error path, no lock was
used in the first place. And drop not so useful code comments
v4: convert,
if (unhash()) return unhash(); return
to if (unhash()) unhash(); return;
Thanks for reviewing Yonghong I carried your ACKs forward.
====================
John Fastabend [Tue, 18 Sep 2018 16:01:49 +0000 (09:01 -0700)]
bpf: sockmap, fix transition through disconnect without close
It is possible (via shutdown()) for TCP socks to go trough TCP_CLOSE
state via tcp_disconnect() without actually calling tcp_close which
would then call our bpf_tcp_close() callback. Because of this a user
could disconnect a socket then put it in a LISTEN state which would
break our assumptions about sockets always being ESTABLISHED state.
To resolve this rely on the unhash hook, which is called in the
disconnect case, to remove the sock from the sockmap.
John Fastabend [Tue, 18 Sep 2018 16:01:44 +0000 (09:01 -0700)]
bpf: sockmap only allow ESTABLISHED sock state
After this patch we only allow socks that are in ESTABLISHED state or
are being added via a sock_ops event that is transitioning into an
ESTABLISHED state. By allowing sock_ops events we allow users to
manage sockmaps directly from sock ops programs. The two supported
sock_ops ops are BPF_SOCK_OPS_PASSIVE_ESTABLISHED_CB and
BPF_SOCK_OPS_ACTIVE_ESTABLISHED_CB.
Similar to TLS ULP this ensures sk_user_data is correct.
following commit:
commit d58e468b1112 ("flow_dissector: implements flow dissector BPF hook")
added struct bpf_flow_keys which conflicts with the struct with
same name in sockex2_kern.c and sockex3_kern.c
similar to commit:
commit 534e0e52bc23 ("samples/bpf: fix a compilation failure")
we tried the rename it "flow_keys" but it also conflicted with struct
having same name in include/net/flow_dissector.h. Hence renaming the
struct to "flow_key_record". Also, this commit doesn't fix the
compilation error completely because the similar struct is present in
sockex3_kern.c. Hence renaming it in both files sockex3_user.c and
sockex3_kern.c
Merge tag 'pinctrl-v4.19-3' of git://git.kernel.org/pub/scm/linux/kernel/git/linusw/linux-pinctrl
Linus writes:
"Pin control fixes for v4.19:
- Two fixes for the Intel pin controllers than cause
problems on laptops."
* tag 'pinctrl-v4.19-3' of git://git.kernel.org/pub/scm/linux/kernel/git/linusw/linux-pinctrl:
pinctrl: intel: Do pin translation in other GPIO operations as well
pinctrl: cannonlake: Fix gpio base for GPP-E
scsi: sd: don't crash the host on invalid commands
When sd_init_command() get's a command with a unknown req_op() it crashes the
system via BUG().
This makes debugging the actual reason for the broken request cmd_flags pretty
hard as the system is down before it's able to write out debugging data on the
serial console or the trace buffer.
Change the BUG() to a WARN_ON() and return BLKPREP_KILL to fail gracefully and
return an I/O error to the producer of the request.
scsi: ipr: System hung while dlpar adding primary ipr adapter back
While dlpar adding primary ipr adapter back, driver goes through adapter
initialization then schedule ipr_worker_thread to start te disk scan by
dropping the host lock, calling scsi_add_device. Then get the adapter reset
request again, so driver does scsi_block_requests, this will cause the
scsi_add_device get hung until we unblock. But we can't run ipr_worker_thread
to do the unblock because its stuck in scsi_add_device.
scsi: target: iscsi: Use hex2bin instead of a re-implementation
This change has the following effects, in order of descreasing importance:
1) Prevent a stack buffer overflow
2) Do not append an unnecessary NULL to an anyway binary buffer, which
is writing one byte past client_digest when caller is:
chap_string_to_hex(client_digest, chap_r, strlen(chap_r));
The latter was found by KASAN (see below) when input value hes expected size
(32 hex chars), and further analysis revealed a stack buffer overflow can
happen when network-received value is longer, allowing an unauthenticated
remote attacker to smash up to 17 bytes after destination buffer (16 bytes
attacker-controlled and one null). As switching to hex2bin requires
specifying destination buffer length, and does not internally append any null,
it solves both issues.
This addresses CVE-2018-14633.
Beyond this:
- Validate received value length and check hex2bin accepted the input, to log
this rejection reason instead of just failing authentication.
- Only log received CHAP_R and CHAP_C values once they passed sanity checks.
==================================================================
BUG: KASAN: stack-out-of-bounds in chap_string_to_hex+0x32/0x60 [iscsi_target_mod]
Write of size 1 at addr ffff8801090ef7c8 by task kworker/0:0/1021