Jie Wang [Wed, 27 Oct 2021 12:11:46 +0000 (20:11 +0800)]
net: hns3: fix data endian problem of some functions of debugfs
The member data in struct hclge_desc is type of __le32, it needs endian
conversion before using it, and some functions of debugfs didn't do that,
so this patch fixes it.
Fixes: c0ebebb9ccc1 ("net: hns3: Add "dcb register" status information query function") Signed-off-by: Jie Wang <[email protected]> Signed-off-by: Guangbin Huang <[email protected]> Signed-off-by: David S. Miller <[email protected]>
Guangbin Huang [Wed, 27 Oct 2021 12:11:45 +0000 (20:11 +0800)]
net: hns3: ignore reset event before initialization process is done
Currently, if there is a reset event triggered by RAS during device in
initialization process, driver may run reset process concurrently with
initialization process. In this case, it may cause problem. For example,
the RSS indirection table may has not been alloc memory in initialization
process yet, but it is used in reset process, it will cause a call trace
like this:
Yufeng Mo [Wed, 27 Oct 2021 12:11:44 +0000 (20:11 +0800)]
net: hns3: change hclge/hclgevf workqueue to WQ_UNBOUND mode
Currently, the workqueue of hclge/hclgevf is executed on
the CPU that initiates scheduling requests by default. In
stress scenarios, the CPU may be busy and workqueue scheduling
is completed after a long period of time. To avoid this
situation and implement proper scheduling, use the WQ_UNBOUND
mode instead. In this way, the workqueue can be performed on
a relatively idle CPU.
Guangbin Huang [Wed, 27 Oct 2021 12:11:43 +0000 (20:11 +0800)]
net: hns3: fix pause config problem after autoneg disabled
If a TP port is configured by follow steps:
1.ethtool -s ethx autoneg off speed 100 duplex full
2.ethtool -A ethx rx on tx on
3.ethtool -s ethx autoneg on(rx&tx negotiated pause results are off)
4.ethtool -s ethx autoneg off speed 100 duplex full
In step 3, driver will set rx&tx pause parameters of hardware to off as
pause parameters negotiated with link partner are off.
After step 4, the "ethtool -a ethx" command shows both rx and tx pause
parameters are on. However, pause parameters of hardware are still off
and port has no flow control function actually.
To fix this problem, if autoneg is disabled, driver uses its saved
parameters to restore pause of hardware. If the speed is not changed in
this case, there is no link state changed for phy, it will cause the pause
parameter is not taken effect, so we need to force phy to go down and up.
Fixes: aacbe27e82f0 ("net: hns3: modify how pause options is displayed") Signed-off-by: Guangbin Huang <[email protected]> Signed-off-by: David S. Miller <[email protected]>
block: Fix partition check for host-aware zoned block devices
Commit a33df75c6328 ("block: use an xarray for disk->part_tbl") modified
the method to check partition existence in host-aware zoned block
devices from disk_has_partitions() helper function call to empty check
of xarray disk->part_tbl. However, disk->part_tbl always has single
entry for disk->part0 and never becomes empty. This resulted in the
host-aware zoned devices always judged to have partitions, and it made
the sysfs queue/zoned attribute to be "none" instead of "host-aware"
regardless of partition existence in the devices.
This also caused DEBUG_LOCKS_WARN_ON(lock->magic != lock) for
sdkp->rev_mutex in scsi layer when the kernel detects host-aware zoned
device. Since block layer handled the host-aware zoned devices as non-
zoned devices, scsi layer did not have chance to initialize the mutex
for zone revalidation. Therefore, the warning was triggered.
To fix the issues, call the helper function disk_has_partitions() in
place of disk->part_tbl empty check. Since the function was removed with
the commit a33df75c6328, reimplement it to walk through entries in the
xarray disk->part_tbl.
The kmaps in compression code are still needed and cause crashes on
32bit machines (ARM, x86). Reproducible eg. by running fstest btrfs/004
with enabled LZO or ZSTD compression.
Amit Engel [Wed, 27 Oct 2021 06:49:27 +0000 (09:49 +0300)]
nvmet-tcp: fix header digest verification
Pass the correct length to nvmet_tcp_verify_hdgst, which is the pdu
header length. This fixes a wrong behaviour where header digest
verification passes although the digest is wrong.
Varun Prakash [Mon, 25 Oct 2021 17:16:54 +0000 (22:46 +0530)]
nvmet-tcp: fix data digest pointer calculation
exp_ddgst is of type __le32, &cmd->exp_ddgst + cmd->offset increases
&cmd->exp_ddgst by 4 * cmd->offset, fix this by type casting
&cmd->exp_ddgst to u8 *.
Varun Prakash [Tue, 26 Oct 2021 13:31:55 +0000 (19:01 +0530)]
nvme-tcp: fix possible req->offset corruption
With commit db5ad6b7f8cd ("nvme-tcp: try to send request in queue_rq
context") r2t and response PDU can get processed while send function
is executing.
Current data digest send code uses req->offset after kernel_sendmsg(),
this creates a race condition where req->offset gets reset before it
is used in send function.
This can happen in two cases -
1. Target sends r2t PDU which resets req->offset.
2. Target send response PDU which completes the req and then req is
used for a new command, nvme_tcp_setup_cmd_pdu() resets req->offset.
Fix this by storing req->offset in a local variable and using
this local variable after kernel_sendmsg().
Linus Torvalds [Tue, 26 Oct 2021 22:24:33 +0000 (15:24 -0700)]
Merge tag 'arm-soc-fixes-5.15-3' of git://git.kernel.org/pub/scm/linux/kernel/git/soc/soc
Pull ARM SoC fixes from Arnd Bergmann:
"One last set of small fixes for the soc tree:
- Incorrect ethernet phy settings found on i.mx and allwinner
platforms
- a revert for a Qualcomm DT change that caused a boot regression
- four patches for incorrect settings in i.MX DT files
- new MAINTAINER file entries for dhcom boards
- a Kconfig fix for a reset driver that became unselectable
- three more code changes for bugs in reset drivers"
* tag 'arm-soc-fixes-5.15-3' of git://git.kernel.org/pub/scm/linux/kernel/git/soc/soc:
MAINTAINERS: Add maintainers for DHCOM i.MX6 and DHCOM/DHCOR STM32MP1
Revert "arm64: dts: qcom: sm8250: remove bus clock from the mdss node for sm8250 target"
arm64: dts: imx8mm-kontron: Fix connection type for VSC8531 RGMII PHY
arm64: dts: imx8mm-kontron: Fix CAN SPI clock frequency
arm64: dts: imx8mm-kontron: Fix polarity of reg_rst_eth2
arm64: dts: imx8mm-kontron: Set lower limit of VDD_SNVS to 800 mV
arm64: dts: imx8mm-kontron: Make sure SOC and DRAM supply voltages are correct
reset: socfpga: add empty driver allowing consumers to probe
reset: tegra-bpmp: Handle errors in BPMP response
reset: pistachio: Re-enable driver selection
reset: brcmstb-rescal: fix incorrect polarity of status bit
ARM: dts: sun7i: A20-olinuxino-lime2: Fix ethernet phy-mode
arm64: dts: allwinner: h5: NanoPI Neo 2: Fix ethernet node
Naohiro Aota [Tue, 26 Oct 2021 16:51:27 +0000 (01:51 +0900)]
block: schedule queue restart after BLK_STS_ZONE_RESOURCE
When dispatching a zone append write request to a SCSI zoned block device,
if the target zone of the request is already locked, the device driver will
return BLK_STS_ZONE_RESOURCE and the request will be pushed back to the
hctx dipatch queue. The queue will be marked as RESTART in
dd_finish_request() and restarted in __blk_mq_free_request(). However, this
restart applies to the hctx of the completed request. If the requeued
request is on a different hctx, dispatch will no be retried until another
request is submitted or the next periodic queue run triggers, leading to up
to 30 seconds latency for the requeued request.
Fix this problem by scheduling a queue restart similarly to the
BLK_STS_RESOURCE case or when we cannot get the budget.
Also, consolidate the checks into the "need_resource" variable to simplify
the condition.
We've added 12 non-merge commits during the last 7 day(s) which contain
a total of 23 files changed, 118 insertions(+), 98 deletions(-).
The main changes are:
1) Fix potential race window in BPF tail call compatibility check, from Toke Høiland-Jørgensen.
2) Fix memory leak in cgroup fs due to missing cgroup_bpf_offline(), from Quanyang Wang.
3) Fix file descriptor reference counting in generic_map_update_batch(), from Xu Kuohai.
4) Fix bpf_jit_limit knob to the max supported limit by the arch's JIT, from Lorenz Bauer.
5) Fix BPF sockmap ->poll callbacks for UDP and AF_UNIX sockets, from Cong Wang and Yucong Sun.
6) Fix BPF sockmap concurrency issue in TCP on non-blocking sendmsg calls, from Liu Jian.
7) Fix build failure of INODE_STORAGE and TASK_STORAGE maps on !CONFIG_NET, from Tejun Heo.
* https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf:
bpf: Fix potential race in tail call compatibility check
bpf: Move BPF_MAP_TYPE for INODE_STORAGE and TASK_STORAGE outside of CONFIG_NET
selftests/bpf: Use recv_timeout() instead of retries
net: Implement ->sock_is_readable() for UDP and AF_UNIX
skmsg: Extract and reuse sk_msg_is_readable()
net: Rename ->stream_memory_read to ->sock_is_readable
tcp_bpf: Fix one concurrency problem in the tcp_bpf_send_verdict function
cgroup: Fix memory leak caused by missing cgroup_bpf_offline
bpf: Fix error usage of map_fd and fdget() in generic_map_update_batch()
bpf: Prevent increasing bpf_jit_limit above max
bpf: Define bpf_jit_alloc_exec_limit for arm64 JIT
bpf: Define bpf_jit_alloc_exec_limit for riscv JIT
====================
bpf: Fix potential race in tail call compatibility check
Lorenzo noticed that the code testing for program type compatibility of
tail call maps is potentially racy in that two threads could encounter a
map with an unset type simultaneously and both return true even though they
are inserting incompatible programs.
The race window is quite small, but artificially enlarging it by adding a
usleep_range() inside the check in bpf_prog_array_compatible() makes it
trivial to trigger from userspace with a program that does, essentially:
While the race window is small, it has potentially serious ramifications in
that triggering it would allow a BPF program to tail call to a program of a
different type. So let's get rid of it by protecting the update with a
spinlock. The commit in the Fixes tag is the last commit that touches the
code in question.
v2:
- Use a spinlock instead of an atomic variable and cmpxchg() (Alexei)
v3:
- Put lock and the members it protects into an embedded 'owner' struct (Daniel)
Tejun Heo [Thu, 21 Oct 2021 18:46:10 +0000 (08:46 -1000)]
bpf: Move BPF_MAP_TYPE for INODE_STORAGE and TASK_STORAGE outside of CONFIG_NET
bpf_types.h has BPF_MAP_TYPE_INODE_STORAGE and BPF_MAP_TYPE_TASK_STORAGE
declared inside #ifdef CONFIG_NET although they are built regardless of
CONFIG_NET. So, when CONFIG_BPF_SYSCALL && !CONFIG_NET, they are built
without the declarations leading to spurious build failures and not
registered to bpf_map_types making them unavailable.
Fix it by moving the BPF_MAP_TYPE for the two map types outside of
CONFIG_NET.
Merge branch 'sock_map: fix ->poll() and update selftests'
Cong Wang says:
====================
This patchset fixes ->poll() for sockets in sockmap and updates
selftests accordingly with select(). Please check each patch
for more details.
Fixes: c50524ec4e3a ("Merge branch 'sockmap: add sockmap support for unix datagram socket'") Fixes: 89d69c5d0fbc ("Merge branch 'sockmap: introduce BPF_SK_SKB_VERDICT and support UDP'") Acked-by: John Fastabend <[email protected]>
---
v4: add a comment in udp_poll()
v3: drop sk_psock_get_checked()
reuse tcp_bpf_sock_is_readable()
v2: rename and reuse ->stream_memory_read()
fix a compile error in sk_psock_get_checked()
Cong Wang (3):
net: rename ->stream_memory_read to ->sock_is_readable
skmsg: extract and reuse sk_msg_is_readable()
net: implement ->sock_is_readable() for UDP and AF_UNIX
Yucong Sun [Fri, 8 Oct 2021 20:33:06 +0000 (13:33 -0700)]
selftests/bpf: Use recv_timeout() instead of retries
We use non-blocking sockets in those tests, retrying for
EAGAIN is ugly because there is no upper bound for the packet
arrival time, at least in theory. After we fix poll() on
sockmap sockets, now we can switch to select()+recv().
Cong Wang [Fri, 8 Oct 2021 20:33:05 +0000 (13:33 -0700)]
net: Implement ->sock_is_readable() for UDP and AF_UNIX
Yucong noticed we can't poll() sockets in sockmap even
when they are the destination sockets of redirections.
This is because we never poll any psock queues in ->poll(),
except for TCP. With ->sock_is_readable() now we can
overwrite >sock_is_readable(), invoke and implement it for
both UDP and AF_UNIX sockets.
Cong Wang [Fri, 8 Oct 2021 20:33:03 +0000 (13:33 -0700)]
net: Rename ->stream_memory_read to ->sock_is_readable
The proto ops ->stream_memory_read() is currently only used
by TCP to check whether psock queue is empty or not. We need
to rename it before reusing it for non-TCP protocols, and
adjust the exsiting users accordingly.
Liu Jian [Tue, 12 Oct 2021 05:20:19 +0000 (13:20 +0800)]
tcp_bpf: Fix one concurrency problem in the tcp_bpf_send_verdict function
With two Msgs, msgA and msgB and a user doing nonblocking sendmsg calls (or
multiple cores) on a single socket 'sk' we could get the following flow.
msgA, sk msgB, sk
----------- ---------------
tcp_bpf_sendmsg()
lock(sk)
psock = sk->psock
tcp_bpf_sendmsg()
lock(sk) ... blocking
tcp_bpf_send_verdict
if (psock->eval == NONE)
psock->eval = sk_psock_msg_verdict
..
< handle SK_REDIRECT case >
release_sock(sk) < lock dropped so grab here >
ret = tcp_bpf_sendmsg_redir
psock = sk->psock
tcp_bpf_send_verdict
lock_sock(sk) ... blocking on B
if (psock->eval == NONE) <- boom.
psock->eval will have msgA state
The problem here is we dropped the lock on msgA and grabbed it with msgB.
Now we have old state in psock and importantly psock->eval has not been
cleared. So msgB will run whatever action was done on A and the verdict
program may never see it.
Walter Stoll [Thu, 14 Oct 2021 10:22:29 +0000 (12:22 +0200)]
watchdog: Fix OMAP watchdog early handling
TI's implementation does not service the watchdog even if the kernel
command line parameter omap_wdt.early_enable is set to 1. This patch
fixes the issue.
All registers are 32 bits in size and should be accessed using 32-bit
reads and writes. If an access size other than 32 bits is used then
the results are IMPLEMENTATION DEFINED.
and for qemu, the implementation will only allow 32-bit accesses
resulting in a synchronous external abort when configuring the watchdog.
Use lo_hi_* accessors rather than a readq/writeq.
Guenter Roeck [Fri, 8 Oct 2021 00:33:02 +0000 (17:33 -0700)]
Revert "watchdog: iTCO_wdt: Account for rebooting on second timeout"
This reverts commit cb011044e34c ("watchdog: iTCO_wdt: Account for
rebooting on second timeout") and commit aec42642d91f ("watchdog: iTCO_wdt:
Fix detection of SMI-off case") since those patches cause a regression
on certain boards (https://bugzilla.kernel.org/show_bug.cgi?id=213809).
While this revert may result in some boards to only reset after twice
the configured timeout value, that is still better than a watchdog reset
after half the configured value.
Wenbin Mei [Tue, 26 Oct 2021 07:08:12 +0000 (15:08 +0800)]
mmc: cqhci: clear HALT state after CQE enable
While mmc0 enter suspend state, we need halt CQE to send legacy cmd(flush
cache) and disable cqe, for resume back, we enable CQE and not clear HALT
state.
In this case MediaTek mmc host controller will keep the value for HALT
state after CQE disable/enable flow, so the next CQE transfer after resume
will be timeout due to CQE is in HALT state, the log as below:
<4>.(4)[318:kworker/4:1H]mmc0: cqhci: timeout for tag 2
<4>.(4)[318:kworker/4:1H]mmc0: cqhci: ============ CQHCI REGISTER DUMP ===========
<4>.(4)[318:kworker/4:1H]mmc0: cqhci: Caps: 0x100020b6 | Version: 0x00000510
<4>.(4)[318:kworker/4:1H]mmc0: cqhci: Config: 0x00001103 | Control: 0x00000001
<4>.(4)[318:kworker/4:1H]mmc0: cqhci: Int stat: 0x00000000 | Int enab: 0x00000006
<4>.(4)[318:kworker/4:1H]mmc0: cqhci: Int sig: 0x00000006 | Int Coal: 0x00000000
<4>.(4)[318:kworker/4:1H]mmc0: cqhci: TDL base: 0xfd05f000 | TDL up32: 0x00000000
<4>.(4)[318:kworker/4:1H]mmc0: cqhci: Doorbell: 0x8000203c | TCN: 0x00000000
<4>.(4)[318:kworker/4:1H]mmc0: cqhci: Dev queue: 0x00000000 | Dev Pend: 0x00000000
<4>.(4)[318:kworker/4:1H]mmc0: cqhci: Task clr: 0x00000000 | SSC1: 0x00001000
<4>.(4)[318:kworker/4:1H]mmc0: cqhci: SSC2: 0x00000001 | DCMD rsp: 0x00000000
<4>.(4)[318:kworker/4:1H]mmc0: cqhci: RED mask: 0xfdf9a080 | TERRI: 0x00000000
<4>.(4)[318:kworker/4:1H]mmc0: cqhci: Resp idx: 0x00000000 | Resp arg: 0x00000000
<4>.(4)[318:kworker/4:1H]mmc0: cqhci: CRNQP: 0x00000000 | CRNQDUN: 0x00000000
<4>.(4)[318:kworker/4:1H]mmc0: cqhci: CRNQIS: 0x00000000 | CRNQIE: 0x00000000
This change check HALT state after CQE enable, if CQE is in HALT state, we
will clear it.
Jaehoon Chung [Fri, 22 Oct 2021 08:21:06 +0000 (17:21 +0900)]
mmc: dw_mmc: exynos: fix the finding clock sample value
Even though there are candiates value if can't find best value, it's
returned -EIO. It's not proper behavior.
If there is not best value, use a first candiate value to work eMMC.
Ming Lei [Tue, 26 Oct 2021 10:12:04 +0000 (18:12 +0800)]
block: drain queue after disk is removed from sysfs
Before removing disk from sysfs, userspace still may change queue via
sysfs, such as switching elevator or setting wbt latency, both may
reinitialize wbt, then the warning in blk_free_queue_stats() will be
triggered since rq_qos_exit() is moved to del_gendisk().
Fixes the issue by moving draining queue & tearing down after disk is
removed from sysfs, at that time no one can come into queue's
store()/show().
Arnd Bergmann [Tue, 26 Oct 2021 14:20:49 +0000 (16:20 +0200)]
Merge tag 'qcom-arm64-fixes-for-5.15-2' of git://git.kernel.org/pub/scm/linux/kernel/git/qcom/linux into arm/fixes
Qualcomm ARM64 DTS one more fix for 5.15
This reverts a clock change in the Qualcomm RB5 devicetree which in some
combinations of firmware and configuration causes the device to crash
during boot.
Data on an adjacent platform indicates that this is probably not be the
root cause of the problem, but this resolves the regression seen on RB5
and will allow the SM8250 platform to boot v5.15.
* tag 'qcom-arm64-fixes-for-5.15-2' of git://git.kernel.org/pub/scm/linux/kernel/git/qcom/linux:
Revert "arm64: dts: qcom: sm8250: remove bus clock from the mdss node for sm8250 target"
Johan Hovold [Tue, 26 Oct 2021 10:36:17 +0000 (12:36 +0200)]
net: lan78xx: fix division by zero in send path
Add the missing endpoint max-packet sanity check to probe() to avoid
division by zero in lan78xx_tx_bh() in case a malicious device has
broken descriptors (or when doing descriptor fuzz testing).
Note that USB core will reject URBs submitted for endpoints with zero
wMaxPacketSize but that drivers doing packet-size calculations still
need to handle this (cf. commit 2548288b4fb0 ("USB: Fix: Don't skip
endpoint descriptors with maxpacket=0")).
Pavel Skripkin [Sun, 24 Oct 2021 13:13:56 +0000 (16:13 +0300)]
net: batman-adv: fix error handling
Syzbot reported ODEBUG warning in batadv_nc_mesh_free(). The problem was
in wrong error handling in batadv_mesh_init().
Before this patch batadv_mesh_init() was calling batadv_mesh_free() in case
of any batadv_*_init() calls failure. This approach may work well, when
there is some kind of indicator, which can tell which parts of batadv are
initialized; but there isn't any.
All written above lead to cleaning up uninitialized fields. Even if we hide
ODEBUG warning by initializing bat_priv->nc.work, syzbot was able to hit
GPF in batadv_nc_purge_paths(), because hash pointer in still NULL. [1]
To fix these bugs we can unwind batadv_*_init() calls one by one.
It is good approach for 2 reasons: 1) It fixes bugs on error handling
path 2) It improves the performance, since we won't call unneeded
batadv_*_free() functions.
So, this patch makes all batadv_*_init() clean up all allocated memory
before returning with an error to no call correspoing batadv_*_free()
and open-codes batadv_mesh_free() with proper order to avoid touching
uninitialized fields.
Max VA [Mon, 25 Oct 2021 15:31:53 +0000 (17:31 +0200)]
tipc: fix size validations for the MSG_CRYPTO type
The function tipc_crypto_key_rcv is used to parse MSG_CRYPTO messages
to receive keys from other nodes in the cluster in order to decrypt any
further messages from them.
This patch verifies that any supplied sizes in the message body are
valid for the received message.
nfc: port100: fix using -ERRNO as command type mask
During probing, the driver tries to get a list (mask) of supported
command types in port100_get_command_type_mask() function. The value
is u64 and 0 is treated as invalid mask (no commands supported). The
function however returns also -ERRNO as u64 which will be interpret as
valid command mask.
Return 0 on every error case of port100_get_command_type_mask(), so the
probing will stop.
Cyril Strejc [Sun, 24 Oct 2021 20:14:25 +0000 (22:14 +0200)]
net: multicast: calculate csum of looped-back and forwarded packets
During a testing of an user-space application which transmits UDP
multicast datagrams and utilizes multicast routing to send the UDP
datagrams out of defined network interfaces, I've found a multicast
router does not fill-in UDP checksum into locally produced, looped-back
and forwarded UDP datagrams, if an original output NIC the datagrams
are sent to has UDP TX checksum offload enabled.
The datagrams are sent malformed out of the NIC the datagrams have been
forwarded to.
It is because:
1. If TX checksum offload is enabled on the output NIC, UDP checksum
is not calculated by kernel and is not filled into skb data.
2. dev_loopback_xmit(), which is called solely by
ip_mc_finish_output(), sets skb->ip_summed = CHECKSUM_UNNECESSARY
unconditionally.
3. Since 35fc92a9 ("[NET]: Allow forwarding of ip_summed except
CHECKSUM_COMPLETE"), the ip_summed value is preserved during
forwarding.
4. If ip_summed != CHECKSUM_PARTIAL, checksum is not calculated during
a packet egress.
The minimum fix in dev_loopback_xmit():
1. Preserves skb->ip_summed CHECKSUM_PARTIAL. This is the
case when the original output NIC has TX checksum offload enabled.
The effects are:
a) If the forwarding destination interface supports TX checksum
offloading, the NIC driver is responsible to fill-in the
checksum.
b) If the forwarding destination interface does NOT support TX
checksum offloading, checksums are filled-in by kernel before
skb is submitted to the NIC driver.
c) For local delivery, checksum validation is skipped as in the
case of CHECKSUM_UNNECESSARY, thanks to skb_csum_unnecessary().
2. Translates ip_summed CHECKSUM_NONE to CHECKSUM_UNNECESSARY. It
means, for CHECKSUM_NONE, the behavior is unmodified and is there
to skip a looped-back packet local delivery checksum validation.
Thomas Perrot [Fri, 22 Oct 2021 14:21:04 +0000 (16:21 +0200)]
spi: spl022: fix Microwire full duplex mode
There are missing braces in the function that verify controller parameters,
then an error is always returned when the parameter to select Microwire
frames operation is used on devices allowing it.
Sagi Grimberg [Sun, 24 Oct 2021 07:43:31 +0000 (10:43 +0300)]
nvme-tcp: fix H2CData PDU send accounting (again)
We should not access request members after the last send, even to
determine if indeed it was the last data payload send. The reason is
that a completion could have arrived and trigger a new execution of the
request which overridden these members. This was fixed by commit 825619b09ad3 ("nvme-tcp: fix possible use-after-completion").
Commit e371af033c56 broke that assumption again to address cases where
multiple r2t pdus are sent per request. To fix it, we need to record the
request data_sent and data_len and after the payload network send we
reference these counters to determine weather we should advance the
request iterator.
nvmet-tcp: fix a memory leak when releasing a queue
page_frag_free() won't completely release the memory
allocated for the commands, the cache page must be explicitly
freed by calling __page_frag_cache_drain().
This bug can be easily reproduced by repeatedly
executing the following command on the initiator:
Imre Deak [Mon, 18 Oct 2021 09:41:49 +0000 (12:41 +0300)]
drm/i915/dp: Skip the HW readout of DPCD on disabled encoders
Reading out the DP encoders' DPCD during booting or resume is only
required for enabled encoders: such encoders may be modesetted during
the initial commit and the link training this involves depends on an
initialized DPCD. For DDI encoders reading out the DPCD is skipped, do
the same on pre-DDI platforms.
Atm, the first DPCD readout without a sink connected - which is a likely
scneario if the encoder is disabled - leaves intel_dp->num_common_rates
at 0, which resulted in
intel_dp_sync_state()->intel_dp_max_common_rate()
in a
intel_dp->common_rates[-1]
access. This by definition results in an undefined behaviour, though to
my best knowledge in all HW/compiler configurations it actually results
in accessing the array item type value preceding the array. In this
case the preceding value happens to be intel_dp->num_common_rates,
which is 0, so this issue - by luck - didn't cause a user visible
problem.
Nevertheless it's still an undefined behaviour and in CONFIG_UBSAN
builds leads to a kernel BUG() (which revealed this problem for us),
hence CC:stable.
A related problem in case the encoder is enabled but the sink is not
connected or the DPCD readout fails is fixed by the next patch.
v2: Amend the commit message describing the root cause of the
CONFIG_UBSAN BUG().
Ville Syrjälä [Thu, 14 Oct 2021 09:09:39 +0000 (12:09 +0300)]
drm/i915: Convert unconditional clflush to drm_clflush_virt_range()
This one is apparently a "clflush for good measure", so bit more
justification (if you can call it that) than some of the others.
Convert to drm_clflush_virt_range() again so that machines without
clflush will survive the ordeal.
Ido Schimmel [Sun, 24 Oct 2021 06:40:14 +0000 (09:40 +0300)]
mlxsw: pci: Recycle received packet upon allocation failure
When the driver fails to allocate a new Rx buffer, it passes an empty Rx
descriptor (contains zero address and size) to the device and marks it
as invalid by setting the skb pointer in the descriptor's metadata to
NULL.
After processing enough Rx descriptors, the driver will try to process
the invalid descriptor, but will return immediately seeing that the skb
pointer is NULL. Since the driver no longer passes new Rx descriptors to
the device, the Rx queue will eventually become full and the device will
start to drop packets.
Fix this by recycling the received packet if allocation of the new
packet failed. This means that allocation is no longer performed at the
end of the Rx routine, but at the start, before tearing down the DMA
mapping of the received packet.
Remove the comment about the descriptor being zeroed as it is no longer
correct. This is OK because we either use the descriptor as-is (when
recycling) or overwrite its address and size fields with that of the
newly allocated Rx buffer.
The issue was discovered when a process ("perf") consumed too much
memory and put the system under memory pressure. It can be reproduced by
injecting slab allocation failures [1]. After the fix, the Rx queue no
longer comes to a halt.
nvdimm/pmem: stop using q_usage_count as external pgmap refcount
Originally all DAX access when through block_device operations and thus
needed a queue reference. But since commit cccbce671582
("filesystem-dax: convert to dax_direct_access()") all this happens at
the DAX device level which uses its own refcounting. Having the external
refcount thus wasn't needed but has otherwise been harmless for long
time.
But now that "block: drain file system I/O on del_gendisk" waits for
q_usage_count to reach 0 in del_gendisk this whole scheme can't work
anymore (and pmem is the only driver abusing q_usage_count like that).
So switch to the internal reference and remove the unbalanced
blk_freeze_queue_start that is taken care of by del_gendisk.
Yongxin Liu [Mon, 11 Oct 2021 07:02:16 +0000 (15:02 +0800)]
ice: check whether PTP is initialized in ice_ptp_release()
PTP is currently only supported on E810 devices, it is checked
in ice_ptp_init(). However, there is no check in ice_ptp_release().
For other E800 series devices, ice_ptp_release() will be wrongly executed.
Fix the following calltrace.
INFO: trying to register non-static key.
The code is fine but needs lockdep annotation, or maybe
you didn't initialize this object before use?
turning off the locking correctness validator.
Workqueue: ice ice_service_task [ice]
Call Trace:
dump_stack_lvl+0x5b/0x82
dump_stack+0x10/0x12
register_lock_class+0x495/0x4a0
? find_held_lock+0x3c/0xb0
__lock_acquire+0x71/0x1830
lock_acquire+0x1e6/0x330
? ice_ptp_release+0x3c/0x1e0 [ice]
? _raw_spin_lock+0x19/0x70
? ice_ptp_release+0x3c/0x1e0 [ice]
_raw_spin_lock+0x38/0x70
? ice_ptp_release+0x3c/0x1e0 [ice]
ice_ptp_release+0x3c/0x1e0 [ice]
ice_prepare_for_reset+0xcb/0xe0 [ice]
ice_do_reset+0x38/0x110 [ice]
ice_service_task+0x138/0xf10 [ice]
? __this_cpu_preempt_check+0x13/0x20
process_one_work+0x26a/0x650
worker_thread+0x3f/0x3b0
? __kthread_parkme+0x51/0xb0
? process_one_work+0x650/0x650
kthread+0x161/0x190
? set_kthread_struct+0x40/0x40
ret_from_fork+0x1f/0x30
Dave Ertman [Thu, 7 Oct 2021 15:40:31 +0000 (08:40 -0700)]
ice: Respond to a NETDEV_UNREGISTER event for LAG
When the PF is a member of a link aggregate, and the driver
is removed, the process will hang unless we respond to the
NETDEV_UNREGISTER event that is sent to the event_handler
for LAG.
Add a case statement for the ice_lag_event_handler to unlink
the PF from the link aggregate.
Also remove code that was incorrectly applying a dev_hold to
peer_netdevs that were associated with the ice driver.
Fixes: df006dd4b1dc ("ice: Add initial support framework for LAG") Signed-off-by: Dave Ertman <[email protected]> Tested-by: Tony Brelinski <[email protected]> Signed-off-by: Tony Nguyen <[email protected]>
This upstream commit broke AOSP (post Android 12 merge) build
on RB5. The device either silently crashes into USB crash mode
after android boot animation or we see a blank blue screen
with following dpu errors in dmesg:
[ T444] hw recovery is not complete for ctl:3
[ T444] [drm:dpu_encoder_phys_vid_prepare_for_kickoff:539] [dpu error]enc31 intf1 ctl 3 reset failure: -22
[ T444] [drm:dpu_encoder_phys_vid_wait_for_commit_done:513] [dpu error]vblank timeout
[ T444] [drm:dpu_kms_wait_for_commit_done:454] [dpu error]wait for commit done returned -110
[ C7] [drm:dpu_encoder_frame_done_timeout:2127] [dpu error]enc31 frame done timeout
[ T444] [drm:dpu_encoder_phys_vid_wait_for_commit_done:513] [dpu error]vblank timeout
[ T444] [drm:dpu_kms_wait_for_commit_done:454] [dpu error]wait for commit done returned -110
secretmem: Prevent secretmem_users from wrapping to zero
Commit 110860541f44 ("mm/secretmem: use refcount_t instead of atomic_t")
attempted to fix the problem of secretmem_users wrapping to zero and
allowing suspend once again.
But it was reverted in commit 87066fdd2e30 ("Revert 'mm/secretmem: use
refcount_t instead of atomic_t'") because of the problems it caused - a
refcount_t was not semantically the right type to use.
Instead prevent secretmem_users from wrapping to zero by forbidding new
users if the number of users has wrapped from positive to negative.
This stops a long way short of reaching the necessary 4 billion users
where it wraps to zero again, so there's no need to be clever with
special anti-wrap types or checking the return value from atomic_inc().
Linus Torvalds [Mon, 25 Oct 2021 17:46:41 +0000 (10:46 -0700)]
spi: Fix tegra20 build with CONFIG_PM=n once again
Commit efafec27c565 ("spi: Fix tegra20 build with CONFIG_PM=n") already
fixed the build without PM support once. There was an alternative fix
by Guenter in commit 2bab94090b01 ("spi: tegra20-slink: Declare runtime
suspend and resume functions conditionally"), and Mark then merged the
two correctly in ffb1e76f4f32 ("Merge tag 'v5.15-rc2' into spi-5.15").
But for some inexplicable reason, Mark then merged things _again_ in
commit 59c4e190b10c ("Merge tag 'v5.15-rc3' into spi-5.15"), and screwed
things up at that point, and the __maybe_unused attribute on
tegra_slink_runtime_resume() went missing.
Reinstate it, so that alpha (and other architectures without PM support)
builds cleanly again.
Btw, this is another prime example of how random back-merges are not
good. Just don't do them. Subsystem developers should not merge my
tree in any normal circumstances. Both of those merge commits pointed
to above are bad: even the one that got the merge result right doesn't
even mention _why_ it was done, and the one that got it wrong is
obviously broken.
Linus Torvalds [Mon, 25 Oct 2021 16:47:18 +0000 (09:47 -0700)]
Merge tag 'pinctrl-v5.15-3' of git://git.kernel.org/pub/scm/linux/kernel/git/linusw/linux-pinctrl
Pull pin control fixes from Linus Walleij:
"Some late pin control fixes, the most generally annoying will probably
be the AMD IRQ storm fix affecting the Microsoft surface.
Summary:
- Three fixes pertaining to Broadcom DT bindings. Some stuff didn't
work out as inteded, we need to back out
- A resume bug fix in the STM32 driver
- Disable and mask the interrupts on probe in the AMD pinctrl driver,
affecting Microsoft surface"
* tag 'pinctrl-v5.15-3' of git://git.kernel.org/pub/scm/linux/kernel/git/linusw/linux-pinctrl:
pinctrl: amd: disable and mask interrupts on probe
pinctrl: stm32: use valid pin identifier in stm32_pinctrl_resume()
Revert "pinctrl: bcm: ns: support updated DT binding as syscon subnode"
dt-bindings: pinctrl: brcm,ns-pinmux: drop unneeded CRU from example
Revert "dt-bindings: pinctrl: bcm4708-pinmux: rework binding to use syscon"
Xin Long [Mon, 25 Oct 2021 06:31:48 +0000 (02:31 -0400)]
net-sysfs: initialize uid and gid before calling net_ns_get_ownership
Currently in net_ns_get_ownership() it may not be able to set uid or gid
if make_kuid or make_kgid returns an invalid value, and an uninit-value
issue can be triggered by this.
This patch is to fix it by initializing the uid and gid before calling
net_ns_get_ownership(), as it does in kobject_get_ownership()
Dongli Zhang [Fri, 22 Oct 2021 23:31:39 +0000 (16:31 -0700)]
xen/netfront: stop tx queues during live migration
The tx queues are not stopped during the live migration. As a result, the
ndo_start_xmit() may access netfront_info->queues which is freed by
talk_to_netback()->xennet_destroy_queues().
This patch is to netif_device_detach() at the beginning of xen-netfront
resuming, and netif_device_attach() at the end of resuming.
CPU A CPU B
talk_to_netback()
-> if (info->queues)
xennet_destroy_queues(info);
to free netfront_info->queues
xennet_start_xmit()
to access netfront_info->queues
Michael Chan [Mon, 25 Oct 2021 09:05:28 +0000 (05:05 -0400)]
net: Prevent infinite while loop in skb_tx_hash()
Drivers call netdev_set_num_tc() and then netdev_set_tc_queue()
to set the queue count and offset for each TC. So the queue count
and offset for the TCs may be zero for a short period after dev->num_tc
has been set. If a TX packet is being transmitted at this time in the
code path netdev_pick_tx() -> skb_tx_hash(), skb_tx_hash() may see
nonzero dev->num_tc but zero qcount for the TC. The while loop that
keeps looping while hash >= qcount will not end.
Fix it by checking the TC's qcount to be nonzero before using it.
Fixes: eadec877ce9c ("net: Add support for subordinate traffic classes to netdev_pick_tx") Reviewed-by: Andy Gospodarek <[email protected]> Signed-off-by: Michael Chan <[email protected]> Signed-off-by: David S. Miller <[email protected]>
Mark Zhang [Sun, 24 Oct 2021 06:08:20 +0000 (09:08 +0300)]
RDMA/sa_query: Use strscpy_pad instead of memcpy to copy a string
When copying the device name, the length of the data memcpy copied exceeds
the length of the source buffer, which cause the KASAN issue below. Use
strscpy_pad() instead.
Trevor Woerner [Sun, 24 Oct 2021 17:50:02 +0000 (13:50 -0400)]
net: nxp: lpc_eth.c: avoid hang when bringing interface down
A hard hang is observed whenever the ethernet interface is brought
down. If the PHY is stopped before the LPC core block is reset,
the SoC will hang. Comparing lpc_eth_close() and lpc_eth_open() I
re-arranged the ordering of the functions calls in lpc_eth_close() to
reset the hardware before stopping the PHY. Fixes: b7370112f519 ("lpc32xx: Added ethernet driver") Signed-off-by: Trevor Woerner <[email protected]> Acked-by: Vladimir Zapolskiy <[email protected]> Signed-off-by: David S. Miller <[email protected]>
Johannes Berg [Mon, 25 Oct 2021 11:31:12 +0000 (13:31 +0200)]
cfg80211: fix management registrations locking
The management registrations locking was broken, the list was
locked for each wdev, but cfg80211_mgmt_registrations_update()
iterated it without holding all the correct spinlocks, causing
list corruption.
Rather than trying to fix it with fine-grained locking, just
move the lock to the wiphy/rdev (still need the list on each
wdev), we already need to hold the wdev lock to change it, so
there's no contention on the lock in any case. This trivially
fixes the bug since we hold one wdev's lock already, and now
will hold the lock that protects all lists.
Walter Stoll <[email protected]> reported a race condition
between "ethtool -s eth0 speed 100 duplex full autoneg off" and phylib
reading the current status from the PHY. Both ksetting_get and
ksetting_set fail the take the phydev mutex, and as a result, there is
a small window of time where the phydev members are not self
consistent.
Patch 1 fixes phy_ethtool_ksettings_get by adding the needed lock.
Patches 2 and 3 move code around and perform to refactoring, to allow
patch 4 to fix phy_ethtool_ksettings_set by added the lock.
Thanks go to Walter for the detailed origional report, suggested fix,
and testing of the proposed patches.
====================
Andrew Lunn [Sun, 24 Oct 2021 19:48:05 +0000 (21:48 +0200)]
phy: phy_ethtool_ksettings_set: Lock the PHY while changing settings
There is a race condition where the PHY state machine can change
members of the phydev structure at the same time userspace requests a
change via ethtool. To prevent this, have phy_ethtool_ksettings_set
take the PHY lock.
Andrew Lunn [Sun, 24 Oct 2021 19:48:04 +0000 (21:48 +0200)]
phy: phy_start_aneg: Add an unlocked version
Split phy_start_aneg into a wrapper which takes the PHY lock, and a
helper doing the real work. This will be needed when
phy_ethtook_ksettings_set takes the lock.
Fixes: 2d55173e71b0 ("phy: add generic function to support ksetting support") Signed-off-by: Andrew Lunn <[email protected]> Signed-off-by: David S. Miller <[email protected]>
Andrew Lunn [Sun, 24 Oct 2021 19:48:03 +0000 (21:48 +0200)]
phy: phy_ethtool_ksettings_set: Move after phy_start_aneg
This allows it to make use of a helper which assume the PHY is already
locked.
Fixes: 2d55173e71b0 ("phy: add generic function to support ksetting support") Signed-off-by: Andrew Lunn <[email protected]> Signed-off-by: David S. Miller <[email protected]>
Andrew Lunn [Sun, 24 Oct 2021 19:48:02 +0000 (21:48 +0200)]
phy: phy_ethtool_ksettings_get: Lock the phy for consistency
The PHY structure should be locked while copying information out if
it, otherwise there is no guarantee of self consistency. Without the
lock the PHY state machine could be updating the structure.
Fixes: 2d55173e71b0 ("phy: add generic function to support ksetting support") Signed-off-by: Andrew Lunn <[email protected]> Signed-off-by: David S. Miller <[email protected]>
LABBE Corentin [Thu, 21 Oct 2021 09:26:57 +0000 (10:26 +0100)]
ARM: 9148/1: handle CONFIG_CPU_ENDIAN_BE32 in arch/arm/kernel/head.S
My intel-ixp42x-welltech-epbx100 no longer boot since 4.14.
This is due to commit 463dbba4d189 ("ARM: 9104/2: Fix Keystone 2 kernel
mapping regression")
which forgot to handle CONFIG_CPU_ENDIAN_BE32 as possible BE config.
Suggested-by: Krzysztof Hałasa <[email protected]> Fixes: 463dbba4d189 ("ARM: 9104/2: Fix Keystone 2 kernel mapping regression") Signed-off-by: Corentin Labbe <[email protected]> Signed-off-by: Russell King (Oracle) <[email protected]>
powerpc/pseries/iommu: Create huge DMA window if no MMIO32 is present
The iommu_init_table() helper takes an address range to reserve in
the IOMMU table being initialized to exclude MMIO addresses, this is
useful if the window stretches far beyond 4GB (although wastes some TCEs).
At the moment the code searches for such MMIO32 range and fails if none
found which is considered a problem while it really is not: it is actually
better as this says there is no MMIO32 to reserve and we can use
usually wasted TCEs. Furthermore PHYP never actually allows creating
windows starting at busaddress=0 so this MMIO32 range is never useful.
This removes error exit and initializes the table with zero range if
no MMIO32 is detected.
powerpc/pseries/iommu: Check if the default window in use before removing it
At the moment this check is performed after we remove the default window
which is late and disallows to revert whatever changes enable_ddw()
has made to DMA windows.
This moves the check and error exit before removing the window.
This raised the message severity from "debug" to "warning" as this
should not happen in practice and cannot be triggered by the userspace.
Converting the "secretmem_users" counter to a refcount is incorrect,
because a refcount is special in zero and can't just be incremented (but
a count of users is not, and "no users" is actually perfectly valid and
not a sign of a free'd resource).
* tag '5.15-rc6-ksmbd-fixes' of git://git.samba.org/ksmbd:
ksmbd: add buffer validation in session setup
ksmbd: throttle session setup failures to avoid dictionary attacks
ksmbd: validate OutputBufferLength of QUERY_DIR, QUERY_INFO, IOCTL requests
ksmbd: validate credit charge after validating SMB2 PDU body size
ksmbd: add buffer validation for smb direct
ksmbd: limit read/write/trans buffer size not to exceed 8MB
ksmbd: validate compound response buffer
ksmbd: fix potencial 32bit overflow from data area check in smb2_write
ksmbd: improve credits management
ksmbd: add validation in smb2_ioctl
Linus Torvalds [Sun, 24 Oct 2021 16:23:48 +0000 (06:23 -1000)]
Merge tag 'scsi-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/jejb/scsi
Pull SCSI fixes from James Bottomley:
"Ten fixes, seven of which are in drivers.
The core fixes are one to fix a potential crash on resume, one to sort
out our reference count releases to avoid releasing in-use modules and
one to adjust the cmd per lun calculation to avoid an overflow in
hyper-v"
* tag 'scsi-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/jejb/scsi:
scsi: ufs: ufs-pci: Force a full restore after suspend-to-disk
scsi: qla2xxx: Fix unmap of already freed sgl
scsi: qla2xxx: Fix a memory leak in an error path of qla2x00_process_els()
scsi: qla2xxx: Return -ENOMEM if kzalloc() fails
scsi: sd: Fix crashes in sd_resume_runtime()
scsi: mpi3mr: Fix duplicate device entries when scanning through sysfs
scsi: core: Put LLD module refcnt after SCSI device is released
scsi: storvsc: Fix validation for unsolicited incoming packets
scsi: iscsi: Fix set_param() handling
scsi: core: Fix shost->cmd_per_lun calculation in scsi_add_host_with_dma()
Yuiko Oshino [Fri, 22 Oct 2021 15:53:43 +0000 (11:53 -0400)]
net: ethernet: microchip: lan743x: Fix dma allocation failure by using dma_set_mask_and_coherent
The dma failure was reported in the raspberry pi github (issue #4117).
https://github.com/raspberrypi/linux/issues/4117
The use of dma_set_mask_and_coherent fixes the issue.
Tested on 32/64-bit raspberry pi CM4 and 64-bit ubuntu x86 PC with EVB-LAN7430.
Fixes: 23f0703c125b ("lan743x: Add main source files for new lan743x driver") Signed-off-by: Yuiko Oshino <[email protected]> Signed-off-by: David S. Miller <[email protected]>
Yuiko Oshino [Fri, 22 Oct 2021 15:13:53 +0000 (11:13 -0400)]
net: ethernet: microchip: lan743x: Fix driver crash when lan743x_pm_resume fails
The driver needs to clean up and return when the initialization fails on resume.
Fixes: 23f0703c125b ("lan743x: Add main source files for new lan743x driver") Signed-off-by: Yuiko Oshino <[email protected]> Signed-off-by: David S. Miller <[email protected]>
Linus Torvalds [Sat, 23 Oct 2021 03:42:13 +0000 (17:42 -1000)]
Merge tag 'block-5.15-2021-10-22' of git://git.kernel.dk/linux-block
Pull block fixes from Jens Axboe:
"Fix for the cgroup code not ussing irq safe stats updates, and one fix
for an error handling condition in add_partition()"
* tag 'block-5.15-2021-10-22' of git://git.kernel.dk/linux-block:
block: fix incorrect references to disk objects
blk-cgroup: blk_cgroup_bio_start() should use irq-safe operations on blkg->iostat_cpu
Linus Torvalds [Sat, 23 Oct 2021 03:34:31 +0000 (17:34 -1000)]
Merge tag 'io_uring-5.15-2021-10-22' of git://git.kernel.dk/linux-block
Pull io_uring fixes from Jens Axboe:
"Two fixes for the max workers limit API that was introduced this
series: one fix for an issue with that code, and one fixing a linked
timeout regression in this series"
* tag 'io_uring-5.15-2021-10-22' of git://git.kernel.dk/linux-block:
io_uring: apply worker limits to previous users
io_uring: fix ltimeout unprep
io_uring: apply max_workers limit to all future users
io-wq: max_worker fixes
This is because that since the commit 2b0d3d3e4fcf ("percpu_ref: reduce
memory footprint of percpu_ref in fast path") root_cgrp->bpf.refcnt.data
is allocated by the function percpu_ref_init in cgroup_bpf_inherit which
is called by cgroup_setup_root when mounting, but not freed along with
root_cgrp when umounting. Adding cgroup_bpf_offline which calls
percpu_ref_kill to cgroup_kill_sb can free root_cgrp->bpf.refcnt.data in
umount path.
This patch also fixes the commit 4bfc0bb2c60e ("bpf: decouple the lifetime
of cgroup_bpf from cgroup itself"). A cgroup_bpf_offline is needed to do a
cleanup that frees the resources which are allocated by cgroup_bpf_inherit
in cgroup_setup_root.
And inside cgroup_bpf_offline, cgroup_get() is at the beginning and
cgroup_put is at the end of cgroup_bpf_release which is called by
cgroup_bpf_offline. So cgroup_bpf_offline can keep the balance of
cgroup's refcount.
====================
Fix some inconsistencies of bpf_jit_limit on non-x86 platforms.
I've dropped exposing bpf_jit_current since we couldn't agree on
file modes, correct names, etc.
====================
Linus Torvalds [Fri, 22 Oct 2021 20:39:47 +0000 (10:39 -1000)]
Merge tag 'fuse-fixes-5.15-rc7' of git://git.kernel.org/pub/scm/linux/kernel/git/mszeredi/fuse
Pull fuse fixes from Miklos Szeredi:
"Syzbot discovered a race in case of reusing the fuse sb (introduced in
this cycle).
Fix it by doing the s_fs_info initialization at the proper place"
* tag 'fuse-fixes-5.15-rc7' of git://git.kernel.org/pub/scm/linux/kernel/git/mszeredi/fuse:
fuse: clean up error exits in fuse_fill_super()
fuse: always initialize sb->s_fs_info
fuse: clean up fuse_mount destruction
fuse: get rid of fuse_put_super()
fuse: check s_root when destroying sb
====================
sctp: enhancements for the verification tag
This patchset is to address CVE-2021-3772:
A flaw was found in the Linux SCTP stack. A blind attacker may be able to
kill an existing SCTP association through invalid chunks if the attacker
knows the IP-addresses and port numbers being used and the attacker can
send packets with spoofed IP addresses.
This is caused by the missing VTAG verification for the received chunks
and the incorrect vtag for the ABORT used to reply to these invalid
chunks.
This patchset is to go over all processing functions for the received
chunks and do:
1. Make sure sctp_vtag_verify() is called firstly to verify the vtag from
the received chunk and discard this chunk if it fails. With some
exceptions:
a. sctp_sf_do_5_1B_init()/5_2_2_dupinit()/9_2_reshutack(), processing
INIT chunk, as sctphdr vtag is always 0 in INIT chunk.
b. sctp_sf_do_5_2_4_dupcook(), processing dupicate COOKIE_ECHO chunk,
as the vtag verification will be done by sctp_tietags_compare() and
then it takes right actions according to the return.
c. sctp_sf_shut_8_4_5(), processing SHUTDOWN_ACK chunk for cookie_wait
and cookie_echoed state, as RFC demand sending a SHUTDOWN_COMPLETE
even if the vtag verification failed.
d. sctp_sf_ootb(), called in many types of chunks for closed state or
no asoc, as the same reason to c.
2. Always use the vtag from the received INIT chunk to make the response
ABORT in sctp_ootb_pkt_new().
3. Fix the order for some checks and add some missing checks for the
received chunk.
This patch series has been tested with SCTP TAHI testing to make sure no
regression caused on protocol conformance.
====================
Xin Long [Wed, 20 Oct 2021 11:42:47 +0000 (07:42 -0400)]
sctp: add vtag check in sctp_sf_ootb
sctp_sf_ootb() is called when processing DATA chunk in closed state,
and many other places are also using it.
The vtag in the chunk's sctphdr should be verified, otherwise, as
later in chunk length check, it may send abort with the existent
asoc's vtag, which can be exploited by one to cook a malicious
chunk to terminate a SCTP asoc.
When fails to verify the vtag from the chunk, this patch sets asoc
to NULL, so that the abort will be made with the vtag from the
received chunk later.
Xin Long [Wed, 20 Oct 2021 11:42:46 +0000 (07:42 -0400)]
sctp: add vtag check in sctp_sf_do_8_5_1_E_sa
sctp_sf_do_8_5_1_E_sa() is called when processing SHUTDOWN_ACK chunk
in cookie_wait and cookie_echoed state.
The vtag in the chunk's sctphdr should be verified, otherwise, as
later in chunk length check, it may send abort with the existent
asoc's vtag, which can be exploited by one to cook a malicious
chunk to terminate a SCTP asoc.
Note that when fails to verify the vtag from SHUTDOWN-ACK chunk,
SHUTDOWN COMPLETE message will still be sent back to peer, but
with the vtag from SHUTDOWN-ACK chunk, as said in 5) of
rfc4960#section-8.4.
While at it, also remove the unnecessary chunk length check from
sctp_sf_shut_8_4_5(), as it's already done in both places where
it calls sctp_sf_shut_8_4_5().
Xin Long [Wed, 20 Oct 2021 11:42:45 +0000 (07:42 -0400)]
sctp: add vtag check in sctp_sf_violation
sctp_sf_violation() is called when processing HEARTBEAT_ACK chunk
in cookie_wait state, and some other places are also using it.
The vtag in the chunk's sctphdr should be verified, otherwise, as
later in chunk length check, it may send abort with the existent
asoc's vtag, which can be exploited by one to cook a malicious
chunk to terminate a SCTP asoc.