Commit 72adf2421d9b ("ice: Move common functions out of ice_main.c part
2/7") moved an older version of ice_setup_rx_ctx() function with
usage of magic number 7.
Reimplement the commit 5ab522443bd1 ("ice: Cleanup magic number") to use
ICE_RLAN_BASE_S instead of magic number.
Resets may occur with or without user interaction. For example, a TX hang
or reconfiguration of parameters will result in a reset. During reset, the
VSI is freed, freeing any statistics structures inside as well. This would
create an issue for the user where a reset happens in the background,
statistics set to zero, and the user checks ring statistics expecting them
to be populated.
To ensure this doesn't happen, accumulate ring statistics over reset.
Define a new ring statistics structure, ice_ring_stats. The new structure
lives in the VSI's parent, preserving ring statistics when VSI is freed.
1. Define a new structure vsi_ring_stats in the PF scope
2. Allocate/free stats only during probe, unload, or change in ring size
3. Replace previous ring statistics functionality with new structure
ice: Accumulate HW and Netdev statistics over reset
Resets happen with or without user interaction. For example, incidents
such as TX hang or a reconfiguration of parameters will result in a reset.
During reset, hardware and software statistics were set to zero. This
created an issue for the user where a reset happens in the background,
statistics set to zero, and the user checks statistics expecting them to
be populated.
To ensure this doesn't happen, keep accumulating stats over reset.
1. Remove function calls which reset hardware and netdev statistics.
2. Do not rollover statistics in ice_stat_update40 during reset.
Brett Creeley [Mon, 31 Oct 2022 17:09:12 +0000 (10:09 -0700)]
ice: Remove and replace ice speed defines with ethtool.h versions
The driver is currently using ICE_LINK_SPEED_* defines that mirror what
ethtool.h defines, with one exception ICE_LINK_SPEED_UNKNOWN.
This issue is fixed by the following changes:
1. replace ICE_LINK_SPEED_UNKNOWN with 0 because SPEED_UNKNOWN in
ethtool.h is "-1" and that doesn't match the driver's expected behavior
2. transform ICE_LINK_SPEED_*MBPS to SPEED_* using static tables and
fls()-1 to convert from BIT() to an index in a table.
Wang Hai [Tue, 15 Nov 2022 17:24:07 +0000 (01:24 +0800)]
e100: Fix possible use after free in e100_xmit_prepare
In e100_xmit_prepare(), if we can't map the skb, then return -ENOMEM, so
e100_xmit_frame() will return NETDEV_TX_BUSY and the upper layer will
resend the skb. But the skb is already freed, which will cause UAF bug
when the upper layer resends the skb.
Remove the harmful free.
Fixes: 5e5d49422dfb ("e100: Release skb when DMA mapping is failed in e100_xmit_prepare") Signed-off-by: Wang Hai <[email protected]> Reviewed-by: Alexander Duyck <[email protected]> Signed-off-by: Tony Nguyen <[email protected]>
Yuan Can [Mon, 14 Nov 2022 08:26:40 +0000 (08:26 +0000)]
iavf: Fix error handling in iavf_init_module()
The iavf_init_module() won't destroy workqueue when pci_register_driver()
failed. Call destroy_workqueue() when pci_register_driver() failed to
prevent the resource leak.
Similar to the handling of u132_hcd_init in commit f276e002793c
("usb: u132-hcd: fix resource leak")
The reason is that fm10k_init_module() returns fm10k_register_pci_driver()
directly without checking its return value, if fm10k_register_pci_driver()
failed, it returns without removing debugfs and destroy workqueue,
resulting the debugfs of fm10k can never be created later and leaks the
workqueue.
Fix by remove debugfs and destroy workqueue when
fm10k_register_pci_driver() returns error.
Fixes: 7461fd913afe ("fm10k: Add support for debugfs") Fixes: b382bb1b3e2d ("fm10k: use separate workqueue for fm10k driver") Signed-off-by: Yuan Can <[email protected]> Reviewed-by: Jacob Keller <[email protected]> Signed-off-by: Tony Nguyen <[email protected]>
Shang XiaoJing [Wed, 16 Nov 2022 01:27:25 +0000 (09:27 +0800)]
i40e: Fix error handling in i40e_init_module()
i40e_init_module() won't free the debugfs directory created by
i40e_dbg_init() when pci_register_driver() failed. Add fail path to
call i40e_dbg_exit() to remove the debugfs entries to prevent the bug.
Shang XiaoJing [Mon, 14 Nov 2022 02:57:58 +0000 (10:57 +0800)]
ixgbevf: Fix resource leak in ixgbevf_init_module()
ixgbevf_init_module() won't destroy the workqueue created by
create_singlethread_workqueue() when pci_register_driver() failed. Add
destroy_workqueue() in fail path to prevent the resource leak.
Similar to the handling of u132_hcd_init in commit f276e002793c
("usb: u132-hcd: fix resource leak")
Fixes: 40a13e2493c9 ("ixgbevf: Use a private workqueue to avoid certain possible hangs") Signed-off-by: Shang XiaoJing <[email protected]> Reviewed-by: Saeed Mahameed <[email protected]> Tested-by: Konrad Jankowski <[email protected]> Signed-off-by: Tony Nguyen <[email protected]>
Takashi Iwai [Wed, 23 Nov 2022 16:14:10 +0000 (17:14 +0100)]
Merge tag 'asoc-fix-v6.1-rc6' of https://git.kernel.org/pub/scm/linux/kernel/git/broonie/sound into for-linus
ASoC: Fixes for v6.1
A clutch of small fixes that have come in in the past week, people seem
to have been unusually active for this late in the release cycle. The
most critical one here is the fix to renumber the SOF DAI types in order
to restore ABI compatibility which was broken by the addition of AMD
support.
Zhen Lei [Tue, 22 Nov 2022 11:50:02 +0000 (19:50 +0800)]
btrfs: sysfs: normalize the error handling branch in btrfs_init_sysfs()
Although kset_unregister() can eventually remove all attribute files,
explicitly rolling back with the matching function makes the code logic
look clearer.
Filipe Manana [Mon, 21 Nov 2022 10:23:22 +0000 (10:23 +0000)]
btrfs: do not modify log tree while holding a leaf from fs tree locked
When logging an inode in full mode, or when logging xattrs or when logging
the dir index items of a directory, we are modifying the log tree while
holding a read lock on a leaf from the fs/subvolume tree. This can lead to
a deadlock in rare circumstances, but it is a real possibility, and it was
recently reported by syzbot with the following trace from lockdep:
WARNING: possible circular locking dependency detected
6.1.0-rc5-next-20221116-syzkaller #0 Not tainted
------------------------------------------------------
syz-executor.1/16154 is trying to acquire lock: ffff88807e3084a0 (&delayed_node->mutex){+.+.}-{3:3}, at: __btrfs_release_delayed_node.part.0+0xa1/0xf30 fs/btrfs/delayed-inode.c:256
but task is already holding lock: ffff88807df33078 (btrfs-log-00){++++}-{3:3}, at: __btrfs_tree_lock+0x32/0x3d0 fs/btrfs/locking.c:197
which lock already depends on the new lock.
the existing dependency chain (in reverse order) is:
Holding a read lock on a leaf from a fs/subvolume tree creates a nasty
lock dependency when we are COWing extent buffers for the log tree and we
have two tasks modifying the log tree, with each one in one of the
following 2 scenarios:
1) Modifying the log tree triggers an extent buffer allocation while
holding a write lock on a parent extent buffer from the log tree.
Allocating the pages for an extent buffer, or the extent buffer
struct, can trigger inode eviction and finally the inode eviction
will trigger a release/remove of a delayed node, which requires
taking the delayed node's mutex;
2) Allocating a metadata extent for a log tree can trigger the async
reclaim thread and make us wait for it to release enough space and
unblock our reservation ticket. The reclaim thread can start flushing
delayed items, and that in turn results in the need to lock delayed
node mutexes and in the need to write lock extent buffers of a
subvolume tree - all this while holding a write lock on the parent
extent buffer in the log tree.
So one task in scenario 1) running in parallel with another task in
scenario 2) could lead to a deadlock, one wanting to lock a delayed node
mutex while having a read lock on a leaf from the subvolume, while the
other is holding the delayed node's mutex and wants to write lock the same
subvolume leaf for flushing delayed items.
Fix this by cloning the leaf of the fs/subvolume tree, release/unlock the
fs/subvolume leaf and use the clone leaf instead.
Otherwise the kernel memory allocator seems to be unhappy about failing
order 6 allocations for the zones array, that cause 100% reproducible
mount failures in my qemu setup:
Calling drm_connector_update_edid_property() in
amdgpu_connector_free_edid() causes a noticeable pause in
the system every 10 seconds on polled outputs so revert this
part of the change.
Tsung-hua Lin [Wed, 9 Nov 2022 04:54:22 +0000 (12:54 +0800)]
drm/amd/display: No display after resume from WB/CB
[why]
First MST sideband message returns AUX_RET_ERROR_HPD_DISCON
on certain intel platform. Aux transaction considered failure
if HPD unexpected pulled low. The actual aux transaction success
in such case, hence do not return error.
[how]
Not returning error when AUX_RET_ERROR_HPD_DISCON detected
on the first sideband message.
Dillon Varone [Fri, 11 Nov 2022 19:06:58 +0000 (14:06 -0500)]
drm/amd/display: Use new num clk levels struct for max mclk index
[WHY?]
When calculating watermark and dlg values, the max mclk level index and
associated speed are needed to find the correlated dummy latency value.
Currently the incorrect index is given due to a clock manager refactor.
[HOW?]
Use num_memclk_level from num_entries_per_clk struct for getting the correct max
mem speed.
There's been a very long running bug that seems to have been neglected for
a while, where amdgpu consistently triggers a KASAN error at start:
BUG: KASAN: global-out-of-bounds in read_indirect_azalia_reg+0x1d4/0x2a0 [amdgpu]
Read of size 4 at addr ffffffffc2274b28 by task modprobe/1889
After digging through amd's rather creative method for accessing registers,
I eventually discovered the problem likely has to do with the fact that on
my dce120 GPU there are supposedly 7 sets of audio registers. But we only
define a register mapping for 6 sets.
Edward Cree [Mon, 21 Nov 2022 21:37:08 +0000 (21:37 +0000)]
sfc: ensure type is valid before updating seen_gen
In the case of invalid or corrupted v2 counter update packets,
efx_tc_rx_version_2() returns EFX_TC_COUNTER_TYPE_MAX. In this case
we should not attempt to update generation counts as this will write
beyond the end of the seen_gen array.
Reported-by: coverity-bot <[email protected]>
Addresses-Coverity-ID: 1527356 ("Memory - illegal accesses") Fixes: 25730d8be5d8 ("sfc: add extra RX channel to receive MAE counter updates on ef100") Signed-off-by: Edward Cree <[email protected]> Reviewed-by: Kees Cook <[email protected]> Signed-off-by: David S. Miller <[email protected]>
net/cdc_ncm: Fix multicast RX support for CDC NCM devices with ZLP
ZLP for DisplayLink ethernet devices was enabled in 6.0: 266c0190aee3 ("net/cdc_ncm: Enable ZLP for DisplayLink ethernet devices").
The related driver_info should be the "same as cdc_ncm_info, but with
FLAG_SEND_ZLP". However, set_rx_mode that enables handling multicast
traffic was missing in the new cdc_ncm_zlp_info.
usbnet_cdc_update_filter rx mode was introduced in linux 5.9 with: e10dcb1b6ba7 ("net: cdc_ncm: hook into set_rx_mode to admit multicast
traffic")
Without this hook, multicast, and then IPv6 SLAAC, is broken.
Fixes: 266c0190aee3 ("net/cdc_ncm: Enable ZLP for DisplayLink ethernet devices") Signed-off-by: Santiago Ruano Rincón <[email protected]> Signed-off-by: David S. Miller <[email protected]>
Davide Tronchin [Mon, 21 Nov 2022 12:54:55 +0000 (13:54 +0100)]
net: usb: qmi_wwan: add u-blox 0x1342 composition
Add RmNet support for LARA-L6.
LARA-L6 module can be configured (by AT interface) in three different
USB modes:
* Default mode (Vendor ID: 0x1546 Product ID: 0x1341) with 4 serial
interfaces
* RmNet mode (Vendor ID: 0x1546 Product ID: 0x1342) with 4 serial
interfaces and 1 RmNet virtual network interface
* CDC-ECM mode (Vendor ID: 0x1546 Product ID: 0x1343) with 4 serial
interface and 1 CDC-ECM virtual network interface
In RmNet mode LARA-L6 exposes the following interfaces:
If 0: Diagnostic
If 1: AT parser
If 2: AT parser
If 3: AT parset/alternative functions
If 4: RMNET interface
Jakub Sitnicki [Mon, 21 Nov 2022 08:54:26 +0000 (09:54 +0100)]
l2tp: Don't sleep and disable BH under writer-side sk_callback_lock
When holding a reader-writer spin lock we cannot sleep. Calling
setup_udp_tunnel_sock() with write lock held violates this rule, because we
end up calling percpu_down_read(), which might sleep, as syzbot reports
[1]:
Trim the writer-side critical section for sk_callback_lock down to the
minimum, so that it covers only operations on sk_user_data.
Also, when grabbing the sk_callback_lock, we always need to disable BH, as
Eric points out. Failing to do so leads to deadlocks because we acquire
sk_callback_lock in softirq context, which can get stuck waiting on us if:
Bagas Sanjaya [Mon, 21 Nov 2022 03:58:55 +0000 (10:58 +0700)]
Documentation: devlink: Add blank line padding on numbered lists in Devlink Port documentation
kernel test robot reported indentation warnings:
Documentation/networking/devlink/devlink-port.rst:220: WARNING: Unexpected indentation.
Documentation/networking/devlink/devlink-port.rst:222: WARNING: Block quote ends without a blank line; unexpected unindent.
These warnings cause lists (arbitration flow for which the warnings blame to
and 3-step subfunction setup) to be rendered inline instead. Also, for the
former list, automatic list numbering is messed up.
Fix these warnings by adding missing blank line padding.
Wang Hai [Sun, 20 Nov 2022 06:24:38 +0000 (14:24 +0800)]
arcnet: fix potential memory leak in com20020_probe()
In com20020_probe(), if com20020_config() fails, dev and info
will not be freed, which will lead to a memory leak.
This patch adds freeing dev and info after com20020_config()
fails to fix this bug.
Compile tested only.
Fixes: 15b99ac17295 ("[PATCH] pcmcia: add return value to _config() functions") Signed-off-by: Wang Hai <[email protected]> Signed-off-by: David S. Miller <[email protected]>
Miklos Szeredi [Wed, 23 Nov 2022 08:10:42 +0000 (09:10 +0100)]
fuse: lock inode unconditionally in fuse_fallocate()
file_modified() must be called with inode lock held. fuse_fallocate()
didn't lock the inode in case of just FALLOC_KEEP_SIZE flags value, which
resulted in a kernel Warning in notify_change().
Lock the inode unconditionally, like all other fallocate implementations
do.
trans_xen did not check the data fits into the buffer before copying
from the xen ring, but we probably should.
Add a check that just skips the request and return an error to
userspace if it did not fit
====================
Revert "veth: Avoid drop packets when xdp_redirect performs" and its fix
This patch 2e0de6366ac16 enables napi of the peer veth automatically when
the veth loads the xdp, but it breaks down as reported by Paolo and John.
So reverting it and its fix, we will rework the patch and make it more
robust based on comments.
====================
This patch fixes the panic maked by 2e0de6366ac16. Now Paolo
and Toke suggest reverting the patch 2e0de6366ac16 and making
it stronger, so do this first.
Jakub Kicinski [Wed, 23 Nov 2022 04:41:57 +0000 (20:41 -0800)]
Merge branch 'remove-dsa_priv-h'
Vladimir Oltean says:
====================
Remove dsa_priv.h
After working on the "Autoload DSA tagging driver when dynamically
changing protocol" series:
https://patchwork.kernel.org/project/netdevbpf/cover/20221115011847.2843127[email protected]/
it became clear to me that the situation with DSA headers is a bit
messy, and I put the tagging protocol driver macros in a pretty random
temporary spot in dsa_priv.h.
Now is the time to make the net/dsa/ folder a bit more organized, and to
make tagging protocol driver modules include just headers they're going
to use.
Another thing is the merging and cleanup of dsa.c and dsa2.c. Before,
dsa.c had 589 lines and dsa2.c had 1817 lines. Now, the combined dsa.c
has 1749 lines, the rest went to some other places.
Sorry for the set size, I know the rules, but since this is basically
code movement for the most part, I thought more patches are better.
====================
Vladimir Oltean [Mon, 21 Nov 2022 13:55:55 +0000 (15:55 +0200)]
net: dsa: kill off dsa_priv.h
The last remnants in dsa_priv.h are a netlink-related definition for
which we create a new header, and DSA_MAX_NUM_OFFLOADING_BRIDGES which
is only used from dsa.c, so move it there.
Some inclusions need to be adjusted now that we no longer have headers
included transitively from dsa_priv.h.
Vladimir Oltean [Mon, 21 Nov 2022 13:55:54 +0000 (15:55 +0200)]
net: dsa: move tag_8021q headers to their proper place
tag_8021q definitions are all over the place. Some are exported to
linux/dsa/8021q.h (visible by DSA core, taggers, switch drivers and
everyone else), and some are in dsa_priv.h.
Move the structures that don't need external visibility into tag_8021q.c,
and the ones which don't need the world or switch drivers to see them
into tag_8021q.h.
We also have the tag_8021q.h inclusion from switch.c, which is basically
the entire reason why tag_8021q.c was built into DSA in commit 8b6e638b4be2 ("net: dsa: build tag_8021q.c as part of DSA core").
I still don't know how to better deal with that, so leave it alone.
Vladimir Oltean [Mon, 21 Nov 2022 13:55:52 +0000 (15:55 +0200)]
net: dsa: rename dsa2.c back into dsa.c and create its header
The previous change moved the code into the larger file (dsa2.c) to
minimize the delta. Rename that now to dsa.c, and create dsa.h, where
all related definitions from dsa_priv.h go.
Vladimir Oltean [Mon, 21 Nov 2022 13:55:49 +0000 (15:55 +0200)]
net: dsa: move dsa_tree_notify() and dsa_broadcast() to switch.c
There isn't an intuitive place for these 2 cross-chip notifier functions
according to the function-to-file classification based on names
(dsa_switch_*() goes to switch.c), but I consider these to be part of
the cross-chip notifier handling, therefore part of switch.c. Move them
there to reduce bloat in dsa2.c (the place where all code with no better
place to go goes).
Vladimir Oltean [Mon, 21 Nov 2022 13:55:47 +0000 (15:55 +0200)]
net: dsa: move tagging protocol code to tag.{c,h}
It would be nice if tagging protocol drivers could include just the
header they need, since they are (mostly) data path and isolated from
most of the other DSA core code does.
Create a tag.c and a tag.h file which are meant to support tagging
protocol drivers.
Vladimir Oltean [Mon, 21 Nov 2022 13:55:46 +0000 (15:55 +0200)]
net: dsa: move headers exported by slave.c to slave.h
Minimize the use of the bloated dsa_priv.h by moving the prototypes
exported by slave.c to their own header file.
This is just approximate to get the code structure right. There are some
interdependencies with static inline code left in dsa_priv.h, so leave
slave.h included from there for now.
Vladimir Oltean [Mon, 21 Nov 2022 13:55:43 +0000 (15:55 +0200)]
net: dsa: move rest of devlink setup/teardown to devlink.c
The code that needed further refactoring into dedicated functions in
dsa2.c was left aside. Move it now to devlink.c, and make dsa2.c stop
including net/devlink.h.
Vladimir Oltean [Mon, 21 Nov 2022 13:55:41 +0000 (15:55 +0200)]
net: dsa: move bulk of devlink code to devlink.{c,h}
dsa.c and dsa2.c are bloated with too much off-topic code. Identify all
code related to devlink and move it to a new devlink.c file.
Steer clear of the dsa_priv.h dumping ground antipattern and create a
dedicated devlink.h for it, which will be included only by the C files
which need it. Usage of dsa_priv.h will be minimized in later patches.
Vladimir Oltean [Mon, 21 Nov 2022 13:55:40 +0000 (15:55 +0200)]
net: dsa: modularize DSA_TAG_PROTO_NONE
There is no reason that I can see why the no-op tagging protocol should
be registered manually, so make it a module and make all drivers which
have any sort of reference to DSA_TAG_PROTO_NONE select it.
Note that I don't know if ksz_get_tag_protocol() really needs this,
or if it's just the logic which is poorly written. All switches seem to
have their own tagging protocol, and DSA_TAG_PROTO_NONE is just a
fallback that never gets used.
Saeed Mahameed [Tue, 22 Nov 2022 18:41:58 +0000 (10:41 -0800)]
tcp: Fix build break when CONFIG_IPV6=n
The cited commit caused the following build break when CONFIG_IPV6 was
disabled
net/ipv4/tcp_input.c: In function ‘tcp_syn_flood_action’:
include/net/sock.h:387:37: error: ‘const struct sock_common’ has no member named ‘skc_v6_rcv_saddr’; did you mean ‘skc_rcv_saddr’?
Fix by using inet6_rcv_saddr() macro which handles this situation
nicely.
Jakub Kicinski [Wed, 23 Nov 2022 04:20:58 +0000 (20:20 -0800)]
Merge tag 'mlx5-fixes-2022-11-21' of git://git.kernel.org/pub/scm/linux/kernel/git/saeed/linux
Saeed Mahameed says:
====================
mlx5 fixes 2022-11-21
This series provides bug fixes to mlx5 driver.
* tag 'mlx5-fixes-2022-11-21' of git://git.kernel.org/pub/scm/linux/kernel/git/saeed/linux:
net/mlx5e: Fix possible race condition in macsec extended packet number update routine
net/mlx5e: Fix MACsec update SecY
net/mlx5e: Fix MACsec SA initialization routine
net/mlx5e: Remove leftovers from old XSK queues enumeration
net/mlx5e: Offload rule only when all encaps are valid
net/mlx5e: Fix missing alignment in size of MTT/KLM entries
net/mlx5: Fix sync reset event handler error flow
net/mlx5: E-Switch, Set correctly vport destination
net/mlx5: Lag, avoid lockdep warnings
net/mlx5: Fix handling of entry refcount when command is not issued to FW
net/mlx5: cmdif, Print info on any firmware cmd failure to tracepoint
net/mlx5: SF: Fix probing active SFs during driver probe phase
net/mlx5: Fix FW tracer timestamp calculation
net/mlx5: Do not query pci info while pci disabled
====================
Yan Cangang [Sun, 20 Nov 2022 05:52:59 +0000 (13:52 +0800)]
net: ethernet: mtk_eth_soc: fix memory leak in error path
In mtk_ppe_init(), when dmam_alloc_coherent() or devm_kzalloc() failed,
the rhashtable ppe->l2_flows isn't destroyed. Fix it.
In mtk_probe(), when mtk_ppe_init() or mtk_eth_offload_init() or
register_netdev() failed, have the same problem. Fix it.
Fixes: 33fc42de3327 ("net: ethernet: mtk_eth_soc: support creating mac address based offload entries") Signed-off-by: Yan Cangang <[email protected]> Reviewed-by: Leon Romanovsky <[email protected]> Signed-off-by: Jakub Kicinski <[email protected]>
Ziyang Xuan [Sun, 20 Nov 2022 03:54:05 +0000 (11:54 +0800)]
net: ethernet: mtk_eth_soc: fix potential memory leak in mtk_rx_alloc()
When fail to dma_map_single() in mtk_rx_alloc(), it returns directly.
But the memory allocated for local variable data is not freed, and
local variabel data has not been attached to ring->data[i] yet, so the
memory allocated for local variable data will not be freed outside
mtk_rx_alloc() too. Thus memory leak would occur in this scenario.
Add skb_free_frag(data) when dma_map_single() failed.
====================
dccp/tcp: Fix bhash2 issues related to WARN_ON() in inet_csk_get_port().
syzkaller was hitting a WARN_ON() in inet_csk_get_port() in the 4th patch,
which was because we forgot to fix up bhash2 bucket when connect() for a
socket bound to a wildcard address fails in __inet_stream_connect().
There was a similar report [0], but its repro does not fire the WARN_ON() due
to inconsistent error handling.
When connect() for a socket bound to a wildcard address fails, saddr may or
may not be reset depending on where the failure happens. When we fail in
__inet_stream_connect(), sk->sk_prot->disconnect() resets saddr. OTOH, in
(dccp|tcp)_v[46]_connect(), if we fail after inet_hash6?_connect(), we
forget to reset saddr.
We fix this inconsistent error handling in the 1st patch, and then we'll
fix the bhash2 WARN_ON() issue.
Note that there is still an issue in that we reset saddr without checking
if there are conflicting sockets in bhash and bhash2, but this should be
another series.
dccp/tcp: Fixup bhash2 bucket when connect() fails.
If a socket bound to a wildcard address fails to connect(), we
only reset saddr and keep the port. Then, we have to fix up the
bhash2 bucket; otherwise, the bucket has an inconsistent address
in the list.
Also, listen() for such a socket will fire the WARN_ON() in
inet_csk_get_port(). [0]
Note that when a system runs out of memory, we give up fixing the
bucket and unlink sk from bhash and bhash2 by inet_put_port().
When we call connect() for a socket bound to a wildcard address, we update
saddr locklessly. However, it could result in a data race; another thread
iterating over bhash might see a corrupted address.
dccp/tcp: Reset saddr on failure after inet6?_hash_connect().
When connect() is called on a socket bound to the wildcard address,
we change the socket's saddr to a local address. If the socket
fails to connect() to the destination, we have to reset the saddr.
However, when an error occurs after inet_hash6?_connect() in
(dccp|tcp)_v[46]_conect(), we forget to reset saddr and leave
the socket bound to the address.
From the user's point of view, whether saddr is reset or not varies
with errno. Let's fix this inconsistent behaviour.
Note that after this patch, the repro [0] will trigger the WARN_ON()
in inet_csk_get_port() again, but this patch is not buggy and rather
fixes a bug papering over the bhash2's bug for which we need another
fix.
For the record, the repro causes -EADDRNOTAVAIL in inet_hash6_connect()
by this sequence:
Li Hua [Mon, 21 Nov 2022 03:06:20 +0000 (11:06 +0800)]
test_kprobes: fix implicit declaration error of test_kprobes
If KPROBES_SANITY_TEST and ARCH_CORRECT_STACKTRACE_ON_KRETPROBE is enabled, but
STACKTRACE is not set. Build failed as below:
lib/test_kprobes.c: In function `stacktrace_return_handler':
lib/test_kprobes.c:228:8: error: implicit declaration of function `stack_trace_save'; did you mean `stacktrace_driver'? [-Werror=implicit-function-declaration]
ret = stack_trace_save(stack_buf, STACK_BUF_SIZE, 0);
^~~~~~~~~~~~~~~~
stacktrace_driver
cc1: all warnings being treated as errors
scripts/Makefile.build:250: recipe for target 'lib/test_kprobes.o' failed
make[2]: *** [lib/test_kprobes.o] Error 1
To fix this error, Select STACKTRACE if ARCH_CORRECT_STACKTRACE_ON_KRETPROBE is enabled.
Chen Zhongjin [Fri, 18 Nov 2022 06:33:04 +0000 (14:33 +0800)]
nilfs2: fix nilfs_sufile_mark_dirty() not set segment usage as dirty
When extending segments, nilfs_sufile_alloc() is called to get an
unassigned segment, then mark it as dirty to avoid accidentally allocating
the same segment in the future.
But for some special cases such as a corrupted image it can be unreliable.
If such corruption of the dirty state of the segment occurs, nilfs2 may
reallocate a segment that is in use and pick the same segment for writing
twice at the same time.
This case started with segbuf1.segnum = 3, nextnum = 4 when constructed.
It supposed segment 4 has already been allocated and marked as dirty.
However the dirty state was corrupted and segment 4 usage was not dirty.
For the first time nilfs_segctor_extend_segments() segment 4 was allocated
again, which made segbuf2 and next segbuf3 had same segment 4.
sb_getblk() will get same bh for segbuf2 and segbuf3, and this bh is added
to both buffer lists of two segbuf. It makes the lists broken which
causes NULL pointer dereference.
Fix the problem by setting usage as dirty every time in
nilfs_sufile_mark_dirty(), which is called during constructing current
segment to be written out and before allocating next segment.
mm/cgroup/reclaim: fix dirty pages throttling on cgroup v1
balance_dirty_pages doesn't do the required dirty throttling on cgroupv1.
See commit 9badce000e2c ("cgroup, writeback: don't enable cgroup writeback
on traditional hierarchies"). Instead, the kernel depends on writeback
throttling in shrink_folio_list to achieve the same goal. With large
memory systems, the flusher may not be able to writeback quickly enough
such that we will start finding pages in the shrink_folio_list already in
writeback. Hence for cgroupv1 let's do a reclaim throttle after waking up
the flusher.
The below test which used to fail on a 256GB system completes till the the
file system is full with this change.
Qi Zheng [Fri, 18 Nov 2022 10:00:11 +0000 (18:00 +0800)]
mm: fix unexpected changes to {failslab|fail_page_alloc}.attr
When we specify __GFP_NOWARN, we only expect that no warnings will be
issued for current caller. But in the __should_failslab() and
__should_fail_alloc_page(), the local GFP flags alter the global
{failslab|fail_page_alloc}.attr, which is persistent and shared by all
tasks. This is not what we expected, let's fix it.
Chen Wandun [Fri, 18 Nov 2022 13:38:50 +0000 (21:38 +0800)]
swapfile: fix soft lockup in scan_swap_map_slots
A softlockup occurs in scan free swap slot under huge memory pressure.
The test scenario is: 64 CPU cores, 64GB memory, and 28 zram devices, the
disksize of each zram device is 50MB.
LATENCY_LIMIT is used to prevent softlockups in scan_swap_map_slots(), but
the real loop number would more than LATENCY_LIMIT because of "goto checks
and goto scan" repeatly without decreasing latency limit.
In order to fix it, decrease latency_ration in advance.
There is also a suspicious place that will cause softlockups in
get_swap_pages(). In this function, the "goto start_over" may result in
continuous scanning of the swap partition. If there is no cond_sched in
scan_swap_map_slots(), it would cause a softlockup (I am not sure about
this).
Mike Kravetz [Fri, 18 Nov 2022 19:52:49 +0000 (11:52 -0800)]
hugetlb: fix __prep_compound_gigantic_page page flag setting
Commit 2b21624fc232 ("hugetlb: freeze allocated pages before creating
hugetlb pages") changed the order page flags were cleared and set in the
head page. It moved the __ClearPageReserved after __SetPageHead.
However, there is a check to make sure __ClearPageReserved is never done
on a head page. If CONFIG_DEBUG_VM_PGFLAGS is enabled, the following BUG
will be hit when creating a hugetlb gigantic page:
page dumped because: VM_BUG_ON_PAGE(1 && PageCompound(page))
------------[ cut here ]------------
kernel BUG at include/linux/page-flags.h:500!
Call Trace will differ depending on whether hugetlb page is created
at boot time or run time.
Make sure to __ClearPageReserved BEFORE __SetPageHead.
Marco Elver [Fri, 18 Nov 2022 15:22:16 +0000 (16:22 +0100)]
kfence: fix stack trace pruning
Commit b14051352465 ("mm/sl[au]b: generalize kmalloc subsystem")
refactored large parts of the kmalloc subsystem, resulting in the stack
trace pruning logic done by KFENCE to no longer work.
While b14051352465 attempted to fix the situation by including
'__kmem_cache_free' in the list of functions KFENCE should skip through,
this only works when the compiler actually optimized the tail call from
kfree() to __kmem_cache_free() into a jump (and thus kfree() _not_
appearing in the full stack trace to begin with).
In some configurations, the compiler no longer optimizes the tail call
into a jump, and __kmem_cache_free() appears in the stack trace. This
means that the pruned stack trace shown by KFENCE would include kfree()
which is not intended - for example:
| BUG: KFENCE: invalid free in kfree+0x7c/0x120
|
| Invalid free of 0xffff8883ed8fefe0 (in kfence-#126):
| kfree+0x7c/0x120
| test_double_free+0x116/0x1a9
| kunit_try_run_case+0x90/0xd0
| [...]
Fix it by moving __kmem_cache_free() to the list of functions that may be
tail called by an allocator entry function, making the pruning logic work
in both the optimized and unoptimized tail call cases.
Yu Zhao [Wed, 16 Nov 2022 01:38:07 +0000 (18:38 -0700)]
mm: multi-gen LRU: retry folios written back while isolated
The page reclaim isolates a batch of folios from the tail of one of the
LRU lists and works on those folios one by one. For a suitable
swap-backed folio, if the swap device is async, it queues that folio for
writeback. After the page reclaim finishes an entire batch, it puts back
the folios it queued for writeback to the head of the original LRU list.
In the meantime, the page writeback flushes the queued folios also by
batches. Its batching logic is independent from that of the page reclaim.
For each of the folios it writes back, the page writeback calls
folio_rotate_reclaimable() which tries to rotate a folio to the tail.
folio_rotate_reclaimable() only works for a folio after the page reclaim
has put it back. If an async swap device is fast enough, the page
writeback can finish with that folio while the page reclaim is still
working on the rest of the batch containing it. In this case, that folio
will remain at the head and the page reclaim will not retry it before
reaching there.
This patch adds a retry to evict_folios(). After evict_folios() has
finished an entire batch and before it puts back folios it cannot free
immediately, it retries those that may have missed the rotation.
Before this patch, ~60% of folios swapped to an Intel Optane missed
folio_rotate_reclaimable(). After this patch, ~99% of missed folios were
reclaimed upon retry.
This problem affects relatively slow async swap devices like Samsung 980
Pro much less and does not affect sync swap devices like zram or zswap at
all.
Alistair Popple [Fri, 11 Nov 2022 00:51:35 +0000 (11:51 +1100)]
mm/migrate_device: return number of migrating pages in args->cpages
migrate_vma->cpages originally contained a count of the number of pages
migrating including non-present pages which can be populated directly on
the target.
Commit 241f68859656 ("mm/migrate_device.c: refactor migrate_vma and
migrate_device_coherent_page()") inadvertantly changed this to contain
just the number of pages that were unmapped. Usage of migrate_vma->cpages
isn't documented, but most drivers use it to see if all the requested
addresses can be migrated so restore the original behaviour.
Sam James [Wed, 16 Nov 2022 18:26:34 +0000 (18:26 +0000)]
kbuild: fix -Wimplicit-function-declaration in license_is_gpl_compatible
Add missing <linux/string.h> include for strcmp.
Clang 16 makes -Wimplicit-function-declaration an error by default.
Unfortunately, out of tree modules may use this in configure scripts,
which means failure might cause silent miscompilation or misconfiguration.
For more information, see LWN.net [0] or LLVM's Discourse [1], gentoo-dev@ [2],
or the (new) c-std-porting mailing list [3].
[0] https://lwn.net/Articles/913505/
[1] https://discourse.llvm.org/t/configure-script-breakage-with-the-new-werror-implicit-function-declaration/65213
[2] https://archives.gentoo.org/gentoo-dev/message/dd9f2d3082b8b6f8dfbccb0639e6e240
[3] hosted at lists.linux.dev.
Ian Cowan [Mon, 14 Nov 2022 00:33:49 +0000 (19:33 -0500)]
mm: mmap: fix documentation for vma_mas_szero
When the struct_mm input, mm, was changed to a struct ma_state, mas, the
documentation for the function was never updated. This updates that
documentation reference.
SeongJae Park [Mon, 14 Nov 2022 17:55:52 +0000 (17:55 +0000)]
mm/damon/sysfs-schemes: skip stats update if the scheme directory is removed
A DAMON sysfs interface user can start DAMON with a scheme, remove the
sysfs directory for the scheme, and then ask update of the scheme's stats.
Because the schemes stats update logic isn't aware of the situation, it
results in an invalid memory access. Fix the bug by checking if the
scheme sysfs directory exists.
Alistair Popple [Mon, 14 Nov 2022 11:55:37 +0000 (22:55 +1100)]
mm/memory: return vm_fault_t result from migrate_to_ram() callback
The migrate_to_ram() callback should always succeed, but in rare cases can
fail usually returning VM_FAULT_SIGBUS. Commit 16ce101db85d
("mm/memory.c: fix race when faulting a device private page") incorrectly
stopped passing the return code up the stack. Fix this by setting the ret
variable, restoring the previous behaviour on migrate_to_ram() failure.
Li Liguang [Mon, 14 Nov 2022 19:48:28 +0000 (14:48 -0500)]
mm: correctly charge compressed memory to its memcg
Kswapd will reclaim memory when memory pressure is high, the annonymous
memory will be compressed and stored in the zpool if zswap is enabled.
The memcg_kmem_bypass() in get_obj_cgroup_from_page() will bypass the
kernel thread and cause the compressed memory not be charged to its memory
cgroup.
Remove the memcg_kmem_bypass() call and properly charge compressed memory
to its corresponding memory cgroup.
Mike Kravetz [Mon, 14 Nov 2022 21:00:18 +0000 (13:00 -0800)]
ipc/shm: call underlying open/close vm_ops
Shared memory segments can be created that are backed by hugetlb pages.
When this happens, the vmas associated with any mappings (shmat) are
marked VM_HUGETLB, yet the vm_ops for such mappings are provided by
ipc/shm (shm_vm_ops). There is a mechanism to call the underlying hugetlb
vm_ops, and this is done for most operations. However, it is not done for
open and close.
This was not an issue until the introduction of the hugetlb vma_lock.
This lock structure is pointed to by vm_private_data and the open/close
vm_ops help maintain this structure. The special hugetlb routine called
at fork took care of structure updates at fork time. However,
vma_splitting is not properly handled for ipc shared memory mappings
backed by hugetlb pages. This can result in a "kernel NULL pointer
dereference" BUG or use after free as two vmas point to the same lock
structure.
Update the shm open and close routines to always call the underlying open
and close routines.
Mukesh Ojha [Wed, 9 Nov 2022 19:01:37 +0000 (00:31 +0530)]
gcov: clang: fix the buffer overflow issue
Currently, in clang version of gcov code when module is getting removed
gcov_info_add() incorrectly adds the sfn_ptr->counter to all the
dst->functions and it result in the kernel panic in below crash report.
Fix this by properly handling it.
Gautam Menghani [Wed, 26 Oct 2022 04:45:24 +0000 (10:15 +0530)]
mm/khugepaged: refactor mm_khugepaged_scan_file tracepoint to remove filename from function call
Refactor the mm_khugepaged_scan_file tracepoint to move filename
dereference to the tracepoint definition, to maintain consistency with
other tracepoints[1].
mm/page_exit: fix kernel doc warning in page_ext_put()
Fix the below compiler warnings reported with 'make W=1 mm/'.
mm/page_ext.c:178: warning: Function parameter or member 'page_ext' not
described in 'page_ext_put'.
The khugepaged code would pick up the node with the most hit as the preferred
node, and also tries to do some balance if several nodes have the same
hit record. Basically it does conceptually:
* If the target_node <= last_target_node, then iterate from
last_target_node + 1 to MAX_NUMNODES (1024 on default config)
* If the max_value == node_load[nid], then target_node = nid
But there is a corner case, paritucularly for MADV_COLLAPSE, that the
non-existing node may be returned as preferred node.
Assuming the system has 2 nodes, the target_node is 0 and the
last_target_node is 1, if MADV_COLLAPSE path is hit, the max_value may
be 0, then it may return 2 for target_node, but it is actually not
existing (offline), so the warn is triggered.
The node balance was introduced by commit 9f1b868a13ac ("mm: thp:
khugepaged: add policy for finding target node") to satisfy
"numactl --interleave=all". But interleaving is a mere hint rather than
something that has hard requirements.
So use nodemask to record the nodes which have the same hit record, the
hugepage allocation could fallback to those nodes. And remove
__GFP_THISNODE since it does disallow fallback. And if the nodemask
just has one node set, it means there is one single node has the most
hit record, the nodemask approach actually behaves like __GFP_THISNODE.
While he reclaimer requested 4M, vmscan reclaimed close to 16G, most of it
by swapping. These requests take over a minute, during which the write()
to memory.reclaim is unkillably stuck inside the kernel.
Digging into the source, this is caused by the proportional reclaim
bailout logic. This code tries to resolve a fundamental conflict: to
reclaim roughly what was requested, while also aging all LRUs fairly and
in accordance to their size, swappiness, refault rates etc. The way it
attempts fairness is that once the reclaim goal has been reached, it stops
scanning the LRUs with the smaller remaining scan targets, and adjusts the
remainder of the bigger LRUs according to how much of the smaller LRUs was
scanned. It then finishes scanning that remainder regardless of the
reclaim goal.
This works fine if priority levels are low and the LRU lists are
comparable in size. However, in this instance, the cgroup that is
targeted by proactive reclaim has almost no files left - they've already
been squeezed out by proactive reclaim earlier - and the remaining anon
pages are hot. Anon rotations cause the priority level to drop to 0,
which results in reclaim targeting all of anon (a lot) and all of file
(almost nothing). By the time reclaim decides to bail, it has scanned
most or all of the file target, and therefor must also scan most or all of
the enormous anon target. This target is thousands of times larger than
the reclaim goal, thus causing the overreclaim.
The bailout code hasn't changed in years, why is this failing now? The
most likely explanations are two other recent changes in anon reclaim:
1. Before the series starting with commit 5df741963d52 ("mm: fix LRU
balancing effect of new transparent huge pages"), the VM was
overall relatively reluctant to swap at all, even if swap was
configured. This means the LRU balancing code didn't come into play
as often as it does now, and mostly in high pressure situations
where pronounced swap activity wouldn't be as surprising.
2. For historic reasons, shrink_lruvec() loops on the scan targets of
all LRU lists except the active anon one, meaning it would bail if
the only remaining pages to scan were active anon - even if there
were a lot of them.
Before the series starting with commit ccc5dc67340c ("mm/vmscan:
make active/inactive ratio as 1:1 for anon lru"), most anon pages
would live on the active LRU; the inactive one would contain only a
handful of preselected reclaim candidates. After the series, anon
gets aged similarly to file, and the inactive list is the default
for new anon pages as well, making it often the much bigger list.
As a result, the VM is now more likely to actually finish large
anon targets than before.
Change the code such that only one SWAP_CLUSTER_MAX-sized nudge toward the
larger LRU lists is made before bailing out on a met reclaim goal.
This fixes the extreme overreclaim problem.
Fairness is more subtle and harder to evaluate. No obvious misbehavior
was observed on the test workload, in any case. Conceptually, fairness
should primarily be a cumulative effect from regular, lower priority
scans. Once the VM is in trouble and needs to escalate scan targets to
make forward progress, fairness needs to take a backseat. This is also
acknowledged by the myriad exceptions in get_scan_count(). This patch
makes fairness decrease gradually, as it keeps fairness work static over
increasing priority levels with growing scan targets. This should make
more sense - although we may have to re-visit the exact values.