Mikulas Patocka [Wed, 29 Jun 2022 17:40:57 +0000 (13:40 -0400)]
dm raid: fix KASAN warning in raid5_add_disks
There's a KASAN warning in raid5_add_disk when running the LVM testsuite.
The warning happens in the test
lvconvert-raid-reshape-linear_to_raid6-single-type.sh. We fix the warning
by verifying that rdev->saved_raid_disk is within limits.
Mikulas Patocka [Wed, 29 Jun 2022 17:40:01 +0000 (13:40 -0400)]
dm raid: fix KASAN warning in raid5_remove_disk
There's a KASAN warning in raid5_remove_disk when running the LVM
testsuite. We fix this warning by verifying that the "number" variable is
within limits.
John Garry [Wed, 29 Jun 2022 09:18:44 +0000 (17:18 +0800)]
ata: pata_cs5535: Fix W=1 warnings
x86_64 allmodconfig build with W=1 gives these warnings:
drivers/ata/pata_cs5535.c: In function ‘cs5535_set_piomode’:
drivers/ata/pata_cs5535.c:93:11: error: variable ‘dummy’ set but not
used [-Werror=unused-but-set-variable]
u32 reg, dummy;
^~~~~
drivers/ata/pata_cs5535.c: In function ‘cs5535_set_dmamode’:
drivers/ata/pata_cs5535.c:132:11: error: variable ‘dummy’ set but not
used [-Werror=unused-but-set-variable]
u32 reg, dummy;
^~~~~
cc1: all warnings being treated as errors
Mark variables 'dummy' as "maybe unused" as they are only ever written
in rdmsr() calls.
Jiang Jian [Wed, 22 Jun 2022 06:32:31 +0000 (14:32 +0800)]
hwmon: (pmbus/ucd9200) fix typos in comments
Drop the redundant word 'the' in the comments following
/*
* Set PHASE registers on all pages to 0xff to ensure that phase
* specific commands will apply to all phases of a given page (rail).
* This only affects the READ_IOUT and READ_TEMPERATURE2 registers.
* READ_IOUT will return the sum of currents of all phases of a rail,
* and READ_TEMPERATURE2 will return the maximum temperature detected
* for the [the - DROP] phases of the rail.
*/
Eddie James [Tue, 28 Jun 2022 20:30:29 +0000 (15:30 -0500)]
hwmon: (occ) Prevent power cap command overwriting poll response
Currently, the response to the power cap command overwrites the
first eight bytes of the poll response, since the commands use
the same buffer. This means that user's get the wrong data between
the time of sending the power cap and the next poll response update.
Fix this by specifying a different buffer for the power cap command
response.
Lukas Bulwahn [Tue, 28 Jun 2022 05:34:11 +0000 (07:34 +0200)]
PM / devfreq: passive: revert an editing accident in SPDX-License line
Commit 26984d9d581e ("PM / devfreq: passive: Keep cpufreq_policy for
possible cpus") reworked governor_passive.c, and accidently added a
tab in the first line, i.e., the SPDX-License-Identifier line.
The checkpatch script warns with the SPDX_LICENSE_TAG warning, and hence
pointed this issue out while investigating checkpatch warnings.
Revert this editing accident. No functional change.
Remove cpufreq_passive_unregister_notifier from
cpufreq_passive_register_notifier in case of error as devfreq core
already call unregister on GOV_START fail.
This fix the kernel always printing a WARN on governor PROBE_DEFER as
cpufreq_passive_unregister_notifier is called two times and return
error on the second call as the cpufreq is already unregistered.
Fixes: a03dacb0316f ("PM / devfreq: Add cpu based scaling support to passive governor") Signed-off-by: Christian Marangi <[email protected]> Signed-off-by: Chanwoo Choi <[email protected]>
PM / devfreq: Rework freq_table to be local to devfreq struct
On a devfreq PROBE_DEFER, the freq_table in the driver profile struct,
is never reset and may be leaved in an undefined state.
This comes from the fact that we store the freq_table in the driver
profile struct that is commonly defined as static and not reset on
PROBE_DEFER.
We currently skip the reinit of the freq_table if we found
it's already defined since a driver may declare his own freq_table.
This logic is flawed in the case devfreq core generate a freq_table, set
it in the profile struct and then PROBE_DEFER, freeing the freq_table.
In this case devfreq will found a NOT NULL freq_table that has been
freed, skip the freq_table generation and probe the driver based on the
wrong table.
To fix this and correctly handle PROBE_DEFER, use a local freq_table and
max_state in the devfreq struct and never modify the freq_table present
in the profile struct if it does provide it.
Miaoqian Lin [Thu, 26 May 2022 08:28:56 +0000 (12:28 +0400)]
PM / devfreq: exynos-ppmu: Fix refcount leak in of_get_devfreq_events
of_get_child_by_name() returns a node pointer with refcount
incremented, we should use of_node_put() on it when done.
This function only calls of_node_put() in normal path,
missing it in error paths.
Add missing of_node_put() to avoid refcount leak.
PM / devfreq: Fix cpufreq passive unregister erroring on PROBE_DEFER
With the passive governor, the cpu based scaling can PROBE_DEFER due to
the fact that CPU policy are not ready.
The cpufreq passive unregister notifier is called both from the
GOV_START errors and for the GOV_STOP and assume the notifier is
successfully registred every time. With GOV_START failing it's wrong to
loop over each possible CPU since the register path has failed for
some CPU policy not ready. Change the logic and unregister the notifer
based on the current allocated parent_cpu_data list to correctly handle
errors and the governor unregister path.
Fixes: a03dacb0316f ("PM / devfreq: Add cpu based scaling support to passive governor") Signed-off-by: Christian 'Ansuel' Marangi <[email protected]> Signed-off-by: Chanwoo Choi <[email protected]>
PM / devfreq: Mute warning on governor PROBE_DEFER
Don't print warning when a governor PROBE_DEFER as it's not a real
GOV_START fail.
Fixes: a03dacb0316f ("PM / devfreq: Add cpu based scaling support to passive governor") Signed-off-by: Christian 'Ansuel' Marangi <[email protected]> Signed-off-by: Chanwoo Choi <[email protected]>
PM / devfreq: Fix kernel panic with cpu based scaling to passive gov
The cpufreq passive register notifier can PROBE_DEFER and the devfreq
struct is freed and then reallocaed on probe retry.
The current logic assume that the code can't PROBE_DEFER so the devfreq
struct in the this variable in devfreq_passive_data is assumed to be
(if already set) always correct.
This cause kernel panic as the code try to access the wrong address.
To correctly handle this, update the this variable in
devfreq_passive_data to the devfreq reallocated struct.
Fixes: a03dacb0316f ("PM / devfreq: Add cpu based scaling support to passive governor") Signed-off-by: Christian 'Ansuel' Marangi <[email protected]> Signed-off-by: Chanwoo Choi <[email protected]>
Carlos Llamas [Tue, 21 Jun 2022 20:39:21 +0000 (20:39 +0000)]
drm/fourcc: fix integer type usage in uapi header
Kernel uapi headers are supposed to use __[us]{8,16,32,64} types defined
by <linux/types.h> as opposed to 'uint32_t' and similar. See [1] for the
relevant discussion about this topic. In this particular case, the usage
of 'uint64_t' escaped headers_check as these macros are not being called
here. However, the following program triggers a compilation error:
#include <drm/drm_fourcc.h>
int main()
{
unsigned long x = AMD_FMT_MOD_CLEAR(RB);
return 0;
}
gcc error:
drm.c:5:27: error: ‘uint64_t’ undeclared (first use in this function)
5 | unsigned long x = AMD_FMT_MOD_CLEAR(RB);
| ^~~~~~~~~~~~~~~~~
This patch changes AMD_FMT_MOD_{SET,CLEAR} macros to use the correct
integer types, which fixes the above issue.
Linus Torvalds [Wed, 29 Jun 2022 16:20:40 +0000 (09:20 -0700)]
Merge tag '5.19-rc4-ksmbd-server-fixes' of git://git.samba.org/ksmbd
Pull ksmbd server fixes from Steve French:
- seek null check (don't use f_seek op directly and blindly)
- offset validation in FSCTL_SET_ZERO_DATA
- fallocate fix (relates e.g. to xfstests generic/091 and 263)
- two cleanup fixes
- fix socket settings on some arch
* tag '5.19-rc4-ksmbd-server-fixes' of git://git.samba.org/ksmbd:
ksmbd: use vfs_llseek instead of dereferencing NULL
ksmbd: check invalid FileOffset and BeyondFinalZero in FSCTL_ZERO_DATA
ksmbd: set the range of bytes to zero without extending file size in FSCTL_ZERO_DATA
ksmbd: remove duplicate flag set in smb2_write
ksmbd: smbd: Remove useless license text when SPDX-License-Identifier is already used
ksmbd: use SOCK_NONBLOCK type for kernel_accept()
Jeff Layton [Mon, 6 Jun 2022 23:31:42 +0000 (19:31 -0400)]
ceph: wait on async create before checking caps for syncfs
Currently, we'll call ceph_check_caps, but if we're still waiting
on the reply, we'll end up spinning around on the same inode in
flush_dirty_session_caps. Wait for the async create reply before
flushing caps.
Cc: [email protected]
URL: https://tracker.ceph.com/issues/55823 Fixes: fbed7045f552 ("ceph: wait for async create reply before sending any cap messages") Signed-off-by: Jeff Layton <[email protected]> Reviewed-by: Xiubo Li <[email protected]> Signed-off-by: Ilya Dryomov <[email protected]>
Darrick J. Wong [Sat, 25 Jun 2022 17:47:45 +0000 (10:47 -0700)]
xfs: dont treat rt extents beyond EOF as eofblocks to be cleared
On a system with a realtime volume and a 28k realtime extent,
generic/491 fails because the test opens a file on a frozen filesystem
and closing it causes xfs_release -> xfs_can_free_eofblocks to
mistakenly think that the the blocks of the realtime extent beyond EOF
are posteof blocks to be freed. Realtime extents cannot be partially
unmapped, so this is pointless. Worse yet, this triggers posteof
cleanup, which stalls on a transaction allocation, which is why the test
fails.
Teach the predicate to account for realtime extents properly.
Darrick J. Wong [Sat, 25 Jun 2022 17:01:20 +0000 (10:01 -0700)]
xfs: don't hold xattr leaf buffers across transaction rolls
Now that we've established (again!) that empty xattr leaf buffers are
ok, we no longer need to bhold them to transactions when we're creating
new leaf blocks. Get rid of the entire mechanism, which should simplify
the xattr code quite a bit.
The original justification for using bhold here was to prevent the AIL
from trying to write the empty leaf block into the fs during the brief
time that we release the buffer lock. The reason for /that/ was to
prevent recovery from tripping over the empty ondisk block.
Darrick J. Wong [Fri, 24 Jun 2022 22:01:28 +0000 (15:01 -0700)]
xfs: empty xattr leaf header blocks are not corruption
TLDR: Revert commit 51e6104fdb95 ("xfs: detect empty attr leaf blocks in
xfs_attr3_leaf_verify") because it was wrong.
Every now and then we get a corruption report from the kernel or
xfs_repair about empty leaf blocks in the extended attribute structure.
We've long thought that these shouldn't be possible, but prior to 5.18
one would shake loose in the recoveryloop fstests about once a month.
A new addition to the xattr leaf block verifier in 5.19-rc1 makes this
happen every 7 minutes on my testing cloud. I added a ton of logging to
detect any time we set the header count on an xattr leaf block to zero.
This produced the following dmesg output on generic/388:
So now we know that someone is creating empty xattr leaf blocks as part
of converting a sf xattr structure into a leaf xattr structure. The
conversion routine logs any existing sf attributes in the same
transaction that creates the leaf block, so we know this is a setxattr
to a file that has no attributes at all.
Next, g/388 calls the shutdown ioctl and cycles the mount to trigger log
recovery. I also augmented buffer item recovery to call ->verify_struct
on any attr leaf blocks and complain if it finds a failure:
As you can see, the bhold keeps the leaf buffer locked and thus prevents
the *AIL* from tripping over the ichdr.count==0 check in the write
verifier. Unfortunately, it doesn't prevent the log from getting
flushed to disk, which sets up log recovery to fail.
So. It's clear that the kernel has always had the ability to persist
attr leaf blocks with ichdr.count==0, which means that it's part of the
ondisk format now.
Unfortunately, this check has been added and removed multiple times
throughout history. It first appeared in[1] kernel 3.10 as part of the
early V5 format patches. The check was later discovered to break log
recovery and hence disabled[2] during log recovery in kernel 4.10.
Simultaneously, the check was added[3] to xfs_repair 4.9.0 to try to
weed out the empty leaf blocks. This was still not correct because log
recovery would recover an empty attr leaf block successfully only for
regular xattr operations to trip over the empty block during of the
block during regular operation. Therefore, the check was removed
entirely[4] in kernel 5.7 but removal of the xfs_repair check was
forgotten. The continued complaints from xfs_repair lead to us
mistakenly re-adding[5] the verifier check for kernel 5.19. Remove it
once again.
[1] 517c22207b04 ("xfs: add CRCs to attr leaf blocks")
[2] 2e1d23370e75 ("xfs: ignore leaf attr ichdr.count in verifier
during log replay")
[3] f7140161 ("xfs_repair: junk leaf attribute if count == 0")
[4] f28cef9e4dac ("xfs: don't fail verifier on empty attr3 leaf
block")
[5] 51e6104fdb95 ("xfs: detect empty attr leaf blocks in
xfs_attr3_leaf_verify")
Looking at the rest of the xattr code, it seems that files with empty
leaf blocks behave as expected -- listxattr reports no attributes;
getxattr on any xattr returns nothing as expected; removexattr does
nothing; and setxattr can add attributes just fine.
Original-bug: 517c22207b04 ("xfs: add CRCs to attr leaf blocks") Still-not-fixed-by: 2e1d23370e75 ("xfs: ignore leaf attr ichdr.count in verifier during log replay")
Removed-in: f28cef9e4dac ("xfs: don't fail verifier on empty attr3 leaf block") Fixes: 51e6104fdb95 ("xfs: detect empty attr leaf blocks in xfs_attr3_leaf_verify") Signed-off-by: Darrick J. Wong <[email protected]> Reviewed-by: Dave Chinner <[email protected]>
Ruozhu Li [Thu, 23 Jun 2022 06:45:39 +0000 (14:45 +0800)]
nvme: fix regression when disconnect a recovering ctrl
We encountered a problem that the disconnect command hangs.
After analyzing the log and stack, we found that the triggering
process is as follows:
CPU0 CPU1
nvme_rdma_error_recovery_work
nvme_rdma_teardown_io_queues
nvme_do_delete_ctrl nvme_stop_queues
nvme_remove_namespaces
--clear ctrl->namespaces
nvme_start_queues
--no ns in ctrl->namespaces
nvme_ns_remove return(because ctrl is deleting)
blk_freeze_queue
blk_mq_freeze_queue_wait
--wait for ns to unquiesce to clean infligt IO, hang forever
This problem was not found in older kernels because we will flush
err work in nvme_stop_ctrl before nvme_remove_namespaces.It does not
seem to be modified for functional reasons, the patch can be revert
to solve the problem.
Revert commit 794a4cb3d2f7 ("nvme: remove the .stop_ctrl callout")
Pablo Greco [Sat, 25 Jun 2022 12:15:02 +0000 (09:15 -0300)]
nvme-pci: add NVME_QUIRK_BOGUS_NID for ADATA XPG SX6000LNP (AKA SPECTRIX S40G)
ADATA XPG SPECTRIX S40G drives report bogus eui64 values that appear to
be the same across drives in one system. Quirk them out so they are
not marked as "non globally unique" duplicates.
Sagi Grimberg [Sun, 26 Jun 2022 09:24:51 +0000 (12:24 +0300)]
nvme-tcp: always fail a request when sending it failed
queue stoppage and inflight requests cancellation is fully fenced from
io_work and thus failing a request from this context. Hence we don't
need to try to guess from the socket retcode if this failure is because
the queue is about to be torn down or not.
We are perfectly safe to just fail it, the request will not be cancelled
later on.
This solves possible very long shutdown delays when the users issues a
'nvme disconnect-all'
Sagi Grimberg [Thu, 23 Jun 2022 21:49:53 +0000 (00:49 +0300)]
nvmet-tcp: fix regression in data_digest calculation
Data digest calculation iterates over command mapped iovec. However
since commit bac04454ef9f we unmap the iovec before we handle the data
digest, and since commit 69b85e1f1d1d we clear nr_mapped when we unmap
the iov.
Instead of open-coding the command iov traversal, simply call
crypto_ahash_digest with the command sg that is already allocated (we
already do that for the send path). Rename nvmet_tcp_send_ddgst to
nvmet_tcp_calc_ddgst and call it from send and recv paths.
Fixes: 69b85e1f1d1d ("nvmet-tcp: add an helper to free the cmd buffers") Fixes: bac04454ef9f ("nvmet-tcp: fix kmap leak when data digest in use") Signed-off-by: Sagi Grimberg <[email protected]> Signed-off-by: Christoph Hellwig <[email protected]>
Michael Walle [Mon, 27 Jun 2022 17:06:42 +0000 (19:06 +0200)]
NFC: nxp-nci: Don't issue a zero length i2c_master_read()
There are packets which doesn't have a payload. In that case, the second
i2c_master_read() will have a zero length. But because the NFC
controller doesn't have any data left, it will NACK the I2C read and
-ENXIO will be returned. In case there is no payload, just skip the
second i2c master read.
Fixes: 6be88670fc59 ("NFC: nxp-nci_i2c: Add I2C support to NXP NCI driver") Signed-off-by: Michael Walle <[email protected]> Reviewed-by: Krzysztof Kozlowski <[email protected]> Signed-off-by: David S. Miller <[email protected]>
powerpc/memhotplug: Add add_pages override for PPC
With commit ffa0b64e3be5 ("powerpc: Fix virt_addr_valid() for 64-bit Book3E & 32-bit")
the kernel now validate the addr against high_memory value. This results
in the below BUG_ON with dax pfns.
The fix is to make sure we update high_memory on memory hotplug.
This is similar to what x86 does in commit 3072e413e305 ("mm/memory_hotplug: introduce add_pages")
Naveen N. Rao [Mon, 27 Jun 2022 19:11:19 +0000 (00:41 +0530)]
powerpc/bpf: Fix use of user_pt_regs in uapi
Trying to build a .c file that includes <linux/bpf_perf_event.h>:
$ cat test_bpf_headers.c
#include <linux/bpf_perf_event.h>
throws the below error:
/usr/include/linux/bpf_perf_event.h:14:28: error: field ‘regs’ has incomplete type
14 | bpf_user_pt_regs_t regs;
| ^~~~
This is because we typedef bpf_user_pt_regs_t to 'struct user_pt_regs'
in arch/powerpc/include/uaps/asm/bpf_perf_event.h, but 'struct
user_pt_regs' is not exposed to userspace.
Powerpc has both pt_regs and user_pt_regs structures. However, unlike
arm64 and s390, we expose user_pt_regs to userspace as just 'pt_regs'.
As such, we should typedef bpf_user_pt_regs_t to 'struct pt_regs' for
userspace.
Within the kernel though, we want to typedef bpf_user_pt_regs_t to
'struct user_pt_regs'.
Remove arch/powerpc/include/uapi/asm/bpf_perf_event.h so that the
uapi/asm-generic version of the header is exposed to userspace.
Introduce arch/powerpc/include/asm/bpf_perf_event.h so that we can
typedef bpf_user_pt_regs_t to 'struct user_pt_regs' for use within the
kernel.
Note that this was not showing up with the bpf selftest build since
tools/include/uapi/asm/bpf_perf_event.h didn't include the powerpc
variant.
fbdev: Disable sysfb device registration when removing conflicting FBs
The platform devices registered by sysfb match with firmware-based DRM or
fbdev drivers, that are used to have early graphics using a framebuffer
provided by the system firmware.
DRM or fbdev drivers later are probed and remove conflicting framebuffers,
leading to these platform devices for generic drivers to be unregistered.
But the current solution has a race, since the sysfb_init() function could
be called after a DRM or fbdev driver is probed and request to unregister
the devices for drivers with conflicting framebuffes.
To prevent this, disable any future sysfb platform device registration by
calling sysfb_disable(), if a driver requests to remove the conflicting
framebuffers.
As of commit 5801f064e351 ("net: ipv6: unexport __init-annotated seg6_hmac_init()"),
EXPORT_SYMBOL and __init is a bad combination because the .init.text
section is freed up after the initialization. Hence, modules cannot
use symbols annotated __init. The access to a freed symbol may end up
with kernel panic.
This remove the EXPORT_SYMBOL to fix modpost warning:
WARNING: modpost: vmlinux.o(___ksymtab+seg6_hmac_net_init+0x0): Section mismatch in reference from the variable __ksymtab_seg6_hmac_net_init to the function .init.text:seg6_hmac_net_init()
The symbol seg6_hmac_net_init is exported and annotated __init
Fix this by removing the __init annotation of seg6_hmac_net_init or drop the export.
Dave Airlie [Wed, 29 Jun 2022 04:16:46 +0000 (14:16 +1000)]
Merge tag 'drm-msm-fixes-2022-06-28' of https://gitlab.freedesktop.org/drm/msm into drm-fixes
Fixes for v5.19-rc5
- Fix to increment vsync_cnt before calling drm_crtc_handle_vblank so that
userspace sees the value *after* it is incremented if waiting for vblank
events
- Fix to reset drm_dev to NULL in dp_display_unbind to avoid a crash in
probe/bind error paths
- Fix to resolve the smatch error of de-referencing before NULL check in
dpu_encoder_phys_wb.c
- Fix error return to userspace if fence-id allocation fails in submit
ioctl
Jakub Kicinski [Wed, 29 Jun 2022 03:45:46 +0000 (20:45 -0700)]
Merge branch 'mptcp-fixes-for-5-19'
Mat Martineau says:
====================
mptcp: Fixes for 5.19
Several categories of fixes from the mptcp tree:
Patches 1-3 are fixes related to MP_FAIL and FASTCLOSE, to make sure
MIBs are accurate, and to handle MP_FAIL transmission and responses at
the correct times. sk_timer conflicts are also resolved.
Patches 4 and 6 handle two separate race conditions, one at socket
shutdown and one with unaccepted subflows.
Patch 5 makes sure read operations are not blocked during fallback to
TCP.
Patch 7 improves the diag selftest, which were incorrectly failing on
slow machines (like the VMs used for CI testing).
Patch 8 avoids possible symbol redefinition errors in the userspace
mptcp.h file.
Patch 9 fixes a selftest build issue with gcc 12.
====================
Mat Martineau [Tue, 28 Jun 2022 01:02:43 +0000 (18:02 -0700)]
selftests: mptcp: Initialize variables to quiet gcc 12 warnings
In a few MPTCP selftest tools, gcc 12 complains that the 'sock' variable
might be used uninitialized. This is a false positive because the only
code path that could lead to uninitialized access is where getaddrinfo()
fails, but the local xgetaddrinfo() wrapper exits if such a failure
occurs.
Initialize the 'sock' variable anyway to allow the tools to build with
gcc 12.
Ossama Othman [Tue, 28 Jun 2022 01:02:42 +0000 (18:02 -0700)]
mptcp: fix conflict with <netinet/in.h>
Including <linux/mptcp.h> before the C library <netinet/in.h> header
causes symbol redefinition errors at compile-time due to duplicate
declarations and definitions in the <linux/in.h> header included by
<linux/mptcp.h>.
Explicitly include <netinet/in.h> before <linux/in.h> in
<linux/mptcp.h> when __KERNEL__ is not defined so that the C library
compatibility logic in <linux/libc-compat.h> is enabled when including
<linux/mptcp.h> in user space code.
Paolo Abeni [Tue, 28 Jun 2022 01:02:41 +0000 (18:02 -0700)]
selftests: mptcp: more stable diag tests
The mentioned test-case still use an hard-coded-len sleep to
wait for a relative large number of connection to be established.
On very slow VM and with debug build such timeout could be exceeded,
causing failures in our CI.
Address the issue polling for the expected condition several times,
up to an unreasonable high amount of time. On reasonably fast system
the self-tests will be faster then before, on very slow one we will
still catch the correct condition.
Fixes: df62f2ec3df6 ("selftests/mptcp: add diag interface tests") Signed-off-by: Paolo Abeni <[email protected]> Signed-off-by: Mat Martineau <[email protected]> Signed-off-by: Jakub Kicinski <[email protected]>
Paolo Abeni [Tue, 28 Jun 2022 01:02:40 +0000 (18:02 -0700)]
mptcp: fix race on unaccepted mptcp sockets
When the listener socket owning the relevant request is closed,
it frees the unaccepted subflows and that causes later deletion
of the paired MPTCP sockets.
The mptcp socket's worker can run in the time interval between such delete
operations. When that happens, any access to msk->first will cause an UaF
access, as the subflow cleanup did not cleared such field in the mptcp
socket.
Address the issue explicitly traversing the listener socket accept
queue at close time and performing the needed cleanup on the pending
msk.
Note that the locking is a bit tricky, as we need to acquire the msk
socket lock, while still owning the subflow socket one.
Fixes: 86e39e04482b ("mptcp: keep track of local endpoint still available for each msk") Signed-off-by: Paolo Abeni <[email protected]> Signed-off-by: Mat Martineau <[email protected]> Signed-off-by: Jakub Kicinski <[email protected]>
Paolo Abeni [Tue, 28 Jun 2022 01:02:39 +0000 (18:02 -0700)]
mptcp: consistent map handling on failure
When the MPTCP receive path reach a non fatal fall-back condition, e.g.
when the MPC sockets must fall-back to TCP, the existing code is a little
self-inconsistent: it reports that new data is available - return true -
but sets the MPC flag to the opposite value.
As the consequence read operations in some exceptional scenario may block
unexpectedly.
Address the issue setting the correct MPC read status. Additionally avoid
some code duplication in the fatal fall-back scenario.
Fixes: 9c81be0dbc89 ("mptcp: add MP_FAIL response support") Signed-off-by: Paolo Abeni <[email protected]> Signed-off-by: Mat Martineau <[email protected]> Signed-off-by: Jakub Kicinski <[email protected]>
Geliang Tang [Tue, 28 Jun 2022 01:02:37 +0000 (18:02 -0700)]
mptcp: invoke MP_FAIL response when needed
mptcp_mp_fail_no_response shouldn't be invoked on each worker run, it
should be invoked only when MP_FAIL response timeout occurs.
This patch refactors the MP_FAIL response logic.
It leverages the fact that only the MPC/first subflow can gracefully
fail to avoid unneeded subflows traversal: the failing subflow can
be only msk->first.
A new 'fail_tout' field is added to the subflow context to record the
MP_FAIL response timeout and use such field to reliably share the
timeout timer between the MP_FAIL event and the MPTCP socket close
timeout.
Finally, a new ack is generated to send out MP_FAIL notification as soon
as we hit the relevant condition, instead of waiting a possibly unbound
time for the next data packet.
Closes: https://github.com/multipath-tcp/mptcp_net-next/issues/281 Fixes: d9fb797046c5 ("mptcp: Do not traverse the subflow connection list without lock") Co-developed-by: Paolo Abeni <[email protected]> Signed-off-by: Paolo Abeni <[email protected]> Signed-off-by: Geliang Tang <[email protected]> Signed-off-by: Mat Martineau <[email protected]> Signed-off-by: Jakub Kicinski <[email protected]>
Paolo Abeni [Tue, 28 Jun 2022 01:02:36 +0000 (18:02 -0700)]
mptcp: introduce MAPPING_BAD_CSUM
This allow moving a couple of conditional out of the fast path,
making the code more easy to follow and will simplify the next
patch.
Fixes: ae66fb2ba6c3 ("mptcp: Do TCP fallback on early DSS checksum failure") Signed-off-by: Paolo Abeni <[email protected]> Signed-off-by: Mat Martineau <[email protected]> Signed-off-by: Jakub Kicinski <[email protected]>
Paolo Abeni [Tue, 28 Jun 2022 01:02:35 +0000 (18:02 -0700)]
mptcp: fix error mibs accounting
The current accounting for MP_FAIL and FASTCLOSE is not very
accurate: both can be increased even when the related option is
not really sent. Move the accounting into the correct place.
Fixes: eb7f33654dc1 ("mptcp: add the mibs for MP_FAIL") Fixes: 1e75629cb964 ("mptcp: add the mibs for MP_FASTCLOSE") Signed-off-by: Paolo Abeni <[email protected]> Signed-off-by: Mat Martineau <[email protected]> Signed-off-by: Jakub Kicinski <[email protected]>
Mark Pearson [Mon, 27 Jun 2022 18:14:49 +0000 (14:14 -0400)]
platform/x86: thinkpad_acpi: do not use PSC mode on Intel platforms
PSC platform profile mode is only supported on Linux for AMD platforms.
Some older Intel platforms (e.g T490) are advertising it's capability
as Windows uses it - but on Linux we should only be using MMC profile
for Intel systems.
Add a check to prevent it being enabled incorrectly.
Hans de Goede [Fri, 24 Jun 2022 11:23:39 +0000 (13:23 +0200)]
platform/x86: panasonic-laptop: filter out duplicate volume up/down/mute keypresses
On some Panasonic models the volume up/down/mute keypresses get
reported both through the Panasonic ACPI HKEY interface as well as
through the atkbd device.
Filter out the atkbd scan-codes for these to avoid reporting presses
twice.
Note normally we would leave the filtering of these to userspace by mapping
the scan-codes to KEY_UNKNOWN through /lib/udev/hwdb.d/60-keyboard.hwdb.
However in this case that would cause regressions since we were filtering
the Panasonic ACPI HKEY events before, so filter these in the kernel.
The brightness key-presses might also get reported by the ACPI video bus,
check for this and in this case don't report the presses to avoid reporting
2 presses for a single key-press.
In hindsight blindly throwing away most of the key-press events is not
a good idea. So revert commit ed83c9171829 ("platform/x86:
panasonic-laptop: Resolve hotkey double trigger bug").
In the definition of panasonic_keymap[] the key codes are given in
decimal, later checks are done with hexadecimal values, which does
not help in understanding the code.
Additionally use two helper variables to shorten the code and make
the logic more obvious.
Hans de Goede [Fri, 24 Jun 2022 11:23:34 +0000 (13:23 +0200)]
ACPI: video: Change how we determine if brightness key-presses are handled
Some systems have an ACPI video bus but not ACPI video devices with
backlight capability. On these devices brightness key-presses are
(logically) not reported through the ACPI video bus.
Change how acpi_video_handles_brightness_key_presses() determines if
brightness key-presses are handled by the ACPI video driver to avoid
vendor specific drivers/platform/x86 drivers filtering out their
brightness key-presses even though they are the only ones reporting
these presses.
tools headers UAPI: Sync linux/kvm.h with the kernel sources
To pick the changes in:
bfbab44568779e16 ("KVM: arm64: Implement PSCI SYSTEM_SUSPEND") 7b33a09d036ffd9a ("KVM: arm64: Add support for userspace to suspend a vCPU") ffbb61d09fc56c85 ("KVM: x86: Accept KVM_[GS]ET_TSC_KHZ as a VM ioctl.") 661a20fab7d156cf ("KVM: x86/xen: Advertise and document KVM_XEN_HVM_CONFIG_EVTCHN_SEND") fde0451be8fb3208 ("KVM: x86/xen: Support per-vCPU event channel upcall via local APIC") 28d1629f751c4a5f ("KVM: x86/xen: Kernel acceleration for XENVER_version") 536395260582be74 ("KVM: x86/xen: handle PV timers oneshot mode") 942c2490c23f2800 ("KVM: x86/xen: Add KVM_XEN_VCPU_ATTR_TYPE_VCPU_ID") 2fd6df2f2b47d430 ("KVM: x86/xen: intercept EVTCHNOP_send from guests") 35025735a79eaa89 ("KVM: x86/xen: Support direct injection of event channel events")
That automatically adds support for this new ioctl:
$ tools/perf/trace/beauty/kvm_ioctl.sh > before
$ cp include/uapi/linux/kvm.h tools/include/uapi/linux/kvm.h
$ tools/perf/trace/beauty/kvm_ioctl.sh > after
$ diff -u before after
--- before 2022-06-28 12:13:07.281150509 -0300
+++ after 2022-06-28 12:13:16.423392896 -0300
@@ -98,6 +98,7 @@
[0xcc] = "GET_SREGS2",
[0xcd] = "SET_SREGS2",
[0xce] = "GET_STATS_FD",
+ [0xd0] = "XEN_HVM_EVTCHN_SEND",
[0xe0] = "CREATE_DEVICE",
[0xe1] = "SET_DEVICE_ATTR",
[0xe2] = "GET_DEVICE_ATTR",
$
This silences these perf build warning:
Warning: Kernel ABI header at 'tools/include/uapi/linux/kvm.h' differs from latest version at 'include/uapi/linux/kvm.h'
diff -u tools/include/uapi/linux/kvm.h include/uapi/linux/kvm.h
Merge tag 'cpufreq-arm-fixes-5.19-rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/vireshk/pm
Pull cpufreq ARM fixes for 5.19-rc5 from Viresh Kumar:
- Fix missing of_node_put for qoriq and pmac32 driver (Liang He).
- Fix issues around throttle interrupt for qcom driver (Stephen Boyd).
- Add MT8186 to cpufreq-dt-platdev blocklist (AngeloGioacchino Del Regno).
* tag 'cpufreq-arm-fixes-5.19-rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/vireshk/pm:
cpufreq: Add MT8186 to cpufreq-dt-platdev blocklist
cpufreq: pmac32-cpufreq: Fix refcount leak bug
cpufreq: qcom-hw: Don't do lmh things without a throttle interrupt
drivers: cpufreq: Add missing of_node_put() in qoriq-cpufreq.c
Ian Rogers [Tue, 14 Jun 2022 01:47:14 +0000 (18:47 -0700)]
perf bpf: 8 byte align bpil data
bpil data is accessed assuming 64-bit alignment resulting in undefined
behavior as the data is just byte aligned. With an -fsanitize=undefined
build the following errors are observed:
$ sudo perf record -a sleep 1
util/bpf-event.c:310:22: runtime error: load of misaligned address 0x55f61084520f for type '__u64', which requires 8 byte alignment
0x55f61084520f: note: pointer points here
a8 fe ff ff 3c 51 d3 c0 ff ff ff ff 04 84 d3 c0 ff ff ff ff d8 aa d3 c0 ff ff ff ff a4 c0 d3 c0
^
util/bpf-event.c:311:20: runtime error: load of misaligned address 0x55f61084522f for type '__u32', which requires 4 byte alignment
0x55f61084522f: note: pointer points here
ff ff ff ff c7 17 00 00 f1 02 00 00 1f 04 00 00 58 04 00 00 00 00 00 00 0f 00 00 00 63 02 00 00
^
util/bpf-event.c:198:33: runtime error: member access within misaligned address 0x55f61084523f for type 'const struct bpf_func_info', which requires 4 byte alignment
0x55f61084523f: note: pointer points here
58 04 00 00 00 00 00 00 0f 00 00 00 63 02 00 00 3b 00 00 00 ab 02 00 00 44 00 00 00 14 03 00 00
Correct this by rouding up the data sizes and aligning the pointers.
tools kvm headers arm64: Update KVM headers from the kernel sources
To pick the changes from:
2cde51f1e10f2600 ("KVM: arm64: Hide KVM_REG_ARM_*_BMAP_BIT_COUNT from userspace") b22216e1a617ca55 ("KVM: arm64: Add vendor hypervisor firmware register") 428fd6788d4d0e0d ("KVM: arm64: Add standard hypervisor firmware register") 05714cab7d63b189 ("KVM: arm64: Setup a framework for hypercall bitmap firmware registers") 18f3976fdb5da2ba ("KVM: arm64: uapi: Add kvm_debug_exit_arch.hsr_high") a5905d6af492ee6a ("KVM: arm64: Allow SMCCC_ARCH_WORKAROUND_3 to be discovered and migrated")
That don't causes any changes in tooling (when built on x86), only
addresses this perf build warning:
Warning: Kernel ABI header at 'tools/arch/arm64/include/uapi/asm/kvm.h' differs from latest version at 'arch/arm64/include/uapi/asm/kvm.h'
diff -u tools/arch/arm64/include/uapi/asm/kvm.h arch/arm64/include/uapi/asm/kvm.h
Namhyung Kim [Fri, 24 Jun 2022 23:13:09 +0000 (16:13 -0700)]
perf offcpu: Accept allowed sample types only
As offcpu-time event is synthesized at the end, it could not get the
all the sample info. Define OFFCPU_SAMPLE_TYPES for allowed ones and
mask out others in evsel__config() to prevent parse errors.
Because perf sample parsing assumes a specific ordering with the
sample types, setting unsupported one would make it fail to read
data like perf record -d/--data.
Namhyung Kim [Fri, 24 Jun 2022 23:13:08 +0000 (16:13 -0700)]
perf offcpu: Fix build failure on old kernels
Old kernels have a 'struct task_struct' which contains a "state" field
and newer kernels have "__state" instead.
While the get_task_state() in the BPF code handles that in some way, it
assumed the current kernel has the new definition and it caused a build
error on old kernels.
We should not assume anything and access them carefully. Do not use
'task struct' directly access it instead using new and old definitions
in a row.
Fabio Estevam [Wed, 22 Jun 2022 11:48:10 +0000 (08:48 -0300)]
ARM: at91: pm: Mark at91_pm_secure_init as __init
at91_pm_secure_init() is used inside sama5d2_pm_init(), which has
the __init notation.
Pass the __init notation to at91_pm_secure_init() as well to fix the
following section mismatch warning:
WARNING: modpost: vmlinux.o(.text.unlikely+0x2138): Section mismatch in reference from the function at91_pm_secure_init() to the (unknown reference) .init.rodata:(unknown)
Eugen Hristev [Tue, 7 Jun 2022 09:04:54 +0000 (12:04 +0300)]
ARM: dts: at91: sam9x60ek: fix eeprom compatible and size
The board has a microchip 24aa025e48 eeprom, which is a 2 Kbits memory,
so it's compatible with at24c02 not at24c32.
Also the size property is wrong, it's not 128 bytes, but 256 bytes.
Thus removing and leaving it to the default (256).
Amir Goldstein [Mon, 27 Jun 2022 17:47:19 +0000 (20:47 +0300)]
fanotify: refine the validation checks on non-dir inode mask
Commit ceaf69f8eadc ("fanotify: do not allow setting dirent events in
mask of non-dir") added restrictions about setting dirent events in the
mask of a non-dir inode mark, which does not make any sense.
For backward compatibility, these restictions were added only to new
(v5.17+) APIs.
It also does not make any sense to set the flags FAN_EVENT_ON_CHILD or
FAN_ONDIR in the mask of a non-dir inode. Add these flags to the
dir-only restriction of the new APIs as well.
Move the check of the dir-only flags for new APIs into the helper
fanotify_events_supported(), which is only called for FAN_MARK_ADD,
because there is no need to error on an attempt to remove the dir-only
flags from non-dir inode.
Stafford Horne [Tue, 14 Jun 2022 23:54:26 +0000 (08:54 +0900)]
irqchip: or1k-pic: Undefine mask_ack for level triggered hardware
The mask_ack operation clears the interrupt by writing to the PICSR
register. This we don't want for level triggered interrupt because
it does not actually clear the interrupt on the source hardware.
This was causing issues in qemu with multi core setups where
interrupts would continue to fire even though they had been cleared in
PICSR.
Liang He [Sat, 18 Jun 2022 02:25:45 +0000 (10:25 +0800)]
cpufreq: pmac32-cpufreq: Fix refcount leak bug
In pmac_cpufreq_init_MacRISC3(), we need to add corresponding
of_node_put() for the three node pointers whose refcount have
been incremented by of_find_node_by_name().
Stephen Boyd [Thu, 16 Jun 2022 22:45:31 +0000 (15:45 -0700)]
cpufreq: qcom-hw: Don't do lmh things without a throttle interrupt
Offlining cpu6 and cpu7 and then onlining cpu6 hangs on
sc7180-trogdor-lazor because the throttle interrupt doesn't exist.
Similarly, things go sideways when suspend/resume runs. That's because
the qcom_cpufreq_hw_cpu_online() and qcom_cpufreq_hw_lmh_exit()
functions are calling genirq APIs with an interrupt value of '-6', i.e.
-ENXIO, and that isn't good.
Check the value of the throttle interrupt like we already do in other
functions in this file and bail out early from lmh code to fix the hang.
Liang He [Wed, 15 Jun 2022 09:48:07 +0000 (17:48 +0800)]
drivers: cpufreq: Add missing of_node_put() in qoriq-cpufreq.c
In qoriq_cpufreq_probe(), of_find_matching_node() will return a
node pointer with refcount incremented. We should use of_node_put()
when it is not used anymore.
Fixes: 157f527639da ("cpufreq: qoriq: convert to a platform driver")
[ Viresh: Fixed Author's name in commit log ] Signed-off-by: Liang He <[email protected]> Signed-off-by: Viresh Kumar <[email protected]>
Nicolas Dichtel [Thu, 23 Jun 2022 12:00:15 +0000 (14:00 +0200)]
ipv6: take care of disable_policy when restoring routes
When routes corresponding to addresses are restored by
fixup_permanent_addr(), the dst_nopolicy parameter was not set.
The typical use case is a user that configures an address on a down
interface and then put this interface up.
Let's take care of this flag in addrconf_f6i_alloc(), so that every callers
benefit ont it.
Oleksij Rempel [Fri, 24 Jun 2022 07:51:39 +0000 (09:51 +0200)]
net: usb: asix: do not force pause frames support
We should respect link partner capabilities and not force flow control
support on every link. Even more, in current state the MAC driver do not
advertises pause support so we should not keep flow control enabled at
all.
Oleksij Rempel [Fri, 24 Jun 2022 07:51:38 +0000 (09:51 +0200)]
net: asix: fix "can't send until first packet is send" issue
If cable is attached after probe sequence, the usbnet framework would
not automatically start processing RX packets except at least one
packet was transmitted.
On systems with any kind of address auto configuration this issue was
not detected, because some packets are send immediately after link state
is changed to "running".
With this patch we will notify usbnet about link status change provided by the
PHYlib.
====================
Notify user space if any actions were flushed before error
This patch series fixes the behaviour of actions flush so that the
kernel always notifies user space whenever it deletes actions during a
flush operation, even if it didn't flush all the actions. This series
also introduces tdc tests to verify this new behaviour.
====================
Victor Nogueira [Thu, 23 Jun 2022 14:07:42 +0000 (11:07 -0300)]
selftests: tc-testing: Add testcases to test new flush behaviour
Add tdc test cases to verify new flush behaviour is correct, which do
the following:
- Try to flush only one action which is being referenced by a filter
- Try to flush three actions where the last one (index 3) is being
referenced by a filter
Victor Nogueira [Thu, 23 Jun 2022 14:07:41 +0000 (11:07 -0300)]
net/sched: act_api: Notify user space if any actions were flushed before error
If during an action flush operation one of the actions is still being
referenced, the flush operation is aborted and the kernel returns to
user space with an error. However, if the kernel was able to flush, for
example, 3 actions and failed on the fourth, the kernel will not notify
user space that it deleted 3 actions before failing.
This patch fixes that behaviour by notifying user space of how many
actions were deleted before flush failed and by setting extack with a
message describing what happened.
Fixes: 55334a5db5cd ("net_sched: act: refuse to remove bound action outside") Signed-off-by: Victor Nogueira <[email protected]> Acked-by: Jamal Hadi Salim <[email protected]> Signed-off-by: Jakub Kicinski <[email protected]>
Tong Zhang [Mon, 27 Jun 2022 04:33:48 +0000 (21:33 -0700)]
epic100: fix use after free on rmmod
epic_close() calls epic_rx() and uses dma buffer, but in epic_remove_one()
we already freed the dma buffer. To fix this issue, reorder function calls
like in the .probe function.
Jakub Kicinski [Thu, 23 Jun 2022 04:21:05 +0000 (21:21 -0700)]
net: tun: stop NAPI when detaching queues
While looking at a syzbot report I noticed the NAPI only gets
disabled before it's deleted. I think that user can detach
the queue before destroying the device and the NAPI will never
be stopped.
dm raid: fix accesses beyond end of raid member array
On dm-raid table load (using raid_ctr), dm-raid allocates an array
rs->devs[rs->raid_disks] for the raid device members. rs->raid_disks
is defined by the number of raid metadata and image tupples passed
into the target's constructor.
In the case of RAID layout changes being requested, that number can be
different from the current number of members for existing raid sets as
defined in their superblocks. Example RAID layout changes include:
- raid1 legs being added/removed
- raid4/5/6/10 number of stripes changed (stripe reshaping)
- takeover to higher raid level (e.g. raid5 -> raid6)
When accessing array members, rs->raid_disks must be used in control
loops instead of the potentially larger value in rs->md.raid_disks.
Otherwise it will cause memory access beyond the end of the rs->devs
array.
Fix this by changing code that is prone to out-of-bounds access.
Also fix validate_raid_redundancy() to validate all devices that are
added. Also, use braces to help clean up raid_iterate_devices().
The out-of-bounds memory accesses was discovered using KASAN.
This commit was verified to pass all LVM2 RAID tests (with KASAN
enabled).
"make dtbs_check" complains about the missing "-supply" suffix for
vdd_lvs1_2 which is clearly a typo, originally introduced in the
msm8994-smd-rpm.dtsi file and apparently later copied to
msm8992-xiaomi-libra.dts:
msm8992-lg-bullhead-rev-10/101.dtb: pm8994-regulators: 'vdd_lvs1_2'
does not match any of the regexes:
'.*-supply$', '^((s|l|lvs|5vs)[0-9]*)|(boost-bypass)|(bob)$', 'pinctrl-[0-9]+'
From schema: regulator/qcom,smd-rpm-regulator.yaml
msm8992-xiaomi-libra.dtb: pm8994-regulators: 'vdd_lvs1_2'
does not match any of the regexes:
'.*-supply$', '^((s|l|lvs|5vs)[0-9]*)|(boost-bypass)|(bob)$', 'pinctrl-[0-9]+'
From schema: regulator/qcom,smd-rpm-regulator.yaml
Helge Deller [Sun, 26 Jun 2022 23:39:11 +0000 (01:39 +0200)]
parisc/unaligned: Fix emulate_ldw() breakage
The commit e8aa7b17fe41 broke the 32-bit load-word unalignment exception
handler because it calculated the wrong amount of bits by which the value
should be shifted. This patch fixes it.
Linus Torvalds [Mon, 27 Jun 2022 17:47:34 +0000 (10:47 -0700)]
Merge tag 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mst/vhost
Pull virtio fixes from Michael Tsirkin:
"Fixes all over the place, most notably we are disabling
IRQ hardening (again!)"
* tag 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mst/vhost:
virtio_ring: make vring_create_virtqueue_split prettier
vhost-vdpa: call vhost_vdpa_cleanup during the release
virtio_mmio: Restore guest page size on resume
virtio_mmio: Add missing PM calls to freeze/restore
caif_virtio: fix race between virtio_device_ready() and ndo_open()
virtio-net: fix race between ndo_open() and virtio_device_ready()
virtio: disable notification hardening by default
virtio: Remove unnecessary variable assignments
virtio_ring : keep used_wrap_counter in vq->last_used_idx
vduse: Tie vduse mgmtdev and its device
vdpa/mlx5: Initialize CVQ vringh only once
vdpa/mlx5: Update Control VQ callback information
EXPORT_SYMBOL and __init is a bad combination because the .init.text
section is freed up after the initialization. Hence, modules cannot
use symbols annotated __init. The access to a freed symbol may end up
with kernel panic.
modpost used to detect it, but it had been broken for a decade.
Commit 28438794aba4 ("modpost: fix section mismatch check for exported
init/exit sections") fixed it so modpost started to warn it again, then
this showed up:
MODPOST vmlinux.symvers
WARNING: modpost: vmlinux.o(___ksymtab_gpl+tick_nohz_full_setup+0x0): Section mismatch in reference from the variable __ksymtab_tick_nohz_full_setup to the function .init.text:tick_nohz_full_setup()
The symbol tick_nohz_full_setup is exported and annotated __init
Fix this by removing the __init annotation of tick_nohz_full_setup or drop the export.
Drop the export because tick_nohz_full_setup() is only called from the
built-in code in kernel/sched/isolation.c.
Fixes: ae9e557b5be2 ("time: Export tick start/stop functions for rcutorture") Reported-by: Linus Torvalds <[email protected]> Signed-off-by: Masahiro Yamada <[email protected]> Tested-by: Paul E. McKenney <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
Florian Westphal [Tue, 21 Jun 2022 16:26:03 +0000 (18:26 +0200)]
netfilter: br_netfilter: do not skip all hooks with 0 priority
When br_netfilter module is loaded, skbs may be diverted to the
ipv4/ipv6 hooks, just like as if we were routing.
Unfortunately, bridge filter hooks with priority 0 may be skipped
in this case.
Example:
1. an nftables bridge ruleset is loaded, with a prerouting
hook that has priority 0.
2. interface is added to the bridge.
3. no tcp packet is ever seen by the bridge prerouting hook.
4. flush the ruleset
5. load the bridge ruleset again.
6. tcp packets are processed as expected.
After 1) the only registered hook is the bridge prerouting hook, but its
not called yet because the bridge hasn't been brought up yet.
After 2), hook order is:
0 br_nf_pre_routing // br_netfilter internal hook
0 chain bridge f prerouting // nftables bridge ruleset
The packet is diverted to br_nf_pre_routing.
If call-iptables is off, the nftables bridge ruleset is called as expected.
But if its enabled, br_nf_hook_thresh() will skip it because it assumes
that all 0-priority hooks had been called previously in bridge context.
To avoid this, check for the br_nf_pre_routing hook itself, we need to
resume directly after it, even if this hook has a priority of 0.
Unfortunately, this still results in different packet flow.
With this fix, the eval order after in 3) is:
1. br_nf_pre_routing
2. ip(6)tables (if enabled)
3. nftables bridge
but after 5 its the much saner:
1. nftables bridge
2. br_nf_pre_routing
3. ip(6)tables (if enabled)
Unfortunately I don't see a solution here:
It would be possible to move br_nf_pre_routing to a higher priority
so that it will be called later in the pipeline, but this also impacts
ebtables evaluation order, and would still result in this very ordering
problem for all nftables-bridge hooks with the same priority as the
br_nf_pre_routing one.
Searching back through the git history I don't think this has
ever behaved in any other way, hence, no fixes-tag.
Florian Westphal [Wed, 22 Jun 2022 14:43:57 +0000 (16:43 +0200)]
netfilter: nf_tables: avoid skb access on nf_stolen
When verdict is NF_STOLEN, the skb might have been freed.
When tracing is enabled, this can result in a use-after-free:
1. access to skb->nf_trace
2. access to skb->mark
3. computation of trace id
4. dump of packet payload
To avoid 1, keep a cached copy of skb->nf_trace in the
trace state struct.
Refresh this copy whenever verdict is != STOLEN.
Avoid 2 by skipping skb->mark access if verdict is STOLEN.
3 is avoided by precomputing the trace id.
Only dump the packet when verdict is not "STOLEN".