Git Repo - linux.git/log

net: microchip: sparx5: fix deletion of existing DSCP mappings

Fix deletion of existing DSCP mappings in the APP table.

Adding and deleting DSCP entries are replicated per-port, since the
mapping table is global for all ports in the chip. Whenever a mapping
for a DSCP value already exists, the old mapping is deleted first.
However, it is only deleted for the specified port. Fix this by calling
sparx5_dcb_ieee_delapp() instead of dcb_ieee_delapp() as it ought to be.

Reproduce:

// Map and remap DSCP value 63
$ dcb app add dev eth0 dscp-prio 63:1
$ dcb app add dev eth0 dscp-prio 63:2

$ dcb app show dev eth0 dscp-prio
dscp-prio 63:2

$ dcb app show dev eth1 dscp-prio
dscp-prio 63:1 63:2 <-- 63:1 should not be there

Fixes: 8dcf69a64118 ("net: microchip: sparx5: add support for offloading dscp table")
Signed-off-by: Daniel Machon <[email protected]>
Reviewed-by: Simon Horman <[email protected]>
Signed-off-by: David S. Miller <[email protected]>

octeontx2-af: Unlock contexts in the queue context cache in case of fault detection

NDC caches contexts of frequently used queue's (Rx and Tx queues)
contexts. Due to a HW errata when NDC detects fault/poision while
accessing contexts it could go into an illegal state where a cache
line could get locked forever. To makesure all cache lines in NDC
are available for optimum performance upon fault/lockerror/posion
errors scan through all cache lines in NDC and clear the lock bit.

Fixes: 4a3581cd5995 ("octeontx2-af: NPA AQ instruction enqueue support")
Signed-off-by: Suman Ghosh <[email protected]>
Signed-off-by: Sunil Kovvuri Goutham <[email protected]>
Signed-off-by: Sai Krishna <[email protected]>
Reviewed-by: Simon Horman <[email protected]>
Signed-off-by: David S. Miller <[email protected]>

net/smc: fix fallback failed while sendmsg with fastopen

Before determining whether the msg has unsupported options, it has been
prematurely terminated by the wrong status check.

For the application, the general usages of MSG_FASTOPEN likes

fd = socket(...)
/* rather than connect */
sendto(fd, data, len, MSG_FASTOPEN)

Hence, We need to check the flag before state check, because the sock
state here is always SMC_INIT when applications tries MSG_FASTOPEN.
Once we found unsupported options, fallback it to TCP.

Fixes: ee9dfbef02d1 ("net/smc: handle sockopts forcing fallback")
Signed-off-by: D. Wythe <[email protected]>
Signed-off-by: Simon Horman <[email protected]>
v2 -> v1: Optimize code style
Reviewed-by: Tony Lu <[email protected]>
Signed-off-by: David S. Miller <[email protected]>

selftests/clone3: test clone3 with CLONE_NEWTIME

Verify that clone3 can be called successfully with CLONE_NEWTIME in
flags.

Cc: Andrey Vagin <[email protected]>
Cc: Christian Brauner <[email protected]>
Cc: Shuah Khan <[email protected]>
Signed-off-by: Tobias Klauser <[email protected]>
Signed-off-by: Christian Brauner (Microsoft) <[email protected]>

fork: allow CLONE_NEWTIME in clone3 flags

Currently, calling clone3() with CLONE_NEWTIME in clone_args->flags
fails with -EINVAL. This is because CLONE_NEWTIME intersects with
CSIGNAL. However, CSIGNAL was deprecated when clone3 was introduced in
commit 7f192e3cd316 ("fork: add clone3"), allowing re-use of that part
of clone flags.

Fix this by explicitly allowing CLONE_NEWTIME in clone3_args_valid. This
is also in line with the respective check in check_unshare_flags which
allow CLONE_NEWTIME for unshare().

Fixes: 769071ac9f20 ("ns: Introduce Time Namespace")
Cc: Andrey Vagin <[email protected]>
Cc: Christian Brauner <[email protected]>
Cc: [email protected]
Signed-off-by: Tobias Klauser <[email protected]>
Reviewed-by: Christian Brauner <[email protected]>
Signed-off-by: Christian Brauner (Microsoft) <[email protected]>

watch_queue: fix IOC_WATCH_QUEUE_SET_SIZE alloc error paths

The watch_queue_set_size() allocation error paths return the ret value
set via the prior pipe_resize_ring() call, which will always be zero.

As a result, IOC_WATCH_QUEUE_SET_SIZE callers such as "keyctl watch"
fail to detect kernel wqueue->notes allocation failures and proceed to
KEYCTL_WATCH_KEY, with any notifications subsequently lost.

Fixes: c73be61cede58 ("pipe: Add general notification queue support")
Signed-off-by: David Disseldorp <[email protected]>
Signed-off-by: Christian Brauner (Microsoft) <[email protected]>

ext4: Fix deadlock during directory rename

As lockdep properly warns, we should not be locking i_rwsem while having
transactions started as the proper lock ordering used by all directory
handling operations is i_rwsem -> transaction start. Fix the lock
ordering by moving the locking of the directory earlier in
ext4_rename().

Reported-by: [email protected]
Fixes: 0813299c586b ("ext4: Fix possible corruption when moving a directory")
Link: https://syzkaller.appspot.com/bug?extid=9d16c39efb5fade84574
Signed-off-by: Jan Kara <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Theodore Ts'o <[email protected]>

ext4: Fix comment about the 64BIT feature

64BIT is part of the incompatible feature set, update the comment
accordingly.

Signed-off-by: Tudor Ambarus <[email protected]>
Reviewed-by: Darrick J. Wong <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Theodore Ts'o <[email protected]>

docs: ext4: modify the group desc size to 64

Since the default ext4 group desc size is 64 now (assuming that the
64-bit feature is enbled). And the size mentioned in this doc is 64 too.
Change it to 64.

Signed-off-by: Wu Bo <[email protected]>
Acked-by: Darrick J. Wong <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Theodore Ts'o <[email protected]>

ext4: fix another off-by-one fsmap error on 1k block filesystems

Apparently syzbot figured out that issuing this FSMAP call:

struct fsmap_head cmd = {
.fmh_count = ...;
.fmh_keys = {
{ .fmr_device = /* ext4 dev */, .fmr_physical = 0, },
{ .fmr_device = /* ext4 dev */, .fmr_physical = 0, },
},
...
};
ret = ioctl(fd, FS_IOC_GETFSMAP, &cmd);

Produces this crash if the underlying filesystem is a 1k-block ext4
filesystem:

kernel BUG at fs/ext4/ext4.h:3331!
invalid opcode: 0000 [#1] PREEMPT SMP
CPU: 3 PID: 3227965 Comm: xfs_io Tainted: G        W  O       6.2.0-rc8-achx
Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.15.0-1 04/01/2014
RIP: 0010:ext4_mb_load_buddy_gfp+0x47c/0x570 [ext4]
RSP: 0018:ffffc90007c03998 EFLAGS: 00010246
RAX: ffff888004978000 RBX: ffffc90007c03a20 RCX: ffff888041618000
RDX: 0000000000000000 RSI: 00000000000005a4 RDI: ffffffffa0c99b11
RBP: ffff888012330000 R08: ffffffffa0c2b7d0 R09: 0000000000000400
R10: ffffc90007c03950 R11: 0000000000000000 R12: 0000000000000001
R13: 00000000ffffffff R14: 0000000000000c40 R15: ffff88802678c398
FS:  00007fdf2020c880(0000) GS:ffff88807e100000(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 00007ffd318a5fe8 CR3: 000000007f80f001 CR4: 00000000001706e0
Call Trace:
<TASK>
ext4_mballoc_query_range+0x4b/0x210 [ext4 dfa189daddffe8fecd3cdfd00564e0f265a8ab80]
ext4_getfsmap_datadev+0x713/0x890 [ext4 dfa189daddffe8fecd3cdfd00564e0f265a8ab80]
ext4_getfsmap+0x2b7/0x330 [ext4 dfa189daddffe8fecd3cdfd00564e0f265a8ab80]
ext4_ioc_getfsmap+0x153/0x2b0 [ext4 dfa189daddffe8fecd3cdfd00564e0f265a8ab80]
__ext4_ioctl+0x2a7/0x17e0 [ext4 dfa189daddffe8fecd3cdfd00564e0f265a8ab80]
__x64_sys_ioctl+0x82/0xa0
do_syscall_64+0x2b/0x80
entry_SYSCALL_64_after_hwframe+0x46/0xb0
RIP: 0033:0x7fdf20558aff
RSP: 002b:00007ffd318a9e30 EFLAGS: 00000246 ORIG_RAX: 0000000000000010
RAX: ffffffffffffffda RBX: 00000000000200c0 RCX: 00007fdf20558aff
RDX: 00007fdf1feb2010 RSI: 00000000c0c0583b RDI: 0000000000000003
RBP: 00005625c0634be0 R08: 00005625c0634c40 R09: 0000000000000001
R10: 0000000000000000 R11: 0000000000000246 R12: 00007fdf1feb2010
R13: 00005625be70d994 R14: 0000000000000800 R15: 0000000000000000

For GETFSMAP calls, the caller selects a physical block device by
writing its block number into fsmap_head.fmh_keys[01].fmr_device.
To query mappings for a subrange of the device, the starting byte of the
range is written to fsmap_head.fmh_keys[0].fmr_physical and the last
byte of the range goes in fsmap_head.fmh_keys[1].fmr_physical.

IOWs, to query what mappings overlap with bytes 3-14 of /dev/sda, you'd
set the inputs as follows:

fmh_keys[0] = { .fmr_device = major(8, 0), .fmr_physical = 3},
fmh_keys[1] = { .fmr_device = major(8, 0), .fmr_physical = 14},

Which would return you whatever is mapped in the 12 bytes starting at
physical offset 3.

The crash is due to insufficient range validation of keys[1] in
ext4_getfsmap_datadev.  On 1k-block filesystems, block 0 is not part of
the filesystem, which means that s_first_data_block is nonzero.
ext4_get_group_no_and_offset subtracts this quantity from the blocknr
argument before cracking it into a group number and a block number
within a group.  IOWs, block group 0 spans blocks 1-8192 (1-based)
instead of 0-8191 (0-based) like what happens with larger blocksizes.

The net result of this encoding is that blocknr < s_first_data_block is
not a valid input to this function.  The end_fsb variable is set from
the keys that are copied from userspace, which means that in the above
example, its value is zero.  That leads to an underflow here:

blocknr = blocknr - le32_to_cpu(es->s_first_data_block);

The division then operates on -1:

offset = do_div(blocknr, EXT4_BLOCKS_PER_GROUP(sb)) >>
EXT4_SB(sb)->s_cluster_bits;

Leaving an impossibly large group number (2^32-1) in blocknr.
ext4_getfsmap_check_keys checked that keys[0].fmr_physical and
keys[1].fmr_physical are in increasing order, but
ext4_getfsmap_datadev adjusts keys[0].fmr_physical to be at least
s_first_data_block.  This implies that we have to check it again after
the adjustment, which is the piece that I forgot.

Reported-by: [email protected]
Fixes: 4a4956249dac ("ext4: fix off-by-one fsmap error on 1k block filesystems")
Link: https://syzkaller.appspot.com/bug?id=79d5768e9bfe362911ac1a5057a36fc6b5c30002
Cc: [email protected]
Signed-off-by: Darrick J. Wong <[email protected]>
Link: https://lore.kernel.org/r/Y+58NPTH7VNGgzdd@magnolia
Signed-off-by: Theodore Ts'o <[email protected]>

ext4: fix RENAME_WHITEOUT handling for inline directories

A significant number of xfstests can cause ext4 to log one or more
warning messages when they are run on a test file system where the
inline_data feature has been enabled.  An example:

"EXT4-fs warning (device vdc): ext4_dirblock_csum_set:425: inode
#16385: comm fsstress: No space for directory leaf checksum. Please
run e2fsck -D."

The xfstests include: ext4/057, 058, and 307; generic/013, 051, 068,
070, 076, 078, 083, 232, 269, 270, 390, 461, 475, 476, 482, 579, 585,
589, 626, 631, and 650.

In this situation, the warning message indicates a bug in the code that
performs the RENAME_WHITEOUT operation on a directory entry that has
been stored inline.  It doesn't detect that the directory is stored
inline, and incorrectly attempts to compute a dirent block checksum on
the whiteout inode when creating it.  This attempt fails as a result
of the integrity checking in get_dirent_tail (usually due to a failure
to match the EXT4_FT_DIR_CSUM magic cookie), and the warning message
is then emitted.

Fix this by simply collecting the inlined data state at the time the
search for the source directory entry is performed.  Existing code
handles the rest, and this is sufficient to eliminate all spurious
warning messages produced by the tests above.  Go one step further
and do the same in the code that resets the source directory entry in
the event of failure.  The inlined state should be present in the
"old" struct, but given the possibility of a race there's no harm
in taking a conservative approach and getting that information again
since the directory entry is being reread anyway.

Fixes: b7ff91fd030d ("ext4: find old entry again if failed to rename whiteout")
Cc: [email protected]
Signed-off-by: Eric Whitney <[email protected]>
Reviewed-by: Jan Kara <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Theodore Ts'o <[email protected]>

ext4: make kobj_type structures constant

Since commit ee6d3dd4ed48 ("driver core: make kobj_type constant.")
the driver core allows the usage of const struct kobj_type.

Take advantage of this to constify the structure definitions to prevent
modification at runtime.

Signed-off-by: Thomas Weißschuh <[email protected]>
Reviewed-by: Jan Kara <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Theodore Ts'o <[email protected]>

ext4: fix cgroup writeback accounting with fs-layer encryption

When writing a page from an encrypted file that is using
filesystem-layer encryption (not inline encryption), ext4 encrypts the
pagecache page into a bounce page, then writes the bounce page.

It also passes the bounce page to wbc_account_cgroup_owner(). That's
incorrect, because the bounce page is a newly allocated temporary page
that doesn't have the memory cgroup of the original pagecache page.
This makes wbc_account_cgroup_owner() not account the I/O to the owner
of the pagecache page as it should.

Fix this by always passing the pagecache page to
wbc_account_cgroup_owner().

Fixes: 001e4a8775f6 ("ext4: implement cgroup writeback support")
Cc: [email protected]
Reported-by: Matthew Wilcox (Oracle) <[email protected]>
Signed-off-by: Eric Biggers <[email protected]>
Acked-by: Tejun Heo <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Theodore Ts'o <[email protected]>

btrfs: fix block group item corruption after inserting new block group

We can often end up inserting a block group item, for a new block group,
with a wrong value for the used bytes field.

This happens if for the new allocated block group, in the same transaction
that created the block group, we have tasks allocating extents from it as
well as tasks removing extents from it.

For example:

1) Task A creates a metadata block group X;

2) Two extents are allocated from block group X, so its "used" field is
   updated to 32K, and its "commit_used" field remains as 0;

3) Transaction commit starts, by some task B, and it enters
   btrfs_start_dirty_block_groups(). There it tries to update the block
   group item for block group X, which currently has its "used" field with
   a value of 32K. But that fails since the block group item was not yet
   inserted, and so on failure update_block_group_item() sets the
   "commit_used" field of the block group back to 0;

4) The block group item is inserted by task A, when for example
   btrfs_create_pending_block_groups() is called when releasing its
   transaction handle. This results in insert_block_group_item() inserting
   the block group item in the extent tree (or block group tree), with a
   "used" field having a value of 32K, but without updating the
   "commit_used" field in the block group, which remains with value of 0;

5) The two extents are freed from block X, so its "used" field changes
   from 32K to 0;

6) The transaction commit by task B continues, it enters
   btrfs_write_dirty_block_groups() which calls update_block_group_item()
   for block group X, and there it decides to skip the block group item
   update, because "used" has a value of 0 and "commit_used" has a value
   of 0 too.

   As a result, we end up with a block item having a 32K "used" field but
   no extents allocated from it.

When this issue happens, a btrfs check reports an error like this:

   [1/7] checking root items
   [2/7] checking extents
   block group [1104150528 1073741824] used 39796736 but extent items used 0
   ERROR: errors found in extent allocation tree or chunk allocation
   (...)

Fix this by making insert_block_group_item() update the block group's
"commit_used" field.

Fixes: 7248e0cebbef ("btrfs: skip update of block group item if used bytes are the same")
CC: [email protected] # 6.2+
Reviewed-by: Qu Wenruo <[email protected]>
Signed-off-by: Filipe Manana <[email protected]>
Signed-off-by: David Sterba <[email protected]>

docs: sysfs-block: document hidden sysfs entry

/sys/block/<disk>/hidden is undocumented. Document it.

Signed-off-by: Sagi Grimberg <[email protected]>
Reviewed-by: Chaitanya Kulkarni <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Jens Axboe <[email protected]>

ynl: re-license uniformly under GPL-2.0 OR BSD-3-Clause

I was intending to make all the Netlink Spec code BSD-3-Clause
to ease the adoption but it appears that:
- I fumbled the uAPI and used "GPL WITH uAPI note" there
- it gives people pause as they expect GPL in the kernel
As suggested by Chuck re-license under dual. This gives us benefit
of full BSD freedom while fulfilling the broad "kernel is under GPL"
expectations.

Link: https://lore.kernel.org/all/[email protected]/
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Jakub Kicinski <[email protected]>

mailmap: update entries for Stephen Hemminger

Map all my old email addresses to current address.

Signed-off-by: Stephen Hemminger <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Jakub Kicinski <[email protected]>

mailmap: add entry for Maxim Mikityanskiy

Map Maxim's old corporate addresses to his personal one.

Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Jakub Kicinski <[email protected]>

nfc: change order inside nfc_se_io error path

cb_context should be freed on the error path in nfc_se_io as stated by
commit 25ff6f8a5a3b ("nfc: fix memory leak of se_io context in
nfc_genl_se_io").

Make the error path in nfc_se_io unwind everything in reverse order, i.e.
free the cb_context after unlocking the device.

Suggested-by: Krzysztof Kozlowski <[email protected]>
Signed-off-by: Fedor Pchelkin <[email protected]>
Reviewed-by: Krzysztof Kozlowski <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Jakub Kicinski <[email protected]>

ethernet: ice: avoid gcc-9 integer overflow warning

With older compilers like gcc-9, the calculation of the vlan
priority field causes a false-positive warning from the byteswap:

In file included from drivers/net/ethernet/intel/ice/ice_tc_lib.c:4:
drivers/net/ethernet/intel/ice/ice_tc_lib.c: In function 'ice_parse_cls_flower':
include/uapi/linux/swab.h:15:15: error: integer overflow in expression '(int)(short unsigned int)((int)match.key-><U67c8>.<U6698>.vlan_priority << 13) & 57344 & 255' of type 'int' results in '0' [-Werror=overflow]
   15 |  (((__u16)(x) & (__u16)0x00ffU) << 8) |   \
      |   ~~~~~~~~~~~~^~~~~~~~~~~~~~~~~
include/uapi/linux/swab.h:106:2: note: in expansion of macro '___constant_swab16'
  106 |  ___constant_swab16(x) :   \
      |  ^~~~~~~~~~~~~~~~~~
include/uapi/linux/byteorder/little_endian.h:42:43: note: in expansion of macro '__swab16'
   42 | #define __cpu_to_be16(x) ((__force __be16)__swab16((x)))
      |                                           ^~~~~~~~
include/linux/byteorder/generic.h:96:21: note: in expansion of macro '__cpu_to_be16'
   96 | #define cpu_to_be16 __cpu_to_be16
      |                     ^~~~~~~~~~~~~
drivers/net/ethernet/intel/ice/ice_tc_lib.c:1458:5: note: in expansion of macro 'cpu_to_be16'
1458 |     cpu_to_be16((match.key->vlan_priority <<
      |     ^~~~~~~~~~~

After a change to be16_encode_bits(), the code becomes more
readable to both people and compilers, which avoids the warning.

Fixes: 34800178b302 ("ice: Add support for VLAN priority filters in switchdev")
Suggested-by: Alexander Lobakin <[email protected]>
Signed-off-by: Arnd Bergmann <[email protected]>
Reviewed-by: Alexander Lobakin <[email protected]>
Tested-by: Sujai Buvaneswaran <[email protected]>
Signed-off-by: Tony Nguyen <[email protected]>

ice: don't ignore return codes in VSI related code

There were few smatch warnings reported by Dan:
- ice_vsi_cfg_xdp_txqs can return 0 instead of ret, which is cleaner
- return values in ice_vsi_cfg_def were ignored
- in ice_vsi_rebuild return value was ignored in case rebuild failed,
it was a never reached code, however, rewrite it for clarity.
- ice_vsi_cfg_tc can return 0 instead of ret

Fixes: 6624e780a577 ("ice: split ice_vsi_setup into smaller functions")
Reported-by: Dan Carpenter <[email protected]>
Signed-off-by: Michal Swiatkowski <[email protected]>
Tested-by: Gurucharan G <[email protected]> (A Contingent worker at Intel)
Signed-off-by: Tony Nguyen <[email protected]>

ice: Fix DSCP PFC TLV creation

When creating the TLV to send to the FW for configuring DSCP mode PFC,the
PFCENABLE field was being masked with a 4 bit mask (0xF), but this is an 8
bit bitmask for enabled classes for PFC. This means that traffic classes
4-7 could not be enabled for PFC.

Remove the mask completely, as it is not necessary, as we are assigning 8
bits to an 8 bit field.

Fixes: 2a87bd73e50d ("ice: Add DSCP support")
Signed-off-by: Dave Ertman <[email protected]>
Signed-off-by: Karen Ostrowska <[email protected]>
Tested-by: Gurucharan G <[email protected]> (A Contingent worker at Intel)
Signed-off-by: Tony Nguyen <[email protected]>

drm/amd/display: Update clock table to include highest clock setting

[Why]
Currently, the clk manager matches SocVoltage with voltage from
fused settings (dfPstate clock table). And then corresponding clocks
are selected.

However in certain situations, this leads to clk manager not
including at least one entry with highest supported clock setting.

[How]
Update the clk manager to include at least one entry with highest
supported clock setting.

Reviewed-by: Pavle Kotarac <[email protected]>
Acked-by: Qingqing Zhuo <[email protected]>
Signed-off-by: Swapnil Patel <[email protected]>
Signed-off-by: Alex Deucher <[email protected]>

drm/amd/pm: Enable ecc_info table support for smu v13_0_10

Support EccInfoTable which includes umc ras error count and
error address.

Signed-off-by: Candice Li <[email protected]>
Reviewed-by: Evan Quan <[email protected]>
Reviewed-by: Stanley.Yang <[email protected]>
Signed-off-by: Alex Deucher <[email protected]>

drm/amdgpu: Support umc node harvest config on umc v8_10

Don't need to query error count and error address on harvest umc nodes.
v2: Fix code bug, use active_mask instead of harvsest_config
and remove unnecessary argument in LOOP macro.
v3: Leave adev->gmc.num_umc unchanged.

Signed-off-by: Candice Li <[email protected]>
Reviewed-by: Tao Zhou <[email protected]>
Signed-off-by: Alex Deucher <[email protected]>

drm/connector: print max_requested_bpc in state debugfs

This is useful to understand the bpc defaults and
support of a driver.

Signed-off-by: Harry Wentland <[email protected]>
Cc: Pekka Paalanen <[email protected]>
Cc: Sebastian Wick <[email protected]>
Cc: [email protected]
Cc: Uma Shankar <[email protected]>
Cc: Ville Syrjälä <[email protected]>
Cc: Joshua Ashton <[email protected]>
Cc: Jani Nikula <[email protected]>
Cc: [email protected]
Cc: [email protected]
Reviewed-By: Joshua Ashton <[email protected]>
Link: https://patchwork.freedesktop.org/patch/msgid/[email protected]
Signed-off-by: Alex Deucher <[email protected]>
Cc: [email protected]

drm/display: Don't block HDR_OUTPUT_METADATA on unknown EOTF

The EDID of an HDR display defines EOTFs that are supported
by the display and can be set in the HDR metadata infoframe.
Userspace is expected to read the EDID and set an appropriate
HDR_OUTPUT_METADATA.

In drm_parse_hdr_metadata_block the kernel reads the supported
EOTFs from the EDID and stores them in the
drm_connector->hdr_sink_metadata. While doing so it also
filters the EOTFs to the EOTFs the kernel knows about.
When an HDR_OUTPUT_METADATA is set it then checks to
make sure the EOTF is a supported EOTF. In cases where
the kernel doesn't know about a new EOTF this check will
fail, even if the EDID advertises support.

Since it is expected that userspace reads the EDID to understand
what the display supports it doesn't make sense for DRM to block
an HDR_OUTPUT_METADATA if it contains an EOTF the kernel doesn't
understand.

This comes with the added benefit of future-proofing metadata
support. If the spec defines a new EOTF there is no need to
update DRM and an compositor can immediately make use of it.

Bug: https://gitlab.freedesktop.org/wayland/weston/-/issues/609

v2: Distinguish EOTFs defind in kernel and ones defined
in EDID in the commit description (Pekka)

v3: Rebase; drm_hdmi_infoframe_set_hdr_metadata moved
to drm_hdmi_helper.c

Signed-off-by: Harry Wentland <[email protected]>
Cc: Pekka Paalanen <[email protected]>
Cc: Sebastian Wick <[email protected]>
Cc: [email protected]
Cc: Uma Shankar <[email protected]>
Cc: Ville Syrjälä <[email protected]>
Cc: Joshua Ashton <[email protected]>
Cc: Jani Nikula <[email protected]>
Cc: [email protected]
Cc: [email protected]
Acked-by: Pekka Paalanen <[email protected]>
Reviewed-By: Joshua Ashton <[email protected]>
Link: https://patchwork.freedesktop.org/patch/msgid/[email protected]
Signed-off-by: Alex Deucher <[email protected]>
Cc: [email protected]

RISC-V: fix taking the text_mutex twice during sifive errata patching

Chris pointed out that some bonehead, *cough* me *cough*, added two
mutex_locks() to the SiFive errata patching. The second was meant to
have been a mutex_unlock().

This results in errors such as

Unable to handle kernel NULL pointer dereference at virtual address 0000000000000030
Oops [#1]
Modules linked in:
CPU: 0 PID: 0 Comm: swapper Not tainted
6.2.0-rc1-starlight-00079-g9493e6f3ce02 #229
Hardware name: BeagleV Starlight Beta (DT)
epc : __schedule+0x42/0x500
ra : schedule+0x46/0xce
epc : ffffffff8065957c ra : ffffffff80659a80 sp : ffffffff81203c80
gp : ffffffff812d50a0 tp : ffffffff8120db40 t0 : ffffffff81203d68
t1 : 0000000000000001 t2 : 4c45203a76637369 s0 : ffffffff81203cf0
s1 : ffffffff8120db40 a0 : 0000000000000000 a1 : ffffffff81213958
a2 : ffffffff81213958 a3 : 0000000000000000 a4 : 0000000000000000
a5 : ffffffff80a1bd00 a6 : 0000000000000000 a7 : 0000000052464e43
s2 : ffffffff8120db41 s3 : ffffffff80a1ad00 s4 : 0000000000000000
s5 : 0000000000000002 s6 : ffffffff81213938 s7 : 0000000000000000
s8 : 0000000000000000 s9 : 0000000000000001 s10: ffffffff812d7204
s11: ffffffff80d3c920 t3 : 0000000000000001 t4 : ffffffff812e6dd7
t5 : ffffffff812e6dd8 t6 : ffffffff81203bb8
status: 0000000200000100 badaddr: 0000000000000030 cause: 000000000000000d
[<ffffffff80659a80>] schedule+0x46/0xce
[<ffffffff80659dce>] schedule_preempt_disabled+0x16/0x28
[<ffffffff8065ae0c>] __mutex_lock.constprop.0+0x3fe/0x652
[<ffffffff8065b138>] __mutex_lock_slowpath+0xe/0x16
[<ffffffff8065b182>] mutex_lock+0x42/0x4c
[<ffffffff8000ad94>] sifive_errata_patch_func+0xf6/0x18c
[<ffffffff80002b92>] _apply_alternatives+0x74/0x76
[<ffffffff80802ee8>] apply_boot_alternatives+0x3c/0xfa
[<ffffffff80803cb0>] setup_arch+0x60c/0x640
[<ffffffff80800926>] start_kernel+0x8e/0x99c
---[ end trace 0000000000000000 ]---

Reported-by: Chris Hofstaedtler <[email protected]>
Fixes: 9493e6f3ce02 ("RISC-V: take text_mutex during alternative patching")
Signed-off-by: Conor Dooley <[email protected]>
Tested-by: Geert Uytterhoeven <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
[Palmer: pick up Geert's bug report from the thread]
Signed-off-by: Palmer Dabbelt <[email protected]>

cpumask: be more careful with 'cpumask_setall()'

Commit 596ff4a09b89 ("cpumask: re-introduce constant-sized cpumask
optimizations") changed cpumask_setall() to use "bitmap_set()" instead
of "bitmap_fill()", because bitmap_fill() would explicitly set all the
bits of a constant sized small bitmap, and that's exactly what we don't
want: we want to only set bits up to 'nr_cpu_ids', which is what
"bitmap_set()" does.

However, Yury correctly points out that while "bitmap_set()" does indeed
only set bits up to the required bitmap size, it doesn't _clear_ bits
above that size, so the upper bits would still not have well-defined
values.

Now, none of this should really matter, since any bits set past
'nr_cpu_ids' should always be ignored in the first place.  Yes, the bit
scanning functions might return them as a result, but since users should
always consider the ">= nr_cpu_ids" condition to mean "no more bits",
that shouldn't have any actual effect (see previous commit 8ca09d5fa354
"cpumask: fix incorrect cpumask scanning result checks").

But let's just do it right, the way the code was _intended_ to work.  We
have had enough lazy code that works but bites us in the *rse later
(again, see previous commit) that there's no reason to not just do this
properly.

It turns out that "bitmap_fill()" gets this all right for the complex
case, and really only fails for the inlined optimized case that just
fills the whole word.  And while we could just fix bitmap_fill() to use
the proper last word mask, there's two issues with that:

- the cpumask case wants to do the _optimization_ based on "NR_CPUS is
   a small constant", but then wants to do the actual bit _fill_ based
   on "nr_cpu_ids" that isn't necessarily that same constant

- we have lots of non-cpumask users of bitmap_fill(), and while they
   hopefully don't care, and probably would want the proper semantics
   anyway ("only set bits up to the limit"), I do not want the cpumask
   changes to impact other parts

So this ends up just doing the single-word optimization by hand in the
cpumask code.  If our cpumask is fundamentally limited to a single word,
just do the proper "fill in that word" exactly.  And if it's the more
complex multi-word case, then the generic bitmap_fill() will DTRT.

This is all an example of how our bitmap function optimizations really
are somewhat broken.  They conflate the "this is size of the bitmap"
optimizations with the actual bit(s) we want to set.

In many cases we really want to have the two be separate things:
sometimes we base our optimizations on the size of the whole bitmap ("I
know this whole bitmap fits in a single word, so I'll just use
single-word accesses"), and sometimes we base them on the bit we are
looking at ("this is just acting on bits that are in the first word, so
I'll use single-word accesses").

Notice how the end result of the two optimizations are the same, but the
way we get to them are quite different.

And all our cpumask optimization games are really about that fundamental
distinction, and we'd often really want to pass in both the "this is the
bit I'm working on" (which _can_ be a small constant but might be
variable), and "I know it's in this range even if it's variable" (based
on CONFIG_NR_CPUS).

So this cpumask_setall() implementation just makes that explicit.  It
checks the "I statically know the size is small" using the known static
size of the cpumask (which is what that 'small_cpumask_bits' is all
about), but then sets the actual bits using the exact number of cpus we
have (ie 'nr_cpumask_bits')

Of course, in a perfect world, the compiler would have done all the
range analysis (possibly with help from us just telling it that
"this value is always in this range"), and would do all of this for us.
But that is not the world we live in.

While we dream of that perfect world, this does that manual logic to
make it all work out.  And this was a very long explanation for a small
code change that shouldn't even matter.

Reported-by: Yury Norov <[email protected]>
Link: https://lore.kernel.org/lkml/ZAV9nGG9e1%2FrV+L%2F@yury-laptop/
Signed-off-by: Linus Torvalds <[email protected]>

NFSD: Protect against filesystem freezing

Flole observes this WARNING on occasion:

[1210423.486503] WARNING: CPU: 8 PID: 1524732 at fs/ext4/ext4_jbd2.c:75 ext4_journal_check_start+0x68/0xb0

Reported-by: <[email protected]>
Suggested-by: Jan Kara <[email protected]>
Link: https://bugzilla.kernel.org/show_bug.cgi?id=217123
Fixes: 73da852e3831 ("nfsd: use vfs_iter_read/write")
Reviewed-by: Jeff Layton <[email protected]>
Reviewed-by: Jan Kara <[email protected]>
Signed-off-by: Chuck Lever <[email protected]>

net: usb: qmi_wwan: add Telit 0x1080 composition

Add the following Telit FE990 composition:

0x1080: tty, adb, rmnet, tty, tty, tty, tty

Signed-off-by: Enrico Sau <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Paolo Abeni <[email protected]>

net: usb: cdc_mbim: avoid altsetting toggling for Telit FE990

Add quirk CDC_MBIM_FLAG_AVOID_ALTSETTING_TOGGLE for Telit FE990
0x1081 composition in order to avoid bind error.

Signed-off-by: Enrico Sau <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Paolo Abeni <[email protected]>

block: fix wrong mode for blkdev_put() from disk_scan_partitions()

If disk_scan_partitions() is called with 'FMODE_EXCL',
blkdev_get_by_dev() will be called without 'FMODE_EXCL', however, follow
blkdev_put() is still called with 'FMODE_EXCL', which will cause
'bd_holders' counter to leak.

Fix the problem by using the right mode for blkdev_put().

Reported-by: [email protected]
Link: https://lore.kernel.org/lkml/[email protected]/T/
Tested-by: Julian Ruess <[email protected]>
Fixes: e5cfefa97bcc ("block: fix scan partition for exclusively open device again")
Signed-off-by: Yu Kuai <[email protected]>
Reviewed-by: Jan Kara <[email protected]>
Signed-off-by: Jens Axboe <[email protected]>

Merge branch 'main' of git://git.kernel.org/pub/scm/linux/kernel/git/netfilter/nf

Pablo Neira Ayuso says:

====================
Netfilter fixes for net

The following patchset contains Netfilter fixes for net:

1) Restore ctnetlink zero mark in events and dump, from Ivan Delalande.

2) Fix deadlock due to missing disabled bh in tproxy, from Florian Westphal.

3) Safer maximum chain load in conntrack, from Eric Dumazet.

* 'main' of git://git.kernel.org/pub/scm/linux/kernel/git/netfilter/nf:
  netfilter: conntrack: adopt safer max chain length
  netfilter: tproxy: fix deadlock due to missing BH disable
  netfilter: ctnetlink: revert to dumping mark regardless of event type
====================

Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Paolo Abeni <[email protected]>

platform: mellanox: mlx-platform: Initialize shift variable to 0

Initialize shift variable in mlxplat_mlxcpld_verify_bus_topology()
to 0 to avoid the following compile error:

drivers/platform/x86/mlx-platform.c:6013
mlxplat_mlxcpld_verify_bus_topology() error: uninitialized symbol 'shift'.

Fixes: 50b823fdd357 ("platform: mellanox: mlx-platform: Move bus shift assignment out of the loop")
Cc: Vadim Pasternak <[email protected]>
Cc: Michael Shych <[email protected]>
Signed-off-by: Hans de Goede <[email protected]>
Link: https://lore.kernel.org/r/[email protected]

platform/x86: int3472: Add GPIOs to Surface Go 3 Board data

Add the INT347E GPIO lookup table to the board data for the Surface
Go 3. This is necessary to allow the ov7251 IR camera to probe
properly on that platform.

Signed-off-by: Daniel Scally <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Reviewed-by: Hans de Goede <[email protected]>
Signed-off-by: Hans de Goede <[email protected]>

platform/x86: ISST: Fix kernel documentation warnings

Fix warning displayed for "make W=1" for kernel documentation.

Signed-off-by: Srinivas Pandruvada <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Hans de Goede <[email protected]>
Reviewed-by: Hans de Goede <[email protected]>

platform: x86: MLX_PLATFORM: select REGMAP instead of depending on it

REGMAP is a hidden (not user visible) symbol. Users cannot set it
directly thru "make *config", so drivers should select it instead of
depending on it if they need it.

Consistently using "select" or "depends on" can also help reduce
Kconfig circular dependency issues.

Therefore, change the use of "depends on REGMAP" to "select REGMAP".

Fixes: ef0f62264b2a ("platform/x86: mlx-platform: Add physical bus number auto detection")
Signed-off-by: Randy Dunlap <[email protected]>
Cc: Vadim Pasternak <[email protected]>
Cc: Darren Hart <[email protected]>
Cc: Hans de Goede <[email protected]>
Cc: Mark Gross <[email protected]>
Cc: [email protected]
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Hans de Goede <[email protected]>
Reviewed-by: Hans de Goede <[email protected]>

platform: mellanox: select REGMAP instead of depending on it

REGMAP is a hidden (not user visible) symbol. Users cannot set it
directly thru "make *config", so drivers should select it instead of
depending on it if they need it.

Consistently using "select" or "depends on" can also help reduce
Kconfig circular dependency issues.

Therefore, change the use of "depends on REGMAP" to "select REGMAP".

For NVSW_SN2201, select REGMAP_I2C instead of depending on it.

Fixes: c6acad68eb2d ("platform/mellanox: mlxreg-hotplug: Modify to use a regmap interface")
Fixes: 5ec4a8ace06c ("platform/mellanox: Introduce support for Mellanox register access driver")
Fixes: 62f9529b8d5c ("platform/mellanox: mlxreg-lc: Add initial support for Nvidia line card devices")
Fixes: 662f24826f95 ("platform/mellanox: Add support for new SN2201 system")
Signed-off-by: Randy Dunlap <[email protected]>
Cc: Darren Hart <[email protected]>
Cc: Hans de Goede <[email protected]>
Cc: Michael Shych <[email protected]>
Cc: Mark Gross <[email protected]>
Cc: Vadim Pasternak <[email protected]>
Cc: [email protected]
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Hans de Goede <[email protected]>
Reviewed-by: Hans de Goede <[email protected]>

platform/x86/intel/tpmi: Fix double free reported by Smatch

Fix warning:
drivers/platform/x86/intel/tpmi.c:253 tpmi_create_device()
warn: 'feature_vsec_dev' was already freed.

If there is some error, feature_vsec_dev memory is freed as part
of resource managed call intel_vsec_add_aux(). So, additional
kfree() call is not required.

Reordered res allocation and feature_vsec_dev, so that on error
only res is freed.

Reported-by: Dan Carpenter <[email protected]>
Link: https://lore.kernel.org/platform-driver-x86/Y%2FxYR7WGiPayZu%2FR@kili/T/#u
Fixes: 47731fd2865f ("platform/x86/intel: Intel TPMI enumeration driver")
Signed-off-by: Srinivas Pandruvada <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Hans de Goede <[email protected]>
Reviewed-by: Hans de Goede <[email protected]>

platform/x86: ISST: Increase range of valid mail box commands

A new command CONFIG_TDP_GET_RATIO_INFO is added, with sub command type
of 0x0C. The previous range of valid sub commands was from 0x00 to 0x0B.
Change the valid range from 0x00 to 0x0C.

Signed-off-by: Srinivas Pandruvada <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Hans de Goede <[email protected]>
Reviewed-by: Hans de Goede <[email protected]>

platform/x86: dell-ddv: Fix temperature scaling

After using the built-in UEFI hardware diagnostics to compare
the measured battery temperature, i noticed that the temperature
is actually expressed in tenth degree kelvin, similar to the
SBS-Data standard. For example, a value of 2992 is displayed as
26 degrees celsius.
Fix the scaling so that the correct values are being displayed.

Tested on a Dell Inspiron 3505.

Fixes: a77272c16041 ("platform/x86: dell: Add new dell-wmi-ddv driver")
Signed-off-by: Armin Wolf <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Hans de Goede <[email protected]>
Reviewed-by: Hans de Goede <[email protected]>

platform/x86: dell-ddv: Fix cache invalidation on resume

If one or both sensor buffers could not be initialized, either
due to missing hardware support or due to some error during probing,
the resume handler will encounter undefined behaviour when
attempting to lock buffers then protected by an uninitialized or
destroyed mutex.
Fix this by introducing a "active" flag which is set during probe,
and only invalidate buffers which where flaged as "active".

Tested on a Dell Inspiron 3505.

Fixes: 3b7eeff93d29 ("platform/x86: dell-ddv: Add hwmon support")
Signed-off-by: Armin Wolf <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Hans de Goede <[email protected]>
Reviewed-by: Hans de Goede <[email protected]>

platform/x86/amd: pmc: remove CONFIG_SUSPEND checks

The amd_pmc_write_stb() function was previously hidden in an
ifdef to avoid a warning when CONFIG_SUSPEND is disabled, but
now there is an additional caller:

drivers/platform/x86/amd/pmc.c: In function 'amd_pmc_stb_debugfs_open_v2':
drivers/platform/x86/amd/pmc.c:256:8: error: implicit declaration of function 'amd_pmc_write_stb'; did you mean 'amd_pmc_read_stb'? [-Werror=implicit-function-declaration]
  256 |  ret = amd_pmc_write_stb(dev, AMD_PMC_STB_DUMMY_PC);
      |        ^~~~~~~~~~~~~~~~~
      |        amd_pmc_read_stb

There is now an easier way to handle this using DEFINE_SIMPLE_DEV_PM_OPS()
to replace all the #ifdefs, letting gcc drop any of the unused functions
silently.

Fixes: b0d4bb973539 ("platform/x86/amd: pmc: Write dummy postcode into the STB DRAM")
Signed-off-by: Arnd Bergmann <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Hans de Goede <[email protected]>
Reviewed-by: Hans de Goede <[email protected]>

netfilter: conntrack: adopt safer max chain length

Customers using GKE 1.25 and 1.26 are facing conntrack issues
root caused to commit c9c3b6811f74 ("netfilter: conntrack: make
max chain length random").

Even if we assume Uniform Hashing, a bucket often reachs 8 chained
items while the load factor of the hash table is smaller than 0.5

With a limit of 16, we reach load factors of 3.
With a limit of 32, we reach load factors of 11.
With a limit of 40, we reach load factors of 15.
With a limit of 50, we reach load factors of 24.

This patch changes MIN_CHAINLEN to 50, to minimize risks.

Ideally, we could in the future add a cushion based on expected
load factor (2 * nf_conntrack_max / nf_conntrack_buckets),
because some setups might expect unusual values.

Fixes: c9c3b6811f74 ("netfilter: conntrack: make max chain length random")
Signed-off-by: Eric Dumazet <[email protected]>
Signed-off-by: Pablo Neira Ayuso <[email protected]>

new helper: put_and_unmap_page()

kunmap_local() + put_page(), as done by e.g. ext2 directory handling.

Signed-off-by: Al Viro <[email protected]>

Merge tag 'for-netdev' of https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf

Daniel Borkmann says:

====================
pull-request: bpf 2023-03-06

We've added 8 non-merge commits during the last 7 day(s) which contain
a total of 9 files changed, 64 insertions(+), 18 deletions(-).

The main changes are:

1) Fix BTF resolver for DATASEC sections when a VAR points at a modifier,
   that is, keep resolving such instances instead of bailing out,
   from Lorenz Bauer.

2) Fix BPF test framework with regards to xdp_frame info misplacement
   in the "live packet" code, from Alexander Lobakin.

3) Fix an infinite loop in BPF sockmap code for TCP/UDP/AF_UNIX,
   from Liu Jian.

4) Fix a build error for riscv BPF JIT under PERF_EVENTS=n,
   from Randy Dunlap.

5) Several BPF doc fixes with either broken links or external instead
   of internal doc links, from Bagas Sanjaya.

* tag 'for-netdev' of https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf:
  selftests/bpf: check that modifier resolves after pointer
  btf: fix resolving BTF_KIND_VAR after ARRAY, STRUCT, UNION, PTR
  bpf, test_run: fix &xdp_frame misplacement for LIVE_FRAMES
  bpf, doc: Link to submitting-patches.rst for general patch submission info
  bpf, doc: Do not link to docs.kernel.org for kselftest link
  bpf, sockmap: Fix an infinite loop error when len is 0 in tcp_bpf_recvmsg_parser()
  riscv, bpf: Fix patch_text implicit declaration
  bpf, docs: Fix link to BTF doc
====================

Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Jakub Kicinski <[email protected]>

alpha: fix lazy-FPU mis(merged/applied/whatnot)

Looks like a braino that used to be fixed in e.g. #next.alpha
had gotten into alpha.git cherry-picked version of that patch.

Sure, alpha has no preempt, but preempt_enable() in place of
preempt_disable() is actively confusing the readers...

Other than that, the cherry-picked variant matches what I have.

Signed-off-by: Al Viro <[email protected]>

RISC-V: Stop emitting attributes

The RISC-V ELF attributes don't contain any useful information. New
toolchains ignore them, but they frequently trip up various older/mixed
toolchains. So just turn them off.

Tested-by: Conor Dooley <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Cc: [email protected]
Signed-off-by: Palmer Dabbelt <[email protected]>

scsi: sd: Fix wrong zone_write_granularity value during revalidate

When the sd driver revalidates host-managed SMR disks, it calls
disk_set_zoned() which changes the zone_write_granularity attribute value
to the logical block size regardless of the device type. After that, the sd
driver overwrites the value in sd_zbc_read_zone() with the physical block
size, since ZBC/ZAC requires this for host-managed disks. Between the calls
to disk_set_zoned() and sd_zbc_read_zone(), there exists a window where the
attribute shows the logical block size as the zone_write_granularity value,
which is wrong for host-managed disks. The duration of the window is from
20ms to 200ms, depending on report zone command execution time.

To avoid the wrong zone_write_granularity value between disk_set_zoned()
and sd_zbc_read_zone(), modify the value not in sd_zbc_read_zone() but
just after disk_set_zoned() call.

Fixes: a805a4fa4fa3 ("block: introduce zone_write_granularity limit")
Signed-off-by: Shin'ichiro Kawasaki <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Reviewed-by: Damien Le Moal <[email protected]>
Reviewed-by: Johannes Thumshirn <[email protected]>
Reviewed-by: Bart Van Assche <[email protected]>
Signed-off-by: Martin K. Petersen <[email protected]>

scsi: storvsc: Handle BlockSize change in Hyper-V VHD/VHDX file

Hyper-V uses a VHD or VHDX file on the host as the underlying storage for a
virtual disk.  The VHD/VHDX file format is a sparse format where real disk
space on the host is assigned in chunks that the VHD/VHDX file format calls
the BlockSize.  This BlockSize is not to be confused with the 512-byte (or
4096-byte) sector size of the underlying storage device.  The default block
size for a new VHD/VHDX file is 32 Mbytes.  When a guest VM touches any
disk space within a 32 Mbyte chunk of the VHD/VHDX file, Hyper-V allocates
32 Mbytes of real disk space for that section of the VHD/VHDX. Similarly,
if a discard operation is done that covers an entire 32 Mbyte chunk,
Hyper-V will free the real disk space for that portion of the VHD/VHDX.
This BlockSize is surfaced in Linux as the "discard_granularity" in
/sys/block/sd<x>/queue, which makes sense.

Hyper-V also has differencing disks that can overlay a VHD/VHDX file to
capture changes to the VHD/VHDX while preserving the original VHD/VHDX.
One example of this differencing functionality is for VM snapshots.  When a
snapshot is created, a differencing disk is created.  If the snapshot is
rolled back, Hyper-V can just delete the differencing disk, and the VM will
see the original disk contents at the time the snapshot was taken.
Differencing disks are used in other scenarios as well.

The BlockSize for a differencing disk defaults to 2 Mbytes, not 32 Mbytes.
The smaller default is used because changes to differencing disks are
typically scattered all over, and Hyper-V doesn't want to allocate 32
Mbytes of real disk space for a stray write here or there.  The smaller
BlockSize provides more efficient use of real disk space.

When a differencing disk is added to a VHD/VHDX, Hyper-V reports
UNIT_ATTENTION with a sense code indicating "Operating parameters have
changed", because the value of discard_granularity should be changed to 2
Mbytes. When the differencing disk is removed, discard_granularity should
be changed back to 32 Mbytes.  However, current code simply reports a
message from scsi_report_sense() and the value of
/sys/block/sd<x>/queue/discard_granularity is not updated. The message
isn't very actionable by a sysadmin.

Fix this by having the storvsc driver check for the sense code indicating
that the underly VHD/VHDX block size has changed, and do a rescan of the
device to pick up the new discard_granularity.  With this change the entire
transition to/from differencing disks is handled automatically and
transparently, with no confusing messages being output.

Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Michael Kelley <[email protected]>
Signed-off-by: Martin K. Petersen <[email protected]>

scsi: megaraid_sas: Driver version update to 07.725.01.00-rc1

Update driver version.

Signed-off-by: Chandrakanth Patil <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Martin K. Petersen <[email protected]>

scsi: megaraid_sas: Add crash dump mode capability bit in MFI capabilities

In kdump kernel mode, the driver works in reduced functionality mode with
some features disabled such as reduced MSI-X count and RDPQ disabled, etc.
However, the firmware is not aware of this mode in some cases, which
results in undefined behavior.

To address this, the driver informs the firmware about the kdump mode
through MPI capabilities bit during driver initialization. This allows
firmware to adjust its behavior accordingly.

Signed-off-by: Chandrakanth Patil <[email protected]>
Signed-off-by: Sumit Saxena <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Martin K. Petersen <[email protected]>

scsi: megaraid_sas: Update max supported LD IDs to 240

The firmware only supports Logical Disk IDs up to 240 and LD ID 255 (0xFF)
is reserved for deleted LDs. However, in some cases, firmware was assigning
LD ID 254 (0xFE) to deleted LDs and this was causing the driver to mark the
wrong disk as deleted. This in turn caused the wrong disk device to be
taken offline by the SCSI midlayer.

To address this issue, limit the LD ID range from 255 to 240. This ensures
the deleted LD ID is properly identified and removed by the driver without
accidently deleting any valid LDs.

Fixes: ae6874ba4b43 ("scsi: megaraid_sas: Early detection of VD deletion through RaidMap update")
Reported-by: Martin K. Petersen <[email protected]>
Signed-off-by: Chandrakanth Patil <[email protected]>
Signed-off-by: Sumit Saxena <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Martin K. Petersen <[email protected]>

scsi: mpi3mr: Bad drive in topology results kernel crash

When the SAS Transport Layer support is enabled and a device exposed to
the OS by the driver fails INQUIRY commands, the driver frees up the memory
allocated for an internal HBA port data structure. However, in some places,
the reference to the freed memory is not cleared. When the firmware sends
the Device Info change event for the same device again, the freed memory is
accessed and that leads to memory corruption and OS crash.

Signed-off-by: Ranjan Kumar <[email protected]>
Signed-off-by: Sreekanth Reddy <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Martin K. Petersen <[email protected]>

scsi: mpi3mr: NVMe command size greater than 8K fails

A wrong variable is checked while populating PRP entries in the PRP page
and this results in failure. No PRP entries in the PRP page were
successfully created and any NVMe Encapsulated commands with PRP of size
greater than 8K failed.

Signed-off-by: Ranjan Kumar <[email protected]>
Signed-off-by: Sreekanth Reddy <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Martin K. Petersen <[email protected]>

scsi: mpi3mr: Return proper values for failures in firmware init path

Return proper non-zero return values for all the cases when the controller
initialization and re-initialization fails.

Signed-off-by: Ranjan Kumar <[email protected]>
Signed-off-by: Sreekanth Reddy <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Martin K. Petersen <[email protected]>

scsi: mpi3mr: Wait for diagnostic save during controller init

If a controller reset operation is triggered to recover the controller from
a fault state, then wait for the snapdump to be saved in the firmware
region before proceeding to reset the controller.

Signed-off-by: Ranjan Kumar <[email protected]>
Signed-off-by: Sreekanth Reddy <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Martin K. Petersen <[email protected]>

scsi: mpi3mr: Driver unload crashes host when enhanced logging is enabled

Prevent driver from trying to dereference a NULL pointer in a debug print
while removing a device during driver unload.

Signed-off-by: Ranjan Kumar <[email protected]>
Signed-off-by: Sreekanth Reddy <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Martin K. Petersen <[email protected]>

scsi: mpi3mr: ioctl timeout when disabling/enabling interrupt

As part of Task Management handling, the driver will disable and enable the
MSIx index zero which belongs to the Admin reply queue. During this
transition the driver loses some interrupts and this leads to Admin request
and ioctl timeouts.

After enabling the interrupts, poll the Admin reply queue to avoid
timeouts.

Signed-off-by: Ranjan Kumar <[email protected]>
Signed-off-by: Sreekanth Reddy <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Martin K. Petersen <[email protected]>

scsi: lpfc: Avoid usage of list iterator variable after loop

If the &epd_pool->list is empty when executing
lpfc_get_io_buf_from_expedite_pool() the function would return an invalid
pointer. Even in the case if the list is guaranteed to be populated, the
iterator variable should not be used after the loop to be more robust for
future changes.

Linus proposed to avoid any use of the list iterator variable after the
loop, in the attempt to move the list iterator variable declaration into
the macro to avoid any potential misuse after the loop [1].

Link: https://lore.kernel.org/all/CAHk-=wgRr_D8CB-D9Kg-c=EHreAsk5SqXPwr9Y7k9sA6cWXJ6w@mail.gmail.com/
Signed-off-by: Jakob Koschel <[email protected]>
Link: https://lore.kernel.org/r/20230301-scsi-lpfc-avoid-list-iterator-after-loop-v1-1-325578ae7561@gmail.com
Reviewed-by: Justin Tee <[email protected]>
Signed-off-by: Martin K. Petersen <[email protected]>

scsi: lpfc: Check kzalloc() in lpfc_sli4_cgn_params_read()

If kzalloc() fails in lpfc_sli4_cgn_params_read(), then we rely on
lpfc_read_object()'s routine to NULL check pdata.

Currently, an early return error is thrown from lpfc_read_object() to
protect us from NULL ptr dereference, but the errno code is -ENODEV.

Change the errno code to a more appropriate -ENOMEM.

Reported-by: Kang Chen <[email protected]>
Link: https://lore.kernel.org/all/[email protected]
Signed-off-by: Justin Tee <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Martin K. Petersen <[email protected]>

scsi: ufs: mcq: qcom: Clean the return path of ufs_qcom_mcq_config_resource()

Smatch static checker reported:
drivers/ufs/host/ufs-qcom.c:1469
ufs_qcom_mcq_config_resource() info: returning a literal zero is
cleaner

Fix the above warning by returning in place instead of a jump to a label.
Also remove the usage of devm_kfree() as it's unnecessary in this function.

Fixes: c263b4ef737e ("scsi: ufs: core: mcq: Configure resource regions")
Reported-by: Dan Carpenter <[email protected]>
Signed-off-by: Asutosh Das <[email protected]>
Link: https://lore.kernel.org/r/3ebd2582af74b81ef7b57149f57c6a3bf0963953.1677721229.git.quic_asutoshd@quicinc.com
Reviewed-by: Bjorn Andersson <[email protected]>
Reviewed-by: Manivannan Sadhasivam <[email protected]>
Signed-off-by: Martin K. Petersen <[email protected]>

scsi: ufs: mcq: qcom: Fix passing zero to PTR_ERR

Fix an error case in ufs_qcom_mcq_config_resource(), where the return value
is set to 0 before passing it to PTR_ERR.

This led to Smatch warning:

drivers/ufs/host/ufs-qcom.c:1455 ufs_qcom_mcq_config_resource() warn:
passing zero to 'PTR_ERR'

Fixes: c263b4ef737e ("scsi: ufs: core: mcq: Configure resource regions")
Reported-by: Dan Carpenter <[email protected]>
Signed-off-by: Asutosh Das <[email protected]>
Link: https://lore.kernel.org/r/94ca99b327af634799ce5f25d0112c28cd00970d.1677721072.git.quic_asutoshd@quicinc.com
Reviewed-by: Bjorn Andersson <[email protected]>
Reviewed-by: Manivannan Sadhasivam <[email protected]>
Signed-off-by: Martin K. Petersen <[email protected]>

scsi: ufs: ufs-qcom: Remove impossible check

The "dev_req_params" pointer points to inside the middle of a struct so it
can't be NULL. Removing this impossible condition is nice because now we
don't need to consider the correct error code for that situation.

Link: https://lore.kernel.org/r/Y/yA3niWUcGYgBU8@kili
Fixes: f06fcc7155dc ("scsi: ufs-qcom: add QUniPro hardware support and power optimizations")
Signed-off-by: Dan Carpenter <[email protected]>
Signed-off-by: Martin K. Petersen <[email protected]>

scsi: ufs: core: Add soft dependency on governor_simpleondemand

The ufshcd driver uses simpleondemand governor for devfreq. Add it to the
list of ufshcd softdeps to allow userspace initramfs tools like dracut to
automatically pull the governor module into the initramfs together with UFS
drivers.

Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Adrien Thierry <[email protected]>
Signed-off-by: Martin K. Petersen <[email protected]>

scsi: hisi_sas: Check devm_add_action() return value

In case devm_add_action() fails, check it in the caller of
interrupt_preinit_v3_hw().

Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Kang Chen <[email protected]>
Acked-by: Xiang Chen <[email protected]>
Signed-off-by: Martin K. Petersen <[email protected]>

scsi: qla2xxx: Add option to disable FC2 Target support

Commit 44c57f205876 ("scsi: qla2xxx: Changes to support FCP2 Target") added
support for FC2 Targets. Unfortunately, there are older setups which break
with this new feature enabled.

Allow to disable it via module option.

Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Daniel Wagner <[email protected]>
Reviewed-by: Himanshu Madhani <[email protected]>
Signed-off-by: Martin K. Petersen <[email protected]>

scsi: target: iscsi: Fix an error message in iscsi_check_key()

The first half of the error message is printed by pr_err(), the second half
is printed by pr_debug(). The user will therefore see only the first part
of the message and will miss some useful information.

Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Maurizio Lombardi <[email protected]>
Reviewed-by: Mike Christie <[email protected]>
Signed-off-by: Martin K. Petersen <[email protected]>

net: tls: fix device-offloaded sendpage straddling records

Adrien reports that incorrect data is transmitted when a single
page straddles multiple records. We would transmit the same
data in all iterations of the loop.

Reported-by: Adrien Moulin <[email protected]>
Link: https://lore.kernel.org/all/[email protected]
Fixes: c1318b39c7d3 ("tls: Add opt-in zerocopy mode of sendfile()")
Tested-by: Adrien Moulin <[email protected]>
Reviewed-by: Tariq Toukan <[email protected]>
Acked-by: Maxim Mikityanskiy <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Jakub Kicinski <[email protected]>

net: ethernet: mtk_eth_soc: fix RX data corruption issue

Fix data corruption issue with SerDes connected PHYs operating at 1.25
Gbps speed where we could previously observe about 30% packet loss while
the bad packet counter was increasing.

As almost all boards with MediaTek MT7622 or MT7986 use either the MT7531
switch IC operating at 3.125Gbps SerDes rate or single-port PHYs using
rate-adaptation to 2500Base-X mode, this issue only got exposed now when
we started trying to use SFP modules operating with 1.25 Gbps with the
BananaPi R3 board.

The fix is to set bit 12 which disables the RX FIFO clear function when
setting up MAC MCR, MediaTek SDK did the same change stating:
"If without this patch, kernel might receive invalid packets that are
corrupted by GMAC."[1]

[1]: https://git01.mediatek.com/plugins/gitiles/openwrt/feeds/mtk-openwrt-feeds/+/d8a2975939a12686c4a95c40db21efdc3f821f63

Fixes: 42c03844e93d ("net-next: mediatek: add support for MediaTek MT7622 SoC")
Tested-by: Bjørn Mork <[email protected]>
Signed-off-by: Daniel Golle <[email protected]>
Reviewed-by: Vladimir Oltean <[email protected]>
Reviewed-by: Florian Fainelli <[email protected]>
Link: https://lore.kernel.org/r/138da2735f92c8b6f8578ec2e5a794ee515b665f.1677937317.git.daniel@makrotopia.org
Signed-off-by: Jakub Kicinski <[email protected]>

net: phy: smsc: fix link up detection in forced irq mode

Currently link up can't be detected in forced mode if polling
isn't used. Only link up interrupt source we have is aneg
complete which isn't applicable in forced mode. Therefore we
have to use energy-on as link up indicator.

Fixes: 7365494550f6 ("net: phy: smsc: skip ENERGYON interrupt if disabled")
Signed-off-by: Heiner Kallweit <[email protected]>
Signed-off-by: David S. Miller <[email protected]>

perf tools: Add Adrian Hunter to MAINTAINERS as a reviewer

Adrian is the main author of the Intel PT codebase and has been
reviewing perf tooling patches consistently for a long time, so lets
reflect that in the MAINTAINERS file so that contributors add him to the
CC list in patch submissions.

Acked-by: Adrian Hunter <[email protected]>
Acked-by: Ian Rogers <[email protected]>
Acked-by: Jiri Olsa <[email protected]>
Acked-by: Namhyung Kim <[email protected]>
Cc: Alexander Shishkin <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: Mark Rutland <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Link: https://lore.kernel.org/lkml/[email protected]
Signed-off-by: Arnaldo Carvalho de Melo <[email protected]>

tools headers UAPI: Sync linux/perf_event.h with the kernel sources

To pick up the changes in:

  09519ec3b19e4144 ("perf: Add perf_event_attr::config3")

The patches for the tooling side will come later.

This addresses this perf build warning:

  Warning: Kernel ABI header at 'tools/include/uapi/linux/perf_event.h' differs from latest version at 'include/uapi/linux/perf_event.h'
  diff -u tools/include/uapi/linux/perf_event.h include/uapi/linux/perf_event.h

Cc: Adrian Hunter <[email protected]>
Cc: Ian Rogers <[email protected]>
Cc: Jiri Olsa <[email protected]>
Cc: Namhyung Kim <[email protected]>
Cc: Rob Herring <[email protected]>
Cc: Will Deacon <[email protected]>
Link: https://lore.kernel.org/lkml/[email protected]
Signed-off-by: Arnaldo Carvalho de Melo <[email protected]>

cpumask: fix incorrect cpumask scanning result checks

It turns out that commit 596ff4a09b89 ("cpumask: re-introduce
constant-sized cpumask optimizations") exposed a number of cases of
drivers not checking the result of "cpumask_next()" and friends
correctly.

The documented correct check for "no more cpus in the cpumask" is to
check for the result being equal or larger than the number of possible
CPU ids, exactly _because_ we've always done those constant-sized
cpumask scans using a widened type before.  So the return value of a
cpumask scan should be checked with

if (cpu >= nr_cpu_ids)
...

because the cpumask scan did not necessarily stop exactly *at* that
maximum CPU id.

But a few cases ended up instead using checks like

if (cpu == nr_cpumask_bits)
...

which used that internal "widened" number of bits.  And that used to
work pretty much by accident (ok, in this case "by accident" is simply
because it matched the historical internal implementation of the cpumask
scanning, so it was more of a "intentionally using implementation
details rather than an accident").

But the extended constant-sized optimizations then did that internal
implementation differently, and now that code that did things wrong but
matched the old implementation no longer worked at all.

Which then causes subsequent odd problems due to using what ends up
being an invalid CPU ID.

Most of these cases require either unusual hardware or special uses to
hit, but the random.c one triggers quite easily.

All you really need is to have a sufficiently small CONFIG_NR_CPUS value
for the bit scanning optimization to be triggered, but not enough CPUs
to then actually fill that widened cpumask.  At that point, the cpumask
scanning will return the NR_CPUS constant, which is _not_ the same as
nr_cpumask_bits.

This just does the mindless fix with

   sed -i 's/== nr_cpumask_bits/>= nr_cpu_ids/'

to fix the incorrect uses.

The ones in the SCSI lpfc driver in particular could probably be fixed
more cleanly by just removing that repeated pattern entirely, but I am
not emptionally invested enough in that driver to care.

Reported-and-tested-by: Guenter Roeck <[email protected]>
Link: https://lore.kernel.org/lkml/[email protected]/
Reported-and-tested-by: Geert Uytterhoeven <[email protected]>
Link: https://lore.kernel.org/lkml/CAMuHMdUKo_Sf7TjKzcNDa8Ve+6QrK+P8nSQrSQ=6LTRmcBKNww@mail.gmail.com/
Reported-by: Vernon Yang <[email protected]>
Link: https://lore.kernel.org/lkml/[email protected]/
Cc: Yury Norov <[email protected]>
Cc: Jason A. Donenfeld <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

Merge branch 'fix resolving VAR after DATASEC'

Lorenz Bauer says:

====================

See the first patch for a detailed explanation.

v2:
- Move RESOLVE_TBD assignment out of the loop (Martin)
====================

Signed-off-by: Martin KaFai Lau <[email protected]>

selftests/bpf: check that modifier resolves after pointer

Add a regression test that ensures that a VAR pointing at a
modifier which follows a PTR (or STRUCT or ARRAY) is resolved
correctly by the datasec validator.

Signed-off-by: Lorenz Bauer <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Martin KaFai Lau <[email protected]>

btf: fix resolving BTF_KIND_VAR after ARRAY, STRUCT, UNION, PTR

btf_datasec_resolve contains a bug that causes the following BTF
to fail loading:

    [1] DATASEC a size=2 vlen=2
        type_id=4 offset=0 size=1
        type_id=7 offset=1 size=1
    [2] INT (anon) size=1 bits_offset=0 nr_bits=8 encoding=(none)
    [3] PTR (anon) type_id=2
    [4] VAR a type_id=3 linkage=0
    [5] INT (anon) size=1 bits_offset=0 nr_bits=8 encoding=(none)
    [6] TYPEDEF td type_id=5
    [7] VAR b type_id=6 linkage=0

This error message is printed during btf_check_all_types:

    [1] DATASEC a size=2 vlen=2
        type_id=7 offset=1 size=1 Invalid type

By tracing btf_*_resolve we can pinpoint the problem:

    btf_datasec_resolve(depth: 1, type_id: 1, mode: RESOLVE_TBD) = 0
        btf_var_resolve(depth: 2, type_id: 4, mode: RESOLVE_TBD) = 0
            btf_ptr_resolve(depth: 3, type_id: 3, mode: RESOLVE_PTR) = 0
        btf_var_resolve(depth: 2, type_id: 4, mode: RESOLVE_PTR) = 0
    btf_datasec_resolve(depth: 1, type_id: 1, mode: RESOLVE_PTR) = -22

The last invocation of btf_datasec_resolve should invoke btf_var_resolve
by means of env_stack_push, instead it returns EINVAL. The reason is that
env_stack_push is never executed for the second VAR.

    if (!env_type_is_resolve_sink(env, var_type) &&
        !env_type_is_resolved(env, var_type_id)) {
        env_stack_set_next_member(env, i + 1);
        return env_stack_push(env, var_type, var_type_id);
    }

env_type_is_resolve_sink() changes its behaviour based on resolve_mode.
For RESOLVE_PTR, we can simplify the if condition to the following:

    (btf_type_is_modifier() || btf_type_is_ptr) && !env_type_is_resolved()

Since we're dealing with a VAR the clause evaluates to false. This is
not sufficient to trigger the bug however. The log output and EINVAL
are only generated if btf_type_id_size() fails.

    if (!btf_type_id_size(btf, &type_id, &type_size)) {
        btf_verifier_log_vsi(env, v->t, vsi, "Invalid type");
        return -EINVAL;
    }

Most types are sized, so for example a VAR referring to an INT is not a
problem. The bug is only triggered if a VAR points at a modifier. Since
we skipped btf_var_resolve that modifier was also never resolved, which
means that btf_resolved_type_id returns 0 aka VOID for the modifier.
This in turn causes btf_type_id_size to return NULL, triggering EINVAL.

To summarise, the following conditions are necessary:

- VAR pointing at PTR, STRUCT, UNION or ARRAY
- Followed by a VAR pointing at TYPEDEF, VOLATILE, CONST, RESTRICT or
  TYPE_TAG

The fix is to reset resolve_mode to RESOLVE_TBD before attempting to
resolve a VAR from a DATASEC.

Fixes: 1dc92851849c ("bpf: kernel side support for BTF Var and DataSec")
Signed-off-by: Lorenz Bauer <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Martin KaFai Lau <[email protected]>

Merge tag 'drm-misc-fixes-2023-02-23' of git://anongit.freedesktop.org/drm/drm-misc into drm-fixes

A fix for nouveau preventing the system shutdown and one for a build
warning, and NULL pointer dereference fix for cirrus.

Signed-off-by: Dave Airlie <[email protected]>
From: Maxime Ripard <[email protected]>
Link: https://patchwork.freedesktop.org/patch/msgid/20230223083839.5gtmu6i42bnj7pfh@houat

bpf, test_run: fix &xdp_frame misplacement for LIVE_FRAMES

&xdp_buff and &xdp_frame are bound in a way that

xdp_buff->data_hard_start == xdp_frame

It's always the case and e.g. xdp_convert_buff_to_frame() relies on
this.
IOW, the following:

for (u32 i = 0; i < 0xdead; i++) {
xdpf = xdp_convert_buff_to_frame(&xdp);
xdp_convert_frame_to_buff(xdpf, &xdp);
}

shouldn't ever modify @xdpf's contents or the pointer itself.
However, "live packet" code wrongly treats &xdp_frame as part of its
context placed *before* the data_hard_start. With such flow,
data_hard_start is sizeof(*xdpf) off to the right and no longer points
to the XDP frame.

Instead of replacing `sizeof(ctx)` with `offsetof(ctx, xdpf)` in several
places and praying that there are no more miscalcs left somewhere in the
code, unionize ::frm with ::data in a flex array, so that both starts
pointing to the actual data_hard_start and the XDP frame actually starts
being a part of it, i.e. a part of the headroom, not the context.
A nice side effect is that the maximum frame size for this mode gets
increased by 40 bytes, as xdp_buff::frame_sz includes everything from
data_hard_start (-> includes xdpf already) to the end of XDP/skb shared
info.
Also update %MAX_PKT_SIZE accordingly in the selftests code. Leave it
hardcoded for 64 bit && 4k pages, it can be made more flexible later on.

Minor: align `&head->data` with how `head->frm` is assigned for
consistency.
Minor #2: rename 'frm' to 'frame' in &xdp_page_head while at it for
clarity.

(was found while testing XDP traffic generator on ice, which calls
xdp_convert_frame_to_buff() for each XDP frame)

Fixes: b530e9e1063e ("bpf: Add "live packet" mode for XDP in BPF_PROG_RUN")
Acked-by: Toke Høiland-Jørgensen <[email protected]>
Signed-off-by: Alexander Lobakin <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Martin KaFai Lau <[email protected]>

cpumask: Fix typo nr_cpumask_size --> nr_cpumask_bits

The never used nr_cpumask_size is just a typo, hence use existing
redefinition that's called nr_cpumask_bits.

Signed-off-by: Andy Shevchenko <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

btrfs: fix extent map logging bit not cleared for split maps after dropping range

At btrfs_drop_extent_map_range() we are clearing the EXTENT_FLAG_LOGGING
bit on a 'flags' variable that was not initialized. This makes static
checkers complain about it, so initialize the 'flags' variable before
clearing the bit.

In practice this has no consequences, because EXTENT_FLAG_LOGGING should
not be set when btrfs_drop_extent_map_range() is called, as an fsync locks
the inode in exclusive mode, locks the inode's mmap semaphore in exclusive
mode too and it always flushes all delalloc.

Also add a comment about why we clear EXTENT_FLAG_LOGGING on a copy of the
flags of the split extent map.

Reported-by: Dan Carpenter <[email protected]>
Link: https://lore.kernel.org/linux-btrfs/Y%2FyipSVozUDEZKow@kili/
Fixes: db21370bffbc ("btrfs: drop extent map range more efficiently")
Signed-off-by: Filipe Manana <[email protected]>
Signed-off-by: David Sterba <[email protected]>

btrfs: fix percent calculation for bg reclaim message

We have a report, that the info message for block-group reclaim is
crossing the 100% used mark.

This is happening as we were truncating the divisor for the division
(the block_group->length) to a 32bit value.

Fix this by using div64_u64() to not truncate the divisor.

In the worst case, it can lead to a div by zero error and should be
possible to trigger on 4 disks RAID0, and each device is large enough:

  $ mkfs.btrfs  -f /dev/test/scratch[1234] -m raid1 -d raid0
  btrfs-progs v6.1
  [...]
  Filesystem size:    40.00GiB
  Block group profiles:
    Data:             RAID0             4.00GiB <<<
    Metadata:         RAID1           256.00MiB
    System:           RAID1             8.00MiB

Reported-by: Forza <[email protected]>
Link: https://lore.kernel.org/linux-btrfs/[email protected]/
Fixes: 5f93e776c673 ("btrfs: zoned: print unusable percentage when reclaiming block groups")
CC: [email protected] # 5.15+
Reviewed-by: Anand Jain <[email protected]>
Reviewed-by: Qu Wenruo <[email protected]>
Signed-off-by: Johannes Thumshirn <[email protected]>
Reviewed-by: David Sterba <[email protected]>
[ add Qu's note ]
Signed-off-by: David Sterba <[email protected]>

btrfs: fix unnecessary increment of read error stat on write error

Current btrfs_log_dev_io_error() increases the read error count even if the
erroneous IO is a WRITE request. This is because it forget to use "else
if", and all the error WRITE requests counts as READ error as there is (of
course) no REQ_RAHEAD bit set.

Fixes: c3a62baf21ad ("btrfs: use chained bios when cloning")
CC: [email protected] # 6.1+
Reviewed-by: Christoph Hellwig <[email protected]>
Reviewed-by: Johannes Thumshirn <[email protected]>
Signed-off-by: Naohiro Aota <[email protected]>
Reviewed-by: David Sterba <[email protected]>
Signed-off-by: David Sterba <[email protected]>

btrfs: handle btrfs_del_item errors in __btrfs_update_delayed_inode

Even if the slot is already read out, we may still need to re-balance
the tree, thus it can cause error in that btrfs_del_item() call and we
need to handle it properly.

Reviewed-by: Qu Wenruo <[email protected]>
Signed-off-by: void0red <[email protected]>
Signed-off-by: David Sterba <[email protected]>

btrfs: ioctl: return device fsid from DEV_INFO ioctl

Currently user space utilizes dev info ioctl to grab the info of a
certain devid, this includes its device uuid.  But the returned info is
not enough to determine if a device is a seed.

Commit a26d60dedf9a ("btrfs: sysfs: add devinfo/fsid to retrieve actual
fsid from the device") exports the same value in sysfs so this is for
parity with ioctl.  Add a new member, fsid, into
btrfs_ioctl_dev_info_args, and populate the member with fsid value.

This should not cause any compatibility problem, following the
combinations:

- Old user space, old kernel
- Old user space, new kernel
  User space tool won't even check the new member.

- New user space, old kernel
  The kernel won't touch the new member, and user space tool should
  zero out its argument, thus the new member is all zero.

  User space tool can then know the kernel doesn't support this fsid
  reporting, and falls back to whatever they can.

- New user space, new kernel
  Go as planned.

  Would find the fsid member is no longer zero, and trust its value.

Reviewed-by: Anand Jain <[email protected]>
Signed-off-by: Qu Wenruo <[email protected]>
Signed-off-by: David Sterba <[email protected]>

btrfs: fix potential dead lock in size class loading logic

As reported by Filipe, there's a potential deadlock caused by
using btrfs_search_forward on commit_root. The locking there is
unconditional, even if ->skip_locking and ->search_commit_root is set.
It's not meant to be used for commit roots, so it always needs to do
locking.

So if another task is COWing a child node of the same root node and
then needs to wait for block group caching to complete when trying to
allocate a metadata extent, it deadlocks.

For example:

[539604.239315] sysrq: Show Blocked State
[539604.240133] task:kworker/u16:6   state:D stack:0     pid:2119594 ppid:2      flags:0x00004000
[539604.241613] Workqueue: btrfs-cache btrfs_work_helper [btrfs]
[539604.242673] Call Trace:
[539604.243129]  <TASK>
[539604.243925]  __schedule+0x41d/0xee0
[539604.244797]  ? rcu_read_lock_sched_held+0x12/0x70
[539604.245399]  ? rwsem_down_read_slowpath+0x185/0x490
[539604.246111]  schedule+0x5d/0xf0
[539604.246593]  rwsem_down_read_slowpath+0x2da/0x490
[539604.247290]  ? rcu_barrier_tasks_trace+0x10/0x20
[539604.248090]  __down_read_common+0x3d/0x150
[539604.248702]  down_read_nested+0xc3/0x140
[539604.249280]  __btrfs_tree_read_lock+0x24/0x100 [btrfs]
[539604.250097]  btrfs_read_lock_root_node+0x48/0x60 [btrfs]
[539604.250915]  btrfs_search_forward+0x59/0x460 [btrfs]
[539604.251781]  ? btrfs_global_root+0x50/0x70 [btrfs]
[539604.252476]  caching_thread+0x1be/0x920 [btrfs]
[539604.253167]  btrfs_work_helper+0xf6/0x400 [btrfs]
[539604.253848]  process_one_work+0x24f/0x5a0
[539604.254476]  worker_thread+0x52/0x3b0
[539604.255166]  ? __pfx_worker_thread+0x10/0x10
[539604.256047]  kthread+0xf0/0x120
[539604.256591]  ? __pfx_kthread+0x10/0x10
[539604.257212]  ret_from_fork+0x29/0x50
[539604.257822]  </TASK>
[539604.258233] task:btrfs-transacti state:D stack:0     pid:2236474 ppid:2      flags:0x00004000
[539604.259802] Call Trace:
[539604.260243]  <TASK>
[539604.260615]  __schedule+0x41d/0xee0
[539604.261205]  ? rcu_read_lock_sched_held+0x12/0x70
[539604.262000]  ? rwsem_down_read_slowpath+0x185/0x490
[539604.262822]  schedule+0x5d/0xf0
[539604.263374]  rwsem_down_read_slowpath+0x2da/0x490
[539604.266228]  ? lock_acquire+0x160/0x310
[539604.266917]  ? rcu_read_lock_sched_held+0x12/0x70
[539604.267996]  ? lock_contended+0x19e/0x500
[539604.268720]  __down_read_common+0x3d/0x150
[539604.269400]  down_read_nested+0xc3/0x140
[539604.270057]  __btrfs_tree_read_lock+0x24/0x100 [btrfs]
[539604.271129]  btrfs_read_lock_root_node+0x48/0x60 [btrfs]
[539604.272372]  btrfs_search_slot+0x143/0xf70 [btrfs]
[539604.273295]  update_block_group_item+0x9e/0x190 [btrfs]
[539604.274282]  btrfs_start_dirty_block_groups+0x1c4/0x4f0 [btrfs]
[539604.275381]  ? __mutex_unlock_slowpath+0x45/0x280
[539604.276390]  btrfs_commit_transaction+0xee/0xed0 [btrfs]
[539604.277391]  ? lock_acquire+0x1a4/0x310
[539604.278080]  ? start_transaction+0xcb/0x6c0 [btrfs]
[539604.279099]  transaction_kthread+0x142/0x1c0 [btrfs]
[539604.279996]  ? __pfx_transaction_kthread+0x10/0x10 [btrfs]
[539604.280673]  kthread+0xf0/0x120
[539604.281050]  ? __pfx_kthread+0x10/0x10
[539604.281496]  ret_from_fork+0x29/0x50
[539604.281966]  </TASK>
[539604.282255] task:fsstress        state:D stack:0     pid:2236483 ppid:1      flags:0x00004006
[539604.283897] Call Trace:
[539604.284700]  <TASK>
[539604.285088]  __schedule+0x41d/0xee0
[539604.285660]  schedule+0x5d/0xf0
[539604.286175]  btrfs_wait_block_group_cache_progress+0xf2/0x170 [btrfs]
[539604.287342]  ? __pfx_autoremove_wake_function+0x10/0x10
[539604.288450]  find_free_extent+0xd93/0x1750 [btrfs]
[539604.289256]  ? _raw_spin_unlock+0x29/0x50
[539604.289911]  ? btrfs_get_alloc_profile+0x127/0x2a0 [btrfs]
[539604.290843]  btrfs_reserve_extent+0x147/0x290 [btrfs]
[539604.291943]  btrfs_alloc_tree_block+0xcb/0x3e0 [btrfs]
[539604.292903]  __btrfs_cow_block+0x138/0x580 [btrfs]
[539604.293773]  btrfs_cow_block+0x10e/0x240 [btrfs]
[539604.294595]  btrfs_search_slot+0x7f3/0xf70 [btrfs]
[539604.295585]  btrfs_update_device+0x71/0x1b0 [btrfs]
[539604.296459]  btrfs_chunk_alloc_add_chunk_item+0xe0/0x340 [btrfs]
[539604.297489]  btrfs_chunk_alloc+0x1bf/0x490 [btrfs]
[539604.298335]  find_free_extent+0x6fa/0x1750 [btrfs]
[539604.299174]  ? _raw_spin_unlock+0x29/0x50
[539604.299950]  ? btrfs_get_alloc_profile+0x127/0x2a0 [btrfs]
[539604.300918]  btrfs_reserve_extent+0x147/0x290 [btrfs]
[539604.301797]  btrfs_alloc_tree_block+0xcb/0x3e0 [btrfs]
[539604.303017]  ? lock_release+0x224/0x4a0
[539604.303855]  __btrfs_cow_block+0x138/0x580 [btrfs]
[539604.304789]  btrfs_cow_block+0x10e/0x240 [btrfs]
[539604.305611]  btrfs_search_slot+0x7f3/0xf70 [btrfs]
[539604.306682]  ? btrfs_global_root+0x50/0x70 [btrfs]
[539604.308198]  lookup_inline_extent_backref+0x17b/0x7a0 [btrfs]
[539604.309254]  lookup_extent_backref+0x43/0xd0 [btrfs]
[539604.310122]  __btrfs_free_extent+0xf8/0x810 [btrfs]
[539604.310874]  ? lock_release+0x224/0x4a0
[539604.311724]  ? btrfs_merge_delayed_refs+0x17b/0x1d0 [btrfs]
[539604.313023]  __btrfs_run_delayed_refs+0x2ba/0x1260 [btrfs]
[539604.314271]  btrfs_run_delayed_refs+0x8f/0x1c0 [btrfs]
[539604.315445]  ? rcu_read_lock_sched_held+0x12/0x70
[539604.316706]  btrfs_commit_transaction+0xa2/0xed0 [btrfs]
[539604.317855]  ? do_raw_spin_unlock+0x4b/0xa0
[539604.318544]  ? _raw_spin_unlock+0x29/0x50
[539604.319240]  create_subvol+0x53d/0x6e0 [btrfs]
[539604.320283]  btrfs_mksubvol+0x4f5/0x590 [btrfs]
[539604.321220]  __btrfs_ioctl_snap_create+0x11b/0x180 [btrfs]
[539604.322307]  btrfs_ioctl_snap_create_v2+0xc6/0x150 [btrfs]
[539604.323295]  btrfs_ioctl+0x9f7/0x33e0 [btrfs]
[539604.324331]  ? rcu_read_lock_sched_held+0x12/0x70
[539604.325137]  ? lock_release+0x224/0x4a0
[539604.325808]  ? __x64_sys_ioctl+0x87/0xc0
[539604.326467]  __x64_sys_ioctl+0x87/0xc0
[539604.327109]  do_syscall_64+0x38/0x90
[539604.327875]  entry_SYSCALL_64_after_hwframe+0x72/0xdc
[539604.328792] RIP: 0033:0x7f05a7babaeb

This needs to use regular btrfs_search_slot() with some skip and stop
logic.

Since we only consider five samples (five search slots), don't bother
with the complexity of looking for commit_root_sem contention. If
necessary, it can be added to the load function in between samples.

Reported-by: Filipe Manana <[email protected]>
Link: https://lore.kernel.org/linux-btrfs/CAL3q7H7eKMD44Z1+=Kb-1RFMMeZpAm2fwyO59yeBwCcSOU80Pg@mail.gmail.com/
Fixes: c7eec3d9aa95 ("btrfs: load block group size class when caching")
Signed-off-by: Boris Burkov <[email protected]>
Signed-off-by: David Sterba <[email protected]>

tools headers x86 cpufeatures: Sync with the kernel sources

To pick the changes from:

  8415a74852d7c247 ("x86/cpu, kvm: Add support for CPUID_80000021_EAX")

This only causes these perf files to be rebuilt:

  CC       /tmp/build/perf/bench/mem-memcpy-x86-64-asm.o
  CC       /tmp/build/perf/bench/mem-memset-x86-64-asm.o

And addresses these perf build warnings:

  Warning: Kernel ABI header at 'tools/arch/x86/include/asm/disabled-features.h' differs from latest version at 'arch/x86/include/asm/disabled-features.h'
  diff -u tools/arch/x86/include/asm/disabled-features.h arch/x86/include/asm/disabled-features.h
  Warning: Kernel ABI header at 'tools/arch/x86/include/asm/required-features.h' differs from latest version at 'arch/x86/include/asm/required-features.h'
  diff -u tools/arch/x86/include/asm/required-features.h arch/x86/include/asm/required-features.h

Cc: Adrian Hunter <[email protected]>
Cc: Borislav Petkov (AMD) <[email protected]>
Cc: Ian Rogers <[email protected]>
Cc: Jiri Olsa <[email protected]>
Cc: Kim Phillips <[email protected]>
Cc: Namhyung Kim <[email protected]>
Link: https://lore.kernel.org/lkml/[email protected]
Signed-off-by: Arnaldo Carvalho de Melo <[email protected]>

bpf, doc: Link to submitting-patches.rst for general patch submission info

The link for patch submission information in general refers to index
page for "Working with the kernel development community" section of
kernel docs, whereas the link should have been
Documentation/process/submitting-patches.rst instead.

Fix it by replacing the index target with the appropriate doc.

Fixes: 542228384888f5 ("bpf, doc: convert bpf_devel_QA.rst to use RST formatting")
Signed-off-by: Bagas Sanjaya <[email protected]>
Signed-off-by: Daniel Borkmann <[email protected]>
Link: https://lore.kernel.org/bpf/[email protected]

bpf, doc: Do not link to docs.kernel.org for kselftest link

The question on how to run BPF selftests have a reference link to kernel
selftest documentation (Documentation/dev-tools/kselftest.rst). However,
it uses external link to the documentation at kernel.org/docs (aka
docs.kernel.org) instead, which requires Internet access.

Fix this and replace the link with internal linking, by using :doc: directive
while keeping the anchor text.

Fixes: b7a27c3aafa252 ("bpf, doc: howto use/run the BPF selftests")
Signed-off-by: Bagas Sanjaya <[email protected]>
Signed-off-by: Daniel Borkmann <[email protected]>
Link: https://lore.kernel.org/bpf/[email protected]

udf: Warn if block mapping is done for in-ICB files

Now that address space operations are merge dfor in-ICB and normal
files, it is more likely some code mistakenly tries to map blocks for
in-ICB files. WARN and return error instead of silently returning
garbage.

Signed-off-by: Jan Kara <[email protected]>

udf: Fix reading of in-ICB files

After merging address space operations of normal and in-ICB files,
readahead could get called for in-ICB files which resulted in
udf_get_block() being called for these files. udf_get_block() is not
prepared to be called for in-ICB files and ends up returning garbage
results as it interprets file data as extent list. Fix the problem by
skipping readahead for in-ICB files.

Fixes: 37a8a39f7ad3 ("udf: Switch to single address_space_operations")
Signed-off-by: Jan Kara <[email protected]>

udf: Fix lost writes in udf_adinicb_writepage()

The patch converting udf_adinicb_writepage() to avoid manually kmapping
the page used memcpy_to_page() however that copies in the wrong
direction (effectively overwriting file data with the old contents).
What we should be using is memcpy_from_page() to copy data from the page
into the inode and then mark inode dirty to store the data.

Fixes: 5cfc45321a6d ("udf: Convert udf_adinicb_writepage() to memcpy_to_page()")
Signed-off-by: Jan Kara <[email protected]>

m68k: Only force 030 bus error if PC not in exception table

__get_kernel_nofault() does copy data in supervisor mode when
forcing a task backtrace log through /proc/sysrq_trigger.
This is expected cause a bus error exception on e.g. NULL
pointer dereferencing when logging a kernel task has no
workqueue associated. This bus error ought to be ignored.

Our 030 bus error handler is ill equipped to deal with this:

Whenever ssw indicates a kernel mode access on a data fault,
we don't even attempt to handle the fault and instead always
send a SEGV signal (or panic). As a result, the check
for exception handling at the fault PC (buried in
send_sig_fault() which gets called from do_page_fault()
eventually) is never used.

In contrast, both 040 and 060 access error handlers do not
care whether a fault happened on supervisor mode access,
and will call do_page_fault() on those, ultimately honoring
the exception table.

Add a check in bus_error030 to call do_page_fault() in case
we do have an entry for the fault PC in our exception table.

I had attempted a fix for this earlier in 2019 that did rely
on testing pagefault_disabled() (see link below) to achieve
the same thing, but this patch should be more generic.

Tested on 030 Atari Falcon.

Reported-by: Eero Tamminen <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Michael Schmitz <[email protected]>
Reviewed-by: Geert Uytterhoeven <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Geert Uytterhoeven <[email protected]>

m68k: mm: Move initrd phys_to_virt handling after paging_init()

When booting with an initial ramdisk on platforms where physical memory
does not start at address zero (e.g. on Amiga):

    initrd: 0ef0602c - 0f800000
    Zone ranges:
      DMA      [mem 0x0000000008000000-0x000000f7ffffffff]
      Normal   empty
    Movable zone start for each node
    Early memory node ranges
      node   0: [mem 0x0000000008000000-0x000000000f7fffff]
    Initmem setup node 0 [mem 0x0000000008000000-0x000000000f7fffff]
    Unable to handle kernel access at virtual address (ptrval)
    Oops: 00000000
    Modules linked in:
    PC: [<00201d3c>] memcmp+0x28/0x56

As phys_to_virt() relies on m68k_memoffset and module_fixup(), it must
not be called before paging_init().  Hence postpone the phys_to_virt
handling for the initial ramdisk until after calling paging_init().

While at it, reduce #ifdef clutter by using IS_ENABLED() instead.

Fixes: 376e3fdecb0dcae2 ("m68k: Enable memtest functionality")
Reported-by: Stephen Walsh <[email protected]>
Link: https://lists.debian.org/debian-68k/2022/09/msg00007.html
Reported-by: John Paul Adrian Glaubitz <[email protected]>
Link: https://lore.kernel.org/r/4f45f05f377bf3f5baf88dbd5c3c8aeac59d94f0.camel@physik.fu-berlin.de
Signed-off-by: Geert Uytterhoeven <[email protected]>
Acked-by: Finn Thain <[email protected]>
Link: https://lore.kernel.org/r/dff216da09ab7a60217c3fc2147e671ae07d636f.1677528627.git.geert@linux-m68k.org

m68k: mm: Fix systems with memory at end of 32-bit address space

The calculation of end addresses of memory chunks overflowed to 0 when
a memory chunk is located at the end of 32-bit address space.
This is the case for the HP300 architecture.

Link: https://lore.kernel.org/linux-m68k/CACz-3rhUo5pgNwdWHaPWmz+30Qo9xCg70wNxdf7o5x-6tXq8QQ@mail.gmail.com/
Signed-off-by: Kars de Jong <[email protected]>
Reviewed-by: Geert Uytterhoeven <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Geert Uytterhoeven <[email protected]>

tools include UAPI: Sync linux/vhost.h with the kernel sources

To get the changes in:

  3b688d7a086d0438 ("vhost-vdpa: uAPI to resume the device")

To pick up these changes and support them:

  $ tools/perf/trace/beauty/vhost_virtio_ioctl.sh > before
  $ cp ../linux/include/uapi/linux/vhost.h tools/include/uapi/linux/vhost.h
  $ tools/perf/trace/beauty/vhost_virtio_ioctl.sh > after
  $ diff -u before after
  --- before 2023-03-06 09:26:14.889251817 -0300
  +++ after 2023-03-06 09:26:20.594406270 -0300
  @@ -30,6 +30,7 @@
    [0x77] = "VDPA_SET_CONFIG_CALL",
    [0x7C] = "VDPA_SET_GROUP_ASID",
    [0x7D] = "VDPA_SUSPEND",
  + [0x7E] = "VDPA_RESUME",
   };
   static const char *vhost_virtio_ioctl_read_cmds[] = {
    [0x00] = "GET_FEATURES",
  $

For instance, see how those 'cmd' ioctl arguments get translated, now
VDPA_RESUME will be as well:

  # perf trace -a -e ioctl --max-events=10
       0.000 ( 0.011 ms): pipewire/2261 ioctl(fd: 60, cmd: SNDRV_PCM_HWSYNC, arg: 0x1)                        = 0
      21.353 ( 0.014 ms): pipewire/2261 ioctl(fd: 60, cmd: SNDRV_PCM_HWSYNC, arg: 0x1)                        = 0
      25.766 ( 0.014 ms): gnome-shell/2196 ioctl(fd: 14, cmd: DRM_I915_IRQ_WAIT, arg: 0x7ffe4a22c740)            = 0
      25.845 ( 0.034 ms): gnome-shel:cs0/2212 ioctl(fd: 14, cmd: DRM_I915_IRQ_EMIT, arg: 0x7fd43915dc70)            = 0
      25.916 ( 0.011 ms): gnome-shell/2196 ioctl(fd: 9, cmd: DRM_MODE_ADDFB2, arg: 0x7ffe4a22c8a0)               = 0
      25.941 ( 0.025 ms): gnome-shell/2196 ioctl(fd: 9, cmd: DRM_MODE_ATOMIC, arg: 0x7ffe4a22c840)               = 0
      32.915 ( 0.009 ms): gnome-shell/2196 ioctl(fd: 9, cmd: DRM_MODE_RMFB, arg: 0x7ffe4a22cf9c)                 = 0
      42.522 ( 0.013 ms): gnome-shell/2196 ioctl(fd: 14, cmd: DRM_I915_IRQ_WAIT, arg: 0x7ffe4a22c740)            = 0
      42.579 ( 0.031 ms): gnome-shel:cs0/2212 ioctl(fd: 14, cmd: DRM_I915_IRQ_EMIT, arg: 0x7fd43915dc70)            = 0
      42.644 ( 0.010 ms): gnome-shell/2196 ioctl(fd: 9, cmd: DRM_MODE_ADDFB2, arg: 0x7ffe4a22c8a0)               = 0
  #

Cc: Adrian Hunter <[email protected]>
Cc: Ian Rogers <[email protected]>
Cc: Jiri Olsa <[email protected]>
Cc: Michael S. Tsirkin <[email protected]>
Cc: Namhyung Kim <[email protected]>
Cc: Sebastien Boeuf <[email protected]>
Link: https://lore.kernel.org/lkml/[email protected]
Signed-off-by: Arnaldo Carvalho de Melo <[email protected]>

netfilter: tproxy: fix deadlock due to missing BH disable

The xtables packet traverser performs an unconditional local_bh_disable(),
but the nf_tables evaluation loop does not.

Functions that are called from either xtables or nftables must assume
that they can be called in process context.

inet_twsk_deschedule_put() assumes that no softirq interrupt can occur.
If tproxy is used from nf_tables its possible that we'll deadlock
trying to aquire a lock already held in process context.

Add a small helper that takes care of this and use it.

Link: https://lore.kernel.org/netfilter-devel/[email protected]/
Fixes: 4ed8eb6570a4 ("netfilter: nf_tables: Add native tproxy support")
Reported-and-tested-by: Major Dávid <[email protected]>
Signed-off-by: Florian Westphal <[email protected]>
Signed-off-by: Pablo Neira Ayuso <[email protected]>

netfilter: ctnetlink: revert to dumping mark regardless of event type

It seems that change was unintentional, we have userspace code that
needs the mark while listening for events like REPLY, DESTROY, etc.
Also include 0-marks in requested dumps, as they were before that fix.

Fixes: 1feeae071507 ("netfilter: ctnetlink: fix compilation warning after data race fixes in ct mark")
Signed-off-by: Ivan Delalande <[email protected]>
Signed-off-by: Pablo Neira Ayuso <[email protected]>

bnxt_en: Fix the double free during device removal

Following warning reported by KASAN during driver unload

==================================================================
BUG: KASAN: double-free in bnxt_remove_one+0x103/0x200 [bnxt_en]
Free of addr ffff88814e8dd4c0 by task rmmod/17469
CPU: 47 PID: 17469 Comm: rmmod Kdump: loaded Tainted: G S 6.2.0-rc7+ #2
Hardware name: Dell Inc. PowerEdge R740/01YM03, BIOS 2.3.10 08/15/2019
Call Trace:
<TASK>
dump_stack_lvl+0x33/0x46
print_report+0x17b/0x4b3
? __call_rcu_common.constprop.79+0x27e/0x8c0
? __pfx_free_object_rcu+0x10/0x10
? __virt_addr_valid+0xe3/0x160
? bnxt_remove_one+0x103/0x200 [bnxt_en]
kasan_report_invalid_free+0x64/0xd0
? bnxt_remove_one+0x103/0x200 [bnxt_en]
? bnxt_remove_one+0x103/0x200 [bnxt_en]
__kasan_slab_free+0x179/0x1c0
? bnxt_remove_one+0x103/0x200 [bnxt_en]
__kmem_cache_free+0x194/0x350
bnxt_remove_one+0x103/0x200 [bnxt_en]
pci_device_remove+0x62/0x110
device_release_driver_internal+0xf6/0x1c0
driver_detach+0x76/0xe0
bus_remove_driver+0x89/0x160
pci_unregister_driver+0x26/0x110
? strncpy_from_user+0x188/0x1c0
bnxt_exit+0xc/0x24 [bnxt_en]
__x64_sys_delete_module+0x21f/0x390
? __pfx___x64_sys_delete_module+0x10/0x10
? __pfx_mem_cgroup_handle_over_high+0x10/0x10
? _raw_spin_lock+0x87/0xe0
? __pfx__raw_spin_lock+0x10/0x10
? __audit_syscall_entry+0x185/0x210
? ktime_get_coarse_real_ts64+0x51/0x80
? syscall_trace_enter.isra.18+0x126/0x1a0
do_syscall_64+0x37/0x90
entry_SYSCALL_64_after_hwframe+0x72/0xdc
RIP: 0033:0x7effcb6fd71b
Code: 73 01 c3 48 8b 0d 6d 17 2c 00 f7 d8 64 89 01 48 83 c8 ff c3 66 2e 0f 1f 84 00 00 00 00 00 90 f3 0f 1e fa b8 b0 00 00 00 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d 3d 17 2c 00 f7 d8 64 89 01 48
RSP: 002b:00007ffeada270b8 EFLAGS: 00000206 ORIG_RAX: 00000000000000b0
RAX: ffffffffffffffda RBX: 00005623660e0750 RCX: 00007effcb6fd71b
RDX: 000000000000000a RSI: 0000000000000800 RDI: 00005623660e07b8
RBP: 0000000000000000 R08: 00007ffeada26031 R09: 0000000000000000
R10: 00007effcb771280 R11: 0000000000000206 R12: 00007ffeada272e0
R13: 00007ffeada28bc4 R14: 00005623660e02a0 R15: 00005623660e0750
</TASK>

Auxiliary device structures are freed in bnxt_aux_dev_release. So avoid
calling kfree from bnxt_remove_one.

Also, set bp->edev to NULL before freeing the auxilary private structure.

Fixes: d80d88b0dfff ("bnxt_en: Add auxiliary driver support")
Reviewed-by: Ajit Khaparde <[email protected]>
Reviewed-by: Andy Gospodarek <[email protected]>
Signed-off-by: Selvin Xavier <[email protected]>
Signed-off-by: Michael Chan <[email protected]>
Signed-off-by: David S. Miller <[email protected]>