Git Repo - linux.git/log

Merge tag 'fs.acl.v6.3' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/idmapping

Pull vfs acl update from Christian Brauner:
"This contains a single update to the internal get acl method and
  replaces an open-coded cmpxchg() comparison with with try_cmpxchg().

  It's clearer and also beneficial on some architectures"

* tag 'fs.acl.v6.3' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/idmapping:
  posix_acl: Use try_cmpxchg in get_acl

Merge tag 'fs.v6.3' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/idmapping

Pull vfs hardening update from Christian Brauner:
"Jan pointed out that during shutdown both filp_close() and super block
  destruction will use basic printk logging when bugs are detected. This
  causes issues in a few scenarios:

   - Tools like syzkaller cannot figure out that the logged message
     indicates a bug.

   - Users that explicitly opt in to have the kernel bug on data
     corruption by selecting CONFIG_BUG_ON_DATA_CORRUPTION should see
     the kernel crash when they did actually select that option.

   - When there are busy inodes after the superblock is shut down later
     access to such a busy inodes walks through freed memory. It would
     be better to cleanly crash instead.

  All of this can be addressed by using the already existing
  CHECK_DATA_CORRUPTION() macro in these places when kernel bugs are
  detected. Its logging improvement is useful for all users.

  Otherwise this only has a meaningful behavioral effect when users do
  select CONFIG_BUG_ON_DATA_CORRUPTION which means this is backward
  compatible for regular users"

* tag 'fs.v6.3' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/idmapping:
  fs: Use CHECK_DATA_CORRUPTION() when kernel bugs are detected

Merge tag 'fs.idmapped.v6.3' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/idmapping

Pull vfs idmapping updates from Christian Brauner:

- Last cycle we introduced the dedicated struct mnt_idmap type for
   mount idmapping and the required infrastucture in 256c8aed2b42 ("fs:
   introduce dedicated idmap type for mounts"). As promised in last
   cycle's pull request message this converts everything to rely on
   struct mnt_idmap.

   Currently we still pass around the plain namespace that was attached
   to a mount. This is in general pretty convenient but it makes it easy
   to conflate namespaces that are relevant on the filesystem with
   namespaces that are relevant on the mount level. Especially for
   non-vfs developers without detailed knowledge in this area this was a
   potential source for bugs.

   This finishes the conversion. Instead of passing the plain namespace
   around this updates all places that currently take a pointer to a
   mnt_userns with a pointer to struct mnt_idmap.

   Now that the conversion is done all helpers down to the really
   low-level helpers only accept a struct mnt_idmap argument instead of
   two namespace arguments.

   Conflating mount and other idmappings will now cause the compiler to
   complain loudly thus eliminating the possibility of any bugs. This
   makes it impossible for filesystem developers to mix up mount and
   filesystem idmappings as they are two distinct types and require
   distinct helpers that cannot be used interchangeably.

   Everything associated with struct mnt_idmap is moved into a single
   separate file. With that change no code can poke around in struct
   mnt_idmap. It can only be interacted with through dedicated helpers.
   That means all filesystems are and all of the vfs is completely
   oblivious to the actual implementation of idmappings.

   We are now also able to extend struct mnt_idmap as we see fit. For
   example, we can decouple it completely from namespaces for users that
   don't require or don't want to use them at all. We can also extend
   the concept of idmappings so we can cover filesystem specific
   requirements.

   In combination with the vfs{g,u}id_t work we finished in v6.2 this
   makes this feature substantially more robust and thus difficult to
   implement wrong by a given filesystem and also protects the vfs.

- Enable idmapped mounts for tmpfs and fulfill a longstanding request.

   A long-standing request from users had been to make it possible to
   create idmapped mounts for tmpfs. For example, to share the host's
   tmpfs mount between multiple sandboxes. This is a prerequisite for
   some advanced Kubernetes cases. Systemd also has a range of use-cases
   to increase service isolation. And there are more users of this.

   However, with all of the other work going on this was way down on the
   priority list but luckily someone other than ourselves picked this
   up.

   As usual the patch is tiny as all the infrastructure work had been
   done multiple kernel releases ago. In addition to all the tests that
   we already have I requested that Rodrigo add a dedicated tmpfs
   testsuite for idmapped mounts to xfstests. It is to be included into
   xfstests during the v6.3 development cycle. This should add a slew of
   additional tests.

* tag 'fs.idmapped.v6.3' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/idmapping: (26 commits)
  shmem: support idmapped mounts for tmpfs
  fs: move mnt_idmap
  fs: port vfs{g,u}id helpers to mnt_idmap
  fs: port fs{g,u}id helpers to mnt_idmap
  fs: port i_{g,u}id_into_vfs{g,u}id() to mnt_idmap
  fs: port i_{g,u}id_{needs_}update() to mnt_idmap
  quota: port to mnt_idmap
  fs: port privilege checking helpers to mnt_idmap
  fs: port inode_owner_or_capable() to mnt_idmap
  fs: port inode_init_owner() to mnt_idmap
  fs: port acl to mnt_idmap
  fs: port xattr to mnt_idmap
  fs: port ->permission() to pass mnt_idmap
  fs: port ->fileattr_set() to pass mnt_idmap
  fs: port ->set_acl() to pass mnt_idmap
  fs: port ->get_acl() to pass mnt_idmap
  fs: port ->tmpfile() to pass mnt_idmap
  fs: port ->rename() to pass mnt_idmap
  fs: port ->mknod() to pass mnt_idmap
  fs: port ->mkdir() to pass mnt_idmap
  ...

sched/topology: fix KASAN warning in hop_cmp()

Despite that prev_hop is used conditionally on cur_hop
is not the first hop, it's initialized unconditionally.

Because initialization implies dereferencing, it might happen
that the code dereferences uninitialized memory, which has been
spotted by KASAN. Fix it by reorganizing hop_cmp() logic.

Reported-by: Bruno Goncalves <[email protected]>
Fixes: cd7f55359c90 ("sched: add sched_numa_find_nth_cpu()")
Signed-off-by: Yury Norov <[email protected]>
Link: https://lore.kernel.org/r/Y+7avK6V9SyAWsXi@yury-laptop/
Signed-off-by: Jakub Kicinski <[email protected]>

Merge tag 'iversion-v6.3' of git://git.kernel.org/pub/scm/linux/kernel/git/jlayton/linux

Pull i_version updates from Jeff Layton:
"This overhauls how we handle i_version queries from nfsd.

  Instead of having special routines and grabbing the i_version field
  directly out of the inode in some cases, we've moved most of the
  handling into the various filesystems' getattr operations. As a bonus,
  this makes ceph's change attribute usable by knfsd as well.

  This should pave the way for future work to make this value queryable
  by userland, and to make it more resilient against rolling back on a
  crash"

* tag 'iversion-v6.3' of git://git.kernel.org/pub/scm/linux/kernel/git/jlayton/linux:
  nfsd: remove fetch_iversion export operation
  nfsd: use the getattr operation to fetch i_version
  nfsd: move nfsd4_change_attribute to nfsfh.c
  ceph: report the inode version in getattr if requested
  nfs: report the inode version in getattr if requested
  vfs: plumb i_version handling into struct kstat
  fs: clarify when the i_version counter must be updated
  fs: uninline inode_query_iversion

Merge tag 'locks-v6.3' of git://git.kernel.org/pub/scm/linux/kernel/git/jlayton/linux

Pull file locking updates from Jeff Layton:
"The main change here is that I've broken out most of the file locking
  definitions into a new header file. I also went ahead and completed
  the removal of locks_inode function"

* tag 'locks-v6.3' of git://git.kernel.org/pub/scm/linux/kernel/git/jlayton/linux:
  fs: remove locks_inode
  filelock: move file locking definitions to separate header file

Merge tag 'tpm-v6.3-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/jarkko/linux-tpmdd

Pull tpm updates from Jarkko Sakkinen:
"In additon to bug fixes, these are noteworthy changes:

   - In TPM I2C drivers, migrate from probe() to probe_new() (a new
     driver model in I2C).

   - TPM CRB: Pluton support

   - Add duplicate hash detection to the blacklist keyring in order to
     give more meaningful klog output than e.g. [1]"

Link: https://askubuntu.com/questions/1436856/ubuntu-22-10-blacklist-problem-blacklisting-hash-13-message-on-boot
* tag 'tpm-v6.3-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/jarkko/linux-tpmdd:
  tpm: add vendor flag to command code validation
  tpm: Add reserved memory event log
  tpm: Use managed allocation for bios event log
  tpm: tis_i2c: Convert to i2c's .probe_new()
  tpm: tpm_i2c_nuvoton: Convert to i2c's .probe_new()
  tpm: tpm_i2c_infineon: Convert to i2c's .probe_new()
  tpm: tpm_i2c_atmel: Convert to i2c's .probe_new()
  tpm: st33zp24: Convert to i2c's .probe_new()
  KEYS: asymmetric: Fix ECDSA use via keyctl uapi
  certs: don't try to update blacklist keys
  KEYS: Add new function key_create()
  certs: make blacklisted hash available in klog
  tpm_crb: Add support for CRB devices based on Pluton
  crypto: certs: fix FIPS selftest dependency

Merge tag 'rust-6.3' of https://github.com/Rust-for-Linux/linux

Pull Rust updates from Miguel Ojeda:
"More core additions, getting closer to a point where the first Rust
  modules can be upstreamed. The major ones being:

   - Sync: new types 'Arc', 'ArcBorrow' and 'UniqueArc'.

   - Types: new trait 'ForeignOwnable' and new type 'ScopeGuard'.

  There is also a substantial removal in terms of lines:

   - 'alloc' crate: remove the 'borrow' module (type 'Cow' and trait
     'ToOwned')"

* tag 'rust-6.3' of https://github.com/Rust-for-Linux/linux:
  rust: delete rust-project.json when running make clean
  rust: MAINTAINERS: Add the zulip link
  rust: types: implement `ForeignOwnable` for `Arc<T>`
  rust: types: implement `ForeignOwnable` for the unit type
  rust: types: implement `ForeignOwnable` for `Box<T>`
  rust: types: introduce `ForeignOwnable`
  rust: types: introduce `ScopeGuard`
  rust: prelude: prevent doc inline of external imports
  rust: sync: add support for dispatching on Arc and ArcBorrow.
  rust: sync: introduce `UniqueArc`
  rust: sync: allow type of `self` to be `ArcBorrow<T>`
  rust: sync: introduce `ArcBorrow`
  rust: sync: allow coercion from `Arc<T>` to `Arc<U>`
  rust: sync: allow type of `self` to be `Arc<T>` or variants
  rust: sync: add `Arc` for ref-counted allocations
  rust: compiler_builtins: make stubs non-global
  rust: alloc: remove the `borrow` module (`ToOwned`, `Cow`)

arm64: fix .idmap.text assertion for large kernels

When building a kernel with many debug options enabled (which happens in
test configurations use by myself and syzbot), the kernel can become
large enough that portions of .text can be more than 128M away from
.idmap.text (which is placed inside the .rodata section). Where idmap
code branches into .text, the linker will place veneers in the
.idmap.text section to make those branches possible.

Unfortunately, as Ard reports, GNU LD has bseen observed to add 4K of
padding when adding such veneers, e.g.

| .idmap.text    0xffffffc01e48e5c0      0x32c arch/arm64/mm/proc.o
|                0xffffffc01e48e5c0                idmap_cpu_replace_ttbr1
|                0xffffffc01e48e600                idmap_kpti_install_ng_mappings
|                0xffffffc01e48e800                __cpu_setup
| *fill*         0xffffffc01e48e8ec        0x4
| .idmap.text.stub
|                0xffffffc01e48e8f0       0x18 linker stubs
|                0xffffffc01e48f8f0                __idmap_text_end = .
|                0xffffffc01e48f000                . = ALIGN (0x1000)
| *fill*         0xffffffc01e48f8f0      0x710
|                0xffffffc01e490000                idmap_pg_dir = .

This makes the __idmap_text_start .. __idmap_text_end region bigger than
the 4K we require it to fit within, and triggers an assertion in arm64's
vmlinux.lds.S, which breaks the build:

| LD      .tmp_vmlinux.kallsyms1
| aarch64-linux-gnu-ld: ID map text too big or misaligned
| make[1]: *** [scripts/Makefile.vmlinux:35: vmlinux] Error 1
| make: *** [Makefile:1264: vmlinux] Error 2

Avoid this by using an `ADRP+ADD+BLR` sequence for branches out of
.idmap.text, which avoids the need for veneers. These branches are only
executed once per boot, and only when the MMU is on, so there should be
no noticeable performance penalty in replacing `BL` with `ADRP+ADD+BLR`.

At the same time, remove the "x" and "w" attributes when placing code in
.idmap.text, as these are not necessary, and this will prevent the
linker from assuming that it is safe to place PLTs into .idmap.text,
causing it to warn if and when there are out-of-range branches within
.idmap.text, e.g.

|   LD      .tmp_vmlinux.kallsyms1
| arch/arm64/kernel/head.o: in function `primary_entry':
| (.idmap.text+0x1c): relocation truncated to fit: R_AARCH64_CALL26 against symbol `dcache_clean_poc' defined in .text section in arch/arm64/mm/cache.o
| arch/arm64/kernel/head.o: in function `init_el2':
| (.idmap.text+0x88): relocation truncated to fit: R_AARCH64_CALL26 against symbol `dcache_clean_poc' defined in .text section in arch/arm64/mm/cache.o
| make[1]: *** [scripts/Makefile.vmlinux:34: vmlinux] Error 1
| make: *** [Makefile:1252: vmlinux] Error 2

Thus, if future changes add out-of-range branches in .idmap.text, it
should be easy enough to identify those from the resulting linker
errors.

Reported-by: [email protected]
Link: https://lore.kernel.org/linux-arm-kernel/[email protected]/
Signed-off-by: Mark Rutland <[email protected]>
Cc: Ard Biesheuvel <[email protected]>
Cc: Will Deacon <[email protected]>
Tested-by: Ard Biesheuvel <[email protected]>
Reviewed-by: Ard Biesheuvel <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Catalin Marinas <[email protected]>

Merge tag 'remove-get_kernel_pages-for-6.3' of https://git.linaro.org/people/jens.wiklander/linux-tee

Pull TEE update from Jens Wiklander:
"Remove get_kernel_pages()

  Vmalloc page support is removed from shm_get_kernel_pages() and the
  get_kernel_pages() call is replaced by calls to get_page(). With no
  remaining callers of get_kernel_pages() the function is removed"

[ This looks like it's just some random 'tee' cleanup, but the bigger
  picture impetus for this is really to to to remove historical
  confusion with mixed use of kernel virtual addresses and 'struct page'
  pointers.

  Kernel virtual pointers in the vmalloc space is then particularly
  confusing - both for looking up a page pointer (when trying to then
  unify a "virtual address or page" interface) and _particularly_ when
  mixed with HIGHMEM support and the kmap*() family of remapping.

  This is particularly true with HIGHMEM getting much less test coverage
  with 32-bit architectures being increasingly legacy targets.

  So we actively wanted to remove get_kernel_pages() to make sure nobody
  else used it too, and thus the 'tee' part is "finally remove last
  user".

  See also commit 6647e76ab623 ("v4l2: don't fall back to follow_pfn()
  if pin_user_pages_fast() fails") for a totally different version of a
  conceptually similar "let's stop this confusion of different ways of
  referring to memory".   - Linus ]

* tag 'remove-get_kernel_pages-for-6.3' of https://git.linaro.org/people/jens.wiklander/linux-tee:
  mm: Remove get_kernel_pages()
  tee: Remove call to get_kernel_pages()
  tee: Remove vmalloc page support
  highmem: Enhance is_kmap_addr() to check kmap_local_page() mappings

net: bcmgenet: Support wake-up from s2idle

When we suspend into s2idle we also need to enable the interrupt line
that generates the MPD and HFB interrupts towards the host CPU interrupt
controller (typically the ARM GIC or MIPS L1) to make it exit s2idle.

When we suspend into other modes such as "standby" or "mem" we engage a
power management state machine which will gate off the CPU L1 controller
(priv->irq0) and ungate the side band wake-up interrupt (priv->wol_irq).
It is safe to have both enabled as wake-up sources because they are
mutually exclusive given any suspend mode.

Signed-off-by: Florian Fainelli <[email protected]>
Signed-off-by: David S. Miller <[email protected]>

scm: add user copy checks to put_cmsg()

This is a followup of commit 2558b8039d05 ("net: use a bounce
buffer for copying skb->mark")

x86 and powerpc define user_access_begin, meaning
that they are not able to perform user copy checks
when using user_write_access_begin() / unsafe_copy_to_user()
and friends [1]

Instead of waiting bugs to trigger on other arches,
add a check_object_size() in put_cmsg() to make sure
that new code tested on x86 with CONFIG_HARDENED_USERCOPY=y
will perform more security checks.

[1] We can not generically call check_object_size() from
unsafe_copy_to_user() because UACCESS is enabled at this point.

Signed-off-by: Eric Dumazet <[email protected]>
Cc: Kees Cook <[email protected]>
Acked-by: Kees Cook <[email protected]>
Signed-off-by: David S. Miller <[email protected]>

devlink: drop leftover duplicate/unused code

The recent merge from net left-over some unused code in
leftover.c - nomen omen.

Just drop the unused bits.

Signed-off-by: Paolo Abeni <[email protected]>
Reviewed-by: Jiri Pirko <[email protected]>
Signed-off-by: David S. Miller <[email protected]>

Merge tag 'linux-can-next-for-6.3-20230217' of git://git.kernel.org/pub/scm/linux/kernel/git/mkl/linux-can-next
Marc Kleine-Budde says:

====================
pull-request: can-next 2023-02-17 - fixed

this is a pull request of 4 patches for net-next/master.

The first patch is by Yang Li and converts the ctucanfd driver to
devm_platform_ioremap_resource().

The last 3 patches are by Frank Jungclaus, target the esd_usb driver
and contains preparations for the upcoming support of the esd
CAN-USB/3 hardware.
====================

Signed-off-by: David S. Miller <[email protected]>

net: lan966x: Use automatic selection of VCAP rule actionset

Since commit 81e164c4aec5 ("net: microchip: sparx5: Add automatic
selection of VCAP rule actionset") the VCAP API has the capability to
select automatically the actionset based on the actions that are attached
to the rule. So it is not needed anymore to hardcode the actionset in the
driver, therefore it is OK to remove this.

Signed-off-by: Horatiu Vultur <[email protected]>
Reviewed-by: Alexander Lobakin <[email protected]>
Signed-off-by: David S. Miller <[email protected]>

Merge branch 'default_rps_mask-follow-up'

Paolo Abeni says:

====================
net: default_rps_mask follow-up

The first patch namespacify the setting. In the common case, once
proper isolation is in place in the main namespace, forwarding
to/from each child netns will allways happen on the desidered CPUs.

Any additional RPS stage inside the child namespace will not provide
additional isolation and could hurt performance badly if picking a
CPU on a remote node.

The 2nd patch adds more self-tests coverage.
====================

Signed-off-by: David S. Miller <[email protected]>

self-tests: more rps self tests

Explicitly check for child netns and main ns independency

Signed-off-by: Paolo Abeni <[email protected]>
Signed-off-by: David S. Miller <[email protected]>

net: make default_rps_mask a per netns attribute

That really was meant to be a per netns attribute from the beginning.

The idea is that once proper isolation is in place in the main
namespace, additional demux in the child namespaces will be redundant.
Let's make child netns default rps mask empty by default.

To avoid bloating the netns with a possibly large cpumask, allocate
it on-demand during the first write operation.

Signed-off-by: Paolo Abeni <[email protected]>
Signed-off-by: David S. Miller <[email protected]>

Merge tag 'wireless-next-2023-02-17' of git://git.kernel.org/pub/scm/linux/kernel/git/wireless/wireless-next

Kalle Valo says:

====================
wireless-next patches for v6.3

Third set of patches for v6.3. This time only a set of small fixes
submitted during the last day or two.
====================

Signed-off-by: David S. Miller <[email protected]>

Merge git://git.kernel.org/pub/scm/linux/kernel/git/netfilter/nf-next

Pablo Neira Ayuso says:

====================
Netfilter/IPVS updates for net-next

The following patchset contains Netfilter updates for net-next:

1) Add safeguard to check for NULL tupe in objects updates via
   NFT_MSG_NEWOBJ, this should not ever happen. From Alok Tiwari.

2) Incorrect pointer check in the new destroy rule command,
   from Yang Yingliang.

3) Incorrect status bitcheck in nf_conntrack_udp_packet(),
   from Florian Westphal.

4) Simplify seq_print_acct(), from Ilia Gavrilov.

5) Use 2-arg optimal variant of kfree_rcu() in IPVS,
   from Julian Anastasov.

6) TCP connection enters CLOSE state in conntrack for locally
   originated TCP reset packet from the reject target,
   from Florian Westphal.

The fixes #2 and #3 in this series address issues from the previous pull
nf-next request in this net-next cycle.
====================

Signed-off-by: David S. Miller <[email protected]>

net: microchip: sparx5: reduce stack usage

The vcap_admin structures in vcap_api_next_lookup_advanced_test()
take several hundred bytes of stack frame, but when CONFIG_KASAN_STACK
is enabled, each one of them also has extra padding before and after
it, which ends up blowing the warning limit:

In file included from drivers/net/ethernet/microchip/vcap/vcap_api.c:3521:
drivers/net/ethernet/microchip/vcap/vcap_api_kunit.c: In function 'vcap_api_next_lookup_advanced_test':
drivers/net/ethernet/microchip/vcap/vcap_api_kunit.c:1954:1: error: the frame size of 1448 bytes is larger than 1400 bytes [-Werror=frame-larger-than=]
1954 | }

Reduce the total stack usage by replacing the five structures with
an array that only needs one pair of padding areas.

Fixes: 1f741f001160 ("net: microchip: sparx5: Add KUNIT tests for enabling/disabling chains")
Signed-off-by: Arnd Bergmann <[email protected]>
Reviewed-by: Alexander Lobakin <[email protected]>
Signed-off-by: David S. Miller <[email protected]>

sfc: use IS_ENABLED() checks for CONFIG_SFC_SRIOV

One local variable has become unused after a recent change:

drivers/net/ethernet/sfc/ef100_nic.c: In function 'ef100_probe_netdev_pf':
drivers/net/ethernet/sfc/ef100_nic.c:1155:21: error: unused variable 'net_dev' [-Werror=unused-variable]
struct net_device *net_dev = efx->net_dev;
^~~~~~~

The variable is still used in an #ifdef. Replace the #ifdef with
an if(IS_ENABLED()) check that lets the compiler see where it is
used, rather than adding another #ifdef.

This also fixes an uninitialized return value in ef100_probe_netdev_pf()
that gcc did not spot.

Fixes: 7e056e2360d9 ("sfc: obtain device mac address based on firmware handle for ef100")
Signed-off-by: Arnd Bergmann <[email protected]>
Reviewed-by: Simon Horman <[email protected]>
Signed-off-by: David S. Miller <[email protected]>

ice: properly alloc ICE_VSI_LB

Devlink reload patchset introduced regression. ICE_VSI_LB wasn't
taken into account when doing default allocation. Fix it by adding a
case for ICE_VSI_LB in ice_vsi_alloc_def().

Fixes: 6624e780a577 ("ice: split ice_vsi_setup into smaller functions")
Reported-by: Maciej Fijalkowski <[email protected]>
Acked-by: Maciej Fijalkowski <[email protected]>
Signed-off-by: Michal Swiatkowski <[email protected]>
Reviewed-by: Alexander Lobakin <[email protected]>
Signed-off-by: David S. Miller <[email protected]>

sfc: Fix spelling mistake "creationg" -> "creating"

There is a spelling mistake in a pci_warn message. Fix it.

Signed-off-by: Colin Ian King <[email protected]>
Reviewed-by: Alejandro Lucero <[email protected]>
Signed-off-by: David S. Miller <[email protected]>

octeontx2-af: Add NIX Errata workaround on CN10K silicon

This patch adds workaround for below 2 HW erratas

1. Due to improper clock gating, NIXRX may free the same
NPA buffer multiple times.. to avoid this, always enable
NIX RX conditional clock.

2. NIX FIFO does not get initialized on reset, if the SMQ
flush is triggered before the first packet is processed, it
will lead to undefined state. The workaround to perform SMQ
flush only if packet count is non-zero in MDQ.

Signed-off-by: Geetha sowjanya <[email protected]>
Signed-off-by: Sunil Kovvuri Goutham <[email protected]>
Signed-off-by: Sai Krishna <[email protected]>
Signed-off-by: David S. Miller <[email protected]>

net: phy: Read EEE abilities when using .features

A PHY driver can use a static integer value to indicate what link mode
features it supports, i.e, its abilities.. This is the old way, but
useful when dynamically determining the devices features does not
work, e.g. support of fibre.

EEE support has been moved into phydev->supported_eee. This needs to
be set otherwise the code assumes EEE is not supported. It is normally
set as part of reading the devices abilities. However if a static
integer value was used, the dynamic reading of the abilities is not
performed. Add a call to genphy_c45_read_eee_abilities() to read the
EEE abilities.

Fixes: 8b68710a3121 ("net: phy: start using genphy_c45_ethtool_get/set_eee()")
Signed-off-by: Andrew Lunn <[email protected]>
Reviewed-by: Oleksij Rempel <[email protected]>
Signed-off-by: David S. Miller <[email protected]>

Merge branch 'phydev-locks'

Andrew Lunn says:

====================
Add additional phydev locks

The phydev lock should be held when accessing members of phydev, or
calling into the driver. Some of the phy_ethtool_ functions are
missing locks. Add them. To avoid deadlock the marvell driver is
modified since it calls one of the functions which gain locks, which
would result in a deadlock.

The missing locks have not caused noticeable issues, so these patches
are for net-next.
====================

Signed-off-by: David S. Miller <[email protected]>

net: phy: Add locks to ethtool functions

The phydev lock should be held while accessing members of phydev,
or calling into the driver.

Signed-off-by: Andrew Lunn <[email protected]>
Signed-off-by: David S. Miller <[email protected]>

net: phy: marvell: Use the unlocked genphy_c45_ethtool_get_eee()

phy_ethtool_get_eee() is about to gain locking of the phydev lock.
This means it cannot be used within a PHY driver without causing a
deadlock. Swap to using genphy_c45_ethtool_get_eee() which assumes the
lock has already been taken.

Signed-off-by: Andrew Lunn <[email protected]>
Signed-off-by: David S. Miller <[email protected]>

net: bcmgenet: fix MoCA LED control

When the bcmgenet_mii_config() code was refactored it was missed
that the LED control for the MoCA interface got overwritten by
the port_ctrl value. Its previous programming is restored here.

Fixes: 4f8d81b77e66 ("net: bcmgenet: Refactor register access in bcmgenet_mii_config")
Signed-off-by: Doug Berger <[email protected]>
Acked-by: Florian Fainelli <[email protected]>
Signed-off-by: David S. Miller <[email protected]>

l2tp: Avoid possible recursive deadlock in l2tp_tunnel_register()

When a file descriptor of pppol2tp socket is passed as file descriptor
of UDP socket, a recursive deadlock occurs in l2tp_tunnel_register().
This situation is reproduced by the following program:

int main(void)
{
int sock;
struct sockaddr_pppol2tp addr;

sock = socket(AF_PPPOX, SOCK_DGRAM, PX_PROTO_OL2TP);
if (sock < 0) {
perror("socket");
return 1;
}

addr.sa_family = AF_PPPOX;
addr.sa_protocol = PX_PROTO_OL2TP;
addr.pppol2tp.pid = 0;
addr.pppol2tp.fd = sock;
addr.pppol2tp.addr.sin_family = PF_INET;
addr.pppol2tp.addr.sin_port = htons(0);
addr.pppol2tp.addr.sin_addr.s_addr = inet_addr("192.168.0.1");
addr.pppol2tp.s_tunnel = 1;
addr.pppol2tp.s_session = 0;
addr.pppol2tp.d_tunnel = 0;
addr.pppol2tp.d_session = 0;

if (connect(sock, (const struct sockaddr *)&addr, sizeof(addr)) < 0) {
perror("connect");
return 1;
}

return 0;
}

This program causes the following lockdep warning:

============================================
WARNING: possible recursive locking detected
6.2.0-rc5-00205-gc96618275234 #56 Not tainted
--------------------------------------------
repro/8607 is trying to acquire lock:
ffff8880213c8130 (sk_lock-AF_PPPOX){+.+.}-{0:0}, at: l2tp_tunnel_register+0x2b7/0x11c0

but task is already holding lock:
ffff8880213c8130 (sk_lock-AF_PPPOX){+.+.}-{0:0}, at: pppol2tp_connect+0xa82/0x1a30

other info that might help us debug this:
  Possible unsafe locking scenario:

        CPU0
        ----
   lock(sk_lock-AF_PPPOX);
   lock(sk_lock-AF_PPPOX);

  *** DEADLOCK ***

  May be due to missing lock nesting notation

1 lock held by repro/8607:
  #0: ffff8880213c8130 (sk_lock-AF_PPPOX){+.+.}-{0:0}, at: pppol2tp_connect+0xa82/0x1a30

stack backtrace:
CPU: 0 PID: 8607 Comm: repro Not tainted 6.2.0-rc5-00205-gc96618275234 #56
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.16.1-2.fc37 04/01/2014
Call Trace:
  <TASK>
  dump_stack_lvl+0x100/0x178
  __lock_acquire.cold+0x119/0x3b9
  ? lockdep_hardirqs_on_prepare+0x410/0x410
  lock_acquire+0x1e0/0x610
  ? l2tp_tunnel_register+0x2b7/0x11c0
  ? lock_downgrade+0x710/0x710
  ? __fget_files+0x283/0x3e0
  lock_sock_nested+0x3a/0xf0
  ? l2tp_tunnel_register+0x2b7/0x11c0
  l2tp_tunnel_register+0x2b7/0x11c0
  ? sprintf+0xc4/0x100
  ? l2tp_tunnel_del_work+0x6b0/0x6b0
  ? debug_object_deactivate+0x320/0x320
  ? lockdep_init_map_type+0x16d/0x7a0
  ? lockdep_init_map_type+0x16d/0x7a0
  ? l2tp_tunnel_create+0x2bf/0x4b0
  ? l2tp_tunnel_create+0x3c6/0x4b0
  pppol2tp_connect+0x14e1/0x1a30
  ? pppol2tp_put_sk+0xd0/0xd0
  ? aa_sk_perm+0x2b7/0xa80
  ? aa_af_perm+0x260/0x260
  ? bpf_lsm_socket_connect+0x9/0x10
  ? pppol2tp_put_sk+0xd0/0xd0
  __sys_connect_file+0x14f/0x190
  __sys_connect+0x133/0x160
  ? __sys_connect_file+0x190/0x190
  ? lockdep_hardirqs_on+0x7d/0x100
  ? ktime_get_coarse_real_ts64+0x1b7/0x200
  ? ktime_get_coarse_real_ts64+0x147/0x200
  ? __audit_syscall_entry+0x396/0x500
  __x64_sys_connect+0x72/0xb0
  do_syscall_64+0x38/0xb0
  entry_SYSCALL_64_after_hwframe+0x63/0xcd

This patch fixes the issue by getting/creating the tunnel before
locking the pppol2tp socket.

Fixes: 0b2c59720e65 ("l2tp: close all race conditions in l2tp_tunnel_register()")
Cc: Cong Wang <[email protected]>
Signed-off-by: Shigeru Yoshida <[email protected]>
Reviewed-by: Guillaume Nault <[email protected]>
Signed-off-by: David S. Miller <[email protected]>

Merge branch 'icmp6-drop-reason'

Eric Dumazet says:

====================
ipv6: icmp6: better drop reason support

This series aims to have more precise drop reason reports for icmp6.

This should reduce false positives on most usual cases.

This can be extended as needed later.
====================

Signed-off-by: David S. Miller <[email protected]>

ipv6: icmp6: add drop reason support to icmpv6_echo_reply()

Change icmpv6_echo_reply() to return a drop reason.

For the moment, return NOT_SPECIFIED or SKB_CONSUMED.

Signed-off-by: Eric Dumazet <[email protected]>
Signed-off-by: David S. Miller <[email protected]>

ipv6: icmp6: add SKB_DROP_REASON_IPV6_NDISC_NS_OTHERHOST

Hosts can often receive neighbour discovery messages
that are not for them.

Use a dedicated drop reason to make clear the packet is dropped
for this normal case.

Signed-off-by: Eric Dumazet <[email protected]>
Signed-off-by: David S. Miller <[email protected]>

ipv6: icmp6: add SKB_DROP_REASON_IPV6_NDISC_BAD_OPTIONS

This is a generic drop reason for any error detected
in ndisc_parse_options().

Signed-off-by: Eric Dumazet <[email protected]>
Signed-off-by: David S. Miller <[email protected]>

ipv6: icmp6: add drop reason support to ndisc_redirect_rcv()

Change ndisc_redirect_rcv() to return a drop reason.

For the moment, return PKT_TOO_SMALL, NOT_SPECIFIED
and values from icmpv6_notify().

More reasons are added later.

Signed-off-by: Eric Dumazet <[email protected]>
Signed-off-by: David S. Miller <[email protected]>

ipv6: icmp6: add drop reason support to ndisc_router_discovery()

Change ndisc_router_discovery() to return a drop reason.

For the moment, return PKT_TOO_SMALL, NOT_SPECIFIED
and SKB_CONSUMED.

More reasons are added later.

Signed-off-by: Eric Dumazet <[email protected]>
Signed-off-by: David S. Miller <[email protected]>

ipv6: icmp6: add drop reason support to ndisc_recv_rs()

Change ndisc_recv_rs() to return a drop reason.

For the moment, return PKT_TOO_SMALL, NOT_SPECIFIED
or SKB_CONSUMED. More reasons are added later.

Signed-off-by: Eric Dumazet <[email protected]>
Signed-off-by: David S. Miller <[email protected]>

ipv6: icmp6: add drop reason support to ndisc_recv_na()

Change ndisc_recv_na() to return a drop reason.

For the moment, return PKT_TOO_SMALL, NOT_SPECIFIED
or SKB_CONSUMED. More reasons are added later.

Signed-off-by: Eric Dumazet <[email protected]>
Signed-off-by: David S. Miller <[email protected]>

ipv6: icmp6: add drop reason support to ndisc_recv_ns()

Change ndisc_recv_ns() to return a drop reason.

For the moment, return PKT_TOO_SMALL, NOT_SPECIFIED
or SKB_CONSUMED.

Signed-off-by: Eric Dumazet <[email protected]>
Signed-off-by: David S. Miller <[email protected]>

net: add location to trace_consume_skb()

kfree_skb() includes the location, it makes sense
to add it to consume_skb() as well.

After patch:

taskd_EventMana  8602 [004]   420.406239: skb:consume_skb: skbaddr=0xffff893a4a6d0500 location=unix_stream_read_generic
         swapper     0 [011]   422.732607: skb:consume_skb: skbaddr=0xffff89597f68cee0 location=mlx4_en_free_tx_desc
      discipline  9141 [043]   423.065653: skb:consume_skb: skbaddr=0xffff893a487e9c00 location=skb_consume_udp
         swapper     0 [010]   423.073166: skb:consume_skb: skbaddr=0xffff8949ce9cdb00 location=icmpv6_rcv
         borglet  8672 [014]   425.628256: skb:consume_skb: skbaddr=0xffff8949c42e9400 location=netlink_dump
         swapper     0 [028]   426.263317: skb:consume_skb: skbaddr=0xffff893b1589dce0 location=net_rx_action
            wget 14339 [009]   426.686380: skb:consume_skb: skbaddr=0xffff893a51b552e0 location=tcp_rcv_state_process

Signed-off-by: Eric Dumazet <[email protected]>
Signed-off-by: David S. Miller <[email protected]>

selftests/net: Interpret UDP_GRO cmsg data as an int value

Data passed to user-space with a (SOL_UDP, UDP_GRO) cmsg carries an
int (see udp_cmsg_recv), not a u16 value, as strace confirms:

  recvmsg(8, {msg_name=...,
              msg_iov=[{iov_base="\0\0..."..., iov_len=96000}],
              msg_iovlen=1,
              msg_control=[{cmsg_len=20,         <-- sizeof(cmsghdr) + 4
                            cmsg_level=SOL_UDP,
                            cmsg_type=0x68}],    <-- UDP_GRO
                            msg_controllen=24,
                            msg_flags=0}, 0) = 11200

Interpreting the data as an u16 value won't work on big-endian platforms.
Since it is too late to back out of this API decision [1], fix the test.

[1]: https://lore.kernel.org/netdev/20230131174601 [email protected]/

Fixes: 3327a9c46352 ("selftests: add functionals test for UDP GRO")
Suggested-by: Eric Dumazet <[email protected]>
Signed-off-by: Jakub Sitnicki <[email protected]>
Reviewed-by: Eric Dumazet <[email protected]>
Signed-off-by: David S. Miller <[email protected]>

qede: fix interrupt coalescing configuration

On default driver load device gets configured with unexpected
higher interrupt coalescing values instead of default expected
values as memory allocated from krealloc() is not supposed to
be zeroed out and may contain garbage values.

Fix this by allocating the memory of required size first with
kcalloc() and then use krealloc() to resize and preserve the
contents across down/up of the interface.

Signed-off-by: Manish Chopra <[email protected]>
Fixes: b0ec5489c480 ("qede: preserve per queue stats across up/down of interface")
Cc: [email protected]
Cc: Bhaskar Upadhaya <[email protected]>
Cc: David S. Miller <[email protected]>
Link: https://bugzilla.redhat.com/show_bug.cgi?id=2160054
Signed-off-by: Alok Prasad <[email protected]>
Signed-off-by: Ariel Elior <[email protected]>
Signed-off-by: David S. Miller <[email protected]>

xsk: support use vaddr as ring

When we try to start AF_XDP on some machines with long running time, due
to the machine's memory fragmentation problem, there is no sufficient
contiguous physical memory that will cause the start failure.

If the size of the queue is 8 * 1024, then the size of the desc[] is
8 * 1024 * 8 = 16 * PAGE, but we also add struct xdp_ring size, so it is
16page+. This is necessary to apply for a 4-order memory. If there are a
lot of queues, it is difficult to these machine with long running time.

Here, that we actually waste 15 pages. 4-Order memory is 32 pages, but
we only use 17 pages.

This patch replaces __get_free_pages() by vmalloc() to allocate memory
to solve these problems.

Signed-off-by: Xuan Zhuo <[email protected]>
Acked-by: Magnus Karlsson <[email protected]>
Reviewed-by: Alexander Lobakin <[email protected]>
Signed-off-by: David S. Miller <[email protected]>

net/smc: fix application data exception

There is a certain probability that following
exceptions will occur in the wrk benchmark test:

Running 10s test @ http://11.213.45.6:80
  8 threads and 64 connections
  Thread Stats   Avg      Stdev     Max   +/- Stdev
    Latency     3.72ms   13.94ms 245.33ms   94.17%
    Req/Sec     1.96k   713.67     5.41k    75.16%
  155262 requests in 10.10s, 23.10MB read
Non-2xx or 3xx responses: 3

We will find that the error is HTTP 400 error, which is a serious
exception in our test, which means the application data was
corrupted.

Consider the following scenarios:

CPU0                            CPU1

buf_desc->used = 0;
                                cmpxchg(buf_desc->used, 0, 1)
                                deal_with(buf_desc)

memset(buf_desc->cpu_addr,0);

This will cause the data received by a victim connection to be cleared,
thus triggering an HTTP 400 error in the server.

This patch exchange the order between clear used and memset, add
barrier to ensure memory consistency.

Fixes: 1c5526968e27 ("net/smc: Clear memory when release and reuse buffer")
Signed-off-by: D. Wythe <[email protected]>
Reviewed-by: Wenjia Zhang <[email protected]>
Signed-off-by: David S. Miller <[email protected]>

net/smc: fix potential panic dues to unprotected smc_llc_srv_add_link()

There is a certain chance to trigger the following panic:

PID: 5900   TASK: ffff88c1c8af4100  CPU: 1   COMMAND: "kworker/1:48"
#0 [ffff9456c1cc79a0] machine_kexec at ffffffff870665b7
#1 [ffff9456c1cc79f0] __crash_kexec at ffffffff871b4c7a
#2 [ffff9456c1cc7ab0] crash_kexec at ffffffff871b5b60
#3 [ffff9456c1cc7ac0] oops_end at ffffffff87026ce7
#4 [ffff9456c1cc7ae0] page_fault_oops at ffffffff87075715
#5 [ffff9456c1cc7b58] exc_page_fault at ffffffff87ad0654
#6 [ffff9456c1cc7b80] asm_exc_page_fault at ffffffff87c00b62
    [exception RIP: ib_alloc_mr+19]
    RIP: ffffffffc0c9cce3  RSP: ffff9456c1cc7c38  RFLAGS: 00010202
    RAX: 0000000000000000  RBX: 0000000000000002  RCX: 0000000000000004
    RDX: 0000000000000010  RSI: 0000000000000000  RDI: 0000000000000000
    RBP: ffff88c1ea281d00   R8: 000000020a34ffff   R9: ffff88c1350bbb20
    R10: 0000000000000000  R11: 0000000000000001  R12: 0000000000000000
    R13: 0000000000000010  R14: ffff88c1ab040a50  R15: ffff88c1ea281d00
    ORIG_RAX: ffffffffffffffff  CS: 0010  SS: 0018
#7 [ffff9456c1cc7c60] smc_ib_get_memory_region at ffffffffc0aff6df [smc]
#8 [ffff9456c1cc7c88] smcr_buf_map_link at ffffffffc0b0278c [smc]
#9 [ffff9456c1cc7ce0] __smc_buf_create at ffffffffc0b03586 [smc]

The reason here is that when the server tries to create a second link,
smc_llc_srv_add_link() has no protection and may add a new link to
link group. This breaks the security environment protected by
llc_conf_mutex.

Fixes: 2d2209f20189 ("net/smc: first part of add link processing as SMC server")
Signed-off-by: D. Wythe <[email protected]>
Reviewed-by: Larysa Zaremba <[email protected]>
Reviewed-by: Wenjia Zhang <[email protected]>
Signed-off-by: David S. Miller <[email protected]>

Merge branch 'taprio-queuemaxsdu-fixes'

Vladimir Oltean says:

====================
taprio queueMaxSDU fixes

This fixes 3 issues noticed while attempting to reoffload the
dynamically calculated queueMaxSDU values. These are:
- Dynamic queueMaxSDU is not calculated correctly due to a lost patch
- Dynamically calculated queueMaxSDU needs to be clamped on the low end
- Dynamically calculated queueMaxSDU needs to be clamped on the high end
====================

Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Paolo Abeni <[email protected]>

net/sched: taprio: dynamic max_sdu larger than the max_mtu is unlimited

It makes no sense to keep randomly large max_sdu values, especially if
larger than the device's max_mtu. These are visible in "tc qdisc show".
Such a max_sdu is practically unlimited and will cause no packets for
that traffic class to be dropped on enqueue.

Just set max_sdu_dynamic to U32_MAX, which in the logic below causes
taprio to save a max_frm_len of U32_MAX and a max_sdu presented to user
space of 0 (unlimited).

Signed-off-by: Vladimir Oltean <[email protected]>
Reviewed-by: Kurt Kanzenbach <[email protected]>
Signed-off-by: Paolo Abeni <[email protected]>

net/sched: taprio: don't allow dynamic max_sdu to go negative after stab adjustment

The overhead specified in the size table comes from the user. With small
time intervals (or gates always closed), the overhead can be larger than
the max interval for that traffic class, and their difference is
negative.

What we want to happen is for max_sdu_dynamic to have the smallest
non-zero value possible (1) which means that all packets on that traffic
class are dropped on enqueue. However, since max_sdu_dynamic is u32, a
negative is represented as a large value and oversized dropping never
happens.

Use max_t with int to force a truncation of max_frm_len to no smaller
than dev->hard_header_len + 1, which in turn makes max_sdu_dynamic no
smaller than 1.

Fixes: fed87cc6718a ("net/sched: taprio: automatically calculate queueMaxSDU based on TC gate durations")
Signed-off-by: Vladimir Oltean <[email protected]>
Reviewed-by: Kurt Kanzenbach <[email protected]>
Signed-off-by: Paolo Abeni <[email protected]>

net/sched: taprio: fix calculation of maximum gate durations

taprio_calculate_gate_durations() depends on netdev_get_num_tc() and
this returns 0. So it calculates the maximum gate durations for no
traffic class.

I had tested the blamed commit only with another patch in my tree, one
which in the end I decided isn't valuable enough to submit ("net/sched:
taprio: mask off bits in gate mask that exceed number of TCs").

The problem is that having this patch threw off my testing. By moving
the netdev_set_num_tc() call earlier, we implicitly gave to
taprio_calculate_gate_durations() the information it needed.

Extract only the portion from the unsubmitted change which applies the
mqprio configuration to the netdev earlier.

Link: https://patchwork.kernel.org/project/netdevbpf/patch/[email protected]/
Fixes: a306a90c8ffe ("net/sched: taprio: calculate tc gate durations")
Signed-off-by: Vladimir Oltean <[email protected]>
Reviewed-by: Kurt Kanzenbach <[email protected]>
Signed-off-by: Paolo Abeni <[email protected]>

rxrpc: Fix overproduction of wakeups to recvmsg()

Fix three cases of overproduction of wakeups:

(1) rxrpc_input_split_jumbo() conditionally notifies the app that there's
     data for recvmsg() to collect if it queues some data - and then its
     only caller, rxrpc_input_data(), goes and wakes up recvmsg() anyway.

     Fix the rxrpc_input_data() to only do the wakeup in failure cases.

(2) If a DATA packet is received for a call by the I/O thread whilst
     recvmsg() is busy draining the call's rx queue in the app thread, the
     call will left on the recvmsg() queue for recvmsg() to pick up, even
     though there isn't any data on it.

     This can cause an unexpected recvmsg() with a 0 return and no MSG_EOR
     set after the reply has been posted to a service call.

     Fix this by discarding pending calls from the recvmsg() queue that
     don't need servicing yet.

(3) Not-yet-completed calls get requeued after having data read from them,
     even if they have no data to read.

     Fix this by only requeuing them if they have data waiting on them; if
     they don't, the I/O thread will requeue them when data arrives or they
     fail.

Signed-off-by: David Howells <[email protected]>
cc: Marc Dionne <[email protected]>
cc: [email protected]
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Paolo Abeni <[email protected]>

Merge branch 'net-final-gsi-register-updates'

Alex Elder says:

====================
net: final GSI register updates

I believe this is the last set of changes required to allow IPA v5.0
to be supported.  There is a little cleanup work remaining, but that
can happen in the next Linux release cycle.  Otherwise we just need
config data and register definitions for IPA v5.0 (and DTS updates).
These are ready but won't be posted without further testing.

The first patch in this series fixes a minor bug in a patch just
posted, which I found too late.  The second eliminates the GSI
memory "adjustment"; this was done previously to avoid/delay the
need to implement a more general way to define GSI register offsets.
Note that this patch causes "checkpatch" warnings due to indentation
that aligns with an open parenthesis.

The third patch makes use of the newly-defined register offsets, to
eliminate the need for a function that hid a few details.  The next
modifies a different helper function to work properly for IPA v5.0+.
The fifth patch changes the way the event ring size is specified
based on how it's now done for IPA v5.0+.  And the last defines a
new register required for IPA v5.0+.
====================

Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Paolo Abeni <[email protected]>

net: ipa: add HW_PARAM_4 GSI register

Starting at IPA v5.0, the number of event rings per EE is defined
in a field in a new HW_PARAM_4 GSI register rather than HW_PARAM_2.
Define this new register and its fields, and update the code that
checks the number of rings supported by hardware to use the proper
field based on IPA version.

Signed-off-by: Alex Elder <[email protected]>
Signed-off-by: Paolo Abeni <[email protected]>

net: ipa: support different event ring encoding

Starting with IPA v5.0, a channel's event ring index is encoded in
a field in the CH_C_CNTXT_1 GSI register rather than CH_C_CNTXT_0.
Define a new field ID for the former register and encode the event
ring in the appropriate register.

Signed-off-by: Alex Elder <[email protected]>
Signed-off-by: Paolo Abeni <[email protected]>

net: ipa: avoid setting an undefined field

The GSI channel protocol field in the CH_C_CNTXT_0 GSI register is
widened starting IPA v5.0, making the CHTYPE_PROTOCOL_MSB field
added in IPA v4.5 unnecessary. Update the code to reflect this.

Signed-off-by: Alex Elder <[email protected]>
Signed-off-by: Paolo Abeni <[email protected]>

net: ipa: kill ev_ch_e_cntxt_1_length_encode()

Now that we explicitly define each register field width there is no
need to have a special encoding function for the event ring length.
Add a field for this to the EV_CH_E_CNTXT_1 GSI register, and use it
in place of ev_ch_e_cntxt_1_length_encode() (which can be removed).

Signed-off-by: Alex Elder <[email protected]>
Signed-off-by: Paolo Abeni <[email protected]>

net: ipa: kill gsi->virt_raw

Starting at IPA v4.5, almost all GSI registers had their offsets
changed by a fixed amount (shifted downward by 0xd000).  Rather than
defining offsets for all those registers dependent on version, an
adjustment was applied for most register accesses.  This was
implemented in commit cdeee49f3ef7f ("net: ipa: adjust GSI register
addresses").  It was later modified to be a bit more obvious about
the adjusment, in commit 571b1e7e58ad3 ("net: ipa: use a separate
pointer for adjusted GSI memory").

We now are able to define every GSI register with its own offset, so
there's no need to implement this special adjustment.

So get rid of the "virt_raw" pointer, and just maintain "virt" as
the (non-adjusted) base address of I/O mapped GSI register memory.

Redefine the offsets of all GSI registers (other than the INTER_EE
ones, which were not subject to the adjustment) for IPA v4.5+,
subtracting 0xd000 from their defined offsets instead.

Move the ERROR_LOG and ERROR_LOG_CLR definitions further down in the
register definition files so all registers are defined in order of
their offset.

Signed-off-by: Alex Elder <[email protected]>
Signed-off-by: Paolo Abeni <[email protected]>

net: ipa: fix an incorrect assignment

I spotted an error in a patch posted this week, unfortunately just
after it got accepted.  The effect of the bug is that time-based
interrupt moderation is disabled.  This is not technically a bug,
but it is not what is intended.  The problem is that a |= assignment
got implemented as a simple assignment, so the previously assigned
value was ignored.

Fixes: edc6158b18af ("net: ipa: define fields for event-ring related registers")
Signed-off-by: Alex Elder <[email protected]>
Signed-off-by: Paolo Abeni <[email protected]>

net: dpaa2-eth: do not always set xsk support in xdp_features flag

Do not always add NETDEV_XDP_ACT_XSK_ZEROCOPY bit in xdp_features flag
but check if the NIC really supports it.

Signed-off-by: Lorenzo Bianconi <[email protected]>
Reviewed-by: Larysa Zaremba <[email protected]>
Link: https://lore.kernel.org/r/3dba6ea42dc343a9f2d7d1a6a6a6c173235e1ebf.1676471386.git.lorenzo@kernel.org
Signed-off-by: Paolo Abeni <[email protected]>

Linux 6.2

Merge tag 'x86-urgent-2023-02-19' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip

Pull x86 fix from Thomas Gleixner:
"A single fix for x86.

  Revert the recent change to the MTRR code which aimed to support
  SEV-SNP guests on Hyper-V. It caused a regression on XEN Dom0 kernels.

  The underlying issue of MTTR (mis)handling in the x86 code needs some
  deeper investigation and is definitely not 6.2 material"

* tag 'x86-urgent-2023-02-19' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  x86/mtrr: Revert 90b926e68f50 ("x86/pat: Fix pat_x_mtrr_type() for MTRR disabled case")

Merge tag 'timers-urgent-2023-02-19' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip

Pull timer fix from Thomas Gleixner:
"A fix for a long standing issue in the alarmtimer code.

  Posix-timers armed with a short interval with an ignored signal result
  in an unpriviledged DoS. Due to the ignored signal the timer switches
  into self rearm mode. This issue had been "fixed" before but a rework
  of the alarmtimer code 5 years ago lost that workaround.

  There is no real good solution for this issue, which is also worked
  around in the core posix-timer code in the same way, but it certainly
  moved way up on the ever growing todo list"

* tag 'timers-urgent-2023-02-19' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  alarmtimer: Prevent starvation by small intervals and SIG_IGN

Merge tag 'irq-urgent-2023-02-19' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip

Pull irq fix from Thomas Gleixner:
"A single build fix for the PCI/MSI infrastructure.

  The addition of the new alloc/free interfaces in this cycle forgot to
  add stub functions for pci_msix_alloc_irq_at() and pci_msix_free_irq()
  for the CONFIG_PCI_MSI=n case"

* tag 'irq-urgent-2023-02-19' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  PCI/MSI: Provide missing stubs for CONFIG_PCI_MSI=n

Merge tag 'irqchip-6.3' of git://git.kernel.org/pub/scm/linux/kernel/git/maz/arm-platforms into irq/core

Pull irqchip updates from Marc Zyngier:

   - New and improved irqdomain locking, closing a number of races that
     became apparent now that we are able to probe drivers in parallel

   - A bunch of OF node refcounting bugs have been fixed

   - We now have a new IPI mux, lifted from the Apple AIC code and
     made common. It is expected that riscv will eventually benefit
     from it

   - Two small fixes for the Broadcom L2 drivers

   - Various cleanups and minor bug fixes

Link: https://lore.kernel.org/r/[email protected]

Merge tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm

Pull kvm/x86 fixes from Paolo Bonzini:

- zero all padding for KVM_GET_DEBUGREGS

- fix rST warning

- disable vPMU support on hybrid CPUs

* tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm:
  kvm: initialize all of the kvm_debugregs structure before sending it to userspace
  perf/x86: Refuse to export capabilities for hybrid PMUs
  KVM: x86/pmu: Disable vPMU support on hybrid CPUs (host PMUs)
  Documentation/hw-vuln: Fix rST warning

Merge tag 'arm64-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/arm64/linux

Pull arm64 regression fix from Will Deacon:
"Apologies for the _extremely_ late pull request here, but we had a
  'perf' (i.e. CPU PMU) regression on the Apple M1 reported on Wednesday
  [1] which was introduced by bd2756811766 ("perf: Rewrite core context
  handling") during the merge window.

  Mark and I looked into this and noticed an additional problem caused
  by the same patch, where the 'CHAIN' event (used to combine two
  adjacent 32-bit counters into a single 64-bit counter) was not being
  filtered correctly. Mark posted a series on Thursday [2] which
  addresses both of these regressions and I queued it the same day.

  The changes are small, self-contained and have been confirmed to fix
  the original regression.

  Summary:

   - Fix 'perf' regression for non-standard CPU PMU hardware (i.e. Apple
     M1)"

* tag 'arm64-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/arm64/linux:
  arm64: perf: reject CHAIN events at creation time
  arm_pmu: fix event CPU filtering

Merge tag 'block-6.2-2023-02-17' of git://git.kernel.dk/linux

Pull block fix from Jens Axboe:
"I guess this is what can happen when you prep things early for going
  away, something else comes in last minute. This one fixes another
  regression in 6.2 for NVMe, from this release, and hence we should
  probably get it submitted for 6.2.

  Still waiting for the original reporter (see bugzilla linked in the
  commit) to test this, but Keith managed to setup and recreate the
  issue and tested the patch that way"

* tag 'block-6.2-2023-02-17' of git://git.kernel.dk/linux:
  nvme-pci: refresh visible attrs for cmb attributes

xen: sysfs: make kobj_type structure constant

Since commit ee6d3dd4ed48 ("driver core: make kobj_type constant.")
the driver core allows the usage of const struct kobj_type.

Take advantage of this to constify the structure definition to prevent
modification at runtime.

Signed-off-by: Thomas Weißschuh <[email protected]>
Reviewed-by: Juergen Gross <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Juergen Gross <[email protected]>

ieee802154: Drop device trackers

In order to prevent a device from disappearing when a background job was
started, dev_hold() and dev_put() calls were made. During the
stabilization phase of the scan/beacon features, it was later decided
that removing the device while a background job was ongoing was a valid use
case, and we should instead stop the background job and then remove the
device, rather than prevent the device from being removed. This is what
is currently done, which means manually reference counting the device
during background jobs is no longer needed.

Fixes: ed3557c947e1 ("ieee802154: Add support for user scanning requests")
Fixes: 9bc114504b07 ("ieee802154: Add support for user beaconing requests")
Reported-by: Jakub Kicinski <[email protected]>
Signed-off-by: Miquel Raynal <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Stefan Schmidt <[email protected]>

mac802154: Fix an always true condition

At this stage we simply do not care about the delayed work value,
because active scan is not yet supported, so we can blindly queue
another work once a beacon has been sent.

It fixes a smatch warning:
mac802154_beacon_worker() warn: always true condition
'(local->beacon_interval >= 0) => (0-u32max >= 0)'

Fixes: 3accf4762734 ("mac802154: Handle basic beaconing")
Reported-by: kernel test robot <[email protected]>
Signed-off-by: Miquel Raynal <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Stefan Schmidt <[email protected]>

mac802154: Send beacons using the MLME Tx path

Using ieee802154_subif_start_xmit() to bypass the net queue when
sending beacons is broken because it does not acquire the
HARD_TX_LOCK(), hence not preventing datagram buffers to be smashed by
beacons upon contention situation. Using the mlme_tx helper is not the
best fit either but at least it is not buggy and has little-to-no
performance hit. More details are given in the comment explaining this
choice in the code.

Fixes: 3accf4762734 ("mac802154: Handle basic beaconing")
Reported-by: Alexander Aring <[email protected]>
Signed-off-by: Miquel Raynal <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Stefan Schmidt <[email protected]>

ieee802154: Change error code on monitor scan netlink request

Returning EPERM gives the impression that "right now" it is not
possible, but "later" it could be, while what we want to express is the
fact that this is not currently supported at all (might change in the
future). So let's return EOPNOTSUPP instead.

Fixes: ed3557c947e1 ("ieee802154: Add support for user scanning requests")
Suggested-by: Alexander Aring <[email protected]>
Signed-off-by: Miquel Raynal <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Stefan Schmidt <[email protected]>

ieee802154: Convert scan error messages to extack

Instead of printing error messages in the kernel log, let's use extack.
When there is a netlink error returned that could be further specified
with a string, use extack as well.

Apply this logic to the very recent scan/beacon infrastructure.

Fixes: ed3557c947e1 ("ieee802154: Add support for user scanning requests")
Fixes: 9bc114504b07 ("ieee802154: Add support for user beaconing requests")
Suggested-by: Jakub Kicinski <[email protected]>
Signed-off-by: Miquel Raynal <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Stefan Schmidt <[email protected]>

ieee802154: Use netlink policies when relevant on scan parameters

Instead of open-coding scan parameters (page, channels, duration, etc),
let's use the existing NLA_POLICY* macros. This help greatly reducing
the error handling and clarifying the overall logic.

Fixes: ed3557c947e1 ("ieee802154: Add support for user scanning requests")
Suggested-by: Jakub Kicinski <[email protected]>
Signed-off-by: Miquel Raynal <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Stefan Schmidt <[email protected]>

Merge branch irq/bcm-l2-fixes into irq/irqchip-next

* irq/bcm-l2-fixes:
  : .
  : Broadcom L2 irqchip fixes for correct handling of level interrupts,
  : courtesy of Florian Fainelli.
  : .
  irqchip/irq-bcm7120-l2: Set IRQ_LEVEL for level triggered interrupts
  irqchip/irq-brcmstb-l2: Set IRQ_LEVEL for level triggered interrupts

Signed-off-by: Marc Zyngier <[email protected]>

irqchip/irq-bcm7120-l2: Set IRQ_LEVEL for level triggered interrupts

When support for the interrupt controller was added with a5042de2688d,
we forgot to update the flags to be set to contain IRQ_LEVEL. While the
flow handler is correct, the output from /proc/interrupts does not show
such interrupts as being level triggered when they are, correct that.

Fixes: a5042de2688d ("irqchip: bcm7120-l2: Add Broadcom BCM7120-style Level 2 interrupt controller")
Signed-off-by: Florian Fainelli <[email protected]>
Reviewed-by: Philippe Mathieu-Daudé <[email protected]>
Signed-off-by: Marc Zyngier <[email protected]>
Link: https://lore.kernel.org/r/[email protected]

irqchip/irq-brcmstb-l2: Set IRQ_LEVEL for level triggered interrupts

When support for the level triggered interrupt controller flavor was
added with c0ca7262088e, we forgot to update the flags to be set to
contain IRQ_LEVEL. While the flow handler is correct, the output from
/proc/interrupts does not show such interrupts as being level triggered
when they are, correct that.

Fixes: c0ca7262088e ("irqchip/brcmstb-l2: Add support for the BCM7271 L2 controller")
Signed-off-by: Florian Fainelli <[email protected]>
Reviewed-by: Philippe Mathieu-Daudé <[email protected]>
Signed-off-by: Marc Zyngier <[email protected]>
Link: https://lore.kernel.org/r/[email protected]

platform/x86: nvidia-wmi-ec-backlight: Add force module parameter

On some Lenovo Legion models, the backlight might be driven by either
one of nvidia_wmi_ec_backlight or amdgpu_bl0 at different times.

When the Nvidia WMI EC backlight interface reports the backlight is
controlled by the EC, the current backlight handling only registers
nvidia_wmi_ec_backlight (and registers no other backlight interfaces).

This hides (never registers) the amdgpu_bl0 interface, where as prior
to 6.1.4 users would have both nvidia_wmi_ec_backlight and amdgpu_bl0
and could work around things in userspace.

Add a force module parameter which can be used with acpi_backlight=native
to restore the old behavior as a workound (for now) by passing:

"acpi_backlight=native nvidia-wmi-ec-backlight.force=1"

Fixes: 8d0ca287fd8c ("platform/x86: nvidia-wmi-ec-backlight: Use acpi_video_get_backlight_type()")
Link: https://bugzilla.kernel.org/show_bug.cgi?id=217026
Cc: [email protected]
Signed-off-by: Hans de Goede <[email protected]>
Reviewed-by: Daniel Dadap <[email protected]>
Link: https://lore.kernel.org/r/[email protected]

net/mlx5e: RX, Remove doubtful unlikely call

When building an skb in non-linear mode, it is not likely nor unlikely
that the xdp buff has fragments, it depends on the size of the packet
received.

Signed-off-by: Gal Pressman <[email protected]>
Signed-off-by: Saeed Mahameed <[email protected]>
Reviewed-by: Maciej Fijalkowski <[email protected]>

net/mlx5e: Fix outdated TLS comment

Comment is outdated since
commit 40379a0084c2 ("net/mlx5_fpga: Drop INNOVA TLS support").

Signed-off-by: Tariq Toukan <[email protected]>
Reviewed-by: Gal Pressman <[email protected]>
Signed-off-by: Saeed Mahameed <[email protected]>

net/mlx5e: Remove unused function mlx5e_sq_xmit_simple

The last usage was removed as part of
commit 40379a0084c2 ("net/mlx5_fpga: Drop INNOVA TLS support").

Signed-off-by: Tariq Toukan <[email protected]>
Reviewed-by: Gal Pressman <[email protected]>
Signed-off-by: Saeed Mahameed <[email protected]>

net/mlx5e: Allow offloading of ct 'new' match

Allow offloading filters that match on conntrack 'new' state in order to
enable UDP NEW offload in the following patch.

Unhardcode ct 'established' from ct modify header infrastructure code and
determine correct ct state bit according to the metadata action 'cookie'
field.

Signed-off-by: Vlad Buslov <[email protected]>
Reviewed-by: Oz Shlomo <[email protected]>
Reviewed-by: Paul Blakey <[email protected]>
Signed-off-by: Saeed Mahameed <[email protected]>

net/mlx5e: Implement CT entry update

With support for UDP NEW offload the flow_table may now send updates for
existing flows. Support properly replacing existing entries by updating
flow restore_cookie and replacing the rule with new one with the same match
but new mod_hdr action that sets updated ctinfo.

Signed-off-by: Vlad Buslov <[email protected]>
Reviewed-by: Oz Shlomo <[email protected]>
Reviewed-by: Paul Blakey <[email protected]>
Signed-off-by: Saeed Mahameed <[email protected]>

net/mlx5: Simplify eq list traversal

EQ list is read only while finding the matching EQ.
Hence, avoid *_safe() version.

Signed-off-by: Parav Pandit <[email protected]>
Reviewed-by: Shay Drory <[email protected]>
Signed-off-by: Saeed Mahameed <[email protected]>
Reviewed-by: Maciej Fijalkowski <[email protected]>

net/mlx5e: Remove redundant page argument in mlx5e_xdp_handle()

Remove the page parameter, it can be derived from the xdp_buff member
of mlx5e_xdp_buff.

Signed-off-by: Tariq Toukan <[email protected]>
Reviewed-by: Dragos Tatulea <[email protected]>
Signed-off-by: Saeed Mahameed <[email protected]>

net/mlx5e: Remove redundant page argument in mlx5e_xmit_xdp_buff()

Remove the page parameter, it can be derived from the xdp_buff.

Signed-off-by: Tariq Toukan <[email protected]>
Reviewed-by: Dragos Tatulea <[email protected]>
Signed-off-by: Saeed Mahameed <[email protected]>

net/mlx5e: Switch to using napi_build_skb()

Use napi_build_skb() which uses NAPI percpu caches to obtain
skbuff_head instead of inplace allocation.

napi_build_skb() calls napi_skb_cache_get(), which returns a cached
skb, or allocates a bulk of NAPI_SKB_CACHE_BULK (16) if cache is empty.

Performance test:
TCP single stream, single ring, single core, default MTU (1500B).

Before: 26.5 Gbits/sec
After: 30.1 Gbits/sec (+13.6%)

Signed-off-by: Tariq Toukan <[email protected]>
Reviewed-by: Gal Pressman <[email protected]>
Signed-off-by: Saeed Mahameed <[email protected]>
Reviewed-by: Maciej Fijalkowski <[email protected]>
Reviewed-by: Alexander Lobakin <[email protected]>

x86/Xen: drop leftover VM-assist uses

Both the 4Gb-segments and the PAE-extended-CR3 one are applicable to
32-bit guests only. The PAE-extended-CR3 use, furthermore, was redundant
with the PAE_MODE ELF note anyway.

Signed-off-by: Jan Beulich <[email protected]>
Reviewed-by: Juergen Gross <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Juergen Gross <[email protected]>

Merge tag 'mm-hotfixes-stable-2023-02-17-15-16-2' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm

Pull misc fixes from Andrew Morton:
"Six hotfixes. Five are cc:stable: four for MM, one for nilfs2.

  Also a MAINTAINERS update"

* tag 'mm-hotfixes-stable-2023-02-17-15-16-2' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm:
  nilfs2: fix underflow in second superblock position calculations
  hugetlb: check for undefined shift on 32 bit architectures
  mm/migrate: fix wrongly apply write bit after mkdirty on sparc64
  MAINTAINERS: update FPU EMULATOR web page
  mm/MADV_COLLAPSE: set EAGAIN on unexpected page refcount
  mm/filemap: fix page end in filemap_get_read_batch

nilfs2: fix underflow in second superblock position calculations

Macro NILFS_SB2_OFFSET_BYTES, which computes the position of the second
superblock, underflows when the argument device size is less than 4096
bytes.  Therefore, when using this macro, it is necessary to check in
advance that the device size is not less than a lower limit, or at least
that underflow does not occur.

The current nilfs2 implementation lacks this check, causing out-of-bound
block access when mounting devices smaller than 4096 bytes:

I/O error, dev loop0, sector 36028797018963960 op 0x0:(READ) flags 0x0
phys_seg 1 prio class 2
NILFS (loop0): unable to read secondary superblock (blocksize = 1024)

In addition, when trying to resize the filesystem to a size below 4096
bytes, this underflow occurs in nilfs_resize_fs(), passing a huge number
of segments to nilfs_sufile_resize(), corrupting parameters such as the
number of segments in superblocks.  This causes excessive loop iterations
in nilfs_sufile_resize() during a subsequent resize ioctl, causing
semaphore ns_segctor_sem to block for a long time and hang the writer
thread:

INFO: task segctord:5067 blocked for more than 143 seconds.
      Not tainted 6.2.0-rc8-syzkaller-00015-gf6feea56f66d #0
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
task:segctord        state:D stack:23456 pid:5067  ppid:2
flags:0x00004000
Call Trace:
  <TASK>
  context_switch kernel/sched/core.c:5293 [inline]
  __schedule+0x1409/0x43f0 kernel/sched/core.c:6606
  schedule+0xc3/0x190 kernel/sched/core.c:6682
  rwsem_down_write_slowpath+0xfcf/0x14a0 kernel/locking/rwsem.c:1190
  nilfs_transaction_lock+0x25c/0x4f0 fs/nilfs2/segment.c:357
  nilfs_segctor_thread_construct fs/nilfs2/segment.c:2486 [inline]
  nilfs_segctor_thread+0x52f/0x1140 fs/nilfs2/segment.c:2570
  kthread+0x270/0x300 kernel/kthread.c:376
  ret_from_fork+0x1f/0x30 arch/x86/entry/entry_64.S:308
  </TASK>
...
Call Trace:
  <TASK>
  folio_mark_accessed+0x51c/0xf00 mm/swap.c:515
  __nilfs_get_page_block fs/nilfs2/page.c:42 [inline]
  nilfs_grab_buffer+0x3d3/0x540 fs/nilfs2/page.c:61
  nilfs_mdt_submit_block+0xd7/0x8f0 fs/nilfs2/mdt.c:121
  nilfs_mdt_read_block+0xeb/0x430 fs/nilfs2/mdt.c:176
  nilfs_mdt_get_block+0x12d/0xbb0 fs/nilfs2/mdt.c:251
  nilfs_sufile_get_segment_usage_block fs/nilfs2/sufile.c:92 [inline]
  nilfs_sufile_truncate_range fs/nilfs2/sufile.c:679 [inline]
  nilfs_sufile_resize+0x7a3/0x12b0 fs/nilfs2/sufile.c:777
  nilfs_resize_fs+0x20c/0xed0 fs/nilfs2/super.c:422
  nilfs_ioctl_resize fs/nilfs2/ioctl.c:1033 [inline]
  nilfs_ioctl+0x137c/0x2440 fs/nilfs2/ioctl.c:1301
  ...

This fixes these issues by inserting appropriate minimum device size
checks or anti-underflow checks, depending on where the macro is used.

Link: https://lkml.kernel.org/r/[email protected]
Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Ryusuke Konishi <[email protected]>
Reported-by: <[email protected]>
Tested-by: Ryusuke Konishi <[email protected]>
Cc: <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>

hugetlb: check for undefined shift on 32 bit architectures

Users can specify the hugetlb page size in the mmap, shmget and
memfd_create system calls.  This is done by using 6 bits within the flags
argument to encode the base-2 logarithm of the desired page size.  The
routine hstate_sizelog() uses the log2 value to find the corresponding
hugetlb hstate structure.  Converting the log2 value (page_size_log) to
potential hugetlb page size is the simple statement:

1UL << page_size_log

Because only 6 bits are used for page_size_log, the left shift can not be
greater than 63.  This is fine on 64 bit architectures where a long is 64
bits.  However, if a value greater than 31 is passed on a 32 bit
architecture (where long is 32 bits) the shift will result in undefined
behavior.  This was generally not an issue as the result of the undefined
shift had to exactly match hugetlb page size to proceed.

Recent improvements in runtime checking have resulted in this undefined
behavior throwing errors such as reported below.

Fix by comparing page_size_log to BITS_PER_LONG before doing shift.

Link: https://lkml.kernel.org/r/[email protected]
Link: https://lore.kernel.org/lkml/CA+G9fYuei_Tr-vN9GS7SfFyU1y9hNysnf=PB7kT0=yv4MiPgVg@mail.gmail.com/
Fixes: 42d7395feb56 ("mm: support more pagesizes for MAP_HUGETLB/SHM_HUGETLB")
Signed-off-by: Mike Kravetz <[email protected]>
Reported-by: Naresh Kamboju <[email protected]>
Reviewed-by: Jesper Juhl <[email protected]>
Acked-by: Muchun Song <[email protected]>
Tested-by: Linux Kernel Functional Testing <[email protected]>
Tested-by: Naresh Kamboju <[email protected]>
Cc: Anders Roxell <[email protected]>
Cc: Andi Kleen <[email protected]>
Cc: Sasha Levin <[email protected]>
Cc: <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>

mm/migrate: fix wrongly apply write bit after mkdirty on sparc64

Nick Bowler reported another sparc64 breakage after the young/dirty
persistent work for page migration (per "Link:" below). That's after a
similar report [2].

It turns out page migration was overlooked, and it wasn't failing before
because page migration was not enabled in the initial report test
environment.

David proposed another way [2] to fix this from sparc64 side, but that
patch didn't land somehow. Neither did I check whether there's any other
arch that has similar issues.

Let's fix it for now as simple as moving the write bit handling to be
after dirty, like what we did before.

Note: this is based on mm-unstable, because the breakage was since 6.1 and
we're at a very late stage of 6.2 (-rc8), so I assume for this specific
case we should target this at 6.3.

[1] https://lore.kernel.org/all/20221021160603 [email protected]/
[2] https://lore.kernel.org/all/20221212130213 [email protected]/

Link: https://lkml.kernel.org/r/[email protected]
Fixes: 2e3468778dbe ("mm: remember young/dirty bit for page migrations")
Link: https://lore.kernel.org/all/CADyTPExpEqaJiMGoV+Z6xVgL50ZoMJg49B10LcZ=8eg19u34BA@mail.gmail.com/
Signed-off-by: Peter Xu <[email protected]>
Reported-by: Nick Bowler <[email protected]>
Acked-by: David Hildenbrand <[email protected]>
Tested-by: Nick Bowler <[email protected]>
Cc: <[email protected]>
Cc: <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>

Merge tag 'powerpc-6.2-6' of git://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux

Pull powerpc fix from Michael Ellerman:

- Prevent fallthrough to hash TLB flush when using radix

Thanks to Benjamin Gray and Erhard Furtner.

* tag 'powerpc-6.2-6' of git://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux:
powerpc/64s: Prevent fallthrough to hash TLB flush when using radix

Merge tag 'nfs-for-6.2-3' of git://git.linux-nfs.org/projects/trondmy/linux-nfs

Pull NFS client fix from Trond Myklebust:
"Unfortunately, we found another bug in the NFSv4.2 READ_PLUS code.

  Since it has not been possible to fix the bug in time for the 6.2
  release, let's just revert the Kconfig change that enables it:

   - Revert 'NFSv4.2: Change the default KConfig value for READ_PLUS'"

* tag 'nfs-for-6.2-3' of git://git.linux-nfs.org/projects/trondmy/linux-nfs:
  Revert "NFSv4.2: Change the default KConfig value for READ_PLUS"

Merge tag 'sound-fix-6.2' of git://git.kernel.org/pub/scm/linux/kernel/git/tiwai/sound

Pull sound fixes from Takashi Iwai:
"A few last-minute fixes. The significant ones are two ASoC SOF
  regression fixes while the rest are trivial HD-audio quirks.

  All are small / one-liners and should be pretty safe to take"

* tag 'sound-fix-6.2' of git://git.kernel.org/pub/scm/linux/kernel/git/tiwai/sound:
  ASoC: SOF: Intel: hda-dai: fix possible stream_tag leak
  ALSA: hda/realtek: Enable mute/micmute LEDs and speaker support for HP Laptops
  ALSA: hda/realtek: fix mute/micmute LEDs don't work for a HP platform.
  ALSA: hda/realtek - fixed wrong gpio assigned
  ALSA: hda: Fix codec device field initializan
  ALSA: hda/conexant: add a new hda codec SN6180
  ASoC: SOF: ops: refine parameters order in function snd_sof_dsp_update8

Merge tag 'gpio-fixes-for-v6.2-part2' of git://git.kernel.org/pub/scm/linux/kernel/git/brgl/linux

Pull gpio fix from Bartosz Golaszewski:

- fix a memory leak in gpio-sim that was triggered every time libgpiod
tests are run in user-space

* tag 'gpio-fixes-for-v6.2-part2' of git://git.kernel.org/pub/scm/linux/kernel/git/brgl/linux:
gpio: sim: fix a memory leak

Merge tag 'ata-6.2-rc8' of git://git.kernel.org/pub/scm/linux/kernel/git/dlemoal/libata

Pull ata fixes from Damien Le Moal:
"Three small fixes for 6.2 final:

   - Disable READ LOG DMA EXT for Samsung MZ7LH drives as these drives
     choke on that command, from Patrick.

   - Add Intel Tiger Lake UP{3,4} to the list of supported AHCI
     controllers (this is not technically a bug fix, but it is trivial
     enough that I add it here), from Simon.

   - Fix code comments in the pata_octeon_cf driver as incorrect
     formatting was causing warnings from kernel-doc, from Randy"

* tag 'ata-6.2-rc8' of git://git.kernel.org/pub/scm/linux/kernel/git/dlemoal/libata:
  ata: pata_octeon_cf: drop kernel-doc notation
  ata: ahci: Add Tiger Lake UP{3,4} AHCI controller
  ata: libata-core: Disable READ LOG DMA EXT for Samsung MZ7LH

Merge tag 'mmc-v6.2-rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/ulfh/mmc

Pull MMC fixes from Ulf Hansson:
"MMC core:
   - Fix potential resource leaks in SDIO card detection error path

  MMC host:
   - jz4740: Decrease maximum clock rate to workaround bug on JZ4760(B)
   - meson-gx: Fix SDIO support to get some WiFi modules to work again
   - mmc_spi: Fix error handling in ->probe()"

* tag 'mmc-v6.2-rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/ulfh/mmc:
  mmc: jz4740: Work around bug on JZ4760(B)
  mmc: mmc_spi: fix error handling in mmc_spi_probe()
  mmc: sdio: fix possible resource leaks in some error paths
  mmc: meson-gx: fix SDIO mode if cap_sdio_irq isn't set

Merge tag 'sched-urgent-2023-02-17' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip

Pull scheduler fixes from Ingo Molnar:

- Fix user-after-free bug in call_usermodehelper_exec()

- Fix missing user_cpus_ptr update in __set_cpus_allowed_ptr_locked()

- Fix PSI use-after-free bug in ep_remove_wait_queue()

* tag 'sched-urgent-2023-02-17' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  sched/psi: Fix use-after-free in ep_remove_wait_queue()
  sched/core: Fix a missed update of user_cpus_ptr
  freezer,umh: Fix call_usermode_helper_exec() vs SIGKILL

selftests/bpf: Add bpf_fib_lookup test

This patch tests the bpf_fib_lookup helper when looking up
a neigh in NUD_FAILED and NUD_STALE state. It also adds test
for the new BPF_FIB_LOOKUP_SKIP_NEIGH flag.

Signed-off-by: Martin KaFai Lau <[email protected]>
Signed-off-by: Daniel Borkmann <[email protected]>
Link: https://lore.kernel.org/bpf/[email protected]