Git Repo - linux.git/log

Merge tag 'leds-next-6.6' of git://git.kernel.org/pub/scm/linux/kernel/git/lee/leds

Pull LED updates from Lee Jones:
"Core Frameworks:
   - Add new framework to support Group Multi-Color (GMC) LEDs
   - Offer an 'optional' API for non-essential LEDs
   - Support obtaining 'max brightness' values from Device Tree
   - Provide new led_classdev member 'color' (settable via DT and SYFS)
   - Stop TTY Trigger from using the old LED_ON constraints
   - Statically allocate leds_class

  New Drivers:
   - Add support for NXP PCA995x I2C Constant Current LED Driver

  New Device Support:
   - Add support for Siemens Simatic IPC BX-21 to Simatic IPC

  Fix-ups:
   - Some dependency / Kconfig tweaking
   - Move final probe() functions back over from .probe_new()
   - Simplify obtaining resources (memory, device data) using unified
     API helpers
   - Bunch of Device Tree additions, conversions and adaptions
   - Fix trivial styling issues; comments
   - Ensure correct includes are present and remove some that are not
     required
   - Omit the use of redundant casts and if relevant replace with better
     ones
   - Use purpose-built APIs for various actions; sysfs_emit(),
     module_led_trigger()
   - Remove a bunch of superfluous locking

  Bug Fixes:
   - Ensure error codes are correctly propagated back up the call chain
   - Fix incorrect error values from being returned (missing '-')
   - Ensure get'ed resources are put'ed to prevent leaks
   - Use correct class when exporting module resources
   - Fixing rounding (or lack there of) issues
   - Fix 'always false' LED_COLOR_ID_MULTI BUG() check"

* tag 'leds-next-6.6' of git://git.kernel.org/pub/scm/linux/kernel/git/lee/leds: (40 commits)
  leds: aw2013: Enable pull-up supply for interrupt and I2C
  dt-bindings: leds: Document pull-up supply for interrupt and I2C
  dt-bindings: leds: aw2013: Document interrupt
  leds: uleds: Use module_misc_device macro to simplify the code
  leds: trigger: netdev: Use module_led_trigger macro to simplify the code
  dt-bindings: leds: Fix reference to definition of default-state
  leds: turris-omnia: Drop unnecessary mutex locking
  leds: turris-omnia: Use sysfs_emit() instead of sprintf()
  leds: Make leds_class a static const structure
  leds: Remove redundant of_match_ptr()
  dt-bindings: leds: Add gpio-line-names to PCA9532 GPIO
  leds: trigger: tty: Do not use LED_ON/OFF constants, use led_blink_set_oneshot instead
  dt-bindings: leds: rohm,bd71828: Drop select:false
  leds: Fix BUG_ON check for LED_COLOR_ID_MULTI that is always false
  leds: multicolor: Use rounded division when calculating color components
  leds: rgb: Add a multicolor LED driver to group monochromatic LEDs
  dt-bindings: leds: Add binding for a multicolor group of LEDs
  leds: class: Store the color index in struct led_classdev
  leds: Provide devm_of_led_get_optional()
  leds: pca995x: Fix MODULE_DEVICE_TABLE for OF
  ...

Merge tag 'mfd-next-6.6' of git://git.kernel.org/pub/scm/linux/kernel/git/lee/mfd

Pull NFD updates from Lee Jones:
"New Drivers:
   - Add support for the Cirrus Logic CS42L43 Audio CODEC

  Fix-ups:
   - Make use of specific printk() format tags for various optimisations
   - Kconfig / module modifications / tweaking
   - Simplify obtaining resources (memory, device data) using unified
     API helpers
   - Bunch of Device Tree additions, conversions and adaptions
   - Convert a bunch of Regmap configurations to use the Maple Tree
     cache
   - Ensure correct includes are present and remove some that are not
     required
   - Remove superfluous code
   - Reduce amount of cycles spent in critical sections
   - Omit the use of redundant casts and if relevant replace with better
     ones
   - Swap out raw_spin_{un}lock_irq{save,restore}() for
     spin_{un}lock_irq{save,restore}()

  Bug Fixes:
   - Repair theoretical deadlock situation
   - Fix some link-time dependencies
   - Use more appropriate datatype when casting"

* tag 'mfd-next-6.6' of git://git.kernel.org/pub/scm/linux/kernel/git/lee/mfd: (70 commits)
  mfd: mc13xxx: Simplify device data fetching in probe()
  mfd: rz-mtu3: Replace raw_spin_lock->spin_lock()
  mfd: rz-mtu3: Reduce critical sections
  mfd: mxs-lradc: Fix Wvoid-pointer-to-enum-cast warning
  mfd: wm31x: Fix Wvoid-pointer-to-enum-cast warning
  mfd: wm8994: Fix Wvoid-pointer-to-enum-cast warning
  mfd: tc3589: Fix Wvoid-pointer-to-enum-cast warning
  mfd: lp87565: Fix Wvoid-pointer-to-enum-cast warning
  mfd: hi6421-pmic: Fix Wvoid-pointer-to-enum-cast warning
  mfd: max77541: Fix Wvoid-pointer-to-enum-cast warning
  mfd: max14577: Fix Wvoid-pointer-to-enum-cast warning
  mfd: stmpe: Fix Wvoid-pointer-to-enum-cast warning
  mfd: rn5t618: Remove redundant of_match_ptr()
  mfd: lochnagar-i2c: Remove redundant of_match_ptr()
  mfd: stpmic1: Remove redundant of_match_ptr()
  mfd: act8945a: Remove redundant of_match_ptr()
  mfd: rsmu_spi: Remove redundant of_match_ptr()
  mfd: altera-a10sr: Remove redundant of_match_ptr()
  mfd: rsmu_i2c: Remove redundant of_match_ptr()
  mfd: tc3589x: Remove redundant of_match_ptr()
  ...

Merge tag 'i2c-for-6.6-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/wsa/linux

Pull i2c updates from Wolfram Sang:
"I2C has mainly cleanups this time and a few driver improvements.

  Because a lot of developers were on holidays (including myself) it was
  a good timing to apply lots of cleanups which would normally cause
  merge conflicts with other floating patches. Extra thanks go to Andi
  Shyti who backed me up when I was on a four week hiatus. This is also
  the reason that some patches were commited later than ideal"

* tag 'i2c-for-6.6-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/wsa/linux: (67 commits)
  i2c: at91: Use dev_err_probe() instead of dev_err()
  I2C: ali15x3: Do PCI error checks on own line
  i2c: Make return value check more accurate and explicit for devm_pinctrl_get()
  i2c: designware: Add support for recovery when GPIO need pinctrl
  i2c: mlxcpld: Add support for extended transaction length
  i2c: mlxcpld: Allow driver to run on ARM64 architecture
  i2c: nforce2: Do PCI error check on own line
  i2c: sis5595: Do PCI error checks on own line
  i2c: qcom-cci: Fix error checking in cci_probe()
  i2c: muxes: pca954x: Add regulator support
  i2c: muxes: pca954x: Add MAX735x/MAX736x support
  dt-bindings: i2c: Add Maxim MAX735x/MAX736x variants
  dt-bindings: i2c: pca954x: Correct interrupt support
  i2c: pnx: Use devm_platform_get_and_ioremap_resource()
  i2c: pxa: Use devm_platform_get_and_ioremap_resource()
  i2c: s3c2410: Use devm_platform_get_and_ioremap_resource()
  i2c: sh_mobile: Use devm_platform_get_and_ioremap_resource()
  i2c: st: Use devm_platform_get_and_ioremap_resource()
  i2c: qcom-geni: Convert to devm_platform_ioremap_resource()
  i2c: stm32f4: Use devm_platform_get_and_ioremap_resource()
  ...

Merge tag 'printk-for-6.6' of git://git.kernel.org/pub/scm/linux/kernel/git/printk/linux

Pull printk updates from Petr Mladek:

- Do not try to get the console lock when it is not need or useful in
   panic()

- Replace the global console_suspended state by a per-console flag

- Export symbols needed for dumping the raw printk buffer in panic()

- Fix documentation of printf formats for integer types

- Moved Sergey Senozhatsky to the reviewer role

- Misc cleanups

* tag 'printk-for-6.6' of git://git.kernel.org/pub/scm/linux/kernel/git/printk/linux:
  printk: export symbols for debug modules
  lib: test_scanf: Add explicit type cast to result initialization in test_number_prefix()
  printk: ringbuffer: Fix truncating buffer size min_t cast
  printk: Rename abandon_console_lock_in_panic() to other_cpu_in_panic()
  printk: Add per-console suspended state
  printk: Consolidate console deferred printing
  printk: Do not take console lock for console_flush_on_panic()
  printk: Keep non-panic-CPUs out of console lock
  printk: Reduce console_unblank() usage in unsafe scenarios
  kdb: Do not assume write() callback available
  docs: printk-formats: Treat char as always unsigned
  docs: printk-formats: Fix hex printing of signed values
  MAINTAINERS: adjust printk/vsprintf entries

Merge tag 'timers-core-2023-09-04-v2' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip

Pull clocksource/clockevent driver updates from Thomas Gleixner:

- Remove the OXNAS driver instead of adding a new one!

- A set of boring fixes, cleanups and improvements

* tag 'timers-core-2023-09-04-v2' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  clocksource: Explicitly include correct DT includes
  clocksource/drivers/sun5i: Convert to platform device driver
  clocksource/drivers/sun5i: Remove pointless struct
  clocksource/drivers/sun5i: Remove duplication of code and data
  clocksource/drivers/loongson1: Set variable ls1x_timer_lock storage-class-specifier to static
  clocksource/drivers/arm_arch_timer: Disable timer before programming CVAL
  dt-bindings: timer: oxsemi,rps-timer: remove obsolete bindings
  clocksource/drivers/timer-oxnas-rps: Remove obsolete timer driver

tpm: Enable hwrng only for Pluton on AMD CPUs

The vendor check introduced by commit 554b841d4703 ("tpm: Disable RNG for
all AMD fTPMs") doesn't work properly on a number of Intel fTPMs. On the
reported systems the TPM doesn't reply at bootup and returns back the
command code. This makes the TPM fail probe on Lenovo Legion Y540 laptop.

Since only Microsoft Pluton is the only known combination of AMD CPU and
fTPM from other vendor, disable hwrng otherwise. In order to make sysadmin
aware of this, print also info message to the klog.

Cc: [email protected]
Fixes: 554b841d4703 ("tpm: Disable RNG for all AMD fTPMs")
Reported-by: Todd Brandt <[email protected]>
Closes: https://bugzilla.kernel.org/show_bug.cgi?id=217804
Reported-by: Patrick Steinhardt <[email protected]>
Reported-by: Raymond Jay Golo <[email protected]>
Reported-by: Ronan Pigott <[email protected]>
Reviewed-by: Jerry Snitselaar <[email protected]>
Signed-off-by: Jarkko Sakkinen <[email protected]>

tpm_crb: Fix an error handling path in crb_acpi_add()

Some error paths don't call acpi_put_table() before returning.
Branch to the correct place instead of doing some direct return.

Fixes: 4d2732882703 ("tpm_crb: Add support for CRB devices based on Pluton")
Signed-off-by: Christophe JAILLET <[email protected]>
Acked-by: Matthew Garrett <[email protected]>
Reviewed-by: Jarkko Sakkinen <[email protected]>
Signed-off-by: Jarkko Sakkinen <[email protected]>

Merge tag 'm68knommu-for-v6.6' of git://git.kernel.org/pub/scm/linux/kernel/git/gerg/m68knommu

Pull m68knommu updates from Greg Ungerer:
"Two changes, one a trivial white space clean up, the other removes the
  unnecessary local pcibios_setup() code"

* tag 'm68knommu-for-v6.6' of git://git.kernel.org/pub/scm/linux/kernel/git/gerg/m68knommu:
  m68k: coldfire: dma_timer: ERROR: "foo __init bar" should be "foo __init bar"
  m68k/pci: Drop useless pcibios_setup()

Merge tag 'uml-for-linus-6.6-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/uml/linux

Pull UML updates from Richard Weinberger:

- Drop 32-bit checksum implementation and re-use it from arch/x86

- String function cleanup

- Fixes for -Wmissing-variable-declarations and -Wmissing-prototypes
   builds

* tag 'uml-for-linus-6.6-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/uml/linux:
  um: virt-pci: fix missing declaration warning
  um: Refactor deprecated strncpy to memcpy
  um: fix 3 instances of -Wmissing-prototypes
  um: port_kern: fix -Wmissing-variable-declarations
  uml: audio: fix -Wmissing-variable-declarations
  um: vector: refactor deprecated strncpy
  um: use obj-y to descend into arch/um/*/
  um: Hard-code the result of 'uname -s'
  um: Use the x86 checksum implementation on 32-bit
  asm-generic: current: Don't include thread-info.h if building asm
  um: Remove unsued extern declaration ldt_host_info()
  um: Fix hostaudio build errors
  um: Remove strlcpy usage

Merge tag 'hyperv-next-signed-20230902' of git://git.kernel.org/pub/scm/linux/kernel/git/hyperv/linux

Pull hyperv updates from Wei Liu:

- Support for SEV-SNP guests on Hyper-V (Tianyu Lan)

- Support for TDX guests on Hyper-V (Dexuan Cui)

- Use SBRM API in Hyper-V balloon driver (Mitchell Levy)

- Avoid dereferencing ACPI root object handle in VMBus driver (Maciej
   Szmigiero)

- A few misecllaneous fixes (Jiapeng Chong, Nathan Chancellor, Saurabh
   Sengar)

* tag 'hyperv-next-signed-20230902' of git://git.kernel.org/pub/scm/linux/kernel/git/hyperv/linux: (24 commits)
  x86/hyperv: Remove duplicate include
  x86/hyperv: Move the code in ivm.c around to avoid unnecessary ifdef's
  x86/hyperv: Remove hv_isolation_type_en_snp
  x86/hyperv: Use TDX GHCI to access some MSRs in a TDX VM with the paravisor
  Drivers: hv: vmbus: Bring the post_msg_page back for TDX VMs with the paravisor
  x86/hyperv: Introduce a global variable hyperv_paravisor_present
  Drivers: hv: vmbus: Support >64 VPs for a fully enlightened TDX/SNP VM
  x86/hyperv: Fix serial console interrupts for fully enlightened TDX guests
  Drivers: hv: vmbus: Support fully enlightened TDX guests
  x86/hyperv: Support hypercalls for fully enlightened TDX guests
  x86/hyperv: Add hv_isolation_type_tdx() to detect TDX guests
  x86/hyperv: Fix undefined reference to isolation_type_en_snp without CONFIG_HYPERV
  x86/hyperv: Add missing 'inline' to hv_snp_boot_ap() stub
  hv: hyperv.h: Replace one-element array with flexible-array member
  Drivers: hv: vmbus: Don't dereference ACPI root object handle
  x86/hyperv: Add hyperv-specific handling for VMMCALL under SEV-ES
  x86/hyperv: Add smp support for SEV-SNP guest
  clocksource: hyper-v: Mark hyperv tsc page unencrypted in sev-snp enlightened guest
  x86/hyperv: Use vmmcall to implement Hyper-V hypercall in sev-snp enlightened guest
  drivers: hv: Mark percpu hvcall input arg page unencrypted in SEV-SNP enlightened guest
  ...

Merge tag 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mst/vhost

Pull virtio updates from Michael Tsirkin:
"A small pull request this time around, mostly because the vduse
  network got postponed to next relase so we can be sure we got the
  security store right"

* tag 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mst/vhost:
  virtio_ring: fix avail_wrap_counter in virtqueue_add_packed
  virtio_vdpa: build affinity masks conditionally
  virtio_net: merge dma operations when filling mergeable buffers
  virtio_ring: introduce dma sync api for virtqueue
  virtio_ring: introduce dma map api for virtqueue
  virtio_ring: introduce virtqueue_reset()
  virtio_ring: separate the logic of reset/enable from virtqueue_resize
  virtio_ring: correct the expression of the description of virtqueue_resize()
  virtio_ring: skip unmap for premapped
  virtio_ring: introduce virtqueue_dma_dev()
  virtio_ring: support add premapped buf
  virtio_ring: introduce virtqueue_set_dma_premapped()
  virtio_ring: put mapping error check in vring_map_one_sg
  virtio_ring: check use_dma_api before unmap desc for indirect
  vdpa_sim: offer VHOST_BACKEND_F_ENABLE_AFTER_DRIVER_OK
  vdpa: add get_backend_features vdpa operation
  vdpa: accept VHOST_BACKEND_F_ENABLE_AFTER_DRIVER_OK backend feature
  vdpa: add VHOST_BACKEND_F_ENABLE_AFTER_DRIVER_OK flag
  vdpa/mlx5: Remove unused function declarations

Merge tag 'tomoyo-pr-20230903' of git://git.osdn.net/gitroot/tomoyo/tomoyo-test1

Pull tomoyo updates from Tetsuo Handa:
"Three cleanup patches, no behavior changes"

* tag 'tomoyo-pr-20230903' of git://git.osdn.net/gitroot/tomoyo/tomoyo-test1:
  tomoyo: remove unused function declaration
  tomoyo: refactor deprecated strncpy
  tomoyo: add format attributes to functions

Merge branch 'pm-cpufreq'

Merge additional cpufreq updates for 6.6-rc1:

- Add support for per-policy performance boost (Jie Zhan).

- Fix assorted issues in the cpufreq core, common governor code and in
   the pcc cpufreq driver (Liao Chang).

* pm-cpufreq:
  cpufreq: Support per-policy performance boost
  cpufreq: pcc: Fix the potentinal scheduling delays in target_index()
  cpufreq: governor: Free dbs_data directly when gov->init() fails
  cpufreq: Fix the race condition while updating the transition_task of policy
  cpufreq: Avoid printing kernel addresses in cpufreq_resume()

ALSA: hda/cirrus: Fix broken audio on hardware with two CS42L42 codecs.

Recently in v6.3-rc1 there was a change affecting behaviour of hrtimers
(commit 0c52310f260014d95c1310364379772cb74cf82d) and causing
few issues on platforms with two CS42L42 codecs. Canonical/Dell
has reported an issue with Vostro-3910.
We need to increase this value by 15ms.

Link: https://bugs.launchpad.net/somerville/+bug/2031060
Fixes: 9fb9fa18fb50 ("ALSA: hda/cirrus: Add extra 10 ms delay to allow PLL settle and lock.")
Signed-off-by: Vitaly Rodionov <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Takashi Iwai <[email protected]>

x86/smp: Don't send INIT to non-present and non-booted CPUs

Vasant reported that kexec() can hang or reset the machine when it tries to
park CPUs via INIT. This happens when the kernel is using extended APIC,
but the present mask has APIC IDs >= 0x100 enumerated.

As extended APIC can only handle 8 bit of APIC ID sending INIT to APIC ID
0x100 sends INIT to APIC ID 0x0. That's the boot CPU which is special on
x86 and INIT causes the system to hang or resets the machine.

Prevent this by sending INIT only to those CPUs which have been booted
once.

Fixes: 45e34c8af58f ("x86/smp: Put CPUs into INIT on shutdown if possible")
Reported-by: Dheeraj Kumar Srivastava <[email protected]>
Signed-off-by: Thomas Gleixner <[email protected]>
Tested-by: Vasant Hegde <[email protected]>
Link: https://lore.kernel.org/r/87cyzwjbff.ffs@tglx

spi: sun6i: fix race between DMA RX transfer completion and RX FIFO drain

Previously the transfer complete IRQ immediately drained to RX FIFO to
read any data remaining in FIFO to the RX buffer. This behaviour is
correct when dealing with SPI in interrupt mode. However in DMA mode the
transfer complete interrupt still fires as soon as all bytes to be
transferred have been stored in the FIFO. At that point data in the FIFO
still needs to be picked up by the DMA engine. Thus the drain procedure
and DMA engine end up racing to read from RX FIFO, corrupting any data
read. Additionally the RX buffer pointer is never adjusted according to
DMA progress in DMA mode, thus calling the RX FIFO drain procedure in DMA
mode is a bug.
Fix corruptions in DMA RX mode by draining RX FIFO only in interrupt mode.
Also wait for completion of RX DMA when in DMA mode before returning to
ensure all data has been copied to the supplied memory buffer.

Signed-off-by: Tobias Schramm <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Mark Brown <[email protected]>

spi: sun6i: reduce DMA RX transfer width to single byte

Through empirical testing it has been determined that sometimes RX SPI
transfers with DMA enabled return corrupted data. This is down to single
or even multiple bytes lost during DMA transfer from SPI peripheral to
memory. It seems the RX FIFO within the SPI peripheral can become
confused when performing bus read accesses wider than a single byte to it
during an active SPI transfer.

This patch reduces the width of individual DMA read accesses to the
RX FIFO to a single byte to mitigate that issue.

Signed-off-by: Tobias Schramm <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Mark Brown <[email protected]>

ASoC: rt5645: NULL pointer access when removing jack

Machine driver calls snd_soc_component_set_jack() function with NULL
jack and data parameters when removing jack in codec exit function.
Do not access data when jack is NULL.

Signed-off-by: Brent Lu <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Mark Brown <[email protected]>

ASoC: amd: yc: Add DMI entries to support Victus by HP Gaming Laptop 15-fb0xxx (8A3E)

This model requires an additional detection quirk to
enable the internal microphone.

Signed-off-by: Shubh <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Mark Brown <[email protected]>

MAINTAINERS: Update the MAINTAINERS enties for TEXAS INSTRUMENTS ASoC DRIVERS

Update the MAINTAINERS email for TEXAS INSTRUMENTS ASoC DRIVERS.

Signed-off-by: Kevin-Lu <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Mark Brown <[email protected]>

Merge branch 'af_unix-data-races'

Kuniyuki Iwashima says:

====================
af_unix: Fix four data-races.

While running syzkaller, KCSAN reported 3 data-races with
systemd-coredump using AF_UNIX sockets.

This series fixes the three and another one inspiered by
one of the reports.
====================

Signed-off-by: David S. Miller <[email protected]>

af_unix: Fix data race around sk->sk_err.

As with sk->sk_shutdown shown in the previous patch, sk->sk_err can be
read locklessly by unix_dgram_sendmsg().

Let's use READ_ONCE() for sk_err as well.

Note that the writer side is marked by commit cc04410af7de ("af_unix:
annotate lockless accesses to sk->sk_err").

Fixes: 1da177e4c3f4 ("Linux-2.6.12-rc2")
Signed-off-by: Kuniyuki Iwashima <[email protected]>
Reviewed-by: Eric Dumazet <[email protected]>
Signed-off-by: David S. Miller <[email protected]>

af_unix: Fix data-races around sk->sk_shutdown.

sk->sk_shutdown is changed under unix_state_lock(sk), but
unix_dgram_sendmsg() calls two functions to read sk_shutdown locklessly.

sock_alloc_send_pskb
`- sock_wait_for_wmem

Let's use READ_ONCE() there.

Note that the writer side was marked by commit e1d09c2c2f57 ("af_unix:
Fix data races around sk->sk_shutdown.").

BUG: KCSAN: data-race in sock_alloc_send_pskb / unix_release_sock

write (marked) to 0xffff8880069af12c of 1 bytes by task 1 on cpu 1:
unix_release_sock+0x75c/0x910 net/unix/af_unix.c:631
unix_release+0x59/0x80 net/unix/af_unix.c:1053
__sock_release+0x7d/0x170 net/socket.c:654
sock_close+0x19/0x30 net/socket.c:1386
__fput+0x2a3/0x680 fs/file_table.c:384
____fput+0x15/0x20 fs/file_table.c:412
task_work_run+0x116/0x1a0 kernel/task_work.c:179
resume_user_mode_work include/linux/resume_user_mode.h:49 [inline]
exit_to_user_mode_loop kernel/entry/common.c:171 [inline]
exit_to_user_mode_prepare+0x174/0x180 kernel/entry/common.c:204
__syscall_exit_to_user_mode_work kernel/entry/common.c:286 [inline]
syscall_exit_to_user_mode+0x1a/0x30 kernel/entry/common.c:297
do_syscall_64+0x4b/0x90 arch/x86/entry/common.c:86
entry_SYSCALL_64_after_hwframe+0x6e/0xd8

read to 0xffff8880069af12c of 1 bytes by task 28650 on cpu 0:
sock_alloc_send_pskb+0xd2/0x620 net/core/sock.c:2767
unix_dgram_sendmsg+0x2f8/0x14f0 net/unix/af_unix.c:1944
unix_seqpacket_sendmsg net/unix/af_unix.c:2308 [inline]
unix_seqpacket_sendmsg+0xba/0x130 net/unix/af_unix.c:2292
sock_sendmsg_nosec net/socket.c:725 [inline]
sock_sendmsg+0x148/0x160 net/socket.c:748
____sys_sendmsg+0x4e4/0x610 net/socket.c:2494
___sys_sendmsg+0xc6/0x140 net/socket.c:2548
__sys_sendmsg+0x94/0x140 net/socket.c:2577
__do_sys_sendmsg net/socket.c:2586 [inline]
__se_sys_sendmsg net/socket.c:2584 [inline]
__x64_sys_sendmsg+0x45/0x50 net/socket.c:2584
do_syscall_x64 arch/x86/entry/common.c:50 [inline]
do_syscall_64+0x3b/0x90 arch/x86/entry/common.c:80
entry_SYSCALL_64_after_hwframe+0x6e/0xd8

value changed: 0x00 -> 0x03

Reported by Kernel Concurrency Sanitizer on:
CPU: 0 PID: 28650 Comm: systemd-coredum Not tainted 6.4.0-11989-g6843306689af #6
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.16.0-0-gd239552ce722-prebuilt.qemu.org 04/01/2014

Fixes: 1da177e4c3f4 ("Linux-2.6.12-rc2")
Reported-by: syzkaller <[email protected]>
Signed-off-by: Kuniyuki Iwashima <[email protected]>
Reviewed-by: Eric Dumazet <[email protected]>
Signed-off-by: David S. Miller <[email protected]>

af_unix: Fix data-race around unix_tot_inflight.

unix_tot_inflight is changed under spin_lock(unix_gc_lock), but
unix_release_sock() reads it locklessly.

Let's use READ_ONCE() for unix_tot_inflight.

Note that the writer side was marked by commit 9d6d7f1cb67c ("af_unix:
annote lockless accesses to unix_tot_inflight & gc_in_progress")

BUG: KCSAN: data-race in unix_inflight / unix_release_sock

write (marked) to 0xffffffff871852b8 of 4 bytes by task 123 on cpu 1:
unix_inflight+0x130/0x180 net/unix/scm.c:64
unix_attach_fds+0x137/0x1b0 net/unix/scm.c:123
unix_scm_to_skb net/unix/af_unix.c:1832 [inline]
unix_dgram_sendmsg+0x46a/0x14f0 net/unix/af_unix.c:1955
sock_sendmsg_nosec net/socket.c:724 [inline]
sock_sendmsg+0x148/0x160 net/socket.c:747
____sys_sendmsg+0x4e4/0x610 net/socket.c:2493
___sys_sendmsg+0xc6/0x140 net/socket.c:2547
__sys_sendmsg+0x94/0x140 net/socket.c:2576
__do_sys_sendmsg net/socket.c:2585 [inline]
__se_sys_sendmsg net/socket.c:2583 [inline]
__x64_sys_sendmsg+0x45/0x50 net/socket.c:2583
do_syscall_x64 arch/x86/entry/common.c:50 [inline]
do_syscall_64+0x3b/0x90 arch/x86/entry/common.c:80
entry_SYSCALL_64_after_hwframe+0x72/0xdc

read to 0xffffffff871852b8 of 4 bytes by task 4891 on cpu 0:
unix_release_sock+0x608/0x910 net/unix/af_unix.c:671
unix_release+0x59/0x80 net/unix/af_unix.c:1058
__sock_release+0x7d/0x170 net/socket.c:653
sock_close+0x19/0x30 net/socket.c:1385
__fput+0x179/0x5e0 fs/file_table.c:321
____fput+0x15/0x20 fs/file_table.c:349
task_work_run+0x116/0x1a0 kernel/task_work.c:179
resume_user_mode_work include/linux/resume_user_mode.h:49 [inline]
exit_to_user_mode_loop kernel/entry/common.c:171 [inline]
exit_to_user_mode_prepare+0x174/0x180 kernel/entry/common.c:204
__syscall_exit_to_user_mode_work kernel/entry/common.c:286 [inline]
syscall_exit_to_user_mode+0x1a/0x30 kernel/entry/common.c:297
do_syscall_64+0x4b/0x90 arch/x86/entry/common.c:86
entry_SYSCALL_64_after_hwframe+0x72/0xdc

value changed: 0x00000000 -> 0x00000001

Reported by Kernel Concurrency Sanitizer on:
CPU: 0 PID: 4891 Comm: systemd-coredum Not tainted 6.4.0-rc5-01219-gfa0e21fa4443 #5
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.16.0-0-gd239552ce722-prebuilt.qemu.org 04/01/2014

Fixes: 9305cfa4443d ("[AF_UNIX]: Make unix_tot_inflight counter non-atomic")
Reported-by: syzkaller <[email protected]>
Signed-off-by: Kuniyuki Iwashima <[email protected]>
Reviewed-by: Eric Dumazet <[email protected]>
Signed-off-by: David S. Miller <[email protected]>

af_unix: Fix data-races around user->unix_inflight.

user->unix_inflight is changed under spin_lock(unix_gc_lock),
but too_many_unix_fds() reads it locklessly.

Let's annotate the write/read accesses to user->unix_inflight.

BUG: KCSAN: data-race in unix_attach_fds / unix_inflight

write to 0xffffffff8546f2d0 of 8 bytes by task 44798 on cpu 1:
unix_inflight+0x157/0x180 net/unix/scm.c:66
unix_attach_fds+0x147/0x1e0 net/unix/scm.c:123
unix_scm_to_skb net/unix/af_unix.c:1827 [inline]
unix_dgram_sendmsg+0x46a/0x14f0 net/unix/af_unix.c:1950
unix_seqpacket_sendmsg net/unix/af_unix.c:2308 [inline]
unix_seqpacket_sendmsg+0xba/0x130 net/unix/af_unix.c:2292
sock_sendmsg_nosec net/socket.c:725 [inline]
sock_sendmsg+0x148/0x160 net/socket.c:748
____sys_sendmsg+0x4e4/0x610 net/socket.c:2494
___sys_sendmsg+0xc6/0x140 net/socket.c:2548
__sys_sendmsg+0x94/0x140 net/socket.c:2577
__do_sys_sendmsg net/socket.c:2586 [inline]
__se_sys_sendmsg net/socket.c:2584 [inline]
__x64_sys_sendmsg+0x45/0x50 net/socket.c:2584
do_syscall_x64 arch/x86/entry/common.c:50 [inline]
do_syscall_64+0x3b/0x90 arch/x86/entry/common.c:80
entry_SYSCALL_64_after_hwframe+0x6e/0xd8

read to 0xffffffff8546f2d0 of 8 bytes by task 44814 on cpu 0:
too_many_unix_fds net/unix/scm.c:101 [inline]
unix_attach_fds+0x54/0x1e0 net/unix/scm.c:110
unix_scm_to_skb net/unix/af_unix.c:1827 [inline]
unix_dgram_sendmsg+0x46a/0x14f0 net/unix/af_unix.c:1950
unix_seqpacket_sendmsg net/unix/af_unix.c:2308 [inline]
unix_seqpacket_sendmsg+0xba/0x130 net/unix/af_unix.c:2292
sock_sendmsg_nosec net/socket.c:725 [inline]
sock_sendmsg+0x148/0x160 net/socket.c:748
____sys_sendmsg+0x4e4/0x610 net/socket.c:2494
___sys_sendmsg+0xc6/0x140 net/socket.c:2548
__sys_sendmsg+0x94/0x140 net/socket.c:2577
__do_sys_sendmsg net/socket.c:2586 [inline]
__se_sys_sendmsg net/socket.c:2584 [inline]
__x64_sys_sendmsg+0x45/0x50 net/socket.c:2584
do_syscall_x64 arch/x86/entry/common.c:50 [inline]
do_syscall_64+0x3b/0x90 arch/x86/entry/common.c:80
entry_SYSCALL_64_after_hwframe+0x6e/0xd8

value changed: 0x000000000000000c -> 0x000000000000000d

Reported by Kernel Concurrency Sanitizer on:
CPU: 0 PID: 44814 Comm: systemd-coredum Not tainted 6.4.0-11989-g6843306689af #6
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.16.0-0-gd239552ce722-prebuilt.qemu.org 04/01/2014

Fixes: 712f4aad406b ("unix: properly account for FDs passed over unix sockets")
Reported-by: syzkaller <[email protected]>
Signed-off-by: Kuniyuki Iwashima <[email protected]>
Acked-by: Willy Tarreau <[email protected]>
Reviewed-by: Eric Dumazet <[email protected]>
Signed-off-by: David S. Miller <[email protected]>

af_unix: Fix msg_controllen test in scm_pidfd_recv() for MSG_CMSG_COMPAT.

Heiko Carstens reported that SCM_PIDFD does not work with MSG_CMSG_COMPAT
because scm_pidfd_recv() always checks msg_controllen against sizeof(struct
cmsghdr).

We need to use sizeof(struct compat_cmsghdr) for the compat case.

Fixes: 5e2ff6704a27 ("scm: add SO_PASSPIDFD and SCM_PIDFD")
Reported-by: Heiko Carstens <[email protected]>
Closes: https://lore.kernel.org/netdev/[email protected]/
Signed-off-by: Kuniyuki Iwashima <[email protected]>
Tested-by: Heiko Carstens <[email protected]>
Reviewed-by: Alexander Mikhalitsyn <[email protected]>
Reviewed-by: Michal Swiatkowski <[email protected]>
Acked-by: Christian Brauner <[email protected]>
Signed-off-by: David S. Miller <[email protected]>

docs: netdev: update the netdev infra URLs

Some corporate proxies block our current NIPA URLs because
they use a free / shady DNS domain. As suggested by Jesse
we got a new DNS entry from Konstantin - netdev.bots.linux.dev,
use it.

Signed-off-by: Jakub Kicinski <[email protected]>
Reviewed-by: Simon Horman <[email protected]>
Signed-off-by: David S. Miller <[email protected]>

Merge branch 'rework/misc-cleanups' into for-linus

Merge branch 'for-6.6-vsprintf-doc' into for-linus

docs: netdev: document patchwork patch states

The patchwork states are largely self-explanatory but small
ambiguities may still come up. Document how we interpret
the states in networking.

Signed-off-by: Jakub Kicinski <[email protected]>
Signed-off-by: David S. Miller <[email protected]>

bpf, sockmap: Fix skb refcnt race after locking changes

There is a race where skb's from the sk_psock_backlog can be referenced
after userspace side has already skb_consumed() the sk_buff and its refcnt
dropped to zer0 causing use after free.

The flow is the following:

  while ((skb = skb_peek(&psock->ingress_skb))
    sk_psock_handle_Skb(psock, skb, ..., ingress)
    if (!ingress) ...
    sk_psock_skb_ingress
       sk_psock_skb_ingress_enqueue(skb)
          msg->skb = skb
          sk_psock_queue_msg(psock, msg)
    skb_dequeue(&psock->ingress_skb)

The sk_psock_queue_msg() puts the msg on the ingress_msg queue. This is
what the application reads when recvmsg() is called. An application can
read this anytime after the msg is placed on the queue. The recvmsg hook
will also read msg->skb and then after user space reads the msg will call
consume_skb(skb) on it effectively free'ing it.

But, the race is in above where backlog queue still has a reference to
the skb and calls skb_dequeue(). If the skb_dequeue happens after the
user reads and free's the skb we have a use after free.

The !ingress case does not suffer from this problem because it uses
sendmsg_*(sk, msg) which does not pass the sk_buff further down the
stack.

The following splat was observed with 'test_progs -t sockmap_listen':

  [ 1022.710250][ T2556] general protection fault, ...
  [...]
  [ 1022.712830][ T2556] Workqueue: events sk_psock_backlog
  [ 1022.713262][ T2556] RIP: 0010:skb_dequeue+0x4c/0x80
  [ 1022.713653][ T2556] Code: ...
  [...]
  [ 1022.720699][ T2556] Call Trace:
  [ 1022.720984][ T2556]  <TASK>
  [ 1022.721254][ T2556]  ? die_addr+0x32/0x80^M
  [ 1022.721589][ T2556]  ? exc_general_protection+0x25a/0x4b0
  [ 1022.722026][ T2556]  ? asm_exc_general_protection+0x22/0x30
  [ 1022.722489][ T2556]  ? skb_dequeue+0x4c/0x80
  [ 1022.722854][ T2556]  sk_psock_backlog+0x27a/0x300
  [ 1022.723243][ T2556]  process_one_work+0x2a7/0x5b0
  [ 1022.723633][ T2556]  worker_thread+0x4f/0x3a0
  [ 1022.723998][ T2556]  ? __pfx_worker_thread+0x10/0x10
  [ 1022.724386][ T2556]  kthread+0xfd/0x130
  [ 1022.724709][ T2556]  ? __pfx_kthread+0x10/0x10
  [ 1022.725066][ T2556]  ret_from_fork+0x2d/0x50
  [ 1022.725409][ T2556]  ? __pfx_kthread+0x10/0x10
  [ 1022.725799][ T2556]  ret_from_fork_asm+0x1b/0x30
  [ 1022.726201][ T2556]  </TASK>

To fix we add an skb_get() before passing the skb to be enqueued in the
engress queue. This bumps the skb->users refcnt so that consume_skb()
and kfree_skb will not immediately free the sk_buff. With this we can
be sure the skb is still around when we do the dequeue. Then we just
need to decrement the refcnt or free the skb in the backlog case which
we do by calling kfree_skb() on the ingress case as well as the sendmsg
case.

Before locking change from fixes tag we had the sock locked so we
couldn't race with user and there was no issue here.

Fixes: 799aa7f98d53e ("skmsg: Avoid lock_sock() in sk_psock_backlog()")
Reported-by: Jiri Olsa <[email protected]>
Signed-off-by: John Fastabend <[email protected]>
Signed-off-by: Daniel Borkmann <[email protected]>
Tested-by: Xu Kuohai <[email protected]>
Tested-by: Jiri Olsa <[email protected]>
Link: https://lore.kernel.org/bpf/[email protected]

fbdev/g364fb: fix build failure with mips

Fix the typo which resulted in the driver using FB_DEFAULT_IOMEM_HELPERS
instead of FB_DEFAULT_IOMEM_OPS as the fbdev I/O helpers.

Fixes: 501126083855 ("fbdev/g364fb: Use fbdev I/O helpers")
Suggested-by: Linus Torvalds <[email protected]>
Signed-off-by: Sudip Mukherjee <[email protected]>
Signed-off-by: Thomas Zimmermann <[email protected]>
Link: https://patchwork.freedesktop.org/patch/msgid/[email protected]

net: phy: micrel: Correct bit assignments for phy_device flags

Previously, the defines for phy_device flags in the Micrel driver were
ambiguous in their representation. They were intended to be bit masks
but were mistakenly defined as bit positions. This led to the following
issues:

- MICREL_KSZ8_P1_ERRATA, designated for KSZ88xx switches, overlapped
  with MICREL_PHY_FXEN and MICREL_PHY_50MHZ_CLK.
- Due to this overlap, the code path for MICREL_PHY_FXEN, tailored for
  the KSZ8041 PHY, was not executed for KSZ88xx PHYs.
- Similarly, the code associated with MICREL_PHY_50MHZ_CLK wasn't
  triggered for KSZ88xx.

To rectify this, all three flags have now been explicitly converted to
use the `BIT()` macro, ensuring they are defined as bit masks and
preventing potential overlaps in the future.

Fixes: 49011e0c1555 ("net: phy: micrel: ksz886x/ksz8081: add cabletest support")
Signed-off-by: Oleksij Rempel <[email protected]>
Reviewed-by: Russell King (Oracle) <[email protected]>
Signed-off-by: David S. Miller <[email protected]>

net: ipv6/addrconf: avoid integer underflow in ipv6_create_tempaddr

The existing code incorrectly casted a negative value (the result of a
subtraction) to an unsigned value without checking. For example, if
/proc/sys/net/ipv6/conf/*/temp_prefered_lft was set to 1, the preferred
lifetime would jump to 4 billion seconds. On my machine and network the
shortest lifetime that avoided underflow was 3 seconds.

Fixes: 76506a986dc3 ("IPv6: fix DESYNC_FACTOR")
Signed-off-by: Alex Henrie <[email protected]>
Reviewed-by: David Ahern <[email protected]>
Signed-off-by: David S. Miller <[email protected]>

veth: Fixing transmit return status for dropped packets

The veth_xmit function returns NETDEV_TX_OK even when packets are dropped.
This behavior leads to incorrect calculations of statistics counts, as
well as things like txq->trans_start updates.

Fixes: e314dbdc1c0d ("[NET]: Virtual ethernet device driver.")
Signed-off-by: Liang Chen <[email protected]>
Reviewed-by: Eric Dumazet <[email protected]>
Signed-off-by: David S. Miller <[email protected]>

gve: fix frag_list chaining

gve_rx_append_frags() is able to build skbs chained with frag_list,
like GRO engine.

Problem is that shinfo->frag_list should only be used
for the head of the chain.

All other links should use skb->next pointer.

Otherwise, built skbs are not valid and can cause crashes.

Equivalent code in GRO (skb_gro_receive()) is:

    if (NAPI_GRO_CB(p)->last == p)
        skb_shinfo(p)->frag_list = skb;
    else
        NAPI_GRO_CB(p)->last->next = skb;
    NAPI_GRO_CB(p)->last = skb;

Fixes: 9b8dd5e5ea48 ("gve: DQO: Add RX path")
Signed-off-by: Eric Dumazet <[email protected]>
Cc: Bailey Forrest <[email protected]>
Cc: Willem de Bruijn <[email protected]>
Cc: Catherine Sullivan <[email protected]>
Reviewed-by: David Ahern <[email protected]>
Signed-off-by: David S. Miller <[email protected]>

net: deal with integer overflows in kmalloc_reserve()

Blamed commit changed:
    ptr = kmalloc(size);
    if (ptr)
      size = ksize(ptr);

to:
    size = kmalloc_size_roundup(size);
    ptr = kmalloc(size);

This allowed various crash as reported by syzbot [1]
and Kyle Zeng.

Problem is that if @size is bigger than 0x80000001,
kmalloc_size_roundup(size) returns 2^32.

kmalloc_reserve() uses a 32bit variable (obj_size),
so 2^32 is truncated to 0.

kmalloc(0) returns ZERO_SIZE_PTR which is not handled by
skb allocations.

Following trace can be triggered if a netdev->mtu is set
close to 0x7fffffff

We might in the future limit netdev->mtu to more sensible
limit (like KMALLOC_MAX_SIZE).

This patch is based on a syzbot report, and also a report
and tentative fix from Kyle Zeng.

[1]
BUG: KASAN: user-memory-access in __build_skb_around net/core/skbuff.c:294 [inline]
BUG: KASAN: user-memory-access in __alloc_skb+0x3c4/0x6e8 net/core/skbuff.c:527
Write of size 32 at addr 00000000fffffd10 by task syz-executor.4/22554

CPU: 1 PID: 22554 Comm: syz-executor.4 Not tainted 6.1.39-syzkaller #0
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 07/03/2023
Call trace:
dump_backtrace+0x1c8/0x1f4 arch/arm64/kernel/stacktrace.c:279
show_stack+0x2c/0x3c arch/arm64/kernel/stacktrace.c:286
__dump_stack lib/dump_stack.c:88 [inline]
dump_stack_lvl+0x120/0x1a0 lib/dump_stack.c:106
print_report+0xe4/0x4b4 mm/kasan/report.c:398
kasan_report+0x150/0x1ac mm/kasan/report.c:495
kasan_check_range+0x264/0x2a4 mm/kasan/generic.c:189
memset+0x40/0x70 mm/kasan/shadow.c:44
__build_skb_around net/core/skbuff.c:294 [inline]
__alloc_skb+0x3c4/0x6e8 net/core/skbuff.c:527
alloc_skb include/linux/skbuff.h:1316 [inline]
igmpv3_newpack+0x104/0x1088 net/ipv4/igmp.c:359
add_grec+0x81c/0x1124 net/ipv4/igmp.c:534
igmpv3_send_cr net/ipv4/igmp.c:667 [inline]
igmp_ifc_timer_expire+0x1b0/0x1008 net/ipv4/igmp.c:810
call_timer_fn+0x1c0/0x9f0 kernel/time/timer.c:1474
expire_timers kernel/time/timer.c:1519 [inline]
__run_timers+0x54c/0x710 kernel/time/timer.c:1790
run_timer_softirq+0x28/0x4c kernel/time/timer.c:1803
_stext+0x380/0xfbc
____do_softirq+0x14/0x20 arch/arm64/kernel/irq.c:79
call_on_irq_stack+0x24/0x4c arch/arm64/kernel/entry.S:891
do_softirq_own_stack+0x20/0x2c arch/arm64/kernel/irq.c:84
invoke_softirq kernel/softirq.c:437 [inline]
__irq_exit_rcu+0x1c0/0x4cc kernel/softirq.c:683
irq_exit_rcu+0x14/0x78 kernel/softirq.c:695
el0_interrupt+0x7c/0x2e0 arch/arm64/kernel/entry-common.c:717
__el0_irq_handler_common+0x18/0x24 arch/arm64/kernel/entry-common.c:724
el0t_64_irq_handler+0x10/0x1c arch/arm64/kernel/entry-common.c:729
el0t_64_irq+0x1a0/0x1a4 arch/arm64/kernel/entry.S:584

Fixes: 12d6c1d3a2ad ("skbuff: Proactively round up to kmalloc bucket size")
Reported-by: syzbot <[email protected]>
Reported-by: Kyle Zeng <[email protected]>
Signed-off-by: Eric Dumazet <[email protected]>
Cc: Kees Cook <[email protected]>
Cc: Vlastimil Babka <[email protected]>
Signed-off-by: David S. Miller <[email protected]>

ksmbd: remove experimental warning

ksmbd has made significant improvements over the past two
years and is regularly tested and used. Remove the experimental
warning.

Acked-by: Namjae Jeon <[email protected]>
Signed-off-by: Steve French <[email protected]>

virtio_ring: fix avail_wrap_counter in virtqueue_add_packed

In current packed virtqueue implementation, the avail_wrap_counter won't
flip, in the case when the driver supplies a descriptor chain with a
length equals to the queue size; total_sg == vq->packed.vring.num.

Let’s assume the following situation:
vq->packed.vring.num=4
vq->packed.next_avail_idx: 1
vq->packed.avail_wrap_counter: 0

Then the driver adds a descriptor chain containing 4 descriptors.

We expect the following result with avail_wrap_counter flipped:
vq->packed.next_avail_idx: 1
vq->packed.avail_wrap_counter: 1

But, the current implementation gives the following result:
vq->packed.next_avail_idx: 1
vq->packed.avail_wrap_counter: 0

To reproduce the bug, you can set a packed queue size as small as
possible, so that the driver is more likely to provide a descriptor
chain with a length equal to the packed queue size. For example, in
qemu run following commands:
sudo qemu-system-x86_64 \
-enable-kvm \
-nographic \
-kernel "path/to/kernel_image" \
-m 1G \
-drive file="path/to/rootfs",if=none,id=disk \
-device virtio-blk,drive=disk \
-drive file="path/to/disk_image",if=none,id=rwdisk \
-device virtio-blk,drive=rwdisk,packed=on,queue-size=4,\
indirect_desc=off \
-append "console=ttyS0 root=/dev/vda rw init=/bin/bash"

Inside the VM, create a directory and mount the rwdisk device on it. The
rwdisk will hang and mount operation will not complete.

This commit fixes the wrap counter error by flipping the
packed.avail_wrap_counter, when start of descriptor chain equals to the
end of descriptor chain (head == i).

Fixes: 1ce9e6055fa0 ("virtio_ring: introduce packed ring support")
Signed-off-by: Yuan Yao <[email protected]>
Message-Id: <20230808051110.3492693 [email protected]>
Acked-by: Jason Wang <[email protected]>
Signed-off-by: Michael S. Tsirkin <[email protected]>

virtio_vdpa: build affinity masks conditionally

We try to build affinity mask via create_affinity_masks()
unconditionally which may lead several issues:

- the affinity mask is not used for parent without affinity support
  (only VDUSE support the affinity now)
- the logic of create_affinity_masks() might not work for devices
  other than block. For example it's not rare in the networking device
  where the number of queues could exceed the number of CPUs. Such
  case breaks the current affinity logic which is based on
  group_cpus_evenly() who assumes the number of CPUs are not less than
  the number of groups. This can trigger a warning[1]:

if (ret >= 0)
WARN_ON(nr_present + nr_others < numgrps);

Fixing this by only build the affinity masks only when

- Driver passes affinity descriptor, driver like virtio-blk can make
  sure to limit the number of queues when it exceeds the number of CPUs
- Parent support affinity setting config ops

This help to avoid the warning. More optimizations could be done on
top.

[1]
[  682.146655] WARNING: CPU: 6 PID: 1550 at lib/group_cpus.c:400 group_cpus_evenly+0x1aa/0x1c0
[  682.146668] CPU: 6 PID: 1550 Comm: vdpa Not tainted 6.5.0-rc5jason+ #79
[  682.146671] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.16.2-0-gea1b7a073390-prebuilt.qemu.org 04/01/2014
[  682.146673] RIP: 0010:group_cpus_evenly+0x1aa/0x1c0
[  682.146676] Code: 4c 89 e0 5b 5d 41 5c 41 5d 41 5e c3 cc cc cc cc e8 1b c4 74 ff 48 89 ef e8 13 ac 98 ff 4c 89 e7 45 31 e4 e8 08 ac 98 ff eb c2 <0f> 0b eb b6 e8 fd 05 c3 00 45 31 e4 eb e5 cc cc cc cc cc cc cc cc
[  682.146679] RSP: 0018:ffffc9000215f498 EFLAGS: 00010293
[  682.146682] RAX: 000000000001f1e0 RBX: 0000000000000041 RCX: 0000000000000000
[  682.146684] RDX: ffff888109922058 RSI: 0000000000000041 RDI: 0000000000000030
[  682.146686] RBP: ffff888109922058 R08: ffffc9000215f498 R09: ffffc9000215f4a0
[  682.146687] R10: 00000000000198d0 R11: 0000000000000030 R12: ffff888107e02800
[  682.146689] R13: 0000000000000030 R14: 0000000000000030 R15: 0000000000000041
[  682.146692] FS:  00007fef52315740(0000) GS:ffff888237380000(0000) knlGS:0000000000000000
[  682.146695] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[  682.146696] CR2: 00007fef52509000 CR3: 0000000110dbc004 CR4: 0000000000370ee0
[  682.146698] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[  682.146700] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[  682.146701] Call Trace:
[  682.146703]  <TASK>
[  682.146705]  ? __warn+0x7b/0x130
[  682.146709]  ? group_cpus_evenly+0x1aa/0x1c0
[  682.146712]  ? report_bug+0x1c8/0x1e0
[  682.146717]  ? handle_bug+0x3c/0x70
[  682.146721]  ? exc_invalid_op+0x14/0x70
[  682.146723]  ? asm_exc_invalid_op+0x16/0x20
[  682.146727]  ? group_cpus_evenly+0x1aa/0x1c0
[  682.146729]  ? group_cpus_evenly+0x15c/0x1c0
[  682.146731]  create_affinity_masks+0xaf/0x1a0
[  682.146735]  virtio_vdpa_find_vqs+0x83/0x1d0
[  682.146738]  ? __pfx_default_calc_sets+0x10/0x10
[  682.146742]  virtnet_find_vqs+0x1f0/0x370
[  682.146747]  virtnet_probe+0x501/0xcd0
[  682.146749]  ? vp_modern_get_status+0x12/0x20
[  682.146751]  ? get_cap_addr.isra.0+0x10/0xc0
[  682.146754]  virtio_dev_probe+0x1af/0x260
[  682.146759]  really_probe+0x1a5/0x410

Fixes: 3dad56823b53 ("virtio-vdpa: Support interrupt affinity spreading mechanism")
Signed-off-by: Jason Wang <[email protected]>
Message-Id: <20230811091539.1359865 [email protected]>
Signed-off-by: Michael S. Tsirkin <[email protected]>

virtio_net: merge dma operations when filling mergeable buffers

Currently, the virtio core will perform a dma operation for each
buffer. Although, the same page may be operated multiple times.

This patch, the driver does the dma operation and manages the dma
address based the feature premapped of virtio core.

This way, we can perform only one dma operation for the pages of the
alloc frag. This is beneficial for the iommu device.

kernel command line: intel_iommu=on iommu.passthrough=0

       |  strict=0  | strict=1
Before |  775496pps | 428614pps
After  | 1109316pps | 742853pps

Signed-off-by: Xuan Zhuo <[email protected]>
Message-Id: <20230810123057 [email protected]>
Signed-off-by: Michael S. Tsirkin <[email protected]>

virtio_ring: introduce dma sync api for virtqueue

These API has been introduced:

* virtqueue_dma_need_sync
* virtqueue_dma_sync_single_range_for_cpu
* virtqueue_dma_sync_single_range_for_device

These APIs can be used together with the premapped mechanism to sync the
DMA address.

Signed-off-by: Xuan Zhuo <[email protected]>
Message-Id: <20230810123057 [email protected]>
Signed-off-by: Michael S. Tsirkin <[email protected]>

virtio_ring: introduce dma map api for virtqueue

Added virtqueue_dma_map_api* to map DMA addresses for virtual memory in
advance. The purpose is to keep memory mapped across multiple add/get
buf operations.

Signed-off-by: Xuan Zhuo <[email protected]>
Message-Id: <20230810123057 [email protected]>
Signed-off-by: Michael S. Tsirkin <[email protected]>

virtio_ring: introduce virtqueue_reset()

Introduce virtqueue_reset() to release all buffer inside vq.

Signed-off-by: Xuan Zhuo <[email protected]>
Acked-by: Jason Wang <[email protected]>
Message-Id: <20230810123057 [email protected]>
Signed-off-by: Michael S. Tsirkin <[email protected]>

virtio_ring: separate the logic of reset/enable from virtqueue_resize

The subsequent reset function will reuse these logic.

Signed-off-by: Xuan Zhuo <[email protected]>
Acked-by: Jason Wang <[email protected]>
Message-Id: <20230810123057 [email protected]>
Signed-off-by: Michael S. Tsirkin <[email protected]>

virtio_ring: correct the expression of the description of virtqueue_resize()

Modify the "useless" to a more accurate "unused".

Signed-off-by: Xuan Zhuo <[email protected]>
Acked-by: Jason Wang <[email protected]>
Message-Id: <20230810123057 [email protected]>
Signed-off-by: Michael S. Tsirkin <[email protected]>

virtio_ring: skip unmap for premapped

Now we add a case where we skip dma unmap, the vq->premapped is true.

We can't just rely on use_dma_api to determine whether to skip the dma
operation. For convenience, I introduced the "do_unmap". By default, it
is the same as use_dma_api. If the driver is configured with premapped,
then do_unmap is false.

So as long as do_unmap is false, for addr of desc, we should skip dma
unmap operation.

Signed-off-by: Xuan Zhuo <[email protected]>
Message-Id: <20230810123057 [email protected]>
Signed-off-by: Michael S. Tsirkin <[email protected]>

virtio_ring: introduce virtqueue_dma_dev()

Added virtqueue_dma_dev() to get DMA device for virtio. Then the
caller can do dma operation in advance. The purpose is to keep memory
mapped across multiple add/get buf operations.

Signed-off-by: Xuan Zhuo <[email protected]>
Acked-by: Jason Wang <[email protected]>
Message-Id: <20230810123057 [email protected]>
Signed-off-by: Michael S. Tsirkin <[email protected]>

virtio_ring: support add premapped buf

If the vq is the premapped mode, use the sg_dma_address() directly.

Signed-off-by: Xuan Zhuo <[email protected]>
Message-Id: <20230810123057 [email protected]>
Signed-off-by: Michael S. Tsirkin <[email protected]>

virtio_ring: introduce virtqueue_set_dma_premapped()

This helper allows the driver change the dma mode to premapped mode.
Under the premapped mode, the virtio core do not do dma mapping
internally.

This just work when the use_dma_api is true. If the use_dma_api is false,
the dma options is not through the DMA APIs, that is not the standard
way of the linux kernel.

Signed-off-by: Xuan Zhuo <[email protected]>
Message-Id: <20230810123057 [email protected]>
Signed-off-by: Michael S. Tsirkin <[email protected]>

virtio_ring: put mapping error check in vring_map_one_sg

This patch put the dma addr error check in vring_map_one_sg().

The benefits of doing this:

1. reduce one judgment of vq->use_dma_api.
2. make vring_map_one_sg more simple, without calling
vring_mapping_error to check the return value. simplifies subsequent
code

Signed-off-by: Xuan Zhuo <[email protected]>
Acked-by: Jason Wang <[email protected]>
Message-Id: <20230810123057 [email protected]>
Signed-off-by: Michael S. Tsirkin <[email protected]>

virtio_ring: check use_dma_api before unmap desc for indirect

Inside detach_buf_split(), if use_dma_api is false,
vring_unmap_one_split_indirect will be called many times, but actually
nothing is done. So this patch check use_dma_api firstly.

Signed-off-by: Xuan Zhuo <[email protected]>
Acked-by: Jason Wang <[email protected]>
Message-Id: <20230810123057 [email protected]>
Signed-off-by: Michael S. Tsirkin <[email protected]>

vdpa_sim: offer VHOST_BACKEND_F_ENABLE_AFTER_DRIVER_OK

Start offering the feature in the simulator. Other parent drivers can
follow this code to offer it too.

Signed-off-by: Eugenio Pérez <[email protected]>
Acked-by: Shannon Nelson <[email protected]>
Message-Id: <20230609092127 [email protected]>
Signed-off-by: Michael S. Tsirkin <[email protected]>

vdpa: add get_backend_features vdpa operation

This operation allow vdpa parent to expose its own backend feature bits.

Next patches introduce a feature not compatible with all parent drivers:
the ability to enable vq after driver_ok. Each parent must declare if
it allows it or not.

Signed-off-by: Eugenio Pérez <[email protected]>
Acked-by: Shannon Nelson <[email protected]>
Message-Id: <20230609092127 [email protected]>
Signed-off-by: Michael S. Tsirkin <[email protected]>

vdpa: accept VHOST_BACKEND_F_ENABLE_AFTER_DRIVER_OK backend feature

Accepting VHOST_BACKEND_F_ENABLE_AFTER_DRIVER_OK backend feature if
userland sets it.

Signed-off-by: Eugenio Pérez <[email protected]>
Acked-by: Shannon Nelson <[email protected]>
Message-Id: <20230609092127 [email protected]>
Signed-off-by: Michael S. Tsirkin <[email protected]>

vdpa: add VHOST_BACKEND_F_ENABLE_AFTER_DRIVER_OK flag

This feature flag allows the driver enabling virtqueues both before and
after DRIVER_OK.

This is needed for software assisted live migration, so userland can
restore the device status in devices with control virtqueue before the
dataplane is enabled.

Signed-off-by: Eugenio Pérez <[email protected]>
Acked-by: Shannon Nelson <[email protected]>
Message-Id: <20230609092127 [email protected]>
Signed-off-by: Michael S. Tsirkin <[email protected]>

vdpa/mlx5: Remove unused function declarations

Commit 29064bfdabd5 ("vdpa/mlx5: Add support library for mlx5 VDPA implementation")
declared but never implemented these.

Signed-off-by: Yue Haibing <[email protected]>
Message-Id: <20230803143041 [email protected]>
Signed-off-by: Michael S. Tsirkin <[email protected]>

Merge tag 'dmaengine-6.6-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/vkoul/dmaengine

Pull dmaengine updates from Vinod Koul:
"New controller support and updates to drivers.

  New support:
   - Qualcomm SM6115 and QCM2290 dmaengine support
   - at_xdma support for microchip,sam9x7 controller

  Updates:
   - idxd updates for wq simplification and ats knob updates
   - fsl edma updates for v3 support
   - Xilinx AXI4-Stream control support
   - Yaml conversion for bcm dma binding"

* tag 'dmaengine-6.6-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/vkoul/dmaengine: (53 commits)
  dmaengine: fsl-edma: integrate v3 support
  dt-bindings: fsl-dma: fsl-edma: add edma3 compatible string
  dmaengine: fsl-edma: move tcd into struct fsl_dma_chan
  dmaengine: fsl-edma: refactor chan_name setup and safety
  dmaengine: fsl-edma: move clearing of register interrupt into setup_irq function
  dmaengine: fsl-edma: refactor using devm_clk_get_enabled
  dmaengine: fsl-edma: simply ATTR_DSIZE and ATTR_SSIZE by using ffs()
  dmaengine: fsl-edma: move common IRQ handler to common.c
  dmaengine: fsl-edma: Remove enum edma_version
  dmaengine: fsl-edma: transition from bool fields to bitmask flags in drvdata
  dmaengine: fsl-edma: clean up EXPORT_SYMBOL_GPL in fsl-edma-common.c
  dmaengine: fsl-edma: fix build error when arch is s390
  dmaengine: idxd: Fix issues with PRS disable sysfs knob
  dmaengine: idxd: Allow ATS disable update only for configurable devices
  dmaengine: xilinx_dma: Program interrupt delay timeout
  dmaengine: xilinx_dma: Use tasklet_hi_schedule for timing critical usecase
  dmaengine: xilinx_dma: Freeup active list based on descriptor completion bit
  dmaengine: xilinx_dma: Increase AXI DMA transaction segment count
  dmaengine: xilinx_dma: Pass AXI4-Stream control words to dma client
  dt-bindings: dmaengine: xilinx_dma: Add xlnx,irq-delay property
  ...

Merge tag 'phy-for-6.6' of git://git.kernel.org/pub/scm/linux/kernel/git/phy/linux-phy

Pull phy updates from Vinod Koul:
"As usual a couple of new drivers, a bunch of new device support and
  few updates to existing drivers

  New Support:
   - Starfive dphy rx, JH7110 usb and pcie support
   - Rockchip rv1126 inno-dsi phy, rk3588 usb and pcie support
   - Qualcomm sa8775p PCIe support, M31 USB PHY driver
   - Samsung Exynos850 usb support

  Updates:
   - Mediatek dsi driver clock updates
   - Qualcomm sm8150 combo phy with reworking of qmp pcie driver
   - Xilinx zynqmp runtime PM support"

* tag 'phy-for-6.6' of git://git.kernel.org/pub/scm/linux/kernel/git/phy/linux-phy: (83 commits)
  phy: exynos5-usbdrd: Add Exynos850 support
  phy: exynos5-usbdrd: Add 26MHz ref clk support
  phy: exynos5-usbdrd: Make it possible to pass custom phy ops
  dt-bindings: phy: samsung,usb3-drd-phy: Add Exynos850 support
  phy: qcom-qmp-combo: fix clock probing
  phy: qcom-qmp-pcie: support SM8150 PCIe QMP PHYs
  phy: qcom-qmp-pcie: populate offsets configuration
  phy: qcom-qmp-pcie: simplify clock handling
  phy: qcom-qmp-pcie: keep offset tables sorted
  phy: qcom-qmp-pcie: drop ln_shrd from v5_20 config
  dt-bindings: phy: qcom,qmp-pcie: describe SM8150 PCIe PHYs
  dt-bindings: phy: migrate QMP PCIe PHY bindings to qcom,sc8280xp-qmp-pcie-phy.yaml
  phy: fsl-imx8mq-usb: add dev_err_probe if getting vbus failed
  phy: qcom: Introduce M31 USB PHY driver
  dt-bindings: phy: qcom,m31: Document qcom,m31 USB phy
  phy: rockchip: inno-dsidphy: Add rv1126 support
  dt-bindings: phy: rockchip-inno-dsidphy: Document rv1126
  dt-bindings: phy: mediatek,tphy: allow simple nodename pattern
  phy: amlogic: meson-g12a-usb2: fix Wvoid-pointer-to-enum-cast warning
  phy: marvell pxa-usb: fix Wvoid-pointer-to-enum-cast warning
  ...

Merge tag 'soundwire-6.6-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/vkoul/soundwire

Pull soundwire updates from Vinod Koul:
"Device numbering and intel driver changes are main features:

   - Core support for soundwire device number allocation

   - intel driver updates for adding hw_params for DAI ops, hybrid
     number allocation and power managemnt callback updates

   - DT header include changes for subsystem"

* tag 'soundwire-6.6-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/vkoul/soundwire:
  soundwire: intel_ace2x: add DAI hw_params/prepare/hw_free callbacks
  soundwire: intel_auxdevice: add hybrid IDA-based device_number allocation
  soundwire: bus: add callbacks for device_number allocation
  soundwire: extend parameters of new_peripheral_assigned() callback
  soundWire: intel_auxdevice: resume 'sdw-master' on startup and system resume
  soundwire: intel_auxdevice: enable pm_runtime earlier on startup
  soundwire: Explicitly include correct DT includes

kbuild: Show marked Kconfig fragments in "help"

Currently the Kconfig fragments in kernel/configs and arch/*/configs
that aren't used internally aren't discoverable through "make help",
which consists of hard-coded lists of config fragments. Instead, list
all the fragment targets that have a "# Help: " comment prefix so the
targets can be generated dynamically.

Add logic to the Makefile to search for and display the fragment and
comment. Add comments to fragments that are intended to be direct targets.

Signed-off-by: Kees Cook <[email protected]>
Co-developed-by: Masahiro Yamada <[email protected]>
Acked-by: Michael Ellerman <[email protected]> (powerpc)
Reviewed-by: Nicolas Schier <[email protected]>
Signed-off-by: Masahiro Yamada <[email protected]>

Merge tag 'mtd/for-6.6' of git://git.kernel.org/pub/scm/linux/kernel/git/mtd/linux

Pull MTD updates from Miquel Raynal:
"Core MTD changes:
   - Use refcount to prevent corruption
   - Call external _get and _put in right order
   - Fix use-after-free in mtd release
   - Explicitly include correct DT includes
   - Clean refcounting with MTD_PARTITIONED_MASTER
   - mtdblock: make warning messages ratelimited
   - dt-bindings: Add SEAMA partition bindings

  Device driver changes:
   - Use devm helper functions
   - Fix questionable cast, remove pointless ones.
   - error handling fixes
   - add support for new chip versions
   - update DT bindings
   - misc cleanups - fix typos, whitespace, indentation"

* tag 'mtd/for-6.6' of git://git.kernel.org/pub/scm/linux/kernel/git/mtd/linux: (105 commits)
  dt-bindings: mtd: amlogic,meson-nand: drop unneeded quotes
  mtd: spear_smi: Use helper function devm_clk_get_enabled()
  mtd: rawnand: orion: Use helper function devm_clk_get_optional_enabled()
  mtd: rawnand: vf610_nfc: Use helper function devm_clk_get_enabled()
  mtd: rawnand: sunxi: Use helper function devm_clk_get_enabled()
  mtd: rawnand: stm32_fmc2: Use helper function devm_clk_get_enabled()
  mtd: rawnand: mtk: Use helper function devm_clk_get_enabled()
  mtd: rawnand: mpc5121: Use helper function devm_clk_get_enabled()
  mtd: rawnand: lpc32xx_slc: Use helper function devm_clk_get_enabled()
  mtd: rawnand: intel: Use helper function devm_clk_get_enabled()
  mtd: rawnand: fsmc: Use helper function devm_clk_get_enabled()
  mtd: rawnand: arasan: Use helper function devm_clk_get_enabled()
  mtd: rawnand: qcom: Add read/read_start ops in exec_op path
  mtd: rawnand: qcom: Clear buf_count and buf_start in raw read
  mtd: maps: fix -Wvoid-pointer-to-enum-cast warning
  mtd: rawnand: fix -Wvoid-pointer-to-enum-cast warning
  mtd: rawnand: fsmc: handle clk prepare error in fsmc_nand_resume()
  mtd: rawnand: Propagate error and simplify ternary operators for brcmstb_nand_wait_for_completion()
  mtd: rawnand: qcom: Sort includes alphabetically
  mtd: rawnand: qcom: Do not override the error no of submit_descs()
  ...

igb: disable virtualization features on 82580

Disable virtualization features on 82580 just as on i210/i211.
This avoids that virt functions are acidentally called on 82850.

Fixes: 55cac248caa4 ("igb: Add full support for 82580 devices")
Signed-off-by: Corinna Vinschen <[email protected]>
Reviewed-by: Simon Horman <[email protected]>
Signed-off-by: David S. Miller <[email protected]>

Merge tag 'f2fs-for-6-6-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/jaegeuk/f2fs

Pull f2fs updates from Jaegeuk Kim:
"In this cycle, we don't have a highlighted feature enhancement, but
  mostly have fixed issues mainly in two parts: 1) zoned block device,
  and 2) compression support.

  For zoned block device, we've tried to improve the power-off recovery
  flow as much as possible. For compression, we found some corner cases
  caused by wrong compression policy and logics. Other than them, there
  were some reverts and stat corrections.

  Bug fixes:
   - use finish zone command when closing a zone
   - check zone type before sending async reset zone command
   - fix to assign compress_level for lz4 correctly
   - fix error path of f2fs_submit_page_read()
   - don't {,de}compress non-full cluster
   - send small discard commands during checkpoint back
   - flush inode if atomic file is aborted
   - correct to account gc/cp stats

  And, there are minor bug fixes, avoiding false lockdep warning, and
  clean-ups"

* tag 'f2fs-for-6-6-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/jaegeuk/f2fs: (25 commits)
  f2fs: use finish zone command when closing a zone
  f2fs: compress: fix to assign compress_level for lz4 correctly
  f2fs: fix error path of f2fs_submit_page_read()
  f2fs: clean up error handling in sanity_check_{compress_,}inode()
  f2fs: avoid false alarm of circular locking
  Revert "f2fs: do not issue small discard commands during checkpoint"
  f2fs: doc: fix description of max_small_discards
  f2fs: should update REQ_TIME for direct write
  f2fs: fix to account cp stats correctly
  f2fs: fix to account gc stats correctly
  f2fs: remove unneeded check condition in __f2fs_setxattr()
  f2fs: fix to update i_ctime in __f2fs_setxattr()
  Revert "f2fs: fix to do sanity check on extent cache correctly"
  f2fs: increase usage of folio_next_index() helper
  f2fs: Only lfs mode is allowed with zoned block device feature
  f2fs: check zone type before sending async reset zone command
  f2fs: compress: don't {,de}compress non-full cluster
  f2fs: allow f2fs_ioc_{,de}compress_file to be interrupted
  f2fs: don't reopen the main block device in f2fs_scan_devices
  f2fs: fix to avoid mmap vs set_compress_option case
  ...

mm/kmemleak: move up cond_resched() call in page scanning loop

Commit bde5f6bc68db ("kmemleak: add scheduling point to kmemleak_scan()")
added a cond_resched() call to the struct page scanning loop to prevent
soft lockup from happening. However, soft lockup can still happen in that
loop in some corner cases when the pages that satisfy the "!(pfn & 63)"
check are skipped for some reasons.

Fix this corner case by moving up the cond_resched() check so that it will
be called every 64 pages unconditionally.

Link: https://lkml.kernel.org/r/[email protected]
Fixes: bde5f6bc68db ("kmemleak: add scheduling point to kmemleak_scan()")
Signed-off-by: Waiman Long <[email protected]>
Cc: Catalin Marinas <[email protected]>
Cc: Yisheng Xie <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>

mm: page_alloc: remove stale CMA guard code

In the past, movable allocations could be disallowed from CMA through
PF_MEMALLOC_PIN.  As CMA pages are funneled through the MOVABLE pcplist,
this required filtering that cornercase during allocations, such that
pinnable allocations wouldn't accidentally get a CMA page.

However, since 8e3560d963d2 ("mm: honor PF_MEMALLOC_PIN for all movable
pages"), PF_MEMALLOC_PIN automatically excludes __GFP_MOVABLE.  Once
again, MOVABLE implies CMA is allowed.

Remove the stale filtering code.  Also remove a stale comment that was
introduced as part of the filtering code, because the filtering let
order-0 pages fall through to the buddy allocator.  See 1d91df85f399
("mm/page_alloc: handle a missing case for memalloc_nocma_{save/restore}
APIs") for context.  The comment's been obsolete since the introduction of
the explicit ALLOC_HIGHATOMIC flag in eb2e2b425c69 ("mm/page_alloc:
explicitly record high-order atomic allocations in alloc_flags").

Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Johannes Weiner <[email protected]>
Acked-by: Mel Gorman <[email protected]>
Cc: David Hildenbrand <[email protected]>
Cc: Joonsoo Kim <[email protected]>
Cc: Miaohe Lin <[email protected]>
Cc: Pasha Tatashin <[email protected]>
Cc: Vlastimil Babka <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>

MAINTAINERS: add rmap.h to mm entry

Make it easier to figure out where to send patches for this file.

Link: https://lkml.kernel.org/r/efbc7689d35a48ff402644d696aa9a8d8bb6333a.1692877089.git.baruch@tkos.co.il
Signed-off-by: Baruch Siach <[email protected]>
Reviewed-by: David Hildenbrand <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>

rmap: remove anon_vma_link() nommu stub

anon_vma_link() is unused since commit 5beb49305251 ("mm: change anon_vma
linking to fix multi-process server scalability issue").

Link: https://lkml.kernel.org/r/cdce9b00c9ab15f6d02eddf40dcad537d3e9676f.1692877089.git.baruch@tkos.co.il
Signed-off-by: Baruch Siach <[email protected]>
Reviewed-by: David Hildenbrand <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>

proc/ksm: add ksm stats to /proc/pid/smaps

With madvise and prctl KSM can be enabled for different VMA's.  Once it is
enabled we can query how effective KSM is overall.  However we cannot
easily query if an individual VMA benefits from KSM.

This commit adds a KSM section to the /prod/<pid>/smaps file.  It reports
how many of the pages are KSM pages.  Note that KSM-placed zeropages are
not included, only actual KSM pages.

Here is a typical output:

7f420a000000-7f421a000000 rw-p 00000000 00:00 0
Size:             262144 kB
KernelPageSize:        4 kB
MMUPageSize:           4 kB
Rss:               51212 kB
Pss:                8276 kB
Shared_Clean:        172 kB
Shared_Dirty:      42996 kB
Private_Clean:       196 kB
Private_Dirty:      7848 kB
Referenced:        15388 kB
Anonymous:         51212 kB
KSM:               41376 kB
LazyFree:              0 kB
AnonHugePages:         0 kB
ShmemPmdMapped:        0 kB
FilePmdMapped:         0 kB
Shared_Hugetlb:        0 kB
Private_Hugetlb:       0 kB
Swap:             202016 kB
SwapPss:            3882 kB
Locked:                0 kB
THPeligible:    0
ProtectionKey:         0
ksm_state:          0
ksm_skip_base:      0
ksm_skip_count:     0
VmFlags: rd wr mr mw me nr mg anon

This information also helps with the following workflow:
- First enable KSM for all the VMA's of a process with prctl.
- Then analyze with the above smaps report which VMA's benefit the most
- Change the application (if possible) to add the corresponding madvise
calls for the VMA's that benefit the most

[[email protected]: v5]
Link: https://lkml.kernel.org/r/[email protected]
Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Stefan Roesch <[email protected]>
Reviewed-by: David Hildenbrand <[email protected]>
Cc: Johannes Weiner <[email protected]>
Cc: Rik van Riel <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>

mm/hwpoison: rename hwp_walk* to hwpoison_walk*

In the discussion of "Improve hugetlbfs read on HWPOISON hugepages" [1],
Matthew Wilcox suggests hwp is a bad abbreviation of hwpoison, as hwp is
already used as "an acronym by acpi, intel_pstate, some clock drivers, an
ethernet driver, and a scsi driver"[1].

So rename hwp_walk and hwp_walk_ops to hwpoison_walk and
hwpoison_walk_ops respectively.

raw_hwp_(page|list), *_raw_hwp, and raw_hwp_unreliable flag are other
major appearances of "hwp". However, given the "raw" hint in the name, it
is easy to differentiate them from other "hwp" acronyms. Since renaming
them is not as straightforward as renaming hwp_walk*, they are not covered
by this commit.

[1] https://lore.kernel.org/lkml/20230707201904 [email protected]/T/#me6fecb8ce1ad4d5769199c9e162a44bc88f7bdec

Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Jiaqi Yan <[email protected]>
Reviewed-by: Anshuman Khandual <[email protected]>
Acked-by: Naoya Horiguchi <[email protected]>
Cc: Matthew Wilcox (Oracle) <[email protected]>
Cc: Miaohe Lin <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>

mm: memory-failure: add PageOffline() check

Memory failure is not interested in logically offlined pages. Skip this
type of page.

Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Miaohe Lin <[email protected]>
Acked-by: Naoya Horiguchi <[email protected]>
Cc: Kefeng Wang <[email protected]>
Cc: Matthew Wilcox (Oracle) <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>

Merge tag 'scsi-misc' of git://git.kernel.org/pub/scm/linux/kernel/git/jejb/scsi

Pull SCSI updates from James Bottomley:
"Updates to the usual drivers (ufs, lpfc, qla2xxx, mpi3mr, libsas) and
  the usual minor updates and bug fixes but no significant core changes"

* tag 'scsi-misc' of git://git.kernel.org/pub/scm/linux/kernel/git/jejb/scsi: (116 commits)
  scsi: storvsc: Handle additional SRB status values
  scsi: libsas: Delete sas_ata_task.retry_count
  scsi: libsas: Delete sas_ata_task.stp_affil_pol
  scsi: libsas: Delete sas_ata_task.set_affil_pol
  scsi: libsas: Delete sas_ssp_task.task_prio
  scsi: libsas: Delete sas_ssp_task.enable_first_burst
  scsi: libsas: Delete sas_ssp_task.retry_count
  scsi: libsas: Delete struct scsi_core
  scsi: libsas: Delete enum sas_phy_type
  scsi: libsas: Delete enum sas_class
  scsi: libsas: Delete sas_ha_struct.lldd_module
  scsi: target: Fix write perf due to unneeded throttling
  scsi: lpfc: Do not abuse UUID APIs and LPFC_COMPRESS_VMID_SIZE
  scsi: pm8001: Remove unused declarations
  scsi: fcoe: Fix potential deadlock on &fip->ctlr_lock
  scsi: elx: sli4: Remove code duplication
  scsi: bfa: Replace one-element array with flexible-array member in struct fc_rscn_pl_s
  scsi: qla2xxx: Remove unused declarations
  scsi: pmcraid: Use pci_dev_id() to simplify the code
  scsi: pm80xx: Set RETFIS when requested by libsas
  ...

Merge tag 'probes-v6.6' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace

Pull probes updates from Masami Hiramatsu:

- kprobes: use struct_size() for variable size kretprobe_instance data
   structure.

- eprobe: Simplify trace_eprobe list iteration.

- probe events: Data structure field access support on BTF argument.

     - Update BTF argument support on the functions in the kernel
       loadable modules (only loaded modules are supported).

     - Move generic BTF access function (search function prototype and
       get function parameters) to a separated file.

     - Add a function to search a member of data structure in BTF.

     - Support accessing BTF data structure member from probe args by
       C-like arrow('->') and dot('.') operators. e.g.
          't sched_switch next=next->pid vruntime=next->se.vruntime'

     - Support accessing BTF data structure member from $retval. e.g.
          'f getname_flags%return +0($retval->name):string'

     - Add string type checking if BTF type info is available. This will
       reject if user specify ":string" type for non "char pointer"
       type.

     - Automatically assume the fprobe event as a function return event
       if $retval is used.

- selftests/ftrace: Add BTF data field access test cases.

- Documentation: Update fprobe event example with BTF data field.

* tag 'probes-v6.6' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace:
  Documentation: tracing: Update fprobe event example with BTF field
  selftests/ftrace: Add BTF fields access testcases
  tracing/fprobe-event: Assume fprobe is a return event by $retval
  tracing/probes: Add string type check with BTF
  tracing/probes: Support BTF field access from $retval
  tracing/probes: Support BTF based data structure field access
  tracing/probes: Add a function to search a member of a struct/union
  tracing/probes: Move finding func-proto API and getting func-param API to trace_btf
  tracing/probes: Support BTF argument on module functions
  tracing/eprobe: Iterate trace_eprobe directly
  kernel: kprobes: Use struct_size()

Merge tag 'trace-v6.6-2' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace

Pull more tracing updates from Steven Rostedt:
"Tracing fixes and clean ups:

   - Replace strlcpy() with strscpy()

   - Initialize the pipe cpumask to zero on allocation

   - Use within_module() instead of open coding it

   - Remove extra space in hwlat_detectory/mode output

   - Use LIST_HEAD() instead of open coding it

   - A bunch of clean ups and fixes for the cpumask filter

   - Set local da_mon_##name to static

   - Fix race in snapshot buffer between cpu write and swap"

* tag 'trace-v6.6-2' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace:
  tracing/filters: Fix coding style issues
  tracing/filters: Change parse_pred() cpulist ternary into an if block
  tracing/filters: Fix double-free of struct filter_pred.mask
  tracing/filters: Fix error-handling of cpulist parsing buffer
  tracing: Zero the pipe cpumask on alloc to avoid spurious -EBUSY
  ftrace: Use LIST_HEAD to initialize clear_hash
  ftrace: Use within_module to check rec->ip within specified module.
  tracing: Replace strlcpy with strscpy in trace/events/task.h
  tracing: Fix race issue between cpu buffer write and swap
  tracing: Remove extra space at the end of hwlat_detector/mode
  rv: Set variable 'da_mon_##name' to static

Merge tag 'pstore-v6.6-rc1-fix' of git://git.kernel.org/pub/scm/linux/kernel/git/kees/linux

Pull pstore fix from Kees Cook:

- Adjust sizes of buffers just avoid uncompress failures (Ard
Biesheuvel)

* tag 'pstore-v6.6-rc1-fix' of git://git.kernel.org/pub/scm/linux/kernel/git/kees/linux:
pstore: Base compression input buffer size on estimated compressed size

Merge tag 'x86-urgent-2023-09-02' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip

Pull x86 selftest fix from Ingo Molnar:
"Fix the __NR_map_shadow_stack syscall-renumbering fallout in the x86
  self-test code.

  [ Arguably the existing code was unnecessarily fragile, and tooling
    should have picked up the new syscall number, and a wider fix is
    being worked on - but meanwhile, let's not have the old syscall
    number in the kernel tree. ]"

* tag 'x86-urgent-2023-09-02' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  selftests/x86: Update map_shadow_stack syscall nr

Merge tag 'timers-urgent-2023-09-02' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip

Pull timer fix from Ingo Molnar:
"Fix false positive 'softirq work is pending' messages on -rt kernels,
caused by a buggy factoring-out of existing code"

* tag 'timers-urgent-2023-09-02' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
tick/rcu: Fix false positive "softirq work is pending" messages

Merge tag 'smp-urgent-2023-09-02' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip

Pull CPU hotplug fix from Ingo Molnar:
"Fix a CPU hotplug related deadlock between the task which initiates
and controls a CPU hot-unplug operation vs. the CFS bandwidth timer"

* tag 'smp-urgent-2023-09-02' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
cpu/hotplug: Prevent self deadlock on CPU hot-unplug

Merge tag 'sched-urgent-2023-09-02' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip

Pull scheduler fixes from Ingo Molnar:
"Miscellaneous scheduler fixes: a reporting fix, a static symbol fix,
  and a kernel-doc fix"

* tag 'sched-urgent-2023-09-02' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  sched/core: Report correct state for TASK_IDLE | TASK_FREEZABLE
  sched/fair: Make update_entity_lag() static
  sched/core: Add kernel-doc for set_cpus_allowed_ptr()

mm/pagewalk: fix bootstopping regression from extra pte_unmap()

Mikhail reports early-6.6-based Fedora Rawhide not booting: "rcu_preempt
detected expedited stalls", minutes wait, and then hung_task splat while
kworker trying to synchronize_rcu_expedited(). Nothing logged to disk.

He bisected to my 6.6 a349d72fd9ef ("mm/pgtable: add rcu_read_lock() and
rcu_read_unlock()s"): but the one to blame is my 6.5 commit to fix the
espfix "bad pmd" warnings when booting x86_64 with CONFIG_EFI_PGT_DUMP=y.

Gaah, that added an "addr >= TASK_SIZE" check to avoid pte_offset_map(),
but failed to add the equivalent check when choosing to pte_unmap().

It's not a problem on 6.5 (for different reasons, it's harmless on both
64-bit and 32-bit), but becomes a bootstopper on 6.6 with the unbalanced
rcu_read_unlock() - RCU has a WARN_ON_ONCE for that, but it would have
scrolled off Mikhail's console too quickly.

Reported-by: Mikhail Gavrilov <[email protected]>
Closes: https://lore.kernel.org/linux-mm/CABXGCsNi8Tiv5zUPNXr6UJw6qV1VdaBEfGqEAMkkXE3QPvZuAQ@mail.gmail.com/
Fixes: 8b1cb4a2e819 ("mm/pagewalk: fix EFI_PGT_DUMP of espfix area")
Fixes: a349d72fd9ef ("mm/pgtable: add rcu_read_lock() and rcu_read_unlock()s")
Signed-off-by: Hugh Dickins <[email protected]>
Tested-by: Mikhail Gavrilov <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

cgroup: fix build when CGROUP_SCHED is not enabled

Sudip Mukherjee reports that the mips sb1250_swarm_defconfig build fails
with the current kernel.  It isn't actually MIPS-specific, it's just
that that defconfig does not have CGROUP_SCHED enabled like most configs
do, and as such shows this error:

  kernel/cgroup/cgroup.c: In function 'cgroup_local_stat_show':
  kernel/cgroup/cgroup.c:3699:15: error: implicit declaration of function 'cgroup_tryget_css'; did you mean 'cgroup_tryget'? [-Werror=implicit-function-declaration]
   3699 |         css = cgroup_tryget_css(cgrp, ss);
        |               ^~~~~~~~~~~~~~~~~
        |               cgroup_tryget
  kernel/cgroup/cgroup.c:3699:13: warning: assignment to 'struct cgroup_subsys_state *' from 'int' makes pointer from integer without a cast [-Wint-conversion]
   3699 |         css = cgroup_tryget_css(cgrp, ss);
        |             ^

because cgroup_tryget_css() only exists when CGROUP_SCHED is enabled,
and the cgroup_local_stat_show() function should similarly be guarded by
that config option.

Move things around a bit to fix this all.

Fixes: d1d4ff5d11a5 ("cgroup: put cgroup_tryget_css() inside CONFIG_CGROUP_SCHED")
Reported-by: Sudip Mukherjee <[email protected]>
Cc: Miaohe Lin <[email protected]>
Cc: Tejun Heo <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

fbdev/g364fb: fix build failure with mips

Fix the typo which resulted in the driver using FB_DEFAULT_IOMEM_HELPERS
instead of FB_DEFAULT_IOMEM_OPS as the fbdev I/O helpers.

Fixes: 501126083855 ("fbdev/g364fb: Use fbdev I/O helpers")
Suggested-by: Linus Torvalds <[email protected]>
Signed-off-by: Sudip Mukherjee <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

perf parse-events: Fixes relating to no_value terms

A term may have no value in which case it is assumed to have a value
of 1. It doesn't just apply to alias/event terms so change the
parse_events_term__to_strbuf assert.

Commit 99e7138eb7897aa0 ("perf tools: Fail on using multiple bits long
terms without value") made it so that no_value terms could only be for a
single bit. Prior to commit 64199ae4b8a3 ("perf parse-events: Fix
propagation of term's no_value when cloning") this missed a test case
where config1 had no_value.

Fixes: 64199ae4b8a36038 ("perf parse-events: Fix propagation of term's no_value when cloning")
Signed-off-by: Ian Rogers <[email protected]>
Cc: Adrian Hunter <[email protected]>
Cc: Alexander Shishkin <[email protected]>
Cc: Ian Rogers <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: James Clark <[email protected]>
Cc: Jiri Olsa <[email protected]>
Cc: Kan Liang <[email protected]>
Cc: Mark Rutland <[email protected]>
Cc: Namhyung Kim <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Rob Herring <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Arnaldo Carvalho de Melo <[email protected]>

Merge branch 'fixes' into misc

ALSA: sb: Fix wrong argument in commented code

While rewriting the code from sockptr_t to iov_iter during the
development, I forgot to replace one place in emu8000-pcm code. As
it's in the disabled area (with ifdef), it's never built and
overlooked. Replace with the proper argument NULL.

Fixes: 9d0fdc602de9 ("ALSA: emu8000: Convert to generic PCM copy ops")
Reported-by: Al Viro <[email protected]>
Closes: https://lore.kernel.org/r/20230902053646.GK3390869@ZenIV
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Takashi Iwai <[email protected]>

ALSA: pcm: Fix error checks of default read/write copy ops

copy_from/to_iter() returns the actually copied bytes, and the more
correct check should be to compare with the given bytes, instead of
zero-check.

Fixes: cf393babb37a ("ALSA: pcm: Add copy ops with iov_iter")
Reported-by: Al Viro <[email protected]>
Closes: https://lore.kernel.org/r/20230902053044.GJ3390869@ZenIV
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Takashi Iwai <[email protected]>

ata: libata-core: Disable NCQ_TRIM on Micron 1100 drives

Micron 1100 drives lock up when encountering queued TRIM command. It is
a quite old hardware series, for past years we have been running our
machines with these drives using libata.force=noncqtrim.

[Damien] Move the "Crucial_CT*M500*" entry to keep Micron and Crucial
entries together.

Signed-off-by: Pawel Zmarzly <[email protected]>
Signed-off-by: Damien Le Moal <[email protected]>

ata: ahci: Add Elkhart Lake AHCI controller

Elkhart Lake is the successor of Apollo Lake and Gemini Lake. These
CPUs and their PCHs are used in mobile and embedded environments.

With this patch I suggest that Elkhart Lake SATA controllers [1] should
use the default LPM policy for mobile chipsets.
The disadvantage of missing hot-plug support with this setting should
not be an issue, as those CPUs are used in embedded environments and
not in servers with hot-plug backplanes.

We discovered that the Elkhart Lake SATA controllers have been missing
in ahci.c after a customer reported the throttling of his SATA SSD
after a short period of higher I/O. We determined the high temperature
of the SSD controller in idle mode as the root cause for that.

Depending on the used SSD, we have seen up to 1.8 Watt lower system
idle power usage and up to 30°C lower SSD controller temperatures in
our tests, when we set med_power_with_dipm manually. I have provided a
table showing seven different SATA SSDs from ATP, Intel/Solidigm and
Samsung [2].

Intel lists a total of 3 SATA controller IDs (4B60, 4B62, 4B63) in [1]
for those mobile PCHs.
This commit just adds 0x4b63 as I do not have test systems with 0x4b60
and 0x4b62 SATA controllers.
I have tested this patch with a system which uses 0x4b63 as SATA
controller.

[1] https://sata-io.org/product/8803
[2] https://www.thomas-krenn.com/en/wiki/SATA_Link_Power_Management#Example_LES_v4

Signed-off-by: Werner Fischer <[email protected]>
Cc: [email protected]
Signed-off-by: Damien Le Moal <[email protected]>

tracing/filters: Fix coding style issues

Recent commits have introduced some coding style issues, fix those up.

Link: https://lkml.kernel.org/r/[email protected]
Cc: Masami Hiramatsu <[email protected]>
Signed-off-by: Valentin Schneider <[email protected]>
Signed-off-by: Steven Rostedt (Google) <[email protected]>

tracing/filters: Change parse_pred() cpulist ternary into an if block

Review comments noted that an if block would be clearer than a ternary, so
swap it out.

No change in behaviour intended

Link: https://lkml.kernel.org/r/[email protected]
Cc: Masami Hiramatsu <[email protected]>
Signed-off-by: Valentin Schneider <[email protected]>
Signed-off-by: Steven Rostedt (Google) <[email protected]>

tracing/filters: Fix double-free of struct filter_pred.mask

When a cpulist filter is found to contain a single CPU, that CPU is saved
as a scalar and the backing cpumask storage is freed.

Also NULL the mask to avoid a double-free once we get down to
free_predicate().

Link: https://lkml.kernel.org/r/[email protected]
Cc: Masami Hiramatsu <[email protected]>
Reported-by: Steven Rostedt <[email protected]>
Signed-off-by: Valentin Schneider <[email protected]>
Signed-off-by: Steven Rostedt (Google) <[email protected]>

tracing/filters: Fix error-handling of cpulist parsing buffer

parse_pred() allocates a string buffer to parse the user-provided cpulist,
but doesn't check the allocation result nor does it free the buffer once it
is no longer needed.

Add an allocation check, and free the buffer as soon as it is no longer
needed.

Link: https://lkml.kernel.org/r/[email protected]
Cc: Masami Hiramatsu <[email protected]>
Reported-by: Steven Rostedt <[email protected]>
Reported-by: Josh Poimboeuf <[email protected]>
Signed-off-by: Valentin Schneider <[email protected]>
Signed-off-by: Steven Rostedt (Google) <[email protected]>

tracing: Zero the pipe cpumask on alloc to avoid spurious -EBUSY

The pipe cpumask used to serialize opens between the main and percpu
trace pipes is not zeroed or initialized. This can result in
spurious -EBUSY returns if underlying memory is not fully zeroed.
This has been observed by immediate failure to read the main
trace_pipe file on an otherwise newly booted and idle system:

# cat /sys/kernel/debug/tracing/trace_pipe
cat: /sys/kernel/debug/tracing/trace_pipe: Device or resource busy

Zero the allocation of pipe_cpumask to avoid the problem.

Link: https://lore.kernel.org/linux-trace-kernel/[email protected]
Cc: [email protected]
Fixes: c2489bb7e6be ("tracing: Introduce pipe_cpumask to avoid race on trace_pipes")
Reviewed-by: Zheng Yejian <[email protected]>
Reviewed-by: Masami Hiramatsu (Google) <[email protected]>
Signed-off-by: Brian Foster <[email protected]>
Signed-off-by: Steven Rostedt (Google) <[email protected]>

ftrace: Use LIST_HEAD to initialize clear_hash

Use LIST_HEAD() to initialize clear_hash instead of open-coding it.

Link: https://lore.kernel.org/linux-trace-kernel/[email protected]
Cc: Masami Hiramatsu <[email protected]>
Cc: Mark Rutland <[email protected]>
Signed-off-by: Ruan Jinjie <[email protected]>
Signed-off-by: Steven Rostedt (Google) <[email protected]>

ftrace: Use within_module to check rec->ip within specified module.

within_module_core && within_module_init condition is same to
within module but it's more readable.

Use within_module instead of former condition to check rec->ip
within specified module area or not.

Link: https://lore.kernel.org/linux-trace-kernel/[email protected]
Signed-off-by: Levi Yun <[email protected]>
Signed-off-by: Steven Rostedt (Google) <[email protected]>

tracing: Replace strlcpy with strscpy in trace/events/task.h

strlcpy() reads the entire source buffer first.
This read may exceed the destination size limit.
This is both inefficient and can lead to linear read
overflows if a source string is not NUL-terminated [1].
In an effort to remove strlcpy() completely [2], replace
strlcpy() here with strscpy().

No return values were used, so direct replacement is safe.

[1] https://www.kernel.org/doc/html/latest/process/deprecated.html#strlcpy
[2] https://github.com/KSPP/linux/issues/89

Link: https://lore.kernel.org/linux-trace-kernel/[email protected]
Cc: Masami Hiramatsu <[email protected]>
Reviewed-by: Kees Cook <[email protected]>
Signed-off-by: Azeem Shaikh <[email protected]>
Signed-off-by: Steven Rostedt (Google) <[email protected]>

tracing: Fix race issue between cpu buffer write and swap

Warning happened in rb_end_commit() at code:
if (RB_WARN_ON(cpu_buffer, !local_read(&cpu_buffer->committing)))

  WARNING: CPU: 0 PID: 139 at kernel/trace/ring_buffer.c:3142
rb_commit+0x402/0x4a0
  Call Trace:
   ring_buffer_unlock_commit+0x42/0x250
   trace_buffer_unlock_commit_regs+0x3b/0x250
   trace_event_buffer_commit+0xe5/0x440
   trace_event_buffer_reserve+0x11c/0x150
   trace_event_raw_event_sched_switch+0x23c/0x2c0
   __traceiter_sched_switch+0x59/0x80
   __schedule+0x72b/0x1580
   schedule+0x92/0x120
   worker_thread+0xa0/0x6f0

It is because the race between writing event into cpu buffer and swapping
cpu buffer through file per_cpu/cpu0/snapshot:

  Write on CPU 0             Swap buffer by per_cpu/cpu0/snapshot on CPU 1
  --------                   --------
                             tracing_snapshot_write()
                               [...]

  ring_buffer_lock_reserve()
    cpu_buffer = buffer->buffers[cpu]; // 1. Suppose find 'cpu_buffer_a';
    [...]
    rb_reserve_next_event()
      [...]

                               ring_buffer_swap_cpu()
                                 if (local_read(&cpu_buffer_a->committing))
                                     goto out_dec;
                                 if (local_read(&cpu_buffer_b->committing))
                                     goto out_dec;
                                 buffer_a->buffers[cpu] = cpu_buffer_b;
                                 buffer_b->buffers[cpu] = cpu_buffer_a;
                                 // 2. cpu_buffer has swapped here.

      rb_start_commit(cpu_buffer);
      if (unlikely(READ_ONCE(cpu_buffer->buffer)
          != buffer)) { // 3. This check passed due to 'cpu_buffer->buffer'
        [...]           //    has not changed here.
        return NULL;
      }
                                 cpu_buffer_b->buffer = buffer_a;
                                 cpu_buffer_a->buffer = buffer_b;
                                 [...]

      // 4. Reserve event from 'cpu_buffer_a'.

  ring_buffer_unlock_commit()
    [...]
    cpu_buffer = buffer->buffers[cpu]; // 5. Now find 'cpu_buffer_b' !!!
    rb_commit(cpu_buffer)
      rb_end_commit()  // 6. WARN for the wrong 'committing' state !!!

Based on above analysis, we can easily reproduce by following testcase:
  ``` bash
  #!/bin/bash

  dmesg -n 7
  sysctl -w kernel.panic_on_warn=1
  TR=/sys/kernel/tracing
  echo 7 > ${TR}/buffer_size_kb
  echo "sched:sched_switch" > ${TR}/set_event
  while [ true ]; do
          echo 1 > ${TR}/per_cpu/cpu0/snapshot
  done &
  while [ true ]; do
          echo 1 > ${TR}/per_cpu/cpu0/snapshot
  done &
  while [ true ]; do
          echo 1 > ${TR}/per_cpu/cpu0/snapshot
  done &
  ```

To fix it, IIUC, we can use smp_call_function_single() to do the swap on
the target cpu where the buffer is located, so that above race would be
avoided.

Link: https://lore.kernel.org/linux-trace-kernel/[email protected]
Cc: <[email protected]>
Fixes: f1affcaaa861 ("tracing: Add snapshot in the per_cpu trace directories")
Signed-off-by: Zheng Yejian <[email protected]>
Signed-off-by: Steven Rostedt (Google) <[email protected]>

tracing: Remove extra space at the end of hwlat_detector/mode

Space is printed after each mode value including the last one:
$ echo \"$(sudo cat /sys/kernel/tracing/hwlat_detector/mode)\"
"none [round-robin] per-cpu "

Found by Linux Verification Center (linuxtesting.org) with SVACE.

Link: https://lore.kernel.org/linux-trace-kernel/[email protected]
Cc: Masami Hiramatsu <[email protected]>
Fixes: 8fa826b7344d ("trace/hwlat: Implement the mode config option")
Signed-off-by: Mikhail Kobuk <[email protected]>
Reviewed-by: Alexey Khoroshilov <[email protected]>
Acked-by: Daniel Bristot de Oliveira <[email protected]>
Signed-off-by: Steven Rostedt (Google) <[email protected]>

rv: Set variable 'da_mon_##name' to static

gcc with W=1 reports
kernel/trace/rv/monitors/wip/wip.c:20:1: sparse: sparse: symbol 'da_mon_wip' was not declared. Should it be static?

The per-cpu variable 'da_mon_##name' is only used in its defining file, so
it should be static.

Link: https://lore.kernel.org/linux-trace-kernel/[email protected]
Reported-by: kernel test robot <[email protected]>
Closes: https://lore.kernel.org/oe-kbuild-all/[email protected]/
Signed-off-by: Yu Liao <[email protected]>
Acked-by: Daniel Bristot de Oliveira <[email protected]>
Signed-off-by: Steven Rostedt (Google) <[email protected]>

Merge tag 'iommu-updates-v6.6' of git://git.kernel.org/pub/scm/linux/kernel/git/joro/iommu

Pull iommu updates from Joerg Roedel:
"Core changes:

   - Consolidate probe_device path

   - Make the PCI-SAC IOVA allocation trick PCI-only

  AMD IOMMU:

   - Consolidate PPR log handling

   - Interrupt handling improvements

   - Refcount fixes for amd_iommu_v2 driver

  Intel VT-d driver:

   - Enable idxd device DMA with pasid through iommu dma ops

   - Lift RESV_DIRECT check from VT-d driver to core

   - Miscellaneous cleanups and fixes

  ARM-SMMU drivers:

   - Device-tree binding updates:
      - Add additional compatible strings for Qualcomm SoCs
      - Allow ASIDs to be configured in the DT to work around Qualcomm's
        broken hypervisor
      - Fix clocks for Qualcomm's MSM8998 SoC

   - SMMUv2:
      - Support for Qualcomm's legacy firmware implementation featured
        on at least MSM8956 and MSM8976
      - Match compatible strings for Qualcomm SM6350 and SM6375 SoC
        variants

   - SMMUv3:
      - Use 'ida' instead of a bitmap for VMID allocation

   - Rockchip IOMMU:
      - Lift page-table allocation restrictions on newer hardware

   - Mediatek IOMMU:
      - Add MT8188 IOMMU Support

   - Renesas IOMMU:
      - Allow PCIe devices

  .. and the usual set of cleanups an smaller fixes"

* tag 'iommu-updates-v6.6' of git://git.kernel.org/pub/scm/linux/kernel/git/joro/iommu: (64 commits)
  iommu: Explicitly include correct DT includes
  iommu/amd: Remove unused declarations
  iommu/arm-smmu-qcom: Add SM6375 SMMUv2
  iommu/arm-smmu-qcom: Add SM6350 DPU compatible
  iommu/arm-smmu-qcom: Add SM6375 DPU compatible
  iommu/arm-smmu-qcom: Sort the compatible list alphabetically
  dt-bindings: arm-smmu: Fix MSM8998 clocks description
  iommu/vt-d: Remove unused extern declaration dmar_parse_dev_scope()
  iommu/vt-d: Fix to convert mm pfn to dma pfn
  iommu/vt-d: Fix to flush cache of PASID directory table
  iommu/vt-d: Remove rmrr check in domain attaching device path
  iommu: Prevent RESV_DIRECT devices from blocking domains
  dmaengine/idxd: Re-enable kernel workqueue under DMA API
  iommu/vt-d: Add set_dev_pasid callback for dma domain
  iommu/vt-d: Prepare for set_dev_pasid callback
  iommu/vt-d: Make prq draining code generic
  iommu/vt-d: Remove pasid_mutex
  iommu/vt-d: Add domain_flush_pasid_iotlb()
  iommu: Move global PASID allocation from SVA to core
  iommu: Generalize PASID 0 for normal DMA w/o PASID
  ...