Git Repo - linux.git/log

Merge tag 's390-6.9-2' of git://git.kernel.org/pub/scm/linux/kernel/git/s390/linux

Pull more s390 updates from Heiko Carstens:

- Various virtual vs physical address usage fixes

- Add new bitwise types and helper functions and use them in s390
   specific drivers and code to make it easier to find virtual vs
   physical address usage bugs.

   Right now virtual and physical addresses are identical for s390,
   except for module, vmalloc, and similar areas. This will be changed,
   hopefully with the next merge window, so that e.g. the kernel image
   and modules will be located close to each other, allowing for direct
   branches and also for some other simplifications.

   As a prerequisite this requires to fix all misuses of virtual and
   physical addresses. As it turned out people are so used to the
   concept that virtual and physical addresses are the same, that new
   bugs got added to code which was already fixed. In order to avoid
   that even more code gets merged which adds such bugs add and use new
   bitwise types, so that sparse can be used to find such usage bugs.

   Most likely the new types can go away again after some time

- Provide a simple ARCH_HAS_DEBUG_VIRTUAL implementation

- Fix kprobe branch handling: if an out-of-line single stepped relative
   branch instruction has a target address within a certain address area
   in the entry code, the program check handler may incorrectly execute
   cleanup code as if KVM code was executed, leading to crashes

- Fix reference counting of zcrypt card objects

- Various other small fixes and cleanups

* tag 's390-6.9-2' of git://git.kernel.org/pub/scm/linux/kernel/git/s390/linux: (41 commits)
  s390/entry: compare gmap asce to determine guest/host fault
  s390/entry: remove OUTSIDE macro
  s390/entry: add CIF_SIE flag and remove sie64a() address check
  s390/cio: use while (i--) pattern to clean up
  s390/raw3270: make class3270 constant
  s390/raw3270: improve raw3270_init() readability
  s390/tape: make tape_class constant
  s390/vmlogrdr: make vmlogrdr_class constant
  s390/vmur: make vmur_class constant
  s390/zcrypt: make zcrypt_class constant
  s390/mm: provide simple ARCH_HAS_DEBUG_VIRTUAL support
  s390/vfio_ccw_cp: use new address translation helpers
  s390/iucv: use new address translation helpers
  s390/ctcm: use new address translation helpers
  s390/lcs: use new address translation helpers
  s390/qeth: use new address translation helpers
  s390/zfcp: use new address translation helpers
  s390/tape: fix virtual vs physical address confusion
  s390/3270: use new address translation helpers
  s390/3215: use new address translation helpers
  ...

spi: docs: spidev: fix echo command format

The two example echo commands for binding the spidev driver were being
rendered as one line in the HTML output. This patch makes use of the
restructured text :: to format the commands as a code block instead
which preserves the line break.

Signed-off-by: David Lechner <[email protected]>
Link: https://msgid.link/r/[email protected]
Signed-off-by: Mark Brown <[email protected]>

tracing: Just use strcmp() for testing __string() and __assign_str() match

As __assign_str() no longer uses its "src" parameter, there's a check to
make sure nothing depends on it being different than what was passed to
__string(). It originally just compared the pointer passed to __string()
with the pointer passed into __assign_str() via the "src" parameter. But
there's a couple of outliers that just pass in a quoted string constant,
where comparing the pointers is UB to the compiler, as the compiler is
free to create multiple copies of the same string constant.

Instead, just use strcmp(). It may slow down the trace event, but this
will eventually be removed.

Also, fix the issue of passing NULL to strcmp() by adding a WARN_ON() to
make sure that both "src" and the pointer saved in __string() are either
both NULL or have content, and then checking if "src" is not NULL before
performing the strcmp().

Link: https://lore.kernel.org/all/CAHk-=wjxX16kWd=uxG5wzqt=aXoYDf1BgWOKk+qVmAO0zh7sjA@mail.gmail.com/
Fixes: b1afefa62ca9 ("tracing: Use strcmp() in __assign_str() WARN_ON() check")
Reported-by: Linus Torvalds <[email protected]>
Signed-off-by: Steven Rostedt (Google) <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

Merge tag 'pm-6.9-rc1-2' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm

Pull more power management updates from Rafael Wysocki:
"These update the Energy Model to make it prevent errors due to power
  unit mismatches, fix a typo in power management documentation, convert
  one driver to using a platform remove callback returning void, address
  two cpufreq issues (one in the core and one in the DT driver), and
  enable boost support in the SCMI cpufreq driver.

  Specifics:

   - Modify the Energy Model code to bail out and complain if the unit
     of power is not uW to prevent errors due to unit mismatches (Lukasz
     Luba)

   - Make the intel_rapl platform driver use a remove callback returning
     void (Uwe Kleine-König)

   - Fix typo in the suspend and interrupts document (Saravana Kannan)

   - Make per-policy boost flags actually take effect on platforms using
     cpufreq_boost_set_sw() (Sibi Sankar)

   - Enable boost support in the SCMI cpufreq driver (Sibi Sankar)

   - Make the DT cpufreq driver use zalloc_cpumask_var() for allocating
     cpumasks to avoid using unitinialized memory (Marek Szyprowski)"

* tag 'pm-6.9-rc1-2' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm:
  cpufreq: scmi: Enable boost support
  firmware: arm_scmi: Add support for marking certain frequencies as turbo
  cpufreq: dt: always allocate zeroed cpumask
  cpufreq: Fix per-policy boost behavior on SoCs using cpufreq_boost_set_sw()
  Documentation: power: Fix typo in suspend and interrupts doc
  PM: EM: Force device drivers to provide power in uW
  powercap: intel_rapl: Convert to platform remove callback returning void

Merge tag 'acpi-6.9-rc1-2' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm

Pull more ACPI updates from Rafael Wysocki:
"These update ACPI documentation and kerneldoc comments.

  Specifics:

   - Add markup to generate links from footnotes in the ACPI enumeration
     document (Chris Packham)

   - Update the handle_eject_request() kerneldoc comment to document the
     arguments of the function and improve kerneldoc comments for ACPI
     suspend and hibernation functions (Yang Li)"

* tag 'acpi-6.9-rc1-2' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm:
  ACPI: PM: Improve kerneldoc comments for suspend and hibernation functions
  ACPI: docs: enumeration: Make footnotes links
  ACPI: Document handle_eject_request() arguments

Merge tag 'thermal-6.9-rc1-2' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm

Pull more thermal control updates from Rafael Wysocki:
"These update thermal drivers for ARM platforms by adding new hardware
  support (r8a779h0, H616 THS), addressing issues (Mediatek LVTS,
  Mediatek MT7896, thermal-of) and cleaning up code.

  Specifics:

   - Fix memory leak in the error path at probe time in the Mediatek
     LVTS driver (Christophe Jaillet)

   - Fix control buffer enablement regression on Meditek MT7896 (Frank
     Wunderlich)

   - Drop spaces before TABs in different places: thermal-of, ST drivers
     and Makefile (Geert Uytterhoeven)

   - Adjust DT binding for NXP as fsl,tmu-range min/maxItems can vary
     among several SoC versions (Fabio Estevam)

   - Add support for the H616 THS controller on Sun8i platforms (Martin
     Botka)

   - Don't fail probe due to zone registration failure because there is
     no trip points defined in the DT (Mark Brown)

   - Support variable TMU array size for new platforms (Peng Fan)

   - Adjust the DT binding for thermal-of and make the polling time not
     required and assume it is zero when not found in the DT (Konrad
     Dybcio)

   - Add r8a779h0 support in both the DT and the rcar_gen3 driver (Geert
     Uytterhoeven)"

* tag 'thermal-6.9-rc1-2' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm:
  thermal/drivers/rcar_gen3: Add support for R-Car V4M
  dt-bindings: thermal: rcar-gen3-thermal: Add r8a779h0 support
  thermal/of: Assume polling-delay(-passive) 0 when absent
  dt-bindings: thermal-zones: Don't require polling-delay(-passive)
  thermal/drivers/qoriq: Fix getting tmu range
  thermal/drivers/sun8i: Don't fail probe due to zone registration failure
  thermal/drivers/sun8i: Add support for H616 THS controller
  thermal/drivers/sun8i: Add SRAM register access code
  thermal/drivers/sun8i: Extend H6 calibration to support 4 sensors
  thermal/drivers/sun8i: Explain unknown H6 register value
  dt-bindings: thermal: sun8i: Add H616 THS controller
  soc: sunxi: sram: export register 0 for THS on H616
  dt-bindings: thermal: qoriq-thermal: Adjust fsl,tmu-range min/maxItems
  thermal: Drop spaces before TABs
  thermal/drivers/mediatek: Fix control buffer enablement on MT7896
  thermal/drivers/mediatek/lvts_thermal: Fix a memory leak in an error handling path

Merge tag 'ata-6.9-rc1-2' of git://git.kernel.org/pub/scm/linux/kernel/git/libata/linux

Pull ata fix from Niklas Cassel:
"A single fix for ASMedia HBAs.

  These HBAs do not indicate that they support SATA Port Multipliers
  CAP.SPM (Supports Port Multiplier) is not set.

  Likewise, they do not allow you to probe the devices behind an
  attached PMP, as defined according to the SATA-IO PMP specification.

  Instead, they have decided to implement their own version of PMP,
  and because of this, plugging in a PMP actually works, even if the
  HBA claims that it does not support PMP.

  Revert a recent quirk for these HBAs, as that breaks ASMedia's own
  implementation of PMP.

  Unfortunately, this will once again give some users of these HBAs
  significantly increased boot time. However, a longer boot time for
  some, is the lesser evil compared to some other users not being able
  to detect their drives at all"

* tag 'ata-6.9-rc1-2' of git://git.kernel.org/pub/scm/linux/kernel/git/libata/linux:
  ahci: asm1064: asm1166: don't limit reported ports

Merge tag 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mst/vhost

Pull virtio updates from Michael Tsirkin:

- Per vq sizes in vdpa

- Info query for block devices support in vdpa

- DMA sync callbacks in vduse

- Fixes, cleanups

* tag 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mst/vhost: (35 commits)
  virtio_net: rename free_old_xmit_skbs to free_old_xmit
  virtio_net: unify the code for recycling the xmit ptr
  virtio-net: add cond_resched() to the command waiting loop
  virtio-net: convert rx mode setting to use workqueue
  virtio: packed: fix unmap leak for indirect desc table
  vDPA: report virtio-blk flush info to user space
  vDPA: report virtio-block read-only info to user space
  vDPA: report virtio-block write zeroes configuration to user space
  vDPA: report virtio-block discarding configuration to user space
  vDPA: report virtio-block topology info to user space
  vDPA: report virtio-block MQ info to user space
  vDPA: report virtio-block max segments in a request to user space
  vDPA: report virtio-block block-size to user space
  vDPA: report virtio-block max segment size to user space
  vDPA: report virtio-block capacity to user space
  virtio: make virtio_bus const
  vdpa: make vdpa_bus const
  vDPA/ifcvf: implement vdpa_config_ops.get_vq_num_min
  vDPA/ifcvf: get_max_vq_size to return max size
  virtio_vdpa: create vqs with the actual size
  ...

dm-integrity: fix a memory leak when rechecking the data

Memory for the "checksums" pointer will leak if the data is rechecked
after checksum failure (because the associated kfree won't happen due
to 'goto skip_io').

Fix this by freeing the checksums memory before recheck, and just use
the "checksum_onstack" memory for storing checksum during recheck.

Fixes: c88f5e553fe3 ("dm-integrity: recheck the integrity tag after a failure")
Signed-off-by: Mikulas Patocka <[email protected]>
Signed-off-by: Mike Snitzer <[email protected]>

Merge tag 'for-linus-6.9-rc1-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/xen/tip

Pull xen updates from Juergen Gross:

- Xen event channel handling fix for a regression with a rare kernel
   config and some added hardening

- better support of running Xen dom0 in PVH mode

- a cleanup for the xen grant-dma-iommu driver

* tag 'for-linus-6.9-rc1-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/xen/tip:
  xen/events: increment refcnt only if event channel is refcounted
  xen/evtchn: avoid WARN() when unbinding an event channel
  x86/xen: attempt to inflate the memory balloon on PVH
  xen/grant-dma-iommu: Convert to platform remove callback returning void

net: phy: fix phy_read_poll_timeout argument type in genphy_loopback

read_poll_timeout inside phy_read_poll_timeout can set val negative
in some cases (for example, __mdiobus_read inside phy_read can return
-EOPNOTSUPP).

Supposedly, commit 4ec732951702 ("net: phylib: fix phy_read*_poll_timeout()")
should fix problems with wrong-signed vals, but I do not see how
as val is sent to phy_read as is and __val = phy_read (not val)
is checked for sign.

Change val type for signed to allow better error handling as done in other
phy_read_poll_timeout callers. This will not fix any error handling
by itself, but allows, for example, to modify cond with appropriate
sign check or check resulting val separately.

Found by Linux Verification Center (linuxtesting.org) with SVACE.

Fixes: 014068dcb5b1 ("net: phy: genphy_loopback: add link speed configuration")
Signed-off-by: Nikita Kiryushin <[email protected]>
Reviewed-by: Russell King (Oracle) <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Paolo Abeni <[email protected]>

Merge branches 'misc' and 'fixes' into for-linus

ALSA: hda/realtek: Add quirk for HP Spectre x360 14 eu0000

Cirrus amps support for this laptop was added in patch:
33e5e648e631 ("ALSA: hda: cs35l41: Support additional HP Envy Models")

This patch adds fixes for wrong pincfgs, wrong DAC selection and
mute/micmute LEDs.

Signed-off-by: Anthony I Gilea <[email protected]>
Message-ID: <e2a7aaed-e9d7-4d36-8abf-b71dfd32a0ff@cpp.in>
Signed-off-by: Takashi Iwai <[email protected]>

net/sched: Add module alias for sch_fq_pie

The commit 2c15a5aee2f3 ("net/sched: Load modules via their alias")
starts loading modules via aliases and not canonical names. The new
aliases were added in commit 241a94abcf46 ("net/sched: Add module
aliases for cls_,sch_,act_ modules") via a Coccinele script.

sch_fq_pie.c is missing module.h header and thus Coccinele did not patch
it. Add the include and module alias manually, so that autoloading works
for sch_fq_pie too.

(Note: commit message in commit 241a94abcf46 ("net/sched: Add module
aliases for cls_,sch_,act_ modules") was mangled due to '#'
misinterpretation. The predicate haskernel is:

| @ haskernel @
| @@
|
| #include <linux/module.h>
|
.)

Fixes: 241a94abcf46 ("net/sched: Add module aliases for cls_,sch_,act_ modules")
Signed-off-by: Michal Koutný <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Paolo Abeni <[email protected]>

can: kvaser_pciefd: Add additional Xilinx interrupts

Since Xilinx-based adapters now support up to eight CAN channels, the
TX interrupt mask array must have eight elements.

Signed-off-by: Martin Jocic <[email protected]>
Link: https://lore.kernel.org/all/[email protected]
Fixes: 9b221ba452aa ("can: kvaser_pciefd: Add support for Kvaser PCIe 8xCAN")
[mkl: replace Link by Fixes tag]
Signed-off-by: Marc Kleine-Budde <[email protected]>

ceph: set correct cap mask for getattr request for read

In case of hitting the file EOF, ceph_read_iter() needs to retrieve the
file size from MDS, and Fr caps aren't neccessary.

[ idryomov: fold into existing retry_op == READ_INLINE branch ]

Reported-by: Frank Hsiao <[email protected]>
Signed-off-by: Xiubo Li <[email protected]>
Reviewed-by: Ilya Dryomov <[email protected]>
Tested-by: Frank Hsiao <[email protected]>
Signed-off-by: Ilya Dryomov <[email protected]>

ceph: stop copying to iter at EOF on sync reads

If EOF is encountered, ceph_sync_read() return value is adjusted down
according to i_size, but the "to" iter is advanced by the actual number
of bytes read. Then, when retrying, the remainder of the range may be
skipped incorrectly.

Ensure that the "to" iter is advanced only until EOF.

[ idryomov: changelog ]

Fixes: c3d8e0b5de48 ("ceph: return the real size read when it hits EOF")
Reported-by: Frank Hsiao <[email protected]>
Signed-off-by: Xiubo Li <[email protected]>
Reviewed-by: Ilya Dryomov <[email protected]>
Tested-by: Frank Hsiao <[email protected]>
Signed-off-by: Ilya Dryomov <[email protected]>

nouveau/gsp: don't check devinit disable on GSP.

GSP should be handling this and I can see no evidence in opengpu
driver that this register should be touched.

Fixed acceleration on 2080 Ti GPUs.

Fixes: 15740541e8f0 ("drm/nouveau/devinit/tu102-: prepare for GSP-RM")
Signed-off-by: Dave Airlie <[email protected]>
Signed-off-by: Danilo Krummrich <[email protected]>
Link: https://patchwork.freedesktop.org/patch/msgid/[email protected]

ipv4: raw: Fix sending packets from raw sockets via IPsec tunnels

Since the referenced commit, the xfrm_inner_extract_output() function
uses the protocol field to determine the address family. So not setting
it for IPv4 raw sockets meant that such packets couldn't be tunneled via
IPsec anymore.

IPv6 raw sockets are not affected as they already set the protocol since
9c9c9ad5fae7 ("ipv6: set skb->protocol on tcp, raw and ip6_append_data
genereated skbs").

Fixes: f4796398f21b ("xfrm: Remove inner/outer modes from output path")
Signed-off-by: Tobias Brunner <[email protected]>
Reviewed-by: David Ahern <[email protected]>
Reviewed-by: Nicolas Dichtel <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Paolo Abeni <[email protected]>

hsr: Handle failures in module init

A failure during registration of the netdev notifier was not handled at
all. A failure during netlink initialization did not unregister the netdev
notifier.

Handle failures of netdev notifier registration and netlink initialization.
Both functions should only return negative values on failure and thereby
lead to the hsr module not being loaded.

Fixes: f421436a591d ("net/hsr: Add support for the High-availability Seamless Redundancy protocol (HSRv0)")
Signed-off-by: Felix Maurer <[email protected]>
Reviewed-by: Shigeru Yoshida <[email protected]>
Reviewed-by: Breno Leitao <[email protected]>
Link: https://lore.kernel.org/r/3ce097c15e3f7ace98fc7fd9bcbf299f092e63d1.1710504184.git.fmaurer@redhat.com
Signed-off-by: Paolo Abeni <[email protected]>

Merge branches 'pm-em', 'pm-powercap' and 'pm-sleep'

Merge additional updates related to the Energy Model, power capping
and system-wide power management for 6.9-rc1:

- Modify the Energy Model code to bail out and complain if the unit of
   power is not uW to prevent errors due to unit mismatches (Lukasz
   Luba).

- Make the intel_rapl platform driver use a remove callback returning
   void (Uwe Kleine-König).

- Fix typo in the suspend and interrupts document (Saravana Kannan).

* pm-em:
  PM: EM: Force device drivers to provide power in uW

* pm-powercap:
  powercap: intel_rapl: Convert to platform remove callback returning void

* pm-sleep:
  Documentation: power: Fix typo in suspend and interrupts doc

fbmon: prevent division by zero in fb_videomode_from_videomode()

The expression htotal * vtotal can have a zero value on
overflow. It is necessary to prevent division by zero like in
fb_var_to_videomode().

Found by Linux Verification Center (linuxtesting.org) with Svace.

Signed-off-by: Roman Smirnov <[email protected]>
Reviewed-by: Sergey Shtylyov <[email protected]>
Signed-off-by: Helge Deller <[email protected]>

Merge branch 'acpi-docs'

Merge an ACPI documentation update for 6.9-rc1 which adds markup to
generate links from footnotes in the enumeration document.

* acpi-docs:
ACPI: docs: enumeration: Make footnotes links

usb: usb-acpi: Fix oops due to freeing uninitialized pld pointer

If reading the ACPI _PLD port location object fails, or the port
doesn't have a _PLD ACPI object then the *pld pointer will remain
uninitialized and oops when freed.

The patch that caused this is currently in next, on its way to v6.9.
So no need to add this to stable or current 6.8 kernel.

Reported-by: Klara Modin <[email protected]>
Closes: https://lore.kernel.org/linux-usb/[email protected]/
Tested-by: Klara Modin <[email protected]>
Fixes: f3ac348e6e04 ("usb: usb-acpi: Set port connect type of not connectable ports correctly")
Signed-off-by: Mathias Nyman <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Greg Kroah-Hartman <[email protected]>

exfat: remove duplicate update parent dir

For renaming, the directory only needs to be updated once if it
is in the same directory.

Signed-off-by: Yuezhang Mo <[email protected]>
Reviewed-by: Andy Wu <[email protected]>
Reviewed-by: Aoyama Wataru <[email protected]>
Reviewed-by: Sungjong Seo <[email protected]>
Signed-off-by: Namjae Jeon <[email protected]>

exfat: do not sync parent dir if just update timestamp

When sync or dir_sync is enabled, there is no need to sync the
parent directory's inode if only for updating its timestamp.

1. If an unexpected power failure occurs, the timestamp of the
   parent directory is not updated to the storage, which has no
   impact on the user.

2. The number of writes will be greatly reduced, which can not
   only improve performance, but also prolong device life.

Signed-off-by: Yuezhang Mo <[email protected]>
Reviewed-by: Andy Wu <[email protected]>
Reviewed-by: Aoyama Wataru <[email protected]>
Reviewed-by: Sungjong Seo <[email protected]>
Signed-off-by: Namjae Jeon <[email protected]>

exfat: remove unused functions

exfat_count_ext_entries() is no longer called, remove it.
exfat_update_dir_chksum() is no longer called, remove it and
rename exfat_update_dir_chksum_with_entry_set() to it.

Signed-off-by: Yuezhang Mo <[email protected]>
Reviewed-by: Andy Wu <[email protected]>
Reviewed-by: Aoyama Wataru <[email protected]>
Reviewed-by: Sungjong Seo <[email protected]>
Signed-off-by: Namjae Jeon <[email protected]>

exfat: convert exfat_find_empty_entry() to use dentry cache

Before this conversion, each dentry traversed needs to be read
from the storage device or page cache. There are at least 16
dentries in a sector. This will result in frequent page cache
searches.

After this conversion, if all directory entries in a sector are
used, the sector only needs to be read once.

Signed-off-by: Yuezhang Mo <[email protected]>
Reviewed-by: Andy Wu <[email protected]>
Reviewed-by: Aoyama Wataru <[email protected]>
Reviewed-by: Sungjong Seo <[email protected]>
Signed-off-by: Namjae Jeon <[email protected]>

exfat: convert exfat_init_ext_entry() to use dentry cache

Before this conversion, in exfat_init_ext_entry(), to init
the dentries in a dentry set, the sync times is equals the
dentry number if 'dirsync' or 'sync' is enabled.
That affects not only performance but also device life.

After this conversion, only needs to be synchronized once if
'dirsync' or 'sync' is enabled.

Signed-off-by: Yuezhang Mo <[email protected]>
Reviewed-by: Andy Wu <[email protected]>
Reviewed-by: Aoyama Wataru <[email protected]>
Reviewed-by: Sungjong Seo <[email protected]>
Signed-off-by: Namjae Jeon <[email protected]>

exfat: move free cluster out of exfat_init_ext_entry()

exfat_init_ext_entry() is an init function, it's a bit strange
to free cluster in it. And the argument 'inode' will be removed
from exfat_init_ext_entry(). So this commit changes to free the
cluster in exfat_remove_entries().

Code refinement, no functional changes.

Signed-off-by: Yuezhang Mo <[email protected]>
Reviewed-by: Andy Wu <[email protected]>
Reviewed-by: Aoyama Wataru <[email protected]>
Reviewed-by: Sungjong Seo <[email protected]>
Signed-off-by: Namjae Jeon <[email protected]>

exfat: convert exfat_remove_entries() to use dentry cache

Before this conversion, in exfat_remove_entries(), to mark the
dentries in a dentry set as deleted, the sync times is equals
the dentry numbers if 'dirsync' or 'sync' is enabled.
That affects not only performance but also device life.

After this conversion, only needs to be synchronized once if
'dirsync' or 'sync' is enabled.

Signed-off-by: Yuezhang Mo <[email protected]>
Reviewed-by: Andy Wu <[email protected]>
Reviewed-by: Aoyama Wataru <[email protected]>
Reviewed-by: Sungjong Seo <[email protected]>
Signed-off-by: Namjae Jeon <[email protected]>

exfat: convert exfat_add_entry() to use dentry cache

After this conversion, if "dirsync" or "sync" is enabled, the
number of synchronized dentries in exfat_add_entry() will change
from 2 to 1.

Signed-off-by: Yuezhang Mo <[email protected]>
Reviewed-by: Andy Wu <[email protected]>
Reviewed-by: Aoyama Wataru <[email protected]>
Reviewed-by: Sungjong Seo <[email protected]>
Signed-off-by: Namjae Jeon <[email protected]>

exfat: add exfat_get_empty_dentry_set() helper

This helper is used to lookup empty dentry set. If there are
no enough empty dentries at the input location, this helper will
return the number of dentries that need to be skipped for the
next lookup.

Signed-off-by: Yuezhang Mo <[email protected]>
Reviewed-by: Andy Wu <[email protected]>
Reviewed-by: Aoyama Wataru <[email protected]>
Reviewed-by: Sungjong Seo <[email protected]>
Signed-off-by: Namjae Jeon <[email protected]>

exfat: add __exfat_get_dentry_set() helper

Since exfat_get_dentry_set() invokes the validate functions of
exfat_validate_entry(), it only supports getting a directory
entry set of an existing file, doesn't support getting an empty
entry set.

To remove the limitation, add this helper.

Signed-off-by: Yuezhang Mo <[email protected]>
Reviewed-by: Andy Wu <[email protected]>
Reviewed-by: Aoyama Wataru <[email protected]>
Reviewed-by: Sungjong Seo <[email protected]>
Signed-off-by: Namjae Jeon <[email protected]>

rds: introduce acquire/release ordering in acquire/release_in_xmit()

acquire/release_in_xmit() work as bit lock in rds_send_xmit(), so they
are expected to ensure acquire/release memory ordering semantics.
However, test_and_set_bit/clear_bit() don't imply such semantics, on
top of this, following smp_mb__after_atomic() does not guarantee release
ordering (memory barrier actually should be placed before clear_bit()).

Instead, we use clear_bit_unlock/test_and_set_bit_lock() here.

Fixes: 0f4b1c7e89e6 ("rds: fix rds_send_xmit() serialization")
Fixes: 1f9ecd7eacfd ("RDS: Pass rds_conn_path to rds_send_xmit()")
Signed-off-by: Yewon Choi <[email protected]>
Reviewed-by: Michal Kubiak <[email protected]>
Link: https://lore.kernel.org/r/ZfQUxnNTO9AJmzwc@libra05
Signed-off-by: Paolo Abeni <[email protected]>

ahci: asm1064: asm1166: don't limit reported ports

Previously, patches have been added to limit the reported count of SATA
ports for asm1064 and asm1166 SATA controllers, as those controllers do
report more ports than physically having.

While it is allowed to report more ports than physically having in CAP.NP,
it is not allowed to report more ports than physically having in the PI
(Ports Implemented) register, which is what these HBAs do.
(This is a AHCI spec violation.)

Unfortunately, it seems that the PMP implementation in these ASMedia HBAs
is also violating the AHCI and SATA-IO PMP specification.

What these HBAs do is that they do not report that they support PMP
(CAP.SPM (Supports Port Multiplier) is not set).

Instead, they have decided to add extra "virtual" ports in the PI register
that is used if a port multiplier is connected to any of the physical
ports of the HBA.

Enumerating the devices behind the PMP as specified in the AHCI and
SATA-IO specifications, by using PMP READ and PMP WRITE commands to the
physical ports of the HBA is not possible, you have to use the "virtual"
ports.

This is of course bad, because this gives us no way to detect the device
and vendor ID of the PMP actually connected to the HBA, which means that
we can not apply the proper PMP quirks for the PMP that is connected to
the HBA.

Limiting the port map will thus stop these controllers from working with
SATA Port Multipliers.

This patch reverts both patches for asm1064 and asm1166, so old behavior
is restored and SATA PMP will work again, but it will also reintroduce the
(minutes long) extra boot time for the ASMedia controllers that do not
have a PMP connected (either on the PCIe card itself, or an external PMP).

However, a longer boot time for some, is the lesser evil compared to some
other users not being able to detect their drives at all.

Fixes: 0077a504e1a4 ("ahci: asm1166: correct count of reported ports")
Fixes: 9815e3961754 ("ahci: asm1064: correct count of reported ports")
Cc: [email protected]
Reported-by: Matt <[email protected]>
Signed-off-by: Conrad Kostecki <[email protected]>
Reviewed-by: Hans de Goede <[email protected]>
[cassel: rewrote commit message]
Signed-off-by: Niklas Cassel <[email protected]>

tools: ynl: add header guards for nlctrl

I "extracted" YNL C into a GitHub repo to make it easier
to use in other projects: https://github.com/linux-netdev/ynl-c

GitHub actions use Ubuntu by default, and the kernel headers
there are missing f329a0ebeaba ("genetlink: correct uAPI defines").
Add the direct include workaround for nlctrl.

Fixes: 768e044a5fd4 ("doc/netlink/specs: Add spec for nlctrl netlink family")
Signed-off-by: Jakub Kicinski <[email protected]>
Reviewed-by: Donald Hunter <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Paolo Abeni <[email protected]>

Merge branch 'wireguard-fixes-for-6-9-rc1'

Jason A. Donenfeld says:

====================
wireguard fixes for 6.9-rc1

This series has four WireGuard fixes:

1) Annotate a data race that KCSAN found by using READ_ONCE/WRITE_ONCE,
   which has been causing syzkaller noise.

2) Use the generic netdev tstats allocation and stats getters instead of
   doing this within the driver.

3) Explicitly check a flag variable instead of an empty list in the
   netlink code, to prevent a UaF situation when paging through GET
   results during a remove-all SET operation.

4) Set a flag in the RISC-V CI config so the selftests continue to boot.
====================

Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Paolo Abeni <[email protected]>

wireguard: selftests: set RISCV_ISA_FALLBACK on riscv{32,64}

This option is needed to continue booting with QEMU. Recent changes that
made this optional meant that it gets unset in the test harness, and so
WireGuard CI has been broken. Fix this by simply setting this option.

Cc: [email protected]
Fixes: 496ea826d1e1 ("RISC-V: provide Kconfig & commandline options to control parsing "riscv,isa"")
Signed-off-by: Jason A. Donenfeld <[email protected]>
Reviewed-by: Jiri Pirko <[email protected]>
Signed-off-by: Paolo Abeni <[email protected]>

wireguard: netlink: access device through ctx instead of peer

The previous commit fixed a bug that led to a NULL peer->device being
dereferenced. It's actually easier and faster performance-wise to
instead get the device from ctx->wg. This semantically makes more sense
too, since ctx->wg->peer_allowedips.seq is compared with
ctx->allowedips_seq, basing them both in ctx. This also acts as a
defence in depth provision against freed peers.

Cc: [email protected]
Fixes: e7096c131e51 ("net: WireGuard secure network tunnel")
Signed-off-by: Jason A. Donenfeld <[email protected]>
Reviewed-by: Jiri Pirko <[email protected]>
Signed-off-by: Paolo Abeni <[email protected]>

wireguard: netlink: check for dangling peer via is_dead instead of empty list

If all peers are removed via wg_peer_remove_all(), rather than setting
peer_list to empty, the peer is added to a temporary list with a head on
the stack of wg_peer_remove_all(). If a netlink dump is resumed and the
cursored peer is one that has been removed via wg_peer_remove_all(), it
will iterate from that peer and then attempt to dump freed peers.

Fix this by instead checking peer->is_dead, which was explictly created
for this purpose. Also move up the device_update_lock lockdep assertion,
since reading is_dead relies on that.

It can be reproduced by a small script like:

    echo "Setting config..."
    ip link add dev wg0 type wireguard
    wg setconf wg0 /big-config
    (
            while true; do
                    echo "Showing config..."
                    wg showconf wg0 > /dev/null
            done
    ) &
    sleep 4
    wg setconf wg0 <(printf "[Peer]\nPublicKey=$(wg genkey)\n")

Resulting in:

    BUG: KASAN: slab-use-after-free in __lock_acquire+0x182a/0x1b20
    Read of size 8 at addr ffff88811956ec70 by task wg/59
    CPU: 2 PID: 59 Comm: wg Not tainted 6.8.0-rc2-debug+ #5
    Call Trace:
     <TASK>
     dump_stack_lvl+0x47/0x70
     print_address_description.constprop.0+0x2c/0x380
     print_report+0xab/0x250
     kasan_report+0xba/0xf0
     __lock_acquire+0x182a/0x1b20
     lock_acquire+0x191/0x4b0
     down_read+0x80/0x440
     get_peer+0x140/0xcb0
     wg_get_device_dump+0x471/0x1130

Cc: [email protected]
Fixes: e7096c131e51 ("net: WireGuard secure network tunnel")
Reported-by: Lillian Berry <[email protected]>
Signed-off-by: Jason A. Donenfeld <[email protected]>
Reviewed-by: Jiri Pirko <[email protected]>
Signed-off-by: Paolo Abeni <[email protected]>

wireguard: device: remove generic .ndo_get_stats64

Commit 3e2f544dd8a33 ("net: get stats64 if device if driver is
configured") moved the callback to dev_get_tstats64() to net core, so,
unless the driver is doing some custom stats collection, it does not
need to set .ndo_get_stats64.

Since this driver is now relying in NETDEV_PCPU_STAT_TSTATS, then, it
doesn't need to set the dev_get_tstats64() generic .ndo_get_stats64
function pointer.

Signed-off-by: Breno Leitao <[email protected]>
Reviewed-by: Simon Horman <[email protected]>
Signed-off-by: Jason A. Donenfeld <[email protected]>
Reviewed-by: Jiri Pirko <[email protected]>
Signed-off-by: Paolo Abeni <[email protected]>

wireguard: device: leverage core stats allocator

With commit 34d21de99cea9 ("net: Move {l,t,d}stats allocation to core
and convert veth & vrf"), stats allocation could be done on net core
instead of in this driver.

With this new approach, the driver doesn't have to bother with error
handling (allocation failure checking, making sure free happens in the
right spot, etc). This is core responsibility now.

Remove the allocation in this driver and leverage the network core
allocation instead.

Signed-off-by: Breno Leitao <[email protected]>
Reviewed-by: Simon Horman <[email protected]>
Signed-off-by: Jason A. Donenfeld <[email protected]>
Reviewed-by: Jiri Pirko <[email protected]>
Signed-off-by: Paolo Abeni <[email protected]>

wireguard: receive: annotate data-race around receiving_counter.counter

Syzkaller with KCSAN identified a data-race issue when accessing
keypair->receiving_counter.counter. Use READ_ONCE() and WRITE_ONCE()
annotations to mark the data race as intentional.

    BUG: KCSAN: data-race in wg_packet_decrypt_worker / wg_packet_rx_poll

    write to 0xffff888107765888 of 8 bytes by interrupt on cpu 0:
     counter_validate drivers/net/wireguard/receive.c:321 [inline]
     wg_packet_rx_poll+0x3ac/0xf00 drivers/net/wireguard/receive.c:461
     __napi_poll+0x60/0x3b0 net/core/dev.c:6536
     napi_poll net/core/dev.c:6605 [inline]
     net_rx_action+0x32b/0x750 net/core/dev.c:6738
     __do_softirq+0xc4/0x279 kernel/softirq.c:553
     do_softirq+0x5e/0x90 kernel/softirq.c:454
     __local_bh_enable_ip+0x64/0x70 kernel/softirq.c:381
     __raw_spin_unlock_bh include/linux/spinlock_api_smp.h:167 [inline]
     _raw_spin_unlock_bh+0x36/0x40 kernel/locking/spinlock.c:210
     spin_unlock_bh include/linux/spinlock.h:396 [inline]
     ptr_ring_consume_bh include/linux/ptr_ring.h:367 [inline]
     wg_packet_decrypt_worker+0x6c5/0x700 drivers/net/wireguard/receive.c:499
     process_one_work kernel/workqueue.c:2633 [inline]
     ...

    read to 0xffff888107765888 of 8 bytes by task 3196 on cpu 1:
     decrypt_packet drivers/net/wireguard/receive.c:252 [inline]
     wg_packet_decrypt_worker+0x220/0x700 drivers/net/wireguard/receive.c:501
     process_one_work kernel/workqueue.c:2633 [inline]
     process_scheduled_works+0x5b8/0xa30 kernel/workqueue.c:2706
     worker_thread+0x525/0x730 kernel/workqueue.c:2787
     ...

Fixes: a9e90d9931f3 ("wireguard: noise: separate receive counter from send counter")
Reported-by: [email protected]
Signed-off-by: Nikita Zhandarovich <[email protected]>
Signed-off-by: Jason A. Donenfeld <[email protected]>
Reviewed-by: Jiri Pirko <[email protected]>
Signed-off-by: Paolo Abeni <[email protected]>

net: move dev->state into net_device_read_txrx group

dev->state can be read in rx and tx fast paths.

netif_running() which needs dev->state is called from
- enqueue_to_backlog() [RX path]
- __dev_direct_xmit() [TX path]

Fixes: 43a71cd66b9c ("net-device: reorganize net_device fast path variables")
Signed-off-by: Eric Dumazet <[email protected]>
Cc: Coco Li <[email protected]>
Reviewed-by: Jiri Pirko <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Paolo Abeni <[email protected]>

timers: Fix removed self-IPI on global timer's enqueue in nohz_full

While running in nohz_full mode, a task may enqueue a timer while the
tick is stopped. However the only places where the timer wheel,
alongside the timer migration machinery's decision, may reprogram the
next event accordingly with that new timer's expiry are the idle loop or
any IRQ tail.

However neither the idle task nor an interrupt may run on the CPU if it
resumes busy work in userspace for a long while in full dynticks mode.

To solve this, the timer enqueue path raises a self-IPI that will
re-evaluate the timer wheel on its IRQ tail. This asynchronous solution
avoids potential locking inversion.

This is supposed to happen both for local and global timers but commit:

b2cf7507e186 ("timers: Always queue timers on the local CPU")

broke the global timers case with removing the ->is_idle field handling
for the global base. As a result, global timers enqueue may go unnoticed
in nohz_full.

Fix this with restoring the idle tracking of the global timer's base,
allowing self-IPIs again on enqueue time.

Fixes: b2cf7507e186 ("timers: Always queue timers on the local CPU")
Reported-by: Paul E. McKenney <[email protected]>
Signed-off-by: Frederic Weisbecker <[email protected]>
Signed-off-by: Thomas Gleixner <[email protected]>
Link: https://lore.kernel.org/r/[email protected]

timers/migration: Fix endless timer requeue after idle interrupts

When a CPU is an idle migrator, but another CPU wakes up before it,
becomes an active migrator and handles the queue, the initial idle
migrator may end up endlessly reprogramming its clockevent, chasing ghost
timers forever such as in the following scenario:

               [GRP0:0]
             migrator = 0
             active   = 0
             nextevt  = T1
              /         \
             0           1
          active        idle (T1)

0) CPU 1 is idle and has a timer queued (T1), CPU 0 is active and is
the active migrator.

               [GRP0:0]
             migrator = NONE
             active   = NONE
             nextevt  = T1
              /         \
             0           1
          idle        idle (T1)
          wakeup = T1

1) CPU 0 is now idle and is therefore the idle migrator. It has
programmed its next timer interrupt to handle T1.

                [GRP0:0]
             migrator = 1
             active   = 1
             nextevt  = KTIME_MAX
              /         \
             0           1
          idle        active
          wakeup = T1

2) CPU 1 has woken up, it is now active and it has just handled its own
timer T1.

3) CPU 0 gets a timer interrupt to handle T1 but tmigr_handle_remote()
realize it is not the migrator anymore. So it early returns without
observing that T1 has been expired already and therefore without
updating its ->wakeup value.

4) CPU 0 goes into tmigr_cpu_new_timer() which also early returns
because it doesn't queue a timer of its own. So ->wakeup is left
unchanged and the next timer is programmed to fire now.

5) goto 3) forever

This results in timer interrupt storms in idle and also in nohz_full (as
observed in rcutorture's TREE07 scenario).

Fix this with forcing a re-evaluation of tmc->wakeup while trying
remote timer handling when the CPU isn't the migrator anymmore. The
check is inherently racy but in the worst case the CPU just races setting
the KTIME_MAX value that a remote expiry also tries to set.

Fixes: 7ee988770326 ("timers: Implement the hierarchical pull model")
Reported-by: Paul E. McKenney <[email protected]>
Signed-off-by: Frederic Weisbecker <[email protected]>
Signed-off-by: Thomas Gleixner <[email protected]>
Link: https://lore.kernel.org/r/[email protected]

LoongArch/crypto: Clean up useless assignment operations

The LoongArch CRC32 hw acceleration is based on arch/mips/crypto/
crc32-mips.c. While the MIPS code supports both MIPS32 and MIPS64,
but LoongArch32 lacks the CRC instruction. As a result, the line
"len -= sizeof(u32)" is unnecessary.

Removing it can make context code style more unified and improve
code readability.

Cc: [email protected]
Reviewed-by: WANG Xuerui <[email protected]>
Suggested-by: Wentao Guan <[email protected]>
Signed-off-by: Yuli Wang <[email protected]>
Signed-off-by: Huacai Chen <[email protected]>

LoongArch: Define the __io_aw() hook as mmiowb()

Commit fb24ea52f78e0d595852e ("drivers: Remove explicit invocations of
mmiowb()") remove all mmiowb() in drivers, but it says:

"NOTE: mmiowb() has only ever guaranteed ordering in conjunction with
spin_unlock(). However, pairing each mmiowb() removal in this patch with
the corresponding call to spin_unlock() is not at all trivial, so there
is a small chance that this change may regress any drivers incorrectly
relying on mmiowb() to order MMIO writes between CPUs using lock-free
synchronisation."

The mmio in radeon_ring_commit() is protected by a mutex rather than a
spinlock, but in the mutex fastpath it behaves similar to spinlock. We
can add mmiowb() calls in the radeon driver but the maintainer says he
doesn't like such a workaround, and radeon is not the only example of
mutex protected mmio.

So we should extend the mmiowb tracking system from spinlock to mutex,
and maybe other locking primitives. This is not easy and error prone, so
we solve it in the architectural code, by simply defining the __io_aw()
hook as mmiowb(). And we no longer need to override queued_spin_unlock()
so use the generic definition.

Without this, we get such an error when run 'glxgears' on weak ordering
architectures such as LoongArch:

radeon 0000:04:00.0: ring 0 stalled for more than 10324msec
radeon 0000:04:00.0: ring 3 stalled for more than 10240msec
radeon 0000:04:00.0: GPU lockup (current fence id 0x000000000001f412 last fence id 0x000000000001f414 on ring 3)
radeon 0000:04:00.0: GPU lockup (current fence id 0x000000000000f940 last fence id 0x000000000000f941 on ring 0)
radeon 0000:04:00.0: scheduling IB failed (-35).
[drm:radeon_gem_va_ioctl [radeon]] *ERROR* Couldn't update BO_VA (-35)
radeon 0000:04:00.0: scheduling IB failed (-35).
[drm:radeon_gem_va_ioctl [radeon]] *ERROR* Couldn't update BO_VA (-35)
radeon 0000:04:00.0: scheduling IB failed (-35).
[drm:radeon_gem_va_ioctl [radeon]] *ERROR* Couldn't update BO_VA (-35)
radeon 0000:04:00.0: scheduling IB failed (-35).
[drm:radeon_gem_va_ioctl [radeon]] *ERROR* Couldn't update BO_VA (-35)
radeon 0000:04:00.0: scheduling IB failed (-35).
[drm:radeon_gem_va_ioctl [radeon]] *ERROR* Couldn't update BO_VA (-35)
radeon 0000:04:00.0: scheduling IB failed (-35).
[drm:radeon_gem_va_ioctl [radeon]] *ERROR* Couldn't update BO_VA (-35)
radeon 0000:04:00.0: scheduling IB failed (-35).
[drm:radeon_gem_va_ioctl [radeon]] *ERROR* Couldn't update BO_VA (-35)

Link: https://lore.kernel.org/dri-devel/[email protected]/T/#t
Link: https://lore.kernel.org/linux-arch/[email protected]/T/#t
Cc: [email protected]
Signed-off-by: Huacai Chen <[email protected]>

LoongArch: Remove superfluous flush_dcache_page() definition

LoongArch doesn't have cache aliases, so flush_dcache_page() is a no-op.
There is a generic implementation for this case in include/asm-generic/
cacheflush.h. So remove the superfluous flush_dcache_page() definition,
which also silences such build warnings:

   In file included from crypto/scompress.c:12:
   include/crypto/scatterwalk.h: In function 'scatterwalk_pagedone':
   include/crypto/scatterwalk.h:76:30: warning: variable 'page' set but not used [-Wunused-but-set-variable]
      76 |                 struct page *page;
         |                              ^~~~
   crypto/scompress.c: In function 'scomp_acomp_comp_decomp':
>> crypto/scompress.c:174:38: warning: unused variable 'dst_page' [-Wunused-variable]
     174 |                         struct page *dst_page = sg_page(req->dst);
         |

Reported-by: kernel test robot <[email protected]>
Closes: https://lore.kernel.org/oe-kbuild-all/[email protected]/
Suggested-by: Barry Song <[email protected]>
Acked-by: Barry Song <[email protected]>
Signed-off-by: Huacai Chen <[email protected]>

LoongArch: Move {dmw,tlb}_virt_to_page() definition to page.h

These two functions are implemented in pgtable.c, and they are needed
only by the virt_to_page() macro in page.h. Having the prototypes in
pgtable.h causes a circular dependency between page.h and pgtable.h,
because the virt_to_page() macro in page.h needs pgtable.h for these
two functions, while pgtable.h needs various definitions from page.h
(e.g. pte_t and pgt_t).

Let's avoid this circular dependency by moving the function prototypes
to page.h.

Signed-off-by: Max Kellermann <[email protected]>
Signed-off-by: Huacai Chen <[email protected]>

LoongArch: Change __my_cpu_offset definition to avoid mis-optimization

From GCC commit 3f13154553f8546a ("df-scan: remove ad-hoc handling of
global regs in asms"), global registers will no longer be forced to add
to the def-use chain. Then current_thread_info(), current_stack_pointer
and __my_cpu_offset may be lifted out of the loop because they are no
longer treated as "volatile variables".

This optimization is still correct for the current_thread_info() and
current_stack_pointer usages because they are associated to a thread.
However it is wrong for __my_cpu_offset because it is associated to a
CPU rather than a thread: if the thread migrates to a different CPU in
the loop, __my_cpu_offset should be changed.

Change __my_cpu_offset definition to treat it as a "volatile variable",
in order to avoid such a mis-optimization.

Cc: [email protected]
Reported-by: Xiaotian Wu <[email protected]>
Reported-by: Miao Wang <[email protected]>
Signed-off-by: Xing Li <[email protected]>
Signed-off-by: Hongchen Zhang <[email protected]>
Signed-off-by: Rui Wang <[email protected]>
Signed-off-by: Huacai Chen <[email protected]>

LoongArch: Select HAVE_ARCH_USERFAULTFD_MINOR in Kconfig

This allocates the VM flag needed to support the userfaultfd minor fault
functionality. See commit 7677f7fd8be7665 ("userfaultfd: add minor fault
registration mode") for more information.

Signed-off-by: Huacai Chen <[email protected]>

LoongArch: Select ARCH_HAS_CURRENT_STACK_POINTER in Kconfig

LoongArch has implemented the current_stack_pointer macro, so select
ARCH_HAS_CURRENT_STACK_POINTER in Kconfig. This will let it be used in
non-arch places (like HARDENED_USERCOPY).

Reviewed-by: Guo Ren <[email protected]>
Signed-off-by: Huacai Chen <[email protected]>

virtio_net: rename free_old_xmit_skbs to free_old_xmit

Since free_old_xmit_skbs not only deals with skb, but also xdp frame and
subsequent added xsk, so change the name of this function to
free_old_xmit.

Signed-off-by: Xuan Zhuo <[email protected]>
Acked-by: Jason Wang <[email protected]>
Message-Id: <20240229072044 [email protected]>
Signed-off-by: Michael S. Tsirkin <[email protected]>

virtio_net: unify the code for recycling the xmit ptr

There are two completely similar and independent implementations. This
is inconvenient for the subsequent addition of new types. So extract a
function from this piece of code and call this function uniformly to
recover old xmit ptr.

Signed-off-by: Xuan Zhuo <[email protected]>
Acked-by: Jason Wang <[email protected]>
Message-Id: <20240229072044 [email protected]>
Signed-off-by: Michael S. Tsirkin <[email protected]>

virtio-net: add cond_resched() to the command waiting loop

Adding cond_resched() to the command waiting loop for a better
co-operation with the scheduler. This allows to give CPU a breath to
run other task(workqueue) instead of busy looping when preemption is
not allowed on a device whose CVQ might be slow.

Signed-off-by: Jason Wang <[email protected]>
Message-Id: <20230720083839 [email protected]>
Signed-off-by: Michael S. Tsirkin <[email protected]>
Reviewed-by: Shannon Nelson <[email protected]>

virtio-net: convert rx mode setting to use workqueue

This patch convert rx mode setting to be done in a workqueue, this is
a must for allow to sleep when waiting for the cvq command to
response since current code is executed under addr spin lock.

Note that we need to disable and flush the workqueue during freeze,
this means the rx mode setting is lost after resuming. This is not the
bug of this patch as we never try to restore rx mode setting during
resume.

Signed-off-by: Jason Wang <[email protected]>
Message-Id: <20230720083839 [email protected]>
Signed-off-by: Michael S. Tsirkin <[email protected]>
Reviewed-by: Shannon Nelson <[email protected]>

virtio: packed: fix unmap leak for indirect desc table

When use_dma_api and premapped are true, then the do_unmap is false.

Because the do_unmap is false, vring_unmap_extra_packed is not called by
detach_buf_packed.

  if (unlikely(vq->do_unmap)) {
                curr = id;
                for (i = 0; i < state->num; i++) {
                        vring_unmap_extra_packed(vq,
                                                 &vq->packed.desc_extra[curr]);
                        curr = vq->packed.desc_extra[curr].next;
                }
  }

So the indirect desc table is not unmapped. This causes the unmap leak.

So here, we check vq->use_dma_api instead. Synchronously, dma info is
updated based on use_dma_api judgment

This bug does not occur, because no driver use the premapped with
indirect.

Fixes: b319940f83c2 ("virtio_ring: skip unmap for premapped")
Signed-off-by: Xuan Zhuo <[email protected]>
Message-Id: <20240223071833 [email protected]>
Signed-off-by: Michael S. Tsirkin <[email protected]>

vDPA: report virtio-blk flush info to user space

This commit reports whether a virtio-blk device
support cache flush command to user space

Signed-off-by: Zhu Lingshan <[email protected]>
Message-Id: <20240218185606 [email protected]>
Signed-off-by: Michael S. Tsirkin <[email protected]>

vDPA: report virtio-block read-only info to user space

This commit report read-only information of
virtio-blk devices to user space.

Signed-off-by: Zhu Lingshan <[email protected]>
Message-Id: <20240218185606 [email protected]>
Signed-off-by: Michael S. Tsirkin <[email protected]>

vDPA: report virtio-block write zeroes configuration to user space

This commits reports write zeroes configuration of
virtio-block devices to user space, includes:
1)maximum write zeroes sectors size
2)maximum write zeroes segment number

Signed-off-by: Zhu Lingshan <[email protected]>
Message-Id: <20240218185606 [email protected]>
Signed-off-by: Michael S. Tsirkin <[email protected]>

vDPA: report virtio-block discarding configuration to user space

This commit reports virtio-blk discarding configuration
to user space,includes:
1) the maximum discard sectors
2) maximum number of discard segments for the block driver to use
3) the alignment for splitting a discarding request

Signed-off-by: Zhu Lingshan <[email protected]>
Message-Id: <20240218185606 [email protected]>
Signed-off-by: Michael S. Tsirkin <[email protected]>

vDPA: report virtio-block topology info to user space

This commit allows vDPA reporting topology information of
virtio-blk devices to user space, includes:
1) the number of logical blocks per physical block
2) offset of first aligned logical block
3) suggested minimum I/O size in blocks
4) optimal (suggested maximum) I/O size in blocks

Signed-off-by: Zhu Lingshan <[email protected]>
Message-Id: <20240218185606 [email protected]>
Signed-off-by: Michael S. Tsirkin <[email protected]>

vDPA: report virtio-block MQ info to user space

This commits allows vDPA reporting virtio-block multi-queue
configuration to user sapce.

Signed-off-by: Zhu Lingshan <[email protected]>
Message-Id: <20240218185606 [email protected]>
Signed-off-by: Michael S. Tsirkin <[email protected]>

vDPA: report virtio-block max segments in a request to user space

This commit allows vDPA reporting the maximum number of
segments in a request of virtio-block devices to
user space.

Signed-off-by: Zhu Lingshan <[email protected]>
Message-Id: <20240218185606 [email protected]>
Signed-off-by: Michael S. Tsirkin <[email protected]>

vDPA: report virtio-block block-size to user space

This commit allows reporting the block size of a
virtio-block device to user space.

Signed-off-by: Zhu Lingshan <[email protected]>
Message-Id: <20240218185606 [email protected]>
Signed-off-by: Michael S. Tsirkin <[email protected]>

vDPA: report virtio-block max segment size to user space

This commit allows reporting the max size of any
single segment of virtio-block devices to user space.

Signed-off-by: Zhu Lingshan <[email protected]>
Message-Id: <20240218185606 [email protected]>
Signed-off-by: Michael S. Tsirkin <[email protected]>

vDPA: report virtio-block capacity to user space

This commit allows userspace to query capacity of
a virtio-block device.

Signed-off-by: Zhu Lingshan <[email protected]>
Message-Id: <20240218185606 [email protected]>
Signed-off-by: Michael S. Tsirkin <[email protected]>

virtio: make virtio_bus const

Now that the driver core can properly handle constant struct bus_type,
move the virtio_bus variable to be a constant structure as well,
placing it into read-only memory which can not be modified at runtime.

Cc: Greg Kroah-Hartman <[email protected]>
Suggested-by: Greg Kroah-Hartman <[email protected]>
Signed-off-by: Ricardo B. Marliere <[email protected]>
Message-Id: <20240204-bus_cleanup-virtio-v1-1-3bcb2212aaa0@marliere.net>
Signed-off-by: Michael S. Tsirkin <[email protected]>
Reviewed-by: Greg Kroah-Hartman <[email protected]>
Acked-by: Jason Wang <[email protected]>

vdpa: make vdpa_bus const

Now that the driver core can properly handle constant struct bus_type,
move the vdpa_bus variable to be a constant structure as well,
placing it into read-only memory which can not be modified at runtime.

Cc: Greg Kroah-Hartman <[email protected]>
Suggested-by: Greg Kroah-Hartman <[email protected]>
Signed-off-by: Ricardo B. Marliere <[email protected]>
Message-Id: <20240204-bus_cleanup-vdpa-v1-1-1745eccb0a5c@marliere.net>
Signed-off-by: Michael S. Tsirkin <[email protected]>
Reviewed-by: Greg Kroah-Hartman <[email protected]>

vDPA/ifcvf: implement vdpa_config_ops.get_vq_num_min

IFCVF HW supports operation with vq size less than the max size,
as the spec required.

This commit implements vdpa_config_ops.get_vq_num_min to report
the minimal size of the virtqueues, which gives vDPA framework
a chance to reduce the vring size.

We need at least one descriptor to be functional, but it is better
no less than 64 to meet ceratin performance requirements.
Actually the framework would allocate at least a PAGE for the vq.

Signed-off-by: Zhu Lingshan <[email protected]>
Message-Id: <20240202163905 [email protected]>
Signed-off-by: Michael S. Tsirkin <[email protected]>

vDPA/ifcvf: get_max_vq_size to return max size

Since we already implemented vdpa_config_ops.get_vq_size,
so get_max_vq_size can return the acutal max size of the
virtqueues other than the max allowed safe size.

Signed-off-by: Zhu Lingshan <[email protected]>
Message-Id: <20240202163905 [email protected]>
Signed-off-by: Michael S. Tsirkin <[email protected]>

virtio_vdpa: create vqs with the actual size

The size of a virtqueue is a per vq configuration,
this commit allows virtio_vdpa to create
virtqueues with the actual size of a specific
vq size that supported by the backend device.

Signed-off-by: Zhu Lingshan <[email protected]>
Message-Id: <20240202163905 [email protected]>
Signed-off-by: Michael S. Tsirkin <[email protected]>

vduse: implement vdpa_config_ops.get_vq_size for vduse

This commit implements get_vq_size for vdpa_config_ops. This
new interface is used to report per vq size.

Signed-off-by: Zhu Lingshan <[email protected]>
Message-Id: <20240202163905 [email protected]>
Signed-off-by: Michael S. Tsirkin <[email protected]>

vdpa_sim: implement vdpa_config_ops.get_vq_size for vDPA simulator

This commit implements vdpa_config_ops.get_vq_size for vDPA
simulator, this new interface can help report per vq size.

Signed-off-by: Zhu Lingshan <[email protected]>
Message-Id: <20240202163905 [email protected]>
Signed-off-by: Michael S. Tsirkin <[email protected]>

eni_vdpa: implement vdpa_config_ops.get_vq_size

This commit implements get_vq_size which report
per vq size in vdpa_config_ops

Signed-off-by: Zhu Lingshan <[email protected]>
Message-Id: <20240202163905 [email protected]>
Signed-off-by: Michael S. Tsirkin <[email protected]>

vp_vdpa: implement vdpa_config_ops.get_vq_size

This commit implements vdpa_config_ops.get_vq_size in
vp_vdpa, which reports per virtqueue size.

Signed-off-by: Zhu Lingshan <[email protected]>
Message-Id: <20240202163905 [email protected]>
Signed-off-by: Michael S. Tsirkin <[email protected]>

vDPA/ifcvf: implement vdpa_config_ops.get_vq_size

This commit implements vdpa_ops.get_vq_size to report
the size of a specific virtqueue.

Signed-off-by: Zhu Lingshan <[email protected]>
Message-Id: <20240202163905 [email protected]>
Signed-off-by: Michael S. Tsirkin <[email protected]>

vDPA: introduce get_vq_size to vdpa_config_ops

This commit introduces a new interface get_vq_size to
vDPA config ops, this new interface intends to report
the size of a specific virtqueue

Signed-off-by: Zhu Lingshan <[email protected]>
Message-Id: <20240202163905 [email protected]>
Signed-off-by: Michael S. Tsirkin <[email protected]>

vhost-vdpa: uapi to support reporting per vq size

The size of a virtqueue is a per vq configuration.
This commit introduce a new ioctl uAPI to support this flexibility.

Signed-off-by: Zhu Lingshan <[email protected]>
Message-Id: <20240202163905 [email protected]>
Signed-off-by: Michael S. Tsirkin <[email protected]>

vdpa/pds: fixes for VF vdpa flr-aer handling

This addresses a couple of things found while testing the FLR and AER
handling with the VFs.
- release irqs before calling vp_modern_remove()
- make sure we have a valid struct pointer before using it to release irqs
- make sure the FW is alive before trying to add a new device

Signed-off-by: Shannon Nelson <[email protected]>
Message-Id: <20240220011050 [email protected]>
Signed-off-by: Michael S. Tsirkin <[email protected]>

vduse: implement DMA sync callbacks

Since commit 295525e29a5b ("virtio_net: merge dma
operations when filling mergeable buffers"), VDUSE device
require support for DMA's .sync_single_for_cpu() operation
as the memory is non-coherent between the device and CPU
because of the use of a bounce buffer.

This patch implements both .sync_single_for_cpu() and
.sync_single_for_device() callbacks, and also skip bounce
buffer copies during DMA map and unmap operations if the
DMA_ATTR_SKIP_CPU_SYNC attribute is set to avoid extra
copies of the same buffer.

Signed-off-by: Maxime Coquelin <[email protected]>
Message-Id: <20240219170606 [email protected]>
Signed-off-by: Michael S. Tsirkin <[email protected]>

vdpa/mlx5: Allow CVQ size changes

The MLX driver was not updating its control virtqueue size at set_vq_num
and instead always initialized to MLX5_CVQ_MAX_ENT (16) at
setup_cvq_vring.

Qemu would try to set the size to 64 by default, however, because the
CVQ size always was initialized to 16, an error would be thrown when
sending >16 control messages (as used-ring entry 17 is initialized to 0).
For example, starting a guest with x-svq=on and then executing the
following command would produce the error below:

# for i in {1..20}; do ifconfig eth0 hw ether XX:xx:XX:xx:XX:XX; done

qemu-system-x86_64: Insufficient written data (0)
[ 435.331223] virtio_net virtio0: Failed to set mac address by vq command.
SIOCSIFHWADDR: Invalid argument

Acked-by: Dragos Tatulea <[email protected]>
Acked-by: Eugenio Pérez <[email protected]>
Signed-off-by: Jonah Palmer <[email protected]>
Message-Id: <20240216142502 [email protected]>
Signed-off-by: Michael S. Tsirkin <[email protected]>
Tested-by: Lei Yang <[email protected]>
Fixes: 5262912ef3cf ("vdpa/mlx5: Add support for control VQ and MAC setting")

vdpa: skip suspend/resume ops if not DRIVER_OK

If a vdpa device is not in state DRIVER_OK, then there is no driver state
to preserve, so no need to call the suspend and resume driver ops.

Suggested-by: Eugenio Perez Martin <[email protected]>"
Signed-off-by: Steve Sistare <[email protected]>
Message-Id: <1707834358 [email protected]>
Signed-off-by: Michael S. Tsirkin <[email protected]>
Reviewed-by: Eugenio Pérez <[email protected]>

virtio: reenable config if freezing device failed

Currently, we don't reenable the config if freezing the device failed.

For example, virtio-mem currently doesn't support suspend+resume, and
trying to freeze the device will always fail. Afterwards, the device
will no longer respond to resize requests, because it won't get notified
about config changes.

Let's fix this by re-enabling the config if freezing fails.

Fixes: 22b7050a024d ("virtio: defer config changed notifications")
Cc: <[email protected]>
Cc: "Michael S. Tsirkin" <[email protected]>
Cc: Jason Wang <[email protected]>
Cc: Xuan Zhuo <[email protected]>
Signed-off-by: David Hildenbrand <[email protected]>
Message-Id: <20240213135425 [email protected]>
Signed-off-by: Michael S. Tsirkin <[email protected]>

vdpa_sim: reset must not run

vdpasim_do_reset sets running to true, which is wrong, as it allows
vdpasim_kick_vq to post work requests before the device has been
configured. To fix, do not set running until VIRTIO_CONFIG_S_DRIVER_OK
is set.

Fixes: 0c89e2a3a9d0 ("vdpa_sim: Implement suspend vdpa op")
Signed-off-by: Steve Sistare <[email protected]>
Reviewed-by: Eugenio Pérez <[email protected]>
Acked-by: Jason Wang <[email protected]>
Message-Id: <1707517807 [email protected]>
Signed-off-by: Michael S. Tsirkin <[email protected]>

virtio: uapi: Drop __packed attribute in linux/virtio_pci.h

Commit 92792ac752aa ("virtio-pci: Introduce admin command sending function")
added "__packed" structures to UAPI header linux/virtio_pci.h. This triggers
build failures in the consumer userspace applications without proper "definition"
of __packed (e.g., kvmtool build fails).

Moreover, the structures are already packed well, and doesn't need explicit
packing, similar to the rest of the structures in all virtio_* headers. Remove
the __packed attribute.

Fixes: 92792ac752aa ("virtio-pci: Introduce admin command sending function")
Cc: Feng Liu <[email protected]>
Cc: Michael S. Tsirkin <[email protected]>
Cc: Yishai Hadas <[email protected]>
Cc: Alex Williamson <[email protected]>
Cc: Jean-Philippe Brucker <[email protected]>
Reviewed-by: Jean-Philippe Brucker <[email protected]>
Acked-by: Michael S. Tsirkin <[email protected]>
Signed-off-by: Suzuki K Poulose <[email protected]>
Message-Id: <20240125232039 [email protected]>
Signed-off-by: Michael S. Tsirkin <[email protected]>

vhost: Added pad cleanup if vnet_hdr is not present.

When the Qemu launched with vhost but without tap vnet_hdr,
vhost tries to copy vnet_hdr from socket iter with size 0
to the page that may contain some trash.
That trash can be interpreted as unpredictable values for
vnet_hdr.
That leads to dropping some packets and in some cases to
stalling vhost routine when the vhost_net tries to process
packets and fails in a loop.

Qemu options:
-netdev tap,vhost=on,vnet_hdr=off,...

Signed-off-by: Andrew Melnychenko <[email protected]>
Message-Id: <20240115194840.1183077 [email protected]>
Signed-off-by: Michael S. Tsirkin <[email protected]>

Merge tag 'drm-misc-next-fixes-2024-03-14' of https://gitlab.freedesktop.org/drm/misc/kernel into drm-next

Short summary of fixes pull:

probe-helper:
- never return negative values from .get_modes() plus driver fixes

nouveau:
- clear bo resource bus after eviction
- documentation fixes

Signed-off-by: Dave Airlie <[email protected]>
From: Thomas Zimmermann <[email protected]>
Link: https://patchwork.freedesktop.org/patch/msgid/[email protected]

bcachefs: Fix lost wakeup on journal shutdown

We need to check for journal shutdown first in __journal_res_get() -
after the journal is shutdown, j->watermark won't be changing anymore.

Signed-off-by: Kent Overstreet <[email protected]>

bcachefs; Fix deadlock in bch2_btree_update_start()

BCH_TRANS_COMMIT_journal_reclaim with watermark != BCH_WATERMARK_reclaim
means nonblocking, and we need the journal_res_get() in
btree_update_start() to respect that.

In a future refactoring we'll be deleting
BCH_TRANS_COMMIT_journal_reclaim and replacing it with an explicit
BCH_TRANS_COMMIT_nonblocking.

Signed-off-by: Kent Overstreet <[email protected]>

io_uring/sqpoll: early exit thread if task_context wasn't allocated

Ideally we'd want to simply kill the task rather than wake it, but for
now let's just add a startup check that causes the thread to exit.
This can only happen if io_uring_alloc_task_context() fails, which
generally requires fault injection.

Reported-by: Ubisectech Sirius <[email protected]>
Fixes: af5d68f8892f ("io_uring/sqpoll: manage task_work privately")
Signed-off-by: Jens Axboe <[email protected]>

ksmbd: remove module version

ksmbd module version marking is not needed. Since there is a
Linux kernel version, there is no point in increasing it anymore.

Signed-off-by: Namjae Jeon <[email protected]>
Signed-off-by: Steve French <[email protected]>

ksmbd: fix potencial out-of-bounds when buffer offset is invalid

I found potencial out-of-bounds when buffer offset fields of a few requests
is invalid. This patch set the minimum value of buffer offset field to
->Buffer offset to validate buffer length.

Cc: [email protected]
Signed-off-by: Namjae Jeon <[email protected]>
Signed-off-by: Steve French <[email protected]>

floppy: remove duplicated code in redo_fd_request()

duplicated code in redo_fd_request(),
unlock_fdc() function has the same code "do_floppy = NULL" inside.

Signed-off-by: Yufeng Wang <[email protected]>
Suggested-by: Denis Efremov <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Jens Axboe <[email protected]>

Merge tag 'drm-xe-next-fixes-2024-03-14' of https://gitlab.freedesktop.org/drm/xe/kernel into drm-next

Driver changes:

- Invalidate userptr VMA on page pin fault, allowing userspace
to free userptr while still having bindings
- Fail early on sysfs file creation error
- Skip VMA pinning on xe_exec with num_batch_buffer == 0

Signed-off-by: Dave Airlie <[email protected]>
From: Lucas De Marchi <[email protected]>
Link: https://patchwork.freedesktop.org/patch/msgid/c4epi2j6anpc77z73zbgibxg7bxsmmkb522aa7tyei6oa6uunn@3oad4cgomd5a

drm: Fix drm_fixp2int_round() making it add 0.5

As well noted by Pekka[1], the rounding of drm_fixp2int_round is wrong.
To round a number, you need to add 0.5 to the number and floor that,
drm_fixp2int_round() is adding 0.0000076. Make it add 0.5.

[1]: https://lore.kernel.org/all/20240301135327.22efe0dd [email protected]/

Fixes: 8b25320887d7 ("drm: Add fixed-point helper to get rounded integer values")
Suggested-by: Pekka Paalanen <[email protected]>
Reviewed-by: Harry Wentland <[email protected]>
Reviewed-by: Melissa Wen <[email protected]>
Signed-off-by: Arthur Grillo <[email protected]>
Signed-off-by: Melissa Wen <[email protected]>
Link: https://patchwork.freedesktop.org/patch/msgid/[email protected]

Merge tag 'dlm-6.9' of git://git.kernel.org/pub/scm/linux/kernel/git/teigland/linux-dlm

Pull dlm updates from David Teigland:

- Fix mistaken variable assignment that caused a refcounting problem

- Revert a recent change that began using atomic counters where they
   were not needed (for lkb wait_count)

- Add comments around forced state reset for waiting lock operations
   during recovery

* tag 'dlm-6.9' of git://git.kernel.org/pub/scm/linux/kernel/git/teigland/linux-dlm:
  dlm: add comments about forced waiters reset
  dlm: revert atomic_t lkb_wait_count
  dlm: fix user space lkb refcounting

Merge tag 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/rdma/rdma

Pull rdma updates from Jason Gunthorpe:
"Very small update this cycle:

   - Minor code improvements in fi, rxe, ipoib, mana, cxgb4, mlx5,
     irdma, rxe, rtrs, mana

   - Simplify the hns hem mechanism

   - Fix EFA's MSI-X allocation in resource constrained configurations

   - Fix a KASN splat in srpt

   - Narrow hns's congestion control selection to QPs granularity and
     allow userspace to select it

   - Solve a parallel module loading race between the CM module and a
     driver module

   - Flexible array cleanup

   - Dump hns's SCC Conext to 'rdma res' for debugging

   - Make mana build page lists for HW objects that require a 0 offset
     correctly

   - Stuck CM ID debugging"

* tag 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/rdma/rdma: (29 commits)
  RDMA/cm: add timeout to cm_destroy_id wait
  RDMA/mana_ib: Use virtual address in dma regions for MRs
  RDMA/mana_ib: Fix bug in creation of dma regions
  RDMA/hns: Append SCC context to the raw dump of QPC
  RDMA/uverbs: Avoid -Wflex-array-member-not-at-end warnings
  RDMA/hns: Support userspace configuring congestion control algorithm with QP granularity
  RDMA/rtrs-clt: Check strnlen return len in sysfs mpath_policy_store()
  RDMA/uverbs: Remove flexible arrays from struct *_filter
  RDMA/device: Fix a race between mad_client and cm_client init
  RDMA/hns: Fix mis-modifying default congestion control algorithm
  RDMA/rxe: Remove unused 'iova' parameter from rxe_mr_init_user
  RDMA/srpt: Do not register event handler until srpt device is fully setup
  RDMA/irdma: Remove duplicate assignment
  RDMA/efa: Limit EQs to available MSI-X vectors
  RDMA/mlx5: Delete unused mlx5_ib_copy_pas prototype
  RDMA/cxgb4: Delete unused c4iw_ep_redirect prototype
  RDMA/mana_ib: Introduce mana_ib_install_cq_cb helper function
  RDMA/mana_ib: Introduce mana_ib_get_netdev helper function
  RDMA/mana_ib: Introduce mdev_to_gc helper function
  RDMA/hns: Simplify 'struct hns_roce_hem' allocation
  ...