Merge tag 'bitmap-for-6.12' of https://github.com/norov/linux
Pull bitmap updates from Yury Norov:
- switch all bitmamp APIs from inline to __always_inline (Brian Norris)
The __always_inline series improves on code generation, and now with
the latest compiler versions is required to avoid compilation
warnings. It spent enough in my backlog, and I'm thankful to Brian
Norris for taking over and moving it forward.
GENMASK_U128() is a prerequisite needed for arm64 development
* tag 'bitmap-for-6.12' of https://github.com/norov/linux:
lib/test_bits.c: Add tests for GENMASK_U128()
uapi: Define GENMASK_U128
nodemask: Switch from inline to __always_inline
cpumask: Switch from inline to __always_inline
bitmap: Switch from inline to __always_inline
find: Switch from inline to __always_inline
Merge tag 'tomoyo-pr-20240927' of git://git.code.sf.net/p/tomoyo/tomoyo
Pull tomoyo updates from Tetsuo Handa:
"One bugfix patch, one preparation patch, and one conversion patch.
TOMOYO is useful as an analysis tool for learning how a Linux system
works. My boss was hoping that SELinux's policy is generated from what
TOMOYO has observed. A translated paper describing it is available at
Although that attempt failed due to mapping problem between inode and
pathname, TOMOYO remains as an access restriction tool due to ability
to write custom policy by individuals.
I was delivering pure LKM version of TOMOYO (named AKARI) to users who
cannot afford rebuilding their distro kernels with TOMOYO enabled. But
since the LSM framework was converted to static calls, it became more
difficult to deliver AKARI to such users. Therefore, I decided to
update TOMOYO so that people can use mostly LKM version of TOMOYO with
minimal burden for both distributors and users"
* tag 'tomoyo-pr-20240927' of git://git.code.sf.net/p/tomoyo/tomoyo:
tomoyo: fallback to realpath if symlink's pathname does not exist
tomoyo: allow building as a loadable LSM module
tomoyo: preparation step for building as a loadable LSM module
Merge tag 'cxl-for-6.12' of git://git.kernel.org/pub/scm/linux/kernel/git/cxl/cxl
Pull compute express link (cxl) updates from Dave Jiang:
"Major changes address HDM decoder initialization from DVSEC ranges,
refactoring the code related to cxl mailboxes to be independent of the
memory devices, and adding support for shared upstream link
access_coordinate calculation, as well as a change to remove locking
from memory notifier callback.
In addition, a number of misc cleanups and refactoring of the code are
also included.
Address HDM decoder initialization from DVSEC ranges:
- Only register non-zero DVSEC ranges
- Remove duplicate implementation of waiting for memory_info_valid
- Simplify the checking of mem_enabled in cxl_hdm_decode_init()
Refactor the code related to cxl mailboxes to be independent of the memory devices:
- Move cxl headers in include/linux/ to include/cxl
- Move all mailbox related data to 'struct cxl_mailbox'
- Refactor mailbox APIs with 'struct cxl_mailbox' as input instead of
memory device state
Add support for shared upstream link access_coordinate calculation for
configurations that have multiple targets under a switch or a root
port where the aggregated bandwidth can be greater than the upstream
link of the switch/RP upstream link:
- Preserve the CDAT access_coordinate from an endpoint
- Add the support for shared upstream link access_coordinate calculation
- Add documentation to explain how the calculations are done
Remove locking from memory notifier callback.
Misc cleanups:
- Convert devm_cxl_add_root() to return using ERR_CAST()
- cxl_test use dev_is_platform() instead of open coding
- Remove duplicate include of header core.h in core/cdat.c
- use scoped resource management to drop put_device() for cxl_port
- Use scoped_guard to drop device_lock() for cxl_port
- Refactor __devm_cxl_add_port() to drop gotos
- Rename cxl_setup_parent_dport to cxl_dport_init_aer and
cxl_dport_map_regs() to cxl_dport_map_ras()
- Refactor cxl_dport_init_aer() to be more concise
- Remove duplicate host_bridge->native_aer checking in
cxl_dport_init_ras_reporting()
- Fix comment for cxl_query_cmd()"
* tag 'cxl-for-6.12' of git://git.kernel.org/pub/scm/linux/kernel/git/cxl/cxl: (21 commits)
cxl: Add documentation to explain the shared link bandwidth calculation
cxl: Calculate region bandwidth of targets with shared upstream link
cxl: Preserve the CDAT access_coordinate for an endpoint
cxl: Fix comment regarding cxl_query_cmd() return data
cxl: Convert cxl_internal_send_cmd() to use 'struct cxl_mailbox' as input
cxl: Move mailbox related bits to the same context
cxl: move cxl headers to new include/cxl/ directory
cxl/region: Remove lock from memory notifier callback
cxl/pci: simplify the check of mem_enabled in cxl_hdm_decode_init()
cxl/pci: Check Mem_info_valid bit for each applicable DVSEC
cxl/pci: Remove duplicated implementation of waiting for memory_info_valid
cxl/pci: Fix to record only non-zero ranges
cxl/pci: Remove duplicate host_bridge->native_aer checking
cxl/pci: cxl_dport_map_rch_aer() cleanup
cxl/pci: Rename cxl_setup_parent_dport() and cxl_dport_map_regs()
cxl/port: Refactor __devm_cxl_add_port() to drop goto pattern
cxl/port: Use scoped_guard()/guard() to drop device_lock() for cxl_port
cxl/port: Use __free() to drop put_device() for cxl_port
cxl: Remove duplicate included header file core.h
tools/testing/cxl: Use dev_is_platform()
...
Merge tag 'loongarch-6.12' of git://git.kernel.org/pub/scm/linux/kernel/git/chenhuacai/linux-loongson
Pull LoongArch updates from Huacai Chen:
- Fix objtool about do_syscall() and Clang
- Enable generic CPU vulnerabilites support
- Enable ACPI BGRT handling
- Rework CPU feature probe from CPUCFG/IOCSR
- Add ARCH_HAS_SET_MEMORY support
- Add ARCH_HAS_SET_DIRECT_MAP support
- Improve hardware page table walker
- Simplify _percpu_read() and _percpu_write()
- Add advanced extended IRQ model documentions
- Some bug fixes and other small changes
* tag 'loongarch-6.12' of git://git.kernel.org/pub/scm/linux/kernel/git/chenhuacai/linux-loongson:
Docs/LoongArch: Add advanced extended IRQ model description
LoongArch: Remove posix_types.h include from sigcontext.h
LoongArch: Fix memleak in pci_acpi_scan_root()
LoongArch: Simplify _percpu_read() and _percpu_write()
LoongArch: Improve hardware page table walker
LoongArch: Add ARCH_HAS_SET_DIRECT_MAP support
LoongArch: Add ARCH_HAS_SET_MEMORY support
LoongArch: Rework CPU feature probe from CPUCFG/IOCSR
LoongArch: Enable ACPI BGRT handling
LoongArch: Enable generic CPU vulnerabilites support
LoongArch: Remove STACK_FRAME_NON_STANDARD(do_syscall)
LoongArch: Set AS_HAS_THIN_ADD_SUB as y if AS_IS_LLVM
LoongArch: Enable objtool for Clang
objtool: Handle frame pointer related instructions
Merge tag 'sh-for-v6.12-tag1' of git://git.kernel.org/pub/scm/linux/kernel/git/glaubitz/sh-linux
Pull sh updates from John Paul Adrian Glaubitz:
"The first change by Gaosheng Cui removes unused declarations which
have been obsoleted since commit 5a4053b23262 ("sh: Kill off dead
boards.") and the second by his colleague Hongbo Li replaces the use
of the unsafe simple_strtoul() with the safer kstrtoul() function in
the sh interrupt controller driver code"
* tag 'sh-for-v6.12-tag1' of git://git.kernel.org/pub/scm/linux/kernel/git/glaubitz/sh-linux:
sh: intc: Replace simple_strtoul() with kstrtoul()
sh: Remove unused declarations for make_maskreg_irq() and irq_mask_register
Merge tag 'for-linus-6.12-rc1a-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/xen/tip
Pull more xen updates from Juergen Gross:
"A second round of Xen related changes and features:
- a small fix of the xen-pciback driver for a warning issued by
sparse
- support PCI passthrough when using a PVH dom0
- enable loading the kernel in PVH mode at arbitrary addresses,
avoiding conflicts with the memory map when running as a Xen dom0
using the host memory layout"
* tag 'for-linus-6.12-rc1a-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/xen/tip:
x86/pvh: Add 64bit relocation page tables
x86/kernel: Move page table macros to header
x86/pvh: Set phys_base when calling xen_prepare_pvh()
x86/pvh: Make PVH entrypoint PIC for x86-64
xen: sync elfnote.h from xen tree
xen/pciback: fix cast to restricted pci_ers_result_t and pci_power_t
xen/privcmd: Add new syscall to get gsi from dev
xen/pvh: Setup gsi for passthrough device
xen/pci: Add a function to reset device for xen
- Dm-integrity: Support recalculation in the 'I' mode
- Revert "dm: requeue IO if mapping table not yet available"
- Dm-crypt: Small refactoring to make the code more readable
- Dm-cache: Remove pointless error check
- Dm: Fix spelling errors
- Dm-verity: Restart or panic on an I/O error if restart or panic was
requested
- Dm-verity: Fallback to platform keyring also if key in trusted
keyring is rejected
* tag 'for-6.12/dm-changes' of git://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm: (26 commits)
dm verity: fallback to platform keyring also if key in trusted keyring is rejected
dm-verity: restart or panic on an I/O error
dm: fix spelling errors
dm-cache: remove pointless error check
dm vdo: handle unaligned discards correctly
dm vdo indexer: Convert comma to semicolon
dm-crypt: Use common error handling code in crypt_set_keyring_key()
dm-crypt: Use up_read() together with key_put() only once in crypt_set_keyring_key()
Revert "dm: requeue IO if mapping table not yet available"
dm-integrity: check mac_size against HASH_MAX_DIGESTSIZE in sb_mac()
dm-integrity: support recalculation in the 'I' mode
dm integrity: Convert comma to semicolon
dm integrity: fix gcc 5 warning
dm: Make use of __assign_bit() API
dm integrity: Remove extra unlikely helper
dm: Convert to use ERR_CAST()
dm bufio: Remove NULL check of list_entry()
dm-crypt: Allow to specify the integrity key size as option
dm: Remove unused declaration and empty definition "dm_zone_map_bio"
dm delay: enhance kernel documentation
...
Merge tag 'driver-core-6.12-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/driver-core
Pull driver core updates from Greg KH:
"Here is a small set of patches for the driver core code for 6.12-rc1.
This set is the one that caused the most delay on my side, due to lots
of last-minute reports of problems in the async shutdown feature that
was added. In the end, I've reverted all of the patches in that series
so we are back to "normal" and the patch set is being reworked for the
next merge window.
Other than the async shutdown patches that were reverted, included in
here are:
- minor driver core cleanups
- minor driver core bus and class api cleanups and simplifications
for some callbacks
- some const markings of structures
- other even more minor cleanups
All of these, including the last minute reverts, have been in
linux-next, but all of the reports of problems in linux-next were
before the reverts happened. After the reverts, all is good"
* tag 'driver-core-6.12-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/driver-core: (32 commits)
Revert "driver core: don't always lock parent in shutdown"
Revert "driver core: separate function to shutdown one device"
Revert "driver core: shut down devices asynchronously"
Revert "nvme-pci: Make driver prefer asynchronous shutdown"
Revert "driver core: fix async device shutdown hang"
driver core: fix async device shutdown hang
driver core: attribute_container: Remove unused functions
driver core: Trivially simplify ((struct device_private *)curr)->device->p to @curr
devres: Correclty strip percpu address space of devm_free_percpu() argument
driver core: Make parameter check consistent for API cluster device_(for_each|find)_child()
bus: fsl-mc: make fsl_mc_bus_type const
nvme-pci: Make driver prefer asynchronous shutdown
driver core: shut down devices asynchronously
driver core: separate function to shutdown one device
driver core: don't always lock parent in shutdown
platform: Make platform_bus_type constant
driver core: class: Check namespace relevant parameters in class_register()
driver:base:core: Adding a "Return:" line in comment for device_link_add()
drivers/base: Introduce device_match_t for device finding APIs
firmware_loader: Block path traversal
...
Al Viro [Fri, 27 Sep 2024 01:56:11 +0000 (02:56 +0100)]
[tree-wide] finally take no_llseek out
no_llseek had been defined to NULL two years ago, in commit 868941b14441
("fs: remove no_llseek")
To quote that commit,
At -rc1 we'll need do a mechanical removal of no_llseek -
git grep -l -w no_llseek | grep -v porting.rst | while read i; do
sed -i '/\<no_llseek\>/d' $i
done
would do it.
Unfortunately, that hadn't been done. Linus, could you do that now, so
that we could finally put that thing to rest? All instances are of the
form
.llseek = no_llseek,
so it's obviously safe.
This is because when ocfs2_extent_map_get_blocks() fails, p_blkno is
uninitialized. So the error log will trigger the above uninit-value
access.
The error log is out-of-date since get_blocks() was removed long time ago.
And the error code will be logged in ocfs2_extent_map_get_blocks() once
ocfs2_get_cluster() fails, so fix this by only logging inode and block.
When CONFIG_ZRAM_MULTI_COMP isn't set ZRAM_SECONDARY_COMP can hold
default_compressor, because it's the same offset as ZRAM_PRIMARY_COMP, so
we need to make sure that we don't attempt to kfree() the statically
defined compressor name.
memory tiers: use default_dram_perf_ref_source in log message
Commit 3718c02dbd4c ("acpi, hmat: calculate abstract distance with HMAT")
added a default_dram_perf_ref_source variable that was initialized but
never used. This causes kmemleak to report the following memory leak:
This reminds us that we forget to use the performance data source
information. So, use the variable in the error log message to help
identify the root cause of inconsistent performance number.
Expected cur == &entries[i], but
cur == 0000037fffadfd80
&entries[i] == 0000037fffadfd60
# list_test_list_cut_position: pass:0 fail:1 skip:0 total:1
not ok 21 list_test_list_cut_position
# list_test_list_cut_before: EXPECTATION FAILED at lib/list-test.c:444
Expected cur == &entries[i], but
cur == 0000037fffa9fd70
&entries[i] == 0000037fffa9fd60
# list_test_list_cut_before: EXPECTATION FAILED at lib/list-test.c:444
Expected cur == &entries[i], but
cur == 0000037fffa9fd80
&entries[i] == 0000037fffa9fd70
The number is dependent on the architecture. The above data shows that:
x86 374
x86_64 323
The value of __NR_userfaultfd was changed to 282 when asm-generic/unistd.h
was included. It makes the test to fail every time as the correct number
of this syscall on x86_64 is 323. Fix the header to asm/unistd.h.
There is no above issue on x86 due to the generated section flag is only
"A" (allocatable). In order to silence the warning on LoongArch, specify
the attribute like ".rodata..c_jump_table,\"a\",@progbits #" explicitly,
then the section attribute of ".rodata..c_jump_table" must be readonly
in the kernel/bpf/core.o file.
By the way, AFAICT, maybe the root cause is related with the different
compiler behavior of various archs, so to some extent this change is a
workaround for LoongArch, and also there is no effect for x86 which is the
only port supported by objtool before LoongArch with this patch.
The old URL doesn't really work anymore and as the documentation has been
integrated in the main kernel documentation site, change the URL to point
to that.
ocfs2: reserve space for inline xattr before attaching reflink tree
One of our customers reported a crash and a corrupted ocfs2 filesystem.
The crash was due to the detection of corruption. Upon troubleshooting,
the fsck -fn output showed the below corruption
[EXTENT_LIST_FREE] Extent list in owner 33080590 claims 230 as the next free chain record,
but fsck believes the largest valid value is 227. Clamp the next record value? n
The stat output from the debugfs.ocfs2 showed the following corruption
where the "Next Free Rec:" had overshot the "Count:" in the root metadata
block.
The issue was in the reflink workfow while reserving space for inline
xattr. The problematic function is ocfs2_reflink_xattr_inline(). By the
time this function is called the reflink tree is already recreated at the
destination inode from the source inode. At this point, this function
reserves space for inline xattrs at the destination inode without even
checking if there is space at the root metadata block. It simply reduces
the l_count from 243 to 227 thereby making space of 256 bytes for inline
xattr whereas the inode already has extents beyond this index (in this
case up to 230), thereby causing corruption.
The fix for this is to reserve space for inline metadata at the destination
inode before the reflink tree gets recreated. The customer has verified the
fix.
Jeongjun Park [Tue, 24 Sep 2024 13:00:53 +0000 (22:00 +0900)]
mm: migrate: annotate data-race in migrate_folio_unmap()
I found a report from syzbot [1]
This report shows that the value can be changed, but in reality, the
value of __folio_set_movable() cannot be changed because it holds the
folio refcount.
Therefore, it is appropriate to add an annotate to make KCSAN
ignore that data-race.
[1]
==================================================================
BUG: KCSAN: data-race in __filemap_remove_folio / migrate_pages_batch
Steve Sistare [Wed, 4 Sep 2024 19:41:08 +0000 (12:41 -0700)]
mm/hugetlb: simplify refs in memfd_alloc_folio
The folio_try_get in memfd_alloc_folio is not necessary. Delete it, and
delete the matching folio_put in memfd_pin_folios. This also avoids
leaking a ref if the memfd_alloc_folio call to hugetlb_add_to_page_cache
fails. That error path is also broken in a second way -- when its
folio_put causes the ref to become 0, it will implicitly call
free_huge_folio, but then the path *explicitly* calls free_huge_folio.
Delete the latter.
This is a continuation of the fix
"mm/hugetlb: fix memfd_pin_folios free_huge_pages leak"
When memfd_pin_folios -> memfd_alloc_folio creates a hugetlb page, the
index is wrong. The subsequent call to filemap_get_folios_contig thus
cannot find it, and fails, and memfd_pin_folios loops forever. To fix,
adjust the index for the huge_page_order.
memfd_alloc_folio also forgets to unlock the folio, so the next touch of
the page calls hugetlb_fault which blocks forever trying to take the lock.
Unlock it.
memfd_pin_folios followed by unpin_folios leaves resv_huge_pages elevated
if the pages were not already faulted in. During a normal page fault,
resv_huge_pages is consumed here:
During memfd_pin_folios, the page is created by calling
alloc_hugetlb_folio_nodemask instead of alloc_hugetlb_folio, and
resv_huge_pages is not modified:
alloc_hugetlb_folio_nodemask has other callers that must not modify
resv_huge_pages. Therefore, to fix, define an alternate version of
alloc_hugetlb_folio_nodemask for this call site that adjusts
resv_huge_pages.
memfd_pin_folios followed by unpin_folios fails to restore free_huge_pages
if the pages were not already faulted in, because the folio refcount for
pages created by memfd_alloc_folio never goes to 0. memfd_pin_folios
needs another folio_put to undo the folio_try_get below:
With the fix, after memfd_pin_folios + unpin_folios, the refcount for the
(unfaulted) page is 512, which is correct, as the refcount for a faulted
unpinned page is 513.
Fix multiple bugs that occur when using memfd_pin_folios with hugetlb
pages and THP. The hugetlb bugs only bite when the page is not yet
faulted in when memfd_pin_folios is called. The THP bug bites when the
starting offset passed to memfd_pin_folios is not huge page aligned. See
the commit messages for details.
This patch (of 5):
memfd_pin_folios on memory backed by THP panics if the requested start
offset is not huge page aligned:
The fault occurs here, because xas_load returns a folio with value 2:
filemap_get_folios_contig()
for (folio = xas_load(&xas); folio && xas.xa_index <= end;
folio = xas_next(&xas)) {
...
if (!folio_try_get(folio)) <-- BOOM
"2" is an xarray sibling entry. We get it because memfd_pin_folios does
not round the indices passed to filemap_get_folios_contig to huge page
boundaries for THP, so we load from the middle of a huge page range see a
sibling. (It does round for hugetlbfs, at the is_file_hugepages test).
To fix, if the folio is a sibling, then return the next index as the
starting point for the next call to filemap_get_folios_contig.
SPLIT_PTE_PTLOCKS depends on "NR_CPUS >= 4". Unfortunately, that
evaluates to true if there is no NR_CPUS configuration option. This
results in CONFIG_SPLIT_PTE_PTLOCKS=y for mac_defconfig. This in turn
causes the m68k "q800" and "virt" machines to crash in qemu if debugging
options are enabled.
Making CONFIG_SPLIT_PTE_PTLOCKS dependent on the existence of NR_CPUS does
not work since a dependency on the existence of a numeric Kconfig entry
always evaluates to false. Example:
config HAVE_NO_NR_CPUS
def_bool y
depends on !NR_CPUS
After adding this to a Kconfig file, "make defconfig" includes:
$ grep NR_CPUS .config
CONFIG_NR_CPUS=64
CONFIG_HAVE_NO_NR_CPUS=y
Defining NR_CPUS for m68k does not help either since many architectures
define NR_CPUS only for SMP configurations.
Make SPLIT_PTE_PTLOCKS depend on SMP instead to solve the problem.
Lorenzo Stoakes [Tue, 24 Sep 2024 18:07:24 +0000 (19:07 +0100)]
tools: fix shared radix-tree build
The shared radix-tree build is not correctly recompiling when
lib/maple_tree.c and lib/test_maple_tree.c are modified - fix this by
adding these core components to the SHARED_DEPS list.
Additionally, add missing header guards to shared header files.
Merge tag 'asm-generic-6.12' of git://git.kernel.org/pub/scm/linux/kernel/git/arnd/asm-generic
Pull asm-generic updates from Arnd Bergmann:
"These are only two small patches, one cleanup for arch/alpha and a
preparation patch cleaning up the handling of runtime constants in the
linker scripts"
* tag 'asm-generic-6.12' of git://git.kernel.org/pub/scm/linux/kernel/git/arnd/asm-generic:
runtime constants: move list of constants to vmlinux.lds.h
alpha: no need to include asm/xchg.h twice
Merge tag 'efi-next-for-v6.12' of git://git.kernel.org/pub/scm/linux/kernel/git/efi/efi
Pull EFI updates from Ard Biesheuvel:
"Not a lot happening in EFI land this cycle.
- Prevent kexec from crashing on a corrupted TPM log by using a
memory type that is reserved by default
- Log correctable errors reported via CPER
- A couple of cosmetic fixes"
* tag 'efi-next-for-v6.12' of git://git.kernel.org/pub/scm/linux/kernel/git/efi/efi:
efi: Remove redundant null pointer checks in efi_debugfs_init()
efistub/tpm: Use ACPI reclaim memory for event log to avoid corruption
efi/cper: Print correctable AER information
efi: Remove unused declaration efi_initialize_iomem_resources()
Merge tag 'net-6.12-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net
Pull networking fixes from Paolo Abeni:
"Including fixes from netfilter.
It looks like that most people are still traveling: both the ML volume
and the processing capacity are low.
Previous releases - regressions:
- netfilter:
- nf_reject_ipv6: fix nf_reject_ip6_tcphdr_put()
- nf_tables: keep deleted flowtable hooks until after RCU
- tcp: check skb is non-NULL in tcp_rto_delta_us()
- phy: aquantia: fix -ETIMEDOUT PHY probe failure when firmware not
present
- eth: virtio_net: fix mismatched buf address when unmapping for
small packets
- eth: stmmac: fix zero-division error when disabling tc cbs
- eth: bonding: fix unnecessary warnings and logs from
bond_xdp_get_xmit_slave()
Previous releases - always broken:
- netfilter:
- fix clash resolution for bidirectional flows
- fix allocation with no memcg accounting
- eth: r8169: add tally counter fields added with RTL8125
- eth: ravb: fix rx and tx frame size limit"
* tag 'net-6.12-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net: (35 commits)
selftests: netfilter: Avoid hanging ipvs.sh
kselftest: add test for nfqueue induced conntrack race
netfilter: nfnetlink_queue: remove old clash resolution logic
netfilter: nf_tables: missing objects with no memcg accounting
netfilter: nf_tables: use rcu chain hook list iterator from netlink dump path
netfilter: ctnetlink: compile ctnetlink_label_size with CONFIG_NF_CONNTRACK_EVENTS
netfilter: nf_reject: Fix build warning when CONFIG_BRIDGE_NETFILTER=n
netfilter: nf_tables: Keep deleted flowtable hooks until after RCU
docs: tproxy: ignore non-transparent sockets in iptables
netfilter: ctnetlink: Guard possible unused functions
selftests: netfilter: nft_tproxy.sh: add tcp tests
selftests: netfilter: add reverse-clash resolution test case
netfilter: conntrack: add clash resolution for reverse collisions
netfilter: nf_nat: don't try nat source port reallocation for reverse dir clash
selftests/net: packetdrill: increase timing tolerance in debug mode
usbnet: fix cyclical race on disconnect with work queue
net: stmmac: set PP_FLAG_DMA_SYNC_DEV only if XDP is enabled
virtio_net: Fix mismatched buf address when unmapping for small packets
bonding: Fix unnecessary warnings and logs from bond_xdp_get_xmit_slave()
r8169: add missing MODULE_FIRMWARE entry for RTL8126A rev.b
...
Merge tag 'char-misc-6.12-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/char-misc
Pull char / misc driver updates from Greg KH:
"Here is the "big" set of char/misc and other driver subsystem changes
for 6.12-rc1.
Lots of changes in here, primarily dominated by the usual IIO driver
updates and additions, but there are also small driver subsystem
updates all over the place. Included in here are:
- lots and lots of new IIO drivers and updates to existing ones
- interconnect subsystem updates and new drivers
- nvmem subsystem updates and new drivers
- mhi driver updates
- power supply subsystem updates
- kobj_type const work for many different small subsystems
- comedi driver fix
- coresight subsystem and driver updates
- fpga subsystem improvements
- slimbus fixups
- binder new feature addition for "frozen" notifications
- lots and lots of other small driver updates and cleanups
All of these have been in linux-next for a long time with no reported
problems"
* tag 'char-misc-6.12-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/char-misc: (354 commits)
greybus: gb-beagleplay: Add firmware upload API
arm64: dts: ti: k3-am625-beagleplay: Add bootloader-backdoor-gpios to cc1352p7
dt-bindings: net: ti,cc1352p7: Add bootloader-backdoor-gpios
MAINTAINERS: Update path for U-Boot environment variables YAML
nvmem: layouts: add U-Boot env layout
comedi: ni_routing: tools: Check when the file could not be opened
ocxl: Remove the unused declarations in headr file
hpet: Fix the wrong format specifier
uio: Constify struct kobj_type
cxl: Constify struct kobj_type
binder: modify the comment for binder_proc_unlock
iio: adc: axp20x_adc: add support for AXP717 ADC
dt-bindings: iio: adc: Add AXP717 compatible
iio: adc: axp20x_adc: Add adc_en1 and adc_en2 to axp_data
w1: ds2482: Drop explicit initialization of struct i2c_device_id::driver_data to 0
tools: iio: rm .*.cmd when make clean
iio: adc: standardize on formatting for id match tables
iio: proximity: aw96103: Add support for aw96103/aw96105 proximity sensor
bus: mhi: host: pci_generic: Enable EDL trigger for Foxconn modems
bus: mhi: host: pci_generic: Update EDL firmware path for Foxconn modems
...
Merge tag 'staging-6.12-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/staging
Pull staging driver updates from Greg KH:
"Here is the big set of staging driver cleanups and removals for
6.12-rc1.
Nothing exciting here, just slow, constant, forward progress in
removing code and cleaning up some old drivers, along with removing
one of them that was not being used anymore at all. In discussions
with some developers this past week, even more deletions will be
happening for the next major merge window, as we seems to have code
here that obviously no one is using anymore.
Along with the normal cleanups is the good vme_user code forward
progress, the one major bright spot in the staging subsystem for code
that people rely on, and is getting good development behind it.
Hopefully it can graduate out of staging "soon".
All of these changes have been in linux-next for a long time with no
reported problems"
* tag 'staging-6.12-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/staging: (141 commits)
staging: vt6655: Rename variable apTD1Rings
staging: vt6655: Rename variable apTD0Rings
staging: rtl8723bs: remove unused 'poll_cnt' from rtw_set_rpwm()
staging: rtl8723bs: remove unused cnt from recv_func()
staging: rtl8723bs: remove unused efuseValue from efuse_OneByteWrite()
staging: rtl8712: remove unused drvinfo_sz from update_recvframe_attrib
staging: vt6655: mac.h: Fix possible precedence issue in macros
staging: rtl8723bs: include: Remove spaces before tabs in rtw_security.h
staging: rtl8723bs: include: Fix trailing */ position in rtw_security.h
staging: rtl8723bs: include: Fix indent for else block struct in rtw_security.h
staging: rtl8723bs: include: Fix indent for struct _byte_ in rtw_security.h
staging: rtl8723bs: include: Fix use of tabs for indent in rtw_security.h
staging: rtl8723bs: include: Fix indent for switch block in rtw_security.h
staging: rtl8723bs: include: Fix indent for switch case in rtw_security.h
staging: rtl8723bs: include: Fix open brace position in rtw_security.h
staging: nvec: Use IRQF_NO_AUTOEN flag in request_irq()
staging: rtl8723bs: Remove unused file rtw_rf.c
staging: rtl8723bs: Remove unused function rtw_ch2freq
staging: rtl8723bs: Remove unused files rtw_debug.c and rtw_debug.h
staging: rtl8723bs: Remove unused function dump_4_regs
...
Merge tag 'tty-6.12-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/tty
Pull tty / serial driver updates from Greg KH:
"Here is the "big" set of tty/serial driver updates for 6.12-rc1.
Nothing major in here, just nice forward progress in the slow cleanup
of the serial apis, and lots of other driver updates and fixes.
Included in here are:
- serial api updates from Jiri to make things more uniform and sane
- 8250_platform driver cleanups
- samsung serial driver fixes and updates
- qcom-geni serial driver fixes from Johan for the bizarre UART
engine that that chip seems to have. Hopefully it's in a better
state now, but hardware designers still seem to come up with more
ways to make broken UARTS 40+ years after this all should have
finished.
- sc16is7xx driver updates
- omap 8250 driver updates
- 8250_bcm2835aux driver updates
- a few new serial driver bindings added
- other serial minor driver updates
All of these have been in linux-next for a long time with no reported
problems"
* tag 'tty-6.12-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/tty: (65 commits)
tty: serial: samsung: Fix serial rx on Apple A7-A9
tty: serial: samsung: Fix A7-A11 serial earlycon SError
tty: serial: samsung: Use bit manipulation macros for APPLE_S5L_*
tty: rp2: Fix reset with non forgiving PCIe host bridges
serial: 8250_aspeed_vuart: Enable module autoloading
serial: qcom-geni: fix polled console corruption
serial: qcom-geni: disable interrupts during console writes
serial: qcom-geni: fix console corruption
serial: qcom-geni: introduce qcom_geni_serial_poll_bitfield()
serial: qcom-geni: fix arg types for qcom_geni_serial_poll_bit()
soc: qcom: geni-se: add GP_LENGTH/IRQ_EN_SET/IRQ_EN_CLEAR registers
serial: qcom-geni: fix false console tx restart
serial: qcom-geni: fix fifo polling timeout
tty: hvc: convert comma to semicolon
mxser: convert comma to semicolon
serial: 8250_bcm2835aux: Fix clock imbalance in PM resume
serial: sc16is7xx: convert bitmask definitions to use BIT() macro
serial: sc16is7xx: fix copy-paste errors in EFR_SWFLOWx_BIT constants
serial: sc16is7xx: remove SC16IS7XX_MSR_DELTA_MASK
serial: xilinx_uartps: Make cdns_rs485_supported static
...
Merge tag 'usb-6.12-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/usb
Pull USB/Thunderbolt updates from Greg KH:
"Here is the large set of USB and Thunderbolt changes for 6.12-rc1.
Nothing "major" in here, except for a new 9p network gadget that has
been worked on for a long time (all of the needed acks are here)
Other than that, it's the usual set of:
- Thunderbolt / USB4 driver updates and additions for new hardware
- dwc3 driver updates and new features added
- xhci driver updates
- typec driver updates
- USB gadget updates and api additions to make some gadgets more
configurable by userspace
- dwc2 driver updates
- usb phy driver updates
- usbip feature additions
- other minor USB driver updates
All of these have been in linux-next for a long time with no reported
issues"
* tag 'usb-6.12-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/usb: (145 commits)
sub: cdns3: Use predefined PCI vendor ID constant
sub: cdns2: Use predefined PCI vendor ID constant
USB: misc: yurex: fix race between read and write
USB: misc: cypress_cy7c63: check for short transfer
USB: appledisplay: close race between probe and completion handler
USB: class: CDC-ACM: fix race between get_serial and set_serial
usb: r8a66597-hcd: make read-only const arrays static
usb: typec: ucsi: Fix busy loop on ASUS VivoBooks
usb: dwc3: rtk: Clean up error code in __get_dwc3_maximum_speed()
usb: storage: ene_ub6250: Fix right shift warnings
usb: roles: Improve the fix for a false positive recursive locking complaint
locking/mutex: Introduce mutex_init_with_key()
locking/mutex: Define mutex_init() once
net/9p/usbg: fix CONFIG_USB_GADGET dependency
usb: xhci: fix loss of data on Cadence xHC
usb: xHCI: add XHCI_RESET_ON_RESUME quirk for Phytium xHCI host
usb: dwc3: imx8mp: disable SS_CON and U3 wakeup for system sleep
usb: dwc3: imx8mp: add 2 software managed quirk properties for host mode
usb: host: xhci-plat: Parse xhci-missing_cas_quirk and apply quirk
usb: misc: onboard_usb_dev: add Microchip usb5744 SMBus programming support
...
Merge tag 'hid-for-linus-2024092601' of git://git.kernel.org/pub/scm/linux/kernel/git/hid/hid
Pull HID fix from Jiri Kosina:
"A revert of Device Tree binding for Goodix SPI HID driver (while
keeping ACPI still available), as it conflicted with already existing
binding and the original submitter didn't respond in time with a fix.
We will be looking into ways how to reintroduce it properly (we have
to agree on a way how to handle cases where vendor uses the very same
product ID for I2C and SPI parts, leading to this kind conflict). But
before that is settled, let's revert the to unbreak everybody else
(Krzysztof Kozlowski)"
* tag 'hid-for-linus-2024092601' of git://git.kernel.org/pub/scm/linux/kernel/git/hid/hid:
dt-bindings: input: Revert "dt-bindings: input: Goodix SPI HID Touchscreen"
HID: hid-goodix: drop unsupported and undocumented DT part
Merge tag 'v6.12-rc-smb3-client-fixes-part2' of git://git.samba.org/sfrench/cifs-2.6
Pull smb client fixes from Steve French:
"Most are from the recent SMB3.1.1 test event, and also an important
netfs fix for a cifs mtime write regression
- fix mode reported by stat of readonly directories and files
- DFS (global namespace) related fixes
- fixes for special file support via reparse points
- mount improvement and reconnect fix
- fix for noisy log message on umount
- two netfs related fixes, one fixing a recent regression, and add
new write tracepoint"
* tag 'v6.12-rc-smb3-client-fixes-part2' of git://git.samba.org/sfrench/cifs-2.6:
netfs, cifs: Fix mtime/ctime update for mmapped writes
cifs: update internal version number
smb: client: print failed session logoffs with FYI
cifs: Fix reversion of the iter in cifs_readv_receive().
smb3: fix incorrect mode displayed for read-only files
smb: client: fix parsing of device numbers
smb: client: set correct device number on nfs reparse points
smb: client: propagate error from cifs_construct_tcon()
smb: client: fix DFS failover in multiuser mounts
cifs: Make the write_{enter,done,err} tracepoints display netfs info
smb: client: fix DFS interlink failover
smb: client: improve purging of cached referrals
smb: client: avoid unnecessary reconnects when refreshing referrals
Merge tag 'probes-v6.12' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace
Pull probes updates from Masami Hiramatsu:
- uprobes: make trace_uprobe->nhit counter a per-CPU one
This makes uprobe event's hit counter per-CPU for improving
scalability on multi-core environment
- kprobes: Remove obsoleted declaration for init_test_probes
Remove unused init_test_probes() from header
- Raw tracepoint probe supports raw tracepoint events on modules:
- add a function for iterating over all tracepoints in all modules
- add a function for iterating over tracepoints in a module
- support raw tracepoint events on modules
- support raw tracepoints on future loaded modules
- add a test for tracepoint events on modules"
* tag 'probes-v6.12' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace:
sefltests/tracing: Add a test for tracepoint events on modules
tracing/fprobe: Support raw tracepoints on future loaded modules
tracing/fprobe: Support raw tracepoint events on modules
tracepoint: Support iterating tracepoints in a loading module
tracepoint: Support iterating over tracepoints on modules
kprobes: Remove obsoleted declaration for init_test_probes
uprobes: turn trace_uprobe's nhit counter to be per-CPU one
Merge tag 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mst/vhost
Pull virtio updates from Michael Tsirkin:
"Several new features here:
- virtio-balloon supports new stats
- vdpa supports setting mac address
- vdpa/mlx5 suspend/resume as well as MKEY ops are now faster
- virtio_fs supports new sysfs entries for queue info
- virtio/vsock performance has been improved
And fixes, cleanups all over the place"
* tag 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mst/vhost: (34 commits)
vsock/virtio: avoid queuing packets when intermediate queue is empty
vsock/virtio: refactor virtio_transport_send_pkt_work
fw_cfg: Constify struct kobj_type
vdpa/mlx5: Postpone MR deletion
vdpa/mlx5: Introduce init/destroy for MR resources
vdpa/mlx5: Rename mr_mtx -> lock
vdpa/mlx5: Extract mr members in own resource struct
vdpa/mlx5: Rename function
vdpa/mlx5: Delete direct MKEYs in parallel
vdpa/mlx5: Create direct MKEYs in parallel
MAINTAINERS: add virtio-vsock driver in the VIRTIO CORE section
virtio_fs: add sysfs entries for queue information
virtio_fs: introduce virtio_fs_put_locked helper
vdpa: Remove unused declarations
vdpa/mlx5: Parallelize VQ suspend/resume for CVQ MQ command
vdpa/mlx5: Small improvement for change_num_qps()
vdpa/mlx5: Keep notifiers during suspend but ignore
vdpa/mlx5: Parallelize device resume
vdpa/mlx5: Parallelize device suspend
vdpa/mlx5: Use async API for vq modify commands
...
dm verity: fallback to platform keyring also if key in trusted keyring is rejected
If enabled, we fallback to the platform keyring if the trusted keyring doesn't have
the key used to sign the roothash. But if pkcs7_verify() rejects the key for other
reasons, such as usage restrictions, we do not fallback. Do so.
Maxim Suhanov reported that dm-verity doesn't crash if an I/O error
happens. In theory, this could be used to subvert security, because an
attacker can create sectors that return error with the Write Uncorrectable
command. Some programs may misbehave if they have to deal with EIO.
This commit fixes dm-verity, so that if "panic_on_corruption" or
"restart_on_corruption" was specified and an I/O error happens, the
machine will panic or restart.
This commit also changes kernel_restart to emergency_restart -
kernel_restart calls reboot notifiers and these reboot notifiers may wait
for the bio that failed. emergency_restart doesn't call the notifiers.
Rob Herring [Wed, 25 Sep 2024 17:35:10 +0000 (12:35 -0500)]
dt-bindings: gpio: ep9301: Add missing "#interrupt-cells" to examples
Enabling dtc interrupt_provider check reveals the examples are missing
the "#interrupt-cells" property as it is a dependency of
"interrupt-controller".
Paolo Abeni [Thu, 26 Sep 2024 13:47:10 +0000 (15:47 +0200)]
Merge tag 'nf-24-09-26' of git://git.kernel.org/pub/scm/linux/kernel/git/netfilter/nf
Pablo Neira Ayuso says:
====================
Netfilter fixes for net
v2: with kdoc fixes per Paolo Abeni.
The following patchset contains Netfilter fixes for net:
Patch #1 and #2 handle an esoteric scenario: Given two tasks sending UDP
packets to one another, two packets of the same flow in each direction
handled by different CPUs that result in two conntrack objects in NEW
state, where reply packet loses race. Then, patch #3 adds a testcase for
this scenario. Series from Florian Westphal.
1) NAT engine can falsely detect a port collision if it happens to pick
up a reply packet as NEW rather than ESTABLISHED. Add extra code to
detect this and suppress port reallocation in this case.
2) To complete the clash resolution in the reply direction, extend conntrack
logic to detect clashing conntrack in the reply direction to existing entry.
3) Adds a test case.
Then, an assorted list of fixes follow:
4) Add a selftest for tproxy, from Antonio Ojea.
5) Guard ctnetlink_*_size() functions under
#if defined(CONFIG_NETFILTER_NETLINK_GLUE_CT) || defined(CONFIG_NF_CONNTRACK_EVENTS)
From Andy Shevchenko.
6) Use -m socket --transparent in iptables tproxy documentation.
From XIE Zhibang.
7) Call kfree_rcu() when releasing flowtable hooks to address race with
netlink dump path, from Phil Sutter.
8) Fix compilation warning in nf_reject with CONFIG_BRIDGE_NETFILTER=n.
From Simon Horman.
9) Guard ctnetlink_label_size() under CONFIG_NF_CONNTRACK_EVENTS which
is its only user, to address a compilation warning. From Simon Horman.
10) Use rcu-protected list iteration over basechain hooks from netlink
dump path.
11) Fix memcg for nf_tables, use GFP_KERNEL_ACCOUNT is not complete.
12) Remove old nfqueue conntrack clash resolution. Instead trying to
use same destination address consistently which requires double DNAT,
use the existing clash resolution which allows clashing packets
go through with different destination. Antonio Ojea originally
reported an issue from the postrouting chain, I proposed a fix:
https://lore.kernel.org/netfilter-devel/ZuwSwAqKgCB2a51-@calendula/T/
which he reported it did not work for him.
13) Adds a selftest for patch 12.
14) Fixes ipvs.sh selftest.
netfilter pull request 24-09-26
* tag 'nf-24-09-26' of git://git.kernel.org/pub/scm/linux/kernel/git/netfilter/nf:
selftests: netfilter: Avoid hanging ipvs.sh
kselftest: add test for nfqueue induced conntrack race
netfilter: nfnetlink_queue: remove old clash resolution logic
netfilter: nf_tables: missing objects with no memcg accounting
netfilter: nf_tables: use rcu chain hook list iterator from netlink dump path
netfilter: ctnetlink: compile ctnetlink_label_size with CONFIG_NF_CONNTRACK_EVENTS
netfilter: nf_reject: Fix build warning when CONFIG_BRIDGE_NETFILTER=n
netfilter: nf_tables: Keep deleted flowtable hooks until after RCU
docs: tproxy: ignore non-transparent sockets in iptables
netfilter: ctnetlink: Guard possible unused functions
selftests: netfilter: nft_tproxy.sh: add tcp tests
selftests: netfilter: add reverse-clash resolution test case
netfilter: conntrack: add clash resolution for reverse collisions
netfilter: nf_nat: don't try nat source port reallocation for reverse dir clash
====================
soc: ep93xx: drop reference to removed EP93XX_SOC_COMMON config
Commit 6eab0ce6e1c6 ("soc: Add SoC driver for Cirrus ep93xx") adds the
config EP93XX_SOC referring to the config EP93XX_SOC_COMMON.
Within the same patch series of the commit above, the commit 046322f1e1d9
("ARM: ep93xx: DT for the Cirrus ep93xx SoC platforms") then removes the
config EP93XX_SOC_COMMON. With that the reference to this config is
obsolete.
Simplify the expression in the EP93XX_SOC config definition.
kselftest: add test for nfqueue induced conntrack race
The netfilter race happens when two packets with the same tuple are DNATed
and enqueued with nfqueue in the postrouting hook.
Once one of the packet is reinjected it may be DNATed again to a different
destination, but the conntrack entry remains the same and the return packet
was dropped.
netfilter: nfnetlink_queue: remove old clash resolution logic
For historical reasons there are two clash resolution spots in
netfilter, one in nfnetlink_queue and one in conntrack core.
nfnetlink_queue one was added first: If a colliding entry is found, NAT
NAT transformation is reversed by calling nat engine again with altered
tuple.
See commit 368982cd7d1b ("netfilter: nfnetlink_queue: resolve clash for
unconfirmed conntracks") for details.
One problem is that nf_reroute() won't take an action if the queueing
doesn't occur in the OUTPUT hook, i.e. when queueing in forward or
postrouting, packet will be sent via the wrong path.
Another problem is that the scenario addressed (2nd UDP packet sent with
identical addresses while first packet is still being processed) can also
occur without any nfqueue involvement due to threaded resolvers doing
A and AAAA requests back-to-back.
This lead us to add clash resolution logic to the conntrack core, see
commit 6a757c07e51f ("netfilter: conntrack: allow insertion of clashing
entries"). Instead of fixing the nfqueue based logic, lets remove it
and let conntrack core handle this instead.
Retain the ->update hook for sake of nfqueue based conntrack helpers.
We could axe this hook completely but we'd have to split confirm and
helper logic again, see commit ee04805ff54a ("netfilter: conntrack: make
conntrack userspace helpers work again").
This SHOULD NOT be backported to kernels earlier than v5.6; they lack
adequate clash resolution handling.
Patch was originally written by Pablo Neira Ayuso.
netfilter: nf_tables: missing objects with no memcg accounting
Several ruleset objects are still not using GFP_KERNEL_ACCOUNT for
memory accounting, update them. This includes:
- catchall elements
- compat match large info area
- log prefix
- meta secctx
- numgen counters
- pipapo set backend datastructure
- tunnel private objects
Fixes: 33758c891479 ("memcg: enable accounting for nft objects") Signed-off-by: Pablo Neira Ayuso <[email protected]>
netfilter: nf_tables: use rcu chain hook list iterator from netlink dump path
Lockless iteration over hook list is possible from netlink dump path,
use rcu variant to iterate over the hook list as is done with flowtable
hooks.
Fixes: b9703ed44ffb ("netfilter: nf_tables: support for adding new devices to an existing netdev chain") Reported-by: Phil Sutter <[email protected]> Signed-off-by: Pablo Neira Ayuso <[email protected]>
Simon Horman [Mon, 16 Sep 2024 15:14:41 +0000 (16:14 +0100)]
netfilter: ctnetlink: compile ctnetlink_label_size with CONFIG_NF_CONNTRACK_EVENTS
Only provide ctnetlink_label_size when it is used,
which is when CONFIG_NF_CONNTRACK_EVENTS is configured.
Flagged by clang-18 W=1 builds as:
.../nf_conntrack_netlink.c:385:19: warning: unused function 'ctnetlink_label_size' [-Wunused-function]
385 | static inline int ctnetlink_label_size(const struct nf_conn *ct)
| ^~~~~~~~~~~~~~~~~~~~
The condition on CONFIG_NF_CONNTRACK_LABELS being removed by
this patch guards compilation of non-trivial implementations
of ctnetlink_dump_labels() and ctnetlink_label_size().
However, this is not necessary as each of these functions
will always return 0 if CONFIG_NF_CONNTRACK_LABELS is not defined
as each function starts with the equivalent of:
And nf_ct_labels_find always returns NULL if CONFIG_NF_CONNTRACK_LABELS
is not enabled. So I believe that the compiler optimises the code away
in such cases anyway.
Found by inspection.
Compile tested only.
Originally splitted in two patches, Pablo Neira Ayuso collapsed them and
added Fixes: tag.
Simon Horman [Mon, 16 Sep 2024 09:50:34 +0000 (10:50 +0100)]
netfilter: nf_reject: Fix build warning when CONFIG_BRIDGE_NETFILTER=n
If CONFIG_BRIDGE_NETFILTER is not enabled, which is the case for x86_64
defconfig, then building nf_reject_ipv4.c and nf_reject_ipv6.c with W=1
using gcc-14 results in the following warnings, which are treated as
errors:
net/ipv4/netfilter/nf_reject_ipv4.c: In function 'nf_send_reset':
net/ipv4/netfilter/nf_reject_ipv4.c:243:23: error: variable 'niph' set but not used [-Werror=unused-but-set-variable]
243 | struct iphdr *niph;
| ^~~~
cc1: all warnings being treated as errors
net/ipv6/netfilter/nf_reject_ipv6.c: In function 'nf_send_reset6':
net/ipv6/netfilter/nf_reject_ipv6.c:286:25: error: variable 'ip6h' set but not used [-Werror=unused-but-set-variable]
286 | struct ipv6hdr *ip6h;
| ^~~~
cc1: all warnings being treated as errors
Address this by reducing the scope of these local variables to where
they are used, which is code only compiled when CONFIG_BRIDGE_NETFILTER
enabled.
Compile tested and run through netfilter selftests.
Phil Sutter [Thu, 12 Sep 2024 12:21:33 +0000 (14:21 +0200)]
netfilter: nf_tables: Keep deleted flowtable hooks until after RCU
Documentation of list_del_rcu() warns callers to not immediately free
the deleted list item. While it seems not necessary to use the
RCU-variant of list_del() here in the first place, doing so seems to
require calling kfree_rcu() on the deleted item as well.
Fixes: 3f0465a9ef02 ("netfilter: nf_tables: dynamically allocate hooks per net_device in flowtables") Signed-off-by: Phil Sutter <[email protected]> Signed-off-by: Pablo Neira Ayuso <[email protected]>
docs: tproxy: ignore non-transparent sockets in iptables
The iptables example was added in commit d2f26037a38a (netfilter: Add
documentation for tproxy, 2008-10-08), but xt_socket 'transparent'
option was added in commit a31e1ffd2231 (netfilter: xt_socket: added new
revision of the 'socket' match supporting flags, 2009-06-09).
Now add the 'transparent' option to the iptables example to ignore
non-transparent sockets, which is also consistent with the nft example.
Andy Shevchenko [Tue, 10 Sep 2024 08:35:33 +0000 (11:35 +0300)]
netfilter: ctnetlink: Guard possible unused functions
Some of the functions may be unused (CONFIG_NETFILTER_NETLINK_GLUE_CT=n
and CONFIG_NF_CONNTRACK_EVENTS=n), it prevents kernel builds with clang,
`make W=1` and CONFIG_WERROR=y:
The TPROXY functionality is widely used, however, there are only mptcp
selftests covering this feature.
The selftests represent the most common scenarios and can also be used
as selfdocumentation of the feature.
UDP and TCP testcases are split in different files because of the
different nature of the protocols, specially due to the challenges that
present to reliable test UDP due to the connectionless nature of the
protocol. UDP only covers the scenarios involving the prerouting hook.
The UDP tests are signfinicantly slower than the TCP ones, hence they
use a larger timeout, it takes 20 seconds to run the full UDP suite
on a 48 vCPU Intel(R) Xeon(R) CPU @2.60GHz.
The colliding ct (and the associated skb) get dropped on insert.
Permit this by checking if the colliding entry matches the reply
direction.
Happens when both ends send packets at same time, both requests are picked
up as NEW, rather than NEW for the 'first' and 'ESTABLISHED' for the
second packet.
This is an esoteric condition, as ruleset must permit NEW connections
in either direction and both peers must already have a bidirectional
traffic flow at the time conntrack gets enabled.
Allow the 'reverse' skb to pass and assign the existing (clashing)
entry.
While at it, also drop the extra 'dying' check, this is already
tested earlier by the calling function.
and new locally originating connection wants:
ORIGINAL: $IP_NAT:$SPORT -> $IP_RESPONDER:$DPORT
REPLY: $IP_RESPONDER:$DPORT -> $IP_NAT:$SPORT
This is handled by the NAT engine which will do a source port reallocation
for the locally originating connection that is colliding with an existing
tuple by attempting a source port rewrite.
This is done even if this new connection attempt did not go through a
masquerade/snat rule.
There is a rare race condition with connection-less protocols like UDP,
where we do the port reallocation even though its not needed.
This happens when new packets from the same, pre-existing flow are received
in both directions at the exact same time on different CPUs after the
conntrack table was flushed (or conntrack becomes active for first time).
With strict ordering/single cpu, the first packet creates new ct entry and
second packet is resolved as established reply packet.
With parallel processing, both packets are picked up as new and both get
their own ct entry.
In this case, the 'reply' packet (picked up as ORIGINAL) can be mangled by
NAT engine because a port collision is detected.
This change isn't enough to prevent a packet drop later during
nf_conntrack_confirm(), the existing clash resolution strategy will not
detect such reverse clash case. This is resolved by a followup patch.
Oliver Neukum [Thu, 19 Sep 2024 12:33:42 +0000 (14:33 +0200)]
usbnet: fix cyclical race on disconnect with work queue
The work can submit URBs and the URBs can schedule the work.
This cycle needs to be broken, when a device is to be stopped.
Use a flag to do so.
This is a design issue as old as the driver.
net: stmmac: set PP_FLAG_DMA_SYNC_DEV only if XDP is enabled
Commit 5fabb01207a2 ("net: stmmac: Add initial XDP support") sets
PP_FLAG_DMA_SYNC_DEV flag for page_pool unconditionally,
page_pool_recycle_direct() will call page_pool_dma_sync_for_device()
on every page even the page is not going to be reused by XDP program.
When XDP is not enabled, the page which holds the received buffer
will be recycled once the buffer is copied into new SKB by
skb_copy_to_linear_data(), then the MAC core will never reuse this
page any longer. Always setting PP_FLAG_DMA_SYNC_DEV wastes CPU cycles
on unnecessary calling of page_pool_dma_sync_for_device().
After this patch, up to 9% noticeable performance improvement was observed
on certain platforms.
Wenbo Li [Thu, 19 Sep 2024 08:13:51 +0000 (16:13 +0800)]
virtio_net: Fix mismatched buf address when unmapping for small packets
Currently, the virtio-net driver will perform a pre-dma-mapping for
small or mergeable RX buffer. But for small packets, a mismatched address
without VIRTNET_RX_PAD and xdp_headroom is used for unmapping.
That will result in unsynchronized buffers when SWIOTLB is enabled, for
example, when running as a TDX guest.
This patch unifies the address passed to the virtio core as the address of
the virtnet header and fixes the mismatched buffer address.
Changes from v2: unify the buf that passed to the virtio core in small
and merge mode.
Changes from v1: Use ctx to get xdp_headroom.
- Fix for a regression with the IO scheduler switching freezing from
6.11 (Damien)
- Use a raw spinlock for sbitmap, as it may get called from preempt
disabled context (Ming)
- Cleanup for bd_claiming waiting, using var_waitqueue() rather than
the bit waitqueues, as that more accurately describes that it does
(Neil)
- Various cleanups (Kanchan, Qiu-ji, David)
* tag 'for-6.12/block-20240925' of git://git.kernel.dk/linux:
nvme: remove CC register read-back during enabling
nvme: null terminate nvme_tls_attrs
nvme-multipath: avoid hang on inaccessible namespaces
nvme-multipath: system fails to create generic nvme device
lib/sbitmap: define swap_lock as raw_spinlock_t
block: Remove unused blk_limits_io_{min,opt}
drbd: Fix atomicity violation in drbd_uuid_set_bm()
block: Fix elv_iosched_local_module handling of "none" scheduler
block: remove bogus union
block: change wait on bd_claiming to use a var_waitqueue
blk-integrity: improved sg segment mapping
block: unexport blk_rq_count_integrity_sg
nvme-rdma: use request to get integrity segments
scsi: use request to get integrity segments
block: provide a request helper for user integrity segments
blk-integrity: consider entire bio list for merging
blk-integrity: properly account for segments
blk-mq: set the nr_integrity_segments from bio
blk-mq: unconditional nr_integrity_segments
Merge tag 'spi-fix-v6.12-merge-window' of git://git.kernel.org/pub/scm/linux/kernel/git/broonie/spi
Pull spi fixes from Mark Brown:
"Some driver specific fixes that came in during the merge window.
Lorenzo Bianconi did some extra testing on the recently added arioha
driver and found some issues, Alexander Dahl fixed some issues with
signal delays in the Atmel QSPI driver and Jinjie Ruan has been fixing
some nits with runtime PM cleanup"
* tag 'spi-fix-v6.12-merge-window' of git://git.kernel.org/pub/scm/linux/kernel/git/broonie/spi:
spi: atmel-quadspi: Avoid overwriting delay register settings
spi: airoha: remove read cache in airoha_snand_dirmap_read()
spi: spi-fsl-lpspi: Undo runtime PM changes at driver exit time
spi: atmel-quadspi: Undo runtime PM changes at driver exit time
spi: airoha: fix airoha_snand_{write,read}_data data_len estimation
spi: airoha: fix dirmap_{read,write} operations
Merge tag 'rtc-6.12' of git://git.kernel.org/pub/scm/linux/kernel/git/abelloni/linux
Pull RTC updates from Alexandre Belloni:
"More conversions of DT bindings to yaml. There is one new driver, for
the DFRobot SD2405AL and support for important features of the stm32
RTC. Summary:
New driver:
- DFRobot SD2405AL
Drivers:
- stm32: add alarm A out and LSCO support
- sun6i: disable automatic clock input switching
- m48t59: set range"
* tag 'rtc-6.12' of git://git.kernel.org/pub/scm/linux/kernel/git/abelloni/linux:
rtc: rc5t619: use proper module tables
rtc: m48t59: set range
dt-bindings: rtc: microcrystal,rv3028: add #clock-cells property
rtc: m48t59: Remove division condition with direct comparison
rtc: at91sam9: fix OF node leak in probe() error path
rtc: sun6i: disable automatic clock input switching
dt-bindings: rtc: Drop non-trivial duplicate compatibles
dt-bindings: vendor-prefixes: Add DFRobot.
dt-bindings: rtc: Add support for SD2405AL.
rtc: Add driver for SD2405AL
rtc: s35390a: Drop vendorless compatible string from match table
rtc: twl: convert comma to semicolon
dt-bindings: rtc: sprd,sc2731-rtc: convert to YAML
rtc: stm32: add alarm A out feature
rtc: stm32: add Low Speed Clock Output (LSCO) support
rtc: stm32: add pinctrl and pinmux interfaces
dt-bindings: rtc: stm32: describe pinmux nodes
This reverts commit 9184b17fbc23 ("dt-bindings: input: Goodix SPI HID
Touchscreen") because it duplicates existing binding leadings to errors:
goodix,gt7986u.example.dtb:
touchscreen@0: compatible: 'oneOf' conditional failed, one must be fixed:
['goodix,gt7986u'] is too short
'goodix,gt7375p' was expected
This was reported on mailing list on 6th of September, but no reaction
happened from contributor or maintainer to fix it.
Therefore let's drop binding which breaks and duplicates existing one.
HID: hid-goodix: drop unsupported and undocumented DT part
Drop support for Devicetree from, because the binding is being reverted
(on basis of duplicating existing binding) and property was not added to
the original binding.
Merge tag 'memblock-v6.12-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/rppt/memblock
Pull memblock updates from Mike Rapoport:
- new memblock_estimated_nr_free_pages() helper to replace
totalram_pages() which is less accurate when
CONFIG_DEFERRED_STRUCT_PAGE_INIT is set
- fixes for memblock tests
* tag 'memblock-v6.12-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/rppt/memblock:
s390/mm: get estimated free pages by memblock api
kernel/fork.c: get estimated free pages by memblock api
mm/memblock: introduce a new helper memblock_estimated_nr_free_pages()
memblock test: fix implicit declaration of function 'strscpy'
memblock test: fix implicit declaration of function 'isspace'
memblock test: fix implicit declaration of function 'memparse'
memblock test: add the definition of __setup()
memblock test: fix implicit declaration of function 'virt_to_phys'
tools/testing: abstract two init.h into common include directory
memblock tests: include export.h in linkage.h as kernel dose
memblock tests: include memory_hotplug.h in mmzone.h as kernel dose
Merge tag 'sparc-for-6.12-tag1' of git://git.kernel.org/pub/scm/linux/kernel/git/alarsson/linux-sparc
Pull sparc32 update from Andreas Larsson:
- Remove an unused variable for sparc32
* tag 'sparc-for-6.12-tag1' of git://git.kernel.org/pub/scm/linux/kernel/git/alarsson/linux-sparc:
arch/sparc: remove unused varible paddrbase in function leon_swprobe()
Merge tag 'powerpc-6.12-2' of git://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux
Pull powerpc fixes from Michael Ellerman:
- Fix build error in vdso32 when building 64-bit with COMPAT=y and -Os
- Fix build error in pseries EEH when CONFIG_DEBUG_FS is not set
Thanks to Christophe Leroy, Narayana Murty N, Christian Zigotzky, and
Ritesh Harjani.
* tag 'powerpc-6.12-2' of git://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux:
powerpc/pseries/eeh: move pseries_eeh_err_inject() outside CONFIG_DEBUG_FS block
powerpc/vdso32: Fix use of crtsavres for PPC64
Kbuild: make MODVERSIONS support depend on not being a compile test build
Currently the Rust support is gated on not having MODVERSIONS enabled,
and as a result an "allmodconfig" build will disable Rust build tests.
While MODVERSIONS configurations are worth build testing, the feature is
not actually meaningful unless you run the result, and I'd rather get
build coverage of Rust than MODVERSIONS. So let's disable MODVERSIONS
for build testing until the Rust side clears up.
Merge tag 'rust-6.12' of https://github.com/Rust-for-Linux/linux
Pull Rust updates from Miguel Ojeda:
"Toolchain and infrastructure:
- Support 'MITIGATION_{RETHUNK,RETPOLINE,SLS}' (which cleans up
objtool warnings), teach objtool about 'noreturn' Rust symbols and
mimic '___ADDRESSABLE()' for 'module_{init,exit}'. With that, we
should be objtool-warning-free, so enable it to run for all Rust
object files.
- KASAN (no 'SW_TAGS'), KCFI and shadow call sanitizer support.
- Support 'RUSTC_VERSION', including re-config and re-build on
change.
- Split helpers file into several files in a folder, to avoid
conflicts in it. Eventually those files will be moved to the right
places with the new build system. In addition, remove the need to
manually export the symbols defined there, reusing existing
machinery for that.
- Relax restriction on configurations with Rust + GCC plugins to just
the RANDSTRUCT plugin.
'kernel' crate:
- New 'list' module: doubly-linked linked list for use with reference
counted values, which is heavily used by the upcoming Rust Binder.
This includes 'ListArc' (a wrapper around 'Arc' that is guaranteed
unique for the given ID), 'AtomicTracker' (tracks whether a
'ListArc' exists using an atomic), 'ListLinks' (the prev/next
pointers for an item in a linked list), 'List' (the linked list
itself), 'Iter' (an iterator over a 'List'), 'Cursor' (a cursor
into a 'List' that allows to remove elements), 'ListArcField' (a
field exclusively owned by a 'ListArc'), as well as support for
heterogeneous lists.
- New 'rbtree' module: red-black tree abstractions used by the
upcoming Rust Binder.
This includes 'RBTree' (the red-black tree itself), 'RBTreeNode' (a
node), 'RBTreeNodeReservation' (a memory reservation for a node),
'Iter' and 'IterMut' (immutable and mutable iterators), 'Cursor'
(bidirectional cursor that allows to remove elements), as well as
an entry API similar to the Rust standard library one.
- 'init' module: add 'write_[pin_]init' methods and the
'InPlaceWrite' trait. Add the 'assert_pinned!' macro.
- 'sync' module: implement the 'InPlaceInit' trait for 'Arc' by
introducing an associated type in the trait.
- 'alloc' module: add 'drop_contents' method to 'BoxExt'.
- 'types' module: implement the 'ForeignOwnable' trait for
'Pin<Box<T>>' and improve the trait's documentation. In addition,
add the 'into_raw' method to the 'ARef' type.
- 'error' module: in preparation for the upcoming Rust support for
32-bit architectures, like arm, locally allow Clippy lint for
those.
Documentation:
- https://rust.docs.kernel.org has been announced, so link to it.
- Enable rustdoc's "jump to definition" feature, making its output a
bit closer to the experience in a cross-referencer.
- Debian Testing now also provides recent Rust releases (outside of
the freeze period), so add it to the list.
MAINTAINERS:
- Trevor is joining as reviewer of the "RUST" entry.
And a few other small bits"
* tag 'rust-6.12' of https://github.com/Rust-for-Linux/linux: (54 commits)
kasan: rust: Add KASAN smoke test via UAF
kbuild: rust: Enable KASAN support
rust: kasan: Rust does not support KHWASAN
kbuild: rust: Define probing macros for rustc
kasan: simplify and clarify Makefile
rust: cfi: add support for CFI_CLANG with Rust
cfi: add CONFIG_CFI_ICALL_NORMALIZE_INTEGERS
rust: support for shadow call stack sanitizer
docs: rust: include other expressions in conditional compilation section
kbuild: rust: replace proc macros dependency on `core.o` with the version text
kbuild: rust: rebuild if the version text changes
kbuild: rust: re-run Kconfig if the version text changes
kbuild: rust: add `CONFIG_RUSTC_VERSION`
rust: avoid `box_uninit_write` feature
MAINTAINERS: add Trevor Gross as Rust reviewer
rust: rbtree: add `RBTree::entry`
rust: rbtree: add cursor
rust: rbtree: add mutable iterator
rust: rbtree: add iterator
rust: rbtree: add red-black tree implementation backed by the C version
...
tracing/fprobe: Support raw tracepoints on future loaded modules
Support raw tracepoint events on future loaded (unloaded) modules.
This allows user to create raw tracepoint events which can be used from
module's __init functions.
Note: since the kernel does not have any information about the tracepoints
in the unloaded modules, fprobe events can not check whether the tracepoint
exists nor extend the BTF based arguments.
tracing/fprobe: Support raw tracepoint events on modules
Support raw tracepoint event on module by fprobe events.
Since it only uses for_each_kernel_tracepoint() to find a tracepoint,
the tracepoints on modules are not handled. Thus if user specified a
tracepoint on a module, it shows an error.
This adds new for_each_module_tracepoint() API to tracepoint subsystem,
and uses it to find tracepoints on modules.
tracepoint: Support iterating tracepoints in a loading module
Add for_each_tracepoint_in_module() function to iterate tracepoints in
a module. This API is needed for handling tracepoints in a loading
module from tracepoint_module_notifier callback function.
This also update for_each_module_tracepoint() to pass the module to
callback function so that it can find module easily.
tracepoint: Support iterating over tracepoints on modules
Add for_each_module_tracepoint() for iterating over tracepoints
on modules. This is similar to the for_each_kernel_tracepoint()
but only for the tracepoints on modules (not including kernel
built-in tracepoints).
Jason Andryuk [Fri, 23 Aug 2024 19:36:30 +0000 (15:36 -0400)]
x86/pvh: Add 64bit relocation page tables
The PVH entry point is 32bit. For a 64bit kernel, the entry point must
switch to 64bit mode, which requires a set of page tables. In the past,
PVH used init_top_pgt.
This works fine when the kernel is loaded at LOAD_PHYSICAL_ADDR, as the
page tables are prebuilt for this address. If the kernel is loaded at a
different address, they need to be adjusted.
__startup_64() adjusts the prebuilt page tables for the physical load
address, but it is 64bit code. The 32bit PVH entry code can't call it
to adjust the page tables, so it can't readily be re-used.
64bit PVH entry needs page tables set up for identity map, the kernel
high map and the direct map. pvh_start_xen() enters identity mapped.
Inside xen_prepare_pvh(), it jumps through a pv_ops function pointer
into the highmap. The direct map is used for __va() on the initramfs
and other guest physical addresses.
Add a dedicated set of prebuild page tables for PVH entry. They are
adjusted in assembly before loading.
Add XEN_ELFNOTE_PHYS32_RELOC to indicate support for relocation
along with the kernel's loading constraints. The maximum load address,
KERNEL_IMAGE_SIZE - 1, is determined by a single pvh_level2_ident_pgt
page. It could be larger with more pages.
tomoyo: fallback to realpath if symlink's pathname does not exist
Alfred Agrell found that TOMOYO cannot handle execveat(AT_EMPTY_PATH)
inside chroot environment where /dev and /proc are not mounted, for
commit 51f39a1f0cea ("syscalls: implement execveat() system call") missed
that TOMOYO tries to canonicalize argv[0] when the filename fed to the
executed program as argv[0] is supplied using potentially nonexistent
pathname.
Since "/dev/fd/<fd>" already lost symlink information used for obtaining
that <fd>, it is too late to reconstruct symlink's pathname. Although
<filename> part of "/dev/fd/<fd>/<filename>" might not be canonicalized,
TOMOYO cannot use tomoyo_realpath_nofollow() when /dev or /proc is not
mounted. Therefore, fallback to tomoyo_realpath_from_path() when
tomoyo_realpath_nofollow() failed.
Jason Andryuk [Fri, 23 Aug 2024 19:36:28 +0000 (15:36 -0400)]
x86/pvh: Set phys_base when calling xen_prepare_pvh()
phys_base needs to be set for __pa() to work in xen_pvh_init() when
finding the hypercall page. Set it before calling into
xen_prepare_pvh(), which calls xen_pvh_init(). Clear it afterward to
avoid __startup_64() adding to it and creating an incorrect value.
Jason Andryuk [Fri, 23 Aug 2024 19:36:27 +0000 (15:36 -0400)]
x86/pvh: Make PVH entrypoint PIC for x86-64
The PVH entrypoint is 32bit non-PIC code running the uncompressed
vmlinux at its load address CONFIG_PHYSICAL_START - default 0x1000000
(16MB). The kernel is loaded at that physical address inside the VM by
the VMM software (Xen/QEMU).
When running a Xen PVH Dom0, the host reserved addresses are mapped 1-1
into the PVH container. There exist system firmwares (Coreboot/EDK2)
with reserved memory at 16MB. This creates a conflict where the PVH
kernel cannot be loaded at that address.
Modify the PVH entrypoint to be position-indepedent to allow flexibility
in load address. Only the 64bit entry path is converted. A 32bit
kernel is not PIC, so calling into other parts of the kernel, like
xen_prepare_pvh() and mk_pgtable_32(), don't work properly when
relocated.
This makes the code PIC, but the page tables need to be updated as well
to handle running from the kernel high map.
The UNWIND_HINT_END_OF_STACK is to silence:
vmlinux.o: warning: objtool: pvh_start_xen+0x7f: unreachable instruction
after the lret into 64bit code.
Andrii Nakryiko [Tue, 13 Aug 2024 20:34:09 +0000 (13:34 -0700)]
uprobes: turn trace_uprobe's nhit counter to be per-CPU one
trace_uprobe->nhit counter is not incremented atomically, so its value
is questionable in when uprobe is hit on multiple CPUs simultaneously.
Also, doing this shared counter increment across many CPUs causes heavy
cache line bouncing, limiting uprobe/uretprobe performance scaling with
number of CPUs.
Solve both problems by making this a per-CPU counter.
Luigi Leonardi [Tue, 30 Jul 2024 19:47:32 +0000 (21:47 +0200)]
vsock/virtio: avoid queuing packets when intermediate queue is empty
When the driver needs to send new packets to the device, it always
queues the new sk_buffs into an intermediate queue (send_pkt_queue)
and schedules a worker (send_pkt_work) to then queue them into the
virtqueue exposed to the device.
This increases the chance of batching, but also introduces a lot of
latency into the communication. So we can optimize this path by
adding a fast path to be taken when there is no element in the
intermediate queue, there is space available in the virtqueue,
and no other process that is sending packets (tx_lock held).
The following benchmarks were run to check improvements in latency and
throughput. The test bed is a host with Intel i7-10700KF CPU @ 3.80GHz
and L1 guest running on QEMU/KVM with vhost process and all vCPUs
pinned individually to pCPUs.
- Latency
Tool: Fio version 3.37-56
Mode: pingpong (h-g-h)
Test runs: 50
Runtime-per-test: 50s
Type: SOCK_STREAM
In the following fio benchmark (pingpong mode) the host sends
a payload to the guest and waits for the same payload back.
fio process pinned both inside the host and the guest system.
Before: Linux 6.9.8
Payload 64B:
1st perc. overall 99th perc.
Before 12.91 16.78 42.24 us
After 9.77 13.57 39.17 us
Payload 512B:
1st perc. overall 99th perc.
Before 13.35 17.35 41.52 us
After 10.25 14.11 39.58 us
Payload 4K:
1st perc. overall 99th perc.
Before 14.71 19.87 41.52 us
After 10.51 14.96 40.81 us
- Throughput
Tool: iperf-vsock
The size represents the buffer length (-l) to read/write
P represents the number of parallel streams
P=1
4K 64K 128K
Before 6.87 29.3 29.5 Gb/s
After 10.5 39.4 39.9 Gb/s
P=2
4K 64K 128K
Before 10.5 32.8 33.2 Gb/s
After 17.8 47.7 48.5 Gb/s
P=4
4K 64K 128K
Before 12.7 33.6 34.2 Gb/s
After 16.9 48.1 50.5 Gb/s
The performance improvement is related to this optimization,
I used a ebpf kretprobe on virtio_transport_send_skb to check
that each packet was sent directly to the virtqueue
Preliminary patch to introduce an optimization to the
enqueue system.
All the code used to enqueue a packet into the virtqueue
is removed from virtio_transport_send_pkt_work()
and moved to the new virtio_transport_send_skb() function.
Dragos Tatulea [Fri, 30 Aug 2024 10:58:38 +0000 (13:58 +0300)]
vdpa/mlx5: Postpone MR deletion
Currently, when a new MR is set up, the old MR is deleted. MR deletion
is about 30-40% the time of MR creation. As deleting the old MR is not
important for the process of setting up the new MR, this operation
can be postponed.
This series adds a workqueue that does MR garbage collection at a later
point. If the MR lock is taken, the handler will back off and
reschedule. The exception during shutdown: then the handler must
not postpone the work.
Note that this is only a speculative optimization: if there is some
mapping operation that is triggered while the garbage collector handler
has the lock taken, this operation it will have to wait for the handler
to finish.
Dragos Tatulea [Fri, 30 Aug 2024 10:58:37 +0000 (13:58 +0300)]
vdpa/mlx5: Introduce init/destroy for MR resources
There's currently not a lot of action happening during
the init/destroy of MR resources. But more will be added
in the upcoming patches.
As the mr mutex lock init/destroy has been moved to these
new functions, the lifetime has now shifted away from
mlx5_vdpa_alloc_resources() / mlx5_vdpa_free_resources()
into these new functions. However, the lifetime at the
outer scope remains the same:
mlx5_vdpa_dev_add() / mlx5_vdpa_dev_free()