This commit cancels all requests with io-wq, not just the ones from the
originating task. This breaks use cases that have thread pools, or just
multiple tasks issuing requests on the same ring. The liburing
regression test for this also shows that problem:
$ test/thread-exit.t
cqe->res=-125, Expected 512
where an IO thread gets its request canceled rather than complete
successfully.
Pavel Begunkov [Thu, 7 Sep 2023 12:50:07 +0000 (13:50 +0100)]
io_uring: break out of iowq iopoll on teardown
io-wq will retry iopoll even when it failed with -EAGAIN. If that
races with task exit, which sets TIF_NOTIFY_SIGNAL for all its workers,
such workers might potentially infinitely spin retrying iopoll again and
again and each time failing on some allocation / waiting / etc. Don't
keep spinning if io-wq is dying.
Paolo Abeni [Thu, 7 Sep 2023 09:47:15 +0000 (11:47 +0200)]
Merge tag 'nf-23-09-06' of https://git.kernel.org/pub/scm/linux/kernel/git/netfilter/nf
Florian Westphal says:
====================
netfilter updates for net
This PR contains nf_tables updates for your *net* tree.
This time almost all fixes are for old bugs:
First patch fixes a 4-byte stack OOB write, from myself.
This was broken ever since nftables was switches from 128 to 32bit
register addressing in v4.1.
2nd patch fixes an out-of-bounds read.
This has been broken ever since xt_osf got added in 2.6.31, the bug
was then just moved around during refactoring, from Wander Lairson Costa.
3rd patch adds a missing enum description, from Phil Sutter.
4th patch fixes a UaF inftables that occurs when userspace adds
elements with a timeout so small that expiration happens while the
transaction is still in progress. Fix from Pablo Neira Ayuso.
Patch 5 fixes a memory out of bounds access, this was
broken since v4.20. Patch from Kyle Zeng and Jozsef Kadlecsik.
Patch 6 fixes another bogus memory access when building audit
record. Bug added in the previous pull request, fix from Pablo.
netfilter pull request 2023-09-06
* tag 'nf-23-09-06' of https://git.kernel.org/pub/scm/linux/kernel/git/netfilter/nf:
netfilter: nf_tables: Unbreak audit log reset
netfilter: ipset: add the missing IP_SET_HASH_WITH_NET0 macro for ip_set_hash_netportnet.c
netfilter: nft_set_rbtree: skip sync GC for new elements in this transaction
netfilter: nf_tables: uapi: Describe NFTA_RULE_CHAIN_ID
netfilter: nfnetlink_osf: avoid OOB read
netfilter: nftables: exthdr: fix 4-byte stack OOB write
====================
Will Deacon [Thu, 7 Sep 2023 08:54:11 +0000 (09:54 +0100)]
arm64: csum: Fix OoB access in IP checksum code for negative lengths
Although commit c2c24edb1d9c ("arm64: csum: Fix pathological zero-length
calls") added an early return for zero-length input, syzkaller has
popped up with an example of a _negative_ length which causes an
undefined shift and an out-of-bounds read:
Jie Wang [Wed, 6 Sep 2023 07:20:18 +0000 (15:20 +0800)]
net: hns3: remove GSO partial feature bit
HNS3 NIC does not support GSO partial packets segmentation. Actually tunnel
packets for example NvGRE packets segment offload and checksum offload is
already supported. There is no need to keep gso partial feature bit. So
this patch removes it.
Fixes: 76ad4f0ee747 ("net: hns3: Add support of HNS3 Ethernet Driver for hip08 SoC") Signed-off-by: Jie Wang <[email protected]> Signed-off-by: Jijie Shao <[email protected]> Signed-off-by: Paolo Abeni <[email protected]>
net: hns3: fix invalid mutex between tc qdisc and dcb ets command issue
We hope that tc qdisc and dcb ets commands can not be used crosswise.
If we want to use any of the commands to configure tc,
We must use the other command to clear the existing configuration.
However, when we configure a single tc with tc qdisc,
we can still configure it with dcb ets.
Because we use mqprio_active as the tag of tc qdisc configuration,
but with dcb ets, we do not check mqprio_active.
This patch fix this issue by check mqprio_active before
executing the dcb ets command. and add dcb_ets_active to
replace HCLGE_FLAG_DCB_ENABLE and HCLGE_FLAG_MQPRIO_ENABLE
at the hclge layer,
Fixes: cacde272dd00 ("net: hns3: Add hclge_dcb module for the support of DCB feature") Signed-off-by: Jijie Shao <[email protected]> Signed-off-by: Paolo Abeni <[email protected]>
net: hns3: fix byte order conversion issue in hclge_dbg_fd_tcam_read()
req1->tcam_data is defined as "u8 tcam_data[8]", and we convert it as
(u32 *) without considerring byte order conversion,
it may result in printing wrong data for tcam_data.
Currently, the driver knocks the ring doorbell before updating
the ring->last_to_use in tx flow. if the hardware transmiting
packet and napi poll scheduling are fast enough, it may get
the old ring->last_to_use in drivers' napi poll.
In this case, the driver will think the tx is not completed, and
return directly without clear the flag __QUEUE_STATE_STACK_XOFF,
which may cause tx timeout.
Kailang Yang [Wed, 6 Sep 2023 08:50:41 +0000 (16:50 +0800)]
ALSA: hda/realtek - ALC287 I2S speaker platform support
0x17 was only speaker pin, DAC assigned will be 0x03. Headphone
assigned to 0x02.
Playback via headphone will get EQ filter processing. So,it needs to
swap DAC.
Steve French [Fri, 1 Sep 2023 06:29:17 +0000 (01:29 -0500)]
smb3: add trace point for queryfs (statfs)
In debugging a recent performance problem with statfs, it would have
been helpful to be able to trace the smb3 query fs info request
more narrowly. Add a trace point "smb3_qfs_done"
1, Enable LSX and LASX.
2, Enable KASLR (CONFIG_RANDOMIZE_BASE).
3, Enable jump label (patching mechanism for static key).
4, Enable LoongArch CRC32(c) Acceleration.
5, Enable Loongson-specific drivers: I2C/RTC/DRM/SOC/CLK/PINCTRL/GPIO/SPI.
6, Enable EXFAT/NTFS3/JFS/GFS2/OCFS2/UBIFS/EROFS/CEPH file systems.
7, Enable WangXun NGBE/TXGBE NIC drivers.
8, Enable some IPVS options.
9, Remove CONFIG_SYSFS_DEPRECATED since it is removed in Kconfig.
10, Remove CONFIG_IP_NF_TARGET_CLUSTERIP since it is removed in Kconfig.
11, Remove CONFIG_NFT_OBJREF since it is removed in Kconfig.
12, Remove CONFIG_R8188EU since it is replaced by CONFIG_RTL8XXXU.
net: phy: Provide Module 4 KSZ9477 errata (DS80000754C)
The KSZ9477 errata points out (in 'Module 4') the link up/down problems
when EEE (Energy Efficient Ethernet) is enabled in the device to which
the KSZ9477 tries to auto negotiate.
The suggested workaround is to clear advertisement of EEE for PHYs in
this chip driver.
To avoid regressions with other switch ICs the new MICREL_NO_EEE flag
has been introduced.
Moreover, the in-register disablement of MMD_DEVICE_ID_EEE_ADV.MMD_EEE_ADV
MMD register is removed, as this code is both; now executed too late
(after previous rework of the PHY and DSA for KSZ switches) and not
required as setting all members of eee_broken_modes bit field prevents
the KSZ9477 from advertising EEE.
Hamza Mahfooz [Tue, 5 Sep 2023 17:27:22 +0000 (13:27 -0400)]
drm/amd/display: prevent potential division by zero errors
There are two places in apply_below_the_range() where it's possible for
a divide by zero error to occur. So, to fix this make sure the divisor
is non-zero before attempting the computation in both cases.
Melissa Wen [Thu, 31 Aug 2023 16:12:28 +0000 (15:12 -0100)]
drm/amd/display: enable cursor degamma for DCN3+ DRM legacy gamma
For DRM legacy gamma, AMD display manager applies implicit sRGB degamma
using a pre-defined sRGB transfer function. It works fine for DCN2
family where degamma ROM and custom curves go to the same color block.
But, on DCN3+, degamma is split into two blocks: degamma ROM for
pre-defined TFs and `gamma correction` for user/custom curves and
degamma ROM settings doesn't apply to cursor plane. To get DRM legacy
gamma working as expected, enable cursor degamma ROM for implict sRGB
degamma on HW with this configuration.
Hamza Mahfooz [Thu, 31 Aug 2023 19:22:35 +0000 (15:22 -0400)]
drm/amd/display: limit the v_startup workaround to ASICs older than DCN3.1
Since, calling dcn20_adjust_freesync_v_startup() on DCN3.1+ ASICs
can cause the display to flicker and underflow to occur, we shouldn't
call it for them. So, ensure that the DCN version is less than
DCN_VERSION_3_1 before calling dcn20_adjust_freesync_v_startup().
Jakub Kicinski [Thu, 7 Sep 2023 01:43:05 +0000 (18:43 -0700)]
Merge tag 'for-netdev' of https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf
Daniel Borkmann says:
====================
pull-request: bpf 2023-09-06
We've added 9 non-merge commits during the last 6 day(s) which contain
a total of 12 files changed, 189 insertions(+), 44 deletions(-).
The main changes are:
1) Fix bpf_sk_storage to address an invalid wait context lockdep
report and another one to address missing omem uncharge,
from Martin KaFai Lau.
2) Two BPF recursion detection related fixes,
from Sebastian Andrzej Siewior.
3) Fix tailcall limit enforcement in trampolines for s390 JIT,
from Ilya Leoshkevich.
4) Fix a sockmap refcount race where skbs in sk_psock_backlog can
be referenced after user space side has already skb_consumed them,
from John Fastabend.
5) Fix BPF CI flake/race wrt sockmap vsock write test where
the transport endpoint is not connected, from Xu Kuohai.
6) Follow-up doc fix to address a cross-link warning,
from Eduard Zingerman.
* tag 'for-netdev' of https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf:
selftests/bpf: Check bpf_sk_storage has uncharged sk_omem_alloc
bpf: bpf_sk_storage: Fix the missing uncharge in sk_omem_alloc
bpf: bpf_sk_storage: Fix invalid wait context lockdep report
s390/bpf: Pass through tail call counter in trampolines
bpf: Assign bpf_tramp_run_ctx::saved_run_ctx before recursion check.
bpf: Invoke __bpf_prog_exit_sleepable_recur() on recursion in kern_sys_bpf().
bpf, sockmap: Fix skb refcnt race after locking changes
docs/bpf: Fix "file doesn't exist" warnings in {llvm_reloc,btf}.rst
selftests/bpf: Fix a CI failure caused by vsock write
====================
8795359e35bc ("x86/sgx: Silence softlockup detection when releasing large enclaves")
The test system has 64GB of enclave memory, and all is assigned to a single VM.
Release of 'vepc' takes a longer time and causes long latencies, which triggers
the softlockup warning.
Add cond_resched() to give other tasks a chance to run and reduce
latencies, which also avoids the softlockup detector.
Thomas Huth [Wed, 6 Sep 2023 16:26:58 +0000 (18:26 +0200)]
x86: Remove the arch_calc_vm_prot_bits() macro from the UAPI
The arch_calc_vm_prot_bits() macro uses VM_PKEY_BIT0 etc. which are
not part of the UAPI, so the macro is completely useless for userspace.
It is also hidden behind the CONFIG_X86_INTEL_MEMORY_PROTECTION_KEYS
config switch which we shouldn't expose to userspace. Thus let's move
this macro into a new internal header instead.
powercap: intel_rapl: Fix invalid setting of Power Limit 4
System runs at minimum performance, once powercap RAPL package domain
enabled flag is changed from 1 to 0 to 1.
Setting RAPL package domain enabled flag to 0, results in setting of
power limit 4 (PL4) MSR 0x601 to 0. This implies disabling PL4 limit.
The PL4 limit controls the peak power. So setting 0, results in some
undesirable performance, which depends on hardware implementation.
Even worse, when the enabled flag is set to 1 again. This will set PL4
MSR value to 0x01, which means reduce peak power to 0.125W. This will
force system to run at the lowest possible performance on every PL4
supported system.
Setting enabled flag should only affect the "enable" bit, not other
bits. Here it is changing power limit.
This is caused by a change which assumes that there is an enable bit in
the PL4 MSR like other power limits. Although PL4 enable/disable bit is
present with TPMI RAPL interface, it is not present with the MSR
interface.
There is a rapl_primitive_info defined for non existent PL4 enable bit
and then it is used with the commit 9050a9cd5e4c ("powercap: intel_rapl:
Cleanup Power Limits support") to enable PL4. This is wrong, hence remove
this rapl primitive for PL4. Also in the function
rapl_detect_powerlimit(), PL_ENABLE is used to check for the presence of
power limits. Replace PL_ENABLE with PL_LIMIT, as PL_LIMIT must be
present. Without this change, PL4 controls will not be available in the
sysfs once rapl primitive for PL4 is removed.
Merge tag 'ceph-for-6.6-rc1' of https://github.com/ceph/ceph-client
Pull ceph updates from Ilya Dryomov:
"Mixed with some fixes and cleanups, this brings in reasonably complete
fscrypt support to CephFS! The list of things which don't work with
encryption should be fairly short, mostly around the edges: fallocate
(not supported well in CephFS to begin with), copy_file_range
(requires re-encryption), non-default striping patterns.
This was a multi-year effort principally by Jeff Layton with
assistance from Xiubo Li, LuÃs Henriques and others, including several
dependant changes in the MDS, netfs helper library and fscrypt
framework itself"
* tag 'ceph-for-6.6-rc1' of https://github.com/ceph/ceph-client: (53 commits)
ceph: make num_fwd and num_retry to __u32
ceph: make members in struct ceph_mds_request_args_ext a union
rbd: use list_for_each_entry() helper
libceph: do not include crypto/algapi.h
ceph: switch ceph_lookup/atomic_open() to use new fscrypt helper
ceph: fix updating i_truncate_pagecache_size for fscrypt
ceph: wait for OSD requests' callbacks to finish when unmounting
ceph: drop messages from MDS when unmounting
ceph: update documentation regarding snapshot naming limitations
ceph: prevent snapshot creation in encrypted locked directories
ceph: add support for encrypted snapshot names
ceph: invalidate pages when doing direct/sync writes
ceph: plumb in decryption during reads
ceph: add encryption support to writepage and writepages
ceph: add read/modify/write to ceph_sync_write
ceph: align data in pages in ceph_sync_write
ceph: don't use special DIO path for encrypted inodes
ceph: add truncate size handling support for fscrypt
ceph: add object version support for sync read
libceph: allow ceph_osdc_new_request to accept a multi-op read
...
Mostafa reports that commit d232606773a0 ("arm64/sysreg: refactor
deprecated strncpy") breaks our early command-line parsing because the
original code is working on space-delimited substrings rather than
NUL-terminated strings.
Rather than simply reverting the broken conversion patch, replace the
strscpy() with a simple memcpy() with an explicit NUL-termination of the
result.
Merge tag 'input-for-v6.6-rc0' of git://git.kernel.org/pub/scm/linux/kernel/git/dtor/input
Pull input updates from Dmitry Torokhov:
- a new driver for Azoteq IQS7210A/7211A/E touch controllers
- support for Azoteq IQS7222D variant added to iqs7222 driver
- support for touch keys functionality added to Melfas MMS114 driver
- new hardware IDs added to exc3000 and Goodix drivers
- xpad driver gained support for GameSir T4 Kaleid Controller
- a fix for xpad driver to properly support some third-party
controllers that need a magic packet to start properly
- a fix for psmouse driver to more reliably switch to RMI4 mode on
devices that use native RMI4/SMbus protocol
- a quirk for i8042 for TUXEDO Gemini 17 Gen1/Clevo PD70PN laptops
- multiple drivers have been updated to make use of devm and other
newer APIs such as dev_err_probe(), devm_regulator_get_enable(), and
others.
* tag 'input-for-v6.6-rc0' of git://git.kernel.org/pub/scm/linux/kernel/git/dtor/input: (83 commits)
Input: goodix - add support for ACPI ID GDX9110
Input: rpckbd - fix the return value handle for platform_get_irq()
Input: tca6416-keypad - switch to using input core's polling features
Input: tca6416-keypad - convert to use devm_* api
Input: tca6416-keypad - fix interrupt enable disbalance
Input: tca6416-keypad - rely on I2C core to set up suspend/resume
Input: tca6416-keypad - always expect proper IRQ number in i2c client
Input: lm8323 - convert to use devm_* api
Input: lm8323 - rely on device core to create kp_disable attribute
Input: qt2160 - convert to use devm_* api
Input: qt2160 - do not hard code interrupt trigger
Input: qt2160 - switch to using threaded interrupt handler
Input: qt2160 - tweak check for i2c adapter functionality
Input: psmouse - add delay when deactivating for SMBus mode
Input: mcs-touchkey - fix uninitialized use of error in mcs_touchkey_probe()
Input: qt1070 - convert to use devm_* api
Input: mcs-touchkey - convert to use devm_* api
Input: amikbd - convert to use devm_* api
Input: lm8333 - convert to use devm_* api
Input: mms114 - add support for touch keys
...
Merge tag 'linux-watchdog-6.6-rc1' of git://www.linux-watchdog.org/linux-watchdog
Pull watchdog updates from Wim Van Sebroeck:
- add marvell GTI watchdog driver
- add support for Amlogic-T7 SoCs
- document the IPQ5018 watchdog compatible
- enable COMPILE_TEST for more watchdog device drivers
- core: stop watchdog when executing poweroff command
- other small improvements and fixes
* tag 'linux-watchdog-6.6-rc1' of git://www.linux-watchdog.org/linux-watchdog: (21 commits)
watchdog: Add support for Amlogic-T7 SoCs
watchdog: Add a new struct for Amlogic-GXBB driver
dt-bindings: watchdog: Add support for Amlogic-T7 SoCs
dt-bindings: watchdog: qcom-wdt: document IPQ5018
watchdog: imx2_wdt: Improve dev_crit() message
watchdog: stm32: Drop unnecessary of_match_ptr()
watchdog: sama5d4: readout initial state
watchdog: intel-mid_wdt: add MODULE_ALIAS() to allow auto-load
watchdog: core: stop watchdog when executing poweroff command
watchdog: pm8916_wdt: Remove redundant of_match_ptr()
watchdog: xilinx_wwdt: Use div_u64() in xilinx_wwdt_start()
watchdog: starfive: Remove #ifdef guards for PM related functions
watchdog: s3c2410: Fix potential deadlock on &wdt->lock
watchdog:rit_wdt: Add support for WDIOF_CARDRESET
dt-bindings: watchdog: ti,rti-wdt: Add support for WDIOF_CARDRESET
watchdog: Enable COMPILE_TEST for more drivers
watchdog: advantech_ec_wdt: fix Kconfig dependencies
watchdog: Explicitly include correct DT includes
Watchdog: Add marvell GTI watchdog driver
dt-bindings: watchdog: marvell GTI system watchdog driver
...
Deliver audit log from __nf_tables_dump_rules(), table dereference at
the end of the table list loop might point to the list head, leading to
this crash.
Deliver audit log only once at the end of the rule dump+reset for
consistency with the set dump+reset.
Ensure audit reset access to table under rcu read side lock. The table
list iteration holds rcu read lock side, but recent audit code
dereferences table object out of the rcu read lock side.
netfilter: ipset: add the missing IP_SET_HASH_WITH_NET0 macro for ip_set_hash_netportnet.c
The missing IP_SET_HASH_WITH_NET0 macro in ip_set_hash_netportnet can
lead to the use of wrong `CIDR_POS(c)` for calculating array offsets,
which can lead to integer underflow. As a result, it leads to slab
out-of-bound access.
This patch adds back the IP_SET_HASH_WITH_NET0 macro to
ip_set_hash_netportnet to address the issue.
netfilter: nft_set_rbtree: skip sync GC for new elements in this transaction
New elements in this transaction might expired before such transaction
ends. Skip sync GC for such elements otherwise commit path might walk
over an already released object. Once transaction is finished, async GC
will collect such expired element.
Fixes: f6c383b8c31a ("netfilter: nf_tables: adapt set backend to use GC transaction API") Signed-off-by: Pablo Neira Ayuso <[email protected]> Signed-off-by: Florian Westphal <[email protected]>
The opt_num field is controlled by user mode and is not currently
validated inside the kernel. An attacker can take advantage of this to
trigger an OOB read and potentially leak information.
BUG: KASAN: slab-out-of-bounds in nf_osf_match_one+0xbed/0xd10 net/netfilter/nfnetlink_osf.c:88
Read of size 2 at addr ffff88804bc64272 by task poc/6431
If priv->len is a multiple of 4, then dst[len / 4] can write past
the destination array which leads to stack corruption.
This construct is necessary to clean the remainder of the register
in case ->len is NOT a multiple of the register size, so make it
conditional just like nft_payload.c does.
The bug was added in 4.1 cycle and then copied/inherited when
tcp/sctp and ip option support was added.
Bug reported by Zero Day Initiative project (ZDI-CAN-21950,
ZDI-CAN-21951, ZDI-CAN-21961).
Fixes: 49499c3e6e18 ("netfilter: nf_tables: switch registers to 32 bit addressing") Fixes: 935b7f643018 ("netfilter: nft_exthdr: add TCP option matching") Fixes: 133dc203d77d ("netfilter: nft_exthdr: Support SCTP chunks") Fixes: dbb5281a1f84 ("netfilter: nf_tables: add support for matching IPv4 options") Signed-off-by: Florian Westphal <[email protected]>
Merge tag 'backlight-next-6.6' of git://git.kernel.org/pub/scm/linux/kernel/git/lee/backlight
Pull backlight updates from Lee Jones:
"New Functionality:
- Ensure correct includes are present and remove some that are not
required
- Drop redundant of_match_ptr() call to cast pointer to NULL
Bug Fixes:
- Revert to old (expected) behaviour of initialising PWM state on
first brightness change
- Correctly handle / propagate errors
- Fix 'sometimes-uninitialised' issues"
* tag 'backlight-next-6.6' of git://git.kernel.org/pub/scm/linux/kernel/git/lee/backlight:
backlight: led_bl: Remove redundant of_match_ptr()
backlight: lp855x: Drop ret variable in brightness change function
backlight: gpio_backlight: Drop output GPIO direction check for initial power state
backlight: lp855x: Catch errors when changing brightness
backlight: lp855x: Initialize PWM state on first brightness change
backlight: qcom-wled: Explicitly include correct DT includes
Commit f56914393537 ("gpio: zynq: fix zynqmp_gpio not an immutable chip
warning") ditched the open-coded resource allocation handlers in favor
of the generic ones. These generic handlers don't maintain the PM
runtime anymore, which causes a regression in that level IRQs are no
longer reported.
Restore the original handlers to fix this.
Signed-off-by: Daniel Mack <[email protected]> Fixes: f56914393537 ("gpio: zynq: fix zynqmp_gpio not an immutable chip warning") Cc: [email protected] Signed-off-by: Bartosz Golaszewski <[email protected]>
LoongArch: Add KASAN (Kernel Address Sanitizer) support
1/8 of kernel addresses reserved for shadow memory. But for LoongArch,
There are a lot of holes between different segments and valid address
space (256T available) is insufficient to map all these segments to kasan
shadow memory with the common formula provided by kasan core, saying
(addr >> KASAN_SHADOW_SCALE_SHIFT) + KASAN_SHADOW_OFFSET
So LoongArch has a arch-specific mapping formula, different segments are
mapped individually, and only limited space lengths of these specific
segments are mapped to shadow.
At early boot stage the whole shadow region populated with just one
physical page (kasan_early_shadow_page). Later, this page is reused as
readonly zero shadow for some memory that kasan currently don't track.
After mapping the physical memory, pages for shadow memory are allocated
and mapped.
Functions like memset()/memcpy()/memmove() do a lot of memory accesses.
If bad pointer passed to one of these function it is important to be
caught. Compiler's instrumentation cannot do this since these functions
are written in assembly.
KASan replaces memory functions with manually instrumented variants.
Original functions declared as weak symbols so strong definitions in
mm/kasan/kasan.c could replace them. Original functions have aliases
with '__' prefix in names, so we could call non-instrumented variant
if needed.
LoongArch: Simplify the processing of jumping new kernel for KASLR
Modified relocate_kernel() doesn't return new kernel's entry point but
the random_offset. In this way we share the start_kernel() processing
with the normal kernel, which avoids calling 'jr a0' directly and allows
some other operations (e.g, kasan_early_init) before start_kernel() when
KASLR (CONFIG_RANDOMIZE_BASE) is turned on.
kasan: Add (pmd|pud)_init for LoongArch zero_(pud|p4d)_populate process
LoongArch populates pmd/pud with invalid_pmd_table/invalid_pud_table in
pagetable_init, So pmd_init/pud_init(p) is required, define them as __weak
in mm/kasan/init.c, like mm/sparse-vmemmap.c.
kasan: Add __HAVE_ARCH_SHADOW_MAP to support arch specific mapping
MIPS, LoongArch and some other architectures have many holes between
different segments and the valid address space (256T available) is
insufficient to map all these segments to kasan shadow memory with the
common formula provided by kasan core. So we need architecture specific
mapping formulas to ensure different segments are mapped individually,
and only limited space lengths of those specific segments are mapped to
shadow.
Therefore, when the incoming address is converted to a shadow, we need
to add a condition to determine whether it is valid.
Enze Li [Wed, 6 Sep 2023 14:54:16 +0000 (22:54 +0800)]
LoongArch: Add KFENCE (Kernel Electric-Fence) support
The LoongArch architecture is quite different from other architectures.
When the allocating of KFENCE itself is done, it is mapped to the direct
mapping configuration window [1] by default on LoongArch. It means that
it is not possible to use the page table mapped mode which required by
the KFENCE system and therefore it should be remapped to the appropriate
region.
This patch adds architecture specific implementation details for KFENCE.
In particular, this implements the required interface in <asm/kfence.h>.
Tested this patch by running the testcases and all passed.
Enze Li [Wed, 6 Sep 2023 14:54:16 +0000 (22:54 +0800)]
LoongArch: Get partial stack information when providing regs parameter
Currently, arch_stack_walk() can only get the full stack information
including NMI. This is because the implementation of arch_stack_walk()
is forced to ignore the information passed by the regs parameter and use
the current stack information instead.
For some detection systems like KFENCE, only partial stack information
is needed. In particular, the stack frame where the interrupt occurred.
To support KFENCE, this patch modifies the implementation of the
arch_stack_walk() function so that if this function is called with the
regs argument passed, it retains all the stack information in regs and
uses it to provide accurate information.
Enze Li [Wed, 6 Sep 2023 14:53:55 +0000 (22:53 +0800)]
LoongArch: mm: Add page table mapped mode support for virt_to_page()
According to LoongArch documentations, there are two types of address
translation modes: direct mapped address translation mode (DMW mode) and
page table mapped address translation mode (TLB mode).
Currently, virt_to_page() only supports direct mapped mode. This patch
determines which mode is used, and adds corresponding handling functions
for both modes.
Enze Li [Wed, 6 Sep 2023 14:53:55 +0000 (22:53 +0800)]
kfence: Defer the assignment of the local variable addr
The LoongArch architecture is different from other architectures. It
needs to update __kfence_pool during arch_kfence_init_pool().
This patch modifies the assignment location of the local variable addr
in the kfence_init_pool() function to support the case of updating
__kfence_pool in arch_kfence_init_pool().
KGDB is intended to be used as a source level debugger for the Linux
kernel. It is used along with gdb to debug a Linux kernel. GDB can be
used to "break in" to the kernel to inspect memory, variables and regs
similar to the way an application developer would use GDB to debug an
application. KDB is a frontend of KGDB which is similar to GDB.
By now, in addition to the generic KGDB features, the LoongArch KGDB
implements the following features:
- Hardware breakpoints/watchpoints;
- Software single-step support for KDB.
Qi Hu [Wed, 6 Sep 2023 14:53:55 +0000 (22:53 +0800)]
LoongArch: Add Loongson Binary Translation (LBT) extension support
Loongson Binary Translation (LBT) is used to accelerate binary translation,
which contains 4 scratch registers (scr0 to scr3), x86/ARM eflags (eflags)
and x87 fpu stack pointer (ftop).
This patch support kernel to save/restore these registers, handle the LBT
exception and maintain sigcontext.
WANG Xuerui [Wed, 6 Sep 2023 14:53:55 +0000 (22:53 +0800)]
raid6: Add LoongArch SIMD recovery implementation
Similar to the syndrome calculation, the recovery algorithms also work
on 64 bytes at a time to align with the L1 cache line size of current
and future LoongArch cores (that we care about). Which means
unrolled-by-4 LSX and unrolled-by-2 LASX code.
The assembly is originally based on the x86 SSSE3/AVX2 ports, but
register allocation has been redone to take advantage of LSX/LASX's 32
vector registers, and instruction sequence has been optimized to suit
(e.g. LoongArch can perform per-byte srl and andi on vectors, but x86
cannot).
Performance numbers measured by instrumenting the raid6test code, on a
3A5000 system clocked at 2.5GHz:
Because recovery algorithms are chosen solely based on priority and
availability, lasx is marked as priority 2 and lsx priority 1. At least
for the current generation of LoongArch micro-architectures, LASX should
always be faster than LSX whenever supported, and have similar power
consumption characteristics (because the only known LASX-capable uarch,
the LA464, always compute the full 256-bit result for vector ops).
WANG Xuerui [Wed, 6 Sep 2023 14:53:55 +0000 (22:53 +0800)]
raid6: Add LoongArch SIMD syndrome calculation
The algorithms work on 64 bytes at a time, which is the L1 cache line
size of all current and future LoongArch cores (that we care about), as
confirmed by Huacai. The code is based on the generic int.uc algorithm,
unrolled 4 times for LSX and 2 times for LASX. Further unrolling does
not meaningfully improve the performance according to experiments.
Performance numbers measured during system boot on a 3A5000 @ 2.5GHz:
WANG Xuerui [Wed, 6 Sep 2023 14:53:55 +0000 (22:53 +0800)]
LoongArch: Add SIMD-optimized XOR routines
Add LSX and LASX implementations of xor operations, operating on 64
bytes (one L1 cache line) at a time, for a balance between memory
utilization and instruction mix. Huacai confirmed that all future
LoongArch implementations by Loongson (that we care) will likely also
feature 64-byte cache lines, and experiments show no throughput
improvement with further unrolling.
Performance numbers measured during system boot on a 3A5000 @ 2.5GHz:
Tiezhu Yang [Wed, 6 Sep 2023 14:53:10 +0000 (22:53 +0800)]
LoongArch: Define symbol 'fault' as a local label in fpu.S
The initial aim is to silence the following objtool warnings:
arch/loongarch/kernel/fpu.o: warning: objtool: _save_fp_context() falls through to next function fault()
arch/loongarch/kernel/fpu.o: warning: objtool: _restore_fp_context() falls through to next function fault()
arch/loongarch/kernel/fpu.o: warning: objtool: _save_lsx_context() falls through to next function fault()
arch/loongarch/kernel/fpu.o: warning: objtool: _restore_lsx_context() falls through to next function fault()
arch/loongarch/kernel/fpu.o: warning: objtool: _save_lasx_context() falls through to next function fault()
arch/loongarch/kernel/fpu.o: warning: objtool: _restore_lasx_context() falls through to next function fault()
Currently, SYM_FUNC_START()/SYM_FUNC_END() defines the symbol 'fault' as
SYM_T_FUNC which is STT_FUNC, the objtool warnings are generated through
the following code:
tools/objtool/include/objtool/check.h:
static inline struct symbol *insn_func(struct instruction *insn)
{
struct symbol *sym = insn->sym;
if (sym && sym->type != STT_FUNC)
sym = NULL;
return sym;
}
tools/objtool/check.c:
static int validate_branch(struct objtool_file *file, struct symbol *func,
struct instruction *insn, struct insn_state state)
{
...
if (func && insn_func(insn) && func != insn_func(insn)->pfunc) {
...
WARN("%s() falls through to next function %s()",
func->name, insn_func(insn)->name);
return 1;
}
...
}
We can see that the fixup can be a local label in the following code:
The {copy, clear}_user function should returns number of bytes that
could not be {copied, cleared}. So, try to {copy, clear} byte by byte
when ld.{d,w,h} and st.{d,w,h} trapped into an exception.
Bibo Mao [Wed, 6 Sep 2023 14:53:10 +0000 (22:53 +0800)]
LoongArch: Use static defined zero page rather than allocated
On LoongArch system, there is only one page needed for zero page (no
cache synonyms), and there is no COLOR_ZERO_PAGE, so zero_page_mask is
useless and the macro __HAVE_COLOR_ZERO_PAGE is not necessary.
Like other popular architectures, It is simpler to define the zero page
in kernel BSS code segment rather than dynamically allocate.
Bibo Mao [Wed, 6 Sep 2023 14:53:09 +0000 (22:53 +0800)]
LoongArch: mm: Introduce unified function populate_kernel_pte()
Function pcpu_populate_pte() and fixmap_pte() are similar, they populate
one page from kernel address space. And there is confusion between pgd
and p4d in the function fixmap_pte(), such as pgd_none() always returns
zero. This patch introduces a unified function populate_kernel_pte() and
then replaces pcpu_populate_pte() and fixmap_pte().
Bibo Mao [Wed, 6 Sep 2023 14:53:09 +0000 (22:53 +0800)]
LoongArch: Code improvements in function pcpu_populate_pte()
Do some code improvements in function pcpu_populate_pte():
1. Add memory allocation failure handling;
2. Replace pgd_populate() with p4d_populate(), it will be useful if
there are four-level page tables.
LoongArch: Remove shm_align_mask and use SHMLBA instead
Both shm_align_mask and SHMLBA want to avoid cache alias. But they are
inconsistent: shm_align_mask is (PAGE_SIZE - 1) while SHMLBA is SZ_64K,
but PAGE_SIZE is not always equal to SZ_64K.
This may cause problems when shmat() twice. Fix this problem by removing
shm_align_mask and using SHMLBA (strictly SHMLBA - 1) instead.
When building with CONFIG_LTO_CLANG_FULL, there are several errors due
to the way that parse_r is defined with an __asm__ statement in a
header:
ld.lld: error: ld-temp.o <inline asm>:105:1: macro 'parse_r' is already defined
.macro parse_r var r
^
This was an issue for arch/mips as well, which was resolved by commit 67512a8cf5a7 ("MIPS: Avoid macro redefinitions").
However, parse_r is unused in arch/loongarch after commit 83d8b38967d2
("LoongArch: Simplify the invtlb wrappers"), so doing the same change
does not make much sense now. Just remove parse_r (and parse_v, which
is also unused) to resolve the redefinition error. If it needs to be
brought back due to an actual use, it should be brought back with the
same changes as the aforementioned arch/mips commit.
static inline void bvec_set_page(struct bio_vec *bv, struct page *page,
unsigned int len, unsigned int offset)
However, the usage in DRBD swaps the len and offset parameters. This
leads to a bvec with length=0 instead of the intended length=4096, which
causes sock_sendmsg to return -EIO.
This leaves DRBD unable to transmit any pages and thus completely
broken.
block: fix pin count management when merging same-page segments
There is no need to unpin the added page when adding it to the bio fails
as that is done by the loop below. Instead we want to unpin it when adding
a single page to the bio more than once as bio_release_pages will only
unpin it once.
Puranjay Mohan [Thu, 31 Aug 2023 13:12:29 +0000 (13:12 +0000)]
bpf, riscv: use prog pack allocator in the BPF JIT
Use bpf_jit_binary_pack_alloc() for memory management of JIT binaries in
RISCV BPF JIT. The bpf_jit_binary_pack_alloc creates a pair of RW and RX
buffers. The JIT writes the program into the RW buffer. When the JIT is
done, the program is copied to the final RX buffer with
bpf_jit_binary_pack_finalize.
Implement bpf_arch_text_copy() and bpf_arch_text_invalidate() for RISCV
JIT as these functions are required by bpf_jit_binary_pack allocator.
Puranjay Mohan [Thu, 31 Aug 2023 13:12:28 +0000 (13:12 +0000)]
riscv: implement a memset like function for text
The BPF JIT needs to write invalid instructions to RX regions of memory to
invalidate removed BPF programs. This needs a function like memset() that
can work with RX memory.
Implement patch_text_set_nosync() which is similar to text_poke_set() of
x86.
Puranjay Mohan [Thu, 31 Aug 2023 13:12:27 +0000 (13:12 +0000)]
riscv: extend patch_text_nosync() for multiple pages
The patch_insn_write() function currently doesn't work for multiple pages
of instructions, therefore patch_text_nosync() will fail with a page fault
if called with lengths spanning multiple pages.
This commit extends the patch_insn_write() function to support multiple
pages by copying at max 2 pages at a time in a loop. This implementation
is similar to text_poke_copy() function of x86.
Puranjay Mohan [Thu, 31 Aug 2023 13:12:26 +0000 (13:12 +0000)]
bpf: make bpf_prog_pack allocator portable
The bpf_prog_pack allocator currently uses module_alloc() and
module_memfree() to allocate and free memory. This is not portable
because different architectures use different methods for allocating
memory for BPF programs. Like ARM64 and riscv use vmalloc()/vfree().
Use bpf_jit_alloc_exec() and bpf_jit_free_exec() for memory management
in bpf_prog_pack allocator. Other architectures can override these with
their implementation and will be able to use bpf_prog_pack directly.
On architectures that don't override bpf_jit_alloc/free_exec() this is
basically a NOP.
bpf: bpf_sk_storage: Fix the missing uncharge in sk_omem_alloc
The commit c83597fa5dc6 ("bpf: Refactor some inode/task/sk storage functions
for reuse"), refactored the bpf_{sk,task,inode}_storage_free() into
bpf_local_storage_unlink_nolock() which then later renamed to
bpf_local_storage_destroy(). The commit accidentally passed the
"bool uncharge_mem = false" argument to bpf_selem_unlink_storage_nolock()
which then stopped the uncharge from happening to the sk->sk_omem_alloc.
This missing uncharge only happens when the sk is going away (during
__sk_destruct).
This patch fixes it by always passing "uncharge_mem = true". It is a
noop to the task/inode/cgroup storage because they do not have the
map_local_storage_(un)charge enabled in the map_ops. A followup patch
will be done in bpf-next to remove the uncharge_mem argument.
It complains about acquiring a local_lock while holding a raw_spin_lock.
It means it should not allocate memory while holding a raw_spin_lock
since it is not safe for RT.
raw_spin_lock is needed because bpf_local_storage supports tracing
context. In particular for task local storage, it is easy to
get a "current" task PTR_TO_BTF_ID in tracing bpf prog.
However, task (and cgroup) local storage has already been moved to
bpf mem allocator which can be used after raw_spin_lock.
The splat is for the sk storage. For sk (and inode) storage,
it has not been moved to bpf mem allocator. Using raw_spin_lock or not,
kzalloc(GFP_ATOMIC) could theoretically be unsafe in tracing context.
However, the local storage helper requires a verifier accepted
sk pointer (PTR_TO_BTF_ID), it is hypothetical if that (mean running
a bpf prog in a kzalloc unsafe context and also able to hold a verifier
accepted sk pointer) could happen.
This patch avoids kzalloc after raw_spin_lock to silent the splat.
There is an existing kzalloc before the raw_spin_lock. At that point,
a kzalloc is very likely required because a lookup has just been done
before. Thus, this patch always does the kzalloc before acquiring
the raw_spin_lock and remove the later kzalloc usage after the
raw_spin_lock. After this change, it will have a charge and then
uncharge during the syscall bpf_map_update_elem() code path.
This patch opts for simplicity and not continue the old
optimization to save one charge and uncharge.
This issue is dated back to the very first commit of bpf_sk_storage
which had been refactored multiple times to create task, inode, and
cgroup storage. This patch uses a Fixes tag with a more recent
commit that should be easier to do backport.
s390/bpf: Pass through tail call counter in trampolines
s390x eBPF programs use the following extension to the s390x calling
convention: tail call counter is passed on stack at offset
STK_OFF_TCCNT, which callees otherwise use as scratch space.
Currently trampoline does not respect this and clobbers tail call
counter. This breaks enforcing tail call limits in eBPF programs, which
have trampolines attached to them.
Fix by forwarding a copy of the tail call counter to the original eBPF
program in the trampoline (for fexit), and by restoring it at the end
of the trampoline (for fentry).
bpf: Assign bpf_tramp_run_ctx::saved_run_ctx before recursion check.
__bpf_prog_enter_recur() assigns bpf_tramp_run_ctx::saved_run_ctx before
performing the recursion check which means in case of a recursion
__bpf_prog_exit_recur() uses the previously set bpf_tramp_run_ctx::saved_run_ctx
value.
__bpf_prog_enter_sleepable_recur() assigns bpf_tramp_run_ctx::saved_run_ctx
after the recursion check which means in case of a recursion
__bpf_prog_exit_sleepable_recur() uses an uninitialized value. This does not
look right. If I read the entry trampoline code right, then bpf_tramp_run_ctx
isn't initialized upfront.
Align __bpf_prog_enter_sleepable_recur() with __bpf_prog_enter_recur() and
set bpf_tramp_run_ctx::saved_run_ctx before the recursion check is made.
Remove the assignment of saved_run_ctx in kern_sys_bpf() since it happens
a few cycles later.
bpf: Invoke __bpf_prog_exit_sleepable_recur() on recursion in kern_sys_bpf().
If __bpf_prog_enter_sleepable_recur() detects recursion then it returns
0 without undoing rcu_read_lock_trace(), migrate_disable() or
decrementing the recursion counter. This is fine in the JIT case because
the JIT code will jump in the 0 case to the end and invoke the matching
exit trampoline (__bpf_prog_exit_sleepable_recur()).
This is not the case in kern_sys_bpf() which returns directly to the
caller with an error code.
Add __bpf_prog_exit_sleepable_recur() as clean up in the recursion case.
Jakub Kicinski [Tue, 5 Sep 2023 23:42:02 +0000 (16:42 -0700)]
net: phylink: fix sphinx complaint about invalid literal
sphinx complains about the use of "%PHYLINK_PCS_NEG_*":
Documentation/networking/kapi:144: ./include/linux/phylink.h:601: WARNING: Inline literal start-string without end-string.
Documentation/networking/kapi:144: ./include/linux/phylink.h:633: WARNING: Inline literal start-string without end-string.
These are not valid symbols so drop the '%' prefix.
Alternatively we could use %PHYLINK_PCS_NEG_\* (escape the *)
or use normal literal ``PHYLINK_PCS_NEG_*`` but there is already
a handful of un-adorned DEFINE_* in this file.
Vladimir Oltean [Tue, 5 Sep 2023 21:53:38 +0000 (00:53 +0300)]
net: dsa: sja1105: complete tc-cbs offload support on SJA1110
The blamed commit left this delta behind:
struct sja1105_cbs_entry {
- u64 port;
- u64 prio;
+ u64 port; /* Not used for SJA1110 */
+ u64 prio; /* Not used for SJA1110 */
u64 credit_hi;
u64 credit_lo;
u64 send_slope;
u64 idle_slope;
};
but did not actually implement tc-cbs offload fully for the new switch.
The offload is accepted, but it doesn't work.
The difference compared to earlier switch generations is that now, the
table of CBS shapers is sparse, because there are many more shapers, so
the mapping between a {port, prio} and a table index is static, rather
than requiring us to store the port and prio into the sja1105_cbs_entry.
So, the problem is that the code programs the CBS shaper parameters at a
dynamic table index which is incorrect.
All that needs to be done for SJA1110 CBS shapers to work is to bypass
the logic which allocates shapers in a dense manner, as for SJA1105, and
use the fixed mapping instead.
Fixes: 3e77e59bf8cf ("net: dsa: sja1105: add support for the SJA1110 switch family") Signed-off-by: Vladimir Oltean <[email protected]> Signed-off-by: David S. Miller <[email protected]>
Error: Specified device failed to setup cbs hardware offload.
This comes from the fact that ndo_setup_tc(TC_SETUP_QDISC_CBS) presents
the same API for the qdisc create and replace cases, and the sja1105
driver fails to distinguish between the 2. Thus, it always thinks that
it must allocate the same shaper for a {port, queue} pair, when it may
instead have to replace an existing one.
Fixes: 4d7525085a9b ("net: dsa: sja1105: offload the Credit-Based Shaper qdisc") Signed-off-by: Vladimir Oltean <[email protected]> Signed-off-by: David S. Miller <[email protected]>
Vladimir Oltean [Tue, 5 Sep 2023 21:53:36 +0000 (00:53 +0300)]
net: dsa: sja1105: fix bandwidth discrepancy between tc-cbs software and offload
More careful measurement of the tc-cbs bandwidth shows that the stream
bandwidth (effectively idleslope) increases, there is a larger and
larger discrepancy between the rate limit obtained by the software
Qdisc, and the rate limit obtained by its offloaded counterpart.
The discrepancy becomes so large, that e.g. at an idleslope of 40000
(40Mbps), the offloaded cbs does not actually rate limit anything, and
traffic will pass at line rate through a 100 Mbps port.
The reason for the discrepancy is that the hardware documentation I've
been following is incorrect. UM11040.pdf (for SJA1105P/Q/R/S) states
about IDLE_SLOPE that it is "the rate (in unit of bytes/sec) at which
the credit counter is increased".
Cross-checking with UM10944.pdf (for SJA1105E/T) and UM11107.pdf
(for SJA1110), the wording is different: "This field specifies the
value, in bytes per second times link speed, by which the credit counter
is increased".
So there's an extra scaling for link speed that the driver is currently
not accounting for, and apparently (empirically), that link speed is
expressed in Kbps.
I've pondered whether to pollute the sja1105_mac_link_up()
implementation with CBS shaper reprogramming, but I don't think it is
worth it. IMO, the UAPI exposed by tc-cbs requires user space to
recalculate the sendslope anyway, since the formula for that depends on
port_transmit_rate (see man tc-cbs), which is not an invariant from tc's
perspective.
So we use the offload->sendslope and offload->idleslope to deduce the
original port_transmit_rate from the CBS formula, and use that value to
scale the offload->sendslope and offload->idleslope to values that the
hardware understands.
Some numerical data points:
40Mbps stream, max interfering frame size 1500, port speed 100M
---------------------------------------------------------------
Before (doesn't work) After (works)
credit_hi 77 77
credit_lo 1444 1444
send_slope 11860000 118
idle_slope 640000 6
Tested on SJA1105T, SJA1105S and SJA1110A, at 1Gbps and 100Mbps.
Fixes: 4d7525085a9b ("net: dsa: sja1105: offload the Credit-Based Shaper qdisc") Reported-by: Yanan Yang <[email protected]> Signed-off-by: Vladimir Oltean <[email protected]> Signed-off-by: David S. Miller <[email protected]>
David S. Miller [Wed, 6 Sep 2023 05:20:33 +0000 (06:20 +0100)]
Merge branch '1GbE' of git://git.kernel.org/pub/scm/linux/kernel/git/tnguy/net-queue
Tony Nguyen says:
====================
Change MIN_TXD and MIN_RXD to allow set rx/tx value between 64 and 80
Olga Zaborska says:
Change the minimum value of RX/TX descriptors to 64 to enable setting the rx/tx
value between 64 and 80. All igb, igbvf and igc devices can use as low as 64
descriptors.
====================
Bodong Wang [Tue, 5 Sep 2023 17:48:46 +0000 (10:48 -0700)]
mlx5/core: E-Switch, Create ACL FT for eswitch manager in switchdev mode
ACL flow table is required in switchdev mode when metadata is enabled,
driver creates such table when loading each vport. However, not every
vport is loaded in switchdev mode. Such as ECPF if it's the eswitch manager.
In this case, ACL flow table is still needed.
To make it modularized, create ACL flow table for eswitch manager as
default and skip such operations when loading manager vport.
Also, there is no need to load the eswitch manager vport in switchdev mode.
This means there is no need to load it on regular connect-x HCAs where
the PF is the eswitch manager. This will avoid creating duplicate ACL
flow table for host PF vport.
Fixes: 29bcb6e4fe70 ("net/mlx5e: E-Switch, Use metadata for vport matching in send-to-vport rules") Fixes: eb8e9fae0a22 ("mlx5/core: E-Switch, Allocate ECPF vport if it's an eswitch manager") Fixes: 5019833d661f ("net/mlx5: E-switch, Introduce helper function to enable/disable vports") Signed-off-by: Bodong Wang <[email protected]> Reviewed-by: Mark Bloch <[email protected]> Signed-off-by: Saeed Mahameed <[email protected]> Signed-off-by: David S. Miller <[email protected]>
Jianbo Liu [Tue, 5 Sep 2023 17:48:45 +0000 (10:48 -0700)]
net/mlx5e: Clear mirred devices array if the rule is split
In the cited commit, the mirred devices are recorded and checked while
parsing the actions. In order to avoid system crash, the duplicate
action in a single rule is not allowed.
But the rule is actually break down into several FTEs in different
tables, for either mirroring, or the specified types of actions which
use post action infrastructure.
It will reject certain action list by mistake, for example:
actions:enp8s0f0_1,set(ipv4(ttl=63)),enp8s0f0_0,enp8s0f0_1.
Here the rule is split to two FTEs because of pedit action.
To fix this issue, when parsing the rule actions, reset if_count to
clear the mirred devices array if the rule is split to multiple
FTEs, and then the duplicate checking is restarted.
Fixes: 039f50629b7f ("ip_tunnel: Move stats update to iptunnel_xmit()") Reported-by: syzbot <[email protected]> Signed-off-by: Eric Dumazet <[email protected]> Signed-off-by: David S. Miller <[email protected]>
team interface has used a dynamic lockdep key to avoid false-positive
lockdep deadlock detection. Virtual interfaces such as team usually
have their own lock for protecting private data.
These interfaces can be nested.
team0
|
team1
Each interface's lock is actually different(team0->lock and team1->lock).
So,
mutex_lock(&team0->lock);
mutex_lock(&team1->lock);
mutex_unlock(&team1->lock);
mutex_unlock(&team0->lock);
The above case is absolutely safe. But lockdep warns about deadlock.
Because the lockdep understands these two locks are same. This is a
false-positive lockdep warning.
So, in order to avoid this problem, the team interfaces started to use
dynamic lockdep key. The false-positive problem was fixed, but it
introduced a new problem.
When the new team virtual interface is created, it registers a dynamic
lockdep key(creates dynamic lockdep key) and uses it. But there is the
limitation of the number of lockdep keys.
So, If so many team interfaces are created, it consumes all lockdep keys.
Then, the lockdep stops to work and warns about it.
In order to fix this problem, team interfaces use the subclass instead
of the dynamic key. So, when a new team interface is created, it doesn't
register(create) a new lockdep, but uses existed subclass key instead.
It is already used by the bonding interface for a similar case.
As the bonding interface does, the subclass variable is the same as
the 'dev->nested_level'. This variable indicates the depth in the stacked
interface graph.
The 'dev->nested_level' is protected by RTNL and RCU.
So, 'mutex_lock_nested()' for 'team->lock' requires RTNL or RCU.
In the current code, 'team->lock' is usually acquired under RTNL, there is
no problem with using 'dev->nested_level'.
The 'team_nl_team_get()' and The 'lb_stats_refresh()' functions acquire
'team->lock' without RTNL.
But these don't iterate their own ports nested so they don't need nested
lock.
Reproducer:
for i in {0..1000}
do
ip link add team$i type team
ip link add dummy$i master team$i type dummy
ip link set dummy$i up
ip link set team$i up
done
Splat looks like:
BUG: MAX_LOCKDEP_ENTRIES too low!
turning off the locking correctness validator.
Please attach the output of /proc/lock_stat to the bug report
CPU: 0 PID: 4104 Comm: ip Not tainted 6.5.0-rc7+ #45
Call Trace:
<TASK>
dump_stack_lvl+0x64/0xb0
add_lock_to_list+0x30d/0x5e0
check_prev_add+0x73a/0x23a0
...
sock_def_readable+0xfe/0x4f0
netlink_broadcast+0x76b/0xac0
nlmsg_notify+0x69/0x1d0
dev_open+0xed/0x130
...
Quan Tian [Tue, 5 Sep 2023 10:36:10 +0000 (10:36 +0000)]
net/ipv6: SKB symmetric hash should incorporate transport ports
__skb_get_hash_symmetric() was added to compute a symmetric hash over
the protocol, addresses and transport ports, by commit eb70db875671
("packet: Use symmetric hash for PACKET_FANOUT_HASH."). It uses
flow_keys_dissector_symmetric_keys as the flow_dissector to incorporate
IPv4 addresses, IPv6 addresses and ports. However, it should not specify
the flag as FLOW_DISSECTOR_F_STOP_AT_FLOW_LABEL, which stops further
dissection when an IPv6 flow label is encountered, making transport
ports not being incorporated in such case.
As a consequence, the symmetric hash is based on 5-tuple for IPv4 but
3-tuple for IPv6 when flow label is present. It caused a few problems,
e.g. when nft symhash and openvswitch l4_sym rely on the symmetric hash
to perform load balancing as different L4 flows between two given IPv6
addresses would always get the same symmetric hash, leading to uneven
traffic distribution.
Removing the use of FLOW_DISSECTOR_F_STOP_AT_FLOW_LABEL makes sure the
symmetric hash is based on 5-tuple for both IPv4 and IPv6 consistently.
Fixes: eb70db875671 ("packet: Use symmetric hash for PACKET_FANOUT_HASH.") Reported-by: Lars Ekman <[email protected]> Closes: https://github.com/antrea-io/antrea/issues/5457 Signed-off-by: Quan Tian <[email protected]> Reviewed-by: Eric Dumazet <[email protected]> Signed-off-by: David S. Miller <[email protected]>
riscv: libstub: Implement KASLR by using generic functions
We can now use arm64 functions to handle the move of the kernel physical
mapping: if KASLR is enabled, we will try to get a random seed from the
firmware, if not possible, the kernel will be moved to a location that
suits its alignment constraints.
Fix the following warning which appears when compiled for rv32 by using
unsigned long type instead of u64.
../drivers/firmware/efi/libstub/efi-stub-helper.c: In function 'efi_kaslr_relocate_kernel':
../drivers/firmware/efi/libstub/efi-stub-helper.c:846:28: warning: cast from pointer to integer of different size [-Wpointer-to-int-cast]
846 | (u64)_end < EFI_ALLOC_LIMIT) {
KASLR implementation relies on a relocatable kernel so that we can move
the kernel mapping.
The seed needed to virtually move the kernel is taken from the device tree,
so we rely on the bootloader to provide a correct seed. Zkr could be used
unconditionnally instead if implemented, but that's for another patch.