Lino Sanfilippo [Thu, 24 Nov 2022 13:55:25 +0000 (14:55 +0100)]
tpm, tpm_tis: Avoid cache incoherency in test for interrupts
The interrupt handler that sets the boolean variable irq_tested may run on
another CPU as the thread that checks irq_tested as part of the irq test in
tpm_tis_send().
Since nothing guarantees cache coherency between CPUs for unsynchronized
accesses to boolean variables the testing thread might not perceive the
value change done in the interrupt handler.
Avoid this issue by setting the bit TPM_TIS_IRQ_TESTED in the flags field
of the tpm_tis_data struct and by accessing this field with the bit
manipulating functions that provide cache coherency.
Also convert all other existing sites to use the proper macros when
accessing this bitfield.
Eric Snowberg [Thu, 2 Mar 2023 16:46:52 +0000 (11:46 -0500)]
integrity: machine keyring CA configuration
Add machine keyring CA restriction options to control the type of
keys that may be added to it. The motivation is separation of
certificate signing from code signing keys. Subsquent work will
limit certificates being loaded into the IMA keyring to code
signing keys used for signature verification.
When no restrictions are selected, all Machine Owner Keys (MOK) are added
to the machine keyring. When CONFIG_INTEGRITY_CA_MACHINE_KEYRING is
selected, the CA bit must be true. Also the key usage must contain
keyCertSign, any other usage field may be set as well.
When CONFIG_INTEGRITY_CA_MACHINE_KEYRING_MAX is selected, the CA bit must
be true. Also the key usage must contain keyCertSign and the
digitialSignature usage may not be set.
If the keyCertSign or digitalSignature is set, store it in the
public_key structure. Having the purpose of the key being stored
during parsing, allows enforcement on the usage field in the future.
This will be used in a follow on patch that requires knowing the
certificate key usage type.
Eric Snowberg [Thu, 2 Mar 2023 16:46:47 +0000 (11:46 -0500)]
KEYS: Create static version of public_key_verify_signature
The kernel test robot reports undefined reference to
public_key_verify_signature when CONFIG_ASYMMETRIC_PUBLIC_KEY_SUBTYPE is
not defined. Create a static version in this case and return -EINVAL.
Mark Hasemeyer [Tue, 14 Mar 2023 19:54:04 +0000 (13:54 -0600)]
tpm: cr50: i2c: use jiffies to wait for tpm ready irq
When waiting for a tpm ready completion, the cr50 i2c driver incorrectly
assumes that the value of timeout_a is represented in milliseconds
instead of jiffies.
We started disabling '-Warray-bounds' for gcc-12 originally on s390,
because it resulted in some warnings that weren't realistically fixable
(commit 8b202ee21839: "s390: disable -Warray-bounds").
That s390-specific issue was then found to be less common elsewhere, but
generic (see f0be87c42cbd: "gcc-12: disable '-Warray-bounds' universally
for now"), and then later expanded the version check was expanded to
gcc-11 (5a41237ad1d4: "gcc: disable -Warray-bounds for gcc-11 too").
And it turns out that I was much too optimistic in thinking that it's
all going to go away, and here we are with gcc-13 showing all the same
issues. So instead of expanding this one version at a time, let's just
disable it for gcc-11+, and put an end limit to it only when we actually
find a solution.
Yes, I'm sure some of this is because the kernel just does odd things
(like our "container_of()" use, but also knowingly playing games with
things like linker tables and array layouts).
And yes, some of the warnings are likely signs of real bugs, but when
there are hundreds of false positives, that doesn't really help.
Merge tag 'kbuild-fixes-v6.3-4' of git://git.kernel.org/pub/scm/linux/kernel/git/masahiroy/linux-kbuild
Pull Kbuild fixes from Masahiro Yamada:
- Fix the prefix in the kernel source tarball
- Fix a typo in the copyright file in Debian package
* tag 'kbuild-fixes-v6.3-4' of git://git.kernel.org/pub/scm/linux/kernel/git/masahiroy/linux-kbuild:
kbuild: use proper prefix for tarballs to fix rpm-pkg build error
kbuild: deb-pkg: Fix a spell typo in mkdebian script
Merge tag 'irq_urgent_for_v6.3' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull irq fix from Borislav Petkov:
- Remove an over-zealous sanity check of the array of MSI-X vectors to
be allocated for a device
* tag 'irq_urgent_for_v6.3' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
PCI/MSI: Remove over-zealous hardware size check in pci_msix_validate_entries()
Merge tag 'x86_urgent_for_v6.3' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull x86 fix from Borislav Petkov
- Fix for older binutils which do not support C-syntax constant
suffixes
* tag 'x86_urgent_for_v6.3' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
x86/alternatives: Do not use integer constant suffixes in inline asm
Merge tag 'input-for-v6.3-rc7' of git://git.kernel.org/pub/scm/linux/kernel/git/dtor/input
Pull input fixes from Dmitry Torokhov:
- a check in pegasus-notetaker driver to validate the type of pipe when
probing a new device
- a fix for Cypress touch controller to correctly parse maximum number
of touches.
* tag 'input-for-v6.3-rc7' of git://git.kernel.org/pub/scm/linux/kernel/git/dtor/input:
Input: cyttsp5 - fix sensing configuration data structure
Input: pegasus-notetaker - check pipe type when probing
kbuild: use proper prefix for tarballs to fix rpm-pkg build error
Since commit f8d94c4e403c ("kbuild: do not create intermediate *.tar
for source tarballs"), 'make rpm-pkg' fails because the prefix of the
source tarball is 'linux.tar/' instead of 'linux/'. $(basename $@)
strips only '.gz' from the filename linux.tar.gz.
You need to strip two suffixes from compressed tarballs and one suffix
from uncompressed tarballs (for example 'perf-6.3.0.tar' generated by
'make perf-tar-src-pkg').
One tricky fix might be --prefix=$(firstword $(subst .tar, ,$@))/
but I think it is better to hard-code the prefix.
Fixes: f8d94c4e403c ("kbuild: do not create intermediate *.tar for source tarballs") Reported-by: Jiwei Sun <[email protected]> Signed-off-by: Masahiro Yamada <[email protected]> Reviewed-by: Nicolas Schier <[email protected]>
Merge tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm
Pull kvm fixes from Paolo Bonzini:
"Two serious ARM fixes:
- Plug a buffer overflow due to the use of the user-provided register
width for firmware regs. Outright reject accesses where the user
register width does not match the kernel representation.
- Protect non-atomic RMW operations on vCPU flags against preemption,
as an update to the flags by an intervening preemption could be
lost"
* tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm:
KVM: arm64: Fix buffer overflow in kvm_arm_set_fw_reg()
KVM: arm64: Make vcpu flag updates non-preemptible
Merge tag '6.3-rc7-smb3-client-fixes' of git://git.samba.org/sfrench/cifs-2.6
Pull cifs fixes from Steve French:
"Three small smb3 client fixes:
- two important fixes for unbuffered read regression with the
iov_iter changes (e.g. read soon after mount in some multichannel
scenarios)
- DFS prefix path fix (also for stable)"
* tag '6.3-rc7-smb3-client-fixes' of git://git.samba.org/sfrench/cifs-2.6:
cifs: Reapply lost fix from commit 30b2b2196d6e
cifs: Fix unbuffered read
cifs: avoid dup prefix path in dfs_get_automount_devname()
Paolo Bonzini [Fri, 21 Apr 2023 23:19:02 +0000 (19:19 -0400)]
Merge tag 'kvmarm-fixes-6.3-4' of git://git.kernel.org/pub/scm/linux/kernel/git/kvmarm/kvmarm into HEAD
KVM/arm64 fixes for 6.3, part #4
- Plug a buffer overflow due to the use of the user-provided register
width for firmware regs. Outright reject accesses where the
user register width does not match the kernel representation.
- Protect non-atomic RMW operations on vCPU flags against preemption,
as an update to the flags by an intervening preemption could be lost.
Merge tag 'for-6.3-rc7-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux
Pull btrfs fixes from David Sterba:
"Two patches fixing the problem with aync discard.
The default settings had a low IOPS limit and processing a large batch
to discard would take a long time. On laptops this can cause increased
power consumption due to disk activity.
As async discard has been on by default since 6.2 this likely affects
a lot of users.
Summary:
- increase the default IOPS limit 10x which reportedly helped
- setting the sysfs IOPS value to 0 now does not throttle anymore
allowing the discards to be processed at full speed. Previously
there was an arbitrary 6 hour target for processing the pending
batch"
* tag 'for-6.3-rc7-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux:
btrfs: reinterpret async discard iops_limit=0 as no delay
btrfs: set default discard iops_limit to 1000
Merge tag 'char-misc-6.3-final' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/char-misc
Pull char/misc driver fixes from Greg KH:
"Here are some last-minute tiny driver fixes for 6.3-final. They
include fixes for some fpga and iio drivers:
- fpga bridge driver fix
- fpga dfl error reporting fix
- fpga m10bmc driver fix
- fpga xilinx driver fix
- iio light driver fix
- iio dac fwhandle leak fix
- iio adc driver fix
All of these have been in linux-next for a few weeks with no reported
problems"
* tag 'char-misc-6.3-final' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/char-misc:
iio: light: tsl2772: fix reading proximity-diodes from device tree
fpga: bridge: properly initialize bridge device before populating children
iio: dac: ad5755: Add missing fwnode_handle_put()
iio: adc: at91-sama5d2_adc: fix an error code in at91_adc_allocate_trigger()
fpga: xilinx-pr-decoupler: Use readl wrapper instead of pure readl
fpga: dfl-pci: Drop redundant pci_enable_pcie_error_reporting()
fpga: m10bmc-sec: Fix rsu_send_data() to return FW_UPLOAD_ERR_HW_ERROR
Merge tag 'gpio-fixes-for-v6.3' of git://git.kernel.org/pub/scm/linux/kernel/git/brgl/linux
Pull gpio fixes from Bartosz Golaszewski:
- use raw_spinlocks in regmaps that are used in interrupt context in
gpio-104-idi-48 and gpio-104-dio-48e
* tag 'gpio-fixes-for-v6.3' of git://git.kernel.org/pub/scm/linux/kernel/git/brgl/linux:
gpio: 104-idi-48: Enable use_raw_spinlock for idi48_regmap_config
gpio: 104-dio-48e: Enable use_raw_spinlock for dio48e_regmap_config
Merge tag 'sound-6.3' of git://git.kernel.org/pub/scm/linux/kernel/git/tiwai/sound
Pull sound fixes from Takashi Iwai:
"Just a few fixes: all small and device-specific (ASoC FSL, SOF, and
HD-audio quirks), should be safe to apply at the last minute"
* tag 'sound-6.3' of git://git.kernel.org/pub/scm/linux/kernel/git/tiwai/sound:
ALSA: hda/realtek: fix mute/micmute LEDs for a HP ProBook
ASoC: fsl_asrc_dma: fix potential null-ptr-deref
ASoC: fsl_sai: Fix pins setting for i.MX8QM platform
ALSA: hda/realtek: Remove specific patch for Dell Precision 3260
ASoC: max98373: change power down sequence for smart amp
ASoC: SOF: pm: Tear down pipelines only if DSP was active
ASoC: SOF: ipc4-topology: Clarify bind failure caused by missing fw_module
Merge tag 'drm-fixes-2023-04-21' of git://anongit.freedesktop.org/drm/drm
Pull drm fixes from Dave Airlie:
"This is the regular and hopefully last round of fixes for 6.3.
Pretty small, a few amdgpu, one i915, one nouveau, one rockchip and
one gpu scheduler fix:
nouveau:
- fix dma-resv timeout
rockchip:
- fix suspend/resume
sched:
- fix timeout handling
i915:
- Fix fast wake AUX sync len
amdgpu:
- GPU reset fix
- DCN 3.1.5 line buffer fix
- Display fix for single channel memory configs
- Fix a possible divide by 0"
* tag 'drm-fixes-2023-04-21' of git://anongit.freedesktop.org/drm/drm:
drm/amd/display: fix a divided-by-zero error
drm/amd/display: limit timing for single dimm memory
drm/amd/display: set dcn315 lb bpp to 48
drm/amdgpu: Fix desktop freezed after gpu-reset
drm/rockchip: vop2: Use regcache_sync() to fix suspend/resume
drm/nouveau: fix incorrect conversion to dma_resv_wait_timeout()
drm/rockchip: vop2: fix suspend/resume
drm/i915: Fix fast wake AUX sync len
drm/sched: Check scheduler ready before calling timeout handling
Boris Burkov [Wed, 5 Apr 2023 19:43:59 +0000 (12:43 -0700)]
btrfs: reinterpret async discard iops_limit=0 as no delay
Currently, a limit of 0 results in a hard coded metering over 6 hours.
Since the default is a set limit, I suspect no one truly depends on this
rather arbitrary setting. Repurpose it for an arguably more useful
"unlimited" mode, where the delay is 0.
Note that if block groups are too new, or go fully empty, there is still
a delay associated with those conditions. Those delays implement
heuristics for not trimming a region we are relatively likely to fully
overwrite soon.
Boris Burkov [Wed, 5 Apr 2023 19:43:58 +0000 (12:43 -0700)]
btrfs: set default discard iops_limit to 1000
Previously, the default was a relatively conservative 10. This results
in a 100ms delay, so with ~300 discards in a commit, it takes the full
30s till the next commit to finish the discards. On a workstation, this
results in the disk never going idle, wasting power/battery, etc.
Set the default to 1000, which results in using the smallest possible
delay, currently, which is 1ms. This has shown to not pathologically
keep the disk busy by the original reporter.
Turns out the channelmap variable is not actually read-only, it's modified
through the MCI_GPM_CLR_CHANNEL_BIT() macro further down in the function,
so making it read-only causes page faults when that code is hit.
* tag 'rust-fixes-6.3' of https://github.com/Rust-for-Linux/linux:
rust: allow to use INIT_STACK_ALL_ZERO
rust: fix regexp in scripts/is_rust_module.sh
rust: build: Fix grep warning
scripts: generate_rust_analyzer: Handle sub-modules with no Makefile
rust: kernel: Mark rust_fmt_argument as extern "C"
rust: sort uml documentation arch support table
rust: str: fix requierments->requirements typo
Rob Herring [Wed, 19 Apr 2023 19:35:13 +0000 (14:35 -0500)]
PCI: Restrict device disabled status check to DT
Commit 6fffbc7ae137 ("PCI: Honor firmware's device disabled status")
checked the firmware device status for both DT and ACPI devices. That
caused a regression in some ACPI systems. The exact reason isn't clear.
It's possibly a firmware bug. For now, at least, refactor the check to
be for DT based systems only.
Note that the original implementation leaked a refcount which is now
correctly handled.
[bhelgaas: Per ACPI r6.5, sec 6.3.7, for devices on an enumerable bus, _STA
must return with bit[0] ("device is present") set]
- eth: cxgb4: fix use after free bugs caused by circular dependency
problem
- eth: mlxsw: pci: fix possible crash during initialization
Previous releases - always broken:
- sched: sch_qfq: prevent slab-out-of-bounds in qfq_activate_agg
- netfilter: validate catch-all set elements
- bridge: don't notify FDB entries with "master dynamic"
- eth: bonding: fix memory leak when changing bond type to ethernet
- eth: i40e: fix accessing vsi->active_filters without holding lock
Misc:
- Mat is back as MPTCP co-maintainer"
* tag 'net-6.3-rc8' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net: (33 commits)
net: bridge: switchdev: don't notify FDB entries with "master dynamic"
Revert "net/mlx5: Enable management PF initialization"
MAINTAINERS: Resume MPTCP co-maintainer role
mailmap: add entries for Mat Martineau
e1000e: Disable TSO on i219-LM card to increase speed
bnxt_en: fix free-runnig PHC mode
net: dsa: microchip: ksz8795: Correctly handle huge frame configuration
bpf: Fix incorrect verifier pruning due to missing register precision taints
hamradio: drop ISA_DMA_API dependency
mlxsw: pci: Fix possible crash during initialization
mptcp: fix accept vs worker race
mptcp: stops worker on unaccepted sockets at listener close
net: rpl: fix rpl header size calculation
net: vmxnet3: Fix NULL pointer dereference in vmxnet3_rq_rx_complete()
bonding: Fix memory leak when changing bond type to Ethernet
veth: take into account peer device for NETDEV_XDP_ACT_NDO_XMIT xdp_features flag
mlxfw: fix null-ptr-deref in mlxfw_mfa2_tlv_next()
bnxt_en: Fix a possible NULL pointer dereference in unload path
bnxt_en: Do not initialize PTP on older P3/P4 chips
netfilter: nf_tables: tighten netlink attribute requirements for catch-all elements
...
blk-mq sched bio merge still needs request to grab queue usage counter,
so we can't simply call blk_mq_attempt_bio_merge() when queue usage
counter isn't held.
Vladimir Oltean [Tue, 18 Apr 2023 15:59:02 +0000 (18:59 +0300)]
net: bridge: switchdev: don't notify FDB entries with "master dynamic"
There is a structural problem in switchdev, where the flag bits in
struct switchdev_notifier_fdb_info (added_by_user, is_local etc) only
represent a simplified / denatured view of what's in struct
net_bridge_fdb_entry :: flags (BR_FDB_ADDED_BY_USER, BR_FDB_LOCAL etc).
Each time we want to pass more information about struct
net_bridge_fdb_entry :: flags to struct switchdev_notifier_fdb_info
(here, BR_FDB_STATIC), we find that FDB entries were already notified to
switchdev with no regard to this flag, and thus, switchdev drivers had
no indication whether the notified entries were static or not.
For example, this command:
ip link add br0 type bridge && ip link set swp0 master br0
bridge fdb add dev swp0 00:01:02:03:04:05 master dynamic
has never worked as intended with switchdev. It causes a struct
net_bridge_fdb_entry to be passed to br_switchdev_fdb_notify() which has
a single flag set: BR_FDB_ADDED_BY_USER.
This is further passed to the switchdev notifier chain, where interested
drivers have no choice but to assume this is a static (does not age) and
sticky (does not migrate) FDB entry. So currently, all drivers offload
it to hardware as such, as can be seen below ("offload" is set).
bridge fdb get 00:01:02:03:04:05 dev swp0 master
00:01:02:03:04:05 dev swp0 offload master br0
The software FDB entry expires $ageing_time centiseconds after the
kernel last sees a packet with this MAC SA, and the bridge notifies its
deletion as well, so it eventually disappears from hardware too.
This is a problem, because it is actually desirable to start offloading
"master dynamic" FDB entries correctly - they should expire $ageing_time
centiseconds after the *hardware* port last sees a packet with this
MAC SA - and this is how the current incorrect behavior was discovered.
With an offloaded data plane, it can be expected that software only sees
exception path packets, so an otherwise active dynamic FDB entry would
be aged out by software sooner than it should.
With the change in place, these FDB entries are no longer offloaded:
bridge fdb get 00:01:02:03:04:05 dev swp0 master
00:01:02:03:04:05 dev swp0 master br0
and this also constitutes a better way (assuming a backport to stable
kernels) for user space to determine whether the kernel has the
capability of doing something sane with these or not.
As opposed to "master dynamic" FDB entries, on the current behavior of
which no one currently depends on (which can be deduced from the lack of
kselftests), Ido Schimmel explains that entries with the "extern_learn"
flag (BR_FDB_ADDED_BY_EXT_LEARN) should still be notified to switchdev,
since the spectrum driver listens to them (and this is kind of okay,
because although they are treated identically to "static", they are
expected to not age, and to roam).
Paul reports that it causes a regression with IB on CX4
and FW 12.18.1000. In addition I think that the concept
of "management PF" is not fully accepted and requires
a discussion.
Jakub Kicinski [Thu, 20 Apr 2023 01:22:18 +0000 (18:22 -0700)]
Merge tag 'for-netdev' of https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf
Daniel Borkmann says:
====================
bpf 2023-04-19
We've added 3 non-merge commits during the last 6 day(s) which contain
a total of 3 files changed, 34 insertions(+), 9 deletions(-).
The main changes are:
1) Fix a crash on s390's bpf_arch_text_poke() under a NULL new_addr,
from Ilya Leoshkevich.
2) Fix a bug in BPF verifier's precision tracker, from Daniel Borkmann
and Andrii Nakryiko.
3) Fix a regression in veth's xdp_features which led to a broken BPF CI
selftest, from Lorenzo Bianconi.
* tag 'for-netdev' of https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf:
bpf: Fix incorrect verifier pruning due to missing register precision taints
veth: take into account peer device for NETDEV_XDP_ACT_NDO_XMIT xdp_features flag
s390/bpf: Fix bpf_arch_text_poke() with new_addr == NULL
====================
Merge tag 'mm-hotfixes-stable-2023-04-19-16-36' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm
Pull misc fixes from Andrew Morton:
"22 hotfixes.
19 are cc:stable and the remainder address issues which were
introduced during this merge cycle, or aren't considered suitable for
-stable backporting.
19 are for MM and the remainder are for other subsystems"
* tag 'mm-hotfixes-stable-2023-04-19-16-36' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: (22 commits)
nilfs2: initialize unused bytes in segment summary blocks
mm: page_alloc: skip regions with hugetlbfs pages when allocating 1G pages
mm/mmap: regression fix for unmapped_area{_topdown}
maple_tree: fix mas_empty_area() search
maple_tree: make maple state reusable after mas_empty_area_rev()
mm: kmsan: handle alloc failures in kmsan_ioremap_page_range()
mm: kmsan: handle alloc failures in kmsan_vmap_pages_range_noflush()
tools/Makefile: do missed s/vm/mm/
mm: fix memory leak on mm_init error handling
mm/page_alloc: fix potential deadlock on zonelist_update_seq seqlock
kernel/sys.c: fix and improve control flow in __sys_setres[ug]id()
Revert "userfaultfd: don't fail on unrecognized features"
writeback, cgroup: fix null-ptr-deref write in bdi_split_work_to_wbs
maple_tree: fix a potential memory leak, OOB access, or other unpredictable bug
tools/mm/page_owner_sort.c: fix TGID output when cull=tg is used
mailmap: update jtoppins' entry to reference correct email
mm/mempolicy: fix use-after-free of VMA iterator
mm/huge_memory.c: warn with pr_warn_ratelimited instead of VM_WARN_ON_ONCE_FOLIO
mm/mprotect: fix do_mprotect_pkey() return on error
mm/khugepaged: check again on anon uffd-wp during isolation
...
e1000e: Disable TSO on i219-LM card to increase speed
While using i219-LM card currently it was only possible to achieve
about 60% of maximum speed due to regression introduced in Linux 5.8.
This was caused by TSO not being disabled by default despite commit f29801030ac6 ("e1000e: Disable TSO for buffer overrun workaround").
Fix that by disabling TSO during driver probe.
The patch in fixes changed the way real-time mode is chosen for PHC on
the NIC. Apparently there is one more use case of the check outside of
ptp part of the driver which was not converted to the new macro and is
making a lot of noise in free-running mode.
Merge tag 'spi-fix-v6.3-rc7' of git://git.kernel.org/pub/scm/linux/kernel/git/broonie/spi
Pull spi fix from Mark Brown:
"A small fix in the error handling for the rockchip driver, ensuring we
don't leak clock enables if we fail to request the interrupt for the
device"
* tag 'spi-fix-v6.3-rc7' of git://git.kernel.org/pub/scm/linux/kernel/git/broonie/spi:
spi: spi-rockchip: Fix missing unwind goto in rockchip_sfc_probe()
Merge tag 'regulator-fix-v6.3-rc7' of git://git.kernel.org/pub/scm/linux/kernel/git/broonie/regulator
Pull regulator fixes from Mark Brown:
"A few driver specific fixes, one build coverage issue and a couple of
'someone typed in the wrong number' style errors in describing devices
to the subsystem"
* tag 'regulator-fix-v6.3-rc7' of git://git.kernel.org/pub/scm/linux/kernel/git/broonie/regulator:
regulator: sm5703: Fix missing n_voltages for fixed regulators
regulator: fan53555: Fix wrong TCS_SLEW_MASK
regulator: fan53555: Explicitly include bits header
Andrea Righi [Fri, 10 Feb 2023 21:51:41 +0000 (22:51 +0100)]
rust: allow to use INIT_STACK_ALL_ZERO
With CONFIG_INIT_STACK_ALL_ZERO enabled, bindgen passes
-ftrivial-auto-var-init=zero to clang, that triggers the following
error:
error: '-ftrivial-auto-var-init=zero' hasn't been enabled; enable it at your own peril for benchmarking purpose only with '-enable-trivial-auto-var-init-zero-knowing-it-will-be-removed-from-clang'
However, this additional option that is currently required by clang is
deprecated since clang-16 and going to be removed in the future,
likely with clang-18.
So, make sure bindgen is using this extra option if the major version of
the libclang used by bindgen is < 16.
In this way we can enable CONFIG_INIT_STACK_ALL_ZERO with CONFIG_RUST
without triggering any build error.
Andrea Righi [Fri, 10 Feb 2023 15:26:22 +0000 (16:26 +0100)]
rust: fix regexp in scripts/is_rust_module.sh
nm can use "R" or "r" to show read-only data sections, but
scripts/is_rust_module.sh can only recognize "r", so with some versions
of binutils it can fail to detect if a module is a Rust module or not.
Right now we're using this script only to determine if we need to skip
BTF generation (that is disabled globally if CONFIG_RUST is enabled),
but it's still nice to fix this script to do the proper job.
Moreover, with this patch applied I can also relax the constraint of
"RUST depends on !DEBUG_INFO_BTF" and build a kernel with Rust and BTF
enabled at the same time (of course BTF generation is still skipped for
Rust modules).
[ Miguel: The actual reason is likely to be a change on the Rust
compiler between 1.61.0 and 1.62.0:
from 6 to 9: safe
verification time 110 usec
stack depth 4
processed 36 insns (limit 1000000) max_states_per_insn 0 total_states 3 peak_states 3 mark_read 2
The verifier considers this program as safe by mistakenly pruning unsafe
code paths. In the above func#0, code lines 0-10 are of interest. In line
0-3 registers r6 to r9 are initialized with known scalar values. In line 4
the register r6 is reset to an unknown scalar given the verifier does not
track modulo operations. Due to this, the verifier can also not determine
precisely which branches in line 6 and 9 are taken, therefore it needs to
explore them both.
As can be seen, the verifier starts with exploring the false/fall-through
paths first. The 'from 19 to 21' path has both r6=0 and r9=0 and the pointer
arithmetic on r0 += r6 is therefore considered safe. Given the arithmetic,
r6 is correctly marked for precision tracking where backtracking kicks in
where it walks back the current path all the way where r6 was set to 0 in
the fall-through branch.
Next, the pruning logics pops the path 'from 9 to 11' from the stack. Also
here, the state of the registers is the same, that is, r6=0 and r9=0, so
that at line 19 the path can be pruned as it is considered safe. It is
interesting to note that the conditional in line 9 turned r6 into a more
precise state, that is, in the fall-through path at the beginning of line
10, it is R6=scalar(umin=1), and in the branch-taken path (which is analyzed
here) at the beginning of line 11, r6 turned into a known const r6=0 as
r9=0 prior to that and therefore (unsigned) r6 <= 0 concludes that r6 must
be 0 (**):
from 9 to 11: R1=ctx(off=0,imm=0) R6=0 R7=0 R8=0 R9=0 R10=fp0
[...]
The next path is 'from 6 to 9'. The verifier considers the old and current
state equivalent, and therefore prunes the search incorrectly. Looking into
the two states which are being compared by the pruning logic at line 9, the
old state consists of R6_rwD=Pscalar() R9_rwD=0 R10=fp0 and the new state
consists of R1=ctx(off=0,imm=0) R6_w=scalar(umax=18446744071562067968)
R7_w=0 R8_w=0 R9_w=-2147483648 R10=fp0. While r6 had the reg->precise flag
correctly set in the old state, r9 did not. Both r6'es are considered as
equivalent given the old one is a superset of the current, more precise one,
however, r9's actual values (0 vs 0x80000000) mismatch. Given the old r9
did not have reg->precise flag set, the verifier does not consider the
register as contributing to the precision state of r6, and therefore it
considered both r9 states as equivalent. However, for this specific pruned
path (which is also the actual path taken at runtime), register r6 will be
0x400 and r9 0x80000000 when reaching line 21, thus oob-accessing the map.
The purpose of precision tracking is to initially mark registers (including
spilled ones) as imprecise to help verifier's pruning logic finding equivalent
states it can then prune if they don't contribute to the program's safety
aspects. For example, if registers are used for pointer arithmetic or to pass
constant length to a helper, then the verifier sets reg->precise flag and
backtracks the BPF program instruction sequence and chain of verifier states
to ensure that the given register or stack slot including their dependencies
are marked as precisely tracked scalar. This also includes any other registers
and slots that contribute to a tracked state of given registers/stack slot.
This backtracking relies on recorded jmp_history and is able to traverse
entire chain of parent states. This process ends only when all the necessary
registers/slots and their transitive dependencies are marked as precise.
The backtrack_insn() is called from the current instruction up to the first
instruction, and its purpose is to compute a bitmask of registers and stack
slots that need precision tracking in the parent's verifier state. For example,
if a current instruction is r6 = r7, then r6 needs precision after this
instruction and r7 needs precision before this instruction, that is, in the
parent state. Hence for the latter r7 is marked and r6 unmarked.
For the class of jmp/jmp32 instructions, backtrack_insn() today only looks
at call and exit instructions and for all other conditionals the masks
remain as-is. However, in the given situation register r6 has a dependency
on r9 (as described above in **), so also that one needs to be marked for
precision tracking. In other words, if an imprecise register influences a
precise one, then the imprecise register should also be marked precise.
Meaning, in the parent state both dest and src register need to be tracked
for precision and therefore the marking must be more conservative by setting
reg->precise flag for both. The precision propagation needs to cover both
for the conditional: if the src reg was marked but not the dst reg and vice
versa.
Merge tag 'nfsd-6.3-6' of git://git.kernel.org/pub/scm/linux/kernel/git/cel/linux
Pull nfsd fixes from Chuck Lever:
- Address two issues with the new GSS krb5 Kunit tests
* tag 'nfsd-6.3-6' of git://git.kernel.org/pub/scm/linux/kernel/git/cel/linux:
SUNRPC: Fix failures of checksum Kunit tests
sunrpc: Fix RFC6803 encryption test
Merge tag 'loongarch-fixes-6.3-1' of git://git.kernel.org/pub/scm/linux/kernel/git/chenhuacai/linux-loongson
Pull LoongArch fixes from Huacai Chen:
"Some bug fixes, some build fixes, a comment fix and a trivial cleanup"
* tag 'loongarch-fixes-6.3-1' of git://git.kernel.org/pub/scm/linux/kernel/git/chenhuacai/linux-loongson:
tools/loongarch: Use __SIZEOF_LONG__ to define __BITS_PER_LONG
LoongArch: Replace hard-coded values in comments with VALEN
LoongArch: Clean up plat_swiotlb_setup() related code
LoongArch: Check unwind_error() in arch_stack_walk()
LoongArch: Adjust user_regset_copyin parameter to the correct offset
LoongArch: Adjust user_watch_state for explicit alignment
LoongArch: module: set section addresses to 0x0
LoongArch: Mark 3 symbol exports as non-GPL
LoongArch: Enable PG when wakeup from suspend
LoongArch: Fix _CONST64_(x) as unsigned
LoongArch: Fix build error if CONFIG_SUSPEND is not set
LoongArch: Fix probing of the CRC32 feature
LoongArch: Make WriteCombine configurable for ioremap()
dma_request_slave_channel() may return NULL which will lead to
NULL pointer dereference error in 'tmp_chan->private'.
Correct this behaviour by, first, switching from deprecated function
dma_request_slave_channel() to dma_request_chan(). Secondly, enable
sanity check for the resuling value of dma_request_chan().
Also, fix description that follows the enacted changes and that
concerns the use of dma_request_slave_channel().
It looks like the dependency got added accidentally in commit a553260618d8
("[PATCH] ISA DMA Kconfig fixes - part 3"). Unlike the previously removed
dmascc driver, the scc driver never used DMA.
mlxsw: pci: Fix possible crash during initialization
During initialization the driver issues a reset command via its command
interface in order to remove previous configuration from the device.
After issuing the reset, the driver waits for 200ms before polling on
the "system_status" register using memory-mapped IO until the device
reaches a ready state (0x5E). The wait is necessary because the reset
command only triggers the reset, but the reset itself happens
asynchronously. If the driver starts polling too soon, the read of the
"system_status" register will never return and the system will crash
[1].
The issue was discovered when the device was flashed with a development
firmware version where the reset routine took longer to complete. The
issue was fixed in the firmware, but it exposed the fact that the
current wait time is borderline.
Fix by increasing the wait time from 200ms to 400ms. With this patch and
the buggy firmware version, the issue did not reproduce in 10 reboots
whereas without the patch the issue is reproduced quite consistently.
[1]
mce: CPUs not responding to MCE broadcast (may include false positives): 0,4
mce: CPUs not responding to MCE broadcast (may include false positives): 0,4
Kernel panic - not syncing: Timeout: Not all CPUs entered broadcast exception handler
Shutting down cpus with NMI
Kernel Offset: 0x12000000 from 0xffffffff81000000 (relocation range: 0xffffffff80000000-0xffffffffbfffffff)
Fixes: ac004e84164e ("mlxsw: pci: Wait longer before accessing the device after reset") Signed-off-by: Ido Schimmel <[email protected]> Reviewed-by: Petr Machata <[email protected]> Signed-off-by: Petr Machata <[email protected]> Signed-off-by: David S. Miller <[email protected]>
Jaroslav Kysela [Wed, 19 Apr 2023 08:11:21 +0000 (10:11 +0200)]
ALSA: hda/realtek: Remove specific patch for Dell Precision 3260
Unfortunately, the tester gave a weak feedback (working/non-working) for
this case. After the double confirmation, this change is not really required.
The standard code with alc269_fallback_pin_fixup_tbl should work on this
hardware.
Fixes: 5911d78fabbb ("ALSA: hda/realtek: Improve support for Dell Precision 3260") Fixes: 5f4efc9dfcfd ("ALSA: hda/realtek: Fix support for Dell Precision 3260") Signed-off-by: Jaroslav Kysela <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Takashi Iwai <[email protected]>
David S. Miller [Wed, 19 Apr 2023 08:08:37 +0000 (09:08 +0100)]
Merge branch 'mptcp-fixes'
Matthieu Baerts says:
====================
mptcp: fixes around listening sockets and the MPTCP worker
Christoph Paasch reported a couple of issues found by syzkaller and
linked to operations done by the MPTCP worker on (un)accepted sockets.
Fixing these issues was not obvious and rather complex but Paolo Abeni
nicely managed to propose these excellent patches that seem to satisfy
syzkaller.
Patch 1 partially reverts a recent fix but while still providing a
solution for the previous issue, it also prevents the MPTCP worker from
running concurrently with inet_csk_listen_stop(). A warning is then
avoided. The partially reverted patch has been introduced in v6.3-rc3,
backported up to v6.1 and fixing an issue visible from v5.18.
Patch 2 prevents the MPTCP worker to race with mptcp_accept() causing a
UaF when a fallback to TCP is done while in parallel, the socket is
being accepted by the userspace. This is also a fix of a previous fix
introduced in v6.3-rc3, backported up to v6.1 but here fixing an issue
that is in theory there from v5.7. There is no need to backport it up
to here as it looks like it is only visible later, around v5.18, see the
previous cover-letter linked to this original fix.
====================
The root cause is that the worker can force fallback to TCP the first
mptcp subflow, actually deleting the unaccepted msk socket.
We can explicitly prevent the race delaying the unaccepted msk deletion
at listener shutdown time. In case the closed subflow is later accepted,
just drop the mptcp context and let the user-space deal with the
paired mptcp socket.
Paolo Abeni [Mon, 17 Apr 2023 14:00:40 +0000 (16:00 +0200)]
mptcp: stops worker on unaccepted sockets at listener close
This is a partial revert of the blamed commit, with a relevant
change: mptcp_subflow_queue_clean() now just change the msk
socket status and stop the worker, so that the UaF issue addressed
by the blamed commit is not re-introduced.
The above prevents the mptcp worker from running concurrently with
inet_csk_listen_stop(), as such race would trigger a warning, as
reported by Christoph:
Alexander Aring [Mon, 17 Apr 2023 13:00:52 +0000 (09:00 -0400)]
net: rpl: fix rpl header size calculation
This patch fixes a missing 8 byte for the header size calculation. The
ipv6_rpl_srh_size() is used to check a skb_pull() on skb->data which
points to skb_transport_header(). Currently we only check on the
calculated addresses fields using CmprI and CmprE fields, see:
https://www.rfc-editor.org/rfc/rfc6554#section-3
there is however a missing 8 byte inside the calculation which stands
for the fields before the addresses field. Those 8 bytes are represented
by sizeof(struct ipv6_rpl_sr_hdr) expression.
net: vmxnet3: Fix NULL pointer dereference in vmxnet3_rq_rx_complete()
When vmxnet3_rq_create() fails to allocate rq->data_ring.base due to page
allocation failure, subsequent call to vmxnet3_rq_rx_complete() can result in
NULL pointer dereference.
To fix this bug, check not only that rxDataRingUsed is true but also that
adapter->rxdataring_enabled is true before calling memcpy() in
vmxnet3_rq_rx_complete().
bonding: Fix memory leak when changing bond type to Ethernet
When a net device is put administratively up, its 'IFF_UP' flag is set
(if not set already) and a 'NETDEV_UP' notification is emitted, which
causes the 8021q driver to add VLAN ID 0 on the device. The reverse
happens when a net device is put administratively down.
When changing the type of a bond to Ethernet, its 'IFF_UP' flag is
incorrectly cleared, resulting in the kernel skipping the above process
and VLAN ID 0 being leaked [1].
Fix by restoring the flag when changing the type to Ethernet, in a
similar fashion to the restoration of the 'IFF_SLAVE' flag.
The issue can be reproduced using the script in [2], with example out
before and after the fix in [3].
[2]
ip link add name t-nlmon type nlmon
ip link add name t-dummy type dummy
ip link add name t-bond type bond mode active-backup
ip link set dev t-bond up
ip link set dev t-nlmon master t-bond
ip link set dev t-nlmon nomaster
ip link show dev t-bond
ip link set dev t-dummy master t-bond
ip link show dev t-bond
ip link del dev t-bond
ip link del dev t-dummy
ip link del dev t-nlmon
[3]
Before:
12: t-bond: <NO-CARRIER,BROADCAST,MULTICAST,MASTER,UP> mtu 1500 qdisc noqueue state DOWN mode DEFAULT group default qlen 1000
link/netlink
12: t-bond: <BROADCAST,MULTICAST,MASTER,LOWER_UP> mtu 1500 qdisc noqueue state UP mode DEFAULT group default qlen 1000
link/ether 46:57:39:a4:46:a2 brd ff:ff:ff:ff:ff:ff
After:
12: t-bond: <NO-CARRIER,BROADCAST,MULTICAST,MASTER,UP> mtu 1500 qdisc noqueue state DOWN mode DEFAULT group default qlen 1000
link/netlink
12: t-bond: <BROADCAST,MULTICAST,MASTER,UP,LOWER_UP> mtu 1500 qdisc noqueue state UP mode DEFAULT group default qlen 1000
link/ether 66:48:7b:74:b6:8a brd ff:ff:ff:ff:ff:ff
Fixes: e36b9d16c6a6 ("bonding: clean muticast addresses when device changes type") Fixes: 75c78500ddad ("bonding: remap muticast addresses without using dev_close() and dev_open()") Fixes: 9ec7eb60dcbc ("bonding: restore IFF_MASTER/SLAVE flags on bond enslave ether type change") Reported-by: Mirsad Goran Todorovac <[email protected]> Link: https://lore.kernel.org/netdev/[email protected]/ Tested-by: Mirsad Goran Todorovac <[email protected]> Signed-off-by: Ido Schimmel <[email protected]> Acked-by: Jay Vosburgh <[email protected]> Signed-off-by: David S. Miller <[email protected]>
Tiezhu Yang [Wed, 19 Apr 2023 04:07:34 +0000 (12:07 +0800)]
tools/loongarch: Use __SIZEOF_LONG__ to define __BITS_PER_LONG
Although __SIZEOF_POINTER__ is equal to _SIZEOF_LONG__ on LoongArch,
it is better to use __SIZEOF_LONG__ to define __BITS_PER_LONG to keep
consistent between arch/loongarch/include/uapi/asm/bitsperlong.h and
tools/arch/loongarch/include/uapi/asm/bitsperlong.h.
Enze Li [Wed, 19 Apr 2023 04:07:27 +0000 (12:07 +0800)]
LoongArch: Replace hard-coded values in comments with VALEN
According to LoongArch documentation [1], CSR.PGDL and CSR.PGDH are
concerned with the VA's MSB which is VALEN-1 instead of always being 47.
Fix comments to avoid misleading others.
Tiezhu Yang [Wed, 19 Apr 2023 04:07:27 +0000 (12:07 +0800)]
LoongArch: Clean up plat_swiotlb_setup() related code
After commit c78c43fe7d42 ("LoongArch: Use acpi_arch_dma_setup() and
remove ARCH_HAS_PHYS_TO_DMA"), plat_swiotlb_setup() has been deleted,
so clean up the related code.
arch_stack_walk() should return immediately if unwind_next_frame()
failed, no need to do the useless loops to increase the value of c->len
in stack_trace_consume_entry(), then we can fix the above problem.
The following patchset contains Netfilter fixes for net:
1) Unbreak br_netfilter physdev match support, from Florian Westphal.
2) Use GFP_KERNEL_ACCOUNT for stateful/policy objects, from Chen Aotian.
3) Use IS_ENABLED() in nf_reset_trace(), from Florian Westphal.
4) Fix validation of catch-all set element.
5) Tighten requirements for catch-all set elements.
* git://git.kernel.org/pub/scm/linux/kernel/git/netfilter/nf:
netfilter: nf_tables: tighten netlink attribute requirements for catch-all elements
netfilter: nf_tables: validate catch-all set elements
netfilter: nf_tables: fix ifdef to also consider nf_tables=m
netfilter: nf_tables: Modify nla_memdup's flag to GFP_KERNEL_ACCOUNT
netfilter: br_netfilter: fix recent physdev match breakage
====================
David Howells [Tue, 18 Apr 2023 22:40:07 +0000 (23:40 +0100)]
cifs: Fix unbuffered read
If read() is done in an unbuffered manner, such that, say,
cifs_strict_readv() goes through cifs_user_readv() and thence
__cifs_readv(), it doesn't recognise the EOF and keeps indicating to
userspace that it returning full buffers of data.
This is due to ctx->iter being advanced in cifs_send_async_read() as the
buffer is split up amongst a number of rdata objects. The iterator count
is then used in collect_uncached_read_data() in the non-DIO case to set the
total length read - and thus the return value of sys_read(). But since the
iterator normally gets used up completely during splitting, ctx->total_len
gets overridden to the full amount.
However, prior to that in collect_uncached_read_data(), we've gone through
the list of rdatas and added up the amount of data we actually received
(which we then throw away).
Fix this by removing the bit that overrides the amount read in the non-DIO
case and just going with the total added up in the aforementioned loop.
This was observed by mounting a cifs share with multiple channels, e.g.:
mount //192.168.6.1/test /test/ -o user=shares,pass=...,max_channels=6
and then reading a 1MiB file on the share:
strace cat /xfstest.test/1M >/dev/null
Through strace, the same data can be seen being read again and again.
nilfs2: initialize unused bytes in segment summary blocks
Syzbot still reports uninit-value in nilfs_add_checksums_on_logs() for
KMSAN enabled kernels after applying commit 7397031622e0 ("nilfs2:
initialize "struct nilfs_binfo_dat"->bi_pad field").
This is because the unused bytes at the end of each block in segment
summaries are not initialized. So this fixes the issue by padding the
unused bytes with null bytes.
mm: page_alloc: skip regions with hugetlbfs pages when allocating 1G pages
A bug was reported by Yuanxi Liu where allocating 1G pages at runtime is
taking an excessive amount of time for large amounts of memory. Further
testing allocating huge pages that the cost is linear i.e. if allocating
1G pages in batches of 10 then the time to allocate nr_hugepages from
10->20->30->etc increases linearly even though 10 pages are allocated at
each step. Profiles indicated that much of the time is spent checking the
validity within already existing huge pages and then attempting a
migration that fails after isolating the range, draining pages and a whole
lot of other useless work.
Commit eb14d4eefdc4 ("mm,page_alloc: drop unnecessary checks from
pfn_range_valid_contig") removed two checks, one which ignored huge pages
for contiguous allocations as huge pages can sometimes migrate. While
there may be value on migrating a 2M page to satisfy a 1G allocation, it's
potentially expensive if the 1G allocation fails and it's pointless to try
moving a 1G page for a new 1G allocation or scan the tail pages for valid
PFNs.
Reintroduce the PageHuge check and assume any contiguous region with
hugetlbfs pages is unsuitable for a new 1G allocation.
The hpagealloc test allocates huge pages in batches and reports the
average latency per page over time. This test happens just after boot
when fragmentation is not an issue. Units are in milliseconds.
The vanilla kernel is poor, taking up to 1.3 second to allocate a huge
page and almost 10 minutes in total to run the test. Reverting the
problematic commit reduces it to 8ms at worst and the patch takes 26ms.
This patch fixes the main issue with skipping huge pages but leaves the
page_count() out because a page with an elevated count potentially can
migrate.
mm/mmap: regression fix for unmapped_area{_topdown}
The maple tree limits the gap returned to a window that specifically fits
what was asked. This may not be optimal in the case of switching search
directions or a gap that does not satisfy the requested space for other
reasons. Fix the search by retrying the operation and limiting the search
window in the rare occasion that a conflict occurs.
The internal function of mas_awalk() was incorrectly skipping the last
entry in a node, which could potentially be NULL. This is only a problem
for the left-most node in the tree - otherwise that NULL would not exist.
Fix mas_awalk() by using the metadata to obtain the end of the node for
the loop and the logical pivot as apposed to the raw pivot value.
maple_tree: make maple state reusable after mas_empty_area_rev()
Stop using maple state min/max for the range by passing through pointers
for those values. This will allow the maple state to be reused without
resetting.
Also add some logic to fail out early on searching with invalid
arguments.
mm: kmsan: handle alloc failures in kmsan_ioremap_page_range()
Similarly to kmsan_vmap_pages_range_noflush(), kmsan_ioremap_page_range()
must also properly handle allocation/mapping failures. In the case of
such, it must clean up the already created metadata mappings and return an
error code, so that the error can be propagated to ioremap_page_range().
Without doing so, KMSAN may silently fail to bring the metadata for the
page range into a consistent state, which will result in user-visible
crashes when trying to access them.
mm: kmsan: handle alloc failures in kmsan_vmap_pages_range_noflush()
As reported by Dipanjan Das, when KMSAN is used together with kernel fault
injection (or, generally, even without the latter), calls to kcalloc() or
__vmap_pages_range_noflush() may fail, leaving the metadata mappings for
the virtual mapping in an inconsistent state. When these metadata
mappings are accessed later, the kernel crashes.
To address the problem, we return a non-zero error code from
kmsan_vmap_pages_range_noflush() in the case of any allocation/mapping
failure inside it, and make vmap_pages_range_noflush() return an error if
KMSAN fails to allocate the metadata.
This patch also removes KMSAN_WARN_ON() from vmap_pages_range_noflush(),
as these allocation failures are not fatal anymore.
SeongJae Park [Sat, 15 Apr 2023 20:31:10 +0000 (20:31 +0000)]
tools/Makefile: do missed s/vm/mm/
Commit 799fb82aa132 ("tools/vm: rename tools/vm to tools/mm") missed
renaming 'vm' in 'tools/Makefile' to 'mm'. As a result, 'make clean'
under 'tools/' directory fails as below:
$ make -C tools clean
DESCEND vm
make[1]: Entering directory '/linux/tools/vm'
make[1]: *** No rule to make target 'clean'. Stop.
make[1]: Leaving directory '/linux/tools/vm'
make: *** [Makefile:173: vm_clean] Error 2
make: Leaving directory '/linux/tools'
commit f1a7941243c1 ("mm: convert mm's rss stats into percpu_counter")
introduces a memory leak by missing a call to destroy_context() when a
percpu_counter fails to allocate.
Before introducing the per-cpu counter allocations, init_new_context() was
the last call that could fail in mm_init(), and thus there was no need to
ever invoke destroy_context() in the error paths. Adding the following
percpu counter allocations adds error paths after init_new_context(),
which means its associated destroy_context() needs to be called when
percpu counters fail to allocate.
mm/page_alloc: fix potential deadlock on zonelist_update_seq seqlock
syzbot is reporting circular locking dependency which involves
zonelist_update_seq seqlock [1], for this lock is checked by memory
allocation requests which do not need to be retried.
One deadlock scenario is kmalloc(GFP_ATOMIC) from an interrupt handler.
CPU0
----
__build_all_zonelists() {
write_seqlock(&zonelist_update_seq); // makes zonelist_update_seq.seqcount odd
// e.g. timer interrupt handler runs at this moment
some_timer_func() {
kmalloc(GFP_ATOMIC) {
__alloc_pages_slowpath() {
read_seqbegin(&zonelist_update_seq) {
// spins forever because zonelist_update_seq.seqcount is odd
}
}
}
}
// e.g. timer interrupt handler finishes
write_sequnlock(&zonelist_update_seq); // makes zonelist_update_seq.seqcount even
}
This deadlock scenario can be easily eliminated by not calling
read_seqbegin(&zonelist_update_seq) from !__GFP_DIRECT_RECLAIM allocation
requests, for retry is applicable to only __GFP_DIRECT_RECLAIM allocation
requests. But Michal Hocko does not know whether we should go with this
approach.
Another deadlock scenario which syzbot is reporting is a race between
kmalloc(GFP_ATOMIC) from tty_insert_flip_string_and_push_buffer() with
port->lock held and printk() from __build_all_zonelists() with
zonelist_update_seq held.
preventing interrupt context from calling kmalloc(GFP_ATOMIC)
and
preventing printk() from calling console_flush_all()
while zonelist_update_seq.seqcount is odd.
Since Petr Mladek thinks that __build_all_zonelists() can become a
candidate for deferring printk() [2], let's address this problem by
disabling local interrupts in order to avoid kmalloc(GFP_ATOMIC)
and
disabling synchronous printk() in order to avoid console_flush_all()
.
As a side effect of minimizing duration of zonelist_update_seq.seqcount
being odd by disabling synchronous printk(), latency at
read_seqbegin(&zonelist_update_seq) for both !__GFP_DIRECT_RECLAIM and
__GFP_DIRECT_RECLAIM allocation requests will be reduced. Although, from
lockdep perspective, not calling read_seqbegin(&zonelist_update_seq) (i.e.
do not record unnecessary locking dependency) from interrupt context is
still preferable, even if we don't allow calling kmalloc(GFP_ATOMIC)
inside
write_seqlock(&zonelist_update_seq)/write_sequnlock(&zonelist_update_seq)
section...
Ondrej Mosnacek [Fri, 17 Feb 2023 16:21:54 +0000 (17:21 +0100)]
kernel/sys.c: fix and improve control flow in __sys_setres[ug]id()
Linux Security Modules (LSMs) that implement the "capable" hook will
usually emit an access denial message to the audit log whenever they
"block" the current task from using the given capability based on their
security policy.
The occurrence of a denial is used as an indication that the given task
has attempted an operation that requires the given access permission, so
the callers of functions that perform LSM permission checks must take care
to avoid calling them too early (before it is decided if the permission is
actually needed to perform the requested operation).
The __sys_setres[ug]id() functions violate this convention by first
calling ns_capable_setid() and only then checking if the operation
requires the capability or not. It means that any caller that has the
capability granted by DAC (task's capability set) but not by MAC (LSMs)
will generate a "denied" audit record, even if is doing an operation for
which the capability is not required.
Fix this by reordering the checks such that ns_capable_setid() is checked
last and -EPERM is returned immediately if it returns false.
While there, also do two small optimizations:
* move the capability check before prepare_creds() and
* bail out early in case of a no-op.
Daniel Miess [Tue, 4 Apr 2023 18:04:11 +0000 (14:04 -0400)]
drm/amd/display: limit timing for single dimm memory
[Why]
1. It could hit bandwidth limitdation under single dimm
memory when connecting 8K external monitor.
2. IsSupportedVidPn got validation failed with
2K240Hz eDP + 8K24Hz external monitor.
3. It's better to filter out such combination in
EnumVidPnCofuncModality
4. For short term, filter out in dc bandwidth validation.
[How]
Force 2K@240Hz+8K@24Hz timing validation false in dc.
Alan Liu [Fri, 14 Apr 2023 10:39:52 +0000 (18:39 +0800)]
drm/amdgpu: Fix desktop freezed after gpu-reset
[Why]
After gpu-reset, sometimes the driver fails to enable vblank irq,
causing flip_done timed out and the desktop freezed.
During gpu-reset, we disable and enable vblank irq in dm_suspend() and
dm_resume(). Later on in amdgpu_irq_gpu_reset_resume_helper(), we check
irqs' refcount and decide to enable or disable the irqs again.
However, we have 2 sets of API for controling vblank irq, one is
dm_vblank_get/put() and another is amdgpu_irq_get/put(). Each API has
its own refcount and flag to store the state of vblank irq, and they
are not synchronized.
In drm we use the first API to control vblank irq but in
amdgpu_irq_gpu_reset_resume_helper() we use the second set of API.
The failure happens when vblank irq was enabled by dm_vblank_get()
before gpu-reset, we have vblank->enabled true. However, during
gpu-reset, in amdgpu_irq_gpu_reset_resume_helper() vblank irq's state
checked from amdgpu_irq_update() is DISABLED. So finally it disables
vblank irq again. After gpu-reset, if there is a cursor plane commit,
the driver will try to enable vblank irq by calling drm_vblank_enable(),
but the vblank->enabled is still true, so it fails to turn on vblank
irq and causes flip_done can't be completed in vblank irq handler and
desktop become freezed.
[How]
Combining the 2 vblank control APIs by letting drm's API finally calls
amdgpu_irq's API, so the irq's refcount and state of both APIs can be
synchronized. Also add a check to prevent refcount from being less then
0 in amdgpu_irq_put().
v2:
- Add warning in amdgpu_irq_enable() if the irq is already disabled.
- Call dc_interrupt_set() in dm_set_vblank() to avoid refcount change
if it is in gpu-reset.
Lorenzo Bianconi [Mon, 17 Apr 2023 21:53:22 +0000 (23:53 +0200)]
veth: take into account peer device for NETDEV_XDP_ACT_NDO_XMIT xdp_features flag
For veth pairs, NETDEV_XDP_ACT_NDO_XMIT is supported by the current
device if the peer one is running a XDP program or if it has GRO enabled.
Fix the xdp_features flags reporting considering peer device and not
current one for NETDEV_XDP_ACT_NDO_XMIT.
Merge tag 'mmc-v6.3-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/ulfh/mmc
Pull MMC fixes from Ulf Hansson:
"MMC host:
- sdhci_am654: Fix support for UHS-I SDR12 and SDR25 speed modes
MEMSTICK:
- Fix memory leak if card device never gets registered"
* tag 'mmc-v6.3-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/ulfh/mmc:
memstick: fix memory leak if card device is never registered
mmc: sdhci_am654: Set HIGH_SPEED_ENA for SDR12 and SDR25
Marc Zyngier [Tue, 18 Apr 2023 12:57:37 +0000 (13:57 +0100)]
KVM: arm64: Make vcpu flag updates non-preemptible
Per-vcpu flags are updated using a non-atomic RMW operation.
Which means it is possible to get preempted between the read and
write operations.
Another interesting thing to note is that preemption also updates
flags, as we have some flag manipulation in both the load and put
operations.
It is thus possible to lose information communicated by either
load or put, as the preempted flag update will overwrite the flags
when the thread is resumed. This is specially critical if either
load or put has stored information which depends on the physical
CPU the vcpu runs on.
This results in really elusive bugs, and kudos must be given to
Mostafa for the long hours of debugging, and finally spotting
the problem.
Fix it by disabling preemption during the RMW operation, which
ensures that the state stays consistent. Also upgrade vcpu_get_flag
path to use READ_ONCE() to make sure the field is always atomically
accessed.