Populating memory : 0.684014577s
Writing to populated memory : 0.006230175s
Reading from populated memory : 0.004557805s
==== Test Assertion Failure ====
lib/kvm_util.c:1411: false
pid=125806 tid=125809 errno=4 - Interrupted system call
1 0x0000000000402f7c: addr_gpa2hva at kvm_util.c:1411
2 (inlined by) addr_gpa2hva at kvm_util.c:1405
3 0x0000000000401f52: lookup_pfn at access_tracking_perf_test.c:98
4 (inlined by) mark_vcpu_memory_idle at access_tracking_perf_test.c:152
5 (inlined by) vcpu_thread_main at access_tracking_perf_test.c:232
6 0x00007fefe9ff81ce: ?? ??:0
7 0x00007fefe9c64d82: ?? ??:0
No vm physical memory at 0xffbffff000
I can easily reproduce it with a Intel(R) Xeon(R) CPU E5-2630 with 46 bits
PA.
It turns out that the address translation for clearing idle page tracking
returned a wrong result; addr_gva2gpa()'s last step, which is based on
"pte[index[0]].pfn", did the calculation with 40 bits length and the
high 12 bits got truncated. In above case the GPA address to be returned
should be 0x3fffbffff000 for GVA 0xc0000000, but it got truncated into
0xffbffff000 and the subsequent gpa2hva lookup failed.
The width of operations on bit fields greater than 32-bit is
implementation defined, and differs between GCC (which uses the bitfield
precision) and clang (which uses 64-bit arithmetic), so this is a
potential minefield. Remove the bit fields and using manual masking
instead.
KVM: SEV: add cache flush to solve SEV cache incoherency issues
Flush the CPU caches when memory is reclaimed from an SEV guest (where
reclaim also includes it being unmapped from KVM's memslots). Due to lack
of coherency for SEV encrypted memory, failure to flush results in silent
data corruption if userspace is malicious/broken and doesn't ensure SEV
guest memory is properly pinned and unpinned.
Cache coherency is not enforced across the VM boundary in SEV (AMD APM
vol.2 Section 15.34.7). Confidential cachelines, generated by confidential
VM guests have to be explicitly flushed on the host side. If a memory page
containing dirty confidential cachelines was released by VM and reallocated
to another user, the cachelines may corrupt the new user at a later time.
KVM takes a shortcut by assuming all confidential memory remain pinned
until the end of VM lifetime. Therefore, KVM does not flush cache at
mmu_notifier invalidation events. Because of this incorrect assumption and
the lack of cache flushing, malicous userspace can crash the host kernel:
creating a malicious VM and continuously allocates/releases unpinned
confidential memory pages when the VM is running.
Add cache flush operations to mmu_notifier operations to ensure that any
physical memory leaving the guest VM get flushed. In particular, hook
mmu_notifier_invalidate_range_start and mmu_notifier_release events and
flush cache accordingly. The hook after releasing the mmu lock to avoid
contention with other vCPUs.
Merge tag 'net-5.18-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net
Pull networking fixes from Paolo Abeni:
"Including fixes from xfrm and can.
Current release - regressions:
- rxrpc: restore removed timer deletion
Current release - new code bugs:
- gre: fix device lookup for l3mdev use-case
- xfrm: fix egress device lookup for l3mdev use-case
Previous releases - regressions:
- sched: cls_u32: fix netns refcount changes in u32_change()
- smc: fix sock leak when release after smc_shutdown()
- xfrm: limit skb_page_frag_refill use to a single page
- eth: atlantic: invert deep par in pm functions, preventing null
derefs
- eth: stmmac: use readl_poll_timeout_atomic() in atomic state
Previous releases - always broken:
- gre: fix skb_under_panic on xmit
- openvswitch: fix OOB access in reserve_sfa_size()
- dsa: hellcreek: calculate checksums in tagger
- eth: ice: fix crash in switchdev mode
- eth: igc:
- fix infinite loop in release_swfw_sync
- fix scheduling while atomic"
* tag 'net-5.18-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net: (37 commits)
drivers: net: hippi: Fix deadlock in rr_close()
selftests: mlxsw: vxlan_flooding_ipv6: Prevent flooding of unwanted packets
selftests: mlxsw: vxlan_flooding: Prevent flooding of unwanted packets
nfc: MAINTAINERS: add Bug entry
net: stmmac: Use readl_poll_timeout_atomic() in atomic state
doc/ip-sysctl: add bc_forwarding
netlink: reset network and mac headers in netlink_dump()
net: mscc: ocelot: fix broken IP multicast flooding
net: dsa: hellcreek: Calculate checksums in tagger
net: atlantic: invert deep par in pm functions, preventing null derefs
can: isotp: stop timeout monitoring when no first frame was sent
bonding: do not discard lowest hash bit for non layer3+4 hashing
net: lan966x: Make sure to release ptp interrupt
ipv6: make ip6_rt_gc_expire an atomic_t
net: Handle l3mdev in ip_tunnel_init_flow
l3mdev: l3mdev_master_upper_ifindex_by_index_rcu should be using netdev_master_upper_dev_get_rcu
net/sched: cls_u32: fix possible leak in u32_init_knode()
net/sched: cls_u32: fix netns refcount changes in u32_change()
powerpc: Update MAINTAINERS for ibmvnic and VAS
net: restore alpha order to Ethernet devices in config
...
Above issue may happen as follows:
write truncate kjournald2
generic_perform_write
ext4_write_begin
ext4_walk_page_buffers
do_journal_get_write_access ->add BJ_Reserved list
ext4_journalled_write_end
ext4_walk_page_buffers
write_end_fn
ext4_handle_dirty_metadata
***************JBD2 ABORT**************
jbd2_journal_dirty_metadata
-> return -EROFS, jh in reserved_list
jbd2_journal_commit_transaction
while (commit_transaction->t_reserved_list)
jh = commit_transaction->t_reserved_list;
truncate_pagecache_range
do_invalidatepage
ext4_journalled_invalidatepage
jbd2_journal_invalidatepage
journal_unmap_buffer
__dispose_buffer
__jbd2_journal_unfile_buffer
jbd2_journal_put_journal_head ->put last ref_count
__journal_remove_journal_head
bh->b_private = NULL;
jh->b_bh = NULL;
jbd2_journal_refile_buffer(journal, jh);
bh = jh2bh(jh);
->bh is NULL, later will trigger null-ptr-deref
journal_free_journal_head(jh);
After commit 96f1e0974575, we no longer hold the j_state_lock while
iterating over the list of reserved handles in
jbd2_journal_commit_transaction(). This potentially allows the
journal_head to be freed by journal_unmap_buffer while the commit
codepath is also trying to free the BJ_Reserved buffers. Keeping
j_state_lock held while trying extends hold time of the lock
minimally, and solves this issue.
Control Flow Integrity (CFI) instrumentation of the kernel noticed that
the caller, dev_attr_show(), and the callback, odvp_show(), did not have
matching function prototypes, which would cause a CFI exception to be
raised. Correct the prototype by using struct device_attribute instead
of struct kobj_attribute.
Ville Syrjälä [Thu, 21 Apr 2022 13:36:34 +0000 (16:36 +0300)]
ACPI: processor: idle: Avoid falling back to C3 type C-states
The "safe state" index is used by acpi_idle_enter_bm() to avoid
entering a C-state that may require bus mastering to be disabled
on entry in the cases when this is not going to happen. For this
reason, it should not be set to point to C3 type of C-states, because
they may require bus mastering to be disabled on entry in principle.
This was broken by commit d6b88ce2eb9d ("ACPI: processor idle: Allow
playing dead in C3 state") which inadvertently allowed the "safe
state" index to point to C3 type of C-states.
This results in a machine that won't boot past the point when it first
enters C3. Restore the correct behaviour (either demote to C1/C2, or
use C3 but also set ARB_DIS=1).
I hit this on a Fujitsu Siemens Lifebook S6010 (P3) machine.
Fixes: d6b88ce2eb9d ("ACPI: processor idle: Allow playing dead in C3 state") Cc: 5.16+ <[email protected]> # 5.16+ Signed-off-by: Ville Syrjälä <[email protected]> Tested-by: Woody Suwalski <[email protected]>
[ rjw: Subject and changelog adjustments ] Signed-off-by: Rafael J. Wysocki <[email protected]>
usb: gadget: configfs: clear deactivation flag in configfs_composite_unbind()
If any function like UVC is deactivating gadget as part of composition
switch which results in not calling pullup enablement, it is not getting
enabled after switch to new composition due to this deactivation flag
not cleared. This results in USB enumeration not happening after switch
to new USB composition. Hence clear deactivation flag inside gadget
structure in configfs_composite_unbind() before switch to new USB
composition.
Tasos Sahanidis [Thu, 31 Mar 2022 21:47:00 +0000 (00:47 +0300)]
usb: core: Don't hold the device lock while sleeping in do_proc_control()
Since commit ae8709b296d8 ("USB: core: Make do_proc_control() and
do_proc_bulk() killable") if a device has the USB_QUIRK_DELAY_CTRL_MSG
quirk set, it will temporarily block all other URBs (e.g. interrupts)
while sleeping due to a control.
This results in noticeable delays when, for example, a userspace usbfs
application is sending URB interrupts at a high rate to a keyboard and
simultaneously updates the lock indicators using controls. Interrupts
with direction set to IN are also affected by this, meaning that
delivery of HID reports (containing scancodes) to the usbfs application
is delayed as well.
This patch fixes the regression by calling msleep() while the device
mutex is unlocked, as was the case originally with usb_control_msg().
KVM: SVM: Flush when freeing encrypted pages even on SME_COHERENT CPUs
Use clflush_cache_range() to flush the confidential memory when
SME_COHERENT is supported in AMD CPU. Cache flush is still needed since
SME_COHERENT only support cache invalidation at CPU side. All confidential
cache lines are still incoherent with DMA devices.
KVM: SVM: Simplify and harden helper to flush SEV guest page(s)
Rework sev_flush_guest_memory() to explicitly handle only a single page,
and harden it to fall back to WBINVD if VM_PAGE_FLUSH fails. Per-page
flushing is currently used only to flush the VMSA, and in its current
form, the helper is completely broken with respect to flushing actual
guest memory, i.e. won't work correctly for an arbitrary memory range.
VM_PAGE_FLUSH takes a host virtual address, and is subject to normal page
walks, i.e. will fault if the address is not present in the host page
tables or does not have the correct permissions. Current AMD CPUs also
do not honor SMAP overrides (undocumented in kernel versions of the APM),
so passing in a userspace address is completely out of the question. In
other words, KVM would need to manually walk the host page tables to get
the pfn, ensure the pfn is stable, and then use the direct map to invoke
VM_PAGE_FLUSH. And the latter might not even work, e.g. if userspace is
particularly evil/clever and backs the guest with Secret Memory (which
unmaps memory from the direct map).
Like Xu [Sat, 9 Apr 2022 01:52:26 +0000 (09:52 +0800)]
KVM: x86/pmu: Update AMD PMC sample period to fix guest NMI-watchdog
NMI-watchdog is one of the favorite features of kernel developers,
but it does not work in AMD guest even with vPMU enabled and worse,
the system misrepresents this capability via /proc.
This is a PMC emulation error. KVM does not pass the latest valid
value to perf_event in time when guest NMI-watchdog is running, thus
the perf_event corresponding to the watchdog counter will enter the
old state at some point after the first guest NMI injection, forcing
the hardware register PMC0 to be constantly written to 0x800000000001.
Meanwhile, the running counter should accurately reflect its new value
based on the latest coordinated pmc->counter (from vPMC's point of view)
rather than the value written directly by the guest.
Wanpeng Li [Mon, 18 Apr 2022 07:42:32 +0000 (00:42 -0700)]
x86/kvm: Preserve BSP MSR_KVM_POLL_CONTROL across suspend/resume
MSR_KVM_POLL_CONTROL is cleared on reset, thus reverting guests to
host-side polling after suspend/resume. Non-bootstrap CPUs are
restored correctly by the haltpoll driver because they are hot-unplugged
during suspend and hot-plugged during resume; however, the BSP
is not hotpluggable and remains in host-sde polling mode after
the guest resume. The makes the guest pay for the cost of vmexits
every time the guest enters idle.
Fix it by recording BSP's haltpoll state and resuming it during guest
resume.
KVM: x86: Skip KVM_GUESTDBG_BLOCKIRQ APICv update if APICv is disabled
Skip the APICv inhibit update for KVM_GUESTDBG_BLOCKIRQ if APICv is
disabled at the module level to avoid having to acquire the mutex and
potentially process all vCPUs. The DISABLE inhibit will (barring bugs)
never be lifted, so piling on more inhibits is unnecessary.
KVM: x86: Pend KVM_REQ_APICV_UPDATE during vCPU creation to fix a race
Make a KVM_REQ_APICV_UPDATE request when creating a vCPU with an
in-kernel local APIC and APICv enabled at the module level. Consuming
kvm_apicv_activated() and stuffing vcpu->arch.apicv_active directly can
race with __kvm_set_or_clear_apicv_inhibit(), as vCPU creation happens
before the vCPU is fully onlined, i.e. it won't get the request made to
"all" vCPUs. If APICv is globally inhibited between setting apicv_active
and onlining the vCPU, the vCPU will end up running with APICv enabled
and trigger KVM's sanity check.
Mark APICv as active during vCPU creation if APICv is enabled at the
module level, both to be optimistic about it's final state, e.g. to avoid
additional VMWRITEs on VMX, and because there are likely bugs lurking
since KVM checks apicv_active in multiple vCPU creation paths. While
keeping the current behavior of consuming kvm_apicv_activated() is
arguably safer from a regression perspective, force apicv_active so that
vCPU creation runs with deterministic state and so that if there are bugs,
they are found sooner than later, i.e. not when some crazy race condition
is hit.
KVM: nVMX: Defer APICv updates while L2 is active until L1 is active
Defer APICv updates that occur while L2 is active until nested VM-Exit,
i.e. until L1 regains control. vmx_refresh_apicv_exec_ctrl() assumes L1
is active and (a) stomps all over vmcs02 and (b) neglects to ever updated
vmcs01. E.g. if vmcs12 doesn't enable the TPR shadow for L2 (and thus no
APICv controls), L1 performs nested VM-Enter APICv inhibited, and APICv
becomes unhibited while L2 is active, KVM will set various APICv controls
in vmcs02 and trigger a failed VM-Entry. The kicker is that, unless
running with nested_early_check=1, KVM blames L1 and chaos ensues.
In all cases, ignoring vmcs02 and always deferring the inhibition change
to vmcs01 is correct (or at least acceptable). The ABSENT and DISABLE
inhibitions cannot truly change while L2 is active (see below).
IRQ_BLOCKING can change, but it is firmly a best effort debug feature.
Furthermore, only L2's APIC is accelerated/virtualized to the full extent
possible, e.g. even if L1 passes through its APIC to L2, normal MMIO/MSR
interception will apply to the virtual APIC managed by KVM.
The exception is the SELF_IPI register when x2APIC is enabled, but that's
an acceptable hole.
Lastly, Hyper-V's Auto EOI can technically be toggled if L1 exposes the
MSRs to L2, but for that to work in any sane capacity, L1 would need to
pass through IRQs to L2 as well, and IRQs must be intercepted to enable
virtual interrupt delivery. I.e. exposing Auto EOI to L2 and enabling
VID for L2 are, for all intents and purposes, mutually exclusive.
Lack of dynamic toggling is also why this scenario is all but impossible
to encounter in KVM's current form. But a future patch will pend an
APICv update request _during_ vCPU creation to plug a race where a vCPU
that's being created doesn't get included in the "all vCPUs request"
because it's not yet visible to other vCPUs. If userspaces restores L2
after VM creation (hello, KVM selftests), the first KVM_RUN will occur
while L2 is active and thus service the APICv update request made during
VM creation.
KVM: x86: Tag APICv DISABLE inhibit, not ABSENT, if APICv is disabled
Set the DISABLE inhibit, not the ABSENT inhibit, if APICv is disabled via
module param. A recent refactoring to add a wrapper for setting/clearing
inhibits unintentionally changed the flag, probably due to a copy+paste
goof.
KVM: Initialize debugfs_dentry when a VM is created to avoid NULL deref
Initialize debugfs_entry to its semi-magical -ENOENT value when the VM
is created. KVM's teardown when VM creation fails is kludgy and calls
kvm_uevent_notify_change() and kvm_destroy_vm_debugfs() even if KVM never
attempted kvm_create_vm_debugfs(). Because debugfs_entry is zero
initialized, the IS_ERR() checks pass and KVM derefs a NULL pointer.
KVM: Add helpers to wrap vcpu->srcu_idx and yell if it's abused
Add wrappers to acquire/release KVM's SRCU lock when stashing the index
in vcpu->src_idx, along with rudimentary detection of illegal usage,
e.g. re-acquiring SRCU and thus overwriting vcpu->src_idx. Because the
SRCU index is (currently) either 0 or 1, illegal nesting bugs can go
unnoticed for quite some time and only cause problems when the nested
lock happens to get a different index.
Wrap the WARNs in PROVE_RCU=y, and make them ONCE, otherwise KVM will
likely yell so loudly that it will bring the kernel to its knees.
KVM: RISC-V: Use kvm_vcpu.srcu_idx, drop RISC-V's unnecessary copy
Use the generic kvm_vcpu's srcu_idx instead of using an indentical field
in RISC-V's version of kvm_vcpu_arch. Generic KVM very intentionally
does not touch vcpu->srcu_idx, i.e. there's zero chance of running afoul
of common code.
KVM: x86: Don't re-acquire SRCU lock in complete_emulated_io()
Don't re-acquire SRCU in complete_emulated_io() now that KVM acquires the
lock in kvm_arch_vcpu_ioctl_run(). More importantly, don't overwrite
vcpu->srcu_idx. If the index acquired by complete_emulated_io() differs
from the one acquired by kvm_arch_vcpu_ioctl_run(), KVM will effectively
leak a lock and hang if/when synchronize_srcu() is invoked for the
relevant grace period.
Sven Peter [Mon, 11 Apr 2022 15:53:00 +0000 (17:53 +0200)]
usb: dwc3: Try usb-role-switch first in dwc3_drd_init
If the PHY controller node has a "port" dwc3 tries to find an
extcon device even when "usb-role-switch" is present. This happens
because dwc3_get_extcon() sees that "port" node and then calls
extcon_find_edev_by_node() which will always return EPROBE_DEFER
in that case.
On the other hand, even if an extcon was present and dwc3_get_extcon()
was successful it would still be ignored in favor of "usb-role-switch".
Let's just first check if "usb-role-switch" is configured in the device
tree and directly use it instead and only try to look for an extcon
device otherwise.
The current driver logic checks against 0 to determine whether the
periodic tx/rx threshold settings are set, but we may get bogus values
from uninitialized variables if no device property is set. Properly
default these variables to 0.
Macpaul Lin [Tue, 19 Apr 2022 08:12:45 +0000 (16:12 +0800)]
usb: mtu3: fix USB 3.0 dual-role-switch from device to host
Issue description:
When an OTG port has been switched to device role and then switch back
to host role again, the USB 3.0 Host (XHCI) will not be able to detect
"plug in event of a connected USB 2.0/1.0 ((Highspeed and Fullspeed)
devices until system reboot.
Root cause and Solution:
There is a condition checking flag "ssusb->otg_switch.is_u3_drd" in
toggle_opstate(). At the end of role switch procedure, toggle_opstate()
will be called to set DC_SESSION and SOFT_CONN bit. If "is_u3_drd" was
set and switched the role to USB host 3.0, bit DC_SESSION and SOFT_CONN
will be skipped hence caused the port cannot detect connected USB 2.0
(Highspeed and Fullspeed) devices. Simply remove the condition check to
solve this issue.
Evan Green [Fri, 8 Apr 2022 18:42:50 +0000 (11:42 -0700)]
xhci: Enable runtime PM on second Alderlake controller
Alderlake has two XHCI controllers with PCI IDs 0x461e and 0x51ed. We
had previously added the quirk to default enable runtime PM for 0x461e,
now add it for 0x51ed as well.
Peter Geis [Sat, 9 Apr 2022 15:21:15 +0000 (11:21 -0400)]
usb: dwc3: fix backwards compat with rockchip devices
Commit 33fb697ec7e5 ("usb: dwc3: Get clocks individually") moved from
the clk_bulk api to individual clocks, following the snps,dwc3.yaml
dt-binding for clock names.
Unfortunately the rk3328 (and upcoming rk356x support) use the
rockchip,dwc3.yaml which has different clock names, which are common on
devices using the glue layer.
The rk3328 does not use a glue layer, but attaches directly to the dwc3
core driver.
The offending patch series failed to account for this, thus dwc3 was
broken on rk3328.
To retain backwards compatibility with rk3328 device trees we must also
check for the alternate clock names.
usb: misc: fix improper handling of refcount in uss720_probe()
usb_put_dev shouldn't be called when uss720_probe succeeds because of
priv->usbdev. At the same time, priv->usbdev shouldn't be set to NULL
before destroy_priv in uss720_disconnect because usb_put_dev is in
destroy_priv.
Fix this by moving priv->usbdev = NULL after usb_put_dev.
Weitao Wango [Thu, 24 Mar 2022 12:17:35 +0000 (20:17 +0800)]
USB: Fix ehci infinite suspend-resume loop issue in zhaoxin
In zhaoxin platform, some ehci projects will latch a wakeup signal
internal when plug in a device on port during system S0. This wakeup
signal will turn on when ehci runtime suspend, which will trigger a
system control interrupt that will resume ehci back to D0. As no
device connect, ehci will be set to runtime suspend and turn on the
internal latched wakeup signal again. It will cause a suspend-resume
loop and generate system control interrupt continuously.
Fixed this issue by clear wakeup signal latched in ehci internal when
ehci resume callback is called.
usb: typec: tcpm: Fix undefined behavior due to shift overflowing the constant
Fix:
drivers/usb/typec/tcpm/tcpm.c: In function ‘run_state_machine’:
drivers/usb/typec/tcpm/tcpm.c:4724:3: error: case label does not reduce to an integer constant
case BDO_MODE_TESTDATA:
^~~~
See https://lore.kernel.org/r/YkwQ6%[email protected] for the gory
details as to why it triggers with older gccs only.
usb: typec: rt1719: Fix build error without CONFIG_POWER_SUPPLY
Building without CONFIG_POWER_SUPPLY will fail:
drivers/usb/typec/rt1719.o: In function `rt1719_psy_set_property':
rt1719.c:(.text+0x10a): undefined reference to `power_supply_get_drvdata'
drivers/usb/typec/rt1719.o: In function `rt1719_psy_get_property':
rt1719.c:(.text+0x2c8): undefined reference to `power_supply_get_drvdata'
drivers/usb/typec/rt1719.o: In function `devm_rt1719_psy_register':
rt1719.c:(.text+0x3e9): undefined reference to `devm_power_supply_register'
drivers/usb/typec/rt1719.o: In function `rt1719_irq_handler':
rt1719.c:(.text+0xf9f): undefined reference to `power_supply_changed'
drivers/usb/typec/rt1719.o: In function `rt1719_update_pwr_opmode.part.9':
rt1719.c:(.text+0x657): undefined reference to `power_supply_changed'
drivers/usb/typec/rt1719.o: In function `rt1719_attach':
rt1719.c:(.text+0x83e): undefined reference to `power_supply_changed'
Heikki Krogerus [Tue, 5 Apr 2022 13:48:24 +0000 (16:48 +0300)]
usb: typec: ucsi: Fix role swapping
All attempts to swap the roles timed out because the
completion was done without releasing the port lock. Fixing
that by releasing the lock before starting to wait for the
completion.
Heikki Krogerus [Tue, 5 Apr 2022 13:48:23 +0000 (16:48 +0300)]
usb: typec: ucsi: Fix reuse of completion structure
The role swapping completion variable is reused, so it needs
to be reinitialised every time. Otherwise it will be marked
as done after the first time it's used and completing
immediately.
zhangqilong [Sat, 19 Mar 2022 02:38:22 +0000 (10:38 +0800)]
usb: xhci: tegra:Fix PM usage reference leak of tegra_xusb_unpowergate_partitions
pm_runtime_get_sync will increment pm usage counter
even it failed. Forgetting to putting operation will
result in reference leak here. We fix it by replacing
it with pm_runtime_resume_and_get to keep usage counter
balanced.
After mnt_hold_writers() has been called we will always have set MNT_WRITE_HOLD
and consequently we always need to pair mnt_hold_writers() with
mnt_unhold_writers(). After the recent cleanup in [1] where Al switched from a
do-while to a for loop the cleanup currently fails to unset MNT_WRITE_HOLD for
the first mount that was changed. Fix this and make sure that the first mount
will be cleaned up and add some comments to make it more obvious.
memory: renesas-rpc-if: Fix HF/OSPI data transfer in Manual Mode
HyperFlash devices fail to probe:
rpc-if-hyperflash rpc-if-hyperflash: probing of hyperbus device failed
In HyperFlash or Octal-SPI Flash mode, the Transfer Data Enable bits
(SPIDE) in the Manual Mode Enable Setting Register (SMENR) are derived
from half of the transfer size, cfr. the rpcif_bits_set() helper
function. However, rpcif_reg_{read,write}() does not take the bus size
into account, and does not double all Manual Mode Data Register access
sizes when communicating with a HyperFlash or Octal-SPI Flash device.
Fix this, and avoid the back-and-forth conversion between transfer size
and Transfer Data Enable bits, by explicitly storing the transfer size
in struct rpcif, and using that value to determine access size in
rpcif_reg_{read,write}().
Enforce that the "high" Manual Mode Read/Write Data Registers
(SM[RW]DR1) are only used for 8-byte data accesses.
While at it, forbid writing to the Manual Mode Read Data Registers,
as they are read-only.
Marek Vasut [Fri, 15 Apr 2022 21:54:10 +0000 (23:54 +0200)]
pinctrl: stm32: Do not call stm32_gpio_get() for edge triggered IRQs in EOI
The stm32_gpio_get() should only be called for LEVEL triggered interrupts,
skip calling it for EDGE triggered interrupts altogether to avoid wasting
CPU cycles in EOI handler. On this platform, EDGE triggered interrupts are
the majority and LEVEL triggered interrupts are the exception no less, and
the CPU cycles are not abundant.
Wells Lu [Fri, 15 Apr 2022 09:41:28 +0000 (17:41 +0800)]
pinctrl: Fix an error in pin-function table of SP7021
The first valid item of pin-function table should
start from the third item. The first two items,
due to historical and compatible reasons, should
be dummy items.
The two dummy items were removed accidentally in
initial submission. This fix adds them back.
btrfs: zoned: use dedicated lock for data relocation
Currently, we use btrfs_inode_{lock,unlock}() to grant an exclusive
writeback of the relocation data inode in
btrfs_zoned_data_reloc_{lock,unlock}(). However, that can cause a deadlock
in the following path.
Thread A takes btrfs_inode_lock() and waits for metadata reservation by
e.g, waiting for writeback:
The deadlock is caused by relying on the vfs_inode's lock. By using it, we
introduced unnecessary exclusion of writeback and
btrfs_prealloc_file_range(). Also, the lock at this point is useless as we
don't have any dirty pages in the inode yet.
Introduce fs_info->zoned_data_reloc_io_lock and use it for the exclusive
writeback.
That assertion is relatively new, introduced with commit d04fbe19aefd2
("btrfs: scrub: cleanup the argument list of scrub_chunk()").
The block group we get at scrub_enumerate_chunks() can actually have a
start address that is smaller then the chunk offset we extracted from a
device extent item we got from the commit root of the device tree.
This is very rare, but it can happen due to a race with block group
removal and allocation. For example, the following steps show how this
can happen:
1) We are at transaction T, and we have the following blocks groups,
sorted by their logical start address:
--> logical address space hole of 256M,
there used to be a 256M metadata block group here
[ bg Y, start address Y, length 256M (metadata) ]
--> Y matches W's end offset + 256M
Block group Y is the block group with the highest logical address in
the whole filesystem;
2) Block group Y is deleted and its extent mapping is removed by the call
to remove_extent_mapping() made from btrfs_remove_block_group().
So after this point, the last element of the mapping red black tree,
its rightmost node, is the mapping for block group W;
3) While still at transaction T, a new data block group is allocated,
with a length of 1G. When creating the block group we do a call to
find_next_chunk(), which returns the logical start address for the
new block group. This calls returns X, which corresponds to the
end offset of the last block group, the rightmost node in the mapping
red black tree (fs_info->mapping_tree), plus one.
So we get a new block group that starts at logical address X and with
a length of 1G. It spans over the whole logical range of the old block
group Y, that was previously removed in the same transaction.
However the device extent allocated to block group X is not the same
device extent that was used by block group Y, and it also does not
overlap that extent, which must be always the case because we allocate
extents by searching through the commit root of the device tree
(otherwise it could corrupt a filesystem after a power failure or
an unclean shutdown in general), so the extent allocator is behaving
as expected;
4) We have a task running scrub, currently at scrub_enumerate_chunks().
There it searches for device extent items in the device tree, using
its commit root. It finds a device extent item that was used by
block group Y, and it extracts the value Y from that item into the
local variable 'chunk_offset', using btrfs_dev_extent_chunk_offset();
It then calls btrfs_lookup_block_group() to find block group for
the logical address Y - since there's currently no block group that
starts at that logical address, it returns block group X, because
its range contains Y.
This results in triggering the assertion:
ASSERT(cache->start == chunk_offset);
right before calling scrub_chunk(), as cache->start is X and
chunk_offset is Y.
This is more likely to happen of filesystems not larger than 50G, because
for these filesystems we use a 256M size for metadata block groups and
a 1G size for data block groups, while for filesystems larger than 50G,
we use a 1G size for both data and metadata block groups (except for
zoned filesystems). It could also happen on any filesystem size due to
the fact that system block groups are always smaller (32M) than both
data and metadata block groups, but these are not frequently deleted, so
much less likely to trigger the race.
So make scrub skip any block group with a start offset that is less than
the value we expect, as that means it's a new block group that was created
in the current transaction. It's pointless to continue and try to scrub
its extents, because scrub searches for extents using the commit root, so
it won't find any. For a device replace, skip it as well for the same
reasons, and we don't need to worry about the possibility of extents of
the new block group not being to the new device, because we have the write
duplication setup done through btrfs_map_block().
Fixes: d04fbe19aefd ("btrfs: scrub: cleanup the argument list of scrub_chunk()") CC: [email protected] # 5.17 Signed-off-by: Filipe Manana <[email protected]> Signed-off-by: David Sterba <[email protected]>
The "read_bhrb" global symbol is only called under CONFIG_PPC64 of
arch/powerpc/perf/core-book3s.c but it is compiled for both 32 and 64 bit
anyway (and LLVM fails to link this on 32bit).
When scheduling a group of events, there are constraint checks done to
make sure all events can go in a group. Example, one of the criteria is
that events in a group cannot use the same PMC. But platform specific
PMU supports alternative event for some of the event codes. During
perf_event_open(), if any event group doesn't match constraint check
criteria, further lookup is done to find alternative event.
By current design, the array of alternatives events in PMU code is
expected to be sorted by column 0. This is because in
find_alternative() the return criteria is based on event code
comparison. ie. "event < ev_alt[i][0])". This optimisation is there
since find_alternative() can be called multiple times. In power10 PMU
code, the alternative event array is not sorted properly and hence there
is breakage in finding alternative event.
To work with existing logic, fix the alternative event array to be
sorted by column 0 for power10-pmu.c
Results:
In case where an alternative event is not chosen when we could, events
will be multiplexed. ie, time sliced where it could actually run
concurrently.
Example, in power10 PM_INST_CMPL_ALT(0x00002) has alternative event,
PM_INST_CMPL(0x500fa). Without the fix, if a group of events with PMC1
to PMC4 is used along with PM_INST_CMPL_ALT, it will be time sliced
since all programmable PMC's are consumed already. But with the fix,
when it picks alternative event on PMC5, all events will run
concurrently.
When scheduling a group of events, there are constraint checks done to
make sure all events can go in a group. Example, one of the criteria is
that events in a group cannot use the same PMC. But platform specific
PMU supports alternative event for some of the event codes. During
perf_event_open(), if any event group doesn't match constraint check
criteria, further lookup is done to find alternative event.
By current design, the array of alternatives events in PMU code is
expected to be sorted by column 0. This is because in
find_alternative() the return criteria is based on event code
comparison. ie. "event < ev_alt[i][0])". This optimisation is there
since find_alternative() can be called multiple times. In power9 PMU
code, the alternative event array is not sorted properly and hence there
is breakage in finding alternative events.
To work with existing logic, fix the alternative event array to be
sorted by column 0 for power9-pmu.c
Results:
With alternative events, multiplexing can be avoided. That is, for
example, in power9 PM_LD_MISS_L1 (0x3e054) has alternative event,
PM_LD_MISS_L1_ALT (0x400f0). This is an identical event which can be
programmed in a different PMC.
We are missing some inter dependencies here so re-introduce the lock
until we have figured out what's missing. Just drop/retake it while
adding dependencies.
We hold rrpriv->lock in position (1) of thread 1 and
use del_timer_sync() to wait timer to stop, but timer handler
also need rrpriv->lock in position (2) of thread 2.
As a result, rr_close() will block forever.
This patch extracts del_timer_sync() from the protection of
spin_lock_irqsave(), which could let timer handler to obtain
the needed lock.
The result is technically harmless, as both the source and the
destinations are currently the same allocation size (4 bytes) and don't
use their padding, but if anything were to ever be added after the
"mcr" member in "struct whiteheat_private", it would be overwritten. The
structs both have a single u8 "mcr" member, but are 4 bytes in padded
size. The memcpy() destination was explicitly targeting the u8 member
(size 1) with the length of the whole structure (size 4), triggering
the memcpy buffer overflow warning:
In file included from include/linux/string.h:253,
from include/linux/bitmap.h:11,
from include/linux/cpumask.h:12,
from include/linux/smp.h:13,
from include/linux/lockdep.h:14,
from include/linux/spinlock.h:62,
from include/linux/mmzone.h:8,
from include/linux/gfp.h:6,
from include/linux/slab.h:15,
from drivers/usb/serial/whiteheat.c:17:
In function 'fortify_memcpy_chk',
inlined from 'firm_send_command' at drivers/usb/serial/whiteheat.c:587:4:
include/linux/fortify-string.h:328:25: warning: call to '__write_overflow_field' declared with attribute warning: detected write beyond size of field (1st parameter); maybe use struct_group()? [-Wattribute-warning]
328 | __write_overflow_field(p_size_field, size);
| ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
mtd: rawnand: qcom: fix memory corruption that causes panic
This patch fixes a memory corruption that occurred in the
nand_scan() path for Hynix nand device.
On boot, for Hynix nand device will panic at a weird place:
| Unable to handle kernel NULL pointer dereference at virtual
address 00000070
| [00000070] *pgd=00000000
| Internal error: Oops: 5 [#1] PREEMPT SMP ARM
| Modules linked in:
| CPU: 0 PID: 1 Comm: swapper/0 Not tainted 5.17.0-01473-g13ae1769cfb0
#38
| Hardware name: Generic DT based system
| PC is at nandc_set_reg+0x8/0x1c
| LR is at qcom_nandc_command+0x20c/0x5d0
| pc : [<c088b74c>] lr : [<c088d9c8>] psr: 00000113
| sp : c14adc50 ip : c14ee208 fp : c0cc970c
| r10: 000000a3 r9 : 00000000 r8 : 00000040
| r7 : c16f6a00 r6 : 00000090 r5 : 00000004 r4 :c14ee040
| r3 : 00000000 r2 : 0000000b r1 : 00000000 r0 :c14ee040
| Flags: nzcv IRQs on FIQs on Mode SVC_32 ISA ARM Segment none
| Control: 10c5387d Table: 8020406a DAC: 00000051
| Register r0 information: slab kmalloc-2k start c14ee000 pointer offset
64 size 2048
| Process swapper/0 (pid: 1, stack limit = 0x(ptrval))
| nandc_set_reg from qcom_nandc_command+0x20c/0x5d0
| qcom_nandc_command from nand_readid_op+0x198/0x1e8
| nand_readid_op from hynix_nand_has_valid_jedecid+0x30/0x78
| hynix_nand_has_valid_jedecid from hynix_nand_init+0xb8/0x454
| hynix_nand_init from nand_scan_with_ids+0xa30/0x14a8
| nand_scan_with_ids from qcom_nandc_probe+0x648/0x7b0
| qcom_nandc_probe from platform_probe+0x58/0xac
The problem is that the nand_scan()'s qcom_nand_attach_chip callback
is updating the nandc->max_cwperpage from 1 to 4 or 8 based on page size.
This causes the sg_init_table of clear_bam_transaction() in the driver's
qcom_nandc_command() to memset much more than what was initially
allocated by alloc_bam_transaction().
This patch will update nandc->max_cwperpage 1 to 4 or 8 based on page
size in qcom_nand_attach_chip call back after freeing the previously
allocated memory for bam txn as per nandc->max_cwperpage = 1 and then
again allocating bam txn as per nandc->max_cwperpage = 4 or 8 based on
page size in qcom_nand_attach_chip call back itself.
Commit 46b5889cc2c5 ("mtd: implement proper partition handling")
started using "mtd_get_master_ofs()" in mtd callbacks to determine
memory offsets by means of 'part' field from mtd_info, what previously
was smashed accessing 'master' field in the mtd_set_dev_defaults() method.
That provides wrong offset what causes hardware access errors.
Just make 'part', 'master' as separate fields, rather than using
union type to avoid 'part' data corruption when mtd_set_dev_defaults()
is called.
Miaoqian Lin [Tue, 12 Apr 2022 08:34:31 +0000 (08:34 +0000)]
mtd: rawnand: Fix return value check of wait_for_completion_timeout
wait_for_completion_timeout() returns unsigned long not int.
It returns 0 if timed out, and positive if completed.
The check for <= 0 is ambiguous and should be == 0 here
indicating timeout which is the only error case.
Revert "drm: of: Lookup if child node has panel or bridge"
Commit '80253168dbfd ("drm: of: Lookup if child node has panel or
bridge")' attempted to simplify the case of expressing a simple panel
under a DSI controller, by assuming that the first non-graph child node
was a panel or bridge.
Unfortunately for non-trivial cases the first child node might not be a
panel or bridge. Examples of this can be a aux-bus in the case of
DisplayPort, or an opp-table represented before the panel node.
In these cases the reverted commit prevents the caller from ever finding
a reference to the panel.
This reverts commit '80253168dbfd ("drm: of: Lookup if child node has
panel or bridge")', in favor of using an explicit graph reference to the
panel in the trivial case as well.
Revert "drm: of: Properly try all possible cases for bridge/panel detection"
Commit '80253168dbfd ("drm: of: Lookup if child node has panel or
bridge")' introduced the ability to describe a panel under a display
controller without having to use a graph to connect the controller to
its single child panel (or bridge).
The implementation of this would find the first non-graph node and
attempt to acquire the related panel or bridge. This prevents cases
where any other child node, such as a aux bus for a DisplayPort
controller, or an opp-table to find the referenced panel.
Commit '67bae5f28c89 ("drm: of: Properly try all possible cases for
bridge/panel detection")' attempted to solve this problem by not
bypassing the graph reference lookup before attempting to find the panel
or bridge.
While this does solve the case where a proper graph reference is
present, it does not allow the caller to distinguish between a
yet-to-be-probed panel or bridge and the absence of a reference to a
panel.
One such case is a DisplayPort controller that on some boards have an
explicitly described reference to a panel, but on others have a
discoverable DisplayPort display attached (which doesn't need to be
expressed in DeviceTree).
This reverts commit '67bae5f28c89 ("drm: of: Properly try all possible
cases for bridge/panel detection")', as a step towards reverting commit
'80253168dbfd ("drm: of: Lookup if child node has panel or bridge")'.
Miaoqian Lin [Wed, 20 Apr 2022 13:50:07 +0000 (21:50 +0800)]
drm/vc4: Use pm_runtime_resume_and_get to fix pm_runtime_get_sync() usage
If the device is already in a runtime PM enabled state
pm_runtime_get_sync() will return 1.
Also, we need to call pm_runtime_put_noidle() when pm_runtime_get_sync()
fails, so use pm_runtime_resume_and_get() instead. this function
will handle this.
The LoPAPR spec defines a guest visible IOMMU with a variable page size.
Currently QEMU advertises 4K, 64K, 2M, 16MB pages, a Linux VM picks
the biggest (16MB). In the case of a passed though PCI device, there is
a hardware IOMMU which does not support all pages sizes from the above -
P8 cannot do 2MB and P9 cannot do 16MB. So for each emulated
16M IOMMU page we may create several smaller mappings ("TCEs") in
the hardware IOMMU.
The code wrongly uses the emulated TCE index instead of hardware TCE
index in error handling. The problem is easier to see on POWER8 with
multi-level TCE tables (when only the first level is preallocated)
as hash mode uses real mode TCE hypercalls handlers.
The kernel starts using indirect tables when VMs get bigger than 128GB
(depends on the max page order).
The very first real mode hcall is going to fail with H_TOO_HARD as
in the real mode we cannot allocate memory for TCEs (we can in the virtual
mode) but on the way out the code attempts to clear hardware TCEs using
emulated TCE indexes which corrupts random kernel memory because
it_offset==1<<59 is subtracted from those indexes and the resulting index
is out of the TCE table bounds.
This fixes kvmppc_clear_tce() to use the correct TCE indexes.
While at it, this fixes TCE cache invalidation which uses emulated TCE
indexes instead of the hardware ones. This went unnoticed as 64bit DMA
is used these days and VMs map all RAM in one go and only then do DMA
and this is when the TCE cache gets populated.
Potentially this could slow down mapping, however normally 16MB
emulated pages are backed by 64K hardware pages so it is one write to
the "TCE Kill" per 256 updates which is not that bad considering the size
of the cache (1024 TCEs or so).
pinctrl: samsung: fix missing GPIOLIB on ARM64 Exynos config
The Samsung pinctrl drivers depend on OF_GPIO, which is part of GPIOLIB.
ARMv7 Exynos platform selects GPIOLIB and Samsung pinctrl drivers. ARMv8
Exynos selects only the latter leading to possible wrong configuration
on ARMv8 build:
WARNING: unmet direct dependencies detected for PINCTRL_EXYNOS
Depends on [n]: PINCTRL [=y] && OF_GPIO [=n] && (ARCH_EXYNOS [=y] || ARCH_S5PV210 || COMPILE_TEST [=y])
Selected by [y]:
- ARCH_EXYNOS [=y]
Always select the GPIOLIB from the Samsung pinctrl drivers to fix the
issue. This requires removing of OF_GPIO dependency (to avoid recursive
dependency), so add dependency on OF for COMPILE_TEST cases.
Michael Ellerman [Wed, 20 Apr 2022 14:16:57 +0000 (00:16 +1000)]
powerpc/time: Always set decrementer in timer_interrupt()
This is a partial revert of commit 0faf20a1ad16 ("powerpc/64s/interrupt:
Don't enable MSR[EE] in irq handlers unless perf is in use").
Prior to that commit, we always set the decrementer in
timer_interrupt(), to clear the timer interrupt. Otherwise we could end
up continuously taking timer interrupts.
When high res timers are enabled there is no problem seen with leaving
the decrementer untouched in timer_interrupt(), because it will be
programmed via hrtimer_interrupt() -> tick_program_event() ->
clockevents_program_event() -> decrementer_set_next_event().
However with CONFIG_HIGH_RES_TIMERS=n or booting with highres=off, we
see a stall/lockup, because tick_nohz_handler() does not cause a
reprogram of the decrementer, leading to endless timer interrupts.
Example trace:
Possibly this should be fixed in the lowres timing code, but that would
be a generic change and could take some time and may not backport
easily, so for now make the programming of the decrementer unconditional
again in timer_interrupt() to avoid the stall/lockup.
Ronnie Sahlberg [Thu, 21 Apr 2022 01:15:36 +0000 (11:15 +1000)]
cifs: destage any unwritten data to the server before calling copychunk_write
because the copychunk_write might cover a region of the file that has not yet
been sent to the server and thus fail.
A simple way to reproduce this is:
truncate -s 0 /mnt/testfile; strace -f -o x -ttT xfs_io -i -f -c 'pwrite 0k 128k' -c 'fcollapse 16k 24k' /mnt/testfile
the issue is that the 'pwrite 0k 128k' becomes rearranged on the wire with
the 'fcollapse 16k 24k' due to write-back caching.
fcollapse is implemented in cifs.ko as a SMB2 IOCTL(COPYCHUNK_WRITE) call
and it will fail serverside since the file is still 0b in size serverside
until the writes have been destaged.
To avoid this we must ensure that we destage any unwritten data to the
server before calling COPYCHUNK_WRITE.
Paulo Alcantara [Thu, 21 Apr 2022 00:05:45 +0000 (21:05 -0300)]
cifs: fix NULL ptr dereference in refresh_mounts()
Either mount(2) or automount might not have server->origin_fullpath
set yet while refresh_cache_worker() is attempting to refresh DFS
referrals. Add missing NULL check and locking around it.
drm/vmwgfx: Fix gem refcounting and memory evictions
v2: Add the last part of the ref count fix which was spotted by
Philipp Sieweck where the ref count of cpu writers is off due to
ERESTARTSYS or EBUSY during bo waits.
The initial GEM port broke refcounting on shareable (prime) surfaces and
memory evictions. The prime surfaces broke because the parent surfaces
weren't increasing the ref count on GEM surfaces, which meant that
the memory backing textures could have been deleted while the texture
was still accessible. The evictions broke due to a typo, the code was
supposed to exit if the passed buffers were not vmw_buffer_object
not if they were. They're tied because the evictions depend on having
memory to actually evict.
This fixes crashes with XA state tracker which is used for xrender
acceleration on xf86-video-vmware, apps/tests which use a lot of
memory (a good test being the piglit's streaming-texture-leak) and
desktops.
Damien Le Moal [Tue, 12 Apr 2022 08:41:37 +0000 (17:41 +0900)]
zonefs: Fix management of open zones
The mount option "explicit_open" manages the device open zone
resources to ensure that if an application opens a sequential file for
writing, the file zone can always be written by explicitly opening
the zone and accounting for that state with the s_open_zones counter.
However, if some zones are already open when mounting, the device open
zone resource usage status will be larger than the initial s_open_zones
value of 0. Ensure that this inconsistency does not happen by closing
any sequential zone that is open when mounting.
Furthermore, with ZNS drives, closing an explicitly open zone that has
not been written will change the zone state to "closed", that is, the
zone will remain in an active state. Since this can then cause failures
of explicit open operations on other zones if the drive active zone
resources are exceeded, we need to make sure that the zone is not
active anymore by resetting it instead of closing it. To address this,
zonefs_zone_mgmt() is modified to change a REQ_OP_ZONE_CLOSE request
into a REQ_OP_ZONE_RESET for sequential zones that have not been
written.
Damien Le Moal [Tue, 12 Apr 2022 11:52:35 +0000 (20:52 +0900)]
zonefs: Clear inode information flags on inode creation
Ensure that the i_flags field of struct zonefs_inode_info is cleared to
0 when initializing a zone file inode, avoiding seeing the flag
ZONEFS_ZONE_OPEN being incorrectly set.
If EINT_MTK is m and PINCTRL_MTK_V2 is y, build fails:
drivers/pinctrl/mediatek/pinctrl-moore.o: In function `mtk_gpio_set_config':
pinctrl-moore.c:(.text+0xa6c): undefined reference to `mtk_eint_set_debounce'
drivers/pinctrl/mediatek/pinctrl-moore.o: In function `mtk_gpio_to_irq':
pinctrl-moore.c:(.text+0xacc): undefined reference to `mtk_eint_find_irq'
Dave Chinner [Wed, 20 Apr 2022 22:45:16 +0000 (08:45 +1000)]
xfs: reorder iunlink remove operation in xfs_ifree
The O_TMPFILE creation implementation creates a specific order of
operations for inode allocation/freeing and unlinked list
modification. Currently both are serialised by the AGI, so the order
doesn't strictly matter as long as the are both in the same
transaction.
However, if we want to move the unlinked list insertions largely out
from under the AGI lock, then we have to be concerned about the
order in which we do unlinked list modification operations.
O_TMPFILE creation tells us this order is inode allocation/free,
then unlinked list modification.
Change xfs_ifree() to use this same ordering on unlinked list
removal. This way we always guarantee that when we enter the
iunlinked list removal code from this path, we already have the AGI
locked and we don't have to worry about lock nesting AGI reads
inside unlink list locks because it's already locked and attached to
the transaction.
We can do this safely as the inode freeing and unlinked list removal
are done in the same transaction and hence are atomic operations
with respect to log recovery.
Dave Chinner [Wed, 20 Apr 2022 22:44:59 +0000 (08:44 +1000)]
xfs: convert buffer flags to unsigned.
5.18 w/ std=gnu11 compiled with gcc-5 wants flags stored in unsigned
fields to be unsigned. This manifests as a compiler error such as:
/kisskb/src/fs/xfs/./xfs_trace.h:432:2: note: in expansion of macro 'TP_printk'
TP_printk("dev %d:%d daddr 0x%llx bbcount 0x%x hold %d pincount %d "
^
/kisskb/src/fs/xfs/./xfs_trace.h:440:5: note: in expansion of macro '__print_flags'
__print_flags(__entry->flags, "|", XFS_BUF_FLAGS),
^
/kisskb/src/fs/xfs/xfs_buf.h:67:4: note: in expansion of macro 'XBF_UNMAPPED'
{ XBF_UNMAPPED, "UNMAPPED" }
^
/kisskb/src/fs/xfs/./xfs_trace.h:440:40: note: in expansion of macro 'XFS_BUF_FLAGS'
__print_flags(__entry->flags, "|", XFS_BUF_FLAGS),
^
/kisskb/src/fs/xfs/./xfs_trace.h: In function 'trace_raw_output_xfs_buf_flags_class':
/kisskb/src/fs/xfs/xfs_buf.h:46:23: error: initializer element is not constant
#define XBF_UNMAPPED (1 << 31)/* do not map the buffer */
as __print_flags assigns XFS_BUF_FLAGS to a structure that uses an
unsigned long for the flag. Since this results in the value of
XBF_UNMAPPED causing a signed integer overflow, the result is
technically undefined behavior, which gcc-5 does not accept as an
integer constant.
Merge tag 'xtensa-20220416' of https://github.com/jcmvbkbc/linux-xtensa
Pull xtensa fixes from Max Filippov:
- fix patching CPU selection in patch_text
- fix potential deadlock in ISS platform serial driver
- fix potential register clobbering in coprocessor exception handler
* tag 'xtensa-20220416' of https://github.com/jcmvbkbc/linux-xtensa:
xtensa: fix a7 clobbering in coprocessor context load/store
arch: xtensa: platforms: Fix deadlock in rs_close()
xtensa: patch_text: Fixup last cpu should be master
Merge tag 'erofs-for-5.18-rc4-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/xiang/erofs
Pull erofs fixes from Gao Xiang:
"One patch to fix a use-after-free race related to the on-stack
z_erofs_decompressqueue, which happens very rarely but needs to be
fixed properly soon.
The other patch fixes some sysfs Sphinx warnings"
* tag 'erofs-for-5.18-rc4-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/xiang/erofs:
Documentation/ABI: sysfs-fs-erofs: Fix Sphinx errors
erofs: fix use-after-free of on-stack io[]
It turns out that making the pipe almost arbitrarily large has some
rather unexpected downsides. The kernel test robot reports a kernel
warning that is due to pipe->max_usage now growing to the point where
the iter_file_splice_write() buffer allocation can no longer be
satisfied as a slab allocation, and the
code sequence there will now always fail as a result.
That code could be modified to use kvcalloc() too, but I feel very
uncomfortable making those kinds of changes for a very niche use case
that really should have other options than make these kinds of
fundamental changes to pipe behavior.
Maybe the CRIU process dumping should be multi-threaded, and use
multiple pipes and multiple cores, rather than try to use one larger
pipe to minimize splice() calls.
x86: __memcpy_flushcache: fix wrong alignment if size > 2^32
The first "if" condition in __memcpy_flushcache is supposed to align the
"dest" variable to 8 bytes and copy data up to this alignment. However,
this condition may misbehave if "size" is greater than 4GiB.
The statement min_t(unsigned, size, ALIGN(dest, 8) - dest); casts both
arguments to unsigned int and selects the smaller one. However, the
cast truncates high bits in "size" and it results in misbehavior.
For example:
suppose that size == 0x100000001, dest == 0x200000002
min_t(unsigned, size, ALIGN(dest, 8) - dest) == min_t(0x1, 0xe) == 0x1;
...
dest += 0x1;
so we copy just one byte "and" dest remains unaligned.
This patch fixes the bug by replacing unsigned with size_t.
Jaegeuk Kim [Tue, 12 Apr 2022 22:01:58 +0000 (15:01 -0700)]
f2fs: keep io_flags to avoid IO split due to different op_flags in two fio holders
Let's attach io_flags to bio only, so that we can merge IOs given original
io_flags only.
Fixes: 64bf0eef0171 ("f2fs: pass the bio operation to bio_alloc_bioset") Reviewed-by: Chao Yu <[email protected]> Signed-off-by: Jaegeuk Kim <[email protected]>
Darren Hart [Mon, 11 Apr 2022 20:53:34 +0000 (13:53 -0700)]
topology: make core_mask include at least cluster_siblings
Ampere Altra defines CPU clusters in the ACPI PPTT. They share a Snoop
Control Unit, but have no shared CPU-side last level cache.
cpu_coregroup_mask() will return a cpumask with weight 1, while
cpu_clustergroup_mask() will return a cpumask with weight 2.
As a result, build_sched_domain() will BUG() once per CPU with:
BUG: arch topology borken
the CLS domain not a subset of the MC domain
The MC level cpumask is then extended to that of the CLS child, and is
later removed entirely as redundant. This sched domain topology is an
improvement over previous topologies, or those built without
SCHED_CLUSTER, particularly for certain latency sensitive workloads.
With the current scheduler model and heuristics, this is a desirable
default topology for Ampere Altra and Altra Max system.
Rather than create a custom sched domains topology structure and
introduce new logic in arch/arm64 to detect these systems, update the
core_mask so coregroup is never a subset of clustergroup, extending it
to cluster_siblings if necessary. Only do this if CONFIG_SCHED_CLUSTER
is enabled to avoid also changing the topology (MC) when
CONFIG_SCHED_CLUSTER is disabled.
This has the added benefit over a custom topology of working for both
symmetric and asymmetric topologies. It does not address systems where
the CLUSTER topology is above a populated MC topology, but these are not
considered today and can be addressed separately if and when they
appear.
The final sched domain topology for a 2 socket Ampere Altra system is
unchanged with or without CONFIG_SCHED_CLUSTER, and the BUG is avoided:
For CPU0:
CONFIG_SCHED_CLUSTER=y
CLS [0-1]
DIE [0-79]
NUMA [0-159]
CONFIG_SCHED_CLUSTER is not set
DIE [0-79]
NUMA [0-159]
Daniel Starke [Wed, 20 Apr 2022 10:13:44 +0000 (03:13 -0700)]
tty: n_gsm: fix missing update of modem controls after DLCI open
Currently the peer is not informed about the initial state of the modem
control lines after a new DLCI has been opened.
Fix this by sending the initial modem control line states after DLCI open.
selftests: mlxsw: vxlan_flooding_ipv6: Prevent flooding of unwanted packets
The test verifies that packets are correctly flooded by the bridge and
the VXLAN device by matching on the encapsulated packets at the other
end. However, if packets other than those generated by the test also
ingress the bridge (e.g., MLD packets), they will be flooded as well and
interfere with the expected count.
Make the test more robust by making sure that only the packets generated
by the test can ingress the bridge. Drop all the rest using tc filters
on the egress of 'br0' and 'h1'.
In the software data path, the problem can be solved by matching on the
inner destination MAC or dropping unwanted packets at the egress of the
VXLAN device, but this is not currently supported by mlxsw.
Fixes: d01724dd2a66 ("selftests: mlxsw: spectrum-2: Add a test for VxLAN flooding with IPv6") Signed-off-by: Ido Schimmel <[email protected]> Reviewed-by: Amit Cohen <[email protected]> Signed-off-by: David S. Miller <[email protected]>
selftests: mlxsw: vxlan_flooding: Prevent flooding of unwanted packets
The test verifies that packets are correctly flooded by the bridge and
the VXLAN device by matching on the encapsulated packets at the other
end. However, if packets other than those generated by the test also
ingress the bridge (e.g., MLD packets), they will be flooded as well and
interfere with the expected count.
Make the test more robust by making sure that only the packets generated
by the test can ingress the bridge. Drop all the rest using tc filters
on the egress of 'br0' and 'h1'.
In the software data path, the problem can be solved by matching on the
inner destination MAC or dropping unwanted packets at the egress of the
VXLAN device, but this is not currently supported by mlxsw.
Fixes: 94d302deae25 ("selftests: mlxsw: Add a test for VxLAN flooding") Signed-off-by: Ido Schimmel <[email protected]> Reviewed-by: Amit Cohen <[email protected]> Signed-off-by: David S. Miller <[email protected]>
ALSA: usb-audio: Clear MIDI port active flag after draining
When a rawmidi output stream is closed, it calls the drain at first,
then does trigger-off only when the drain returns -ERESTARTSYS as a
fallback. It implies that each driver should turn off the stream
properly after the drain. Meanwhile, USB-audio MIDI interface didn't
change the port->active flag after the drain. This may leave the
output work picking up the port that is closed right now, which
eventually leads to a use-after-free for the already released rawmidi
object.
This patch fixes the bug by properly clearing the port->active flag
after the output drain.
Dave Jiang [Mon, 11 Apr 2022 22:06:34 +0000 (15:06 -0700)]
dmaengine: idxd: skip clearing device context when device is read-only
If the device shows up as read-only configuration, skip the clearing of the
state as the context must be preserved for device re-enable after being
disabled.
Dave Jiang [Mon, 18 Apr 2022 21:33:21 +0000 (14:33 -0700)]
dmaengine: idxd: fix retry value to be constant for duration of function call
When retries is compared to wq->enqcmds_retries each loop of idxd_enqcmds(),
wq->enqcmds_retries can potentially changed by user. Assign the value
of retries to wq->enqcmds_retries during initialization so it is the
original value set when entering the function.
Dave Jiang [Mon, 18 Apr 2022 21:31:10 +0000 (14:31 -0700)]
dmaengine: idxd: match type for retries var in idxd_enqcmds()
wq->enqcmds_retries is defined as unsigned int. However, retries on the
stack is defined as int. Change retries to unsigned int to compare the same
type.
Kevin Hao [Tue, 19 Apr 2022 08:42:26 +0000 (16:42 +0800)]
net: stmmac: Use readl_poll_timeout_atomic() in atomic state
The init_systime() may be invoked in atomic state. We have observed the
following call trace when running "phc_ctl /dev/ptp0 set" on a Intel
Agilex board.
BUG: sleeping function called from invalid context at drivers/net/ethernet/stmicro/stmmac/stmmac_hwtstamp.c:74
in_atomic(): 1, irqs_disabled(): 128, non_block: 0, pid: 381, name: phc_ctl
preempt_count: 1, expected: 0
RCU nest depth: 0, expected: 0
Preemption disabled at:
[<ffff80000892ef78>] stmmac_set_time+0x34/0x8c
CPU: 2 PID: 381 Comm: phc_ctl Not tainted 5.18.0-rc2-next-20220414-yocto-standard+ #567
Hardware name: SoCFPGA Agilex SoCDK (DT)
Call trace:
dump_backtrace.part.0+0xc4/0xd0
show_stack+0x24/0x40
dump_stack_lvl+0x7c/0xa0
dump_stack+0x18/0x34
__might_resched+0x154/0x1c0
__might_sleep+0x58/0x90
init_systime+0x78/0x120
stmmac_set_time+0x64/0x8c
ptp_clock_settime+0x60/0x9c
pc_clock_settime+0x6c/0xc0
__arm64_sys_clock_settime+0x88/0xf0
invoke_syscall+0x5c/0x130
el0_svc_common.constprop.0+0x4c/0x100
do_el0_svc+0x7c/0xa0
el0_svc+0x58/0xcc
el0t_64_sync_handler+0xa4/0x130
el0t_64_sync+0x18c/0x190
So we should use readl_poll_timeout_atomic() here instead of
readl_poll_timeout().
Also adjust the delay time to 10us to fix a "__bad_udelay" build error
reported by "kernel test robot <[email protected]>". I have tested this on
Intel Agilex and NXP S32G boards, there is no delay needed at all.
So the 10us delay should be long enough for most cases.
Fixes: ff8ed737860e ("net: stmmac: use readl_poll_timeout() function in init_systime()") Signed-off-by: Kevin Hao <[email protected]> Signed-off-by: David S. Miller <[email protected]>
Nicolas Dichtel [Wed, 13 Apr 2022 14:00:00 +0000 (16:00 +0200)]
doc/ip-sysctl: add bc_forwarding
Let's describe this sysctl.
Fixes: 5cbf777cfdf6 ("route: add support for directed broadcast forwarding") Signed-off-by: Nicolas Dichtel <[email protected]> Signed-off-by: David S. Miller <[email protected]>
phy: amlogic: fix error path in phy_g12a_usb3_pcie_probe()
If clk_prepare_enable() fails we call clk_disable_unprepare()
in the error path what results in a warning that the clock
is disabled and unprepared already.
And if we fail later in phy_g12a_usb3_pcie_probe() then we
bail out w/o calling clk_disable_unprepare().
This patch fixes both errors.