Taehee Yoo [Thu, 12 May 2022 05:47:09 +0000 (05:47 +0000)]
net: sfc: ef10: fix memory leak in efx_ef10_mtd_probe()
In the NIC ->probe() callback, ->mtd_probe() callback is called.
If NIC has 2 ports, ->probe() is called twice and ->mtd_probe() too.
In the ->mtd_probe(), which is efx_ef10_mtd_probe() it allocates and
initializes mtd partiion.
But mtd partition for sfc is shared data.
So that allocated mtd partition data from last called
efx_ef10_mtd_probe() will not be used.
Therefore it must be freed.
But it doesn't free a not used mtd partition data in efx_ef10_mtd_probe().
Guangguan Wang [Thu, 12 May 2022 03:08:20 +0000 (11:08 +0800)]
net/smc: non blocking recvmsg() return -EAGAIN when no data and signal_pending
Non blocking sendmsg will return -EAGAIN when any signal pending
and no send space left, while non blocking recvmsg return -EINTR
when signal pending and no data received. This may makes confused.
As TCP returns -EAGAIN in the conditions described above. Align the
behavior of smc with TCP.
Florian Fainelli [Thu, 12 May 2022 02:17:31 +0000 (19:17 -0700)]
net: dsa: bcm_sf2: Fix Wake-on-LAN with mac_link_down()
After commit 2d1f90f9ba83 ("net: dsa/bcm_sf2: fix incorrect usage of
state->link") the interface suspend path would call our mac_link_down()
call back which would forcibly set the link down, thus preventing
Wake-on-LAN packets from reaching our management port.
Fix this by looking at whether the port is enabled for Wake-on-LAN and
not clearing the link status in that case to let packets go through.
Chunfeng Yun [Thu, 12 May 2022 06:49:31 +0000 (14:49 +0800)]
usb: xhci-mtk: remove bandwidth budget table
The bandwidth budget table is introduced to trace ideal bandwidth used
by each INT/ISOC endpoint, but in fact the endpoint may consume more
bandwidth and cause data transfer error, so it's better to leave some
margin. Obviously it's difficult to find the best margin for all cases,
instead take use of the worst-case scenario.
Chunfeng Yun [Thu, 12 May 2022 06:49:30 +0000 (14:49 +0800)]
usb: xhci-mtk: fix fs isoc's transfer error
Due to the scheduler allocates the optimal bandwidth for FS ISOC endpoints,
this may be not enough actually and causes data transfer error, so come up
with an estimate that is no less than the worst case bandwidth used for
any one mframe, but may be an over-estimate.
ChiYuan Huang [Tue, 10 May 2022 05:13:00 +0000 (13:13 +0800)]
usb: typec: tcpci_mt6360: Update for BMC PHY setting
Update MT6360 BMC PHY Tx/Rx setting for the compatibility.
Macpaul reported this CtoDP cable attention message cannot be received from
MT6360 TCPC. But actually, attention message really sent from UFP_D
device.
After RD's comment, there may be BMC PHY Tx/Rx setting causes this issue.
Kai Huang [Tue, 19 Apr 2022 11:17:04 +0000 (23:17 +1200)]
KVM: VMX: Include MKTME KeyID bits in shadow_zero_check
Intel MKTME KeyID bits (including Intel TDX private KeyID bits) should
never be set to SPTE. Set shadow_me_value to 0 and shadow_me_mask to
include all MKTME KeyID bits to include them to shadow_zero_check.
Kai Huang [Tue, 19 Apr 2022 11:17:03 +0000 (23:17 +1200)]
KVM: x86/mmu: Add shadow_me_value and repurpose shadow_me_mask
Intel Multi-Key Total Memory Encryption (MKTME) repurposes couple of
high bits of physical address bits as 'KeyID' bits. Intel Trust Domain
Extentions (TDX) further steals part of MKTME KeyID bits as TDX private
KeyID bits. TDX private KeyID bits cannot be set in any mapping in the
host kernel since they can only be accessed by software running inside a
new CPU isolated mode. And unlike to AMD's SME, host kernel doesn't set
any legacy MKTME KeyID bits to any mapping either. Therefore, it's not
legitimate for KVM to set any KeyID bits in SPTE which maps guest
memory.
KVM maintains shadow_zero_check bits to represent which bits must be
zero for SPTE which maps guest memory. MKTME KeyID bits should be set
to shadow_zero_check. Currently, shadow_me_mask is used by AMD to set
the sme_me_mask to SPTE, and shadow_me_shadow is excluded from
shadow_zero_check. So initializing shadow_me_mask to represent all
MKTME keyID bits doesn't work for VMX (as oppositely, they must be set
to shadow_zero_check).
Introduce a new 'shadow_me_value' to replace existing shadow_me_mask,
and repurpose shadow_me_mask as 'all possible memory encryption bits'.
The new schematic of them will be:
- shadow_me_value: the memory encryption bit(s) that will be set to the
SPTE (the original shadow_me_mask).
- shadow_me_mask: all possible memory encryption bits (which is a super
set of shadow_me_value).
- For now, shadow_me_value is supposed to be set by SVM and VMX
respectively, and it is a constant during KVM's life time. This
perhaps doesn't fit MKTME but for now host kernel doesn't support it
(and perhaps will never do).
- Bits in shadow_me_mask are set to shadow_zero_check, except the bits
in shadow_me_value.
Introduce a new helper kvm_mmu_set_me_spte_mask() to initialize them.
Replace shadow_me_mask with shadow_me_value in almost all code paths,
except the one in PT64_PERM_MASK, which is used by need_remote_flush()
to determine whether remote TLB flush is needed. This should still use
shadow_me_mask as any encryption bit change should need a TLB flush.
And for AMD, move initializing shadow_me_value/shadow_me_mask from
kvm_mmu_reset_all_pte_masks() to svm_hardware_setup().
Kai Huang [Tue, 19 Apr 2022 11:17:02 +0000 (23:17 +1200)]
KVM: x86/mmu: Rename reset_rsvds_bits_mask()
Rename reset_rsvds_bits_mask() to reset_guest_rsvds_bits_mask() to make
it clearer that it resets the reserved bits check for guest's page table
entries.
KVM: x86/mmu: Expand and clean up page fault stats
Expand and clean up the page fault stats. The current stats are at best
incomplete, and at worst misleading. Differentiate between faults that
are actually fixed vs those that result in an MMIO SPTE being created,
track faults that are spurious, faults that trigger emulation, faults
that that are fixed in the fast path, and last but not least, track the
number of faults that are taken.
Note, the number of faults that require emulation for write-protected
shadow pages can roughly be calculated by subtracting the number of MMIO
SPTEs created from the overall number of faults that trigger emulation.
KVM: x86/mmu: Use IS_ENABLED() to avoid RETPOLINE for TDP page faults
Use IS_ENABLED() instead of an #ifdef to activate the anti-RETPOLINE fast
path for TDP page faults. The generated code is identical, and the #ifdef
makes it dangerously difficult to extend the logic (guess who forgot to
add an "else" inside the #ifdef and ran through the page fault handler
twice).
KVM: x86/mmu: Make all page fault handlers internal to the MMU
Move kvm_arch_async_page_ready() to mmu.c where it belongs, and move all
of the page fault handling collateral that was in mmu.h purely for the
async #PF handler into mmu_internal.h, where it belongs. This will allow
kvm_mmu_do_page_fault() to act on the RET_PF_* return without having to
expose those enums outside of the MMU.
KVM: x86/mmu: Add RET_PF_CONTINUE to eliminate bool+int* "returns"
Add RET_PF_CONTINUE and use it in handle_abnormal_pfn() and
kvm_faultin_pfn() to signal that the page fault handler should continue
doing its thing. Aside from being gross and inefficient, using a boolean
return to signal continue vs. stop makes it extremely difficult to add
more helpers and/or move existing code to a helper.
E.g. hypothetically, if nested MMUs were to gain a separate page fault
handler in the future, everything up to the "is self-modifying PTE" check
can be shared by all shadow MMUs, but communicating up the stack whether
to continue on or stop becomes a nightmare.
More concretely, proposed support for private guest memory ran into a
similar issue, where it'll be forced to forego a helper in order to yield
sane code: https://lore.kernel.org/all/YkJbxiL%[email protected].
KVM: x86/mmu: Drop exec/NX check from "page fault can be fast"
Tweak the "page fault can be fast" logic to explicitly check for !PRESENT
faults in the access tracking case, and drop the exec/NX check that
becomes redundant as a result. No sane hardware will generate an access
that is both an instruct fetch and a write, i.e. it's a waste of cycles.
If hardware goes off the rails, or KVM runs under a misguided hypervisor,
spuriously running throught fast path is benign (KVM has been uknowingly
being doing exactly that for years).
KVM: x86/mmu: Don't attempt fast page fault just because EPT is in use
Check for A/D bits being disabled instead of the access tracking mask
being non-zero when deciding whether or not to attempt to fix a page
fault vian the fast path. Originally, the access tracking mask was
non-zero if and only if A/D bits were disabled by _KVM_ (including not
being supported by hardware), but that hasn't been true since nVMX was
fixed to honor EPTP12's A/D enabling, i.e. since KVM allowed L1 to cause
KVM to not use A/D bits while running L2 despite KVM using them while
running L1.
In other words, don't attempt the fast path just because EPT is enabled.
Note, attempting the fast path for all !PRESENT faults can "fix" a very,
_VERY_ tiny percentage of faults out of mmu_lock by detecting that the
fault is spurious, i.e. has been fixed by a different vCPU, but again the
odds of that happening are vanishingly small. E.g. booting an 8-vCPU VM
gets less than 10 successes out of 30k+ faults, and that's likely one of
the more favorable scenarios. Disabling dirty logging can likely lead to
a rash of collisions between vCPUs for some workloads that operate on a
common set of pages, but penalizing _all_ !PRESENT faults for that one
case is unlikely to be a net positive, not to mention that that problem
is best solved by not zapping in the first place.
The number of spurious faults does scale with the number of vCPUs, e.g. a
255-vCPU VM using TDP "jumps" to ~60 spurious faults detected in the fast
path (again out of 30k), but that's all of 0.2% of faults. Using legacy
shadow paging does get more spurious faults, and a few more detected out
of mmu_lock, but the percentage goes _down_ to 0.08% (and that's ignoring
faults that are reflected into the guest), i.e. the extra detections are
purely due to the sheer number of faults observed.
On the other hand, getting a "negative" in the fast path takes in the
neighborhood of 150-250 cycles. So while it is tempting to keep/extend
the current behavior, such a change needs to come with hard numbers
showing that it's actually a win in the grand scheme, or any scheme for
that matter.
Li RongQing [Wed, 6 Apr 2022 11:25:02 +0000 (19:25 +0800)]
KVM: VMX: clean up pi_wakeup_handler
Passing per_cpu() to list_for_each_entry() causes the macro to be
evaluated N+1 times for N sleeping vCPUs. This is a very small
inefficiency, and the code is cleaner if the address of the per-CPU
variable is loaded earlier. Do this for both the list and the spinlock.
Maxim Levitsky [Thu, 12 May 2022 10:14:20 +0000 (13:14 +0300)]
KVM: x86: fix typo in __try_cmpxchg_user causing non-atomicness
This shows up as a TDP MMU leak when running nested. Non-working cmpxchg on L0
relies makes L1 install two different shadow pages under same spte, and one of
them is leaked.
Shreyas K K [Thu, 12 May 2022 11:01:34 +0000 (16:31 +0530)]
arm64: Enable repeat tlbi workaround on KRYO4XX gold CPUs
Add KRYO4XX gold/big cores to the list of CPUs that need the
repeat TLBI workaround. Apply this to the affected
KRYO4XX cores (rcpe to rfpe).
The variant and revision bits are implementation defined and are
different from the their Cortex CPU counterparts on which they are
based on, i.e., (r0p0 to r3p0) is equivalent to (rcpe to rfpe).
Amit Cohen [Wed, 11 May 2022 11:57:47 +0000 (14:57 +0300)]
mlxsw: Avoid warning during ip6gre device removal
IPv6 addresses which are used for tunnels are stored in a hash table
with reference counting. When a new GRE tunnel is configured, the driver
is notified and configures it in hardware.
Currently, any change in the tunnel is not applied in the driver. It
means that if the remote address is changed, the driver is not aware of
this change and the first address will be used.
This behavior results in a warning [1] in scenarios such as the
following:
# ip link add name gre1 type ip6gre local 2000::3 remote 2000::fffe tos inherit ttl inherit
# ip link set name gre1 type ip6gre local 2000::3 remote 2000::ffff ttl inherit
# ip link delete gre1
The change of the address is not applied in the driver. Currently, the
driver uses the remote address which is stored in the 'parms' of the
overlay device. When the tunnel is removed, the new IPv6 address is
used, the driver tries to release it, but as it is not aware of the
change, this address is not configured and it warns about releasing non
existing IPv6 address.
Fix it by using the IPv6 address which is cached in the IPIP entry, this
address is the last one that the driver used, so even in cases such the
above, the first address will be released, without any warning.
The ID register table should have one entry per ID register but
currently has two entries for ID_AA64ISAR2_EL1. Only one entry has an
override, and get_arm64_ftr_reg() can end up choosing the other, causing
the override to be ignored. Fix this by removing the duplicate entry.
While here, also make the check in sort_ftr_regs() more strict so that
duplicate entries can't be added in the future.
Hui Tang [Tue, 10 May 2022 13:51:48 +0000 (21:51 +0800)]
drm/vc4: hdmi: Fix build error for implicit function declaration
drivers/gpu/drm/vc4/vc4_hdmi.c: In function ‘vc4_hdmi_connector_detect’:
drivers/gpu/drm/vc4/vc4_hdmi.c:228:7: error: implicit declaration of function ‘gpiod_get_value_cansleep’; did you mean ‘gpio_get_value_cansleep’? [-Werror=implicit-function-declaration]
if (gpiod_get_value_cansleep(vc4_hdmi->hpd_gpio))
^~~~~~~~~~~~~~~~~~~~~~~~
gpio_get_value_cansleep
CC [M] drivers/gpu/drm/vc4/vc4_validate.o
CC [M] drivers/gpu/drm/vc4/vc4_v3d.o
CC [M] drivers/gpu/drm/vc4/vc4_validate_shaders.o
CC [M] drivers/gpu/drm/vc4/vc4_debugfs.o
drivers/gpu/drm/vc4/vc4_hdmi.c: In function ‘vc4_hdmi_bind’:
drivers/gpu/drm/vc4/vc4_hdmi.c:2883:23: error: implicit declaration of function ‘devm_gpiod_get_optional’; did you mean ‘devm_clk_get_optional’? [-Werror=implicit-function-declaration]
vc4_hdmi->hpd_gpio = devm_gpiod_get_optional(dev, "hpd", GPIOD_IN);
^~~~~~~~~~~~~~~~~~~~~~~
devm_clk_get_optional
drivers/gpu/drm/vc4/vc4_hdmi.c:2883:59: error: ‘GPIOD_IN’ undeclared (first use in this function); did you mean ‘GPIOF_IN’?
vc4_hdmi->hpd_gpio = devm_gpiod_get_optional(dev, "hpd", GPIOD_IN);
^~~~~~~~
GPIOF_IN
drivers/gpu/drm/vc4/vc4_hdmi.c:2883:59: note: each undeclared identifier is reported only once for each function it appears in
cc1: all warnings being treated as errors
Florian Fainelli [Wed, 11 May 2022 03:17:51 +0000 (20:17 -0700)]
net: bcmgenet: Check for Wake-on-LAN interrupt probe deferral
The interrupt controller supplying the Wake-on-LAN interrupt line maybe
modular on some platforms (irq-bcm7038-l1.c) and might be probed at a
later time than the GENET driver. We need to specifically check for
-EPROBE_DEFER and propagate that error to ensure that we eventually
fetch the interrupt descriptor.
Jakub Kicinski [Thu, 12 May 2022 00:40:39 +0000 (17:40 -0700)]
Merge tag 'for-net-2022-05-11' of git://git.kernel.org/pub/scm/linux/kernel/git/bluetooth/bluetooth
Luiz Augusto von Dentz says:
====================
bluetooth pull request for net:
- Fix the creation of hdev->name when index is greater than 9999
* tag 'for-net-2022-05-11' of git://git.kernel.org/pub/scm/linux/kernel/git/bluetooth/bluetooth:
Bluetooth: Fix the creation of hdev->name
====================
Jakub Kicinski [Thu, 12 May 2022 00:33:01 +0000 (17:33 -0700)]
Merge tag 'wireless-2022-05-11' of git://git.kernel.org/pub/scm/linux/kernel/git/wireless/wireless
Kalle Valo says:
====================
wireless fixes for v5.18
Second set of fixes for v5.18 and hopefully the last one. We have a
new iwlwifi maintainer, a fix to rfkill ioctl interface and important
fixes to both stack and two drivers.
* tag 'wireless-2022-05-11' of git://git.kernel.org/pub/scm/linux/kernel/git/wireless/wireless:
rfkill: uapi: fix RFKILL_IOCTL_MAX_SIZE ioctl request definition
nl80211: fix locking in nl80211_set_tx_bitrate_mask()
mac80211_hwsim: call ieee80211_tx_prepare_skb under RCU protection
mac80211_hwsim: fix RCU protected chanctx access
mailmap: update Kalle Valo's email
mac80211: Reset MBSSID parameters upon connection
cfg80211: retrieve S1G operating channel number
nl80211: validate S1G channel width
mac80211: fix rx reordering with non explicit / psmp ack policy
ath11k: reduce the wait time of 11d scan and hw scan while add interface
MAINTAINERS: update iwlwifi driver maintainer
iwlwifi: iwl-dbg: Use del_timer_sync() before freeing
====================
Itay Iellin [Sat, 7 May 2022 12:32:48 +0000 (08:32 -0400)]
Bluetooth: Fix the creation of hdev->name
Set a size limit of 8 bytes of the written buffer to "hdev->name"
including the terminating null byte, as the size of "hdev->name" is 8
bytes. If an id value which is greater than 9999 is allocated,
then the "snprintf(hdev->name, sizeof(hdev->name), "hci%d", id)"
function call would lead to a truncation of the id value in decimal
notation.
Set an explicit maximum id parameter in the id allocation function call.
The id allocation function defines the maximum allocated id value as the
maximum id parameter value minus one. Therefore, HCI_MAX_ID is defined
as 10000.
Delyan Kratunov [Wed, 11 May 2022 18:28:36 +0000 (18:28 +0000)]
sched/tracing: Append prev_state to tp args instead
Commit fa2c3254d7cf (sched/tracing: Don't re-read p->state when emitting
sched_switch event, 2022-01-20) added a new prev_state argument to the
sched_switch tracepoint, before the prev task_struct pointer.
This reordering of arguments broke BPF programs that use the raw
tracepoint (e.g. tp_btf programs). The type of the second argument has
changed and existing programs that assume a task_struct* argument
(e.g. for bpf_task_storage access) will now fail to verify.
If we instead append the new argument to the end, all existing programs
would continue to work and can conditionally extract the prev_state
argument on supported kernel versions.
Xiaomeng Tong [Tue, 10 May 2022 20:48:46 +0000 (13:48 -0700)]
i40e: i40e_main: fix a missing check on list iterator
The bug is here:
ret = i40e_add_macvlan_filter(hw, ch->seid, vdev->dev_addr, &aq_err);
The list iterator 'ch' will point to a bogus position containing
HEAD if the list is empty or no element is found. This case must
be checked before any use of the iterator, otherwise it will
lead to a invalid memory access.
To fix this bug, use a new variable 'iter' as the list iterator,
while use the origin variable 'ch' as a dedicated pointer to
point to the found element.
Paolo Abeni [Tue, 10 May 2022 14:57:34 +0000 (16:57 +0200)]
net/sched: act_pedit: really ensure the skb is writable
Currently pedit tries to ensure that the accessed skb offset
is writable via skb_unclone(). The action potentially allows
touching any skb bytes, so it may end-up modifying shared data.
The above causes some sporadic MPTCP self-test failures, due to
this code:
The above modifies a data byte outside the skb head and the skb is
a cloned one, carrying a TCP output packet.
This change addresses the issue by keeping track of a rough
over-estimate highest skb offset accessed by the action and ensuring
such offset is really writable.
Note that this may cause performance regressions in some scenarios,
but hopefully pedit is not in the critical path.
Alex Deucher [Tue, 10 May 2022 14:32:26 +0000 (10:32 -0400)]
drm/amdgpu/ctx: only reset stable pstate if the user changed it (v2)
Check if the requested stable pstate matches the current one before
changing it. This avoids changing the stable pstate on context
destroy if the user never changed it in the first place via the
IOCTL.
v2: compare the current and requested rather than setting a flag (Lijo)
Commit ebc002e3ee78 ("drm/amdgpu: don't use BACO for reset in S3")
stops using BACO for reset during suspend, so it's no longer
necessary to leave BACO enabled during suspend. This fixes
resume from suspend on the navy flounder dGPU in the ASUS ROG
Strix G513QY.
Alexander Graf [Tue, 10 May 2022 12:37:17 +0000 (14:37 +0200)]
KVM: PPC: Book3S PR: Enable MSR_DR for switch_mmu_context()
Commit 863771a28e27 ("powerpc/32s: Convert switch_mmu_context() to C")
moved the switch_mmu_context() to C. While in principle a good idea, it
meant that the function now uses the stack. The stack is not accessible
from real mode though.
So to keep calling the function, let's turn on MSR_DR while we call it.
That way, all pointer references to the stack are handled virtually.
In addition, make sure to save/restore r12 on the stack, as it may get
clobbered by the C function.
David S. Miller [Wed, 11 May 2022 11:31:01 +0000 (12:31 +0100)]
Merge branch 's390-net-fixes'
Alexandra Winter says:
====================
s390/net: Cleanup some code checker findings
clean up smatch findings in legacy code. I was not able to provoke
any real failures on my systems, but other hardware reactions,
timing conditions or compiler output, may cause failures.
There are still 2 smatch warnings left in s390/net:
drivers/s390/net/ctcm_main.c:1326 add_channel() warn: missing error code 'rc'
This one is a false positive.
drivers/s390/net/netiucv.c:1355 netiucv_check_user() warn: argument 3 to %02x specifier has type 'char'
Postponing this one, need to better understand string handling in iucv.
There are several sparse warnings left in ctcm, like:
drivers/s390/net/ctcm_fsms.c:573:9: warning: context imbalance in 'ctcm_chx_setmode' - different lock contexts for basic block
Those are mentioned in the source, no plan to rework.
====================
Alexandra Winter [Tue, 10 May 2022 07:05:08 +0000 (09:05 +0200)]
s390/lcs: fix variable dereferenced before check
smatch complains about
drivers/s390/net/lcs.c:1741 lcs_get_control() warn: variable dereferenced before check 'card->dev' (see line 1739)
Fixes: 27eb5ac8f015 ("[PATCH] s390: lcs driver bug fixes and improvements [1/2]") Signed-off-by: Alexandra Winter <[email protected]> Signed-off-by: David S. Miller <[email protected]>
Alexandra Winter [Tue, 10 May 2022 07:05:07 +0000 (09:05 +0200)]
s390/ctcm: fix potential memory leak
smatch complains about
drivers/s390/net/ctcm_mpc.c:1210 ctcmpc_unpack_skb() warn: possible memory leak of 'mpcginfo'
mpc_action_discontact() did not free mpcginfo. Consolidate the freeing in
ctcmpc_unpack_skb().
Fixes: 293d984f0e36 ("ctcm: infrastructure for replaced ctc driver") Signed-off-by: Alexandra Winter <[email protected]> Signed-off-by: David S. Miller <[email protected]>
Alexandra Winter [Tue, 10 May 2022 07:05:06 +0000 (09:05 +0200)]
s390/ctcm: fix variable dereferenced before check
Found by cppcheck and smatch.
smatch complains about
drivers/s390/net/ctcm_sysfs.c:43 ctcm_buffer_write() warn: variable dereferenced before check 'priv' (see line 42)
Fixes: 3c09e2647b5e ("ctcm: rename READ/WRITE defines to avoid redefinitions") Reported-by: Colin Ian King <[email protected]> Signed-off-by: Alexandra Winter <[email protected]> Signed-off-by: David S. Miller <[email protected]>
David S. Miller [Wed, 11 May 2022 11:25:07 +0000 (12:25 +0100)]
Merge branch 'atlantic-fixes'
Grant Grundler says:
====================
net: atlantic: more fuzzing fixes
It essentially describes four problems:
1) validate rxd_wb->next_desc_ptr before populating buff->next
2) "frag[0] not initialized" case in aq_ring_rx_clean()
3) limit iterations handling fragments in aq_ring_rx_clean()
4) validate hw_head_ in hw_atl_b0_hw_ring_tx_head_update()
(1) was fixed by Zekun Shen <[email protected]> around the same time with
"atlantic: Fix buff_ring OOB in aq_ring_rx_clean" (SHA1 5f50153288452e10).
I've added one "clean up" contribution:
"net: atlantic: reduce scope of is_rsc_complete"
I tested the "original" patches using chromeos-v5.4 kernel branch:
https://chromium-review.googlesource.com/q/hashtag:pcinet-atlantic-2022q1+(status:open%20OR%20status:merged)
I've forward ported those patches to 5.18-rc2 and compiled them but am
unable to test them on 5.18-rc2 kernel (logistics problems).
Credit largely goes to ChromeOS Fuzzing team members:
Aashay Shringarpure, Yi Chou, Shervin Oloumi
V2 changes:
o drop first patch - was already fixed upstream differently
o reduce (4) "validate hw_head_" to simple bounds checking.
====================
Grant Grundler [Tue, 10 May 2022 02:28:23 +0000 (19:28 -0700)]
net: atlantic: fix "frag[0] not initialized"
In aq_ring_rx_clean(), if buff->is_eop is not set AND
buff->len < AQ_CFG_RX_HDR_SIZE, then hdr_len remains equal to
buff->len and skb_add_rx_frag(xxx, *0*, ...) is not called.
The loop following this code starts calling skb_add_rx_frag() starting
with i=1 and thus frag[0] is never initialized. Since i is initialized
to zero at the top of the primary loop, we can just reference and
post-increment i instead of hardcoding the 0 when calling
skb_add_rx_frag() the first time.
James Smart [Fri, 6 May 2022 20:55:48 +0000 (13:55 -0700)]
scsi: lpfc: Correct BDE DMA address assignment for GEN_REQ_WQE
Garbage FCoE CT frames are transmitted on the wire because of bad DMA ptr
addresses filled in the GEN_REQ_WQE.
The __lpfc_sli_prep_gen_req_s4() routine is using the wrong buffer for the
payload address. Change the DMA buffer assignment from the bmp buffer to
the bpl buffer.
James Smart [Fri, 6 May 2022 20:55:28 +0000 (13:55 -0700)]
scsi: lpfc: Fix split code for FLOGI on FCoE
The refactoring code converted context information from SLI-3 to SLI-4.
The conversion for the SLI-4 bit field tried to use the old (hacky) SLI3
high/low bit settings. Needless to say, it was incorrect.
Explicitly set the context field to type FCFI and set it in the wqe.
SLI-4 is now a proper bit field so no need for the shifting/anding.
Lukas Wunner [Tue, 10 May 2022 07:56:05 +0000 (09:56 +0200)]
genirq: Remove WARN_ON_ONCE() in generic_handle_domain_irq()
Since commit 0953fb263714 ("irq: remove handle_domain_{irq,nmi}()"),
generic_handle_domain_irq() warns if called outside hardirq context, even
though the function calls down to handle_irq_desc(), which warns about the
same, but conditionally on handle_enforce_irqctx().
The newly added warning is a false positive if the interrupt originates
from any other irqchip than x86 APIC or ARM GIC/GICv3. Those are the only
ones for which handle_enforce_irqctx() returns true. Per commit c16816acd086 ("genirq: Add protection against unsafe usage of
generic_handle_irq()"):
"In general calling generic_handle_irq() with interrupts disabled from non
interrupt context is harmless. For some interrupt controllers like the
x86 trainwrecks this is outright dangerous as it might corrupt state if
an interrupt affinity change is pending."
Examples for interrupt chips where the warning is a false positive are
USB-attached GPIO controllers such as drivers/gpio/gpio-dln2.c:
USB gadgets are incapable of directly signaling an interrupt because they
cannot initiate a bus transaction by themselves. All communication on
the bus is initiated by the host controller, which polls a gadget's
Interrupt Endpoint in regular intervals. If an interrupt is pending,
that information is passed up the stack in softirq context, from which a
hardirq is synthesized via generic_handle_domain_irq().
Remove the warning to eliminate such false positives.
Mathieu Poirier [Mon, 9 May 2022 16:35:10 +0000 (10:35 -0600)]
MAINTAINERS: Add James and Mike as Arm64 performance events reviewers
James, Mike and Leo have been doing all the reviews and development work
for the Coresight perf tools for a couple of years now. As such remove
my name and add James and Mike as official reviewers (Leo is already
listed as such).
Jan Kara [Tue, 10 May 2022 10:36:04 +0000 (12:36 +0200)]
udf: Avoid using stale lengthOfImpUse
udf_write_fi() uses lengthOfImpUse of the entry it is writing to.
However this field has not yet been initialized so it either contains
completely bogus value or value from last directory entry at that place.
In either case this is wrong and can lead to filesystem corruption or
kernel crashes.
Joey Gouly [Tue, 10 May 2022 10:27:21 +0000 (11:27 +0100)]
arm64: vdso: fix makefile dependency on vdso.so
There is currently no dependency for vdso*-wrap.S on vdso*.so, which means that
you can get a build that uses a stale vdso*-wrap.o.
In commit a5b8ca97fbf8, the file that includes the vdso.so was moved and renamed
from arch/arm64/kernel/vdso/vdso.S to arch/arm64/kernel/vdso-wrap.S, when this
happened the Makefile was not updated to force the dependcy on vdso.so.
Jing Xia [Tue, 10 May 2022 02:35:14 +0000 (10:35 +0800)]
writeback: Avoid skipping inode writeback
We have run into an issue that a task gets stuck in
balance_dirty_pages_ratelimited() when perform I/O stress testing.
The reason we observed is that an I_DIRTY_PAGES inode with lots
of dirty pages is in b_dirty_time list and standard background
writeback cannot writeback the inode.
After studing the relevant code, the following scenario may lead
to the issue:
This patch moves the dirty inode to b_dirty list when the inode
currently is not queued in b_io or b_more_io list at the end of
writeback_single_inode.
dma-buf: call dma_buf_stats_setup after dmabuf is in valid list
When dma_buf_stats_setup() fails, it closes the dmabuf file which
results into the calling of dma_buf_file_release() where it does
list_del(&dmabuf->list_node) with out first adding it to the proper
list. This is resulting into panic in the below path:
__list_del_entry_valid+0x38/0xac
dma_buf_file_release+0x74/0x158
__fput+0xf4/0x428
____fput+0x14/0x24
task_work_run+0x178/0x24c
do_notify_resume+0x194/0x264
work_pending+0xc/0x5f0
Fix it by moving the dma_buf_stats_setup() after dmabuf is added to the
list.
Manuel Ullmann [Wed, 4 May 2022 19:30:44 +0000 (21:30 +0200)]
net: atlantic: always deep reset on pm op, fixing up my null deref regression
The impact of this regression is the same for resume that I saw on
thaw: the kernel hangs and nothing except SysRq rebooting can be done.
Fixes regression in commit cbe6c3a8f8f4 ("net: atlantic: invert deep
par in pm functions, preventing null derefs"), where I disabled deep
pm resets in suspend and resume, trying to make sense of the
atl_resume_common() deep parameter in the first place.
It turns out, that atlantic always has to deep reset on pm
operations. Even though I expected that and tested resume, I screwed
up by kexec-rebooting into an unpatched kernel, thus missing the
breakage.
This fixup obsoletes the deep parameter of atl_resume_common, but I
leave the cleanup for the maintainers to post to mainline.
Suspend and hibernation were successfully tested by the reporters.
Xiubo Li [Thu, 5 May 2022 10:53:09 +0000 (18:53 +0800)]
ceph: check folio PG_private bit instead of folio->private
The pages in the file mapping maybe reclaimed and reused by other
subsystems and the page->private maybe used as flags field or
something else, if later that pages are used by page caches again
the page->private maybe not cleared as expected.
Here will check the PG_private bit instead of the folio->private.
Jeff Layton [Mon, 25 Apr 2022 19:54:27 +0000 (15:54 -0400)]
ceph: fix setting of xattrs on async created inodes
Currently when we create a file, we spin up an xattr buffer to send
along with the create request. If we end up doing an async create
however, then we currently pass down a zero-length xattr buffer.
Fix the code to send down the xattr buffer in req->r_pagelist. If the
xattrs span more than a page, however give up and don't try to do an
async create.
Jakub Kicinski [Tue, 10 May 2022 01:16:46 +0000 (18:16 -0700)]
Merge tag 'batadv-net-pullrequest-20220508' of git://git.open-mesh.org/linux-merge
Simon Wunderlich says:
====================
Here is a batman-adv bugfix:
- Don't skb_split skbuffs with frag_list, by Sven Eckelmann
* tag 'batadv-net-pullrequest-20220508' of git://git.open-mesh.org/linux-merge:
batman-adv: Don't skb_split skbuffs with frag_list
====================
Vladimir Oltean [Sat, 7 May 2022 13:45:50 +0000 (16:45 +0300)]
net: dsa: flush switchdev workqueue on bridge join error path
There is a race between switchdev_bridge_port_offload() and the
dsa_port_switchdev_sync_attrs() call right below it.
When switchdev_bridge_port_offload() finishes, FDB entries have been
replayed by the bridge, but are scheduled for deferred execution later.
However dsa_port_switchdev_sync_attrs -> dsa_port_can_apply_vlan_filtering()
may impose restrictions on the vlan_filtering attribute and refuse
offloading.
When this happens, the delayed FDB entries will dereference dp->bridge,
which is a NULL pointer because we have stopped the process of
offloading this bridge.
Unable to handle kernel NULL pointer dereference at virtual address 0000000000000000
Workqueue: dsa_ordered dsa_slave_switchdev_event_work
pc : dsa_port_bridge_host_fdb_del+0x64/0x100
lr : dsa_slave_switchdev_event_work+0x130/0x1bc
Call trace:
dsa_port_bridge_host_fdb_del+0x64/0x100
dsa_slave_switchdev_event_work+0x130/0x1bc
process_one_work+0x294/0x670
worker_thread+0x80/0x460
---[ end trace 0000000000000000 ]---
Error: dsa_core: Must first remove VLAN uppers having VIDs also present in bridge.
Fix the bug by doing what we do on the normal bridge leave path as well,
which is to wait until the deferred FDB entries complete executing, then
exit.
The placement of dsa_flush_workqueue() after switchdev_bridge_port_unoffload()
guarantees that both the FDB additions and deletions on rollback are waited for.
Jakub Kicinski [Tue, 10 May 2022 00:54:26 +0000 (17:54 -0700)]
Merge branch '100GbE' of git://git.kernel.org/pub/scm/linux/kernel/git/tnguy/net-queue
Tony Nguyen says:
====================
Intel Wired LAN Driver Updates 2022-05-06
This series contains updates to ice driver only.
Ivan Vecera fixes a race with aux plug/unplug by delaying setting adev
until initialization is complete and adding locking.
Anatolii ensures VF queues are completely disabled before attempting to
reconfigure them.
Michal ensures stale Tx timestamps are cleared from hardware.
* '100GbE' of git://git.kernel.org/pub/scm/linux/kernel/git/tnguy/net-queue:
ice: fix PTP stale Tx timestamps cleanup
ice: clear stale Tx queue settings before configuring
ice: Fix race during aux device (un)plugging
====================
net: phy: Fix race condition on link status change
This fixes the following error caused by a race condition between
phydev->adjust_link() and a MDIO transaction in the phy interrupt
handler. The issue was reproduced with the ethernet FEC driver and a
micrel KSZ9031 phy.
[ 146.195696] fec 2188000.ethernet eth0: MDIO read timeout
[ 146.201779] ------------[ cut here ]------------
[ 146.206671] WARNING: CPU: 0 PID: 571 at drivers/net/phy/phy.c:942 phy_error+0x24/0x6c
[ 146.214744] Modules linked in: bnep imx_vdoa imx_sdma evbug
[ 146.220640] CPU: 0 PID: 571 Comm: irq/128-2188000 Not tainted 5.18.0-rc3-00080-gd569e86915b7 #9
[ 146.229563] Hardware name: Freescale i.MX6 Quad/DualLite (Device Tree)
[ 146.236257] unwind_backtrace from show_stack+0x10/0x14
[ 146.241640] show_stack from dump_stack_lvl+0x58/0x70
[ 146.246841] dump_stack_lvl from __warn+0xb4/0x24c
[ 146.251772] __warn from warn_slowpath_fmt+0x5c/0xd4
[ 146.256873] warn_slowpath_fmt from phy_error+0x24/0x6c
[ 146.262249] phy_error from kszphy_handle_interrupt+0x40/0x48
[ 146.268159] kszphy_handle_interrupt from irq_thread_fn+0x1c/0x78
[ 146.274417] irq_thread_fn from irq_thread+0xf0/0x1dc
[ 146.279605] irq_thread from kthread+0xe4/0x104
[ 146.284267] kthread from ret_from_fork+0x14/0x28
[ 146.289164] Exception stack(0xe6fa1fb0 to 0xe6fa1ff8)
[ 146.294448] 1fa0: 00000000000000000000000000000000
[ 146.302842] 1fc0: 0000000000000000000000000000000000000000000000000000000000000000
[ 146.311281] 1fe0: 000000000000000000000000000000000000001300000000
[ 146.318262] irq event stamp: 12325
[ 146.321780] hardirqs last enabled at (12333): [<c01984c4>] __up_console_sem+0x50/0x60
[ 146.330013] hardirqs last disabled at (12342): [<c01984b0>] __up_console_sem+0x3c/0x60
[ 146.338259] softirqs last enabled at (12324): [<c01017f0>] __do_softirq+0x2c0/0x624
[ 146.346311] softirqs last disabled at (12319): [<c01300ac>] __irq_exit_rcu+0x138/0x178
[ 146.354447] ---[ end trace 0000000000000000 ]---
With the FEC driver phydev->adjust_link() calls fec_enet_adjust_link()
calls fec_stop()/fec_restart() and both these function reset and
temporary disable the FEC disrupting any MII transaction that
could be happening at the same time.
fec_enet_adjust_link() and phy_read() can be running at the same time
when we have one additional interrupt before the phy_state_machine() is
able to terminate.
Joel Savitz [Tue, 10 May 2022 00:34:29 +0000 (17:34 -0700)]
selftests: vm: Makefile: rename TARGETS to VMTARGETS
The tools/testing/selftests/vm/Makefile uses the variable TARGETS
internally to generate a list of platform-specific binary build targets
suffixed with _{32,64}. When building the selftests using its own
Makefile directly, such as via the following command run in a kernel tree:
One receives an error such as the following:
make: Entering directory '/root/linux/tools/testing/selftests'
make --no-builtin-rules ARCH=x86 -C ../../.. headers_install
make[1]: Entering directory '/root/linux'
INSTALL ./usr/include
make[1]: Leaving directory '/root/linux'
make[1]: Entering directory '/root/linux/tools/testing/selftests/vm'
make[1]: *** No rule to make target 'vm.c', needed by '/root/linux/tools/testing/selftests/vm/vm_64'. Stop.
make[1]: Leaving directory '/root/linux/tools/testing/selftests/vm'
make: *** [Makefile:175: all] Error 2
make: Leaving directory '/root/linux/tools/testing/selftests'
The TARGETS variable passed to tools/testing/selftests/Makefile collides
with the TARGETS used in tools/testing/selftests/vm/Makefile, so rename
the latter to VMTARGETS, eliminating the collision with no functional
change.
Separate linux.intel.com account was created for submitting and reviewing
kernel patches, thus need to map previously used primary Intel e-mail
address.
Mike Rapoport [Tue, 10 May 2022 00:34:28 +0000 (17:34 -0700)]
arm[64]/memremap: don't abuse pfn_valid() to ensure presence of linear map
The semantics of pfn_valid() is to check presence of the memory map for a
PFN and not whether a PFN is covered by the linear map. The memory map
may be present for NOMAP memory regions, but they won't be mapped in the
linear mapping. Accessing such regions via __va() when they are
memremap()'ed will cause a crash.
On v5.4.y the crash happens on qemu-arm with UEFI [1]:
<1>[ 0.084476] 8<--- cut here ---
<1>[ 0.084595] Unable to handle kernel paging request at virtual address dfb76000
<1>[ 0.084938] pgd = (ptrval)
<1>[ 0.085038] [dfb76000] *pgd=5f7fe801, *pte=00000000, *ppte=00000000
...
<4>[ 0.093923] [<c0ed6ce8>] (memcpy) from [<c16a06f8>] (dmi_setup+0x60/0x418)
<4>[ 0.094204] [<c16a06f8>] (dmi_setup) from [<c16a38d4>] (arm_dmi_init+0x8/0x10)
<4>[ 0.094408] [<c16a38d4>] (arm_dmi_init) from [<c0302e9c>] (do_one_initcall+0x50/0x228)
<4>[ 0.094619] [<c0302e9c>] (do_one_initcall) from [<c16011e4>] (kernel_init_freeable+0x15c/0x1f8)
<4>[ 0.094841] [<c16011e4>] (kernel_init_freeable) from [<c0f028cc>] (kernel_init+0x8/0x10c)
<4>[ 0.095057] [<c0f028cc>] (kernel_init) from [<c03010e8>] (ret_from_fork+0x14/0x2c)
On kernels v5.10.y and newer the same crash won't reproduce on ARM because
commit b10d6bca8720 ("arch, drivers: replace for_each_membock() with
for_each_mem_range()") changed the way memory regions are registered in
the resource tree, but that merely covers up the problem.
On ARM64 memory resources registered in yet another way and there the
issue of wrong usage of pfn_valid() to ensure availability of the linear
map is also covered.
Implement arch_memremap_can_ram_remap() on ARM and ARM64 to prevent access
to NOMAP regions via the linear mapping in memremap().
Kalesh Singh [Tue, 10 May 2022 00:34:28 +0000 (17:34 -0700)]
procfs: prevent unprivileged processes accessing fdinfo dir
The file permissions on the fdinfo dir from were changed from
S_IRUSR|S_IXUSR to S_IRUGO|S_IXUGO, and a PTRACE_MODE_READ check was added
for opening the fdinfo files [1]. However, the ptrace permission check
was not added to the directory, allowing anyone to get the open FD numbers
by reading the fdinfo directory.
Add the missing ptrace permission check for opening the fdinfo directory.
Niels Dossche [Tue, 10 May 2022 00:34:28 +0000 (17:34 -0700)]
mm: mremap: fix sign for EFAULT error return value
The mremap syscall is supposed to return a pointer to the new virtual
memory area on success, and a negative value of the error code in case of
failure. Currently, EFAULT is returned when the VMA is not found, instead
of -EFAULT. The users of this syscall will therefore believe the syscall
succeeded in case the VMA didn't exist, as it returns a pointer to address
0xe (0xe being the value of EFAULT). Fix the sign of the error value.
Randy Dunlap [Mon, 9 May 2022 23:47:40 +0000 (16:47 -0700)]
hwmon: (ltq-cputemp) restrict it to SOC_XWAY
Building with SENSORS_LTQ_CPUTEMP=y with SOC_FALCON=y causes build
errors since FALCON does not support the same features as XWAY.
Change this symbol to depend on SOC_XWAY since that provides the
necessary interfaces.
Repairs these build errors:
../drivers/hwmon/ltq-cputemp.c: In function 'ltq_cputemp_enable':
../drivers/hwmon/ltq-cputemp.c:23:9: error: implicit declaration of function 'ltq_cgu_w32'; did you mean 'ltq_ebu_w32'? [-Werror=implicit-function-declaration]
23 | ltq_cgu_w32(ltq_cgu_r32(CGU_GPHY1_CR) | CGU_TEMP_PD, CGU_GPHY1_CR);
../drivers/hwmon/ltq-cputemp.c:23:21: error: implicit declaration of function 'ltq_cgu_r32'; did you mean 'ltq_ebu_r32'? [-Werror=implicit-function-declaration]
23 | ltq_cgu_w32(ltq_cgu_r32(CGU_GPHY1_CR) | CGU_TEMP_PD, CGU_GPHY1_CR);
../drivers/hwmon/ltq-cputemp.c: In function 'ltq_cputemp_probe':
../drivers/hwmon/ltq-cputemp.c:92:31: error: 'SOC_TYPE_VR9_2' undeclared (first use in this function)
92 | if (ltq_soc_type() != SOC_TYPE_VR9_2)
The W=2 build pointed out that the code wasn't initializing all the
variables in the dim_cq_moder declarations with the struct initializers.
The net change here is zero since these structs were already static
const globals and were initialized with zeros by the compiler, but
removing compiler warnings has value in and of itself.
lib/dim/net_dim.c: At top level:
lib/dim/net_dim.c:54:9: warning: missing initializer for field ‘comps’ of ‘const struct dim_cq_moder’ [-Wmissing-field-initializers]
54 | NET_DIM_RX_EQE_PROFILES,
| ^~~~~~~~~~~~~~~~~~~~~~~
In file included from lib/dim/net_dim.c:6:
./include/linux/dim.h:45:13: note: ‘comps’ declared here
45 | u16 comps;
| ^~~~~
and repeats for the tx struct, and once you fix the comps entry then
the cq_period_mode field needs the same treatment.
Use the commonly accepted style to indicate to the compiler that we
know what we're doing, and add a comma at the end of each struct
initializer to clean up the issue, and use explicit initializers
for the fields we are initializing which makes the compiler happy.
While here and fixing these lines, clean up the code slightly with
a fix for the super long lines by removing the word "_MODERATION" from a
couple defines only used in this file.
Jonathan Lemon [Fri, 6 May 2022 22:37:39 +0000 (15:37 -0700)]
ptp: ocp: Use DIV64_U64_ROUND_UP for rounding.
The initial code used roundup() to round the starting time to
a multiple of a period. This generated an error on 32-bit
systems, so was replaced with DIV_ROUND_UP_ULL().
However, this truncates to 32-bits on a 64-bit system. Replace
with DIV64_U64_ROUND_UP() instead.
Dan Aloni [Sun, 8 May 2022 12:54:50 +0000 (15:54 +0300)]
nfs: fix broken handling of the softreval mount option
Turns out that ever since this mount option was added, passing
`softreval` in NFS mount options cancelled all other flags while not
affecting the underlying flag `NFS_MOUNT_SOFTREVAL`.
Fixes: c74dfe97c104 ("NFS: Add mount option 'softreval'") Signed-off-by: Dan Aloni <[email protected]> Signed-off-by: Trond Myklebust <[email protected]>
Linus Torvalds [Mon, 9 May 2022 15:48:59 +0000 (08:48 -0700)]
Merge tag 'platform-drivers-x86-v5.18-4' of git://git.kernel.org/pub/scm/linux/kernel/git/pdx86/platform-drivers-x86
Pull x86 platform driver fixes from Hans de Goede:
- thinkpad_acpi AMD suspend/resume + fan detection fixes
- two other small fixes
- one hardware-id addition
* tag 'platform-drivers-x86-v5.18-4' of git://git.kernel.org/pub/scm/linux/kernel/git/pdx86/platform-drivers-x86:
platform/surface: aggregator: Fix initialization order when compiling as builtin module
platform/surface: gpe: Add support for Surface Pro 8
platform/x86/intel: Fix 'rmmod pmt_telemetry' panic
platform/x86: thinkpad_acpi: Correct dual fan probe
platform/x86: thinkpad_acpi: Add a s2idle resume quirk for a number of laptops
platform/x86: thinkpad_acpi: Convert btusb DMI list to quirks
Miaoqian Lin [Fri, 29 Apr 2022 16:49:17 +0000 (17:49 +0100)]
slimbus: qcom: Fix IRQ check in qcom_slim_probe
platform_get_irq() returns non-zero IRQ number on success,
negative error number on failure.
And the doc of platform_get_irq() provides a usage example:
int irq = platform_get_irq(pdev, 0);
if (irq < 0)
return irq;
Fix the check of return value to catch errors correctly.
The definition of RFKILL_IOCTL_MAX_SIZE introduced by commit 54f586a91532 ("rfkill: make new event layout opt-in") is unusable
since it is based on RFKILL_IOC_EXT_SIZE which has not been defined.
Fix that by replacing the undefined constant with the constant which
is intended to be used in this definition.
Johannes Berg [Thu, 5 May 2022 21:04:22 +0000 (23:04 +0200)]
mac80211_hwsim: call ieee80211_tx_prepare_skb under RCU protection
This is needed since it might use (and pass out) pointers to
e.g. keys protected by RCU. Can't really happen here as the
frames aren't encrypted, but we need to still adhere to the
rules.
Amir Goldstein [Sat, 7 May 2022 08:00:28 +0000 (11:00 +0300)]
fanotify: do not allow setting dirent events in mask of non-dir
Dirent events (create/delete/move) are only reported on watched
directory inodes, but in fanotify as well as in legacy inotify, it was
always allowed to set them on non-dir inode, which does not result in
any meaningful outcome.
Until kernel v5.17, dirent events in fanotify also differed from events
"on child" (e.g. FAN_OPEN) in the information provided in the event.
For example, FAN_OPEN could be set in the mask of a non-dir or the mask
of its parent and event would report the fid of the child regardless of
the marked object.
By contrast, FAN_DELETE is not reported if the child is marked and the
child fid was not reported in the events.
Since kernel v5.17, with fanotify group flag FAN_REPORT_TARGET_FID, the
fid of the child is reported with dirent events, like events "on child",
which may create confusion for users expecting the same behavior as
events "on child" when setting events in the mask on a child.
The desired semantics of setting dirent events in the mask of a child
are not clear, so for now, deny this action for a group initialized
with flag FAN_REPORT_TARGET_FID and for the new event FAN_RENAME.
We may relax this restriction in the future if we decide on the
semantics and implement them.
When NET_F_F_GRO_FRAGLIST is enabled and bpf_skb_change_proto is used,
check if udp packets and tcp packets are successfully delivered to user
space. If wrong udp packets are delivered, udpgso_bench_rx will exit
with "Initial byte out of range"
Lina Wang [Thu, 5 May 2022 05:48:49 +0000 (13:48 +0800)]
net: fix wrong network header length
When clatd starts with ebpf offloaing, and NETIF_F_GRO_FRAGLIST is enable,
several skbs are gathered in skb_shinfo(skb)->frag_list. The first skb's
ipv6 header will be changed to ipv4 after bpf_skb_proto_6_to_4,
network_header\transport_header\mac_header have been updated as ipv4 acts,
but other skbs in frag_list didnot update anything, just ipv6 packets.
udp_queue_rcv_skb will call skb_segment_list to traverse other skbs in
frag_list and make sure right udp payload is delivered to user space.
Unfortunately, other skbs in frag_list who are still ipv6 packets are
updated like the first skb and will have wrong transport header length.
e.g.before bpf_skb_proto_6_to_4,the first skb and other skbs in frag_list
has the same network_header(24)& transport_header(64), after
bpf_skb_proto_6_to_4, ipv6 protocol has been changed to ipv4, the first
skb's network_header is 44,transport_header is 64, other skbs in frag_list
didnot change.After skb_segment_list, the other skbs in frag_list has
different network_header(24) and transport_header(44), so there will be 20
bytes different from original,that is difference between ipv6 header and
ipv4 header. Just change transport_header to be the same with original.
Actually, there are two solutions to fix it, one is traversing all skbs
and changing every skb header in bpf_skb_proto_6_to_4, the other is
modifying frag_list skb's header in skb_segment_list. Considering
efficiency, adopt the second one--- when the first skb and other skbs in
frag_list has different network_header length, restore them to make sure
right udp payload is delivered to user space.
Taehee Yoo [Wed, 4 May 2022 12:32:27 +0000 (12:32 +0000)]
net: sfc: fix memory leak due to ptp channel
It fixes memory leak in ring buffer change logic.
When ring buffer size is changed(ethtool -G eth0 rx 4096), sfc driver
works like below.
1. stop all channels and remove ring buffers.
2. allocates new buffer array.
3. allocates rx buffers.
4. start channels.
While the above steps are working, it skips some steps if the channel
doesn't have a ->copy callback function.
Due to ptp channel doesn't have ->copy callback, these above steps are
skipped for ptp channel.
It eventually makes some problems.
a. ptp channel's ring buffer size is not changed, it works only
1024(default).
b. memory leak.
The reason for memory leak is to use the wrong ring buffer values.
There are some values, which is related to ring buffer size.
a. efx->rxq_entries
- This is global value of rx queue size.
b. rx_queue->ptr_mask
- used for access ring buffer as circular ring.
- roundup_pow_of_two(efx->rxq_entries) - 1
c. rx_queue->max_fill
- efx->rxq_entries - EFX_RXD_HEAD_ROOM
These all values should be based on ring buffer size consistently.
But ptp channel's values are not.
a. efx->rxq_entries
- This is global(for sfc) value, always new ring buffer size.
b. rx_queue->ptr_mask
- This is always 1023(default).
c. rx_queue->max_fill
- This is new ring buffer size - EFX_RXD_HEAD_ROOM.
sfc driver allocates rx ring buffers based on these values.
When it allocates ptp channel's ring buffer, 4086 ring buffers are
allocated then, these buffers are attached to the allocated array.
But ptp channel's ring buffer array size is still 1024(default)
and ptr_mask is still 1023 too.
So, 3062 ring buffers will be overwritten to the array.
This is the reason for memory leak.
Test commands:
ethtool -G <interface name> rx 4096
while :
do
ip link set <interface name> up
ip link set <interface name> down
done
In order to avoid this problem, it adds ->copy callback to ptp channel
type.
So that rx_queue->ptr_mask value will be updated correctly.
Fixes: 7c236c43b838 ("sfc: Add support for IEEE-1588 PTP") Signed-off-by: Taehee Yoo <[email protected]> Signed-off-by: David S. Miller <[email protected]>
tools headers UAPI: Sync linux/kvm.h with the kernel sources
To pick the changes in:
d495f942f40aa412 ("KVM: fix bad user ABI for KVM_EXIT_SYSTEM_EVENT")
That just rebuilds perf, as these patches don't add any new KVM ioctl to
be harvested for the the 'perf trace' ioctl syscall argument
beautifiers.
This is also by now used by tools/testing/selftests/kvm/, a simple test
build succeeded.
This silences this perf build warning:
Warning: Kernel ABI header at 'tools/include/uapi/linux/kvm.h' differs from latest version at 'include/uapi/linux/kvm.h'
diff -u tools/include/uapi/linux/kvm.h include/uapi/linux/kvm.h
Jeremy Linton [Thu, 28 Apr 2022 15:19:47 +0000 (10:19 -0500)]
perf tests: Fix coresight `perf test` failure.
Currently the `perf test` always fails the coresight test like:
89: Check Arm CoreSight trace data recording and synthesized samples: FAILED!
That is because the test_arm_coresight.sh is attempting to SIGINT the
parent but is using $$ rather than $PPID and it sigint's itself when
run under the perf test framework.
Since this is done in a trap clause it ends up returning a non zero
return.
Since $PPID is a bash ism and not all distros are linking /bin/sh to
bash, the alternative parent pid lookups are uglier than just dropping
the kill, and its not strictly needed, lets pick the simple solution and
drop the sigint.
Ian Rogers [Thu, 28 Apr 2022 20:29:11 +0000 (13:29 -0700)]
perf bench: Fix two numa NDEBUG warnings
BUG_ON is a no-op if NDEBUG is defined, otherwise it is an assert.
Compiling with NDEBUG yields:
bench/numa.c: In function ‘bind_to_cpu’:
bench/numa.c:314:1: error: control reaches end of non-void function [-Werror=return-type]
314 | }
| ^
bench/numa.c: In function ‘bind_to_node’:
bench/numa.c:367:1: error: control reaches end of non-void function [-Werror=return-type]
367 | }
| ^
Linus Torvalds [Sun, 8 May 2022 19:42:05 +0000 (12:42 -0700)]
Merge tag 'for-5.18/parisc-3' of git://git.kernel.org/pub/scm/linux/kernel/git/deller/parisc-linux
Pull parisc architecture fixes from Helge Deller:
"Some reverts of existing patches, which were necessary because of boot
issues due to wrong CPU clock handling and cache issues which led to
userspace segfaults with 32bit kernels. Dave has a whole bunch of
upcoming cache fixes which I then plan to push in the next merge
window.
Other than that just small updates and fixes, e.g. defconfig updates,
spelling fixes, a clocksource fix, boot topology fixes and a fix for
/proc/cpuinfo output to satisfy lscpu"
* tag 'for-5.18/parisc-3' of git://git.kernel.org/pub/scm/linux/kernel/git/deller/parisc-linux:
Revert "parisc: Increase parisc_cache_flush_threshold setting"
parisc: Mark cr16 clock unstable on all SMP machines
parisc: Fix typos in comments
parisc: Change MAX_ADDRESS to become unsigned long long
parisc: Merge model and model name into one line in /proc/cpuinfo
parisc: Re-enable GENERIC_CPU_DEVICES for !SMP
parisc: Update 32- and 64-bit defconfigs
parisc: Only list existing CPUs in cpu_possible_mask
Revert "parisc: Fix patch code locking and flushing"
Revert "parisc: Mark sched_clock unstable only if clocks are not syncronized"
Revert "parisc: Mark cr16 CPU clocksource unstable on all SMP machines"
Linus Torvalds [Sun, 8 May 2022 18:38:23 +0000 (11:38 -0700)]
Merge tag 'powerpc-5.18-4' of git://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux
Pull powerpc fixes from Michael Ellerman:
- Fix the DWARF CFI in our VDSO time functions, allowing gdb to
backtrace through them correctly.
- Fix a buffer overflow in the papr_scm driver, only triggerable by
hypervisor input.
- A fix in the recently added QoS handling for VAS (used for
communicating with coprocessors).
Thanks to Alan Modra, Haren Myneni, Kajol Jain, and Segher Boessenkool.
* tag 'powerpc-5.18-4' of git://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux:
powerpc/papr_scm: Fix buffer overflow issue with CONFIG_FORTIFY_SOURCE
powerpc/vdso: Fix incorrect CFI in gettimeofday.S
powerpc/pseries/vas: Use QoS credits from the userspace
Linus Torvalds [Sun, 8 May 2022 18:21:54 +0000 (11:21 -0700)]
Merge tag 'x86-urgent-2022-05-08' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull x86 fix from Thomas Gleixner:
"A fix and an email address update:
- Prevent FPU state corruption.
The condition in irq_fpu_usable() grants FPU usage when the FPU is
not used in the kernel. That's just wrong as it does not take the
fpregs_lock()'ed regions into account. If FPU usage happens within
such a region from interrupt context, then the FPU state gets
corrupted.
That's a long standing bug, which got unearthed by the recent
changes to the random code.
- Josh wants to use his kernel.org email address"
* tag 'x86-urgent-2022-05-08' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
x86/fpu: Prevent FPU state corruption
MAINTAINERS: Update Josh Poimboeuf's email address