MarkLee [Mon, 14 Oct 2019 07:15:17 +0000 (15:15 +0800)]
net: ethernet: mediatek: Fix MT7629 missing GMII mode support
In the original design, mtk_phy_connect function will set ge_mode=1
if phy-mode is GMII(PHY_INTERFACE_MODE_GMII) and then set the correct
ge_mode to ETHSYS_SYSCFG0 register. This logic was broken after apply
mediatek PHYLINK patch(Fixes tag), the new mtk_mac_config function will
not set ge_mode=1 for GMII mode hence the final ETHSYS_SYSCFG0 setting
will be incorrect for mt7629 GMII mode. This patch add the missing logic
back to fix it.
Fixes: b8fc9f30821e ("net: ethernet: mediatek: Add basic PHYLINK support") Signed-off-by: MarkLee <[email protected]> Signed-off-by: David S. Miller <[email protected]>
David S. Miller [Wed, 16 Oct 2019 00:14:48 +0000 (17:14 -0700)]
Merge branch 'mpls-push-pop-fix'
Davide Caratti says:
====================
net/sched: fix wrong behavior of MPLS push/pop action
this series contains two fixes for TC 'act_mpls', that try to address
two problems that can be observed configuring simple 'push' / 'pop'
operations:
- patch 1/2 avoids dropping non-MPLS packets that pass through the MPLS
'pop' action.
- patch 2/2 fixes corruption of the L2 header that occurs when 'push'
or 'pop' actions are configured in TC egress path.
v2: - change commit message in patch 1/2 to better describe that the
patch impacts only TC, thanks to Simon Horman
- fix missing documentation of 'mac_len' in patch 2/2
====================
Davide Caratti [Sat, 12 Oct 2019 11:55:07 +0000 (13:55 +0200)]
net/sched: fix corrupted L2 header with MPLS 'push' and 'pop' actions
the following script:
# tc qdisc add dev eth0 clsact
# tc filter add dev eth0 egress protocol ip matchall \
> action mpls push protocol mpls_uc label 0x355aa bos 1
causes corruption of all IP packets transmitted by eth0. On TC egress, we
can't rely on the value of skb->mac_len, because it's 0 and a MPLS 'push'
operation will result in an overwrite of the first 4 octets in the packet
L2 header (e.g. the Destination Address if eth0 is an Ethernet); the same
error pattern is present also in the MPLS 'pop' operation. Fix this error
in act_mpls data plane, computing 'mac_len' as the difference between the
network header and the mac header (when not at TC ingress), and use it in
MPLS 'push'/'pop' core functions.
v2: unbreak 'make htmldocs' because of missing documentation of 'mac_len'
in skb_mpls_pop(), reported by kbuild test robot
Davide Caratti [Sat, 12 Oct 2019 11:55:06 +0000 (13:55 +0200)]
net: avoid errors when trying to pop MLPS header on non-MPLS packets
the following script:
# tc qdisc add dev eth0 clsact
# tc filter add dev eth0 egress matchall action mpls pop
implicitly makes the kernel drop all packets transmitted by eth0, if they
don't have a MPLS header. This behavior is uncommon: other encapsulations
(like VLAN) just let the packet pass unmodified. Since the result of MPLS
'pop' operation would be the same regardless of the presence / absence of
MPLS header(s) in the original packet, we can let skb_mpls_pop() return 0
when dealing with non-MPLS packets.
For the OVS use-case, this is acceptable because __ovs_nla_copy_actions()
already ensures that MPLS 'pop' operation only occurs with packets having
an MPLS Ethernet type (and there are no other callers in current code, so
the semantic change should be ok).
v2: better documentation of use-cases for skb_mpls_pop(), thanks to Simon
Horman
Nishad Kamdar [Sat, 12 Oct 2019 13:12:28 +0000 (18:42 +0530)]
net: cavium: Use the correct style for SPDX License Identifier
This patch corrects the SPDX License Identifier style
in header files related to Cavium Ethernet drivers.
For C header files Documentation/process/license-rules.rst
mandates C-like comments (opposed to C source files where
C++ style should be used)
Changes made by using a script provided by Joe Perches here:
https://lkml.org/lkml/2019/2/7/46.
Nishad Kamdar [Sat, 12 Oct 2019 12:18:56 +0000 (17:48 +0530)]
net: dsa: microchip: Use the correct style for SPDX License Identifier
This patch corrects the SPDX License Identifier style
in header files related to Distributed Switch Architecture
drivers for Microchip KSZ series switch support.
For C header files Documentation/process/license-rules.rst
mandates C-like comments (opposed to C source files where
C++ style should be used)
Changes made by using a script provided by Joe Perches here:
https://lkml.org/lkml/2019/2/7/46.
There is an arbitrary difference between the system resume and
runtime resume code paths for PCI devices regarding the delay to
apply when switching the devices from D3cold to D0.
Namely, pci_restore_standard_config() used in the runtime resume
code path calls pci_set_power_state() which in turn invokes
__pci_start_power_transition() to power up the device through the
platform firmware and that function applies the transition delay
(as per PCI Express Base Specification Revision 2.0, Section 6.6.1).
However, pci_pm_default_resume_early() used in the system resume
code path calls pci_power_up() which doesn't apply the delay at
all and that causes issues to occur during resume from
suspend-to-idle on some systems where the delay is required.
Since there is no reason for that difference to exist, modify
pci_power_up() to follow pci_set_power_state() more closely and
invoke __pci_start_power_transition() from there to call the
platform firmware to power up the device (in case that's necessary).
Linus Torvalds [Tue, 15 Oct 2019 21:50:10 +0000 (14:50 -0700)]
Merge tag 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mst/vhost
Pull virtio fixes from Michael Tsirkin:
"Some minor bugfixes"
* tag 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mst/vhost:
vhost/test: stop device before reset
tools/virtio: xen stub
tools/virtio: more stubs
Max Filippov [Tue, 15 Oct 2019 20:52:03 +0000 (13:52 -0700)]
xtensa: virt: fix PCI IO ports mapping
virt device tree incorrectly uses 0xf0000000 on both sides of PCI IO
ports address space mapping. This results in incorrect port address
assignment in PCI IO BARs and subsequent crash on attempt to access
them. Use 0 as base address in PCI IO ports address space.
Dan Williams [Tue, 15 Oct 2019 19:54:17 +0000 (12:54 -0700)]
libata/ahci: Fix PCS quirk application
Commit c312ef176399 "libata/ahci: Drop PCS quirk for Denverton and
beyond" got the polarity wrong on the check for which board-ids should
have the quirk applied. The board type board_ahci_pcs7 is defined at the
end of the list such that "pcs7" boards can be special cased in the
future if they need the quirk. All prior Intel board ids "<
board_ahci_pcs7" should proceed with applying the quirk.
Linus Torvalds [Tue, 15 Oct 2019 19:19:08 +0000 (12:19 -0700)]
Merge tag 'scsi-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/jejb/scsi
Pull SCSI fixes from James Bottomley:
"Five changes, two in drivers (qla2xxx, zfcp), one to MAINTAINERS
(qla2xxx) and two in the core.
The last two are mostly about removing incorrect messages from the
kernel log: the resid message is definitely wrong and the sync cache
on protected drive problem is arguably wrong"
* tag 'scsi-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/jejb/scsi:
scsi: MAINTAINERS: Update qla2xxx driver
scsi: zfcp: fix reaction on bit error threshold notification
scsi: core: save/restore command resid for error handling
scsi: qla2xxx: Remove WARN_ON_ONCE in qla2x00_status_cont_entry()
scsi: sd: Ignore a failure to sync cache due to lack of authorization
Randy Dunlap [Sat, 12 Oct 2019 04:03:33 +0000 (21:03 -0700)]
net: ethernet: broadcom: have drivers select DIMLIB as needed
NET_VENDOR_BROADCOM is intended to control a kconfig menu only.
It should not have anything to do with code generation.
As such, it should not select DIMLIB for all drivers under
NET_VENDOR_BROADCOM. Instead each driver that needs DIMLIB should
select it (being the symbols SYSTEMPORT, BNXT, and BCMGENET).
Florian Fainelli [Fri, 11 Oct 2019 19:53:49 +0000 (12:53 -0700)]
net: bcmgenet: Set phydev->dev_flags only for internal PHYs
phydev->dev_flags is entirely dependent on the PHY device driver which
is going to be used, setting the internal GENET PHY revision in those
bits only makes sense when drivers/net/phy/bcm7xxx.c is the PHY driver
being used.
Fixes: 487320c54143 ("net: bcmgenet: communicate integrated PHY revision to PHY driver") Signed-off-by: Florian Fainelli <[email protected]> Acked-by: Doug Berger <[email protected]> Signed-off-by: David S. Miller <[email protected]>
Mahesh Bandewar [Sat, 12 Oct 2019 01:14:55 +0000 (18:14 -0700)]
blackhole_netdev: fix syzkaller reported issue
While invalidating the dst, we assign backhole_netdev instead of
loopback device. However, this device does not have idev pointer
and hence no ip6_ptr even if IPv6 is enabled. Possibly this has
triggered the syzbot reported crash.
The syzbot report does not have reproducer, however, this is the
only device that doesn't have matching idev created.
Also ipv6 always assumes presence of idev and never checks for it
being NULL (as does the above referenced code). So adding a idev
for the blackhole_netdev to avoid this class of crashes in the future.
Linus Torvalds [Tue, 15 Oct 2019 16:56:36 +0000 (09:56 -0700)]
sparc64: disable fast-GUP due to unexplained oopses
HAVE_FAST_GUP enables the lockless quick page table walker for simple
cases, and is a nice optimization for some random loads that can then
use get_user_pages_fast() rather than the more careful page walker.
However, for some unexplained reason, it seems to be subtly broken on
sparc64. The breakage is only with some compiler versions and some
hardware, and nobody seems to have figured out what triggers it,
although there's a simple reprodicer for the problem when it does
trigger.
The problem was introduced with the conversion to the generic GUP code
in commit 7b9afb86b632 ("sparc64: use the generic get_user_pages_fast
code"), but nothing looks obviously wrong in that conversion. It may be
a compiler bug that just hits us with the code reorganization. Or it
may be something very specific to sparc64.
This disables HAVE_FAST_GUP entirely. That makes things like futexes a
bit slower, but at least they work. If we can figure out the trigger,
that would be lovely, but it's been three months already..
Steven Price [Wed, 9 Oct 2019 09:44:55 +0000 (10:44 +0100)]
drm/panfrost: Handle resetting on timeout better
Panfrost uses multiple schedulers (one for each slot, so 2 in reality),
and on a timeout has to stop all the schedulers to safely perform a
reset. However more than one scheduler can trigger a timeout at the same
time. This race condition results in jobs being freed while they are
still in use.
When stopping other slots use cancel_delayed_work_sync() to ensure that
any timeout started for that slot has completed. Also use
mutex_trylock() to obtain reset_lock. This means that only one thread
attempts the reset, the other threads will simply complete without doing
anything (the first thread will wait for this in the call to
cancel_delayed_work_sync()).
While we're here and since the function is already dependent on
sched_job not being NULL, let's remove the unnecessary checks.
Tejun Heo [Tue, 15 Oct 2019 15:49:27 +0000 (08:49 -0700)]
blk-rq-qos: fix first node deletion of rq_qos_del()
rq_qos_del() incorrectly assigns the node being deleted to the head if
it was the first on the list in the !prev path. Fix it by iterating
with ** instead.
Tejun Heo [Tue, 15 Oct 2019 16:03:47 +0000 (09:03 -0700)]
blkcg: Fix multiple bugs in blkcg_activate_policy()
blkcg_activate_policy() has the following bugs.
* cf09a8ee19ad ("blkcg: pass @q and @blkcg into
blkcg_pol_alloc_pd_fn()") added @blkcg to ->pd_alloc_fn(); however,
blkcg_activate_policy() ends up using pd's allocated for the root
blkcg for all preallocations, so ->pd_init_fn() for non-root blkcgs
can be passed in pd's which are allocated for the root blkcg.
For blk-iocost, this means that ->pd_init_fn() can write beyond the
end of the allocated object as it determines the length of the flex
array at the end based on the blkcg's nesting level.
* Each pd is initialized as they get allocated. If alloc fails, the
policy will get freed with pd's initialized on it.
* After the above partial failure, the partial pds are not freed.
This patch fixes all the above issues by
* Restructuring blkcg_activate_policy() so that alloc and init passes
are separate. Init takes place only after all allocs succeeded and
on failure all allocated pds are freed.
* Unifying and fixing the cleanup of the remaining pd_prealloc.
Signed-off-by: Tejun Heo <[email protected]> Fixes: cf09a8ee19ad ("blkcg: pass @q and @blkcg into blkcg_pol_alloc_pd_fn()") Signed-off-by: Jens Axboe <[email protected]>
Darrick J. Wong [Tue, 15 Oct 2019 15:46:07 +0000 (08:46 -0700)]
xfs: change the seconds fields in xfs_bulkstat to signed
64-bit time is a signed quantity in the kernel, so the bulkstat
structure should reflect that. Note that the structure size stays
the same and that we have not yet published userspace headers for this
new ioctl so there are no users to break.
Fixes: 7035f9724f84 ("xfs: introduce new v5 bulkstat structure") Signed-off-by: Darrick J. Wong <[email protected]> Reviewed-by: Carlos Maiolino <[email protected]> Reviewed-by: Christoph Hellwig <[email protected]>
The reason is that the rbd_add_acquire_lock() is interruptable,
that means, when we kill the waiting on ->acquire_wait, the lock_dwork
could be still running.
Then when we reach the rbd_dev_free(), WARN_ON is triggered because
lock_state is not RBD_LOCK_STATE_UNLOCKED.
To fix it, this commit make sure the lock_dwork was finished before
calling rbd_dev_image_unlock().
On the other hand, this would not happend in do_rbd_remove(), because
after rbd mapped, lock_dwork will only be queued for IO request, and
request will continue unless lock_dwork finished. when we call
rbd_dev_image_unlock() in do_rbd_remove(), all requests are done.
That means, lock_state should not be locked again after
rbd_dev_image_unlock().
[ Cancel lock_dwork in rbd_add_acquire_lock(), only if the wait is
interrupted. ]
Jeff Layton [Thu, 26 Sep 2019 20:05:11 +0000 (16:05 -0400)]
ceph: just skip unrecognized info in ceph_reply_info_extra
In the future, we're going to want to extend the ceph_reply_info_extra
for create replies. Currently though, the kernel code doesn't accept an
extra blob that is larger than the expected data.
Change the code to skip over any unrecognized fields at the end of the
extra blob, rather than returning -EIO.
yangerkun [Tue, 15 Oct 2019 13:59:29 +0000 (21:59 +0800)]
io_uring: consider the overflow of sequence for timeout req
Now we recalculate the sequence of timeout with 'req->sequence =
ctx->cached_sq_head + count - 1', judge the right place to insert
for timeout_list by compare the number of request we still expected for
completion. But we have not consider about the situation of overflow:
1. ctx->cached_sq_head + count - 1 may overflow. And a bigger count for
the new timeout req can have a small req->sequence.
2. cached_sq_head of now may overflow compare with before req. And it
will lead the timeout req with small req->sequence.
This overflow will lead to the misorder of timeout_list, which can lead
to the wrong order of the completion of timeout_list. Fix it by reuse
req->submit.sequence to store the count, and change the logic of
inserting sort in io_timeout.
iommu/amd: Fix incorrect PASID decoding from event log
IOMMU Event Log encodes 20-bit PASID for events:
ILLEGAL_DEV_TABLE_ENTRY
IO_PAGE_FAULT
PAGE_TAB_HARDWARE_ERROR
INVALID_DEVICE_REQUEST
as:
PASID[15:0] = bit 47:32
PASID[19:16] = bit 19:16
Note that INVALID_PPR_REQUEST event has different encoding
from the rest of the events as the following:
PASID[15:0] = bit 31:16
PASID[19:16] = bit 45:42
iommu/rockchip: Don't use platform_get_irq to implicitly count irqs
Till now the Rockchip iommu driver walked through the irq list via
platform_get_irq() until it encountered an ENXIO error. With the
recent change to add a central error message, this always results
in such an error for each iommu on probe and shutdown.
To not confuse people, switch to platform_count_irqs() to get the
actual number of interrupts before walking through them.
Fixes: 7723f4c5ecdb ("driver core: platform: Add an error message to platform_get_irq*()") Signed-off-by: Heiko Stuebner <[email protected]> Tested-by: Enric Balletbo i Serra <[email protected]> Signed-off-by: Joerg Roedel <[email protected]>
Pavel Tatashin [Mon, 14 Oct 2019 14:48:24 +0000 (10:48 -0400)]
arm64: hibernate: check pgd table allocation
There is a bug in create_safe_exec_page(), when page table is allocated
it is not checked that table is allocated successfully:
But it is dereferenced in: pgd_none(READ_ONCE(*pgdp)). Check that
allocation was successful.
Fixes: 82869ac57b5d ("arm64: kernel: Add support for hibernate/suspend-to-disk") Reviewed-by: James Morse <[email protected]> Signed-off-by: Pavel Tatashin <[email protected]> Signed-off-by: Will Deacon <[email protected]>
Julien Grall [Mon, 14 Oct 2019 10:21:13 +0000 (11:21 +0100)]
arm64: cpufeature: Treat ID_AA64ZFR0_EL1 as RAZ when SVE is not enabled
If CONFIG_ARM64_SVE=n then we fail to report ID_AA64ZFR0_EL1 as 0 when
read by userspace, despite being required by the architecture. Although
this is theoretically a change in ABI, userspace will first check for
the presence of SVE via the HWCAP or the ID_AA64PFR0_EL1.SVE field
before probing the ID_AA64ZFR0_EL1 register. Given that these are
reported correctly for this configuration, we can safely tighten up the
current behaviour.
Ensure ID_AA64ZFR0_EL1 is treated as RAZ when CONFIG_ARM64_SVE=n.
Dmitry Bogdanov [Fri, 11 Oct 2019 13:45:23 +0000 (13:45 +0000)]
net: aquantia: correctly handle macvlan and multicast coexistence
macvlan and multicast handling is now mixed up.
The explicit issue is that macvlan interface gets broken (no traffic)
after clearing MULTICAST flag on the real interface.
We now do separate logic and consider both ALLMULTI and MULTICAST
flags on the device.
Fixes: 11ba961c9161 ("net: aquantia: Fix IFF_ALLMULTI flag functionality") Signed-off-by: Dmitry Bogdanov <[email protected]> Signed-off-by: Igor Russkikh <[email protected]> Signed-off-by: David S. Miller <[email protected]>
Dmitry Bogdanov [Fri, 11 Oct 2019 13:45:22 +0000 (13:45 +0000)]
net: aquantia: do not pass lro session with invalid tcp checksum
Individual descriptors on LRO TCP session should be checked
for CRC errors. It was discovered that HW recalculates
L4 checksums on LRO session and does not break it up on bad L4
csum.
Thus, driver should aggregate HW LRO L4 statuses from all individual
buffers of LRO session and drop packet if one of the buffers has bad
L4 checksum.
Fixes: f38f1ee8aeb2 ("net: aquantia: check rx csum for all packets in LRO session") Signed-off-by: Dmitry Bogdanov <[email protected]> Signed-off-by: Igor Russkikh <[email protected]> Signed-off-by: David S. Miller <[email protected]>
Igor Russkikh [Fri, 11 Oct 2019 13:45:20 +0000 (13:45 +0000)]
net: aquantia: when cleaning hw cache it should be toggled
>From HW specification to correctly reset HW caches (this is a required
workaround when stopping the device), register bit should actually
be toggled.
It was previosly always just set. Due to the way driver stops HW this
never actually caused any issues, but it still may, so cleaning this up.
Fixes: 7a1bb49461b1 ("net: aquantia: fix potential IOMMU fault after driver unbind") Signed-off-by: Igor Russkikh <[email protected]> Signed-off-by: David S. Miller <[email protected]>
Igor Russkikh [Fri, 11 Oct 2019 13:45:19 +0000 (13:45 +0000)]
net: aquantia: temperature retrieval fix
Chip temperature is a two byte word, colocated internally with cable
length data. We do all readouts from HW memory by dwords, thus
we should clear extra high bytes, otherwise temperature output
gets weird as soon as we attach a cable to the NIC.
Fixes: 8f8940118654 ("net: aquantia: add infrastructure to readout chip temperature") Tested-by: Holger Hoffstätte <[email protected]> Signed-off-by: Igor Russkikh <[email protected]> Signed-off-by: David S. Miller <[email protected]>
Linus Torvalds [Mon, 14 Oct 2019 23:49:59 +0000 (16:49 -0700)]
Merge branch 'akpm' (patches from Andrew)
Merge more fixes from Andrew Morton:
"The usual shower of hotfixes and some followups to the recently merged
page_owner enhancements"
* emailed patches from Andrew Morton <[email protected]>:
mm/memory-failure: poison read receives SIGKILL instead of SIGBUS if mmaped more than once
mm/slab.c: fix kernel-doc warning for __ksize()
xarray.h: fix kernel-doc warning
bitmap.h: fix kernel-doc warning and typo
fs/fs-writeback.c: fix kernel-doc warning
fs/libfs.c: fix kernel-doc warning
fs/direct-io.c: fix kernel-doc warning
mm, compaction: fix wrong pfn handling in __reset_isolation_pfn()
mm, hugetlb: allow hugepage allocations to reclaim as needed
lib/test_meminit: add a kmem_cache_alloc_bulk() test
mm/slub.c: init_on_free=1 should wipe freelist ptr for bulk allocations
lib/generic-radix-tree.c: add kmemleak annotations
mm/slub: fix a deadlock in show_slab_objects()
mm, page_owner: rename flag indicating that page is allocated
mm, page_owner: decouple freeing stack trace from debug_pagealloc
mm, page_owner: fix off-by-one error in __set_page_owner_handle()
Andy Shevchenko [Wed, 9 Oct 2019 15:23:27 +0000 (18:23 +0300)]
gpio: merrifield: Move hardware initialization to callback
The driver wants to initialize related registers before IRQ chip will be added.
That's why move it to a corresponding callback. It also fixes the NULL pointer
dereference.
Fixes: 8f86a5b4ad67 ("gpio: merrifield: Pass irqchip when adding gpiochip") Signed-off-by: Andy Shevchenko <[email protected]> Signed-off-by: Linus Walleij <[email protected]>
Andy Shevchenko [Wed, 9 Oct 2019 15:58:45 +0000 (18:58 +0300)]
gpio: lynxpoint: Move hardware initialization to callback
The driver wants to initialize related registers before IRQ chip will be added.
That's why move it to a corresponding callback. It also fixes the NULL pointer
dereference.
Fixes: 7b1e889436a1 ("gpio: lynxpoint: Pass irqchip when adding gpiochip") Signed-off-by: Andy Shevchenko <[email protected]> Signed-off-by: Linus Walleij <[email protected]>
Andy Shevchenko [Wed, 9 Oct 2019 15:23:27 +0000 (18:23 +0300)]
gpio: intel-mid: Move hardware initialization to callback
The driver wants to initialize related registers before IRQ chip will be added.
That's why move it to a corresponding callback. It also fixes the NULL pointer
dereference.
Fixes: 8069e69a9792 ("gpio: intel-mid: Pass irqchip when adding gpiochip") Signed-off-by: Andy Shevchenko <[email protected]> Signed-off-by: Linus Walleij <[email protected]>
Andy Shevchenko [Wed, 9 Oct 2019 14:34:44 +0000 (17:34 +0300)]
gpiolib: Initialize the hardware with a callback
After changing the drivers to use GPIO core to add an IRQ chip
it appears that some of them requires a hardware initialization
before adding the IRQ chip.
Add an optional callback ->init_hw() to allow that drivers
to initialize hardware if needed.
This change is a part of the fix NULL pointer dereference
brought to the several drivers recently.
Andy Shevchenko [Wed, 9 Oct 2019 15:23:27 +0000 (18:23 +0300)]
gpio: merrifield: Restore use of irq_base
During conversion to internal IRQ chip initialization the commit 8f86a5b4ad67 ("gpio: merrifield: Pass irqchip when adding gpiochip")
lost the irq_base assignment.
drivers/gpio/gpio-merrifield.c: In function ‘mrfld_gpio_probe’:
drivers/gpio/gpio-merrifield.c:405:17: warning: variable ‘irq_base’ set but not used [-Wunused-but-set-variable]
Assign the girq->first to it.
Fixes: 8f86a5b4ad67 ("gpio: merrifield: Pass irqchip when adding gpiochip") Signed-off-by: Andy Shevchenko <[email protected]> Signed-off-by: Linus Walleij <[email protected]>
Max Filippov [Mon, 14 Oct 2019 22:48:19 +0000 (15:48 -0700)]
xtensa: drop EXPORT_SYMBOL for outs*/ins*
Custom outs*/ins* implementations are long gone from the xtensa port,
remove matching EXPORT_SYMBOLs.
This fixes the following build warnings issued by modpost since commit 15bfc2348d54 ("modpost: check for static EXPORT_SYMBOL* functions"):
WARNING: "insb" [vmlinux] is a static EXPORT_SYMBOL
WARNING: "insw" [vmlinux] is a static EXPORT_SYMBOL
WARNING: "insl" [vmlinux] is a static EXPORT_SYMBOL
WARNING: "outsb" [vmlinux] is a static EXPORT_SYMBOL
WARNING: "outsw" [vmlinux] is a static EXPORT_SYMBOL
WARNING: "outsl" [vmlinux] is a static EXPORT_SYMBOL
Jane Chu [Mon, 14 Oct 2019 21:12:29 +0000 (14:12 -0700)]
mm/memory-failure: poison read receives SIGKILL instead of SIGBUS if mmaped more than once
Mmap /dev/dax more than once, then read the poison location using
address from one of the mappings. The other mappings due to not having
the page mapped in will cause SIGKILLs delivered to the process.
SIGKILL succeeds over SIGBUS, so user process loses the opportunity to
handle the UE.
Although one may add MAP_POPULATE to mmap(2) to work around the issue,
MAP_POPULATE makes mapping 128GB of pmem several magnitudes slower, so
isn't always an option.
Vlastimil Babka [Mon, 14 Oct 2019 21:12:07 +0000 (14:12 -0700)]
mm, compaction: fix wrong pfn handling in __reset_isolation_pfn()
Florian and Dave reported [1] a NULL pointer dereference in
__reset_isolation_pfn(). While the exact cause is unclear, staring at
the code revealed two bugs, which might be related.
One bug is that if zone starts in the middle of pageblock, block_page
might correspond to different pfn than block_pfn, and then the
pfn_valid_within() checks will check different pfn's than those accessed
via struct page. This might result in acessing an unitialized page in
CONFIG_HOLES_IN_ZONE configs.
The other bug is that end_page refers to the first page of next
pageblock and not last page of current pageblock. The online and valid
check is then wrong and with sections, the while (page < end_page) loop
might wander off actual struct page arrays.
David Rientjes [Mon, 14 Oct 2019 21:12:04 +0000 (14:12 -0700)]
mm, hugetlb: allow hugepage allocations to reclaim as needed
Commit b39d0ee2632d ("mm, page_alloc: avoid expensive reclaim when
compaction may not succeed") has chnaged the allocator to bail out from
the allocator early to prevent from a potentially excessive memory
reclaim. __GFP_RETRY_MAYFAIL is designed to retry the allocation,
reclaim and compaction loop as long as there is a reasonable chance to
make forward progress. Neither COMPACT_SKIPPED nor COMPACT_DEFERRED at
the INIT_COMPACT_PRIORITY compaction attempt gives this feedback.
The most obvious affected subsystem is hugetlbfs which allocates huge
pages based on an admin request (or via admin configured overcommit). I
have done a simple test which tries to allocate half of the memory for
hugetlb pages while the memory is full of a clean page cache. This is
not an unusual situation because we try to cache as much of the memory
as possible and sysctl/sysfs interface to allocate huge pages is there
for flexibility to allocate hugetlb pages at any time.
System has 1GB of RAM and we are requesting 515MB worth of hugetlb pages
after the memory is prefilled by a clean page cache:
root@test1:~# sh hugetlb_test.sh
+ echo 0
+ echo 3
+ echo 1
+ dd if=/mnt/data/file-1G of=/dev/null bs=4096
262144+0 records in
262144+0 records out 1073741824 bytes (1.1 GB) copied, 20.1815 s, 53.2 MB/s
+ date +%s
+ TS=1569905516
+ echo 256
+ cat /proc/sys/vm/nr_hugepages
11
root@test1:~# sh hugetlb_test.sh
+ echo 0
+ echo 3
+ echo 1
+ dd if=/mnt/data/file-1G of=/dev/null bs=4096
262144+0 records in
262144+0 records out 1073741824 bytes (1.1 GB) copied, 21.9485 s, 48.9 MB/s
+ date +%s
+ TS=1569905541
+ echo 256
+ cat /proc/sys/vm/nr_hugepages
12
The success rate went down by factor of 20!
Although hugetlb allocation requests might fail and it is reasonable to
expect them to under extremely fragmented memory or when the memory is
under a heavy pressure but the above situation is not that case.
Fix the regression by reverting back to the previous behavior for
__GFP_RETRY_MAYFAIL requests and disable the beail out heuristic for
those requests.
Mike said:
: hugetlbfs allocations are commonly done via sysctl/sysfs shortly after
: boot where this may not be as much of an issue. However, I am aware of at
: least three use cases where allocations are made after the system has been
: up and running for quite some time:
:
: - DB reconfiguration. If sysctl/sysfs fails to get required number of
: huge pages, system is rebooted to perform allocation after boot.
:
: - VM provisioning. If unable get required number of huge pages, fall
: back to base pages.
:
: - An application that does not preallocate pool, but rather allocates
: pages at fault time for optimal NUMA locality.
:
: In all cases, I would expect b39d0ee2632d to cause regressions and
: noticable behavior changes.
:
: My quick/limited testing in
: https://lkml.kernel.org/r/3468b605-a3a9-6978-9699-57c52a90bd7e@oracle.com
: was insufficient. It was also mentioned that if something like
: b39d0ee2632d went forward, I would like exemptions for __GFP_RETRY_MAYFAIL
: requests as in this patch.
mm/slub.c: init_on_free=1 should wipe freelist ptr for bulk allocations
slab_alloc_node() already zeroed out the freelist pointer if
init_on_free was on. Thibaut Sautereau noticed that the same needs to
be done for kmem_cache_alloc_bulk(), which performs the allocations
separately.
kmem_cache_alloc_bulk() is currently used in two places in the kernel,
so this change is unlikely to have a major performance impact.
SLAB doesn't require a similar change, as auto-initialization makes the
allocator store the freelist pointers off-slab.
But it's freed later. Kmemleak misses the allocation because its
pointer is stored in the generic radix tree sctp_stream::out, and the
generic radix tree uses raw pages which aren't tracked by kmemleak.
Fix this by adding the kmemleak hooks to the generic radix tree code.
Qian Cai [Mon, 14 Oct 2019 21:11:51 +0000 (14:11 -0700)]
mm/slub: fix a deadlock in show_slab_objects()
A long time ago we fixed a similar deadlock in show_slab_objects() [1].
However, it is apparently due to the commits like 01fb58bcba63 ("slab:
remove synchronous synchronize_sched() from memcg cache deactivation
path") and 03afc0e25f7f ("slab: get_online_mems for
kmem_cache_{create,destroy,shrink}"), this kind of deadlock is back by
just reading files in /sys/kernel/slab which will generate a lockdep
splat below.
Since the "mem_hotplug_lock" here is only to obtain a stable online node
mask while racing with NUMA node hotplug, in the worst case, the results
may me miscalculated while doing NUMA node hotplug, but they shall be
corrected by later reads of the same files.
WARNING: possible circular locking dependency detected
------------------------------------------------------
cat/5224 is trying to acquire lock: ffff900012ac3120 (mem_hotplug_lock.rw_sem){++++}, at:
show_slab_objects+0x94/0x3a8
but task is already holding lock: b8ff009693eee398 (kn->count#45){++++}, at: kernfs_seq_start+0x44/0xf0
which lock already depends on the new lock.
the existing dependency chain (in reverse order) is:
I think it is important to mention that this doesn't expose the
show_slab_objects to use-after-free. There is only a single path that
might really race here and that is the slab hotplug notifier callback
__kmem_cache_shrink (via slab_mem_going_offline_callback) but that path
doesn't really destroy kmem_cache_node data structures.
Vlastimil Babka [Mon, 14 Oct 2019 21:11:47 +0000 (14:11 -0700)]
mm, page_owner: rename flag indicating that page is allocated
Commit 37389167a281 ("mm, page_owner: keep owner info when freeing the
page") has introduced a flag PAGE_EXT_OWNER_ACTIVE to indicate that page
is tracked as being allocated. Kirril suggested naming it
PAGE_EXT_OWNER_ALLOCATED to make it more clear, as "active is somewhat
loaded term for a page".
Vlastimil Babka [Mon, 14 Oct 2019 21:11:44 +0000 (14:11 -0700)]
mm, page_owner: decouple freeing stack trace from debug_pagealloc
Commit 8974558f49a6 ("mm, page_owner, debug_pagealloc: save and dump
freeing stack trace") enhanced page_owner to also store freeing stack
trace, when debug_pagealloc is also enabled. KASAN would also like to
do this [1] to improve error reports to debug e.g. UAF issues.
Kirill has suggested that the freeing stack trace saving should be also
possible to be enabled separately from KASAN or debug_pagealloc, i.e.
with an extra boot option. Qian argued that we have enough options
already, and avoiding the extra overhead is not worth the complications
in the case of a debugging option. Kirill noted that the extra stack
handle in struct page_owner requires 0.1% of memory.
This patch therefore enables free stack saving whenever page_owner is
enabled, regardless of whether debug_pagealloc or KASAN is also enabled.
KASAN kernels booted with page_owner=on will thus benefit from the
improved error reports.
Vlastimil Babka [Mon, 14 Oct 2019 21:11:40 +0000 (14:11 -0700)]
mm, page_owner: fix off-by-one error in __set_page_owner_handle()
Patch series "followups to debug_pagealloc improvements through
page_owner", v3.
These are followups to [1] which made it to Linus meanwhile. Patches 1
and 3 are based on Kirill's review, patch 2 on KASAN request [2]. It
would be nice if all of this made it to 5.4 with [1] already there (or
at least Patch 1).
This patch (of 3):
As noted by Kirill, commit 7e2f2a0cd17c ("mm, page_owner: record page
owner for each subpage") has introduced an off-by-one error in
__set_page_owner_handle() when looking up page_ext for subpages. As a
result, the head page page_owner info is set twice, while for the last
tail page, it's not set at all.
Fix this and also make the code more efficient by advancing the page_ext
pointer we already have, instead of calling lookup_page_ext() for each
subpage. Since the full size of struct page_ext is not known at compile
time, we can't use a simple page_ext++ statement, so introduce a
page_ext_next() inline function for that.
Max Filippov [Fri, 11 Oct 2019 03:55:35 +0000 (20:55 -0700)]
xtensa: fix type conversion in __get_user_[no]check
__get_user_[no]check uses temporary buffer of type long to store result
of __get_user_size and do sign extension on it when necessary. This
doesn't work correctly for 64-bit data. Fix it by moving temporary
buffer/sign extension logic to __get_user_asm.
Don't do assignment of __get_user_bad result to (x) as it may not always
be integer-compatible now and issue warning even when it's going to be
optimized. Instead do (x) = 0; and call __get_user_bad separately.
Zero initialize __x in __get_user_asm and use '+' constraint for its
assembly argument, so that its value is preserved in error cases. This
may add at most 1 cycle to the fast path, but saves an instruction and
two padding bytes in the fixup section for each use of this macro and
works for both misaligned store and store exception.
Max Filippov [Thu, 10 Oct 2019 02:41:24 +0000 (19:41 -0700)]
xtensa: clean up assembly arguments in uaccess macros
Numeric assembly arguments are hard to understand and assembly code that
uses them is hard to modify. Use named arguments in __check_align_*,
__get_user_asm and __put_user_asm. Modify macro parameter names so that
they don't affect argument names. Use '+' constraint for the [err]
argument instead of having it as both input and output.
Damien Le Moal [Tue, 8 Oct 2019 22:39:54 +0000 (07:39 +0900)]
block: Fix elv_support_iosched()
A BIO based request queue does not have a tag_set, which prevent testing
for the flag BLK_MQ_F_NO_SCHED indicating that the queue does not
require an elevator. This leads to an incorrect initialization of a
default elevator in some cases such as BIO based null_blk
(queue_mode == BIO) with zoned mode enabled as the default elevator in
this case is mq-deadline instead of "none".
Fix this by testing for a NULL queue mq_ops field which indicates that
the queue is BIO based and should not have an elevator.
Sven Schnelle [Tue, 24 Sep 2019 15:01:31 +0000 (17:01 +0200)]
parisc: Remove 32-bit DMA enforcement from sba_iommu
This breaks booting from sata_sil24 with the recent DMA change.
According to James Bottomley this was in to improve performance by
kicking the device into 32 bit descriptors, which are usually more
efficient, especially with older dual descriptor format cards like we
have on parisc systems.
Remove it for now to make DMA working again.
Fixes: dcc02c19cc06 ("sata_sil24: use dma_set_mask_and_coherent") Signed-off-by: Sven Schnelle <[email protected]> Signed-off-by: Helge Deller <[email protected]>
Helge Deller [Fri, 4 Oct 2019 17:23:37 +0000 (19:23 +0200)]
parisc: Fix vmap memory leak in ioremap()/iounmap()
Sven noticed that calling ioremap() and iounmap() multiple times leads
to a vmap memory leak:
vmap allocation for size 4198400 failed:
use vmalloc=<size> to increase size
Jean Delvare [Mon, 14 Oct 2019 19:41:24 +0000 (21:41 +0200)]
firmware: dmi: Fix unlikely out-of-bounds read in save_mem_devices
Before reading the Extended Size field, we should ensure it fits in
the DMI record. There is already a record length check but it does
not cover that field.
It would take a seriously corrupted DMI table to hit that bug, so no
need to worry, but we should still fix it.
Signed-off-by: Jean Delvare <[email protected]> Fixes: 6deae96b42eb ("firmware, DMI: Add function to look up a handle and return DIMM size") Cc: Tony Luck <[email protected]> Cc: Borislav Petkov <[email protected]>
Paul Walmsley [Thu, 10 Oct 2019 22:57:58 +0000 (15:57 -0700)]
riscv: tlbflush: remove confusing comment on local_flush_tlb_all()
Remove a confusing comment on our local_flush_tlb_all()
implementation. Per an internal discussion with Andrew, while it's
true that the fence.i is not necessary, it's not the case that an
sfence.vma implies a fence.i. We also drop the section about
"flush[ing] the entire local TLB" to better align with the language in
section 4.2.1 "Supervisor Memory-Management Fence Instruction" of the
RISC-V Privileged Specification v20190608.
Add a default "stdout-path" to the kernel DTS file, as is present in many
of the board DTS files elsewhere in the kernel tree. With this line
present, earlyconsole can be enabled by simply passing "earlycon" on the
kernel command line. No specific device details are necessary, since the
kernel will use the stdout-path as the default.
Steven Price [Mon, 14 Oct 2019 15:15:15 +0000 (16:15 +0100)]
drm/panfrost: Add missing GPU feature registers
Three feature registers were declared but never actually read from the
GPU. Add THREAD_MAX_THREADS, THREAD_MAX_WORKGROUP_SIZE and
THREAD_MAX_BARRIER_SIZE so that the complete set are available.
Jiri Benc [Wed, 9 Oct 2019 08:31:24 +0000 (10:31 +0200)]
bpf: lwtunnel: Fix reroute supplying invalid dst
The dst in bpf_input() has lwtstate field set. As it is of the
LWTUNNEL_ENCAP_BPF type, lwtstate->data is struct bpf_lwt. When the bpf
program returns BPF_LWT_REROUTE, ip_route_input_noref is directly called on
this skb. This causes invalid memory access, as ip_route_input_slow calls
skb_tunnel_info(skb) that expects the dst->lwstate->data to be
struct ip_tunnel_info. This results to struct bpf_lwt being accessed as
struct ip_tunnel_info.
Drop the dst before calling the IP route input functions (both for IPv4 and
IPv6).
Al Viro [Wed, 9 Oct 2019 19:21:05 +0000 (20:21 +0100)]
xtensa: fix {get,put}_user() for 64bit values
First of all, on short copies __copy_{to,from}_user() return the amount
of bytes left uncopied, *not* -EFAULT. get_user() and put_user() are
expected to return -EFAULT on failure.
Another problem is get_user(v32, (__u64 __user *)p); that should
fetch 64bit value and the assign it to v32, truncating it in process.
Current code, OTOH, reads 8 bytes of data and stores them at the
address of v32, stomping on the 4 bytes that follow v32 itself.
Catalin Marinas [Fri, 4 Oct 2019 13:46:24 +0000 (14:46 +0100)]
kmemleak: Do not corrupt the object_list during clean-up
In case of an error (e.g. memory pool too small), kmemleak disables
itself and cleans up the already allocated metadata objects. However, if
this happens early before the RCU callback mechanism is available,
put_object() skips call_rcu() and frees the object directly. This is not
safe with the RCU list traversal in __kmemleak_do_cleanup().
Change the list traversal in __kmemleak_do_cleanup() to
list_for_each_entry_safe() and remove the rcu_read_{lock,unlock} since
the kmemleak is already disabled at this point. In addition, avoid an
unnecessary metadata object rb-tree look-up since it already has the
struct kmemleak_object pointer.
nvme-tcp: Initialize sk->sk_ll_usec only with NET_RX_BUSY_POLL
The access to sk->sk_ll_usec should be hidden behind
CONFIG_NET_RX_BUSY_POLL like the definition of sk_ll_usec.
Put access to ->sk_ll_usec behind CONFIG_NET_RX_BUSY_POLL.
Fixes: 1a9460cef5711 ("nvme-tcp: support simple polling") Reviewed-by: Christoph Hellwig <[email protected]> Signed-off-by: Sebastian Andrzej Siewior <[email protected]> Signed-off-by: Keith Busch <[email protected]>
Keith Busch [Wed, 4 Sep 2019 16:06:11 +0000 (10:06 -0600)]
nvme: Wait for reset state when required
Prevent simultaneous controller disabling/enabling tasks from interfering
with each other through a function to wait until the task successfully
transitioned the controller to the RESETTING state. This ensures disabling
the controller will not be interrupted by another reset path, otherwise
a concurrent reset may leave the controller in the wrong state.
Keith Busch [Fri, 6 Sep 2019 17:23:08 +0000 (11:23 -0600)]
nvme: Prevent resets during paused controller state
A paused controller is doing critical internal activation work in the
background. Prevent subsequent controller resets from occurring during
this period by setting the controller state to RESETTING first. A helper
function, nvme_try_sched_reset_work(), is introduced for these paths so
they may continue with scheduling the reset_work after they've completed
their uninterruptible critical section.
Keith Busch [Thu, 5 Sep 2019 14:09:33 +0000 (08:09 -0600)]
nvme: Restart request timers in resetting state
A controller in the resetting state has not yet completed its recovery
actions. The pci and fc transports were already handling this, so update
the remaining transports to not attempt additional recovery in this
state. Instead, just restart the request timer.
Keith Busch [Tue, 3 Sep 2019 15:22:24 +0000 (09:22 -0600)]
nvme: Remove ADMIN_ONLY state
The admin only state was intended to fence off actions that don't
apply to a non-IO capable controller. The only actual user of this is
the scan_work, and pci was the only transport to ever set this state.
The consequence of having this state is placing an additional burden on
every other action that applies to both live and admin only controllers.
Remove the admin only state and place the admin only burden on the only
place that actually cares: scan_work.
This also prepares to make it easier to temporarily pause a LIVE state
so that we don't need to remember which state the controller had been in
prior to the pause.
Keith Busch [Thu, 5 Sep 2019 13:52:33 +0000 (07:52 -0600)]
nvme-pci: Free tagset if no IO queues
If a controller becomes degraded after a reset, we will not be able to
perform any IO. We currently teardown previously created request
queues and namespaces, but we had kept the unusable tagset. Free
it after all queues using it have been released.
Andy Shevchenko [Fri, 11 Oct 2019 10:22:58 +0000 (13:22 +0300)]
platform/x86: i2c-multi-instantiate: Fail the probe if no IRQ provided
For APIC case of interrupt we don't fail a ->probe() of the driver,
which makes kernel to print a lot of warnings from the children.
We have two options here:
- switch to platform_get_irq_optional(), though it won't stop children
to be probed and failed
- fail the ->probe() of i2c-multi-instantiate
Since the in reality we never had devices in the wild where IRQ resource
is optional, the latter solution suits the best.
Thomas Hellstrom [Thu, 12 Sep 2019 18:38:54 +0000 (20:38 +0200)]
drm/ttm: Restore ttm prefaulting
Commit 4daa4fba3a38 ("gpu: drm: ttm: Adding new return type vm_fault_t")
broke TTM prefaulting. Since vmf_insert_mixed() typically always returns
VM_FAULT_NOPAGE, prefaulting stops after the second PTE.
Restore (almost) the original behaviour. Unfortunately we can no longer
with the new vm_fault_t return type determine whether a prefaulting
PTE insertion hit an already populated PTE, and terminate the insertion
loop. Instead we continue with the pre-determined number of prefaults.
Miaoqing Pan [Thu, 29 Aug 2019 02:45:12 +0000 (10:45 +0800)]
ath10k: fix latency issue for QCA988x
(kvalo: cherry picked from commit 1340cc631bd00431e2f174525c971f119df9efa1 in
wireless-drivers-next to wireless-drivers as this a frequently reported
regression)
Bad latency is found on QCA988x, the issue was introduced by
commit 4504f0e5b571 ("ath10k: sdio: workaround firmware UART
pin configuration bug"). If uart_pin_workaround is false, this
change will set uart pin even if uart_print is false.
Tested HW: QCA9880
Tested FW: 10.2.4-1.0-00037
Fixes: 4504f0e5b571 ("ath10k: sdio: workaround firmware UART pin configuration bug") Signed-off-by: Miaoqing Pan <[email protected]> Signed-off-by: Kalle Valo <[email protected]>
Linus Torvalds [Sun, 13 Oct 2019 21:47:10 +0000 (14:47 -0700)]
Merge tag 'trace-v5.4-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace
Pull tracing fixes from Steven Rostedt:
"A few tracing fixes:
- Remove lockdown from tracefs itself and moved it to the trace
directory. Have the open functions there do the lockdown checks.
- Fix a few races with opening an instance file and the instance
being deleted (Discovered during the lockdown updates). Kept
separate from the clean up code such that they can be backported to
stable easier.
- Clean up and consolidated the checks done when opening a trace
file, as there were multiple checks that need to be done, and it
did not make sense having them done in each open instance.
- Fix a regression in the record mcount code.
- Small hw_lat detector tracer fixes.
- A trace_pipe read fix due to not initializing trace_seq"
* tag 'trace-v5.4-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace:
tracing: Initialize iter->seq after zeroing in tracing_read_pipe()
tracing/hwlat: Don't ignore outer-loop duration when calculating max_latency
tracing/hwlat: Report total time spent in all NMIs during the sample
recordmcount: Fix nop_mcount() function
tracing: Do not create tracefs files if tracefs lockdown is in effect
tracing: Add locked_down checks to the open calls of files created for tracefs
tracing: Add tracing_check_open_get_tr()
tracing: Have trace events system open call tracing_open_generic_tr()
tracing: Get trace_array reference for available_tracers files
ftrace: Get a reference counter for the trace_array on filter files
tracefs: Revert ccbd54ff54e8 ("tracefs: Restrict tracefs when the kernel is locked down")
Reported-by: Hulk Robot <[email protected]> Fixes: 59c84b9fcf42 ("netdevsim: Restore per-network namespace accounting for fib entries") Signed-off-by: YueHaibing <[email protected]> Acked-by: Jakub Kicinski <[email protected]> Signed-off-by: David S. Miller <[email protected]>
Cédric Le Goater [Fri, 11 Oct 2019 05:52:54 +0000 (07:52 +0200)]
net/ibmvnic: Fix EOI when running in XIVE mode.
pSeries machines on POWER9 processors can run with the XICS (legacy)
interrupt mode or with the XIVE exploitation interrupt mode. These
interrupt contollers have different interfaces for interrupt
management : XICS uses hcalls and XIVE loads and stores on a page.
H_EOI being a XICS interface the enable_scrq_irq() routine can fail
when the machine runs in XIVE mode.
Fix that by calling the EOI handler of the interrupt chip.
Fixes: f23e0643cd0b ("ibmvnic: Clear pending interrupt after device reset") Signed-off-by: Cédric Le Goater <[email protected]> Signed-off-by: David S. Miller <[email protected]>
====================
tcp: address KCSAN reports in tcp_poll() (part I)
This all started with a KCSAN report (included
in "tcp: annotate tp->rcv_nxt lockless reads" changelog)
tcp_poll() runs in a lockless way. This means that about
all accesses of tcp socket fields done in tcp_poll() context
need annotations otherwise KCSAN will complain about data-races.
While doing this detective work, I found a more serious bug,
addressed by the first patch ("tcp: add rcu protection around
tp->fastopen_rsk").
====================