Dan Williams [Fri, 26 Feb 2021 01:17:08 +0000 (17:17 -0800)]
mm: fix memory_failure() handling of dax-namespace metadata
Given 'struct dev_pagemap' spans both data pages and metadata pages be
careful to consult the altmap if present to delineate metadata. In fact
the pfn_first() helper already identifies the first valid data pfn, so
export that helper for other code paths via pgmap_pfn_valid().
Other usage of get_dev_pagemap() are not a concern because those are
operating on known data pfns having been looked up by get_user_pages().
I.e. metadata pfns are never user mapped.
Dan Williams [Fri, 26 Feb 2021 01:17:05 +0000 (17:17 -0800)]
mm: teach pfn_to_online_page() about ZONE_DEVICE section collisions
While pfn_to_online_page() is able to determine pfn_valid() at subsection
granularity it is not able to reliably determine if a given pfn is also
online if the section is mixes ZONE_{NORMAL,MOVABLE} with ZONE_DEVICE.
This means that pfn_to_online_page() may return invalid @page objects.
For example with a memory map like:
...succeeds when it should fail. When it succeeds it touches an
uninitialized page and may crash or cause other damage (see
dissolve_free_huge_page()).
While the memory map above is contrived via the memmap=ss!nn kernel
command line option, the collision happens in practice on shipping
platforms. The memory controller resources that decode spans of physical
address space are a limited resource. One technique platform-firmware
uses to conserve those resources is to share a decoder across 2 devices to
keep the address range contiguous. Unfortunately the unit of operation of
a decoder is 64MiB while the Linux section size is 128MiB. This results
in situations where, without subsection hotplug memory mappings with
different lifetimes collide into one object that can only express one
lifetime.
Update move_pfn_range_to_zone() to flag (SECTION_TAINT_ZONE_DEVICE) a
section that mixes ZONE_DEVICE pfns with other online pfns. With
SECTION_TAINT_ZONE_DEVICE to delineate, pfn_to_online_page() can fall back
to a slow-path check for ZONE_DEVICE pfns in an online section. In the
fast path online_section() for a full ZONE_DEVICE section returns false.
Because the collision case is rare, and for simplicity, the
SECTION_TAINT_ZONE_DEVICE flag is never cleared once set.
Dan Williams [Fri, 26 Feb 2021 01:17:01 +0000 (17:17 -0800)]
mm: teach pfn_to_online_page() to consider subsection validity
pfn_to_online_page is primarily used to filter out offline or fully
uninitialized pages. pfn_valid resp. online_section_nr have a coarse
per memory section granularity. If a section shared with a partially
offline memory (e.g. part of ZONE_DEVICE) then pfn_to_online_page
would lead to a false positive on some pfns. Fix this by adding
pfn_section_valid check which is subsection aware.
Dan Williams [Fri, 26 Feb 2021 01:16:57 +0000 (17:16 -0800)]
mm: move pfn_to_online_page() out of line
Patch series "mm: Fix pfn_to_online_page() with respect to ZONE_DEVICE", v4.
A pfn-walker that uses pfn_to_online_page() may inadvertently translate a
pfn as online and in the page allocator, when it is offline managed by a
ZONE_DEVICE mapping (details in Patch 3: ("mm: Teach pfn_to_online_page()
about ZONE_DEVICE section collisions")).
The 2 proposals under consideration are teach pfn_to_online_page() to be
precise in the presence of mixed-zone sections, or teach the memory-add
code to drop the System RAM associated with ZONE_DEVICE collisions. In
order to not regress memory capacity by a few 10s to 100s of MiB the
approach taken in this set is to add precision to pfn_to_online_page().
In the course of validating pfn_to_online_page() a couple other fixes
fell out:
1/ soft_offline_page() fails to drop the reference taken in the
madvise(..., MADV_SOFT_OFFLINE) case.
2/ memory_failure() uses get_dev_pagemap() to lookup ZONE_DEVICE pages,
however that mapping may contain data pages and metadata raw pfns.
Introduce pgmap_pfn_valid() to delineate the 2 types and fail the
handling of raw metadata pfns.
This patch (of 4);
pfn_to_online_page() is already too large to be a macro or an inline
function. In anticipation of further logic changes / growth, move it out
of line.
Jiang Biao [Fri, 26 Feb 2021 01:16:54 +0000 (17:16 -0800)]
mm/vmstat.c: erase latency in vmstat_shepherd
Many 100us+ latencies have been deteceted in vmstat_shepherd() on CPX
platform which has 208 logic cpus. And vmstat_shepherd is queued every
second, which could make the case worse.
Add schedule point in vmstat_shepherd() to erase the latency.
Johannes Weiner [Fri, 26 Feb 2021 01:16:51 +0000 (17:16 -0800)]
mm: vmstat: add some comments on internal storage of byte items
Byte-accounted items are used for slab object accounting at the cgroup
level, because the objects in a slab page can belong to different cgroups.
At the global level these items always change in multiples of whole slab
pages. The vmstat code exploits this and stores these items as pages
internally, which allows for more compact per-cpu data.
This optimization isn't self-evident from the asserts and the division in
the stat update functions. Provide the reader with some context.
Johannes Weiner [Fri, 26 Feb 2021 01:16:47 +0000 (17:16 -0800)]
mm: vmstat: fix NOHZ wakeups for node stat changes
On NOHZ, the periodic vmstat flushers on each CPU can go to sleep and
won't wake up until stat changes are detected in the per-cpu deltas of the
zone vmstat counters.
In commit 75ef71840539 ("mm, vmstat: add infrastructure for per-node
vmstats") per-node counters were introduced, and subsequently most stats
were moved from the zone to the node level. However, the node counters
weren't added to the NOHZ wakeup detection.
In theory this can cause per-cpu errors to remain in the user-reported
stats indefinitely. In practice this only affects a handful of sub
counters (file_mapped, dirty and writeback e.g.) because other page state
changes at the node level likely involve a change at the zone level as
well (alloc and free, lru ops). Also, nobody has complained.
Fix it up for completeness: wake up vmstat refreshing on node changes.
Also remove the BUILD_BUG_ONs that assert counter size; we haven't relied
on it since we added sizeof() to the range calculation in commit 13c9aaf7fa01 ("mm/vmstat.c: fix NUMA statistics updates").
mm/page_alloc: count CMA pages per zone and print them in /proc/zoneinfo
Let's count the number of CMA pages per zone and print them in
/proc/zoneinfo.
Having access to the total number of CMA pages per zone is helpful for
debugging purposes to know where exactly the CMA pages ended up, and to
figure out how many pages of a zone might behave differently, even after
some of these pages might already have been allocated.
As one example, CMA pages part of a kernel zone cannot be used for
ordinary kernel allocations but instead behave more like ZONE_MOVABLE.
For now, we are only able to get the global nr+free cma pages from
/proc/meminfo and the free cma pages per zone from /proc/zoneinfo.
Example after this patch when booting a 6 GiB QEMU VM with
"hugetlb_cma=2G":
# cat /proc/zoneinfo | grep cma
cma 0
nr_free_cma 0
cma 0
nr_free_cma 0
cma 524288
nr_free_cma 493016
cma 0
cma 0
# cat /proc/meminfo | grep Cma
CmaTotal: 2097152 kB
CmaFree: 1972064 kB
Note: We print even without CONFIG_CMA, just like "nr_free_cma"; this way,
one can be sure when spotting "cma 0", that there are definetly no
CMA pages located in a zone.
mm/cma: expose all pages to the buddy if activation of an area fails
Right now, if activation fails, we might already have exposed some pages
to the buddy for CMA use (although they will never get actually used by
CMA), and some pages won't be exposed to the buddy at all.
Let's check for "single zone" early and on error, don't expose any pages
for CMA use - instead, expose them to the buddy available for any use.
Simply call free_reserved_page() on every single page - easier than going
via free_reserved_area(), converting back and forth between pfns and virt
addresses.
In addition, make sure to fixup totalcma_pages properly.
Example: 6 GiB QEMU VM with "... hugetlb_cma=2G movablecore=20% ...":
[ 0.006891] hugetlb_cma: reserve 2048 MiB, up to 2048 MiB per node
[ 0.006893] cma: Reserved 2048 MiB at 0x0000000100000000
[ 0.006893] hugetlb_cma: reserved 2048 MiB on node 0
...
[ 0.175433] cma: CMA area hugetlb0 could not be activated
Roman Gushchin [Fri, 26 Feb 2021 01:16:33 +0000 (17:16 -0800)]
mm: cma: allocate cma areas bottom-up
Currently cma areas without a fixed base are allocated close to the end of
the node. This placement is sub-optimal because of compaction: it brings
pages into the cma area. In particular, it can bring in hot executable
pages, even if there is a plenty of free memory on the machine. This
results in cma allocation failures.
Instead let's place cma areas close to the beginning of a node. In this
case the compaction will help to free cma areas, resulting in better cma
allocation success rates.
If there is enough memory let's try to allocate bottom-up starting with
4GB to exclude any possible interference with DMA32. On smaller machines
or in a case of a failure, stick with the old behavior.
16GB vm, 2GB cma area:
With this patch:
[ 0.000000] Command line: root=/dev/vda3 rootflags=subvol=/root systemd.unified_cgroup_hierarchy=1 enforcing=0 console=ttyS0,115200 hugetlb_cma=2G
[ 0.002928] hugetlb_cma: reserve 2048 MiB, up to 2048 MiB per node
[ 0.002930] cma: Reserved 2048 MiB at 0x0000000100000000
[ 0.002931] hugetlb_cma: reserved 2048 MiB on node 0
Without this patch:
[ 0.000000] Command line: root=/dev/vda3 rootflags=subvol=/root systemd.unified_cgroup_hierarchy=1 enforcing=0 console=ttyS0,115200 hugetlb_cma=2G
[ 0.002930] hugetlb_cma: reserve 2048 MiB, up to 2048 MiB per node
[ 0.002933] cma: Reserved 2048 MiB at 0x00000003c0000000
[ 0.002934] hugetlb_cma: reserved 2048 MiB on node 0
v2:
- switched to memblock_set_bottom_up(true), by Mike
- start with 4GB, by Mike
Rik van Riel [Fri, 26 Feb 2021 01:16:29 +0000 (17:16 -0800)]
mm,shmem,thp: limit shmem THP allocations to requested zones
Hugh pointed out that the gma500 driver uses shmem pages, but needs to
limit them to the DMA32 zone. Ensure the allocations resulting from the
gfp_mask returned by limit_gfp_mask use the zone flags that were
originally passed to shmem_getpage_gfp.
Rik van Riel [Fri, 26 Feb 2021 01:16:25 +0000 (17:16 -0800)]
mm,thp,shmem: make khugepaged obey tmpfs mount flags
Currently if thp enabled=[madvise], mounting a tmpfs filesystem with
huge=always and mmapping files from that tmpfs does not result in
khugepaged collapsing those mappings, despite the mount flag indicating
that it should.
Fix that by breaking up the blocks of tests in hugepage_vma_check a little
bit, and testing things in the correct order.
Rik van Riel [Fri, 26 Feb 2021 01:16:22 +0000 (17:16 -0800)]
mm,thp,shm: limit gfp mask to no more than specified
Matthew Wilcox pointed out that the i915 driver opportunistically
allocates tmpfs memory, but will happily reclaim some of its pool if no
memory is available.
Make sure the gfp mask used to opportunistically allocate a THP is always
at least as restrictive as the original gfp mask.
Rik van Riel [Fri, 26 Feb 2021 01:16:18 +0000 (17:16 -0800)]
mm,thp,shmem: limit shmem THP alloc gfp_mask
Patch series "mm,thp,shm: limit shmem THP alloc gfp_mask", v6.
The allocation flags of anonymous transparent huge pages can be controlled
through the files in /sys/kernel/mm/transparent_hugepage/defrag, which can
help the system from getting bogged down in the page reclaim and
compaction code when many THPs are getting allocated simultaneously.
However, the gfp_mask for shmem THP allocations were not limited by those
configuration settings, and some workloads ended up with all CPUs stuck on
the LRU lock in the page reclaim code, trying to allocate dozens of THPs
simultaneously.
This patch applies the same configurated limitation of THPs to shmem
hugepage allocations, to prevent that from happening.
This way a THP defrag setting of "never" or "defer+madvise" will result in
quick allocation failures without direct reclaim when no 2MB free pages
are available.
With this patch applied, THP allocations for tmpfs will be a little more
aggressive than today for files mmapped with MADV_HUGEPAGE, and a little
less aggressive for files that are not mmapped or mapped without that
flag.
This patch (of 4):
The allocation flags of anonymous transparent huge pages can be controlled
through the files in /sys/kernel/mm/transparent_hugepage/defrag, which can
help the system from getting bogged down in the page reclaim and
compaction code when many THPs are getting allocated simultaneously.
However, the gfp_mask for shmem THP allocations were not limited by those
configuration settings, and some workloads ended up with all CPUs stuck on
the LRU lock in the page reclaim code, trying to allocate dozens of THPs
simultaneously.
This patch applies the same configurated limitation of THPs to shmem
hugepage allocations, to prevent that from happening.
Controlling the gfp_mask of THP allocations through the knobs in sysfs
allows users to determine the balance between how aggressively the system
tries to allocate THPs at fault time, and how much the application may end
up stalling attempting those allocations.
This way a THP defrag setting of "never" or "defer+madvise" will result in
quick allocation failures without direct reclaim when no 2MB free pages
are available.
With this patch applied, THP allocations for tmpfs will be a little more
aggressive than today for files mmapped with MADV_HUGEPAGE, and a little
less aggressive for files that are not mmapped or mapped without that
flag.
mm: add an 'end' parameter to pagevec_lookup_entries
Simplifies the callers and uses the existing functionality in
find_get_entries(). We can also drop the final argument of
truncate_exceptional_pvec_entries() and simplify the logic in that
function.
We have three functions (shmem_undo_range(), truncate_inode_pages_range()
and invalidate_mapping_pages()) which want exactly this function, so add
it to filemap.c. Before this patch, shmem_undo_range() would split any
compound page which overlaps either end of the range being punched in both
the first and second loops through the address space. After this patch,
that functionality is left for the second loop, which is arguably more
appropriate since the first loop is supposed to run through all the pages
quickly, and splitting a page can sleep.
There is a lot of common code in find_get_entries(),
find_get_pages_range() and find_get_pages_range_tag(). Factor out
find_get_entry() which simplifies all three functions.
The functionality of find_lock_entry() and find_get_entry() can be
provided by pagecache_get_page(), which lets us delete find_lock_entry()
and make find_get_entry() static.
mm/shmem: use pagevec_lookup in shmem_unlock_mapping
The comment shows that the reason for using find_get_entries() is now
stale; find_get_pages() will not return 0 if it hits a consecutive run of
swap entries, and I don't believe it has since 2011. pagevec_lookup() is
a simpler function to use than find_get_pages(), so use it instead.
mm: make pagecache tagged lookups return only head pages
Patch series "Overhaul multi-page lookups for THP", v4.
This THP prep patchset changes several page cache iteration APIs to only
return head pages.
- It's only possible to tag head pages in the page cache, so only
return head pages, not all their subpages.
- Factor a lot of common code out of the various batch lookup routines
- Add mapping_seek_hole_data()
- Unify find_get_entries() and pagevec_lookup_entries()
- Make find_get_entries only return head pages, like find_get_entry().
These are only loosely connected, but they seem to make sense together as
a series.
This patch (of 14):
Pagecache tags are used for dirty page writeback. Since dirtiness is
tracked on a per-THP basis, we only want to return the head page rather
than each subpage of a tagged page. All the filesystems which use huge
pages today are in-memory, so there are no tagged huge pages today.
Linus Torvalds [Fri, 26 Feb 2021 17:17:24 +0000 (09:17 -0800)]
Merge tag 'nfs-for-5.12-1' of git://git.linux-nfs.org/projects/anna/linux-nfs
Pull NFS Client Updates from Anna Schumaker:
"New Features:
- Support for eager writes, and the write=eager and write=wait mount
options
- Other Bugfixes and Cleanups:
- Fix typos in some comments
- Fix up fall-through warnings for Clang
- Cleanups to the NFS readpage codepath
- Remove FMR support in rpcrdma_convert_iovs()
- Various other cleanups to xprtrdma
- Fix xprtrdma pad optimization for servers that don't support
RFC 8797
- Improvements to rpcrdma tracepoints
- Fix up nfs4_bitmask_adjust()
- Optimize sparse writes past the end of files"
* tag 'nfs-for-5.12-1' of git://git.linux-nfs.org/projects/anna/linux-nfs: (27 commits)
NFS: Support the '-owrite=' option in /proc/self/mounts and mountinfo
NFS: Set the stable writes flag when initialising the super block
NFS: Add mount options supporting eager writes
NFS: Add support for eager writes
NFS: 'flags' field should be unsigned in struct nfs_server
NFS: Don't set NFS_INO_INVALID_XATTR if there is no xattr cache
NFS: Always clear an invalid mapping when attempting a buffered write
NFS: Optimise sparse writes past the end of file
NFS: Fix documenting comment for nfs_revalidate_file_size()
NFSv4: Fixes for nfs4_bitmask_adjust()
xprtrdma: Clean up rpcrdma_prepare_readch()
rpcrdma: Capture bytes received in Receive completion tracepoints
xprtrdma: Pad optimization, revisited
rpcrdma: Fix comments about reverse-direction operation
xprtrdma: Refactor invocations of offset_in_page()
xprtrdma: Simplify rpcrdma_convert_kvec() and frwr_map()
xprtrdma: Remove FMR support in rpcrdma_convert_iovs()
NFS: Add nfs_pageio_complete_read() and remove nfs_readpage_async()
NFS: Call readpage_async_filler() from nfs_readpage_async()
NFS: Refactor nfs_readpage() and nfs_readpage_async() to use nfs_readdesc
...
Martin Radev [Tue, 12 Jan 2021 15:07:29 +0000 (16:07 +0100)]
swiotlb: Validate bounce size in the sync/unmap path
The size of the buffer being bounced is not checked if it happens
to be larger than the size of the mapped buffer. Because the size
can be controlled by a device, as it's the case with virtio devices,
this can lead to memory corruption.
This patch saves the remaining buffer memory for each slab and uses
that information for validation in the sync/unmap paths before
swiotlb_bounce is called.
Validating this argument is important under the threat models of
AMD SEV-SNP and Intel TDX, where the HV is considered untrusted.
Jianxiong Gao [Mon, 1 Feb 2021 18:30:17 +0000 (10:30 -0800)]
nvme-pci: set min_align_mask
The PRP addressing scheme requires all PRP entries except for the
first one to have a zero offset into the NVMe controller pages (which
can be different from the Linux PAGE_SIZE). Use the min_align_mask
device parameter to ensure that swiotlb does not change the address
of the buffer modulo the device page size to ensure that the PRPs
won't be malformed.
Respect the min_align_mask in struct device_dma_parameters in swiotlb.
There are two parts to it:
1) for the lower bits of the alignment inside the io tlb slot, just
extent the size of the allocation and leave the start of the slot
empty
2) for the high bits ensure we find a slot that matches the high bits
of the alignment to avoid wasting too much memory
Ira Weiny [Wed, 10 Feb 2021 06:22:20 +0000 (22:22 -0800)]
btrfs: use copy_highpage() instead of 2 kmaps()
There are many places where kmap/memove/kunmap patterns occur.
This pattern exists in the core common function copy_highpage().
Use copy_highpage to avoid open coding the use of kmap and leverages the
core functions use of kmap_local_page().
Development of this patch was aided by the following coccinelle script:
// <smpl>
// SPDX-License-Identifier: GPL-2.0-only
// Find kmap/copypage/kunmap pattern and replace with copy_highpage calls
//
// NOTE: The expressions in the copy page version of this kmap pattern are
// overly complex and so these all need individual attention.
//
// Confidence: Low
// Copyright: (C) 2021 Intel Corporation
// URL: http://coccinelle.lip6.fr/
// Comments:
// Options:
//
// Then a copy_page where we have 2 pages involved.
//
@ copy_page_rule @
expression page, page2, To, From, Size;
identifier ptr, ptr2;
type VP, VP2;
@@
// Remove any pointers left unused
@
depends on copy_page_rule
@
identifier copy_page_rule.ptr;
identifier copy_page_rule.ptr2;
type VP, VP1;
type VP2, VP21;
@@
-VP ptr;
... when != ptr;
? VP1 ptr;
-VP2 ptr2;
... when != ptr2;
? VP21 ptr2;
Ira Weiny [Wed, 10 Feb 2021 06:22:19 +0000 (22:22 -0800)]
btrfs: use memcpy_[to|from]_page() and kmap_local_page()
There are many places where the pattern kmap/memcpy/kunmap occurs.
This pattern was lifted to the core common functions
memcpy_[to|from]_page().
Use these new functions to reduce the code, eliminate direct uses of
kmap, and leverage the new core functions use of kmap_local_page().
Also, there is 1 place where a kmap/memcpy is followed by an
optional memset. Here we leave the kmap open coded to avoid remapping
the page but use kmap_local_page() directly.
Development of this patch was aided by the coccinelle script:
// <smpl>
// SPDX-License-Identifier: GPL-2.0-only
// Find kmap/memcpy/kunmap pattern and replace with memcpy*page calls
//
// NOTE: Offsets and other expressions may be more complex than what the script
// will automatically generate. Therefore a catchall rule is provided to find
// the pattern which then must be evaluated by hand.
//
// Confidence: Low
// Copyright: (C) 2021 Intel Corporation
// URL: http://coccinelle.lip6.fr/
// Comments:
// Options:
//
// simple memcpy version
//
@ memcpy_rule1 @
expression page, T, F, B, Off;
identifier ptr;
type VP;
@@
Mårten Lindahl [Tue, 16 Feb 2021 22:25:38 +0000 (23:25 +0100)]
i2c: exynos5: Preserve high speed master code
When the driver starts to send a message with the MASTER_ID field
set (high speed), the whole I2C_ADDR register is overwritten including
MASTER_ID as the SLV_ADDR_MAS field is set.
This patch preserves already written fields in I2C_ADDR when writing
SLV_ADDR_MAS.
Fixes: 8a73cd4cfa15 ("i2c: exynos5: add High Speed I2C controller driver") Signed-off-by: Mårten Lindahl <[email protected]> Reviewed-by: Krzysztof Kozlowski <[email protected]> Tested-by: Krzysztof Kozlowski <[email protected]> Signed-off-by: Wolfram Sang <[email protected]>
Liguang Zhang [Thu, 25 Feb 2021 14:26:31 +0000 (22:26 +0800)]
i2c: designware: Get right data length
IC_DATA_CMD[11] indicates the first data byte received after the address
phase for receive transfer in Master receiver or Slave receiver mode,
this bit was set in some transfer flow. IC_DATA_CMD[7:0] contains the
data to be transmitted or received on the I2C bus, so we should use the
lower 8 bits to get the real data length.
Maxime Ripard [Thu, 25 Feb 2021 16:11:01 +0000 (17:11 +0100)]
i2c: brcmstb: Fix brcmstd_send_i2c_cmd condition
The brcmstb_send_i2c_cmd currently has a condition that is (CMD_RD ||
CMD_WR) which always evaluates to true, while the obvious fix is to test
whether the cmd variable passed as parameter holds one of these two
values.
KVM: x86/mmu: Set SPTE_AD_WRPROT_ONLY_MASK if and only if PML is enabled
Check that PML is actually enabled before setting the mask to force a
SPTE to be write-protected. The bits used for the !AD_ENABLED case are
in the upper half of the SPTE. With 64-bit paging and EPT, these bits
are ignored, but with 32-bit PAE paging they are reserved. Setting them
for L2 SPTEs without checking PML breaks NPT on 32-bit KVM.
Hyper-V context is lazily allocated until Hyper-V specific MSRs are accessed
or SynIC is enabled. However, the syzkaller testcase sets irq routing table
directly w/o enabling SynIC. This results in null-ptr-deref when accessing
SynIC Hyper-V context. This patch fixes it.
Chenyi Qiang [Fri, 26 Feb 2021 07:55:41 +0000 (15:55 +0800)]
KVM: Documentation: rectify rst markup in kvm_run->flags
Commit c32b1b896d2a ("KVM: X86: Add the Document for
KVM_CAP_X86_BUS_LOCK_EXIT") added a new flag in kvm_run->flags
documentation, and caused warning in make htmldocs:
Paolo Bonzini [Thu, 25 Feb 2021 13:37:01 +0000 (08:37 -0500)]
Documentation: kvm: fix messy conversion from .txt to .rst
Building the documentation gives a warning that the KVM_PPC_RESIZE_HPT_PREPARE
label is defined twice. The root cause is that the KVM_PPC_RESIZE_HPT_PREPARE
API is present twice, the second being a mix of the prepare and commit APIs.
Fix it.
David Howells [Thu, 4 Feb 2021 06:15:21 +0000 (00:15 -0600)]
cifs: use discard iterator to discard unneeded network data more efficiently
The iterator, ITER_DISCARD, that can only be used in READ mode and
just discards any data copied to it, was added to allow a network
filesystem to discard any unwanted data sent by a server.
Convert cifs_discard_from_socket() to use this.
vmlinux.lds.h: Define SANTIZER_DISCARDS with CONFIG_GCOV_KERNEL=y
clang produces .eh_frame sections when CONFIG_GCOV_KERNEL is enabled,
even when -fno-asynchronous-unwind-tables is in KBUILD_CFLAGS:
$ make CC=clang vmlinux
...
ld: warning: orphan section `.eh_frame' from `init/main.o' being placed in section `.eh_frame'
ld: warning: orphan section `.eh_frame' from `init/version.o' being placed in section `.eh_frame'
ld: warning: orphan section `.eh_frame' from `init/do_mounts.o' being placed in section `.eh_frame'
ld: warning: orphan section `.eh_frame' from `init/do_mounts_initrd.o' being placed in section `.eh_frame'
ld: warning: orphan section `.eh_frame' from `init/initramfs.o' being placed in section `.eh_frame'
ld: warning: orphan section `.eh_frame' from `init/calibrate.o' being placed in section `.eh_frame'
ld: warning: orphan section `.eh_frame' from `init/init_task.o' being placed in section `.eh_frame'
...
This was already handled for a couple of other options in
commit d812db78288d ("vmlinux.lds.h: Avoid KASAN and KCSAN's unwanted
sections") and there is an open LLVM bug for this issue. Take advantage
of that section for this config as well so that there are no more orphan
warnings.
Linus Torvalds [Thu, 25 Feb 2021 20:21:08 +0000 (12:21 -0800)]
Merge tag 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mst/vhost
Pull virtio updates from Michael Tsirkin:
- new vdpa features to allow creation and deletion of new devices
- virtio-blk support per-device queue depth
- fixes, cleanups all over the place
* tag 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mst/vhost: (31 commits)
virtio-input: add multi-touch support
virtio_mmio: fix one typo
vdpa/mlx5: fix param validation in mlx5_vdpa_get_config()
virtio_net: Fix fall-through warnings for Clang
virtio_input: Prevent EV_MSC/MSC_TIMESTAMP loop storm for MT.
virtio-blk: support per-device queue depth
virtio_vdpa: don't warn when fail to disable vq
virtio-pci: introduce modern device module
virito-pci-modern: rename map_capability() to vp_modern_map_capability()
virtio-pci-modern: introduce helper to get notification offset
virtio-pci-modern: introduce helper for getting queue nums
virtio-pci-modern: introduce helper for setting/geting queue size
virtio-pci-modern: introduce helper to set/get queue_enable
virtio-pci-modern: introduce vp_modern_queue_address()
virtio-pci-modern: introduce vp_modern_set_queue_vector()
virtio-pci-modern: introduce vp_modern_generation()
virtio-pci-modern: introduce helpers for setting and getting features
virtio-pci-modern: introduce helpers for setting and getting status
virtio-pci-modern: introduce helper to set config vector
virtio-pci-modern: introduce vp_modern_remove()
...
Masahiro Yamada [Thu, 25 Feb 2021 19:39:12 +0000 (04:39 +0900)]
kbuild: Move .thinlto-cache removal to 'make clean'
Instead of 'make distclean', 'make clean' should remove build artifacts
unneeded by external module builds. Obviously, you do not need to keep
this directory.
parisc uses -fpatchable-function-entry with dynamic ftrace, which means we
don't need recordmcount. Select FTRACE_MCOUNT_USE_PATCHABLE_FUNCTION_ENTRY
to tell that to the build system.
Linus Torvalds [Thu, 25 Feb 2021 20:18:21 +0000 (12:18 -0800)]
Merge tag 'mips_5.12_1' of git://git.kernel.org/pub/scm/linux/kernel/git/mips/linux
Pull more MIPS updates from Thomas Bogendoerfer:
- added n64 block driver
- fix for ubsan warnings
- fix for bcm63xx platform
- update of linux-mips mailinglist
* tag 'mips_5.12_1' of git://git.kernel.org/pub/scm/linux/kernel/git/mips/linux:
arch: mips: update references to current linux-mips list
mips: bmips: init clocks earlier
vmlinux.lds.h: catch even more instrumentation symbols into .data
n64: store dev instance into disk private data
n64: cleanup n64cart_probe()
n64: cosmetics changes
n64: remove curly brackets
n64: use sector SECTOR_SHIFT instead 512
n64: use enums for reg
n64: move module param at the top
n64: move module info at the end
n64: use pr_fmt to avoid duplicate string
block: Add n64 cart driver
Linus Torvalds [Thu, 25 Feb 2021 20:10:22 +0000 (12:10 -0800)]
Merge tag 'drm-next-2021-02-26' of git://anongit.freedesktop.org/drm/drm
Pull more drm updates from Dave Airlie:
"This is mostly fixes but I missed msm-next pull last week. It's been
in drm-next.
Otherwise it's a selection of i915, amdgpu and misc fixes, one TTM
memory leak, nothing really major stands out otherwise.
core:
- vblank fence timing improvements
dma-buf:
- improve error handling
ttm:
- memory leak fix
msm:
- a6xx speedbin support
- a508, a509, a512 support
- various a5xx fixes
- various dpu fixes
- qseed3lite support for sm8250
- dsi fix for msm8994
- mdp5 fix for framerate bug with cmd mode panels
- a6xx GMU OOB race fixes that were showing up in CI
- various addition and removal of semicolons
- gem submit fix for legacy userspace relocs path
i915:
- color format fix
- -Wuninitialised reenabled
- GVT ww locking, cmd parser fixes
atyfb:
- fix build
rockchip:
- AFBC modifier fix"
* tag 'drm-next-2021-02-26' of git://anongit.freedesktop.org/drm/drm: (60 commits)
drm/panel: kd35t133: allow using non-continuous dsi clock
drm/rockchip: Require the YTR modifier for AFBC
drm/ttm: Fix a memory leak
drm/drm_vblank: set the dma-fence timestamp during send_vblank_event
dma-fence: allow signaling drivers to set fence timestamp
dma-buf: heaps: Rework heap allocation hooks to return struct dma_buf instead of fd
dma-buf: system_heap: Make sure to return an error if we abort
drm/amd/display: Fix system hang after multiple hotplugs (v3)
drm/amdgpu: fix shutdown and poweroff process failed with s0ix
drm/i915: Nuke INTEL_OUTPUT_FORMAT_INVALID
drm/i915: Enable -Wuninitialized
drm/amd/display: Remove Assert from dcn10_get_dig_frontend
drm/amd/display: Add vupdate_no_lock interrupts for DCN2.1
Revert "drm/amd/display: reuse current context instead of recreating one"
drm/amd/pm/swsmu: Avoid using structure_size uninitialized in smu_cmn_init_soft_gpu_metrics
fbdev: atyfb: add stubs for aty_{ld,st}_lcd()
drm/i915/gvt: Introduce per object locking in GVT scheduler.
drm/i915/gvt: Purge dev_priv->gt
drm/i915/gvt: Parse default state to update reg whitelist
dt-bindings: dp-connector: Drop maxItems from -supply
...
Linus Torvalds [Thu, 25 Feb 2021 20:06:25 +0000 (12:06 -0800)]
Merge tag 'net-5.12-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net
Pull networking fixes from Jakub Kicinski:
"Rather small batch this time.
Current release - regressions:
- bcm63xx_enet: fix sporadic kernel panic due to queue length
mis-accounting
Current release - new code bugs:
- bcm4908_enet: fix RX path possible mem leak
- bcm4908_enet: fix NAPI poll returned value
- stmmac: fix missing spin_lock_init in visconti_eth_dwmac_probe()
- sched: cls_flower: validate ct_state for invalid and reply flags
Previous releases - regressions:
- net: introduce CAN specific pointer in the struct net_device to
prevent mis-interpreting memory
- phy: micrel: set soft_reset callback to genphy_soft_reset for
KSZ8081
- psample: fix netlink skb length with tunnel info
Previous releases - always broken:
- icmp: pass zeroed opts from icmp{,v6}_ndo_send before sending
- wireguard: device: do not generate ICMP for non-IP packets
- mptcp: provide subflow aware release function to avoid a mem leak
- hsr: add support for EntryForgetTime
- r8169: fix jumbo packet handling on RTL8168e
- octeontx2-af: fix an off by one in rvu_dbg_qsize_write()
- i40e: fix flow for IPv6 next header (extension header)
- phy: icplus: call phy_restore_page() when phy_select_page() fails
- dpaa_eth: fix the access method for the dpaa_napi_portal"
* tag 'net-5.12-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net: (55 commits)
r8169: fix jumbo packet handling on RTL8168e
net: phy: micrel: set soft_reset callback to genphy_soft_reset for KSZ8081
net: psample: Fix netlink skb length with tunnel info
net: broadcom: bcm4908_enet: fix NAPI poll returned value
net: broadcom: bcm4908_enet: fix RX path possible mem leak
net: hsr: add support for EntryForgetTime
net: dsa: sja1105: Remove unneeded cast in sja1105_crc32()
ibmvnic: fix a race between open and reset
net: stmmac: Fix missing spin_lock_init in visconti_eth_dwmac_probe()
net: introduce CAN specific pointer in the struct net_device
net: usb: qmi_wwan: support ZTE P685M modem
wireguard: kconfig: use arm chacha even with no neon
wireguard: queueing: get rid of per-peer ring buffers
wireguard: device: do not generate ICMP for non-IP packets
wireguard: peer: put frequently used members above cache lines
wireguard: selftests: test multiple parallel streams
wireguard: socket: remove bogus __be32 annotation
wireguard: avoid double unlikely() notation when using IS_ERR()
net: qrtr: Fix memory leak in qrtr_tun_open
vxlan: move debug check after netdev unregister
...
Andrew Donnellan [Thu, 25 Feb 2021 06:08:57 +0000 (17:08 +1100)]
docs: powerpc: Fix tables in syscall64-abi.rst
Commit 209b44c804c ("docs: powerpc: syscall64-abi.rst: fix a malformed
table") attempted to fix the formatting of tables in syscall64-abi.rst, but
inadvertently changed some register names.
Redo the tables with the correct register names, and while we're here,
clean things up to separate the registers into different rows and add
headings.
Linus Torvalds [Thu, 25 Feb 2021 20:03:13 +0000 (12:03 -0800)]
Merge tag 'acpi-5.12-rc1-3' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm
Pull more ACPI updates from Rafael Wysocki:
"These make additional changes to the platform profile interface merged
recently and add support for the FPDT ACPI table.
Specifics:
- Rearrange Kconfig handling of ACPI_PLATFORM_PROFILE, add
"balanced-performance" to the list of supported platform profiles
and fix up some file references in a comment (Maximilian Luz).
- Add support for parsing the ACPI Firmware Performance Data Table
(FPDT) and exposing the data from there via sysfs (Zhang Rui)"
* tag 'acpi-5.12-rc1-3' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm:
ACPI: platform: Add balanced-performance platform profile
ACPI: platform: Fix file references in comment
ACPI: platform: Hide ACPI_PLATFORM_PROFILE option
ACPI: tables: introduce support for FPDT table
Dave Airlie [Thu, 25 Feb 2021 19:07:24 +0000 (05:07 +1000)]
Merge tag 'drm-intel-next-fixes-2021-02-25' of git://anongit.freedesktop.org/drm/drm-intel into drm-next
A fix for color format check from Ville, plus the re-enable of -Wuninitialized
from Nathan, and the GVT fixes including fixes for ww locking, cmd parser and
a general cleanup of dev_priv->gt.
Dave Airlie [Thu, 25 Feb 2021 18:42:51 +0000 (04:42 +1000)]
Merge tag 'drm-misc-next-fixes-2021-02-25' of git://anongit.freedesktop.org/drm/drm-misc into drm-next
drm-misc-next tasty fixes for v5.12:
- Cherry pick of drm-misc-fixes pull:
"here's this week's PR for drm-misc-fixes. One of the patches is a memory
leak; the rest is for hardware issues."
- Fix dt bindings for dp connector.
- Fix build error in atyfb.
- Improve error handling for dma-buf heaps.
- Make vblank timestamp more correct, by recording timestamp to be set when signaling.
Paulo Alcantara [Wed, 24 Feb 2021 23:59:23 +0000 (20:59 -0300)]
cifs: introduce helper for finding referral server to improve DFS target resolution
Some servers seem to mistakenly report different values for
capabilities and share flags, so we can't always rely on those values
to decide whether the resolved target can handle any new DFS
referrals.
Add a new helper is_referral_server() to check if all resolved targets
can handle new DFS referrals by directly looking at the
GET_DFS_REFERRAL.ReferralHeaderFlags value as specified in MS-DFSC
2.2.4 RESP_GET_DFS_REFERRAL in addition to is_tcon_dfs().
Paulo Alcantara [Wed, 24 Feb 2021 23:59:22 +0000 (20:59 -0300)]
cifs: check all path components in resolved dfs target
Handle the case where a resolved target share is like
//server/users/dir, and the user "foo" has no read permission to
access the parent folder "users" but has access to the final path
component "dir".
is_path_remote() already implements that, so call it directly.
Linus Torvalds [Thu, 25 Feb 2021 18:17:31 +0000 (10:17 -0800)]
Merge tag 'kbuild-v5.12' of git://git.kernel.org/pub/scm/linux/kernel/git/masahiroy/linux-kbuild
Pull Kbuild updates from Masahiro Yamada:
- Fix false-positive build warnings for ARCH=ia64 builds
- Optimize dictionary size for module compression with xz
- Check the compiler and linker versions in Kconfig
- Fix misuse of extra-y
- Support DWARF v5 debug info
- Clamp SUBLEVEL to 255 because stable releases 4.4.x and 4.9.x
exceeded the limit
- Add generic syscall{tbl,hdr}.sh for cleanups across arches
- Minor cleanups of genksyms
- Minor cleanups of Kconfig
* tag 'kbuild-v5.12' of git://git.kernel.org/pub/scm/linux/kernel/git/masahiroy/linux-kbuild: (38 commits)
initramfs: Remove redundant dependency of RD_ZSTD on BLK_DEV_INITRD
kbuild: remove deprecated 'always' and 'hostprogs-y/m'
kbuild: parse C= and M= before changing the working directory
kbuild: reuse this-makefile to define abs_srctree
kconfig: unify rule of config, menuconfig, nconfig, gconfig, xconfig
kconfig: omit --oldaskconfig option for 'make config'
kconfig: fix 'invalid option' for help option
kconfig: remove dead code in conf_askvalue()
kconfig: clean up nested if-conditionals in check_conf()
kconfig: Remove duplicate call to sym_get_string_value()
Makefile: Remove # characters from compiler string
Makefile: reuse CC_VERSION_TEXT
kbuild: check the minimum linker version in Kconfig
kbuild: remove ld-version macro
scripts: add generic syscallhdr.sh
scripts: add generic syscalltbl.sh
arch: syscalls: remove $(srctree)/ prefix from syscall tables
arch: syscalls: add missing FORCE and fix 'targets' to make if_changed work
gen_compile_commands: prune some directories
kbuild: simplify access to the kernel's version
...
Paulo Alcantara [Wed, 24 Feb 2021 23:59:21 +0000 (20:59 -0300)]
cifs: fix DFS failover
In do_dfs_failover(), the mount_get_conns() function requires the full
fs context in order to get new connection to server, so clone the
original context and change it accordingly when retrying the DFS
targets in the referral.
If failover was successful, then update original context with the new
UNC, prefix path and ip address.
Ronnie Sahlberg [Thu, 25 Feb 2021 07:36:27 +0000 (17:36 +1000)]
cifs: fix handling of escaped ',' in the password mount argument
Passwords can contain ',' which are also used as the separator between
mount options. Mount.cifs will escape all ',' characters as the string ",,".
Update parsing of the mount options to detect ",," and treat it as a single
'c' character.
Linus Torvalds [Thu, 25 Feb 2021 18:06:55 +0000 (10:06 -0800)]
Merge tag 'ext4_for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4
Pull ext4 updates from Ted Ts'o:
"Miscellaneous ext4 cleanups and bug fixes. Pretty boring this cycle..."
* tag 'ext4_for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4:
ext4: add .kunitconfig fragment to enable ext4-specific tests
ext: EXT4_KUNIT_TESTS should depend on EXT4_FS instead of selecting it
ext4: reset retry counter when ext4_alloc_file_blocks() makes progress
ext4: fix potential htree index checksum corruption
ext4: factor out htree rep invariant check
ext4: Change list_for_each* to list_for_each_entry*
ext4: don't try to processed freed blocks until mballoc is initialized
ext4: use DEFINE_MUTEX() for mutex lock
Linus Torvalds [Thu, 25 Feb 2021 17:56:08 +0000 (09:56 -0800)]
Merge tag 'pci-v5.12-changes' of git://git.kernel.org/pub/scm/linux/kernel/git/helgaas/pci
Pull PCI updates from Bjorn Helgaas:
"Enumeration:
- Remove unnecessary locking around _OSC (Bjorn Helgaas)
- Clarify message about _OSC failure (Bjorn Helgaas)
- Remove notification of PCIe bandwidth changes (Bjorn Helgaas)
- Tidy checking of syscall user config accessors (Heiner Kallweit)
Resource management:
- Decline to resize resources if boot config must be preserved (Ard
Biesheuvel)
- Fix pci_register_io_range() memory leak (Geert Uytterhoeven)
Error handling (Keith Busch):
- Clear error status from the correct device
- Retain error recovery status so drivers can use it after reset
- Log the type of Port (Root or Switch Downstream) that we reset
- Always request a reset for Downstream Ports in frozen state
Endpoint framework and NTB (Kishon Vijay Abraham I):
- Make *_get_first_free_bar() take into account 64 bit BAR
- Add helper API to get the 'next' unreserved BAR
- Make *_free_bar() return error codes on failure
- Remove unused pci_epf_match_device()
- Add support to associate secondary EPC with EPF
- Add support in configfs to associate two EPCs with EPF
- Add pci_epc_ops to map MSI IRQ
- Add pci_epf_ops to expose function-specific attrs
- Allow user to create sub-directory of 'EPF Device' directory
- Implement ->msi_map_irq() ops for cadence
- Configure LM_EP_FUNC_CFG based on epc->function_num_map for cadence
- Add EP function driver to provide NTB functionality
- Add support for EPF PCI Non-Transparent Bridge
- Add specification for PCI NTB function device
- Add PCI endpoint NTB function user guide
- Add configfs binding documentation for pci-ntb endpoint function
Broadcom STB PCIe controller driver:
- Add support for BCM4908 and external PERST# signal controller
(Rafał Miłecki)
Cadence PCIe controller driver:
- Retrain Link to work around Gen2 training defect (Nadeem Athani)
- Fix merge botch in cdns_pcie_host_map_dma_ranges() (Krzysztof
Wilczyński)
Freescale Layerscape PCIe controller driver:
- Add LX2160A rev2 EP mode support (Hou Zhiqiang)
- Convert to builtin_platform_driver() (Michael Walle)
Qualcomm PCIe controller driver:
- Use PHY_REFCLK_USE_PAD only for ipq8064 (Ansuel Smith)
- Add support for ddrss_sf_tbu clock for sm8250 (Dmitry Baryshkov)
Renesas R-Car PCIe controller driver:
- Drop PCIE_RCAR config option (Lad Prabhakar)
- Always allocate MSI addresses in 32bit space (Marek Vasut)
Synopsys DesignWare PCIe controller driver:
- Work around ECRC configuration hardware defect (Vidya Sagar)
- Drop support for config space in DT 'ranges' (Rob Herring)
- Change size to u64 for EP outbound iATU (Shradha Todi)
- Add upper limit address for outbound iATU (Shradha Todi)
- Make dw_pcie ops optional (Jisheng Zhang)
- Remove unnecessary dw_pcie_ops from al driver (Jisheng Zhang)
Miscellaneous:
- Remove tango host controller driver (Arnd Bergmann)
- Remove IRQ handler & data together (altera-msi, brcmstb, dwc)
(Martin Kaiser)
- Fix xgene-msi race in installing chained IRQ handler (Martin
Kaiser)
- Apply CONFIG_PCI_DEBUG to entire drivers/pci hierarchy (Junhao He)
- Fix pci-bridge-emul array overruns (Russell King)
- Remove obsolete uses of WARN_ON(in_interrupt()) (Sebastian Andrzej
Siewior)"
* tag 'pci-v5.12-changes' of git://git.kernel.org/pub/scm/linux/kernel/git/helgaas/pci: (69 commits)
PCI: qcom: Use PHY_REFCLK_USE_PAD only for ipq8064
PCI: qcom: Add support for ddrss_sf_tbu clock
dt-bindings: PCI: qcom: Document ddrss_sf_tbu clock for sm8250
PCI: al: Remove useless dw_pcie_ops
PCI: dwc: Don't assume the ops in dw_pcie always exist
PCI: dwc: Add upper limit address for outbound iATU
PCI: dwc: Change size to u64 for EP outbound iATU
PCI: dwc: Drop support for config space in 'ranges'
PCI: layerscape: Convert to builtin_platform_driver()
PCI: layerscape: Add LX2160A rev2 EP mode support
dt-bindings: PCI: layerscape: Add LX2160A rev2 compatible strings
PCI: dwc: Work around ECRC configuration issue
PCI/portdrv: Report reset for frozen channel
PCI/AER: Specify the type of Port that was reset
PCI/ERR: Retain status from error notification
PCI/AER: Clear AER status from Root Port when resetting Downstream Port
PCI/ERR: Clear status of the reporting device
dt-bindings: arm: rockchip: Add FriendlyARM NanoPi M4B
PCI: rockchip: Make 'ep-gpios' DT property optional
Documentation: PCI: Add PCI endpoint NTB function user guide
...
Heiner Kallweit [Thu, 25 Feb 2021 15:05:19 +0000 (16:05 +0100)]
r8169: fix jumbo packet handling on RTL8168e
Josef reported [0] that using jumbo packets fails on RTL8168e.
Aligning the values for register MaxTxPacketSize with the
vendor driver fixes the problem.
Colin Ian King [Thu, 25 Feb 2021 16:52:48 +0000 (17:52 +0100)]
tracing/tools: fix a couple of spelling mistakes
There is a spelling mistake in the -g help option, I believe
it should be "graph". There is also a spelling mistake in a
warning message. Fix both mistakes.
Christian Melki [Wed, 24 Feb 2021 20:55:36 +0000 (21:55 +0100)]
net: phy: micrel: set soft_reset callback to genphy_soft_reset for KSZ8081
Following a similar reinstate for the KSZ9031.
Older kernels would use the genphy_soft_reset if the PHY did not implement
a .soft_reset.
Bluntly removing that default may expose a lot of situations where various
PHYs/board implementations won't recover on various changes.
Like with this implementation during a 4.9.x to 5.4.x LTS transition.
I think it's a good thing to remove unwanted soft resets but wonder if it
did open a can of worms?
Linus Torvalds [Thu, 25 Feb 2021 17:50:36 +0000 (09:50 -0800)]
Merge tag 'nds32-for-linux-5.12' of git://git.kernel.org/pub/scm/linux/kernel/git/greentime/linux
Pull nds32 updates from Greentime Hu:
"Code clean-up and refinement"
* tag 'nds32-for-linux-5.12' of git://git.kernel.org/pub/scm/linux/kernel/git/greentime/linux:
nds32: Fix bogus reference to <asm/procinfo.h>
nds32: use get_kernel_nofault in dump_mem
nds32: remove dump_instr
nds32: configs: Cleanup CONFIG_CROSS_COMPILE
nds32: Replace <linux/clk-provider.h> by <linux/of_clk.h>
mm, tracing: Fix kmem_cache_free trace event to not print stale pointers
The update to kmem_cache_free trace event added printing of the slab name in
the trace event. But it only stores the pointer of the name which will be
printed as a string when the event is read some time in the future. This is
dangerous because the name could be freed in the mean time and when reading
the trace event it would try to dereference the string name by the pointer
to the name that has been freed.
Instead, use the trace event helper macros __string(), __assign_str(), and
__get_str() that are for this very case.
Cc: Jacob Wen <[email protected]> Fixes: 3544de8ee6e4 ("mm, tracing: record slab name for kmem_cache_free()") Signed-off-by: Steven Rostedt (VMware) <[email protected]>
Chris Mi [Thu, 25 Feb 2021 07:51:45 +0000 (15:51 +0800)]
net: psample: Fix netlink skb length with tunnel info
Currently, the psample netlink skb is allocated with a size that does
not account for the nested 'PSAMPLE_ATTR_TUNNEL' attribute and the
padding required for the 64-bit attribute 'PSAMPLE_TUNNEL_KEY_ATTR_ID'.
This can result in failure to add attributes to the netlink skb due
to insufficient tail room. The following error message is printed to
the kernel log: "Could not create psample log message".
Fix this by adjusting the allocation size to take into account the
nested attribute and the padding.
Steve French [Wed, 24 Feb 2021 18:12:53 +0000 (12:12 -0600)]
cifs: Add new parameter "acregmax" for distinct file and directory metadata timeout
The new optional mount parameter "acregmax" allows a different
timeout for file metadata ("acdirmax" now allows controlling timeout
for directory metadata). Setting "actimeo" still works as before,
and changes timeout for both files and directories, but
specifying "acregmax" or "acdirmax" allows overriding the
default more granularly which can be a big performance benefit
on some workloads. "acregmax" is already used by NFS as a mount
parameter (albeit with a larger default and thus looser caching).
Steve French [Tue, 23 Feb 2021 22:16:09 +0000 (16:16 -0600)]
cifs: convert revalidate of directories to using directory metadata cache timeout
The new optional mount parm, "acdirmax" allows caching the metadata
for a directory longer than file metadata, which can be very helpful
for performance. Convert cifs_inode_needs_reval to check acdirmax
for revalidating directory metadata.
Steve French [Tue, 23 Feb 2021 21:50:57 +0000 (15:50 -0600)]
cifs: Add new mount parameter "acdirmax" to allow caching directory metadata
nfs and cifs on Linux currently have a mount parameter "actimeo" to control
metadata (attribute) caching but cifs does not have additional mount
parameters to allow distinguishing between caching directory metadata
(e.g. needed to revalidate paths) and that for files.
Add new mount parameter "acdirmax" to allow caching metadata for
directories more loosely than file data. NFS adjusts metadata
caching from acdirmin to acdirmax (and another two mount parms
for files) but to reduce complexity, it is safer to just introduce
the one mount parm to allow caching directories longer. The
defaults for acdirmax and actimeo (for cifs.ko) are conservative,
1 second (NFS defaults acdirmax to 60 seconds). For many workloads,
setting acdirmax to a higher value is safe and will improve
performance. This patch leaves unchanged the default values
for caching metadata for files and directories but gives the
user more flexibility in adjusting them safely for their workload
via the new mount parm.
Marco Wenzel [Wed, 24 Feb 2021 09:46:49 +0000 (10:46 +0100)]
net: hsr: add support for EntryForgetTime
In IEC 62439-3 EntryForgetTime is defined with a value of 400 ms. When a
node does not send any frame within this time, the sequence number check
for can be ignored. This solves communication issues with Cisco IE 2000
in Redbox mode.
net: dsa: sja1105: Remove unneeded cast in sja1105_crc32()
sja1105_unpack() takes a "const void *buf" as its first parameter, so
there is no need to cast away the "const" of the "buf" variable before
calling it.
Drop the cast, as it prevents the compiler performing some checks.
Jens Axboe [Thu, 25 Feb 2021 17:17:46 +0000 (10:17 -0700)]
io_uring: fix SQPOLL thread handling over exec
Just like the changes for io-wq, ensure that we re-fork the SQPOLL
thread if the owner execs. Mark the ctx sq thread as sqo_exec if
it dies, and the ring as needing a wakeup which will force the task
to enter the kernel. When it does, setup the new thread and proceed
as usual.
Jens Axboe [Thu, 25 Feb 2021 17:17:09 +0000 (10:17 -0700)]
io-wq: improve manager/worker handling over exec
exec will cancel any threads, including the ones that io-wq is using. This
isn't a problem, in fact we'd prefer it to be that way since it means we
know that any async work cancels naturally without having to handle it
proactively.
But it does mean that we need to setup a new manager, as the manager and
workers are gone. Handle this at queue time, and cancel work if we fail.
Since the manager can go away without us noticing, ensure that the manager
itself holds a reference to the 'wq' as well. Rename io_wq_destroy() to
io_wq_put() to reflect that.
In the future we can now simplify exec cancelation handling, for now just
leave it the same.
which is due to exiting after the SQPOLL thread has been created, but
hasn't been started yet. Ensure that we always complete the startup
side when waiting for it to exit.
Jens Axboe [Fri, 19 Feb 2021 19:33:30 +0000 (12:33 -0700)]
io-wq: make buffered file write hashed work map per-ctx
Before the io-wq thread change, we maintained a hash work map and lock
per-node per-ring. That wasn't ideal, as we really wanted it to be per
ring. But now that we have per-task workers, the hash map ends up being
just per-task. That'll work just fine for the normal case of having
one task use a ring, but if you share the ring between tasks, then it's
considerably worse than it was before.
Make the hash map per ctx instead, which provides full per-ctx buffered
write serialization on hashed writes.
Dave Chinner [Tue, 23 Feb 2021 18:26:06 +0000 (10:26 -0800)]
xfs: use current->journal_info for detecting transaction recursion
Because the iomap code using PF_MEMALLOC_NOFS to detect transaction
recursion in XFS is just wrong. Remove it from the iomap code and
replace it with XFS specific internal checks using
current->journal_info instead.
[djwong: This change also realigns the lifetime of NOFS flag changes to
match the incore transaction, instead of the inconsistent scheme we have
now.]
Darrick J. Wong [Fri, 19 Feb 2021 17:18:06 +0000 (09:18 -0800)]
xfs: don't nest transactions when scanning for eofblocks
Brian Foster reported a lockdep warning on xfs/167:
============================================
WARNING: possible recursive locking detected
5.11.0-rc4 #35 Tainted: G W I
--------------------------------------------
fsstress/17733 is trying to acquire lock: ffff8e0fd1d90650 (sb_internal){++++}-{0:0}, at: xfs_free_eofblocks+0x104/0x1d0 [xfs]
but task is already holding lock: ffff8e0fd1d90650 (sb_internal){++++}-{0:0}, at: xfs_trans_alloc_inode+0x5f/0x160 [xfs]
The cause of this is the new code that spurs a scan to garbage collect
speculative preallocations if we fail to reserve enough blocks while
allocating a transaction. While the warning itself is a fairly benign
lockdep complaint, it does expose a potential livelock if the rwsem
behavior ever changes with regards to nesting read locks when someone's
waiting for a write lock.
Fix this by freeing the transaction and jumping back to xfs_trans_alloc
like this patch in the V4 submission[1].
Brian Foster [Tue, 23 Feb 2021 18:22:39 +0000 (10:22 -0800)]
xfs: don't reuse busy extents on extent trim
Freed extents are marked busy from the point the freeing transaction
commits until the associated CIL context is checkpointed to the log.
This prevents reuse and overwrite of recently freed blocks before
the changes are committed to disk, which can lead to corruption
after a crash. The exception to this rule is that metadata
allocation is allowed to reuse busy extents because metadata changes
are also logged.
As of commit 97d3ac75e5e0 ("xfs: exact busy extent tracking"), XFS
has allowed modification or complete invalidation of outstanding
busy extents for metadata allocations. This implementation assumes
that use of the associated extent is imminent, which is not always
the case. For example, the trimmed extent might not satisfy the
minimum length of the allocation request, or the allocation
algorithm might be involved in a search for the optimal result based
on locality.
generic/019 reproduces a corruption caused by this scenario. First,
a metadata block (usually a bmbt or symlink block) is freed from an
inode. A subsequent bmbt split on an unrelated inode attempts a near
mode allocation request that invalidates the busy block during the
search, but does not ultimately allocate it. Due to the busy state
invalidation, the block is no longer considered busy to subsequent
allocation. A direct I/O write request immediately allocates the
block and writes to it. Finally, the filesystem crashes while in a
state where the initial metadata block free had not committed to the
on-disk log. After recovery, the original metadata block is in its
original location as expected, but has been corrupted by the
aforementioned dio.
This demonstrates that it is fundamentally unsafe to modify busy
extent state for extents that are not guaranteed to be allocated.
This applies to pretty much all of the code paths that currently
trim busy extents for one reason or another. Therefore to address
this problem, drop the reuse mechanism from the busy extent trim
path. This code already knows how to return partial non-busy ranges
of the targeted free extent and higher level code tracks the busy
state of the allocation attempt. If a block allocation fails where
one or more candidate extents is busy, we force the log and retry
the allocation.
I ran into a case where the ref resurrect now spins, so revert
this change for now until we can further investigate why it's
broken. The bug seems to indicate spinning on the lock itself,
likely there's some ABBA deadlock involved:
Mark Brown [Wed, 24 Feb 2021 16:50:37 +0000 (16:50 +0000)]
arm64: stacktrace: Report when we reach the end of the stack
Currently the arm64 unwinder code returns -EINVAL whenever it can't find
the next stack frame, not distinguishing between cases where the stack has
been corrupted or is otherwise in a state it shouldn't be and cases
where we have reached the end of the stack. At the minute none of the
callers care what error code is returned but this will be important for
reliable stack trace which needs to be sure that the stack is intact.
Change to return -ENOENT in the case where we reach the bottom of the
stack. The error codes from this function are only used in kernel, this
particular code is chosen as we are indicating that we know there is no
frame there.
arm64: ptrace: Fix seccomp of traced syscall -1 (NO_SYSCALL)
Since commit f086f67485c5 ("arm64: ptrace: add support for syscall
emulation"), if system call number -1 is called and the process is being
traced with PTRACE_SYSCALL, for example by strace, the seccomp check is
skipped and -ENOSYS is returned unconditionally (unless altered by the
tracer) rather than carrying out action specified in the seccomp filter.
The consequence of this is that it is not possible to reliably strace
a seccomp based implementation of a foreign system call interface in
which r7/x8 is permitted to be -1 on entry to a system call.
Also trace_sys_enter and audit_syscall_entry are skipped if a system
call is skipped.
Fix by removing the in_syscall(regs) check restoring the previous
behaviour which is like AArch32, x86 (which uses generic code) and
everything else.
KVM: SVM: Fix nested VM-Exit on #GP interception handling
Fix the interpreation of nested_svm_vmexit()'s return value when
synthesizing a nested VM-Exit after intercepting an SVM instruction while
L2 was running. The helper returns '0' on success, whereas a return
value of '0' in the exit handler path means "exit to userspace". The
incorrect return value causes KVM to exit to userspace without filling
the run state, e.g. QEMU logs "KVM: unknown exit, hardware reason 0".
Wei Yongjun [Wed, 24 Feb 2021 01:38:03 +0000 (01:38 +0000)]
ALSA: n64: Fix return value check in n64audio_probe()
In case of error, the function devm_platform_ioremap_resource()
returns ERR_PTR() and never returns NULL. The NULL test in the
return value check should be replaced with IS_ERR().
Heiko Stuebner [Sat, 6 Feb 2021 13:50:20 +0000 (14:50 +0100)]
drm/panel: kd35t133: allow using non-continuous dsi clock
The panel is able to work when dsi clock is non-continuous, thus
the system power consumption can be reduced using such feature.
Add MIPI_DSI_CLOCK_NON_CONTINUOUS to panel's mode_flags.
Also the flag actually becomes necessary after
commit c6d94e37bdbb ("drm/bridge/synopsys: dsi: add support for non-continuous HS clock")
and without it the panel only emits stripes instead of output.