Jann Horn [Sat, 17 Oct 2020 23:14:12 +0000 (16:14 -0700)]
mm/gup_benchmark: take the mmap lock around GUP
To be safe against concurrent changes to the VMA tree, we must take the
mmap lock around GUP operations (excluding the GUP-fast family of
operations, which will take the mmap lock by themselves if necessary).
This code is only for testing, and it's only reachable by root through
debugfs, so this doesn't really have any impact; however, if we want to
add lockdep asserts into the GUP path, we need to have clean locking here.
Liam R. Howlett [Sat, 17 Oct 2020 23:14:09 +0000 (16:14 -0700)]
mm/mmap: add inline munmap_vma_range() for code readability
There are two locations that have a block of code for munmapping a vma
range. Change those two locations to use a function and add meaningful
comments about what happens to the arguments, which was unclear in the
previous code.
Liam R. Howlett [Sat, 17 Oct 2020 23:14:06 +0000 (16:14 -0700)]
mm/mmap: add inline vma_next() for readability of mmap code
There are three places that the next vma is required which uses the same
block of code. Replace the block with a function and add comments on what
happens in the case where NULL is encountered.
Miaohe Lin [Sat, 17 Oct 2020 23:14:03 +0000 (16:14 -0700)]
mm/migrate: avoid possible unnecessary process right check in kernel_move_pages()
There is no need to check if this process has the right to modify the
specified process when they are same. And we could also skip the security
hook call if a process is modifying its own pages. Add helper function to
handle these.
Joonsoo Kim [Sat, 17 Oct 2020 23:14:00 +0000 (16:14 -0700)]
mm/memory_hotplug: remove a wrapper for alloc_migration_target()
To calculate the correct node to migrate the page for hotplug, we need to
check node id of the page. Wrapper for alloc_migration_target() exists
for this purpose.
However, Vlastimil informs that all migration source pages come from a
single node. In this case, we don't need to check the node id for each
page and we don't need to re-set the target nodemask for each page by
using the wrapper. Set up the migration_target_control once and use it
for all pages.
Roman Gushchin [Sat, 17 Oct 2020 23:13:53 +0000 (16:13 -0700)]
mm: kmem: enable kernel memcg accounting from interrupt contexts
If a memcg to charge can be determined (using remote charging API), there
are no reasons to exclude allocations made from an interrupt context from
the accounting.
Such allocations will pass even if the resulting memcg size will exceed
the hard limit, but it will affect the application of the memory pressure
and an inability to put the workload under the limit will eventually
trigger the OOM.
To use active_memcg() helper, memcg_kmem_bypass() is moved back to
memcontrol.c.
Roman Gushchin [Sat, 17 Oct 2020 23:13:50 +0000 (16:13 -0700)]
mm: kmem: prepare remote memcg charging infra for interrupt contexts
Remote memcg charging API uses current->active_memcg to store the
currently active memory cgroup, which overwrites the memory cgroup of the
current process. It works well for normal contexts, but doesn't work for
interrupt contexts: indeed, if an interrupt occurs during the execution of
a section with an active memcg set, all allocations inside the interrupt
will be charged to the active memcg set (given that we'll enable
accounting for allocations from an interrupt context). But because the
interrupt might have no relation to the active memcg set outside, it's
obviously wrong from the accounting prospective.
To resolve this problem, let's add a global percpu int_active_memcg
variable, which will be used to store an active memory cgroup which will
be used from interrupt contexts. set_active_memcg() will transparently
use current->active_memcg or int_active_memcg depending on the context.
To make the read part simple and transparent for the caller, let's
introduce two new functions:
- struct mem_cgroup *active_memcg(void),
- struct mem_cgroup *get_active_memcg(void).
They are returning the active memcg if it's set, hiding all implementation
details: where to get it depending on the current context.
Roman Gushchin [Sat, 17 Oct 2020 23:13:47 +0000 (16:13 -0700)]
mm: kmem: remove redundant checks from get_obj_cgroup_from_current()
There are checks for current->mm and current->active_memcg in
get_obj_cgroup_from_current(), but these checks are redundant:
memcg_kmem_bypass() called just above performs same checks.
Roman Gushchin [Sat, 17 Oct 2020 23:13:44 +0000 (16:13 -0700)]
mm: kmem: move memcg_kmem_bypass() calls to get_mem/obj_cgroup_from_current()
Patch series "mm: kmem: kernel memory accounting in an interrupt context".
This patchset implements memcg-based memory accounting of allocations made
from an interrupt context.
Historically, such allocations were passed unaccounted mostly because
charging the memory cgroup of the current process wasn't an option. Also
performance reasons were likely a reason too.
The remote charging API allows to temporarily overwrite the currently
active memory cgroup, so that all memory allocations are accounted towards
some specified memory cgroup instead of the memory cgroup of the current
process.
This patchset extends the remote charging API so that it can be used from
an interrupt context. Then it removes the fence that prevented the
accounting of allocations made from an interrupt context. It also
contains a couple of optimizations/code refactorings.
This patchset doesn't directly enable accounting for any specific
allocations, but prepares the code base for it. The bpf memory accounting
will likely be the first user of it: a typical example is a bpf program
parsing an incoming network packet, which allocates an entry in hashmap
map to store some information.
This patch (of 4):
Currently memcg_kmem_bypass() is called before obtaining the current
memory/obj cgroup using get_mem/obj_cgroup_from_current(). Moving
memcg_kmem_bypass() into get_mem/obj_cgroup_from_current() reduces the
number of call sites and allows further code simplifications.
Roman Gushchin [Sat, 17 Oct 2020 23:13:40 +0000 (16:13 -0700)]
mm, memcg: rework remote charging API to support nesting
Currently the remote memcg charging API consists of two functions:
memalloc_use_memcg() and memalloc_unuse_memcg(), which set and clear the
memcg value, which overwrites the memcg of the current task.
It works perfectly for allocations performed from a normal context,
however an attempt to call it from an interrupt context or just nest two
remote charging blocks will lead to an incorrect accounting. On exit from
the inner block the active memcg will be cleared instead of being
restored.
Error: allocation here are charged to the memcg of the current
process instead of target_memcg.
memalloc_unuse_memcg();
This patch extends the remote charging API by switching to a single
function: struct mem_cgroup *set_active_memcg(struct mem_cgroup *memcg),
which sets the new value and returns the old one. So a remote charging
block will look like:
Fix linkage error when CONFIG_BINFMT_ELF is selected but CONFIG_COREDUMP
is not:
ia64-linux-ld: arch/ia64/kernel/elfcore.o: in function `elf_core_write_extra_phdrs':
elfcore.c:(.text+0x172): undefined reference to `dump_emit'
ia64-linux-ld: arch/ia64/kernel/elfcore.o: in function `elf_core_write_extra_data':
elfcore.c:(.text+0x2b2): undefined reference to `dump_emit'
Jan Kara [Thu, 15 Oct 2020 11:03:30 +0000 (13:03 +0200)]
ext4: Detect already used quota file early
When we try to use file already used as a quota file again (for the same
or different quota type), strange things can happen. At the very least
lockdep annotations may be wrong but also inode flags may be wrongly set
/ reset. When the file is used for two quota types at once we can even
corrupt the file and likely crash the kernel. Catch all these cases by
checking whether passed file is already used as quota file and bail
early in that case.
This fixes occasional generic/219 failure due to lockdep complaint.
changfengnan [Mon, 12 Oct 2020 16:49:00 +0000 (18:49 +0200)]
jbd2: avoid transaction reuse after reformatting
When ext4 is formatted with lazy_journal_init=1 and transactions from
the previous filesystem are still on disk, it is possible that they are
considered during a recovery after a crash. Because the checksum seed
has changed, the CRC check will fail, and the journal recovery fails
with checksum error although the journal is otherwise perfectly valid.
Fix the problem by checking commit block time stamps to determine
whether the data in the journal block is just stale or whether it is
indeed corrupt.
Kaixu Xia [Sat, 10 Oct 2020 08:10:16 +0000 (16:10 +0800)]
ext4: use the normal helper to get the actual inode
Here we use the READ_ONCE to fix race conditions in ->d_compare() and
->d_hash() when they are called in RCU-walk mode, seems we can use
the normal helper d_inode_rcu() to get the actual inode.
Ritesh Harjani [Thu, 8 Oct 2020 15:02:48 +0000 (20:32 +0530)]
ext4: fix bs < ps issue reported with dioread_nolock mount opt
left shifting m_lblk by blkbits was causing value overflow and hence
it was not able to convert unwritten to written extent.
So, make sure we typecast it to loff_t before do left shift operation.
Also in func ext4_convert_unwritten_io_end_vec(), make sure to initialize
ret variable to avoid accidentally returning an uninitialized ret.
This patch fixes the issue reported in ext4 for bs < ps with
dioread_nolock mount option.
ext4: data=journal: write-protect pages on j_submit_inode_data_buffers()
This implements journal callbacks j_submit|finish_inode_data_buffers()
with different behavior for data=journal: to write-protect pages under
commit, preventing changes to buffers writeably mapped to userspace.
If a buffer's content changes between commit's checksum calculation
and write-out to disk, it can cause journal recovery/mount failures
upon a kernel crash or power loss.
[ 27.334874] EXT4-fs: Warning: mounting with data=journal disables delayed allocation, dioread_nolock, and O_DIRECT support!
[ 27.339492] JBD2: Invalid checksum recovering data block 8705 in log
[ 27.342716] JBD2: recovery failed
[ 27.343316] EXT4-fs (loop0): error loading journal
mount: /ext4: can't read superblock on /dev/loop0.
In j_submit_inode_data_buffers() we write-protect the inode's pages
with write_cache_pages() and redirty w/ writepage callback if needed.
In j_finish_inode_data_buffers() there is nothing do to.
And in order to use the callbacks, inodes are added to the inode list
in transaction in __ext4_journalled_writepage() and ext4_page_mkwrite().
In ext4_page_mkwrite() we must make sure that the buffers are attached
to the transaction as jbddirty with write_end_fn(), as already done in
__ext4_journalled_writepage().
These are two fixes for data journalling required by
the next patch, discovered while testing it.
First, the optimization to return early if all buffers
are mapped is not appropriate for the next patch:
The inode _must_ be added to the transaction's list in
data=journal mode (so to write-protect pages on commit)
thus we cannot return early there.
Second, once that optimization to reduce transactions
was disabled for data=journal mode, more transactions
happened, and occasionally hit this warning message:
'JBD2: Spotted dirty metadata buffer'.
Reason is, block_page_mkwrite() will set_buffer_dirty()
before do_journal_get_write_access() that is there to
prevent it. This issue was masked by the optimization.
So, on data=journal use __block_write_begin() instead.
This also requires page locking and len recalculation.
(see block_page_mkwrite() for implementation details.)
Finally, as Jan noted there is little sharing between
data=journal and other modes in ext4_page_mkwrite().
However, a prototype of ext4_journalled_page_mkwrite()
showed there still would be lots of duplicated lines
(tens of) that didn't seem worth it.
Thus this patch ends up with an ugly goto to skip all
non-data journalling code (to avoid long indentations,
but that can be changed..) in the beginning, and just
a conditional in the transaction section.
Well, we skip a common part to data journalling which
is the page truncated check, but we do it again after
ext4_journal_start() when we re-acquire the page lock
(so not to acquire the page lock twice needlessly for
data journalling.)
Introduce journal callbacks to allow different behaviors
for an inode in journal_submit|finish_inode_data_buffers().
The existing users of the current behavior (ext4, ocfs2)
are adapted to use the previously exported functions
that implement the current behavior.
Users are callers of jbd2_journal_inode_ranged_write|wait(),
which adds the inode to the transaction's inode list with
the JI_WRITE|WAIT_DATA flags. Only ext4 and ocfs2 in-tree.
Both CONFIG_EXT4_FS and CONFIG_OCSFS2_FS select CONFIG_JBD2,
which builds fs/jbd2/commit.c and journal.c that define and
export the functions, so we can call directly in ext4/ocfs2.
ext4: introduce ext4_sb_bread_unmovable() to replace sb_bread_unmovable()
Now we only use sb_bread_unmovable() to read superblock and descriptor
block at mount time, so there is no opportunity that we need to clear
buffer verified bit and also handle buffer write_io error bit. But for
the sake of unification, let's introduce ext4_sb_bread_unmovable() to
replace all sb_bread_unmovable(). After this patch, we stop using read
helpers in fs/buffer.c.
We have already remove open codes that invoke helpers provide by
fs/buffer.c in all places reading metadata buffers. This patch switch to
use ext4_sb_bread() to replace all sb_bread() helpers, which is
ext4_read_bh() helper back end.
ext4: introduce ext4_sb_breadahead_unmovable() to replace sb_breadahead_unmovable()
If we readahead inode tables in __ext4_get_inode_loc(), it may bypass
buffer_write_io_error() check, so introduce ext4_sb_breadahead_unmovable()
to handle this special case.
This patch also replace sb_breadahead_unmovable() in ext4_fill_super()
for the sake of unification.
ext4: use ext4_buffer_uptodate() in __ext4_get_inode_loc()
We have already introduced ext4_buffer_uptodate() to re-set the uptodate
bit on buffer which has been failed to write out to disk. Just remove
the redundant codes and switch to use ext4_buffer_uptodate() in
__ext4_get_inode_loc().
The previous patch add clear_buffer_verified() before we read metadata
block from disk again, but it's rather easy to miss clearing of this bit
because currently we read metadata buffer through different open codes
(e.g. ll_rw_block(), bh_submit_read() and invoke submit_bh() directly).
So, it's time to add common helpers to unify in all the places reading
metadata buffers instead. This patch add 3 helpers:
- ext4_read_bh_nowait(): async read metadata buffer if it's actually
not uptodate, clear buffer_verified bit before read from disk.
- ext4_read_bh(): sync version of read metadata buffer, it will wait
until the read operation return and check the return status.
- ext4_read_bh_lock(): try to lock the buffer before read buffer, it
will skip reading if the buffer is already locked.
After this patch, we need to use these helpers in all the places reading
metadata buffer instead of different open codes.
ext4: clear buffer verified flag if read meta block from disk
The metadata buffer is no longer trusted after we read it from disk
again because it is not uptodate for some reasons (e.g. failed to write
back). Otherwise we may get below memory corruption problem in
ext4_ext_split()->memset() if we read stale data from the newly
allocated extent block on disk which has been failed to async write
out but miss verify again since the verified bit has already been set
on the buffer.
Darrick J. Wong [Thu, 1 Oct 2020 22:21:48 +0000 (15:21 -0700)]
ext4: limit entries returned when counting fsmap records
If userspace asked fsmap to try to count the number of entries, we cannot
return more than UINT_MAX entries because fmh_entries is u32.
Therefore, stop counting if we hit this limit or else we will waste time
to return truncated results.
ext4: fix bdev write error check failed when mount fs with ro
Consider a situation when a filesystem was uncleanly shutdown and the
orphan list is not empty and a read-only mount is attempted. The orphan
list cleanup during mount will fail with:
ext4_check_bdev_write_error:193: comm mount: Error while async write back metadata
This happens because sbi->s_bdev_wb_err is not initialized when mounting
the filesystem in read only mode and so ext4_check_bdev_write_error()
falsely triggers.
Initialize sbi->s_bdev_wb_err unconditionally to avoid this problem.
ext4: rename system_blks to s_system_blks inside ext4_sb_info
Rename system_blks to s_system_blks inside ext4_sb_info, keep
the naming rules consistent with other variables, which is
convenient for code reading and writing.
ext4: rename journal_dev to s_journal_dev inside ext4_sb_info
Rename journal_dev to s_journal_dev inside ext4_sb_info, keep
the naming rules consistent with other variables, which is
convenient for code reading and writing.
In case if the file already has underlying blocks/extents allocated
then we don't need to start a journal txn and can directly return
the underlying mapping. Currently ext4_iomap_begin() is used by
both DAX & DIO path. We can check if the write request is an
overwrite & then directly return the mapping information.
This could give a significant perf boost for multi-threaded writes
specially random overwrites.
On PPC64 VM with simulated pmem(DAX) device, ~10x perf improvement
could be seen in random writes (overwrite). Also bcoz this optimizes
away the spinlock contention during jbd2 slab cache allocation
(jbd2_journal_handle). On x86 VM, ~2x perf improvement was observed.
The race condition could cause the persisted superblock checksum
to not match the contents of the superblock, causing the
superblock to be considered corrupt.
An example of the race follows. A first thread is interrupted in the
middle of a checksum calculation. Then, another thread changes the
superblock, calculates a new checksum, and sets it. Then, the first
thread resumes and sets the checksum based on the older superblock.
To fix, serialize the superblock checksum calculation using the buffer
header lock. While a spinlock is sufficient, the buffer header is
already there and there is precedent for locking it (e.g. in
ext4_commit_super).
Tested the patch by booting up a kernel with the patch, creating
a filesystem and some files (including some orphans), and then
unmounting and remounting the file system.
Xiao Yang [Fri, 28 Aug 2020 08:43:30 +0000 (16:43 +0800)]
ext4: disallow modifying DAX inode flag if inline_data has been set
inline_data is mutually exclusive to DAX so enabling both of them triggers
the following issue:
------------------------------------------
# mkfs.ext4 -F -O inline_data /dev/pmem1
...
# mount /dev/pmem1 /mnt
# echo 'test' >/mnt/file
# lsattr -l /mnt/file
/mnt/file Inline_Data
# xfs_io -c "chattr +x" /mnt/file
# xfs_io -c "lsattr -v" /mnt/file
[dax] /mnt/file
# umount /mnt
# mount /dev/pmem1 /mnt
# cat /mnt/file
cat: /mnt/file: Numerical result out of range
------------------------------------------
Petr Malat [Tue, 25 Aug 2020 15:00:16 +0000 (17:00 +0200)]
ext4: do not interpret high bytes if 64bit feature is disabled
Fields s_free_blocks_count_hi, s_r_blocks_count_hi and s_blocks_count_hi
are not valid if EXT4_FEATURE_INCOMPAT_64BIT is not enabled and should be
treated as zeroes.
Jan Kara [Thu, 24 Sep 2020 15:09:59 +0000 (17:09 +0200)]
ext4: discard preallocations before releasing group lock
ext4_mb_discard_group_preallocations() can be releasing group lock with
preallocations accumulated on its local list. Thus although
discard_pa_seq was incremented and concurrent allocating processes will
be retrying allocations, it can happen that premature ENOSPC error is
returned because blocks used for preallocations are not available for
reuse yet. Make sure we always free locally accumulated preallocations
before releasing group lock.
Ye Bin [Wed, 16 Sep 2020 11:38:59 +0000 (19:38 +0800)]
ext4: fix dead loop in ext4_mb_new_blocks
As we test disk offline/online with running fsstress, we find fsstress
process is keeping running state.
kworker/u32:3-262 [004] ...1 140.787471: ext4_mb_discard_preallocations: dev 8,32 needed 114
....
kworker/u32:3-262 [004] ...1 140.787471: ext4_mb_discard_preallocations: dev 8,32 needed 114
As we see seq_retry is sum of discard_pa_seq every cpu, if
ext4_mb_discard_group_preallocations return zero discard_pa_seq in this
cpu maybe increase one, so condition "seq_retry != *seq" have always
been met.
Ritesh Harjani suggest to in ext4_mb_discard_group_preallocations function we
only increase discard_pa_seq when there is some PA to free.
After moving ext4's bmap to iomap interface, swapon functionality
on files created using fallocate (which creates unwritten extents) are
failing. This is since iomap_bmap interface returns 0 for unwritten
extents and thus generic_swapfile_activate considers this as holes
and hence bail out with below kernel msg :-
[340.915835] swapon: swapfile has holes
To fix this we need to implement ->swap_activate aops in ext4
which will use ext4_iomap_report_ops. Since we only need to return
the list of extents so ext4_iomap_report_ops should be enough.
Jens Axboe [Fri, 16 Oct 2020 15:02:26 +0000 (09:02 -0600)]
task_work: cleanup notification modes
A previous commit changed the notification mode from true/false to an
int, allowing notify-no, notify-yes, or signal-notify. This was
backwards compatible in the sense that any existing true/false user
would translate to either 0 (on notification sent) or 1, the latter
which mapped to TWA_RESUME. TWA_SIGNAL was assigned a value of 2.
Clean this up properly, and define a proper enum for the notification
mode. Now we have:
- TWA_NONE. This is 0, same as before the original change, meaning no
notification requested.
- TWA_RESUME. This is 1, same as before the original change, meaning
that we use TIF_NOTIFY_RESUME.
- TWA_SIGNAL. This uses TIF_SIGPENDING/JOBCTL_TASK_WORK for the
notification.
Clean up all the callers, switching their 0/1/false/true to using the
appropriate TWA_* mode for notifications.
Fixes: e91b48162332 ("task_work: teach task_work_add() to do signal_wake_up()") Reviewed-by: Thomas Gleixner <[email protected]> Signed-off-by: Jens Axboe <[email protected]>
Jens Axboe [Sat, 17 Oct 2020 15:25:52 +0000 (09:25 -0600)]
mm: use limited read-ahead to satisfy read
For the case where read-ahead is disabled on the file, or if the cgroup
is congested, ensure that we can at least do 1 page of read-ahead to
make progress on the read in an async fashion. This could potentially be
larger, but it's not needed in terms of functionality, so let's error on
the side of caution as larger counts of pages may run into reclaim
issues (particularly if we're congested).
This makes sure we're not hitting the potentially sync ->readpage() path
for IO that is marked IOCB_WAITQ, which could cause us to block. It also
means we'll use the same path for IO, regardless of whether or not
read-ahead happens to be disabled on the lower level device.
Jens Axboe [Sat, 17 Oct 2020 14:31:29 +0000 (08:31 -0600)]
mm: mark async iocb read as NOWAIT once some data has been copied
Once we've copied some data for an iocb that is marked with IOCB_WAITQ,
we should no longer attempt to async lock a new page. Instead make sure
we return the copied amount, and let the caller retry, instead of
returning -EIOCBQUEUED for a new page.
This should only be possible with read-ahead disabled on the below
device, and multiple threads racing on the same file. Haven't been able
to reproduce on anything else.
Cc: [email protected] # v5.9 Fixes: 1a0a7853b901 ("mm: support async buffered reads in generic_file_buffered_read()") Reported-by: Kent Overstreet <[email protected]> Signed-off-by: Jens Axboe <[email protected]>
- Fix usage of reloc_sym in 'perf probe' when using both kallsyms and
debuginfo files.
- Do not print 'Metric Groups:' unnecessarily in 'perf list'
- Refcounting fixes in the event parsing code.
- Add expand cgroup event 'perf test' entry.
- Fix out of bounds CPU map access when handling armv8_pmu events in
'perf stat'.
- Add build-id injection 'perf bench' benchmark.
- Enter namespace when reading build-id in 'perf inject'.
- Do not load map/dso when injecting build-id speeding up the 'perf
inject' process.
- Add --buildid-all option to avoid processing all samples, just the
mmap metadata events.
- Add feature test to check if libbfd has buildid support
- Add 'perf test' entry for PE binary format support.
- Fix typos in power8 PMU vendor events JSON files.
- Hide libtraceevent non API functions.
* tag 'perf-tools-for-v5.10-2020-10-15' of git://git.kernel.org/pub/scm/linux/kernel/git/acme/linux: (113 commits)
perf c2c: Update documentation for metrics reorganization
perf c2c: Add metrics "RMT Load Hit"
perf c2c: Correct LLC load hit metrics
perf c2c: Change header for LLC local hit
perf c2c: Use more explicit headers for HITM
perf c2c: Change header from "LLC Load Hitm" to "Load Hitm"
perf c2c: Organize metrics based on memory hierarchy
perf c2c: Display "Total Stores" as a standalone metrics
perf c2c: Display the total numbers continuously
perf bench: Use condition variables in numa.
perf jevents: Fix event code for events referencing std arch events
perf diff: Support hot streams comparison
perf streams: Report hot streams
perf streams: Calculate the sum of total streams hits
perf streams: Link stream pair
perf streams: Compare two streams
perf streams: Get the evsel_streams by evsel_idx
perf streams: Introduce branch history "streams"
perf intel-pt: Improve PT documentation slightly
perf tools: Add support for exclusive groups/events
...
Linus Torvalds [Sat, 17 Oct 2020 18:18:18 +0000 (11:18 -0700)]
Merge tag 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/rdma/rdma
Pull rdma updates from Jason Gunthorpe:
"A usual cycle for RDMA with a typical mix of driver and core subsystem
updates:
- Driver minor changes and bug fixes for mlx5, efa, rxe, vmw_pvrdma,
hns, usnic, qib, qedr, cxgb4, hns, bnxt_re
- Various rtrs fixes and updates
- Bug fix for mlx4 CM emulation for virtualization scenarios where
MRA wasn't working right
- Use tracepoints instead of pr_debug in the CM code
- Scrub the locking in ucma and cma to close more syzkaller bugs
- Use tasklet_setup in the subsystem
- Revert the idea that 'destroy' operations are not allowed to fail
at the driver level. This proved unworkable from a HW perspective.
- Revise how the umem API works so drivers make fewer mistakes using
it
- XRC support for qedr
- Convert uverbs objects RWQ and MW to new the allocation scheme
- Large queue entry sizes for hns
- Use hmm_range_fault() for mlx5 On Demand Paging
- uverbs APIs to inspect the GID table instead of sysfs
- Move some of the RDMA code for building large page SGLs into
lib/scatterlist"
* tag 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/rdma/rdma: (191 commits)
RDMA/ucma: Fix use after free in destroy id flow
RDMA/rxe: Handle skb_clone() failure in rxe_recv.c
RDMA/rxe: Move the definitions for rxe_av.network_type to uAPI
RDMA: Explicitly pass in the dma_device to ib_register_device
lib/scatterlist: Do not limit max_segment to PAGE_ALIGNED values
IB/mlx4: Convert rej_tmout radix-tree to XArray
RDMA/rxe: Fix bug rejecting all multicast packets
RDMA/rxe: Fix skb lifetime in rxe_rcv_mcast_pkt()
RDMA/rxe: Remove duplicate entries in struct rxe_mr
IB/hfi,rdmavt,qib,opa_vnic: Update MAINTAINERS
IB/rdmavt: Fix sizeof mismatch
MAINTAINERS: CISCO VIC LOW LATENCY NIC DRIVER
RDMA/bnxt_re: Fix sizeof mismatch for allocation of pbl_tbl.
RDMA/bnxt_re: Use rdma_umem_for_each_dma_block()
RDMA/umem: Move to allocate SG table from pages
lib/scatterlist: Add support in dynamic allocation of SG table from pages
tools/testing/scatterlist: Show errors in human readable form
tools/testing/scatterlist: Rejuvenate bit-rotten test
RDMA/ipoib: Set rtnl_link_ops for ipoib interfaces
RDMA/uverbs: Expose the new GID query API to user space
...
Linus Torvalds [Sat, 17 Oct 2020 18:01:01 +0000 (11:01 -0700)]
Merge tag 'i3c/for-5.10' of git://git.kernel.org/pub/scm/linux/kernel/git/i3c/linux
Pull i3c updates from Boris Brezillon:
- Fix DAA for the pre-reserved address case
- Fix an error path in the cadence driver
* tag 'i3c/for-5.10' of git://git.kernel.org/pub/scm/linux/kernel/git/i3c/linux:
i3c: master: Fix error return in cdns_i3c_master_probe()
i3c: master: fix for SETDASA and DAA process
i3c: master add i3c_master_attach_boardinfo to preserve boardinfo
Linus Torvalds [Sat, 17 Oct 2020 17:45:42 +0000 (10:45 -0700)]
Merge tag 'mtd/for-5.10' of git://git.kernel.org/pub/scm/linux/kernel/git/mtd/linux
Pull MTD updates from Richard Weinberger:
"NAND core changes:
- Drop useless 'depends on' in Kconfig
- Add an extra level in the Kconfig hierarchy
- Trivial spellings
- Dynamic allocation of the interface configurations
- Dropping the default ONFI timing mode
- Various cleanup (types, structures, naming, comments)
- Hide the chip->data_interface indirection
- Add the generic rb-gpios property
- Add the ->choose_interface_config() hook
- Introduce nand_choose_best_sdr_timings()
- Use default values for tPROG_max and tBERS_max
- Avoid redefining tR_max and tCCS_min
- Add a helper to find the closest ONFI mode
- bcm63xx MTD parsers: simplify CFE detection
Raw NAND controller drivers changes:
- fsl-upm: Deprecation of specific DT properties
- fsl_upm: Driver rework and cleanup in favor of ->exec_op()
- Ingenic: Cleanup ARRAY_SIZE() vs sizeof() use
- brcmnand: ECC error handling on EDU transfers
- brcmnand: Don't default to EDU transfers
- qcom: Set BAM mode only if not set already
- qcom: Avoid write to unavailable register
- gpio: Driver rework in favor of ->exec_op()
- tango: ->exec_op() conversion
- mtk: ->exec_op() conversion
Raw NAND chip drivers changes:
- toshiba: Implement ->choose_interface_config() for TH58NVG2S3HBAI4
- toshiba: Implement ->choose_interface_config() for TC58NVG0S3E
- toshiba: Implement ->choose_interface_config() for TC58TEG5DCLTA00
- hynix: Implement ->choose_interface_config() for H27UCG8T2ATR-BC
HyperBus changes:
- DMA support for TI's AM654 HyperBus controller driver.
- HyperBus frontend driver for Renesas RPC-IF driver.
SPI NOR core changes:
- Support for Winbond w25q64jwm flash
- Enable 4K sector support for mx25l12805d
SPI NOR controller drivers changes:
- intel-spi Add Alder Lake-S PCI ID
MTD Core changes:
- mtdoops: Don't run panic write twice
- mtdconcat: Correctly handle panic write
- Use DEFINE_SHOW_ATTRIBUTE"
* tag 'mtd/for-5.10' of git://git.kernel.org/pub/scm/linux/kernel/git/mtd/linux: (76 commits)
mtd: hyperbus: Fix build failure when only RPCIF_HYPERBUS is enabled
mtd: hyperbus: add Renesas RPC-IF driver
Revert "mtd: spi-nor: Prefer asynchronous probe"
mtd: parsers: bcm63xx: Do not make it modular
mtd: spear_smi: Enable compile testing
mtd: maps: vmu-flash: fix typos for struct memcard
mtd: physmap: Add Baikal-T1 physically mapped ROM support
mtd: maps: vmu-flash: simplify the return expression of probe_maple_vmu
mtd: onenand: simplify the return expression of onenand_transfer_auto_oob
mtd: rawnand: cadence: remove a redundant dev_err call
mtd: rawnand: ams-delta: Fix non-OF build warning
mtd: rawnand: Don't overwrite the error code from nand_set_ecc_soft_ops()
mtd: rawnand: Introduce nand_set_ecc_on_host_ops()
mtd: rawnand: atmel: Check return values for nand_read_data_op
mtd: rawnand: vf610: Remove unused function vf610_nfc_transfer_size()
mtd: rawnand: qcom: Simplify with dev_err_probe()
mtd: rawnand: marvell: Fix and update kerneldoc
mtd: rawnand: marvell: Simplify with dev_err_probe()
mtd: rawnand: gpmi: Simplify with dev_err_probe()
mtd: rawnand: atmel: Simplify with dev_err_probe()
...
Pavel Begunkov [Fri, 16 Oct 2020 19:55:56 +0000 (20:55 +0100)]
io_uring: fix double poll mask init
__io_queue_proc() is used by both, poll reqs and apoll. Don't use
req->poll.events to copy poll mask because for apoll it aliases with
private data of the request.
Jens Axboe [Thu, 15 Oct 2020 19:46:44 +0000 (13:46 -0600)]
io-wq: inherit audit loginuid and sessionid
Make sure the async io-wq workers inherit the loginuid and sessionid from
the original task, and restore them to unset once we're done with the
async work item.
While at it, disable the ability for kernel threads to write to their own
loginuid.
Jens Axboe [Thu, 15 Oct 2020 22:24:45 +0000 (16:24 -0600)]
io_uring: use percpu counters to track inflight requests
Even though we place the req_issued and req_complete in separate
cachelines, there's considerable overhead in doing the atomics
particularly on the completion side.
Get rid of having the two counters, and just use a percpu_counter for
this. That's what it was made for, after all. This considerably
reduces the overhead in __io_free_req().
Jens Axboe [Thu, 15 Oct 2020 23:38:03 +0000 (17:38 -0600)]
io_uring: assign new io_identity for task if members have changed
This avoids doing a copy for each new async IO, if some parts of the
io_identity has changed. We avoid reference counting for the normal
fast path of nothing ever changing.
Jens Axboe [Thu, 15 Oct 2020 15:02:33 +0000 (09:02 -0600)]
io_uring: store io_identity in io_uring_task
This is, by definition, a per-task structure. So store it in the
task context, instead of doing carrying it in each io_kiocb. We're being
a bit inefficient if members have changed, as that requires an alloc and
copy of a new io_identity struct. The next patch will fix that up.
Jens Axboe [Thu, 15 Oct 2020 14:46:24 +0000 (08:46 -0600)]
io_uring: COW io_identity on mismatch
If the io_identity doesn't completely match the task, then create a
copy of it and use that. The existing copy remains valid until the last
user of it has gone away.
This also changes the personality lookup to be indexed by io_identity,
instead of creds directly.
Jens Axboe [Wed, 14 Oct 2020 16:48:51 +0000 (10:48 -0600)]
io_uring: move io identity items into separate struct
io-wq contains a pointer to the identity, which we just hold in io_kiocb
for now. This is in preparation for putting this outside io_kiocb. The
only exception is struct files_struct, which we'll need different rules
for to avoid a circular dependency.
Jens Axboe [Wed, 14 Oct 2020 15:23:55 +0000 (09:23 -0600)]
io_uring: pass required context in as flags
We have a number of bits that decide what context to inherit. Set up
io-wq flags for these instead. This is in preparation for always having
the various members set, but not always needing them for all requests.
Jens Axboe [Thu, 15 Oct 2020 16:13:07 +0000 (10:13 -0600)]
io-wq: assign NUMA node locality if appropriate
There was an assumption that kthread_create_on_node() would properly set
NUMA affinities in terms of CPUs allowed, but it doesn't. Make sure we
do this when creating an io-wq context on NUMA.
which is a copy of fget failure condition jumping to cleanup, but the
cleanup requires ctx->file_data to be assigned. Assign it when setup,
and ensure that we clear it again for the error path exit.
Pavel Begunkov [Tue, 13 Oct 2020 08:44:00 +0000 (09:44 +0100)]
io_uring: fix REQ_F_COMP_LOCKED by killing it
REQ_F_COMP_LOCKED is used and implemented in a buggy way. The problem is
that the flag is set before io_put_req() but not cleared after, and if
that wasn't the final reference, the request will be freed with the flag
set from some other context, which may not hold a spinlock. That means
possible races with removing linked timeouts and unsynchronised
completion (e.g. access to CQ).
Instead of fixing REQ_F_COMP_LOCKED, kill the flag and use
task_work_add() to move such requests to a fresh context to free from
it, as was done with __io_free_req_finish().
Pavel Begunkov [Tue, 13 Oct 2020 08:43:59 +0000 (09:43 +0100)]
io_uring: dig out COMP_LOCK from deep call chain
io_req_clean_work() checks REQ_F_COMP_LOCK to pass this two layers up.
Move the check up into __io_free_req(), so at least it doesn't looks so
ugly and would facilitate further changes.
Pavel Begunkov [Tue, 13 Oct 2020 08:43:58 +0000 (09:43 +0100)]
io_uring: don't put a poll req under spinlock
Move io_put_req() in io_poll_task_handler() from under spinlock. This
eliminates the need to use REQ_F_COMP_LOCKED, at the expense of
potentially having to grab the lock again. That's still a better trade
off than relying on the locked flag.
If a request had REQ_F_LINK_TIMEOUT it would've been cleared in
__io_kill_linked_timeout() by the time of __io_fail_links(), so no need
to care about it.
Pavel Begunkov [Tue, 13 Oct 2020 08:43:56 +0000 (09:43 +0100)]
io_uring: don't set COMP_LOCKED if won't put
__io_kill_linked_timeout() sets REQ_F_COMP_LOCKED for a linked timeout
even if it can't cancel it, e.g. it's already running. It not only races
with io_link_timeout_fn() for ->flags field, but also leaves the flag
set and so io_link_timeout_fn() may find it and decide that it holds the
lock. Hopefully, the second problem is potential.
Colin Ian King [Mon, 12 Oct 2020 14:03:41 +0000 (15:03 +0100)]
io_uring: Fix sizeof() mismatch
An incorrect sizeof() is being used, sizeof(file_data->table) is not
correct, it should be sizeof(*file_data->table).
Fixes: 5398ae698525 ("io_uring: clean file_data access in files_register") Signed-off-by: Colin Ian King <[email protected]>
Addresses-Coverity: ("Sizeof not portable (SIZEOF_MISMATCH)") Signed-off-by: Jens Axboe <[email protected]>
Jassi Brar [Fri, 16 Oct 2020 17:20:56 +0000 (12:20 -0500)]
mailbox: avoid timer start from callback
If the txdone is done by polling, it is possible for msg_submit() to start
the timer while txdone_hrtimer() callback is running. If the timer needs
recheduling, it could already be enqueued by the time hrtimer_forward_now()
is called, leading hrtimer to loudly complain.
This can be fixed by not starting the timer from the callback path. Which
requires the timer reloading as long as any message is queued on the
channel, and not just when current tx is not done yet.
Ioana Ciornei [Thu, 15 Oct 2020 20:00:23 +0000 (23:00 +0300)]
net: pcs-xpcs: depend on MDIO_BUS instead of selecting it
The below compile time error can be seen when PHYLIB is configured as a
module.
ld: drivers/net/pcs/pcs-xpcs.o: in function `xpcs_read':
pcs-xpcs.c:(.text+0x29): undefined reference to `mdiobus_read'
ld: drivers/net/pcs/pcs-xpcs.o: in function `xpcs_soft_reset.constprop.7':
pcs-xpcs.c:(.text+0x80): undefined reference to `mdiobus_write'
ld: drivers/net/pcs/pcs-xpcs.o: in function `xpcs_config_aneg':
pcs-xpcs.c:(.text+0x318): undefined reference to `mdiobus_write'
ld: pcs-xpcs.c:(.text+0x38e): undefined reference to `mdiobus_write'
ld: pcs-xpcs.c:(.text+0x3eb): undefined reference to `mdiobus_write'
ld: pcs-xpcs.c:(.text+0x437): undefined reference to `mdiobus_write'
ld: drivers/net/pcs/pcs-xpcs.o:pcs-xpcs.c:(.text+0xb1e): more undefined references to `mdiobus_write' follow
PHYLIB being a module leads to MDIO_BUS being a module as well while the
XPCS is still built-in. What should happen in this configuration is that
PCS_XPCS should be forced to build as module. However, that select only
acts in the opposite way so we should turn it into a depends.
Fix this up by explicitly depending on MDIO_BUS.
Reported-by: Randy Dunlap <[email protected]> Acked-by: Randy Dunlap <[email protected]> # build-tested Fixes: 2fa4e4b799e1 ("net: pcs: Move XPCS into new PCS subdirectory") Signed-off-by: Ioana Ciornei <[email protected]> Signed-off-by: Jakub Kicinski <[email protected]>
Eric Dumazet [Thu, 15 Oct 2020 18:42:00 +0000 (11:42 -0700)]
icmp: randomize the global rate limiter
Keyu Man reported that the ICMP rate limiter could be used
by attackers to get useful signal. Details will be provided
in an upcoming academic publication.
Our solution is to add some noise, so that the attackers
no longer can get help from the predictable token bucket limiter.
Dylan Hung [Wed, 14 Oct 2020 06:06:32 +0000 (14:06 +0800)]
net: ftgmac100: Fix Aspeed ast2600 TX hang issue
The new HW arbitration feature on Aspeed ast2600 will cause MAC TX to
hang when handling scatter-gather DMA. Disable the problematic feature
by setting MAC register 0x58 bit28 and bit27.
Fixes: 39bfab8844a0 ("net: ftgmac100: Add support for DT phy-handle property") Signed-off-by: Dylan Hung <[email protected]> Reviewed-by: Joel Stanley <[email protected]> Signed-off-by: Jakub Kicinski <[email protected]>
Darrick J. Wong [Mon, 12 Oct 2020 21:10:03 +0000 (14:10 -0700)]
xfs: fix Kconfig asking about XFS_SUPPORT_V4 when XFS_FS=n
Pavel Machek complained that the question about supporting deprecated
XFS v4 comes up even when XFS is disabled. This clearly makes no sense,
so fix Kconfig.
Darrick J. Wong [Tue, 13 Oct 2020 15:46:27 +0000 (08:46 -0700)]
xfs: fix high key handling in the rt allocator's query_range function
Fix some off-by-one errors in xfs_rtalloc_query_range. The highest key
in the realtime bitmap is always one less than the number of rt extents,
which means that the key clamp at the start of the function is wrong.
The 4th argument to xfs_rtfind_forw is the highest rt extent that we
want to probe, which means that passing 1 less than the high key is
wrong. Finally, drop the rem variable that controls the loop because we
can compare the iteration point (rtstart) against the high key directly.
The sordid history of this function is that the original commit (fb3c3)
incorrectly passed (high_rec->ar_startblock - 1) as the 'limit' parameter
to xfs_rtfind_forw. This was wrong because the "high key" is supposed
to be the largest key for which the caller wants result rows, not the
key for the first row that could possibly be outside the range that the
caller wants to see.
A subsequent attempt (8ad56) to strengthen the parameter checking added
incorrect clamping of the parameters to the number of rt blocks in the
system (despite the bitmap functions all taking units of rt extents) to
avoid querying ranges past the end of rt bitmap file but failed to fix
the incorrect _rtfind_forw parameter. The original _rtfind_forw
parameter error then survived the conversion of the startblock and
blockcount fields to rt extents (a0e5c), and the most recent off-by-one
fix (a3a37) thought it was patching a problem when the end of the rt
volume is not in use, but none of these fixes actually solved the
original problem that the author was confused about the "limit" argument
to xfs_rtfind_forw.
Sadly, all four of these patches were written by this author and even
his own usage of this function and rt testing were inadequate to get
this fixed quickly.
Original-problem: fb3c3de2f65c ("xfs: add a couple of queries to iterate free extents in the rtbitmap") Not-fixed-by: 8ad560d2565e ("xfs: strengthen rtalloc query range checks") Not-fixed-by: a0e5c435babd ("xfs: fix xfs_rtalloc_rec units") Fixes: a3a374bf1889 ("xfs: fix off-by-one error in xfs_rtalloc_query_range") Signed-off-by: Darrick J. Wong <[email protected]> Reviewed-by: Chandan Babu R <[email protected]>
Linus Torvalds [Fri, 16 Oct 2020 22:29:46 +0000 (15:29 -0700)]
Merge tag 'ovl-update-5.10' of git://git.kernel.org/pub/scm/linux/kernel/git/mszeredi/vfs
Pull overlayfs updates from Miklos Szeredi:
- Improve performance for certain container setups by introducing a
"volatile" mode
- ioctl improvements
- continue preparation for unprivileged overlay mounts
* tag 'ovl-update-5.10' of git://git.kernel.org/pub/scm/linux/kernel/git/mszeredi/vfs:
ovl: use generic vfs_ioc_setflags_prepare() helper
ovl: support [S|G]ETFLAGS and FS[S|G]ETXATTR ioctls for directories
ovl: rearrange ovl_can_list()
ovl: enumerate private xattrs
ovl: pass ovl_fs down to functions accessing private xattrs
ovl: drop flags argument from ovl_do_setxattr()
ovl: adhere to the vfs_ vs. ovl_do_ conventions for xattrs
ovl: use ovl_do_getxattr() for private xattr
ovl: fold ovl_getxattr() into ovl_get_redirect_xattr()
ovl: clean up ovl_getxattr() in copy_up.c
duplicate ovl_getxattr()
ovl: provide a mount option "volatile"
ovl: check for incompatible features in work dir
Linus Torvalds [Fri, 16 Oct 2020 22:22:41 +0000 (15:22 -0700)]
Merge tag 'afs-fixes-20201016' of git://git.kernel.org/pub/scm/linux/kernel/git/dhowells/linux-fs
Pull afs updates from David Howells:
"A collection of fixes to fix afs_cell struct refcounting, thereby
fixing a slew of related syzbot bugs:
- Fix the cell tree in the netns to use an rwsem rather than RCU.
There seem to be some problems deriving from the use of RCU and a
seqlock to walk the rbtree, but it's not entirely clear what since
there are several different failures being seen.
Changing things to use an rwsem instead makes it more robust. The
extra performance derived from using RCU isn't necessary in this
case since the only time we're looking up a cell is during mount or
when cells are being manually added.
- Fix the refcounting by splitting the usage counter into a memory
refcount and an active users counter. The usage counter was doing
double duty, keeping track of whether a cell is still in use and
keeping track of when it needs to be destroyed - but this makes the
clean up tricky. Separating these out simplifies the logic.
- Fix purging a cell that has an alias. A cell alias pins the cell
it's an alias of, but the alias is always later in the list. Trying
to purge in a single pass causes rmmod to hang in such a case.
- Fix cell removal. If a cell's manager is requeued whilst it's
removing itself, the manager will run again and re-remove itself,
causing problems in various places. Follow Hillf Danton's
suggestion to insert a more terminal state that causes the manager
to do nothing post-removal.
In additional to the above, two other changes:
- Add a tracepoint for the cell refcount and active users count. This
helped with debugging the above and may be useful again in future.
- Downgrade an assertion to a print when a still-active server is
seen during purging. This was happening as a consequence of
incomplete cell removal before the servers were cleaned up"
* tag 'afs-fixes-20201016' of git://git.kernel.org/pub/scm/linux/kernel/git/dhowells/linux-fs:
afs: Don't assert on unpurgeable server records
afs: Add tracing for cell refcount and active user count
afs: Fix cell removal
afs: Fix cell purging with aliases
afs: Fix cell refcounting by splitting the usage counter
afs: Fix rapid cell addition/removal by not using RCU on cells tree
Linus Torvalds [Fri, 16 Oct 2020 22:14:43 +0000 (15:14 -0700)]
Merge tag 'f2fs-for-5.10-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/jaegeuk/f2fs
Pull f2fs updates from Jaegeuk Kim:
"In this round, we've added new features such as zone capacity for ZNS
and a new GC policy, ATGC, along with in-memory segment management. In
addition, we could improve the decompression speed significantly by
changing virtual mapping method. Even though we've fixed lots of small
bugs in compression support, I feel that it becomes more stable so
that I could give it a try in production.
Enhancements:
- suport zone capacity in NVMe Zoned Namespace devices
- introduce in-memory current segment management
- add standart casefolding support
- support age threshold based garbage collection
- improve decompression speed by changing virtual mapping method
Bug fixes:
- fix condition checks in some ioctl() such as compression, move_range, etc
- fix 32/64bits support in data structures
- fix memory allocation in zstd decompress
- add some boundary checks to avoid kernel panic on corrupted image
- fix disallowing compression for non-empty file
- fix slab leakage of compressed block writes
In addition, it includes code refactoring for better readability and
minor bug fixes for compression and zoned device support"
* tag 'f2fs-for-5.10-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/jaegeuk/f2fs: (51 commits)
f2fs: code cleanup by removing unnecessary check
f2fs: wait for sysfs kobject removal before freeing f2fs_sb_info
f2fs: fix writecount false positive in releasing compress blocks
f2fs: introduce check_swap_activate_fast()
f2fs: don't issue flush in f2fs_flush_device_cache() for nobarrier case
f2fs: handle errors of f2fs_get_meta_page_nofail
f2fs: fix to set SBI_NEED_FSCK flag for inconsistent inode
f2fs: reject CASEFOLD inode flag without casefold feature
f2fs: fix memory alignment to support 32bit
f2fs: fix slab leak of rpages pointer
f2fs: compress: fix to disallow enabling compress on non-empty file
f2fs: compress: introduce cic/dic slab cache
f2fs: compress: introduce page array slab cache
f2fs: fix to do sanity check on segment/section count
f2fs: fix to check segment boundary during SIT page readahead
f2fs: fix uninit-value in f2fs_lookup
f2fs: remove unneeded parameter in find_in_block()
f2fs: fix wrong total_sections check and fsmeta check
f2fs: remove duplicated code in sanity_check_area_boundary
f2fs: remove unused check on version_bitmap
...
Linus Torvalds [Fri, 16 Oct 2020 22:02:21 +0000 (15:02 -0700)]
Merge tag 'docs/v5.10-1' of git://git.kernel.org/pub/scm/linux/kernel/git/mchehab/linux-media
Pull documentation updates from Mauro Carvalho Chehab:
"A series of patches addressing warnings produced by make htmldocs.
This includes:
- kernel-doc markup fixes
- ReST fixes
- Updates at the build system in order to support newer versions of
the docs build toolchain (Sphinx)
After this series, the number of html build warnings should reduce
significantly, and building with Sphinx 3.1 or later should now be
supported (although it is still recommended to use Sphinx 2.4.4).
As agreed with Jon, I should be sending you a late pull request by the
end of the merge window addressing remaining issues with docs build,
as there are a number of warning fixes that depends on pull requests
that should be happening along the merge window.
The end goal is to have a clean htmldocs build on Kernel 5.10.
PS. It should be noticed that Sphinx 3.0 is not currently supported,
as it lacks support for C domain namespaces. Such feature, needed in
order to document uAPI system calls with Sphinx 3.x, was added only on
Sphinx 3.1"
* tag 'docs/v5.10-1' of git://git.kernel.org/pub/scm/linux/kernel/git/mchehab/linux-media: (75 commits)
PM / devfreq: remove a duplicated kernel-doc markup
mm/doc: fix a literal block markup
workqueue: fix a kernel-doc warning
docs: virt: user_mode_linux_howto_v2.rst: fix a literal block markup
Input: sparse-keymap: add a description for @sw
rcu/tree: docs: document bkvcache new members at struct kfree_rcu_cpu
nl80211: docs: add a description for s1g_cap parameter
usb: docs: document altmode register/unregister functions
kunit: test.h: fix a bad kernel-doc markup
drivers: core: fix kernel-doc markup for dev_err_probe()
docs: bio: fix a kerneldoc markup
kunit: test.h: solve kernel-doc warnings
block: bio: fix a warning at the kernel-doc markups
docs: powerpc: syscall64-abi.rst: fix a malformed table
drivers: net: hamradio: fix document location
net: appletalk: Kconfig: Fix docs location
dt-bindings: fix references to files converted to yaml
memblock: get rid of a :c:type leftover
math64.h: kernel-docs: Convert some markups into normal comments
media: uAPI: buffer.rst: remove a left-over documentation
...
Hoang Huu Le [Fri, 16 Oct 2020 02:31:19 +0000 (09:31 +0700)]
tipc: fix incorrect setting window for bcast link
In commit 16ad3f4022bb
("tipc: introduce variable window congestion control"), we applied
the algorithm to select window size from minimum window to the
configured maximum window for unicast link, and, besides we chose
to keep the window size for broadcast link unchanged and equal (i.e
fix window 50)
However, when setting maximum window variable via command, the window
variable was re-initialized to unexpect value (i.e 32).
We fix this by updating the fix window for broadcast as we stated.
Fixes: 16ad3f4022bb ("tipc: introduce variable window congestion control") Acked-by: Jon Maloy <[email protected]> Signed-off-by: Hoang Huu Le <[email protected]> Signed-off-by: Jakub Kicinski <[email protected]>
Hoang Huu Le [Fri, 16 Oct 2020 02:31:18 +0000 (09:31 +0700)]
tipc: re-configure queue limit for broadcast link
The queue limit of the broadcast link is being calculated base on initial
MTU. However, when MTU value changed (e.g manual changing MTU on NIC
device, MTU negotiation etc.,) we do not re-calculate queue limit.
This gives throughput does not reflect with the change.
So fix it by calling the function to re-calculate queue limit of the
broadcast link.
Linus Torvalds [Fri, 16 Oct 2020 19:52:37 +0000 (12:52 -0700)]
Merge tag 'printk-for-5.10-fixup' of git://git.kernel.org/pub/scm/linux/kernel/git/printk/linux
Pull printk fix from Petr Mladek:
"Prevent overflow in the new lockless ringbuffer"
* tag 'printk-for-5.10-fixup' of git://git.kernel.org/pub/scm/linux/kernel/git/printk/linux:
printk: ringbuffer: Wrong data pointer when appending small string
Linus Torvalds [Fri, 16 Oct 2020 19:47:18 +0000 (12:47 -0700)]
Merge tag 'kgdb-5.10-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/danielt/linux
Pull kgdb updates from Daniel Thompson:
"A fairly modest set of changes for this cycle.
Of particular note are an earlycon fix from Doug Anderson and my own
changes to get kgdb/kdb to honour the kprobe blocklist. The later
creates a safety rail that strongly encourages developers not to place
breakpoints in, for example, arch specific trap handling code.
Also included are a couple of small fixes and tweaks: an API update,
eliminate a coverity dead code warning, improved handling of search
during multi-line printk and a couple of typo corrections"
* tag 'kgdb-5.10-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/danielt/linux:
kdb: Fix pager search for multi-line strings
kernel: debug: Centralize dbg_[de]activate_sw_breakpoints
kgdb: Add NOKPROBE labels on the trap handler functions
kgdb: Honour the kprobe blocklist when setting breakpoints
kernel/debug: Fix spelling mistake in debug_core.c
kdb: Use newer api for tasklist scanning
kgdb: Make "kgdbcon" work properly with "kgdb_earlycon"
kdb: remove unnecessary null check of dbg_io_ops
Linus Torvalds [Fri, 16 Oct 2020 19:40:55 +0000 (12:40 -0700)]
Merge tag 'mips_5.10' of git://git.kernel.org/pub/scm/linux/kernel/git/mips/linux
Pull MIPS updates from Thomas Bogendoerfer:
- removed support for PNX833x alias NXT_STB22x
- included Ingenic SoC support into generic MIPS kernels
- added support for new Ingenic SoCs
- converted workaround selection to use Kconfig
- replaced old boot mem functions by memblock_*
- enabled COP2 usage in kernel for Loongson64 to make use
of 16byte load/stores possible
- cleanups and fixes
* tag 'mips_5.10' of git://git.kernel.org/pub/scm/linux/kernel/git/mips/linux: (92 commits)
MIPS: DEC: Restore bootmem reservation for firmware working memory area
MIPS: dec: fix section mismatch
bcm963xx_tag.h: fix duplicated word
mips: ralink: enable zboot support
MIPS: ingenic: Remove CPU_SUPPORTS_HUGEPAGES
MIPS: cpu-probe: remove MIPS_CPU_BP_GHIST option bit
MIPS: cpu-probe: introduce exclusive R3k CPU probe
MIPS: cpu-probe: move fpu probing/handling into its own file
MIPS: replace add_memory_region with memblock
MIPS: Loongson64: Clean up numa.c
MIPS: Loongson64: Select SMP in Kconfig to avoid build error
mips: octeon: Add Ubiquiti E200 and E220 boards
MIPS: SGI-IP28: disable use of ll/sc in kernel
MIPS: tx49xx: move tx4939_add_memory_regions into only user
MIPS: pgtable: Remove used PAGE_USERIO define
MIPS: alchemy: Share prom_init implementation
MIPS: alchemy: Fix build breakage, if TOUCHSCREEN_WM97XX is disabled
MIPS: process: include exec.h header in process.c
MIPS: process: Add prototype for function arch_dup_task_struct
MIPS: idle: Add prototype for function check_wait
...
Linus Torvalds [Fri, 16 Oct 2020 19:36:38 +0000 (12:36 -0700)]
Merge tag 's390-5.10-1' of git://git.kernel.org/pub/scm/linux/kernel/git/s390/linux
Pull s390 updates from Vasily Gorbik:
- Remove address space overrides using set_fs()
- Convert to generic vDSO
- Convert to generic page table dumper
- Add ARCH_HAS_DEBUG_WX support
- Add leap seconds handling support
- Add NVMe firmware-assisted kernel dump support
- Extend NVMe boot support with memory clearing control and addition of
kernel parameters
- AP bus and zcrypt api code rework. Add adapter configure/deconfigure
interface. Extend debug features. Add failure injection support
- Add ECC secure private keys support
- Add KASan support for running protected virtualization host with
4-level paging
- Utilize destroy page ultravisor call to speed up secure guests
shutdown
- Implement ioremap_wc() and ioremap_prot() with MIO in PCI code
- Various checksum improvements
- Other small various fixes and improvements all over the code
* tag 's390-5.10-1' of git://git.kernel.org/pub/scm/linux/kernel/git/s390/linux: (85 commits)
s390/uaccess: fix indentation
s390/uaccess: add default cases for __put_user_fn()/__get_user_fn()
s390/zcrypt: fix wrong format specifications
s390/kprobes: move insn_page to text segment
s390/sie: fix typo in SIGP code description
s390/lib: fix kernel doc for memcmp()
s390/zcrypt: Introduce Failure Injection feature
s390/zcrypt: move ap_msg param one level up the call chain
s390/ap/zcrypt: revisit ap and zcrypt error handling
s390/ap: Support AP card SCLP config and deconfig operations
s390/sclp: Add support for SCLP AP adapter config/deconfig
s390/ap: add card/queue deconfig state
s390/ap: add error response code field for ap queue devices
s390/ap: split ap queue state machine state from device state
s390/zcrypt: New config switch CONFIG_ZCRYPT_DEBUG
s390/zcrypt: introduce msg tracking in zcrypt functions
s390/startup: correct early pgm check info formatting
s390: remove orphaned extern variables declarations
s390/kasan: make sure int handler always run with DAT on
s390/ipl: add support to control memory clearing for nvme re-IPL
...
Linus Torvalds [Fri, 16 Oct 2020 19:21:15 +0000 (12:21 -0700)]
Merge tag 'powerpc-5.10-1' of git://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux
Pull powerpc updates from Michael Ellerman:
- A series from Nick adding ARCH_WANT_IRQS_OFF_ACTIVATE_MM & selecting
it for powerpc, as well as a related fix for sparc.
- Remove support for PowerPC 601.
- Some fixes for watchpoints & addition of a new ptrace flag for
detecting ISA v3.1 (Power10) watchpoint features.
- A fix for kernels using 4K pages and the hash MMU on bare metal
Power9 systems with > 16TB of RAM, or RAM on the 2nd node.
- A basic idle driver for shallow stop states on Power10.
- Tweaks to our sched domains code to better inform the scheduler about
the hardware topology on Power9/10, where two SMT4 cores can be
presented by firmware as an SMT8 core.
- A series doing further reworks & cleanups of our EEH code.
- Addition of a filter for RTAS (firmware) calls done via sys_rtas(),
to prevent root from overwriting kernel memory.
- Other smaller features, fixes & cleanups.
Thanks to: Alexey Kardashevskiy, Andrew Donnellan, Aneesh Kumar K.V,
Athira Rajeev, Biwen Li, Cameron Berkenpas, Cédric Le Goater, Christophe
Leroy, Christoph Hellwig, Colin Ian King, Daniel Axtens, David Dai, Finn
Thain, Frederic Barrat, Gautham R. Shenoy, Greg Kurz, Gustavo Romero,
Ira Weiny, Jason Yan, Joel Stanley, Jordan Niethe, Kajol Jain, Konrad
Rzeszutek Wilk, Laurent Dufour, Leonardo Bras, Liu Shixin, Luca
Ceresoli, Madhavan Srinivasan, Mahesh Salgaonkar, Nathan Lynch, Nicholas
Mc Guire, Nicholas Piggin, Nick Desaulniers, Oliver O'Halloran, Pedro
Miraglia Franco de Carvalho, Pratik Rajesh Sampat, Qian Cai, Qinglang
Miao, Ravi Bangoria, Russell Currey, Satheesh Rajendran, Scott Cheloha,
Segher Boessenkool, Srikar Dronamraju, Stan Johnson, Stephen Kitt,
Stephen Rothwell, Thiago Jung Bauermann, Tyrel Datwyler, Vaibhav Jain,
Vaidyanathan Srinivasan, Vasant Hegde, Wang Wensheng, Wolfram Sang, Yang
Yingliang, zhengbin.
* tag 'powerpc-5.10-1' of git://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux: (228 commits)
Revert "powerpc/pci: unmap legacy INTx interrupts when a PHB is removed"
selftests/powerpc: Fix eeh-basic.sh exit codes
cpufreq: powernv: Fix frame-size-overflow in powernv_cpufreq_reboot_notifier
powerpc/time: Make get_tb() common to PPC32 and PPC64
powerpc/time: Make get_tbl() common to PPC32 and PPC64
powerpc/time: Remove get_tbu()
powerpc/time: Avoid using get_tbl() and get_tbu() internally
powerpc/time: Make mftb() common to PPC32 and PPC64
powerpc/time: Rename mftbl() to mftb()
powerpc/32s: Remove #ifdef CONFIG_PPC_BOOK3S_32 in head_book3s_32.S
powerpc/32s: Rename head_32.S to head_book3s_32.S
powerpc/32s: Setup the early hash table at all time.
powerpc/time: Remove ifdef in get_dec() and set_dec()
powerpc: Remove get_tb_or_rtc()
powerpc: Remove __USE_RTC()
powerpc: Tidy up a bit after removal of PowerPC 601.
powerpc: Remove support for PowerPC 601
powerpc: Remove PowerPC 601
powerpc: Drop SYNC_601() ISYNC_601() and SYNC()
powerpc: Remove CONFIG_PPC601_SYNC_FIX
...
Dan Aloni [Fri, 2 Oct 2020 19:33:43 +0000 (22:33 +0300)]
svcrdma: fix bounce buffers for unaligned offsets and multiple pages
This was discovered using O_DIRECT at the client side, with small
unaligned file offsets or IOs that span multiple file pages.
Fixes: e248aa7be86 ("svcrdma: Remove max_sge check at connect time") Signed-off-by: Dan Aloni <[email protected]> Signed-off-by: J. Bruce Fields <[email protected]>