Jack Morgenstein [Tue, 12 Mar 2019 15:05:49 +0000 (17:05 +0200)]
net/mlx4_core: Fix qp mtt size calculation
Calculation of qp mtt size (in function mlx4_RST2INIT_wrapper)
ultimately depends on function roundup_pow_of_two.
If the amount of memory required by the QP is less than one page,
roundup_pow_of_two is called with argument zero. In this case, the
roundup_pow_of_two result is undefined.
Calling roundup_pow_of_two with a zero argument resulted in the
following stack trace:
UBSAN: Undefined behaviour in ./include/linux/log2.h:61:13
shift exponent 64 is too large for 64-bit type 'long unsigned int'
CPU: 4 PID: 26939 Comm: rping Tainted: G OE 4.19.0-rc1
Hardware name: Supermicro X9DR3-F/X9DR3-F, BIOS 3.2a 07/09/2015
Call Trace:
dump_stack+0x9a/0xeb
ubsan_epilogue+0x9/0x7c
__ubsan_handle_shift_out_of_bounds+0x254/0x29d
? __ubsan_handle_load_invalid_value+0x180/0x180
? debug_show_all_locks+0x310/0x310
? sched_clock+0x5/0x10
? sched_clock+0x5/0x10
? sched_clock_cpu+0x18/0x260
? find_held_lock+0x35/0x1e0
? mlx4_RST2INIT_QP_wrapper+0xfb1/0x1440 [mlx4_core]
mlx4_RST2INIT_QP_wrapper+0xfb1/0x1440 [mlx4_core]
Fix this by explicitly testing for zero, and returning one if the
argument is zero (assuming that the next higher power of 2 in this case
should be one).
Fixes: c82e9aa0a8bc ("mlx4_core: resource tracking for HCA resources used by guests") Signed-off-by: Jack Morgenstein <[email protected]> Signed-off-by: Tariq Toukan <[email protected]> Signed-off-by: David S. Miller <[email protected]>
Jack Morgenstein [Tue, 12 Mar 2019 15:05:48 +0000 (17:05 +0200)]
net/mlx4_core: Fix locking in SRIOV mode when switching between events and polling
In procedures mlx4_cmd_use_events() and mlx4_cmd_use_polling(), we need to
guarantee that there are no FW commands in progress on the comm channel
(for VFs) or wrapped FW commands (on the PF) when SRIOV is active.
We do this by also taking the slave_cmd_mutex when SRIOV is active.
This is especially important when switching from event to polling, since we
free the command-context array during the switch. If there are FW commands
in progress (e.g., waiting for a completion event), the completion event
handler will access freed memory.
Since the decision to use comm_wait or comm_poll is taken before grabbing
the event_sem/poll_sem in mlx4_comm_cmd_wait/poll, we must take the
slave_cmd_mutex as well (to guarantee that the decision to use events or
polling and the call to the appropriate cmd function are atomic).
Fixes: a7e1f04905e5 ("net/mlx4_core: Fix deadlock when switching between polling and event fw commands") Signed-off-by: Jack Morgenstein <[email protected]> Signed-off-by: Tariq Toukan <[email protected]> Signed-off-by: David S. Miller <[email protected]>
Jack Morgenstein [Tue, 12 Mar 2019 15:05:47 +0000 (17:05 +0200)]
net/mlx4_core: Fix reset flow when in command polling mode
As part of unloading a device, the driver switches from
FW command event mode to FW command polling mode.
Part of switching over to polling mode is freeing the command context array
memory (unfortunately, currently, without NULLing the command context array
pointer).
The reset flow calls "complete" to complete all outstanding fw commands
(if we are in event mode). The check for event vs. polling mode here
is to test if the command context array pointer is NULL.
If the reset flow is activated after the switch to polling mode, it will
attempt (incorrectly) to complete all the commands in the context array --
because the pointer was not NULLed when the driver switched over to polling
mode.
As a result, we have a use-after-free situation, which results in a
kernel crash.
Linus Torvalds [Tue, 12 Mar 2019 21:58:35 +0000 (14:58 -0700)]
Merge tag 'ceph-for-5.1-rc1' of git://github.com/ceph/ceph-client
Pull ceph updates from Ilya Dryomov:
"The highlights are:
- rbd will now ignore discards that aren't aligned and big enough to
actually free up some space (myself). This is controlled by the new
alloc_size map option and can be disabled if needed.
- support for rbd deep-flatten feature (myself). Deep-flatten allows
"rbd flatten" to fully disconnect the clone image and its snapshots
from the parent and make the parent snapshot removable.
- a new round of cap handling improvements (Zheng Yan). The kernel
client should now be much more prompt about releasing its caps and
it is possible to put a limit on the number of caps held.
- support for getting ceph.dir.pin extended attribute (Zheng Yan)"
* tag 'ceph-for-5.1-rc1' of git://github.com/ceph/ceph-client: (26 commits)
Documentation: modern versions of ceph are not backed by btrfs
rbd: advertise support for RBD_FEATURE_DEEP_FLATTEN
rbd: whole-object write and zeroout should copyup when snapshots exist
rbd: copyup with an empty snapshot context (aka deep-copyup)
rbd: introduce rbd_obj_issue_copyup_ops()
rbd: stop copying num_osd_ops in rbd_obj_issue_copyup()
rbd: factor out __rbd_osd_req_create()
rbd: clear ->xferred on error from rbd_obj_issue_copyup()
rbd: remove experimental designation from kernel layering
ceph: add mount option to limit caps count
ceph: periodically trim stale dentries
ceph: delete stale dentry when last reference is dropped
ceph: remove dentry_lru file from debugfs
ceph: touch existing cap when handling reply
ceph: pass inclusive lend parameter to filemap_write_and_wait_range()
rbd: round off and ignore discards that are too small
rbd: handle DISCARD and WRITE_ZEROES separately
rbd: get rid of obj_req->obj_request_count
libceph: use struct_size() for kmalloc() in crush_decode()
ceph: send cap releases more aggressively
...
David S. Miller [Tue, 12 Mar 2019 21:55:16 +0000 (14:55 -0700)]
Merge branch 'mlxsw-Various-fixes'
Ido Schimmel says:
====================
mlxsw: Various fixes
Patch #1 fixes the recently introduced QSFP thermal zones to correctly
work with split ports, where several ports are mapped to the same
module.
Patch #2 initializes the base MAC in the minimal driver. The driver is
using the base MAC as its parent ID and without initializing it, it is
reported as all zeroes to user space.
====================
Vadim Pasternak [Tue, 12 Mar 2019 08:40:41 +0000 (08:40 +0000)]
mlxsw: core: Prevent duplication during QSFP module initialization
Verify during thermal initialization if QSFP module's entry is already
configured in order to prevent duplication.
Such scenario could happen in case two switch drivers (PCI and I2C
based) coexist and if after boot, splitting configuration is applied
for some ports and then I2C based driver is re-probed.
In such case after reboot same QSFP module, associated with split will
be discovered by I2C based driver few times, and it will cause a crash.
It could happen for example on system equipped with BMC (Baseboard
Management Controller), running I2C based driver, when the next steps
are performed:
- System boot
- Host side configures port spilt.
- BMC side is rebooted.
Fixes: 6a79507cfe94 ("mlxsw: core: Extend thermal module with per QSFP module thermal zones") Signed-off-by: Vadim Pasternak <[email protected]> Signed-off-by: Ido Schimmel <[email protected]> Signed-off-by: David S. Miller <[email protected]>
Linus Torvalds [Tue, 12 Mar 2019 21:53:57 +0000 (14:53 -0700)]
Merge tag 'for-5.1-part2-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux
Pull btrfs fixes from David Sterba:
"Correctness and a deadlock fixes"
* tag 'for-5.1-part2-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux:
btrfs: zstd: ensure reclaim timer is properly cleaned up
btrfs: move ulist allocation out of transaction in quota enable
btrfs: save drop_progress if we drop refs at all
btrfs: check for refs on snapshot delete resume
Btrfs: fix deadlock between clone/dedupe and rename
Btrfs: fix corruption reading shared and compressed extents after hole punching
Linus Torvalds [Tue, 12 Mar 2019 21:50:42 +0000 (14:50 -0700)]
Merge tag 'nfs-for-5.1-1' of git://git.linux-nfs.org/projects/trondmy/linux-nfs
Pull NFS client updates from Trond Myklebust:
"Highlights include:
Stable fixes:
- Fixes for NFS I/O request leakages
- Fix error handling paths in the NFS I/O recoalescing code
- Reinitialise NFSv4.1 sequence results before retransmitting a
request
- Fix a soft lockup in the delegation recovery code
- Bulk destroy of layouts needs to be safe w.r.t. umount
- Prevent thundering herd issues when the SUNRPC socket is not
connected
- Respect RPC call timeouts when retrying transmission
Features:
- Convert rpc auth layer to use xdr_streams
- Config option to disable insecure RPCSEC_GSS crypto types
- Reduce size of RPC receive buffers
- Readdirplus optimization by cache mechanism
- Convert SUNRPC socket send code to use iov_iter()
- SUNRPC micro-optimisations to avoid indirect calls
- Add support for the pNFS LAYOUTERROR operation and use it with the
pNFS/flexfiles driver
- Add trace events to report non-zero NFS status codes
- Various removals of unnecessary dprintks
Bugfixes and cleanups:
- Fix a number of sparse warnings and documentation format warnings
- Fix nfs_parse_devname to not modify it's argument
- Fix potential corruption of page being written through pNFS/blocks
- fix xfstest generic/099 failures on nfsv3
- Avoid NFSv4.1 "false retries" when RPC calls are interrupted
- Abort I/O early if the pNFS/flexfiles layout segment was
invalidated
- Avoid unnecessary pNFS/flexfiles layout invalidations"
* tag 'nfs-for-5.1-1' of git://git.linux-nfs.org/projects/trondmy/linux-nfs: (90 commits)
SUNRPC: Take the transport send lock before binding+connecting
SUNRPC: Micro-optimise when the task is known not to be sleeping
SUNRPC: Check whether the task was transmitted before rebind/reconnect
SUNRPC: Remove redundant calls to RPC_IS_QUEUED()
SUNRPC: Clean up
SUNRPC: Respect RPC call timeouts when retrying transmission
SUNRPC: Fix up RPC back channel transmission
SUNRPC: Prevent thundering herd when the socket is not connected
SUNRPC: Allow dynamic allocation of back channel slots
NFSv4.1: Bump the default callback session slot count to 16
SUNRPC: Convert remaining GFP_NOIO, and GFP_NOWAIT sites in sunrpc
NFS/flexfiles: Clean up mirror DS initialisation
NFS/flexfiles: Remove dead code in ff_layout_mirror_valid()
NFS/flexfile: Simplify nfs4_ff_layout_select_ds_stateid()
NFS/flexfile: Simplify nfs4_ff_layout_ds_version()
NFS/flexfiles: Simplify ff_layout_get_ds_cred()
NFS/flexfiles: Simplify nfs4_ff_find_or_create_ds_client()
NFS/flexfiles: Simplify nfs4_ff_layout_select_ds_fh()
NFS/flexfiles: Speed up read failover when DSes are down
NFS/flexfiles: Don't invalidate DS deviceids for being unresponsive
...
Linus Torvalds [Tue, 12 Mar 2019 21:48:52 +0000 (14:48 -0700)]
Merge tag 'ovl-update-5.1' of git://git.kernel.org/pub/scm/linux/kernel/git/mszeredi/vfs
Pull overlayfs updates from Miklos Szeredi:
"Fix copy up of security related xattrs"
* tag 'ovl-update-5.1' of git://git.kernel.org/pub/scm/linux/kernel/git/mszeredi/vfs:
ovl: Do not lose security.capability xattr over metadata file copy-up
ovl: During copy up, first copy up data and then xattrs
Linus Torvalds [Tue, 12 Mar 2019 21:46:26 +0000 (14:46 -0700)]
Merge tag 'fuse-update-5.1' of git://git.kernel.org/pub/scm/linux/kernel/git/mszeredi/fuse
Pull fuse updates from Miklos Szeredi:
"Scalability and performance improvements, as well as minor bug fixes
and cleanups"
* tag 'fuse-update-5.1' of git://git.kernel.org/pub/scm/linux/kernel/git/mszeredi/fuse: (25 commits)
fuse: cache readdir calls if filesystem opts out of opendir
fuse: support clients that don't implement 'opendir'
fuse: lift bad inode checks into callers
fuse: multiplex cached/direct_io file operations
fuse add copy_file_range to direct io fops
fuse: use iov_iter based generic splice helpers
fuse: Switch to using async direct IO for FOPEN_DIRECT_IO
fuse: use atomic64_t for khctr
fuse: clean up aborted
fuse: Protect ff->reserved_req via corresponding fi->lock
fuse: Protect fi->nlookup with fi->lock
fuse: Introduce fi->lock to protect write related fields
fuse: Convert fc->attr_version into atomic64_t
fuse: Add fuse_inode argument to fuse_prepare_release()
fuse: Verify userspace asks to requeue interrupt that we really sent
fuse: Do some refactoring in fuse_dev_do_write()
fuse: Wake up req->waitq of only if not background
fuse: Optimize request_end() by not taking fiq->waitq.lock
fuse: Kill fasync only if interrupt is queued in queue_interrupt()
fuse: Remove stale comment in end_requests()
...
Linus Torvalds [Tue, 12 Mar 2019 21:08:19 +0000 (14:08 -0700)]
Merge branch 'work.mount' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs
Pull vfs mount infrastructure updates from Al Viro:
"The rest of core infrastructure; no new syscalls in that pile, but the
old parts are switched to new infrastructure. At that point
conversions of individual filesystems can happen independently; some
are done here (afs, cgroup, procfs, etc.), there's also a large series
outside of that pile dealing with NFS (quite a bit of option-parsing
stuff is getting used there - it's one of the most convoluted
filesystems in terms of mount-related logics), but NFS bits are the
next cycle fodder.
It got seriously simplified since the last cycle; documentation is
probably the weakest bit at the moment - I considered dropping the
commit introducing Documentation/filesystems/mount_api.txt (cutting
the size increase by quarter ;-), but decided that it would be better
to fix it up after -rc1 instead.
That pile allows to do followup work in independent branches, which
should make life much easier for the next cycle. fs/super.c size
increase is unpleasant; there's a followup series that allows to
shrink it considerably, but I decided to leave that until the next
cycle"
* 'work.mount' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: (41 commits)
afs: Use fs_context to pass parameters over automount
afs: Add fs_context support
vfs: Add some logging to the core users of the fs_context log
vfs: Implement logging through fs_context
vfs: Provide documentation for new mount API
vfs: Remove kern_mount_data()
hugetlbfs: Convert to fs_context
cpuset: Use fs_context
kernfs, sysfs, cgroup, intel_rdt: Support fs_context
cgroup: store a reference to cgroup_ns into cgroup_fs_context
cgroup1_get_tree(): separate "get cgroup_root to use" into a separate helper
cgroup_do_mount(): massage calling conventions
cgroup: stash cgroup_root reference into cgroup_fs_context
cgroup2: switch to option-by-option parsing
cgroup1: switch to option-by-option parsing
cgroup: take options parsing into ->parse_monolithic()
cgroup: fold cgroup1_mount() into cgroup1_get_tree()
cgroup: start switching to fs_context
ipc: Convert mqueue fs to fs_context
proc: Add fs_context support to procfs
...
Linus Torvalds [Tue, 12 Mar 2019 20:43:42 +0000 (13:43 -0700)]
Merge branch 'work.iov_iter' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs
Pull iov_iter updates from Al Viro:
"A couple of iov_iter patches - Christoph's crapectomy (the last
remaining user of iov_for_each() went away with lustre, IIRC) and
Eric'c optimization of sanity checks"
* 'work.iov_iter' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
iov_iter: optimize page_copy_sane()
uio: remove the unused iov_for_each macro
Abel Vesa [Tue, 5 Mar 2019 09:49:16 +0000 (09:49 +0000)]
dt-bindings: clock: imx8mq: Fix numbering overlaps and gaps
IMX8MQ_CLK_USB_PHY_REF changes from 163 to 153, this way removing the gap.
All the following clock ids are now decreased by 10 to keep the numbering
right. Doing this, the IMX8MQ_CLK_CSI2_CORE is not overlapped with
IMX8MQ_CLK_GPT1 anymore. IMX8MQ_CLK_GPT1_ROOT changes from 193 to 183 and
all the following ids are updated accordingly.
Trond Myklebust [Tue, 12 Mar 2019 20:04:51 +0000 (16:04 -0400)]
pNFS: Fix a typo in pnfs_update_layout
We're supposed to wait for the outstanding layout count to go to zero,
but that got lost somehow.
Fixes: d03360aaf5cca ("pNFS: Ensure we return the error if someone...") Reported-by: Anna Schumaker <[email protected]> Signed-off-by: Trond Myklebust <[email protected]>
Linus Torvalds [Tue, 12 Mar 2019 17:39:53 +0000 (10:39 -0700)]
Merge branch 'akpm' (patches from Andrew)
Merge misc updates from Andrew Morton:
- a few misc things
- the rest of MM
- remove flex_arrays, replace with new simple radix-tree implementation
* emailed patches from Andrew Morton <[email protected]>: (38 commits)
Drop flex_arrays
sctp: convert to genradix
proc: commit to genradix
generic radix trees
selinux: convert to kvmalloc
md: convert to kvmalloc
openvswitch: convert to kvmalloc
of: fix kmemleak crash caused by imbalance in early memory reservation
mm: memblock: update comments and kernel-doc
memblock: split checks whether a region should be skipped to a helper function
memblock: remove memblock_{set,clear}_region_flags
memblock: drop memblock_alloc_*_nopanic() variants
memblock: memblock_alloc_try_nid: don't panic
treewide: add checks for the return value of memblock_alloc*()
swiotlb: add checks for the return value of memblock_alloc*()
init/main: add checks for the return value of memblock_alloc*()
mm/percpu: add checks for the return value of memblock_alloc*()
sparc: add checks for the return value of memblock_alloc*()
ia64: add checks for the return value of memblock_alloc*()
arch: don't memset(0) memory returned by memblock_alloc()
...
Xiao Ni [Fri, 8 Mar 2019 15:52:05 +0000 (23:52 +0800)]
It's wrong to add len to sector_nr in raid10 reshape twice
In reshape_request it already adds len to sector_nr already. It's wrong to add len to
sector_nr again after adding pages to bio. If there is bad block it can't copy one chunk
at a time, it needs to goto read_more. Now the sector_nr is wrong. It can cause data
corruption.
When the Partial Parity Log is enabled, circular buffer is used to store
PPL data. Each write to RAID device causes overwrite of data in this buffer
so some write_hint can be set to those request to help drives handle
garbage collection. This patch adds new sysfs attribute which can be used
to specify which write_hint should be assigned to PPL.
Kent Overstreet [Tue, 12 Mar 2019 06:31:18 +0000 (23:31 -0700)]
proc: commit to genradix
The new generic radix trees have a simpler API and implementation, and
no limitations on number of elements, so all flex_array users are being
converted
Kent Overstreet [Tue, 12 Mar 2019 06:31:14 +0000 (23:31 -0700)]
generic radix trees
Very simple radix tree implementation that supports storing arbitrary
size entries, up to PAGE_SIZE - upcoming patches will convert existing
flex_array users to genradixes. The new genradix code has a much
simpler API and implementation, and doesn't have a hard limit on the
number of elements like flex_array does.
Kent Overstreet [Tue, 12 Mar 2019 06:31:02 +0000 (23:31 -0700)]
openvswitch: convert to kvmalloc
Patch series "generic radix trees; drop flex arrays".
This patch (of 7):
There was no real need for this code to be using flexarrays, it's just
implementing a hash table - ideally it would be using rhashtables, but
that conversion would be significantly more complicated.
Mike Rapoport [Tue, 12 Mar 2019 06:30:58 +0000 (23:30 -0700)]
of: fix kmemleak crash caused by imbalance in early memory reservation
Marc Gonzalez reported the following kmemleak crash:
Unable to handle kernel paging request at virtual address ffffffc021e00000
Mem abort info:
ESR = 0x96000006
Exception class = DABT (current EL), IL = 32 bits
SET = 0, FnV = 0
EA = 0, S1PTW = 0
Data abort info:
ISV = 0, ISS = 0x00000006
CM = 0, WnR = 0
swapper pgtable: 4k pages, 39-bit VAs, pgdp = (____ptrval____) [ffffffc021e00000] pgd=000000017e3ba803, pud=000000017e3ba803, pmd=0000000000000000
Internal error: Oops: 96000006 [#1] PREEMPT SMP
Modules linked in:
CPU: 6 PID: 523 Comm: kmemleak Tainted: G S W 5.0.0-rc1 #13
Hardware name: Qualcomm Technologies, Inc. MSM8998 v1 MTP (DT)
pstate: 80000085 (Nzcv daIf -PAN -UAO)
pc : scan_block+0x70/0x190
lr : scan_block+0x6c/0x190
Process kmemleak (pid: 523, stack limit = 0x(____ptrval____))
Call trace:
scan_block+0x70/0x190
scan_gray_list+0x108/0x1c0
kmemleak_scan+0x33c/0x7c0
kmemleak_scan_thread+0x98/0xf0
kthread+0x11c/0x120
ret_from_fork+0x10/0x1c
Code: f9000fb4d503201f97ffffd235000580 (f9400260)
The crash happens when a no-map area is allocated in
early_init_dt_alloc_reserved_memory_arch(). The allocated region is
registered with kmemleak, but it is then removed from memblock using
memblock_remove() that is not kmemleak-aware.
Replacing memblock_phys_alloc_range() with memblock_find_in_range()
makes sure that the allocated memory is not added to kmemleak and then
memblock_remove()'ing this memory is safe.
As a bonus, since memblock_find_in_range() ensures the allocation in the
specified range, the bounds check can be removed.
Mike Rapoport [Tue, 12 Mar 2019 06:30:50 +0000 (23:30 -0700)]
memblock: split checks whether a region should be skipped to a helper function
__next_mem_range() and __next_mem_range_rev() duplicate the code that
checks whether a region should be skipped because of node or flags
incompatibility.
The memblock API provides dedicated helpers to set or clear a flag on a
memory region, e.g. memblock_{mark,clear}_hotplug().
The memblock_{set,clear}_region_flags() functions are used only by the
memblock internal function that adjusts the region flags. Drop these
functions and use open-coded implementation instead.
Mike Rapoport [Tue, 12 Mar 2019 06:30:37 +0000 (23:30 -0700)]
memblock: memblock_alloc_try_nid: don't panic
As all the memblock_alloc*() users are now checking the return value and
panic() in case of error, the panic() call can be removed from the core
memblock allocator, namely memblock_alloc_try_nid().
Mike Rapoport [Tue, 12 Mar 2019 06:30:31 +0000 (23:30 -0700)]
treewide: add checks for the return value of memblock_alloc*()
Add check for the return value of memblock_alloc*() functions and call
panic() in case of error. The panic message repeats the one used by
panicing memblock allocators with adjustment of parameters to include
only relevant ones.
The replacement was mostly automated with semantic patches like the one
below with manual massaging of format strings.
Mike Rapoport [Tue, 12 Mar 2019 06:30:26 +0000 (23:30 -0700)]
swiotlb: add checks for the return value of memblock_alloc*()
Add panic() calls if memblock_alloc() returns NULL.
The panic() format duplicates the one used by memblock itself and in
order to avoid explosion with long parameters list replace open coded
allocation size calculations with a local variable.
Mike Rapoport [Tue, 12 Mar 2019 06:30:20 +0000 (23:30 -0700)]
init/main: add checks for the return value of memblock_alloc*()
Add panic() calls if memblock_alloc() returns NULL.
The panic() format duplicates the one used by memblock itself and in
order to avoid explosion with long parameters list replace open coded
allocation size calculations with a local variable.
Mike Rapoport [Tue, 12 Mar 2019 06:30:15 +0000 (23:30 -0700)]
mm/percpu: add checks for the return value of memblock_alloc*()
Add panic() calls if memblock_alloc() returns NULL.
The panic() format duplicates the one used by memblock itself and in
order to avoid explosion with long parameters list replace open coded
allocation size calculations with a local variable.
Mike Rapoport [Tue, 12 Mar 2019 06:29:41 +0000 (23:29 -0700)]
memblock: refactor internal allocation functions
Currently, memblock has several internal functions with overlapping
functionality. They all call memblock_find_in_range_node() to find free
memory and then reserve the allocated range and mark it with kmemleak.
However, there is difference in the allocation constraints and in
fallback strategies.
The allocations returning physical address first attempt to find free
memory on the specified node within mirrored memory regions, then retry
on the same node without the requirement for memory mirroring and
finally fall back to all available memory.
The allocations returning virtual address start with clamping the
allowed range to memblock.current_limit, attempt to allocate from the
specified node from regions with mirroring and with user defined minimal
address. If such allocation fails, next attempt is done with node
restriction lifted. Next, the allocation is retried with minimal
address reset to zero and at last without the requirement for mirrored
regions.
Let's consolidate various fallbacks handling and make them more
consistent for physical and virtual variants. Most of the fallback
handling is moved to memblock_alloc_range_nid() and it now handles node
and mirror fallbacks.
The memblock_alloc_internal() uses memblock_alloc_range_nid() to get a
physical address of the allocated range and converts it to virtual
address.
The fallback for allocation below the specified minimal address remains
in memblock_alloc_internal() because memblock_alloc_range_nid() is used
by CMA with exact requirement for lower bounds.
The memblock_phys_alloc_nid() function is completely dropped as it is not
used anywhere outside memblock and its only usage can be replaced by a
call to memblock_alloc_range_nid().
Mike Rapoport [Tue, 12 Mar 2019 06:29:35 +0000 (23:29 -0700)]
memblock: drop memblock_alloc_base()
The memblock_alloc_base() function tries to allocate a memory up to the
limit specified by its max_addr parameter and panics if the allocation
fails. Replace its usage with memblock_phys_alloc_range() and make the
callers check the return value and panic in case of error.
Mike Rapoport [Tue, 12 Mar 2019 06:29:31 +0000 (23:29 -0700)]
memblock: drop __memblock_alloc_base()
The __memblock_alloc_base() function tries to allocate a memory up to
the limit specified by its max_addr parameter. Depending on the value
of this parameter, the __memblock_alloc_base() can is replaced with the
appropriate memblock_phys_alloc*() variant.
Mike Rapoport [Tue, 12 Mar 2019 06:29:26 +0000 (23:29 -0700)]
memblock: memblock_phys_alloc(): don't panic
Make the memblock_phys_alloc() function an inline wrapper for
memblock_phys_alloc_range() and update the memblock_phys_alloc() callers
to check the returned value and panic in case of error.
The memblock_phys_alloc_try_nid() function tries to allocate memory from
the requested node and then falls back to allocation from any node in
the system. The memblock_alloc_base() fallback used by this function
panics if the allocation fails.
Replace the memblock_alloc_base() fallback with the direct call to
memblock_alloc_range_nid() and update the memblock_phys_alloc_try_nid()
callers to check the returned value and panic in case of error.
Mike Rapoport [Tue, 12 Mar 2019 06:29:16 +0000 (23:29 -0700)]
memblock: emphasize that memblock_alloc_range() returns a physical address
Rename memblock_alloc_range() to memblock_phys_alloc_range() to
emphasize that it returns a physical address.
While on it, remove the 'enum memblock_flags' parameter from this
function as its only user anyway sets it to MEMBLOCK_NONE, which is the
default for the most of memblock allocations.
Mike Rapoport [Tue, 12 Mar 2019 06:29:06 +0000 (23:29 -0700)]
memblock: replace memblock_alloc_base(ANYWHERE) with memblock_phys_alloc
The calls to memblock_alloc_base(size, align, MEMBLOCK_ALLOC_ANYWHERE)
and memblock_phys_alloc(size, align) are equivalent as both try to
allocate 'size' bytes with 'align' alignment anywhere in the memory and
panic if hte allocation fails.
The conversion is done using the following semantic patch:
Current memblock API is quite extensive and, which is more annoying,
duplicated. Except the low-level functions that allow searching for a
free memory region and marking it as reserved, memblock provides three
(well, two and a half) sets of functions to allocate memory.
There are several overlapping functions that return a physical address
and there are functions that return virtual address. Those that return
the virtual address may also clear the allocated memory. And, on top of
all that, some allocators panic and some return NULL in case of error.
This set tries to reduce the mess, and trim down the amount of memblock
allocation methods.
Patches 1-10 consolidate the functions that return physical address of
the allocated memory
Patches 11-13 are some trivial cleanups
Patches 14-19 add checks for the return value of memblock_alloc*() and
panics in case of errors. The patches 14-18 include some minor
refactoring to have better readability of the resulting code and patch
19 is a mechanical addition of
if (!ptr)
panic();
after memblock_alloc*() calls.
And, finally, patches 20 and 21 remove panic() calls memblock and
_nopanic variants from memblock.
This patch (of 21):
The allocation of the page tables memory in openrics uses
memblock_phys_alloc() and then converts the returned physical address to
virtual one. Use memblock_alloc_raw() and add a panic() if the
allocation fails.
Nikolay Borisov [Tue, 12 Mar 2019 06:28:13 +0000 (23:28 -0700)]
mm: refactor readahead defines in mm.h
All users of VM_MAX_READAHEAD actually convert it to kbytes and then to
pages. Define the macro explicitly as (SZ_128K / PAGE_SIZE). This
simplifies the expression in every filesystem. Also rename the macro to
VM_READAHEAD_PAGES to properly convey its meaning. Finally remove unused
VM_MIN_READAHEAD
Souptick Joarder [Tue, 12 Mar 2019 06:28:10 +0000 (23:28 -0700)]
mm/hmm: convert to use vm_fault_t
Convert to use vm_fault_t type as return type for fault handler.
kbuild reported warning during testing of
*mm-create-the-new-vm_fault_t-type.patch* available in below link -
https://patchwork.kernel.org/patch/10752741/
kernel/memremap.c:46:34: warning: incorrect type in return expression
(different base types)
kernel/memremap.c:46:34: expected restricted vm_fault_t
kernel/memremap.c:46:34: got int
This patch has fixed the warnings and also hmm_devmem_fault() is
converted to return vm_fault_t to avoid further warnings.
Zev Weiss [Tue, 12 Mar 2019 06:28:06 +0000 (23:28 -0700)]
kernel/sysctl.c: define minmax conv functions in terms of non-minmax versions
do_proc_do[u]intvec_minmax_conv() had included open-coded versions of
do_proc_do[u]intvec_conv(); the duplication led to buggy inconsistencies
(missing range checks). To reduce the likelihood of such problems in the
future, we can instead refactor both to be defined in terms of their
non-bounded counterparts (plus the added check).
Zev Weiss [Tue, 12 Mar 2019 06:28:02 +0000 (23:28 -0700)]
kernel/sysctl.c: add missing range check in do_proc_dointvec_minmax_conv
This bug has apparently existed since the introduction of this function
in the pre-git era (4500e91754d3 in Thomas Gleixner's history.git,
"[NET]: Add proc_dointvec_userhz_jiffies, use it for proper handling of
neighbour sysctls.").
As a minimal fix we can simply duplicate the corresponding check in
do_proc_dointvec_conv().
Zev Weiss [Tue, 12 Mar 2019 06:27:58 +0000 (23:27 -0700)]
tools/testing/selftests/sysctl/sysctl.sh: add tests for >32-bit values written to 32-bit integers
Patch series "sysctl: fix range-checking in do_proc_dointvec_minmax_conv()", v2.
After being left with an unusable system after a typo executing
something like 'echo $((1<<24)) > /proc/sys/vm/max_map_count', I found
that do_proc_dointvec_minmax_conv() was missing a check to ensure that
the converted value actually fits in an int.
The first of the following patches enhances the sysctl selftest such
that it detects this problem; the second provides a minimal fix
(suitable for -stable) such that the selftest passes. The third patch
then performs a more thorough refactoring to eliminate the code
duplication that led to the bug in the first place (maintaining the
passing status of the selftest).
This patch (of 3):
At present this exposes a bug in do_proc_dointvec_minmax_conv() (it
fails to check for values that are too wide to fit in an int).
Linus Torvalds [Tue, 12 Mar 2019 16:46:32 +0000 (09:46 -0700)]
Merge tag 'tag-chrome-platform-for-v5.1' of git://git.kernel.org/pub/scm/linux/kernel/git/chrome-platform/linux
Pull chrome platform updates from Benson Leung:
- SPDX identifier cleanup for platform/chrome
- Cleanup series between mfd and chrome/platform, moving cros-ec
attributes from mfd/cros_ec_dev to sub-drivers in platform/chrome
- Wilco EC driver
- Maintainership change to new group repository
* tag 'tag-chrome-platform-for-v5.1' of git://git.kernel.org/pub/scm/linux/kernel/git/chrome-platform/linux:
platform/chrome: fix wilco-ec dependencies
platform/chrome: wilco_ec: Add RTC driver
platform/chrome: wilco_ec: Add support for raw commands in debugfs
platform/chrome: Add new driver for Wilco EC
platform/chrome: cros_ec: Remove cros_ec dependency in lpc_mec
MAINTAINERS: chrome-platform: change the git tree to a chrome-platform group git tree
platform/chrome: cros_ec_sysfs: remove pr_fmt() define
platform/chrome: cros_ec_lightbar: remove pr_fmt() define
platform/chrome: cros_kbd_led_backlight: switch to SPDX identifier
platform/chrome: cros_ec_spi: switch to SPDX identifier
platform/chrome: cros_ec_proto: switch to SPDX identifier
platform/chrome: cros_ec_lpc: switch to SPDX identifier
platform/chrome: cros_ec_i2c: switch to SPDX identifier
platform/chrome: cros_ec_vbc: switch to SPDX identifier
platform/chrome: cros_ec_sysfs: switch to SPDX identifier
platform/chrome: cros_ec_lightbar: switch to SPDX identifier
platform/chrome: cros_ec_debugfs: switch to SPDX identifier
platform/chrome: cromeos_pstore: switch to SPDX identifier
Linus Torvalds [Tue, 12 Mar 2019 16:02:36 +0000 (09:02 -0700)]
Merge branch 'x86-tsx-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull x86 tsx fixes from Thomas Gleixner:
"This update provides kernel side handling for the TSX erratum of Intel
Skylake (and later) CPUs.
On these CPUs Intel Transactional Synchronization Extensions (TSX)
functions can result in unpredictable system behavior under certain
circumstances.
The issue is mitigated with an microcode update which utilizes
Performance Monitoring Counter (PMC) 3 when TSX functions are in use.
This mitigation is enabled unconditionally by the updated microcode.
As a consequence the usage of TSX functions can cause corrupted
performance monitoring results for events which utilize PMC3. The
corruption is silent on kernels which have no update for this issue.
This update makes the kernel aware of the PMC3 utilization by the
microcode:
The microcode offers a possibility to enforce TSX abort which prevents
the malfunction and frees up PMC3. The enforced TSX abort requires the
TSX using application to have a software fallback path implemented;
abort handlers which solely retry the transaction will fail over and
over.
The enforced TSX abort request is issued by the kernel when:
- enforced TSX abort is enabled (PMU attribute)
- A performance monitoring request needs PMC3
When PMC3 is not longer used by the kernel the TSX force abort request
is cleared.
The enforced TSX abort mechanism is enabled by default and can be
controlled by the administrator via the new PMU attribute
'allow_tsx_force_abort'. This attribute is only visible when updated
microcode is detected on affected systems. Writing '0' disables the
enforced TSX abort mechanism, '1' enables it.
As a result of disabling the enforced TSX abort mechanism, PMC3 is
permanentely unavailable for performance monitoring which can cause
performance monitoring requests to fail or switch to multiplexing
mode"
* branch 'x86-tsx-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
perf/x86/intel: Implement support for TSX Force Abort
x86: Add TSX Force Abort CPUID/MSR
perf/x86/intel: Generalize dynamic constraint creation
perf/x86/intel: Make cpuc allocations consistent
Valdis Klētnieks [Tue, 12 Mar 2019 08:52:58 +0000 (04:52 -0400)]
tracing/probes: Make reserved_field_names static
sparse complains:
CHECK kernel/trace/trace_probe.c
kernel/trace/trace_probe.c:16:12: warning: symbol 'reserved_field_names' was not declared. Should it be static?
Wolfram Sang [Sun, 3 Mar 2019 15:03:14 +0000 (16:03 +0100)]
i2c: rcar: explain the lockless design
To make sure people can understand the lockless design of this driver
without the need to dive into git history, add a comment giving an
overview of the situation.
i2c: rcar: fix concurrency issue related to ICDMAER
This patch fixes the problem that an interrupt may set up a new I2C
message and the DMA callback overwrites this setup.
By disabling the DMA Enable Register(ICDMAER), rcar_i2c_dma_unmap()
enables interrupts for register settings (such as Master Control
Register(ICMCR)) and advances the I2C transfer sequence.
If an interrupt occurs immediately after ICDMAER is disabled, the
callback handler later continues and overwrites the previous settings
from the interrupt. So, disable ICDMAER at the end of the callback to
ensure other interrupts are masked until then.
Note that this driver needs to work lock-free because there are IP cores
with a HW race condition which prevent us from using a spinlock in the
interrupt handler.
Reproduction test:
1. Add a delay after disabling ICDMAER. (It is expected to generate an
interrupt of rcar_i2c_irq())
Fixes: 73e8b0528346 ("i2c: rcar: add DMA support") Signed-off-by: Hiromitsu Yamasaki <[email protected]> Signed-off-by: Wolfram Sang <[email protected]>
[wsa: updated test case to be more reliable, added note to comment] Reviewed-by: Simon Horman <[email protected]> Signed-off-by: Wolfram Sang <[email protected]>
Louis Taylor [Sat, 2 Mar 2019 14:18:36 +0000 (14:18 +0000)]
i2c: sis630: correct format strings
When compiling with -Wformat, clang warns:
drivers/i2c/busses/i2c-sis630.c:482:4: warning: format specifies type
'unsigned short' but the argument has type 'int' [-Wformat]
smbus_base + SMB_STS,
^~~~~~~~~~~~~~~~~~~~
drivers/i2c/busses/i2c-sis630.c:483:4: warning: format specifies type
'unsigned short' but the argument has type 'int' [-Wformat]
smbus_base + SMB_STS + SIS630_SMB_IOREGION - 1);
^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
drivers/i2c/busses/i2c-sis630.c:531:37: warning: format specifies type
'unsigned short' but the argument has type 'int' [-Wformat]
"SMBus SIS630 adapter at %04hx", smbus_base + SMB_STS);
~~~~~ ^~~~~~~~~~~~~~~~~~~~
This patch fixes the format strings to use the format type for int.
Chris Coulson [Mon, 4 Feb 2019 10:21:23 +0000 (10:21 +0000)]
apparmor: delete the dentry in aafs_remove() to avoid a leak
Although the apparmorfs dentries are always dropped from the dentry cache
when the usage count drops to zero, there is no guarantee that this will
happen in aafs_remove(), as another thread might still be using it. In
this scenario, this means that the dentry will temporarily continue to
appear in the results of lookups, even after the call to aafs_remove().
In the case of removal of a profile - it also causes simple_rmdir()
on the profile directory to fail, as the directory won't be empty until
the usage counts of all child dentries have decreased to zero. This
results in the dentry for the profile directory leaking and appearing
empty in the file system tree forever.
ACPI: sysfs: Prevent get_status() from returning acpi_status
The return value of get_status() is passed to user space on errors,
so it should not return acpi_status values then. Make it return
error values that are meaningful for user space instead.
This also makes a Clang warning regarding the initialization of a
local variable in get_status() go away.
Andy Shevchenko [Mon, 11 Mar 2019 16:41:03 +0000 (18:41 +0200)]
ACPI / device_sysfs: Avoid OF modalias creation for removed device
If SSDT overlay is loaded via ConfigFS and then unloaded the device,
we would like to have OF modalias for, already gone. Thus, acpi_get_name()
returns no allocated buffer for such case and kernel crashes afterwards:
Andy Shevchenko [Mon, 11 Mar 2019 14:04:30 +0000 (16:04 +0200)]
ACPI / configfs: Mark local data structures static
There is no need to have non-static local data structures. otherwise
sparse is not happy:
CHECK drivers/acpi/acpi_configfs.c
drivers/acpi/acpi_configfs.c:100:31: warning: symbol 'acpi_table_bin_attrs' was not declared. Should it be static?
drivers/acpi/acpi_configfs.c:196:27: warning: symbol 'acpi_table_attrs' was not declared. Should it be static?
drivers/acpi/acpi_configfs.c:236:34: warning: symbol 'acpi_table_group_ops' was not declared. Should it be static?
cpufreq: intel_pstate: Fix up iowait_boost computation
After commit b8bd1581aa61 ("cpufreq: intel_pstate: Rework iowait
boosting to be less aggressive") the handling of the case when
the SCHED_CPUFREQ_IOWAIT flag is set again after a few iterations of
intel_pstate_update_util() is a bit inconsistent, because the
new value of cpu->iowait_boost may be lower than ONE_EIGHTH_FP
if it was set before, but has not dropped down to zero just yet.
Fix that up by ensuring that the new value of cpu->iowait_boost
will always be at least ONE_EIGHTH_FP then.
Fixes: b8bd1581aa61 ("cpufreq: intel_pstate: Rework iowait boosting to be less aggressive") Signed-off-by: Rafael J. Wysocki <[email protected]>
Viresh Kumar [Tue, 12 Mar 2019 04:57:18 +0000 (10:27 +0530)]
PM / OPP: Update performance state when freq == old_freq
At boot up, CPUFreq core performs a sanity check to see if the system is
running at a frequency defined in the frequency table of the CPU. If so,
we try to find a valid frequency (lowest frequency greater than the
currently programmed frequency) from the table and set it. When the call
reaches dev_pm_opp_set_rate(), it calls _find_freq_ceil(opp_table,
&old_freq) to find the previously configured OPP and this call also
updates the old_freq. This eventually sets the old_freq == freq (new
target requested by cpufreq core) and we skip updating the performance
state in this case.
Fix this by also updating the performance state when the old_freq ==
freq.
After commit d856f39ac1cc ("PM / wakeup: Rework wakeup source timer
cancellation") wakeup_source_drop() is a trivial wrapper around
__pm_relax() and it has no users except for wakeup_source_destroy()
and wakeup_source_trash() which also has no users, so drop it along
with the latter and make wakeup_source_destroy() call __pm_relax()
directly.
If wakeup_source_add() is called right after wakeup_source_remove()
for the same wakeup source, timer_setup() may be called for a
potentially scheduled timer which is incorrect.
To avoid that, move the wakeup source timer cancellation from
wakeup_source_drop() to wakeup_source_remove().
Moreover, make wakeup_source_remove() clear the timer function after
canceling the timer to let wakeup_source_not_registered() treat
unregistered wakeup sources in the same way as the ones that have
never been registered.
Dave Airlie [Tue, 12 Mar 2019 05:11:40 +0000 (15:11 +1000)]
Merge branch 'drm-next-5.1' of git://people.freedesktop.org/~agd5f/linux into drm-next
Fixes for 5.1:
- Powerplay fixes
- DC fixes
- Fix locking around indirect register access in some cases
- KFD MQD fix
- Disable BACO for vega20 for now (fixes pending)
Linus Torvalds [Tue, 12 Mar 2019 03:06:18 +0000 (20:06 -0700)]
Merge tag 'xarray-5.1-rc1' of git://git.infradead.org/users/willy/linux-dax
Pull XArray updates from Matthew Wilcox:
"This pull request changes the xa_alloc() API. I'm only aware of one
subsystem that has started trying to use it, and we agree on the fixup
as part of the merge.
The xa_insert() error code also changed to match xa_alloc() (EEXIST to
EBUSY), and I added xa_alloc_cyclic(). Beyond that, the usual
bugfixes, optimisations and tweaking.
I now have a git tree with all users of the radix tree and IDR
converted over to the XArray that I'll be feeding to maintainers over
the next few weeks"
* tag 'xarray-5.1-rc1' of git://git.infradead.org/users/willy/linux-dax:
XArray: Fix xa_reserve for 2-byte aligned entries
XArray: Fix xa_erase of 2-byte aligned entries
XArray: Use xa_cmpxchg to implement xa_reserve
XArray: Fix xa_release in allocating arrays
XArray: Mark xa_insert and xa_reserve as must_check
XArray: Add cyclic allocation
XArray: Redesign xa_alloc API
XArray: Add support for 1s-based allocation
XArray: Change xa_insert to return -EBUSY
XArray: Update xa_erase family descriptions
XArray tests: RCU lock prohibits GFP_KERNEL