Linus Torvalds [Fri, 25 Jun 2021 16:41:29 +0000 (09:41 -0700)]
Merge tag 'netfs-fixes-20210621' of git://git.kernel.org/pub/scm/linux/kernel/git/dhowells/linux-fs
Pull netfs fixes from David Howells:
"This contains patches to fix netfs_write_begin() and afs_write_end()
in the following ways:
(1) In netfs_write_begin(), extract the decision about whether to skip
a page out to its own helper and have that clear around the region
to be written, but not clear that region. This requires the
filesystem to patch it up afterwards if the hole doesn't get
completely filled.
(2) Use offset_in_thp() in (1) rather than manually calculating the
offset into the page.
(3) Due to (1), afs_write_end() now needs to handle short data write
into the page by generic_perform_write(). I've adopted an
analogous approach to ceph of just returning 0 in this case and
letting the caller go round again.
It also adds a note that (in the future) the len parameter may extend
beyond the page allocated. This is because the page allocation is
deferred to write_begin() and that gets to decide what size of THP to
allocate."
Jeff Layton points out:
"The netfs fix in particular fixes a data corruption bug in cephfs"
* tag 'netfs-fixes-20210621' of git://git.kernel.org/pub/scm/linux/kernel/git/dhowells/linux-fs:
netfs: fix test for whether we can skip read when writing beyond EOF
afs: Fix afs_write_end() to handle short writes
Linus Torvalds [Fri, 25 Jun 2021 16:32:57 +0000 (09:32 -0700)]
Merge tag 'gpio-fixes-for-v5.13' of git://git.kernel.org/pub/scm/linux/kernel/git/brgl/linux
Pull gpio fixes from Bartosz Golaszewski:
- fix wake-up interrupt support on gpio-mxc
- zero the padding bytes in a structure passed to user-space in the
GPIO character device
- require HAS_IOPORT_MAP in two drivers that need it to fix a Kbuild
issue
* tag 'gpio-fixes-for-v5.13' of git://git.kernel.org/pub/scm/linux/kernel/git/brgl/linux:
gpio: AMD8111 and TQMX86 require HAS_IOPORT_MAP
gpiolib: cdev: zero padding during conversion to gpioline_info_changed
gpio: mxc: Fix disabled interrupt wake-up support
Linus Torvalds [Fri, 25 Jun 2021 16:20:22 +0000 (09:20 -0700)]
Merge tag 'sound-5.13-rc8' of git://git.kernel.org/pub/scm/linux/kernel/git/tiwai/sound
Pull sound fixes from Takashi Iwai:
"Two small changes have been cherry-picked as a last material for 5.13:
a coverage after UMN revert action and a stale MAINTAINERS entry fix"
* tag 'sound-5.13-rc8' of git://git.kernel.org/pub/scm/linux/kernel/git/tiwai/sound:
MAINTAINERS: remove Timur Tabi from Freescale SOC sound drivers
ASoC: rt5645: Avoid upgrading static warnings to errors
Johannes Berg [Fri, 25 Jun 2021 08:37:34 +0000 (10:37 +0200)]
gpio: AMD8111 and TQMX86 require HAS_IOPORT_MAP
Both of these drivers use ioport_map(), so they need to
depend on HAS_IOPORT_MAP. Otherwise, they cannot be built
even with COMPILE_TEST on architectures without an ioport
implementation, such as ARCH=um.
Mel Gorman [Fri, 25 Jun 2021 01:40:07 +0000 (18:40 -0700)]
mm/page_alloc: do bulk array bounds check after checking populated elements
Dan Carpenter reported the following
The patch 0f87d9d30f21: "mm/page_alloc: add an array-based interface
to the bulk page allocator" from Apr 29, 2021, leads to the following
static checker warning:
mm/page_alloc.c:5338 __alloc_pages_bulk()
warn: potentially one past the end of array 'page_array[nr_populated]'
The problem can occur if an array is passed in that is fully populated.
That potentially ends up allocating a single page and storing it past
the end of the array. This patch returns 0 if the array is fully
populated.
Rasmus Villemoes [Fri, 25 Jun 2021 01:40:04 +0000 (18:40 -0700)]
mm/page_alloc: __alloc_pages_bulk(): do bounds check before accessing array
In the event that somebody would call this with an already fully
populated page_array, the last loop iteration would do an access beyond
the end of page_array.
It's of course extremely unlikely that would ever be done, but this
triggers my internal static analyzer. Also, if it really is not
supposed to be invoked this way (i.e., with no NULL entries in
page_array), the nr_populated<nr_pages check could simply be removed
instead.
Naoya Horiguchi [Fri, 25 Jun 2021 01:40:01 +0000 (18:40 -0700)]
mm/hwpoison: do not lock page again when me_huge_page() successfully recovers
Currently me_huge_page() temporary unlocks page to perform some actions
then locks it again later. My testcase (which calls hard-offline on
some tail page in a hugetlb, then accesses the address of the hugetlb
range) showed that page allocation code detects this page lock on buddy
page and printed out "BUG: Bad page state" message.
check_new_page_bad() does not consider a page with __PG_HWPOISON as bad
page, so this flag works as kind of filter, but this filtering doesn't
work in this case because the "bad page" is not the actual hwpoisoned
page. So stop locking page again. Actions to be taken depend on the
page type of the error, so page unlocking should be done in ->action()
callbacks. So let's make it assumed and change all existing callbacks
that way.
Aili Yao [Fri, 25 Jun 2021 01:39:58 +0000 (18:39 -0700)]
mm,hwpoison: return -EHWPOISON to denote that the page has already been poisoned
When memory_failure() is called with MF_ACTION_REQUIRED on the page that
has already been hwpoisoned, memory_failure() could fail to send SIGBUS
to the affected process, which results in infinite loop of MCEs.
Currently memory_failure() returns 0 if it's called for already
hwpoisoned page, then the caller, kill_me_maybe(), could return without
sending SIGBUS to current process. An action required MCE is raised
when the current process accesses to the broken memory, so no SIGBUS
means that the current process continues to run and access to the error
page again soon, so running into MCE loop.
This issue can arise for example in the following scenarios:
- Two or more threads access to the poisoned page concurrently. If
local MCE is enabled, MCE handler independently handles the MCE
events. So there's a race among MCE events, and the second or latter
threads fall into the situation in question.
- If there was a precedent memory error event and memory_failure() for
the event failed to unmap the error page for some reason, the
subsequent memory access to the error page triggers the MCE loop
situation.
To fix the issue, make memory_failure() return an error code when the
error page has already been hwpoisoned. This allows memory error
handler to control how it sends signals to userspace. And make sure
that any process touching a hwpoisoned page should get a SIGBUS even in
"already hwpoisoned" path of memory_failure() as is done in page fault
path.
Tony Luck [Fri, 25 Jun 2021 01:39:55 +0000 (18:39 -0700)]
mm/memory-failure: use a mutex to avoid memory_failure() races
Patch series "mm,hwpoison: fix sending SIGBUS for Action Required MCE", v5.
I wrote this patchset to materialize what I think is the current
allowable solution mentioned by the previous discussion [1]. I simply
borrowed Tony's mutex patch and Aili's return code patch, then I queued
another one to find error virtual address in the best effort manner. I
know that this is not a perfect solution, but should work for some
typical case.
There can be races when multiple CPUs consume poison from the same page.
The first into memory_failure() atomically sets the HWPoison page flag
and begins hunting for tasks that map this page. Eventually it
invalidates those mappings and may send a SIGBUS to the affected tasks.
But while all that work is going on, other CPUs see a "success" return
code from memory_failure() and so they believe the error has been
handled and continue executing.
Fix by wrapping most of the internal parts of memory_failure() in a
mutex.
Hugh Dickins [Fri, 25 Jun 2021 01:39:52 +0000 (18:39 -0700)]
mm, futex: fix shared futex pgoff on shmem huge page
If more than one futex is placed on a shmem huge page, it can happen
that waking the second wakes the first instead, and leaves the second
waiting: the key's shared.pgoff is wrong.
When 3.11 commit 13d60f4b6ab5 ("futex: Take hugepages into account when
generating futex_key"), the only shared huge pages came from hugetlbfs,
and the code added to deal with its exceptional page->index was put into
hugetlb source. Then that was missed when 4.8 added shmem huge pages.
page_to_pgoff() is what others use for this nowadays: except that, as
currently written, it gives the right answer on hugetlbfs head, but
nonsense on hugetlbfs tails. Fix that by calling hugetlbfs-specific
hugetlb_basepage_index() on PageHuge tails as well as on head.
Yes, it's unconventional to declare hugetlb_basepage_index() there in
pagemap.h, rather than in hugetlb.h; but I do not expect anything but
page_to_pgoff() ever to need it.
[[email protected]: give hugetlb_basepage_index() prototype the correct scope]
BANG: flush_work is not longer linked and will never get proceed.
The problem is that kthread_mod_delayed_work() checks work->canceling
flag before canceling the timer.
A simple solution is to (re)check work->canceling after
__kthread_cancel_work(). But then it is not clear what should be
returned when __kthread_cancel_work() removed the work from the queue
(list) and it can't queue it again with the new @delay.
The return value might be used for reference counting. The caller has
to know whether a new work has been queued or an existing one was
replaced.
The proper solution is that kthread_mod_delayed_work() will remove the
work from the queue (list) _only_ when work->canceling is not set. The
flag must be checked after the timer is stopped and the remaining
operations can be done under worker->lock.
Note that kthread_mod_delayed_work() could remove the timer and then
bail out. It is fine. The other canceling caller needs to cancel the
timer as well. The important thing is that the queue (list)
manipulation is done atomically under worker->lock.
Daniel Axtens [Fri, 25 Jun 2021 01:39:42 +0000 (18:39 -0700)]
mm/vmalloc: unbreak kasan vmalloc support
In commit 121e6f3258fe ("mm/vmalloc: hugepage vmalloc mappings"),
__vmalloc_node_range was changed such that __get_vm_area_node was no
longer called with the requested/real size of the vmalloc allocation,
but rather with a rounded-up size.
This means that __get_vm_area_node called kasan_unpoision_vmalloc() with
a rounded up size rather than the real size. This led to it allowing
access to too much memory and so missing vmalloc OOBs and failing the
kasan kunit tests.
Pass the real size and the desired shift into __get_vm_area_node. This
allows it to round up the size for the underlying allocators while still
unpoisioning the correct quantity of shadow memory.
Adjust the other call-sites to pass in PAGE_SHIFT for the shift value.
Claudio Imbrenda [Fri, 25 Jun 2021 01:39:39 +0000 (18:39 -0700)]
KVM: s390: prepare for hugepage vmalloc
The Create Secure Configuration Ultravisor Call does not support using
large pages for the virtual memory area. This is a hardware limitation.
This patch replaces the vzalloc call with an almost equivalent call to
the newly introduced vmalloc_no_huge function, which guarantees that
only small pages will be used for the backing.
The new call will not clear the allocated memory, but that has never
been an actual requirement.
Claudio Imbrenda [Fri, 25 Jun 2021 01:39:36 +0000 (18:39 -0700)]
mm/vmalloc: add vmalloc_no_huge
Patch series "mm: add vmalloc_no_huge and use it", v4.
Add vmalloc_no_huge() and export it, so modules can allocate memory with
small pages.
Use the newly added vmalloc_no_huge() in KVM on s390 to get around a
hardware limitation.
This patch (of 2):
Commit 121e6f3258fe3 ("mm/vmalloc: hugepage vmalloc mappings") added
support for hugepage vmalloc mappings, it also added the flag
VM_NO_HUGE_VMAP for __vmalloc_node_range to request the allocation to be
performed with 0-order non-huge pages.
This flag is not accessible when calling vmalloc, the only option is to
call directly __vmalloc_node_range, which is not exported.
This means that a module can't vmalloc memory with small pages.
Case in point: KVM on s390x needs to vmalloc a large area, and it needs
to be mapped with non-huge pages, because of a hardware limitation.
This patch adds the function vmalloc_no_huge, which works like vmalloc,
but it is guaranteed to always back the mapping using small pages. This
new function is exported, therefore it is usable by modules.
Hugh Dickins [Fri, 25 Jun 2021 01:39:30 +0000 (18:39 -0700)]
mm/thp: another PVMW_SYNC fix in page_vma_mapped_walk()
Aha! Shouldn't that quick scan over pte_none()s make sure that it holds
ptlock in the PVMW_SYNC case? That too might have been responsible for
BUGs or WARNs in split_huge_page_to_list() or its unmap_page(), though
I've never seen any.
Hugh Dickins [Fri, 25 Jun 2021 01:39:26 +0000 (18:39 -0700)]
mm/thp: fix page_vma_mapped_walk() if THP mapped by ptes
Running certain tests with a DEBUG_VM kernel would crash within hours,
on the total_mapcount BUG() in split_huge_page_to_list(), while trying
to free up some memory by punching a hole in a shmem huge page: split's
try_to_unmap() was unable to find all the mappings of the page (which,
on a !DEBUG_VM kernel, would then keep the huge page pinned in memory).
Crash dumps showed two tail pages of a shmem huge page remained mapped
by pte: ptes in a non-huge-aligned vma of a gVisor process, at the end
of a long unmapped range; and no page table had yet been allocated for
the head of the huge page to be mapped into.
Although designed to handle these odd misaligned huge-page-mapped-by-pte
cases, page_vma_mapped_walk() falls short by returning false prematurely
when !pmd_present or !pud_present or !p4d_present or !pgd_present: there
are cases when a huge page may span the boundary, with ptes present in
the next.
Restructure page_vma_mapped_walk() as a loop to continue in these cases,
while keeping its layout much as before. Add a step_forward() helper to
advance pvmw->address across those boundaries: originally I tried to use
mm's standard p?d_addr_end() macros, but hit the same crash 512 times
less often: because of the way redundant levels are folded together, but
folded differently in different configurations, it was just too
difficult to use them correctly; and step_forward() is simpler anyway.
Hugh Dickins [Fri, 25 Jun 2021 01:39:17 +0000 (18:39 -0700)]
mm: page_vma_mapped_walk(): add a level of indentation
page_vma_mapped_walk() cleanup: add a level of indentation to much of
the body, making no functional change in this commit, but reducing the
later diff when this is all converted to a loop.
page_vma_mapped_walk() cleanup: adjust the test for crossing page table
boundary - I believe pvmw->address is always page-aligned, but nothing
else here assumed that; and remember to reset pvmw->pte to NULL after
unmapping the page table, though I never saw any bug from that.
page_vma_mapped_walk() cleanup: rearrange the !pmd_present() block to
follow the same "return not_found, return not_found, return true"
pattern as the block above it (note: returning not_found there is never
premature, since existence or prior existence of huge pmd guarantees
good alignment).
Linus Torvalds [Thu, 24 Jun 2021 20:27:07 +0000 (13:27 -0700)]
Merge tag 'drm-fixes-2021-06-25' of git://anongit.freedesktop.org/drm/drm
Pull drm fixes from Dave Airlie:
"This is a bit bigger than I'd like at this stage, and I guess last
week was extra quiet, but it's mostly one fix across three drivers to
wait for buffer move pinning to complete.
There was one locking change that got reverted so it's just noise.
Otherwise the amdgpu/nouveau changes are for known regressions, and
otherwise it's just misc changes in kmb/atmel/vc4 drivers.
Summary:
core:
- auth locking change + brown paper bag revert
radeon/nouveau/amdgpu/ttm:
- wait for BO to be pinned after moving it (same fix in three
drivers)
amdgpu:
- Revert GFX9/10 doorbell fixes, we just end up trading one bug for
another
- Potential memory corruption fix in framebuffer handling
* tag 'drm-fixes-2021-06-25' of git://anongit.freedesktop.org/drm/drm:
drm/nouveau: fix dma_address check for CPU/GPU sync
drm/kmb: Fix error return code in kmb_hw_init()
drm/amdgpu: wait for moving fence after pinning
drm/radeon: wait for moving fence after pinning
drm/nouveau: wait for moving fence after pinning v2
Revert "drm: add a locked version of drm_is_current_master"
Revert "drm/amdgpu/gfx9: fix the doorbell missing when in CGPG issue."
Revert "drm/amdgpu/gfx10: enlarge CP_MEC_DOORBELL_RANGE_UPPER to cover full doorbell."
drm/amdgpu: Call drm_framebuffer_init last for framebuffer init
drm: add a locked version of drm_is_current_master
drm/atmel-hlcdc: Allow async page flips
drm/panel: ld9040: reference spi_device_id table
drm: atmel_hlcdc: Enable the crtc vblank prior to crtc usage.
drm/vc4: hdmi: Make sure the controller is powered in detect
drm/vc4: hdmi: Move the HSM clock enable to runtime_pm
The direction of the pipe argument must match the request-type direction
bit or control requests may fail depending on the host-controller-driver
implementation.
Control transfers without a data stage are treated as OUT requests by
the USB stack and should be using usb_sndctrlpipe(). Failing to do so
will now trigger a warning.
Fix the OSIFI2C_SET_BIT_RATE and OSIFI2C_STOP requests which erroneously
used the osif_usb_read() helper and set the IN direction bit.
Dave Airlie [Thu, 24 Jun 2021 19:44:32 +0000 (05:44 +1000)]
Merge tag 'drm-misc-fixes-2021-06-24' of git://anongit.freedesktop.org/drm/drm-misc into drm-fixes
A DMA address check for nouveau, an error code return fix for kmb, fixes
to wait for a moving fence after pinning the BO for amdgpu, nouveau and
radeon, a crtc and async page flip fix for atmel-hlcdc and a cpu hang
fix for vc4.
Andreas Hecht [Thu, 24 Jun 2021 15:25:35 +0000 (17:25 +0200)]
i2c: dev: Add __user annotation
Fix Sparse warnings:
drivers/i2c/i2c-dev.c:546:19: warning: incorrect type in assignment (different address spaces)
drivers/i2c/i2c-dev.c:549:53: warning: incorrect type in argument 2 (different address spaces)
compat_ptr() returns a pointer tagged __user which gets assigned to a
pointer missing the __user annotation. The same pointer is passed to
copy_from_user() as an argument where it is expected to have the __user
annotation. Fix both by adding the __user annotation to the pointer.
Fixes: 7d5cb45655f2 ("i2c compat ioctls: move to ->compat_ioctl()") Signed-off-by: Andreas Hecht <[email protected]> Signed-off-by: Wolfram Sang <[email protected]>
Ilya Dryomov [Mon, 21 Jun 2021 10:17:40 +0000 (12:17 +0200)]
libceph: set global_id as soon as we get an auth ticket
Commit 61ca49a9105f ("libceph: don't set global_id until we get an
auth ticket") delayed the setting of global_id too much. It is set
only after all tickets are received, but in pre-nautilus clusters an
auth ticket and the service tickets are obtained in separate steps
(for a total of three MAuth replies). When the service tickets are
requested, global_id is used to build an authorizer; if global_id is
still 0 we never get them and fail to establish the session.
Moving the setting of global_id into protocol implementations. This
way global_id can be set exactly when an auth ticket is received, not
sooner nor later.
Fixes: 61ca49a9105f ("libceph: don't set global_id until we get an auth ticket") Signed-off-by: Ilya Dryomov <[email protected]> Reviewed-by: Jeff Layton <[email protected]>
Ilya Dryomov [Mon, 21 Jun 2021 09:53:38 +0000 (11:53 +0200)]
libceph: don't pass result into ac->ops->handle_reply()
There is no result to pass in msgr2 case because authentication
failures are reported through auth_bad_method frame and in MAuth
case an error is returned immediately.
Linus Torvalds [Thu, 24 Jun 2021 15:58:23 +0000 (08:58 -0700)]
Merge tag 'sched-urgent-2021-06-24' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull scheduler fix from Ingo Molnar:
"A last minute cgroup bandwidth scheduling fix for a recently
introduced logic fail which triggered a kernel warning by LTP's
cfs_bandwidth01 test"
* tag 'sched-urgent-2021-06-24' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
sched/fair: Ensure that the CFS parent is added after unthrottling
Nicholas Piggin [Thu, 24 Jun 2021 12:29:04 +0000 (08:29 -0400)]
KVM: do not allow mapping valid but non-reference-counted pages
It's possible to create a region which maps valid but non-refcounted
pages (e.g., tail pages of non-compound higher order allocations). These
host pages can then be returned by gfn_to_page, gfn_to_pfn, etc., family
of APIs, which take a reference to the page, which takes it from 0 to 1.
When the reference is dropped, this will free the page incorrectly.
Fix this by only taking a reference on valid pages if it was non-zero,
which indicates it is participating in normal refcounting (and can be
released with put_page).
Juergen Gross [Wed, 23 Jun 2021 13:09:13 +0000 (15:09 +0200)]
xen/events: reset active flag for lateeoi events later
In order to avoid a race condition for user events when changing
cpu affinity reset the active flag only when EOI-ing the event.
This is working fine as all user events are lateeoi events. Note that
lateeoi_ack_mask_dynirq() is not modified as there is no explicit call
to xen_irq_lateeoi() expected later.
Timur Tabi [Sun, 20 Jun 2021 16:01:35 +0000 (11:01 -0500)]
MAINTAINERS: remove Timur Tabi from Freescale SOC sound drivers
I haven't touched these drivers in seven years, and none of the
patches sent to me these days affect code that I wrote. The
other maintainers are doing a very good job without me.
Mark Brown [Tue, 8 Jun 2021 16:07:13 +0000 (17:07 +0100)]
ASoC: rt5645: Avoid upgrading static warnings to errors
One of the fixes reverted as part of the UMN fallout was actually fine,
however rather than undoing the revert the process that handled all this
stuff resulted in a patch which attempted to add extra error checks
instead. Unfortunately this new change wasn't really based on a good
understanding of the subsystem APIs and bypassed the usual patch flow
without ensuring it was reviewed by people with subsystem knowledge and
was merged as a fix rather than during the merge window.
The effect of the new fix is to upgrade what were previously warnings on
static data in the code to hard errors on that data. If this actually
happens then it would break existing systems, if it doesn't happen then
the change has no effect so this was not a safe change to apply as a fix
to the release candidates. Since the new code has not been tested and
doesn't in practice improve error handling revert it instead, and also
drop the original revert since the original fix was fine. This takes
the driver back to what it was in -rc1.
Zenghui Yu [Thu, 24 Jun 2021 07:09:31 +0000 (15:09 +0800)]
KVM: selftests: Fix mapping length truncation in m{,un}map()
max_mem_slots is now declared as uint32_t. The result of (0x200000 * 32767)
is unexpectedly truncated to be 0xffe00000, whilst we actually need to
allocate about, 63GB. Cast max_mem_slots to size_t in both mmap() and
munmap() to fix the length truncation.
We'll otherwise see the failure on arm64 thanks to the access_ok() checking
in __kvm_set_memory_region(), as the unmapped VA happen to go beyond the
task's allowed address space.
# ./set_memory_region_test
Allowed number of memory slots: 32767
Adding slots 0..32766, each memory region with 2048K size
==== Test Assertion Failure ====
set_memory_region_test.c:391: ret == 0
pid=94861 tid=94861 errno=22 - Invalid argument
1 0x00000000004015a7: test_add_max_memory_regions at set_memory_region_test.c:389
2 (inlined by) main at set_memory_region_test.c:426
3 0x0000ffffb8e67bdf: ?? ??:0
4 0x00000000004016db: _start at :?
KVM_SET_USER_MEMORY_REGION IOCTL failed,
rc: -1 errno: 22 slot: 2615
Linus Torvalds [Wed, 23 Jun 2021 18:29:15 +0000 (11:29 -0700)]
Merge tag 'spi-fix-v5.13-rc7' of git://git.kernel.org/pub/scm/linux/kernel/git/broonie/spi
Pull spi fixes from Mark Brown:
"A couple of small, driver specific fixes that arrived in the past few
weeks"
* tag 'spi-fix-v5.13-rc7' of git://git.kernel.org/pub/scm/linux/kernel/git/broonie/spi:
spi: spi-nxp-fspi: move the register operation after the clock enable
spi: tegra20-slink: Ensure SPI controller reset is deasserted
Heikki Krogerus [Wed, 23 Jun 2021 13:14:21 +0000 (16:14 +0300)]
software node: Handle software node injection to an existing device properly
The function software_node_notify() - the function that creates
and removes the symlinks between the node and the device - was
called unconditionally in device_add_software_node() and
device_remove_software_node(), but it needs to be called in
those functions only in the special case where the node is
added to a device that has already been registered.
This fixes NULL pointer dereference that happens if
device_remove_software_node() is used with device that was
never registered.
Fixes: b622b24519f5 ("software node: Allow node addition to already existing device") Reported-and-tested-by: Dominik Brodowski <[email protected]> Reviewed-by: Andy Shevchenko <[email protected]> Signed-off-by: Heikki Krogerus <[email protected]> Signed-off-by: Rafael J. Wysocki <[email protected]>
Linus Torvalds [Wed, 23 Jun 2021 16:40:55 +0000 (09:40 -0700)]
Merge tag 'pm-5.13-rc8' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm
Pull power management fix from Rafael Wysocki:
"Revert a recent PCI power management commit that causes initialization
issues to appear on some systems"
* tag 'pm-5.13-rc8' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm:
Revert "PCI: PM: Do not read power state in pci_enable_device_flags()"
Linus Torvalds [Wed, 23 Jun 2021 16:04:07 +0000 (09:04 -0700)]
Merge branch 'stable/for-linus-5.14' of git://git.kernel.org/pub/scm/linux/kernel/git/konrad/swiotlb
Pull swiotlb fix from Konrad Rzeszutek Wilk:
"A fix for the regression for the DMA operations where the offset was
ignored and corruptions would appear.
Going forward there will be a cleanups to make the offset and
alignment logic more clearer and better test-cases to help with this"
* 'stable/for-linus-5.14' of git://git.kernel.org/pub/scm/linux/kernel/git/konrad/swiotlb:
swiotlb: manipulate orig_addr when tlb_addr has offset
scsi: sd: Call sd_revalidate_disk() for ioctl(BLKRRPART)
While the disk state has nothing to do with partitions, BLKRRPART is used
to force a full revalidate after things like a disk format for historical
reasons. Restore that behavior.
Mimi Zohar [Tue, 22 Jun 2021 11:36:41 +0000 (13:36 +0200)]
module: limit enabling module.sig_enforce
Irrespective as to whether CONFIG_MODULE_SIG is configured, specifying
"module.sig_enforce=1" on the boot command line sets "sig_enforce".
Only allow "sig_enforce" to be set when CONFIG_MODULE_SIG is configured.
This patch makes the presence of /sys/module/module/parameters/sig_enforce
dependent on CONFIG_MODULE_SIG=y.
Zhen Lei [Thu, 13 May 2021 13:46:38 +0000 (21:46 +0800)]
drm/kmb: Fix error return code in kmb_hw_init()
When the call to platform_get_irq() to obtain the IRQ of the lcd fails, the
returned error code should be propagated. However, we currently do not
explicitly assign this error code to 'ret'. As a result, 0 was incorrectly
returned.
Revert "PCI: PM: Do not read power state in pci_enable_device_flags()"
Revert commit 4514d991d992 ("PCI: PM: Do not read power state in
pci_enable_device_flags()") that is reported to cause PCI device
initialization issues on some systems.
Thomas Gleixner [Mon, 21 Jun 2021 23:08:30 +0000 (01:08 +0200)]
signal: Prevent sigqueue caching after task got released
syzbot reported a memory leak related to sigqueue caching.
The assumption that a task cannot cache a sigqueue after the signal handler
has been dropped and exit_task_sigqueue_cache() has been invoked turns out
to be wrong.
Such a task can still invoke release_task(other_task), which cleans up the
signals of 'other_task' and ends up in sigqueue_cache_or_free(), which in
turn will cache the signal because task->sigqueue_cache is NULL. That's
obviously bogus because nothing will free the cached signal of that task
anymore, so the cached item is leaked.
This happens when e.g. the last non-leader thread exits and reaps the
zombie leader.
Prevent this by setting tsk::sigqueue_cache to an error pointer value in
exit_task_sigqueue_cache() which forces any subsequent invocation of
sigqueue_cache_or_free() from that task to hand the sigqueue back to the
kmemcache.
Timeline (ish):
- worker3 gets throttled
- level3b is decayed, since it has no more load
- level2 get throttled
- worker3 get unthrottled
- level2 get unthrottled
- worker3 is added to list
- level3b is not added to list, since nr_running==0 and is decayed
[ Vincent Guittot: Rebased and updated to fix for the reported warning. ]
Peter Zijlstra [Mon, 21 Jun 2021 11:12:38 +0000 (13:12 +0200)]
locking/lockdep: Improve noinstr vs errors
Better handle the failure paths.
vmlinux.o: warning: objtool: debug_locks_off()+0x23: call to console_verbose() leaves .noinstr.text section
vmlinux.o: warning: objtool: debug_locks_off()+0x19: call to __kasan_check_write() leaves .noinstr.text section
debug_locks_off+0x19/0x40:
instrument_atomic_write at include/linux/instrumented.h:86
(inlined by) __debug_locks_off at include/linux/debug_locks.h:17
(inlined by) debug_locks_off at lib/debug_locks.c:41
Thomas Gleixner [Fri, 18 Jun 2021 14:18:25 +0000 (16:18 +0200)]
x86/fpu: Make init_fpstate correct with optimized XSAVE
The XSAVE init code initializes all enabled and supported components with
XRSTOR(S) to init state. Then it XSAVEs the state of the components back
into init_fpstate which is used in several places to fill in the init state
of components.
This works correctly with XSAVE, but not with XSAVEOPT and XSAVES because
those use the init optimization and skip writing state of components which
are in init state. So init_fpstate.xsave still contains all zeroes after
this operation.
There are two ways to solve that:
1) Use XSAVE unconditionally, but that requires to reshuffle the buffer when
XSAVES is enabled because XSAVES uses compacted format.
2) Save the components which are known to have a non-zero init state by other
means.
Looking deeper, #2 is the right thing to do because all components the
kernel supports have all-zeroes init state except the legacy features (FP,
SSE). Those cannot be hard coded because the states are not identical on all
CPUs, but they can be saved with FXSAVE which avoids all conditionals.
Use FXSAVE to save the legacy FP/SSE components in init_fpstate along with
a BUILD_BUG_ON() which reminds developers to validate that a newly added
component has all zeroes init state. As a bonus remove the now unused
copy_xregs_to_kernel_booting() crutch.
The XSAVE and reshuffle method can still be implemented in the unlikely
case that components are added which have a non-zero init state and no
other means to save them. For now, FXSAVE is just simple and good enough.
Thomas Gleixner [Fri, 18 Jun 2021 14:18:24 +0000 (16:18 +0200)]
x86/fpu: Preserve supervisor states in sanitize_restored_user_xstate()
sanitize_restored_user_xstate() preserves the supervisor states only
when the fx_only argument is zero, which allows unprivileged user space
to put supervisor states back into init state.
Gabriel Knezek [Mon, 21 Jun 2021 22:28:59 +0000 (15:28 -0700)]
gpiolib: cdev: zero padding during conversion to gpioline_info_changed
When userspace requests a GPIO v1 line info changed event,
lineinfo_watch_read() populates and returns the gpioline_info_changed
structure. It contains 5 words of padding at the end which are not
initialized before being returned to userspace.
Zero the structure in gpio_v2_line_info_change_to_v1() before populating
its contents.
Fixes: aad955842d1c ("gpiolib: cdev: support GPIO_V2_GET_LINEINFO_IOCTL and GPIO_V2_GET_LINEINFO_WATCH_IOCTL") Signed-off-by: Gabriel Knezek <[email protected]> Reviewed-by: Kent Gibson <[email protected]> Signed-off-by: Bartosz Golaszewski <[email protected]>
Michel Dänzer [Wed, 16 Jun 2021 10:46:51 +0000 (12:46 +0200)]
drm/amdgpu: Call drm_framebuffer_init last for framebuffer init
Once drm_framebuffer_init has returned 0, the framebuffer is hooked up
to the reference counting machinery and can no longer be destroyed with
a simple kfree. Therefore, it must be called last.
If drm_framebuffer_init returns 0 but its caller then returns non-0,
there will likely be memory corruption fireworks down the road.
The following lead me to this fix:
Fixes: f258907fdd835e "drm/amdgpu: Verify bo size can fit framebuffer size on init." Signed-off-by: Michel Dänzer <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
Jeff Layton [Sun, 13 Jun 2021 23:33:45 +0000 (19:33 -0400)]
netfs: fix test for whether we can skip read when writing beyond EOF
It's not sufficient to skip reading when the pos is beyond the EOF.
There may be data at the head of the page that we need to fill in
before the write.
Add a new helper function that corrects and clarifies the logic of
when we can skip reads, and have it only zero out the part of the page
that won't have data copied in for the write.
Finally, don't set the page Uptodate after zeroing. It's not up to date
since the write data won't have been copied in yet.
[DH made the following changes:
- Prefixed the new function with "netfs_".
- Don't call zero_user_segments() for a full-page write.
- Altered the beyond-last-page check to avoid a DIV instruction and got
rid of then-redundant zero-length file check.
]
David Howells [Mon, 14 Jun 2021 13:13:41 +0000 (14:13 +0100)]
afs: Fix afs_write_end() to handle short writes
Fix afs_write_end() to correctly handle a short copy into the intended
write region of the page. Two things are necessary:
(1) If the page is not up to date, then we should just return 0
(ie. indicating a zero-length copy). The loop in
generic_perform_write() will go around again, possibly breaking up the
iterator into discrete chunks[1].
(2) The page should not have been set uptodate if it wasn't completely set
up by netfs_write_begin() (this will be fixed in the next patch), so
we need to set uptodate here in such a case.
Also remove the assertion that was checking that the page was set uptodate
since it's now set uptodate if it wasn't already a few lines above. The
assertion was from when uptodate was set elsewhere.
Changes:
v3: Remove the handling of len exceeding the end of the page.
Loic Poulain [Thu, 17 Jun 2021 13:54:13 +0000 (15:54 +0200)]
gpio: mxc: Fix disabled interrupt wake-up support
A disabled/masked interrupt marked as wakeup source must be re-enable
and unmasked in order to be able to wake-up the host. That can be done
by flaging the irqchip with IRQCHIP_ENABLE_WAKEUP_ON_SUSPEND.
Note: It 'sometimes' works without that change, but only thanks to the
lazy generic interrupt disabling (keeping interrupt unmasked).
drm: add a locked version of drm_is_current_master
While checking the master status of the DRM file in
drm_is_current_master(), the device's master mutex should be
held. Without the mutex, the pointer fpriv->master may be freed
concurrently by another process calling drm_setmaster_ioctl(). This
could lead to use-after-free errors when the pointer is subsequently
dereferenced in drm_lease_owner().
The callers of drm_is_current_master() from drm_auth.c hold the
device's master mutex, but external callers do not. Hence, we implement
drm_is_current_master_locked() to be used within drm_auth.c, and
modify drm_is_current_master() to grab the device's master mutex
before checking the master status.
Peter Zijlstra [Mon, 21 Jun 2021 14:13:55 +0000 (16:13 +0200)]
objtool/x86: Ignore __x86_indirect_alt_* symbols
Because the __x86_indirect_alt* symbols are just that, objtool will
try and validate them as regular symbols, instead of the alternative
replacements that they are.
This goes sideways for FRAME_POINTER=y builds; which generate a fair
amount of warnings.
Bumyong Lee [Mon, 10 May 2021 09:10:04 +0000 (18:10 +0900)]
swiotlb: manipulate orig_addr when tlb_addr has offset
in case of driver wants to sync part of ranges with offset,
swiotlb_tbl_sync_single() copies from orig_addr base to tlb_addr with
offset and ends up with data mismatch.
It was removed from
"swiotlb: don't modify orig_addr in swiotlb_tbl_sync_single",
but said logic has to be added back in.
From Linus's email:
"That commit which the removed the offset calculation entirely, because the old
(unsigned long)tlb_addr & (IO_TLB_SIZE - 1)
was wrong, but instead of removing it, I think it should have just
fixed it to be
(tlb_addr - mem->start) & (IO_TLB_SIZE - 1);
instead. That way the slot offset always matches the slot index calculation."
(Unfortunatly that broke NVMe).
The use-case that drivers are hitting is as follow:
[One fix which was Linus's that made more sense to as it created a
symmetry would break NVMe. The reason for that is the:
unsigned int offset = (tlb_addr - mem->start) & (IO_TLB_SIZE - 1);
would come up with the proper offset, but it would lose the
alignment (which this patch contains).]
Heiko Carstens [Fri, 18 Jun 2021 14:58:47 +0000 (16:58 +0200)]
s390/stack: fix possible register corruption with stack switch helper
The CALL_ON_STACK macro is used to call a C function from inline
assembly, and therefore must consider the C ABI, which says that only
registers 6-13, and 15 are non-volatile (restored by the called
function).
The inline assembly incorrectly marks all registers used to pass
parameters to the called function as read-only input operands, instead
of operands that are read and written to. This might result in
register corruption depending on usage, compiler, and compile options.
Fix this by marking all operands used to pass parameters as read/write
operands. To keep the code simple even register 6, if used, is marked
as read-write operand.
Sven Schnelle [Tue, 15 Jun 2021 13:05:22 +0000 (15:05 +0200)]
s390/topology: clear thread/group maps for offline cpus
The current code doesn't clear the thread/group maps for offline
CPUs. This may cause kernel crashes like the one bewlow in common
code that assumes if a CPU has sibblings it is online.
Unable to handle kernel pointer dereference in virtual kernel address space
Tony Krowiak [Wed, 9 Jun 2021 22:46:32 +0000 (18:46 -0400)]
s390/vfio-ap: clean up mdev resources when remove callback invoked
The mdev remove callback for the vfio_ap device driver bails out with
-EBUSY if the mdev is in use by a KVM guest (i.e., the KVM pointer in the
struct ap_matrix_mdev is not NULL). The intended purpose was
to prevent the mdev from being removed while in use. There are two
problems with this scenario:
1. Returning a non-zero return code from the remove callback does not
prevent the removal of the mdev.
2. The KVM pointer in the struct ap_matrix_mdev will always be NULL because
the remove callback will not get invoked until the mdev fd is closed.
When the mdev fd is closed, the mdev release callback is invoked and
clears the KVM pointer from the struct ap_matrix_mdev.
Let's go ahead and remove the check for KVM in the remove callback and
allow the cleanup of mdev resources to proceed.
Sven Schnelle [Fri, 11 Jun 2021 14:08:18 +0000 (16:08 +0200)]
s390: clear pt_regs::flags on irq entry
The current irq entry code doesn't initialize pt_regs::flags. On exit to
user mode arch_do_signal_or_restart() tests whether PIF_SYSCALL is set,
which might yield wrong results.
Fix this by clearing pt_regs::flags in the entry.S irq handler
code.
Sven Schnelle [Fri, 11 Jun 2021 08:27:51 +0000 (10:27 +0200)]
s390: fix system call restart with multiple signals
glibc complained with "The futex facility returned an unexpected error
code.". It turned out that the futex syscall returned -ERESTARTSYS because
a signal is pending. arch_do_signal_or_restart() restored the syscall
parameters (nameley regs->gprs[2]) and set PIF_SYSCALL_RESTART. When
another signal is made pending later in the exit loop
arch_do_signal_or_restart() is called again. This function clears
PIF_SYSCALL_RESTART and checks the return code which is set in
regs->gprs[2]. However, regs->gprs[2] was restored in the previous run
and no longer contains -ERESTARTSYS, so PIF_SYSCALL_RESTART isn't set
again and the syscall is skipped.
Fix this by not clearing PIF_SYSCALL_RESTART - it is already cleared in
__do_syscall() when the syscall is restarted.
Heiner Kallweit [Sun, 6 Jun 2021 13:55:55 +0000 (15:55 +0200)]
i2c: i801: Ensure that SMBHSTSTS_INUSE_STS is cleared when leaving i801_access
As explained in [0] currently we may leave SMBHSTSTS_INUSE_STS set,
thus potentially breaking ACPI/BIOS usage of the SMBUS device.
Seems patch [0] needs a little bit more of review effort, therefore
I'd suggest to apply a part of it as quick win. Just clearing
SMBHSTSTS_INUSE_STS when leaving i801_access() should fix the
referenced issue and leaves more time for discussing a more
sophisticated locking handling.
Linus Torvalds [Sun, 20 Jun 2021 16:44:52 +0000 (09:44 -0700)]
Merge tag 'sched_urgent_for_v5.13_rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull scheduler fix from Borislav Petkov:
"A single fix to restore fairness between control groups with equal
priority"
* tag 'sched_urgent_for_v5.13_rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
sched/fair: Correctly insert cfs_rq's to list on unthrottle
Linus Torvalds [Sun, 20 Jun 2021 16:38:14 +0000 (09:38 -0700)]
Merge tag 'irq_urgent_for_v5.13_rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull irq fix from Borislav Petkov:
"A single fix for GICv3 to not take an interrupt in an NMI context"
* tag 'irq_urgent_for_v5.13_rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
irqchip/gic-v3: Workaround inconsistent PMR setting on NMI entry
Linus Torvalds [Sun, 20 Jun 2021 16:09:58 +0000 (09:09 -0700)]
Merge tag 'x86_urgent_for_v5.13_rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull x86 fixes from Borislav Petkov:
"A first set of urgent fixes to the FPU/XSTATE handling mess^W code.
(There's a lot more in the pipe):
- Prevent corruption of the XSTATE buffer in signal handling by
validating what is being copied from userspace first.
- Invalidate other task's preserved FPU registers on XRSTOR failure
(#PF) because latter can still modify some of them.
- Restore the proper PKRU value in case userspace modified it
- Reset FPU state when signal restoring fails
Other:
- Map EFI boot services data memory as encrypted in a SEV guest so
that the guest can access it and actually boot properly
- Two SGX correctness fixes: proper resources freeing and a NUMA fix"
* tag 'x86_urgent_for_v5.13_rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
x86/mm: Avoid truncating memblocks for SGX memory
x86/sgx: Add missing xa_destroy() when virtual EPC is destroyed
x86/fpu: Reset state for all signal restore failures
x86/pkru: Write hardware init value to PKRU when xstate is init
x86/process: Check PF_KTHREAD and not current->mm for kernel threads
x86/fpu: Invalidate FPU state after a failed XRSTOR from a user buffer
x86/fpu: Prevent state corruption in __fpu__restore_sig()
x86/ioremap: Map EFI-reserved memory as encrypted for SEV
Linus Torvalds [Sat, 19 Jun 2021 23:50:23 +0000 (16:50 -0700)]
Merge tag 'powerpc-5.13-6' of git://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux
Pull powerpc fixes from Michael Ellerman:
"Fix initrd corruption caused by our recent change to use relative jump
labels.
Fix a crash using perf record on systems without a hardware PMU
backend.
Rework our 64-bit signal handling slighty to make it more closely
match the old behaviour, after the recent change to use unsafe user
accessors.
Thanks to Anastasia Kovaleva, Athira Rajeev, Christophe Leroy, Daniel
Axtens, Greg Kurz, and Roman Bolshakov"
* tag 'powerpc-5.13-6' of git://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux:
powerpc/perf: Fix crash in perf_instruction_pointer() when ppmu is not set
powerpc: Fix initrd corruption with relative jump labels
powerpc/signal64: Copy siginfo before changing regs->nip
powerpc/mem: Add back missing header to fix 'no previous prototype' error
Linus Torvalds [Sat, 19 Jun 2021 21:50:43 +0000 (14:50 -0700)]
Merge tag 'perf-tools-fixes-for-v5.13-2021-06-19' of git://git.kernel.org/pub/scm/linux/kernel/git/acme/linux
Pull perf tools fixes from Arnaldo Carvalho de Melo:
- Fix refcount usage when processing PERF_RECORD_KSYMBOL.
- 'perf stat' metric group fixes.
- Fix 'perf test' non-bash issue with stat bpf counters.
- Update unistd, in.h and socket.h with the kernel sources, silencing
perf build warnings.
* tag 'perf-tools-fixes-for-v5.13-2021-06-19' of git://git.kernel.org/pub/scm/linux/kernel/git/acme/linux:
tools headers UAPI: Sync linux/in.h copy with the kernel sources
tools headers UAPI: Sync asm-generic/unistd.h with the kernel original
perf beauty: Update copy of linux/socket.h with the kernel sources
perf test: Fix non-bash issue with stat bpf counters
perf machine: Fix refcount usage when processing PERF_RECORD_KSYMBOL
perf metricgroup: Return error code from metricgroup__add_metric_sys_event_iter()
perf metricgroup: Fix find_evsel_group() event selector
Dan Sneddon [Wed, 2 Jun 2021 16:08:45 +0000 (09:08 -0700)]
drm: atmel_hlcdc: Enable the crtc vblank prior to crtc usage.
'commit eec44d44a3d2 ("drm/atmel: Use drm_atomic_helper_commit")'
removed the home-grown handling of atomic commits and exposed an issue
in the crtc atomic commit handling where vblank is expected to be
enabled but hasn't yet, causing kernel warnings during boot. This patch
cleans up the crtc vblank handling thus removing the warning on boot.
Linus Torvalds [Sat, 19 Jun 2021 15:45:34 +0000 (08:45 -0700)]
Merge tag 'riscv-for-linus-5.13-rc7' of git://git.kernel.org/pub/scm/linux/kernel/git/riscv/linux
Pull RISC-V fixes from Palmer Dabbelt:
- A build fix to always build modules with the 'medany' code model, as
the module loader doesn't support 'medlow'.
- A Kconfig warning fix for the SiFive errata.
- A pair of fixes that for regressions to the recent memory layout
changes.
- A fix for the FU740 device tree.
* tag 'riscv-for-linus-5.13-rc7' of git://git.kernel.org/pub/scm/linux/kernel/git/riscv/linux:
riscv: dts: fu740: fix cache-controller interrupts
riscv: Ensure BPF_JIT_REGION_START aligned with PMD size
riscv: kasan: Fix MODULES_VADDR evaluation due to local variables' name
riscv: sifive: fix Kconfig errata warning
riscv32: Use medany C model for modules
tools headers UAPI: Sync linux/in.h copy with the kernel sources
To pick the changes in:
321827477360934d ("icmp: don't send out ICMP messages with a source address of 0.0.0.0")
That don't result in any change in tooling, as INADDR_ are not used to
generate id->string tables used by 'perf trace'.
This addresses this build warning:
Warning: Kernel ABI header at 'tools/include/uapi/linux/in.h' differs from latest version at 'include/uapi/linux/in.h'
diff -u tools/include/uapi/linux/in.h include/uapi/linux/in.h