Linus Torvalds [Thu, 7 Dec 2017 02:16:20 +0000 (18:16 -0800)]
Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/gerg/m68knommu
Pull m68knommu fixes from Greg Ungerer:
"There are two fixes here. One to add a missing linker section to the
m68k architecture linker scripts, the other to fix a defconfig build
problem"
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/gerg/m68knommu:
m68k/defconfig: fix stmark2 broken local compilation
m68k: add missing SOFTIRQENTRY_TEXT linker section
Linus Torvalds [Thu, 7 Dec 2017 01:47:29 +0000 (17:47 -0800)]
Merge branch 'x86-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull misc x86 fixes from Ingo Molnar:
- make CR4 handling irq-safe, which bug vmware guests ran into
- don't crash on early IRQs in Xen guests
- don't crash secondary CPU bringup if #UD assisted WARN()ings are
triggered
- make X86_BUG_FXSAVE_LEAK optional on newer AMD CPUs that have the fix
- fix AMD Fam17h microcode loading
- fix broadcom_postcore_init() if ACPI is disabled
- fix resume regression in __restore_processor_context()
- fix Sparse warnings
- fix a GCC-8 warning
* 'x86-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
x86/vdso: Change time() prototype to match __vdso_time()
x86: Fix Sparse warnings about non-static functions
x86/power: Fix some ordering bugs in __restore_processor_context()
x86/PCI: Make broadcom_postcore_init() check acpi_disabled
x86/microcode/AMD: Add support for fam17h microcode loading
x86/cpufeatures: Make X86_BUG_FXSAVE_LEAK detectable in CPUID on AMD
x86/idt: Load idt early in start_secondary
x86/xen: Support early interrupts in xen pv guests
x86/tlb: Disable interrupts when changing CR4
x86/tlb: Refactor CR4 setting and shadow write
Linus Torvalds [Thu, 7 Dec 2017 01:43:26 +0000 (17:43 -0800)]
Merge branch 'sched-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull scheduler fixes from Ingo Molnar:
"This includes a fix for the add_wait_queue() queue ordering brown
paperbag bug, plus PELT accounting fixes for cgroups scheduling
artifacts"
* 'sched-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
sched/fair: Update and fix the runnable propagation rule
sched/wait: Fix add_wait_queue() behavioral change
Linus Torvalds [Thu, 7 Dec 2017 01:41:24 +0000 (17:41 -0800)]
Merge branch 'perf-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull perf fixes from Ingo Molnar:
"This includes perf namespace support kernel side fixes, plus an
accumulated set of perf tooling fixes - including UAPI header
synchronization that should make the perf build less noisy"
* 'perf-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (31 commits)
tooling/headers: Synchronize updated s390 and x86 UAPI headers
tools headers: Syncronize mman.h ABI header
tools headers: Synchronize prctl.h ABI header
tools headers: Synchronize KVM arch ABI headers
tools headers: Synchronize drm/i915_drm.h
tools headers uapi: Synchronize drm/drm.h
tools headers: Synchronize perf_event.h header
tools headers: Synchronize kernel ABI headers wrt SPDX tags
tools/headers: Synchronize kernel x86 UAPI headers
perf intel-pt: Bring instruction decoder files into line with the kernel
perf test: Fix test 21 for s390x
perf bench numa: Fixup discontiguous/sparse numa nodes
perf top: Use signal interface for SIGWINCH handler
perf top: Fix window dimensions change handling
perf: Fix header.size for namespace events
perf top: Ignore kptr_restrict when not sampling the kernel
perf record: Ignore kptr_restrict when not sampling the kernel
perf report: Ignore kptr_restrict when not sampling the kernel
perf evlist: Add helper to check if attr.exclude_kernel is set in all evsels
perf test shell: Fix test case probe libc's inet_pton on s390x
...
Linus Torvalds [Thu, 7 Dec 2017 01:39:44 +0000 (17:39 -0800)]
Merge branch 'locking-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull lockdep fix from Ingo Molnar:
"Fix a possible NULL dereference for the (rare) case when a task
doesn't have ->xhlocks space allocated due to kmalloc() OOM-ing"
* 'locking-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
locking/lockdep: Fix possible NULL deref
Marek Szyprowski [Wed, 22 Nov 2017 13:14:47 +0000 (14:14 +0100)]
drm/exynos: gem: Drop NONCONTIG flag for buffers allocated without IOMMU
When no IOMMU is available, all GEM buffers allocated by Exynos DRM driver
are contiguous, because of the underlying dma_alloc_attrs() function
provides only such buffers. In such case it makes no sense to keep
BO_NONCONTIG flag for the allocated GEM buffers. This allows to avoid
failures for buffer contiguity checks in the subsequent operations on GEM
objects.
Marek Szyprowski [Mon, 30 Oct 2017 07:28:09 +0000 (08:28 +0100)]
drm/exynos: Fix dma-buf import
When IOMMU support was enabled, dma-buf import in Exynos DRM was broken
since commit f43c35966a5a ("drm/exynos: use real device for DMA-mapping
operations") due to using wrong struct device in drm_gem_prime_import()
function. This patch fixes following kernel BUG caused by incorrect buffer
mapping to DMA address space:
exynos-sysmmu 14650000.sysmmu: 14450000.mixer: PAGE FAULT occurred at 0xb2e00000
------------[ cut here ]------------
kernel BUG at drivers/iommu/exynos-iommu.c:449!
Internal error: Oops - BUG: 0 [#1] PREEMPT SMP ARM
Modules linked in:
CPU: 0 PID: 0 Comm: swapper/0 Not tainted 4.14.0-rc4-next-20171016-00033-g990d723669fd #3165
Hardware name: SAMSUNG EXYNOS (Flattened Device Tree)
task: c0e0b7c0 task.stack: c0e00000
PC is at exynos_sysmmu_irq+0x1d0/0x24c
LR is at exynos_sysmmu_irq+0x154/0x24c
------------[ cut here ]------------
Reported-by: Marian Mihailescu <[email protected]> Fixes: f43c35966a5a ("drm/exynos: use real device for DMA-mapping operations") Signed-off-by: Marek Szyprowski <[email protected]> Reviewed-by: Tobias Jakobi <[email protected]> Signed-off-by: Inki Dae <[email protected]>
Linus Torvalds [Wed, 6 Dec 2017 23:47:51 +0000 (15:47 -0800)]
Merge branch 'irq-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull irq fixes from Ingo Molnar:
"Two fixes: use bool type consistently, plus a irq_matrix_available()
bugfix"
* 'irq-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
irqdesc: Use bool return type instead of int
genirq/matrix: Fix the precedence fix for real
Nikolay Borisov [Fri, 1 Dec 2017 09:19:42 +0000 (11:19 +0200)]
btrfs: Fix possible off-by-one in btrfs_search_path_in_tree
The name char array passed to btrfs_search_path_in_tree is of size
BTRFS_INO_LOOKUP_PATH_MAX (4080). So the actual accessible char indexes
are in the range of [0, 4079]. Currently the code uses the define but this
represents an off-by-one.
Implications:
Size of btrfs_ioctl_ino_lookup_args is 4096, so the new byte will be
written to extra space, not some padding that could be provided by the
allocator.
btrfs-progs store the arguments on stack, but kernel does own copy of
the ioctl buffer and the off-by-one overwrite does not affect userspace,
but the ending 0 might be lost.
Kernel ioctl buffer is allocated dynamically so we're overwriting
somebody else's memory, and the ioctl is privileged if args.objectid is
not 256. Which is in most cases, but resolving a subvolume stored in
another directory will trigger that path.
Before this patch the buffer was one byte larger, but then the -1 was
not added.
Fixes: ac8e9819d71f907 ("Btrfs: add search and inode lookup ioctls") Signed-off-by: Nikolay Borisov <[email protected]> Reviewed-by: David Sterba <[email protected]>
[ added implications ] Signed-off-by: David Sterba <[email protected]>
Omar Sandoval [Wed, 6 Dec 2017 06:54:02 +0000 (22:54 -0800)]
Btrfs: disable FUA if mounted with nobarrier
I was seeing disk flushes still happening when I mounted a Btrfs
filesystem with nobarrier for testing. This is because we use FUA to
write out the first super block, and on devices without FUA support, the
block layer translates FUA to a flush. Even on devices supporting true
FUA, using FUA when we asked for no barriers is surprising.
Jeff Mahoney [Tue, 21 Nov 2017 18:58:49 +0000 (13:58 -0500)]
btrfs: handle errors while updating refcounts in update_ref_for_cow
Since commit fb235dc06fa (btrfs: qgroup: Move half of the qgroup
accounting time out of commit trans) the assumption that
btrfs_add_delayed_{data,tree}_ref can only return 0 or -ENOMEM has
been false. The qgroup operations call into btrfs_search_slot
and friends and can now return the full spectrum of error codes.
Fortunately, the fix here is easy since update_ref_for_cow failing
is already handled so we just need to bail early with the error
code.
Justin Maggard [Mon, 30 Oct 2017 22:29:10 +0000 (15:29 -0700)]
btrfs: Fix quota reservation leak on preallocated files
Commit c6887cd11149 ("Btrfs: don't do nocow check unless we have to")
changed the behavior of __btrfs_buffered_write() so that it first tries
to get a data space reservation, and then skips the relatively expensive
nocow check if the reservation succeeded.
If we have quotas enabled, the data space reservation also includes a
quota reservation. But in the rewrite case, the space has already been
accounted for in qgroups. So btrfs_check_data_free_space() increases
the quota reservation, but it never gets decreased when the data
actually gets written and overwrites the pre-existing data. So we're
left with both the qgroup and qgroup reservation accounting for the same
space.
This commit adds the missing btrfs_qgroup_free_data() call in the case
of BTRFS_ORDERED_PREALLOC extents.
Fixes: c6887cd11149 ("Btrfs: don't do nocow check unless we have to") Signed-off-by: Justin Maggard <[email protected]> Reviewed-by: Qu Wenruo <[email protected]> Signed-off-by: David Sterba <[email protected]>
Linus Torvalds [Wed, 6 Dec 2017 23:20:51 +0000 (15:20 -0800)]
Merge branch 'efi-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull EFI fixes from Ingo Molnar:
"Misc fixes: world-readable pointer removal from sysfs, a ESRT kfree()
bug fix and a comment update"
* 'efi-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
efi: Add comment to avoid future expanding of sysfs systab
efi/esrt: Use memunmap() instead of kfree() to free the remapping
efi: Move some sysfs files to be read-only by root
Linus Torvalds [Wed, 6 Dec 2017 22:53:32 +0000 (14:53 -0800)]
Merge branch 'core-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull core fixes from Ingo Molnar:
"Two fixes:
- objtool cross-build fixes
- removal of an obsolete CPU-hotplug state name from comments"
* 'core-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
objtool: Fix 64-bit build on 32-bit host
cpu/hotplug: Fix state name in takedown_cpu() comment
Daniel Thompson [Mon, 2 Mar 2015 14:13:36 +0000 (14:13 +0000)]
kdb: Fix handling of kallsyms_symbol_next() return value
kallsyms_symbol_next() returns a boolean (true on success). Currently
kdb_read() tests the return value with an inequality that
unconditionally evaluates to true.
This is fixed in the obvious way and, since the conditional branch is
supposed to be unreachable, we also add a WARN_ON().
of: overlay: Fix (un)locking in of_overlay_apply()
The special overlay mutex is taken first, hence it should be released
last in the error path.
of_resolve_phandles() must be called with of_mutex held. Without it, a
node and new phandle could be added via of_attach_node(), making the max
phandle wrong.
free_overlay_changeset() must be called with of_mutex held, if any
non-trivial cleanup is to be done.
Hence move "mutex_lock(&of_mutex)" up, as suggested by Frank, and merge
the two tail statements of the success and error paths, now they became
identical.
Note that while the two mutexes are adjacent, we still need both:
__of_changeset_apply_notify(), which is called by __of_changeset_apply()
unlocks of_mutex, then does notifications then locks of_mutex. So the
mutex get released in the middle of of_overlay_apply()
Fixes: f948d6d8b792bb90 ("of: overlay: avoid race condition between applying multiple overlays") Signed-off-by: Geert Uytterhoeven <[email protected]> Reviewed-by: Frank Rowand <[email protected]> Signed-off-by: Rob Herring <[email protected]>
of: overlay: Fix memory leak in of_overlay_apply() error path
If of_resolve_phandles() fails, free_overlay_changeset() is called in
the error path. However, that function returns early if the list hasn't
been initialized yet, before freeing the object.
Explicitly calling kfree() instead would solve that issue. However, that
complicates matter, by having to consider which of two different methods
to use to dispose of the same object.
Hence make free_overlay_changeset() consider initialization state of the
different parts of the object, making it always safe to call (once!) to
dispose of a (partially) initialized overlay_changeset:
- Only destroy the changeset if the list was initialized,
- Make init_overlay_changeset() store the ID in ovcs->id on success,
to avoid calling idr_remove() with an error value or an already
released ID.
Reported-by: Colin King <[email protected]> Fixes: f948d6d8b792bb90 ("of: overlay: avoid race condition between applying multiple overlays") Signed-off-by: Geert Uytterhoeven <[email protected]> Reviewed-by: Frank Rowand <[email protected]> Signed-off-by: Rob Herring <[email protected]>
Mikulas Patocka [Sat, 2 Dec 2017 22:17:44 +0000 (16:17 -0600)]
objtool: Fix 64-bit build on 32-bit host
The new ORC unwinder breaks the build of a 64-bit kernel on a 32-bit
host. Building the kernel on a i386 or x32 host fails with:
orc_dump.c: In function 'orc_dump':
orc_dump.c:105:26: error: passing argument 2 of 'elf_getshdrnum' from incompatible pointer type [-Werror=incompatible-pointer-types]
if (elf_getshdrnum(elf, &nr_sections)) {
^
In file included from /usr/local/include/gelf.h:32:0,
from elf.h:22,
from warn.h:26,
from orc_dump.c:20:
/usr/local/include/libelf.h:304:12: note: expected 'size_t * {aka unsigned int *}' but argument is of type 'long unsigned int *'
extern int elf_getshdrnum (Elf *__elf, size_t *__dst);
^~~~~~~~~~~~~~
orc_dump.c:190:17: error: format '%lx' expects argument of type 'long unsigned int', but argument 3 has type 'Elf64_Sxword {aka long long int}' [-Werror=format=]
printf("%s+%lx:", name, rela.r_addend);
~~^ ~~~~~~~~~~~~~
%llx
Fix the build failure.
Another problem is that if the user specifies HOSTCC or HOSTLD
variables, they are ignored in the objtool makefile. Change the
Makefile to respect these variables.
Document the recommended presence of a device-specific compatible value,
and list examples that are already in use or soon will be.
This will allow checkpatch to validate compatible values in DTS.
Update the example to match current best practices (generic node name,
specific compatible value first).
Chris Dion [Wed, 6 Dec 2017 15:50:28 +0000 (10:50 -0500)]
net_sched: use macvlan real dev trans_start in dev_trans_start()
Macvlan devices are similar to vlans and do not update their
own trans_start. In order for arp monitoring to work for a bond device
when the slaves are macvlans, obtain its real device.
Arnd Bergmann [Mon, 4 Dec 2017 15:01:55 +0000 (16:01 +0100)]
x86/vdso: Change time() prototype to match __vdso_time()
gcc-8 warns that time() is an alias for __vdso_time() but the two
have different prototypes:
arch/x86/entry/vdso/vclock_gettime.c:327:5: error: 'time' alias between functions of incompatible types 'int(time_t *)' {aka 'int(long int *)'} and 'time_t(time_t *)' {aka 'long int(long int *)'} [-Werror=attribute-alias]
int time(time_t *t)
^~~~
arch/x86/entry/vdso/vclock_gettime.c:318:16: note: aliased declaration here
I could not figure out whether this is intentional, but I see that
changing it to return time_t avoids the warning.
Returning 'int' from time() is also a bit questionable, as it causes an
overflow in y2038 even on 64-bit architectures that use a 64-bit time_t
type. On 32-bit architecture with 64-bit time_t, time() should always
be implement by the C library by calling a (to be added) clock_gettime()
variant that takes a sufficiently wide argument.
Dave Airlie [Wed, 6 Dec 2017 20:27:13 +0000 (06:27 +1000)]
Merge branch 'drm-fixes-4.15' of git://people.freedesktop.org/~agd5f/linux into drm-fixes
ttm and license fixes
* 'drm-fixes-4.15' of git://people.freedesktop.org/~agd5f/linux:
drm/ttm: swap consecutive allocated pooled pages v4
drm/ttm: swap consecutive allocated cached pages v3
drm/ttm: roundup the shrink request to prevent skip huge pool
drm/ttm: add page order support in ttm_pages_put
drm/ttm: add set_pages_wb for handling page order more than zero
drm/ttm: add page order in page pool
drm/ttm: use NUM_PAGES_TO_ALLOC always
drm/amdgpu: add license to files where it was missing
drm/amdgpu: add license to Makefiles
net: thunderx: Fix TCP/UDP checksum offload for IPv4 pkts
Offload IP header checksum to NIC.
This fixes a previous patch which disabled checksum offloading
for both IPv4 and IPv6 packets. So L3 checksum offload was
getting disabled for IPv4 pkts. And HW is dropping these pkts
for some reason.
Without this patch, IPv4 TSO appears to be broken:
WIthout this patch I get ~16kbyte/s, with patch close to 2mbyte/s
when copying files via scp from test box to my home workstation.
Looking at tcpdump on sender it looks like hardware drops IPv4 TSO skbs.
This patch restores performance for me, ipv6 looks good too.
Dave Martin [Wed, 6 Dec 2017 16:45:47 +0000 (16:45 +0000)]
arm64/sve: Avoid dereference of dead task_struct in KVM guest entry
When deciding whether to invalidate FPSIMD state cached in the cpu,
the backend function sve_flush_cpu_state() attempts to dereference
__this_cpu_read(fpsimd_last_state). However, this is not safe:
there is no guarantee that this task_struct pointer is still valid,
because the task could have exited in the meantime.
This means that we need another means to get the appropriate value
of TIF_SVE for the associated task.
This patch solves this issue by adding a cached copy of the TIF_SVE
flag in fpsimd_last_state, which we can check without dereferencing
the task pointer.
In particular, although this patch is not a KVM fix per se, this
means that this check is now done safely in the KVM world switch
path (which is currently the only user of this code).
Linus Torvalds [Wed, 6 Dec 2017 18:49:14 +0000 (10:49 -0800)]
Merge tag 'sound-4.15-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/tiwai/sound
Pull sound fixes from Takashi Iwai:
"All fixes are small and for stable:
- a PCM ioctl race fix
- yet another USB-audio hardening for malicious descriptors
- Realtek ALC257 codec support"
* tag 'sound-4.15-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/tiwai/sound:
ALSA: pcm: prevent UAF in snd_pcm_info
ALSA: hda/realtek - New codec support for ALC257
ALSA: usb-audio: Add check return value for usb_string()
ALSA: usb-audio: Fix out-of-bound error
ALSA: seq: Remove spurious WARN_ON() at timer check
Colin Ian King [Wed, 6 Dec 2017 17:33:58 +0000 (17:33 +0000)]
x86: Fix Sparse warnings about non-static functions
Functions x86_vector_debug_show(), uv_handle_nmi() and uv_nmi_setup_common()
are local to the source and do not need to be in global scope, so make them
static.
Dave Young [Wed, 6 Dec 2017 09:50:10 +0000 (09:50 +0000)]
efi: Add comment to avoid future expanding of sysfs systab
/sys/firmware/efi/systab shows several different values, it breaks sysfs
one file one value design. But since there are already userspace tools
depend on it eg. kexec-tools so add code comment to alert future expanding
of this file.
Vincent Guittot [Thu, 16 Nov 2017 14:21:52 +0000 (15:21 +0100)]
sched/fair: Update and fix the runnable propagation rule
Unlike running, the runnable part can't be directly propagated through
the hierarchy when we migrate a task. The main reason is that runnable
time can be shared with other sched_entities that stay on the rq and
this runnable time will also remain on prev cfs_rq and must not be
removed.
Instead, we can estimate what should be the new runnable of the prev
cfs_rq and check that this estimation stay in a possible range. The
prop_runnable_sum is a good estimation when adding runnable_sum but
fails most often when we remove it. Instead, we could use the formula
below instead:
50816c48997a ("sched/wait: Standardize internal naming of wait-queue entries")
... unintentionally changed the behavior of add_wait_queue() from
inserting the wait entry at the head of the wait queue to the tail
of the wait queue.
Beyond a negative performance impact this change in behavior
theoretically also breaks wait queues which mix exclusive and
non-exclusive waiters, as non-exclusive waiters will not be
woken up if they are queued behind enough exclusive waiters.
Brendan Jackman [Wed, 6 Dec 2017 10:59:11 +0000 (10:59 +0000)]
cpu/hotplug: Fix state name in takedown_cpu() comment
CPUHP_AP_SCHED_MIGRATE_DYING doesn't exist, it looks like this was
supposed to refer to CPUHP_AP_SCHED_STARTING's teardown callback,
i.e. sched_cpu_dying().
Will Deacon [Wed, 6 Dec 2017 10:51:12 +0000 (10:51 +0000)]
arm64: SW PAN: Update saved ttbr0 value on enter_lazy_tlb
enter_lazy_tlb is called when a kernel thread rides on the back of
another mm, due to a context switch or an explicit call to unuse_mm
where a call to switch_mm is elided.
In these cases, it's important to keep the saved ttbr value up to date
with the active mm, otherwise we can end up with a stale value which
points to a potentially freed page table.
This patch implements enter_lazy_tlb for arm64, so that the saved ttbr0
is kept up-to-date with the active mm for kernel threads.
Will Deacon [Wed, 6 Dec 2017 10:42:10 +0000 (10:42 +0000)]
arm64: SW PAN: Point saved ttbr0 at the zero page when switching to init_mm
update_saved_ttbr0 mandates that mm->pgd is not swapper, since swapper
contains kernel mappings and should never be installed into ttbr0. However,
this means that callers must avoid passing the init_mm to update_saved_ttbr0
which in turn can cause the saved ttbr0 value to be out-of-date in the context
of the idle thread. For example, EFI runtime services may leave the saved ttbr0
pointing at the EFI page table, and kernel threads may end up with stale
references to freed page tables.
This patch changes update_saved_ttbr0 so that the init_mm points the saved
ttbr0 value to the empty zero page, which always exists and never contains
valid translations. EFI and switch can then call into update_saved_ttbr0
unconditionally.
Dave Martin [Wed, 6 Dec 2017 16:45:46 +0000 (16:45 +0000)]
arm64: fpsimd: Abstract out binding of task's fpsimd context to the cpu.
There is currently some duplicate logic to associate current's
FPSIMD context with the cpu when loading FPSIMD state into the cpu
regs.
Subsequent patches will update that logic, so in order to ensure it
only needs to be done in one place, this patch factors the relevant
code out into a new function fpsimd_bind_to_cpu().
Dave Martin [Tue, 5 Dec 2017 14:56:42 +0000 (14:56 +0000)]
arm64: fpsimd: Prevent registers leaking from dead tasks
Currently, loading of a task's fpsimd state into the CPU registers
is skipped if that task's state is already present in the registers
of that CPU.
However, the code relies on the struct fpsimd_state * (and by
extension struct task_struct *) to unambiguously identify a task.
There is a particular case in which this doesn't work reliably:
when a task exits, its task_struct may be recycled to describe a
new task.
Consider the following scenario:
1) Task P loads its fpsimd state onto cpu C.
per_cpu(fpsimd_last_state, C) := P;
P->thread.fpsimd_state.cpu := C;
2) Task X is scheduled onto C and loads its fpsimd state on C.
per_cpu(fpsimd_last_state, C) := X;
X->thread.fpsimd_state.cpu := C;
3) X exits, causing X's task_struct to be freed.
4) P forks a new child T, which obtains X's recycled task_struct.
T == X.
T->thread.fpsimd_state.cpu == C (inherited from P).
5) T is scheduled on C.
T's fpsimd state is not loaded, because
per_cpu(fpsimd_last_state, C) == T (== X) &&
T->thread.fpsimd_state.cpu == C.
(This is the check performed by fpsimd_thread_switch().)
So, T gets X's registers because the last registers loaded onto C
were those of X, in (2).
This patch fixes the problem by ensuring that the sched-in check
fails in (5): fpsimd_flush_task_state(T) is called when T is
forked, so that T->thread.fpsimd_state.cpu == C cannot be true.
This relies on the fact that T is not schedulable until after
copy_thread() completes.
Once T's fpsimd state has been loaded on some CPU C there may still
be other cpus D for which per_cpu(fpsimd_last_state, D) ==
&X->thread.fpsimd_state. But D is necessarily != C in this case,
and the check in (5) must fail.
An alternative fix would be to do refcounting on task_struct. This
would result in each CPU holding a reference to the last task whose
fpsimd state was loaded there. It's not clear whether this is
preferable, and it involves higher overhead than the fix proposed
in this patch. It would also move all the task_struct freeing
work into the context switch critical section, or otherwise some
deferred cleanup mechanism would need to be introduced, neither of
which seems obviously justified.
Cc: <[email protected]> Fixes: 005f78cd8849 ("arm64: defer reloading a task's FPSIMD state to userland resume") Signed-off-by: Dave Martin <[email protected]> Reviewed-by: Ard Biesheuvel <[email protected]>
[will: word-smithed the comment so it makes more sense] Signed-off-by: Will Deacon <[email protected]>
Radim Krčmář [Thu, 30 Nov 2017 18:05:45 +0000 (19:05 +0100)]
KVM: x86: fix APIC page invalidation
Implementation of the unpinned APIC page didn't update the VMCS address
cache when invalidation was done through range mmu notifiers.
This became a problem when the page notifier was removed.
Re-introduce the arch-specific helper and call it from ...range_start.
Dan Carpenter [Tue, 5 Dec 2017 14:38:54 +0000 (17:38 +0300)]
xen/pvcalls: Fix a check in pvcalls_front_remove()
bedata->ref can't be less than zero because it's unsigned. This affects
certain error paths in probe. We first set ->ref = -1 and then we set
it to a valid value later.
Since commit ad67b74d2469 ("printk: hash addresses printed with %p")
pointers printed with %p are hashed, ie. you don't see the actual
pointer value but rather a cryptographic hash of its value.
In xmon we want to see the actual pointer values, because xmon is a
debugger, so replace %p with %px which prints the actual pointer
value.
We justify doing this in xmon because 1) xmon is a kernel crash
debugger, it's only accessible via the console 2) xmon doesn't print
to dmesg, so the pointers it prints are not able to be leaked that
way.
Nicholas Piggin [Wed, 6 Dec 2017 08:21:14 +0000 (18:21 +1000)]
powerpc/64s: Initialize ISAv3 MMU registers before setting partition table
kexec can leave MMU registers set when booting into a new kernel,
the PIDR (Process Identification Register) in particular. The boot
sequence does not zero PIDR, so it only gets set when CPUs first
switch to a userspace processes (until then it's running a kernel
thread with effective PID = 0).
This leaves a window where a process table entry and page tables are
set up due to user processes running on other CPUs, that happen to
match with a stale PID. The CPU with that PID may cause speculative
accesses that address quadrant 0 (aka userspace addresses), which will
result in cached translations and PWC (Page Walk Cache) for that
process, on a CPU which is not in the mm_cpumask and so they will not
be invalidated properly.
The most common result is the kernel hanging in infinite page fault
loops soon after kexec (usually in schedule_tail, which is usually the
first non-speculative quadrant 0 access to a new PID) due to a stale
PWC. However being a stale translation error, it could result in
anything up to security and data corruption problems.
Fix this by zeroing out PIDR at boot and kexec.
Fixes: 7e381c0ff618 ("powerpc/mm/radix: Add mmu context handling callback for radix") Cc: [email protected] # v4.7+ Signed-off-by: Nicholas Piggin <[email protected]> Signed-off-by: Michael Ellerman <[email protected]>
Andy Lutomirski [Thu, 30 Nov 2017 15:57:57 +0000 (07:57 -0800)]
x86/power: Fix some ordering bugs in __restore_processor_context()
__restore_processor_context() had a couple of ordering bugs. It
restored GSBASE after calling load_gs_index(), and the latter can
call into tracing code. It also tried to restore segment registers
before restoring the LDT, which is straight-up wrong.
Reorder the code so that we restore GSBASE, then the descriptor
tables, then the segments.
This fixes two bugs. First, it fixes a regression that broke resume
under certain configurations due to irqflag tracing in
native_load_gs_index(). Second, it fixes resume when the userspace
process that initiated suspect had funny segments. The latter can be
reproduced by compiling this:
// SPDX-License-Identifier: GPL-2.0
/*
* ldt_echo.c - Echo argv[1] while using an LDT segment
*/
int main(int argc, char **argv)
{
int ret;
size_t len;
char *buf;
x86/PCI: Make broadcom_postcore_init() check acpi_disabled
acpi_os_get_root_pointer() may return a valid address even if acpi_disabled
is set, but the host bridge information from the ACPI tables is not going
to be used in that case and the Broadcom host bridge initialization should
not be skipped then, So make broadcom_postcore_init() check acpi_disabled
too to avoid this issue.
Tom Lendacky [Thu, 30 Nov 2017 22:46:40 +0000 (16:46 -0600)]
x86/microcode/AMD: Add support for fam17h microcode loading
The size for the Microcode Patch Block (MPB) for an AMD family 17h
processor is 3200 bytes. Add a #define for fam17h so that it does
not default to 2048 bytes and fail a microcode load/update.
Rudolf Marek [Tue, 28 Nov 2017 21:01:06 +0000 (22:01 +0100)]
x86/cpufeatures: Make X86_BUG_FXSAVE_LEAK detectable in CPUID on AMD
The latest AMD AMD64 Architecture Programmer's Manual
adds a CPUID feature XSaveErPtr (CPUID_Fn80000008_EBX[2]).
If this feature is set, the FXSAVE, XSAVE, FXSAVEOPT, XSAVEC, XSAVES
/ FXRSTOR, XRSTOR, XRSTORS always save/restore error pointers,
thus making the X86_BUG_FXSAVE_LEAK workaround obsolete on such CPUs.
we've went to extreme lengths to make sure connector iterations works
in any context, without introducing any additional locking context.
This worked, except for a small fumble in the implementation:
When we actually race with a concurrent connector unplug event, and
our temporary connector reference turns out to be the final one, then
everything breaks: We call the connector release function from
whatever context we happen to be in, which can be an irq/atomic
context. And connector freeing grabs all kinds of locks and stuff.
Fix this by creating a specially safe put function for connetor_iter,
which (in this rare case) punts the cleanup to a worker.
Janosch Frank [Mon, 4 Dec 2017 11:19:11 +0000 (12:19 +0100)]
KVM: s390: Fix skey emulation permission check
All skey functions call skey_check_enable at their start, which checks
if we are in the PSTATE and injects a privileged operation exception
if we are.
Unfortunately they continue processing afterwards and perform the
operation anyhow as skey_check_enable does not deliver an error if the
exception injection was successful.
Let's move the PSTATE check into the skey functions and exit them on
such an occasion, also we now do not enable skey handling anymore in
such a case.
Old kernels did not check for zero in the irq_state.flags field and old
QEMUs did not zero the flag/reserved fields when calling
KVM_S390_*_IRQ_STATE. Let's add comments to prevent future uses of
these fields.
Now that the SPDX tag is in all arch/s390/kvm/ files, that identifies
the license in a specific and legally-defined manner. So the extra GPL
text wording can be removed as it is no longer needed at all.
This is done on a quest to remove the 700+ different ways that files in
the kernel describe the GPL license text. And there's unneeded stuff
like the address (sometimes incorrect) for the FSF which is never
needed.
No copyright headers or other non-license-description text was removed.
KVM: s390: add SPDX identifiers to the remaining files
It's good to have SPDX identifiers in all files to make it easier to
audit the kernel tree for correct licenses.
Update the arch/s390/kvm/ files with the correct SPDX license
identifier based on the license text in the file itself. The SPDX
identifier is a legally binding shorthand, which can be used instead of
the full boiler plate text.
This work is based on a script and data from Thomas Gleixner, Philippe
Ombredanne, and Kate Stewart.
Zhenyu Wang [Mon, 4 Dec 2017 02:42:58 +0000 (10:42 +0800)]
drm/i915/gvt: set max priority for gvt context
This is to workaround guest driver hang regression after
preemption enable that gvt hasn't enabled handling of that
for guest workload. So in effect this disables preemption
for gvt context now.
Xiong Zhang [Mon, 6 Nov 2017 21:23:02 +0000 (05:23 +0800)]
drm/i915/gvt: Limit read hw reg to active vgpu
mmio_read_from_hw() let vgpu could read hw reg, if vgpu's workload
is running on hw, things is good. Otherwise vgpu will get other
vgpu's reg val, it is unsafe.
This patch limit such hw access to active vgpu. If vgpu isn't
running on hw, the reg read of this vgpu will get the last active
val which saved at schedule_out.
v2: ring timestamp is walking continuously even if the ring is idle.
so read hw directly. (Zhenyu)
Changbin Du [Thu, 2 Nov 2017 05:33:42 +0000 (13:33 +0800)]
drm/i915/gvt: Emulate PCI expansion ROM base address register
Our vGPU doesn't have a device ROM, we need follow the PCI spec to
report this info to drivers. Otherwise, we would see below errors.
Inspecting possible rom at 0xfe049000 (vd=8086:1912 bdf=00:10.0)
qemu-system-x86_64: vfio-pci: Cannot read device rom at 00000000-0000-0000-0000-000000000001
Device option ROM contents are probably invalid (check dmesg).
Skip option ROM probe with rombar=0, or load from file with romfile=No option rom signature (got 4860)
I will also send a improvement patch to PCI subsystem related to PCI ROM.
But no idea to omit below error, since no pattern to detect vbios shadow
without touch its content.
0000:00:10.0: Invalid PCI ROM header signature: expecting 0xaa55, got 0x0000
Linus Torvalds [Wed, 6 Dec 2017 01:59:29 +0000 (17:59 -0800)]
x86: don't hash faulting address in oops printout
Things like this will probably keep showing up for other architectures
and other special cases.
I actually thought we already used %lx for this, and that is indeed
_historically_ the case, but we moved to %p when merging the 32-bit and
64-bit cases as a convenient way to get the formatting right (ie
automatically picking "%08lx" vs "%016lx" based on register size).
Kees Cook [Tue, 5 Dec 2017 01:24:54 +0000 (17:24 -0800)]
locking/refcounts: Do not force refcount_t usage as GPL-only export
The refcount_t protection on x86 was not intended to use the stricter
GPL export. This adjusts the linkage again to avoid a regression in
the availability of the refcount API.
Al Viro [Tue, 5 Dec 2017 23:29:09 +0000 (23:29 +0000)]
make sock_alloc_file() do sock_release() on failures
This changes calling conventions (and simplifies the hell out
the callers). New rules: once struct socket had been passed
to sock_alloc_file(), it's been consumed either by struct file
or by sock_release() done by sock_alloc_file(). Either way
the caller should not do sock_release() after that point.
Al Viro [Tue, 5 Dec 2017 23:27:57 +0000 (23:27 +0000)]
fix kcm_clone()
1) it's fput() or sock_release(), not both
2) don't do fd_install() until the last failure exit.
3) not a bug per se, but... don't attach socket to struct file
until it's set up.
Take reserving descriptor into the caller, move fd_install() to the
caller, sanitize failure exits and calling conventions.
Eric Dumazet [Tue, 5 Dec 2017 20:45:56 +0000 (12:45 -0800)]
net: remove hlist_nulls_add_tail_rcu()
Alexander Potapenko reported use of uninitialized memory [1]
This happens when inserting a request socket into TCP ehash,
in __sk_nulls_add_node_rcu(), since sk_reuseport is not initialized.
Bug was added by commit d894ba18d4e4 ("soreuseport: fix ordering for
mixed v4/v6 sockets")
Note that d296ba60d8e2 ("soreuseport: Resolve merge conflict for v4/v6
ordering fix") missed the opportunity to get rid of
hlist_nulls_add_tail_rcu() :
Both UDP sockets and TCP/DCCP listeners no longer use
__sk_nulls_add_node_rcu() for their hash insertion.
Since all other sockets have unique 4-tuple, the reuseport status
has no special meaning, so we can always use hlist_nulls_add_head_rcu()
for them and save few cycles/instructions.
====================
net: qualcomm: rmnet: Fix leaks in failure scenarios
Patch 1 fixes a leak in transmit path where a skb cannot be
transmitted due to insufficient headroom to stamp the map header.
Patch 2 fixes a leak in rmnet_newlink() failure because the
rmnet endpoint was never freed
====================
net: qualcomm: rmnet: Fix leak in device creation failure
If the rmnet device creation fails in the newlink either while
registering with the physical device or after subsequent
operations, the rmnet endpoint information is never freed.
Fixes: ceed73a2cf4a ("drivers: net: ethernet: qualcomm: rmnet: Initial implementation") Signed-off-by: Subash Abhinov Kasiviswanathan <[email protected]> Signed-off-by: David S. Miller <[email protected]>
Robb Glasser [Tue, 5 Dec 2017 17:16:55 +0000 (09:16 -0800)]
ALSA: pcm: prevent UAF in snd_pcm_info
When the device descriptor is closed, the `substream->runtime` pointer
is freed. But another thread may be in the ioctl handler, case
SNDRV_CTL_IOCTL_PCM_INFO. This case calls snd_pcm_info_user() which
calls snd_pcm_info() which accesses the now freed `substream->runtime`.
George Cherian [Mon, 4 Dec 2017 14:06:54 +0000 (14:06 +0000)]
ACPI / CPPC: Fix KASAN global out of bounds warning
Default value of pcc_subspace_idx is -1.
Make sure to check pcc_subspace_idx before using the same as array index.
This will avoid following KASAN warnings too.
[ 15.116983] The buggy address belongs to the variable:
[ 15.116983] __key.36299+0x38/0x40
[ 15.116983] Memory state around the buggy address:
[ 15.116983] ffffffffb9a5bf80: fa fa fa fa 00 fa fa fa fa fa fa fa 00 fa fa fa
[ 15.116983] ffffffffb9a5c000: fa fa fa fa 00 fa fa fa fa fa fa fa 00 fa fa fa
[ 15.116983] >ffffffffb9a5c080: fa fa fa fa 00 fa fa fa fa fa fa fa 00 00 00 00
[ 15.116983] ^
[ 15.116983] ffffffffb9a5c100: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
[ 15.116983] ffffffffb9a5c180: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
[ 15.116983] ==================================================================
Fixes: 85b1407bf6d2 (ACPI / CPPC: Make CPPC ACPI driver aware of PCC subspace IDs) Reported-by: Changbin Du <[email protected]> Signed-off-by: George Cherian <[email protected]> Signed-off-by: Rafael J. Wysocki <[email protected]>
KVM allows guests to directly access I/O port 0x80 on Intel hosts. If
the guest floods this port with writes it generates exceptions and
instability in the host kernel, leading to a crash. With this change
guest writes to port 0x80 on Intel will behave the same as they
currently behave on AMD systems.
Prevent the flooding by removing the code that sets port 0x80 as a
passthrough port. This is essentially the same as upstream patch 99f85a28a78e96d28907fe036e1671a218fee597, except that patch was
for AMD chipsets and this patch is for Intel.
Rik van Riel [Tue, 14 Nov 2017 21:54:24 +0000 (16:54 -0500)]
x86,kvm: remove KVM emulator get_fpu / put_fpu
Now that get_fpu and put_fpu do nothing, because the scheduler will
automatically load and restore the guest FPU context for us while we
are in this code (deep inside the vcpu_run main loop), we can get rid
of the get_fpu and put_fpu hooks.
Rik van Riel [Tue, 14 Nov 2017 21:54:23 +0000 (16:54 -0500)]
x86,kvm: move qemu/guest FPU switching out to vcpu_run
Currently, every time a VCPU is scheduled out, the host kernel will
first save the guest FPU/xstate context, then load the qemu userspace
FPU context, only to then immediately save the qemu userspace FPU
context back to memory. When scheduling in a VCPU, the same extraneous
FPU loads and saves are done.
This could be avoided by moving from a model where the guest FPU is
loaded and stored with preemption disabled, to a model where the
qemu userspace FPU is swapped out for the guest FPU context for
the duration of the KVM_RUN ioctl.
This is done under the VCPU mutex, which is also taken when other
tasks inspect the VCPU FPU context, so the code should already be
safe for this change. That should come as no surprise, given that
s390 already has this optimization.
This can fix a bug where KVM calls get_user_pages while owning the
FPU, and the file system ends up requesting the FPU again:
No performance changes were detected in quick ping-pong tests on
my 4 socket system, which is expected since an FPU+xstate load is
on the order of 0.1us, while ping-ponging between CPUs is on the
order of 20us, and somewhat noisy.
Cc: [email protected] Signed-off-by: Rik van Riel <[email protected]> Suggested-by: Christian Borntraeger <[email protected]> Signed-off-by: Paolo Bonzini <[email protected]>
[Fixed a bug where reset_vcpu called put_fpu without preceding load_fpu,
which happened inside from KVM_CREATE_VCPU ioctl. - Radim] Signed-off-by: Radim Krčmář <[email protected]>
Jon Maloy [Mon, 4 Dec 2017 21:00:20 +0000 (22:00 +0100)]
tipc: fix memory leak in tipc_accept_from_sock()
When the function tipc_accept_from_sock() fails to create an instance of
struct tipc_subscriber it omits to free the already created instance of
struct tipc_conn instance before it returns.
David S. Miller [Tue, 5 Dec 2017 19:40:35 +0000 (14:40 -0500)]
Merge branch 'sh_eth-dma-mapping-fixes'
Thomas Petazzoni says:
====================
net: sh_eth: DMA mapping API fixes
Here are two patches that fix how the sh_eth driver is using the DMA
mapping API: a bogus struct device is used in some places, or a NULL
struct device is used.
====================
net: sh_eth: don't use NULL as "struct device" for the DMA mapping API
Using NULL as argument for the DMA mapping API is bogus, as the DMA
mapping API may use information from the "struct device" to perform
the DMA mapping operation. Therefore, pass the appropriate "struct
device".
net: sh_eth: use correct "struct device" when calling DMA mapping functions
There are two types of "struct device": the one representing the
physical device on its physical bus (platform, SPI, PCI, etc.), and
the one representing the logical device in its device class (net,
etc.).
The DMA mapping API expects to receive as argument a "struct device"
representing the physical device, as the "struct device" contains
information about the bus that the DMA API needs.
However, the sh_eth driver mistakenly uses the "struct device"
representing the logical device (embedded in "struct net_device")
rather than the "struct device" representing the physical device on
its bus.
This commit fixes that by adjusting all calls to the DMA mapping API.
Nogah Frankel [Mon, 4 Dec 2017 11:31:11 +0000 (13:31 +0200)]
net_sched: red: Avoid illegal values
Check the qmin & qmax values doesn't overflow for the given Wlog value.
Check that qmin <= qmax.
Fixes: a783474591f2 ("[PKT_SCHED]: Generic RED layer") Signed-off-by: Nogah Frankel <[email protected]> Signed-off-by: David S. Miller <[email protected]>
Nogah Frankel [Mon, 4 Dec 2017 11:31:10 +0000 (13:31 +0200)]
net_sched: red: Avoid devision by zero
Do not allow delta value to be zero since it is used as a divisor.
Fixes: 8af2a218de38 ("sch_red: Adaptative RED AQM") Signed-off-by: Nogah Frankel <[email protected]> Signed-off-by: David S. Miller <[email protected]>