Vlastimil Babka [Tue, 19 Dec 2017 21:33:46 +0000 (22:33 +0100)]
x86/dumpstack: Indicate in Oops whether PTI is configured and enabled
CONFIG_PAGE_TABLE_ISOLATION is relatively new and intrusive feature that may
still have some corner cases which could take some time to manifest and be
fixed. It would be useful to have Oops messages indicate whether it was
enabled for building the kernel, and whether it was disabled during boot.
Example of fully enabled:
Oops: 0001 [#1] SMP PTI
Example of enabled during build, but disabled during boot:
Oops: 0001 [#1] SMP NOPTI
We can decide to remove this after the feature has been tested in the field
long enough.
[ tglx: Made it use boot_cpu_has() as requested by Borislav ]
Dave Hansen [Mon, 4 Dec 2017 14:08:01 +0000 (15:08 +0100)]
x86/mm: Use INVPCID for __native_flush_tlb_single()
This uses INVPCID to shoot down individual lines of the user mapping
instead of marking the entire user map as invalid. This
could/might/possibly be faster.
This for sure needs tlb_single_page_flush_ceiling to be redetermined;
esp. since INVPCID is _slow_.
A detailed performance analysis is available here:
Peter Zijlstra [Mon, 4 Dec 2017 14:08:00 +0000 (15:08 +0100)]
x86/mm: Optimize RESTORE_CR3
Most NMI/paranoid exceptions will not in fact change pagetables and would
thus not require TLB flushing, however RESTORE_CR3 uses flushing CR3
writes.
Restores to kernel PCIDs can be NOFLUSH, because we explicitly flush the
kernel mappings and now that we track which user PCIDs need flushing we can
avoid those too when possible.
This does mean RESTORE_CR3 needs an additional scratch_reg, luckily both
sites have plenty available.
Peter Zijlstra [Mon, 4 Dec 2017 14:07:59 +0000 (15:07 +0100)]
x86/mm: Use/Fix PCID to optimize user/kernel switches
We can use PCID to retain the TLBs across CR3 switches; including those now
part of the user/kernel switch. This increases performance of kernel
entry/exit at the cost of more expensive/complicated TLB flushing.
Now that we have two address spaces, one for kernel and one for user space,
we need two PCIDs per mm. We use the top PCID bit to indicate a user PCID
(just like we use the PFN LSB for the PGD). Since we do TLB invalidation
from kernel space, the existing code will only invalidate the kernel PCID,
we augment that by marking the corresponding user PCID invalid, and upon
switching back to userspace, use a flushing CR3 write for the switch.
In order to access the user_pcid_flush_mask we use PER_CPU storage, which
means the previously established SWAPGS vs CR3 ordering is now mandatory
and required.
Having to do this memory access does require additional registers, most
sites have a functioning stack and we can spill one (RAX), sites without
functional stack need to otherwise provide the second scratch register.
Note: PCID is generally available on Intel Sandybridge and later CPUs.
Note: Up until this point TLB flushing was broken in this series.
Dave Hansen [Mon, 4 Dec 2017 14:07:57 +0000 (15:07 +0100)]
x86/mm: Allow flushing for future ASID switches
If changing the page tables in such a way that an invalidation of all
contexts (aka. PCIDs / ASIDs) is required, they can be actively invalidated
by:
1. INVPCID for each PCID (works for single pages too).
2. Load CR3 with each PCID without the NOFLUSH bit set
3. Load CR3 with the NOFLUSH bit set for each and do INVLPG for each address.
But, none of these are really feasible since there are ~6 ASIDs (12 with
PAGE_TABLE_ISOLATION) at the time that invalidation is required.
Instead of actively invalidating them, invalidate the *current* context and
also mark the cpu_tlbstate _quickly_ to indicate future invalidation to be
required.
At the next context-switch, look for this indicator
('invalidate_other' being set) invalidate all of the
cpu_tlbstate.ctxs[] entries.
This ensures that any future context switches will do a full flush
of the TLB, picking up the previous changes.
Andy Lutomirski [Tue, 12 Dec 2017 15:56:45 +0000 (07:56 -0800)]
x86/pti: Put the LDT in its own PGD if PTI is on
With PTI enabled, the LDT must be mapped in the usermode tables somewhere.
The LDT is per process, i.e. per mm.
An earlier approach mapped the LDT on context switch into a fixmap area,
but that's a big overhead and exhausted the fixmap space when NR_CPUS got
big.
Take advantage of the fact that there is an address space hole which
provides a completely unused pgd. Use this pgd to manage per-mm LDT
mappings.
This has a down side: the LDT isn't (currently) randomized, and an attack
that can write the LDT is instant root due to call gates (thanks, AMD, for
leaving call gates in AMD64 but designing them wrong so they're only useful
for exploits). This can be mitigated by making the LDT read-only or
randomizing the mapping, either of which is strightforward on top of this
patch.
This will significantly slow down LDT users, but that shouldn't matter for
important workloads -- the LDT is only used by DOSEMU(2), Wine, and very
old libc implementations.
Hugh Dickins [Mon, 4 Dec 2017 14:07:50 +0000 (15:07 +0100)]
x86/events/intel/ds: Map debug buffers in cpu_entry_area
The BTS and PEBS buffers both have their virtual addresses programmed into
the hardware. This means that any access to them is performed via the page
tables. The times that the hardware accesses these are entirely dependent
on how the performance monitoring hardware events are set up. In other
words, there is no way for the kernel to tell when the hardware might
access these buffers.
To avoid perf crashes, place 'debug_store' allocate pages and map them into
the cpu_entry_area.
The PEBS fixup buffer does not need this treatment.
[ tglx: Got rid of the kaiser_add_mapping() complication ]
Thomas Gleixner [Mon, 4 Dec 2017 14:07:49 +0000 (15:07 +0100)]
x86/cpu_entry_area: Add debugstore entries to cpu_entry_area
The Intel PEBS/BTS debug store is a design trainwreck as it expects virtual
addresses which must be visible in any execution context.
So it is required to make these mappings visible to user space when kernel
page table isolation is active.
Provide enough room for the buffer mappings in the cpu_entry_area so the
buffers are available in the user space visible page tables.
At the point where the kernel side entry area is populated there is no
buffer available yet, but the kernel PMD must be populated. To achieve this
set the entries for these buffers to non present.
Thomas Gleixner [Mon, 4 Dec 2017 14:07:47 +0000 (15:07 +0100)]
x86/mm/pti: Share entry text PMD
Share the entry text PMD of the kernel mapping with the user space
mapping. If large pages are enabled this is a single PMD entry and at the
point where it is copied into the user page table the RW bit has not been
cleared yet. Clear it right away so the user space visible map becomes RX.
Andy Lutomirski [Mon, 4 Dec 2017 14:07:42 +0000 (15:07 +0100)]
x86/mm/pti: Add functions to clone kernel PMDs
Provide infrastructure to:
- find a kernel PMD for a mapping which must be visible to user space for
the entry/exit code to work.
- walk an address range and share the kernel PMD with it.
This reuses a small part of the original KAISER patches to populate the
user space page table.
[ tglx: Made it universally usable so it can be used for any kind of shared
mapping. Add a mechanism to clear specific bits in the user space
visible PMD entry. Folded Andys simplifactions ]
Dave Hansen [Mon, 4 Dec 2017 14:07:40 +0000 (15:07 +0100)]
x86/mm/pti: Populate user PGD
In clone_pgd_range() copy the init user PGDs which cover the kernel half of
the address space, so a process has all the required kernel mappings
visible.
[ tglx: Split out from the big kaiser dump and folded Andys simplification ]
Dave Hansen [Mon, 4 Dec 2017 14:07:39 +0000 (15:07 +0100)]
x86/mm/pti: Allocate a separate user PGD
Kernel page table isolation requires to have two PGDs. One for the kernel,
which contains the full kernel mapping plus the user space mapping and one
for user space which contains the user space mappings and the minimal set
of kernel mappings which are required by the architecture to be able to
transition from and to user space.
Add the necessary preliminaries.
[ tglx: Split out from the big kaiser dump. EFI fixup from Kirill ]
Dave Hansen [Mon, 4 Dec 2017 14:07:38 +0000 (15:07 +0100)]
x86/mm/pti: Allow NX poison to be set in p4d/pgd
With PAGE_TABLE_ISOLATION the user portion of the kernel page tables is
poisoned with the NX bit so if the entry code exits with the kernel page
tables selected in CR3, userspace crashes.
But doing so trips the p4d/pgd_bad() checks. Make sure it does not do
that.
Dave Hansen [Mon, 4 Dec 2017 14:07:35 +0000 (15:07 +0100)]
x86/mm/pti: Prepare the x86/entry assembly code for entry/exit CR3 switching
PAGE_TABLE_ISOLATION needs to switch to a different CR3 value when it
enters the kernel and switch back when it exits. This essentially needs to
be done before leaving assembly code.
This is extra challenging because the switching context is tricky: the
registers that can be clobbered can vary. It is also hard to store things
on the stack because there is an established ABI (ptregs) or the stack is
entirely unsafe to use.
Establish a set of macros that allow changing to the user and kernel CR3
values.
Interactions with SWAPGS:
Previous versions of the PAGE_TABLE_ISOLATION code relied on having
per-CPU scratch space to save/restore a register that can be used for the
CR3 MOV. The %GS register is used to index into our per-CPU space, so
SWAPGS *had* to be done before the CR3 switch. That scratch space is gone
now, but the semantic that SWAPGS must be done before the CR3 MOV is
retained. This is good to keep because it is not that hard to do and it
allows to do things like add per-CPU debugging information.
What this does in the NMI code is worth pointing out. NMIs can interrupt
*any* context and they can also be nested with NMIs interrupting other
NMIs. The comments below ".Lnmi_from_kernel" explain the format of the
stack during this situation. Changing the format of this stack is hard.
Instead of storing the old CR3 value on the stack, this depends on the
*regular* register save/restore mechanism and then uses %r14 to keep CR3
during the NMI. It is callee-saved and will not be clobbered by the C NMI
handlers that get called.
Dave Hansen [Mon, 4 Dec 2017 14:07:34 +0000 (15:07 +0100)]
x86/mm/pti: Disable global pages if PAGE_TABLE_ISOLATION=y
Global pages stay in the TLB across context switches. Since all contexts
share the same kernel mapping, these mappings are marked as global pages
so kernel entries in the TLB are not flushed out on a context switch.
But, even having these entries in the TLB opens up something that an
attacker can use, such as the double-page-fault attack:
That means that even when PAGE_TABLE_ISOLATION switches page tables
on return to user space the global pages would stay in the TLB cache.
Disable global pages so that kernel TLB entries can be flushed before
returning to user space. This way, all accesses to kernel addresses from
userspace result in a TLB miss independent of the existence of a kernel
mapping.
Suppress global pages via the __supported_pte_mask. The user space
mappings set PAGE_GLOBAL for the minimal kernel mappings which are
required for entry/exit. These mappings are set up manually so the
filtering does not take place.
Thomas Gleixner [Mon, 4 Dec 2017 14:07:33 +0000 (15:07 +0100)]
x86/cpufeatures: Add X86_BUG_CPU_INSECURE
Many x86 CPUs leak information to user space due to missing isolation of
user space and kernel space page tables. There are many well documented
ways to exploit that.
The upcoming software migitation of isolating the user and kernel space
page tables needs a misfeature flag so code can be made runtime
conditional.
Add the BUG bits which indicates that the CPU is affected and add a feature
bit which indicates that the software migitation is enabled.
Assume for now that _ALL_ x86 CPUs are affected by this. Exceptions can be
made later.
Linus Torvalds [Sat, 23 Dec 2017 19:53:04 +0000 (11:53 -0800)]
Merge branch 'x86-pti-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull x86 PTI preparatory patches from Thomas Gleixner:
"Todays Advent calendar window contains twentyfour easy to digest
patches. The original plan was to have twenty three matching the date,
but a late fixup made that moot.
- Move the cpu_entry_area mapping out of the fixmap into a separate
address space. That's necessary because the fixmap becomes too big
with NRCPUS=8192 and this caused already subtle and hard to
diagnose failures.
The top most patch is fresh from today and cures a brain slip of
that tall grumpy german greybeard, who ignored the intricacies of
32bit wraparounds.
- Limit the number of CPUs on 32bit to 64. That's insane big already,
but at least it's small enough to prevent address space issues with
the cpu_entry_area map, which have been observed and debugged with
the fixmap code
- A few TLB flush fixes in various places plus documentation which of
the TLB functions should be used for what.
- Rename the SYSENTER stack to CPU_ENTRY_AREA stack as it is used for
more than sysenter now and keeping the name makes backtraces
confusing.
- Prevent LDT inheritance on exec() by moving it to arch_dup_mmap(),
which is only invoked on fork().
- Make vysycall more robust.
- A few fixes and cleanups of the debug_pagetables code. Check
PAGE_PRESENT instead of checking the PTE for 0 and a cleanup of the
C89 initialization of the address hint array which already was out
of sync with the index enums.
- Move the ESPFIX init to a different place to prepare for PTI.
- Several code moves with no functional change to make PTI
integration simpler and header files less convoluted.
- Documentation fixes and clarifications"
* 'x86-pti-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (24 commits)
x86/cpu_entry_area: Prevent wraparound in setup_cpu_entry_area_ptes() on 32bit
init: Invoke init_espfix_bsp() from mm_init()
x86/cpu_entry_area: Move it out of the fixmap
x86/cpu_entry_area: Move it to a separate unit
x86/mm: Create asm/invpcid.h
x86/mm: Put MMU to hardware ASID translation in one place
x86/mm: Remove hard-coded ASID limit checks
x86/mm: Move the CR3 construction functions to tlbflush.h
x86/mm: Add comments to clarify which TLB-flush functions are supposed to flush what
x86/mm: Remove superfluous barriers
x86/mm: Use __flush_tlb_one() for kernel memory
x86/microcode: Dont abuse the TLB-flush interface
x86/uv: Use the right TLB-flush API
x86/entry: Rename SYSENTER_stack to CPU_ENTRY_AREA_entry_stack
x86/doc: Remove obvious weirdnesses from the x86 MM layout documentation
x86/mm/64: Improve the memory map documentation
x86/ldt: Prevent LDT inheritance on exec
x86/ldt: Rework locking
arch, mm: Allow arch_dup_mmap() to fail
x86/vsyscall/64: Warn and fail vsyscall emulation in NATIVE mode
...
Thomas Gleixner [Sat, 23 Dec 2017 18:45:11 +0000 (19:45 +0100)]
x86/cpu_entry_area: Prevent wraparound in setup_cpu_entry_area_ptes() on 32bit
The loop which populates the CPU entry area PMDs can wrap around on 32bit
machines when the number of CPUs is small.
It worked wonderful for NR_CPUS=64 for whatever reason and the moron who
wrote that code did not bother to test it with !SMP.
Check for the wraparound to fix it.
Fixes: 92a0f81d8957 ("x86/cpu_entry_area: Move it out of the fixmap") Reported-by: kernel test robot <[email protected]> Signed-off-by: Thomas "Feels stupid" Gleixner <[email protected]> Tested-by: Borislav Petkov <[email protected]>
nvmem: meson-mx-efuse: fix reading from an offset other than 0
meson_mx_efuse_read calculates the address internal to the eFuse based
on the offset and the word size. This works fine with any given offset.
However, the offset is also included when writing to the output buffer.
This means that reading 4 bytes at offset 500 tries to write beyond the
array allocated by the nvmem core as it wants to write the 4 bytes to
"buffer address + offset (500)".
This issue did not show up in the previous tests since no driver uses
any value from the eFuse yet and reading the eFuse via sysfs simply
reads the whole eFuse, starting at offset 0.
Fix this by only including the offset in the internal address
calculation.
Fixes: 8caef1fa9176 ("nvmem: add a driver for the Amlogic Meson6/Meson8/Meson8b SoCs") Signed-off-by: Martin Blumenstingl <[email protected]> Signed-off-by: Srinivas Kandagatla <[email protected]> Signed-off-by: Greg Kroah-Hartman <[email protected]>
Daniel Borkmann [Sat, 23 Dec 2017 00:09:53 +0000 (01:09 +0100)]
Merge branch 'bpf-bpftool-various-fixes'
Jakub Kicinski says:
====================
Two small fixes here to listing maps and programs. The loop for showing
maps is written slightly differently to programs which was missed in JSON
output support, and output would be broken if any of the system calls
failed. Second fix is in very unlikely case that program or map disappears
after we get its ID we should just skip over that object instead of failing.
====================
Jakub Kicinski [Fri, 22 Dec 2017 19:36:06 +0000 (11:36 -0800)]
tools: bpftool: protect against races with disappearing objects
On program/map show we may get an ID of an object from GETNEXT,
but the object may disappear before we call GET_FD_BY_ID. If
that happens, ignore the object and continue.
Jakub Kicinski [Fri, 22 Dec 2017 19:36:05 +0000 (11:36 -0800)]
tools: bpftool: maps: close json array on error paths of show
We can't return from the middle of do_show(), because
json_array will not be closed. Break out of the loop.
Note that the error handling after the loop depends on
errno, so no need to set err.
Fixes: 831a0aafe5c3 ("tools: bpftool: add JSON output for `bpftool map *` commands") Signed-off-by: Jakub Kicinski <[email protected]> Acked-by: Quentin Monnet <[email protected]> Signed-off-by: Daniel Borkmann <[email protected]>
Linus Torvalds [Fri, 22 Dec 2017 20:38:30 +0000 (12:38 -0800)]
Merge tag 'powerpc-4.15-5' of git://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux
Pull powerpc fixes from Michael Ellerman:
"This is all fairly boring, except that there's two KVM fixes that
you'd normally get via Paul's kvm-ppc tree. He's away so I picked them
up. I was waiting to see if he would apply them, which is why they
have only been in my tree since today. But they were on the list for a
while and have been tested on the relevant hardware.
Of note is two fixes for KVM XIVE (Power9 interrupt controller). These
would normally go via the KVM tree but Paul is away so I've picked
them up.
Other than that, two fixes for error handling in the IMC driver, and
one for a potential oops in the BHRB code if the hardware records a
branch address that has subsequently been unmapped, and finally a
s/%p/%px/ in our oops code.
Thanks to: Anju T Sudhakar, Cédric Le Goater, Laurent Vivier, Madhavan
Srinivasan, Naveen N. Rao, Ravi Bangoria"
* tag 'powerpc-4.15-5' of git://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux:
KVM: PPC: Book3S HV: Fix pending_pri value in kvmppc_xive_get_icp()
KVM: PPC: Book3S: fix XIVE migration of pending interrupts
powerpc/kernel: Print actual address of regs when oopsing
powerpc/perf: Fix kfree memory allocated for nest pmus
powerpc/perf/imc: Fix nest-imc cpuhotplug callback failure
powerpc/perf: Dereference BHRB entries safely
Linus Torvalds [Fri, 22 Dec 2017 20:30:10 +0000 (12:30 -0800)]
Merge tag 'for-linus-4.15-rc5-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/xen/tip
Pull xen fixes from Juergen Gross:
"This contains two fixes for running under Xen:
- a fix avoiding resource conflicts between adding mmio areas and
memory hotplug
- a fix setting NX bits in page table entries copied from Xen when
running a PV guest"
* tag 'for-linus-4.15-rc5-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/xen/tip:
xen/balloon: Mark unallocated host memory as UNUSABLE
x86-64/Xen: eliminate W+X mappings
Linus Torvalds [Fri, 22 Dec 2017 20:27:27 +0000 (12:27 -0800)]
Merge tag 'xfs-4.15-fixes-8' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux
Pull xfs fixes from Darrick Wong:
"Here are some XFS fixes for 4.15-rc5. Apologies for the unusually
large number of patches this late, but I wanted to make sure the
corruption fixes were really ready to go.
Changes since last update:
- Fix a locking problem during xattr block conversion that could lead
to the log checkpointing thread to try to write an incomplete
buffer to disk, which leads to a corruption shutdown
- Fix a null pointer dereference when removing delayed allocation
extents
- Remove post-eof speculative allocations when reflinking a block
past current inode size so that we don't just leave them there and
assert on inode reclaim
- Relax an assert which didn't accurately reflect the way locking
works and would trigger under heavy io load
- Avoid infinite loop when cancelling copy on write extents after a
writeback failure
- Try to avoid copy on write transaction reservation overflows when
remapping after a successful write
- Fix various problems with the copy-on-write reservation automatic
garbage collection not being cleaned up properly during a ro
remount
- Fix problems with rmap log items being processed in the wrong
order, leading to corruption shutdowns
- Fix problems with EFI recovery wherein the "remove any rmapping if
present" mechanism wasn't actually doing anything, which would lead
to corruption problems later when the extent is reallocated,
leading to multiple rmaps for the same extent"
* tag 'xfs-4.15-fixes-8' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux:
xfs: only skip rmap owner checks for unknown-owner rmap removal
xfs: always honor OWN_UNKNOWN rmap removal requests
xfs: queue deferred rmap ops for cow staging extent alloc/free in the right order
xfs: set cowblocks tag for direct cow writes too
xfs: remove leftover CoW reservations when remounting ro
xfs: don't be so eager to clear the cowblocks tag on truncate
xfs: track cowblocks separately in i_flags
xfs: allow CoW remap transactions to use reserve blocks
xfs: avoid infinite loop when cancelling CoW blocks after writeback failure
xfs: relax is_reflink_inode assert in xfs_reflink_find_cow_mapping
xfs: remove dest file's post-eof preallocations before reflinking
xfs: move xfs_iext_insert tracepoint to report useful information
xfs: account for null transactions in bunmapi
xfs: hold xfs_buf locked between shortform->leaf conversion and the addition of an attribute
xfs: add the ability to join a held buffer to a defer_ops
Linus Torvalds [Fri, 22 Dec 2017 20:22:48 +0000 (12:22 -0800)]
Merge branch 'linus' of git://git.kernel.org/pub/scm/linux/kernel/git/herbert/crypto-2.6
Pull crypto fixes from Herbert Xu:
"This fixes the following issues:
- fix chacha20 crash on zero-length input due to unset IV
- fix potential race conditions in mcryptd with spinlock
- only wait once at top of algif recvmsg to avoid inconsistencies
- fix potential use-after-free in algif_aead/algif_skcipher"
* 'linus' of git://git.kernel.org/pub/scm/linux/kernel/git/herbert/crypto-2.6:
crypto: af_alg - fix race accessing cipher request
crypto: mcryptd - protect the per-CPU queue with a lock
crypto: af_alg - wait for data at beginning of recvmsg
crypto: skcipher - set walk.iv for zero-length inputs
Linus Torvalds [Fri, 22 Dec 2017 20:21:12 +0000 (12:21 -0800)]
Merge tag 'pinctrl-v4.15-3' of git://git.kernel.org/pub/scm/linux/kernel/git/linusw/linux-pinctrl
Pull pin control fix from Linus Walleij:
"A single pin control fix for Intel machines, affecting a bunch of
Chromebooks. Nothing else collected up amazingly"
* tag 'pinctrl-v4.15-3' of git://git.kernel.org/pub/scm/linux/kernel/git/linusw/linux-pinctrl:
pinctrl: cherryview: Mask all interrupts on Intel_Strago based systems
I'm not sure I'll see many fixes over next couple of weeks, we'll see
how we go"
* tag 'drm-fixes-for-v4.15-rc5' of git://people.freedesktop.org/~airlied/linux: (27 commits)
drm/syncobj: Stop reusing the same struct file for all syncobj -> fd
drm: move lease init after validation in drm_lease_create
drm/plane: Make framebuffer refcounting the responsibility of setplane_internal callers
drm/sun4i: hdmi: Move the mode_valid callback to the encoder
drm/nouveau: fix obvious memory leak
drm/i915: Protect DDI port to DPLL map from theoretical race.
drm/i915/lpe: Remove double-encapsulation of info string
drm/sun4i: Fix error path handling
drm/nouveau: use alternate memory type for system-memory buffers with kind != 0
drm/nouveau: avoid GPU page sizes > PAGE_SIZE for buffer objects in host memory
drm/nouveau/mmu/gp10b: use correct implementation
drm/nouveau/pci: do a msi rearm on init
drm/nouveau/imem/nv50: fix refcount_t warning
drm/nouveau/bios/dp: support DP Info Table 2.0
drm/nouveau/fbcon: fix NULL pointer access in nouveau_fbcon_destroy
drm/amd/display: Fix rehook MST display not light back on
drm/amd/display: fix missing pixel clock adjustment for dongle
drm/amd/display: set chroma taps to 1 when not scaling
drm/amd/display: add pipe locking before front end programing
drm/sun4i: validate modes for HDMI
...
Linus Torvalds [Fri, 22 Dec 2017 19:48:36 +0000 (11:48 -0800)]
Merge tag 'clk-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/clk/linux
Pull clk fixes from Stephen Boyd:
"Here's a trio of fixes:
- The runtime PM clk patches that landed this merge window forgot to
runtime resume devices that may be off while recalculating and
setting rates of child clks of whatever clk is changing rates.
- We had a NULL pointer deref in an old clk tracepoint when
clk_set_parent() is called with a NULL parent pointer. This
shouldn't really happen, but it's best to avoid this regardless.
- The sun9i-mmc clk driver didn't provide 'reset' support, just
'assert' and 'deassert' so the MMC driver stopped probing when the
probe was changed to do a reset instead of assert/deassert pair.
This implements the reset so things work again"
* tag 'clk-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/clk/linux:
clk: sunxi: sun9i-mmc: Implement reset callback for reset controls
clk: fix a panic error caused by accessing NULL pointer
clk: Manage proper runtime PM state in clk_change_rate()
Thomas Gleixner [Sun, 17 Dec 2017 09:56:29 +0000 (10:56 +0100)]
init: Invoke init_espfix_bsp() from mm_init()
init_espfix_bsp() needs to be invoked before the page table isolation
initialization. Move it into mm_init() which is the place where pti_init()
will be added.
While at it get rid of the #ifdeffery and provide proper stub functions.
Thomas Gleixner [Wed, 20 Dec 2017 17:51:31 +0000 (18:51 +0100)]
x86/cpu_entry_area: Move it out of the fixmap
Put the cpu_entry_area into a separate P4D entry. The fixmap gets too big
and 0-day already hit a case where the fixmap PTEs were cleared by
cleanup_highmap().
Aside of that the fixmap API is a pain as it's all backwards.
Dave Hansen [Mon, 4 Dec 2017 14:07:56 +0000 (15:07 +0100)]
x86/mm: Put MMU to hardware ASID translation in one place
There are effectively two ASID types:
1. The one stored in the mmu_context that goes from 0..5
2. The one programmed into the hardware that goes from 1..6
This consolidates the locations where converting between the two (by doing
a +1) to a single place which gives us a nice place to comment.
PAGE_TABLE_ISOLATION will also need to, given an ASID, know which hardware
ASID to flush for the userspace mapping.
Dave Hansen [Mon, 4 Dec 2017 14:07:55 +0000 (15:07 +0100)]
x86/mm: Remove hard-coded ASID limit checks
First, it's nice to remove the magic numbers.
Second, PAGE_TABLE_ISOLATION is going to consume half of the available ASID
space. The space is currently unused, but add a comment to spell out this
new restriction.
Peter Zijlstra [Tue, 5 Dec 2017 12:34:50 +0000 (13:34 +0100)]
x86/uv: Use the right TLB-flush API
Since uv_flush_tlb_others() implements flush_tlb_others() which is
about flushing user mappings, we should use __flush_tlb_single(),
which too is about flushing user mappings.
Thomas Gleixner [Thu, 14 Dec 2017 11:27:31 +0000 (12:27 +0100)]
x86/ldt: Prevent LDT inheritance on exec
The LDT is inherited across fork() or exec(), but that makes no sense
at all because exec() is supposed to start the process clean.
The reason why this happens is that init_new_context_ldt() is called from
init_new_context() which obviously needs to be called for both fork() and
exec().
It would be surprising if anything relies on that behaviour, so it seems to
be safe to remove that misfeature.
Split the context initialization into two parts. Clear the LDT pointer and
initialize the mutex from the general context init and move the LDT
duplication to arch_dup_mmap() which is only called on fork().
Peter Zijlstra [Thu, 14 Dec 2017 11:27:30 +0000 (12:27 +0100)]
x86/ldt: Rework locking
The LDT is duplicated on fork() and on exec(), which is wrong as exec()
should start from a clean state, i.e. without LDT. To fix this the LDT
duplication code will be moved into arch_dup_mmap() which is only called
for fork().
This introduces a locking problem. arch_dup_mmap() holds mmap_sem of the
parent process, but the LDT duplication code needs to acquire
mm->context.lock to access the LDT data safely, which is the reverse lock
order of write_ldt() where mmap_sem nests into context.lock.
Solve this by introducing a new rw semaphore which serializes the
read/write_ldt() syscall operations and use context.lock to protect the
actual installment of the LDT descriptor.
So context.lock stabilizes mm->context.ldt and can nest inside of the new
semaphore or mmap_sem.
Andy Lutomirski [Mon, 11 Dec 2017 06:47:19 +0000 (22:47 -0800)]
x86/vsyscall/64: Explicitly set _PAGE_USER in the pagetable hierarchy
The kernel is very erratic as to which pagetables have _PAGE_USER set. The
vsyscall page gets lucky: it seems that all of the relevant pagetables are
among the apparently arbitrary ones that set _PAGE_USER. Rather than
relying on chance, just explicitly set _PAGE_USER.
This will let us clean up pagetable setup to stop setting _PAGE_USER. The
added code can also be reused by pagetable isolation to manage the
_PAGE_USER bit in the usermode tables.
Thomas Gleixner [Wed, 20 Dec 2017 17:07:42 +0000 (18:07 +0100)]
x86/mm/dump_pagetables: Make the address hints correct and readable
The address hints are a trainwreck. The array entry numbers have to kept
magically in sync with the actual hints, which is doomed as some of the
array members are initialized at runtime via the entry numbers.
Designated initializers have been around before this code was
implemented....
Use the entry numbers to populate the address hints array and add the
missing bits and pieces. Split 32 and 64 bit for readability sake.
Thomas Gleixner [Sat, 16 Dec 2017 00:14:39 +0000 (01:14 +0100)]
x86/mm/dump_pagetables: Check PAGE_PRESENT for real
The check for a present page in printk_prot():
if (!pgprot_val(prot)) {
/* Not present */
is bogus. If a PTE is set to PAGE_NONE then the pgprot_val is not zero and
the entry is decoded in bogus ways, e.g. as RX GLB. That is confusing when
analyzing mapping correctness. Check for the present bit to make an
informed decision.
Michael J. Ruhl [Fri, 22 Dec 2017 16:47:20 +0000 (08:47 -0800)]
IB/hfi: Only read capability registers if the capability exists
During driver init, various registers are saved to allow restoration
after an FLR or gen3 bump. Some of these registers are not available
in some circumstances (i.e. Virtual machines).
This bug makes the driver unusable when the PCI device is passed into
a VM, it fails during probe.
Delete unnecessary register read/write, and only access register if
the capability exists.
Christophe Leroy [Fri, 15 Dec 2017 14:02:33 +0000 (15:02 +0100)]
gpio: fix "gpio-line-names" property retrieval
Following commit 9427ecbed46cc ("gpio: Rework of_gpiochip_set_names()
to use device property accessors"), "gpio-line-names" DT property is
not retrieved anymore when chip->parent is not set by the driver.
This is due to OF based property reads having been replaced by device
based property reads.
This patch fixes that by making use of
fwnode_property_read_string_array() instead of
device_property_read_string_array() and handing over either
of_fwnode_handle(chip->of_node) or dev_fwnode(chip->parent)
to that function.
Takashi Iwai [Fri, 22 Dec 2017 09:45:07 +0000 (10:45 +0100)]
ALSA: hda: Drop useless WARN_ON()
Since the commit 97cc2ed27e5a ("ALSA: hda - Fix yet another i915
pointer leftover in error path") cleared hdac_acomp pointer, the
WARN_ON() non-NULL check in snd_hdac_i915_register_notifier() may give
a false-positive warning, as the function gets called no matter
whether the component is registered or not. For fixing it, let's get
rid of the spurious WARN_ON().
Hui Wang [Fri, 22 Dec 2017 03:17:46 +0000 (11:17 +0800)]
ALSA: hda - change the location for one mic on a Lenovo machine
There are two front mics on this machine, and current driver assign
the same name Mic to both of them, but pulseaudio can't handle them.
As a workaround, we change the location for one of them, then the
driver will assign "Front Mic" and "Mic" for them.
Hui Wang [Fri, 22 Dec 2017 03:17:44 +0000 (11:17 +0800)]
ALSA: hda - Add MIC_NO_PRESENCE fixup for 2 HP machines
There is a headset jack on the front panel, when we plug a headset
into it, the headset mic can't trigger unsol events, and
read_pin_sense() can't detect its presence too. So add this fixup
to fix this issue.
Antoine Ténart [Mon, 11 Dec 2017 11:10:58 +0000 (12:10 +0100)]
crypto: inside-secure - do not use areq->result for partial results
This patches update the SafeXcel driver to stop using the crypto
ahash_request result field for partial results (i.e. on updates).
Instead the driver local safexcel_ahash_req state field is used, and
only on final operations the ahash_request result buffer is updated.
Antoine Ténart [Mon, 11 Dec 2017 11:10:57 +0000 (12:10 +0100)]
crypto: inside-secure - fix request allocations in invalidation path
This patch makes use of the SKCIPHER_REQUEST_ON_STACK and
AHASH_REQUEST_ON_STACK helpers to allocate enough memory to contain both
the crypto request structures and their embedded context (__ctx).
Ofer Heifetz [Mon, 11 Dec 2017 11:10:55 +0000 (12:10 +0100)]
crypto: inside-secure - per request invalidation
When an invalidation request is needed we currently override the context
.send and .handle_result helpers. This is wrong as under high load other
requests can already be queued and overriding the context helpers will
make them execute the wrong .send and .handle_result functions.
This commit fixes this by adding a needs_inv flag in the request to
choose the action to perform when sending requests or handling their
results. This flag will be set when needed (i.e. when the context flag
will be set).
Fixes: 1b44c5a60c13 ("crypto: inside-secure - add SafeXcel EIP197 crypto engine driver") Signed-off-by: Ofer Heifetz <[email protected]>
[Antoine: commit message, and removed non related changes from the
original commit] Signed-off-by: Antoine Tenart <[email protected]> Signed-off-by: Herbert Xu <[email protected]>
The present change is a bug fix for AVB link iteratively up/down.
Steps to reproduce:
- start AVB TX stream (Using aplay via MSE),
- disconnect+reconnect the eth cable,
- after a reconnection the eth connection goes iteratively up/down
without user interaction,
- this may heal after some seconds or even stay for minutes.
As the documentation specifies, the "renesas,no-ether-link" option
should be used when a board does not provide a proper AVB_LINK signal.
There is no need for this option enabled on RCAR H3/M3 Salvator-X/XS
and ULCB starter kits since the AVB_LINK is correctly handled by HW.
Choosing to keep or remove the "renesas,no-ether-link" option will
have impact on the code flow in the following ways:
- keeping this option enabled may lead to unexpected behavior since
the RX & TX are enabled/disabled directly from adjust_link function
without any HW interrogation,
- removing this option, the RX & TX will only be enabled/disabled after
HW interrogation. The HW check is made through the LMON pin in PSR
register which specifies AVB_LINK signal value (0 - at low level;
1 - at high level).
In conclusion, the present change is also a safety improvement because
it removes the "renesas,no-ether-link" option leading to a proper way
of detecting the link state based on HW interrogation and not on
software heuristic.
James Hogan [Tue, 5 Dec 2017 23:31:35 +0000 (23:31 +0000)]
lib/mpi: Fix umul_ppmm() for MIPS64r6
Current MIPS64r6 toolchains aren't able to generate efficient
DMULU/DMUHU based code for the C implementation of umul_ppmm(), which
performs an unsigned 64 x 64 bit multiply and returns the upper and
lower 64-bit halves of the 128-bit result. Instead it widens the 64-bit
inputs to 128-bits and emits a __multi3 intrinsic call to perform a 128
x 128 multiply. This is both inefficient, and it results in a link error
since we don't include __multi3 in MIPS linux.
For example commit 90a53e4432b1 ("cfg80211: implement regdb signature
checking") merged in v4.15-rc1 recently broke the 64r6_defconfig and
64r6el_defconfig builds by indirectly selecting MPILIB. The same build
errors can be reproduced on older kernels by enabling e.g. CRYPTO_RSA:
lib/mpi/generic_mpih-mul1.o: In function `mpihelp_mul_1':
lib/mpi/generic_mpih-mul1.c:50: undefined reference to `__multi3'
lib/mpi/generic_mpih-mul2.o: In function `mpihelp_addmul_1':
lib/mpi/generic_mpih-mul2.c:49: undefined reference to `__multi3'
lib/mpi/generic_mpih-mul3.o: In function `mpihelp_submul_1':
lib/mpi/generic_mpih-mul3.c:49: undefined reference to `__multi3'
lib/mpi/mpih-div.o In function `mpihelp_divrem':
lib/mpi/mpih-div.c:205: undefined reference to `__multi3'
lib/mpi/mpih-div.c:142: undefined reference to `__multi3'
Therefore add an efficient MIPS64r6 implementation of umul_ppmm() using
inline assembly and the DMULU/DMUHU instructions, to prevent __multi3
calls being emitted.
The present change is a bug fix for AVB link iteratively up/down.
Steps to reproduce:
- start AVB TX stream (Using aplay via MSE),
- disconnect+reconnect the eth cable,
- after a reconnection the eth connection goes iteratively up/down
without user interaction,
- this may heal after some seconds or even stay for minutes.
As the documentation specifies, the "renesas,no-ether-link" option
should be used when a board does not provide a proper AVB_LINK signal.
There is no need for this option enabled on RCAR H3/M3 Salvator-X/XS
and ULCB starter kits since the AVB_LINK is correctly handled by HW.
Choosing to keep or remove the "renesas,no-ether-link" option will
have impact on the code flow in the following ways:
- keeping this option enabled may lead to unexpected behavior since
the RX & TX are enabled/disabled directly from adjust_link function
without any HW interrogation,
- removing this option, the RX & TX will only be enabled/disabled after
HW interrogation. The HW check is made through the LMON pin in PSR
register which specifies AVB_LINK signal value (0 - at low level;
1 - at high level).
In conclusion, the present change is also a safety improvement because
it removes the "renesas,no-ether-link" option leading to a proper way
of detecting the link state based on HW interrogation and not on
software heuristic.
Eric Biggers [Wed, 20 Dec 2017 22:28:25 +0000 (14:28 -0800)]
crypto: pcrypt - fix freeing pcrypt instances
pcrypt is using the old way of freeing instances, where the ->free()
method specified in the 'struct crypto_template' is passed a pointer to
the 'struct crypto_instance'. But the crypto_instance is being
kfree()'d directly, which is incorrect because the memory was actually
allocated as an aead_instance, which contains the crypto_instance at a
nonzero offset. Thus, the wrong pointer was being kfree()'d.
Fix it by switching to the new way to free aead_instance's where the
->free() method is specified in the aead_instance itself.
Jan Engelhardt [Tue, 19 Dec 2017 18:09:07 +0000 (19:09 +0100)]
crypto: n2 - cure use after free
queue_cache_init is first called for the Control Word Queue
(n2_crypto_probe). At that time, queue_cache[0] is NULL and a new
kmem_cache will be allocated. If the subsequent n2_register_algs call
fails, the kmem_cache will be released in queue_cache_destroy, but
queue_cache_init[0] is not set back to NULL.
So when the Module Arithmetic Unit gets probed next (n2_mau_probe),
queue_cache_init will not allocate a kmem_cache again, but leave it
as its bogus value, causing a BUG() to trigger when queue_cache[0] is
eventually passed to kmem_cache_zalloc:
n2_crypto: Found N2CP at /virtual-devices@100/n2cp@7
n2_crypto: Registered NCS HVAPI version 2.0
called queue_cache_init
n2_crypto: md5 alg registration failed
n2cp f028687c: /virtual-devices@100/n2cp@7: Unable to register algorithms.
called queue_cache_destroy
n2cp: probe of f028687c failed with error -22
n2_crypto: Found NCP at /virtual-devices@100/ncp@6
n2_crypto: Registered NCS HVAPI version 2.0
called queue_cache_init
kernel BUG at mm/slab.c:2993!
Call Trace:
[0000000000604488] kmem_cache_alloc+0x1a8/0x1e0
(inlined) kmem_cache_zalloc
(inlined) new_queue
(inlined) spu_queue_setup
(inlined) handle_exec_unit
[0000000010c61eb4] spu_mdesc_scan+0x1f4/0x460 [n2_crypto]
[0000000010c62b80] n2_mau_probe+0x100/0x220 [n2_crypto]
[000000000084b174] platform_drv_probe+0x34/0xc0
Jonathan Cameron [Tue, 19 Dec 2017 10:27:24 +0000 (10:27 +0000)]
crypto: af_alg - Fix race around ctx->rcvused by making it atomic_t
This variable was increased and decreased without any protection.
Result was an occasional misscount and negative wrap around resulting
in false resource allocation failures.
Fixes: 7d2c3f54e6f6 ("crypto: af_alg - remove locking in async callback") Signed-off-by: Jonathan Cameron <[email protected]> Reviewed-by: Stephan Mueller <[email protected]> Signed-off-by: Herbert Xu <[email protected]>
Eric Biggers [Mon, 11 Dec 2017 20:15:17 +0000 (12:15 -0800)]
crypto: chacha20poly1305 - validate the digest size
If the rfc7539 template was instantiated with a hash algorithm with
digest size larger than 16 bytes (POLY1305_DIGEST_SIZE), then the digest
overran the 'tag' buffer in 'struct chachapoly_req_ctx', corrupting the
subsequent memory, including 'cryptlen'. This caused a crash during
crypto_skcipher_decrypt().
Fix it by, when instantiating the template, requiring that the
underlying hash algorithm has the digest size expected for Poly1305.
dtc points out that the parent node of the interrupt controllers is not
actually an interrupt controller itself, and lacks an #interrupt-cells
property:
arch/arm/boot/dts/tango4-vantage-1172.dtb: Warning (interrupts_property): Missing #interrupt-cells in interrupt-parent /soc/interrupt-controller@6e000
Arnd Bergmann [Thu, 21 Dec 2017 21:35:19 +0000 (22:35 +0100)]
ARM: dts: ls1021a: fix incorrect clock references
dtc warns about two 'clocks' properties that have an extraneous '1'
at the end:
arch/arm/boot/dts/ls1021a-qds.dtb: Warning (clocks_property): arch/arm/boot/dts/ls1021a-twr.dtb: Warning (clocks_property): Property 'clocks', cell 1 is not a phandle reference in /soc/i2c@2180000/mux@77/i2c@4/sgtl5000@2a
arch/arm/boot/dts/ls1021a-qds.dtb: Warning (clocks_property): Missing property '#clock-cells' in node /soc/interrupt-controller@1400000 or bad phandle (referred from /soc/i2c@2180000/mux@77/i2c@4/sgtl5000@2a:clocks[1])
Property 'clocks', cell 1 is not a phandle reference in /soc/i2c@2190000/sgtl5000@a
arch/arm/boot/dts/ls1021a-twr.dtb: Warning (clocks_property): Missing property '#clock-cells' in node /soc/interrupt-controller@1400000 or bad phandle (referred from /soc/i2c@2190000/sgtl5000@a:clocks[1])
The clocks that get referenced here are fixed-rate, so they do not
take any argument, and dtc interprets the next cell as a phandle, which
is invalid.
Laurent Vivier [Tue, 12 Dec 2017 17:23:56 +0000 (18:23 +0100)]
KVM: PPC: Book3S HV: Fix pending_pri value in kvmppc_xive_get_icp()
When we migrate a VM from a POWER8 host (XICS) to a POWER9 host
(XICS-on-XIVE), we have an error:
qemu-kvm: Unable to restore KVM interrupt controller state \
(0xff000000) for CPU 0: Invalid argument
This is because kvmppc_xics_set_icp() checks the new state
is internaly consistent, and especially:
...
1129 if (xisr == 0) {
1130 if (pending_pri != 0xff)
1131 return -EINVAL;
...
On the other side, kvmppc_xive_get_icp() doesn't set
neither the pending_pri value, nor the xisr value (set to 0)
(and kvmppc_xive_set_icp() ignores the pending_pri value)
As xisr is 0, pending_pri must be set to 0xff.
Fixes: 5af50993850a ("KVM: PPC: Book3S HV: Native usage of the XIVE interrupt controller") Cc: [email protected] # v4.12+ Signed-off-by: Laurent Vivier <[email protected]> Acked-by: Benjamin Herrenschmidt <[email protected]> Signed-off-by: Michael Ellerman <[email protected]>
Cédric Le Goater [Tue, 12 Dec 2017 12:02:04 +0000 (12:02 +0000)]
KVM: PPC: Book3S: fix XIVE migration of pending interrupts
When restoring a pending interrupt, we are setting the Q bit to force
a retrigger in xive_finish_unmask(). But we also need to force an EOI
in this case to reach the same initial state : P=1, Q=0.
This can be done by not setting 'old_p' for pending interrupts which
will inform xive_finish_unmask() that an EOI needs to be sent.
Chris Wilson [Tue, 19 Dec 2017 12:07:00 +0000 (12:07 +0000)]
drm/syncobj: Stop reusing the same struct file for all syncobj -> fd
The vk cts test:
dEQP-VK.api.external.semaphore.opaque_fd.export_multiple_times_temporary
triggers a lot of
VFS: Close: file count is 0
Dave pointed out that clearing the syncobj->file from
drm_syncobj_file_release() was sufficient to silence the test, but that
opens a can of worm since we assumed that the syncobj->file was never
unset. Stop trying to reuse the same struct file for every fd pointing
to the drm_syncobj, and allocate one file for each fd instead.
v2: Fixup return handling of drm_syncobj_fd_to_handle
v2.1: [airlied: fix possible syncobj ref race]
Dave Airlie [Fri, 22 Dec 2017 00:00:04 +0000 (10:00 +1000)]
Merge tag 'drm-misc-fixes-2017-12-21' of git://anongit.freedesktop.org/drm/drm-misc into drm-fixes
drm-misc-fixes before holidays:
- fixup for the lease fixup (Keith)
- fb leak in the ww mutex fallback code (Maarten)
- sun4i fixes (Maxime, Hans)
* tag 'drm-misc-fixes-2017-12-21' of git://anongit.freedesktop.org/drm/drm-misc:
drm: move lease init after validation in drm_lease_create
drm/plane: Make framebuffer refcounting the responsibility of setplane_internal callers
drm/sun4i: hdmi: Move the mode_valid callback to the encoder
drm/sun4i: Fix error path handling
drm/sun4i: validate modes for HDMI
Pull networking fixes from David Miller"
"What's a holiday weekend without some networking bug fixes? [1]
1) Fix some eBPF JIT bugs wrt. SKB pointers across helper function
calls, from Daniel Borkmann.
2) Fix regression from errata limiting change to marvell PHY driver,
from Zhao Qiang.
3) Fix u16 overflow in SCTP, from Xin Long.
4) Fix potential memory leak during bridge newlink, from Nikolay
Aleksandrov.
5) Fix BPF selftest build on s390, from Hendrik Brueckner.
6) Don't append to cfg80211 automatically generated certs file,
always write new ones from scratch. From Thierry Reding.
7) Fix sleep in atomic in mac80211 hwsim, from Jia-Ju Bai.
8) Fix hang on tg3 MTU change with certain chips, from Brian King.
9) Add stall detection to arc emac driver and reset chip when this
happens, from Alexander Kochetkov.
10) Fix MTU limitng in GRE tunnel drivers, from Xin Long.
11) Fix stmmac timestamping bug due to mis-shifting of field. From
Fredrik Hallenberg.
12) Fix metrics match when deleting an ipv4 route. The kernel sets
some internal metrics bits which the user isn't going to set when
it makes the delete request. From Phil Sutter.
13) mvneta driver loop over RX queues limits on "txq_number" :-) Fix
from Yelena Krivosheev.
14) Fix double free and memory corruption in get_net_ns_by_id, from
Eric W. Biederman.
15) Flush ipv4 FIB tables in the reverse order. Some tables can share
their actual backing data, in particular this happens for the MAIN
and LOCAL tables. We have to kill the LOCAL table first, because
it uses MAIN's backing memory. Fix from Ido Schimmel.
16) Several eBPF verifier value tracking fixes, from Edward Cree, Jann
Horn, and Alexei Starovoitov.
17) Make changes to ipv6 autoflowlabel sysctl really propagate to
sockets, unless the socket has set the per-socket value
explicitly. From Shaohua Li.
18) Fix leaks and double callback invocations of zerocopy SKBs, from
Willem de Bruijn"
[1] Is this a trick question? "Relaxing"? "Quiet"? "Fine"? - Linus.
* git://git.kernel.org/pub/scm/linux/kernel/git/davem/net: (77 commits)
skbuff: skb_copy_ubufs must release uarg even without user frags
skbuff: orphan frags before zerocopy clone
net: reevalulate autoflowlabel setting after sysctl setting
openvswitch: Fix pop_vlan action for double tagged frames
ipv6: Honor specified parameters in fibmatch lookup
bpf: do not allow root to mangle valid pointers
selftests/bpf: add tests for recent bugfixes
bpf: fix integer overflows
bpf: don't prune branches when a scalar is replaced with a pointer
bpf: force strict alignment checks for stack pointers
bpf: fix missing error return in check_stack_boundary()
bpf: fix 32-bit ALU op verification
bpf: fix incorrect tracking of register size truncation
bpf: fix incorrect sign extension in check_alu_op()
bpf/verifier: fix bounds calculation on BPF_RSH
ipv4: Fix use-after-free when flushing FIB tables
s390/qeth: fix error handling in checksum cmd callback
tipc: remove joining group member from congested list
selftests: net: Adding config fragment CONFIG_NUMA=y
nfp: bpf: keep track of the offloaded program
...
Quentin Monnet [Thu, 21 Dec 2017 16:52:50 +0000 (08:52 -0800)]
selftests/bpf: fix Makefile for passing LLC to the command line
Makefile has a LLC variable that is initialised to "llc", but can
theoretically be overridden from the command line ("make LLC=llc-6.0").
However, this fails because for LLVM probe check, "llc" is called
directly. Use the $(LLC) variable instead to fix this.
Fixes: 22c8852624fc ("bpf: improve selftests and add tests for meta pointer") Signed-off-by: Quentin Monnet <[email protected]> Signed-off-by: Jakub Kicinski <[email protected]> Signed-off-by: Daniel Borkmann <[email protected]>
Alex Vesker [Thu, 21 Dec 2017 15:38:27 +0000 (17:38 +0200)]
IB/ipoib: Fix lockdep issue found on ipoib_ib_dev_heavy_flush
The locking order of vlan_rwsem (LOCK A) and then rtnl (LOCK B),
contradicts other flows such as ipoib_open possibly causing a deadlock.
To prevent this deadlock heavy flush is called with RTNL locked and
only then tries to acquire vlan_rwsem.
This deadlock is possible only when there are child interfaces.
other info that might help us debug this:
[ 141.088967] Possible unsafe locking scenario:
[ 141.094280] CPU0 CPU1
[ 141.097953] ---- ----
[ 141.101640] lock(&priv->vlan_rwsem);
[ 141.104771] lock(rtnl_mutex);
[ 141.109207] lock(&priv->vlan_rwsem);
[ 141.114032] lock(rtnl_mutex);
[ 141.116800]
*** DEADLOCK ***
Fixes: b4b678b06f6e ("IB/ipoib: Grab rtnl lock on heavy flush when calling ndo_open/stop") Signed-off-by: Alex Vesker <[email protected]> Signed-off-by: Leon Romanovsky <[email protected]> Signed-off-by: Jason Gunthorpe <[email protected]>
Majd Dibbiny [Thu, 21 Dec 2017 15:38:26 +0000 (17:38 +0200)]
IB/mlx5: Fix congestion counters in LAG mode
Congestion counters are counted and queried per physical function.
When working in LAG mode, CNP packets can be sent or received on both
of the functions, thus congestion counters should be aggregated from
the two physical functions.
Steve Wise [Tue, 19 Dec 2017 22:02:10 +0000 (14:02 -0800)]
iw_cxgb4: when flushing, complete all wrs in a chain
If a wr chain was posted and needed to be flushed, only the first
wr in the chain was completed with FLUSHED status. The rest were
never completed. This caused isert to hang on shutdown due to the
missing completions which left iscsi IO commands referenced, stalling
the shutdown.
Steve Wise [Tue, 19 Dec 2017 18:29:25 +0000 (10:29 -0800)]
iw_cxgb4: reflect the original WR opcode in drain cqes
The flush/drain logic was not retaining the original wr opcode in
its completion. This can cause problems if the application uses
the completion opcode to make decisions.
Use bit 10 of the CQE header word to indicate the CQE is a special
drain completion, and save the original WR opcode in the cqe header
opcode field.
Steve Wise [Mon, 18 Dec 2017 21:10:00 +0000 (13:10 -0800)]
iw_cxgb4: Only validate the MSN for successful completions
If the RECV CQE is in error, ignore the MSN check. This was causing
recvs that were flushed into the sw cq to be completed with the wrong
status (BAD_MSN instead of FLUSHED).
Vishal Verma [Mon, 18 Dec 2017 16:28:39 +0000 (09:28 -0700)]
libnvdimm, btt: Fix an incompatibility in the log layout
Due to a spec misinterpretation, the Linux implementation of the BTT log
area had different padding scheme from other implementations, such as
UEFI and NVML.
This fixes the padding scheme, and defaults to it for new BTT layouts.
We attempt to detect the padding scheme in use when probing for an
existing BTT. If we detect the older/incompatible scheme, we continue
using it.