Jeff Layton [Mon, 11 Dec 2017 11:35:11 +0000 (06:35 -0500)]
afs: convert to new i_version API
For AFS, it's generally treated as an opaque value, so we use the
*_raw variants of the API here.
Note that AFS has quite a different definition for this counter. AFS
only increments it on changes to the data to the data in regular files
and contents of the directories. Inode metadata changes do not result
in a version increment.
We'll need to reconcile that somehow if we ever want to present this to
userspace via statx.
Jeff Layton [Mon, 18 Dec 2017 11:25:31 +0000 (06:25 -0500)]
fs: don't take the i_lock in inode_inc_iversion
The rationale for taking the i_lock when incrementing this value is
lost in antiquity. The readers of the field don't take it (at least
not universally), so my assumption is that it was only done here to
serialize incrementors.
If that is indeed the case, then we can drop the i_lock from this
codepath and treat it as a atomic64_t for the purposes of
incrementing it. This allows us to use inode_inc_iversion without
any danger of lock inversion.
Note that the read side is not fetched atomically with this change.
The assumption here is that that is not a critical issue since the
i_version is not fully synchronized with anything else anyway.
Jeff Layton [Mon, 29 Jan 2018 11:41:30 +0000 (06:41 -0500)]
fs: new API for handling inode->i_version
Add a documentation blob that explains what the i_version field is, how
it is expected to work, and how it is currently implemented by various
filesystems.
We already have inode_inc_iversion. Add several other functions for
manipulating and accessing the i_version counter. For now, the
implementation is trivial and basically works the way that all of the
open-coded i_version accesses work today.
Future patches will convert existing users of i_version to use the new
API, and then convert the backend implementation to do things more
efficiently.
Boris Brezillon [Mon, 29 Jan 2018 08:58:36 +0000 (09:58 +0100)]
Merge tag 'nand/for-4.16' of git://git.infradead.org/linux-mtd into mtd/next
Pull NAND changes from Boris Brezillon:
"
Core changes:
* Fix NAND_CMD_NONE handling in nand_command[_lp]() hooks
* Introduce the ->exec_op() infrastructure
* Rework NAND buffers handling
* Fix ECC requirements for K9F4G08U0D
* Fix nand_do_read_oob() to return the number of bitflips
* Mark K9F1G08U0E as not supporting subpage writes
Driver changes:
* MTK: Rework the driver to support new IP versions
* OMAP OneNAND: Full rework to use new APIs (libgpio, dmaengine) and fix
DT support
* Marvell: Add a new driver to replace the pxa3xx one
"
Boris Brezillon [Mon, 29 Jan 2018 08:55:14 +0000 (09:55 +0100)]
Merge tag 'spi-nor/for-4.16' of git://git.infradead.org/linux-mtd into mtd/next
Pull spi-nor changes from Cyrille Pitchen:
"
This pull-request contains the following notable changes:
Core changes:
* Add support to new ISSI and Cypress/Spansion memory parts.
* Fix support of Micron memories by checking error bits in the FSR.
* Fix update of block-protection bits by reading back the SR.
* Restore the internal state of the SPI flash memory when removing the
device.
Driver changes:
* Maintenance for Freescale, Intel and Metiatek drivers.
* Add support of the direct access mode for the Cadence QSPI controller.
"
Corentin Labbe [Sun, 28 Jan 2018 15:08:55 +0000 (15:08 +0000)]
IB/qib: remove qib_keys.c
qib_keys.c was left uncompilable in commit 7c2e11fe2dbe ("IB/qib: Remove qp and mr functionality from qib")
Since nothing need it, remove it from tree.
Leon Romanovsky [Sun, 28 Jan 2018 09:25:33 +0000 (11:25 +0200)]
RDMA/cm: Fix access to uninitialized variable
The ndev will be initialized and held only for successful
ib_get_cached_gid(), otherwise it is garbage stack memory.
Calling dev_put() in failure path is wrong.
Fixes: 16c72e402867 ("IB/cm: Refactor to avoid setting path record software only fields") Signed-off-by: Parav Pandit <[email protected]> Signed-off-by: Leon Romanovsky <[email protected]> Reviewed-by: Yuval Shaia <[email protected]> Signed-off-by: Jason Gunthorpe <[email protected]>
Parav Pandit [Sun, 28 Jan 2018 09:25:32 +0000 (11:25 +0200)]
RDMA/cma: Use existing netif_is_bond_master function
When checking whatever the current netdev is the bond master interface,
use kernel API netif_is_bond_master() instead of hardcoding the check.
No functionality is changed.
Jack Morgenstein [Sun, 28 Jan 2018 09:25:29 +0000 (11:25 +0200)]
IB/umad: Fix use of unprotected device pointer
The ib_write_umad() is protected by taking the umad file mutex.
However, it accesses file->port->ib_dev -- which is protected only by the
port's mutex (field file_mutex).
The ib_umad_remove_one() calls ib_umad_kill_port() which sets
port->ib_dev to NULL under the port mutex (NOT the file mutex).
It then sets the mad agent to "dead" under the umad file mutex.
This is a race condition -- because there is a window where
port->ib_dev is NULL, while the agent is not "dead".
Ingo Molnar [Sun, 28 Jan 2018 20:58:18 +0000 (21:58 +0100)]
Merge tag 'perf-core-for-mingo-4.16-20180125' of git://git.kernel.org/pub/scm/linux/kernel/git/acme/linux into perf/core
Pull perf/core improvements and fixes from Arnaldo Carvalho de Melo:
- Introduce a errno code to string facility to allow tools such as
'perf trace' to support multi-arch perf.data decoding/beautifying,
uses errno header files copied from the kernel, this allows
removing the need for audit-libs in arches generating its syscall
tables from the kernel sources, so far x86 and s/390 (Hendrik Brueckner)
- Intel vendor event JSON updates for the Broadwell, BroadwellDE,
BroadwellX, Goldmont, Haswell, HaswellX, IvyBridge, IvyBridge, IvyTown,
IvyTown, Silvermont, Skylake and SkylakeX architectures. (Andi Kleen)
- Use ui__error() to have --field error handling messages not to
disappear due to TUI exit cleanup (Arnaldo Carvalho de Melo)
- Don't warn about unavailability of builtin clang in 'perf trace', just
continue fallbacking to using the external toolchain (Arnaldo Carvalho de Melo)
- Move conditional O_CLOEXEC define to util.h to keep the build
working on older distros where that is not available now that
this define will be used in more source files (Arnaldo Carvalho de Melo)
- Using O_CLOEXEC in do_open() to avoid having too many open files when
using 'perf script' from the 'perf report' TUI scripts browser (Wang YanQing)
- Add 'perf trace --print-sample' to help in debugging the printing
of timestamp calculations, for instance (Arnaldo Carvalho de Melo)
- Do not print from time delta for interrupted syscall lines in 'perf
trace' (those ending with '...') just print the total syscall duration
at raw_syscalls:sys_exit time (Arnaldo Carvalho de Melo)
- Beautify FUTEX_BITSET_MATCH_ANY in the futex syscall in 'perf trace'
(Arnaldo Carvalho de Melo)
- Display EXTRA features for 'make VF=1' build (Jiri Olsa)
- Add support for CoreSight trace decoding by making the perf tools
use the external openCSD (Mathieu Poirier, Tor Jeremiassen)
Linus Torvalds [Sun, 28 Jan 2018 20:19:23 +0000 (12:19 -0800)]
Merge branch 'x86-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull x86 fixes from Thomas Gleixner:
"A set of small fixes for 4.15:
- Fix vmapped stack synchronization on systems with 4-level paging
and a large amount of memory caused by a missing 5-level folding
which made the pgd synchronization logic to fail and causing double
faults.
- Add a missing sanity check in the vmalloc_fault() logic on 5-level
paging systems.
- Bring back protection against accessing a freed initrd in the
microcode loader which was lost by a wrong merge conflict
resolution.
- Extend the Broadwell micro code loading sanity check.
- Add a missing ENDPROC annotation in ftrace assembly code which
makes ORC unhappy.
- Prevent loading the AMD power module on !AMD platforms. The load
itself is uncritical, but an unload attempt results in a kernel
crash.
- Update Peter Anvins role in the MAINTAINERS file"
* 'x86-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
x86/ftrace: Add one more ENDPROC annotation
x86: Mark hpa as a "Designated Reviewer" for the time being
x86/mm/64: Tighten up vmalloc_fault() sanity checks on 5-level kernels
x86/mm/64: Fix vmapped stack syncing on very-large-memory 4-level systems
x86/microcode: Fix again accessing initrd after having been freed
x86/microcode/intel: Extend BDW late-loading further with LLC size check
perf/x86/amd/power: Do not load AMD power module on !AMD platforms
Linus Torvalds [Sun, 28 Jan 2018 20:17:35 +0000 (12:17 -0800)]
Merge branch 'timers-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull timer fix from Thomas Gleixner:
"A single fix for a ~10 years old problem which causes high resolution
timers to stop after a CPU unplug/plug cycle due to a stale flag in
the per CPU hrtimer base struct.
Paul McKenney was hunting this for about a year, but the heisenbug
nature made it resistant against debug attempts for quite some time"
* 'timers-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
hrtimer: Reset hrtimer cpu base proper on CPU hotplug
David Woodhouse [Sat, 27 Jan 2018 16:24:32 +0000 (16:24 +0000)]
x86/cpufeatures: Clean up Spectre v2 related CPUID flags
We want to expose the hardware features simply in /proc/cpuinfo as "ibrs",
"ibpb" and "stibp". Since AMD has separate CPUID bits for those, use them
as the user-visible bits.
When the Intel SPEC_CTRL bit is set which indicates both IBRS and IBPB
capability, set those (AMD) bits accordingly. Likewise if the Intel STIBP
bit is set, set the AMD STIBP that's used for the generic hardware
capability.
Hide the rest from /proc/cpuinfo by putting "" in the comments. Including
RETPOLINE and RETPOLINE_AMD which shouldn't be visible there. There are
patches to make the sysfs vulnerabilities information non-readable by
non-root, and the same should apply to all information about which
mitigations are actually in use. Those *shouldn't* appear in /proc/cpuinfo.
The feature bit for whether IBPB is actually used, which is needed for
ALTERNATIVEs, is renamed to X86_FEATURE_USE_IBPB.
hwmon: (dell-smm) Disable fan support for Dell Vostro 3360
Calling fan related SMM functions implemented by Dell BIOS firmware on Dell
Vostro 3360 freeze kernel for about 500ms.
Unfortunately, it is unlikely for Dell to fix this since the machine
is pretty old, so this commit just disables fan support to make the
system usable again.
Via "force" module param fan support can be enabled.
Pali Rohár [Sat, 27 Jan 2018 16:28:34 +0000 (17:28 +0100)]
hwmon: (dell-smm) Disable fan support for Dell Inspiron 7720
Calling fan related SMM functions implemented by Dell BIOS firmware on Dell
Inspiron 7720 freeze kernel for about 500ms. Until Dell fixes it we need to
disable fan support for Dell Inspiron 7720 as it makes system unusable.
Via "force" module param fan support can be enabled.
Pali Rohár [Sat, 27 Jan 2018 16:22:01 +0000 (17:22 +0100)]
hwmon: (dell-smm) Enable broken functionality via "force" module param
Some Dell machines are broken and some functionality is disabled. Show
warning into dmesg about this fact and allow user via "force" module param
to override brokenness and enable broken functionality.
Thomas Gleixner [Fri, 26 Jan 2018 13:54:32 +0000 (14:54 +0100)]
hrtimer: Reset hrtimer cpu base proper on CPU hotplug
The hrtimer interrupt code contains a hang detection and mitigation
mechanism, which prevents that a long delayed hrtimer interrupt causes a
continous retriggering of interrupts which prevent the system from making
progress. If a hang is detected then the timer hardware is programmed with
a certain delay into the future and a flag is set in the hrtimer cpu base
which prevents newly enqueued timers from reprogramming the timer hardware
prior to the chosen delay. The subsequent hrtimer interrupt after the delay
clears the flag and resumes normal operation.
If such a hang happens in the last hrtimer interrupt before a CPU is
unplugged then the hang_detected flag is set and stays that way when the
CPU is plugged in again. At that point the timer hardware is not armed and
it cannot be armed because the hang_detected flag is still active, so
nothing clears that flag. As a consequence the CPU does not receive hrtimer
interrupts and no timers expire on that CPU which results in RCU stalls and
other malfunctions.
Clear the flag along with some other less critical members of the hrtimer
cpu base to ensure starting from a clean state when a CPU is plugged in.
Thanks to Paul, Sebastian and Anna-Maria for their help to get down to the
root cause of that hard to reproduce heisenbug. Once understood it's
trivial and certainly justifies a brown paperbag.
H. Peter Anvin [Thu, 25 Jan 2018 19:59:34 +0000 (11:59 -0800)]
x86: Mark hpa as a "Designated Reviewer" for the time being
Due to some unfortunate events, I have not been directly involved in
the x86 kernel patch flow for a while now. I have also not been able
to ramp back up by now like I had hoped to, and after reviewing what I
will need to work on both internally at Intel and elsewhere in the near
term, it is clear that I am not going to be able to ramp back up until
late 2018 at the very earliest.
It is not acceptable to not recognize that this load is currently
taken by Ingo and Thomas without my direct participation, so I mark
myself as R: (designated reviewer) rather than M: (maintainer) until
further notice. This is in fact recognizing the de facto situation
for the past few years.
I have obviously no intention of going away, and I will do everything
within my power to improve Linux on x86 and x86 for Linux. This,
however, puts credit where it is due and reflects a change of focus.
This patch also removes stale entries for portions of the x86
architecture which have not been maintained separately from arch/x86
for a long time. If there is a reason to re-introduce them then that
can happen later.
Linus Torvalds [Fri, 26 Jan 2018 23:10:50 +0000 (15:10 -0800)]
Merge tag 'riscv-for-linus-4.15-maintainers' of git://git.kernel.org/pub/scm/linux/kernel/git/palmer/riscv-linux
Pull RISC-V update from Palmer Dabbelt:
"RISC-V: We have a new mailing list and git repo!
Sorry to send something essentially as late as possible (Friday after
an rc9), but we managed to get a mailing list for the RISC-V Linux
port. We've been using [email protected] for a while, but that
list has some problems (it's Google Groups and it's shared over all
RISC-V software projects). The new infaread.org list is much better.
We just got it on Wednesday but I used it a bit on Thursday to shake
out all the configuration problems and it appears to be in working
order.
When I updated the mailing list I noticed that the MAINTAINERS file
was pointing to our github repo, but now that we have a kernel.org
repo I'd like to point to that instead so I changed that as well.
We'll be centralizing all RISC-V Linux related development here as
that seems to be the saner way to go about it.
I can understand if it's too late to get this into 4.15, but given
that it's not a code change I was hoping it'd still be OK. It would be
nice to have the new mailing list and git repo in the release tarballs
so when people start to find bugs they'll get to the right place"
* tag 'riscv-for-linus-4.15-maintainers' of git://git.kernel.org/pub/scm/linux/kernel/git/palmer/riscv-linux:
Update the RISC-V MAINTAINERS file
Aurelien Aptel [Wed, 24 Jan 2018 12:46:10 +0000 (13:46 +0100)]
CIFS: make IPC a regular tcon
* Remove ses->ipc_tid.
* Make IPC$ regular tcon.
* Add a direct pointer to it in ses->tcon_ipc.
* Distinguish PIPE tcon from IPC tcon by adding a tcon->pipe flag. All
IPC tcons are pipes but not all pipes are IPC.
* All TreeConnect functions now cannot take a NULL tcon object.
The IPC tcon has the same lifetime as the session it belongs to. It is
created when the session is created and destroyed when the session is
destroyed.
Since no mounts directly refer to the IPC tcon, its refcount should
always be set to initialisation value (1). Thus we make sure
cifs_put_tcon() skips it.
If the mount request resulting in a new session being created requires
encryption, try to require it too for IPC.
* set SERVER_NAME_LENGTH to serverName actual size
The maximum length of an ipv6 string representation is defined in
INET6_ADDRSTRLEN as 45+1 for null but lets keep what we know works.
Steve Capper [Wed, 24 Jan 2018 08:27:08 +0000 (08:27 +0000)]
arm64: Fix TTBR + PAN + 52-bit PA logic in cpu_do_switch_mm
In cpu_do_switch_mm(.) with ARM64_SW_TTBR0_PAN=y we apply phys_to_ttbr
to a value that already has an ASID inserted into the upper bits. For
52-bit PA configurations this then can give us TTBR0_EL1 registers that
cause translation table walks to attempt to access non-zero PA[51:48]
spuriously. Ultimately leading to a Synchronous External Abort on level
1 translation.
This patch re-arranges the logic in cpu_do_switch_mm(.) such that
phys_to_ttbr is called before the ASID is inserted into the TTBR0 value.
Fixes: 6b88a32c7af6 ("arm64: kpti: Fix the interaction between ASID switching and software PAN") Acked-by: Suzuki K Poulose <[email protected]> Tested-by: Kristina Martsenko <[email protected]> Reviewed-by: Kristina Martsenko <[email protected]> Signed-off-by: Steve Capper <[email protected]> Signed-off-by: Catalin Marinas <[email protected]>
Jens Axboe [Fri, 26 Jan 2018 18:13:01 +0000 (11:13 -0700)]
Merge branch 'nvme-4.16' of git://git.infradead.org/nvme into for-4.16/block
Pull NVMe fixes from Christoph:
"The additional week before the 4.15 release gave us time for a few more
nvme fixes, as well as the nifty trace points from Johannes"
* 'nvme-4.16' of git://git.infradead.org/nvme:
nvme: add tracepoint for nvme_complete_rq
nvme: add tracepoint for nvme_setup_cmd
nvme-pci: introduce RECONNECTING state to mark initializing procedure
nvme-rdma: remove redundant boolean for inline_data
nvme: don't free uuid pointer before printing it
nvme-pci: Suspend queues after deleting them
nvme-pci: Fix queue double allocations
1) The per-network-namespace loopback device, and thus its namespace,
can have its teardown deferred for a long time if a kernel created
TCP socket closes and the namespace is exiting meanwhile. The kernel
keeps trying to finish the close sequence until it times out (which
takes quite some time).
Fix this by forcing the socket closed in this situation, from Dan
Streetman.
2) Fix regression where we're trying to invoke the update_pmtu method
on route types (in this case metadata tunnel routes) that don't
implement the dst_ops method. Fix from Nicolas Dichtel.
3) Fix long standing memory corruption issues in r8169 driver by
performing the chip statistics DMA programming more correctly. From
Francois Romieu.
4) Handle local broadcast sends over VRF routes properly, from David
Ahern.
5) Don't refire the DCCP CCID2 timer endlessly, otherwise the socket
can never be released. From Alexey Kodanev.
6) Set poll flags properly in VSOCK protocol layer, from Stefan
Hajnoczi.
* git://git.kernel.org/pub/scm/linux/kernel/git/davem/net:
VSOCK: set POLLOUT | POLLWRNORM for TCP_CLOSING
dccp: don't restart ccid2_hc_tx_rto_expire() if sk in closed state
net: vrf: Add support for sends to local broadcast address
r8169: fix memory corruption on retrieval of hardware statistics.
net: don't call update_pmtu unconditionally
net: tcp: close sock if net namespace is exiting
Linus Torvalds [Fri, 26 Jan 2018 16:59:57 +0000 (08:59 -0800)]
Merge tag 'drm-fixes-for-v4.15-rc10-2' of git://people.freedesktop.org/~airlied/linux
Pull drm fixes from Dave Airlie:
"A fairly urgent nouveau regression fix for broken irqs across
suspend/resume came in. This was broken before but a patch in 4.15 has
made it much more obviously broken and now s/r fails a lot more often.
The fix removes freeing the irq across s/r which never should have
been done anyways.
Also two vc4 fixes for a NULL deference and some misrendering /
flickering on screen"
* tag 'drm-fixes-for-v4.15-rc10-2' of git://people.freedesktop.org/~airlied/linux:
drm/nouveau: Move irq setup/teardown to pci ctor/dtor
drm/vc4: Fix NULL pointer dereference in vc4_save_hang_state()
drm/vc4: Flush the caches before the bin jobs, as well.
Stefan Hajnoczi [Fri, 26 Jan 2018 11:48:25 +0000 (11:48 +0000)]
VSOCK: set POLLOUT | POLLWRNORM for TCP_CLOSING
select(2) with wfds but no rfds must return when the socket is shut down
by the peer. This way userspace notices socket activity and gets -EPIPE
from the next write(2).
Currently select(2) does not return for virtio-vsock when a SEND+RCV
shutdown packet is received. This is because vsock_poll() only sets
POLLOUT | POLLWRNORM for TCP_CLOSE, not the TCP_CLOSING state that the
socket is in when the shutdown is received.
Alexey Kodanev [Fri, 26 Jan 2018 12:14:16 +0000 (15:14 +0300)]
dccp: don't restart ccid2_hc_tx_rto_expire() if sk in closed state
ccid2_hc_tx_rto_expire() timer callback always restarts the timer
again and can run indefinitely (unless it is stopped outside), and after
commit 120e9dabaf55 ("dccp: defer ccid_hc_tx_delete() at dismantle time"),
which moved ccid_hc_tx_delete() (also includes sk_stop_timer()) from
dccp_destroy_sock() to sk_destruct(), this started to happen quite often.
The timer prevents releasing the socket, as a result, sk_destruct() won't
be called.
Found with LTP/dccp_ipsec tests running on the bonding device,
which later couldn't be unloaded after the tests were completed:
unregister_netdevice: waiting for bond0 to become free. Usage count = 148
Fixes: 2a91aa396739 ("[DCCP] CCID2: Initial CCID2 (TCP-Like) implementation") Signed-off-by: Alexey Kodanev <[email protected]> Reviewed-by: Eric Dumazet <[email protected]> Signed-off-by: David S. Miller <[email protected]>
Palmer Dabbelt [Wed, 24 Jan 2018 21:26:11 +0000 (13:26 -0800)]
Update the RISC-V MAINTAINERS file
Now that we're upstream in Linux we've been able to make some
infrastructure changes so our port works a bit more like other ports.
Specifically:
* We now have a mailing list specific to the RISC-V Linux port, hosted
at lists.infreadead.org.
* We now have a kernel.org git tree where work on our port is
coordinated.
This patch changes the RISC-V maintainers entry to reflect these new
bits of infrastructure.
Change _regulator_list_voltage() argument from regulator to
regulator_dev in order to provide better separation of core layers.
Allow calling _regulator_list_voltage() from functions, with
regulator_dev argument. This refactoring is needed in order to
implement setting voltage of coupled regulators.
Maciej Purski [Mon, 22 Jan 2018 14:30:06 +0000 (15:30 +0100)]
regulator: core: Move of_find_regulator_by_node() to of_regulator.c
As of_find_regulator_by_node() is an of function it should be moved from
core.c to of_regulator.c. It provides better separation of device tree
functions from the core and allows other of_functions in of_regulator.c
to resolve device_node to regulator_dev. This will be useful for
implementation of parsing coupled regulators properties.
Declare of_find_regulator_by_node() function in internal.h as well as
regulator_class and dev_to_rdev(), as they are needed by
of_find_regulator_by_node().
Davidlohr Bueso [Thu, 25 Jan 2018 19:27:27 +0000 (11:27 -0800)]
IB/mthca: Fix gup usage in mthca_map_user_db()
get_user_pages() must be called with mmap_sem held, currently
it is not. In fact it is called under the user db_table->mutex.
To fix this we can convert gup to use the fast alternative,
and safely avoid taking mmap_sem, if possible. Furthermore
this is safe wrt to the mutex as other callers that take the
lock (unmap and alloc_db) are not called under mmap_sem
(hence possible deadlock).
Andy Lutomirski [Thu, 25 Jan 2018 21:12:15 +0000 (13:12 -0800)]
x86/mm/64: Tighten up vmalloc_fault() sanity checks on 5-level kernels
On a 5-level kernel, if a non-init mm has a top-level entry, it needs to
match init_mm's, but the vmalloc_fault() code skipped over the BUG_ON()
that would have checked it.
While we're at it, get rid of the rather confusing 4-level folded "pgd"
logic.
Andy Lutomirski [Thu, 25 Jan 2018 21:12:14 +0000 (13:12 -0800)]
x86/mm/64: Fix vmapped stack syncing on very-large-memory 4-level systems
Neil Berrington reported a double-fault on a VM with 768GB of RAM that uses
large amounts of vmalloc space with PTI enabled.
The cause is that load_new_mm_cr3() was never fixed to take the 5-level pgd
folding code into account, so, on a 4-level kernel, the pgd synchronization
logic compiles away to exactly nothing.
Interestingly, the problem doesn't trigger with nopti. I assume this is
because the kernel is mapped with global pages if we boot with nopti. The
sequence of operations when we create a new task is that we first load its
mm while still running on the old stack (which crashes if the old stack is
unmapped in the new mm unless the TLB saves us), then we call
prepare_switch_to(), and then we switch to the new stack.
prepare_switch_to() pokes the new stack directly, which will populate the
mapping through vmalloc_fault(). I assume that we're getting lucky on
non-PTI systems -- the old stack's TLB entry stays alive long enough to
make it all the way through prepare_switch_to() and switch_to() so that we
make it to a valid stack.
Borislav Petkov [Fri, 26 Jan 2018 12:11:36 +0000 (13:11 +0100)]
x86/alternative: Print unadorned pointers
After commit ad67b74d2469 ("printk: hash addresses printed with %p")
pointers are being hashed when printed. However, this makes the alternative
debug output completely useless. Switch to %px in order to see the
unadorned kernel pointers.
David Woodhouse [Thu, 25 Jan 2018 16:14:14 +0000 (16:14 +0000)]
x86/cpufeature: Blacklist SPEC_CTRL/PRED_CMD on early Spectre v2 microcodes
This doesn't refuse to load the affected microcodes; it just refuses to
use the Spectre v2 mitigation features if they're detected, by clearing
the appropriate feature bits.
The AMD CPUID bits are handled here too, because hypervisors *may* have
been exposing those bits even on Intel chips, for fine-grained control
of what's available.
It is non-trivial to use x86_match_cpu() for this table because that
doesn't handle steppings. And the approach taken in commit bd9240a18
almost made me lose my lunch.
David Woodhouse [Thu, 25 Jan 2018 16:14:13 +0000 (16:14 +0000)]
x86/pti: Do not enable PTI on CPUs which are not vulnerable to Meltdown
Also, for CPUs which don't speculate at all, don't report that they're
vulnerable to the Spectre variants either.
Leave the cpu_no_meltdown[] match table with just X86_VENDOR_AMD in it
for now, even though that could be done with a simple comparison, on the
assumption that we'll have more to add.
Based on suggestions from Dave Hansen and Alan Cox.
David Woodhouse [Thu, 25 Jan 2018 16:14:09 +0000 (16:14 +0000)]
x86/cpufeatures: Add CPUID_7_EDX CPUID leaf
This is a pure feature bits leaf. There are two AVX512 feature bits in it
already which were handled as scattered bits, and three more from this leaf
are going to be added for speculation control features.
Chunyan Zhang [Fri, 26 Jan 2018 13:08:47 +0000 (21:08 +0800)]
regulator: add PM suspend and resume hooks
In this patch, consumers are allowed to set suspend voltage, and this
actually just set the "uV" in constraint::regulator_state, when the
regulator_suspend_late() was called by PM core through callback when
the system is entering into suspend, the regulator device would act
suspend activity then.
And it assumes that if any consumer set suspend voltage, the regulator
device should be enabled in the suspend state. And if the suspend
voltage of a regulator device for all consumers was set zero, the
regulator device would be off in the suspend state.
This patch also provides a new function hook to regulator devices for
resuming from suspend states.
Chunyan Zhang [Fri, 26 Jan 2018 13:08:46 +0000 (21:08 +0800)]
regulator: empty the old suspend functions
Regualtor suspend/resume functions should only be called by PM suspend
core via registering dev_pm_ops, and regulator devices should implement
the callback functions. Thus, any regulator consumer shouldn't call
the regulator suspend/resume functions directly.
In order to avoid compile errors, two empty functions with the same name
still be left for the time being.
Chunyan Zhang [Fri, 26 Jan 2018 13:08:45 +0000 (21:08 +0800)]
regulator: leave one item to record whether regulator is enabled
The items "disabled" and "enabled" are a little redundant, since only one
of them would be set to record if the regulator device should keep on
or be switched to off in suspend states.
So in this patch, the "disabled" was removed, only leave the "enabled":
- enabled == 1 for regulator-on-in-suspend
- enabled == 0 for regulator-off-in-suspend
- enabled == -1 means do nothing when entering suspend mode.
Chunyan Zhang [Fri, 26 Jan 2018 13:08:43 +0000 (21:08 +0800)]
regulator: added support for suspend states
Some systems need to set regulators to specific states when they enter
low power modes, especially around CPUs. There are many of these modes
depending on the particular runtime state.
Currently the regulator consumers are not granted permission to change
suspend state of regulator devices, the constraints are configured at
startup. In order to allow changes in a vlotage range, we need to add
new properties for voltage range and a flag to give permission to
change the suspend voltage and suspend on/off in suspend mode.
Andi Kleen [Thu, 25 Jan 2018 23:50:28 +0000 (15:50 -0800)]
module/retpoline: Warn about missing retpoline in module
There's a risk that a kernel which has full retpoline mitigations becomes
vulnerable when a module gets loaded that hasn't been compiled with the
right compiler or the right option.
To enable detection of that mismatch at module load time, add a module info
string "retpoline" at build time when the module was compiled with
retpoline support. This only covers compiled C source, but assembler source
or prebuilt object files are not checked.
If a retpoline enabled kernel detects a non retpoline protected module at
load time, print a warning and report it in the sysfs vulnerability file.
spi: orion: Fix a resource leak if the optional "axi" clk is deferred
If the optional "axi" clk is deferred, we still need to undo some
initialisation. Especially 'master' must be released. It will be
reallocated the next time 'orion_spi_probe()' is called.
Add a new label to clean what needs to be cleaned and rename another
label to improve the names used.
Fixes: 92ae112e477a ("spi: orion: Fix clock resource by adding an optional bus clock") Signed-off-by: Christophe JAILLET <[email protected]> Acked-by: Gregory CLEMENT <[email protected]> Signed-off-by: Mark Brown <[email protected]>
Add tracepoints for nvme_setup_cmd() for tracing admin and/or nvm commands.
Examples of the two tracepoints are as follows for trace_nvme_setup_admin_cmd():
kworker/u8:0-5 [003] .... 2.998792: nvme_setup_admin_cmd: cmdid=14, flags=0x0, meta=0x0, cmd=(nvme_admin_create_cq cqid=1, qsize=1023, cq_flags=0x3, irq_vector=0)
Jianchao Wang [Mon, 22 Jan 2018 14:03:16 +0000 (22:03 +0800)]
nvme-pci: introduce RECONNECTING state to mark initializing procedure
After Sagi's commit (nvme-rdma: fix concurrent reset and reconnect),
both nvme-fc/rdma have following pattern:
RESETTING - quiesce blk-mq queues, teardown and delete queues/
connections, clear out outstanding IO requests...
RECONNECTING - establish new queues/connections and some other
initializing things.
Introduce RECONNECTING to nvme-pci transport to do the same mark.
Then we get a coherent state definition among nvme pci/rdma/fc
transports.
David Ahern [Thu, 25 Jan 2018 03:37:37 +0000 (19:37 -0800)]
net: vrf: Add support for sends to local broadcast address
Sukumar reported that sends to the local broadcast address
(255.255.255.255) are broken. Check for the address in vrf driver
and do not redirect to the VRF device - similar to multicast
packets.
With this change sockets can use SO_BINDTODEVICE to specify an
egress interface and receive responses. Note: the egress interface
can not be a VRF device but needs to be the enslaved device.