There is a small chance that the compiler could generate separate loads
for the dist->propbaser which could be modified from another CPU. As we
want to make sure we atomically update the entire value, and don't race
with other updates, guarantee that the cmpxchg operation compares
against the original value.
This can be reproduced by booting L1 guest w/ 'noapic' grub parameter, which
means that tells the kernel to not make use of any IOAPICs that may be present
in the system.
Actually external_intr variable in nested_vmx_vmexit() is the req_int_win
variable passed from vcpu_enter_guest() which means that the L0's userspace
requests an irq window. I observed the scenario (!kvm_cpu_has_interrupt(vcpu) &&
L0's userspace reqeusts an irq window) is true, so there is no interrupt which
L1 requires to inject to L2, we should not attempt to emualte "Acknowledge
interrupt on exit" for the irq window requirement in this scenario.
This patch fixes it by not attempt to emulate "Acknowledge interrupt on exit"
if there is no L1 requirement to inject an interrupt to L2.
Cc: Paolo Bonzini <[email protected]> Cc: Radim Krčmář <[email protected]> Signed-off-by: Wanpeng Li <[email protected]>
[Added code comment to make it obvious that the behavior is not correct.
We should do a userspace exit with open interrupt window instead of the
nested VM exit. This patch still improves the behavior, so it was
accepted as a (temporary) workaround.] Signed-off-by: Radim Krčmář <[email protected]>
pci_device_id are not supposed to change at runtime. All functions
working with pci_device_id provided by <linux/pci.h> work with
const pci_device_id. So mark the non-const structs as const.
hostap: Fix outdated comment about dev->destructor
After commit cf124db566e6 ("net: Fix inconsistent teardown and
release of private netdev state."), setting
'dev->needs_free_netdev' ensures device data is released, and
'dev->destructor' is not used anymore.
Fixes: cf124db566e6 ("net: Fix inconsistent teardown and release of private netdev state.") Signed-off-by: Stefano Brivio <[email protected]> Signed-off-by: Kalle Valo <[email protected]>
pci_device_id are not supposed to change at runtime. All functions
working with pci_device_id provided by <linux/pci.h> work with
const pci_device_id. So mark the non-const structs as const.
File size before:
text data bss dec hex filename
1899 928 0 2827 b0b realtek/rtlwifi/rtl8192ee/sw.o
File size After adding 'const':
text data bss dec hex filename
1963 864 0 2827 b0b realtek/rtlwifi/rtl8192ee/sw.o
pci_device_id are not supposed to change at runtime. All functions
working with pci_device_id provided by <linux/pci.h> work with
const pci_device_id. So mark the non-const structs as const.
File size before:
text data bss dec hex filename
3090 912 0 4002 fa2 realtek/rtlwifi/rtl8188ee/sw.o
File size After adding 'const':
text data bss dec hex filename
3154 848 0 4002 fa2 realtek/rtlwifi/rtl8188ee/sw.o
pci_device_id are not supposed to change at runtime. All functions
working with pci_device_id provided by <linux/pci.h> work with
const pci_device_id. So mark the non-const structs as const.
File size before:
text data bss dec hex filename
3032 912 0 3944 f68 realtek/rtlwifi/rtl8723be/sw.o
File size After adding 'const':
text data bss dec hex filename
3096 848 0 3944 f68 realtek/rtlwifi/rtl8723be/sw.o
pci_device_id are not supposed to change at runtime. All functions
working with pci_device_id provided by <linux/pci.h> work with
const pci_device_id. So mark the non-const structs as const.
File size before:
text data bss dec hex filename
2775 912 0 3687 e67 realtek/rtlwifi/rtl8723ae/sw.o
File size After adding 'const':
text data bss dec hex filename
2839 848 0 3687 e67 realtek/rtlwifi/rtl8723ae/sw.o
pci_device_id are not supposed to change at runtime. All functions
working with pci_device_id provided by <linux/pci.h> work with
const pci_device_id. So mark the non-const structs as const.
File size before:
text data bss dec hex filename
2491 960 0 3451 d7b realtek/rtlwifi/rtl8821ae/sw.o
File size After adding 'const':
text data bss dec hex filename
2587 864 0 3451 d7b realtek/rtlwifi/rtl8821ae/sw.o
pci_device_id are not supposed to change at runtime. All functions
working with pci_device_id provided by <linux/pci.h> work with
const pci_device_id. So mark the non-const structs as const.
File size before:
text data bss dec hex filename
2817 1040 0 3857 f11 realtek/rtlwifi/rtl8192se/sw.o
File size After adding 'const':
text data bss dec hex filename
3009 848 0 3857 f11 realtek/rtlwifi/rtl8192se/sw.o
pci_device_id are not supposed to change at runtime. All functions
working with pci_device_id provided by <linux/pci.h> work with
const pci_device_id. So mark the non-const structs as const.
File size before:
text data bss dec hex filename
2833 945 12 3790 ece realtek/rtlwifi/rtl8192de/sw.o
File size After adding 'const':
text data bss dec hex filename
2929 849 12 3790 ece realtek/rtlwifi/rtl8192de/sw.o
Robin Murphy [Mon, 24 Jul 2017 17:41:30 +0000 (18:41 +0100)]
qtnfmac: Tidy up DMA mask setting
As the only caller of dma_supported() outside of DMA API internals, the
qtfnmac driver stands out and invites scrutiny. Thankfully, it's not
being used for evil, but it is entirely redundant, since it open-codes a
check that the DMA mask setting functions are going to perform anyway.
In fact, the whole qtnf_pcie_init_dma_mask() function is nothing more
than a rather long-winded implementation of dma_set_mask_and_coherent(),
so let's just use that directly.
qtnfmac: fix handling of iftype mask reported by firmware
Firmware sends supported interface type rather than mask. As a result,
types field of ieee80211_iface_limit structure may end up having
multiple iftype bits set. This leads to WARN_ON from
wiphy_verify_combinations.
Userspace tools may hang on scan in the case when scan completion event
is not returned by firmware. This patch implements the scan timeout
to avoid such situation.
This patch implements cfg80211 channel_switch handler enabling CSA
channel-switch procedure.
Driver performs only basic validation of the requested new channel
and then sends command to firmware. Beacon IEs are not sent since
beacon update is handled by firmware.
qtnfmac: move current channel info from vif to mac
Wireless cfg80211 core supplies channel settings in cfg80211_ap_settings
structure for each BSS in multiple BSS configuration. On the other hand
all the virtual interfaces on one radio are using the same PHY settings
including channel.
Move chandef structure from vif to mac structure in order to mantain
the only instance of cfg80211_chan_def structure in qtnf_wmac
rather than its multiple copies in qtnf_vif.
Implement current channel reporting functionality. Current operating
channel can be obtained either directly using cfg80211 get_channel
callback or from stats reported by cfg80211 survey_dump callback.
On startup driver obtains regulatory rules from firmware and
enables them during wiphy registration. Later on regulatory
domain change can be requested by host. In this case firmware
is notified about the upcoming changes. If the change is valid,
then firmware updates hardware channel configuration and host
driver receives updated channel info for each band.
Sven Joachim [Mon, 31 Jul 2017 16:10:45 +0000 (18:10 +0200)]
rtlwifi: Fix fallback firmware loading
Commit f70e4df2b384 ("rtlwifi: Add code to read new versions of
firmware") added code to load an old firmware file if the new one is
not available. Unfortunately that code is never reached because
request_firmware_nowait() does not wait for the firmware to show up
and returns 0 even if the file is not there.
Use the existing fallback mechanism introduced by commit 62009b7f1279
("rtlwifi: rtl8192cu: Add new firmware") instead.
Xinming Hu [Mon, 31 Jul 2017 09:49:24 +0000 (09:49 +0000)]
mwifiex: correct IE parse during association
It is observed that some IEs get missed during association.
This patch correct the old IE parse code. sme->ie will be
store as wpa ie, wps ie, wapi ie and gen ie accordingly.
Shawn Lin [Tue, 25 Jul 2017 01:11:28 +0000 (09:11 +0800)]
mmc: block: bypass the queue even if usage is present for hotplug
The commit 304419d8a7e9 ("mmc: core: Allocate per-request data using the
block layer core") refactored mechanism of queue handling caused
mmc_init_request() can be called just after mmc_cleanup_queue() caused null
pointer dereference.
Another commit bbdc74dc19e0 ("mmc: block: Prevent new req entering queue
after its cleanup") tried to fix the problem. However it actually miss one
corner case.
We could still reproduce the issue mentioned with these steps:
(1) insert a SD card and mount it
(2) hotplug it, so it will leave md->usage still be counted
(3) reboot the system which will sync data and umount the card
So mmc_blk_put wouldn't calling blk_cleanup_queue which actually the
QUEUE_FLAG_DYING and QUEUE_FLAG_BYPASS should stay. Block core expect
blk_queue_bypass_{start, end} internally to bypass/drain the queue before
actually dying the queue, so it didn't expose API to set the queue bypass.
I think we should set QUEUE_FLAG_BYPASS whenever queue is removed, although
the md->usage is still counted, as no dispatch queue could be found then.
Fixes: 304419d8a7e9 ("mmc: core: Allocate per-request data using the block layer core") Signed-off-by: Shawn Lin <[email protected]> Reviewed-by: Linus Walleij <[email protected]> Signed-off-by: Ulf Hansson <[email protected]>
mmc: sdhci-of-at91: force card detect value for non removable devices
When the device is non removable, the card detect signal is often used
for another purpose i.e. muxed to another SoC peripheral or used as a
GPIO. It could lead to wrong behaviors depending the default value of
this signal if not muxed to the SDHCI controller.
Linus Torvalds [Thu, 3 Aug 2017 03:56:44 +0000 (20:56 -0700)]
Merge tag 'nfs-for-4.13-4' of git://git.linux-nfs.org/projects/anna/linux-nfs
Pull NFS client fixes from Anna Schumaker:
"Two fixes from Trond this time, now that he's back from his vacation.
The first is a stable fix for the EXCHANGE_ID issue on the mailing
list, and the other fixes a double-free situation that he found at the
same time.
Tero Kristo [Wed, 2 Aug 2017 18:32:13 +0000 (21:32 +0300)]
clk: keystone: sci-clk: Fix sci_clk_get
Currently a bug in the sci_clk_get implementation causes it to always
return a clock belonging to the last device in the static list of clock
data. This is due to a bug in the init code that causes the array
used by sci_clk_get to only be populated with the clocks for the last
device, as each device overwrites the entire array with its own clocks.
Fix this by calculating the actual number of clocks for the SoC, and
allocating the whole array in one go. Also, we don't need the handle
to the init data array anymore after doing this, instead we can
just compare the dev_id / clk_id against the registered clocks and
use binary search for speed.
Jan Kara [Wed, 2 Aug 2017 20:32:30 +0000 (13:32 -0700)]
ocfs2: don't clear SGID when inheriting ACLs
When new directory 'DIR1' is created in a directory 'DIR0' with SGID bit
set, DIR1 is expected to have SGID bit set (and owning group equal to
the owning group of 'DIR0'). However when 'DIR0' also has some default
ACLs that 'DIR1' inherits, setting these ACLs will result in SGID bit on
'DIR1' to get cleared if user is not member of the owning group.
Fix the problem by moving posix_acl_update_mode() out of ocfs2_set_acl()
into ocfs2_iop_set_acl(). That way the function will not be called when
inheriting ACLs which is what we want as it prevents SGID bit clearing
and the mode has been properly set by posix_acl_create() anyway. Also
posix_acl_chmod() that is calling ocfs2_set_acl() takes care of updating
mode itself.
Kan Liang [Wed, 2 Aug 2017 20:32:27 +0000 (13:32 -0700)]
mm: allow page_cache_get_speculative in interrupt context
Kernel panic when calling the IRQ-safe __get_user_pages_fast in NMI
handler.
The bug was introduced by commit 2947ba054a4d ("x86/mm/gup: Switch GUP
to the generic get_user_page_fast() implementation").
The original x86 __get_user_page_fast used plain get_page() or
page_ref_add(). However, the generic __get_user_page_fast uses
page_cache_get_speculative(), which has VM_BUG_ON(in_interrupt()).
There is no reason to prevent page_cache_get_speculative from using in
interrupt context. According to the author, putting a BUG_ON there is
just because the code is not verifying correctness of interrupt races.
I did some tests in interrupt context. There is no issue found.
Removing VM_BUG_ON(in_interrupt()) for page_cache_get_speculative().
Mike Rapoport [Wed, 2 Aug 2017 20:32:24 +0000 (13:32 -0700)]
userfaultfd: non-cooperative: flush event_wqh at release time
There may still be threads waiting on event_wqh at the time the
userfault file descriptor is closed. Flush the events wait-queue to
prevent waiting threads from hanging.
Kees Cook [Wed, 2 Aug 2017 20:32:21 +0000 (13:32 -0700)]
ipc: add missing container_of()s for randstruct
When building with the randstruct gcc plugin, the layout of the IPC
structs will be randomized, which requires any sub-structure accesses to
use container_of(). The proc display handlers were missing the needed
container_of()s since the iterator is passing in the top-level struct
kern_ipc_perm.
This would lead to crashes when running the "lsipc" program after the
system had IPC registered (e.g. after starting up Gnome):
Dima Zavin [Wed, 2 Aug 2017 20:32:18 +0000 (13:32 -0700)]
cpuset: fix a deadlock due to incomplete patching of cpusets_enabled()
In codepaths that use the begin/retry interface for reading
mems_allowed_seq with irqs disabled, there exists a race condition that
stalls the patch process after only modifying a subset of the
static_branch call sites.
This problem manifested itself as a deadlock in the slub allocator,
inside get_any_partial. The loop reads mems_allowed_seq value (via
read_mems_allowed_begin), performs the defrag operation, and then
verifies the consistency of mem_allowed via the read_mems_allowed_retry
and the cookie returned by xxx_begin.
The issue here is that both begin and retry first check if cpusets are
enabled via cpusets_enabled() static branch. This branch can be
rewritted dynamically (via cpuset_inc) if a new cpuset is created. The
x86 jump label code fully synchronizes across all CPUs for every entry
it rewrites. If it rewrites only one of the callsites (specifically the
one in read_mems_allowed_retry) and then waits for the
smp_call_function(do_sync_core) to complete while a CPU is inside the
begin/retry section with IRQs off and the mems_allowed value is changed,
we can hang.
This is because begin() will always return 0 (since it wasn't patched
yet) while retry() will test the 0 against the actual value of the seq
counter.
The fix is to use two different static keys: one for begin
(pre_enable_key) and one for retry (enable_key). In cpuset_inc(), we
first bump the pre_enable key to ensure that cpuset_mems_allowed_begin()
always return a valid seqcount if are enabling cpusets. Similarly, when
disabling cpusets via cpuset_dec(), we first ensure that callers of
cpuset_mems_allowed_retry() will start ignoring the seqcount value
before we let cpuset_mems_allowed_begin() return 0.
The relevant stack traces of the two stuck threads:
Mike Rapoport [Wed, 2 Aug 2017 20:32:15 +0000 (13:32 -0700)]
userfaultfd_zeropage: return -ENOSPC in case mm has gone
In the non-cooperative userfaultfd case, the process exit may race with
outstanding mcopy_atomic called by the uffd monitor. Returning -ENOSPC
instead of -EINVAL when mm is already gone will allow uffd monitor to
distinguish this case from other error conditions.
Unfortunately I overlooked userfaultfd_zeropage when updating
userfaultd_copy().
In commit 3f906ba23689a ("mm/memory-hotplug: switch locking to a percpu
rwsem") memory hotplug locking was changed to fix a potential deadlock.
This also switched the stop_machine() invocation within
build_all_zonelists() to stop_machine_cpuslocked() which now expects
that online cpus are locked when being called.
This assumption is not true if build_all_zonelists() is being called
from numa_zonelist_order_handler().
In order to fix this simply add a mem_hotplug_begin()/mem_hotplug_done()
pair to numa_zonelist_order_handler().
Tetsuo Handa [Wed, 2 Aug 2017 20:32:09 +0000 (13:32 -0700)]
mm/page_io.c: fix oops during block io poll in swapin path
When a thread is OOM-killed during swap_readpage() operation, an oops
occurs because end_swap_bio_read() is calling wake_up_process() based on
an assumption that the thread which called swap_readpage() is still
alive.
David S. Miller [Thu, 3 Aug 2017 00:00:24 +0000 (17:00 -0700)]
Merge branch 'per-nexthop-offload'
Jiri Pirko says:
====================
ipv4: fib: Provide per-nexthop offload indication
Ido says:
Offload indication for IPv4 routes is currently set in the FIB info's
flags. When multipath routes are employed, this can lead to a route being
marked as offloaded although only one of its nexthops is actually
offloaded.
Instead, this patchset aims to proivde a higher resolution for the offload
indication and report it on a per-nexthop basis.
Example output from patched iproute:
$ ip route show 192.168.200.0/24
192.168.200.0/24
nexthop via 192.168.100.2 dev enp3s0np7 weight 1 offload
nexthop via 192.168.101.3 dev enp3s0np8 weight 1
And once the second gateway is resolved:
$ ip route show 192.168.200.0/24
192.168.200.0/24
nexthop via 192.168.100.2 dev enp3s0np7 weight 1 offload
nexthop via 192.168.101.3 dev enp3s0np8 weight 1 offload
First patch teaches the kernel to look for the offload indication in the
nexthop flags. Patches 2-5 adjust current capable drivers to provide
offload indication on a per-nexthop basis. Last patch removes no longer
used functions to set offload indication in the FIB info's flags.
====================
Ido Schimmel [Wed, 2 Aug 2017 07:56:05 +0000 (09:56 +0200)]
mlxsw: spectrum_router: Refresh offload indication upon group refresh
Now that we provide offload indication using the nexthop's flags we must
refresh the offload indication whenever the offload state within the
group changes.
This didn't matter until now, as offload indication was provided using
the FIB info flags and multipath routes were marked as offloaded as long
as one of the nexthops was offloaded.
Ido Schimmel [Wed, 2 Aug 2017 07:56:04 +0000 (09:56 +0200)]
mlxsw: spectrum_router: Don't check state when refreshing offload indication
Previous patch removed the reliance on the counter in the FIB info to
set the offload indication, so we no longer need to keep an offload
state on each FIB entry and can just set or unset the RTNH_F_OFFLOAD
flag in each nexthop.
This is also necessary because we're going to need to refresh the
offload indication whenever the nexthop group associated with the FIB
entry is refreshed. Current check would prevent us from marking a newly
resolved nexthop as offloaded if the FIB entry is already marked as
offloaded.
Ido Schimmel [Wed, 2 Aug 2017 07:56:01 +0000 (09:56 +0200)]
ipv4: fib: Set offload indication according to nexthop flags
We're going to have capable drivers indicate route offload using the
nexthop flags, but for non-multipath routes these flags aren't dumped to
user space.
Instead, set the offload indication in the route message flags.
David S. Miller [Wed, 2 Aug 2017 23:55:34 +0000 (16:55 -0700)]
Merge branch 'netvsc-transparent-VF-support'
Stephen Hemminger says:
====================
netvsc: transparent VF support
This patch set changes how SR-IOV Virtual Function devices are managed
in the Hyper-V network driver. This version is rebased onto current net-next.
Background
In Hyper-V SR-IOV can be enabled (and disabled) by changing guest settings
on host. When SR-IOV is enabled a matching PCI device is hot plugged and
visible on guest. The VF device is an add-on to an existing netvsc
device, and has the same MAC address.
How is this different?
The original support of VF relied on using bonding driver in active
standby mode to handle the VF device.
With the new netvsc VF logic, the Linux hyper-V network
virtual driver will directly manage the link to SR-IOV VF device.
When VF device is detected (hot plug) it is automatically made a
slave device of the netvsc device. The VF device state reflects
the state of the netvsc device; i.e. if netvsc is set down, then
VF is set down. If netvsc is set up, then VF is brought up.
Packet flow is independent of VF status; all packets are sent and
received as if they were associated with the netvsc device. If VF is
removed or link is down then the synthetic VMBUS path is used.
What was wrong with using bonding script?
A lot of work went into getting the bonding script to work on all
distributions, but it was a major struggle. Linux network devices
can be configured many, many ways and there is no one solution from
userspace to make it all work. What is really hard is when
configuration is attached to synthetic device during boot (eth0) and
then the same addresses and firewall rules needs to also work later if
doing bonding. The new code gets around all of this.
How does VF work during initialization?
Since all packets are sent and received through the logical netvsc
device, initialization is much easier. Just configure the regular
netvsc Ethernet device; when/if SR-IOV is enabled it just
works. Provisioning and cloud init only need to worry about setting up
netvsc device (eth0). If SR-IOV is enabled (even as a later step), the
address and rules stay the same.
What devices show up?
Both netvsc and PCI devices are visible in the system. The netvsc
device is active and named in usual manner (eth0). The PCI device is
visible to Linux and gets renamed by udev to a persistent name
(enP2p3s0). The PCI device name is now irrelevant now.
The logic also sets the PCI VF device SLAVE flag on the network
device so network tools can see the relationship if they are smart
enough to understand how layered devices work.
This is a lot like how I see Windows working.
The VF device is visible in Device Manager, but is not configured.
Is there any performance impact?
There is no visible change in performance. The bonding
and netvsc driver both have equivalent steps.
Is it compatible with old bonding script?
It turns out that if you use the old bonding script, then everything
still works but in a sub-optimum manner. What happens is that bonding
is unable to steal the VF from the netvsc device so it creates a one
legged bond. Packet flow then is:
bond0 <--> eth0 <- -> VF (enP2p3s0).
In other words, if you get it wrong it still works, just
awkward and slower.
What if I add address or firewall rule onto the VF?
Same problems occur with now as already occur with bonding, bridging,
teaming on Linux if user incorrectly does configuration onto
an underlying slave device. It will sort of work, packets will come in
and out but the Linux kernel gets confused and things like ARP don’t
work right. There is no way to block manipulation of the slave
device, and I am sure someone will find some special use case where
they want it.
====================
This patch implements transparent fail over from synthetic NIC to
SR-IOV virtual function NIC in Hyper-V environment. It is a better
alternative to using bonding as is done now. Instead, the receive and
transmit fail over is done internally inside the driver.
Using bonding driver has lots of issues because it depends on the
script being run early enough in the boot process and with sufficient
information to make the association. This patch moves all that
functionality into the kernel.
Functions working with attribute_groups provided by <linux/sysfs.h>
work with const attribute_group. These attribute_group structures do not
change at runtime so mark them as const.
File size before:
text data bss dec hex filename
35740 28424 832 64996 fde4 drivers/atm/solos-pci.o
File size after:
text data bss dec hex filename
35932 28232 832 64996 fde4 drivers/atm/solos-pci.o
Functions working with attribute_groups provided by <linux/sysfs.h>
work with const attribute_group. These attribute_group structures do not
change at runtime so mark them as const.
File size before:
text data bss dec hex filename
2033 1448 0 3481 d99 drivers/atm/adummy.o
File size after:
text data bss dec hex filename
2129 1352 0 3481 d99 drivers/atm/adummy.o
Derek Chickles [Tue, 1 Aug 2017 22:05:07 +0000 (15:05 -0700)]
liquidio: set sriov_totalvfs correctly
The file /sys/devices/pci000.../sriov_totalvfs is showing a wrong value.
Fix it by calling pci_sriov_set_totalvfs() to set the total number of VFs
available after calculations for the number of PF and VF queues are made.
DSA slave network devices maintain a pair of bytes and packets counters
for each directions, but these are not 64-bit capable. Re-use
pcpu_sw_netstats which contains exactly what we need for that purpose
and update the code path to report 64-bit capable statistics.
After commit cf8e0fedf078 ("mm/zsmalloc: simplify zs_max_alloc_size
handling"), zram doesn't use double pointer for pool->size_class any
more in zs_create_pool so counter function zs_destroy_pool don't need to
free it, either.
Otherwise, it does kfree wrong address and then, kernel goes Oops.
Arnd Bergmann [Wed, 2 Aug 2017 20:31:58 +0000 (13:31 -0700)]
kasan: avoid -Wmaybe-uninitialized warning
gcc-7 produces this warning:
mm/kasan/report.c: In function 'kasan_report':
mm/kasan/report.c:351:3: error: 'info.first_bad_addr' may be used uninitialized in this function [-Werror=maybe-uninitialized]
print_shadow_for_address(info->first_bad_addr);
^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
mm/kasan/report.c:360:27: note: 'info.first_bad_addr' was declared here
The code seems fine as we only print info.first_bad_addr when there is a
shadow, and we always initialize it in that case, but this is relatively
hard for gcc to figure out after the latest rework.
Adding an intialization to the most likely value together with the other
struct members shuts up that warning.
Mike Rapoport [Wed, 2 Aug 2017 20:31:55 +0000 (13:31 -0700)]
userfaultfd: non-cooperative: notify about unmap of destination during mremap
When mremap is called with MREMAP_FIXED it unmaps memory at the
destination address without notifying userfaultfd monitor.
If the destination were registered with userfaultfd, the monitor has no
way to distinguish between the old and new ranges and to properly relate
the page faults that would occur in the destination region.
Mel Gorman [Wed, 2 Aug 2017 20:31:52 +0000 (13:31 -0700)]
mm, mprotect: flush TLB if potentially racing with a parallel reclaim leaving stale TLB entries
Nadav Amit identified a theoritical race between page reclaim and
mprotect due to TLB flushes being batched outside of the PTL being held.
He described the race as follows:
CPU0 CPU1
---- ----
user accesses memory using RW PTE
[PTE now cached in TLB]
try_to_unmap_one()
==> ptep_get_and_clear()
==> set_tlb_ubc_flush_pending()
mprotect(addr, PROT_READ)
==> change_pte_range()
==> [ PTE non-present - no flush ]
user writes using cached RW PTE
...
try_to_unmap_flush()
The same type of race exists for reads when protecting for PROT_NONE and
also exists for operations that can leave an old TLB entry behind such
as munmap, mremap and madvise.
For some operations like mprotect, it's not necessarily a data integrity
issue but it is a correctness issue as there is a window where an
mprotect that limits access still allows access. For munmap, it's
potentially a data integrity issue although the race is massive as an
munmap, mmap and return to userspace must all complete between the
window when reclaim drops the PTL and flushes the TLB. However, it's
theoritically possible so handle this issue by flushing the mm if
reclaim is potentially currently batching TLB flushes.
Other instances where a flush is required for a present pte should be ok
as either the page lock is held preventing parallel reclaim or a page
reference count is elevated preventing a parallel free leading to
corruption. In the case of page_mkclean there isn't an obvious path
that userspace could take advantage of without using the operations that
are guarded by this patch. Other users such as gup as a race with
reclaim looks just at PTEs. huge page variants should be ok as they
don't race with reclaim. mincore only looks at PTEs. userfault also
should be ok as if a parallel reclaim takes place, it will either fault
the page back in or read some of the data before the flush occurs
triggering a fault.
Note that a variant of this patch was acked by Andy Lutomirski but this
was for the x86 parts on top of his PCID work which didn't make the 4.13
merge window as expected. His ack is dropped from this version and
there will be a follow-on patch on top of PCID that will include his
ack.
Daniel Jordan [Wed, 2 Aug 2017 20:31:47 +0000 (13:31 -0700)]
mm/hugetlb.c: __get_user_pages ignores certain follow_hugetlb_page errors
Commit 9a291a7c9428 ("mm/hugetlb: report -EHWPOISON not -EFAULT when
FOLL_HWPOISON is specified") causes __get_user_pages to ignore certain
errors from follow_hugetlb_page. After such error, __get_user_pages
subsequently calls faultin_page on the same VMA and start address that
follow_hugetlb_page failed on instead of returning the error immediately
as it should.
In follow_hugetlb_page, when hugetlb_fault returns a value covered under
VM_FAULT_ERROR, follow_hugetlb_page returns it without setting nr_pages
to 0 as __get_user_pages expects in this case, which causes the
following to happen in __get_user_pages: the "while (nr_pages)" check
succeeds, we skip the "if (!vma..." check because we got a VMA the last
time around, we find no page with follow_page_mask, and we call
faultin_page, which calls hugetlb_fault for the second time.
This issue also slightly changes how __get_user_pages works. Before, it
only returned error if it had made no progress (i = 0). But now,
follow_hugetlb_page can clobber "i" with an error code since its new
return path doesn't check for progress. So if "i" is nonzero before a
failing call to follow_hugetlb_page, that indication of progress is lost
and __get_user_pages can return error even if some pages were
successfully pinned.
To fix this, change follow_hugetlb_page so that it updates nr_pages,
allowing __get_user_pages to fail immediately and restoring the "error
only if no progress" behavior to __get_user_pages.
Tested that __get_user_pages returns when expected on error from
hugetlb_fault in follow_hugetlb_page.
David Matlack [Tue, 1 Aug 2017 21:00:40 +0000 (14:00 -0700)]
KVM: nVMX: mark vmcs12 pages dirty on L2 exit
The host physical addresses of L1's Virtual APIC Page and Posted
Interrupt descriptor are loaded into the VMCS02. The CPU may write
to these pages via their host physical address while L2 is running,
bypassing address-translation-based dirty tracking (e.g. EPT write
protection). Mark them dirty on every exit from L2 to prevent them
from getting out of sync with dirty tracking.
Also mark the virtual APIC page and the posted interrupt descriptor
dirty when KVM is virtualizing posted interrupt processing.
David Matlack [Tue, 1 Aug 2017 21:00:39 +0000 (14:00 -0700)]
kvm: nVMX: don't flush VMCS12 during VMXOFF or VCPU teardown
According to the Intel SDM, software cannot rely on the current VMCS to be
coherent after a VMXOFF or shutdown. So this is a valid way to handle VMCS12
flushes.
24.11.1 Software Use of Virtual-Machine Control Structures
...
If a logical processor leaves VMX operation, any VMCSs active on
that logical processor may be corrupted (see below). To prevent
such corruption of a VMCS that may be used either after a return
to VMX operation or on another logical processor, software should
execute VMCLEAR for that VMCS before executing the VMXOFF instruction
or removing power from the processor (e.g., as part of a transition
to the S3 and S4 power states).
...
This fixes a "suspicious rcu_dereference_check() usage!" warning during
kvm_vm_release() because nested_release_vmcs12() calls
kvm_vcpu_write_guest_page() without holding kvm->srcu.
Paolo Bonzini [Thu, 27 Jul 2017 13:54:46 +0000 (15:54 +0200)]
KVM: nVMX: do not pin the VMCS12
Since the current implementation of VMCS12 does a memcpy in and out
of guest memory, we do not need current_vmcs12 and current_vmcs12_page
anymore. current_vmptr is enough to read and write the VMCS12.
And David Matlack noted:
This patch also fixes dirty tracking (memslot->dirty_bitmap) of the
VMCS12 page by using kvm_write_guest. nested_release_page() only marks
the struct page dirty.
Signed-off-by: Paolo Bonzini <[email protected]> Reviewed-by: David Hildenbrand <[email protected]>
[Added David Matlack's note and nested_release_page_clean() fix.] Signed-off-by: Radim Krčmář <[email protected]>
Paolo Bonzini [Wed, 2 Aug 2017 15:55:54 +0000 (17:55 +0200)]
KVM: avoid using rcu_dereference_protected
During teardown, accesses to memslots and buses are using
rcu_dereference_protected with an always-true condition because
these accesses are done outside the usual mutexes. This
is because the last reference is gone and there cannot be any
concurrent modifications, but rcu_dereference_protected is
ugly and unobvious.
Instead, check the refcount in kvm_get_bus and __kvm_memslots.
Longpeng(Mike) [Wed, 2 Aug 2017 03:20:51 +0000 (11:20 +0800)]
KVM: X86: init irq->level in kvm_pv_kick_cpu_op
'lapic_irq' is a local variable and its 'level' field isn't
initialized, so 'level' is random, it doesn't matter but
makes UBSAN unhappy:
UBSAN: Undefined behaviour in .../lapic.c:...
load of value 10 is not a valid value for type '_Bool'
...
Call Trace:
[<ffffffff81f030b6>] dump_stack+0x1e/0x20
[<ffffffff81f03173>] ubsan_epilogue+0x12/0x55
[<ffffffff81f03b96>] __ubsan_handle_load_invalid_value+0x118/0x162
[<ffffffffa1575173>] kvm_apic_set_irq+0xc3/0xf0 [kvm]
[<ffffffffa1575b20>] kvm_irq_delivery_to_apic_fast+0x450/0x910 [kvm]
[<ffffffffa15858ea>] kvm_irq_delivery_to_apic+0xfa/0x7a0 [kvm]
[<ffffffffa1517f4e>] kvm_emulate_hypercall+0x62e/0x760 [kvm]
[<ffffffffa113141a>] handle_vmcall+0x1a/0x30 [kvm_intel]
[<ffffffffa114e592>] vmx_handle_exit+0x7a2/0x1fa0 [kvm_intel]
...
Felix Kuehling [Wed, 2 Aug 2017 02:34:55 +0000 (22:34 -0400)]
drm/amdgpu: Use list_del_init in amdgpu_mn_unregister
Otherwise bo->shadow_list (which is aliased by bo->mn_list) will not
appear empty in amdgpu_ttm_bo_destroy and cause an oops when freeing
former userptr BOs.
Jean Delvare [Sun, 30 Jul 2017 08:18:25 +0000 (10:18 +0200)]
drm/amdgpu: Fix undue fallthroughs in golden registers initialization
As I was staring at the si_init_golden_registers code, I noticed that
the Pitcairn initialization silently falls through the Cape Verde
initialization, and the Oland initialization falls through the Hainan
initialization. However there is no comment stating that this is
intentional, and the radeon driver doesn't have any such fallthrough,
so I suspect this is not supposed to happen.
Arnd Bergmann [Tue, 1 Aug 2017 11:50:56 +0000 (13:50 +0200)]
net: bcmgenet: drop COMPILE_TEST dependency
The last patch added the dependency on 'OF && HAS_IOMEM' but left
COMPILE_TEST as an alternative, which kind of defeats the purpose
of adding the dependency, we still get randconfig build warnings:
warning: (NET_DSA_BCM_SF2 && BCMGENET) selects MDIO_BCM_UNIMAC which has unmet direct dependencies (NETDEVICES && MDIO_BUS && HAS_IOMEM && OF_MDIO)
For compile-testing purposes, we don't really need this anyway,
as CONFIG_OF can be enabled on all architectures, and HAS_IOMEM
is present on all architectures we do meaningful compile-testing on
(the exception being arch/um).
This makes both OF and HAS_IOMEM hard dependencies.
Fixes: 5af74bb4fcf8 ("net: bcmgenet: Add dependency on HAS_IOMEM && OF") Signed-off-by: Arnd Bergmann <[email protected]> Reviewed-by: Florian Fainelli <[email protected]> Signed-off-by: David S. Miller <[email protected]>
Stephen Boyd [Wed, 2 Aug 2017 16:11:44 +0000 (09:11 -0700)]
Merge tag 'sunxi-clk-fixes-for-4.13' of https://git.kernel.org/pub/scm/linux/kernel/git/sunxi/linux into clk-fixes
Pull one Allwinner clock fix from Chen-Yu Tsai:
One critical clock fix for sun5i (A10s/A13/R8) which enables propagation
of clock rate changes from the "cpu" clock to it's parent PLL clock.
This fixes cpufreq related crashes that have been observed on KernelCI
with the C.H.I.P. and multi_v7_defconfig.
* tag 'sunxi-clk-fixes-for-4.13' of https://git.kernel.org/pub/scm/linux/kernel/git/sunxi/linux:
clk: sunxi-ng: sun5i: Add clk_set_rate_parent to the CPU clock
Takashi Iwai [Wed, 2 Aug 2017 15:11:45 +0000 (17:11 +0200)]
Merge tag 'asoc-fix-v4.13-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/broonie/sound into for-linus
ASoC: Fixes for v4.13
Quite a few fixes here that have been sent since the merge window, the
biggest one is the fix from Tony for some confusion with the device
property API which was causing issues with the of-graph card. This is
fixed with some changes in the graph API itself as it seemed very likely
to be error prone.
Gregory CLEMENT [Tue, 1 Aug 2017 16:01:35 +0000 (18:01 +0200)]
ARM64: dts: marvell: armada-37xx: Fix the number of GPIO on south bridge
The number of pins in South Bridge is 30 and not 29. There is a fix for
the driver for the pinctrl, but a fix is also need at device tree level
for the GPIO.
Fixes: afda007feda5 ("ARM64: dts: marvell: Add pinctrl nodes for Armada
3700") Cc: <[email protected]> Signed-off-by: Gregory CLEMENT <[email protected]>
of_irq_to_resource() has recently been fixed to return negative error #'s
along with 0 in case of failure, however the Freescale MPC832x RDB board
code still only regards 0 as a failure indication -- fix it up.
Fixes: 7a4228bbff76 ("of: irq: use of_irq_get() in of_irq_to_resource()") Signed-off-by: Sergei Shtylyov <[email protected]> Acked-by: Scott Wood <[email protected]> Signed-off-by: Michael Ellerman <[email protected]>
When more than one GPIO IRQs are triggered simultaneously,
tegra_gpio_irq_handler() called chained_irq_exit() multiple
times for one chained_irq_enter().
Fixes: 3c92db9ac0ca3eee8e46e2424b6c074e2e394ad9 Signed-off-by: Michał Mirosław <[email protected]>
[Also changed the variable to a bool] Signed-off-by: Linus Walleij <[email protected]>
David S. Miller [Wed, 2 Aug 2017 03:09:10 +0000 (20:09 -0700)]
Merge branch 'dsa-rework-EEE-support'
Vivien Didelot says:
====================
net: dsa: rework EEE support
EEE implies configuring the port's PHY and MAC of both ends of the wire.
The current EEE support in DSA mixes PHY and MAC configuration, which is
bad because PHYs must be configured through a proper PHY driver. The DSA
switch operations for EEE are only meant for configuring the port's MAC,
which are integrated in the Ethernet switch device.
This patchset fixes the EEE support in qca8k driver, makes the DSA layer
call phy_init_eee for all drivers, and remove the EEE support from the
mv88e6xxx driver since the Marvell PHY driver should be enough for it.
Changes in v2:
- make PHY device and DSA EEE ops mandatory for slave EEE operations.
- simply return 0 in drivers which don't need to do anything to
configure the port' MAC. Subsequent PHY calls will be enough.
====================
Vivien Didelot [Tue, 1 Aug 2017 20:32:39 +0000 (16:32 -0400)]
net: dsa: remove PHY device argument from .set_eee
The DSA switch operations for EEE are only meant to configure a port's
MAC EEE settings. The port's PHY EEE settings are accessed by the DSA
layer and must be made available via a proper PHY driver.
In order to reduce this confusion, remove the phy_device argument from
the .set_eee operation.
Vivien Didelot [Tue, 1 Aug 2017 20:32:32 +0000 (16:32 -0400)]
net: dsa: qca8k: fix EEE init
The qca8k obviously copied code from the sf2 driver as how to set EEE:
if (e->eee_enabled) {
p->eee_enabled = qca8k_eee_init(ds, port, phydev);
if (!p->eee_enabled)
ret = -EOPNOTSUPP;
}
But it did not use the same logic for the EEE init routine, which is
"Returns 0 if EEE was not enabled, or 1 otherwise". This results in
returning -EOPNOTSUPP on success and caching EEE enabled on failure.
This patch fixes the returned value of qca8k_eee_init.