Git Repo - linux.git/log

Merge tag 'for-linus-20191121' of git://git.kernel.dk/linux-block

Pull block fix from Jens Axboe:
"Just a single fix for an issue in nbd introduced in this cycle"

* tag 'for-linus-20191121' of git://git.kernel.dk/linux-block:
nbd:fix memory leak in nbd_get_socket()

Merge tag 'gpio-v5.4-5' of git://git.kernel.org/pub/scm/linux/kernel/git/linusw/linux-gpio

Pull GPIO fixes from Linus Walleij:
"A last set of small fixes for GPIO, this cycle was quite busy.

   - Fix debounce delays on the MAX77620 GPIO expander

   - Use the correct unit for debounce times on the BD70528 GPIO expander

   - Get proper deps for parallel builds of the GPIO tools

   - Add a specific ACPI quirk for the Terra Pad 1061"

* tag 'gpio-v5.4-5' of git://git.kernel.org/pub/scm/linux/kernel/git/linusw/linux-gpio:
  gpiolib: acpi: Add Terra Pad 1061 to the run_edge_events_on_boot_blacklist
  tools: gpio: Correctly add make dependencies for gpio_utils
  gpio: bd70528: Use correct unit for debounce times
  gpio: max77620: Fixup debounce delays

Merge tag 'for-linus-2019-11-21' of git://git.kernel.org/pub/scm/linux/kernel/git/brauner/linux

Pull pidfd fixlet from Christian Brauner:
"This contains a simple fix for the pidfd poll method. In the original
  patchset pidfd_poll() was made to return an unsigned int. However, the
  poll method is defined to return a __poll_t. While the unsigned int is
  not a huge deal it's just nicer to return a __poll_t.

  I've decided to send it right before the 5.4 release mainly so that
  stable doesn't need to backport it to both 5.4 and 5.3"

* tag 'for-linus-2019-11-21' of git://git.kernel.org/pub/scm/linux/kernel/git/brauner/linux:
  fork: fix pidfd_poll()'s return type

nfc: port100: handle command failure cleanly

If starting the transfer of a command suceeds but the transfer for the reply
fails, it is not enough to initiate killing the transfer for the
command may still be running. You need to wait for the killing to finish
before you can reuse URB and buffer.

Reported-and-tested-by: [email protected]
Signed-off-by: Oliver Neukum <[email protected]>
Signed-off-by: David S. Miller <[email protected]>

nbd: prevent memory leak

In nbd_add_socket when krealloc succeeds, if nsock's allocation fail the
reallocted memory is leak. The correct behaviour should be assigning the
reallocted memory to config->socks right after success.

Reviewed-by: Josef Bacik <[email protected]>
Signed-off-by: Navid Emamdoost <[email protected]>
Signed-off-by: Jens Axboe <[email protected]>

Merge branch 'nvme-5.5' of git://git.infradead.org/nvme into for-5.5/drivers-post

Pull NVMe changes from Keith:

"- The only new feature is the optional hwmon support for nvme (Guenter
   and Akinobu)

- A universal work-around for controllers reading discard payloads
   beyond the range boundary (Eduard)

- Chaitanya graciously agreed to share the target driver maintenance"

* 'nvme-5.5' of git://git.infradead.org/nvme:
  nvme: hwmon: add quirk to avoid changing temperature threshold
  nvme: hwmon: provide temperature min and max values for each sensor
  nvmet: add another maintainer
  nvme: Discard workaround for non-conformant devices
  nvme: Add hardware monitoring support

nvme: hwmon: add quirk to avoid changing temperature threshold

This adds a new quirk NVME_QUIRK_NO_TEMP_THRESH_CHANGE to avoid changing
the value of the temperature threshold feature for specific devices that
show undesirable behavior.

Guenter reported:

"On my Intel NVME drive (SSDPEKKW512G7), writing any minimum limit on the
Composite temperature sensor results in a temperature warning, and that
warning is sticky until I reset the controller.

It doesn't seem to matter which temperature I write; writing -273000 has
the same result."

The Intel NVMe has the latest firmware version installed, so this isn't
a problem that was ever fixed.

Reported-by: Guenter Roeck <[email protected]>
Cc: Keith Busch <[email protected]>
Cc: Jens Axboe <[email protected]>
Cc: Christoph Hellwig <[email protected]>
Cc: Sagi Grimberg <[email protected]>
Cc: Jean Delvare <[email protected]>
Reviewed-by: Guenter Roeck <[email protected]>
Tested-by: Guenter Roeck <[email protected]>
Signed-off-by: Akinobu Mita <[email protected]>
Signed-off-by: Keith Busch <[email protected]>

nvme: hwmon: provide temperature min and max values for each sensor

According to the NVMe specification, the over temperature threshold and
under temperature threshold features shall be implemented for Composite
Temperature if a non-zero WCTEMP field value is reported in the Identify
Controller data structure.  The features are also implemented for all
implemented temperature sensors (i.e., all Temperature Sensor fields that
report a non-zero value).

This provides the over temperature threshold and under temperature
threshold for each sensor as temperature min and max values of hwmon
sysfs attributes.

The WCTEMP is already provided as a temperature max value for Composite
Temperature, but this change isn't incompatible.  Because the default
value of the over temperature threshold for Composite Temperature is
the WCTEMP.

Now the alarm attribute for Composite Temperature indicates one of the
temperature is outside of a temperature threshold.  Because there is only
a single bit in Critical Warning field that indicates a temperature is
outside of a threshold.

Example output from the "sensors" command:

nvme-pci-0100
Adapter: PCI adapter
Composite:    +33.9°C  (low  = -273.1°C, high = +69.8°C)
                       (crit = +79.8°C)
Sensor 1:     +34.9°C  (low  = -273.1°C, high = +65261.8°C)
Sensor 2:     +31.9°C  (low  = -273.1°C, high = +65261.8°C)
Sensor 5:     +47.9°C  (low  = -273.1°C, high = +65261.8°C)

This also adds helper macros for kelvin from/to milli Celsius conversion,
and replaces the repeated code in hwmon.c.

Cc: Keith Busch <[email protected]>
Cc: Jens Axboe <[email protected]>
Cc: Christoph Hellwig <[email protected]>
Cc: Sagi Grimberg <[email protected]>
Cc: Jean Delvare <[email protected]>
Reviewed-by: Guenter Roeck <[email protected]>
Tested-by: Guenter Roeck <[email protected]>
Signed-off-by: Akinobu Mita <[email protected]>
Signed-off-by: Keith Busch <[email protected]>

nvmet: add another maintainer

Sagi and I have been pretty busy lately, and Chaitanya has been
helping a lot with target work and agreed to share the load.

Signed-off-by: Christoph Hellwig <[email protected]>
Signed-off-by: Keith Busch <[email protected]>

Revert "block: split bio if the only bvec's length is > SZ_4K"

We really don't need this, as the slow path will do the right thing
anyway.

This reverts commit 6952a7f8446ee85ea9d10ab87b64797a031eaae3.

Signed-off-by: Jens Axboe <[email protected]>

block: add iostat counters for flush requests

Requests that triggers flushing volatile writeback cache to disk (barriers)
have significant effect to overall performance.

Block layer has sophisticated engine for combining several flush requests
into one. But there is no statistics for actual flushes executed by disk.
Requests which trigger flushes usually are barriers - zero-size writes.

This patch adds two iostat counters into /sys/class/block/$dev/stat and
/proc/diskstats - count of completed flush requests and their total time.

Signed-off-by: Konstantin Khlebnikov <[email protected]>
Signed-off-by: Jens Axboe <[email protected]>

KVM: x86: create mmu/ subdirectory

Preparatory work for shattering mmu.c into multiple files. Besides making it easier
to follow, this will also make it possible to write unit tests for various parts.

Signed-off-by: Paolo Bonzini <[email protected]>

KVM: nVMX: Remove unnecessary TLB flushes on L1<->L2 switches when L1 use apic-access-page

According to Intel SDM section 28.3.3.3/28.3.3.4 Guidelines for Use
of the INVVPID/INVEPT Instruction, the hypervisor needs to execute
INVVPID/INVEPT X in case CPU executes VMEntry with VPID/EPTP X and
either: "Virtualize APIC accesses" VM-execution control was changed
from 0 to 1, OR the value of apic_access_page was changed.

In the nested case, the burden falls on L1, unless L0 enables EPT in
vmcs02 but L1 enables neither EPT nor VPID in vmcs12.  For this reason
prepare_vmcs02() and load_vmcs12_host_state() have special code to
request a TLB flush in case L1 does not use EPT but it uses
"virtualize APIC accesses".

This special case however is not necessary. On a nested vmentry the
physical TLB will already be flushed except if all the following apply:

* L0 uses VPID

* L1 uses VPID

* L0 can guarantee TLB entries populated while running L1 are tagged
differently than TLB entries populated while running L2.

If the first condition is false, the processor will flush the TLB
on vmentry to L2.  If the second or third condition are false,
prepare_vmcs02() will request KVM_REQ_TLB_FLUSH.  However, even
if both are true, no extra TLB flush is needed to handle the APIC
access page:

* if L1 doesn't use VPID, the second condition doesn't hold and the
TLB will be flushed anyway.

* if L1 uses VPID, it has to flush the TLB itself with INVVPID and
section 28.3.3.3 doesn't apply to L0.

* even INVEPT is not needed because, if L0 uses EPT, it uses different
EPTP when running L2 than L1 (because guest_mode is part of mmu-role).
In this case SDM section 28.3.3.4 doesn't apply.

Similarly, examining nested_vmx_vmexit()->load_vmcs12_host_state(),
one could note that L0 won't flush TLB only in cases where SDM sections
28.3.3.3 and 28.3.3.4 don't apply.  In particular, if L0 uses different
VPIDs for L1 and L2 (i.e. vmx->vpid != vmx->nested.vpid02), section
28.3.3.3 doesn't apply.

Thus, remove this flush from prepare_vmcs02() and nested_vmx_vmexit().

Side-note: This patch can be viewed as removing parts of commit
fb6c81984313 ("kvm: vmx: Flush TLB when the APIC-access address changes”)
that is not relevant anymore since commit
1313cc2bd8f6 ("kvm: mmu: Add guest_mode to kvm_mmu_page_role”).
i.e. The first commit assumes that if L0 use EPT and L1 doesn’t use EPT,
then L0 will use same EPTP for both L0 and L1. Which indeed required
L0 to execute INVEPT before entering L2 guest. This assumption is
not true anymore since when guest_mode was added to mmu-role.

Reviewed-by: Joao Martins <[email protected]>
Signed-off-by: Liran Alon <[email protected]>
Signed-off-by: Paolo Bonzini <[email protected]>

KVM: x86: remove set but not used variable 'called'

Fixes gcc '-Wunused-but-set-variable' warning:

arch/x86/kvm/x86.c: In function kvm_make_scan_ioapic_request_mask:
arch/x86/kvm/x86.c:7911:7: warning: variable called set but not
used [-Wunused-but-set-variable]

It is not used since commit 7ee30bc132c6 ("KVM: x86: deliver KVM
IOAPIC scan request to target vCPUs")

Signed-off-by: Mao Wenan <[email protected]>
Fixes: 7ee30bc132c6 ("KVM: x86: deliver KVM IOAPIC scan request to target vCPUs")
Signed-off-by: Paolo Bonzini <[email protected]>

KVM: nVMX: Do not mark vmcs02->apic_access_page as dirty when unpinning

vmcs->apic_access_page is simply a token that the hypervisor puts into
the PFN of a 4KB EPTE (or PTE if using shadow-paging) that triggers
APIC-access VMExit or APIC virtualization logic whenever a CPU running
in VMX non-root mode read/write from/to this PFN.

As every write either triggers an APIC-access VMExit or write is
performed on vmcs->virtual_apic_page, the PFN pointed to by
vmcs->apic_access_page should never actually be touched by CPU.

Therefore, there is no need to mark vmcs02->apic_access_page as dirty
after unpin it on L2->L1 emulated VMExit or when L1 exit VMX operation.

Reviewed-by: Krish Sadhukhan <[email protected]>
Reviewed-by: Joao Martins <[email protected]>
Reviewed-by: Jim Mattson <[email protected]>
Signed-off-by: Liran Alon <[email protected]>
Signed-off-by: Paolo Bonzini <[email protected]>

Merge branch 'kvm-tsx-ctrl' into HEAD

Conflicts:
arch/x86/kvm/vmx/vmx.c

KVM: vmx: use MSR_IA32_TSX_CTRL to hard-disable TSX on guest that lack it

If X86_FEATURE_RTM is disabled, the guest should not be able to access
MSR_IA32_TSX_CTRL. We can therefore use it in KVM to force all
transactions from the guest to abort.

Tested-by: Jim Mattson <[email protected]>
Signed-off-by: Paolo Bonzini <[email protected]>

KVM: vmx: implement MSR_IA32_TSX_CTRL disable RTM functionality

The current guest mitigation of TAA is both too heavy and not really
sufficient.  It is too heavy because it will cause some affected CPUs
(those that have MDS_NO but lack TAA_NO) to fall back to VERW and
get the corresponding slowdown.  It is not really sufficient because
it will cause the MDS_NO bit to disappear upon microcode update, so
that VMs started before the microcode update will not be runnable
anymore afterwards, even with tsx=on.

Instead, if tsx=on on the host, we can emulate MSR_IA32_TSX_CTRL for
the guest and let it run without the VERW mitigation.  Even though
MSR_IA32_TSX_CTRL is quite heavyweight, and we do not want to write
it on every vmentry, we can use the shared MSR functionality because
the host kernel need not protect itself from TSX-based side-channels.

Tested-by: Jim Mattson <[email protected]>
Signed-off-by: Paolo Bonzini <[email protected]>

KVM: x86: implement MSR_IA32_TSX_CTRL effect on CPUID

Because KVM always emulates CPUID, the CPUID clear bit
(bit 1) of MSR_IA32_TSX_CTRL must be emulated "manually"
by the hypervisor when performing said emulation.

Right now neither kvm-intel.ko nor kvm-amd.ko implement
MSR_IA32_TSX_CTRL but this will change in the next patch.

Reviewed-by: Jim Mattson <[email protected]>
Tested-by: Jim Mattson <[email protected]>
Signed-off-by: Paolo Bonzini <[email protected]>

KVM: x86: do not modify masked bits of shared MSRs

"Shared MSRs" are guest MSRs that are written to the host MSRs but
keep their value until the next return to userspace.  They support
a mask, so that some bits keep the host value, but this mask is
only used to skip an unnecessary MSR write and the value written
to the MSR is always the guest MSR.

Fix this and, while at it, do not update smsr->values[slot].curr if
for whatever reason the wrmsr fails.  This should only happen due to
reserved bits, so the value written to smsr->values[slot].curr
will not match when the user-return notifier and the host value will
always be restored.  However, it is untidy and in rare cases this
can actually avoid spurious WRMSRs on return to userspace.

Cc: [email protected]
Reviewed-by: Jim Mattson <[email protected]>
Tested-by: Jim Mattson <[email protected]>
Signed-off-by: Paolo Bonzini <[email protected]>

KVM: x86: fix presentation of TSX feature in ARCH_CAPABILITIES

KVM does not implement MSR_IA32_TSX_CTRL, so it must not be presented
to the guests. It is also confusing to have !ARCH_CAP_TSX_CTRL_MSR &&
!RTM && ARCH_CAP_TAA_NO: lack of MSR_IA32_TSX_CTRL suggests TSX was not
hidden (it actually was), yet the value says that TSX is not vulnerable
to microarchitectural data sampling. Fix both.

Cc: [email protected]
Tested-by: Jim Mattson <[email protected]>
Signed-off-by: Paolo Bonzini <[email protected]>

Merge tag 'kvmarm-5.5' of git://git.kernel.org/pub/scm/linux/kernel/git/kvmarm/kvmarm into HEAD

KVM/arm updates for Linux 5.5:

- Allow non-ISV data aborts to be reported to userspace
- Allow injection of data aborts from userspace
- Expose stolen time to guests
- GICv4 performance improvements
- vgic ITS emulation fixes
- Simplify FWB handling
- Enable halt pool counters
- Make the emulated timer PREEMPT_RT compliant

Conflicts:
include/uapi/linux/kvm.h

drm/i915/fbdev: Restore physical addresses for fb_mmap()

fbdev uses the physical address of our framebuffer for its fb_mmap()
routine. While we need to adapt this address for the new io BAR, we have
to fix v5.4 first! The simplest fix is to restore the smem back to v5.3
and we will then probably have to implement our fbops->fb_mmap() callback
to handle local memory.

Reported-by: Neil MacLeod <[email protected]>
Bugzilla: https://bugs.freedesktop.org/show_bug.cgi?id=112256
Fixes: 5f889b9a61dd ("drm/i915: Disregard drm_mode_config.fb_base")
Signed-off-by: Chris Wilson <[email protected]>
Cc: Daniel Vetter <[email protected]>
Cc: Maarten Lankhorst <[email protected]>
Tested-by: Neil MacLeod <[email protected]>
Reviewed-by: Ville Syrjälä <[email protected]>
Link: https://patchwork.freedesktop.org/patch/msgid/[email protected]
(cherry picked from commit abc5520704ab438099fe352636b30b05c1253bea)
Signed-off-by: Joonas Lahtinen <[email protected]>
(cherry picked from commit 9faf5fa4d3dad3b0c0fa6e67689c144981a11c27)
Signed-off-by: Rodrigo Vivi <[email protected]>

net-sysfs: fix netdev_queue_add_kobject() breakage

kobject_put() should only be called in error path.

Fixes: b8eb718348b8 ("net-sysfs: Fix reference count leak in rx|netdev_queue_add_kobject")
Signed-off-by: Eric Dumazet <[email protected]>
Cc: Jouni Hogander <[email protected]>
Signed-off-by: David S. Miller <[email protected]>

KVM: PPC: Book3S HV: XIVE: Fix potential page leak on error path

We need to check the host page size is big enough to accomodate the
EQ. Let's do this before taking a reference on the EQ page to avoid
a potential leak if the check fails.

Cc: [email protected] # v5.2
Fixes: 13ce3297c576 ("KVM: PPC: Book3S HV: XIVE: Add controls for the EQ configuration")
Signed-off-by: Greg Kurz <[email protected]>
Reviewed-by: Cédric Le Goater <[email protected]>
Signed-off-by: Paul Mackerras <[email protected]>

KVM: PPC: Book3S HV: XIVE: Free previous EQ page when setting up a new one

The EQ page is allocated by the guest and then passed to the hypervisor
with the H_INT_SET_QUEUE_CONFIG hcall. A reference is taken on the page
before handing it over to the HW. This reference is dropped either when
the guest issues the H_INT_RESET hcall or when the KVM device is released.
But, the guest can legitimately call H_INT_SET_QUEUE_CONFIG several times,
either to reset the EQ (vCPU hot unplug) or to set a new EQ (guest reboot).
In both cases the existing EQ page reference is leaked because we simply
overwrite it in the XIVE queue structure without calling put_page().

This is especially visible when the guest memory is backed with huge pages:
start a VM up to the guest userspace, either reboot it or unplug a vCPU,
quit QEMU. The leak is observed by comparing the value of HugePages_Free in
/proc/meminfo before and after the VM is run.

Ideally we'd want the XIVE code to handle the EQ page de-allocation at the
platform level. This isn't the case right now because the various XIVE
drivers have different allocation needs. It could maybe worth introducing
hooks for this purpose instead of exposing XIVE internals to the drivers,
but this is certainly a huge work to be done later.

In the meantime, for easier backport, fix both vCPU unplug and guest reboot
leaks by introducing a wrapper around xive_native_configure_queue() that
does the necessary cleanup.

Reported-by: Satheesh Rajendran <[email protected]>
Cc: [email protected] # v5.2
Fixes: 13ce3297c576 ("KVM: PPC: Book3S HV: XIVE: Add controls for the EQ configuration")
Signed-off-by: Cédric Le Goater <[email protected]>
Signed-off-by: Greg Kurz <[email protected]>
Tested-by: Lijun Pan <[email protected]>
Signed-off-by: Paul Mackerras <[email protected]>

Merge tag 'drm-fixes-5.4-2019-11-20' of git://people.freedesktop.org/~agd5f/linux into drm-fixes

drm-fixes-5.4-2019-11-20:

amdgpu:
- Remove experimental flag for navi14
- Fix confusing power message failures on older VI parts
- Hang fix for gfxoff when using the read register interface
- Two stability regression fixes for Raven

Signed-off-by: Dave Airlie <[email protected]>
From: Alex Deucher <[email protected]>
Link: https://patchwork.freedesktop.org/patch/msgid/[email protected]

Revert "drm/amd/display: enable S/G for RAVEN chip"

This reverts commit 1c4259159132ae4ceaf7c6db37a6cf76417f73d9.

S/G display is not stable with the IOMMU enabled on some
platforms.

Bug: https://bugzilla.kernel.org/show_bug.cgi?id=205523
Acked-by: Christian König <[email protected]>
Signed-off-by: Alex Deucher <[email protected]>
Cc: [email protected]

drm/amdgpu: disable gfxoff on original raven

There are still combinations of sbios and firmware that
are not stable.

Bug: https://bugzilla.kernel.org/show_bug.cgi?id=204689
Acked-by: Christian König <[email protected]>
Signed-off-by: Alex Deucher <[email protected]>
Cc: [email protected]

drm/amdgpu: disable gfxoff when using register read interface

When gfxoff is enabled, accessing gfx registers via MMIO
can lead to a hang.

Bug: https://bugzilla.kernel.org/show_bug.cgi?id=205497
Acked-by: Xiaojie Yuan <[email protected]>
Reviewed-by: Evan Quan <[email protected]>
Signed-off-by: Alex Deucher <[email protected]>
Cc: [email protected]

drm/amd/powerplay: correct fine grained dpm force level setting

For fine grained dpm, there is only two levels supported. However
to reflect correctly the current clock frequency, there is an
intermediate level faked. Thus on forcing level setting, we
need to treat level 2 correctly as level 1.

Signed-off-by: Evan Quan <[email protected]>
Reviewed-by: Kevin Wang <[email protected]>
Signed-off-by: Alex Deucher <[email protected]>

drm/amd/powerplay: issue no PPSMC_MSG_GetCurrPkgPwr on unsupported ASICs

Otherwise, the error message prompted will confuse user.

Signed-off-by: Evan Quan <[email protected]>
Acked-by: Alex Deucher <[email protected]>
Signed-off-by: Alex Deucher <[email protected]>
Cc: [email protected]

drm/amdgpu: remove experimental flag for Navi14

5.4 and newer works fine with navi14.

Reviewed-by: Xiaojie Yuan <[email protected]>
Signed-off-by: Alex Deucher <[email protected]>

block,bfq: Skip tracing hooks if possible

In most cases blk_tracing is not active, but  bfq_log_bfqq macro
generate pid_str unconditionally, which result in significant overhead.

## Test
modprobe null_blk
echo bfq > /sys/block/nullb0/queue/scheduler
fio --name=t --ioengine=libaio --direct=1 --filename=/dev/nullb0 \
   --runtime=30 --time_based=1 --rw=write --iodepth=128 --bs=4k

# Results
|        | baseline | w/ patch | gain |
| iops   | 113.19K  | 126.42K  | +11% |

Acked-by: Paolo Valente <[email protected]>
Signed-off-by: Dmitry Monakhov <[email protected]>
Signed-off-by: Jens Axboe <[email protected]>

Revert "dm crypt: use WQ_HIGHPRI for the IO and crypt workqueues"

This reverts commit a1b89132dc4f61071bdeaab92ea958e0953380a1.

Revert required hand-patching due to subsequent changes that were
applied since commit a1b89132dc4f61071bdeaab92ea958e0953380a1.

Requires: ed0302e83098d ("dm crypt: make workqueue names device-specific")
Cc: [email protected]
Bug: https://bugzilla.kernel.org/show_bug.cgi?id=199857
Reported-by: Vito Caputo <[email protected]>
Signed-off-by: Mike Snitzer <[email protected]>

Merge tag 'mlx5-fixes-2019-11-20' of git://git.kernel.org/pub/scm/linux/kernel/git/saeed/linux

Saeed Mahameed says:

====================
Mellanox, mlx5 fixes 2019-11-20

This series introduces some fixes to mlx5 driver.

Please pull and let me know if there is any problem.

For -stable v4.9:
('net/mlx5e: Fix set vf link state error flow')

For -stable v4.14
('net/mlxfw: Verify FSM error code translation doesn't exceed array size')

For -stable v4.19
('net/mlx5: Fix auto group size calculation')

For -stable v5.3
('net/mlx5e: Fix error flow cleanup in mlx5e_tc_tun_create_header_ipv4/6')
('net/mlx5e: Do not use non-EXT link modes in EXT mode')
('net/mlx5: Update the list of the PCI supported devices')
====================

Signed-off-by: David S. Miller <[email protected]>

r8152: Re-order napi_disable in rtl8152_close

Both rtl_work_func_t() and rtl8152_close() call napi_disable().
Since the two calls aren't protected by a lock, if the close
function starts executing before the work function, we can get into a
situation where the napi_disable() function is called twice in
succession (first by rtl8152_close(), then by set_carrier()).

In such a situation, the second call would loop indefinitely, since
rtl8152_close() doesn't call napi_enable() to clear the NAPI_STATE_SCHED
bit.

The rtl8152_close() function in turn issues a
cancel_delayed_work_sync(), and so it would wait indefinitely for the
rtl_work_func_t() to complete. Since rtl8152_close() is called by a
process holding rtnl_lock() which is requested by other processes, this
eventually leads to a system deadlock and crash.

Re-order the napi_disable() call to occur after the work function
disabling and urb cancellation calls are issued.

Change-Id: I6ef0b703fc214998a037a68f722f784e1d07815e
Reported-by: http://crbug.com/1017928
Signed-off-by: Prashant Malani <[email protected]>
Signed-off-by: David S. Miller <[email protected]>

Merge branch 'qca_spi-fixes'

Stefan Wahren says:

====================
net: qca_spi: Fix receive and reset issues

This small patch series fixes two major issues in the SPI driver for the
QCA700x.

It has been tested on a Charge Control C 300 (NXP i.MX6ULL +
2x QCA7000).
====================

Signed-off-by: David S. Miller <[email protected]>

net: qca_spi: Move reset_count to struct qcaspi

The reset counter is specific for every QCA700x chip. So move this
into the private driver struct. Otherwise we get unpredictable reset
behavior in setups with multiple QCA700x chips.

Fixes: 291ab06ecf67 (net: qualcomm: new Ethernet over SPI driver for QCA7000)
Signed-off-by: Stefan Wahren <[email protected]>
Signed-off-by: Stefan Wahren <[email protected]>
Signed-off-by: David S. Miller <[email protected]>

net: qca_spi: fix receive buffer size check

When receiving many or larger packets, e.g. when doing a file download,
it was observed that the read buffer size register reports up to 4 bytes
more than the current define allows in the check.
If this is the case, then no data transfer is initiated to receive the
packets (and thus to empty the buffer) which results in a stall of the
interface.

These 4 bytes are a hardware generated frame length which is prepended
to the actual frame, thus we have to respect it during our check.

Fixes: 026b907d58c4 ("net: qca_spi: Add available buffer space verification")
Signed-off-by: Michael Heimpold <[email protected]>
Signed-off-by: Stefan Wahren <[email protected]>
Signed-off-by: David S. Miller <[email protected]>

Merge branch 'ibmvnic-regression'

Juliet Kim says:

====================
Support both XIVE and XICS modes in ibmvnic

This series aims to support both XICS and XIVE with avoiding
a regression in behavior when a system runs in XICS mode.

Patch 1 reverts commit 11d49ce9f7946dfed4dcf5dbde865c78058b50ab
(“net/ibmvnic: Fix EOI when running in XIVE mode.”)

Patch 2 Ignore H_FUNCTION return from H_EOI to tolerate XIVE mode
====================

Signed-off-by: David S. Miller <[email protected]>

net/ibmvnic: Ignore H_FUNCTION return from H_EOI to tolerate XIVE mode

Reversion of commit 11d49ce9f7946dfed4dcf5dbde865c78058b50ab
(“net/ibmvnic: Fix EOI when running in XIVE mode.”) leaves us
calling H_EOI even in XIVE mode. That will fail with H_FUNCTION
because H_EOI is not supported in that mode. That failure is
harmless. Ignore it so we can use common code for both XICS and
XIVE.

Signed-off-by: Juliet Kim <[email protected]>
Signed-off-by: David S. Miller <[email protected]>

Revert "net/ibmvnic: Fix EOI when running in XIVE mode"

This reverts commit 11d49ce9f7946dfed4dcf5dbde865c78058b50ab
(“net/ibmvnic: Fix EOI when running in XIVE mode.”) since that
has the unintended effect of changing the interrupt priority
and emits warning when running in legacy XICS mode.

Signed-off-by: Juliet Kim <[email protected]>
Signed-off-by: David S. Miller <[email protected]>

net/mlxfw: Verify FSM error code translation doesn't exceed array size

Array mlxfw_fsm_state_err_str contains value to string translation, when
values are provided by mlxfw_dev. If value is larger than
MLXFW_FSM_STATE_ERR_MAX, return "unknown error" as expected instead of
reading an address than exceed array size.

Fixes: 410ed13cae39 ("Add the mlxfw module for Mellanox firmware flash process")
Signed-off-by: Eran Ben Elisha <[email protected]>
Acked-by: Jiri Pirko <[email protected]>
Signed-off-by: Saeed Mahameed <[email protected]>

net/mlx5: Update the list of the PCI supported devices

Add the upcoming ConnectX-6 LX device ID.

Fixes: 85327a9c4150 ("net/mlx5: Update the list of the PCI supported devices")
Signed-off-by: Shani Shapp <[email protected]>
Reviewed-by: Eran Ben Elisha <[email protected]>
Signed-off-by: Saeed Mahameed <[email protected]>

net/mlx5: Fix auto group size calculation

Once all the large flow groups (defined by the user when the flow table
is created - max_num_groups) were created, then all the following new
flow groups will have only one flow table entry, even though the flow table
has place to larger groups.
Fix the condition to prefer large flow group.

Fixes: f0d22d187473 ("net/mlx5_core: Introduce flow steering autogrouped flow table")
Signed-off-by: Maor Gottlieb <[email protected]>
Signed-off-by: Saeed Mahameed <[email protected]>

net/mlx5e: Add missing capability bit check for IP-in-IP

Device that doesn't support IP-in-IP offloads has to filter csum and gso
offload support, otherwise kernel will conclude that device is capable of
offloading csum and gso for IP-in-IP tunnels and that might result in
IP-in-IP tunnel not functioning.

Fixes: 25948b87dda2 ("net/mlx5e: Support TSO and TX checksum offloads for IP-in-IP")
Signed-off-by: Marina Varshaver <[email protected]>
Signed-off-by: Saeed Mahameed <[email protected]>

net/mlx5e: Do not use non-EXT link modes in EXT mode

On some old Firmwares, connector type value was not supported, and value
read from FW was 0. For those, driver used link mode in order to set
connector type in link_ksetting.

After FW exposed the connector type, driver translated the value to ethtool
definitions. However, as 0 is a valid value, before returning PORT_OTHER,
driver run the check of link mode in order to maintain backward
compatibility.

Cited patch added support to EXT mode. With both features (connector type
and EXT link modes) ,if connector_type read from FW is 0 and EXT mode is
set, driver mistakenly compare EXT link modes to non-EXT link mode.
Fixed that by skipping this comparison if we are in EXT mode, as connector
type value is valid in this scenario.

Fixes: 6a897372417e ("net/mlx5: ethtool, Add ethtool support for 50Gbps per lane link modes")
Signed-off-by: Eran Ben Elisha <[email protected]>
Reviewed-by: Aya Levin <[email protected]>
Signed-off-by: Saeed Mahameed <[email protected]>

net/mlx5e: Fix set vf link state error flow

Before this commit the ndo always returned success.
Fix that.

Fixes: 1ab2068a4c66 ("net/mlx5: Implement vports admin state backup/restore")
Signed-off-by: Roi Dayan <[email protected]>
Reviewed-by: Vlad Buslov <[email protected]>
Signed-off-by: Saeed Mahameed <[email protected]>

net/mlx5: DR, Limit STE hash table enlarge based on bytemask

When an ste hash table has too many collision we enlarge it
to a bigger hash table (rehash). Rehashing collision improvement
depends on the bytemask value. The more 1 bits we have in bytemask
means better spreading in the table.

Without this fix tables can grow in size without providing any
improvement which can lead to memory depletion and failures.

This patch will limit table rehash to reduce memory and improve
the performance.

Fixes: 41d07074154c ("net/mlx5: DR, Expose steering rule functionality")
Signed-off-by: Alex Vesker <[email protected]>
Reviewed-by: Erez Shitrit <[email protected]>
Signed-off-by: Saeed Mahameed <[email protected]>

net/mlx5: DR, Skip rehash for tables with byte mask zero

The byte mask fields affect on the hash index distribution,
when the byte mask is zero, the hash calculation will always
be equal to the same index.

To avoid unneeded rehash of hash tables mark the table to skip
rehash.

This is needed by the next patch which will limit table rehash
to reduce memory consumption.

Fixes: 41d07074154c ("net/mlx5: DR, Expose steering rule functionality")
Signed-off-by: Alex Vesker <[email protected]>
Reviewed-by: Erez Shitrit <[email protected]>
Signed-off-by: Saeed Mahameed <[email protected]>

net/mlx5: DR, Fix invalid EQ vector number on CQ creation

When creating a CQ, the CPU id is used for the vector value.
This would fail in-case the CPU id was higher than the maximum
vector value.

Fixes: 297cccebdc5a ("net/mlx5: DR, Expose an internal API to issue RDMA operations")
Signed-off-by: Alex Vesker <[email protected]>
Reviewed-by: Tariq Toukan <[email protected]>
Reviewed-by: Erez Shitrit <[email protected]>
Signed-off-by: Saeed Mahameed <[email protected]>

net/mlx5e: Reorder mirrer action parsing to check for encap first

Mirred action parsing code in parse_tc_fdb_actions() first checks if
out_dev has same parent id, and only verifies that there is a pending encap
action that was parsed before. Recent change in vxlan module made function
netdev_port_same_parent_id() to return true when called for mlx5 eswitch
representor and vxlan device created explicitly on mlx5 representor
device (vxlan devices created with "external" flag without explicitly
specifying parent interface are not affected). With call to
netdev_port_same_parent_id() returning true, incorrect code path is chosen
and encap rules fail to offload because vxlan dev is not a valid eswitch
forwarding dev. Dmesg log of error:

[ 1784.389797] devices ens1f0_0 vxlan1 not on same switch HW, can't offload forwarding

In order to fix the issue, rearrange conditional in parse_tc_fdb_actions()
to check for pending encap action before checking if out_dev has the same
parent id.

Fixes: 0ce1822c2a08 ("vxlan: add adjacent link to limit depth level")
Signed-off-by: Vlad Buslov <[email protected]>
Reviewed-by: Roi Dayan <[email protected]>
Signed-off-by: Saeed Mahameed <[email protected]>

net/mlx5e: Fix ingress rate configuration for representors

Current code uses the old method of prio encoding in
flow_cls_common_offload. Fix to follow the changes introduced in
commit ef01adae0e43 ("net: sched: use major priority number as hardware priority").

Fixes: fcb64c0f5640 ("net/mlx5: E-Switch, add ingress rate support")
Signed-off-by: Eli Cohen <[email protected]>
Reviewed-by: Roi Dayan <[email protected]>
Signed-off-by: Saeed Mahameed <[email protected]>

net/mlx5e: Fix error flow cleanup in mlx5e_tc_tun_create_header_ipv4/6

Be sure to release the neighbour in case of failures after successful
route lookup.

Fixes: 101f4de9dd52 ("net/mlx5e: Move TC tunnel offloading code to separate source file")
Signed-off-by: Eli Cohen <[email protected]>
Reviewed-by: Roi Dayan <[email protected]>
Signed-off-by: Saeed Mahameed <[email protected]>

Merge branch 's390-fixes'

Julian Wiedmann says:

====================
s390/qeth: fixes 2019-11-20

please apply two late qeth fixes to your net tree.

The first fixes a deadlock that can occur if a qeth device is set
offline while in the middle of processing deferred HW events.
The second patch converts the return value of an error path to
use -EIO, so that it can be passed back to userspace.
====================

Signed-off-by: David S. Miller <[email protected]>

s390/qeth: return proper errno on IO error

When propagating IO errors back to userspace, one error path in
qeth_irq() currently returns '1' instead of a proper errno.

Fixes: 54daaca7024d ("s390/qeth: cancel cmd on early error")
Signed-off-by: Julian Wiedmann <[email protected]>
Signed-off-by: David S. Miller <[email protected]>

s390/qeth: fix potential deadlock on workqueue flush

The L2 bridgeport code uses the coarse 'conf_mutex' for guarding access
to its configuration state.
This can result in a deadlock when qeth_l2_stop_card() - called under the
conf_mutex - blocks on flush_workqueue() to wait for the completion of
pending bridgeport workers. Such workers would also need to aquire
the conf_mutex, stalling indefinitely.

Introduce a lock that specifically guards the bridgeport configuration,
so that the workers no longer need the conf_mutex.
Wrapping qeth_l2_promisc_to_bridge() in this fine-grained lock then also
fixes a theoretical race against a concurrent qeth_bridge_port_role_store()
operation.

Fixes: c0a2e4d10d93 ("s390/qeth: conclude all event processing before offlining a card")
Signed-off-by: Julian Wiedmann <[email protected]>
Reviewed-by: Alexandra Winter <[email protected]>
Signed-off-by: David S. Miller <[email protected]>

ipv6/route: return if there is no fib_nh_gw_family

Previously we will return directly if (!rt || !rt->fib6_nh.fib_nh_gw_family)
in function rt6_probe(), but after commit cc3a86c802f0
("ipv6: Change rt6_probe to take a fib6_nh"), the logic changed to
return if there is fib_nh_gw_family.

Fixes: cc3a86c802f0 ("ipv6: Change rt6_probe to take a fib6_nh")
Signed-off-by: Hangbin Liu <[email protected]>
Reviewed-by: David Ahern <[email protected]>
Signed-off-by: David S. Miller <[email protected]>

net-sysfs: Fix reference count leak in rx|netdev_queue_add_kobject

kobject_init_and_add takes reference even when it fails. This has
to be given up by the caller in error handling. Otherwise memory
allocated by kobject_init_and_add is never freed. Originally found
by Syzkaller:

BUG: memory leak
unreferenced object 0xffff8880679f8b08 (size 8):
  comm "netdev_register", pid 269, jiffies 4294693094 (age 12.132s)
  hex dump (first 8 bytes):
    72 78 2d 30 00 36 20 d4                          rx-0.6 .
  backtrace:
    [<000000008c93818e>] __kmalloc_track_caller+0x16e/0x290
    [<000000001f2e4e49>] kvasprintf+0xb1/0x140
    [<000000007f313394>] kvasprintf_const+0x56/0x160
    [<00000000aeca11c8>] kobject_set_name_vargs+0x5b/0x140
    [<0000000073a0367c>] kobject_init_and_add+0xd8/0x170
    [<0000000088838e4b>] net_rx_queue_update_kobjects+0x152/0x560
    [<000000006be5f104>] netdev_register_kobject+0x210/0x380
    [<00000000e31dab9d>] register_netdevice+0xa1b/0xf00
    [<00000000f68b2465>] __tun_chr_ioctl+0x20d5/0x3dd0
    [<000000004c50599f>] tun_chr_ioctl+0x2f/0x40
    [<00000000bbd4c317>] do_vfs_ioctl+0x1c7/0x1510
    [<00000000d4c59e8f>] ksys_ioctl+0x99/0xb0
    [<00000000946aea81>] __x64_sys_ioctl+0x78/0xb0
    [<0000000038d946e5>] do_syscall_64+0x16f/0x580
    [<00000000e0aa5d8f>] entry_SYSCALL_64_after_hwframe+0x44/0xa9
    [<00000000285b3d1a>] 0xffffffffffffffff

Cc: David Miller <[email protected]>
Cc: Lukas Bulwahn <[email protected]>
Signed-off-by: Jouni Hogander <[email protected]>
Signed-off-by: David S. Miller <[email protected]>

arm64: uaccess: Remove uaccess_*_not_uao asm macros

It is safer and simpler to drop the uaccess assembly macros in favour of
inline C functions. Although this bloats the Image size slightly, it
aligns our user copy routines with '{get,put}_user()' and generally
makes the code a lot easier to reason about.

Cc: Catalin Marinas <[email protected]>
Reviewed-by: Mark Rutland <[email protected]>
Tested-by: Mark Rutland <[email protected]>
Signed-off-by: Pavel Tatashin <[email protected]>
[will: tweaked commit message and changed temporary variable names]
Signed-off-by: Will Deacon <[email protected]>

arm64: uaccess: Ensure PAN is re-enabled after unhandled uaccess fault

A number of our uaccess routines ('__arch_clear_user()' and
'__arch_copy_{in,from,to}_user()') fail to re-enable PAN if they
encounter an unhandled fault whilst accessing userspace.

For CPUs implementing both hardware PAN and UAO, this bug has no effect
when both extensions are in use by the kernel.

For CPUs implementing hardware PAN but not UAO, this means that a kernel
using hardware PAN may execute portions of code with PAN inadvertently
disabled, opening us up to potential security vulnerabilities that rely
on userspace access from within the kernel which would usually be
prevented by this mechanism. In other words, parts of the kernel run the
same way as they would on a CPU without PAN implemented/emulated at all.

For CPUs not implementing hardware PAN and instead relying on software
emulation via 'CONFIG_ARM64_SW_TTBR0_PAN=y', the impact is unfortunately
much worse. Calling 'schedule()' with software PAN disabled means that
the next task will execute in the kernel using the page-table and ASID
of the previous process even after 'switch_mm()', since the actual
hardware switch is deferred until return to userspace. At this point, or
if there is a intermediate call to 'uaccess_enable()', the page-table
and ASID of the new process are installed. Sadly, due to the changes
introduced by KPTI, this is not an atomic operation and there is a very
small window (two instructions) where the CPU is configured with the
page-table of the old task and the ASID of the new task; a speculative
access in this state is disastrous because it would corrupt the TLB
entries for the new task with mappings from the previous address space.

As Pavel explains:

  | I was able to reproduce memory corruption problem on Broadcom's SoC
  | ARMv8-A like this:
  |
  | Enable software perf-events with PERF_SAMPLE_CALLCHAIN so userland's
  | stack is accessed and copied.
  |
  | The test program performed the following on every CPU and forking
  | many processes:
  |
  | unsigned long *map = mmap(NULL, PAGE_SIZE, PROT_READ|PROT_WRITE,
  |   MAP_SHARED | MAP_ANONYMOUS, -1, 0);
  | map[0] = getpid();
  | sched_yield();
  | if (map[0] != getpid()) {
  | fprintf(stderr, "Corruption detected!");
  | }
  | munmap(map, PAGE_SIZE);
  |
  | From time to time I was getting map[0] to contain pid for a
  | different process.

Ensure that PAN is re-enabled when returning after an unhandled user
fault from our uaccess routines.

Cc: Catalin Marinas <[email protected]>
Reviewed-by: Mark Rutland <[email protected]>
Tested-by: Mark Rutland <[email protected]>
Cc: <[email protected]>
Fixes: 338d4f49d6f7 ("arm64: kernel: Add support for Privileged Access Never")
Signed-off-by: Pavel Tatashin <[email protected]>
[will: rewrote commit message]
Signed-off-by: Will Deacon <[email protected]>

s390/cpumf: Adjust registration of s390 PMU device drivers

Linux-next commit titled "perf/core: Optimize perf_init_event()"
changed the semantics of PMU device driver registration.
It was done to speed up the lookup/handling of PMU device driver
specific events. It also enforces that only one PMU device
driver will be registered of type PERF_EVENT_RAW.

This change added these line in function perf_pmu_register():

  ...
  +       ret = idr_alloc(&pmu_idr, pmu, max, 0, GFP_KERNEL);
  +       if (ret < 0)
                goto free_pdc;
  +
  +       WARN_ON(type >= 0 && ret != type);

The warn_on generates a message. We have 3 PMU device drivers,
each registered as type PERF_TYPE_RAW.
The cf_diag device driver (arch/s390/kernel/perf_cpumf_cf_diag.c)
always hits the WARN_ON because it is the second PMU device driver
(after sampling device driver arch/s390/kernel/perf_cpumf_sf.c)
which is registered as type 4 (PERF_TYPE_RAW).
So when the sampling device driver is registered, ret has value 4.
When cf_diag device driver is registered with type 4,
ret has value of 5 and WARN_ON fires.

Adjust the PMU device drivers for s390 to support the new
semantics required by perf_pmu_register().

Signed-off-by: Thomas Richter <[email protected]>
Signed-off-by: Vasily Gorbik <[email protected]>

dm: Fix Kconfig indentation

Adjust indentation from spaces to tab (+optional two spaces) as in
coding style with command like:
$ sed -e 's/^ /\t/' -i */Kconfig

Signed-off-by: Krzysztof Kozlowski <[email protected]>
Signed-off-by: Mike Snitzer <[email protected]>

KVM: nVMX: Assume TLB entries of L1 and L2 are tagged differently if L0 use EPT

Since commit 1313cc2bd8f6 ("kvm: mmu: Add guest_mode to kvm_mmu_page_role"),
guest_mode was added to mmu-role and therefore if L0 use EPT, it will
always run L1 and L2 with different EPTP. i.e. EPTP01!=EPTP02.

Because TLB entries are tagged with EP4TA, KVM can assume
TLB entries populated while running L2 are tagged differently
than TLB entries populated while running L1.

Therefore, update nested_has_guest_tlb_tag() to consider if
L0 use EPT instead of if L1 use EPT.

Reviewed-by: Joao Martins <[email protected]>
Reviewed-by: Krish Sadhukhan <[email protected]>
Signed-off-by: Liran Alon <[email protected]>
Signed-off-by: Paolo Bonzini <[email protected]>

KVM: x86: Unexport kvm_vcpu_reload_apic_access_page()

The function is only used in kvm.ko module.

Reviewed-by: Mark Kanda <[email protected]>
Signed-off-by: Liran Alon <[email protected]>
Signed-off-by: Paolo Bonzini <[email protected]>

KVM: nVMX: add CR4_LA57 bit to nested CR4_FIXED1

When L1 guest uses 5-level paging, it fails vm-entry to L2 due to
invalid host-state. It needs to add CR4_LA57 bit to nested CR4_FIXED1
MSR.

Signed-off-by: Chenyi Qiang <[email protected]>
Reviewed-by: Xiaoyao Li <[email protected]>
Reviewed-by: Liran Alon <[email protected]>
Signed-off-by: Paolo Bonzini <[email protected]>

KVM: nVMX: Use semi-colon instead of comma for exit-handlers initialization

Reviewed-by: Mark Kanda <[email protected]>
Signed-off-by: Liran Alon <[email protected]>
Reviewed-by: Jim Mattson <[email protected]>
Signed-off-by: Paolo Bonzini <[email protected]>

KVM: x86: Zero the IOAPIC scan request dest vCPUs bitmap

Not zeroing the bitmap used for identifying the destination vCPUs for an
IOAPIC scan request in fixed delivery mode could lead to waking up unwanted
vCPUs. This patch zeroes the vCPU bitmap before passing it to
kvm_bitmap_or_dest_vcpus(), which is responsible for setting the bitmap
with the bits corresponding to the destination vCPUs.

Fixes: 7ee30bc132c6("KVM: x86: deliver KVM IOAPIC scan request to target vCPUs")
Signed-off-by: Nitesh Narayan Lal <[email protected]>
Signed-off-by: Paolo Bonzini <[email protected]>

s390/smp: fix physical to logical CPU map for SMT

If an SMT capable system is not IPL'ed from the first CPU the setup of
the physical to logical CPU mapping is broken: the IPL core gets CPU
number 0, but then the next core gets CPU number 1. Correct would be
that all SMT threads of CPU 0 get the subsequent logical CPU numbers.

This is important since a lot of code (like e.g. the CPU topology
code) assumes that CPU maps are setup like this. If the mapping is
broken the system will not IPL due to broken topology masks:

[    1.716341] BUG: arch topology broken
[    1.716342]      the SMT domain not a subset of the MC domain
[    1.716343] BUG: arch topology broken
[    1.716344]      the MC domain not a subset of the BOOK domain

This scenario can usually not happen since LPARs are always IPL'ed
from CPU 0 and also re-IPL is intiated from CPU 0. However older
kernels did initiate re-IPL on an arbitrary CPU. If therefore a re-IPL
from an old kernel into a new kernel is initiated this may lead to
crash.

Fix this by setting up the physical to logical CPU mapping correctly.

Signed-off-by: Heiko Carstens <[email protected]>
Signed-off-by: Vasily Gorbik <[email protected]>

s390/early: move access registers setup in C code

Reviewed-by: Heiko Carstens <[email protected]>
Signed-off-by: Vasily Gorbik <[email protected]>

s390/head64: remove unnecessary vdso_per_cpu_data setup

vdso_per_cpu_data lowcore value is only needed for fully functional
exception handlers, which are activated in setup_lowcore_dat_off. The
same function does init vdso_per_cpu_data via vdso_alloc_boot_cpu.

Reviewed-by: Heiko Carstens <[email protected]>
Signed-off-by: Vasily Gorbik <[email protected]>

s390/early: move control registers setup in C code

Reviewed-by: Heiko Carstens <[email protected]>
Signed-off-by: Vasily Gorbik <[email protected]>

s390/kasan: support memcpy_real with TRACE_IRQFLAGS

Currently if the kernel is built with CONFIG_TRACE_IRQFLAGS and KASAN
and used as crash kernel it crashes itself due to
trace_hardirqs_off/trace_hardirqs_on being called with DAT off. This
happens because trace_hardirqs_off/trace_hardirqs_on are instrumented and
kasan code tries to perform access to shadow memory to validate memory
accesses. Kasan shadow memory is populated with vmemmap, so all accesses
require DAT on.

memcpy_real could be called with DAT on or off (with kasan enabled DAT
is set even before early code is executed).

Make sure that trace_hardirqs_off/trace_hardirqs_on are called with DAT
on and only actual __memcpy_real is called with DAT off.

Also annotate __memcpy_real and _memcpy_real with __no_sanitize_address
to avoid further problems due to switching DAT off.

Reviewed-by: Philipp Rudo <[email protected]>
Signed-off-by: Vasily Gorbik <[email protected]>

s390/crypto: Fix unsigned variable compared with zero

s390_crypto_shash_parmsize() return type is int, it
should not be stored in a unsigned variable, which
compared with zero.

Reported-by: Hulk Robot <[email protected]>
Fixes: 3c2eb6b76cab ("s390/crypto: Support for SHA3 via CPACF (MSA6)")
Signed-off-by: YueHaibing <[email protected]>
Signed-off-by: Joerg Schmidbauer <[email protected]>
Signed-off-by: Vasily Gorbik <[email protected]>

fork: fix pidfd_poll()'s return type

pidfd_poll() is defined as returning 'unsigned int' but the
.poll method is declared as returning '__poll_t', a bitwise type.

Fix this by using the proper return type and using the EPOLL
constants instead of the POLL ones, as required for __poll_t.

Fixes: b53b0b9d9a61 ("pidfd: add polling support")
Cc: Joel Fernandes (Google) <[email protected]>
Cc: [email protected] # 5.3
Signed-off-by: Luc Van Oostenryck <[email protected]>
Reviewed-by: Christian Brauner <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Christian Brauner <[email protected]>

PM: QoS: Invalidate frequency QoS requests after removal

Switching cpufreq drivers (or switching operation modes of the
intel_pstate driver from "active" to "passive" and vice versa)
does not work on some x86 systems with ACPI after commit
3000ce3c52f8 ("cpufreq: Use per-policy frequency QoS"), because
the ACPI _PPC and thermal code uses the same frequency QoS request
object for a given CPU every time a cpufreq driver is registered
and freq_qos_remove_request() does not invalidate the request after
removing it from its QoS list, so freq_qos_add_request() complains
and fails when that request is passed to it again.

Fix the issue by modifying freq_qos_remove_request() to clear the qos
and type fields of the frequency request pointed to by its argument
after removing it from its QoS list so as to invalidate it.

Fixes: 3000ce3c52f8 ("cpufreq: Use per-policy frequency QoS")
Reported-and-tested-by: Doug Smythies <[email protected]>
Signed-off-by: Rafael J. Wysocki <[email protected]>
Acked-by: Viresh Kumar <[email protected]>

virtio_balloon: fix shrinker count

Instead of multiplying by page order, virtio balloon divided by page
order. The result is that it can return 0 if there are a bit less
than MAX_ORDER - 1 pages in use, and then shrinker scan won't be called.

Cc: [email protected]
Fixes: 71994620bb25 ("virtio_balloon: replace oom notifier with shrinker")
Signed-off-by: Wei Wang <[email protected]>
Signed-off-by: Michael S. Tsirkin <[email protected]>
Reviewed-by: David Hildenbrand <[email protected]>

virtio_balloon: fix shrinker scan number of pages

virtio_balloon_shrinker_scan should return number of system pages freed,
but because it's calling functions that deal with balloon pages, it gets
confused and sometimes returns the number of balloon pages.

It does not matter practically as the exact number isn't
used, but it seems better to be consistent in case someone
starts using this API.

Further, if we ever tried to iteratively leak pages as
virtio_balloon_shrinker_scan tries to do, we'd run into issues - this is
because freed_pages was accumulating total freed pages, but was also
subtracted on each iteration from pages_to_free, which can result in
either leaking less memory than we were supposed to free, or more if
pages_to_free underruns.

On a system with 4K pages we are lucky that we are never asked to leak
more than 128 pages while we can leak up to 256 at a time,
but it looks like a real issue for systems with page size != 4K.

Fixes: 71994620bb25 ("virtio_balloon: replace oom notifier with shrinker")
Reported-by: Khazhismel Kumykov <[email protected]>
Reviewed-by: Wei Wang <[email protected]>
Signed-off-by: Michael S. Tsirkin <[email protected]>

mdio_bus: Fix init if CONFIG_RESET_CONTROLLER=n

Commit 1d4639567d97 ("mdio_bus: Fix PTR_ERR applied after initialization
to constant") accidentally changed a check from -ENOTSUPP to -ENOSYS,
causing failures if reset controller support is not enabled.  E.g. on
r7s72100/rskrza1:

    sh-eth e8203000.ethernet: MDIO init failed: -524
    sh-eth: probe of e8203000.ethernet failed with error -524

Seen on r8a7740/armadillo, r7s72100/rskrza1, and r7s9210/rza2mevb.

Fixes: 1d4639567d97 ("mdio_bus: Fix PTR_ERR applied after initialization to constant")
Signed-off-by: Geert Uytterhoeven <[email protected]>
Cc: YueHaibing <[email protected]>
Cc: David S. Miller <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
Signed-off-by: David S. Miller <[email protected]>

Revert "mdio_bus: fix mdio_register_device when RESET_CONTROLLER is disabled"

This reverts commit 075e238d12c21c8bde700d21fb48be7a3aa80194.

Going to go with Geert's fix instead, which also has a
correct Fixes tag.

Signed-off-by: David S. Miller <[email protected]>

net: hns3: fix a wrong reset interrupt status mask

According to hardware user manual, bits5~7 in register
HCLGE_MISC_VECTOR_INT_STS means reset interrupts status,
but HCLGE_RESET_INT_M is defined as bits0~2 now. So it
will make hclge_reset_err_handle() read the wrong reset
interrupt status.

This patch fixes this wrong bit mask.

Fixes: 2336f19d7892 ("net: hns3: check reset interrupt status when reset fails")
Signed-off-by: Huazhong Tan <[email protected]>
Signed-off-by: David S. Miller <[email protected]>

net: fec: fix clock count mis-match

pm_runtime_put_autosuspend in probe will call runtime suspend to
disable clks automatically if CONFIG_PM is defined. (If CONFIG_PM
is not defined, its implementation will be empty, then runtime
suspend will not be called.)

Therefore, we can call pm_runtime_get_sync to runtime resume it
first to enable clks, which matches the runtime suspend. (Only when
CONFIG_PM is defined, otherwise pm_runtime_get_sync will also be
empty, then runtime resume will not be called.)

Then it is fine to disable clks without causing clock count mis-match.

Fixes: c43eab3eddb4 ("net: fec: add missed clk_disable_unprepare in remove")
Signed-off-by: Chuhong Yuan <[email protected]>
Acked-by: Fugang Duan <[email protected]>
Signed-off-by: David S. Miller <[email protected]>

net/sched: act_pedit: fix WARN() in the traffic path

when configuring act_pedit rules, the number of keys is validated only on
addition of a new entry. This is not sufficient to avoid hitting a WARN()
in the traffic path: for example, it is possible to replace a valid entry
with a new one having 0 extended keys, thus causing splats in dmesg like:

pedit BUG: index 42
WARNING: CPU: 2 PID: 4054 at net/sched/act_pedit.c:410 tcf_pedit_act+0xc84/0x1200 [act_pedit]
[...]
RIP: 0010:tcf_pedit_act+0xc84/0x1200 [act_pedit]
Code: 89 fa 48 c1 ea 03 0f b6 04 02 84 c0 74 08 3c 03 0f 8e ac 00 00 00 48 8b 44 24 10 48 c7 c7 a0 c4 e4 c0 8b 70 18 e8 1c 30 95 ea <0f> 0b e9 a0 fa ff ff e8 00 03 f5 ea e9 14 f4 ff ff 48 89 58 40 e9
RSP: 0018:ffff888077c9f320 EFLAGS: 00010286
RAX: 0000000000000000 RBX: 0000000000000000 RCX: ffffffffac2983a2
RDX: 0000000000000001 RSI: 0000000000000008 RDI: ffff888053927bec
RBP: dffffc0000000000 R08: ffffed100a726209 R09: ffffed100a726209
R10: 0000000000000001 R11: ffffed100a726208 R12: ffff88804beea780
R13: ffff888079a77400 R14: ffff88804beea780 R15: ffff888027ab2000
FS:  00007fdeec9bd740(0000) GS:ffff888053900000(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 00007ffdb3dfd000 CR3: 000000004adb4006 CR4: 00000000001606e0
Call Trace:
  tcf_action_exec+0x105/0x3f0
  tcf_classify+0xf2/0x410
  __dev_queue_xmit+0xcbf/0x2ae0
  ip_finish_output2+0x711/0x1fb0
  ip_output+0x1bf/0x4b0
  ip_send_skb+0x37/0xa0
  raw_sendmsg+0x180c/0x2430
  sock_sendmsg+0xdb/0x110
  __sys_sendto+0x257/0x2b0
  __x64_sys_sendto+0xdd/0x1b0
  do_syscall_64+0xa5/0x4e0
  entry_SYSCALL_64_after_hwframe+0x49/0xbe
RIP: 0033:0x7fdeeb72e993
Code: 48 8b 0d e0 74 2c 00 f7 d8 64 89 01 48 83 c8 ff c3 66 0f 1f 44 00 00 83 3d 0d d6 2c 00 00 75 13 49 89 ca b8 2c 00 00 00 0f 05 <48> 3d 01 f0 ff ff 73 34 c3 48 83 ec 08 e8 4b cc 00 00 48 89 04 24
RSP: 002b:00007ffdb3de8a18 EFLAGS: 00000246 ORIG_RAX: 000000000000002c
RAX: ffffffffffffffda RBX: 000055c81972b700 RCX: 00007fdeeb72e993
RDX: 0000000000000040 RSI: 000055c81972b700 RDI: 0000000000000003
RBP: 00007ffdb3dea130 R08: 000055c819728510 R09: 0000000000000010
R10: 0000000000000000 R11: 0000000000000246 R12: 0000000000000040
R13: 000055c81972b6c0 R14: 000055c81972969c R15: 0000000000000080

Fix this moving the check on 'nkeys' earlier in tcf_pedit_init(), so that
attempts to install rules having 0 keys are always rejected with -EINVAL.

Fixes: 1da177e4c3f4 ("Linux-2.6.12-rc2")
Signed-off-by: Davide Caratti <[email protected]>
Signed-off-by: David S. Miller <[email protected]>

net: phylink: fix link mode modification in PHY mode

Modifying the link settings via phylink_ethtool_ksettings_set() and
phylink_ethtool_set_pauseparam() didn't always work as intended for
PHY based setups, as calling phylink_mac_config() would result in the
unresolved configuration being committed to the MAC, rather than the
configuration with the speed and duplex setting.

This would work fine if the update caused the link to renegotiate,
but if no settings have changed, phylib won't trigger a renegotiation
cycle, and the MAC will be left incorrectly configured.

Avoid calling phylink_mac_config() unless we are using an inband mode
in phylink_ethtool_ksettings_set(), and use phy_set_asym_pause() as
introduced in 4.20 to set the PHY settings in
phylink_ethtool_set_pauseparam().

Signed-off-by: Russell King <[email protected]>
Signed-off-by: David S. Miller <[email protected]>

net: phylink: update documentation on create and destroy

Update the documentation on phylink's create and destroy functions to
explicitly state that the rtnl lock must not be held while calling
these.

Signed-off-by: Russell King <[email protected]>
Signed-off-by: David S. Miller <[email protected]>

r8169: disable TSO on a single version of RTL8168c to fix performance

During performance testing, I found that one of my r8169 NICs suffered
a major performance loss, a 8168c model.

Running netperf's TCP_STREAM test didn't return the expected
throughput of > 900 Mb/s, but rather only about 22 Mb/s. Strange
enough, running the TCP_MAERTS and UDP_STREAM tests all returned with
throughput > 900 Mb/s, as did TCP_STREAM with the other r8169 NICs I can
test (either one of 8169s, 8168e, 8168f).

Bisecting turned up commit 93681cd7d94f83903cb3f0f95433d10c28a7e9a5,
"r8169: enable HW csum and TSO" as the culprit.

I added my 8168c version, RTL_GIGA_MAC_VER_22, to the code
special-casing the 8168evl as per the patch below. This fixed the
performance problem for me.

Fixes: 93681cd7d94f ("r8169: enable HW csum and TSO")
Signed-off-by: Corinna Vinschen <[email protected]>
Reviewed-by: Heiner Kallweit <[email protected]>
Signed-off-by: David S. Miller <[email protected]>

MAINTAINERS: forcedeth: Change Zhu Yanjun's email address

I prefer to use my personal email address for kernel related work.

Signed-off-by: Zhu Yanjun <[email protected]>
Acked-by: Rain River <[email protected]>
Signed-off-by: David S. Miller <[email protected]>

taprio: don't reject same mqprio settings

The taprio qdisc allows to set mqprio setting but only once. In case
if mqprio settings are provided next time the error is returned as
it's not allowed to change traffic class mapping in-flignt and that
is normal. But if configuration is absolutely the same - no need to
return error. It allows to provide same command couple times,
changing only base time for instance, or changing only scheds maps,
but leaving mqprio setting w/o modification. It more corresponds the
message: "Changing the traffic mapping of a running schedule is not
supported", so reject mqprio if it's really changed.

Also corrected TC_BITMASK + 1 for consistency, as proposed.

Fixes: a3d43c0d56f1 ("taprio: Add support adding an admin schedule")
Reviewed-by: Vladimir Oltean <[email protected]>
Tested-by: Vladimir Oltean <[email protected]>
Acked-by: Vinicius Costa Gomes <[email protected]>
Signed-off-by: Ivan Khoronzhuk <[email protected]>
Signed-off-by: David S. Miller <[email protected]>

net/tls: enable sk_msg redirect to tls socket egress

Bring back tls_sw_sendpage_locked. sk_msg redirection into a socket
with TLS_TX takes the following path:

  tcp_bpf_sendmsg_redir
    tcp_bpf_push_locked
      tcp_bpf_push
        kernel_sendpage_locked
          sock->ops->sendpage_locked

Also update the flags test in tls_sw_sendpage_locked to allow flag
MSG_NO_SHARED_FRAGS. bpf_tcp_sendmsg sets this.

Link: https://lore.kernel.org/netdev/CA+FuTSdaAawmZ2N8nfDDKu3XLpXBbMtcCT0q4FntDD2gn8ASUw@mail.gmail.com/T/#t
Link: https://github.com/wdebruij/kerneltools/commits/icept.2
Fixes: 0608c69c9a80 ("bpf: sk_msg, sock{map|hash} redirect through ULP")
Fixes: f3de19af0f5b ("Revert \"net/tls: remove unused function tls_sw_sendpage_locked\"")
Signed-off-by: Willem de Bruijn <[email protected]>
Acked-by: John Fastabend <[email protected]>
Signed-off-by: David S. Miller <[email protected]>

afs: Fix missing timeout reset

In afs_wait_for_call_to_complete(), rather than immediately aborting an
operation if a signal occurs, the code attempts to wait for it to
complete, using a schedule timeout of 2*RTT (or min 2 jiffies) and a
check that we're still receiving relevant packets from the server before
we consider aborting the call. We may even ping the server to check on
the status of the call.

However, there's a missing timeout reset in the event that we do
actually get a packet to process, such that if we then get a couple of
short stalls, we then time out when progress is actually being made.

Fix this by resetting the timeout any time we get something to process.
If it's the failure of the call then the call state will get changed and
we'll exit the loop shortly thereafter.

A symptom of this is data fetches and stores failing with EINTR when
they really shouldn't.

Fixes: bc5e3a546d55 ("rxrpc: Use MSG_WAITALL to tell sendmsg() to temporarily ignore signals")
Signed-off-by: David Howells <[email protected]>
Reviewed-by: Marc Dionne <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

gve: fix dma sync bug where not all pages synced

The previous commit had a bug where the last page in the memory range
could not be synced. This change fixes the behavior so that all the
required pages are synced.

Fixes: 9cfeeb576d49 ("gve: Fixes DMA synchronization")
Signed-off-by: Adi Suresh <[email protected]>
Reviewed-by: Catherine Sullivan <[email protected]>
Signed-off-by: David S. Miller <[email protected]>

drm/i915: make pool objects read-only

For our current users we don't expect pool objects to be writable from
the gpu.

Signed-off-by: Matthew Auld <[email protected]>
Cc: Chris Wilson <[email protected]>
Fixes: 4f7af1948abc ("drm/i915: Support ro ppgtt mapped cmdparser shadow buffers")
Reviewed-by: Chris Wilson <[email protected]>
Signed-off-by: Chris Wilson <[email protected]>
Link: https://patchwork.freedesktop.org/patch/msgid/[email protected]
(cherry picked from commit d18580b08b92ec4105eb0ede2d676e8b1f5a66c3)
Signed-off-by: Rodrigo Vivi <[email protected]>

mdio_bus: Fix init if CONFIG_RESET_CONTROLLER=n

Commit 1d4639567d97 ("mdio_bus: Fix PTR_ERR applied after initialization
to constant") accidentally changed a check from -ENOTSUPP to -ENOSYS,
causing failures if reset controller support is not enabled.  E.g. on
r7s72100/rskrza1:

    sh-eth e8203000.ethernet: MDIO init failed: -524
    sh-eth: probe of e8203000.ethernet failed with error -524

Seen on r8a7740/armadillo, r7s72100/rskrza1, and r7s9210/rza2mevb.

Fixes: 1d4639567d97 ("mdio_bus: Fix PTR_ERR applied after initialization to constant")
Signed-off-by: Geert Uytterhoeven <[email protected]>
Cc: YueHaibing <[email protected]>
Cc: David S. Miller <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

nbd:fix memory leak in nbd_get_socket()

Before returning NULL, put the sock first.

Cc: [email protected]
Fixes: cf1b2326b734 ("nbd: verify socket is supported during setup")
Reviewed-by: Josef Bacik <[email protected]>
Reviewed-by: Mike Christie <[email protected]>
Signed-off-by: Sun Ke <[email protected]>
Signed-off-by: Jens Axboe <[email protected]>

virtio_console: allocate inbufs in add_port() only if it is needed

When we hot unplug a virtserialport and then try to hot plug again,
it fails:

(qemu) chardev-add socket,id=serial0,path=/tmp/serial0,server,nowait
(qemu) device_add virtserialport,bus=virtio-serial0.0,nr=2,\
                  chardev=serial0,id=serial0,name=serial0
(qemu) device_del serial0
(qemu) device_add virtserialport,bus=virtio-serial0.0,nr=2,\
                  chardev=serial0,id=serial0,name=serial0
kernel error:
  virtio-ports vport2p2: Error allocating inbufs
qemu error:
  virtio-serial-bus: Guest failure in adding port 2 for device \
                     virtio-serial0.0

This happens because buffers for the in_vq are allocated when the port is
added but are not released when the port is unplugged.

They are only released when virtconsole is removed (see a7a69ec0d8e4)

To avoid the problem and to be symmetric, we could allocate all the buffers
in init_vqs() as they are released in remove_vqs(), but it sounds like
a waste of memory.

Rather than that, this patch changes add_port() logic to ignore ENOSPC
error in fill_queue(), which means queue has already been filled.

Fixes: a7a69ec0d8e4 ("virtio_console: free buffers after reset")
Cc: [email protected]
Cc: [email protected]
Signed-off-by: Laurent Vivier <[email protected]>
Signed-off-by: Michael S. Tsirkin <[email protected]>

virtio_ring: fix return code on DMA mapping fails

Commit 780bc7903a32 ("virtio_ring: Support DMA APIs") makes
virtqueue_add() return -EIO when we fail to map our I/O buffers. This is
a very realistic scenario for guests with encrypted memory, as swiotlb
may run out of space, depending on it's size and the I/O load.

The virtio-blk driver interprets -EIO form virtqueue_add() as an IO
error, despite the fact that swiotlb full is in absence of bugs a
recoverable condition.

Let us change the return code to -ENOMEM, and make the block layer
recover form these failures when virtio-blk encounters the condition
described above.

Cc: [email protected]
Fixes: 780bc7903a32 ("virtio_ring: Support DMA APIs")
Signed-off-by: Halil Pasic <[email protected]>
Tested-by: Michael Mueller <[email protected]>
Signed-off-by: Michael S. Tsirkin <[email protected]>

mdio_bus: fix mdio_register_device when RESET_CONTROLLER is disabled

When CONFIG_RESET_CONTROLLER is disabled, the
devm_reset_control_get_exclusive function returns -ENOTSUPP. This is not
handled in subsequent check and then the mdio device fails to probe.

When CONFIG_RESET_CONTROLLER is enabled, its code checks in OF for reset
device, and since it is not present, returns -ENOENT. -ENOENT is handled.
Add -ENOTSUPP also.

This happened to me when upgrading kernel on Turris Omnia. You either
have to enable CONFIG_RESET_CONTROLLER or use this patch.

Signed-off-by: Marek Behún <[email protected]>
Fixes: 71dd6c0dff51b ("net: phy: add support for reset-controller")
Cc: Dmitry Torokhov <[email protected]>
Cc: Andrew Lunn <[email protected]>
Cc: Andy Shevchenko <[email protected]>
Reviewed-by: Andrew Lunn <[email protected]>
Signed-off-by: David S. Miller <[email protected]>

net/ipv4: fix sysctl max for fib_multipath_hash_policy

Commit eec4844fae7c ("proc/sysctl: add shared variables for range
check") did:
- .extra2 = &two,
+ .extra2 = SYSCTL_ONE,
here, which doesn't seem to be intentional, given the changelog.
This patch restores it to the previous, as the value of 2 still makes
sense (used in fib_multipath_hash()).

Fixes: eec4844fae7c ("proc/sysctl: add shared variables for range check")
Cc: Matteo Croce <[email protected]>
Signed-off-by: Marcelo Ricardo Leitner <[email protected]>
Acked-by: Matteo Croce <[email protected]>
Signed-off-by: David S. Miller <[email protected]>

phy: mdio-sun4i: add missed regulator_disable in remove

The driver forgets to disable the regulator in remove like what is done
in probe failure.
Add the missed call to fix it.

Signed-off-by: Chuhong Yuan <[email protected]>
Signed-off-by: David S. Miller <[email protected]>