Git Repo - linux.git/log

tcp: Fix SOF_TIMESTAMPING_RX_HARDWARE to use the latest timestamp during TCP coalescing

During tcp coalescing ensure that the skb hardware timestamp refers to the
highest sequence number data.
Previously only the software timestamp was updated during coalescing.

Signed-off-by: Stephen Mallon <[email protected]>
Signed-off-by: Eric Dumazet <[email protected]>
Signed-off-by: David S. Miller <[email protected]>

tg3: Add PHY reset for 5717/5719/5720 in change ring and flow control paths

This patch has the fix to avoid PHY lockup with 5717/5719/5720 in change
ring and flow control paths. This patch solves the RX hang while doing
continuous ring or flow control parameters with heavy traffic from peer.

Signed-off-by: Siva Reddy Kallam <[email protected]>
Acked-by: Michael Chan <[email protected]>
Signed-off-by: David S. Miller <[email protected]>

Documentation/security-bugs: Postpone fix publication in exceptional cases

At the request of the reporter, the Linux kernel security team offers to
postpone the publishing of a fix for up to 5 business days from the date
of a report.

While it is generally undesirable to keep a fix private after it has
been developed, this short window is intended to allow distributions to
package the fix into their kernel builds and permits early inclusion of
the security team in the case of a co-ordinated disclosure with other
parties. Unfortunately, discussions with major Linux distributions and
cloud providers has revealed that 5 business days is not sufficient to
achieve either of these two goals.

As an example, cloud providers need to roll out KVM security fixes to a
global fleet of hosts with sufficient early ramp-up and monitoring. An
end-to-end timeline of less than two weeks dramatically cuts into the
amount of early validation and increases the chance of guest-visible
regressions.

The consequence of this timeline mismatch is that security issues are
commonly fixed without the involvement of the Linux kernel security team
and are instead analysed and addressed by an ad-hoc group of developers
across companies contributing to Linux. In some cases, mainline (and
therefore the official stable kernels) can be left to languish for
extended periods of time. This undermines the Linux kernel security
process and puts upstream developers in a difficult position should they
find themselves involved with an undisclosed security problem that they
are unable to report due to restrictions from their employer.

To accommodate the needs of these users of the Linux kernel and
encourage them to engage with the Linux security team when security
issues are first uncovered, extend the maximum period for which fixes
may be delayed to 7 calendar days, or 14 calendar days in exceptional
cases, where the logistics of QA and large scale rollouts specifically
need to be accommodated. This brings parity with the linux-distros@
maximum embargo period of 14 calendar days.

Cc: Paolo Bonzini <[email protected]>
Cc: David Woodhouse <[email protected]>
Cc: Amit Shah <[email protected]>
Cc: Laura Abbott <[email protected]>
Acked-by: Kees Cook <[email protected]>
Co-developed-by: Thomas Gleixner <[email protected]>
Co-developed-by: David Woodhouse <[email protected]>
Signed-off-by: Thomas Gleixner <[email protected]>
Signed-off-by: David Woodhouse <[email protected]>
Signed-off-by: Will Deacon <[email protected]>
Reviewed-by: Tyler Hicks <[email protected]>
Acked-by: Peter Zijlstra <[email protected]>
Signed-off-by: Greg Kroah-Hartman <[email protected]>

MAINTAINERS: Add Sasha as a stable branch maintainer

Sasha has somehow been convinced into helping me with the stable kernel
maintenance. Codify this slip in good judgement before he realizes what
he really signed up for :)

Signed-off-by: Greg Kroah-Hartman <[email protected]>
Acked-by: Sasha Levin <[email protected]>
Signed-off-by: Greg Kroah-Hartman <[email protected]>

Merge tag 'media/v4.20-3' of git://git.kernel.org/pub/scm/linux/kernel/git/mchehab/linux-media

Pull media fixes from Mauro Carvalho Chehab:

- add a missing include at v4l2-controls uAPI header

- minor kAPI update for the request API

- some fixes at CEC core

- use a lower minimum height for the virtual codec driver

- cleanup a gcc warning due to the lack of a fall though markup

- tc358743: Remove unnecessary self assignment

- fix the V4L event subscription logic

- docs: Document metadata format in struct v4l2_format

- omap3isp and ipu3-cio2: fix unbinding logic

* tag 'media/v4.20-3' of git://git.kernel.org/pub/scm/linux/kernel/git/mchehab/linux-media:
  media: ipu3-cio2: Use cio2_queues_exit
  media: ipu3-cio2: Unregister device nodes first, then release resources
  media: omap3isp: Unregister media device as first
  media: docs: Document metadata format in struct v4l2_format
  media: v4l: event: Add subscription to list before calling "add" operation
  media: dm365_ipipeif: better annotate a fall though
  media: Rename vb2_m2m_request_queue -> v4l2_m2m_request_queue
  media: cec: increase debug level for 'queue full'
  media: cec: check for non-OK/NACK conditions while claiming a LA
  media: vicodec: lower minimum height to 360
  media: tc358743: Remove unnecessary self assignment
  media: v4l: fix uapi mpeg slice params definition
  v4l2-controls: add a missing include

mtd: spi-nor: fix selection of uniform erase type in flexible conf

There are uniform, non-uniform and flexible erase flash configurations.

The non-uniform erase types, are the erase types that can _not_ erase
the entire flash by their own.

As the code was, in case flashes had flexible erase capabilities
(support both uniform and non-uniform erase types in the same flash
configuration) and supported multiple uniform erase type sizes, the
code did not sort the uniform erase types, and could select a wrong
erase type size.

Sort the uniform erase mask in case of flexible erase flash
configurations, in order to select the best uniform erase type size.

Uniform, non-uniform, and flexible configurations with just a valid
uniform erase type, are not affected by this change.

Uniform erase tested on mx25l3273fm2i-08g and sst26vf064B-104i/sn.
Non uniform erase tested on sst26vf064B-104i/sn.

Fixes: 5390a8df769e ("mtd: spi-nor: add support to non-uniform SFDP SPI NOR flash memories")
Signed-off-by: Tudor Ambarus <[email protected]>
Signed-off-by: Boris Brezillon <[email protected]>

RISC-V: recognize S/U mode bits in print_isa

Removes the warning about an unsupported ISA when reading /proc/cpuinfo
on QEMU. The "S" extension is not being returned as it is not accessible
from userspace.

Signed-off-by: Patrick Stählin <[email protected]>
Signed-off-by: Palmer Dabbelt <[email protected]>

riscv: add asm/unistd.h UAPI header

Marcin Juszkiewicz reported issues while generating syscall table for riscv
using 4.20-rc1. The patch refactors our unistd.h files to match some other
architectures.

- Add asm/unistd.h UAPI header, which has __ARCH_WANT_NEW_STAT only for 64-bit
- Remove asm/syscalls.h UAPI header and merge to asm/unistd.h
- Adjust kernel asm/unistd.h

So now asm/unistd.h UAPI header should show all syscalls for riscv.

Before this, Makefile simply put `#include <asm-generic/unistd.h>` into
generated asm/unistd.h UAPI header thus user didn't see:

- __NR_riscv_flush_icache
- __NR_newfstatat
- __NR_fstat

which are supported by riscv kernel.

Signed-off-by: David Abdurachmanov <[email protected]>
Cc: Arnd Bergmann <[email protected]>
Cc: Marcin Juszkiewicz <[email protected]>
Cc: Guenter Roeck <[email protected]>
Fixes: 67314ec7b025 ("RISC-V: Request newstat syscalls")
Signed-off-by: David Abdurachmanov <[email protected]>
Acked-by: Olof Johansson <[email protected]>
Signed-off-by: Palmer Dabbelt <[email protected]>

riscv: fix warning in arch/riscv/include/asm/module.h

Fixes warning: 'struct module' declared inside parameter list will not be
visible outside of this definition or declaration

Signed-off-by: David Abdurachmanov <[email protected]>
Acked-by: Olof Johansson <[email protected]>
Signed-off-by: Palmer Dabbelt <[email protected]>

RISC-V: Build flat and compressed kernel images

This patch extends Linux RISC-V build system to build and install:
Image - Flat uncompressed kernel image
Image.gz - Flat and GZip compressed kernel image

Quiet a few bootloaders (such as Uboot, UEFI, etc) are capable of
booting flat and compressed kernel images. In case of Uboot, booting
Image or Image.gz is achieved using bootm command.

The flat and uncompressed kernel image (i.e. Image) is very useful
in pre-silicon developent and testing because we can create back-door
HEX files for RAM on FPGAs from Image.

Signed-off-by: Anup Patel <[email protected]>
Signed-off-by: Palmer Dabbelt <[email protected]>

RISC-V: Fix raw_copy_{to,from}_user()

Sparse highlighted it, and appears to be a pure bug (from vs to).

./arch/riscv/include/asm/uaccess.h:403:35: warning: incorrect type in argument 1 (different address spaces)
./arch/riscv/include/asm/uaccess.h:403:39: warning: incorrect type in argument 2 (different address spaces)
./arch/riscv/include/asm/uaccess.h:409:37: warning: incorrect type in argument 1 (different address spaces)
./arch/riscv/include/asm/uaccess.h:409:41: warning: incorrect type in argument 2 (different address spaces)

Signed-off-by: Olof Johansson <[email protected]>
Cc: [email protected]
Signed-off-by: Palmer Dabbelt <[email protected]>

HID: Add quirk for Primax PIXART OEM mice

The PixArt OEM mice are known for disconnecting every minute in
runlevel 1 or 3 if they are not always polled. So add quirk
ALWAYS_POLL for two Primax mice as well.

0x4e22 is the Dell MS111-P and 0x4d0f is the unbranded HP Portia
mouse HP 697738-001. Both were built until approx. 2014.
Those were the standard mice from those vendors and are still
around - even as new old stock.

Reference: https://github.com/sriemer/fix-linux-mouse/issues/11

Signed-off-by: Sebastian Parschauer <[email protected]>
CC: [email protected]
Signed-off-by: Jiri Kosina <[email protected]>

usb: cdc-acm: add entry for Hiro (Conexant) modem

The cdc-acm kernel module currently does not support the Hiro (Conexant)
H05228 USB modem. The patch below adds the device specific information:
idVendor 0x0572
idProduct 0x1349

Signed-off-by: Maarten Jacobs <[email protected]>
Acked-by: Oliver Neukum <[email protected]>
Cc: stable <[email protected]>
Signed-off-by: Greg Kroah-Hartman <[email protected]>

drm/i915: Prevent machine hang from Broxton's vtd w/a and error capture

Since capturing the error state requires fiddling around with the GGTT
to read arbitrary buffers and is itself run under stop_machine(), it
deadlocks the machine (effectively a hard hang) when run in conjunction
with Broxton's VTd workaround to serialize GGTT access.

v2: Store the ERR_PTR in first_error so that the error can be reported
to the user via sysfs.
v3: Mention the quirk in dmesg (using info as per usual)

Fixes: 0ef34ad6222a ("drm/i915: Serialize GTT/Aperture accesses on BXT")
Signed-off-by: Chris Wilson <[email protected]>
Cc: Jon Bloomfield <[email protected]>
Cc: John Harrison <[email protected]>
Cc: Tvrtko Ursulin <[email protected]>
Cc: Joonas Lahtinen <[email protected]>
Cc: Daniel Vetter <[email protected]>
Reviewed-by: Joonas Lahtinen <[email protected]>
Link: https://patchwork.freedesktop.org/patch/msgid/[email protected]
(cherry picked from commit fb6f0b64e455b207a636346588e65bf9598d30eb)
Signed-off-by: Joonas Lahtinen <[email protected]>

Merge tag 'mlx5-fixes-2018-11-19' of git://git.kernel.org/pub/scm/linux/kernel/git/saeed/linux

Saeed Mahameed says:

====================
Mellanox, mlx5 fixes 2018-11-19

The following fixes are for mlx5 core and netdev driver.

For -stable v4.16
bc7fda7d4637 ('net/mlx5e: IPoIB, Reset QP after channels are closed')

For -stable v4.17
36917a270395 ('net/mlx5: IPSec, Fix the SA context hash key')

For -stable v4.18
6492a432be3a ('net/mlx5e: Always use the match level enum when parsing TC rule match')
c3f81be236b1 ('net/mlx5e: Removed unnecessary warnings in FEC caps query')
c5ce2e736b64 ('net/mlx5e: Fix selftest for small MTUs')

For -stable v4.19
effcd896b25e ('net/mlx5e: Adjust to max number of channles when re-attaching')
394cbc5acd68 ('net/mlx5e: RX, verify received packet size in Linear Striding RQ')
447cbb3613c8 ('net/mlx5e: Don't match on vlan non-existence if ethertype is wildcarded')
c223c1574612 ('net/mlx5e: Claim TC hw offloads support only under a proper build config')

Please pull and let me know if there's any problem.
====================

Signed-off-by: David S. Miller <[email protected]>

net/ibmnvic: Fix deadlock problem in reset

This patch changes to use rtnl_lock only during a reset to avoid
deadlock that could occur when a thread operating close is holding
rtnl_lock and waiting for reset_lock acquired by another thread,
which is waiting for rtnl_lock in order to set the number of tx/rx
queues during a reset.

Also, we now setting the number of tx/rx queues during a soft reset
for failover or LPM events.

Signed-off-by: Juliet Kim <[email protected]>
Signed-off-by: David S. Miller <[email protected]>

Merge branch 'qed-Fix-Queue-Manager-getters'

Denis Bolotin says:

====================
qed: Fix Queue Manager getters

This patch series fixes various queue manager getter functions. It is
important to make sure the getter's caller will receive a valid queue even
in error case to prevent more serious bugs.
Please consider applying to net.
====================

Signed-off-by: David S. Miller <[email protected]>

qed: Fix QM getters to always return a valid pq

The getter callers doesn't know the valid Physical Queues (PQ) values.
This patch makes sure that a valid PQ will always be returned.

The patch consists of 3 fixes:

- When qed_init_qm_get_idx_from_flags() receives a disabled flag, it
   returned PQ 0, which can potentially be another function's pq. Verify
   that flag is enabled, otherwise return default start_pq.

- When qed_init_qm_get_idx_from_flags() receives an unknown flag, it
   returned NULL and could lead to a segmentation fault. Return default
   start_pq instead.

- A modulo operation was added to MCOS/VFS PQ getters to make sure the
   PQ returned is in range of the required flag.

Fixes: b5a9ee7cf3be ("qed: Revise QM cofiguration")
Signed-off-by: Denis Bolotin <[email protected]>
Signed-off-by: Michal Kalderon <[email protected]>
Signed-off-by: David S. Miller <[email protected]>

qed: Fix bitmap_weight() check

Fix the condition which verifies that only one flag is set. The API
bitmap_weight() should receive size in bits instead of bytes.

Fixes: b5a9ee7cf3be ("qed: Revise QM cofiguration")
Signed-off-by: Denis Bolotin <[email protected]>
Signed-off-by: Michal Kalderon <[email protected]>
Signed-off-by: David S. Miller <[email protected]>

NFSv4: Fix a NFSv4 state manager deadlock

Fix a deadlock whereby the NFSv4 state manager can get stuck in the
delegation return code, waiting for a layout return to complete in
another thread. If the server reboots before that other thread
completes, then we need to be able to start a second state
manager thread in order to perform recovery.

Signed-off-by: Trond Myklebust <[email protected]>

net/mlx5e: Fix failing ethtool query on FEC query error

If FEC caps query fails when executing 'ethtool <interface>'
the whole callback fails unnecessarily, fixed that by replacing the
error return code with debug logging only.

Fixes: 6cfa94605091 ("net/mlx5e: Ethtool driver callback for query/set FEC policy")
Signed-off-by: Shay Agroskin <[email protected]>
Signed-off-by: Saeed Mahameed <[email protected]>

net/mlx5e: Removed unnecessary warnings in FEC caps query

Querying interface FEC caps with 'ethtool [int]' after link reset
throws warning regading link speed.
This warning is not needed as there is already an indication in
user space that the link is not up.

Fixes: 0696d60853d5 ("net/mlx5e: Receive buffer configuration")
Signed-off-by: Shay Agroskin <[email protected]>
Signed-off-by: Saeed Mahameed <[email protected]>

net/mlx5e: Fix wrong field name in FEC related functions

This bug would result in reading wrong FEC capabilities for 10G/40G.

Fixes: 2095b2641477 ("net/mlx5e: Add port FEC get/set functions")
Signed-off-by: Shay Agroskin <[email protected]>
Signed-off-by: Saeed Mahameed <[email protected]>

net/mlx5e: Fix a bug in turning off FEC policy in unsupported speeds

Some speeds don't support turning FEC policy off. In case a requested
FEC policy is not supported for a speed (including current speed), its new
FEC policy would be:
no FEC - if disabling FEC is supported for that speed
unchanged - else

Fixes: 2095b2641477 ("net/mlx5e: Add port FEC get/set functions")
Signed-off-by: Shay Agroskin <[email protected]>
Signed-off-by: Saeed Mahameed <[email protected]>

Merge branch 'ena-hibernation-and-rmmod-bug-fixes'

Arthur Kiyanovski says:

====================
net: ena: hibernation and rmmod bug fixes

This patchset includes 2 bug fixes:
1. A fix to a crash during resume from hibernation.
2. A fix to an illegal memory access during driver removal (e.g. during rmmod)
which might cause a crash in certain systems.

The subminor number in the driver version is also promoted to indicate driver
was changed.
====================

Signed-off-by: David S. Miller <[email protected]>

net: ena: update driver version from 2.0.1 to 2.0.2

Update driver version due to critical bug fixes.

Signed-off-by: Arthur Kiyanovski <[email protected]>
Signed-off-by: David S. Miller <[email protected]>

net: ena: fix crash during ena_remove()

In ena_remove() we have the following stack call:
ena_remove()
  unregister_netdev()
  ena_destroy_device()
    netif_carrier_off()

Calling netif_carrier_off() causes linkwatch to try to handle the
link change event on the already unregistered netdev, which leads
to a read from an unreadable memory address.

This patch switches the order of the two functions, so that
netif_carrier_off() is called on a regiestered netdev.

To accomplish this fix we also had to:
1. Remove the set bit ENA_FLAG_TRIGGER_RESET
2. Add a sanitiy check in ena_close()
both to prevent double device reset (when calling unregister_netdev()
ena_close is called, but the device was already deleted in
ena_destroy_device()).
3. Set the admin_queue running state to false to avoid using it after
device was reset (for example when calling ena_destroy_all_io_queues()
right after ena_com_dev_reset() in ena_down)

Fixes: 944b28aa2982 ("net: ena: fix missing lock during device destruction")
Signed-off-by: Arthur Kiyanovski <[email protected]>
Signed-off-by: David S. Miller <[email protected]>

net: ena: fix crash during failed resume from hibernation

During resume from hibernation if ena_restore_device fails,
ena_com_dev_reset() is called, and uses the readless read mechanism,
which was already destroyed by the call to
ena_com_mmio_reg_read_request_destroy(). This causes a NULL pointer
reference.

In this commit we switch the call order of the above two functions
to avoid this crash.

Fixes: d7703ddbd7c9 ("net: ena: fix rare bug when failed restart/resume is followed by driver removal")
Signed-off-by: Arthur Kiyanovski <[email protected]>
Signed-off-by: David S. Miller <[email protected]>

sctp: not increase stream's incnt before sending addstrm_in request

Different from processing the addstrm_out request, The receiver handles
an addstrm_in request by sending back an addstrm_out request to the
sender who will increase its stream's in and incnt later.

Now stream->incnt has been increased since it sent out the addstrm_in
request in sctp_send_add_streams(), with the wrong stream->incnt will
even cause crash when copying stream info from the old stream's in to
the new one's in sctp_process_strreset_addstrm_out().

This patch is to fix it by simply removing the stream->incnt change
from sctp_send_add_streams().

Fixes: 242bd2d519d7 ("sctp: implement sender-side procedures for Add Incoming/Outgoing Streams Request Parameter")
Reported-by: Jianwen Ji <[email protected]>
Signed-off-by: Xin Long <[email protected]>
Signed-off-by: David S. Miller <[email protected]>

net/mlx5e: Fix selftest for small MTUs

Loopback test had fixed packet size, which can be bigger than configured
MTU. Shorten the loopback packet size to be bigger than minimal MTU
allowed by the device. Text field removed from struct 'mlx5ehdr'
as redundant to allow send small packets as minimal allowed MTU.

Fixes: d605d66 ("net/mlx5e: Add support for ethtool self diagnostics test")
Signed-off-by: Valentine Fatiev <[email protected]>
Reviewed-by: Eran Ben Elisha <[email protected]>
Signed-off-by: Saeed Mahameed <[email protected]>

net/mlx5e: RX, verify received packet size in Linear Striding RQ

In case of striding RQ, we use MPWRQ (Multi Packet WQE RQ), which means
that WQE (RX descriptor) can be used for many packets and so the WQE is
much bigger than MTU. In virtualization setups where the port mtu can
be larger than the vf mtu, if received packet is bigger than MTU, it
won't be dropped by HW on too small receive WQE. If we use linear SKB in
striding RQ, since each stride has room for mtu size payload and skb
info, an oversized packet can lead to crash for crossing allocated page
boundary upon the call to build_skb. So driver needs to check packet
size and drop it.

Introduce new SW rx counter, rx_oversize_pkts_sw_drop, which counts the
number of packets dropped by the driver for being too large.

As a new field is added to the RQ struct, re-open the channels whenever
this field is being used in datapath (i.e., in the case of linear
Striding RQ).

Fixes: 619a8f2a42f1 ("net/mlx5e: Use linear SKB in Striding RQ")
Signed-off-by: Moshe Shemesh <[email protected]>
Reviewed-by: Tariq Toukan <[email protected]>
Signed-off-by: Saeed Mahameed <[email protected]>

net/mlx5e: Apply the correct check for supporting TC esw rules split

The mirror and not the output count is the one denoting a split.
Fix to condition the offload attempt on the mirror count being > 0
along the firmware to have the related capability.

Fixes: 592d36515969 ("net/mlx5e: Parse mirroring action for offloaded TC eswitch flows")
Signed-off-by: Roi Dayan <[email protected]>
Reviewed-by: Yossi Kuperman <[email protected]>
Reviewed-by: Chris Mi <[email protected]>
Acked-by: Or Gerlitz <[email protected]>
Reviewed-by: Or Gerlitz <[email protected]>
Signed-off-by: Saeed Mahameed <[email protected]>

net/mlx5e: Adjust to max number of channles when re-attaching

When core driver enters deattach/attach flow after pci reset,
Number of logical CPUs may have changed.
As a result we need to update the cpu affiliated resource tables.
1. indirect rqt list
2. eq table

Reproduction (PowerPC):
echo 1000 > /sys/kernel/debug/powerpc/eeh_max_freezes
ppc64_cpu --smt=on
# Restart driver
modprobe -r ... ; modprobe ...
# Link up
ifconfig ...
# Only physical CPUs
ppc64_cpu --smt=off
# Inject PCI errors so PCI will reset - calling the pci error handler
echo 0x8000000000000000 > /sys/kernel/debug/powerpc/<PCI BUS>/err_injct_inboundA

Call trace when trying to add non-existing rqs to an indirect rqt:
mlx5e_redirect_rqt+0x84/0x260 [mlx5_core] (unreliable)
mlx5e_redirect_rqts+0x188/0x190 [mlx5_core]
mlx5e_activate_priv_channels+0x488/0x570 [mlx5_core]
mlx5e_open_locked+0xbc/0x140 [mlx5_core]
mlx5e_open+0x50/0x130 [mlx5_core]
mlx5e_nic_enable+0x174/0x1b0 [mlx5_core]
mlx5e_attach_netdev+0x154/0x290 [mlx5_core]
mlx5e_attach+0x88/0xd0 [mlx5_core]
mlx5_attach_device+0x168/0x1e0 [mlx5_core]
mlx5_load_one+0x1140/0x1210 [mlx5_core]
mlx5_pci_resume+0x6c/0xf0 [mlx5_core]

Create cq will fail when trying to use non-existing EQ.

Fixes: 89d44f0a6c73 ("net/mlx5_core: Add pci error handlers to mlx5_core driver")
Signed-off-by: Yuval Avnery <[email protected]>
Signed-off-by: Saeed Mahameed <[email protected]>

net/mlx5e: Always use the match level enum when parsing TC rule match

We get the match level (none, l2, l3, l4) while going over the match
dissectors of an offloaded tc rule. When doing this, the match level
enum and the not min inline enum values should be used, fix that.

This worked accidentally b/c both enums have the same numerical values.

Fixes: d708f902989b ('net/mlx5e: Get the required HW match level while parsing TC flow matches')
Signed-off-by: Or Gerlitz <[email protected]>
Reviewed-by: Roi Dayan <[email protected]>
Signed-off-by: Saeed Mahameed <[email protected]>

net/mlx5e: Claim TC hw offloads support only under a proper build config

Currently, we are only supporting tc hw offloads when the eswitch
support is compiled in, but we are not gating the adevertizment
of the NETIF_F_HW_TC feature on this config being set.

Fix it, and while doing that, also avoid dealing with the feature
on ethtool when the config is not set.

Fixes: e8f887ac6a45 ('net/mlx5e: Introduce tc offload support')
Signed-off-by: Or Gerlitz <[email protected]>
Reviewed-by: Roi Dayan <[email protected]>
Signed-off-by: Saeed Mahameed <[email protected]>

net/mlx5e: Don't match on vlan non-existence if ethertype is wildcarded

For the "all" ethertype we should not care whether the packet has
vlans. Besides being wrong, the way we did it caused FW error
for rules such as:

tc filter add dev eth0 protocol all parent ffff: \
prio 1 flower skip_sw action drop

b/c the matching meta-data (outer headers bit in struct mlx5_flow_spec)
wasn't set. Fix that by matching on vlan non-existence only if we were
also told to match on the ethertype.

Fixes: cee26487620b ('net/mlx5e: Set vlan masks for all offloaded TC rules')
Signed-off-by: Or Gerlitz <[email protected]>
Reported-by: Slava Ovsiienko <[email protected]>
Reviewed-by: Jianbo Liu <[email protected]>
Reviewed-by: Roi Dayan <[email protected]>
Signed-off-by: Saeed Mahameed <[email protected]>

net/mlx5e: IPoIB, Reset QP after channels are closed

The mlx5e channels should be closed before mlx5i_uninit_underlay_qp
puts the QP into RST (reset) state during mlx5i_close. Currently QP
state incorrectly set to RST before channels got deactivated and closed,
since mlx5_post_send request expects QP in RTS (Ready To Send) state.

The fix is to keep QP in RTS state until mlx5e channels get closed
and to reset QP afterwards.

Also this fix is simply correct in order to keep the open/close flow
symmetric, i.e mlx5i_init_underlay_qp() is called first thing at open,
the correct thing to do is to call mlx5i_uninit_underlay_qp() last thing
at close, which is exactly what this patch is doing.

Fixes: dae37456c8ac ("net/mlx5: Support for attaching multiple underlay QPs to root flow table")
Signed-off-by: Denis Drozdov <[email protected]>
Signed-off-by: Saeed Mahameed <[email protected]>

net/mlx5: IPSec, Fix the SA context hash key

The commit "net/mlx5: Refactor accel IPSec code" introduced a
bug where asynchronous short time change in hash key value
by create/release SA context might happen during an asynchronous
hash resize operation this could cause a subsequent remove SA
context operation to fail as the key value used during resize is
not the same key value used when remove SA context operation is
invoked.

This commit fixes the bug by defining the SA context hash key
such that it includes only fields that never change during the
lifetime of the SA context object.

Fixes: d6c4f0298cec ("net/mlx5: Refactor accel IPSec code")
Signed-off-by: Raed Salem <[email protected]>
Reviewed-by: Aviad Yehezkel <[email protected]>
Signed-off-by: Saeed Mahameed <[email protected]>

xfs: make xfs_file_remap_range() static

xfs_file_remap_range() is only used in fs/xfs/xfs_file.c, so make it
static.

This addresses a gcc warning when -Wmissing-prototypes is enabled.

Signed-off-by: Eric Biggers <[email protected]>
Reviewed-by: Darrick J. Wong <[email protected]>
Signed-off-by: Darrick J. Wong <[email protected]>

xfs: fix shared extent data corruption due to missing cow reservation

Page writeback indirectly handles shared extents via the existence
of overlapping COW fork blocks. If COW fork blocks exist, writeback
always performs the associated copy-on-write regardless if the
underlying blocks are actually shared. If the blocks are shared,
then overlapping COW fork blocks must always exist.

fstests shared/010 reproduces a case where a buffered write occurs
over a shared block without performing the requisite COW fork
reservation. This ultimately causes writeback to the shared extent
and data corruption that is detected across md5 checks of the
filesystem across a mount cycle.

The problem occurs when a buffered write lands over a shared extent
that crosses an extent size hint boundary and that also happens to
have a partial COW reservation that doesn't cover the start and end
blocks of the data fork extent.

For example, a buffered write occurs across the file offset (in FSB
units) range of [29, 57]. A shared extent exists at blocks [29, 35]
and COW reservation already exists at blocks [32, 34]. After
accommodating a COW extent size hint of 32 blocks and the existing
reservation at offset 32, xfs_reflink_reserve_cow() allocates 32
blocks of reservation at offset 0 and returns with COW reservation
across the range of [0, 34]. The associated data fork extent is
still [29, 35], however, which isn't fully covered by the COW
reservation.

This leads to a buffered write at file offset 35 over a shared
extent without associated COW reservation. Writeback eventually
kicks in, performs an overwrite of the underlying shared block and
causes the associated data corruption.

Update xfs_reflink_reserve_cow() to accommodate the fact that a
delalloc allocation request may not fully cover the extent in the
data fork. Trim the data fork extent appropriately, just as is done
for shared extent boundaries and/or existing COW reservations that
happen to overlap the start of the data fork extent. This prevents
shared/010 failures due to data corruption on reflink enabled
filesystems.

Signed-off-by: Brian Foster <[email protected]>
Reviewed-by: Christoph Hellwig <[email protected]>
Reviewed-by: Darrick J. Wong <[email protected]>
Signed-off-by: Darrick J. Wong <[email protected]>

mtd: spi-nor: Fix Cadence QSPI page fault kernel panic

The current Cadence QSPI driver caused a kernel panic sporadically
when writing to QSPI. The problem was caused by writing more bytes
than needed because the QSPI operated on 4 bytes at a time.
<snip>
[   11.202044] Unable to handle kernel paging request at virtual address bffd3000
[   11.209254] pgd = e463054d
[   11.211948] [bffd3000] *pgd=2fffb811, *pte=00000000, *ppte=00000000
[   11.218202] Internal error: Oops: 7 [#1] SMP ARM
[   11.222797] Modules linked in:
[   11.225844] CPU: 1 PID: 1317 Comm: systemd-hwdb Not tainted 4.17.7-d0c45cd44a8f
[   11.235796] Hardware name: Altera SOCFPGA Arria10
[   11.240487] PC is at __raw_writesl+0x70/0xd4
[   11.244741] LR is at cqspi_write+0x1a0/0x2cc
</snip>
On a page boundary limit the number of bytes copied from the tx buffer
to remain within the page.

This patch uses a temporary buffer to hold the 4 bytes to write and then
copies only the bytes required from the tx buffer.

Reported-by: Adrian Amborzewicz <[email protected]>
Signed-off-by: Thor Thayer <[email protected]>
Signed-off-by: Boris Brezillon <[email protected]>

drm/amd/pp: handle negative values when reading OD

Reading the sysfs files pp_sclk_od and pp_mclk_od return the
percentage difference between the VBIOS-provided default
frequency and the current (possibly user-set) frequency in
the highest SCLK and MCLK DPM states, respectively.

Writing to these files provides an easy mechanism for
setting a higher-than-default maximum frequency. We
normally only allow values >= 0 to be written here.

However, with the addition of pp_od_clk_voltage, we now
allow users to set custom DPM tables. If they then set
the maximum DPM state to something less than the default,
later reads of pp_*_od should return a negative value.
The highest DPM state is now less than the VBIOS-provided
default, so the percentage is negative.

The math to calculate this was originally performed with
unsigned values, meaning reads that should return negative
values returned meaningless data. This patch corrects that
issue and normalizes how all of the calculations are done
across the various hwmgr types.

Acked-by: Alex Deucher <[email protected]>
Signed-off-by: Joseph Greathouse <[email protected]>
Signed-off-by: Alex Deucher <[email protected]>

drm/amdgpu: Add missing firmware entry for HAINAN

Due to lack of MODULE_FIRMWARE() with hainan_mc.bin, the driver
doesn't work properly in initrd. Let's add it.

Bugzilla: https://bugzilla.suse.com/show_bug.cgi?id=1116239
Fixes: 8eaf2b1faaf4 ("drm/amdgpu: switch firmware path for SI parts")
Cc: <[email protected]>
Reviewed-by: Christian König <[email protected]>
Signed-off-by: Takashi Iwai <[email protected]>
Signed-off-by: Alex Deucher <[email protected]>

drm/amd/powerplay: disable Vega20 DS related features

Disable these features on Vega20 for now.

Signed-off-by: Evan Quan <[email protected]>
Acked-by: Feifei Xu<[email protected]>
Signed-off-by: Alex Deucher <[email protected]>

drm/amdgpu: Fix oops when pp_funcs->switch_power_profile is unset

On Vega20 and other pre-production GPUs, powerplay is not enabled yet.
Check for NULL pointers before calling pp_funcs function pointers.

Also affects Kaveri.

CC: Joerg Roedel <[email protected]>
Signed-off-by: Felix Kuehling <[email protected]>
Reviewed-by: Alex Deucher <[email protected]>
Tested-by: Joerg Roedel <[email protected]>
Signed-off-by: Alex Deucher <[email protected]>
Cc: [email protected]

Revert "sctp: remove sctp_transport_pmtu_check"

This reverts commit 22d7be267eaa8114dcc28d66c1c347f667d7878a.

The dst's mtu in transport can be updated by a non sctp place like
in xfrm where the MTU information didn't get synced between asoc,
transport and dst, so it is still needed to do the pmtu check
in sctp_packet_config.

Acked-by: Neil Horman <[email protected]>
Signed-off-by: David S. Miller <[email protected]>

sctp: not allow to set asoc prsctp_enable by sockopt

As rfc7496#section4.5 says about SCTP_PR_SUPPORTED:

   This socket option allows the enabling or disabling of the
   negotiation of PR-SCTP support for future associations.  For existing
   associations, it allows one to query whether or not PR-SCTP support
   was negotiated on a particular association.

It means only sctp sock's prsctp_enable can be set.

Note that for the limitation of SCTP_{CURRENT|ALL}_ASSOC, we will
add it when introducing SCTP_{FUTURE|CURRENT|ALL}_ASSOC for linux
sctp in another patchset.

v1->v2:
  - drop the params.assoc_id check as Neil suggested.

Fixes: 28aa4c26fce2 ("sctp: add SCTP_PR_SUPPORTED on sctp sockopt")
Reported-by: Ying Xu <[email protected]>
Signed-off-by: Xin Long <[email protected]>
Signed-off-by: David S. Miller <[email protected]>

sctp: count sk_wmem_alloc by skb truesize in sctp_packet_transmit

Now sctp increases sk_wmem_alloc by 1 when doing set_owner_w for the
skb allocked in sctp_packet_transmit and decreases by 1 when freeing
this skb.

But when this skb goes through networking stack, some subcomponents
might change skb->truesize and add the same amount on sk_wmem_alloc.
However sctp doesn't know the amount to decrease by, it would cause
a leak on sk->sk_wmem_alloc and the sock can never be freed.

Xiumei found this issue when it hit esp_output_head() by using sctp
over ipsec, where skb->truesize is added and so is sk->sk_wmem_alloc.

Since sctp has used sk_wmem_queued to count for writable space since
Commit cd305c74b0f8 ("sctp: use sk_wmem_queued to check for writable
space"), it's ok to fix it by counting sk_wmem_alloc by skb truesize
in sctp_packet_transmit.

Fixes: cac2661c53f3 ("esp4: Avoid skb_cow_data whenever possible")
Reported-by: Xiumei Mu <[email protected]>
Signed-off-by: Xin Long <[email protected]>
Signed-off-by: David S. Miller <[email protected]>

MAINTAINERS: Add myself as third phylib maintainer

Add myself as third phylib maintainer.

Signed-off-by: Heiner Kallweit <[email protected]>
Acked-by: Andrew Lunn <[email protected]>
Acked-by: Florian Fainelli <[email protected]>
Signed-off-by: David S. Miller <[email protected]>

Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net

Pull networking fixes from David Miller:

1) Fix some potentially uninitialized variables and use-after-free in
    kvaser_usb can drier, from Jimmy Assarsson.

2) Fix leaks in qed driver, from Denis Bolotin.

3) Socket leak in l2tp, from Xin Long.

4) RSS context allocation fix in bnxt_en from Michael Chan.

5) Fix cxgb4 build errors, from Ganesh Goudar.

6) Route leaks in ipv6 when removing exceptions, from Xin Long.

7) Memory leak in IDR allocation handling of act_pedit, from Davide
    Caratti.

8) Use-after-free of bridge vlan stats, from Nikolay Aleksandrov.

9) When MTU is locked, do not force DF bit on ipv4 tunnels. From
    Sabrina Dubroca.

10) When NAPI cached skb is reused, we must set it to the proper initial
    state which includes skb->pkt_type. From Eric Dumazet.

11) Lockdep and non-linear SKB handling fix in tipc from Jon Maloy.

12) Set RX queue properly in various tuntap receive paths, from Matthew
    Cover.

* git://git.kernel.org/pub/scm/linux/kernel/git/davem/net: (61 commits)
  tuntap: fix multiqueue rx
  ipv6: Fix PMTU updates for UDP/raw sockets in presence of VRF
  tipc: don't assume linear buffer when reading ancillary data
  tipc: fix lockdep warning when reinitilaizing sockets
  net-gro: reset skb->pkt_type in napi_reuse_skb()
  tc-testing: tdc.py: Guard against lack of returncode in executed command
  tc-testing: tdc.py: ignore errors when decoding stdout/stderr
  ip_tunnel: don't force DF when MTU is locked
  MAINTAINERS: Add entry for CAKE qdisc
  net: bridge: fix vlan stats use-after-free on destruction
  socket: do a generic_file_splice_read when proto_ops has no splice_read
  net: phy: mdio-gpio: Fix working over slow can_sleep GPIOs
  Revert "net: phy: mdio-gpio: Fix working over slow can_sleep GPIOs"
  net: phy: mdio-gpio: Fix working over slow can_sleep GPIOs
  net/sched: act_pedit: fix memory leak when IDR allocation fails
  net: lantiq: Fix returned value in case of error in 'xrx200_probe()'
  ipv6: fix a dst leak when removing its exception
  net: mvneta: Don't advertise 2.5G modes
  drivers/net/ethernet/qlogic/qed/qed_rdma.h: fix typo
  net/mlx4: Fix UBSAN warning of signed integer overflow
  ...

libceph: fall back to sendmsg for slab pages

skb_can_coalesce() allows coalescing neighboring slab objects into
a single frag:

  return page == skb_frag_page(frag) &&
         off == frag->page_offset + skb_frag_size(frag);

ceph_tcp_sendpage() can be handed slab pages.  One example of this is
XFS: it passes down sector sized slab objects for its metadata I/O.  If
the kernel client is co-located on the OSD node, the skb may go through
loopback and pop on the receive side with the exact same set of frags.
When tcp_recvmsg() attempts to copy out such a frag, hardened usercopy
complains because the size exceeds the object's allocated size:

  usercopy: kernel memory exposure attempt detected from ffff9ba917f20a00 (kmalloc-512) (1024 bytes)

Although skb_can_coalesce() could be taught to return false if the
resulting frag would cross a slab object boundary, we already have
a fallback for non-refcounted pages.  Utilize it for slab pages too.

Cc: [email protected] # 4.8+
Signed-off-by: Ilya Dryomov <[email protected]>

HID: i2c-hid: Disable runtime PM for LG touchscreen

LG touchscreen (1fd2:8001) stops working after reboot:
[ 4.859153] i2c_hid i2c-SAPS2101:00: i2c_hid_get_input: incomplete report (64/66)
[ 4.936070] i2c_hid i2c-SAPS2101:00: i2c_hid_get_input: incomplete report (64/66)
[ 9.948224] i2c_hid i2c-SAPS2101:00: failed to reset device.

The device in question stops working after receives SLEEP, ON, SLEEP
commands in a short period. The scenario is like this:
- Once the desktop session closes, it also closed the hid device, so the
device gets runtime suspended and receives a SLEEP command.
- Before calling shutdown callback, it gets runtime resumed and received
an ON command.
- In the shutdown callback, it receives another SLEEP command.

I failed to find a reliable interval between ON/SLEEP commands that can
make it work, so let's simply disable runtime PM for the device.

Signed-off-by: Kai-Heng Feng <[email protected]>
Signed-off-by: Jiri Kosina <[email protected]>

HID: multitouch: Add pointstick support for Cirque Touchpad

Cirque Touchpad/Pointstick combo is similar to Alps devices, it requires
MT_CLS_WIN_8_DUAL to expose its pointstick as a mouse.

Signed-off-by: Kai-Heng Feng <[email protected]>
Signed-off-by: Jiri Kosina <[email protected]>

HID: steam: remove input device when a hid client is running.

Previously, when a HID client such as the Steam Client was running, this
driver disabled its input device to avoid doubling the input events.

While it worked mostly fine, some games got confused by the idle gamepad,
and switched to two player mode, or asked the user to choose which gamepad
to use. Other games just crashed, probably a bug in Unity [1].

With this commit, when a HID client starts, the input device is removed;
when the HID client ends the input device is recreated.

[1]: https://github.com/ValveSoftware/steam-for-linux/issues/5645

Signed-off-by: Rodrigo Rivas Costa <[email protected]>
Signed-off-by: Jiri Kosina <[email protected]>

XArray tests: Add missing locking

Lockdep caught me being sloppy in the test suite and failing to lock
the XArray appropriately.

Reported-by: kernel test robot <[email protected]>
Signed-off-by: Matthew Wilcox <[email protected]>

dax: Avoid losing wakeup in dax_lock_mapping_entry

After calling get_unlocked_entry(), you have to call
put_unlocked_entry() to avoid subsequent waiters losing wakeups.

Fixes: c2a7d2a11552 ("filesystem-dax: Introduce dax_lock_mapping_entry()")
Cc: [email protected]
Signed-off-by: Matthew Wilcox <[email protected]>

Revert "HID: uhid: use strlcpy() instead of strncpy()"

This reverts commit 336fd4f5f25157e9e8bd50e898a1bbcd99eaea46.

Please note that `strlcpy()` does *NOT* do what you think it does.
strlcpy() *ALWAYS* reads the full input string, regardless of the
'length' parameter. That is, if the input is not zero-terminated,
strlcpy() will *READ* beyond input boundaries. It does this, because it
always returns the size it *would* copy if the target was big enough,
not the truncated size it actually copied.

The original code was perfectly fine. The hid device is
zero-initialized and the strncpy() functions copied up to n-1
characters. The result is always zero-terminated this way.

This is the third time someone tried to replace strncpy with strlcpy in
this function, and gets it wrong. I now added a comment that should at
least make people reconsider.

Signed-off-by: David Herrmann <[email protected]>
Signed-off-by: Jiri Kosina <[email protected]>

HID: uhid: forbid UHID_CREATE under KERNEL_DS or elevated privileges

When a UHID_CREATE command is written to the uhid char device, a
copy_from_user() is done from a user pointer embedded in the command.
When the address limit is KERNEL_DS, e.g. as is the case during
sys_sendfile(), this can read from kernel memory. Alternatively,
information can be leaked from a setuid binary that is tricked to write
to the file descriptor. Therefore, forbid UHID_CREATE in these cases.

No other commands in uhid_char_write() are affected by this bug and
UHID_CREATE is marked as "obsolete", so apply the restriction to
UHID_CREATE only rather than to uhid_char_write() entirely.

Thanks to Dmitry Vyukov for adding uhid definitions to syzkaller and to
Jann Horn for commit 9da3f2b740544 ("x86/fault: BUG() when uaccess
helpers fault on kernel addresses"), allowing this bug to be found.

Reported-by: [email protected]
Fixes: d365c6cfd337 ("HID: uhid: add UHID_CREATE and UHID_DESTROY events")
Cc: <[email protected]> # v3.6+
Cc: Jann Horn <[email protected]>
Cc: Andy Lutomirski <[email protected]>
Signed-off-by: Eric Biggers <[email protected]>
Reviewed-by: Jann Horn <[email protected]>
Signed-off-by: Jiri Kosina <[email protected]>

mmc: sdhci-pci: Workaround GLK firmware failing to restore the tuning value

GLK firmware can indicate that the tuning value will be restored after
runtime suspend, but not actually do that. Add a workaround that detects
such cases, and lets the driver do re-tuning instead.

Reported-by: Anisse Astier <[email protected]>
Tested-by: Anisse Astier <[email protected]>
Signed-off-by: Adrian Hunter <[email protected]>
Cc: [email protected]
Signed-off-by: Ulf Hansson <[email protected]>

drm/i915: Disable LP3 watermarks on all SNB machines

I have a Thinkpad X220 Tablet in my hands that is losing vblank
interrupts whenever LP3 watermarks are used.

If I nudge the latency value written to the WM3 register just
by one in either direction the problem disappears. That to me
suggests that the punit will not enter the corrsponding
powersave mode (MPLL shutdown IIRC) unless the latency value
in the register matches exactly what we read from SSKPD. Ie.
it's not really a latency value but rather just a cookie
by which the punit can identify the desired power saving state.
On HSW/BDW this was changed such that we actually just write
the WM level number into those bits, which makes much more
sense given the observed behaviour.

We could try to handle this by disallowing LP3 watermarks
only when vblank interrupts are enabled but we'd first have
to prove that only vblank interrupts are affected, which
seems unlikely. Also we can't grab the wm mutex from the
vblank enable/disable hooks because those are called with
various spinlocks held. Thus we'd have to redesigne the
watermark locking. So to play it safe and keep the code
simple we simply disable LP3 watermarks on all SNB machines.

To do that we simply zero out the latency values for
watermark level 3, and we adjust the watermark computation
to check for that. The behaviour now matches that of the
g4x/vlv/skl wm code in the presence of a zeroed latency
value.

v2: s/USHRT_MAX/U32_MAX/ for consistency with the types (Chris)

Cc: [email protected]
Cc: Chris Wilson <[email protected]>
Acked-by: Chris Wilson <[email protected]>
Bugzilla: https://bugs.freedesktop.org/show_bug.cgi?id=101269
Bugzilla: https://bugs.freedesktop.org/show_bug.cgi?id=103713
Signed-off-by: Ville Syrjälä <[email protected]>
Link: https://patchwork.freedesktop.org/patch/msgid/[email protected]
(cherry picked from commit 03981c6ebec4fc7056b9b45f847393aeac90d060)
Signed-off-by: Joonas Lahtinen <[email protected]>

ALSA: hda/ca0132 - fix AE-5 pincfg

This patch fixes the pincfg assignment for the AE-5, which was
previously using the Recon3D pincfg's by mistake.

Fixes: d06feaf02fe6 ("ALSA: hda/ca0132 - Add pincfg for AE-5")
Signed-off-by: Connor McAdams <[email protected]>
Signed-off-by: Takashi Iwai <[email protected]>

ALSA: hda/ca0132 - Add new ZxR quirk

This patch adds a new PCI subsys ID for the ZxR, as found and tested by
other users. Without a way to know if any Z's use it as well, it keeps
the quirk of QUIRK_SBZ and goes through the HDA subsys test function.

Signed-off-by: Connor McAdams <[email protected]>
Signed-off-by: Takashi Iwai <[email protected]>

mmc: sdhci-pci: Try "cd" for card-detect lookup before using NULL

Problem:

The card detect IRQ does not work with modern BIOS (that want
to use _DSD to provide the card detect GPIO to the driver).

Details:

The mmc core provides the mmc_gpiod_request_cd() API to let host drivers
request the gpio descriptor for the "card detect" pin.
This pin is specified in the ACPI for the SDHC device:

* Either as a resource using _CRS. This is a method used by legacy BIOS.
   (The driver needs to tell which resource index).

* Or as a named property ("cd-gpios"/"cd-gpio") in _DSD (which internally
   points to an entry in _CRS). This way, the driver can lookup using a
   string. This is what modern BIOS prefer to use.

This API finally results in a call to the following code:

struct gpio_desc *acpi_find_gpio(..., const char *con_id,...)
{
...
   /* Lookup gpio (using "<con_id>-gpio") in the _DSD */
...
   if (!acpi_can_fallback_to_crs(adev, con_id))
          return ERR_PTR(-ENOENT);
...
   /* Falling back to _CRS is allowed, Lookup gpio in the _CRS */
...
}

Note that this means that if the ACPI has _DSD properties, the kernel
will never use _CRS for the lookup (Because acpi_can_fallback_to_crs()
will always be false for any device hat has _DSD entries).

The SDHCI driver is thus currently broken on a modern BIOS, even if
BIOS provides both _CRS (for index based lookup) and _DSD entries (for
string based lookup). Ironically, none of these will be used for the
lookup currently because:

* Since the con_id is NULL, acpi_find_gpio() does not find a matching
  entry in DSDT. (The _DSDT entry has the property name = "cd-gpios")

* Because ACPI contains DSDT entries, thus acpi_can_fallback_to_crs()
  returns false (because device properties have been populated from
  _DSD), thus the _CRS is never used for the lookup.

Fix:

Try "cd" for lookup in the _DSD before falling back to using NULL so
as to try looking up in the _CRS.

I've tested this patch successfully with both Legacy BIOS (that
provide only _CRS method) as well as modern BIOS (that provide both
_CRS and _DSD). Also the use of "cd" appears to be fairly consistent
across other users of this API (other MMC host controller drivers).

Link: https://lkml.org/lkml/2018/9/25/1113
Signed-off-by: Rajat Jain <[email protected]>
Acked-by: Adrian Hunter <[email protected]>
Fixes: f10e4bf6632b ("gpio: acpi: Even more tighten up ACPI GPIO lookups")
Cc: [email protected]
Signed-off-by: Ulf Hansson <[email protected]>

exec: make de_thread() freezable

Suspend fails due to the exec family of functions blocking the freezer.
The casue is that de_thread() sleeps in TASK_UNINTERRUPTIBLE waiting for
all sub-threads to die, and we have the deadlock if one of them is frozen.
This also can occur with the schedule() waiting for the group thread leader
to exit if it is frozen.

In our machine, it causes freeze timeout as bellows.

Freezing of tasks failed after 20.010 seconds (1 tasks refusing to freeze, wq_busy=0):
setcpushares-ls D ffffffc00008ed70 0 5817 1483 0x0040000d
Call trace:
[<ffffffc00008ed70>] __switch_to+0x88/0xa0
[<ffffffc000d1c30c>] __schedule+0x1bc/0x720
[<ffffffc000d1ca90>] schedule+0x40/0xa8
[<ffffffc0001cd784>] flush_old_exec+0xdc/0x640
[<ffffffc000220360>] load_elf_binary+0x2a8/0x1090
[<ffffffc0001ccff4>] search_binary_handler+0x9c/0x240
[<ffffffc00021c584>] load_script+0x20c/0x228
[<ffffffc0001ccff4>] search_binary_handler+0x9c/0x240
[<ffffffc0001ce8e0>] do_execveat_common.isra.14+0x4f8/0x6e8
[<ffffffc0001cedd0>] compat_SyS_execve+0x38/0x48
[<ffffffc00008de30>] el0_svc_naked+0x24/0x28

To fix this, make de_thread() freezable. It looks safe and works fine.

Suggested-by: Oleg Nesterov <[email protected]>
Signed-off-by: Chanho Min <[email protected]>
Acked-by: Oleg Nesterov <[email protected]>
Acked-by: Pavel Machek <[email protected]>
Acked-by: Michal Hocko <[email protected]>
Signed-off-by: Rafael J. Wysocki <[email protected]>

cpufreq: ti-cpufreq: Only register platform_device when supported

Currently the ti-cpufreq driver blindly registers a 'ti-cpufreq' to force
the driver to probe on any platforms where the driver is built in.
However, this should only happen on platforms that actually can make use
of the driver. There is already functionality in place to match the
SoC compatible so let's factor this out into a separate call and
make sure we find a match before creating the ti-cpufreq platform device.

Reviewed-by: Johan Hovold <[email protected]>
Signed-off-by: Dave Gerlach <[email protected]>
Acked-by: Viresh Kumar <[email protected]>
Signed-off-by: Rafael J. Wysocki <[email protected]>

Merge branch 'opp/fixes-for-4.20' of git://git.kernel.org/pub/scm/linux/kernel/git/vireshk/pm

Pull operating performance points (OPP) framework fixes for 4.20 from
Viresh Kumar.

* 'opp/fixes-for-4.20' of git://git.kernel.org/pub/scm/linux/kernel/git/vireshk/pm:
opp: ti-opp-supply: Correct the supply in _get_optimal_vdd_voltage call
opp: ti-opp-supply: Dynamically update u_volt_min

tuntap: fix multiqueue rx

When writing packets to a descriptor associated with a combined queue, the
packets should end up on that queue.

Before this change all packets written to any descriptor associated with a
tap interface end up on rx-0, even when the descriptor is associated with a
different queue.

The rx traffic can be generated by either of the following.
  1. a simple tap program which spins up multiple queues and writes packets
     to each of the file descriptors
  2. tx from a qemu vm with a tap multiqueue netdev

The queue for rx traffic can be observed by either of the following (done
on the hypervisor in the qemu case).
  1. a simple netmap program which opens and reads from per-queue
     descriptors
  2. configuring RPS and doing per-cpu captures with rxtxcpu

Alternatively, if you printk() the return value of skb_get_rx_queue() just
before each instance of netif_receive_skb() in tun.c, you will get 65535
for every skb.

Calling skb_record_rx_queue() to set the rx queue to the queue_index fixes
the association between descriptor and rx queue.

Signed-off-by: Matthew Cover <[email protected]>
Signed-off-by: David S. Miller <[email protected]>

ipv6: Fix PMTU updates for UDP/raw sockets in presence of VRF

Preethi reported that PMTU discovery for UDP/raw applications is not
working in the presence of VRF when the socket is not bound to a device.
The problem is that ip6_sk_update_pmtu does not consider the L3 domain
of the skb device if the socket is not bound. Update the function to
set oif to the L3 master device if relevant.

Fixes: ca254490c8df ("net: Add VRF support to IPv6 stack")
Reported-by: Preethi Ramachandra <[email protected]>
Signed-off-by: David Ahern <[email protected]>
Signed-off-by: David S. Miller <[email protected]>

drm/ast: Remove existing framebuffers before loading driver

If vesafb attaches to the AST device, it configures the framebuffer memory
for uncached access by default. When ast.ko later tries to attach itself to
the device, it wants to use write-combining on the framebuffer memory, but
vesefb's existing configuration for uncached access takes precedence. This
results in reduced performance.

Removing the framebuffer's configuration before loding the AST driver fixes
the problem. Other DRM drivers already contain equivalent code.

Link: https://bugzilla.opensuse.org/show_bug.cgi?id=1112963
Signed-off-by: Thomas Zimmermann <[email protected]>
Cc: <[email protected]>
Tested-by: Y.C. Chen <[email protected]>
Reviewed-by: Jean Delvare <[email protected]>
Tested-by: Jean Delvare <[email protected]>
Signed-off-by: Dave Airlie <[email protected]>

Linux 4.20-rc3

Merge tag 'libnvdimm-fixes-4.20-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/nvdimm/nvdimm

Pull libnvdimm fixes from Dan Williams:
"A small batch of fixes for v4.20-rc3.

  The overflow continuation fix addresses something that has been broken
  for several releases. Arguably it could wait even longer, but it's a
  one line fix and this finishes the last of the known address range
  scrub bug reports. The revert addresses a lockdep regression. The unit
  tests are not critical to fix, but no reason to hold this fix back.

  Summary:

   - Address Range Scrub overflow continuation handling has been broken
     since it was initially merged. It was only recently that error
     injection and platform-BIOS support enabled this corner case to be
     exercised.

   - The recent attempt to provide more isolation for the kernel Address
     Range Scrub state machine from userapace initiated sessions
     triggers a lockdep report. Revert and try again at the next merge
     window.

   - Fix a kasan reported buffer overflow in libnvdimm unit test
     infrastrucutre (nfit_test)"

* tag 'libnvdimm-fixes-4.20-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/nvdimm/nvdimm:
  Revert "acpi, nfit: Further restrict userspace ARS start requests"
  acpi, nfit: Fix ARS overflow continuation
  tools/testing/nvdimm: Fix the array size for dimm devices.

Merge branch 'akpm' (patches from Andrew)

Merge misc fixes from Andrew Morton:
"16 fixes"

* emailed patches from Andrew Morton <[email protected]>:
  mm/memblock.c: fix a typo in __next_mem_pfn_range() comments
  mm, page_alloc: check for max order in hot path
  scripts/spdxcheck.py: make python3 compliant
  tmpfs: make lseek(SEEK_DATA/SEK_HOLE) return ENXIO with a negative offset
  lib/ubsan.c: don't mark __ubsan_handle_builtin_unreachable as noreturn
  mm/vmstat.c: fix NUMA statistics updates
  mm/gup.c: fix follow_page_mask() kerneldoc comment
  ocfs2: free up write context when direct IO failed
  scripts/faddr2line: fix location of start_kernel in comment
  mm: don't reclaim inodes with many attached pages
  mm, memory_hotplug: check zone_movable in has_unmovable_pages
  mm/swapfile.c: use kvzalloc for swap_info_struct allocation
  MAINTAINERS: update OMAP MMC entry
  hugetlbfs: fix kernel BUG at fs/hugetlbfs/inode.c:444!
  kernel/sched/psi.c: simplify cgroup_move_task()
  z3fold: fix possible reclaim races

Merge branch 'sched-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip

Pull scheduler fix from Ingo Molnar:
"Fix an exec() related scalability/performance regression, which was
  caused by incorrectly calculating load and migrating tasks on exec()
  when they shouldn't be"

* 'sched-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  sched/fair: Fix cpu_util_wake() for 'execl' type workloads

Merge branch 'perf-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip

Pull perf fixes from Ingo Molnar:
"Fix uncore PMU enumeration for CofeeLake CPUs"

* 'perf-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
perf/x86/intel/uncore: Support CoffeeLake 8th CBOX
perf/x86/intel/uncore: Add more IMC PCI IDs for KabyLake and CoffeeLake CPUs

Merge branch 'efi-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip

Pull EFI fixes from Ingo Molnar:
"Misc fixes: two warning splat fixes, a leak fix and persistent memory
  allocation fixes for ARM"

* 'efi-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  efi: Permit calling efi_mem_reserve_persistent() from atomic context
  efi/arm: Defer persistent reservations until after paging_init()
  efi/arm/libstub: Pack FDT after populating it
  efi/arm: Revert deferred unmap of early memmap mapping
  efi: Fix debugobjects warning on 'efi_rts_work'

Merge branch 'spectre' of git://git.armlinux.org.uk/~rmk/linux-arm

Pull ARM spectre updates from Russell King:
"These are the currently known final bits that resolve the Spectre
  issues. big.Little systems used to be sufficiently identical in that
  there were no differences between individual CPUs in the system that
  mattered to the kernel. With the advent of the Spectre problem, the
  CPUs now have differences in how the workaround is applied.

  As a result of previous Spectre patches, these systems ended up
  reporting quite a lot of:

     "CPUx: Spectre v2: incorrect context switching function, system vulnerable"

  messages due to the action of the big.Little switcher causing the CPUs
  to be re-initialised regularly. This series resolves that issue by
  making the CPU vtable unique to each CPU.

  However, since this is used very early, before per-cpu is setup,
  per-cpu can't be used. We also have a problem that two of the methods
  are not called from preempt-safe paths, but thankfully these remain
  identical between all CPUs in the system. To make sure, we validate
  that these are identical during boot"

* 'spectre' of git://git.armlinux.org.uk/~rmk/linux-arm:
  ARM: spectre-v2: per-CPU vtables to work around big.Little systems
  ARM: add PROC_VTABLE and PROC_TABLE macros
  ARM: clean up per-processor check_bugs method call
  ARM: split out processor lookup
  ARM: make lookup_processor_type() non-__init

mm/memblock.c: fix a typo in __next_mem_pfn_range() comments

Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Chen Chang <[email protected]>
Acked-by: Michal Hocko <[email protected]>
Acked-by: Mike Rapoport <[email protected]>
Reviewed-by: Andrew Morton <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

mm, page_alloc: check for max order in hot path

Konstantin has noticed that kvmalloc might trigger the following
warning:

  WARNING: CPU: 0 PID: 6676 at mm/vmstat.c:986 __fragmentation_index+0x54/0x60
  [...]
  Call Trace:
   fragmentation_index+0x76/0x90
   compaction_suitable+0x4f/0xf0
   shrink_node+0x295/0x310
   node_reclaim+0x205/0x250
   get_page_from_freelist+0x649/0xad0
   __alloc_pages_nodemask+0x12a/0x2a0
   kmalloc_large_node+0x47/0x90
   __kmalloc_node+0x22b/0x2e0
   kvmalloc_node+0x3e/0x70
   xt_alloc_table_info+0x3a/0x80 [x_tables]
   do_ip6t_set_ctl+0xcd/0x1c0 [ip6_tables]
   nf_setsockopt+0x44/0x60
   SyS_setsockopt+0x6f/0xc0
   do_syscall_64+0x67/0x120
   entry_SYSCALL_64_after_hwframe+0x3d/0xa2

the problem is that we only check for an out of bound order in the slow
path and the node reclaim might happen from the fast path already.  This
is fixable by making sure that kvmalloc doesn't ever use kmalloc for
requests that are larger than KMALLOC_MAX_SIZE but this also shows that
the code is rather fragile.  A recent UBSAN report just underlines that
by the following report

  UBSAN: Undefined behaviour in mm/page_alloc.c:3117:19
  shift exponent 51 is too large for 32-bit type 'int'
  CPU: 0 PID: 6520 Comm: syz-executor1 Not tainted 4.19.0-rc2 #1
  Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Bochs 01/01/2011
  Call Trace:
   __dump_stack lib/dump_stack.c:77 [inline]
   dump_stack+0xd2/0x148 lib/dump_stack.c:113
   ubsan_epilogue+0x12/0x94 lib/ubsan.c:159
   __ubsan_handle_shift_out_of_bounds+0x2b6/0x30b lib/ubsan.c:425
   __zone_watermark_ok+0x2c7/0x400 mm/page_alloc.c:3117
   zone_watermark_fast mm/page_alloc.c:3216 [inline]
   get_page_from_freelist+0xc49/0x44c0 mm/page_alloc.c:3300
   __alloc_pages_nodemask+0x21e/0x640 mm/page_alloc.c:4370
   alloc_pages_current+0xcc/0x210 mm/mempolicy.c:2093
   alloc_pages include/linux/gfp.h:509 [inline]
   __get_free_pages+0x12/0x60 mm/page_alloc.c:4414
   dma_mem_alloc+0x36/0x50 arch/x86/include/asm/floppy.h:156
   raw_cmd_copyin drivers/block/floppy.c:3159 [inline]
   raw_cmd_ioctl drivers/block/floppy.c:3206 [inline]
   fd_locked_ioctl+0xa00/0x2c10 drivers/block/floppy.c:3544
   fd_ioctl+0x40/0x60 drivers/block/floppy.c:3571
   __blkdev_driver_ioctl block/ioctl.c:303 [inline]
   blkdev_ioctl+0xb3c/0x1a30 block/ioctl.c:601
   block_ioctl+0x105/0x150 fs/block_dev.c:1883
   vfs_ioctl fs/ioctl.c:46 [inline]
   do_vfs_ioctl+0x1c0/0x1150 fs/ioctl.c:687
   ksys_ioctl+0x9e/0xb0 fs/ioctl.c:702
   __do_sys_ioctl fs/ioctl.c:709 [inline]
   __se_sys_ioctl fs/ioctl.c:707 [inline]
   __x64_sys_ioctl+0x7e/0xc0 fs/ioctl.c:707
   do_syscall_64+0xc4/0x510 arch/x86/entry/common.c:290
   entry_SYSCALL_64_after_hwframe+0x49/0xbe

Note that this is not a kvmalloc path.  It is just that the fast path
really depends on having sanitzed order as well.  Therefore move the
order check to the fast path.

Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Michal Hocko <[email protected]>
Reported-by: Konstantin Khlebnikov <[email protected]>
Reported-by: Kyungtae Kim <[email protected]>
Acked-by: Vlastimil Babka <[email protected]>
Cc: Balbir Singh <[email protected]>
Cc: Mel Gorman <[email protected]>
Cc: Pavel Tatashin <[email protected]>
Cc: Oscar Salvador <[email protected]>
Cc: Mike Rapoport <[email protected]>
Cc: Aaron Lu <[email protected]>
Cc: Joonsoo Kim <[email protected]>
Cc: Byoungyoung Lee <[email protected]>
Cc: "Dae R. Jeong" <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

scripts/spdxcheck.py: make python3 compliant

Without this change the following happens when using Python3 (3.6.6):

$ echo "GPL-2.0" | python3 scripts/spdxcheck.py -
FAIL: 'str' object has no attribute 'decode'
Traceback (most recent call last):
  File "scripts/spdxcheck.py", line 253, in <module>
    parser.parse_lines(sys.stdin, args.maxlines, '-')
  File "scripts/spdxcheck.py", line 171, in parse_lines
    line = line.decode(locale.getpreferredencoding(False), errors='ignore')
AttributeError: 'str' object has no attribute 'decode'

So as the line is already a string, there is no need to decode it and
the line can be dropped.

/usr/bin/python on Arch is Python 3.  So this would indeed be worth
going into 4.19.

Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Uwe Kleine-König <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: Joe Perches <[email protected]>
Cc: Greg Kroah-Hartman <[email protected]>
Cc: <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

tmpfs: make lseek(SEEK_DATA/SEK_HOLE) return ENXIO with a negative offset

Other filesystems such as ext4, f2fs and ubifs all return ENXIO when
lseek (SEEK_DATA or SEEK_HOLE) requests a negative offset.

man 2 lseek says

:      EINVAL whence  is  not  valid.   Or: the resulting file offset would be
:             negative, or beyond the end of a seekable device.
:
:      ENXIO  whence is SEEK_DATA or SEEK_HOLE, and the file offset is  beyond
:             the end of the file.

Make tmpfs return ENXIO under these circumstances as well.  After this,
tmpfs also passes xfstests's generic/448.

[[email protected]: rewrite changelog]
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Yufen Yu <[email protected]>
Reviewed-by: Andrew Morton <[email protected]>
Cc: Al Viro <[email protected]>
Cc: Hugh Dickins <[email protected]>
Cc: William Kucharski <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

lib/ubsan.c: don't mark __ubsan_handle_builtin_unreachable as noreturn

gcc-8 complains about the prototype for this function:

lib/ubsan.c:432:1: error: ignoring attribute 'noreturn' in declaration of a built-in function '__ubsan_handle_builtin_unreachable' because it conflicts with attribute 'const' [-Werror=attributes]

This is actually a GCC's bug. In GCC internals
__ubsan_handle_builtin_unreachable() declared with both 'noreturn' and
'const' attributes instead of only 'noreturn':

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=84210

Workaround this by removing the noreturn attribute.

[aryabinin: add information about GCC bug in changelog]
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Arnd Bergmann <[email protected]>
Signed-off-by: Andrey Ryabinin <[email protected]>
Acked-by: Olof Johansson <[email protected]>
Cc: <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

mm/vmstat.c: fix NUMA statistics updates

Scan through the whole array to see if an update is needed.  While we're
at it, use sizeof() to be safe against any possible type changes in the
future.

The bug here is that we wouldn't sync per-cpu counters into global ones
if there was an update of numa_stats for higher cpus.  Highly
theoretical one though because it is much more probable that zone_stats
are updated so we would refresh anyway.  So I wouldn't bother to mark
this for stable, yet something nice to fix.

[[email protected]: changelog enhancement]
Link: http://lkml.kernel.org/r/[email protected]
Fixes: 1d90ca897cb0 ("mm: update NUMA counter threshold size")
Signed-off-by: Janne Huttunen <[email protected]>
Acked-by: Michal Hocko <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

mm/gup.c: fix follow_page_mask() kerneldoc comment

Commit df06b37ffe5a ("mm/gup: cache dev_pagemap while pinning pages")
modified the signature of follow_page_mask() but left the parameter
description behind.

Update the description to make the code and comments agree again.

While at it, update formatting of the return value description to match
Documentation/doc-guide/kernel-doc.rst guidelines.

Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Mike Rapoport <[email protected]>
Reviewed-by: Andrew Morton <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

ocfs2: free up write context when direct IO failed

The write context should also be freed even when direct IO failed.
Otherwise a memory leak is introduced and entries remain in
oi->ip_unwritten_list causing the following BUG later in unlink path:

  ERROR: bug expression: !list_empty(&oi->ip_unwritten_list)
  ERROR: Clear inode of 215043, inode has unwritten extents
  ...
  Call Trace:
  ? __set_current_blocked+0x42/0x68
  ocfs2_evict_inode+0x91/0x6a0 [ocfs2]
  ? bit_waitqueue+0x40/0x33
  evict+0xdb/0x1af
  iput+0x1a2/0x1f7
  do_unlinkat+0x194/0x28f
  SyS_unlinkat+0x1b/0x2f
  do_syscall_64+0x79/0x1ae
  entry_SYSCALL_64_after_hwframe+0x151/0x0

This patch also logs, with frequency limit, direct IO failures.

Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Wengang Wang <[email protected]>
Reviewed-by: Junxiao Bi <[email protected]>
Reviewed-by: Changwei Ge <[email protected]>
Reviewed-by: Joseph Qi <[email protected]>
Cc: Mark Fasheh <[email protected]>
Cc: Joel Becker <[email protected]>
Cc: <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

scripts/faddr2line: fix location of start_kernel in comment

Fix a source file reference location to the correct path name.

Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Randy Dunlap <[email protected]>
Acked-by: Josh Poimboeuf <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

mm: don't reclaim inodes with many attached pages

Spock reported that commit 172b06c32b94 ("mm: slowly shrink slabs with a
relatively small number of objects") leads to a regression on his setup:
periodically the majority of the pagecache is evicted without an obvious
reason, while before the change the amount of free memory was balancing
around the watermark.

The reason behind is that the mentioned above change created some
minimal background pressure on the inode cache.  The problem is that if
an inode is considered to be reclaimed, all belonging pagecache page are
stripped, no matter how many of them are there.  So, if a huge
multi-gigabyte file is cached in the memory, and the goal is to reclaim
only few slab objects (unused inodes), we still can eventually evict all
gigabytes of the pagecache at once.

The workload described by Spock has few large non-mapped files in the
pagecache, so it's especially noticeable.

To solve the problem let's postpone the reclaim of inodes, which have
more than 1 attached page.  Let's wait until the pagecache pages will be
evicted naturally by scanning the corresponding LRU lists, and only then
reclaim the inode structure.

Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Roman Gushchin <[email protected]>
Reported-by: Spock <[email protected]>
Tested-by: Spock <[email protected]>
Reviewed-by: Andrew Morton <[email protected]>
Cc: Michal Hocko <[email protected]>
Cc: Rik van Riel <[email protected]>
Cc: Randy Dunlap <[email protected]>
Cc: <[email protected]> [4.19.x]
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

mm, memory_hotplug: check zone_movable in has_unmovable_pages

Page state checks are racy.  Under a heavy memory workload (e.g.  stress
-m 200 -t 2h) it is quite easy to hit a race window when the page is
allocated but its state is not fully populated yet.  A debugging patch to
dump the struct page state shows

  has_unmovable_pages: pfn:0x10dfec00, found:0x1, count:0x0
  page:ffffea0437fb0000 count:1 mapcount:1 mapping:ffff880e05239841 index:0x7f26e5000 compound_mapcount: 1
  flags: 0x5fffffc0090034(uptodate|lru|active|head|swapbacked)

Note that the state has been checked for both PageLRU and PageSwapBacked
already.  Closing this race completely would require some sort of retry
logic.  This can be tricky and error prone (think of potential endless
or long taking loops).

Workaround this problem for movable zones at least.  Such a zone should
only contain movable pages.  Commit 15c30bc09085 ("mm, memory_hotplug:
make has_unmovable_pages more robust") has told us that this is not
strictly true though.  Bootmem pages should be marked reserved though so
we can move the original check after the PageReserved check.  Pages from
other zones are still prone to races but we even do not pretend that
memory hotremove works for those so pre-mature failure doesn't hurt that
much.

Link: http://lkml.kernel.org/r/[email protected]
Fixes: 15c30bc09085 ("mm, memory_hotplug: make has_unmovable_pages more robust")
Signed-off-by: Michal Hocko <[email protected]>
Reported-by: Baoquan He <[email protected]>
Tested-by: Baoquan He <[email protected]>
Acked-by: Baoquan He <[email protected]>
Reviewed-by: Oscar Salvador <[email protected]>
Acked-by: Balbir Singh <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

mm/swapfile.c: use kvzalloc for swap_info_struct allocation

Commit a2468cc9bfdf ("swap: choose swap device according to numa node")
changed 'avail_lists' field of 'struct swap_info_struct' to an array.
In popular linux distros it increased size of swap_info_struct up to 40
Kbytes and now swap_info_struct allocation requires order-4 page.
Switch to kvzmalloc allows to avoid unexpected allocation failures.

Link: http://lkml.kernel.org/r/[email protected]
Fixes: a2468cc9bfdf ("swap: choose swap device according to numa node")
Signed-off-by: Vasily Averin <[email protected]>
Acked-by: Aaron Lu <[email protected]>
Acked-by: Michal Hocko <[email protected]>
Reviewed-by: Andrew Morton <[email protected]>
Cc: Huang Ying <[email protected]>
Cc: <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

MAINTAINERS: update OMAP MMC entry

Jarkko's e-mail address hasn't worked for a long time. We still want to
keep this driver working as it is critical for some of the OMAP boards.
I use and test this driver frequently, so change myself as a maintainer
with "Odd Fixes" status.

Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Aaro Koskinen <[email protected]>
Acked-by: Tony Lindgren <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

hugetlbfs: fix kernel BUG at fs/hugetlbfs/inode.c:444!

This bug has been experienced several times by the Oracle DB team.  The
BUG is in remove_inode_hugepages() as follows:

/*
* If page is mapped, it was faulted in after being
* unmapped in caller.  Unmap (again) now after taking
* the fault mutex.  The mutex will prevent faults
* until we finish removing the page.
*
* This race can only happen in the hole punch case.
* Getting here in a truncate operation is a bug.
*/
if (unlikely(page_mapped(page))) {
BUG_ON(truncate_op);

In this case, the elevated map count is not the result of a race.
Rather it was incorrectly incremented as the result of a bug in the huge
pmd sharing code.  Consider the following:

- Process A maps a hugetlbfs file of sufficient size and alignment
   (PUD_SIZE) that a pmd page could be shared.

- Process B maps the same hugetlbfs file with the same size and
   alignment such that a pmd page is shared.

- Process B then calls mprotect() to change protections for the mapping
   with the shared pmd. As a result, the pmd is 'unshared'.

- Process B then calls mprotect() again to chage protections for the
   mapping back to their original value. pmd remains unshared.

- Process B then forks and process C is created. During the fork
   process, we do dup_mm -> dup_mmap -> copy_page_range to copy page
   tables. Copying page tables for hugetlb mappings is done in the
   routine copy_hugetlb_page_range.

In copy_hugetlb_page_range(), the destination pte is obtained by:

dst_pte = huge_pte_alloc(dst, addr, sz);

If pmd sharing is possible, the returned pointer will be to a pte in an
existing page table.  In the situation above, process C could share with
either process A or process B.  Since process A is first in the list,
the returned pte is a pointer to a pte in process A's page table.

However, the check for pmd sharing in copy_hugetlb_page_range is:

/* If the pagetables are shared don't copy or take references */
if (dst_pte == src_pte)
continue;

Since process C is sharing with process A instead of process B, the
above test fails.  The code in copy_hugetlb_page_range which follows
assumes dst_pte points to a huge_pte_none pte.  It copies the pte entry
from src_pte to dst_pte and increments this map count of the associated
page.  This is how we end up with an elevated map count.

To solve, check the dst_pte entry for huge_pte_none.  If !none, this
implies PMD sharing so do not copy.

Link: http://lkml.kernel.org/r/[email protected]
Fixes: c5c99429fa57 ("fix hugepages leak due to pagetable page sharing")
Signed-off-by: Mike Kravetz <[email protected]>
Reviewed-by: Naoya Horiguchi <[email protected]>
Cc: Michal Hocko <[email protected]>
Cc: Hugh Dickins <[email protected]>
Cc: Andrea Arcangeli <[email protected]>
Cc: "Kirill A . Shutemov" <[email protected]>
Cc: Davidlohr Bueso <[email protected]>
Cc: Prakash Sangappa <[email protected]>
Cc: <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

kernel/sched/psi.c: simplify cgroup_move_task()

The existing code triggered an invalid warning about 'rq' possibly being
used uninitialized.  Instead of doing the silly warning suppression by
initializa it to NULL, refactor the code to bail out early instead.

Warning was:

  kernel/sched/psi.c: In function `cgroup_move_task':
  kernel/sched/psi.c:639:13: warning: `rq' may be used uninitialized in this function [-Wmaybe-uninitialized]

Link: http://lkml.kernel.org/r/[email protected]
Fixes: 2ce7135adc9ad ("psi: cgroup support")
Signed-off-by: Olof Johansson <[email protected]>
Reviewed-by: Andrew Morton <[email protected]>
Acked-by: Johannes Weiner <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

z3fold: fix possible reclaim races

Reclaim and free can race on an object which is basically fine but in
order for reclaim to be able to map "freed" object we need to encode
object length in the handle. handle_to_chunks() is then introduced to
extract object length from a handle and use it during mapping.

Moreover, to avoid racing on a z3fold "headless" page release, we should
not try to free that page in z3fold_free() if the reclaim bit is set.
Also, in the unlikely case of trying to reclaim a page being freed, we
should not proceed with that page.

While at it, fix the page accounting in reclaim function.

This patch supersedes "[PATCH] z3fold: fix reclaim lock-ups".

Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Vitaly Wool <[email protected]>
Signed-off-by: Jongseok Kim <[email protected]>
Reported-by-by: Jongseok Kim <[email protected]>
Reviewed-by: Snild Dolkow <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

mtd: rawnand: qcom: Namespace prefix some commands

PAGE_READ is used by RISC-V arch code included through mm headers,
and it makes sense to bring in a prefix on these in the driver.

drivers/mtd/nand/raw/qcom_nandc.c:153: warning: "PAGE_READ" redefined
#define PAGE_READ   0x2
In file included from include/linux/memremap.h:7,
                 from include/linux/mm.h:27,
                 from include/linux/scatterlist.h:8,
                 from include/linux/dma-mapping.h:11,
                 from drivers/mtd/nand/raw/qcom_nandc.c:17:
arch/riscv/include/asm/pgtable.h:48: note: this is the location of the previous definition

Caught by riscv allmodconfig.

Signed-off-by: Olof Johansson <[email protected]>
Reviewed-by: Miquel Raynal <[email protected]>
Signed-off-by: Boris Brezillon <[email protected]>

mtd: rawnand: atmel: fix OF child-node lookup

Use the new of_get_compatible_child() helper to lookup the nfc child
node instead of using of_find_compatible_node(), which searches the
entire tree from a given start node and thus can return an unrelated
(i.e. non-child) node.

This also addresses a potential use-after-free (e.g. after probe
deferral) as the tree-wide helper drops a reference to its first
argument (i.e. the node of the device being probed).

While at it, also fix a related nfc-node reference leak.

Fixes: f88fc122cc34 ("mtd: nand: Cleanup/rework the atmel_nand driver")
Cc: stable <[email protected]> # 4.11
Cc: Nicolas Ferre <[email protected]>
Cc: Josh Wu <[email protected]>
Cc: Boris Brezillon <[email protected]>
Signed-off-by: Johan Hovold <[email protected]>
Signed-off-by: Boris Brezillon <[email protected]>

tipc: don't assume linear buffer when reading ancillary data

The code for reading ancillary data from a received buffer is assuming
the buffer is linear. To make this assumption true we have to linearize
the buffer before message data is read.

Signed-off-by: Jon Maloy <[email protected]>
Signed-off-by: David S. Miller <[email protected]>

tipc: fix lockdep warning when reinitilaizing sockets

We get the following warning:

[   47.926140] 32-bit node address hash set to 2010a0a
[   47.927202]
[   47.927433] ================================
[   47.928050] WARNING: inconsistent lock state
[   47.928661] 4.19.0+ #37 Tainted: G            E
[   47.929346] --------------------------------
[   47.929954] inconsistent {SOFTIRQ-ON-W} -> {IN-SOFTIRQ-W} usage.
[   47.930116] swapper/3/0 [HC0[0]:SC1[3]:HE1:SE0] takes:
[   47.930116] 00000000af8bc31e (&(&ht->lock)->rlock){+.?.}, at: rhashtable_walk_enter+0x36/0xb0
[   47.930116] {SOFTIRQ-ON-W} state was registered at:
[   47.930116]   _raw_spin_lock+0x29/0x60
[   47.930116]   rht_deferred_worker+0x556/0x810
[   47.930116]   process_one_work+0x1f5/0x540
[   47.930116]   worker_thread+0x64/0x3e0
[   47.930116]   kthread+0x112/0x150
[   47.930116]   ret_from_fork+0x3a/0x50
[   47.930116] irq event stamp: 14044
[   47.930116] hardirqs last  enabled at (14044): [<ffffffff9a07fbba>] __local_bh_enable_ip+0x7a/0xf0
[   47.938117] hardirqs last disabled at (14043): [<ffffffff9a07fb81>] __local_bh_enable_ip+0x41/0xf0
[   47.938117] softirqs last  enabled at (14028): [<ffffffff9a0803ee>] irq_enter+0x5e/0x60
[   47.938117] softirqs last disabled at (14029): [<ffffffff9a0804a5>] irq_exit+0xb5/0xc0
[   47.938117]
[   47.938117] other info that might help us debug this:
[   47.938117]  Possible unsafe locking scenario:
[   47.938117]
[   47.938117]        CPU0
[   47.938117]        ----
[   47.938117]   lock(&(&ht->lock)->rlock);
[   47.938117]   <Interrupt>
[   47.938117]     lock(&(&ht->lock)->rlock);
[   47.938117]
[   47.938117]  *** DEADLOCK ***
[   47.938117]
[   47.938117] 2 locks held by swapper/3/0:
[   47.938117]  #0: 0000000062c64f90 ((&d->timer)){+.-.}, at: call_timer_fn+0x5/0x280
[   47.938117]  #1: 00000000ee39619c (&(&d->lock)->rlock){+.-.}, at: tipc_disc_timeout+0xc8/0x540 [tipc]
[   47.938117]
[   47.938117] stack backtrace:
[   47.938117] CPU: 3 PID: 0 Comm: swapper/3 Tainted: G            E     4.19.0+ #37
[   47.938117] Hardware name: Bochs Bochs, BIOS Bochs 01/01/2011
[   47.938117] Call Trace:
[   47.938117]  <IRQ>
[   47.938117]  dump_stack+0x5e/0x8b
[   47.938117]  print_usage_bug+0x1ed/0x1ff
[   47.938117]  mark_lock+0x5b5/0x630
[   47.938117]  __lock_acquire+0x4c0/0x18f0
[   47.938117]  ? lock_acquire+0xa6/0x180
[   47.938117]  lock_acquire+0xa6/0x180
[   47.938117]  ? rhashtable_walk_enter+0x36/0xb0
[   47.938117]  _raw_spin_lock+0x29/0x60
[   47.938117]  ? rhashtable_walk_enter+0x36/0xb0
[   47.938117]  rhashtable_walk_enter+0x36/0xb0
[   47.938117]  tipc_sk_reinit+0xb0/0x410 [tipc]
[   47.938117]  ? mark_held_locks+0x6f/0x90
[   47.938117]  ? __local_bh_enable_ip+0x7a/0xf0
[   47.938117]  ? lockdep_hardirqs_on+0x20/0x1a0
[   47.938117]  tipc_net_finalize+0xbf/0x180 [tipc]
[   47.938117]  tipc_disc_timeout+0x509/0x540 [tipc]
[   47.938117]  ? call_timer_fn+0x5/0x280
[   47.938117]  ? tipc_disc_msg_xmit.isra.19+0xa0/0xa0 [tipc]
[   47.938117]  ? tipc_disc_msg_xmit.isra.19+0xa0/0xa0 [tipc]
[   47.938117]  call_timer_fn+0xa1/0x280
[   47.938117]  ? tipc_disc_msg_xmit.isra.19+0xa0/0xa0 [tipc]
[   47.938117]  run_timer_softirq+0x1f2/0x4d0
[   47.938117]  __do_softirq+0xfc/0x413
[   47.938117]  irq_exit+0xb5/0xc0
[   47.938117]  smp_apic_timer_interrupt+0xac/0x210
[   47.938117]  apic_timer_interrupt+0xf/0x20
[   47.938117]  </IRQ>
[   47.938117] RIP: 0010:default_idle+0x1c/0x140
[   47.938117] Code: 90 90 90 90 90 90 90 90 90 90 90 90 90 90 0f 1f 44 00 00 41 54 55 53 65 8b 2d d8 2b 74 65 0f 1f 44 00 00 e8 c6 2c 8b ff fb f4 <65> 8b 2d c5 2b 74 65 0f 1f 44 00 00 5b 5d 41 5c c3 65 8b 05 b4 2b
[   47.938117] RSP: 0018:ffffaf6ac0207ec8 EFLAGS: 00000206 ORIG_RAX: ffffffffffffff13
[   47.938117] RAX: ffff8f5b3735e200 RBX: 0000000000000003 RCX: 0000000000000001
[   47.938117] RDX: 0000000000000001 RSI: 0000000000000001 RDI: ffff8f5b3735e200
[   47.938117] RBP: 0000000000000003 R08: 0000000000000001 R09: 0000000000000000
[   47.938117] R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000000000
[   47.938117] R13: 0000000000000000 R14: ffff8f5b3735e200 R15: ffff8f5b3735e200
[   47.938117]  ? default_idle+0x1a/0x140
[   47.938117]  do_idle+0x1bc/0x280
[   47.938117]  cpu_startup_entry+0x19/0x20
[   47.938117]  start_secondary+0x187/0x1c0
[   47.938117]  secondary_startup_64+0xa4/0xb0

The reason seems to be that tipc_net_finalize()->tipc_sk_reinit() is
calling the function rhashtable_walk_enter() within a timer interrupt.
We fix this by executing tipc_net_finalize() in work queue context.

Acked-by: Ying Xue <[email protected]>
Signed-off-by: Jon Maloy <[email protected]>
Signed-off-by: David S. Miller <[email protected]>

net-gro: reset skb->pkt_type in napi_reuse_skb()

eth_type_trans() assumes initial value for skb->pkt_type
is PACKET_HOST.

This is indeed the value right after a fresh skb allocation.

However, it is possible that GRO merged a packet with a different
value (like PACKET_OTHERHOST in case macvlan is used), so
we need to make sure napi->skb will have pkt_type set back to
PACKET_HOST.

Otherwise, valid packets might be dropped by the stack because
their pkt_type is not PACKET_HOST.

napi_reuse_skb() was added in commit 96e93eab2033 ("gro: Add
internal interfaces for VLAN"), but this bug always has
been there.

Fixes: 96e93eab2033 ("gro: Add internal interfaces for VLAN")
Signed-off-by: Eric Dumazet <[email protected]>
Signed-off-by: David S. Miller <[email protected]>

Merge branch 'tdc-fixes'

Lucas Bates says:

====================
Prevent uncaught exceptions in tdc

This patch series addresses two potential bugs in tdc that can
cause exceptions to be raised in certain circumstances. These
exceptions are generally not handled, so instead we will prevent
them from being raised.
====================

Signed-off-by: David S. Miller <[email protected]>

tc-testing: tdc.py: Guard against lack of returncode in executed command

Add some defensive coding in case one of the subprocesses created by tdc
returns nothing. If no object is returned from exec_cmd, then tdc will
halt with an unhandled exception.

Signed-off-by: Brenda J. Butler <[email protected]>
Signed-off-by: Lucas Bates <[email protected]>
Signed-off-by: David S. Miller <[email protected]>

tc-testing: tdc.py: ignore errors when decoding stdout/stderr

Prevent exceptions from being raised while decoding output
from an executed command. There is no impact on tdc's
execution and the verify command phase would fail the pattern
match.

Signed-off-by: Lucas Bates <[email protected]>
Signed-off-by: David S. Miller <[email protected]>