]> Git Repo - linux.git/log
linux.git
5 years agousb: don't create dma pools for HCDs with a localmem_pool
Christoph Hellwig [Sun, 11 Aug 2019 08:05:15 +0000 (10:05 +0200)]
usb: don't create dma pools for HCDs with a localmem_pool

If the HCD provides a localmem pool we will never use the DMA pools, so
don't create them.

Fixes: b0310c2f09bb ("USB: use genalloc for USB HCs with local memory")
Signed-off-by: Christoph Hellwig <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Greg Kroah-Hartman <[email protected]>
5 years agousb: chipidea: imx: fix EPROBE_DEFER support during driver probe
André Draszik [Sat, 10 Aug 2019 15:07:58 +0000 (16:07 +0100)]
usb: chipidea: imx: fix EPROBE_DEFER support during driver probe

If driver probe needs to be deferred, e.g. because ci_hdrc_add_device()
isn't ready yet, this driver currently misbehaves badly:
    a) success is still reported to the driver core (meaning a 2nd
       probe attempt will never be done), leaving the driver in
       a dysfunctional state and the hardware unusable

    b) driver remove / shutdown OOPSes:
    [  206.786916] Unable to handle kernel paging request at virtual address fffffdff
    [  206.794148] pgd = 880b9f82
    [  206.796890] [fffffdff] *pgd=abf5e861, *pte=00000000, *ppte=00000000
    [  206.803179] Internal error: Oops: 37 [#1] PREEMPT SMP ARM
    [  206.808581] Modules linked in: wl18xx evbug
    [  206.813308] CPU: 1 PID: 1 Comm: systemd-shutdow Not tainted 4.19.35+gf345c93b4195 #1
    [  206.821053] Hardware name: Freescale i.MX7 Dual (Device Tree)
    [  206.826813] PC is at ci_hdrc_remove_device+0x4/0x20
    [  206.831699] LR is at ci_hdrc_imx_remove+0x20/0xe8
    [  206.836407] pc : [<805cd4b0>]    lr : [<805d62cc>]    psr: 20000013
    [  206.842678] sp : a806be40  ip : 00000001  fp : 80adbd3c
    [  206.847906] r10: 80b1b794  r9 : 80d5dfe0  r8 : a8192c44
    [  206.853136] r7 : 80db93a0  r6 : a8192c10  r5 : a8192c00  r4 : a93a4a00
    [  206.859668] r3 : 00000000  r2 : a8192ce4  r1 : ffffffff  r0 : fffffdfb
    [  206.866201] Flags: nzCv  IRQs on  FIQs on  Mode SVC_32  ISA ARM  Segment none
    [  206.873341] Control: 10c5387d  Table: a9e0c06a  DAC: 00000051
    [  206.879092] Process systemd-shutdow (pid: 1, stack limit = 0xb271353c)
    [  206.885624] Stack: (0xa806be40 to 0xa806c000)
    [  206.889992] be40: a93a4a00 805d62cc a8192c1c a8170e10 a8192c10 8049a490 80d04d08 00000000
    [  206.898179] be60: 00000000 80d0da2c fee1dead 00000000 a806a000 00000058 00000000 80148b08
    [  206.906366] be80: 01234567 80148d8c a9858600 00000000 00000000 00000000 00000000 80d04d08
    [  206.914553] bea0: 00000000 00000000 a82741e0 a9858600 00000024 00000002 a9858608 00000005
    [  206.922740] bec0: 0000001e 8022c058 00000000 00000000 a806bf14 a9858600 00000000 a806befc
    [  206.930927] bee0: a806bf78 00000000 7ee12c30 8022c18c a806bef8 a806befc 00000000 00000001
    [  206.939115] bf00: 00000000 00000024 a806bf14 00000005 7ee13b34 7ee12c68 00000004 7ee13f20
    [  206.947302] bf20: 00000010 7ee12c7c 00000005 7ee12d04 0000000a 76e7dc00 00000001 80d0f140
    [  206.955490] bf40: ab637880 a974de40 60000013 80d0f140 ab6378a0 80d04d08 a8080470 a9858600
    [  206.963677] bf60: a9858600 00000000 00000000 8022c24c 00000000 80144310 00000000 00000000
    [  206.971864] bf80: 80101204 80d04d08 00000000 80d04d08 00000000 00000000 00000003 00000058
    [  206.980051] bfa0: 80101204 80101000 00000000 00000000 fee1dead 28121969 01234567 00000000
    [  206.988237] bfc0: 00000000 00000000 00000003 00000058 00000000 00000000 00000000 00000000
    [  206.996425] bfe0: 0049ffb0 7ee13d58 0048a84b 76f245a6 60000030 fee1dead 00000000 00000000
    [  207.004622] [<805cd4b0>] (ci_hdrc_remove_device) from [<805d62cc>] (ci_hdrc_imx_remove+0x20/0xe8)
    [  207.013509] [<805d62cc>] (ci_hdrc_imx_remove) from [<8049a490>] (device_shutdown+0x16c/0x218)
    [  207.022050] [<8049a490>] (device_shutdown) from [<80148b08>] (kernel_restart+0xc/0x50)
    [  207.029980] [<80148b08>] (kernel_restart) from [<80148d8c>] (sys_reboot+0xf4/0x1f0)
    [  207.037648] [<80148d8c>] (sys_reboot) from [<80101000>] (ret_fast_syscall+0x0/0x54)
    [  207.045308] Exception stack(0xa806bfa8 to 0xa806bff0)
    [  207.050368] bfa0:                   00000000 00000000 fee1dead 28121969 01234567 00000000
    [  207.058554] bfc0: 00000000 00000000 00000003 00000058 00000000 00000000 00000000 00000000
    [  207.066737] bfe0: 0049ffb0 7ee13d58 0048a84b 76f245a6
    [  207.071799] Code: ebffffa8 e3a00000 e8bd8010 e92d4010 (e5904004)
    [  207.078021] ---[ end trace be47424e3fd46e9f ]---
    [  207.082647] Kernel panic - not syncing: Fatal exception
    [  207.087894] ---[ end Kernel panic - not syncing: Fatal exception ]---

    c) the error path in combination with driver removal causes
       imbalanced calls to the clk_*() and pm_()* APIs

a) happens because the original intended return value is
   overwritten (with 0) by the return code of
   regulator_disable() in ci_hdrc_imx_probe()'s error path
b) happens because ci_pdev is -EPROBE_DEFER, which causes
   ci_hdrc_remove_device() to OOPS

Fix a) by being more careful in ci_hdrc_imx_probe()'s error
path and not overwriting the real error code

Fix b) by calling the respective cleanup functions during
remove only when needed (when ci_pdev != NULL, i.e. when
everything was initialised correctly). This also has the
side effect of not causing imbalanced clk_*() and pm_*()
API calls as part of the error code path.

Fixes: 7c8e8909417e ("usb: chipidea: imx: add HSIC support")
Signed-off-by: André Draszik <[email protected]>
Cc: stable <[email protected]>
CC: Peter Chen <[email protected]>
CC: Greg Kroah-Hartman <[email protected]>
CC: Shawn Guo <[email protected]>
CC: Sascha Hauer <[email protected]>
CC: Pengutronix Kernel Team <[email protected]>
CC: Fabio Estevam <[email protected]>
CC: NXP Linux Team <[email protected]>
CC: [email protected]
CC: [email protected]
CC: [email protected]
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Greg Kroah-Hartman <[email protected]>
5 years agousb: host: fotg2: restart hcd after port reset
Hans Ulli Kroll [Sat, 10 Aug 2019 15:04:58 +0000 (17:04 +0200)]
usb: host: fotg2: restart hcd after port reset

On the Gemini SoC the FOTG2 stalls after port reset
so restart the HCD after each port reset.

Signed-off-by: Hans Ulli Kroll <[email protected]>
Signed-off-by: Linus Walleij <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Greg Kroah-Hartman <[email protected]>
5 years agoUSB: CDC: fix sanity checks in CDC union parser
Oliver Neukum [Tue, 13 Aug 2019 09:35:41 +0000 (11:35 +0200)]
USB: CDC: fix sanity checks in CDC union parser

A few checks checked for the size of the pointer to a structure
instead of the structure itself. Copy & paste issue presumably.

Fixes: e4c6fb7794982 ("usbnet: move the CDC parser into USB core")
Cc: stable <[email protected]>
Reported-by: [email protected]
Signed-off-by: Oliver Neukum <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Greg Kroah-Hartman <[email protected]>
5 years agousb: cdc-acm: make sure a refcount is taken early enough
Oliver Neukum [Thu, 8 Aug 2019 14:21:19 +0000 (16:21 +0200)]
usb: cdc-acm: make sure a refcount is taken early enough

destroy() will decrement the refcount on the interface, so that
it needs to be taken so early that it never undercounts.

Fixes: 7fb57a019f94e ("USB: cdc-acm: Fix potential deadlock (lockdep warning)")
Cc: stable <[email protected]>
Reported-and-tested-by: [email protected]
Signed-off-by: Oliver Neukum <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Greg Kroah-Hartman <[email protected]>
5 years agoUSB: serial: option: add the BroadMobi BM818 card
Bob Ham [Wed, 24 Jul 2019 14:52:26 +0000 (07:52 -0700)]
USB: serial: option: add the BroadMobi BM818 card

Add a VID:PID for the BroadMobi BM818 M.2 card

T:  Bus=01 Lev=03 Prnt=40 Port=03 Cnt=01 Dev#= 44 Spd=480 MxCh= 0
D:  Ver= 2.00 Cls=00(>ifc ) Sub=00 Prot=00 MxPS=64 #Cfgs=  1
P:  Vendor=2020 ProdID=2060 Rev=00.00
S:  Manufacturer=Qualcomm, Incorporated
S:  Product=Qualcomm CDMA Technologies MSM
C:  #Ifs= 5 Cfg#= 1 Atr=e0 MxPwr=500mA
I:  If#=0x0 Alt= 0 #EPs= 2 Cls=ff(vend.) Sub=ff Prot=ff Driver=(none)
I:  If#=0x1 Alt= 0 #EPs= 2 Cls=ff(vend.) Sub=ff Prot=ff Driver=(none)
I:  If#=0x2 Alt= 0 #EPs= 3 Cls=ff(vend.) Sub=ff Prot=ff Driver=(none)
I:  If#=0x3 Alt= 0 #EPs= 3 Cls=ff(vend.) Sub=fe Prot=ff Driver=(none)
I:  If#=0x4 Alt= 0 #EPs= 3 Cls=ff(vend.) Sub=ff Prot=ff Driver=(none)

Signed-off-by: Bob Ham <[email protected]>
Signed-off-by: Angus Ainslie (Purism) <[email protected]>
Cc: stable <[email protected]>
[ johan: use USB_DEVICE_INTERFACE_CLASS() ]
Signed-off-by: Johan Hovold <[email protected]>
5 years agoUSB: serial: option: Add Motorola modem UARTs
Tony Lindgren [Thu, 15 Aug 2019 08:26:02 +0000 (01:26 -0700)]
USB: serial: option: Add Motorola modem UARTs

On Motorola Mapphone devices such as Droid 4 there are five USB ports
that do not use the same layout as Gobi 1K/2K/etc devices listed in
qcserial.c. So we should use qcaux.c or option.c as noted by
Dan Williams <[email protected]>.

As the Motorola USB serial ports have an interrupt endpoint as shown
with lsusb -v, we should use option.c instead of qcaux.c as pointed out
by Johan Hovold <[email protected]>.

The ff/ff/ff interfaces seem to always be UARTs on Motorola devices.
For the other interfaces, class 0x0a (CDC Data) should not in general
be added as they are typically part of a multi-interface function as
noted earlier by Bjørn Mork <[email protected]>.

However, looking at the Motorola mapphone kernel code, the mdm6600 0x0a
class is only used for flashing the modem firmware, and there are no
other interfaces. So I've added that too with more details below as it
works just fine.

The ttyUSB ports on Droid 4 are:

ttyUSB0 DIAG, CQDM-capable
ttyUSB1 MUX or NMEA, no response
ttyUSB2 MUX or NMEA, no response
ttyUSB3 TCMD
ttyUSB4 AT-capable

The ttyUSB0 is detected as QCDM capable by ModemManager. I think
it's only used for debugging with ModemManager --debug for sending
custom AT commands though. ModemManager already can manage data
connection using the USB QMI ports that are already handled by the
qmi_wwan.c driver.

To enable the MUX or NMEA ports, it seems that something needs to be
done additionally to enable them, maybe via the DIAG or TCMD port.
It might be just a NVRAM setting somewhere, but I have no idea what
NVRAM settings may need changing for that.

The TCMD port seems to be a Motorola custom protocol for testing
the modem and to configure it's NVRAM and seems to work just fine
based on a quick test with a minimal tcmdrw tool I wrote.

The voice modem AT-capable port seems to provide only partial
support, and no PM support compared to the TS 27.010 based UART
wired directly to the modem.

The UARTs added with this change are the same product IDs as the
Motorola Mapphone Android Linux kernel mdm6600_id_table. I don't
have any mdm9600 based devices, so I have only tested these on
mdm6600 based droid 4.

Then for the class 0x0a (CDC Data) mode, the Motorola Mapphone Android
Linux kernel driver moto_flashqsc.c just seems to change the
port->bulk_out_size to 8K from the default. And is only used for
flashing the modem firmware it seems.

I've verified that flashing the modem with signed firmware works just
fine with the option driver after manually toggling the GPIO pins, so
I've added droid 4 modem flashing mode to the option driver. I've not
added the other devices listed in moto_flashqsc.c in case they really
need different port->bulk_out_size. Those can be added as they get
tested to work for flashing the modem.

After this patch the output of /sys/kernel/debug/usb/devices has
the following for normal 22b8:2a70 mode including the related qmi_wwan
interfaces:

T:  Bus=01 Lev=01 Prnt=01 Port=00 Cnt=01 Dev#=  2 Spd=12   MxCh= 0
D:  Ver= 2.00 Cls=00(>ifc ) Sub=00 Prot=00 MxPS=64 #Cfgs=  1
P:  Vendor=22b8 ProdID=2a70 Rev= 0.00
S:  Manufacturer=Motorola, Incorporated
S:  Product=Flash MZ600
C:* #Ifs= 9 Cfg#= 1 Atr=e0 MxPwr=500mA
I:* If#= 0 Alt= 0 #EPs= 2 Cls=ff(vend.) Sub=ff Prot=ff Driver=option
E:  Ad=81(I) Atr=02(Bulk) MxPS=  64 Ivl=0ms
E:  Ad=01(O) Atr=02(Bulk) MxPS=  64 Ivl=0ms
I:* If#= 1 Alt= 0 #EPs= 2 Cls=ff(vend.) Sub=ff Prot=ff Driver=option
E:  Ad=82(I) Atr=02(Bulk) MxPS=  64 Ivl=0ms
E:  Ad=02(O) Atr=02(Bulk) MxPS=  64 Ivl=0ms
I:* If#= 2 Alt= 0 #EPs= 2 Cls=ff(vend.) Sub=ff Prot=ff Driver=option
E:  Ad=83(I) Atr=02(Bulk) MxPS=  64 Ivl=0ms
E:  Ad=03(O) Atr=02(Bulk) MxPS=  64 Ivl=0ms
I:* If#= 3 Alt= 0 #EPs= 2 Cls=ff(vend.) Sub=ff Prot=ff Driver=option
E:  Ad=84(I) Atr=02(Bulk) MxPS=  64 Ivl=0ms
E:  Ad=04(O) Atr=02(Bulk) MxPS=  64 Ivl=0ms
I:* If#= 4 Alt= 0 #EPs= 3 Cls=ff(vend.) Sub=ff Prot=ff Driver=option
E:  Ad=85(I) Atr=03(Int.) MxPS=  64 Ivl=5ms
E:  Ad=86(I) Atr=02(Bulk) MxPS=  64 Ivl=0ms
E:  Ad=05(O) Atr=02(Bulk) MxPS=  64 Ivl=0ms
I:* If#= 5 Alt= 0 #EPs= 3 Cls=ff(vend.) Sub=fb Prot=ff Driver=qmi_wwan
E:  Ad=87(I) Atr=03(Int.) MxPS=  64 Ivl=5ms
E:  Ad=88(I) Atr=02(Bulk) MxPS=  64 Ivl=0ms
E:  Ad=06(O) Atr=02(Bulk) MxPS=  64 Ivl=0ms
I:* If#= 6 Alt= 0 #EPs= 3 Cls=ff(vend.) Sub=fb Prot=ff Driver=qmi_wwan
E:  Ad=89(I) Atr=03(Int.) MxPS=  64 Ivl=5ms
E:  Ad=8a(I) Atr=02(Bulk) MxPS=  64 Ivl=0ms
E:  Ad=07(O) Atr=02(Bulk) MxPS=  64 Ivl=0ms
I:* If#= 7 Alt= 0 #EPs= 3 Cls=ff(vend.) Sub=fb Prot=ff Driver=qmi_wwan
E:  Ad=8b(I) Atr=03(Int.) MxPS=  64 Ivl=5ms
E:  Ad=8c(I) Atr=02(Bulk) MxPS=  64 Ivl=0ms
E:  Ad=08(O) Atr=02(Bulk) MxPS=  64 Ivl=0ms
I:* If#= 8 Alt= 0 #EPs= 3 Cls=ff(vend.) Sub=fb Prot=ff Driver=qmi_wwan
E:  Ad=8d(I) Atr=03(Int.) MxPS=  64 Ivl=5ms
E:  Ad=8e(I) Atr=02(Bulk) MxPS=  64 Ivl=0ms
E:  Ad=09(O) Atr=02(Bulk) MxPS=  64 Ivl=0ms

In 22b8:900e "qc_dload" mode the device shows up as:

T:  Bus=01 Lev=01 Prnt=01 Port=00 Cnt=01 Dev#=  2 Spd=12   MxCh= 0
D:  Ver= 2.00 Cls=00(>ifc ) Sub=00 Prot=00 MxPS=64 #Cfgs=  1
P:  Vendor=22b8 ProdID=900e Rev= 0.00
S:  Manufacturer=Motorola, Incorporated
S:  Product=Flash MZ600
C:* #Ifs= 1 Cfg#= 1 Atr=e0 MxPwr=500mA
I:* If#= 0 Alt= 0 #EPs= 2 Cls=ff(vend.) Sub=ff Prot=ff Driver=option
E:  Ad=81(I) Atr=02(Bulk) MxPS=  64 Ivl=0ms
E:  Ad=01(O) Atr=02(Bulk) MxPS=  64 Ivl=0ms

And in 22b8:4281 "ram_downloader" mode the device shows up as:

T:  Bus=01 Lev=01 Prnt=01 Port=00 Cnt=01 Dev#=  2 Spd=12   MxCh= 0
D:  Ver= 2.00 Cls=00(>ifc ) Sub=00 Prot=00 MxPS=64 #Cfgs=  1
P:  Vendor=22b8 ProdID=4281 Rev= 0.00
S:  Manufacturer=Motorola, Incorporated
S:  Product=Flash MZ600
C:* #Ifs= 1 Cfg#= 1 Atr=e0 MxPwr=500mA
I:* If#= 0 Alt= 0 #EPs= 2 Cls=0a(data ) Sub=00 Prot=fc Driver=option
E:  Ad=81(I) Atr=02(Bulk) MxPS=  64 Ivl=0ms
E:  Ad=01(O) Atr=02(Bulk) MxPS=  64 Ivl=0ms

Cc: Bjørn Mork <[email protected]>
Cc: Dan Williams <[email protected]>
Cc: Lars Melin <[email protected]>
Cc: Marcel Partap <[email protected]>
Cc: Merlijn Wajer <[email protected]>
Cc: Michael Scott <[email protected]>
Cc: NeKit <[email protected]>
Cc: Pavel Machek <[email protected]>
Cc: Sebastian Reichel <[email protected]>
Tested-by: Pavel Machek <[email protected]>
Signed-off-by: Tony Lindgren <[email protected]>
Cc: stable <[email protected]>
Signed-off-by: Johan Hovold <[email protected]>
5 years agoMAINTAINERS, x86/CPU: Tony Luck will maintain asm/intel-family.h
Tony Luck [Wed, 14 Aug 2019 23:40:30 +0000 (16:40 -0700)]
MAINTAINERS, x86/CPU: Tony Luck will maintain asm/intel-family.h

There are a few different subsystems in the kernel that depend on model
specific behaviour (perf, EDAC, power, ...). Easier for just one person
to have the task to get new model numbers included instead of having
these groups trip over each other to do it.

 [ bp: s/Cpu/CPU/ and add [email protected] so that it gets CCed too as
   FYI. ]

Signed-off-by: Tony Luck <[email protected]>
Signed-off-by: Borislav Petkov <[email protected]>
Acked-by: Thomas Gleixner <[email protected]>
Cc: Dave Hansen <[email protected]>
Cc: x86-ml <[email protected]>
Link: https://lkml.kernel.org/r/[email protected]
5 years agoMerge tag 'drm-fixes-5.3-2019-08-14' of git://people.freedesktop.org/~agd5f/linux...
Dave Airlie [Thu, 15 Aug 2019 03:29:18 +0000 (13:29 +1000)]
Merge tag 'drm-fixes-5.3-2019-08-14' of git://people.freedesktop.org/~agd5f/linux into drm-fixes

drm-fixes-5.3-2019-08-14:

amdgpu:
- Use kvalloc for dc_state to avoid allocation
  failures in some cases.
- Fix gfx9 soft recovery

scheduler:
- Fix a race condition when destroying entities

Signed-off-by: Dave Airlie <[email protected]>
From: Alex Deucher <[email protected]>
Link: https://patchwork.freedesktop.org/patch/msgid/[email protected]
5 years agodrm/nouveau: Only recalculate PBN/VCPI on mode/connector changes
Lyude Paul [Fri, 9 Aug 2019 00:53:05 +0000 (20:53 -0400)]
drm/nouveau: Only recalculate PBN/VCPI on mode/connector changes

I -thought- I had fixed this entirely, but it looks like that I didn't
test this thoroughly enough as we apparently still make one big mistake
with nv50_msto_atomic_check() - we don't handle the following scenario:

* CRTC #1 has n VCPI allocated to it, is attached to connector DP-4
  which is attached to encoder #1. enabled=y active=n
* CRTC #1 is changed from DP-4 to DP-5, causing:
  * DP-4 crtc=#1→NULL (VCPI n→0)
  * DP-5 crtc=NULL→#1
  * CRTC #1 steals encoder #1 back from DP-4 and gives it to DP-5
  * CRTC #1 maintains the same mode as before, just with a different
    connector
* mode_changed=n connectors_changed=y
  (we _SHOULD_ do VCPI 0→n here, but don't)

Once the above scenario is repeated once, we'll attempt freeing VCPI
from the connector that we didn't allocate due to the connectors
changing, but the mode staying the same. Sigh.

Since nv50_msto_atomic_check() has broken a few times now, let's rethink
things a bit to be more careful: limit both VCPI/PBN allocations to
mode_changed || connectors_changed, since neither VCPI or PBN should
ever need to change outside of routing and mode changes.

Changes since v1:
* Fix accidental reversal of clock and bpp arguments in
  drm_dp_calc_pbn_mode() - William Lewis

Signed-off-by: Lyude Paul <[email protected]>
Reported-by: Bohdan Milar <[email protected]>
Tested-by: Bohdan Milar <[email protected]>
Fixes: 232c9eec417a ("drm/nouveau: Use atomic VCPI helpers for MST")
References: 412e85b60531 ("drm/nouveau: Only release VCPI slots on mode changes")
Cc: Lyude Paul <[email protected]>
Cc: Ben Skeggs <[email protected]>
Cc: Daniel Vetter <[email protected]>
Cc: David Airlie <[email protected]>
Cc: Jerry Zuo <[email protected]>
Cc: Harry Wentland <[email protected]>
Cc: Juston Li <[email protected]>
Cc: Laurent Pinchart <[email protected]>
Cc: Karol Herbst <[email protected]>
Cc: Ilia Mirkin <[email protected]>
Cc: <[email protected]> # v5.1+
Acked-by: Ben Skeggs <[email protected]>
Signed-off-by: Dave Airlie <[email protected]>
Link: https://patchwork.freedesktop.org/patch/msgid/[email protected]
5 years agodrm/ast: Fixed reboot test may cause system hanged
Y.C. Chen [Wed, 11 Apr 2018 01:27:39 +0000 (09:27 +0800)]
drm/ast: Fixed reboot test may cause system hanged

There is another thread still access standard VGA I/O while loading drm driver.
Disable standard VGA I/O decode to avoid this issue.

Signed-off-by: Y.C. Chen <[email protected]>
Reviewed-by: Benjamin Herrenschmidt <[email protected]>
Signed-off-by: Dave Airlie <[email protected]>
Link: https://patchwork.freedesktop.org/patch/msgid/[email protected]
5 years agocxgb4: fix a memory leak bug
Wenwen Wang [Tue, 13 Aug 2019 09:18:52 +0000 (04:18 -0500)]
cxgb4: fix a memory leak bug

In blocked_fl_write(), 't' is not deallocated if bitmap_parse_user() fails,
leading to a memory leak bug. To fix this issue, free t before returning
the error.

Signed-off-by: Wenwen Wang <[email protected]>
Signed-off-by: David S. Miller <[email protected]>
5 years agonet/mvpp2: Replace tasklet with softirq hrtimer
Thomas Gleixner [Tue, 13 Aug 2019 08:00:25 +0000 (10:00 +0200)]
net/mvpp2: Replace tasklet with softirq hrtimer

The tx_done_tasklet tasklet is used in invoke the hrtimer
(mvpp2_hr_timer_cb) in softirq context. This can be also achieved without
the tasklet but with HRTIMER_MODE_SOFT as hrtimer mode.

Signed-off-by: Thomas Gleixner <[email protected]>
Signed-off-by: Sebastian Andrzej Siewior <[email protected]>
Signed-off-by: David S. Miller <[email protected]>
5 years agoMerge git://git.kernel.org/pub/scm/linux/kernel/git/pablo/nf-next
David S. Miller [Thu, 15 Aug 2019 02:59:00 +0000 (19:59 -0700)]
Merge git://git.kernel.org/pub/scm/linux/kernel/git/pablo/nf-next

Pablo Neira Ayuso says:

====================
Netfilter updates for net-next

The following patchset contains Netfilter updates for net-next.
This round addresses fallout from previous pull request:

1) Remove #warning from ipt_LOG.h and ip6t_LOG.h headers,
   from Jeremy Sowden.

2) Incorrect parens in memcmp() in nft_bitwise, from Nathan Chancellor.
====================

Signed-off-by: David S. Miller <[email protected]>
5 years agoof: irq: fix a trivial typo in a doc comment
Lubomir Rintel [Wed, 7 Aug 2019 13:22:31 +0000 (15:22 +0200)]
of: irq: fix a trivial typo in a doc comment

Diverged from what the code does with commit 530210c7814e ("of/irq: Replace
of_irq with of_phandle_args").

Signed-off-by: Lubomir Rintel <[email protected]>
Signed-off-by: Rob Herring <[email protected]>
5 years agodt-bindings: pinctrl: stm32: Fix 'st,syscfg' schema
Rob Herring [Tue, 13 Aug 2019 20:47:54 +0000 (14:47 -0600)]
dt-bindings: pinctrl: stm32: Fix 'st,syscfg' schema

The proper way to add additional contraints to an existing json-schema
is using 'allOf' to reference the base schema. Using just '$ref' doesn't
work. Fix this for the 'st,syscfg' property.

Cc: Mark Rutland <[email protected]>
Cc: Maxime Coquelin <[email protected]>
Cc: Alexandre Torgue <[email protected]>
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Reviewed-by: Linus Walleij <[email protected]>
Signed-off-by: Rob Herring <[email protected]>
5 years agoMerge tag 'Wimplicit-fallthrough-5.3-rc5' of git://git.kernel.org/pub/scm/linux/kerne...
Linus Torvalds [Wed, 14 Aug 2019 22:29:53 +0000 (15:29 -0700)]
Merge tag 'Wimplicit-fallthrough-5.3-rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/gustavoars/linux

Pull fallthrough fixes from Gustavo A. R. Silva:
 "Fix sh mainline builds:

   - Fix fall-through warning in sh.

   - Fix missing break bug in sh (this is a 10-year-old bug)

  Currently, mainline builds for sh are broken. These patches fix that"

* tag 'Wimplicit-fallthrough-5.3-rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/gustavoars/linux:
  sh: kernel: hw_breakpoint: Fix missing break in switch statement
  sh: kernel: disassemble: Mark expected switch fall-throughs

5 years agonetfilter: nft_bitwise: Adjust parentheses to fix memcmp size argument
Nathan Chancellor [Wed, 14 Aug 2019 16:58:09 +0000 (09:58 -0700)]
netfilter: nft_bitwise: Adjust parentheses to fix memcmp size argument

clang warns:

net/netfilter/nft_bitwise.c:138:50: error: size argument in 'memcmp'
call is a comparison [-Werror,-Wmemsize-comparison]
        if (memcmp(&priv->xor, &zero, sizeof(priv->xor) ||
                                      ~~~~~~~~~~~~~~~~~~^~
net/netfilter/nft_bitwise.c:138:6: note: did you mean to compare the
result of 'memcmp' instead?
        if (memcmp(&priv->xor, &zero, sizeof(priv->xor) ||
            ^
                                                       )
net/netfilter/nft_bitwise.c:138:32: note: explicitly cast the argument
to size_t to silence this warning
        if (memcmp(&priv->xor, &zero, sizeof(priv->xor) ||
                                      ^
                                      (size_t)(
1 error generated.

Adjust the parentheses so that the result of the sizeof is used for the
size argument in memcmp, rather than the result of the comparison (which
would always be true because sizeof is a non-zero number).

Fixes: bd8699e9e292 ("netfilter: nft_bitwise: add offload support")
Link: https://github.com/ClangBuiltLinux/linux/issues/638
Reported-by: kbuild test robot <[email protected]>
Signed-off-by: Nathan Chancellor <[email protected]>
Reviewed-by: Nick Desaulniers <[email protected]>
Signed-off-by: Pablo Neira Ayuso <[email protected]>
5 years agonetfilter: remove deprecation warnings from uapi headers.
Jeremy Sowden [Wed, 14 Aug 2019 08:01:28 +0000 (09:01 +0100)]
netfilter: remove deprecation warnings from uapi headers.

There are two netfilter userspace headers which contain deprecation
warnings.  While these headers are not used within the kernel, they are
compiled stand-alone for header-testing.

Pablo informs me that userspace iptables still refer to these headers,
and the intention was to use xt_LOG.h instead and remove these, but
userspace was never updated.

Remove the warnings.

Fixes: 2a475c409fe8 ("kbuild: remove all netfilter headers from header-test blacklist.")
Reported-by: kbuild test robot <[email protected]>
Signed-off-by: Jeremy Sowden <[email protected]>
Signed-off-by: Pablo Neira Ayuso <[email protected]>
5 years agoMerge tag 'afs-fixes-20190814' of git://git.kernel.org/pub/scm/linux/kernel/git/dhowe...
Linus Torvalds [Wed, 14 Aug 2019 21:21:14 +0000 (14:21 -0700)]
Merge tag 'afs-fixes-20190814' of git://git.kernel.org/pub/scm/linux/kernel/git/dhowells/linux-fs

Pull afs fixes from David Howells:

 - Fix the CB.ProbeUuid handler to generate its reply correctly.

 - Fix a mix up in indices when parsing a Volume Location entry record.

 - Fix a potential NULL-pointer deref when cleaning up a read request.

 - Fix the expected data version of the destination directory in
   afs_rename().

 - Fix afs_d_revalidate() to only update d_fsdata if it's not the same
   as the directory data version to reduce the likelihood of overwriting
   the result of a competing operation. (d_fsdata carries the directory
   DV or the least-significant word thereof).

 - Fix the tracking of the data-version on a directory and make sure
   that dentry objects get properly initialised, updated and
   revalidated.

   Also fix rename to update d_fsdata to match the new directory's DV if
   the dentry gets moved over and unhash the dentry to stop
   afs_d_revalidate() from interfering.

* tag 'afs-fixes-20190814' of git://git.kernel.org/pub/scm/linux/kernel/git/dhowells/linux-fs:
  afs: Fix missing dentry data version updating
  afs: Only update d_fsdata if different in afs_d_revalidate()
  afs: Fix off-by-one in afs_rename() expected data version calculation
  fs: afs: Fix a possible null-pointer dereference in afs_put_read()
  afs: Fix loop index mixup in afs_deliver_vl_get_entry_by_name_u()
  afs: Fix the CB.ProbeUuid service handler to reply correctly

5 years agodrm/scheduler: use job count instead of peek
Christian König [Fri, 9 Aug 2019 15:27:21 +0000 (17:27 +0200)]
drm/scheduler: use job count instead of peek

The spsc_queue_peek function is accessing queue->head which belongs to
the consumer thread and shouldn't be accessed by the producer

This is fixing a rare race condition when destroying entities.

Signed-off-by: Christian König <[email protected]>
Acked-by: Andrey Grodzovsky <[email protected]>
Reviewed-by: [email protected]
Signed-off-by: Alex Deucher <[email protected]>
5 years agoriscv: Make __fstate_clean() work correctly.
Vincent Chen [Wed, 14 Aug 2019 08:23:53 +0000 (16:23 +0800)]
riscv: Make __fstate_clean() work correctly.

Make the __fstate_clean() function correctly set the
state of sstatus.FS in pt_regs to SR_FS_CLEAN.

Fixes: 7db91e57a0acd ("RISC-V: Task implementation")
Cc: linux-stable <[email protected]>
Signed-off-by: Vincent Chen <[email protected]>
Reviewed-by: Anup Patel <[email protected]>
Reviewed-by: Christoph Hellwig <[email protected]>
[[email protected]: expanded "Fixes" commit ID]
Signed-off-by: Paul Walmsley <[email protected]>
5 years agoriscv: Correct the initialized flow of FP register
Vincent Chen [Wed, 14 Aug 2019 08:23:52 +0000 (16:23 +0800)]
riscv: Correct the initialized flow of FP register

  The following two reasons cause FP registers are sometimes not
initialized before starting the user program.
1. Currently, the FP context is initialized in flush_thread() function
   and we expect these initial values to be restored to FP register when
   doing FP context switch. However, the FP context switch only occurs in
   switch_to function. Hence, if this process does not be scheduled out
   and scheduled in before entering the user space, the FP registers
   have no chance to initialize.
2. In flush_thread(), the state of reg->sstatus.FS inherits from the
   parent. Hence, the state of reg->sstatus.FS may be dirty. If this
   process is scheduled out during flush_thread() and initializing the
   FP register, the fstate_save() in switch_to will corrupt the FP context
   which has been initialized until flush_thread().

  To solve the 1st case, the initialization of the FP register will be
completed in start_thread(). It makes sure all FP registers are initialized
before starting the user program. For the 2nd case, the state of
reg->sstatus.FS in start_thread will be set to SR_FS_OFF to prevent this
process from corrupting FP context in doing context save. The FP state is
set to SR_FS_INITIAL in start_trhead().

Signed-off-by: Vincent Chen <[email protected]>
Reviewed-by: Anup Patel <[email protected]>
Reviewed-by: Christoph Hellwig <[email protected]>
Fixes: 7db91e57a0acd ("RISC-V: Task implementation")
Cc: [email protected]
[[email protected]: fixed brace alignment issue reported by
 checkpatch]
Signed-off-by: Paul Walmsley <[email protected]>
5 years agoMerge tag 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/rdma/rdma
Linus Torvalds [Wed, 14 Aug 2019 18:10:38 +0000 (11:10 -0700)]
Merge tag 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/rdma/rdma

Pull rdma fixes from Doug Ledford:
 "Fairly small pull request for -rc3. I'm out of town the rest of this
  week, so I made sure to clean out as much as possible from patchworks
  in enough time for 0-day to chew through it (Yay! for 0-day being back
  online! :-)). Jason might send through any emergency stuff that could
  pop up, otherwise I'm back next week.

  The only real thing of note is the siw ABI change. Since we just
  merged siw *this* release, there are no prior kernel releases to
  maintain kernel ABI with. I told Bernard that if there is anything
  else about the siw ABI he thinks he might want to change before it
  goes set in stone, he should get it in ASAP. The siw module was around
  for several years outside the kernel tree, and it had to be revamped
  considerably for inclusion upstream, so we are making no attempts to
  be backward compatible with the out of tree version. Once 5.3 is
  actually released, we will have our baseline ABI to maintain.

  Summary:

   - Fix a memory registration release flow issue that was causing a
     WARN_ON (mlx5)

   - If the counters for a port aren't allocated, then we can't do
     operations on the non-existent counters (core)

   - Check the right variable for error code result (mlx5)

   - Fix a use after free issue (mlx5)

   - Fix an off by one memory leak (siw)

   - Actually return an error code on error (core)

   - Allow siw to be built on 32bit arches (siw, ABI change, but OK
     since siw was just merged this merge window and there is no prior
     released kernel to maintain compatibility with and we also updated
     the rdma-core user space package to match)"

* tag 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/rdma/rdma:
  RDMA/siw: Change CQ flags from 64->32 bits
  RDMA/core: Fix error code in stat_get_doit_qp()
  RDMA/siw: Fix a memory leak in siw_init_cpulist()
  IB/mlx5: Fix use-after-free error while accessing ev_file pointer
  IB/mlx5: Check the correct variable in error handling code
  RDMA/counter: Prevent QP counter binding if counters unsupported
  IB/mlx5: Fix implicit MR release flow

5 years agoALSA: usb-audio: Fix an OOB bug in parse_audio_mixer_unit
Hui Peng [Wed, 14 Aug 2019 02:34:04 +0000 (22:34 -0400)]
ALSA: usb-audio: Fix an OOB bug in parse_audio_mixer_unit

The `uac_mixer_unit_descriptor` shown as below is read from the
device side. In `parse_audio_mixer_unit`, `baSourceID` field is
accessed from index 0 to `bNrInPins` - 1, the current implementation
assumes that descriptor is always valid (the length  of descriptor
is no shorter than 5 + `bNrInPins`). If a descriptor read from
the device side is invalid, it may trigger out-of-bound memory
access.

```
struct uac_mixer_unit_descriptor {
__u8 bLength;
__u8 bDescriptorType;
__u8 bDescriptorSubtype;
__u8 bUnitID;
__u8 bNrInPins;
__u8 baSourceID[];
}
```

This patch fixes the bug by add a sanity check on the length of
the descriptor.

Reported-by: Hui Peng <[email protected]>
Reported-by: Mathias Payer <[email protected]>
Cc: <[email protected]>
Signed-off-by: Hui Peng <[email protected]>
Signed-off-by: Takashi Iwai <[email protected]>
5 years agoMerge tag 'dma-mapping-5.3-4' of git://git.infradead.org/users/hch/dma-mapping
Linus Torvalds [Wed, 14 Aug 2019 17:31:11 +0000 (10:31 -0700)]
Merge tag 'dma-mapping-5.3-4' of git://git.infradead.org/users/hch/dma-mapping

Pull dma-mapping fixes from Christoph Hellwig:

 - fix the handling of the bus_dma_mask in dma_get_required_mask, which
   caused a regression in this merge window (Lucas Stach)

 - fix a regression in the handling of DMA_ATTR_NO_KERNEL_MAPPING (me)

 - fix dma_mmap_coherent to not cause page attribute mismatches on
   coherent architectures like x86 (me)

* tag 'dma-mapping-5.3-4' of git://git.infradead.org/users/hch/dma-mapping:
  dma-mapping: fix page attributes for dma_mmap_*
  dma-direct: don't truncate dma_required_mask to bus addressing capabilities
  dma-direct: fix DMA_ATTR_NO_KERNEL_MAPPING

5 years agonet: phy: realtek: add NBase-T PHY auto-detection
Heiner Kallweit [Tue, 13 Aug 2019 06:09:32 +0000 (08:09 +0200)]
net: phy: realtek: add NBase-T PHY auto-detection

Realtek provided information on how the new NIC-integrated PHY's
expose whether they support 2.5G/5G/10G. This allows to automatically
differentiate 1Gbps and 2.5Gbps PHY's, and therefore allows to
remove the fake PHY ID mechanism for RTL8125.
So far RTL8125 supports 2.5Gbps only, but register layout for faster
modes has been defined already, so let's use this information to be
future-proof.

Signed-off-by: Heiner Kallweit <[email protected]>
Signed-off-by: David S. Miller <[email protected]>
5 years agoMerge tag 'iommu-fixes-v5.3-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git...
Linus Torvalds [Wed, 14 Aug 2019 17:16:59 +0000 (10:16 -0700)]
Merge tag 'iommu-fixes-v5.3-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/joro/iommu

Pull iommu fixes from Joerg Roedel:

 - A couple more fixes for the Intel VT-d driver for bugs introduced
   during the recent conversion of this driver to use IOMMU core default
   domains.

 - Fix for common dma-iommu code to make sure MSI mappings happen in the
   correct domain for a device.

 - Fix a corner case in the handling of sg-lists in dma-iommu code that
   might cause dma_length to be truncated.

 - Mark a switch as fall-through in arm-smmu code.

* tag 'iommu-fixes-v5.3-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/joro/iommu:
  iommu/vt-d: Fix possible use-after-free of private domain
  iommu/vt-d: Detach domain before using a private one
  iommu/dma: Handle SG length overflow better
  iommu/vt-d: Correctly check format of page table in debugfs
  iommu/vt-d: Detach domain when move device out of group
  iommu/arm-smmu: Mark expected switch fall-through
  iommu/dma: Handle MSI mappings separately

5 years agoMerge branch 'akpm' (patches from Andrew)
Linus Torvalds [Wed, 14 Aug 2019 16:53:46 +0000 (09:53 -0700)]
Merge branch 'akpm' (patches from Andrew)

Merge misc VM fixes from Andrew Morton:
 "A bunch of hotfixes, all affecting mm/.

  The two-patch series from Andrea may be controversial. This restores
  patches which were reverted in Dec 2018 due to a regression report [*].

  After extensive discussion it is evident that the problems which these
  patches solved were significantly more serious than the problems they
  introduced. I am told that major distros are already carrying these
  two patches for this reason"

[*] See

      https://lore.kernel.org/lkml/alpine.DEB.2.21.1812061343240[email protected]/
      https://lore.kernel.org/lkml/alpine.DEB.2.21.1812031545560[email protected]/

  for the google-specific issues brought up by David Rijentes. And as
  Andrew says:

    "I'm unaware of anyone else who will be adversely affected by this,
     and google already carries over a thousand kernel patches - another
     won't kill them.

     There has been sporadic discussion about fixing these things for
     real but it's clear that nobody apart from David is particularly
     motivated"

* emailed patches from Andrew Morton <[email protected]>:
  hugetlbfs: fix hugetlb page migration/fault race causing SIGBUS
  mm, vmscan: do not special-case slab reclaim when watermarks are boosted
  Revert "mm, thp: restore node-local hugepage allocations"
  Revert "Revert "mm, thp: consolidate THP gfp handling into alloc_hugepage_direct_gfpmask""
  include/asm-generic/5level-fixup.h: fix variable 'p4d' set but not used
  seq_file: fix problem when seeking mid-record
  mm: workingset: fix vmstat counters for shadow nodes
  mm/usercopy: use memory range to be accessed for wraparound check
  mm: kmemleak: disable early logging in case of error
  mm/vmalloc.c: fix percpu free VM area search criteria
  mm/memcontrol.c: fix use after free in mem_cgroup_iter()
  mm/z3fold.c: fix z3fold_destroy_pool() race condition
  mm/z3fold.c: fix z3fold_destroy_pool() ordering
  mm: mempolicy: handle vma with unmovable pages mapped correctly in mbind
  mm: mempolicy: make the behavior consistent when MPOL_MF_MOVE* and MPOL_MF_STRICT were specified
  mm/hmm: fix bad subpage pointer in try_to_unmap_one
  mm/hmm: fix ZONE_DEVICE anon page mapping reuse
  mm: document zone device struct page field usage

5 years agor8169: fix sporadic transmit timeout issue
Heiner Kallweit [Mon, 12 Aug 2019 18:47:40 +0000 (20:47 +0200)]
r8169: fix sporadic transmit timeout issue

Holger reported sporadic transmit timeouts and it turned out that one
path misses ringing the doorbell. Fix was suggested by Eric.

Fixes: ef14358546b1 ("r8169: make use of xmit_more")
Suggested-by: Eric Dumazet <[email protected]>
Tested-by: Holger Hoffstätte <[email protected]>
Signed-off-by: Heiner Kallweit <[email protected]>
Signed-off-by: David S. Miller <[email protected]>
5 years agoBluetooth: hci_qca: Skip 1 error print in device_want_to_sleep()
Rocky Liao [Wed, 14 Aug 2019 07:42:39 +0000 (15:42 +0800)]
Bluetooth: hci_qca: Skip 1 error print in device_want_to_sleep()

Don't fall through to print error message when receive sleep indication
in HCI_IBS_RX_ASLEEP state, this is allowed behavior.

Signed-off-by: Rocky Liao <[email protected]>
Signed-off-by: Marcel Holtmann <[email protected]>
5 years agoi2c: stm32: Use the correct style for SPDX License Identifier
Nishad Kamdar [Sat, 3 Aug 2019 14:13:35 +0000 (19:43 +0530)]
i2c: stm32: Use the correct style for SPDX License Identifier

This patch corrects the SPDX License Identifier style
in header file related to STM32 Driver for I2C hardware
bus support.
For C header files Documentation/process/license-rules.rst
mandates C-like comments (opposed to C source files where
C++ style should be used)

Changes made by using a script provided by Joe Perches here:
https://lkml.org/lkml/2019/2/7/46

Suggested-by: Joe Perches <[email protected]>
Signed-off-by: Nishad Kamdar <[email protected]>
Signed-off-by: Wolfram Sang <[email protected]>
5 years agoi2c: emev2: avoid race when unregistering slave client
Wolfram Sang [Thu, 8 Aug 2019 19:54:17 +0000 (21:54 +0200)]
i2c: emev2: avoid race when unregistering slave client

After we disabled interrupts, there might still be an active one
running. Sync before clearing the pointer to the slave device.

Fixes: c31d0a00021d ("i2c: emev2: add slave support")
Reported-by: Krzysztof Adamski <[email protected]>
Signed-off-by: Wolfram Sang <[email protected]>
Reviewed-by: Krzysztof Adamski <[email protected]>
Signed-off-by: Wolfram Sang <[email protected]>
5 years agoi2c: rcar: avoid race when unregistering slave client
Wolfram Sang [Thu, 8 Aug 2019 19:39:10 +0000 (21:39 +0200)]
i2c: rcar: avoid race when unregistering slave client

After we disabled interrupts, there might still be an active one
running. Sync before clearing the pointer to the slave device.

Fixes: de20d1857dd6 ("i2c: rcar: add slave support")
Reported-by: Krzysztof Adamski <[email protected]>
Signed-off-by: Wolfram Sang <[email protected]>
Reviewed-by: Krzysztof Adamski <[email protected]>
Signed-off-by: Wolfram Sang <[email protected]>
5 years agorxrpc: Fix read-after-free in rxrpc_queue_local()
David Howells [Tue, 13 Aug 2019 21:26:36 +0000 (22:26 +0100)]
rxrpc: Fix read-after-free in rxrpc_queue_local()

rxrpc_queue_local() attempts to queue the local endpoint it is given and
then, if successful, prints a trace line.  The trace line includes the
current usage count - but we're not allowed to look at the local endpoint
at this point as we passed our ref on it to the workqueue.

Fix this by reading the usage count before queuing the work item.

Also fix the reading of local->debug_id for trace lines, which must be done
with the same consideration as reading the usage count.

Fixes: 09d2bf595db4 ("rxrpc: Add a tracepoint to track rxrpc_local refcounting")
Reported-by: [email protected]
Signed-off-by: David Howells <[email protected]>
5 years agorxrpc: Fix local endpoint replacement
David Howells [Mon, 12 Aug 2019 22:30:06 +0000 (23:30 +0100)]
rxrpc: Fix local endpoint replacement

When a local endpoint (struct rxrpc_local) ceases to be in use by any
AF_RXRPC sockets, it starts the process of being destroyed, but this
doesn't cause it to be removed from the namespace endpoint list immediately
as tearing it down isn't trivial and can't be done in softirq context, so
it gets deferred.

If a new socket comes along that wants to bind to the same endpoint, a new
rxrpc_local object will be allocated and rxrpc_lookup_local() will use
list_replace() to substitute the new one for the old.

Then, when the dying object gets to rxrpc_local_destroyer(), it is removed
unconditionally from whatever list it is on by calling list_del_init().

However, list_replace() doesn't reset the pointers in the replaced
list_head and so the list_del_init() will likely corrupt the local
endpoints list.

Fix this by using list_replace_init() instead.

Fixes: 730c5fd42c1e ("rxrpc: Fix local endpoint refcounting")
Reported-by: [email protected]
Signed-off-by: David Howells <[email protected]>
5 years agoMAINTAINERS: i2c-imx: take over maintainership
Oleksij Rempel [Mon, 12 Aug 2019 05:08:17 +0000 (07:08 +0200)]
MAINTAINERS: i2c-imx: take over maintainership

I would like to maintain the i2c-imx driver. Since I work with
different i.MX variants and have access to the hardware, I can spend
some time on the reviewing of this driver.

Signed-off-by: Oleksij Rempel <[email protected]>
Signed-off-by: Wolfram Sang <[email protected]>
5 years agoRevert "i2c: imx: improve the error handling in i2c_imx_dma_request()"
Fabio Estevam [Thu, 8 Aug 2019 21:01:36 +0000 (18:01 -0300)]
Revert "i2c: imx: improve the error handling in i2c_imx_dma_request()"

Since commit e1ab9a468e3b ("i2c: imx: improve the error handling in
i2c_imx_dma_request()") when booting with the DMA driver as module (such
as CONFIG_FSL_EDMA=m) the following endless clk warnings are seen:

[  153.077831] ------------[ cut here ]------------
[  153.082528] WARNING: CPU: 0 PID: 15 at drivers/clk/clk.c:924 clk_core_disable_lock+0x18/0x24
[  153.093077] i2c0 already disabled
[  153.096416] Modules linked in:
[  153.099521] CPU: 0 PID: 15 Comm: kworker/0:1 Tainted: G        W         5.2.0+ #321
[  153.107290] Hardware name: Freescale Vybrid VF5xx/VF6xx (Device Tree)
[  153.113772] Workqueue: events deferred_probe_work_func
[  153.118979] [<c0019560>] (unwind_backtrace) from [<c0014734>] (show_stack+0x10/0x14)
[  153.126778] [<c0014734>] (show_stack) from [<c083f8dc>] (dump_stack+0x9c/0xd4)
[  153.134051] [<c083f8dc>] (dump_stack) from [<c0031154>] (__warn+0xf8/0x124)
[  153.141056] [<c0031154>] (__warn) from [<c0031248>] (warn_slowpath_fmt+0x38/0x48)
[  153.148580] [<c0031248>] (warn_slowpath_fmt) from [<c040fde0>] (clk_core_disable_lock+0x18/0x24)
[  153.157413] [<c040fde0>] (clk_core_disable_lock) from [<c058f520>] (i2c_imx_probe+0x554/0x6ec)
[  153.166076] [<c058f520>] (i2c_imx_probe) from [<c04b9178>] (platform_drv_probe+0x48/0x98)
[  153.174297] [<c04b9178>] (platform_drv_probe) from [<c04b7298>] (really_probe+0x1d8/0x2c0)
[  153.182605] [<c04b7298>] (really_probe) from [<c04b7554>] (driver_probe_device+0x5c/0x174)
[  153.190909] [<c04b7554>] (driver_probe_device) from [<c04b58c8>] (bus_for_each_drv+0x44/0x8c)
[  153.199480] [<c04b58c8>] (bus_for_each_drv) from [<c04b746c>] (__device_attach+0xa0/0x108)
[  153.207782] [<c04b746c>] (__device_attach) from [<c04b65a4>] (bus_probe_device+0x88/0x90)
[  153.215999] [<c04b65a4>] (bus_probe_device) from [<c04b6a04>] (deferred_probe_work_func+0x60/0x90)
[  153.225003] [<c04b6a04>] (deferred_probe_work_func) from [<c004f190>] (process_one_work+0x204/0x634)
[  153.234178] [<c004f190>] (process_one_work) from [<c004f618>] (worker_thread+0x20/0x484)
[  153.242315] [<c004f618>] (worker_thread) from [<c0055c2c>] (kthread+0x118/0x150)
[  153.249758] [<c0055c2c>] (kthread) from [<c00090b4>] (ret_from_fork+0x14/0x20)
[  153.257006] Exception stack(0xdde43fb0 to 0xdde43ff8)
[  153.262095] 3fa0:                                     00000000 00000000 00000000 00000000
[  153.270306] 3fc0: 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000
[  153.278520] 3fe0: 00000000 00000000 00000000 00000000 00000013 00000000
[  153.285159] irq event stamp: 3323022
[  153.288787] hardirqs last  enabled at (3323021): [<c0861c4c>] _raw_spin_unlock_irq+0x24/0x2c
[  153.297261] hardirqs last disabled at (3323022): [<c040d7a0>] clk_enable_lock+0x10/0x124
[  153.305392] softirqs last  enabled at (3322092): [<c000a504>] __do_softirq+0x344/0x540
[  153.313352] softirqs last disabled at (3322081): [<c00385c0>] irq_exit+0x10c/0x128
[  153.320946] ---[ end trace a506731ccd9bd703 ]---

This endless clk warnings behaviour is well explained by Andrey Smirnov:

"Allocating DMA after registering I2C adapter can lead to infinite
probing loop, for example, consider the following scenario:

    1. i2c_imx_probe() is called and successfully registers an I2C
       adapter via i2c_add_numbered_adapter()

    2. As a part of i2c_add_numbered_adapter() new I2C slave devices
       are added from DT which results in a call to
       driver_deferred_probe_trigger()

    3. i2c_imx_probe() continues and calls i2c_imx_dma_request() which
       due to lack of proper DMA driver returns -EPROBE_DEFER

    4. i2c_imx_probe() fails, removes I2C adapter and returns
       -EPROBE_DEFER, which places it into deferred probe list

    5. Deferred probe work triggered in #2 above kicks in and calls
       i2c_imx_probe() again thus bringing us to step #1"

So revert commit e1ab9a468e3b ("i2c: imx: improve the error handling in
i2c_imx_dma_request()") and restore the old behaviour, in order to
avoid regressions on existing setups.

Cc: <[email protected]>
Reported-by: Andrey Smirnov <[email protected]>
Reported-by: Russell King <[email protected]>
Fixes: e1ab9a468e3b ("i2c: imx: improve the error handling in i2c_imx_dma_request()")
Signed-off-by: Fabio Estevam <[email protected]>
Signed-off-by: Wolfram Sang <[email protected]>
5 years agonetfilter: nft_flow_offload: skip tcp rst and fin packets
Pablo Neira Ayuso [Tue, 13 Aug 2019 15:41:13 +0000 (17:41 +0200)]
netfilter: nft_flow_offload: skip tcp rst and fin packets

TCP rst and fin packets do not qualify to place a flow into the
flowtable. Most likely there will be no more packets after connection
closure. Without this patch, this flow entry expires and connection
tracking picks up the entry in ESTABLISHED state using the fixup
timeout, which makes this look inconsistent to the user for a connection
that is actually already closed.

Signed-off-by: Pablo Neira Ayuso <[email protected]>
5 years agoALSA: hda - Add a generic reboot_notify
Hui Wang [Wed, 14 Aug 2019 04:09:08 +0000 (12:09 +0800)]
ALSA: hda - Add a generic reboot_notify

Make codec enter D3 before rebooting or poweroff can fix the noise
issue on some laptops. And in theory it is harmless for all codecs
to enter D3 before rebooting or poweroff, let us add a generic
reboot_notify, then realtek and conexant drivers can call this
function.

Cc: [email protected]
Signed-off-by: Hui Wang <[email protected]>
Signed-off-by: Takashi Iwai <[email protected]>
5 years agoALSA: hda - Let all conexant codec enter D3 when rebooting
Hui Wang [Wed, 14 Aug 2019 04:09:07 +0000 (12:09 +0800)]
ALSA: hda - Let all conexant codec enter D3 when rebooting

We have 3 new lenovo laptops which have conexant codec 0x14f11f86,
these 3 laptops also have the noise issue when rebooting, after
letting the codec enter D3 before rebooting or poweroff, the noise
disappers.

Instead of adding a new ID again in the reboot_notify(), let us make
this function apply to all conexant codec. In theory make codec enter
D3 before rebooting or poweroff is harmless, and I tested this change
on a couple of other Lenovo laptops which have different conexant
codecs, there is no side effect so far.

Cc: [email protected]
Signed-off-by: Hui Wang <[email protected]>
Signed-off-by: Takashi Iwai <[email protected]>
5 years agosctp: fix memleak in sctp_send_reset_streams
zhengbin [Tue, 13 Aug 2019 14:05:50 +0000 (22:05 +0800)]
sctp: fix memleak in sctp_send_reset_streams

If the stream outq is not empty, need to kfree nstr_list.

Fixes: d570a59c5b5f ("sctp: only allow the out stream reset when the stream outq is empty")
Reported-by: Hulk Robot <[email protected]>
Signed-off-by: zhengbin <[email protected]>
Acked-by: Marcelo Ricardo Leitner <[email protected]>
Acked-by: Neil Horman <[email protected]>
Signed-off-by: Jakub Kicinski <[email protected]>
5 years agonetlink: Fix nlmsg_parse as a wrapper for strict message parsing
David Ahern [Mon, 12 Aug 2019 20:07:07 +0000 (13:07 -0700)]
netlink: Fix nlmsg_parse as a wrapper for strict message parsing

Eric reported a syzbot warning:

BUG: KMSAN: uninit-value in nh_valid_get_del_req+0x6f1/0x8c0 net/ipv4/nexthop.c:1510
CPU: 0 PID: 11812 Comm: syz-executor444 Not tainted 5.3.0-rc3+ #17
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011
Call Trace:
 __dump_stack lib/dump_stack.c:77 [inline]
 dump_stack+0x191/0x1f0 lib/dump_stack.c:113
 kmsan_report+0x162/0x2d0 mm/kmsan/kmsan_report.c:109
 __msan_warning+0x75/0xe0 mm/kmsan/kmsan_instr.c:294
 nh_valid_get_del_req+0x6f1/0x8c0 net/ipv4/nexthop.c:1510
 rtm_del_nexthop+0x1b1/0x610 net/ipv4/nexthop.c:1543
 rtnetlink_rcv_msg+0x115a/0x1580 net/core/rtnetlink.c:5223
 netlink_rcv_skb+0x431/0x620 net/netlink/af_netlink.c:2477
 rtnetlink_rcv+0x50/0x60 net/core/rtnetlink.c:5241
 netlink_unicast_kernel net/netlink/af_netlink.c:1302 [inline]
 netlink_unicast+0xf6c/0x1050 net/netlink/af_netlink.c:1328
 netlink_sendmsg+0x110f/0x1330 net/netlink/af_netlink.c:1917
 sock_sendmsg_nosec net/socket.c:637 [inline]
 sock_sendmsg net/socket.c:657 [inline]
 ___sys_sendmsg+0x14ff/0x1590 net/socket.c:2311
 __sys_sendmmsg+0x53a/0xae0 net/socket.c:2413
 __do_sys_sendmmsg net/socket.c:2442 [inline]
 __se_sys_sendmmsg+0xbd/0xe0 net/socket.c:2439
 __x64_sys_sendmmsg+0x56/0x70 net/socket.c:2439
 do_syscall_64+0xbc/0xf0 arch/x86/entry/common.c:297
 entry_SYSCALL_64_after_hwframe+0x63/0xe7

The root cause is nlmsg_parse calling __nla_parse which means the
header struct size is not checked.

nlmsg_parse should be a wrapper around __nlmsg_parse with
NL_VALIDATE_STRICT for the validate argument very much like
nlmsg_parse_deprecated is for NL_VALIDATE_LIBERAL.

Fixes: 3de6440354465 ("netlink: re-add parse/validate functions in strict mode")
Reported-by: Eric Dumazet <[email protected]>
Reported-by: syzbot <[email protected]>
Signed-off-by: David Ahern <[email protected]>
Reviewed-by: Eric Dumazet <[email protected]>
Signed-off-by: Jakub Kicinski <[email protected]>
5 years agonet: phy: consider AN_RESTART status when reading link status
Heiner Kallweit [Mon, 12 Aug 2019 19:20:02 +0000 (21:20 +0200)]
net: phy: consider AN_RESTART status when reading link status

After configuring and restarting aneg we immediately try to read the
link status. On some systems the PHY may not yet have cleared the
"aneg complete" and "link up" bits, resulting in a false link-up
signal. See [0] for a report.
Clause 22 and 45 both require the PHY to keep the AN_RESTART
bit set until the PHY actually starts auto-negotiation.
Let's consider this in the generic functions for reading link status.
The commit marked as fixed is the first one where the patch applies
cleanly.

[0] https://marc.info/?t=156518400300003&r=1&w=2

Fixes: c1164bb1a631 ("net: phy: check PMAPMD link status only in genphy_c45_read_link")
Tested-by: Yonglong Liu <[email protected]>
Signed-off-by: Heiner Kallweit <[email protected]>
Reviewed-by: Andrew Lunn <[email protected]>
Signed-off-by: Jakub Kicinski <[email protected]>
5 years agonet/mlx4_en: fix a memory leak bug
Wenwen Wang [Mon, 12 Aug 2019 19:11:35 +0000 (14:11 -0500)]
net/mlx4_en: fix a memory leak bug

In mlx4_en_config_rss_steer(), 'rss_map->indir_qp' is allocated through
kzalloc(). After that, mlx4_qp_alloc() is invoked to configure RSS
indirection. However, if mlx4_qp_alloc() fails, the allocated
'rss_map->indir_qp' is not deallocated, leading to a memory leak bug.

To fix the above issue, add the 'qp_alloc_err' label to free
'rss_map->indir_qp'.

Fixes: 4931c6ef04b4 ("net/mlx4_en: Optimized single ring steering")
Signed-off-by: Wenwen Wang <[email protected]>
Reviewed-by: Tariq Toukan <[email protected]>
Signed-off-by: Jakub Kicinski <[email protected]>
5 years agoibmveth: Convert multicast list size for little-endian system
Thomas Falcon [Mon, 12 Aug 2019 21:13:06 +0000 (16:13 -0500)]
ibmveth: Convert multicast list size for little-endian system

The ibm,mac-address-filters property defines the maximum number of
addresses the hypervisor's multicast filter list can support. It is
encoded as a big-endian integer in the OF device tree, but the virtual
ethernet driver does not convert it for use by little-endian systems.
As a result, the driver is not behaving as it should on affected systems
when a large number of multicast addresses are assigned to the device.

Reported-by: Hangbin Liu <[email protected]>
Signed-off-by: Thomas Falcon <[email protected]>
Signed-off-by: Jakub Kicinski <[email protected]>
5 years agos390/qeth: serialize cmd reply with concurrent timeout
Julian Wiedmann [Mon, 12 Aug 2019 14:44:35 +0000 (16:44 +0200)]
s390/qeth: serialize cmd reply with concurrent timeout

Callbacks for a cmd reply run outside the protection of card->lock, to
allow for additional cmds to be issued & enqueued in parallel.

When qeth_send_control_data() bails out for a cmd without having
received a reply (eg. due to timeout), its callback may concurrently be
processing a reply that just arrived. In this case, the callback
potentially accesses a stale reply->reply_param area that eg. was
on-stack and has already been released.

To avoid this race, add some locking so that qeth_send_control_data()
can (1) wait for a concurrently running callback, and (2) zap any
pending callback that still wants to run.

Signed-off-by: Julian Wiedmann <[email protected]>
Signed-off-by: Jakub Kicinski <[email protected]>
5 years agoriscv: defconfig: Update the defconfig
Alistair Francis [Tue, 13 Aug 2019 23:32:30 +0000 (16:32 -0700)]
riscv: defconfig: Update the defconfig

Update the defconfig:
 - Add CONFIG_HW_RANDOM=y and CONFIG_HW_RANDOM_VIRTIO=y to enable
   VirtIORNG when running on QEMU

Signed-off-by: Alistair Francis <[email protected]>
Signed-off-by: Paul Walmsley <[email protected]>
5 years agoriscv: rv32_defconfig: Update the defconfig
Alistair Francis [Tue, 13 Aug 2019 23:32:29 +0000 (16:32 -0700)]
riscv: rv32_defconfig: Update the defconfig

Update the rv32_defconfig:
 - Add 'CONFIG_DEVTMPFS_MOUNT=y' to match the RISC-V defconfig
 - Add CONFIG_HW_RANDOM=y and CONFIG_HW_RANDOM_VIRTIO=y to enable
   VirtIORNG when running on QEMU

Signed-off-by: Alistair Francis <[email protected]>
Signed-off-by: Paul Walmsley <[email protected]>
5 years agosctp: fix the transport error_count check
Xin Long [Mon, 12 Aug 2019 12:49:12 +0000 (20:49 +0800)]
sctp: fix the transport error_count check

As the annotation says in sctp_do_8_2_transport_strike():

  "If the transport error count is greater than the pf_retrans
   threshold, and less than pathmaxrtx ..."

It should be transport->error_count checked with pathmaxrxt,
instead of asoc->pf_retrans.

Fixes: 5aa93bcf66f4 ("sctp: Implement quick failover draft from tsvwg")
Signed-off-by: Xin Long <[email protected]>
Acked-by: Marcelo Ricardo Leitner <[email protected]>
Signed-off-by: Jakub Kicinski <[email protected]>
5 years agoMerge git://git.kernel.org/pub/scm/linux/kernel/git/pablo/nf-next
Jakub Kicinski [Wed, 14 Aug 2019 01:22:57 +0000 (18:22 -0700)]
Merge git://git.kernel.org/pub/scm/linux/kernel/git/pablo/nf-next

Pablo Neira Ayuso says:

====================
Netfilter/IPVS updates for net-next

The following patchset contains Netfilter/IPVS updates for net-next:

1) Rename mss field to mss_option field in synproxy, from Fernando Mancera.

2) Use SYSCTL_{ZERO,ONE} definitions in conntrack, from Matteo Croce.

3) More strict validation of IPVS sysctl values, from Junwei Hu.

4) Remove unnecessary spaces after on the right hand side of assignments,
   from yangxingwu.

5) Add offload support for bitwise operation.

6) Extend the nft_offload_reg structure to store immediate date.

7) Collapse several ip_set header files into ip_set.h, from
   Jeremy Sowden.

8) Make netfilter headers compile with CONFIG_KERNEL_HEADER_TEST=y,
   from Jeremy Sowden.

9) Fix several sparse warnings due to missing prototypes, from
   Valdis Kletnieks.

10) Use static lock initialiser to ensure connlabel spinlock is
    initialized on boot time to fix sched/act_ct.c, patch
    from Florian Westphal.
====================

Signed-off-by: Jakub Kicinski <[email protected]>
5 years agoMerge branch 'r8152-RX-improve'
Jakub Kicinski [Wed, 14 Aug 2019 01:12:45 +0000 (18:12 -0700)]
Merge branch 'r8152-RX-improve'

Hayes says:

====================
v2:
For patch #2, replace list_for_each_safe with list_for_each_entry_safe.
Remove unlikely in WARN_ON. Adjust the coding style.

For patch #4, replace list_for_each_safe with list_for_each_entry_safe.
Remove "else" after "continue".

For patch #5. replace sysfs with ethtool to modify rx_copybreak and
rx_pending.

v1:
The different chips use different rx buffer size.

Use skb_add_rx_frag() to reduce memory copy for RX.
====================

Signed-off-by: Jakub Kicinski <[email protected]>
5 years agor8152: change rx_copybreak and rx_pending through ethtool
Hayes Wang [Tue, 13 Aug 2019 03:42:09 +0000 (11:42 +0800)]
r8152: change rx_copybreak and rx_pending through ethtool

Let the rx_copybreak and rx_pending could be modified by
ethtool.

Signed-off-by: Hayes Wang <[email protected]>
Signed-off-by: Jakub Kicinski <[email protected]>
5 years agor8152: support skb_add_rx_frag
Hayes Wang [Tue, 13 Aug 2019 03:42:08 +0000 (11:42 +0800)]
r8152: support skb_add_rx_frag

Use skb_add_rx_frag() to reduce the memory copy for rx data.

Use a new list of rx_used to store the rx buffer which couldn't be
reused yet.

Besides, the total number of rx buffer may be increased or decreased
dynamically. And it is limited by RTL8152_MAX_RX_AGG.

Signed-off-by: Hayes Wang <[email protected]>
Signed-off-by: Jakub Kicinski <[email protected]>
5 years agor8152: use alloc_pages for rx buffer
Hayes Wang [Tue, 13 Aug 2019 03:42:07 +0000 (11:42 +0800)]
r8152: use alloc_pages for rx buffer

Replace kmalloc_node() with alloc_pages() for rx buffer.

Signed-off-by: Hayes Wang <[email protected]>
Signed-off-by: Jakub Kicinski <[email protected]>
5 years agor8152: replace array with linking list for rx information
Hayes Wang [Tue, 13 Aug 2019 03:42:06 +0000 (11:42 +0800)]
r8152: replace array with linking list for rx information

The original method uses an array to store the rx information. The
new one uses a list to link each rx structure. Then, it is possible
to increase/decrease the number of rx structure dynamically.

Signed-off-by: Hayes Wang <[email protected]>
Signed-off-by: Jakub Kicinski <[email protected]>
5 years agor8152: separate the rx buffer size
Hayes Wang [Tue, 13 Aug 2019 03:42:05 +0000 (11:42 +0800)]
r8152: separate the rx buffer size

The different chips may accept different rx buffer sizes. The RTL8152
supports 16K bytes, and RTL8153 support 32K bytes.

Signed-off-by: Hayes Wang <[email protected]>
Signed-off-by: Jakub Kicinski <[email protected]>
5 years agoMerge branch 'net-phy-let-phy_speed_down-up-support-speeds-1Gbps'
Jakub Kicinski [Wed, 14 Aug 2019 00:16:11 +0000 (17:16 -0700)]
Merge branch 'net-phy-let-phy_speed_down-up-support-speeds-1Gbps'

Heiner says:

====================
So far phy_speed_down/up can be used up to 1Gbps only. Remove this
restriction and add needed helpers to phy-core.c

v2:
- remove unused parameter in patch 1
- rename __phy_speed_down to phy_speed_down_core in patch 2
====================

Signed-off-by: Jakub Kicinski <[email protected]>
5 years agonet: phy: let phy_speed_down/up support speeds >1Gbps
Heiner Kallweit [Mon, 12 Aug 2019 21:52:19 +0000 (23:52 +0200)]
net: phy: let phy_speed_down/up support speeds >1Gbps

So far phy_speed_down/up can be used up to 1Gbps only. Remove this
restriction by using new helper __phy_speed_down. New member adv_old
in struct phy_device is used by phy_speed_up to restore the advertised
modes before calling phy_speed_down. Don't simply advertise what is
supported because a user may have intentionally removed modes from
advertisement.

Signed-off-by: Heiner Kallweit <[email protected]>
Reviewed-by: Andrew Lunn <[email protected]>
Signed-off-by: Jakub Kicinski <[email protected]>
5 years agonet: phy: add phy_speed_down_core and phy_resolve_min_speed
Heiner Kallweit [Mon, 12 Aug 2019 21:51:27 +0000 (23:51 +0200)]
net: phy: add phy_speed_down_core and phy_resolve_min_speed

phy_speed_down_core provides most of the functionality for
phy_speed_down. It makes use of new helper phy_resolve_min_speed that is
based on the sorting of the settings[] array. In certain cases it may be
helpful to be able to exclude legacy half duplex modes, therefore
prepare phy_resolve_min_speed() for it.

v2:
- rename __phy_speed_down to phy_speed_down_core

Signed-off-by: Heiner Kallweit <[email protected]>
Reviewed-by: Andrew Lunn <[email protected]>
Signed-off-by: Jakub Kicinski <[email protected]>
5 years agonet: phy: add __set_linkmode_max_speed
Heiner Kallweit [Mon, 12 Aug 2019 21:50:30 +0000 (23:50 +0200)]
net: phy: add __set_linkmode_max_speed

We will need the functionality of __set_linkmode_max_speed also for
linkmode bitmaps other than phydev->supported. Therefore split it.

v2:
- remove unused parameter from __set_linkmode_max_speed

Signed-off-by: Heiner Kallweit <[email protected]>
Reviewed-by: Andrew Lunn <[email protected]>
Signed-off-by: Jakub Kicinski <[email protected]>
5 years agonet: devlink: remove redundant rtnl lock assert
Vlad Buslov [Mon, 12 Aug 2019 17:02:02 +0000 (20:02 +0300)]
net: devlink: remove redundant rtnl lock assert

It is enough for caller of devlink_compat_switch_id_get() to hold the net
device to guarantee that devlink port is not destroyed concurrently. Remove
rtnl lock assertion and modify comment to warn user that they must hold
either rtnl lock or reference to net device. This is necessary to
accommodate future implementation of rtnl-unlocked TC offloads driver
callbacks.

Signed-off-by: Vlad Buslov <[email protected]>
Acked-by: Jiri Pirko <[email protected]>
Signed-off-by: Jakub Kicinski <[email protected]>
5 years agoMerge git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next
Jakub Kicinski [Tue, 13 Aug 2019 23:24:57 +0000 (16:24 -0700)]
Merge git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next

Daniel Borkmann says:

====================
The following pull-request contains BPF updates for your *net-next* tree.

There is a small merge conflict in libbpf (Cc Andrii so he's in the loop
as well):

        for (i = 1; i <= btf__get_nr_types(btf); i++) {
                t = (struct btf_type *)btf__type_by_id(btf, i);

                if (!has_datasec && btf_is_var(t)) {
                        /* replace VAR with INT */
                        t->info = BTF_INFO_ENC(BTF_KIND_INT, 0, 0);
  <<<<<<< HEAD
                        /*
                         * using size = 1 is the safest choice, 4 will be too
                         * big and cause kernel BTF validation failure if
                         * original variable took less than 4 bytes
                         */
                        t->size = 1;
                        *(int *)(t+1) = BTF_INT_ENC(0, 0, 8);
                } else if (!has_datasec && kind == BTF_KIND_DATASEC) {
  =======
                        t->size = sizeof(int);
                        *(int *)(t + 1) = BTF_INT_ENC(0, 0, 32);
                } else if (!has_datasec && btf_is_datasec(t)) {
  >>>>>>> 72ef80b5ee131e96172f19e74b4f98fa3404efe8
                        /* replace DATASEC with STRUCT */

Conflict is between the two commits 1d4126c4e119 ("libbpf: sanitize VAR to
conservative 1-byte INT") and b03bc6853c0e ("libbpf: convert libbpf code to
use new btf helpers"), so we need to pick the sanitation fixup as well as
use the new btf_is_datasec() helper and the whitespace cleanup. Looks like
the following:

  [...]
                if (!has_datasec && btf_is_var(t)) {
                        /* replace VAR with INT */
                        t->info = BTF_INFO_ENC(BTF_KIND_INT, 0, 0);
                        /*
                         * using size = 1 is the safest choice, 4 will be too
                         * big and cause kernel BTF validation failure if
                         * original variable took less than 4 bytes
                         */
                        t->size = 1;
                        *(int *)(t + 1) = BTF_INT_ENC(0, 0, 8);
                } else if (!has_datasec && btf_is_datasec(t)) {
                        /* replace DATASEC with STRUCT */
  [...]

The main changes are:

1) Addition of core parts of compile once - run everywhere (co-re) effort,
   that is, relocation of fields offsets in libbpf as well as exposure of
   kernel's own BTF via sysfs and loading through libbpf, from Andrii.

   More info on co-re: http://vger.kernel.org/bpfconf2019.html#session-2
   and http://vger.kernel.org/lpc-bpf2018.html#session-2

2) Enable passing input flags to the BPF flow dissector to customize parsing
   and allowing it to stop early similar to the C based one, from Stanislav.

3) Add a BPF helper function that allows generating SYN cookies from XDP and
   tc BPF, from Petar.

4) Add devmap hash-based map type for more flexibility in device lookup for
   redirects, from Toke.

5) Improvements to XDP forwarding sample code now utilizing recently enabled
   devmap lookups, from Jesper.

6) Add support for reporting the effective cgroup progs in bpftool, from Jakub
   and Takshak.

7) Fix reading kernel config from bpftool via /proc/config.gz, from Peter.

8) Fix AF_XDP umem pages mapping for 32 bit architectures, from Ivan.

9) Follow-up to add two more BPF loop tests for the selftest suite, from Alexei.

10) Add perf event output helper also for other skb-based program types, from Allan.

11) Fix a co-re related compilation error in selftests, from Yonghong.
====================

Signed-off-by: Jakub Kicinski <[email protected]>
5 years agonet: hns3: Make hclge_func_reset_sync_vf static
YueHaibing [Mon, 12 Aug 2019 14:41:56 +0000 (22:41 +0800)]
net: hns3: Make hclge_func_reset_sync_vf static

Fix sparse warning:

drivers/net/ethernet/hisilicon/hns3/hns3pf/hclge_main.c:3190:5:
 warning: symbol 'hclge_func_reset_sync_vf' was not declared. Should it be static?

Reported-by: Hulk Robot <[email protected]>
Signed-off-by: YueHaibing <[email protected]>
Signed-off-by: Jakub Kicinski <[email protected]>
5 years agohugetlbfs: fix hugetlb page migration/fault race causing SIGBUS
Mike Kravetz [Tue, 13 Aug 2019 22:38:00 +0000 (15:38 -0700)]
hugetlbfs: fix hugetlb page migration/fault race causing SIGBUS

Li Wang discovered that LTP/move_page12 V2 sometimes triggers SIGBUS in
the kernel-v5.2.3 testing.  This is caused by a race between hugetlb
page migration and page fault.

If a hugetlb page can not be allocated to satisfy a page fault, the task
is sent SIGBUS.  This is normal hugetlbfs behavior.  A hugetlb fault
mutex exists to prevent two tasks from trying to instantiate the same
page.  This protects against the situation where there is only one
hugetlb page, and both tasks would try to allocate.  Without the mutex,
one would fail and SIGBUS even though the other fault would be
successful.

There is a similar race between hugetlb page migration and fault.
Migration code will allocate a page for the target of the migration.  It
will then unmap the original page from all page tables.  It does this
unmap by first clearing the pte and then writing a migration entry.  The
page table lock is held for the duration of this clear and write
operation.  However, the beginnings of the hugetlb page fault code
optimistically checks the pte without taking the page table lock.  If
clear (as it can be during the migration unmap operation), a hugetlb
page allocation is attempted to satisfy the fault.  Note that the page
which will eventually satisfy this fault was already allocated by the
migration code.  However, the allocation within the fault path could
fail which would result in the task incorrectly being sent SIGBUS.

Ideally, we could take the hugetlb fault mutex in the migration code
when modifying the page tables.  However, locks must be taken in the
order of hugetlb fault mutex, page lock, page table lock.  This would
require significant rework of the migration code.  Instead, the issue is
addressed in the hugetlb fault code.  After failing to allocate a huge
page, take the page table lock and check for huge_pte_none before
returning an error.  This is the same check that must be made further in
the code even if page allocation is successful.

Link: http://lkml.kernel.org/r/[email protected]
Fixes: 290408d4a250 ("hugetlb: hugepage migration core")
Signed-off-by: Mike Kravetz <[email protected]>
Reported-by: Li Wang <[email protected]>
Tested-by: Li Wang <[email protected]>
Reviewed-by: Naoya Horiguchi <[email protected]>
Acked-by: Michal Hocko <[email protected]>
Cc: Cyril Hrubis <[email protected]>
Cc: Xishi Qiu <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
5 years agomm, vmscan: do not special-case slab reclaim when watermarks are boosted
Mel Gorman [Tue, 13 Aug 2019 22:37:57 +0000 (15:37 -0700)]
mm, vmscan: do not special-case slab reclaim when watermarks are boosted

Dave Chinner reported a problem pointing a finger at commit 1c30844d2dfe
("mm: reclaim small amounts of memory when an external fragmentation
event occurs").

The report is extensive:

  https://lore.kernel.org/linux-mm/20190807091858[email protected]/

and it's worth recording the most relevant parts (colorful language and
typos included).

When running a simple, steady state 4kB file creation test to
simulate extracting tarballs larger than memory full of small
files into the filesystem, I noticed that once memory fills up
the cache balance goes to hell.

The workload is creating one dirty cached inode for every dirty
page, both of which should require a single IO each to clean and
reclaim, and creation of inodes is throttled by the rate at which
dirty writeback runs at (via balance dirty pages). Hence the ingest
rate of new cached inodes and page cache pages is identical and
steady. As a result, memory reclaim should quickly find a steady
balance between page cache and inode caches.

The moment memory fills, the page cache is reclaimed at a much
faster rate than the inode cache, and evidence suggests that
the inode cache shrinker is not being called when large batches
of pages are being reclaimed. In roughly the same time period
that it takes to fill memory with 50% pages and 50% slab caches,
memory reclaim reduces the page cache down to just dirty pages
and slab caches fill the entirety of memory.

The LRU is largely full of dirty pages, and we're getting spikes
of random writeback from memory reclaim so it's all going to shit.
Behaviour never recovers, the page cache remains pinned at just
dirty pages, and nothing I could tune would make any difference.
vfs_cache_pressure makes no difference - I would set it so high
it should trim the entire inode caches in a single pass, yet it
didn't do anything. It was clear from tracing and live telemetry
that the shrinkers were pretty much not running except when
there was absolutely no memory free at all, and then they did
the minimum necessary to free memory to make progress.

So I went looking at the code, trying to find places where pages
got reclaimed and the shrinkers weren't called. There's only one
- kswapd doing boosted reclaim as per commit 1c30844d2dfe ("mm:
reclaim small amounts of memory when an external fragmentation
event occurs").

The watermark boosting introduced by the commit is triggered in response
to an allocation "fragmentation event".  The boosting was not intended
to target THP specifically and triggers even if THP is disabled.
However, with Dave's perfectly reasonable workload, fragmentation events
can be very common given the ratio of slab to page cache allocations so
boosting remains active for long periods of time.

As high-order allocations might use compaction and compaction cannot
move slab pages the decision was made in the commit to special-case
kswapd when watermarks are boosted -- kswapd avoids reclaiming slab as
reclaiming slab does not directly help compaction.

As Dave notes, this decision means that slab can be artificially
protected for long periods of time and messes up the balance with slab
and page caches.

Removing the special casing can still indirectly help avoid
fragmentation by avoiding fragmentation-causing events due to slab
allocation as pages from a slab pageblock will have some slab objects
freed.  Furthermore, with the special casing, reclaim behaviour is
unpredictable as kswapd sometimes examines slab and sometimes does not
in a manner that is tricky to tune or analyse.

This patch removes the special casing.  The downside is that this is not
a universal performance win.  Some benchmarks that depend on the
residency of data when rereading metadata may see a regression when slab
reclaim is restored to its original behaviour.  Similarly, some
benchmarks that only read-once or write-once may perform better when
page reclaim is too aggressive.  The primary upside is that slab
shrinker is less surprising (arguably more sane but that's a matter of
opinion), behaves consistently regardless of the fragmentation state of
the system and properly obeys VM sysctls.

A fsmark benchmark configuration was constructed similar to what Dave
reported and is codified by the mmtest configuration
config-io-fsmark-small-file-stream.  It was evaluated on a 1-socket
machine to avoid dealing with NUMA-related issues and the timing of
reclaim.  The storage was an SSD Samsung Evo and a fresh trimmed XFS
filesystem was used for the test data.

This is not an exact replication of Dave's setup.  The configuration
scales its parameters depending on the memory size of the SUT to behave
similarly across machines.  The parameters mean the first sample
reported by fs_mark is using 50% of RAM which will barely be throttled
and look like a big outlier.  Dave used fake NUMA to have multiple
kswapd instances which I didn't replicate.  Finally, the number of
iterations differ from Dave's test as the target disk was not large
enough.  While not identical, it should be representative.

  fsmark
                                     5.3.0-rc3              5.3.0-rc3
                                       vanilla          shrinker-v1r1
  Min       1-files/sec     4444.80 (   0.00%)     4765.60 (   7.22%)
  1st-qrtle 1-files/sec     5005.10 (   0.00%)     5091.70 (   1.73%)
  2nd-qrtle 1-files/sec     4917.80 (   0.00%)     4855.60 (  -1.26%)
  3rd-qrtle 1-files/sec     4667.40 (   0.00%)     4831.20 (   3.51%)
  Max-1     1-files/sec    11421.50 (   0.00%)     9999.30 ( -12.45%)
  Max-5     1-files/sec    11421.50 (   0.00%)     9999.30 ( -12.45%)
  Max-10    1-files/sec    11421.50 (   0.00%)     9999.30 ( -12.45%)
  Max-90    1-files/sec     4649.60 (   0.00%)     4780.70 (   2.82%)
  Max-95    1-files/sec     4491.00 (   0.00%)     4768.20 (   6.17%)
  Max-99    1-files/sec     4491.00 (   0.00%)     4768.20 (   6.17%)
  Max       1-files/sec    11421.50 (   0.00%)     9999.30 ( -12.45%)
  Hmean     1-files/sec     5004.75 (   0.00%)     5075.96 (   1.42%)
  Stddev    1-files/sec     1778.70 (   0.00%)     1369.66 (  23.00%)
  CoeffVar  1-files/sec       33.70 (   0.00%)       26.05 (  22.71%)
  BHmean-99 1-files/sec     5053.72 (   0.00%)     5101.52 (   0.95%)
  BHmean-95 1-files/sec     5053.72 (   0.00%)     5101.52 (   0.95%)
  BHmean-90 1-files/sec     5107.05 (   0.00%)     5131.41 (   0.48%)
  BHmean-75 1-files/sec     5208.45 (   0.00%)     5206.68 (  -0.03%)
  BHmean-50 1-files/sec     5405.53 (   0.00%)     5381.62 (  -0.44%)
  BHmean-25 1-files/sec     6179.75 (   0.00%)     6095.14 (  -1.37%)

                     5.3.0-rc3   5.3.0-rc3
                       vanillashrinker-v1r1
  Duration User         501.82      497.29
  Duration System      4401.44     4424.08
  Duration Elapsed     8124.76     8358.05

This is showing a slight skew for the max result representing a large
outlier for the 1st, 2nd and 3rd quartile are similar indicating that
the bulk of the results show little difference.  Note that an earlier
version of the fsmark configuration showed a regression but that
included more samples taken while memory was still filling.

Note that the elapsed time is higher.  Part of this is that the
configuration included time to delete all the test files when the test
completes -- the test automation handles the possibility of testing
fsmark with multiple thread counts.  Without the patch, many of these
objects would be memory resident which is part of what the patch is
addressing.

There are other important observations that justify the patch.

1. With the vanilla kernel, the number of dirty pages in the system is
   very low for much of the test. With this patch, dirty pages is
   generally kept at 10% which matches vm.dirty_background_ratio which
   is normal expected historical behaviour.

2. With the vanilla kernel, the ratio of Slab/Pagecache is close to
   0.95 for much of the test i.e. Slab is being left alone and
   dominating memory consumption. With the patch applied, the ratio
   varies between 0.35 and 0.45 with the bulk of the measured ratios
   roughly half way between those values. This is a different balance to
   what Dave reported but it was at least consistent.

3. Slabs are scanned throughout the entire test with the patch applied.
   The vanille kernel has periods with no scan activity and then
   relatively massive spikes.

4. Without the patch, kswapd scan rates are very variable. With the
   patch, the scan rates remain quite steady.

4. Overall vmstats are closer to normal expectations

                                5.3.0-rc3      5.3.0-rc3
                                  vanilla  shrinker-v1r1
    Ops Direct pages scanned             99388.00      328410.00
    Ops Kswapd pages scanned          45382917.00    33451026.00
    Ops Kswapd pages reclaimed        30869570.00    25239655.00
    Ops Direct pages reclaimed           74131.00        5830.00
    Ops Kswapd efficiency %                 68.02          75.45
    Ops Kswapd velocity                   5585.75        4002.25
    Ops Page reclaim immediate         1179721.00      430927.00
    Ops Slabs scanned                 62367361.00    73581394.00
    Ops Direct inode steals               2103.00        1002.00
    Ops Kswapd inode steals             570180.00     5183206.00

o Vanilla kernel is hitting direct reclaim more frequently,
  not very much in absolute terms but the fact the patch
  reduces it is interesting
o "Page reclaim immediate" in the vanilla kernel indicates
  dirty pages are being encountered at the tail of the LRU.
  This is generally bad and means in this case that the LRU
  is not long enough for dirty pages to be cleaned by the
  background flush in time. This is much reduced by the
  patch.
o With the patch, kswapd is reclaiming 10 times more slab
  pages than with the vanilla kernel. This is indicative
  of the watermark boosting over-protecting slab

A more complete set of tests were run that were part of the basis for
introducing boosting and while there are some differences, they are well
within tolerances.

Bottom line, the special casing kswapd to avoid slab behaviour is
unpredictable and can lead to abnormal results for normal workloads.

This patch restores the expected behaviour that slab and page cache is
balanced consistently for a workload with a steady allocation ratio of
slab/pagecache pages.  It also means that if there are workloads that
favour the preservation of slab over pagecache that it can be tuned via
vm.vfs_cache_pressure where as the vanilla kernel effectively ignores
the parameter when boosting is active.

Link: http://lkml.kernel.org/r/[email protected]
Fixes: 1c30844d2dfe ("mm: reclaim small amounts of memory when an external fragmentation event occurs")
Signed-off-by: Mel Gorman <[email protected]>
Reviewed-by: Dave Chinner <[email protected]>
Acked-by: Vlastimil Babka <[email protected]>
Cc: Michal Hocko <[email protected]>
Cc: <[email protected]> [5.0+]
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
5 years agoRevert "mm, thp: restore node-local hugepage allocations"
Andrea Arcangeli [Tue, 13 Aug 2019 22:37:53 +0000 (15:37 -0700)]
Revert "mm, thp: restore node-local hugepage allocations"

This reverts commit 2f0799a0ffc033b ("mm, thp: restore node-local
hugepage allocations").

commit 2f0799a0ffc033b was rightfully applied to avoid the risk of a
severe regression that was reported by the kernel test robot at the end
of the merge window.  Now we understood the regression was a false
positive and was caused by a significant increase in fairness during a
swap trashing benchmark.  So it's safe to re-apply the fix and continue
improving the code from there.  The benchmark that reported the
regression is very useful, but it provides a meaningful result only when
there is no significant alteration in fairness during the workload.  The
removal of __GFP_THISNODE increased fairness.

__GFP_THISNODE cannot be used in the generic page faults path for new
memory allocations under the MPOL_DEFAULT mempolicy, or the allocation
behavior significantly deviates from what the MPOL_DEFAULT semantics are
supposed to be for THP and 4k allocations alike.

Setting THP defrag to "always" or using MADV_HUGEPAGE (with THP defrag
set to "madvise") has never meant to provide an implicit MPOL_BIND on
the "current" node the task is running on, causing swap storms and
providing a much more aggressive behavior than even zone_reclaim_node =
3.

Any workload who could have benefited from __GFP_THISNODE has now to
enable zone_reclaim_mode=1||2||3.  __GFP_THISNODE implicitly provided
the zone_reclaim_mode behavior, but it only did so if THP was enabled:
if THP was disabled, there would have been no chance to get any 4k page
from the current node if the current node was full of pagecache, which
further shows how this __GFP_THISNODE was misplaced in MADV_HUGEPAGE.
MADV_HUGEPAGE has never been intended to provide any zone_reclaim_mode
semantics, in fact the two are orthogonal, zone_reclaim_mode = 1|2|3
must work exactly the same with MADV_HUGEPAGE set or not.

The performance characteristic of memory depends on the hardware
details.  The numbers below are obtained on Naples/EPYC architecture and
the N/A projection extends them to show what we should aim for in the
future as a good THP NUMA locality default.  The benchmark used
exercises random memory seeks (note: the cost of the page faults is not
part of the measurement).

  D0 THP | D0 4k | D1 THP | D1 4k | D2 THP | D2 4k | D3 THP | D3 4k | ...
  0%     | +43%  | +45%   | +106% | +131%  | +224% | N/A    | N/A

D0 means distance zero (i.e.  local memory), D1 means distance one (i.e.
intra socket memory), D2 means distance two (i.e.  inter socket memory),
etc...

For the guest physical memory allocated by qemu and for guest mode
kernel the performance characteristic of RAM is more complex and an
ideal default could be:

  D0 THP | D1 THP | D0 4k | D2 THP | D1 4k | D3 THP | D2 4k | D3 4k | ...
  0%     | +58%   | +101% | N/A    | +222% | N/A    | N/A   | N/A

NOTE: the N/A are projections and haven't been measured yet, the
measurement in this case is done on a 1950x with only two NUMA nodes.
The THP case here means THP was used both in the host and in the guest.

After applying this commit the THP NUMA locality order that we'll get
out of MADV_HUGEPAGE is this:

  D0 THP | D1 THP | D2 THP | D3 THP | ... | D0 4k | D1 4k | D2 4k | D3 4k | ...

Before this commit it was:

  D0 THP | D0 4k | D1 4k | D2 4k | D3 4k | ...

Even if we ignore the breakage of large workloads that can't fit in a
single node that the __GFP_THISNODE implicit "current node" mbind
caused, the THP NUMA locality order provided by __GFP_THISNODE was still
not the one we shall aim for in the long term (i.e.  the first one at
the top).

After this commit is applied, we can introduce a new allocator multi
order API and to replace those two alloc_pages_vmas calls in the page
fault path, with a single multi order call:

        unsigned int order = (1 << HPAGE_PMD_ORDER) | (1 << 0);
        page = alloc_pages_multi_order(..., &order);
        if (!page)
         goto out;
        if (!(order & (1 << 0))) {
         VM_WARN_ON(order != 1 << HPAGE_PMD_ORDER);
         /* THP fault */
        } else {
         VM_WARN_ON(order != 1 << 0);
         /* 4k fallback */
        }

The page allocator logic has to be altered so that when it fails on any
zone with order 9, it has to try again with a order 0 before falling
back to the next zone in the zonelist.

After that we need to do more measurements and evaluate if adding an
opt-in feature for guest mode is worth it, to swap "DN 4k | DN+1 THP"
with "DN+1 THP | DN 4k" at every NUMA distance crossing.

Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Andrea Arcangeli <[email protected]>
Acked-by: Michal Hocko <[email protected]>
Acked-by: Mel Gorman <[email protected]>
Cc: Vlastimil Babka <[email protected]>
Cc: David Rientjes <[email protected]>
Cc: Zi Yan <[email protected]>
Cc: Stefan Priebe - Profihost AG <[email protected]>
Cc: "Kirill A. Shutemov" <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
5 years agoRevert "Revert "mm, thp: consolidate THP gfp handling into alloc_hugepage_direct_gfpm...
Andrea Arcangeli [Tue, 13 Aug 2019 22:37:50 +0000 (15:37 -0700)]
Revert "Revert "mm, thp: consolidate THP gfp handling into alloc_hugepage_direct_gfpmask""

Patch series "reapply: relax __GFP_THISNODE for MADV_HUGEPAGE mappings".

The fixes for what was originally reported as "pathological THP
behavior" we rightfully reverted to be sure not to introduced
regressions at end of a merge window after a severe regression report
from the kernel bot.  We can safely re-apply them now that we had time
to analyze the problem.

The mm process worked fine, because the good fixes were eventually
committed upstream without excessive delay.

The regression reported by the kernel bot however forced us to revert
the good fixes to be sure not to introduce regressions and to give us
the time to analyze the issue further.  The silver lining is that this
extra time allowed to think more at this issue and also plan for a
future direction to improve things further in terms of THP NUMA
locality.

This patch (of 2):

This reverts commit 356ff8a9a78fb35d ("Revert "mm, thp: consolidate THP
gfp handling into alloc_hugepage_direct_gfpmask").  So it reapplies
89c83fb539f954 ("mm, thp: consolidate THP gfp handling into
alloc_hugepage_direct_gfpmask").

Consolidation of the THP allocation flags at the same place was meant to
be a clean up to easier handle otherwise scattered code which is
imposing a maintenance burden.  There were no real problems observed
with the gfp mask consolidation but the reversion was rushed through
without a larger consensus regardless.

This patch brings the consolidation back because this should make the
long term maintainability easier as well as it should allow future
changes to be less error prone.

[[email protected]: changelog additions]
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Andrea Arcangeli <[email protected]>
Acked-by: Michal Hocko <[email protected]>
Cc: Mel Gorman <[email protected]>
Cc: Vlastimil Babka <[email protected]>
Cc: David Rientjes <[email protected]>
Cc: Zi Yan <[email protected]>
Cc: Stefan Priebe - Profihost AG <[email protected]>
Cc: "Kirill A. Shutemov" <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
5 years agoinclude/asm-generic/5level-fixup.h: fix variable 'p4d' set but not used
Qian Cai [Tue, 13 Aug 2019 22:37:47 +0000 (15:37 -0700)]
include/asm-generic/5level-fixup.h: fix variable 'p4d' set but not used

A compiler throws a warning on an arm64 system since commit 9849a5697d3d
("arch, mm: convert all architectures to use 5level-fixup.h"),

  mm/kasan/init.c: In function 'kasan_free_p4d':
  mm/kasan/init.c:344:9: warning: variable 'p4d' set but not used [-Wunused-but-set-variable]
   p4d_t *p4d;
          ^~~

because p4d_none() in "5level-fixup.h" is compiled away while it is a
static inline function in "pgtable-nopud.h".

However, if converted p4d_none() to a static inline there, powerpc would
be unhappy as it reads those in assembler language in
"arch/powerpc/include/asm/book3s/64/pgtable.h", so it needs to skip
assembly include for the static inline C function.

While at it, converted a few similar functions to be consistent with the
ones in "pgtable-nopud.h".

Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Qian Cai <[email protected]>
Acked-by: Arnd Bergmann <[email protected]>
Cc: Kirill A. Shutemov <[email protected]>
Cc: Michal Hocko <[email protected]>
Cc: Jason Gunthorpe <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
5 years agoseq_file: fix problem when seeking mid-record
NeilBrown [Tue, 13 Aug 2019 22:37:44 +0000 (15:37 -0700)]
seq_file: fix problem when seeking mid-record

If you use lseek or similar (e.g.  pread) to access a location in a
seq_file file that is within a record, rather than at a record boundary,
then the first read will return the remainder of the record, and the
second read will return the whole of that same record (instead of the
next record).  When seeking to a record boundary, the next record is
correctly returned.

This bug was introduced by a recent patch (identified below).  Before
that patch, seq_read() would increment m->index when the last of the
buffer was returned (m->count == 0).  After that patch, we rely on
->next to increment m->index after filling the buffer - but there was
one place where that didn't happen.

Link: https://lkml.kernel.org/lkml/[email protected]/
Fixes: 1f4aace60b0e ("fs/seq_file.c: simplify seq_file iteration code and interface")
Signed-off-by: NeilBrown <[email protected]>
Reported-by: Sergei Turchanov <[email protected]>
Tested-by: Sergei Turchanov <[email protected]>
Cc: Alexander Viro <[email protected]>
Cc: Markus Elfring <[email protected]>
Cc: <[email protected]> [4.19+]
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
5 years agomm: workingset: fix vmstat counters for shadow nodes
Roman Gushchin [Tue, 13 Aug 2019 22:37:41 +0000 (15:37 -0700)]
mm: workingset: fix vmstat counters for shadow nodes

Memcg counters for shadow nodes are broken because the memcg pointer is
obtained in a wrong way. The following approach is used:
        virt_to_page(xa_node)->mem_cgroup

Since commit 4d96ba353075 ("mm: memcg/slab: stop setting
page->mem_cgroup pointer for slab pages") page->mem_cgroup pointer isn't
set for slab pages, so memcg_from_slab_page() should be used instead.

Also I doubt that it ever worked correctly: virt_to_head_page() should
be used instead of virt_to_page().  Otherwise objects residing on tail
pages are not accounted, because only the head page contains a valid
mem_cgroup pointer.  That was a case since the introduction of these
counters by the commit 68d48e6a2df5 ("mm: workingset: add vmstat counter
for shadow nodes").

Link: http://lkml.kernel.org/r/[email protected]
Fixes: 4d96ba353075 ("mm: memcg/slab: stop setting page->mem_cgroup pointer for slab pages")
Signed-off-by: Roman Gushchin <[email protected]>
Acked-by: Johannes Weiner <[email protected]>
Cc: Vladimir Davydov <[email protected]>
Cc: Shakeel Butt <[email protected]>
Cc: Michal Hocko <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
5 years agomm/usercopy: use memory range to be accessed for wraparound check
Isaac J. Manjarres [Tue, 13 Aug 2019 22:37:37 +0000 (15:37 -0700)]
mm/usercopy: use memory range to be accessed for wraparound check

Currently, when checking to see if accessing n bytes starting at address
"ptr" will cause a wraparound in the memory addresses, the check in
check_bogus_address() adds an extra byte, which is incorrect, as the
range of addresses that will be accessed is [ptr, ptr + (n - 1)].

This can lead to incorrectly detecting a wraparound in the memory
address, when trying to read 4 KB from memory that is mapped to the the
last possible page in the virtual address space, when in fact, accessing
that range of memory would not cause a wraparound to occur.

Use the memory range that will actually be accessed when considering if
accessing a certain amount of bytes will cause the memory address to
wrap around.

Link: http://lkml.kernel.org/r/[email protected]
Fixes: f5509cc18daa ("mm: Hardened usercopy")
Signed-off-by: Prasad Sodagudi <[email protected]>
Signed-off-by: Isaac J. Manjarres <[email protected]>
Co-developed-by: Prasad Sodagudi <[email protected]>
Reviewed-by: William Kucharski <[email protected]>
Acked-by: Kees Cook <[email protected]>
Cc: Greg Kroah-Hartman <[email protected]>
Cc: Trilok Soni <[email protected]>
Cc: <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
5 years agomm: kmemleak: disable early logging in case of error
Catalin Marinas [Tue, 13 Aug 2019 22:37:34 +0000 (15:37 -0700)]
mm: kmemleak: disable early logging in case of error

If an error occurs during kmemleak_init() (e.g.  kmem cache cannot be
created), kmemleak is disabled but kmemleak_early_log remains enabled.
Subsequently, when the .init.text section is freed, the log_early()
function no longer exists.  To avoid a page fault in such scenario,
ensure that kmemleak_disable() also disables early logging.

Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Catalin Marinas <[email protected]>
Reported-by: Qian Cai <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
5 years agomm/vmalloc.c: fix percpu free VM area search criteria
Kuppuswamy Sathyanarayanan [Tue, 13 Aug 2019 22:37:31 +0000 (15:37 -0700)]
mm/vmalloc.c: fix percpu free VM area search criteria

Recent changes to the vmalloc code by commit 68ad4a330433
("mm/vmalloc.c: keep track of free blocks for vmap allocation") can
cause spurious percpu allocation failures.  These, in turn, can result
in panic()s in the slub code.  One such possible panic was reported by
Dave Hansen in following link https://lkml.org/lkml/2019/6/19/939.
Another related panic observed is,

 RIP: 0033:0x7f46f7441b9b
 Call Trace:
  dump_stack+0x61/0x80
  pcpu_alloc.cold.30+0x22/0x4f
  mem_cgroup_css_alloc+0x110/0x650
  cgroup_apply_control_enable+0x133/0x330
  cgroup_mkdir+0x41b/0x500
  kernfs_iop_mkdir+0x5a/0x90
  vfs_mkdir+0x102/0x1b0
  do_mkdirat+0x7d/0xf0
  do_syscall_64+0x5b/0x180
  entry_SYSCALL_64_after_hwframe+0x44/0xa9

VMALLOC memory manager divides the entire VMALLOC space (VMALLOC_START
to VMALLOC_END) into multiple VM areas (struct vm_areas), and it mainly
uses two lists (vmap_area_list & free_vmap_area_list) to track the used
and free VM areas in VMALLOC space.  And pcpu_get_vm_areas(offsets[],
sizes[], nr_vms, align) function is used for allocating congruent VM
areas for percpu memory allocator.  In order to not conflict with
VMALLOC users, pcpu_get_vm_areas allocates VM areas near the end of the
VMALLOC space.  So the search for free vm_area for the given requirement
starts near VMALLOC_END and moves upwards towards VMALLOC_START.

Prior to commit 68ad4a330433, the search for free vm_area in
pcpu_get_vm_areas() involves following two main steps.

Step 1:
    Find a aligned "base" adress near VMALLOC_END.
    va = free vm area near VMALLOC_END
Step 2:
    Loop through number of requested vm_areas and check,
        Step 2.1:
           if (base < VMALLOC_START)
              1. fail with error
        Step 2.2:
           // end is offsets[area] + sizes[area]
           if (base + end > va->vm_end)
               1. Move the base downwards and repeat Step 2
        Step 2.3:
           if (base + start < va->vm_start)
              1. Move to previous free vm_area node, find aligned
                 base address and repeat Step 2

But Commit 68ad4a330433 removed Step 2.2 and modified Step 2.3 as below:

        Step 2.3:
           if (base + start < va->vm_start || base + end > va->vm_end)
              1. Move to previous free vm_area node, find aligned
                 base address and repeat Step 2

Above change is the root cause of spurious percpu memory allocation
failures.  For example, consider a case where a relatively large vm_area
(~ 30 TB) was ignored in free vm_area search because it did not pass the
base + end < vm->vm_end boundary check.  Ignoring such large free
vm_area's would lead to not finding free vm_area within boundary of
VMALLOC_start to VMALLOC_END which in turn leads to allocation failures.

So modify the search algorithm to include Step 2.2.

Link: http://lkml.kernel.org/r/20190729232139.91131-1-sathyanarayanan.kuppuswamy@linux.intel.com
Fixes: 68ad4a330433 ("mm/vmalloc.c: keep track of free blocks for vmap allocation")
Signed-off-by: Kuppuswamy Sathyanarayanan <[email protected]>
Reported-by: Dave Hansen <[email protected]>
Acked-by: Dennis Zhou <[email protected]>
Reviewed-by: Uladzislau Rezki (Sony) <[email protected]>
Cc: Roman Gushchin <[email protected]>
Cc: sathyanarayanan kuppuswamy <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
5 years agomm/memcontrol.c: fix use after free in mem_cgroup_iter()
Miles Chen [Tue, 13 Aug 2019 22:37:28 +0000 (15:37 -0700)]
mm/memcontrol.c: fix use after free in mem_cgroup_iter()

This patch is sent to report an use after free in mem_cgroup_iter()
after merging commit be2657752e9e ("mm: memcg: fix use after free in
mem_cgroup_iter()").

I work with android kernel tree (4.9 & 4.14), and commit be2657752e9e
("mm: memcg: fix use after free in mem_cgroup_iter()") has been merged
to the trees.  However, I can still observe use after free issues
addressed in the commit be2657752e9e.  (on low-end devices, a few times
this month)

backtrace:
        css_tryget <- crash here
        mem_cgroup_iter
        shrink_node
        shrink_zones
        do_try_to_free_pages
        try_to_free_pages
        __perform_reclaim
        __alloc_pages_direct_reclaim
        __alloc_pages_slowpath
        __alloc_pages_nodemask

To debug, I poisoned mem_cgroup before freeing it:

  static void __mem_cgroup_free(struct mem_cgroup *memcg)
        for_each_node(node)
        free_mem_cgroup_per_node_info(memcg, node);
        free_percpu(memcg->stat);
  +     /* poison memcg before freeing it */
  +     memset(memcg, 0x78, sizeof(struct mem_cgroup));
        kfree(memcg);
  }

The coredump shows the position=0xdbbc2a00 is freed.

  (gdb) p/x ((struct mem_cgroup_per_node *)0xe5009e00)->iter[8]
  $13 = {position = 0xdbbc2a00, generation = 0x2efd}

  0xdbbc2a00:     0xdbbc2e00      0x00000000      0xdbbc2800      0x00000100
  0xdbbc2a10:     0x00000200      0x78787878      0x00026218      0x00000000
  0xdbbc2a20:     0xdcad6000      0x00000001      0x78787800      0x00000000
  0xdbbc2a30:     0x78780000      0x00000000      0x0068fb84      0x78787878
  0xdbbc2a40:     0x78787878      0x78787878      0x78787878      0xe3fa5cc0
  0xdbbc2a50:     0x78787878      0x78787878      0x00000000      0x00000000
  0xdbbc2a60:     0x00000000      0x00000000      0x00000000      0x00000000
  0xdbbc2a70:     0x00000000      0x00000000      0x00000000      0x00000000
  0xdbbc2a80:     0x00000000      0x00000000      0x00000000      0x00000000
  0xdbbc2a90:     0x00000001      0x00000000      0x00000000      0x00100000
  0xdbbc2aa0:     0x00000001      0xdbbc2ac8      0x00000000      0x00000000
  0xdbbc2ab0:     0x00000000      0x00000000      0x00000000      0x00000000
  0xdbbc2ac0:     0x00000000      0x00000000      0xe5b02618      0x00001000
  0xdbbc2ad0:     0x00000000      0x78787878      0x78787878      0x78787878
  0xdbbc2ae0:     0x78787878      0x78787878      0x78787878      0x78787878
  0xdbbc2af0:     0x78787878      0x78787878      0x78787878      0x78787878
  0xdbbc2b00:     0x78787878      0x78787878      0x78787878      0x78787878
  0xdbbc2b10:     0x78787878      0x78787878      0x78787878      0x78787878
  0xdbbc2b20:     0x78787878      0x78787878      0x78787878      0x78787878
  0xdbbc2b30:     0x78787878      0x78787878      0x78787878      0x78787878
  0xdbbc2b40:     0x78787878      0x78787878      0x78787878      0x78787878
  0xdbbc2b50:     0x78787878      0x78787878      0x78787878      0x78787878
  0xdbbc2b60:     0x78787878      0x78787878      0x78787878      0x78787878
  0xdbbc2b70:     0x78787878      0x78787878      0x78787878      0x78787878
  0xdbbc2b80:     0x78787878      0x78787878      0x00000000      0x78787878
  0xdbbc2b90:     0x78787878      0x78787878      0x78787878      0x78787878
  0xdbbc2ba0:     0x78787878      0x78787878      0x78787878      0x78787878

In the reclaim path, try_to_free_pages() does not setup
sc.target_mem_cgroup and sc is passed to do_try_to_free_pages(), ...,
shrink_node().

In mem_cgroup_iter(), root is set to root_mem_cgroup because
sc->target_mem_cgroup is NULL.  It is possible to assign a memcg to
root_mem_cgroup.nodeinfo.iter in mem_cgroup_iter().

        try_to_free_pages
         struct scan_control sc = {...}, target_mem_cgroup is 0x0;
        do_try_to_free_pages
        shrink_zones
        shrink_node
          mem_cgroup *root = sc->target_mem_cgroup;
          memcg = mem_cgroup_iter(root, NULL, &reclaim);
        mem_cgroup_iter()
         if (!root)
         root = root_mem_cgroup;
         ...

         css = css_next_descendant_pre(css, &root->css);
         memcg = mem_cgroup_from_css(css);
         cmpxchg(&iter->position, pos, memcg);

My device uses memcg non-hierarchical mode.  When we release a memcg:
invalidate_reclaim_iterators() reaches only dead_memcg and its parents.
If non-hierarchical mode is used, invalidate_reclaim_iterators() never
reaches root_mem_cgroup.

  static void invalidate_reclaim_iterators(struct mem_cgroup *dead_memcg)
  {
        struct mem_cgroup *memcg = dead_memcg;

        for (; memcg; memcg = parent_mem_cgroup(memcg)
        ...
  }

So the use after free scenario looks like:

  CPU1 CPU2

  try_to_free_pages
  do_try_to_free_pages
  shrink_zones
  shrink_node
  mem_cgroup_iter()
      if (!root)
       root = root_mem_cgroup;
      ...
      css = css_next_descendant_pre(css, &root->css);
      memcg = mem_cgroup_from_css(css);
      cmpxchg(&iter->position, pos, memcg);

         invalidate_reclaim_iterators(memcg);
         ...
         __mem_cgroup_free()
         kfree(memcg);

  try_to_free_pages
  do_try_to_free_pages
  shrink_zones
  shrink_node
  mem_cgroup_iter()
      if (!root)
       root = root_mem_cgroup;
      ...
      mz = mem_cgroup_nodeinfo(root, reclaim->pgdat->node_id);
      iter = &mz->iter[reclaim->priority];
      pos = READ_ONCE(iter->position);
      css_tryget(&pos->css) <- use after free

To avoid this, we should also invalidate root_mem_cgroup.nodeinfo.iter
in invalidate_reclaim_iterators().

[[email protected]: fix -Wparentheses compilation warning]
Link: http://lkml.kernel.org/r/[email protected]
Link: http://lkml.kernel.org/r/[email protected]
Fixes: 5ac8fb31ad2e ("mm: memcontrol: convert reclaim iterator to simple css refcounting")
Signed-off-by: Miles Chen <[email protected]>
Signed-off-by: Qian Cai <[email protected]>
Acked-by: Michal Hocko <[email protected]>
Cc: Johannes Weiner <[email protected]>
Cc: Vladimir Davydov <[email protected]>
Cc: <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
5 years agomm/z3fold.c: fix z3fold_destroy_pool() race condition
Henry Burns [Tue, 13 Aug 2019 22:37:25 +0000 (15:37 -0700)]
mm/z3fold.c: fix z3fold_destroy_pool() race condition

The constraint from the zpool use of z3fold_destroy_pool() is there are
no outstanding handles to memory (so no active allocations), but it is
possible for there to be outstanding work on either of the two wqs in
the pool.

Calling z3fold_deregister_migration() before the workqueues are drained
means that there can be allocated pages referencing a freed inode,
causing any thread in compaction to be able to trip over the bad pointer
in PageMovable().

Link: http://lkml.kernel.org/r/[email protected]
Fixes: 1f862989b04a ("mm/z3fold.c: support page migration")
Signed-off-by: Henry Burns <[email protected]>
Reviewed-by: Shakeel Butt <[email protected]>
Reviewed-by: Jonathan Adams <[email protected]>
Cc: Vitaly Vul <[email protected]>
Cc: Vitaly Wool <[email protected]>
Cc: David Howells <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: Al Viro <[email protected]>
Cc: Henry Burns <[email protected]>
Cc: <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
5 years agomm/z3fold.c: fix z3fold_destroy_pool() ordering
Henry Burns [Tue, 13 Aug 2019 22:37:21 +0000 (15:37 -0700)]
mm/z3fold.c: fix z3fold_destroy_pool() ordering

The constraint from the zpool use of z3fold_destroy_pool() is there are
no outstanding handles to memory (so no active allocations), but it is
possible for there to be outstanding work on either of the two wqs in
the pool.

If there is work queued on pool->compact_workqueue when it is called,
z3fold_destroy_pool() will do:

   z3fold_destroy_pool()
     destroy_workqueue(pool->release_wq)
     destroy_workqueue(pool->compact_wq)
       drain_workqueue(pool->compact_wq)
         do_compact_page(zhdr)
           kref_put(&zhdr->refcount)
             __release_z3fold_page(zhdr, ...)
               queue_work_on(pool->release_wq, &pool->work) *BOOM*

So compact_wq needs to be destroyed before release_wq.

Link: http://lkml.kernel.org/r/[email protected]
Fixes: 5d03a6613957 ("mm/z3fold.c: use kref to prevent page free/compact race")
Signed-off-by: Henry Burns <[email protected]>
Reviewed-by: Shakeel Butt <[email protected]>
Reviewed-by: Jonathan Adams <[email protected]>
Cc: Vitaly Vul <[email protected]>
Cc: Vitaly Wool <[email protected]>
Cc: David Howells <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: Al Viro <[email protected]
Cc: Henry Burns <[email protected]>
Cc: <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
5 years agomm: mempolicy: handle vma with unmovable pages mapped correctly in mbind
Yang Shi [Tue, 13 Aug 2019 22:37:18 +0000 (15:37 -0700)]
mm: mempolicy: handle vma with unmovable pages mapped correctly in mbind

When running syzkaller internally, we ran into the below bug on 4.9.x
kernel:

  kernel BUG at mm/huge_memory.c:2124!
  invalid opcode: 0000 [#1] SMP KASAN
  CPU: 0 PID: 1518 Comm: syz-executor107 Not tainted 4.9.168+ #2
  Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 0.5.1 01/01/2011
  task: ffff880067b34900 task.stack: ffff880068998000
  RIP: split_huge_page_to_list+0x8fb/0x1030 mm/huge_memory.c:2124
  Call Trace:
    split_huge_page include/linux/huge_mm.h:100 [inline]
    queue_pages_pte_range+0x7e1/0x1480 mm/mempolicy.c:538
    walk_pmd_range mm/pagewalk.c:50 [inline]
    walk_pud_range mm/pagewalk.c:90 [inline]
    walk_pgd_range mm/pagewalk.c:116 [inline]
    __walk_page_range+0x44a/0xdb0 mm/pagewalk.c:208
    walk_page_range+0x154/0x370 mm/pagewalk.c:285
    queue_pages_range+0x115/0x150 mm/mempolicy.c:694
    do_mbind mm/mempolicy.c:1241 [inline]
    SYSC_mbind+0x3c3/0x1030 mm/mempolicy.c:1370
    SyS_mbind+0x46/0x60 mm/mempolicy.c:1352
    do_syscall_64+0x1d2/0x600 arch/x86/entry/common.c:282
    entry_SYSCALL_64_after_swapgs+0x5d/0xdb
  Code: c7 80 1c 02 00 e8 26 0a 76 01 <0f> 0b 48 c7 c7 40 46 45 84 e8 4c
  RIP  [<ffffffff81895d6b>] split_huge_page_to_list+0x8fb/0x1030 mm/huge_memory.c:2124
   RSP <ffff88006899f980>

with the below test:

  uint64_t r[1] = {0xffffffffffffffff};

  int main(void)
  {
        syscall(__NR_mmap, 0x20000000, 0x1000000, 3, 0x32, -1, 0);
                                intptr_t res = 0;
        res = syscall(__NR_socket, 0x11, 3, 0x300);
        if (res != -1)
                r[0] = res;
        *(uint32_t*)0x20000040 = 0x10000;
        *(uint32_t*)0x20000044 = 1;
        *(uint32_t*)0x20000048 = 0xc520;
        *(uint32_t*)0x2000004c = 1;
        syscall(__NR_setsockopt, r[0], 0x107, 0xd, 0x20000040, 0x10);
        syscall(__NR_mmap, 0x20fed000, 0x10000, 0, 0x8811, r[0], 0);
        *(uint64_t*)0x20000340 = 2;
        syscall(__NR_mbind, 0x20ff9000, 0x4000, 0x4002, 0x20000340, 0x45d4, 3);
        return 0;
  }

Actually the test does:

  mmap(0x20000000, 16777216, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_FIXED|MAP_ANONYMOUS, -1, 0) = 0x20000000
  socket(AF_PACKET, SOCK_RAW, 768)        = 3
  setsockopt(3, SOL_PACKET, PACKET_TX_RING, {block_size=65536, block_nr=1, frame_size=50464, frame_nr=1}, 16) = 0
  mmap(0x20fed000, 65536, PROT_NONE, MAP_SHARED|MAP_FIXED|MAP_POPULATE|MAP_DENYWRITE, 3, 0) = 0x20fed000
  mbind(..., MPOL_MF_STRICT|MPOL_MF_MOVE) = 0

The setsockopt() would allocate compound pages (16 pages in this test)
for packet tx ring, then the mmap() would call packet_mmap() to map the
pages into the user address space specified by the mmap() call.

When calling mbind(), it would scan the vma to queue the pages for
migration to the new node.  It would split any huge page since 4.9
doesn't support THP migration, however, the packet tx ring compound
pages are not THP and even not movable.  So, the above bug is triggered.

However, the later kernel is not hit by this issue due to commit
d44d363f6578 ("mm: don't assume anonymous pages have SwapBacked flag"),
which just removes the PageSwapBacked check for a different reason.

But, there is a deeper issue.  According to the semantic of mbind(), it
should return -EIO if MPOL_MF_MOVE or MPOL_MF_MOVE_ALL was specified and
MPOL_MF_STRICT was also specified, but the kernel was unable to move all
existing pages in the range.  The tx ring of the packet socket is
definitely not movable, however, mbind() returns success for this case.

Although the most socket file associates with non-movable pages, but XDP
may have movable pages from gup.  So, it sounds not fine to just check
the underlying file type of vma in vma_migratable().

Change migrate_page_add() to check if the page is movable or not, if it
is unmovable, just return -EIO.  But do not abort pte walk immediately,
since there may be pages off LRU temporarily.  We should migrate other
pages if MPOL_MF_MOVE* is specified.  Set has_unmovable flag if some
paged could not be not moved, then return -EIO for mbind() eventually.

With this change the above test would return -EIO as expected.

[[email protected]: fix review comments from Vlastimil]
Link: http://lkml.kernel.org/r/[email protected]
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Yang Shi <[email protected]>
Reviewed-by: Vlastimil Babka <[email protected]>
Cc: Michal Hocko <[email protected]>
Cc: Mel Gorman <[email protected]>
Cc: <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
5 years agomm: mempolicy: make the behavior consistent when MPOL_MF_MOVE* and MPOL_MF_STRICT...
Yang Shi [Tue, 13 Aug 2019 22:37:15 +0000 (15:37 -0700)]
mm: mempolicy: make the behavior consistent when MPOL_MF_MOVE* and MPOL_MF_STRICT were specified

When both MPOL_MF_MOVE* and MPOL_MF_STRICT was specified, mbind() should
try best to migrate misplaced pages, if some of the pages could not be
migrated, then return -EIO.

There are three different sub-cases:
 1. vma is not migratable
 2. vma is migratable, but there are unmovable pages
 3. vma is migratable, pages are movable, but migrate_pages() fails

If #1 happens, kernel would just abort immediately, then return -EIO,
after a7f40cfe3b7a ("mm: mempolicy: make mbind() return -EIO when
MPOL_MF_STRICT is specified").

If #3 happens, kernel would set policy and migrate pages with
best-effort, but won't rollback the migrated pages and reset the policy
back.

Before that commit, they behaves in the same way.  It'd better to keep
their behavior consistent.  But, rolling back the migrated pages and
resetting the policy back sounds not feasible, so just make #1 behave as
same as #3.

Userspace will know that not everything was successfully migrated (via
-EIO), and can take whatever steps it deems necessary - attempt
rollback, determine which exact page(s) are violating the policy, etc.

Make queue_pages_range() return 1 to indicate there are unmovable pages
or vma is not migratable.

The #2 is not handled correctly in the current kernel, the following
patch will fix it.

[[email protected]: fix review comments from Vlastimil]
Link: http://lkml.kernel.org/r/[email protected]
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Yang Shi <[email protected]>
Reviewed-by: Vlastimil Babka <[email protected]>
Cc: Michal Hocko <[email protected]>
Cc: Mel Gorman <[email protected]>
Cc: <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
5 years agomm/hmm: fix bad subpage pointer in try_to_unmap_one
Ralph Campbell [Tue, 13 Aug 2019 22:37:11 +0000 (15:37 -0700)]
mm/hmm: fix bad subpage pointer in try_to_unmap_one

When migrating an anonymous private page to a ZONE_DEVICE private page,
the source page->mapping and page->index fields are copied to the
destination ZONE_DEVICE struct page and the page_mapcount() is
increased.  This is so rmap_walk() can be used to unmap and migrate the
page back to system memory.

However, try_to_unmap_one() computes the subpage pointer from a swap pte
which computes an invalid page pointer and a kernel panic results such
as:

  BUG: unable to handle page fault for address: ffffea1fffffffc8

Currently, only single pages can be migrated to device private memory so
no subpage computation is needed and it can be set to "page".

[[email protected]: add comment]
Link: http://lkml.kernel.org/r/[email protected]
Link: http://lkml.kernel.org/r/[email protected]
Fixes: a5430dda8a3a1c ("mm/migrate: support un-addressable ZONE_DEVICE page in migration")
Signed-off-by: Ralph Campbell <[email protected]>
Cc: "Jérôme Glisse" <[email protected]>
Cc: "Kirill A. Shutemov" <[email protected]>
Cc: Mike Kravetz <[email protected]>
Cc: Christoph Hellwig <[email protected]>
Cc: Jason Gunthorpe <[email protected]>
Cc: John Hubbard <[email protected]>
Cc: Andrea Arcangeli <[email protected]>
Cc: Andrey Ryabinin <[email protected]>
Cc: Christoph Lameter <[email protected]>
Cc: Dan Williams <[email protected]>
Cc: Dave Hansen <[email protected]>
Cc: Ira Weiny <[email protected]>
Cc: Jan Kara <[email protected]>
Cc: Lai Jiangshan <[email protected]>
Cc: Logan Gunthorpe <[email protected]>
Cc: Martin Schwidefsky <[email protected]>
Cc: Matthew Wilcox <[email protected]>
Cc: Mel Gorman <[email protected]>
Cc: Michal Hocko <[email protected]>
Cc: Pekka Enberg <[email protected]>
Cc: Randy Dunlap <[email protected]>
Cc: Vlastimil Babka <[email protected]>
Cc: <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
5 years agomm/hmm: fix ZONE_DEVICE anon page mapping reuse
Ralph Campbell [Tue, 13 Aug 2019 22:37:07 +0000 (15:37 -0700)]
mm/hmm: fix ZONE_DEVICE anon page mapping reuse

When a ZONE_DEVICE private page is freed, the page->mapping field can be
set.  If this page is reused as an anonymous page, the previous value
can prevent the page from being inserted into the CPU's anon rmap table.
For example, when migrating a pte_none() page to device memory:

  migrate_vma(ops, vma, start, end, src, dst, private)
    migrate_vma_collect()
      src[] = MIGRATE_PFN_MIGRATE
    migrate_vma_prepare()
      /* no page to lock or isolate so OK */
    migrate_vma_unmap()
      /* no page to unmap so OK */
    ops->alloc_and_copy()
      /* driver allocates ZONE_DEVICE page for dst[] */
    migrate_vma_pages()
      migrate_vma_insert_page()
        page_add_new_anon_rmap()
          __page_set_anon_rmap()
            /* This check sees the page's stale mapping field */
            if (PageAnon(page))
              return
            /* page->mapping is not updated */

The result is that the migration appears to succeed but a subsequent CPU
fault will be unable to migrate the page back to system memory or worse.

Clear the page->mapping field when freeing the ZONE_DEVICE page so stale
pointer data doesn't affect future page use.

Link: http://lkml.kernel.org/r/[email protected]
Fixes: b7a523109fb5c9d2d6dd ("mm: don't clear ->mapping in hmm_devmem_free")
Signed-off-by: Ralph Campbell <[email protected]>
Reviewed-by: John Hubbard <[email protected]>
Reviewed-by: Christoph Hellwig <[email protected]>
Cc: Dan Williams <[email protected]>
Cc: Jason Gunthorpe <[email protected]>
Cc: Logan Gunthorpe <[email protected]>
Cc: Ira Weiny <[email protected]>
Cc: Matthew Wilcox <[email protected]>
Cc: Mel Gorman <[email protected]>
Cc: Jan Kara <[email protected]>
Cc: "Kirill A. Shutemov" <[email protected]>
Cc: Michal Hocko <[email protected]>
Cc: Andrea Arcangeli <[email protected]>
Cc: Mike Kravetz <[email protected]>
Cc: "Jérôme Glisse" <[email protected]>
Cc: Andrey Ryabinin <[email protected]>
Cc: Christoph Lameter <[email protected]>
Cc: Dave Hansen <[email protected]>
Cc: Lai Jiangshan <[email protected]>
Cc: Martin Schwidefsky <[email protected]>
Cc: Pekka Enberg <[email protected]>
Cc: Randy Dunlap <[email protected]>
Cc: Vlastimil Babka <[email protected]>
Cc: <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
5 years agomm: document zone device struct page field usage
Ralph Campbell [Tue, 13 Aug 2019 22:37:04 +0000 (15:37 -0700)]
mm: document zone device struct page field usage

Patch series "mm/hmm: fixes for device private page migration", v3.

Testing the latest linux git tree turned up a few bugs with page
migration to and from ZONE_DEVICE private and anonymous pages.
Hopefully it clarifies how ZONE_DEVICE private struct page uses the same
mapping and index fields from the source anonymous page mapping.

This patch (of 3):

Struct page for ZONE_DEVICE private pages uses the page->mapping and and
page->index fields while the source anonymous pages are migrated to
device private memory.  This is so rmap_walk() can find the page when
migrating the ZONE_DEVICE private page back to system memory.
ZONE_DEVICE pmem backed fsdax pages also use the page->mapping and
page->index fields when files are mapped into a process address space.

Add comments to struct page and remove the unused "_zd_pad_1" field to
make this more clear.

Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Ralph Campbell <[email protected]>
Reviewed-by: John Hubbard <[email protected]>
Cc: Matthew Wilcox <[email protected]>
Cc: Vlastimil Babka <[email protected]>
Cc: Christoph Lameter <[email protected]>
Cc: Dave Hansen <[email protected]>
Cc: Jérôme Glisse <[email protected]>
Cc: "Kirill A . Shutemov" <[email protected]>
Cc: Lai Jiangshan <[email protected]>
Cc: Martin Schwidefsky <[email protected]>
Cc: Pekka Enberg <[email protected]>
Cc: Randy Dunlap <[email protected]>
Cc: Andrey Ryabinin <[email protected]>
Cc: Christoph Hellwig <[email protected]>
Cc: Jason Gunthorpe <[email protected]>
Cc: Andrew Morton <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
5 years agodevlink: send notifications for deleted snapshots on region destroy
Jiri Pirko [Mon, 12 Aug 2019 12:28:31 +0000 (14:28 +0200)]
devlink: send notifications for deleted snapshots on region destroy

Currently the notifications for deleted snapshots are sent only in case
user deletes a snapshot manually. Send the notifications in case region
is destroyed too.

Signed-off-by: Jiri Pirko <[email protected]>
Signed-off-by: Jakub Kicinski <[email protected]>
5 years agoMerge branch 'bpf-libbpf-read-sysfs-btf'
Daniel Borkmann [Tue, 13 Aug 2019 21:19:42 +0000 (23:19 +0200)]
Merge branch 'bpf-libbpf-read-sysfs-btf'

Andrii Nakryiko says:

====================
Now that kernel's BTF is exposed through sysfs at well-known location, attempt
to load it first as a target BTF for the purpose of BPF CO-RE relocations.

Patch #1 is a follow-up patch to rename /sys/kernel/btf/kernel into
/sys/kernel/btf/vmlinux.

Patch #2 adds ability to load raw BTF contents from sysfs and expands the list
of locations libbpf attempts to load vmlinux BTF from.
====================

Signed-off-by: Daniel Borkmann <[email protected]>
5 years agolibbpf: attempt to load kernel BTF from sysfs first
Andrii Nakryiko [Tue, 13 Aug 2019 18:54:43 +0000 (11:54 -0700)]
libbpf: attempt to load kernel BTF from sysfs first

Add support for loading kernel BTF from sysfs (/sys/kernel/btf/vmlinux)
as a target BTF. Also extend the list of on disk search paths for
vmlinux ELF image with entries that perf is searching for.

Signed-off-by: Andrii Nakryiko <[email protected]>
Signed-off-by: Daniel Borkmann <[email protected]>
5 years agobtf: rename /sys/kernel/btf/kernel into /sys/kernel/btf/vmlinux
Andrii Nakryiko [Tue, 13 Aug 2019 18:54:42 +0000 (11:54 -0700)]
btf: rename /sys/kernel/btf/kernel into /sys/kernel/btf/vmlinux

Expose kernel's BTF under the name vmlinux to be more uniform with using
kernel module names as file names in the future.

Fixes: 341dfcf8d78e ("btf: expose BTF info through sysfs")
Suggested-by: Daniel Borkmann <[email protected]>
Signed-off-by: Andrii Nakryiko <[email protected]>
Signed-off-by: Daniel Borkmann <[email protected]>
5 years agoriscv: fix flush_tlb_range() end address for flush_tlb_page()
Paul Walmsley [Thu, 8 Aug 2019 02:07:34 +0000 (19:07 -0700)]
riscv: fix flush_tlb_range() end address for flush_tlb_page()

The RISC-V kernel implementation of flush_tlb_page() when CONFIG_SMP
is set is wrong.  It passes zero to flush_tlb_range() as the final
address to flush, but it should be at least 'addr'.

Some other Linux architecture ports use the beginning address to
flush, plus PAGE_SIZE, as the final address to flush.  This might
flush slightly more than what's needed, but it seems unlikely that
being more clever would improve anything.  So let's just take that
implementation for now.

While here, convert the macro into a static inline function, primarily
to avoid unintentional multiple evaluations of 'addr'.

This second version of the patch fixes a coding style issue found by
Christoph Hellwig <[email protected]>.

Reported-by: Andreas Schwab <[email protected]>
Signed-off-by: Paul Walmsley <[email protected]>
Reviewed-by: Christoph Hellwig <[email protected]>
5 years agoMerge tag 'tpmdd-next-20190813' of git://git.infradead.org/users/jjs/linux-tpmdd
Linus Torvalds [Tue, 13 Aug 2019 18:46:24 +0000 (11:46 -0700)]
Merge tag 'tpmdd-next-20190813' of git://git.infradead.org/users/jjs/linux-tpmdd

Pull tpm fixes from Jarkko Sakkinen:
 "One more bug fix for the next release"

* tag 'tpmdd-next-20190813' of git://git.infradead.org/users/jjs/linux-tpmdd:
  KEYS: trusted: allow module init if TPM is inactive or deactivated

5 years agoMerge tag 'scsi-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/jejb/scsi
Linus Torvalds [Tue, 13 Aug 2019 17:31:31 +0000 (10:31 -0700)]
Merge tag 'scsi-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/jejb/scsi

Pull SCSI fix from James Bottomley:
 "Single lpfc fix, for a single-cpu corner case"

* tag 'scsi-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/jejb/scsi:
  scsi: lpfc: Fix crash when cpu count is 1 and null irq affinity mask

5 years agoKEYS: trusted: allow module init if TPM is inactive or deactivated
Roberto Sassu [Mon, 5 Aug 2019 16:44:27 +0000 (18:44 +0200)]
KEYS: trusted: allow module init if TPM is inactive or deactivated

Commit c78719203fc6 ("KEYS: trusted: allow trusted.ko to initialize w/o a
TPM") allows the trusted module to be loaded even if a TPM is not found, to
avoid module dependency problems.

However, trusted module initialization can still fail if the TPM is
inactive or deactivated. tpm_get_random() returns an error.

This patch removes the call to tpm_get_random() and instead extends the PCR
specified by the user with zeros. The security of this alternative is
equivalent to the previous one, as either option prevents with a PCR update
unsealing and misuse of sealed data by a user space process.

Even if a PCR is extended with zeros, instead of random data, it is still
computationally infeasible to find a value as input for a new PCR extend
operation, to obtain again the PCR value that would allow unsealing.

Cc: [email protected]
Fixes: 240730437deb ("KEYS: trusted: explicitly use tpm_chip structure...")
Signed-off-by: Roberto Sassu <[email protected]>
Reviewed-by: Tyler Hicks <[email protected]>
Suggested-by: Mimi Zohar <[email protected]>
Reviewed-by: Jarkko Sakkinen <[email protected]>
Signed-off-by: Jarkko Sakkinen <[email protected]>
5 years agoRDMA/siw: Change CQ flags from 64->32 bits
Bernard Metzler [Fri, 9 Aug 2019 15:18:16 +0000 (17:18 +0200)]
RDMA/siw: Change CQ flags from 64->32 bits

This patch changes the driver/user shared (mmapped) CQ notification
flags field from unsigned 64-bits size to unsigned 32-bits size. This
enables building siw on 32-bit architectures.

This patch changes the siw-abi, but as siw was only just merged in
this merge window cycle, there are no released kernels with the prior
abi.  We are making no attempt to be binary compatible with siw user
space libraries prior to the merge of siw into the upstream kernel,
only moving forward with upstream kernels and upstream rdma-core
provided siw libraries are we guaranteeing compatibility.

Signed-off-by: Bernard Metzler <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Doug Ledford <[email protected]>
5 years agonetfilter: conntrack: Use consistent ct id hash calculation
Dirk Morris [Thu, 8 Aug 2019 20:57:51 +0000 (13:57 -0700)]
netfilter: conntrack: Use consistent ct id hash calculation

Change ct id hash calculation to only use invariants.

Currently the ct id hash calculation is based on some fields that can
change in the lifetime on a conntrack entry in some corner cases. The
current hash uses the whole tuple which contains an hlist pointer which
will change when the conntrack is placed on the dying list resulting in
a ct id change.

This patch also removes the reply-side tuple and extension pointer from
the hash calculation so that the ct id will will not change from
initialization until confirmation.

Fixes: 3c79107631db1f7 ("netfilter: ctnetlink: don't use conntrack/expect object addresses as id")
Signed-off-by: Dirk Morris <[email protected]>
Acked-by: Florian Westphal <[email protected]>
Signed-off-by: Pablo Neira Ayuso <[email protected]>
5 years agoALSA: hda/realtek - Add quirk for HP Envy x360
Takashi Iwai [Tue, 13 Aug 2019 15:39:56 +0000 (17:39 +0200)]
ALSA: hda/realtek - Add quirk for HP Envy x360

HP Envy x360 (AMD Ryzen-based model) with 103c:8497 needs the same
quirk like HP Spectre x360 for enabling the mute LED over Mic3 pin.

BugLink: https://bugzilla.kernel.org/show_bug.cgi?id=204373
Cc: <[email protected]>
Signed-off-by: Takashi Iwai <[email protected]>
5 years agocan: netlink: fix documentation typos
Andre Hartmann [Sat, 23 Mar 2019 15:04:19 +0000 (16:04 +0100)]
can: netlink: fix documentation typos

This patch fixes some documentation typos in struct can_bittiming_const.

Signed-off-by: Andre Hartmann <[email protected]>
Signed-off-by: Marc Kleine-Budde <[email protected]>
5 years agocan: vcan: introduce pr_fmt and make use of it
Marc Kleine-Budde [Fri, 26 Jul 2019 07:35:43 +0000 (09:35 +0200)]
can: vcan: introduce pr_fmt and make use of it

This patch introduces pr_fmt and makes use of it, also it converts a
printk() to pr_info().

Signed-off-by: Marc Kleine-Budde <[email protected]>
5 years agocan: vcan: remove unnecessary blank lines
Marc Kleine-Budde [Wed, 24 Jul 2019 12:28:21 +0000 (14:28 +0200)]
can: vcan: remove unnecessary blank lines

This patch removes unnecessary blank lines, so that checkpatch doesn't
complain anymore.

Signed-off-by: Marc Kleine-Budde <[email protected]>
5 years agocan: vcan: convert block comments to network style comments
Marc Kleine-Budde [Wed, 24 Jul 2019 12:16:29 +0000 (14:16 +0200)]
can: vcan: convert block comments to network style comments

This patch converts all block comments to network subsystem style block
comments.

Signed-off-by: Marc Kleine-Budde <[email protected]>
5 years agocan: gw: add support for CAN FD frames
Oliver Hartkopp [Sat, 10 Aug 2019 19:18:10 +0000 (21:18 +0200)]
can: gw: add support for CAN FD frames

Introduce CAN FD support which needs an extension of the netlink API to
pass CAN FD type content to the kernel which has a different size to
Classic CAN. Additionally the struct canfd_frame has a new 'flags' element
that can now be modified with can-gw.

The new CGW_FLAGS_CAN_FD option flag defines whether the routing job
handles Classic CAN or CAN FD frames. This setting is very strict at
reception time and enables the new possibilities, e.g. CGW_FDMOD_* and
modifying the flags element of struct canfd_frame, only when
CGW_FLAGS_CAN_FD is set.

Signed-off-by: Oliver Hartkopp <[email protected]>
Signed-off-by: Marc Kleine-Budde <[email protected]>
5 years agocan: gw: use struct canfd_frame as internal data structure
Oliver Hartkopp [Sat, 10 Aug 2019 19:18:09 +0000 (21:18 +0200)]
can: gw: use struct canfd_frame as internal data structure

To prepare the CAN FD support this patch implements the first adaptions in
data structures for CAN FD without changing the current functionality.

Additionally some code at the end of this patch is moved or indented to
simplify the review of the next implementation step.

Signed-off-by: Oliver Hartkopp <[email protected]>
Signed-off-by: Marc Kleine-Budde <[email protected]>
5 years agocan: gw: cgw_parse_attr(): remove unnecessary braces for single statement block
Marc Kleine-Budde [Wed, 24 Jul 2019 12:48:14 +0000 (14:48 +0200)]
can: gw: cgw_parse_attr(): remove unnecessary braces for single statement block

This patch removes some unnecessary braces for a single statement block.

Signed-off-by: Marc Kleine-Budde <[email protected]>
This page took 0.165976 seconds and 4 git commands to generate.