Git Repo - linux.git/log

staging: vboxsf: Remove unused including <linux/version.h>

Remove including <linux/version.h> that don't need it.

Signed-off-by: YueHaibing <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Greg Kroah-Hartman <[email protected]>

Merge tag 'asoc-fix-v5.4-rc6' of https://git.kernel.org/pub/scm/linux/kernel/git/broonie/sound into for-linus

ASoC: Fixes for v5.4

These are a collection of fixes since v5.4-rc4 that have accumilated,
they're all driver specific and there's nothing major in here so it's
probably not essential to actually send them but I'll leave that call to
you.

pinctrl: stmfx: fix valid_mask init sequence

With stmfx_pinctrl_gpio_init_valid_mask callback, gpio_valid_mask was used
to initialize gpiochip valid_mask for gpiolib. But gpio_valid_mask was not
yet initialized. gpio_valid_mask required gpio-ranges to be registered,
this is the case after gpiochip_add_data call. But init_valid_mask
callback is also called under gpiochip_add_data. gpio_valid_mask
initialization cannot be moved before gpiochip_add_data because
gpio-ranges are not registered.
So, it is not possible to use init_valid_mask callback.
To avoid this issue, get rid of valid_mask and rely on ranges.

Fixes: da9b142ab2c5 ("pinctrl: stmfx: Use the callback to populate valid_mask")
Signed-off-by: Amelie Delaunay <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Linus Walleij <[email protected]>

NFC: st21nfca: fix double free

The variable nfcid_skb is not changed in the callee nfc_hci_get_param()
if error occurs. Consequently, the freed variable nfcid_skb will be
freed again, resulting in a double free bug. Set nfcid_skb to NULL after
releasing it to fix the bug.

Signed-off-by: Pan Bian <[email protected]>
Signed-off-by: David S. Miller <[email protected]>

net: hns3: add compatible handling for command HCLGE_OPC_PF_RST_DONE

Since old firmware does not support HCLGE_OPC_PF_RST_DONE, it will
return -EOPNOTSUPP to the driver when received this command. So
for this case, it should just print a warning and return success
to the caller.

Fixes: 72e2fb07997c ("net: hns3: clear reset interrupt status in hclge_irq_handle()")
Signed-off-by: Huazhong Tan <[email protected]>
Signed-off-by: David S. Miller <[email protected]>

Merge tag 'mlx5-fixes-2019-11-06' of git://git.kernel.org/pub/scm/linux/kernel/git/saeed/linux

Saeed Mahamees says:

====================
Mellanox, mlx5 fixes 2019-11-06

This series introduces some fixes to mlx5 driver.

Please pull and let me know if there is any problem.

No -stable this time.
====================

Signed-off-by: David S. Miller <[email protected]>

r8169: fix page read in r8168g_mdio_read

Functions like phy_modify_paged() read the current page, on Realtek
PHY's this means reading the value of register 0x1f. Add special
handling for reading this register, similar to what we do already
in r8168g_mdio_write(). Currently we read a random value that by
chance seems to be 0 always.

Fixes: a2928d28643e ("r8169: use paged versions of phylib MDIO access functions")
Signed-off-by: Heiner Kallweit <[email protected]>
Signed-off-by: David S. Miller <[email protected]>

Merge branch 'stmmac-fixes'

Jose Abreu says:

====================
net: stmmac: Fixes for -net

Misc fixes for stmmac.

Patch 1/11 and 2/11, use the correct variable type for bitrev32() calls.

Patch 3/11, fixes the random failures the we were seing when running selftests.

Patch 4/11, prevents a crash that can occur when receiving AVB packets and with
SPH feature enabled on XGMAC.

Patch 5/11, fixes the correct settings for CBS on XGMAC.

Patch 6/11, corrects the interpretation of AVB feature on XGMAC.

Patch 7/11, disables Flow Control for AVB enabled queues on XGMAC.

Patch 8/11, disables MMC interrupts on XGMAC, preventing a storm of interrupts.

Patch 9/11, fixes the number of packets that were being taken into account in
the RX path cleaning function.

Patch 10/11, fixes an incorrect descriptor setting that could cause IP
misbehavior.

Patch 11/11, fixes the IOC generation mechanism when multiple descriptors
are used.
====================

Signed-off-by: David S. Miller <[email protected]>

net: stmmac: Fix the TX IOC in xmit path

IOC bit must be only set in the last descriptor. Move the logic up a
little bit to make sure it's set in the correct descriptor.

Signed-off-by: Jose Abreu <[email protected]>
Signed-off-by: David S. Miller <[email protected]>

net: stmmac: Fix TSO descriptor with Enhanced Addressing

When using addressing > 32 bits the TSO first descriptor only has the
header so we can't set the payload field for this descriptor. Let's
reset the variable so that buffer 2 value is zero.

Fixes: a993db88d17d ("net: stmmac: Enable support for > 32 Bits addressing in XGMAC")
Signed-off-by: Jose Abreu <[email protected]>
Signed-off-by: David S. Miller <[email protected]>

net: stmmac: Fix the packet count in stmmac_rx()

Currently, stmmac_rx() is counting the number of descriptors but it
should count the number of packets as specified by the NAPI limit.

Fix this.

Fixes: ec222003bd94 ("net: stmmac: Prepare to add Split Header support")
Signed-off-by: Jose Abreu <[email protected]>
Signed-off-by: David S. Miller <[email protected]>

net: stmmac: xgmac: Disable MMC interrupts by default

MMC interrupts were being enabled, which is not what we want because it
will lead to a storm of interrupts that are not handled at all. Fix it
by disabling all MMC interrupts for XGMAC.

Fixes: b6cdf09f51c2 ("net: stmmac: xgmac: Implement MMC counters")
Signed-off-by: Jose Abreu <[email protected]>
Signed-off-by: David S. Miller <[email protected]>

net: stmmac: xgmac: Disable Flow Control when 1 or more queues are in AV

When in AVB mode we need to disable flow control to prevent MAC from
pausing in TX side.

Fixes: ec6ea8e3eee9 ("net: stmmac: Add CBS support in XGMAC2")
Signed-off-by: Jose Abreu <[email protected]>
Signed-off-by: David S. Miller <[email protected]>

net: stmmac: xgmac: Fix AV Feature detection

Fix incorrect precedence of operators. For reference: AV implies AV
Feature but RAV implies only RX side AV Feature. As we want full AV
features we need to check RAV.

Fixes: c2b69474d63b ("net: stmmac: xgmac: Correct RAVSEL field interpretation")
Signed-off-by: Jose Abreu <[email protected]>
Signed-off-by: David S. Miller <[email protected]>

net: stmmac: xgmac: Fix TSA selection

When we change between Transmission Scheduling Algorithms, we need to
clear previous values so that the new chosen algorithm is correctly
selected.

Fixes: ec6ea8e3eee9 ("net: stmmac: Add CBS support in XGMAC2")
Signed-off-by: Jose Abreu <[email protected]>
Signed-off-by: David S. Miller <[email protected]>

net: stmmac: xgmac: Only get SPH header len if available

Split Header length is only available when L34T == 0. Fix this by
correctly checking if L34T is zero before trying to get Header length.

Fixes: 67afd6d1cfdf ("net: stmmac: Add Split Header support and enable it in XGMAC cores")
Signed-off-by: Jose Abreu <[email protected]>
Signed-off-by: David S. Miller <[email protected]>

net: stmmac: selftests: Prevent false positives in filter tests

In L2 tests that filter packets by destination MAC address we need to
prevent false positives that can occur if we add an address that
collides with the existing ones.

To fix this, lets manually check if the new address to be added is
already present in the NIC and use a different one if so. For Hash
filtering this also envolves converting the address to the hash.

Fixes: 091810dbded9 ("net: stmmac: Introduce selftests support")
Signed-off-by: Jose Abreu <[email protected]>
Signed-off-by: David S. Miller <[email protected]>

net: stmmac: xgmac: bitrev32 returns u32

The bitrev32 function returns an u32 var, not an int. Fix it.

Fixes: 0efedbf11f07 ("net: stmmac: xgmac: Fix XGMAC selftests")
Signed-off-by: Jose Abreu <[email protected]>
Signed-off-by: David S. Miller <[email protected]>

net: stmmac: gmac4: bitrev32 returns u32

The bitrev32 function returns an u32 var, not an int. Fix it.

Fixes: 477286b53f55 ("stmmac: add GMAC4 core support")
Signed-off-by: Jose Abreu <[email protected]>
Signed-off-by: David S. Miller <[email protected]>

Merge git://git.kernel.org/pub/scm/linux/kernel/git/pablo/nf

Pablo Neira Ayuso says:

====================
Netfilter fixes for net

The following patchset contains Netfilter fixes for net:

1) Missing register size validation in bitwise and cmp offloads.

2) Fix error code in ip_set_sockfn_get() when copy_to_user() fails,
   from Dan Carpenter.

3) Oneliner to copy MAC address in IPv6 hash:ip,mac sets, from
   Stefano Brivio.

4) Missing policy validation in ipset with NL_VALIDATE_STRICT,
   from Jozsef Kadlecsik.

5) Fix unaligned access to private data area of nf_tables instructions,
   from Lukas Wunner.

6) Relax check for object updates, reported as a regression by
   Eric Garver, patch from Fernando Fernandez Mancera.

7) Crash on ebtables dnat extension when used from the output path.
   From Florian Westphal.

8) Fix bogus EOPNOTSUPP when updating basechain flags.

9) Fix bogus EBUSY when updating a basechain that is already offloaded.
====================

Signed-off-by: David S. Miller <[email protected]>

SMB3: Fix persistent handles reconnect

When the client hits a network reconnect, it re-opens every open
file with a create context to reconnect a persistent handle. All
create context types should be 8-bytes aligned but the padding
was missed for that one. As a result, some servers don't allow
us to reconnect handles and return an error. The problem occurs
when the problematic context is not at the end of the create
request packet. Fix this by adding a proper padding at the end
of the reconnect persistent handle context.

Cc: Stable <[email protected]> # 4.19.x
Signed-off-by: Pavel Shilovsky <[email protected]>
Signed-off-by: Steve French <[email protected]>

drm/radeon: fix si_enable_smc_cac() failed issue

Need to set the dte flag on this asic.

Port the fix from amdgpu:
5cb818b861be114 ("drm/amd/amdgpu: fix si_enable_smc_cac() failed issue")

Reviewed-by: Yong Zhao <[email protected]>
Signed-off-by: Alex Deucher <[email protected]>
Cc: [email protected]

drm/amdgpu/renoir: move gfxoff handling into gfx9 module

To properly handle the option parsing ordering.

Reviewed-by: Yong Zhao <[email protected]>
Signed-off-by: Alex Deucher <[email protected]>

drm/amdgpu: add warning for GRBM 1-cycle delay issue in gfx9

It needs to add warning to update firmware in gfx9
in case that firmware is too old to have function to
realize dummy read in cp firmware.

Signed-off-by: changzhu <[email protected]>
Reviewed-by: Christian König <[email protected]>
Signed-off-by: Alex Deucher <[email protected]>

drm/amdgpu: add dummy read by engines for some GCVM status registers in gfx10

The GRBM register interface is now capable of bursting 1 cycle per
register wr->wr, wr->rd much faster than previous muticycle per
transaction done interface. This has caused a problem where
status registers requiring HW to update have a 1 cycle delay, due
to the register update having to go through GRBM.

For cp ucode, it has realized dummy read in cp firmware.It covers
the use of WAIT_REG_MEM operation 1 case only.So it needs to call
gfx_v10_0_wait_reg_mem in gfx10. Besides it also needs to add warning to
update firmware in case firmware is too old to have function to realize
dummy read in cp firmware.

For sdma ucode, it hasn't realized dummy read in sdma firmware. sdma is
moved to gfxhub in gfx10. So it needs to add dummy read in driver
between amdgpu_ring_emit_wreg and amdgpu_ring_emit_reg_wait for sdma_v5_0.

Signed-off-by: changzhu <[email protected]>
Reviewed-by: Christian König <[email protected]>
Signed-off-by: Alex Deucher <[email protected]>

drm/amdgpu: register gpu instance before fan boost feature enablment

Otherwise, the feature enablement will be skipped due to wrong count.

Fixes: beff74bc6e0fa91 ("drm/amdgpu: fix a race in GPU reset with IB test (v2)")
Signed-off-by: Evan Quan <[email protected]>
Reviewed-by: Alex Deucher <[email protected]>
Signed-off-by: Alex Deucher <[email protected]>

drm/amd/swSMU: fix smu workload bit map error

fix workload bit (WORKLOAD_PPLIB_COMPUTE_BIT) map error
on vega20 and navi asic.

fix commit:
drm/amd/powerplay: add function get_workload_type_map for swsmu

Signed-off-by: Kevin Wang <[email protected]>
Reviewed-by: Kenneth Feng <[email protected]>
Signed-off-by: Alex Deucher <[email protected]>

net/smc: fix ethernet interface refcounting

If a pnet table entry is to be added mentioning a valid ethernet
interface, but an invalid infiniband or ISM device, the dev_put()
operation for the ethernet interface is called twice, resulting
in a negative refcount for the ethernet interface, which disables
removal of such a network interface.

This patch removes one of the dev_put() calls.

Fixes: 890a2cb4a966 ("net/smc: rework pnet table")
Signed-off-by: Ursula Braun <[email protected]>
Signed-off-by: Karsten Graul <[email protected]>
Signed-off-by: David S. Miller <[email protected]>

Merge branch 'net-tls-add-a-TX-lock'

Jakub Kicinski says:

====================
net/tls: add a TX lock

Some time ago Pooja and Mallesham started reporting crashes with
an async accelerator. After trying to poke the existing logic into
shape I came to the conclusion that it can't be trusted, and to
preserve our sanity we should just add a lock around the TX side.

First patch removes the sk_write_pending checks from the write
space callbacks. Those don't seem to have a logical justification.

Patch 2 adds the TX lock and patch 3 associated test (which should
hang with current net).

Mallesham reports that even with these fixes applied the async
accelerator workload still occasionally hangs waiting for socket
memory. I suspect that's strictly related to the way async crypto
is integrated in TLS, so I think we should get these into net or
net-next and move from there.
====================

Signed-off-by: David S. Miller <[email protected]>

selftests/tls: add test for concurrent recv and send

Add a test which spawns 16 threads and performs concurrent
send and recv calls on the same socket.

Signed-off-by: Jakub Kicinski <[email protected]>
Reviewed-by: Simon Horman <[email protected]>
Signed-off-by: David S. Miller <[email protected]>

net/tls: add a TX lock

TLS TX needs to release and re-acquire the socket lock if send buffer
fills up.

TLS SW TX path currently depends on only allowing one thread to enter
the function by the abuse of sk_write_pending. If another writer is
already waiting for memory no new ones are allowed in.

This has two problems:
- writers don't wake other threads up when they leave the kernel;
   meaning that this scheme works for single extra thread (second
   application thread or delayed work) because memory becoming
   available will send a wake up request, but as Mallesham and
   Pooja report with larger number of threads it leads to threads
   being put to sleep indefinitely;
- the delayed work does not get _scheduled_ but it may _run_ when
   other writers are present leading to crashes as writers don't
   expect state to change under their feet (same records get pushed
   and freed multiple times); it's hard to reliably bail from the
   work, however, because the mere presence of a writer does not
   guarantee that the writer will push pending records before exiting.

Ensuring wakeups always happen will make the code basically open
code a mutex. Just use a mutex.

The TLS HW TX path does not have any locking (not even the
sk_write_pending hack), yet it uses a per-socket sg_tx_data
array to push records.

Fixes: a42055e8d2c3 ("net/tls: Add support for async encryption of records for performance")
Reported-by: Mallesham Jatharakonda <[email protected]>
Reported-by: Pooja Trivedi <[email protected]>
Signed-off-by: Jakub Kicinski <[email protected]>
Reviewed-by: Simon Horman <[email protected]>
Signed-off-by: David S. Miller <[email protected]>

net/tls: don't pay attention to sk_write_pending when pushing partial records

sk_write_pending being not zero does not guarantee that partial
record will be pushed. If the thread waiting for memory times out
the pending record may get stuck.

In case of tls_device there is no path where parial record is
set and writer present in the first place. Partial record is
set only in tls_push_sg() and tls_push_sg() will return an
error immediately. All tls_device callers of tls_push_sg()
will return (and not wait for memory) if it failed.

Fixes: a42055e8d2c3 ("net/tls: Add support for async encryption of records for performance")
Signed-off-by: Jakub Kicinski <[email protected]>
Reviewed-by: Simon Horman <[email protected]>
Signed-off-by: David S. Miller <[email protected]>

blkcg: make blkcg_print_stat() print stats only for online blkgs

blkcg_print_stat() iterates blkgs under RCU and doesn't test whether
the blkg is online.  This can call into pd_stat_fn() on a pd which is
still being initialized leading to an oops.

The heaviest operation - recursively summing up rwstat counters - is
already done while holding the queue_lock.  Expand queue_lock to cover
the other operations and skip the blkg if it isn't online yet.  The
online state is protected by both blkcg and queue locks, so this
guarantees that only online blkgs are processed.

Signed-off-by: Tejun Heo <[email protected]>
Reported-by: Roman Gushchin <[email protected]>
Cc: Josef Bacik <[email protected]>
Fixes: 903d23f0a354 ("blk-cgroup: allow controllers to output their own stats")
Cc: [email protected] # v4.19+
Signed-off-by: Jens Axboe <[email protected]>

drm/shmem: Add docbook comments for drm_gem_shmem_object madvise fields

Add missing docbook comments to madvise fields in struct
drm_gem_shmem_object which fixes these warnings:

include/drm/drm_gem_shmem_helper.h:87: warning: Function parameter or member 'madv' not described in 'drm_gem_shmem_object'
include/drm/drm_gem_shmem_helper.h:87: warning: Function parameter or member 'madv_list' not described in 'drm_gem_shmem_object'

Fixes: 17acb9f35ed7 ("drm/shmem: Add madvise state and purge helpers")
Reported-by: Sean Paul <[email protected]>
Cc: Maarten Lankhorst <[email protected]>
Cc: Maxime Ripard <[email protected]>
Cc: David Airlie <[email protected]>
Cc: Daniel Vetter <[email protected]>
Signed-off-by: Rob Herring <[email protected]>
Reviewed-by: Sean Paul <[email protected]>
Link: https://patchwork.freedesktop.org/patch/msgid/[email protected]

net: mscc: ocelot: fix __ocelot_rmw_ix prototype

The "read-modify-write register index" function is declared with a
confusing prototype: the "mask" and "reg" arguments are swapped.

Fortunately, this does not affect callers so far. Both arguments are
u32, and the wrapper macros (ocelot_rmw_ix etc) have the arguments in
the correct order (the one from ocelot_io.c).

Signed-off-by: Vladimir Oltean <[email protected]>
Signed-off-by: David S. Miller <[email protected]>

Merge branch 'Bonding-fixes-for-Ocelot-switch'

Vladimir Oltean says:

====================
Bonding fixes for Ocelot switch

This series fixes 2 issues with bonding in a system that integrates the
ocelot driver, but the ports that are bonded do not actually belong to
ocelot.
====================

Signed-off-by: David S. Miller <[email protected]>

net: mscc: ocelot: fix NULL pointer on LAG slave removal

lag_upper_info may be NULL on slave removal.

Fixes: dc96ee3730fc ("net: mscc: ocelot: add bonding support")
Signed-off-by: Claudiu Manoil <[email protected]>
Signed-off-by: Vladimir Oltean <[email protected]>
Signed-off-by: David S. Miller <[email protected]>

net: mscc: ocelot: don't handle netdev events for other netdevs

The check that the event is actually for this device should be moved
from the "port" handler to the net device handler.

Otherwise the port handler will deny bonding configuration for other
net devices in the same system (like enetc in the LS1028A) that don't
have the lag_upper_info->tx_type restriction that ocelot has.

Fixes: dc96ee3730fc ("net: mscc: ocelot: add bonding support")
Signed-off-by: Claudiu Manoil <[email protected]>
Signed-off-by: Vladimir Oltean <[email protected]>
Signed-off-by: David S. Miller <[email protected]>

net/mlx5e: Use correct enum to determine uplink port

For vlan push action, if eswitch flow source capability is enabled, flow
source value compared with MLX5_VPORT_UPLINK enum, to determine uplink
port. This lead to syndrome in dmesg if try to add vlan push action.
For example:
$ tc filter add dev vxlan0 ingress protocol ip prio 1 flower \
       enc_dst_port 4789 \
       action tunnel_key unset pipe \
       action vlan push id 20 pipe \
       action mirred egress redirect dev ens1f0_0
$ dmesg
...
[ 2456.883693] mlx5_core 0000:82:00.0: mlx5_cmd_check:756:(pid 5273): SET_FLOW_TABLE_ENTRY(0x936) op_mod(0x0) failed, status bad parameter(0x3), syndrome (0xa9c090)
Use the correct enum value MLX5_FLOW_CONTEXT_FLOW_SOURCE_UPLINK.

Fixes: bb204dcf39fe ("net/mlx5e: Determine source port properly for vlan push action")
Signed-off-by: Dmytro Linkin <[email protected]>
Reviewed-by: Vlad Buslov <[email protected]>
Reviewed-by: Roi Dayan <[email protected]>
Signed-off-by: Saeed Mahameed <[email protected]>

net/mlx5: DR, Fix memory leak during rule creation

During rule creation hw_ste_arr was not freed.

Fixes: 41d07074154c ("net/mlx5: DR, Expose steering rule functionality")
Signed-off-by: Alex Vesker <[email protected]>
Signed-off-by: Saeed Mahameed <[email protected]>

net/mlx5: DR, Fix memory leak in modify action destroy

The rewrite data was no freed.

Fixes: 9db810ed2d37 ("net/mlx5: DR, Expose steering action functionality")
Signed-off-by: Alex Vesker <[email protected]>
Signed-off-by: Saeed Mahameed <[email protected]>

net/mlx5e: Fix eswitch debug print of max fdb flow

The value is already the calculation so remove the log prefix.

Fixes: e52c28024008 ("net/mlx5: E-Switch, Add chains and priorities")
Signed-off-by: Roi Dayan <[email protected]>
Reviewed-by: Eli Britstein <[email protected]>
Signed-off-by: Saeed Mahameed <[email protected]>

HID: wacom: generic: Treat serial number and related fields as unsigned

The HID descriptors for most Wacom devices oddly declare the serial
number and other related fields as signed integers. When these numbers
are ingested by the HID subsystem, they are automatically sign-extended
into 32-bit integers. We treat the fields as unsigned elsewhere in the
kernel and userspace, however, so this sign-extension causes problems.
In particular, the sign-extended tool ID sent to userspace as ABS_MISC
does not properly match unsigned IDs used by xf86-input-wacom and libwacom.

We introduce a function 'wacom_s32tou' that can undo the automatic sign
extension performed by 'hid_snto32'. We call this function when processing
the serial number and related fields to ensure that we are dealing with
and reporting the unsigned form. We opt to use this method rather than
adding a descriptor fixup in 'wacom_hid_usage_quirk' since it should be
more robust in the face of future devices.

Ref: https://github.com/linuxwacom/input-wacom/issues/134
Fixes: f85c9dc678 ("HID: wacom: generic: Support tool ID and additional tool types")
CC: <[email protected]> # v4.10+
Signed-off-by: Jason Gerecke <[email protected]>
Reviewed-by: Aaron Armstrong Skomra <[email protected]>
Signed-off-by: Jiri Kosina <[email protected]>

drm/amdgpu: add navi14 PCI ID

Add the navi14 PCI device id.

Reviewed-by: Hawking Zhang <[email protected]>
Signed-off-by: Tianci.Yin <[email protected]>
Signed-off-by: Alex Deucher <[email protected]>

Revert "drm/amd/display: setting the DIG_MODE to the correct value."

This reverts commit 385857adb8154563840e5b0f200254126618f464.

Reason for revert: Root cause of this issue is found. The workaround is not needed anymore.

Signed-off-by: Zhan Liu <[email protected]>
Reviewed-by: Hersen Wu <[email protected]>
Signed-off-by: Alex Deucher <[email protected]>

drm/amd/display: Add ENGINE_ID_DIGD condition check for Navi14

[Why]
Navi10 has 6 PHY, but Navi14 only has 5 PHY, that is
because there is no ENGINE_ID_DIGD in Navi14. Without
this patch, many HDMI related issues (e.g. HDMI S3
resume failure, HDMI pink screen on boot) will be
observed.

[How]
If "eng_id" is larger than ENGINE_ID_DIGD, then
add "eng_id" by 1.

Signed-off-by: Zhan Liu <[email protected]>
Reviewed-by: Hersen Wu <[email protected]>
Signed-off-by: Alex Deucher <[email protected]>

drm/amdgpu: dont schedule jobs while in reset

[Why]

doing kthread_park()/unpark() from drm_sched_entity_fini
while GPU reset is in progress defeats all the purpose of
drm_sched_stop->kthread_park.
If drm_sched_entity_fini->kthread_unpark() happens AFTER
drm_sched_stop->kthread_park nothing prevents from another
(third) thread to keep submitting job to HW which will be
picked up by the unparked scheduler thread and try to submit
to HW but fail because the HW ring is deactivated.

[How]
grab the reset lock before calling drm_sched_entity_fini()

Signed-off-by: Shirish S <[email protected]>
Suggested-by: Christian König <[email protected]>
Reviewed-by: Christian König <[email protected]>
Reviewed-by: Andrey Grodzovsky <[email protected]>
Signed-off-by: Alex Deucher <[email protected]>

drm/amdgpu/arcturus: properly set BANK_SELECT and FRAGMENT_SIZE

These were not aligned for optimal performance for GPUVM.

Reviewed-by: Christian König <[email protected]>
Signed-off-by: Alex Deucher <[email protected]>

Merge branch 'akpm' (patches from Andrew)

Merge more fixes from Andrew Morton:
"17 fixes"

Mostly mm fixes and one ocfs2 locking fix.

* emailed patches from Andrew Morton <[email protected]>:
  mm: memcontrol: fix network errors from failing __GFP_ATOMIC charges
  mm/memory_hotplug: fix updating the node span
  scripts/gdb: fix debugging modules compiled with hot/cold partitioning
  mm: slab: make page_cgroup_ino() to recognize non-compound slab pages properly
  MAINTAINERS: update information for "MEMORY MANAGEMENT"
  dump_stack: avoid the livelock of the dump_lock
  zswap: add Vitaly to the maintainers list
  mm/page_alloc.c: ratelimit allocation failure warnings more aggressively
  mm/khugepaged: fix might_sleep() warn with CONFIG_HIGHPTE=y
  mm, vmstat: reduce zone->lock holding time by /proc/pagetypeinfo
  mm, vmstat: hide /proc/pagetypeinfo from normal users
  mm/mmu_notifiers: use the right return code for WARN_ON
  ocfs2: protect extent tree in ocfs2_prepare_inode_for_write()
  mm: thp: handle page cache THP correctly in PageTransCompoundMap
  mm, meminit: recalculate pcpu batch and high limits after init completes
  mm/gup_benchmark: fix MAP_HUGETLB case
  mm: memcontrol: fix NULL-ptr deref in percpu stats flush

arm64: Do not mask out PTE_RDONLY in pte_same()

Following commit 73e86cb03cf2 ("arm64: Move PTE_RDONLY bit handling out
of set_pte_at()"), the PTE_RDONLY bit is no longer managed by
set_pte_at() but built into the PAGE_* attribute definitions.
Consequently, pte_same() must include this bit when checking two PTEs
for equality.

Remove the arm64-specific pte_same() function, practically reverting
commit 747a70e60b72 ("arm64: Fix copy-on-write referencing in HugeTLB")

Fixes: 73e86cb03cf2 ("arm64: Move PTE_RDONLY bit handling out of set_pte_at()")
Cc: <[email protected]> # 4.14.x-
Cc: Will Deacon <[email protected]>
Cc: Steve Capper <[email protected]>
Reported-by: John Stultz <[email protected]>
Signed-off-by: Catalin Marinas <[email protected]>
Signed-off-by: Will Deacon <[email protected]>

Merge branch 'net-bcmgenet-restore-internal-EPHY-support'

Doug Berger says:

====================
net: bcmgenet: restore internal EPHY support (part 2)

This is a follow up to my previous submission (see [1]).

The first commit provides what is intended to be a complete solution
for the issues that can result from insufficient clocking of the MAC
during reset of its state machines. It should be backported to the
stable releases.

It is intended to replace the partial solution of commit 1f515486275a
("net: bcmgenet: soft reset 40nm EPHYs before MAC init") which is
reverted by the second commit of this series and should not be back-
ported as noted in [2].

The third commit corrects a timing hazard with a polled PHY that can
occur when the MAC resumes and also when a v3 internal EPHY is reset
by the change in commit 25382b991d25 ("net: bcmgenet: reset 40nm EPHY
on energy detect"). It is expected that commit 25382b991d25 be back-
ported to stable first before backporting this commit.

[1] https://lkml.org/lkml/2019/10/16/1706
[2] https://lkml.org/lkml/2019/10/31/749
====================

Signed-off-by: David S. Miller <[email protected]>

net: bcmgenet: reapply manual settings to the PHY

The phy_init_hw() function may reset the PHY to a configuration
that does not match manual network settings stored in the phydev
structure. If the phy state machine is polled rather than event
driven this can create a timing hazard where the phy state machine
might alter the settings stored in the phydev structure from the
value read from the BMCR.

This commit follows invocations of phy_init_hw() by the bcmgenet
driver with invocations of the genphy_config_aneg() function to
ensure that the BMCR is written to match the settings held in the
phydev structure. This prevents the risk of manual settings being
accidentally altered.

Fixes: 1c1008c793fa ("net: bcmgenet: add main driver file")
Signed-off-by: Doug Berger <[email protected]>
Acked-by: Florian Fainelli <[email protected]>
Signed-off-by: David S. Miller <[email protected]>

Revert "net: bcmgenet: soft reset 40nm EPHYs before MAC init"

This reverts commit 1f515486275a08a17a2c806b844cca18f7de5b34.

This commit improved the chances of the umac resetting cleanly by
ensuring that the PHY was restored to its normal operation prior
to resetting the umac. However, there were still cases when the
PHY might not be driving a Tx clock to the umac during this window
(e.g. when the PHY detects no link).

The previous commit now ensures that the unimac receives clocks
from the MAC during its reset window so this commit is no longer
needed. This commit also has an unintended negative impact on the
MDIO performance of the UniMAC MDIO interface because it is used
before the MDIO interrupts are reenabled, so it should be removed.

Signed-off-by: Doug Berger <[email protected]>
Acked-by: Florian Fainelli <[email protected]>
Signed-off-by: David S. Miller <[email protected]>

net: bcmgenet: use RGMII loopback for MAC reset

As noted in commit 28c2d1a7a0bf ("net: bcmgenet: enable loopback
during UniMAC sw_reset") the UniMAC must be clocked while sw_reset
is asserted for its state machines to reset cleanly.

The transmit and receive clocks used by the UniMAC are derived from
the signals used on its PHY interface. The bcmgenet MAC can be
configured to work with different PHY interfaces including MII,
GMII, RGMII, and Reverse MII on internal and external interfaces.
Unfortunately for the UniMAC, when configured for MII the Tx clock
is always driven from the PHY which places it outside of the direct
control of the MAC.

The earlier commit enabled a local loopback mode within the UniMAC
so that the receive clock would be derived from the transmit clock
which addressed the observed issue with an external GPHY disabling
it's Rx clock. However, when a Tx clock is not available this
loopback is insufficient.

This commit implements a workaround that leverages the fact that
the MAC can reliably generate all of its necessary clocking by
enterring the external GPHY RGMII interface mode with the UniMAC in
local loopback during the sw_reset interval. Unfortunately, this
has the undesirable side efect of the RGMII GTXCLK signal being
driven during the same window.

In most configurations this is a benign side effect as the signal
is either not routed to a pin or is already expected to drive the
pin. The one exception is when an external MII PHY is expected to
drive the same pin with its TX_CLK output creating output driver
contention.

This commit exploits the IEEE 802.3 clause 22 standard defined
isolate mode to force an external MII PHY to present a high
impedance on its TX_CLK output during the window to prevent any
contention at the pin.

The MII interface is used internally with the 40nm internal EPHY
which agressively disables its clocks for power savings leading to
incomplete resets of the UniMAC and many instabilities observed
over the years. The workaround of this commit is expected to put
an end to those problems.

Fixes: 1c1008c793fa ("net: bcmgenet: add main driver file")
Signed-off-by: Doug Berger <[email protected]>
Acked-by: Florian Fainelli <[email protected]>
Signed-off-by: David S. Miller <[email protected]>

Merge tag 'perf-urgent-for-mingo-5.4-20191105' of git://git.kernel.org/pub/scm/linux/kernel/git/acme/linux into perf/urgent

Pull perf fixes from Arnaldo Carvalho de Melo:

perf report/top:

  Jiri Olsa:

  - Fix time sorting for big numbers, i.e.:

    perf report -s time -F time,overhead --stdio

  was failing because the sort comparision routine was returning 'int' while
  that particular -s key was int64_t, fix it.

perf scripting engines:

  Steven Rostedt (VMware):

  - Iterate on tep event arrays directly, fixing a bug when generating python/perl
    source code from a perf.data file with more than one tracepoint event.

Signed-off-by: Arnaldo Carvalho de Melo <[email protected]>

drm/atomic: fix self-refresh helpers crtc state dereference

drm_self_refresh_helper_update_avg_times() was incorrectly accessing the
new incoming state after drm_atomic_helper_commit_hw_done().  But this
state might have already been superceeded by an !nonblock atomic update
resulting in dereferencing an already free'd crtc_state.

TODO I *think* this will more or less do the right thing.. althought I'm
not 100% sure if, for example, we enter psr in a nonblock commit, and
then leave psr in a !nonblock commit that overtakes the completion of
the nonblock commit.  Not sure if this sort of scenario can happen in
practice.  But not crashing is better than crashing, so I guess we
should either take this patch or rever the self-refresh helpers until
Sean can figure out a better solution.

Fixes: d4da4e33341c ("drm: Measure Self Refresh Entry/Exit times to avoid thrashing")
Cc: Sean Paul <[email protected]>
Signed-off-by: Rob Clark <[email protected]>
[seanpaul fixed up some checkpatch warns]
Signed-off-by: Sean Paul <[email protected]>
Link: https://patchwork.freedesktop.org/patch/msgid/[email protected]

configfs: calculate the depth of parent item

When create symbolic link, create_link should calculate the depth
of the parent item. However, both the first and second parameters
of configfs_get_target_path had been set to the target. Broken
symbolic link created.

$ targetcli ls /
o- / ............................................................. [...]
  o- backstores .................................................. [...]
  | o- block ...................................... [Storage Objects: 0]
  | o- fileio ..................................... [Storage Objects: 2]
  | | o- vdev0 .......... [/dev/ramdisk1 (16.0MiB) write-thru activated]
  | | | o- alua ....................................... [ALUA Groups: 1]
  | | |   o- default_tg_pt_gp ........... [ALUA state: Active/optimized]
  | | o- vdev1 .......... [/dev/ramdisk2 (16.0MiB) write-thru activated]
  | |   o- alua ....................................... [ALUA Groups: 1]
  | |     o- default_tg_pt_gp ........... [ALUA state: Active/optimized]
  | o- pscsi ...................................... [Storage Objects: 0]
  | o- ramdisk .................................... [Storage Objects: 0]
  o- iscsi ................................................ [Targets: 0]
  o- loopback ............................................. [Targets: 0]
  o- srpt ................................................. [Targets: 2]
  | o- ib.e89a8f91cb3200000000000000000000 ............... [no-gen-acls]
  | | o- acls ................................................ [ACLs: 2]
  | | | o- ib.e89a8f91cb3200000000000000000000 ........ [Mapped LUNs: 2]
  | | | | o- mapped_lun0 ............................. [BROKEN LUN LINK]
  | | | | o- mapped_lun1 ............................. [BROKEN LUN LINK]
  | | | o- ib.e89a8f91cb3300000000000000000000 ........ [Mapped LUNs: 2]
  | | |   o- mapped_lun0 ............................. [BROKEN LUN LINK]
  | | |   o- mapped_lun1 ............................. [BROKEN LUN LINK]
  | | o- luns ................................................ [LUNs: 2]
  | |   o- lun0 ...... [fileio/vdev0 (/dev/ramdisk1) (default_tg_pt_gp)]
  | |   o- lun1 ...... [fileio/vdev1 (/dev/ramdisk2) (default_tg_pt_gp)]
  | o- ib.e89a8f91cb3300000000000000000000 ............... [no-gen-acls]
  |   o- acls ................................................ [ACLs: 0]
  |   o- luns ................................................ [LUNs: 0]
  o- vhost ................................................ [Targets: 0]

Fixes: e9c03af21cc7 ("configfs: calculate the symlink target only once")
Signed-off-by: Honggang Li <[email protected]>
Signed-off-by: Christoph Hellwig <[email protected]>

ALSA: timer: Fix incorrectly assigned timer instance

The clean up commit 41672c0c24a6 ("ALSA: timer: Simplify error path in
snd_timer_open()") unified the error handling code paths with the
standard goto, but it introduced a subtle bug: the timer instance is
stored in snd_timer_open() incorrectly even if it returns an error.
This may eventually lead to UAF, as spotted by fuzzer.

The culprit is the snd_timer_open() code checks the
SNDRV_TIMER_IFLG_EXCLUSIVE flag with the common variable timeri.
This variable is supposed to be the newly created instance, but we
(ab-)used it for a temporary check before the actual creation of a
timer instance.  After that point, there is another check for the max
number of instances, and it bails out if over the threshold.  Before
the refactoring above, it worked fine because the code returned
directly from that point.  After the refactoring, however, it jumps to
the unified error path that stores the timeri variable in return --
even if it returns an error.  Unfortunately this stored value is kept
in the caller side (snd_timer_user_tselect()) in tu->timeri.  This
causes inconsistency later, as if the timer was successfully
assigned.

In this patch, we fix it by not re-using timeri variable but a
temporary variable for testing the exclusive connection, so timeri
remains NULL at that point.

Fixes: 41672c0c24a6 ("ALSA: timer: Simplify error path in snd_timer_open()")
Reported-and-tested-by: Tristan Madani <[email protected]>
Cc: <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Takashi Iwai <[email protected]>

mm: memcontrol: fix network errors from failing __GFP_ATOMIC charges

While upgrading from 4.16 to 5.2, we noticed these allocation errors in
the log of the new kernel:

  SLUB: Unable to allocate memory on node -1, gfp=0xa20(GFP_ATOMIC)
    cache: tw_sock_TCPv6(960:helper-logs), object size: 232, buffer size: 240, default order: 1, min order: 0
    node 0: slabs: 5, objs: 170, free: 0

        slab_out_of_memory+1
        ___slab_alloc+969
        __slab_alloc+14
        kmem_cache_alloc+346
        inet_twsk_alloc+60
        tcp_time_wait+46
        tcp_fin+206
        tcp_data_queue+2034
        tcp_rcv_state_process+784
        tcp_v6_do_rcv+405
        __release_sock+118
        tcp_close+385
        inet_release+46
        __sock_release+55
        sock_close+17
        __fput+170
        task_work_run+127
        exit_to_usermode_loop+191
        do_syscall_64+212
        entry_SYSCALL_64_after_hwframe+68

accompanied by an increase in machines going completely radio silent
under memory pressure.

One thing that changed since 4.16 is e699e2c6a654 ("net, mm: account
sock objects to kmemcg"), which made these slab caches subject to cgroup
memory accounting and control.

The problem with that is that cgroups, unlike the page allocator, do not
maintain dedicated atomic reserves.  As a cgroup's usage hovers at its
limit, atomic allocations - such as done during network rx - can fail
consistently for extended periods of time.  The kernel is not able to
operate under these conditions.

We don't want to revert the culprit patch, because it indeed tracks a
potentially substantial amount of memory used by a cgroup.

We also don't want to implement dedicated atomic reserves for cgroups.
There is no point in keeping a fixed margin of unused bytes in the
cgroup's memory budget to accomodate a consumer that is impossible to
predict - we'd be wasting memory and get into configuration headaches,
not unlike what we have going with min_free_kbytes.  We do this for
physical mem because we have to, but cgroups are an accounting game.

Instead, account these privileged allocations to the cgroup, but let
them bypass the configured limit if they have to.  This way, we get the
benefits of accounting the consumed memory and have it exert pressure on
the rest of the cgroup, but like with the page allocator, we shift the
burden of reclaimining on behalf of atomic allocations onto the regular
allocations that can block.

Link: http://lkml.kernel.org/r/[email protected]
Fixes: e699e2c6a654 ("net, mm: account sock objects to kmemcg")
Signed-off-by: Johannes Weiner <[email protected]>
Reviewed-by: Shakeel Butt <[email protected]>
Cc: Suleiman Souhlal <[email protected]>
Cc: Michal Hocko <[email protected]>
Cc: <[email protected]> [4.18+]
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

mm/memory_hotplug: fix updating the node span

We recently started updating the node span based on the zone span to
avoid touching uninitialized memmaps.

Currently, we will always detect the node span to start at 0, meaning a
node can easily span too many pages.  pgdat_is_empty() will still work
correctly if all zones span no pages.  We should skip over all zones
without spanned pages and properly handle the first detected zone that
spans pages.

Unfortunately, in contrast to the zone span (/proc/zoneinfo), the node
span cannot easily be inspected and tested.  The node span gives no real
guarantees when an architecture supports memory hotplug, meaning it can
easily contain holes or span pages of different nodes.

The node span is not really used after init on architectures that
support memory hotplug.

E.g., we use it in mm/memory_hotplug.c:try_offline_node() and in
mm/kmemleak.c:kmemleak_scan().  These users seem to be fine.

Link: http://lkml.kernel.org/r/[email protected]
Fixes: 00d6c019b5bc ("mm/memory_hotplug: don't access uninitialized memmaps in shrink_pgdat_span()")
Signed-off-by: David Hildenbrand <[email protected]>
Cc: Michal Hocko <[email protected]>
Cc: Oscar Salvador <[email protected]>
Cc: Stephen Rothwell <[email protected]>
Cc: Dan Williams <[email protected]>
Cc: Pavel Tatashin <[email protected]>
Cc: Greg Kroah-Hartman <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

scripts/gdb: fix debugging modules compiled with hot/cold partitioning

gcc's -freorder-blocks-and-partition option makes it group frequently
and infrequently used code in .text.hot and .text.unlikely sections
respectively.  At least when building modules on s390, this option is
used by default.

gdb assumes that all code is located in .text section, and that .text
section is located at module load address.  With such modules this is no
longer the case: there is code in .text.hot and .text.unlikely, and
either of them might precede .text.

Fix by explicitly telling gdb the addresses of code sections.

It might be tempting to do this for all sections, not only the ones in
the white list.  Unfortunately, gdb appears to have an issue, when
telling it about e.g. loadable .note.gnu.build-id section causes it to
think that non-loadable .note.Linux section is loaded at address 0,
which in turn causes NULL pointers to be resolved to bogus symbols.  So
keep using the white list approach for the time being.

Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Ilya Leoshkevich <[email protected]>
Reviewed-by: Jan Kiszka <[email protected]>
Cc: Kieran Bingham <[email protected]>
Cc: Heiko Carstens <[email protected]>
Cc: Vasily Gorbik <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

mm: slab: make page_cgroup_ino() to recognize non-compound slab pages properly

page_cgroup_ino() doesn't return a valid memcg pointer for non-compound
slab pages, because it depends on PgHead AND PgSlab flags to be set to
determine the memory cgroup from the kmem_cache.  It's correct for
compound pages, but not for generic small pages.  Those don't have PgHead
set, so it ends up returning zero.

Fix this by replacing the condition to PageSlab() && !PageTail().

Before this patch:
  [root@localhost ~]# ./page-types -c /sys/fs/cgroup/user.slice/user-0.slice/[email protected]/ | grep slab
  0x0000000000000080         38        0  _______S___________________________________ slab

After this patch:
  [root@localhost ~]# ./page-types -c /sys/fs/cgroup/user.slice/user-0.slice/[email protected]/ | grep slab
  0x0000000000000080        147        0  _______S___________________________________ slab

Also, hwpoison_filter_task() uses output of page_cgroup_ino() in order
to filter error injection events based on memcg.  So if
page_cgroup_ino() fails to return memcg pointer, we just fail to inject
memory error.  Considering that hwpoison filter is for testing, affected
users are limited and the impact should be marginal.

[[email protected]: changelog additions]
Link: http://lkml.kernel.org/r/[email protected]
Fixes: 4d96ba353075 ("mm: memcg/slab: stop setting page->mem_cgroup pointer for slab pages")
Signed-off-by: Roman Gushchin <[email protected]>
Reviewed-by: Shakeel Butt <[email protected]>
Acked-by: David Rientjes <[email protected]>
Cc: Vladimir Davydov <[email protected]>
Cc: Daniel Jordan <[email protected]>
Cc: Naoya Horiguchi <[email protected]>
Cc: <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

MAINTAINERS: update information for "MEMORY MANAGEMENT"

I was trying to find the mm tree in MAINTAINERS by searching "Morton".
Unfortunately, I didn't find one. And I didn't even locate the MEMORY
MANAGEMENT section quickly, because Andrew's name was not listed there.

Thanks to Johannes who helped me find the mm tree.

Let save other's time searching around by adding:

M: Andrew Morton <[email protected]>
T: git git://github.com/hnaz/linux-mm.git

[[email protected]: add ozlabs.org quilt trees]
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Song Liu <[email protected]>
Acked-by: Andrew Morton <[email protected]>
Cc: Johannes Weiner <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

dump_stack: avoid the livelock of the dump_lock

In the current code, we use the atomic_cmpxchg() to serialize the output
of the dump_stack(), but this implementation suffers the thundering herd
problem.  We have observed such kind of livelock on a Marvell cn96xx
board(24 cpus) when heavily using the dump_stack() in a kprobe handler.
Actually we can let the competitors to wait for the releasing of the
lock before jumping to atomic_cmpxchg().  This will definitely mitigate
the thundering herd problem.  Thanks Linus for the suggestion.

[[email protected]: fix comment]
Link: http://lkml.kernel.org/r/[email protected]
Fixes: b58d977432c8 ("dump_stack: serialize the output from dump_stack()")
Signed-off-by: Kevin Hao <[email protected]>
Suggested-by: Linus Torvalds <[email protected]>
Cc: <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

zswap: add Vitaly to the maintainers list

Per conversation with Dan, add myself to the zswap MAINTAINERS list.

Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Vitaly Wool <[email protected]>
Acked-by: Dan Streetman <[email protected]>
Acked-by: Andrew Morton <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

mm/page_alloc.c: ratelimit allocation failure warnings more aggressively

While investigating a bug related to higher atomic allocation failures,
we noticed the failure warnings positively drowning the console, and in
our case trigger lockup warnings because of a serial console too slow to
handle all that output.

But even if we had a faster console, it's unclear what additional
information the current level of repetition provides.

Allocation failures happen for three reasons: The machine is OOM, the VM
is failing to handle reasonable requests, or somebody is making
unreasonable requests (and didn't acknowledge their opportunism with
__GFP_NOWARN). Having the memory dump, a callstack, and the ratelimit
stats on skipped failure warnings should provide enough information to
let users/admins/developers know whether something is wrong and point
them in the right direction for debugging, bpftracing etc.

Limit allocation failure warnings to one spew every ten seconds.

Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Johannes Weiner <[email protected]>
Acked-by: David Rientjes <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

mm/khugepaged: fix might_sleep() warn with CONFIG_HIGHPTE=y

I got some khugepaged spew on a 32bit x86:

  BUG: sleeping function called from invalid context at include/linux/mmu_notifier.h:346
  in_atomic(): 1, irqs_disabled(): 0, non_block: 0, pid: 25, name: khugepaged
  INFO: lockdep is turned off.
  CPU: 1 PID: 25 Comm: khugepaged Not tainted 5.4.0-rc5-elk+ #206
  Hardware name: System manufacturer P5Q-EM/P5Q-EM, BIOS 2203    07/08/2009
  Call Trace:
   dump_stack+0x66/0x8e
   ___might_sleep.cold.96+0x95/0xa6
   __might_sleep+0x2e/0x80
   collapse_huge_page.isra.51+0x5ac/0x1360
   khugepaged+0x9a9/0x20f0
   kthread+0xf5/0x110
   ret_from_fork+0x2e/0x38

Looks like it's due to CONFIG_HIGHPTE=y pte_offset_map()->kmap_atomic()
vs.  mmu_notifier_invalidate_range_start().  Let's do the naive approach
and just reorder the two operations.

Link: http://lkml.kernel.org/r/[email protected]
Fixes: 810e24e009cf71 ("mm/mmu_notifiers: annotate with might_sleep()")
Signed-off-by: Ville Syrjl <[email protected]>
Reviewed-by: Andrew Morton <[email protected]>
Acked-by: Kirill A. Shutemov <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: Borislav Petkov <[email protected]>
Cc: "H. Peter Anvin" <[email protected]>
Cc: Jérôme Glisse <[email protected]>
Cc: Ralph Campbell <[email protected]>
Cc: Ira Weiny <[email protected]>
Cc: Jason Gunthorpe <[email protected]>
Cc: Daniel Vetter <[email protected]>
Cc: Andrea Arcangeli <[email protected]>
Cc: <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

mm, vmstat: reduce zone->lock holding time by /proc/pagetypeinfo

pagetypeinfo_showfree_print is called by zone->lock held in irq mode.
This is not really nice because it blocks both any interrupts on that
cpu and the page allocator.  On large machines this might even trigger
the hard lockup detector.

Considering the pagetypeinfo is a debugging tool we do not really need
exact numbers here.  The primary reason to look at the outuput is to see
how pageblocks are spread among different migratetypes and low number of
pages is much more interesting therefore putting a bound on the number
of pages on the free_list sounds like a reasonable tradeoff.

The new output will simply tell
  [...]
  Node    6, zone   Normal, type      Movable >100000 >100000 >100000 >100000  41019  31560  23996  10054   3229    983    648

instead of
  Node    6, zone   Normal, type      Movable 399568 294127 221558 102119  41019  31560  23996  10054   3229    983    648

The limit has been chosen arbitrary and it is a subject of a future
change should there be a need for that.

While we are at it, also drop the zone lock after each free_list
iteration which will help with the IRQ and page allocator responsiveness
even further as the IRQ lock held time is always bound to those 100k
pages.

[[email protected]: tweak comment text, per David Hildenbrand]
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Michal Hocko <[email protected]>
Suggested-by: Andrew Morton <[email protected]>
Reviewed-by: Waiman Long <[email protected]>
Acked-by: Vlastimil Babka <[email protected]>
Acked-by: David Hildenbrand <[email protected]>
Acked-by: Rafael Aquini <[email protected]>
Acked-by: David Rientjes <[email protected]>
Reviewed-by: Andrew Morton <[email protected]>
Cc: Greg Kroah-Hartman <[email protected]>
Cc: Jann Horn <[email protected]>
Cc: Johannes Weiner <[email protected]>
Cc: Konstantin Khlebnikov <[email protected]>
Cc: Mel Gorman <[email protected]>
Cc: Roman Gushchin <[email protected]>
Cc: Song Liu <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

mm, vmstat: hide /proc/pagetypeinfo from normal users

/proc/pagetypeinfo is a debugging tool to examine internal page
allocator state wrt to fragmentation. It is not very useful for any
other use so normal users really do not need to read this file.

Waiman Long has noticed that reading this file can have negative side
effects because zone->lock is necessary for gathering data and that a)
interferes with the page allocator and its users and b) can lead to hard
lockups on large machines which have very long free_list.

Reduce both issues by simply not exporting the file to regular users.

Link: http://lkml.kernel.org/r/[email protected]
Fixes: 467c996c1e19 ("Print out statistics in relation to fragmentation avoidance to /proc/pagetypeinfo")
Signed-off-by: Michal Hocko <[email protected]>
Reported-by: Waiman Long <[email protected]>
Acked-by: Mel Gorman <[email protected]>
Acked-by: Vlastimil Babka <[email protected]>
Acked-by: Waiman Long <[email protected]>
Acked-by: Rafael Aquini <[email protected]>
Acked-by: David Rientjes <[email protected]>
Reviewed-by: Andrew Morton <[email protected]>
Cc: David Hildenbrand <[email protected]>
Cc: Johannes Weiner <[email protected]>
Cc: Roman Gushchin <[email protected]>
Cc: Konstantin Khlebnikov <[email protected]>
Cc: Jann Horn <[email protected]>
Cc: Song Liu <[email protected]>
Cc: Greg Kroah-Hartman <[email protected]>
Cc: <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

mm/mmu_notifiers: use the right return code for WARN_ON

The return code from the op callback is actually in _ret, while the
WARN_ON was checking ret which causes it to misfire.

Link: http://lkml.kernel.org/r/[email protected]
Fixes: 8402ce61bec2 ("mm/mmu_notifiers: check if mmu notifier callbacks are allowed to fail")
Signed-off-by: Jason Gunthorpe <[email protected]>
Reviewed-by: Andrew Morton <[email protected]>
Cc: Daniel Vetter <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

ocfs2: protect extent tree in ocfs2_prepare_inode_for_write()

When the extent tree is modified, it should be protected by inode
cluster lock and ip_alloc_sem.

The extent tree is accessed and modified in the
ocfs2_prepare_inode_for_write, but isn't protected by ip_alloc_sem.

The following is a case.  The function ocfs2_fiemap is accessing the
extent tree, which is modified at the same time.

  kernel BUG at fs/ocfs2/extent_map.c:475!
  invalid opcode: 0000 [#1] SMP
  Modules linked in: tun ocfs2 ocfs2_nodemanager configfs ocfs2_stackglue [...]
  CPU: 16 PID: 14047 Comm: o2info Not tainted 4.1.12-124.23.1.el6uek.x86_64 #2
  Hardware name: Oracle Corporation ORACLE SERVER X7-2L/ASM, MB MECH, X7-2L, BIOS 42040600 10/19/2018
  task: ffff88019487e200 ti: ffff88003daa4000 task.ti: ffff88003daa4000
  RIP: ocfs2_get_clusters_nocache.isra.11+0x390/0x550 [ocfs2]
  Call Trace:
    ocfs2_fiemap+0x1e3/0x430 [ocfs2]
    do_vfs_ioctl+0x155/0x510
    SyS_ioctl+0x81/0xa0
    system_call_fastpath+0x18/0xd8
  Code: 18 48 c7 c6 60 7f 65 a0 31 c0 bb e2 ff ff ff 48 8b 4a 40 48 8b 7a 28 48 c7 c2 78 2d 66 a0 e8 38 4f 05 00 e9 28 fe ff ff 0f 1f 00 <0f> 0b 66 0f 1f 44 00 00 bb 86 ff ff ff e9 13 fe ff ff 66 0f 1f
  RIP  ocfs2_get_clusters_nocache.isra.11+0x390/0x550 [ocfs2]
  ---[ end trace c8aa0c8180e869dc ]---
  Kernel panic - not syncing: Fatal exception
  Kernel Offset: disabled

This issue can be reproduced every week in a production environment.

This issue is related to the usage mode.  If others use ocfs2 in this
mode, the kernel will panic frequently.

[[email protected]: coding style fixes]
[Fix new warning due to unused function by removing said function - Linus ]
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Shuning Zhang <[email protected]>
Reviewed-by: Junxiao Bi <[email protected]>
Reviewed-by: Gang He <[email protected]>
Cc: Mark Fasheh <[email protected]>
Cc: Joel Becker <[email protected]>
Cc: Joseph Qi <[email protected]>
Cc: Changwei Ge <[email protected]>
Cc: Jun Piao <[email protected]>
Cc: <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

mm: thp: handle page cache THP correctly in PageTransCompoundMap

We have a usecase to use tmpfs as QEMU memory backend and we would like
to take the advantage of THP as well.  But, our test shows the EPT is
not PMD mapped even though the underlying THP are PMD mapped on host.
The number showed by /sys/kernel/debug/kvm/largepage is much less than
the number of PMD mapped shmem pages as the below:

  7f2778200000-7f2878200000 rw-s 00000000 00:14 262232 /dev/shm/qemu_back_mem.mem.Hz2hSf (deleted)
  Size:            4194304 kB
  [snip]
  AnonHugePages:         0 kB
  ShmemPmdMapped:   579584 kB
  [snip]
  Locked:                0 kB

  cat /sys/kernel/debug/kvm/largepages
  12

And some benchmarks do worse than with anonymous THPs.

By digging into the code we figured out that commit 127393fbe597 ("mm:
thp: kvm: fix memory corruption in KVM with THP enabled") checks if
there is a single PTE mapping on the page for anonymous THP when setting
up EPT map.  But the _mapcount < 0 check doesn't work for page cache THP
since every subpage of page cache THP would get _mapcount inc'ed once it
is PMD mapped, so PageTransCompoundMap() always returns false for page
cache THP.  This would prevent KVM from setting up PMD mapped EPT entry.

So we need handle page cache THP correctly.  However, when page cache
THP's PMD gets split, kernel just remove the map instead of setting up
PTE map like what anonymous THP does.  Before KVM calls get_user_pages()
the subpages may get PTE mapped even though it is still a THP since the
page cache THP may be mapped by other processes at the mean time.

Checking its _mapcount and whether the THP has PTE mapped or not.
Although this may report some false negative cases (PTE mapped by other
processes), it looks not trivial to make this accurate.

With this fix /sys/kernel/debug/kvm/largepage would show reasonable
pages are PMD mapped by EPT as the below:

  7fbeaee00000-7fbfaee00000 rw-s 00000000 00:14 275464 /dev/shm/qemu_back_mem.mem.SKUvat (deleted)
  Size:            4194304 kB
  [snip]
  AnonHugePages:         0 kB
  ShmemPmdMapped:   557056 kB
  [snip]
  Locked:                0 kB

  cat /sys/kernel/debug/kvm/largepages
  271

And the benchmarks are as same as anonymous THPs.

[[email protected]: v4]
Link: http://lkml.kernel.org/r/[email protected]
Link: http://lkml.kernel.org/r/[email protected]
Fixes: dd78fedde4b9 ("rmap: support file thp")
Signed-off-by: Yang Shi <[email protected]>
Reported-by: Gang Deng <[email protected]>
Tested-by: Gang Deng <[email protected]>
Suggested-by: Hugh Dickins <[email protected]>
Acked-by: Kirill A. Shutemov <[email protected]>
Cc: Andrea Arcangeli <[email protected]>
Cc: Matthew Wilcox <[email protected]>
Cc: <[email protected]> [4.8+]
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

mm, meminit: recalculate pcpu batch and high limits after init completes

Deferred memory initialisation updates zone->managed_pages during the
initialisation phase but before that finishes, the per-cpu page
allocator (pcpu) calculates the number of pages allocated/freed in
batches as well as the maximum number of pages allowed on a per-cpu
list.  As zone->managed_pages is not up to date yet, the pcpu
initialisation calculates inappropriately low batch and high values.

This increases zone lock contention quite severely in some cases with
the degree of severity depending on how many CPUs share a local zone and
the size of the zone.  A private report indicated that kernel build
times were excessive with extremely high system CPU usage.  A perf
profile indicated that a large chunk of time was lost on zone->lock
contention.

This patch recalculates the pcpu batch and high values after deferred
initialisation completes for every populated zone in the system.  It was
tested on a 2-socket AMD EPYC 2 machine using a kernel compilation
workload -- allmodconfig and all available CPUs.

mmtests configuration: config-workload-kernbench-max Configuration was
modified to build on a fresh XFS partition.

kernbench
                                5.4.0-rc3              5.4.0-rc3
                                  vanilla           resetpcpu-v2
Amean     user-256    13249.50 (   0.00%)    16401.31 * -23.79%*
Amean     syst-256    14760.30 (   0.00%)     4448.39 *  69.86%*
Amean     elsp-256      162.42 (   0.00%)      119.13 *  26.65%*
Stddev    user-256       42.97 (   0.00%)       19.15 (  55.43%)
Stddev    syst-256      336.87 (   0.00%)        6.71 (  98.01%)
Stddev    elsp-256        2.46 (   0.00%)        0.39 (  84.03%)

                   5.4.0-rc3    5.4.0-rc3
                     vanilla resetpcpu-v2
Duration User       39766.24     49221.79
Duration System     44298.10     13361.67
Duration Elapsed      519.11       388.87

The patch reduces system CPU usage by 69.86% and total build time by
26.65%.  The variance of system CPU usage is also much reduced.

Before, this was the breakdown of batch and high values over all zones
was:

    256               batch: 1
    256               batch: 63
    512               batch: 7
    256               high:  0
    256               high:  378
    512               high:  42

512 pcpu pagesets had a batch limit of 7 and a high limit of 42.  After
the patch:

    256               batch: 1
    768               batch: 63
    256               high:  0
    768               high:  378

[[email protected]: fix merge/linkage snafu]
Link: http://lkml.kernel.org/r/[email protected]:
Signed-off-by: Mel Gorman <[email protected]>
Acked-by: Michal Hocko <[email protected]>
Acked-by: Vlastimil Babka <[email protected]>
Acked-by: David Hildenbrand <[email protected]>
Cc: Matt Fleming <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: Borislav Petkov <[email protected]>
Cc: Qian Cai <[email protected]>
Cc: <[email protected]> [4.1+]
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

mm/gup_benchmark: fix MAP_HUGETLB case

The MAP_HUGETLB ("-H" option) of gup_benchmark fails:

  $ sudo ./gup_benchmark -H
  mmap: Invalid argument

This is because gup_benchmark.c is passing in a file descriptor to
mmap(), but the fd came from opening up the /dev/zero file.  This
confuses the mmap syscall implementation, which thinks that, if the
caller did not specify MAP_ANONYMOUS, then the file must be a huge page
file.  So it attempts to verify that the file really is a huge page
file, as you can see here:

ksys_mmap_pgoff()
{
    if (!(flags & MAP_ANONYMOUS)) {
        retval = -EINVAL;
        if (unlikely(flags & MAP_HUGETLB && !is_file_hugepages(file)))
            goto out_fput; /* THIS IS WHERE WE END UP */

    else if (flags & MAP_HUGETLB) {
        ...proceed normally, /dev/zero is ok here...

...and of course is_file_hugepages() returns "false" for the /dev/zero
file.

The problem is that the user space program, gup_benchmark.c, really just
wants anonymous memory here.  The simplest way to get that is to pass
MAP_ANONYMOUS whenever MAP_HUGETLB is specified, so that's what this
patch does.

Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: John Hubbard <[email protected]>
Reviewed-by: Andrew Morton <[email protected]>
Reviewed-by: Jérôme Glisse <[email protected]>
Cc: Keith Busch <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

mm: memcontrol: fix NULL-ptr deref in percpu stats flush

__mem_cgroup_free() can be called on the failure path in
mem_cgroup_alloc().  However memcg_flush_percpu_vmstats() and
memcg_flush_percpu_vmevents() which are called from __mem_cgroup_free()
access the fields of memcg which can potentially be null if called from
failure path from mem_cgroup_alloc().  Indeed syzbot has reported the
following crash:

kasan: CONFIG_KASAN_INLINE enabled
kasan: GPF could be caused by NULL-ptr deref or user memory access
general protection fault: 0000 [#1] PREEMPT SMP KASAN
CPU: 0 PID: 30393 Comm: syz-executor.1 Not tainted 5.4.0-rc2+ #0
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011
RIP: 0010:memcg_flush_percpu_vmstats+0x4ae/0x930 mm/memcontrol.c:3436
Code: 05 41 89 c0 41 0f b6 04 24 41 38 c7 7c 08 84 c0 0f 85 5d 03 00 00 44 3b 05 33 d5 12 08 0f 83 e2 00 00 00 4c 89 f0 48 c1 e8 03 <42> 80 3c 28 00 0f 85 91 03 00 00 48 8b 85 10 fe ff ff 48 8b b0 90
RSP: 0018:ffff888095c27980 EFLAGS: 00010206
RAX: 0000000000000012 RBX: ffff888095c27b28 RCX: ffffc90008192000
RDX: 0000000000040000 RSI: ffffffff8340fae7 RDI: 0000000000000007
RBP: ffff888095c27be0 R08: 0000000000000000 R09: ffffed1013f0da33
R10: ffffed1013f0da32 R11: ffff88809f86d197 R12: fffffbfff138b760
R13: dffffc0000000000 R14: 0000000000000090 R15: 0000000000000007
FS:  00007f5027170700(0000) GS:ffff8880ae800000(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 0000000000710158 CR3: 00000000a7b18000 CR4: 00000000001406f0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
Call Trace:
__mem_cgroup_free+0x1a/0x190 mm/memcontrol.c:5021
mem_cgroup_free mm/memcontrol.c:5033 [inline]
mem_cgroup_css_alloc+0x3a1/0x1ae0 mm/memcontrol.c:5160
css_create kernel/cgroup/cgroup.c:5156 [inline]
cgroup_apply_control_enable+0x44d/0xc40 kernel/cgroup/cgroup.c:3119
cgroup_mkdir+0x899/0x11b0 kernel/cgroup/cgroup.c:5401
kernfs_iop_mkdir+0x14d/0x1d0 fs/kernfs/dir.c:1124
vfs_mkdir+0x42e/0x670 fs/namei.c:3807
do_mkdirat+0x234/0x2a0 fs/namei.c:3830
__do_sys_mkdir fs/namei.c:3846 [inline]
__se_sys_mkdir fs/namei.c:3844 [inline]
__x64_sys_mkdir+0x5c/0x80 fs/namei.c:3844
do_syscall_64+0xfa/0x760 arch/x86/entry/common.c:290
entry_SYSCALL_64_after_hwframe+0x49/0xbe

Fixing this by moving the flush to mem_cgroup_free as there is no need
to flush anything if we see failure in mem_cgroup_alloc().

Link: http://lkml.kernel.org/r/[email protected]
Fixes: bb65f89b7d3d ("mm: memcontrol: flush percpu vmevents before releasing memcg")
Fixes: c350a99ea2b1 ("mm: memcontrol: flush percpu vmstats before releasing memcg")
Signed-off-by: Shakeel Butt <[email protected]>
Reported-by: [email protected]
Reviewed-by: Roman Gushchin <[email protected]>
Cc: Michal Hocko <[email protected]>
Cc: Johannes Weiner <[email protected]>
Cc: Vladimir Davydov <[email protected]>
Cc: <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

MAINTAINERS: update Cavium ThunderX2 maintainers

jnair is no longer at caviumnetworks.com (or at marvell.com). This also
means that Cavium ThunderX2 will now be maintained by Robert.

This is probably a good time to map various email addresses used for
my patches to my personal email ID, update .mailmap to do this.

Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Jayachandran C <[email protected]>
Acked-by: Robert Richter <[email protected]>
Signed-off-by: Olof Johansson <[email protected]>

Merge tag 'stm32-dt-for-v5.4-fixes-2' of git://git.kernel.org/pub/scm/linux/kernel/git/atorgue/stm32 into arm/fixes

STM32 DT fixes for v5.4, round 2

Highlights:
-----------

Fixes for STM32MP157:
-Fix CAN RAM mapping
-Change stmfx pinctrl definition for joystick and camera. Due to
stmfx pinctrl fix done in v5.4-rc cycle, camera and joystick were no
longer functional.

* tag 'stm32-dt-for-v5.4-fixes-2' of git://git.kernel.org/pub/scm/linux/kernel/git/atorgue/stm32:
  ARM: dts: stm32: change joystick pinctrl definition on stm32mp157c-ev1
  ARM: dts: stm32: remove OV5640 pinctrl definition on stm32mp157c-ev1
  ARM: dts: stm32: Fix CAN RAM mapping on stm32mp157c
  ARM: dts: stm32: relax qspi pins slew-rate for stm32mp157

Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Olof Johansson <[email protected]>

ASoC: SOF: topology: Fix bytes control size checks

When using the example SOF amp widget topology, KASAN dumps this
when the AMP bytes kcontrol gets loaded:

[ 9.579548] BUG: KASAN: slab-out-of-bounds in
sof_control_load+0x8cc/0xac0 [snd_sof]
[ 9.588194] Write of size 40 at addr ffff8882314559dc by task
systemd-udevd/2411

Fix that by rejecting the topology if the bytes data size > max_size

Fixes: 311ce4fe7637d ("ASoC: SOF: Add support for loading topologies")
Reviewed-by: Jaska Uimonen <[email protected]>
Reviewed-by: Ranjani Sridharan <[email protected]>
Signed-off-by: Dragos Tarcatu <[email protected]>
Signed-off-by: Pierre-Louis Bossart <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Mark Brown <[email protected]>

ARM: dts: stm32: change joystick pinctrl definition on stm32mp157c-ev1

Pins used for joystick are all configured as input. "push-pull" is not a
valid setting for an input pin.

Fixes: a502b343ebd0 ("pinctrl: stmfx: update pinconf settings")
Signed-off-by: Alexandre Torgue <[email protected]>
Signed-off-by: Amelie Delaunay <[email protected]>
Signed-off-by: Alexandre Torgue <[email protected]>

ARM: dts: stm32: remove OV5640 pinctrl definition on stm32mp157c-ev1

"push-pull" configuration is now fully handled by the gpiolib and the
STMFX pinctrl driver. There is no longer need to declare a pinctrl group
to only configure "push-pull" setting for the line. It is done directly by
the gpiolib.

Fixes: a502b343ebd0 ("pinctrl: stmfx: update pinconf settings")
Signed-off-by: Alexandre Torgue <[email protected]>
Signed-off-by: Amelie Delaunay <[email protected]>
Signed-off-by: Alexandre Torgue <[email protected]>

ARM: dts: stm32: Fix CAN RAM mapping on stm32mp157c

Split the 10Kbytes CAN message RAM to be able to use simultaneously
FDCAN1 and FDCAN2 instances.
First 5Kbytes are allocated to FDCAN1 and last 5Kbytes are used for
FDCAN2. To do so, set the offset to 0x1400 in mram-cfg for FDCAN2.

Fixes: d44d6e021301 ("ARM: dts: stm32: change CAN RAM mapping on stm32mp157c")
Signed-off-by: Christophe Roullier <[email protected]>
Signed-off-by: Alexandre Torgue <[email protected]>

ARM: dts: stm32: relax qspi pins slew-rate for stm32mp157

Relax qspi pins slew-rate to minimize peak currents.

Fixes: 844030057339 ("ARM: dts: stm32: add flash nor support on stm32mp157c eval board")
Signed-off-by: Patrice Chotard <[email protected]>
Signed-off-by: Alexandre Torgue <[email protected]>

Documentation: TLS: Add missing counter description

Add TLS TX counter description for the handshake retransmitted
packets that triggers the resync procedure then skip it, going
into the regular transmit flow.

Fixes: 46a3ea98074e ("net/mlx5e: kTLS, Enhance TX resync flow")
Signed-off-by: Tariq Toukan <[email protected]>
Signed-off-by: Saeed Mahameed <[email protected]>
Acked-by: Jakub Kicinski <[email protected]>
Signed-off-by: David S. Miller <[email protected]>

NFC: fdp: fix incorrect free object

The address of fw_vsc_cfg is on stack. Releasing it with devm_kfree() is
incorrect, which may result in a system crash or other security impacts.
The expected object to free is *fw_vsc_cfg.

Signed-off-by: Pan Bian <[email protected]>
Signed-off-by: David S. Miller <[email protected]>

net: prevent load/store tearing on sk->sk_stamp

Add a couple of READ_ONCE() and WRITE_ONCE() to prevent
load-tearing and store-tearing in sock_read_timestamp()
and sock_write_timestamp()

This might prevent another KCSAN report.

Fixes: 3a0ed3e96197 ("sock: Make sock->sk_stamp thread-safe")
Signed-off-by: Eric Dumazet <[email protected]>
Cc: Deepa Dinamani <[email protected]>
Acked-by: Deepa Dinamani <[email protected]>
Signed-off-by: David S. Miller <[email protected]>

net: qualcomm: rmnet: Fix potential UAF when unregistering

During the exit/unregistration process of the RmNet driver, the function
rmnet_unregister_real_device() is called to handle freeing the driver's
internal state and removing the RX handler on the underlying physical
device. However, the order of operations this function performs is wrong
and can lead to a use after free of the rmnet_port structure.

Before calling netdev_rx_handler_unregister(), this port structure is
freed with kfree(). If packets are received on any RmNet devices before
synchronize_net() completes, they will attempt to use this already-freed
port structure when processing the packet. As such, before cleaning up any
other internal state, the RX handler must be unregistered in order to
guarantee that no further packets will arrive on the device.

Fixes: ceed73a2cf4a ("drivers: net: ethernet: qualcomm: rmnet: Initial implementation")
Signed-off-by: Sean Tranchetti <[email protected]>
Signed-off-by: David S. Miller <[email protected]>

net/tls: fix sk_msg trim on fallback to copy mode

sk_msg_trim() tries to only update curr pointer if it falls into
the trimmed region. The logic, however, does not take into the
account pointer wrapping that sk_msg_iter_var_prev() does nor
(as John points out) the fact that msg->sg is a ring buffer.

This means that when the message was trimmed completely, the new
curr pointer would have the value of MAX_MSG_FRAGS - 1, which is
neither smaller than any other value, nor would it actually be
correct.

Special case the trimming to 0 length a little bit and rework
the comparison between curr and end to take into account wrapping.

This bug caused the TLS code to not copy all of the message, if
zero copy filled in fewer sg entries than memcopy would need.

Big thanks to Alexander Potapenko for the non-KMSAN reproducer.

v2:
- take into account that msg->sg is a ring buffer (John).

Link: https://lore.kernel.org/netdev/[email protected]/
Fixes: d829e9c4112b ("tls: convert to generic sk_msg interface")
Reported-by: [email protected]
Reported-by: [email protected]
Co-developed-by: John Fastabend <[email protected]>
Signed-off-by: Jakub Kicinski <[email protected]>
Signed-off-by: John Fastabend <[email protected]>
Signed-off-by: David S. Miller <[email protected]>

mlx4_core: fix wrong comment about the reason of subtract one from the max_cqes

The reason for the pre-allocation of one CQE is to enable resizing of
the CQ.
Fix comment accordingly.

Signed-off-by: Dotan Barak <[email protected]>
Signed-off-by: Eli Cohen <[email protected]>
Signed-off-by: Vladimir Sokolovsky <[email protected]>
Signed-off-by: Yuval Shaia <[email protected]>
Reviewed-by: Tariq Toukan <[email protected]>
Signed-off-by: David S. Miller <[email protected]>

net: dsa: bcm_sf2: Fix driver removal

With the DSA core doing the call to dsa_port_disable() we do not need to
do that within the driver itself. This could cause an use after free
since past dsa_unregister_switch() we should not be accessing any
dsa_switch internal structures.

Fixes: 0394a63acfe2 ("net: dsa: enable and disable all ports")
Signed-off-by: Florian Fainelli <[email protected]>
Reviewed-by: Vivien Didelot <[email protected]>
Signed-off-by: David S. Miller <[email protected]>

net: sched: prevent duplicate flower rules from tcf_proto destroy race

When a new filter is added to cls_api, the function
tcf_chain_tp_insert_unique() looks up the protocol/priority/chain to
determine if the tcf_proto is duplicated in the chain's hashtable. It then
creates a new entry or continues with an existing one. In cls_flower, this
allows the function fl_ht_insert_unque to determine if a filter is a
duplicate and reject appropriately, meaning that the duplicate will not be
passed to drivers via the offload hooks. However, when a tcf_proto is
destroyed it is removed from its chain before a hardware remove hook is
hit. This can lead to a race whereby the driver has not received the
remove message but duplicate flows can be accepted. This, in turn, can
lead to the offload driver receiving incorrect duplicate flows and out of
order add/delete messages.

Prevent duplicates by utilising an approach suggested by Vlad Buslov. A
hash table per block stores each unique chain/protocol/prio being
destroyed. This entry is only removed when the full destroy (and hardware
offload) has completed. If a new flow is being added with the same
identiers as a tc_proto being detroyed, then the add request is replayed
until the destroy is complete.

Fixes: 8b64678e0af8 ("net: sched: refactor tp insert/delete for concurrent execution")
Signed-off-by: John Hurley <[email protected]>
Signed-off-by: Vlad Buslov <[email protected]>
Reviewed-by: Simon Horman <[email protected]>
Reported-by: Louis Peens <[email protected]>
Signed-off-by: David S. Miller <[email protected]>

net: hns3: Use the correct style for SPDX License Identifier

This patch corrects the SPDX License Identifier style in
header files related to Hisilicon network devices. For C header files
Documentation/process/license-rules.rst mandates C-like comments
(opposed to C source files where C++ style should be used)

Changes made by using a script provided by Joe Perches here:
https://lkml.org/lkml/2019/2/7/46.

Suggested-by: Joe Perches <[email protected]>
Signed-off-by: Nishad Kamdar <[email protected]>
Signed-off-by: David S. Miller <[email protected]>

bonding: fix state transition issue in link monitoring

Since de77ecd4ef02 ("bonding: improve link-status update in
mii-monitoring"), the bonding driver has utilized two separate variables
to indicate the next link state a particular slave should transition to.
Each is used to communicate to a different portion of the link state
change commit logic; one to the bond_miimon_commit function itself, and
another to the state transition logic.

Unfortunately, the two variables can become unsynchronized,
resulting in incorrect link state transitions within bonding.  This can
cause slaves to become stuck in an incorrect link state until a
subsequent carrier state transition.

The issue occurs when a special case in bond_slave_netdev_event
sets slave->link directly to BOND_LINK_FAIL.  On the next pass through
bond_miimon_inspect after the slave goes carrier up, the BOND_LINK_FAIL
case will set the proposed next state (link_new_state) to BOND_LINK_UP,
but the new_link to BOND_LINK_DOWN.  The setting of the final link state
from new_link comes after that from link_new_state, and so the slave
will end up incorrectly in _DOWN state.

Resolve this by combining the two variables into one.

Reported-by: Aleksei Zakharov <[email protected]>
Reported-by: Sha Zhang <[email protected]>
Cc: Mahesh Bandewar <[email protected]>
Fixes: de77ecd4ef02 ("bonding: improve link-status update in mii-monitoring")
Signed-off-by: Jay Vosburgh <[email protected]>
Signed-off-by: David S. Miller <[email protected]>

Merge git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf

Daniel Borkmann says:

====================
pull-request: bpf 2019-11-02

The following pull-request contains BPF updates for your *net* tree.

We've added 6 non-merge commits during the last 6 day(s) which contain
a total of 8 files changed, 35 insertions(+), 9 deletions(-).

The main changes are:

1) Fix ppc BPF JIT's tail call implementation by performing a second pass
   to gather a stable JIT context before opcode emission, from Eric Dumazet.

2) Fix build of BPF samples sys_perf_event_open() usage to compiled out
   unavailable test_attr__{enabled,open} checks. Also fix potential overflows
   in bpf_map_{area_alloc,charge_init} on 32 bit archs, from Björn Töpel.

3) Fix narrow loads of bpf_sysctl context fields with offset > 0 on big endian
   archs like s390x and also improve the test coverage, from Ilya Leoshkevich.
====================

Signed-off-by: David S. Miller <[email protected]>

Merge branch 'nvme-5.4-rc7' of git://git.infradead.org/nvme into for-linus

Pull NVMe fixes from Keith:

"We have a few late nvme fixes for a couple device removal kernel
crashes, and a compat fix for a new ioctl introduced during this merge
window."

* 'nvme-5.4-rc7' of git://git.infradead.org/nvme:
  nvme: change nvme_passthru_cmd64 to explicitly mark rsvd
  nvme-multipath: fix crash in nvme_mpath_clear_ctrl_paths
  nvme-rdma: fix a segmentation fault during module unload

taprio: fix panic while hw offload sched list swap

Don't swap oper and admin schedules too early, it's not correct and
causes crash.

Steps to reproduce:

1)
tc qdisc replace dev eth0 parent root handle 100 taprio \
    num_tc 3 \
    map 2 2 1 0 2 2 2 2 2 2 2 2 2 2 2 2 \
    queues 1@0 1@1 1@2 \
    base-time $SOME_BASE_TIME \
    sched-entry S 01 80000 \
    sched-entry S 02 15000 \
    sched-entry S 04 40000 \
    flags 2

2)
tc qdisc replace dev eth0 parent root handle 100 taprio \
    base-time $SOME_BASE_TIME \
    sched-entry S 01 90000 \
    sched-entry S 02 20000 \
    sched-entry S 04 40000 \
    flags 2

3)
tc qdisc replace dev eth0 parent root handle 100 taprio \
    base-time $SOME_BASE_TIME \
    sched-entry S 01 150000 \
    sched-entry S 02 200000 \
    sched-entry S 04 40000 \
    flags 2

Do 2 3 2 .. steps  more times if not happens and observe:

[  305.832319] Unable to handle kernel write to read-only memory at
virtual address ffff0000087ce7f0
[  305.910887] CPU: 0 PID: 0 Comm: swapper/0 Not tainted
[  305.919306] Hardware name: Texas Instruments AM654 Base Board (DT)

[...]

[  306.017119] x1 : ffff800848031d88 x0 : ffff800848031d80
[  306.022422] Call trace:
[  306.024866]  taprio_free_sched_cb+0x4c/0x98
[  306.029040]  rcu_process_callbacks+0x25c/0x410
[  306.033476]  __do_softirq+0x10c/0x208
[  306.037132]  irq_exit+0xb8/0xc8
[  306.040267]  __handle_domain_irq+0x64/0xb8
[  306.044352]  gic_handle_irq+0x7c/0x178
[  306.048092]  el1_irq+0xb0/0x128
[  306.051227]  arch_cpu_idle+0x10/0x18
[  306.054795]  do_idle+0x120/0x138
[  306.058015]  cpu_startup_entry+0x20/0x28
[  306.061931]  rest_init+0xcc/0xd8
[  306.065154]  start_kernel+0x3bc/0x3e4
[  306.068810] Code: f2fbd5b7 f2fbd5b6 d503201f f9400422 (f9000662)
[  306.074900] ---[ end trace 96c8e2284a9d9d6e ]---
[  306.079507] Kernel panic - not syncing: Fatal exception in interrupt
[  306.085847] SMP: stopping secondary CPUs
[  306.089765] Kernel Offset: disabled

Try to explain one of the possible crash cases:

The "real" admin list is assigned when admin_sched is set to
new_admin, it happens after "swap", that assigns to oper_sched NULL.
Thus if call qdisc show it can crash.

Farther, next second time, when sched list is updated, the admin_sched
is not NULL and becomes the oper_sched, previous oper_sched was NULL so
just skipped. But then admin_sched is assigned new_admin, but schedules
to free previous assigned admin_sched (that already became oper_sched).

Farther, next third time, when sched list is updated,
while one more swap, oper_sched is not null, but it was happy to be
freed already (while prev. admin update), so while try to free
oper_sched the kernel panic happens at taprio_free_sched_cb().

So, move the "swap emulation" where it should be according to function
comment from code.

Fixes: 9c66d15646760e ("taprio: Add support for hardware offloading")
Signed-off-by: Ivan Khoronzhuk <[email protected]>
Acked-by: Vinicius Costa Gomes <[email protected]>
Tested-by: Vladimir Oltean <[email protected]>
Signed-off-by: David S. Miller <[email protected]>

Merge tag 'linux-can-fixes-for-5.4-20191105' of git://git.kernel.org/pub/scm/linux/kernel/git/mkl/linux-can

Marc Kleine-Budde says:

====================
pull-request: can 2019-11-05

this is a pull request of 33 patches for net/master.

In the first patch Wen Yang's patch adds a missing of_node_put() to CAN device
infrastructure.

Navid Emamdoost's patch for the gs_usb driver fixes a memory leak in the
gs_can_open() error path.

Johan Hovold provides two patches, one for the mcba_usb, the other for the
usb_8dev driver. Both fix a use-after-free after USB-disconnect.

Joakim Zhang's patch improves the flexcan driver, the ECC mechanism is now
completely disabled instead of masking the interrupts.

The next three patches all target the peak_usb driver. Stephane Grosjean's
patch fixes a potential out-of-sync while decoding packets, Johan Hovold's
patch fixes a slab info leak, Jeroen Hofstee's patch adds missing reporting of
bus off recovery events.

Followed by three patches for the c_can driver. Kurt Van Dijck's patch fixes
detection of potential missing status IRQs, Jeroen Hofstee's patches add a chip
reset on open and add missing reporting of bus off recovery events.

Appana Durga Kedareswara rao's patch for the xilinx driver fixes the flags
field initialization for axi CAN.

The next seven patches target the rx-offload helper, they are by me and Jeroen
Hofstee. The error handling in case of a queue overflow is fixed removing a
memory leak. Further the error handling in case of queue overflow and skb OOM
is cleaned up.

The next two patches are by me and target the flexcan and ti_hecc driver. In
case of a error during can_rx_offload_queue_sorted() the error counters in the
drivers are incremented.

Jeroen Hofstee provides 6 patches for the ti_hecc driver, which properly stop
the device in ifdown, improve the rx-offload support (which hit mainline in
v5.4-rc1), and add missing FIFO overflow and state change reporting.

The following four patches target the j1939 protocol. Colin Ian King's patch
fixes a memory leak in the j1939_sk_errqueue() handling. Three patches by
Oleksij Rempel fix a memory leak on socket release and fix the EOMA packet in
the transport protocol.

Timo Schlüßler's patch fixes a potential race condition in the mcp251x driver
on after suspend.

The last patch is by Yegor Yefremov and updates the SPDX-License-Identifier to
v3.0.
====================

Signed-off-by: David S. Miller <[email protected]>

nvme: change nvme_passthru_cmd64 to explicitly mark rsvd

Changing nvme_passthru_cmd64 to add a field: rsvd2. This field is an explicit
marker for the padding space added on certain platforms as a result of the
enlargement of the result field from 32 bit to 64 bits in size, and
fixes differences in struct size when using compat ioctl for 32-bit
binaries on 64-bit architecture.

Fixes: 65e68edce0db ("nvme: allow 64-bit results in passthru commands")
Reviewed-by: Christoph Hellwig <[email protected]>
Signed-off-by: Charles Machalow <[email protected]>
[changelog]
Signed-off-by: Keith Busch <[email protected]>

ALSA: hda: hdmi - add Tigerlake support

Add Tigerlake HDMI codec support.

BugLink: https://bugzilla.kernel.org/show_bug.cgi?id=205379
BugLink: https://bugs.freedesktop.org/show_bug.cgi?id=112171
Cc: Pan Xiuli <[email protected]>
Signed-off-by: Kai Vehmanen <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Takashi Iwai <[email protected]>

ASoC: max98373: replace gpio_request with devm_gpio_request

Use devm_gpio_request() to automatic unroll when fails and avoid
resource leaks at error paths.

Signed-off-by: Yong Zhi <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Mark Brown <[email protected]>

ASoC: stm32: sai: add restriction on mmap support

Do not support mmap in S/PDIF mode. In S/PDIF mode
the buffer has to be copied, to allow the channel status
bits insertion.

Signed-off-by: Olivier Moysan <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Mark Brown <[email protected]>