Git Repo - linux.git/log

bpf: Fix out of bounds memory access in bpf_sk_storage

bpf_sk_storage maps use multiple spin locks to reduce contention.
The number of locks to use is determined by the number of possible CPUs.
With only 1 possible CPU, bucket_log == 0, and 2^0 = 1 locks are used.

When updating elements, the correct lock is determined with hash_ptr().
Calling hash_ptr() with 0 bits is undefined behavior, as it does:

x >> (64 - bits)

Using the value results in an out of bounds memory access.
In my case, this manifested itself as a page fault when raw_spin_lock_bh()
is called later, when running the self tests:

./tools/testing/selftests/bpf/test_verifier 773 775
[ 16.366342] BUG: unable to handle page fault for address: ffff8fe7a66f93f8

Force the minimum number of locks to two.

Signed-off-by: Arthur Fabre <[email protected]>
Fixes: 6ac99e8f23d4 ("bpf: Introduce bpf sk local storage")
Acked-by: Andrii Nakryiko <[email protected]>
Signed-off-by: Alexei Starovoitov <[email protected]>

net: sched: remove NET_CLS_IND config option

This config option makes only couple of lines optional.
Two small helpers and an int in couple of cls structs.

Remove the config option and always compile this in.
This saves the user from unexpected surprises when he adds
a filter with ingress device match which is silently ignored
in case the config option is not set.

Signed-off-by: Jiri Pirko <[email protected]>
Signed-off-by: David S. Miller <[email protected]>

r8169: improve handling of Abit Fatal1ty F-190HD

The Abit Fatal1ty F-190HD has a PCI ID quirk and the entry marks this
board as not GBit-capable, what is wrong. According to [0] the board
has a RTL8111B that is GBit-capable, therefore remove the
RTL_CFG_NO_GBIT flag.

[0] https://www.centos.org/forums/viewtopic.php?t=23390

Signed-off-by: Heiner Kallweit <[email protected]>
Signed-off-by: David S. Miller <[email protected]>

vsock/virtio: set SOCK_DONE on peer shutdown

Set the SOCK_DONE flag to match the TCP_CLOSING state when a peer has
shut down and there is nothing left to read.

This fixes the following bug:
1) Peer sends SHUTDOWN(RDWR).
2) Socket enters TCP_CLOSING but SOCK_DONE is not set.
3) read() returns -ENOTCONN until close() is called, then returns 0.

Signed-off-by: Stephen Barber <[email protected]>
Signed-off-by: David S. Miller <[email protected]>

net: stmmac: Fix wrapper drivers not detecting PHY

Because of PHYLINK conversion we stopped parsing the phy-handle property
from DT. Unfortunatelly, some wrapper drivers still rely on this phy
node to configure the PHY.

Let's restore the parsing of PHY handle while these wrapper drivers are
not fully converted to PHYLINK.

Fixes: 74371272f97f ("net: stmmac: Convert to phylink and remove phylib logic")
Reported-by: Corentin Labbe <[email protected]>
Signed-off-by: Jose Abreu <[email protected]>
Cc: Joao Pinto <[email protected]>
Cc: David S. Miller <[email protected]>
Cc: Giuseppe Cavallaro <[email protected]>
Cc: Alexandre Torgue <[email protected]>
Tested-by: Corentin Labbe <[email protected]>
Signed-off-by: David S. Miller <[email protected]>

Merge branch 'Reuse-ptp_qoriq-driver-for-dpaa2-ptp'

Yangbo Lu says:

====================
Reuse ptp_qoriq driver for dpaa2-ptp

Although dpaa2-ptp.c driver is a fsl_mc_driver which
is using MC APIs for register accessing, it's same IP
block with eTSEC/DPAA/ENETC 1588 timer.
This patch-set is to convert to reuse ptp_qoriq driver by
using register ioremap and dropping related MC APIs.
However the interrupts could only be handled by MC which
fires MSIs to ARM cores. So the interrupt enabling and
handling still rely on MC APIs. MC APIs for interrupt
and PPS event support are also added by this patch-set.

---
Changes for v2:
- Allowed to compile with COMPILE_TEST.
====================

Signed-off-by: David S. Miller <[email protected]>

MAINTAINERS: maintain DPAA2 PTP driver in QorIQ PTP entry

Maintain DPAA2 PTP driver in QorIQ PTP entry.

Signed-off-by: Yangbo Lu <[email protected]>
Signed-off-by: David S. Miller <[email protected]>

dpaa2-ptp: add interrupt support

This patch is to add interrupt support for dpaa2 ptp clock,
including MC APIs and PPS interrupt support. Other events
haven't been supported in MC by now.

Signed-off-by: Yangbo Lu <[email protected]>
Signed-off-by: David S. Miller <[email protected]>

arm64: dts: fsl: add ptp timer node for dpaa2 platforms

This patch is to add ptp timer device tree node for dpaa2
platforms(ls1088a/ls208xa/lx2160a).

Signed-off-by: Yangbo Lu <[email protected]>
Signed-off-by: David S. Miller <[email protected]>

dt-binding: ptp_qoriq: support DPAA2 PTP compatible

Add a new compatible for DPAA2 PTP.

Signed-off-by: Yangbo Lu <[email protected]>
Signed-off-by: David S. Miller <[email protected]>

dpaa2-ptp: reuse ptp_qoriq driver

Although dpaa2-ptp.c driver is a fsl_mc_driver which
is using MC APIs for register accessing, it's same IP
block with eTSEC/DPAA/ENETC 1588 timer.
This patch is to convert to reuse ptp_qoriq driver by
using register ioremap and dropping related MC APIs.
However the interrupts could only be handled by MC which
fires MSIs to ARM cores. So the interrupt enabling and
handling still rely on MC APIs.

Signed-off-by: Yangbo Lu <[email protected]>
Signed-off-by: David S. Miller <[email protected]>

ptp: add QorIQ PTP support for DPAA2

This patch is to add QorIQ PTP support for DPAA2.
Although dpaa2-ptp.c driver is a fsl_mc_driver which
is using MC APIs for register accessing, it's same
IP block with eTSEC/DPAA/ENETC 1588 timer. We will
convert to reuse ptp_qoriq driver by using register
ioremap and dropping related MC APIs.
Also allow to compile ptp_qoriq with COMPILE_TEST.

Signed-off-by: Yangbo Lu <[email protected]>
Signed-off-by: David S. Miller <[email protected]>

net: dsa: rtl8366: Fix up VLAN filtering

We get this regression when using RTL8366RB as part of a bridge
with OpenWrt:

WARNING: CPU: 0 PID: 1347 at net/switchdev/switchdev.c:291
switchdev_port_attr_set_now+0x80/0xa4
lan0: Commit of attribute (id=7) failed.
(...)
realtek-smi switch lan0: failed to initialize vlan filtering on this port

This is because it is trying to disable VLAN filtering
on VLAN0, as we have forgot to add 1 to the port number
to get the right VLAN in rtl8366_vlan_filtering(): when
we initialize the VLAN we associate VLAN1 with port 0,
VLAN2 with port 1 etc, so we need to add 1 to the port
offset.

Fixes: d8652956cf37 ("net: dsa: realtek-smi: Add Realtek SMI driver")
Signed-off-by: Linus Walleij <[email protected]>
Signed-off-by: David S. Miller <[email protected]>

hinic: Use devm_kasprintf instead of hard coding it

'devm_kasprintf' is less verbose than:
   snprintf(NULL, 0, ...);
   devm_kzalloc(...);
   sprintf
so use it instead.

Signed-off-by: Christophe JAILLET <[email protected]>
Signed-off-by: Zhao Chen <[email protected]>
Signed-off-by: David S. Miller <[email protected]>

Revert "net: dsa: mv88e6xxx: do not flood CPU with unknown multicast"

This reverts commit 422efd032775757c41e9579facd9656a87bf4f00.

It breaks ipv6.

Signed-off-by: David S. Miller <[email protected]>

net: phylink: set the autoneg state in phylink_phy_change

The phy_state field of phylink should carry only valid information
especially when this can be passed to the .mac_config callback.
Update the an_enabled field with the autoneg state in the
phylink_phy_change function.

Fixes: 9525ae83959b ("phylink: add phylink infrastructure")
Signed-off-by: Ioana Ciornei <[email protected]>
Signed-off-by: David S. Miller <[email protected]>

net: dsa: mv88e6xxx: do not flood CPU with unknown multicast

The DSA ports must flood unknown unicast and multicast, but the switch
must not flood the CPU ports with unknown multicast, as this results
in a lot of undesirable traffic that the network stack needs to filter
in software.

Signed-off-by: Vivien Didelot <[email protected]>
Signed-off-by: David S. Miller <[email protected]>

Merge tag 'platform-drivers-x86-v5.2-3' of git://git.infradead.org/linux-platform-drivers-x86

Pull x86 platform driver fixes from Andy Shevchenko:

- fix a couple of Mellanox driver enumeration issues

- fix ASUS laptop regression with backlight

- fix Dell computers that got a wrong mode (tablet versus laptop) after
   resume

* tag 'platform-drivers-x86-v5.2-3' of git://git.infradead.org/linux-platform-drivers-x86:
  platform/mellanox: mlxreg-hotplug: Add devm_free_irq call to remove flow
  platform/x86: mlx-platform: Fix parent device in i2c-mux-reg device registration
  platform/x86: intel-vbtn: Report switch events when event wakes device
  platform/x86: asus-wmi: Only Tell EC the OS will handle display hotkeys from asus_nb_wmi

Merge tag 'usb-5.2-rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/usb

Pull USB fixes from Greg KH:
"Here are some small USB driver fixes for 5.2-rc5

  Nothing major, just some small gadget fixes, usb-serial new device
  ids, a few new quirks, and some small fixes for some regressions that
  have been found after the big 5.2-rc1 merge.

  All of these have been in linux-next for a while with no reported
  issues"

* tag 'usb-5.2-rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/usb:
  usb: typec: Make sure an alt mode exist before getting its partner
  usb: gadget: udc: lpc32xx: fix return value check in lpc32xx_udc_probe()
  usb: gadget: dwc2: fix zlp handling
  usb: dwc2: Set actual frame number for completed ISOC transfer for none DDMA
  usb: gadget: udc: lpc32xx: allocate descriptor with GFP_ATOMIC
  usb: gadget: fusb300_udc: Fix memory leak of fusb300->ep[i]
  usb: phy: mxs: Disable external charger detect in mxs_phy_hw_init()
  usb: dwc2: Fix DMA cache alignment issues
  usb: dwc2: host: Fix wMaxPacketSize handling (fix webcam regression)
  USB: Fix chipmunk-like voice when using Logitech C270 for recording audio.
  USB: usb-storage: Add new ID to ums-realtek
  usb: typec: ucsi: ccg: fix memory leak in do_flash
  USB: serial: option: add Telit 0x1260 and 0x1261 compositions
  USB: serial: pl2303: add Allied Telesis VT-Kit3
  USB: serial: option: add support for Simcom SIM7500/SIM7600 RNDIS mode

Merge tag 'powerpc-5.2-4' of git://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux

Pull powerpc fixes from Michael Ellerman:
"One fix for a regression introduced by our 32-bit KASAN support, which
  broke booting on machines with "bootx" early debugging enabled.

  A fix for a bug which broke kexec on 32-bit, introduced by changes to
  the 32-bit STRICT_KERNEL_RWX support in v5.1.

  Finally two fixes going to stable for our THP split/collapse handling,
  discovered by Nick. The first fixes random crashes and/or corruption
  in guests under sufficient load.

  Thanks to: Nicholas Piggin, Christophe Leroy, Aaro Koskinen, Mathieu
  Malaterre"

* tag 'powerpc-5.2-4' of git://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux:
  powerpc/32s: fix booting with CONFIG_PPC_EARLY_DEBUG_BOOTX
  powerpc/64s: __find_linux_pte() synchronization vs pmdp_invalidate()
  powerpc/64s: Fix THP PMD collapse serialisation
  powerpc: Fix kexec failure on book3s/32

Merge tag 'trace-v5.2-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace

Pull tracing fixes from Steven Rostedt:

- Out of range read of stack trace output

- Fix for NULL pointer dereference in trace_uprobe_create()

- Fix to a livepatching / ftrace permission race in the module code

- Fix for NULL pointer dereference in free_ftrace_func_mapper()

- A couple of build warning clean ups

* tag 'trace-v5.2-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace:
  ftrace: Fix NULL pointer dereference in free_ftrace_func_mapper()
  module: Fix livepatch/ftrace module text permissions race
  tracing/uprobe: Fix obsolete comment on trace_uprobe_create()
  tracing/uprobe: Fix NULL pointer dereference in trace_uprobe_create()
  tracing: Make two symbols static
  tracing: avoid build warning with HAVE_NOP_MCOUNT
  tracing: Fix out-of-range read in trace_stack_print()

x86/microcode, cpuhotplug: Add a microcode loader CPU hotplug callback

Adric Blake reported the following warning during suspend-resume:

  Enabling non-boot CPUs ...
  x86: Booting SMP configuration:
  smpboot: Booting Node 0 Processor 1 APIC 0x2
  unchecked MSR access error: WRMSR to 0x10f (tried to write 0x0000000000000000) \
   at rIP: 0xffffffff8d267924 (native_write_msr+0x4/0x20)
  Call Trace:
   intel_set_tfa
   intel_pmu_cpu_starting
   ? x86_pmu_dead_cpu
   x86_pmu_starting_cpu
   cpuhp_invoke_callback
   ? _raw_spin_lock_irqsave
   notify_cpu_starting
   start_secondary
   secondary_startup_64
  microcode: sig=0x806ea, pf=0x80, revision=0x96
  microcode: updated to revision 0xb4, date = 2019-04-01
  CPU1 is up

The MSR in question is MSR_TFA_RTM_FORCE_ABORT and that MSR is emulated
by microcode. The log above shows that the microcode loader callback
happens after the PMU restoration, leading to the conjecture that
because the microcode hasn't been updated yet, that MSR is not present
yet, leading to the #GP.

Add a microcode loader-specific hotplug vector which comes before
the PERF vectors and thus executes earlier and makes sure the MSR is
present.

Fixes: 400816f60c54 ("perf/x86/intel: Implement support for TSX Force Abort")
Reported-by: Adric Blake <[email protected]>
Signed-off-by: Borislav Petkov <[email protected]>
Reviewed-by: Thomas Gleixner <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: <[email protected]>
Cc: [email protected]
Link: https://bugzilla.kernel.org/show_bug.cgi?id=203637

Merge branch 'for-5.2-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup

Pull cgroup fixes from Tejun Heo:
"This has an unusually high density of tricky fixes:

   - task_get_css() could deadlock when it races against a dying cgroup.

   - cgroup.procs didn't list thread group leaders with live threads.

     This could mislead readers to think that a cgroup is empty when
     it's not. Fixed by making PROCS iterator include dead tasks. I made
     a couple mistakes making this change and this pull request contains
     a couple follow-up patches.

   - When cpusets run out of online cpus, it updates cpusmasks of member
     tasks in bizarre ways. Joel improved the behavior significantly"

* 'for-5.2-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup:
  cpuset: restore sanity to cpuset_cpus_allowed_fallback()
  cgroup: Fix css_task_iter_advance_css_set() cset skip condition
  cgroup: css_task_iter_skip()'d iterators must be advanced before accessed
  cgroup: Include dying leaders with live threads in PROCS iterations
  cgroup: Implement css_task_iter_skip()
  cgroup: Call cgroup_release() before __exit_signal()
  docs cgroups: add another example size for hugetlb
  cgroup: Use css_tryget() instead of css_tryget_online() in task_get_css()

Merge tag 'drm-fixes-2019-06-14' of git://anongit.freedesktop.org/drm/drm

Pull drm fixes from Daniel Vetter:
"Nothing unsettling here, also not aware of anything serious still
  pending.

  The edid override regression fix took a bit longer since this seems to
  be an area with an overabundance of bad options. But the fix we have
  now seems like a good path forward.

  Next week it should be back to Dave.

  Summary:

   - fix regression on amdgpu on SI

   - fix edid override regression

   - driver fixes: amdgpu, i915, mediatek, meson, panfrost

   - fix writecombine for vmap in gem-shmem helper (used by panfrost)

   - add more panel quirks"

* tag 'drm-fixes-2019-06-14' of git://anongit.freedesktop.org/drm/drm: (25 commits)
  drm/amdgpu: return 0 by default in amdgpu_pm_load_smu_firmware
  drm/amdgpu: Fix bounds checking in amdgpu_ras_is_supported()
  drm: add fallback override/firmware EDID modes workaround
  drm/edid: abstract override/firmware EDID retrieval
  drm/i915/perf: fix whitelist on Gen10+
  drm/i915/sdvo: Implement proper HDMI audio support for SDVO
  drm/i915: Fix per-pixel alpha with CCS
  drm/i915/dmc: protect against reading random memory
  drm/i915/dsi: Use a fuzzy check for burst mode clock check
  drm/amdgpu/{uvd,vcn}: fetch ring's read_ptr after alloc
  drm/panfrost: Require the simple_ondemand governor
  drm/panfrost: make devfreq optional again
  drm/gem_shmem: Use a writecombine mapping for ->vaddr
  drm: panel-orientation-quirks: Add quirk for GPD MicroPC
  drm: panel-orientation-quirks: Add quirk for GPD pocket2
  drm/meson: fix G12A primary plane disabling
  drm/meson: fix primary plane disabling
  drm/meson: fix G12A HDMI PLL settings for 4K60 1000/1001 variations
  drm/mediatek: call mtk_dsi_stop() after mtk_drm_crtc_atomic_disable()
  drm/mediatek: clear num_pipes when unbind driver
  ...

Merge tag 'gfs2-v5.2.fixes2' of git://git.kernel.org/pub/scm/linux/kernel/git/gfs2/linux-gfs2

Pull gfs2 fix from Andreas Gruenbacher:
"Fix rounding error in gfs2_iomap_page_prepare"

* tag 'gfs2-v5.2.fixes2' of git://git.kernel.org/pub/scm/linux/kernel/git/gfs2/linux-gfs2:
gfs2: Fix rounding error in gfs2_iomap_page_prepare

Merge branch 'net-dsa-use-switchdev-attr-and-obj-handlers'

Vivien Didelot says:

====================
net: dsa: use switchdev attr and obj handlers

This series reduces boilerplate in the handling of switchdev attribute and
object operations by using the switchdev_handle_* helpers, which check the
targeted devices and recurse into their lower devices.

This also brings back the ability to inspect operations targeting the bridge
device itself (where .orig_dev and .dev were originally the bridge device),
even though that is of no use yet and skipped by this series.

Changes in v2: Only VLAN and (non-host) MDB objects not directly targeting
the slave device are unsupported at the moment, so only skip these cases.
====================

Signed-off-by: David S. Miller <[email protected]>

net: dsa: use switchdev handle helpers

Get rid of the dsa_slave_switchdev_port_{attr_set,obj}_event functions
in favor of the switchdev_handle_port_{attr_set,obj_add,obj_del}
helpers which recurse into the lower devices of the target interface.

This has the benefit of being aware of the operations made on the
bridge device itself, where orig_dev is the bridge, and dev is the
slave. This can be used later to configure the hardware switches.

Only VLAN and (port) MDB objects not directly targeting the slave
device are unsupported at the moment, so skip this case in their
respective case statements.

Signed-off-by: Vivien Didelot <[email protected]>
Signed-off-by: David S. Miller <[email protected]>

net: dsa: make dsa_slave_dev_check use const

The switchdev handle helpers make use of a device checking helper
requiring a const net_device. Make dsa_slave_dev_check compliant
to this.

Signed-off-by: Vivien Didelot <[email protected]>
Reviewed-by: Florian Fainelli <[email protected]>
Signed-off-by: David S. Miller <[email protected]>

net: dsa: make cpu_dp non const

A port may trigger operations on its dedicated CPU port, so using
cpu_dp as const will raise warnings. Make cpu_dp non const.

Signed-off-by: Vivien Didelot <[email protected]>
Reviewed-by: Florian Fainelli <[email protected]>
Signed-off-by: David S. Miller <[email protected]>

net: dsa: do not check orig_dev in vlan del

The current DSA code handling switchdev objects does not recurse into
the lower devices thus is never called with an orig_dev member being
a bridge device, hence remove this useless check.

At the same time, remove the comments about the callers, which is
unlikely to be updated if the code changes and thus will be confusing.

Signed-off-by: Vivien Didelot <[email protected]>
Reviewed-by: Florian Fainelli <[email protected]>
Signed-off-by: David S. Miller <[email protected]>

Merge branch 'tcp-add-three-static-keys'

Eric Dumazet says:

====================
tcp: add three static keys

Recent addition of per TCP socket rx/tx cache brought
regressions for some workloads, as reported by Feng Tang.

It seems better to make them opt-in, before we adopt better
heuristics.

The last patch adds high_order_alloc_disable sysctl
to ask TCP sendmsg() to exclusively use order-0 allocations,
as mm layer has specific optimizations.
====================

Signed-off-by: David S. Miller <[email protected]>

net: add high_order_alloc_disable sysctl/static key

>From linux-3.7, (commit 5640f7685831 "net: use a per task frag
allocator") TCP sendmsg() has preferred using order-3 allocations.

While it gives good results for most cases, we had reports
that heavy uses of TCP over loopback were hitting a spinlock
contention in page allocations/freeing.

This commits adds a sysctl so that admins can opt-in
for order-0 allocations. Hopefully mm layer might optimize
order-3 allocations in the future since it could give us
a nice boost (see 8 lines of following benchmark)

The following benchmark shows a win when more than 8 TCP_STREAM
threads are running (56 x86 cores server in my tests)

for thr in {1..30}
do
sysctl -wq net.core.high_order_alloc_disable=0
T0=`./super_netperf $thr -H 127.0.0.1 -l 15`
sysctl -wq net.core.high_order_alloc_disable=1
T1=`./super_netperf $thr -H 127.0.0.1 -l 15`
echo $thr:$T0:$T1
done

1: 49979: 37267
2: 98745: 76286
3: 141088: 110051
4: 177414: 144772
5: 197587: 173563
6: 215377: 208448
7: 241061: 234087
8: 267155: 263373
9: 295069: 297402
10: 312393: 335213
11: 340462: 368778
12: 371366: 403954
13: 412344: 443713
14: 426617: 473580
15: 474418: 507861
16: 503261: 538539
17: 522331: 563096
18: 532409: 567084
19: 550824: 605240
20: 525493: 641988
21: 564574: 665843
22: 567349: 690868
23: 583846: 710917
24: 588715: 736306
25: 603212: 763494
26: 604083: 792654
27: 602241: 796450
28: 604291: 797993
29: 611610: 833249
30: 577356: 841062

Signed-off-by: Eric Dumazet <[email protected]>
Signed-off-by: David S. Miller <[email protected]>

tcp: add tcp_tx_skb_cache sysctl

Feng Tang reported a performance regression after introduction
of per TCP socket tx/rx caches, for TCP over loopback (netperf)

There is high chance the regression is caused by a change on
how well the 32 KB per-thread page (current->task_frag) can
be recycled, and lack of pcp caches for order-3 pages.

I could not reproduce the regression myself, cpus all being
spinning on the mm spinlocks for page allocs/freeing, regardless
of enabling or disabling the per tcp socket caches.

It seems best to disable the feature by default, and let
admins enabling it.

MM layer either needs to provide scalable order-3 pages
allocations, or could attempt a trylock on zone->lock if
the caller only attempts to get a high-order page and is
able to fallback to order-0 ones in case of pressure.

Tests run on a 56 cores host (112 hyper threads)

- 35.49% netperf [kernel.vmlinux]   [k] queued_spin_lock_slowpath
   - 35.49% queued_spin_lock_slowpath
  - 18.18% get_page_from_freelist
- __alloc_pages_nodemask
- 18.18% alloc_pages_current
skb_page_frag_refill
sk_page_frag_refill
tcp_sendmsg_locked
tcp_sendmsg
inet_sendmsg
sock_sendmsg
__sys_sendto
__x64_sys_sendto
do_syscall_64
entry_SYSCALL_64_after_hwframe
__libc_send
  + 17.31% __free_pages_ok
+ 31.43% swapper [kernel.vmlinux]   [k] intel_idle
+ 9.12% netperf [kernel.vmlinux]   [k] copy_user_enhanced_fast_string
+ 6.53% netserver [kernel.vmlinux]   [k] copy_user_enhanced_fast_string
+ 0.69% netserver [kernel.vmlinux]   [k] queued_spin_lock_slowpath
+ 0.68% netperf [kernel.vmlinux]   [k] skb_release_data
+ 0.52% netperf [kernel.vmlinux]   [k] tcp_sendmsg_locked
0.46% netperf [kernel.vmlinux]   [k] _raw_spin_lock_irqsave

Fixes: 472c2e07eef0 ("tcp: add one skb cache for tx")
Signed-off-by: Eric Dumazet <[email protected]>
Reported-by: Feng Tang <[email protected]>
Signed-off-by: David S. Miller <[email protected]>

tcp: add tcp_rx_skb_cache sysctl

Instead of relying on rps_needed, it is safer to use a separate
static key, since we do not want to enable TCP rx_skb_cache
by default. This feature can cause huge increase of memory
usage on hosts with millions of sockets.

Signed-off-by: Eric Dumazet <[email protected]>
Signed-off-by: David S. Miller <[email protected]>

sysctl: define proc_do_static_key()

Convert proc_dointvec_minmax_bpf_stats() into a more generic
helper, since we are going to use jump labels more often.

Note that sysctl_bpf_stats_enabled is removed, since
it is no longer needed/used.

Signed-off-by: Eric Dumazet <[email protected]>
Acked-by: Alexei Starovoitov <[email protected]>
Signed-off-by: David S. Miller <[email protected]>

Merge branch '40GbE' of git://git.kernel.org/pub/scm/linux/kernel/git/jkirsher/next-queue

Jeff Kirsher says:

====================
40GbE Intel Wired LAN Driver Updates 2019-06-14

This series contains updates to i40e only.

Aleksandr adds stub functions for Energy Efficient Ethernet (EEE) to
currently report that it is not supported in i40e.  Fixed up the Link
Layer Detection Protocol (LLDP) code to ensure we do not set the LLDP
flag too early before we ensure that we have a successful start.  This
also will prevent needles restarting of the device if LLDP did not
change its state with an unsuccessful start.

Piotr bumps up the amount of VLANs that an untrusted VF can implement,
from 8 VLANs to 16.  Adds checks to the Virtual Embedded Bridge (VEB)
and channel arrays so access does not exceed the boundary and ensure the
index is below the maximum.  Fixed an issue in the driver where we were
not checking the response from the LLDP flag and were returned success
no matter what the value of the response was.

Mitch fixes a variable counter, which can be negative in value so make
it an integer instead of an unsigned-integer.

Doug improves the admin queue log granularity by making it possible to
log only the admin queue descriptors without the entire admin queue
message buffers.

Sergey fixes up the virtchnl code by removing duplicate checks, ensure
the variable type is correct when comparing integers, enhance error and
warning messages to include useful information.

Adam fixes a potential kernel panic when the i40e driver was being bound
to a non-i40e port by adding a check on the BAR size to ensure it is
large enough by reading the highest register.

Jake fixes a statistics error in the "transmit errors" stat, which was
being calculated twice.

Gustavo A. R. Silva adds a fall-through code comment to help with
compiler checks.

v2: Fixed the return values wrapped in parenthesis in patch 8 and
    cleaned up the commit message in patch 12 so the Gustavo does
    not repeat himself.
====================

Signed-off-by: David S. Miller <[email protected]>

udp: Remove unused variable/function (exact_dif)

This was originally passed through to the VRF logic in compute_score().
But that logic has now been replaced by udp_sk_bound_dev_eq() and so
this code is no longer used or needed.

Signed-off-by: Tim Beale <[email protected]>
Reviewed-by: David Ahern <[email protected]>
Signed-off-by: David S. Miller <[email protected]>

udp: Remove unused parameter (exact_dif)

Originally this was used by the VRF logic in compute_score(), but that
was later replaced by udp_sk_bound_dev_eq() and the parameter became
unused.

Note this change adds an 'unused variable' compiler warning that will be
removed in the next patch (I've split the removal in two to make review
slightly easier).

Signed-off-by: Tim Beale <[email protected]>
Reviewed-by: David Ahern <[email protected]>
Signed-off-by: David S. Miller <[email protected]>

ipv4: tcp: fix ACK/RST sent with a transmit delay

If we want to set a EDT time for the skb we want to send
via ip_send_unicast_reply(), we have to pass a new parameter
and initialize ipc.sockc.transmit_time with it.

This fixes the EDT time for ACK/RST packets sent on behalf of
a TIME_WAIT socket.

Fixes: a842fe1425cb ("tcp: add optional per socket transmit delay")
Signed-off-by: Eric Dumazet <[email protected]>
Signed-off-by: David S. Miller <[email protected]>

net: remove empty netlink_tap_exit_net

Pointer members of an object with static storage duration, if not
explicitly initialized, will be initialized to a NULL pointer. The
net namespace API checks if this pointer is not NULL before using it,
it are safe to remove the function.

Signed-off-by: Li RongQing <[email protected]>
Signed-off-by: David S. Miller <[email protected]>

Merge branch 'nfp-flower-loosen-L4-checks-and-add-extack-to-flower-offload'

Jakub Kicinski says:

====================
nfp: flower: loosen L4 checks and add extack to flower offload

Pieter says:

This set allows the offload of filters that make use of an unknown
ip protocol, given that layer 4 is being wildcarded. The set then
aims to make use of extack messaging for flower offloads. It adds
about 70 extack messages to the driver.
====================

Signed-off-by: David S. Miller <[email protected]>

nfp: flower: extend extack messaging for flower match and actions

Use extack messages in flower offload when compiling match and actions
messages that will configure hardware.

Signed-off-by: Pieter Jansen van Vuuren <[email protected]>
Reviewed-by: Simon Horman <[email protected]>
Reviewed-by: Jakub Kicinski <[email protected]>
Signed-off-by: David S. Miller <[email protected]>

nfp: flower: use extack messages in flower offload

Use extack messages in flower offload, specifically focusing on
the extack use in add offload, remove offload and get stats paths.

Signed-off-by: Pieter Jansen van Vuuren <[email protected]>
Reviewed-by: Simon Horman <[email protected]>
Reviewed-by: Jakub Kicinski <[email protected]>
Signed-off-by: David S. Miller <[email protected]>

nfp: flower: check L4 matches on unknown IP protocols

Matching on fields with a protocol that is unknown to hardware
is not strictly unsupported. Determine if hardware can offload
a filter with an unknown protocol by checking if any L4 fields
are being matched as well.

Signed-off-by: Pieter Jansen van Vuuren <[email protected]>
Reviewed-by: Jakub Kicinski <[email protected]>
Signed-off-by: David S. Miller <[email protected]>

hv_netvsc: Set probe mode to sync

For better consistency of synthetic NIC names, we set the probe mode to
PROBE_FORCE_SYNCHRONOUS. So the names can be aligned with the vmbus
channel offer sequence.

Fixes: af0a5646cb8d ("use the new async probing feature for the hyperv drivers")
Signed-off-by: Haiyang Zhang <[email protected]>
Signed-off-by: David S. Miller <[email protected]>

Merge tag 'mlx5-updates-2019-06-13' of git://git.kernel.org/pub/scm/linux/kernel/git/saeed/linux

Saeed Mahameed says:

====================
mlx5-updates-2019-06-13

Mlx5 devlink health fw reporters and sw reset support

This series provides mlx5 firmware reset support and firmware devlink health
reporters.

1) Add initial mlx5 kernel documentation and include devlink health reporters

2) Add CR-Space access and FW Crdump snapshot support via devlink region_snapshot

3) Issue software reset upon FW asserts

4) Add fw and fw_fatal devlink heath reporters to follow fw errors indication by
dump and recover procedures and enable trigger these functionality by user.

4.1) fw reporter:
The fw reporter implements diagnose and dump callbacks.
It follows symptoms of fw error such as fw syndrome by triggering
fw core dump and storing it and any other fw trace into the dump buffer.
The fw reporter diagnose command can be triggered any time by the user to check
current fw status.

4.2) fw_fatal repoter:
The fw_fatal reporter implements dump and recover callbacks.
It follows fatal errors indications by CR-space dump and recover flow.
The CR-space dump uses vsc interface which is valid even if the FW command
interface is not functional, which is the case in most FW fatal errors. The
CR-space dump is stored as a memory region snapshot to ease read by address.
The recover function runs recover flow which reloads the driver and triggers fw
reset if needed.
====================

Signed-off-by: David S. Miller <[email protected]>

ipv4: Support multipath hashing on inner IP pkts for GRE tunnel

Multipath hash policy value of 0 isn't distributing since the outer IP
dest and src aren't varied eventhough the inner ones are. Since the flow
is on the inner ones in the case of tunneled traffic, hashing on them is
desired.

This is done mainly for IP over GRE, hence only tested for that. But
anything else supported by flow dissection should work.

v2: Use skb_flow_dissect_flow_keys() directly so that other tunneling
can be supported through flow dissection (per Nikolay Aleksandrov).
v3: Remove accidental inclusion of ports in the hash keys and clarify
the documentation (Nikolay Alexandrov).
Signed-off-by: Stephen Suryaputra <[email protected]>
Signed-off-by: Nikolay Aleksandrov <[email protected]>
Signed-off-by: David S. Miller <[email protected]>

virtio_net: enable napi_tx by default

NAPI tx mode improves TCP behavior by enabling TCP small queues (TSQ).
TSQ reduces queuing ("bufferbloat") and burstiness.

Previous measurements have shown significant improvement for
TCP_STREAM style workloads. Such as those in commit 86a5df1495cc
("Merge branch 'virtio-net-tx-napi'").

There has been uncertainty about smaller possible regressions in
latency due to increased reliance on tx interrupts.

The above results did not show that, nor did I observe this when
rerunning TCP_RR on Linux 5.1 this week on a pair of guests in the
same rack. This may be subject to other settings, notably interrupt
coalescing.

In the unlikely case of regression, we have landed a credible runtime
solution. Ethtool can configure it with -C tx-frames [0|1] as of
commit 0c465be183c7 ("virtio_net: ethtool tx napi configuration").

NAPI tx mode has been the default in Google Container-Optimized OS
(COS) for over half a year, as of release M70 in October 2018,
without any negative reports.

Link: https://marc.info/?l=linux-netdev&m=149305618416472
Link: https://lwn.net/Articles/507065/
Signed-off-by: Willem de Bruijn <[email protected]>
Acked-by: Michael S. Tsirkin <[email protected]>
Acked-by: Jason Wang <[email protected]>
Signed-off-by: David S. Miller <[email protected]>

net: sched: ingress: set 'unlocked' flag for clsact Qdisc ops

To remove rtnl lock dependency in tc filter update API when using clsact
Qdisc, set QDISC_CLASS_OPS_DOIT_UNLOCKED flag in clsact Qdisc_class_ops.

Clsact Qdisc ops don't require any modifications to be used without rtnl
lock on tc filter update path. Implementation never changes its q->block
and only releases it when Qdisc is being destroyed. This means it is enough
for RTM_{NEWTFILTER|DELTFILTER|GETTFILTER} message handlers to hold clsact
Qdisc reference while using it without relying on rtnl lock protection.
Unlocked Qdisc ops support is already implemented in filter update path by
unlocked cls API patch set.

Signed-off-by: Vlad Buslov <[email protected]>
Signed-off-by: David S. Miller <[email protected]>

Merge branch 'enable-and-use-static_branch_deferred_inc'

Willem de Bruijn says:

====================
enable and use static_branch_deferred_inc

1. make static_branch_deferred_inc available if !CONFIG_JUMP_LABEL
2. convert the existing STATIC_KEY_DEFERRED_FALSE user to this api
====================

Signed-off-by: David S. Miller <[email protected]>

tcp: use static_branch_deferred_inc for clean_acked_data_enabled

Deferred static key clean_acked_data_enabled uses the deferred
variants of dec and flush. Do the same for inc.

Signed-off-by: Willem de Bruijn <[email protected]>
Signed-off-by: David S. Miller <[email protected]>

locking/static_key: always define static_branch_deferred_inc

This interface is currently only defined if CONFIG_JUMP_LABEL. Make it
available also when jump labels are off.

Signed-off-by: Willem de Bruijn <[email protected]>
Acked-by: Peter Zijlstra (Intel) <[email protected]>
Signed-off-by: David S. Miller <[email protected]>

net: sched: flower: don't call synchronize_rcu() on mask creation

Current flower mask creating code assumes that temporary mask that is used
when inserting new filter is stack allocated. To prevent race condition
with data patch synchronize_rcu() is called every time fl_create_new_mask()
replaces temporary stack allocated mask. As reported by Jiri, this
increases runtime of creating 20000 flower classifiers from 4 seconds to
163 seconds. However, this design is no longer necessary since temporary
mask was converted to be dynamically allocated by commit 2cddd2014782
("net/sched: cls_flower: allocate mask dynamically in fl_change()").

Remove synchronize_rcu() calls from mask creation code. Instead, refactor
fl_change() to always deallocate temporary mask with rcu grace period.

Fixes: 195c234d15c9 ("net: sched: flower: handle concurrent mask insertion")
Reported-by: Jiri Pirko <[email protected]>
Signed-off-by: Vlad Buslov <[email protected]>
Tested-by: Jiri Pirko <[email protected]>
Acked-by: Jiri Pirko <[email protected]>
Signed-off-by: David S. Miller <[email protected]>

net: dsa: fix warning same module names

When building with CONFIG_NET_DSA_REALTEK_SMI and CONFIG_REALTEK_PHY
enabled as loadable modules, we see the following warning:

warning: same module names found:
drivers/net/phy/realtek.ko
drivers/net/dsa/realtek.ko

Rework so the driver name is realtek-smi instead of realtek.

Reviewed-by: Linus Walleij <[email protected]>
Reviewed-by: Andrew Lunn <[email protected]>
Signed-off-by: Anders Roxell <[email protected]>
Signed-off-by: David S. Miller <[email protected]>

sctp: Free cookie before we memdup a new one

Based on comments from Xin, even after fixes for our recent syzbot
report of cookie memory leaks, its possible to get a resend of an INIT
chunk which would lead to us leaking cookie memory.

To ensure that we don't leak cookie memory, free any previously
allocated cookie first.

Change notes
v1->v2
update subsystem tag in subject (davem)
repeat kfree check for peer_random and peer_hmacs (xin)

v2->v3
net->sctp
also free peer_chunks

v3->v4
fix subject tags

v4->v5
remove cut line

Signed-off-by: Neil Horman <[email protected]>
Reported-by: [email protected]
CC: Marcelo Ricardo Leitner <[email protected]>
CC: Xin Long <[email protected]>
CC: "David S. Miller" <[email protected]>
CC: [email protected]
Acked-by: Marcelo Ricardo Leitner <[email protected]>
Signed-off-by: David S. Miller <[email protected]>

Merge branch 'hns3-next'

Huazhong Tan says:

====================
net: hns3: some code optimizations & cleanups & bugfixes

This patch-set includes code optimizations, cleanups and bugfixes for
the HNS3 ethernet controller driver.

[patch 1/12 - 6/12] adds some code optimizations and bugfixes about RAS
and MSI-X HW error.

[patch 7/12] fixes a loading issue.

[patch 8/12 - 11/12] adds some bugfixes.

[patch 12/12] adds some cleanups, which does not change the logic of code.
====================

Signed-off-by: David S. Miller <[email protected]>

net: hns3: some variable modification

This patch does following things:
1. add the keyword const before some variables which won't be modified
   in functions.
2. changes some variables from signed to unsigned to avoid bitwise
   operation on signed variables.
3. adds or removes initialization of some variables.
4. defines a new structure to help parsing mailbox messages instead of
   using an array which is harder to get the meaning of each element.

Signed-off-by: Weihang Li <[email protected]>
Signed-off-by: Peng Li <[email protected]>
Signed-off-by: Yunsheng Lin <[email protected]>
Signed-off-by: Yufeng Mo <[email protected]>
Signed-off-by: Huazhong Tan <[email protected]>
Signed-off-by: David S. Miller <[email protected]>

net: hns3: delay ring buffer clearing during reset

The driver may not be able to disable the ring through firmware
when downing the netdev during reset process, which may cause
hardware accessing freed buffer problem.

This patch delays the ring buffer clearing to reset uninit
process because hardware will not access the ring buffer after
hardware reset is completed.

Fixes: bb6b94a896d4 ("net: hns3: Add reset interface implementation in client")
Signed-off-by: Yunsheng Lin <[email protected]>
Signed-off-by: Peng Li <[email protected]>
Signed-off-by: Huazhong Tan <[email protected]>
Signed-off-by: David S. Miller <[email protected]>

net: hns3: fix for skb leak when doing selftest

If hns3_nic_net_xmit does not return NETDEV_TX_BUSY when doing
a loopback selftest, the skb is not freed in hns3_clean_tx_ring
or hns3_nic_net_xmit, which causes skb not freed problem.

This patch fixes it by freeing skb when hns3_nic_net_xmit does
not return NETDEV_TX_OK.

Fixes: c39c4d98dc65 ("net: hns3: Add mac loopback selftest support in hns3 driver")
Signed-off-by: Yunsheng Lin <[email protected]>
Signed-off-by: Peng Li <[email protected]>
Signed-off-by: Huazhong Tan <[email protected]>
Signed-off-by: David S. Miller <[email protected]>

net: hns3: fix for dereferencing before null checking

The netdev is dereferenced before null checking in the function
hns3_setup_tc.

This patch moves the dereferencing after the null checking.

Fixes: 76ad4f0ee747 ("net: hns3: Add support of HNS3 Ethernet Driver for hip08 SoC")
Signed-off-by: Yunsheng Lin <[email protected]>
Signed-off-by: Peng Li <[email protected]>
Signed-off-by: Huazhong Tan <[email protected]>
Signed-off-by: David S. Miller <[email protected]>

net: hns3: free irq when exit from abnormal branch

In hns3_nic_init_irq(), if request irq fail at index i,
the function return directly without releasing irq resources
that already requested, and nowhere else will release them.

Signed-off-by: Yonglong Liu <[email protected]>
Signed-off-by: Peng Li <[email protected]>
Signed-off-by: Huazhong Tan <[email protected]>
Signed-off-by: David S. Miller <[email protected]>

net: hns3: clear restting state when initializing HW device

IMP will set restting state for all function when PF FLR, driver
just clear the restting state in resetting progress, but don't do
it in initializing progress. As FLR is not created by driver,
it is necessary to clear restting state when initializing HW device.

Signed-off-by: Peng Li <[email protected]>
Signed-off-by: Huazhong Tan <[email protected]>
Signed-off-by: David S. Miller <[email protected]>

net: hns3: extract handling of mpf/pf msi-x errors into functions

Function hclge_handle_all_hw_msix_error() contains four parts:
1. Query buffer descriptors for MSI-X errors.
2. Query and clear all main PF MSI-X errors.
3. Query and clear all PF MSI-X errors.
4. Handle mac tunnel interrupts.
Part 2 and part 3 handle errors of some different modules respectively,
this patch extracts them into dividual functions, which makes the logic
clearer.

Signed-off-by: Weihang Li <[email protected]>
Signed-off-by: Peng Li <[email protected]>
Signed-off-by: Huazhong Tan <[email protected]>
Signed-off-by: David S. Miller <[email protected]>

net: hns3: some changes of MSI-X bits in PPU(RCB)

This patch modifies print message of rx_q_search_miss from error to dfx to
prevent misleading users, because this interrupt may occur if we receive
packets during initialization of HNS3 driver.
Otherwise, this patch masks 28th bit of PPU_MPF_ABNORMAL_SRC2 which is now
meaningless.

Signed-off-by: Weihang Li <[email protected]>
Signed-off-by: Peng Li <[email protected]>
Signed-off-by: Huazhong Tan <[email protected]>
Signed-off-by: David S. Miller <[email protected]>

net: hns3: add recovery for the H/W errors occurred before the HNS dev initialization

This patch adds the recovery for the HNS H/W errors which occurred
before the driver initialization.

Reported-by: Salil Mehta <[email protected]>
Signed-off-by: Shiju Jose <[email protected]>
Signed-off-by: Peng Li <[email protected]>
Signed-off-by: Huazhong Tan <[email protected]>
Signed-off-by: David S. Miller <[email protected]>

net: hns3: process H/W errors occurred before HNS dev initialization

Presently the HNS driver enables the HNS H/W error interrupts after
the dev initialization is completed. However some exceptions such as
NCSI errors can occur when the network port driver is not loaded
and those errors required reporting to the BMC.
Therefore the firmware enabled all the HNS ras error interrupts
before the driver is loaded. And in some cases, there will be some
H/W errors remained unclear before reboot. Thus the HNS driver needs
to process and recover those hw errors occurred before HNS driver is
initialized.

This patch adds processing of the HNS hw errors(RAS and MSI-X)
which occurred before the driver initialization. For RAS, because
they are enabled by firmware, so we can detect specific bits, then
log and clear them. But for MSI-X which can not be enabled before
open vector0 irq, we can't detect the specific error bits, so we
just write 1 to all interrupt source registers to clear.

Signed-off-by: Shiju Jose <[email protected]>
Signed-off-by: Weihang Li <[email protected]>
Signed-off-by: Peng Li <[email protected]>
Signed-off-by: Huazhong Tan <[email protected]>
Signed-off-by: David S. Miller <[email protected]>

net: hns3: fix avoid unnecessary resetting for the H/W errors which do not require reset

HNS does not need to be reset when errors occur in some bits.
However presently the HNAE3_FUNC_RESET is set in this case and
as a result the default_reset is done when these errors are reported.
This patch fix this issue. Also patch does some optimization
in setting the reset level for the error recovery.

Reported-by: Weihang Li <[email protected]>
Signed-off-by: Shiju Jose <[email protected]>
Signed-off-by: Peng Li <[email protected]>
Signed-off-by: Huazhong Tan <[email protected]>
Signed-off-by: David S. Miller <[email protected]>

net: hns3: delay setting of reset level for hw errors until slot_reset is called

Presently the error handling code sets the reset level required
for the recovery of the hw errors to the reset framework in the
error_detected AER callback. However the rest_event would be
called later from the slot_reset callback. This can cause issue
of using the wrong reset_level if a high priority reset request
occur before the slot_reset is called.

This patch delays setting of the reset level, required
for the hw errors, to the reset framework until the
slot_reset is called.

Reported-by: Salil Mehta <[email protected]>
Signed-off-by: Shiju Jose <[email protected]>
Signed-off-by: Weihang Li <[email protected]>
Signed-off-by: Peng Li <[email protected]>
Signed-off-by: Huazhong Tan <[email protected]>
Signed-off-by: David S. Miller <[email protected]>

Merge branch 'qed-iWARP-fixes'

Michal Kalderon says:

====================
qed: iWARP fixes

This series contains a few small fixes related to iWARP.
====================

Signed-off-by: David S. Miller <[email protected]>

qed: iWARP - Fix default window size to be based on chip

The default window size is calculated for best performance based
on internal hw buffer sizes. The size differs between the
different chips and modes.

Fixes: 67b40dccc45f ("qed: Implement iWARP initialization, teardown and qp operations")
Signed-off-by: Ariel Elior <[email protected]>
Signed-off-by: Michal Kalderon <[email protected]>
Signed-off-by: David S. Miller <[email protected]>

qed: iWARP - Fix tc for MPA ll2 connection

The driver needs to assign a lossless traffic class for the MPA ll2
connection to ensure no packets are dropped when returning from the
driver as they will never be re-transmitted by the peer.

Fixes: ae3488ff37dc ("qed: Add ll2 connection for processing unaligned MPA packets")
Signed-off-by: Ariel Elior <[email protected]>
Signed-off-by: Michal Kalderon <[email protected]>
Signed-off-by: David S. Miller <[email protected]>

qed: iWARP - fix uninitialized callback

Fix uninitialized variable warning by static checker.

Fixes: ae3488ff37dc ("qed: Add ll2 connection for processing unaligned MPA packets")
Signed-off-by: Ariel Elior <[email protected]>
Signed-off-by: Michal Kalderon <[email protected]>
Signed-off-by: David S. Miller <[email protected]>

qed: iWARP - Use READ_ONCE and smp_store_release to access ep->state

Destroy QP waits for it's ep object state to be set to CLOSED
before proceeding. ep->state can be updated from a different
context. Add smp_store_release/READ_ONCE to synchronize.

Fixes: fc4c6065e661 ("qed: iWARP implement disconnect flows")
Signed-off-by: Ariel Elior <[email protected]>
Signed-off-by: Michal Kalderon <[email protected]>
Signed-off-by: David S. Miller <[email protected]>

net: phy: sfp: clean up a condition

The acpi_node_get_property_reference() doesn't return ACPI error codes,
it just returns regular negative kernel error codes. This patch doesn't
affect run time, it's just a clean up.

Signed-off-by: Dan Carpenter <[email protected]>
Reviewed-by: Ruslan Babayev <[email protected]>
Signed-off-by: David S. Miller <[email protected]>

vsock: correct removal of socket from the list

The current vsock code for removal of socket from the list is both
subject to race and inefficient. It takes the lock, checks whether
the socket is in the list, drops the lock and if the socket was on the
list, deletes it from the list. This is subject to race because as soon
as the lock is dropped once it is checked for presence, that condition
cannot be relied upon for any decision. It is also inefficient because
if the socket is present in the list, it takes the lock twice.

Signed-off-by: Sunil Muthuswamy <[email protected]>
Signed-off-by: David S. Miller <[email protected]>

Merge branch 'nfp-add-two-user-friendly-errors'

Jakub Kicinski says:

====================
nfp: add two user friendly errors

This small series adds two error messages based on recent
bug reports which turned out not to be bugs..
====================

Signed-off-by: David S. Miller <[email protected]>

nfp: print a warning when binding VFs to PF driver

Users sometimes mistakenly try to manually bind the PF driver
to the VFs, print a warning message in that case.

Signed-off-by: Jakub Kicinski <[email protected]>
Reviewed-by: Dirk van der Merwe <[email protected]>
Signed-off-by: David S. Miller <[email protected]>

nfp: update the old flash error message

Apparently there are still cards in the wild with a very old
management FW. Let's make the error message in that case
indicate more clearly that management firmware has to be
updated.

Signed-off-by: Jakub Kicinski <[email protected]>
Reviewed-by: Dirk van der Merwe <[email protected]>
Signed-off-by: David S. Miller <[email protected]>

Merge branch 'Microchip-KSZ-driver-enhancements'

Robert Hancock says:

====================
Microchip KSZ driver enhancements

A couple of enhancements to the Microchip KSZ switch driver: one to add
PHY register settings for errata workarounds for more stable operation, and
another to add a device tree option to change the output clock rate as
required by some board designs.
====================

Signed-off-by: David S. Miller <[email protected]>

net: dsa: microchip: Support optional 125MHz SYNCLKO output

The KSZ9477 series chips have a SYNCLKO pin which by default outputs a
25MHz clock, but some board setups require a 125MHz clock instead. Added
a microchip,synclko-125 device tree property to allow indicating a
125MHz clock output is required.

Signed-off-by: Robert Hancock <[email protected]>
Signed-off-by: David S. Miller <[email protected]>

net: dsa: microchip: Add PHY errata workarounds

The Silicon Errata and Data Sheet Clarification documents for the
KSZ9477 series of chips describe a number of otherwise undocumented PHY
register settings which are required to work around various chip errata.
Apply these settings when initializing the PHY ports on these chips.

Signed-off-by: Robert Hancock <[email protected]>
Signed-off-by: David S. Miller <[email protected]>

net: dsa: microchip: Don't try to read stats for unused ports

If some of the switch ports were not listed in the device tree, due to
being unused, the ksz_mib_read_work function ended up accessing a NULL
dp->slave pointer and causing an oops. Skip checking statistics for any
unused ports.

Fixes: 7c6ff470aa867f53 ("net: dsa: microchip: add MIB counter reading support")
Signed-off-by: Robert Hancock <[email protected]>
Reviewed-by: Vivien Didelot <[email protected]>
Reviewed-by: Andrew Lunn <[email protected]>
Reviewed-by: Florian Fainelli <[email protected]>
Signed-off-by: David S. Miller <[email protected]>

net: stmmac: use GPIO descriptors in stmmac_mdio_reset

Switch stmmac_mdio_reset to use GPIO descriptors. GPIO core handles the
"snps,reset-gpio" for GPIO descriptors so we don't need to take care of
it inside the driver anymore.

The advantage of this is that we now preserve the GPIO flags which are
passed via devicetree. This is required on some newer Amlogic boards
which use an Open Drain pin for the reset GPIO. This pin can only output
a LOW signal or switch to input mode but it cannot output a HIGH signal.
There are already devicetree bindings for these special cases and GPIO
core already takes care of them but only if we use GPIO descriptors
instead of GPIO numbers.

Signed-off-by: Martin Blumenstingl <[email protected]>
Reviewed-by: Linus Walleij <[email protected]>
Reviewed-by: Andrew Lunn <[email protected]>
Signed-off-by: David S. Miller <[email protected]>

Merge branch 'qmi_wwan-fix-QMAP-handling'

Reinhard Speyerer says:

====================
qmi_wwan: fix QMAP handling

This series addresses the following issues observed when using the
QMAP support of the qmi_wwan driver:

1. The QMAP code in the qmi_wwan driver is based on the CodeAurora
   GobiNet driver ([1], [2]) which does not process QMAP padding
   in the RX path correctly. This causes qmimux_rx_fixup() to pass
   incorrect data to the IP stack when padding is used.

2. qmimux devices currently lack proper network device usage statistics.

3. RCU stalls on device disconnect with QMAP activated like this

   # echo Y > /sys/class/net/wwan0/qmi/raw_ip
   # echo 1 > /sys/class/net/wwan0/qmi/add_mux
   # echo 2 > /sys/class/net/wwan0/qmi/add_mux
   # echo 3 > /sys/class/net/wwan0/qmi/add_mux

   have been observed in certain setups:

   [ 2273.676593] option1 ttyUSB16: GSM modem (1-port) converter now disconnected from ttyUSB16
   [ 2273.676617] option 6-1.2:1.0: device disconnected
   [ 2273.676774] WARNING: CPU: 1 PID: 141 at kernel/rcu/tree_plugin.h:342 rcu_note_context_switch+0x2a/0x3d0
   [ 2273.676776] Modules linked in: option qmi_wwan cdc_mbim cdc_ncm qcserial cdc_wdm usb_wwan sierra sierra_net usbnet mii edd coretemp iptable_mangle ip6_tables iptable_filter ip_tables cdc_acm dm_mod dax iTCO_wdt evdev iTCO_vendor_support sg ftdi_sio usbserial e1000e ptp pps_core i2c_i801 ehci_pci button lpc_ich i2c_core mfd_core uhci_hcd ehci_hcd rtc_cmos usbcore usb_common sd_mod fan ata_piix thermal
   [ 2273.676817] CPU: 1 PID: 141 Comm: kworker/1:1 Not tainted 4.19.38-rsp-1 #1
   [ 2273.676819] Hardware name: Not Applicable   Not Applicable  /CX-GS/GM45-GL40             , BIOS V1.11      03/23/2011
   [ 2273.676828] Workqueue: usb_hub_wq hub_event [usbcore]
   [ 2273.676832] EIP: rcu_note_context_switch+0x2a/0x3d0
   [ 2273.676834] Code: 55 89 e5 57 56 89 c6 53 83 ec 14 89 45 f0 e8 5d ff ff ff 89 f0 64 8b 3d 24 a6 86 c0 84 c0 8b 87 04 02 00 00 75 7a 85 c0 7e 7a <0f> 0b 80 bf 08 02 00 00 00 0f 84 87 00 00 00 e8 b2 e2 ff ff bb dc
   [ 2273.676836] EAX: 00000001 EBX: f614bc00 ECX: 00000001 EDX: c0715b81
   [ 2273.676838] ESI: 00000000 EDI: f18beb40 EBP: f1a3dc20 ESP: f1a3dc00
   [ 2273.676840] DS: 007b ES: 007b FS: 00d8 GS: 00e0 SS: 0068 EFLAGS: 00010002
   [ 2273.676842] CR0: 80050033 CR2: b7e97230 CR3: 2f9c4000 CR4: 000406b0
   [ 2273.676843] Call Trace:
   [ 2273.676847]  ? preempt_count_add+0xa5/0xc0
   [ 2273.676852]  __schedule+0x4e/0x4f0
   [ 2273.676855]  ? __queue_work+0xf1/0x2a0
   [ 2273.676858]  ? _raw_spin_lock_irqsave+0x14/0x40
   [ 2273.676860]  ? preempt_count_add+0x52/0xc0
   [ 2273.676862]  schedule+0x33/0x80
   [ 2273.676865]  _synchronize_rcu_expedited+0x24e/0x280
   [ 2273.676867]  ? rcu_accelerate_cbs_unlocked+0x70/0x70
   [ 2273.676871]  ? wait_woken+0x70/0x70
   [ 2273.676873]  ? rcu_accelerate_cbs_unlocked+0x70/0x70
   [ 2273.676875]  ? _synchronize_rcu_expedited+0x280/0x280
   [ 2273.676877]  synchronize_rcu_expedited+0x22/0x30
   [ 2273.676881]  synchronize_net+0x25/0x30
   [ 2273.676885]  dev_deactivate_many+0x133/0x230
   [ 2273.676887]  ? preempt_count_add+0xa5/0xc0
   [ 2273.676890]  __dev_close_many+0x4d/0xc0
   [ 2273.676892]  ? skb_dequeue+0x40/0x50
   [ 2273.676895]  dev_close_many+0x5d/0xd0
   [ 2273.676898]  rollback_registered_many+0xbf/0x4c0
   [ 2273.676901]  ? raw_notifier_call_chain+0x1a/0x20
   [ 2273.676904]  ? call_netdevice_notifiers_info+0x23/0x60
   [ 2273.676906]  ? netdev_master_upper_dev_get+0xe/0x70
   [ 2273.676908]  rollback_registered+0x1f/0x30
   [ 2273.676911]  unregister_netdevice_queue+0x47/0xb0
   [ 2273.676915]  qmimux_unregister_device+0x1f/0x30 [qmi_wwan]
   [ 2273.676917]  qmi_wwan_disconnect+0x5d/0x90 [qmi_wwan]
   ...
   [ 2273.677001] ---[ end trace 0fcc5f88496b485a ]---
   [ 2294.679136] rcu: INFO: rcu_preempt detected stalls on CPUs/tasks:
   [ 2294.679140] rcu: Tasks blocked on level-0 rcu_node (CPUs 0-1): P141
   [ 2294.679144] rcu: (detected by 0, t=21002 jiffies, g=265857, q=8446)
   [ 2294.679148] kworker/1:1     D    0   141      2 0x80000000

In addition the permitted QMAP mux_id value range is extended for
compatibility with ip(8) and the rmnet driver.

Reinhard

[1]: https://portland.source.codeaurora.org/patches/quic/gobi
[2]: https://portland.source.codeaurora.org/quic/qsdk/oss/lklm/gobinet/
====================

Tested-by: Daniele Palmas <[email protected]>
Acked-by: Bjørn Mork <[email protected]>
Signed-off-by: David S. Miller <[email protected]>

qmi_wwan: extend permitted QMAP mux_id value range

Permit mux_id values up to 254 to be used in qmimux_register_device()
for compatibility with ip(8) and the rmnet driver.

Fixes: c6adf77953bc ("net: usb: qmi_wwan: add qmap mux protocol support")
Cc: Daniele Palmas <[email protected]>
Signed-off-by: Reinhard Speyerer <[email protected]>
Signed-off-by: David S. Miller <[email protected]>

qmi_wwan: avoid RCU stalls on device disconnect when in QMAP mode

Switch qmimux_unregister_device() and qmi_wwan_disconnect() to
use unregister_netdevice_queue() and unregister_netdevice_many()
instead of unregister_netdevice(). This avoids RCU stalls which
have been observed on device disconnect in certain setups otherwise.

Fixes: c6adf77953bc ("net: usb: qmi_wwan: add qmap mux protocol support")
Cc: Daniele Palmas <[email protected]>
Signed-off-by: Reinhard Speyerer <[email protected]>
Signed-off-by: David S. Miller <[email protected]>

qmi_wwan: add network device usage statistics for qmimux devices

Add proper network device usage statistics for qmimux devices
instead of reporting all-zero values for them.

Fixes: c6adf77953bc ("net: usb: qmi_wwan: add qmap mux protocol support")
Cc: Daniele Palmas <[email protected]>
Signed-off-by: Reinhard Speyerer <[email protected]>
Signed-off-by: David S. Miller <[email protected]>

qmi_wwan: add support for QMAP padding in the RX path

The QMAP code in the qmi_wwan driver is based on the CodeAurora GobiNet
driver which does not process QMAP padding in the RX path correctly.
Add support for QMAP padding to qmimux_rx_fixup() according to the
description of the rmnet driver.

Fixes: c6adf77953bc ("net: usb: qmi_wwan: add qmap mux protocol support")
Cc: Daniele Palmas <[email protected]>
Signed-off-by: Reinhard Speyerer <[email protected]>
Signed-off-by: David S. Miller <[email protected]>

Merge tag 'scsi-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/jejb/scsi

Pull SCSI fix from James Bottomley:
"A single bug fix for hpsa.

  The user visible consequences aren't clear, but the ioaccel2 raid
  acceleration may misfire on the malformed request assuming the payload
  is big enough to require chaining (more than 31 sg entries)"

* tag 'scsi-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/jejb/scsi:
  scsi: hpsa: correct ioaccel2 chaining

Merge branch 'packet-DDOS'

Eric Dumazet says:

====================
net/packet: better behavior under DDOS

Using tcpdump (or other af_packet user) on a busy host can lead to
catastrophic consequences, because suddenly, potentially all cpus
are spinning on a contended spinlock.

Both packet_rcv() and tpacket_rcv() grab the spinlock
to eventually find there is no room for an additional packet.

This patch series align packet_rcv() and tpacket_rcv() to both
check if the queue is full before grabbing the spinlock.

If the queue is full, they both increment a new atomic counter
placed on a separate cache line to let readers drain the queue faster.

There is still false sharing on this new atomic counter,
we might in the future make it per cpu if there is interest.
====================

Acked-by: Willem de Bruijn <[email protected]>
Signed-off-by: David S. Miller <[email protected]>

net/packet: introduce packet_rcv_try_clear_pressure() helper

There are two places where we want to clear the pressure
if possible, add a helper to make it more obvious.

Signed-off-by: Eric Dumazet <[email protected]>
Suggested-by: Willem de Bruijn <[email protected]>
Acked-by: Vinicius Costa Gomes <[email protected]>
Signed-off-by: David S. Miller <[email protected]>

net/packet: remove locking from packet_rcv_has_room()

__packet_rcv_has_room() can now be run without lock being held.

po->pressure is only a non persistent hint, we can mark
all read/write accesses with READ_ONCE()/WRITE_ONCE()
to document the fact that the field could be written
without any synchronization.

Signed-off-by: Eric Dumazet <[email protected]>
Signed-off-by: David S. Miller <[email protected]>

net/packet: implement shortcut in tpacket_rcv()

tpacket_rcv() can be hit under DDOS quite hard, since
it will always grab a socket spinlock, to eventually find
there is no room for an additional packet.

Using tcpdump [1] on a busy host can lead to catastrophic consequences,
because of all cpus spinning on a contended spinlock.

This replicates a similar strategy used in packet_rcv()

[1] Also some applications mistakenly use af_packet socket
bound to ETH_P_ALL only to send packets.
Receive queue is never drained and immediately full.

Signed-off-by: Eric Dumazet <[email protected]>
Signed-off-by: David S. Miller <[email protected]>

net/packet: make tp_drops atomic

Under DDOS, we want to be able to increment tp_drops without
touching the spinlock. This will help readers to drain
the receive queue slightly faster :/

Signed-off-by: Eric Dumazet <[email protected]>
Signed-off-by: David S. Miller <[email protected]>

net/packet: constify __packet_rcv_has_room()

Goal is use the helper without lock being held.

Signed-off-by: Eric Dumazet <[email protected]>
Signed-off-by: David S. Miller <[email protected]>

net/packet: constify prb_lookup_block() and __tpacket_v3_has_room()

Goal is to be able to use __tpacket_v3_has_room() without holding
a lock.

Signed-off-by: Eric Dumazet <[email protected]>
Signed-off-by: David S. Miller <[email protected]>

net/packet: constify packet_lookup_frame() and __tpacket_has_room()

Goal is to be able to use __tpacket_has_room() without holding a lock.

Signed-off-by: Eric Dumazet <[email protected]>
Signed-off-by: David S. Miller <[email protected]>

net/packet: constify __packet_get_status() argument

struct packet_sock is only read.

Signed-off-by: Eric Dumazet <[email protected]>
Signed-off-by: David S. Miller <[email protected]>

net: phy: Add more 1000BaseX support detection

Commit "net: phy: Add detection of 1000BaseX link mode support" added
support for not filtering out 1000BaseX mode from the PHY's supported
modes in genphy_config_init, but we have to make a similar change in
genphy_read_abilities in order to actually detect it as a supported mode
in the first place. Add this in.

Signed-off-by: Robert Hancock <[email protected]>
Reviewed-by: Florian Fainelli <[email protected]>
Signed-off-by: David S. Miller <[email protected]>

net: ethernet: ti: cpsw_ethtool: simplify slave loops

Only for consistency reasons, do it like in main cpsw.c module
and use ndev reference but not by means of slave.

Signed-off-by: Ivan Khoronzhuk <[email protected]>
Signed-off-by: David S. Miller <[email protected]>