Git Repo - linux.git/log

x86/entry/64: Do not use RDPID in paranoid entry to accomodate KVM

KVM has an optmization to avoid expensive MRS read/writes on
VMENTER/EXIT. It caches the MSR values and restores them either when
leaving the run loop, on preemption or when going out to user space.

The affected MSRs are not required for kernel context operations. This
changed with the recently introduced mechanism to handle FSGSBASE in the
paranoid entry code which has to retrieve the kernel GSBASE value by
accessing per CPU memory. The mechanism needs to retrieve the CPU number
and uses either LSL or RDPID if the processor supports it.

Unfortunately RDPID uses MSR_TSC_AUX which is in the list of cached and
lazily restored MSRs, which means between the point where the guest value
is written and the point of restore, MSR_TSC_AUX contains a random number.

If an NMI or any other exception which uses the paranoid entry path happens
in such a context, then RDPID returns the random guest MSR_TSC_AUX value.

As a consequence this reads from the wrong memory location to retrieve the
kernel GSBASE value. Kernel GS is used to for all regular this_cpu_*()
operations. If the GSBASE in the exception handler points to the per CPU
memory of a different CPU then this has the obvious consequences of data
corruption and crashes.

As the paranoid entry path is the only place which accesses MSR_TSX_AUX
(via RDPID) and the fallback via LSL is not significantly slower, remove
the RDPID alternative from the entry path and always use LSL.

The alternative would be to write MSR_TSC_AUX on every VMENTER and VMEXIT
which would be inflicting massive overhead on that code path.

[ tglx: Rewrote changelog ]

Fixes: eaad981291ee3 ("x86/entry/64: Introduce the FIND_PERCPU_BASE macro")
Reported-by: Tom Lendacky <[email protected]>
Debugged-by: Tom Lendacky <[email protected]>
Suggested-by: Andy Lutomirski <[email protected]>
Suggested-by: Peter Zijlstra <[email protected]>
Signed-off-by: Sean Christopherson <[email protected]>
Signed-off-by: Paolo Bonzini <[email protected]>
Signed-off-by: Thomas Gleixner <[email protected]>
Link: https://lore.kernel.org/r/[email protected]

MAINTAINERS: Update Mellanox and Cumulus Network addresses to new domain

Mellanox and Cumulus Network were acquired by Nvidia, so change the
maintainers emails to new domain name.

Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Leon Romanovsky <[email protected]>
Acked-by: Andy Shevchenko <[email protected]>
Signed-off-by: Jason Gunthorpe <[email protected]>

powerpc/perf/hv-24x7: Move cpumask file to top folder of hv-24x7 driver

Commit 792f73f747b8 ("powerpc/hv-24x7: Add sysfs files inside hv-24x7
device to show cpumask") added cpumask file as part of hv-24x7 driver
inside the interface folder. The cpumask file is supposed to be in the
top folder of the PMU driver in order to make hotplug work.

This patch fixes that issue and creates new group 'cpumask_attr_group'
to add cpumask file and make sure it added in top folder.

command:# cat /sys/devices/hv_24x7/cpumask
0

Fixes: 792f73f747b8 ("powerpc/hv-24x7: Add sysfs files inside hv-24x7 device to show cpumask")
Signed-off-by: Kajol Jain <[email protected]>
Signed-off-by: Michael Ellerman <[email protected]>
Link: https://lore.kernel.org/r/[email protected]

powerpc/32s: Fix module loading failure when VMALLOC_END is over 0xf0000000

In is_module_segment(), when VMALLOC_END is over 0xf0000000,
ALIGN(VMALLOC_END, SZ_256M) has value 0.

In that case, addr >= ALIGN(VMALLOC_END, SZ_256M) is always
true then is_module_segment() always returns false.

Use (ALIGN(VMALLOC_END, SZ_256M) - 1) which will have
value 0xffffffff and will be suitable for the comparison.

Fixes: c49643319715 ("powerpc/32s: Only leave NX unset on segments used for modules")
Reported-by: Andreas Schwab <[email protected]>
Signed-off-by: Christophe Leroy <[email protected]>
Tested-by: Andreas Schwab <[email protected]>
Signed-off-by: Michael Ellerman <[email protected]>
Link: https://lore.kernel.org/r/09fc73fe9c7423c6b4cf93f93df9bb0ed8eefab5.1597994047.git.christophe.leroy@csgroup.eu

KVM: arm64: Print warning when cpu erratum can cause guests to deadlock

If guests don't have certain CPU erratum workarounds implemented, then
there is a possibility a guest can deadlock the system. IOW, only trusted
guests should be used on systems with the erratum.

This is the case for Cortex-A57 erratum 832075.

Signed-off-by: Rob Herring <[email protected]>
Acked-by: Will Deacon <[email protected]>
Cc: Marc Zyngier <[email protected]>
Cc: James Morse <[email protected]>
Cc: Julien Thierry <[email protected]>
Cc: Suzuki K Poulose <[email protected]>
Cc: Will Deacon <[email protected]>
Cc: [email protected]
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Catalin Marinas <[email protected]>

arm64: Allow booting of late CPUs affected by erratum 1418040

As we can now switch from a system that isn't affected by 1418040
to a system that globally is affected, let's allow affected CPUs
to come in at a later time.

Signed-off-by: Marc Zyngier <[email protected]>
Tested-by: Sai Prakash Ranjan <[email protected]>
Reviewed-by: Stephen Boyd <[email protected]>
Reviewed-by: Suzuki K Poulose <[email protected]>
Acked-by: Will Deacon <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Catalin Marinas <[email protected]>

arm64: Move handling of erratum 1418040 into C code

Instead of dealing with erratum 1418040 on each entry and exit,
let's move the handling to __switch_to() instead, which has
several advantages:

- It can be applied when it matters (switching between 32 and 64
bit tasks).
- It is written in C (yay!)
- It can rely on static keys rather than alternatives

Signed-off-by: Marc Zyngier <[email protected]>
Tested-by: Sai Prakash Ranjan <[email protected]>
Reviewed-by: Stephen Boyd <[email protected]>
Acked-by: Will Deacon <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Catalin Marinas <[email protected]>

kconfig: qconf: replace deprecated QString::sprintf() with QTextStream

QString::sprintf() is deprecated in the latest Qt version, and spawns
a lot of warnings:

  HOSTCXX scripts/kconfig/qconf.o
scripts/kconfig/qconf.cc: In member function ‘void ConfigInfoView::menuInfo()’:
scripts/kconfig/qconf.cc:1090:61: warning: ‘QString& QString::sprintf(const char*, ...)’ is deprecated: Use asprintf(), arg() or QTextStream instead [-Wdeprecated-declarations]
1090 |      head += QString().sprintf("<a href=\"s%s\">", sym->name);
      |                                                             ^
In file included from /usr/include/qt5/QtGui/qkeysequence.h:44,
                 from /usr/include/qt5/QtWidgets/qaction.h:44,
                 from /usr/include/qt5/QtWidgets/QAction:1,
                 from scripts/kconfig/qconf.cc:7:
/usr/include/qt5/QtCore/qstring.h:382:14: note: declared here
  382 |     QString &sprintf(const char *format, ...) Q_ATTRIBUTE_FORMAT_PRINTF(2, 3);
      |              ^~~~~~~
scripts/kconfig/qconf.cc:1099:60: warning: ‘QString& QString::sprintf(const char*, ...)’ is deprecated: Use asprintf(), arg() or QTextStream instead [-Wdeprecated-declarations]
1099 |     head += QString().sprintf("<a href=\"s%s\">", sym->name);
      |                                                            ^
In file included from /usr/include/qt5/QtGui/qkeysequence.h:44,
                 from /usr/include/qt5/QtWidgets/qaction.h:44,
                 from /usr/include/qt5/QtWidgets/QAction:1,
                 from scripts/kconfig/qconf.cc:7:
/usr/include/qt5/QtCore/qstring.h:382:14: note: declared here
  382 |     QString &sprintf(const char *format, ...) Q_ATTRIBUTE_FORMAT_PRINTF(2, 3);
      |              ^~~~~~~
scripts/kconfig/qconf.cc:1127:90: warning: ‘QString& QString::sprintf(const char*, ...)’ is deprecated: Use asprintf(), arg() or QTextStream instead [-Wdeprecated-declarations]
1127 |   debug += QString().sprintf("defined at %s:%d<br><br>", _menu->file->name, _menu->lineno);
      |                                                                                          ^
In file included from /usr/include/qt5/QtGui/qkeysequence.h:44,
                 from /usr/include/qt5/QtWidgets/qaction.h:44,
                 from /usr/include/qt5/QtWidgets/QAction:1,
                 from scripts/kconfig/qconf.cc:7:
/usr/include/qt5/QtCore/qstring.h:382:14: note: declared here
  382 |     QString &sprintf(const char *format, ...) Q_ATTRIBUTE_FORMAT_PRINTF(2, 3);
      |              ^~~~~~~
scripts/kconfig/qconf.cc: In member function ‘QString ConfigInfoView::debug_info(symbol*)’:
scripts/kconfig/qconf.cc:1150:68: warning: ‘QString& QString::sprintf(const char*, ...)’ is deprecated: Use asprintf(), arg() or QTextStream instead [-Wdeprecated-declarations]
1150 |    debug += QString().sprintf("prompt: <a href=\"m%s\">", sym->name);
      |                                                                    ^
In file included from /usr/include/qt5/QtGui/qkeysequence.h:44,
                 from /usr/include/qt5/QtWidgets/qaction.h:44,
                 from /usr/include/qt5/QtWidgets/QAction:1,
                 from scripts/kconfig/qconf.cc:7:
/usr/include/qt5/QtCore/qstring.h:382:14: note: declared here
  382 |     QString &sprintf(const char *format, ...) Q_ATTRIBUTE_FORMAT_PRINTF(2, 3);
      |              ^~~~~~~
scripts/kconfig/qconf.cc: In static member function ‘static void ConfigInfoView::expr_print_help(void*, symbol*, const char*)’:
scripts/kconfig/qconf.cc:1225:59: warning: ‘QString& QString::sprintf(const char*, ...)’ is deprecated: Use asprintf(), arg() or QTextStream instead [-Wdeprecated-declarations]
1225 |   *text += QString().sprintf("<a href=\"s%s\">", sym->name);
      |                                                           ^
In file included from /usr/include/qt5/QtGui/qkeysequence.h:44,
                 from /usr/include/qt5/QtWidgets/qaction.h:44,
                 from /usr/include/qt5/QtWidgets/QAction:1,
                 from scripts/kconfig/qconf.cc:7:
/usr/include/qt5/QtCore/qstring.h:382:14: note: declared here
  382 |     QString &sprintf(const char *format, ...) Q_ATTRIBUTE_FORMAT_PRINTF(2, 3);
      |              ^~~~~~~

The documentation also says:
"Warning: We do not recommend using QString::asprintf() in new Qt code.
Instead, consider using QTextStream or arg(), both of which support
Unicode strings seamlessly and are type-safe."

Use QTextStream as suggested.

Reported-by: Robert Crawford <[email protected]>
Signed-off-by: Masahiro Yamada <[email protected]>

kconfig: qconf: remove redundant help in the info view

The same information is repeated in the info view.

Remove the second one.

Signed-off-by: Masahiro Yamada <[email protected]>

kconfig: qconf: remove qInfo() to get back Qt4 support

qconf is supposed to work with Qt4 and Qt5, but since commit
c4f7398bee9c ("kconfig: qconf: make debug links work again"),
building with Qt4 fails as follows:

  HOSTCXX scripts/kconfig/qconf.o
scripts/kconfig/qconf.cc: In member function ‘void ConfigInfoView::clicked(const QUrl&)’:
scripts/kconfig/qconf.cc:1241:3: error: ‘qInfo’ was not declared in this scope; did you mean ‘setInfo’?
1241 |   qInfo() << "Clicked link is empty";
      |   ^~~~~
      |   setInfo
scripts/kconfig/qconf.cc:1254:3: error: ‘qInfo’ was not declared in this scope; did you mean ‘setInfo’?
1254 |   qInfo() << "Clicked symbol is invalid:" << data;
      |   ^~~~~
      |   setInfo
make[1]: *** [scripts/Makefile.host:129: scripts/kconfig/qconf.o] Error 1
make: *** [Makefile:606: xconfig] Error 2

qInfo() does not exist in Qt4. In my understanding, these call-sites
should be unreachable. Perhaps, qWarning(), assertion, or something
is better, but qInfo() is not the right one to use here, I think.

Fixes: c4f7398bee9c ("kconfig: qconf: make debug links work again")
Reported-by: Ronald Warsow <[email protected]>
Signed-off-by: Masahiro Yamada <[email protected]>

Merge tag 'drm-intel-fixes-2020-08-20' of git://anongit.freedesktop.org/drm/drm-intel into drm-fixes

drm/i915 fixes for v5.9-rc2:
- GVT fixes
- Fix device parameter usage for selftest mock i915 device
- Fix LPSP capability debugfs NULL dereference
- Fix buddy register pagemask table
- Fix intel_atomic_check() non-negative return value
- Fix selftests passing a random 0 into ilog2()
- Fix TGL power well enable/disable ordering
- Switch to PMU module refcounting

Signed-off-by: Dave Airlie <[email protected]>
From: Jani Nikula <[email protected]>
Link: https://patchwork.freedesktop.org/patch/msgid/[email protected]

Merge tag 'amd-drm-fixes-5.9-2020-08-20' of git://people.freedesktop.org/~agd5f/linux into drm-fixes

amd-drm-fixes-5.9-2020-08-20:

amdgpu:
- Fixes for Navy Flounder
- Misc display fixes
- RAS fix

amdkfd:
- SDMA fix for renoir

Signed-off-by: Dave Airlie <[email protected]>
From: Alex Deucher <[email protected]>
Link: https://patchwork.freedesktop.org/patch/msgid/[email protected]

tipc: call rcu_read_lock() in tipc_aead_encrypt_done()

b->media->send_msg() requires rcu_read_lock(), as we can see
elsewhere in tipc,  tipc_bearer_xmit, tipc_bearer_xmit_skb
and tipc_bearer_bc_xmit().

Syzbot has reported this issue as:

  net/tipc/bearer.c:466 suspicious rcu_dereference_check() usage!
  Workqueue: cryptd cryptd_queue_worker
  Call Trace:
   tipc_l2_send_msg+0x354/0x420 net/tipc/bearer.c:466
   tipc_aead_encrypt_done+0x204/0x3a0 net/tipc/crypto.c:761
   cryptd_aead_crypt+0xe8/0x1d0 crypto/cryptd.c:739
   cryptd_queue_worker+0x118/0x1b0 crypto/cryptd.c:181
   process_one_work+0x94c/0x1670 kernel/workqueue.c:2269
   worker_thread+0x64c/0x1120 kernel/workqueue.c:2415
   kthread+0x3b5/0x4a0 kernel/kthread.c:291
   ret_from_fork+0x1f/0x30 arch/x86/entry/entry_64.S:293

So fix it by calling rcu_read_lock() in tipc_aead_encrypt_done()
for b->media->send_msg().

Fixes: fc1b6d6de220 ("tipc: introduce TIPC encryption & authentication")
Reported-by: [email protected]
Signed-off-by: Xin Long <[email protected]>
Signed-off-by: David S. Miller <[email protected]>

net/sched: act_ct: Fix skb double-free in tcf_ct_handle_fragments() error flow

tcf_ct_handle_fragments() shouldn't free the skb when ip_defrag() call
fails. Otherwise, we will cause a double-free bug.
In such cases, just return the error to the caller.

Fixes: b57dc7c13ea9 ("net/sched: Introduce action ct")
Signed-off-by: Alaa Hleihel <[email protected]>
Reviewed-by: Roi Dayan <[email protected]>
Signed-off-by: David S. Miller <[email protected]>

net: sctp: Fix negotiation of the number of data streams.

The number of output and input streams was never being reduced, eg when
processing received INIT or INIT_ACK chunks.
The effect is that DATA chunks can be sent with invalid stream ids
and then discarded by the remote system.

Fixes: 2075e50caf5ea ("sctp: convert to genradix")
Signed-off-by: David Laight <[email protected]>
Acked-by: Marcelo Ricardo Leitner <[email protected]>
Signed-off-by: David S. Miller <[email protected]>

dt-bindings: net: renesas, ether: Improve schema validation

- Remove pinctrl consumer properties, as they are handled by core
    dt-schema,
  - Document missing properties,
  - Document missing PHY child node,
  - Add "additionalProperties: false".

Signed-off-by: Geert Uytterhoeven <[email protected]>
Reviewed-by: Rob Herring <[email protected]>
Reviewed-by: Sergei Shtylyov <[email protected]>
Signed-off-by: David S. Miller <[email protected]>

gre6: Fix reception with IP6_TNL_F_RCV_DSCP_COPY

When receiving an IPv4 packet inside an IPv6 GRE packet, and the
IP6_TNL_F_RCV_DSCP_COPY flag is set on the tunnel, the IPv4 header would
get corrupted. This is due to the common ip6_tnl_rcv() function assuming
that the inner header is always IPv6. This patch checks the tunnel
protocol for IPv4 inner packets, but still defaults to IPv6.

Fixes: 308edfdf1563 ("gre6: Cleanup GREv6 receive path, call common GRE functions")
Signed-off-by: Mark Tomlinson <[email protected]>
Signed-off-by: David S. Miller <[email protected]>

Merge branch 'hv_netvsc-Some-fixes-for-the-select_queue'

Haiyang Zhang says:

====================
hv_netvsc: Some fixes for the select_queue

This patch set includes two fixes for the select_queue process.
====================

Signed-off-by: David S. Miller <[email protected]>

hv_netvsc: Fix the queue_mapping in netvsc_vf_xmit()

netvsc_vf_xmit() / dev_queue_xmit() will call VF NIC’s ndo_select_queue
or netdev_pick_tx() again. They will use skb_get_rx_queue() to get the
queue number, so the “skb->queue_mapping - 1” will be used. This may
cause the last queue of VF not been used.

Use skb_record_rx_queue() here, so that the skb_get_rx_queue() called
later will get the correct queue number, and VF will be able to use
all queues.

Fixes: b3bf5666a510 ("hv_netvsc: defer queue selection to VF")
Signed-off-by: Haiyang Zhang <[email protected]>
Signed-off-by: David S. Miller <[email protected]>

hv_netvsc: Remove "unlikely" from netvsc_select_queue

When using vf_ops->ndo_select_queue, the number of queues of VF is
usually bigger than the synthetic NIC. This condition may happen
often.
Remove "unlikely" from the comparison of ndev->real_num_tx_queues.

Fixes: b3bf5666a510 ("hv_netvsc: defer queue selection to VF")
Signed-off-by: Haiyang Zhang <[email protected]>
Signed-off-by: David S. Miller <[email protected]>

bpf: selftests: global_funcs: Check err_str before strstr

The error path in libbpf.c:load_program() has calls to pr_warn()
which ends up for global_funcs tests to
test_global_funcs.c:libbpf_debug_print().

For the tests with no struct test_def::err_str initialized with a
string, it causes call of strstr() with NULL as the second argument
and it segfaults.

Fix it by calling strstr() only for non-NULL err_str.

Signed-off-by: Yauheni Kaliuta <[email protected]>
Signed-off-by: Alexei Starovoitov <[email protected]>
Acked-by: Yonghong Song <[email protected]>
Link: https://lore.kernel.org/bpf/[email protected]

bpf: xdp: Fix XDP mode when no mode flags specified

7f0a838254bd ("bpf, xdp: Maintain info on attached XDP BPF programs in net_device")
inadvertently changed which XDP mode is assumed when no mode flags are
specified explicitly. Previously, driver mode was preferred, if driver
supported it. If not, generic SKB mode was chosen. That commit changed default
to SKB mode always. This patch fixes the issue and restores the original
logic.

Fixes: 7f0a838254bd ("bpf, xdp: Maintain info on attached XDP BPF programs in net_device")
Reported-by: Lorenzo Bianconi <[email protected]>
Signed-off-by: Andrii Nakryiko <[email protected]>
Signed-off-by: Alexei Starovoitov <[email protected]>
Tested-by: Lorenzo Bianconi <[email protected]>
Link: https://lore.kernel.org/bpf/[email protected]

selftests/bpf: Remove test_align leftovers

Calling generic selftests "make install" fails as rsync expects all
files from TEST_GEN_PROGS to be present. The binary is not generated
anymore (commit 3b09d27cc93d) so we can safely remove it from there
and also from gitignore.

Fixes: 3b09d27cc93d ("selftests/bpf: Move test_align under test_progs")
Signed-off-by: Veronika Kabatova <[email protected]>
Signed-off-by: Alexei Starovoitov <[email protected]>
Acked-by: Jesper Dangaard Brouer <[email protected]>
Link: https://lore.kernel.org/bpf/[email protected]

tools/resolve_btfids: Fix sections with wrong alignment

The data of compressed section should be aligned to 4
(for 32bit) or 8 (for 64 bit) bytes.

The binutils ld sets sh_addralign to 1, which makes libelf
fail with misaligned section error during the update as
reported by Jesper:

   FAILED elf_update(WRITE): invalid section alignment

While waiting for ld fix, we can fix compressed sections
sh_addralign value manually.

Adding warning in -vv mode when the fix is triggered:

  $ ./tools/bpf/resolve_btfids/resolve_btfids -vv vmlinux
  ...
  section(36) .comment, size 44, link 0, flags 30, type=1
  section(37) .debug_aranges, size 45684, link 0, flags 800, type=1
   - fixing wrong alignment sh_addralign 16, expected 8
  section(38) .debug_info, size 129104957, link 0, flags 800, type=1
   - fixing wrong alignment sh_addralign 1, expected 8
  section(39) .debug_abbrev, size 1152583, link 0, flags 800, type=1
   - fixing wrong alignment sh_addralign 1, expected 8
  section(40) .debug_line, size 7374522, link 0, flags 800, type=1
   - fixing wrong alignment sh_addralign 1, expected 8
  section(41) .debug_frame, size 702463, link 0, flags 800, type=1
  section(42) .debug_str, size 1017571, link 0, flags 830, type=1
   - fixing wrong alignment sh_addralign 1, expected 8
  section(43) .debug_loc, size 3019453, link 0, flags 800, type=1
   - fixing wrong alignment sh_addralign 1, expected 8
  section(44) .debug_ranges, size 1744583, link 0, flags 800, type=1
   - fixing wrong alignment sh_addralign 16, expected 8
  section(45) .symtab, size 2955888, link 46, flags 0, type=2
  section(46) .strtab, size 2613072, link 0, flags 0, type=3
  ...
  update ok for vmlinux

Another workaround is to disable compressed debug info data
CONFIG_DEBUG_INFO_COMPRESSED kernel option.

Fixes: fbbb68de80a4 ("bpf: Add resolve_btfids tool to resolve BTF IDs in ELF object")
Reported-by: Jesper Dangaard Brouer <[email protected]>
Signed-off-by: Jiri Olsa <[email protected]>
Signed-off-by: Alexei Starovoitov <[email protected]>
Acked-by: Jesper Dangaard Brouer <[email protected]>
Acked-by: Yonghong Song <[email protected]>
Cc: Mark Wielaard <[email protected]>
Cc: Nick Clifton <[email protected]>
Link: https://lore.kernel.org/bpf/[email protected]

Merge tag 'pci-v5.9-fixes-1' of git://git.kernel.org/pub/scm/linux/kernel/git/helgaas/pci

Pull PCI fix from Bjorn Helgaas:
"Fix P2PDMA build issue (Christoph Hellwig)"

* tag 'pci-v5.9-fixes-1' of git://git.kernel.org/pub/scm/linux/kernel/git/helgaas/pci:
PCI/P2PDMA: Fix build without DMA ops

net/smc: Prevent kernel-infoleak in __smc_diag_dump()

__smc_diag_dump() is potentially copying uninitialized kernel stack memory
into socket buffers, since the compiler may leave a 4-byte hole near the
beginning of `struct smcd_diag_dmbinfo`. Fix it by initializing `dinfo`
with memset().

Fixes: 4b1b7d3b30a6 ("net/smc: add SMC-D diag support")
Suggested-by: Dan Carpenter <[email protected]>
Signed-off-by: Peilin Ye <[email protected]>
Signed-off-by: Ursula Braun <[email protected]>
Signed-off-by: David S. Miller <[email protected]>

sfc: fix build warnings on 32-bit

Truncation of DMA_BIT_MASK to 32-bit dma_addr_t is semantically safe,
but the compiler was warning because it was happening implicitly.
Insert explicit casts to suppress the warnings.

Reported-by: Randy Dunlap <[email protected]>
Signed-off-by: Edward Cree <[email protected]>
Acked-by: Randy Dunlap <[email protected]> # build-tested
Signed-off-by: David S. Miller <[email protected]>

net: phy: mscc: Fix a couple of spelling mistakes "spcified" -> "specified"

There are a couple of spelling mistakes in comment text. Fix these.

Signed-off-by: Kaige Li <[email protected]>
Signed-off-by: David S. Miller <[email protected]>

riscv: Add SiFive drivers to rv32_defconfig

This adds SiFive drivers to rv32_defconfig, to keep in sync with the
64-bit config. This is useful when testing 32-bit kernel with QEMU
'sifive_u' 32-bit machine.

Signed-off-by: Bin Meng <[email protected]>
Reviewed-by: Anup Patel <[email protected]>
Reviewed-by: Alistair Francis <[email protected]>
Signed-off-by: Palmer Dabbelt <[email protected]>

dt-bindings: timer: Add CLINT bindings

We add DT bindings documentation for CLINT device.

Signed-off-by: Anup Patel <[email protected]>
Reviewed-by: Palmer Dabbelt <[email protected]>
Tested-by: Emil Renner Berhing <[email protected]>
Reviewed-by: Atish Patra <[email protected]>
Reviewed-by: Rob Herring <[email protected]>
Reviewed-by: Palmer Dabbelt <[email protected]>
Signed-off-by: Palmer Dabbelt <[email protected]>

RISC-V: Remove CLINT related code from timer and arch

Right now the RISC-V timer driver is convoluted to support:
1. Linux RISC-V S-mode (with MMU) where it will use TIME CSR for
   clocksource and SBI timer calls for clockevent device.
2. Linux RISC-V M-mode (without MMU) where it will use CLINT MMIO
   counter register for clocksource and CLINT MMIO compare register
   for clockevent device.

We now have a separate CLINT timer driver which also provide CLINT
based IPI operations so let's remove CLINT MMIO related code from
arch/riscv directory and RISC-V timer driver.

Signed-off-by: Anup Patel <[email protected]>
Tested-by: Emil Renner Berhing <[email protected]>
Acked-by: Daniel Lezcano <[email protected]>
Reviewed-by: Atish Patra <[email protected]>
Reviewed-by: Palmer Dabbelt <[email protected]>
Signed-off-by: Palmer Dabbelt <[email protected]>

clocksource/drivers: Add CLINT timer driver

We add a separate CLINT timer driver for Linux RISC-V M-mode (i.e.
RISC-V NoMMU kernel).

The CLINT MMIO device provides three things:
1. 64bit free running counter register
2. 64bit per-CPU time compare registers
3. 32bit per-CPU inter-processor interrupt registers

Unlike other timer devices, CLINT provides IPI registers along with
timer registers. To use CLINT IPI registers, the CLINT timer driver
provides IPI related callbacks to arch/riscv.

Signed-off-by: Anup Patel <[email protected]>
Tested-by: Emil Renner Berhing <[email protected]>
Acked-by: Daniel Lezcano <[email protected]>
Reviewed-by: Atish Patra <[email protected]>
Reviewed-by: Palmer Dabbelt <[email protected]>
Signed-off-by: Palmer Dabbelt <[email protected]>

RISC-V: Add mechanism to provide custom IPI operations

We add mechanism to set custom IPI operations so that CLINT driver
from drivers directory can provide custom IPI operations.

Signed-off-by: Anup Patel <[email protected]>
Tested-by: Emil Renner Berhing <[email protected]>
Reviewed-by: Atish Patra <[email protected]>
Reviewed-by: Palmer Dabbelt <[email protected]>
Signed-off-by: Palmer Dabbelt <[email protected]>

Merge tag 'dma-mapping-5.9-1' of git://git.infradead.org/users/hch/dma-mapping

Pull dma-mapping fixes from Christoph Hellwig:
"Fix more fallout from the dma-pool changes (Nicolas Saenz Julienne,
  me)"

* tag 'dma-mapping-5.9-1' of git://git.infradead.org/users/hch/dma-mapping:
  dma-pool: Only allocate from CMA when in same memory zone
  dma-pool: fix coherent pool allocations for IOMMU mappings

afs: Fix key ref leak in afs_put_operation()

The afs_put_operation() function needs to put the reference to the key
that's authenticating the operation.

Fixes: e49c7b2f6de7 ("afs: Build an abstraction around an "operation" concept")
Reported-by: Dave Botsch <[email protected]>
Signed-off-by: David Howells <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

Merge branch 'opp/fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/vireshk/pm

Pull operating performance points (OPP) framework fixes for 5.9-rc2
from Viresh Kumar:

"This contains the following fixes for 5.9:

- Fix re-enabling of resources (Rajendra Nayak).

- Put OPP table references (Stephen Boyd)."

* 'opp/fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/vireshk/pm:
  opp: Enable resources again if they were disabled earlier
  opp: Put opp table in dev_pm_opp_set_rate() if _set_opp_bw() fails
  opp: Put opp table in dev_pm_opp_set_rate() for empty tables

libbpf: Fix map index used in error message

The error message emitted by bpf_object__init_user_btf_maps() was using the
wrong section ID.

Signed-off-by: Toke Høiland-Jørgensen <[email protected]>
Signed-off-by: Daniel Borkmann <[email protected]>
Acked-by: Yonghong Song <[email protected]>
Link: https://lore.kernel.org/bpf/[email protected]

powerpc/pseries: Do not initiate shutdown when system is running on UPS

As per PAPR we have to look for both EPOW sensor value and event
modifier to identify the type of event and take appropriate action.

In LoPAPR v1.1 section 10.2.2 includes table 136 "EPOW Action Codes":

  SYSTEM_SHUTDOWN 3

  The system must be shut down. An EPOW-aware OS logs the EPOW error
  log information, then schedules the system to be shut down to begin
  after an OS defined delay internal (default is 10 minutes.)

Then in section 10.3.2.2.8 there is table 146 "Platform Event Log
Format, Version 6, EPOW Section", which includes the "EPOW Event
Modifier":

  For EPOW sensor value = 3
  0x01 = Normal system shutdown with no additional delay
  0x02 = Loss of utility power, system is running on UPS/Battery
  0x03 = Loss of system critical functions, system should be shutdown
  0x04 = Ambient temperature too high
  All other values = reserved

We have a user space tool (rtas_errd) on LPAR to monitor for
EPOW_SHUTDOWN_ON_UPS. Once it gets an event it initiates shutdown
after predefined time. It also starts monitoring for any new EPOW
events. If it receives "Power restored" event before predefined time
it will cancel the shutdown. Otherwise after predefined time it will
shutdown the system.

Commit 79872e35469b ("powerpc/pseries: All events of
EPOW_SYSTEM_SHUTDOWN must initiate shutdown") changed our handling of
the "on UPS/Battery" case, to immediately shutdown the system. This
breaks existing setups that rely on the userspace tool to delay
shutdown and let the system run on the UPS.

Fixes: 79872e35469b ("powerpc/pseries: All events of EPOW_SYSTEM_SHUTDOWN must initiate shutdown")
Cc: [email protected] # v4.0+
Signed-off-by: Vasant Hegde <[email protected]>
[mpe: Massage change log and add PAPR references]
Signed-off-by: Michael Ellerman <[email protected]>
Link: https://lore.kernel.org/r/[email protected]

netfilter: conntrack: allow sctp hearbeat after connection re-use

If an sctp connection gets re-used, heartbeats are flagged as invalid
because their vtag doesn't match.

Handle this in a similar way as TCP conntrack when it suspects that the
endpoints and conntrack are out-of-sync.

When a HEARTBEAT request fails its vtag validation, flag this in the
conntrack state and accept the packet.

When a HEARTBEAT_ACK is received with an invalid vtag in the reverse
direction after we allowed such a HEARTBEAT through, assume we are
out-of-sync and re-set the vtag info.

v2: remove left-over snippet from an older incarnation that moved
new_state/old_state assignments, thats not needed so keep that
as-is.

Signed-off-by: Florian Westphal <[email protected]>
Signed-off-by: Pablo Neira Ayuso <[email protected]>

io_uring: kill extra iovec=NULL in import_iovec()

If io_import_iovec() returns an error, return iovec is undefined and
must not be used, so don't set it to NULL when failing.

Signed-off-by: Pavel Begunkov <[email protected]>
Signed-off-by: Jens Axboe <[email protected]>

io_uring: comment on kfree(iovec) checks

kfree() handles NULL pointers well, but io_{read,write}() checks it
because of performance reasons. Leave a comment there for those who are
tempted to patch it.

Signed-off-by: Pavel Begunkov <[email protected]>
Signed-off-by: Jens Axboe <[email protected]>

io_uring: fix racy req->flags modification

Setting and clearing REQ_F_OVERFLOW in io_uring_cancel_files() and
io_cqring_overflow_flush() are racy, because they might be called
asynchronously.

REQ_F_OVERFLOW flag in only needed for files cancellation, so if it can
be guaranteed that requests _currently_ marked inflight can't be
overflown, the problem will be solved with removing the flag
altogether.

That's how the patch works, it removes inflight status of a request
in io_cqring_fill_event() whenever it should be thrown into CQ-overflow
list. That's Ok to do, because no opcode specific handling can be done
after io_cqring_fill_event(), the same assumption as with "struct
io_completion" patches.
And it already have a good place for such cleanups, which is
io_clean_op(). A nice side effect of this is removing this inflight
check from the hot path.

note on synchronisation: now __io_cqring_fill_event() may be taking two
spinlocks simultaneously, completion_lock and inflight_lock. It's fine,
because we never do that in reverse order, and CQ-overflow of inflight
requests shouldn't happen often.

Signed-off-by: Pavel Begunkov <[email protected]>
Signed-off-by: Jens Axboe <[email protected]>

Revert "RDMA/hns: Reserve one sge in order to avoid local length error"

This patch caused some issues on SEND operation, and it should be reverted
to make the drivers work correctly. There will be a better solution that
has been tested carefully to solve the original problem.

This reverts commit 711195e57d341e58133d92cf8aaab1db24e4768d.

Fixes: 711195e57d34 ("RDMA/hns: Reserve one sge in order to avoid local length error")
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Weihang Li <[email protected]>
Signed-off-by: Jason Gunthorpe <[email protected]>

RDMA/hfi1: Correct an interlock issue for TID RDMA WRITE request

The following message occurs when running an AI application with TID RDMA
enabled:

hfi1 0000:7f:00.0: hfi1_0: [QP74] hfi1_tid_timeout 4084
hfi1 0000:7f:00.0: hfi1_0: [QP70] hfi1_tid_timeout 4084

The issue happens when TID RDMA WRITE request is followed by an
IB_WR_RDMA_WRITE_WITH_IMM request, the latter could be completed first on
the responder side. As a result, no ACK packet for the latter could be
sent because the TID RDMA WRITE request is still being processed on the
responder side.

When the TID RDMA WRITE request is eventually completed, the requester
will wait for the IB_WR_RDMA_WRITE_WITH_IMM request to be acknowledged.

If the next request is another TID RDMA WRITE request, no TID RDMA WRITE
DATA packet could be sent because the preceding IB_WR_RDMA_WRITE_WITH_IMM
request is not completed yet.

Consequently the IB_WR_RDMA_WRITE_WITH_IMM will be retried but it will be
ignored on the responder side because the responder thinks it has already
been completed. Eventually the retry will be exhausted and the qp will be
put into error state on the requester side. On the responder side, the TID
resource timer will eventually expire because no TID RDMA WRITE DATA
packets will be received for the second TID RDMA WRITE request. There is
also risk of a write-after-write memory corruption due to the issue.

Fix by adding a requester side interlock to prevent any potential data
corruption and TID RDMA protocol error.

Fixes: a0b34f75ec20 ("IB/hfi1: Add interlock between a TID RDMA request and other requests")
Link: https://lore.kernel.org/r/[email protected]
Cc: <[email protected]> # 5.4.x+
Reviewed-by: Mike Marciniszyn <[email protected]>
Reviewed-by: Dennis Dalessandro <[email protected]>
Signed-off-by: Kaike Wan <[email protected]>
Signed-off-by: Mike Marciniszyn <[email protected]>
Signed-off-by: Jason Gunthorpe <[email protected]>

RDMA/bnxt_re: Do not add user qps to flushlist

Driver shall add only the kernel qps to the flush list for clean up.
During async error events from the HW, driver is adding qps to this list
without checking if the qp is kernel qp or not.

Add a check to avoid user qp addition to the flush list.

Fixes: 942c9b6ca8de ("RDMA/bnxt_re: Avoid Hard lockup during error CQE processing")
Fixes: c50866e2853a ("bnxt_re: fix the regression due to changes in alloc_pbl")
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Selvin Xavier <[email protected]>
Signed-off-by: Jason Gunthorpe <[email protected]>

RDMA/core: Fix spelling mistake "Could't" -> "Couldn't"

There is a spelling mistake in a pr_warn message. Fix it.

Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Colin Ian King <[email protected]>
Signed-off-by: Jason Gunthorpe <[email protected]>

powerpc/perf: Fix soft lockups due to missed interrupt accounting

Performance monitor interrupt handler checks if any counter has
overflown and calls record_and_restart() in core-book3s which invokes
perf_event_overflow() to record the sample information. Apart from
creating sample, perf_event_overflow() also does the interrupt and
period checks via perf_event_account_interrupt().

Currently we record information only if the SIAR (Sampled Instruction
Address Register) valid bit is set (using siar_valid() check) and
hence the interrupt check.

But it is possible that we do sampling for some events that are not
generating valid SIAR, and hence there is no chance to disable the
event if interrupts are more than max_samples_per_tick. This leads to
soft lockup.

Fix this by adding perf_event_account_interrupt() in the invalid SIAR
code path for a sampling event. ie if SIAR is invalid, just do
interrupt check and don't record the sample information.

Reported-by: Alexey Kardashevskiy <[email protected]>
Signed-off-by: Athira Rajeev <[email protected]>
Tested-by: Alexey Kardashevskiy <[email protected]>
Signed-off-by: Michael Ellerman <[email protected]>
Link: https://lore.kernel.org/r/[email protected]

Documentation: efi: remove description of efi=old_map

The old EFI runtime region mapping logic that was kept around for some
time has finally been removed entirely, along with the SGI UV1 support
code that was its last remaining user. So remove any mention of the
efi=old_map command line parameter from the docs.

Cc: Jonathan Corbet <[email protected]>
Cc: [email protected]
Signed-off-by: Ard Biesheuvel <[email protected]>

efi/x86: Move 32-bit code into efi_32.c

Now that the old memmap code has been removed, some code that was left
behind in arch/x86/platform/efi/efi.c is only used for 32-bit builds,
which means it can live in efi_32.c as well. So move it over.

Signed-off-by: Ard Biesheuvel <[email protected]>

efi/libstub: Handle unterminated cmdline

Make the command line parsing more robust, by handling the case it is
not NUL-terminated.

Use strnlen instead of strlen, and make sure that the temporary copy is
NUL-terminated before parsing.

Cc: <[email protected]>
Signed-off-by: Arvind Sankar <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Ard Biesheuvel <[email protected]>

efi/libstub: Handle NULL cmdline

Treat a NULL cmdline the same as empty. Although this is unlikely to
happen in practice, the x86 kernel entry does check for NULL cmdline and
handles it, so do it here as well.

Cc: <[email protected]>
Signed-off-by: Arvind Sankar <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Ard Biesheuvel <[email protected]>

efi/libstub: Stop parsing arguments at "--"

Arguments after "--" are arguments for init, not for the kernel.

Cc: <[email protected]>
Signed-off-by: Arvind Sankar <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Ard Biesheuvel <[email protected]>

efi: add missed destroy_workqueue when efisubsys_init fails

destroy_workqueue() should be called to destroy efi_rts_wq
when efisubsys_init() init resources fails.

Cc: <[email protected]>
Reported-by: Hulk Robot <[email protected]>
Signed-off-by: Li Heng <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Ard Biesheuvel <[email protected]>

efi/x86: Mark kernel rodata non-executable for mixed mode

When remapping the kernel rodata section RO in the EFI pagetables, the
protection flags that were used for the text section are being reused,
but the rodata section should not be marked executable.

Cc: <[email protected]>
Signed-off-by: Arvind Sankar <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Ard Biesheuvel <[email protected]>

opp: Enable resources again if they were disabled earlier

dev_pm_opp_set_rate() can now be called with freq = 0 in order
to either drop performance or bandwidth votes or to disable
regulators on platforms which support them.

In such cases, a subsequent call to dev_pm_opp_set_rate() with
the same frequency ends up returning early because 'old_freq == freq'

Instead make it fall through and put back the dropped performance
and bandwidth votes and/or enable back the regulators.

Cc: v5.3+ <[email protected]> # v5.3+
Fixes: cd7ea582866f ("opp: Make dev_pm_opp_set_rate() handle freq = 0 to drop performance votes")
Reported-by: Sajida Bhanu <[email protected]>
Reviewed-by: Sibi Sankar <[email protected]>
Reported-by: Matthias Kaehlcke <[email protected]>
Tested-by: Matthias Kaehlcke <[email protected]>
Reviewed-by: Stephen Boyd <[email protected]>
Signed-off-by: Rajendra Nayak <[email protected]>
[ Viresh: Don't skip clk_set_rate() and massaged changelog ]
Signed-off-by: Viresh Kumar <[email protected]>

Fix build error when CONFIG_ACPI is not set/enabled:

../arch/x86/pci/xen.c: In function ‘pci_xen_init’:
../arch/x86/pci/xen.c:410:2: error: implicit declaration of function ‘acpi_noirq_set’; did you mean ‘acpi_irq_get’? [-Werror=implicit-function-declaration]
acpi_noirq_set();

Fixes: 88e9ca161c13 ("xen/pci: Use acpi_noirq_set() helper to avoid #ifdef")
Signed-off-by: Randy Dunlap <[email protected]>
Reviewed-by: Juergen Gross <[email protected]>
Cc: Andy Shevchenko <[email protected]>
Cc: Bjorn Helgaas <[email protected]>
Cc: Konrad Rzeszutek Wilk <[email protected]>
Cc: [email protected]
Cc: [email protected]
Signed-off-by: Juergen Gross <[email protected]>

efi: avoid error message when booting under Xen

efifb_probe() will issue an error message in case the kernel is booted
as Xen dom0 from UEFI as EFI_MEMMAP won't be set in this case. Avoid
that message by calling efi_mem_desc_lookup() only if EFI_MEMMAP is set.

Fixes: 38ac0287b7f4 ("fbdev/efifb: Honour UEFI memory map attributes when mapping the FB")
Signed-off-by: Juergen Gross <[email protected]>
Acked-by: Ard Biesheuvel <[email protected]>
Acked-by: Bartlomiej Zolnierkiewicz <[email protected]>
Signed-off-by: Juergen Gross <[email protected]>

Merge tag 'vfio-v5.9-rc2' of git://github.com/awilliam/linux-vfio

Pull VFIO fixes from Alex Williamson:

- Fix lockdep issue reported for recursive read-lock (Alex Williamson)

- Fix missing unwind in type1 replay function (Alex Williamson)

* tag 'vfio-v5.9-rc2' of git://github.com/awilliam/linux-vfio:
vfio/type1: Add proper error unwind for vfio_iommu_replay()
vfio-pci: Avoid recursive read-lock usage

Revert "drm/amdgpu: disable gfxoff for navy_flounder"

This reverts commit 9c9b17a7d19a8e21db2e378784fff1128b46c9d3.
Newly released sdma fw (51.52) provides a fix for the issue.

Signed-off-by: Jiansong Chen <[email protected]>
Reviewed-by: Kenneth Feng <[email protected]>
Reviewed-by: Tao Zhou <[email protected]>
Acked-by: Alex Deucher <[email protected]>
Signed-off-by: Alex Deucher <[email protected]>

powerpc/powernv/pci: Fix possible crash when releasing DMA resources

Fix a typo introduced during recent code cleanup, which could lead to
silently not freeing resources or an oops message (on PCI hotplug or
CAPI reset).

Only impacts ioda2, the code path for ioda1 is correct.

Fixes: 01e12629af4e ("powerpc/powernv/pci: Add explicit tracking of the DMA setup state")
Signed-off-by: Frederic Barrat <[email protected]>
Reviewed-by: Oliver O'Halloran <[email protected]>
Signed-off-by: Michael Ellerman <[email protected]>
Link: https://lore.kernel.org/r/[email protected]

net: gemini: Fix missing free_netdev() in error path of gemini_ethernet_port_probe()

Replace alloc_etherdev_mq with devm_alloc_etherdev_mqs. In this way,
when probe fails, netdev can be freed automatically.

Fixes: 4d5ae32f5e1e ("net: ethernet: Add a driver for Gemini gigabit ethernet")
Reported-by: Hulk Robot <[email protected]>
Signed-off-by: Wang Hai <[email protected]>
Signed-off-by: David S. Miller <[email protected]>

net: atlantic: Use readx_poll_timeout() for large timeout

Commit
8dcf2ad39fdb2 ("net: atlantic: add hwmon getter for MAC temperature")

implemented a read callback with an udelay(10000U). This fails to
compile on ARM because the delay is >1ms. I doubt that it is needed to
spin for 10ms even if possible on x86.

>From looking at the code, the context appears to be preemptible so using
usleep() should work and avoid busy spinning.

Use readx_poll_timeout() in the poll loop.

Fixes: 8dcf2ad39fdb2 ("net: atlantic: add hwmon getter for MAC temperature")
Cc: Mark Starovoytov <[email protected]>
Cc: Igor Russkikh <[email protected]>
Signed-off-by: Sebastian Andrzej Siewior <[email protected]>
Acked-by: Guenter Roeck <[email protected]>
Signed-off-by: David S. Miller <[email protected]>

ptp: ptp_clockmatrix: use i2c_master_send for i2c write

The old code for i2c write would break on some controllers, which fails
at handling Repeated Start Condition. So we will just use i2c_master_send
to handle write in one transanction.

Changes since v1:
- Remove indentation change

Signed-off-by: Min Li <[email protected]>
Signed-off-by: David S. Miller <[email protected]>

netlink: fix state reallocation in policy export

Evidently, when I did this previously, we didn't have more than
10 policies and didn't run into the reallocation path, because
it's missing a memset() for the unused policies. Fix that.

Fixes: d07dcf9aadd6 ("netlink: add infrastructure to expose policies to userspace")
Signed-off-by: Johannes Berg <[email protected]>
Signed-off-by: David S. Miller <[email protected]>

Merge branch 'Bug-fixes-for-ENA-ethernet-driver'

Shay Agroskin says:

====================
Bug fixes for ENA ethernet driver

This series adds the following:
- Fix undesired call to ena_restore after returning from suspend
- Fix condition inside a WARN_ON
- Fix overriding previous value when updating missed_tx statistic

v1->v2:
- fix bug when calling reset routine after device resources are freed (Jakub)

v2->v3:
- fix wrong hash in 'Fixes' tag
====================

Signed-off-by: David S. Miller <[email protected]>

net: ena: Make missed_tx stat incremental

Most statistics in ena driver are incremented, meaning that a stat's
value is a sum of all increases done to it since driver/queue
initialization.

This patch makes all statistics this way, effectively making missed_tx
statistic incremental.
Also added a comment regarding rx_drops and tx_drops to make it
clearer how these counters are calculated.

Fixes: 11095fdb712b ("net: ena: add statistics for missed tx packets")
Signed-off-by: Shay Agroskin <[email protected]>
Signed-off-by: David S. Miller <[email protected]>

net: ena: Change WARN_ON expression in ena_del_napi_in_range()

The ena_del_napi_in_range() function unregisters the napi handler for
rings in a given range.
This function had the following WARN_ON macro:

    WARN_ON(ENA_IS_XDP_INDEX(adapter, i) &&
    adapter->ena_napi[i].xdp_ring);

This macro prints the call stack if the expression inside of it is
true [1], but the expression inside of it is the wanted situation.
The expression checks whether the ring has an XDP queue and its index
corresponds to a XDP one.

This patch changes the expression to
    !ENA_IS_XDP_INDEX(adapter, i) && adapter->ena_napi[i].xdp_ring
which indicates an unwanted situation.

Also, change the structure of the function. The napi handler is
unregistered for all rings, and so there's no need to check whether the
index is an XDP index or not. By removing this check the code becomes
much more readable.

Fixes: 548c4940b9f1 ("net: ena: Implement XDP_TX action")
Signed-off-by: Shay Agroskin <[email protected]>
Signed-off-by: David S. Miller <[email protected]>

net: ena: Prevent reset after device destruction

The reset work is scheduled by the timer routine whenever it
detects that a device reset is required (e.g. when a keep_alive signal
is missing).
When releasing device resources in ena_destroy_device() the driver
cancels the scheduling of the timer routine without destroying the reset
work explicitly.

This creates the following bug:
    The driver is suspended and the ena_suspend() function is called
-> This function calls ena_destroy_device() to free the net device
   resources
    -> The driver waits for the timer routine to finish
    its execution and then cancels it, thus preventing from it
    to be called again.

    If, in its final execution, the timer routine schedules a reset,
    the reset routine might be called afterwards,and a redundant call to
    ena_restore_device() would be made.

By changing the reset routine we allow it to read the device's state
accurately.
This is achieved by checking whether ENA_FLAG_TRIGGER_RESET flag is set
before resetting the device and making both the destruction function and
the flag check are under rtnl lock.
The ENA_FLAG_TRIGGER_RESET is cleared at the end of the destruction
routine. Also surround the flag check with 'likely' because
we expect that the reset routine would be called only when
ENA_FLAG_TRIGGER_RESET flag is set.

The destruction of the timer and reset services in __ena_shutoff() have to
stay, even though the timer routine is destroyed in ena_destroy_device().
This is to avoid a case in which the reset routine is scheduled after
free_netdev() in __ena_shutoff(), which would create an access to freed
memory in adapter->flags.

Fixes: 8c5c7abdeb2d ("net: ena: add power management ops to the ENA driver")
Signed-off-by: Shay Agroskin <[email protected]>
Signed-off-by: David S. Miller <[email protected]>

of: address: Work around missing device_type property in pcie nodes

Recent changes to the DT PCI bus parsing made it mandatory for
device tree nodes describing a PCI controller to have the
'device_type = "pci"' property for the node to be matched.

Although this follows the letter of the specification, it
breaks existing device-trees that have been working fine
for years. Rockchip rk3399-based systems are a prime example
of such collateral damage, and have stopped discovering their
PCI bus.

In order to paper over it, let's add a workaround to the code
matching the device type, and accept as PCI any node that is
named "pcie",

A warning will hopefully nudge the user into updating their
DT to a fixed version if they can, but the incentive is
obviously pretty small.

Fixes: 2f96593ecc37 ("of_address: Add bus type match for pci ranges parser")
Suggested-by: Rob Herring <[email protected]>
Signed-off-by: Marc Zyngier <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Rob Herring <[email protected]>

dt: writing-schema: Miscellaneous grammar fixes

- Add missing verb,
- Fix accidental plural.

Signed-off-by: Geert Uytterhoeven <[email protected]>
Signed-off-by: Rob Herring <[email protected]>
Link: https://lore.kernel.org/r/[email protected]

lib/string.c: Use freestanding environment

gcc can transform the loop in a naive implementation of memset/memcpy
etc into a call to the function itself. This optimization is enabled by
-ftree-loop-distribute-patterns.

This has been the case for a while, but gcc-10.x enables this option at
-O2 rather than -O3 as in previous versions.

Add -ffreestanding, which implicitly disables this optimization with
gcc. It is unclear whether clang performs such optimizations, but
hopefully it will also not do so in a freestanding environment.

Signed-off-by: Arvind Sankar <[email protected]>
Link: https://gcc.gnu.org/bugzilla/show_bug.cgi?id=56888
Signed-off-by: Linus Torvalds <[email protected]>

x86/boot/compressed: Use builtin mem functions for decompressor

Since commits

  c041b5ad8640 ("x86, boot: Create a separate string.h file to provide standard string functions")
  fb4cac573ef6 ("x86, boot: Move memcmp() into string.h and string.c")

the decompressor stub has been using the compiler's builtin memcpy,
memset and memcmp functions, _except_ where it would likely have the
largest impact, in the decompression code itself.

Remove the #undef's of memcpy and memset in misc.c so that the
decompressor code also uses the compiler builtins.

The rationale given in the comment doesn't really apply: just because
some functions use the out-of-line version is no reason to not use the
builtin version in the rest.

Replace the comment with an explanation of why memzero and memmove are
being #define'd.

Drop the suggestion to #undef in boot/string.h as well: the out-of-line
versions are not really optimized versions, they're generic code that's
good enough for the preboot environment. The compiler will likely
generate better code for constant-size memcpy/memset/memcmp if it is
allowed to.

Most decompressors' performance is unchanged, with the exception of LZ4
and 64-bit ZSTD.

Before After ARCH
LZ4   73ms 10ms   32
LZ4 120ms 10ms 64
ZSTD   90ms 74ms 64

Measurements on QEMU on 2.2GHz Broadwell Xeon, using defconfig kernels.

Decompressor code size has small differences, with the largest being
that 64-bit ZSTD decreases just over 2k. The largest code size increase
was on 64-bit XZ, of about 400 bytes.

Signed-off-by: Arvind Sankar <[email protected]>
Suggested-by: Nick Terrell <[email protected]>
Tested-by: Nick Terrell <[email protected]>
Acked-by: Kees Cook <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

io_uring: use system_unbound_wq for ring exit work

We currently use system_wq, which is unbounded in terms of number of
workers. This means that if we're exiting tons of rings at the same
time, then we'll briefly spawn tons of event kworkers just for a very
short blocking time as the rings exit.

Use system_unbound_wq instead, which has a sane cap on the concurrency
level.

Signed-off-by: Jens Axboe <[email protected]>

ext4: limit the length of per-inode prealloc list

In the scenario of writing sparse files, the per-inode prealloc list may
be very long, resulting in high overhead for ext4_mb_use_preallocated().
To circumvent this problem, we limit the maximum length of per-inode
prealloc list to 512 and allow users to modify it.

After patching, we observed that the sys ratio of cpu has dropped, and
the system throughput has increased significantly. We created a process
to write the sparse file, and the running time of the process on the
fixed kernel was significantly reduced, as follows:

Running time on unfixed kernel：
[root@TENCENT64 ~]# time taskset 0x01 ./sparse /data1/sparce.dat
real    0m2.051s
user    0m0.008s
sys     0m2.026s

Running time on fixed kernel：
[root@TENCENT64 ~]# time taskset 0x01 ./sparse /data1/sparce.dat
real    0m0.471s
user    0m0.004s
sys     0m0.395s

Signed-off-by: Chunguang Xu <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Theodore Ts'o <[email protected]>

ext4: reorganize if statement of ext4_mb_release_context()

Reorganize the if statement of ext4_mb_release_context(), make it
easier to read.

Signed-off-by: Chunguang Xu <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Reviewed-by: Ritesh Harjani <[email protected]>
Signed-off-by: Theodore Ts'o <[email protected]>

ext4: add mb_debug logging when there are lost chunks

Lost chunks are when some other process raced with the current thread
to grab a particular block allocation. Add mb_debug log for
developers who wants to see how often this is happening for a
particular workload.

Signed-off-by: Chunguang Xu <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Theodore Ts'o <[email protected]>

ext4: Fix comment typo "the the".

I have found double typed comments "the the". So i modified it to
one "the"

Signed-off-by: kyoungho koo <[email protected]>
Link: https://lore.kernel.org/r/20200424171620.GA11943@koo-Z370-HD3
Signed-off-by: Theodore Ts'o <[email protected]>

jbd2: clean up checksum verification in do_one_pass()

Remove the unnecessary chksum_err and checksum_seen variables as well as
some redundant code to make the function easier to understand.

[ With changes suggested by jack@ and tytso@ ]

Signed-off-by: Shijie Luo <[email protected]>
Signed-off-by: Theodore Ts'o <[email protected]>
Reviewed-by: Jan Kara <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Theodore Ts'o <[email protected]>

ALSA: hda: avoid reset of sdo_limit

By default 'sdo_limit' is initialized with a default value of '8'
as per spec. This is overridden in cases where a different value is
required. However this is getting reset when snd_hdac_bus_init_chip()
is called again, which happens during runtime PM cycle.

Avoid this reset by moving 'sdo_limit' setup to 'snd_hdac_bus_init()'
function which would be called only once.

Fixes: 67ae482a59e9 ("ALSA: hda: add member to store ratio for stripe control")
Cc: <[email protected]>
Signed-off-by: Sameer Pujar <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Takashi Iwai <[email protected]>

drm/i915/tgl: Make sure TC-cold is blocked before enabling TC AUX power wells

The dependency between power wells is determined by the ordering of the
power well list: when enabling the power wells for a domain, this
happens walking the power well list forward, while disabling them
happens in the reverse direction. Accordingly a power well on the list
must follow any other power well it depends on.

Since the TC AUX power wells depend on TC-cold being blocked, move the
TC-cold off power well before all AUX power wells.

Fixes: 3c02934b24e3 ("drm/i915/tc/tgl: Implement TC cold sequences")
Cc: José Roberto de Souza <[email protected]>
Signed-off-by: Imre Deak <[email protected]>
Reviewed-by: José Roberto de Souza <[email protected]>
Link: https://patchwork.freedesktop.org/patch/msgid/[email protected]
Signed-off-by: Rodrigo Vivi <[email protected]>
(cherry picked from commit b302a2e68807604af2a5015816c1d117747989b6)
Signed-off-by: Jani Nikula <[email protected]>

drm/i915/selftests: Avoid passing a random 0 into ilog2

igt_mm_config() calls ilog2() on the (pseudo)random 21-bit number
s>>12. Once in 2 million seeds, this is zero and ilog2 summons
the nasal demons.

There was an attempt to handle this case with a max(), but that's
too late; ms could already be something bizarre.

Given that the low 12 bits of s and ms are always zero, it's a lot
simpler just to divide them by 4096, then everything fits into 32
bits, and we can easily generate a random number 1 <= s <= 0x1fffff.

Fixes: 14d1b9a6247c ("drm/i915: buddy allocator")
Signed-off-by: George Spelvin <[email protected]>
Cc: Matthew Auld <[email protected]>
Cc: Jani Nikula <[email protected]>
Cc: Joonas Lahtinen <[email protected]>
Cc: Rodrigo Vivi <[email protected]>
Cc: [email protected]
Reviewed-by: Matthew Auld <[email protected]>
Signed-off-by: Chris Wilson <[email protected]>
Link: https://patchwork.freedesktop.org/patch/msgid/[email protected]
Signed-off-by: Rodrigo Vivi <[email protected]>
(cherry picked from commit 21118e8e56479ef33460fbd63a5ad0535843b666)
Signed-off-by: Jani Nikula <[email protected]>

drm/i915: Fix wrong return value in intel_atomic_check()

In the case of calling check_digital_port_conflicts() failed, a
negative error code -EINVAL should be returned.

Fixes: bf5da83e4bd80 ("drm/i915: Move check_digital_port_conflicts() earier")
Cc: Ville Syrjälä <[email protected]>
Signed-off-by: Tianjia Zhang <[email protected]>
Reviewed-by: José Roberto de Souza <[email protected]>
Signed-off-by: José Roberto de Souza <[email protected]>
Link: https://patchwork.freedesktop.org/patch/msgid/[email protected]
Signed-off-by: Rodrigo Vivi <[email protected]>
(cherry picked from commit 66b51b801d05ee54a0f23628cb8220189adb715e)
Signed-off-by: Jani Nikula <[email protected]>

drm/i915: Update bw_buddy pagemask table

A recent bspec update removed the LPDDR4 single channel entry from the
buddy register table, but added a new four-channel entry.

Workaround 1409767108 hasn't been updated with any guidance for four
channel configurations, so we leave that alternate table unchanged for
now.

Bspec 49218
Fixes: 3fa01d642fa7 ("drm/i915/tgl: Program BW_BUDDY registers during display init")
Signed-off-by: Matt Roper <[email protected]>
Link: https://patchwork.freedesktop.org/patch/msgid/[email protected]
Reviewed-by: Lucas De Marchi <[email protected]>
Signed-off-by: Rodrigo Vivi <[email protected]>
(cherry picked from commit ecb40d0826fda213ebb58d49e7d5b4752480e130)
Signed-off-by: Jani Nikula <[email protected]>

drm/i915/display: Check for an LPSP encoder before dereferencing

Avoid a GPF at

<1>[   20.177320] BUG: kernel NULL pointer dereference, address: 000000000000007c
<1>[   20.177322] #PF: supervisor read access in kernel mode
<1>[   20.177323] #PF: error_code(0x0000) - not-present page
<6>[   20.177324] PGD 0 P4D 0
<4>[   20.177327] Oops: 0000 [#1] PREEMPT SMP PTI
<4>[   20.177328] CPU: 1 PID: 944 Comm: debugfs_test Not tainted 5.8.0-rc7-CI-CI_DRM_8814+ #1
<4>[   20.177330] Hardware name: Dell Inc. XPS 13 9360/0823VW, BIOS 2.9.0 07/09/2018
<4>[   20.177372] RIP: 0010:i915_lpsp_capability_show+0x44/0xc0 [i915]
<4>[   20.177374] Code: 0f b6 81 ca 0d 00 00 3c 0b 74 77 76 19 3c 0c 75 44 83 7e 7c 01 7e 2f 48 c7 c6 d7 b9 47 a0 e8 43 df 06 e1 31 c0 c3 3c 09 72 2b <8b> 46 7c 85 c0 75 e6 8b 82 e4 00 00 00 89 c2 83 e2 fb 83 fa 0a 74
<4>[   20.177376] RSP: 0018:ffffc90000cebe38 EFLAGS: 00010246
<4>[   20.177377] RAX: 0000000000000009 RBX: ffff888267fe6a58 RCX: ffff888252d10000
<4>[   20.177378] RDX: ffff88824a9a4000 RSI: 0000000000000000 RDI: ffff888267fe6a30
<4>[   20.177379] RBP: 0000000000000000 R08: 0000000000000000 R09: 0000000000000001
<4>[   20.177380] R10: 0000000000000001 R11: 0000000000000000 R12: ffffc90000cebf08
<4>[   20.177381] R13: 00000000ffffffff R14: 0000000000000001 R15: ffff888267fe6a30
<4>[   20.177383] FS:  00007f6f9c6b5e40(0000) GS:ffff888276480000(0000) knlGS:0000000000000000
<4>[   20.177384] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
<4>[   20.177385] CR2: 000000000000007c CR3: 0000000255f04006 CR4: 00000000003606e0
<4>[   20.177386] Call Trace:
<4>[   20.177390]  seq_read+0xcb/0x420

which is presumably from having no encoder attached at that time.

Closes: https://gitlab.freedesktop.org/drm/intel/-/issues/2175
Fixes: 8806211fe7b3 ("drm/i915: Add i915_lpsp_capability debugfs")
Signed-off-by: Chris Wilson <[email protected]>
Cc: Animesh Manna <[email protected]>
Cc: Anshuman Gupta <[email protected]>
Cc: Uma Shankar <[email protected]>
Cc: "Ville Syrjälä" <[email protected]>
Reviewed-by: Anshuman Gupta <[email protected]>
Link: https://patchwork.freedesktop.org/patch/msgid/[email protected]
Signed-off-by: Rodrigo Vivi <[email protected]>
(cherry picked from commit a22b1a9bb0d72a58d5b836653f28d97ee8fea1c4)
Signed-off-by: Jani Nikula <[email protected]>

drm/i915: Copy default modparams to mock i915_device

Since we use the module parameters stored inside the drm_i915_device
itself, we need to ensure the mock i915_device also sets up the right
defaults.

Fixes: 8a25c4be583d ("drm/i915/params: switch to device specific parameters")
Signed-off-by: Chris Wilson <[email protected]>
Cc: Jani Nikula <[email protected]>
Reviewed-by: Tvrtko Ursulin <[email protected]>
Link: https://patchwork.freedesktop.org/patch/msgid/[email protected]
Signed-off-by: Rodrigo Vivi <[email protected]>
(cherry picked from commit 98ef067453709444a264939940f7b3a5dfdfa09e)
Signed-off-by: Jani Nikula <[email protected]>

drm/i915: Provide the perf pmu.module

Rather than manually implement our own module reference counting for perf
pmu events, finally realise that there is a module parameter to struct
pmu for this very purpose.

Signed-off-by: Chris Wilson <[email protected]>
Cc: Tvrtko Ursulin <[email protected]>
Cc: [email protected]
Reviewed-by: Tvrtko Ursulin <[email protected]>
Link: https://patchwork.freedesktop.org/patch/msgid/[email protected]
Signed-off-by: Rodrigo Vivi <[email protected]>
(cherry picked from commit 27e897beec1c59861f15d4d3562c39ad1143620f)
Signed-off-by: Jani Nikula <[email protected]>

Merge tag 'gvt-next-fixes-2020-08-05' of https://github.com/intel/gvt-linux into drm-intel-fixes

gvt-next-fixes-2020-08-05

- Fix guest suspend/resume low performance handling of shadow ppgtt (Colin)
- Fix PV notifier handling for guest suspend/resume (Colin)

Signed-off-by: Jani Nikula <[email protected]>
From: Zhenyu Wang <[email protected]>
Link: https://patchwork.freedesktop.org/patch/msgid/[email protected]

ALSA: hda/realtek: Add quirk for Samsung Galaxy Book Ion

The Galaxy Book Ion uses the same ALC298 codec as other Samsung laptops
which have the no headphone sound bug, like my Samsung Notebook. The
Galaxy Book owner confirmed that this patch fixes the bug.

BugLink: https://bugzilla.kernel.org/show_bug.cgi?id=207423
Signed-off-by: Mike Pozulp <[email protected]>
Cc: <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Takashi Iwai <[email protected]>

Merge tag 'asoc-fix-v5.9-rc1' of https://git.kernel.org/pub/scm/linux/kernel/git/broonie/sound into for-linus

ASoC: Fixes for v5.9

A bunch of fixes that came in during the merge window, mostly for issues
that were uncovered by the changes to report errors on invalid register
access plus one important fix in that code itself.

Merge tag 'amd-drm-fixes-5.9-2020-08-12' of git://people.freedesktop.org/~agd5f/linux into drm-fixes

amd-drm-fixes-5.9-2020-08-12:

amdgpu:
- Fix allocation size
- SR-IOV fixes
- Vega20 SMU feature state caching fix
- Fix custom pptable handling
- Arcturus golden settings update
- Several display fixes

Signed-off-by: Dave Airlie <[email protected]>
From: Alex Deucher <[email protected]>
Link: https://patchwork.freedesktop.org/patch/msgid/[email protected]

Merge tag 'drm-misc-fixes-2020-08-12' of git://anongit.freedesktop.org/drm/drm-misc into drm-fixes

drm-misc-fixes for v5.9-rc1:
- Add missing dma_fence_put() in virtio_gpu_execbuffer_ioctl().
- Fix memory leak in virtio_gpu_cleanup_object().

Signed-off-by: Dave Airlie <[email protected]>
From: Maarten Lankhorst <[email protected]>
Link: https://patchwork.freedesktop.org/patch/msgid/[email protected]

bpftool: Handle EAGAIN error code properly in pids collection

When the error code is EAGAIN, the kernel signals the user
space should retry the read() operation for bpf iterators.
Let us do it.

Signed-off-by: Yonghong Song <[email protected]>
Signed-off-by: Alexei Starovoitov <[email protected]>
Acked-by: Andrii Nakryiko <[email protected]>
Link: https://lore.kernel.org/bpf/[email protected]

bpf: Avoid visit same object multiple times

Currently when traversing all tasks, the next tid
is always increased by one. This may result in
visiting the same task multiple times in a
pid namespace.

This patch fixed the issue by seting the next
tid as pid_nr_ns(pid, ns) + 1, similar to
funciton next_tgid().

Signed-off-by: Yonghong Song <[email protected]>
Signed-off-by: Alexei Starovoitov <[email protected]>
Cc: Rik van Riel <[email protected]>
Link: https://lore.kernel.org/bpf/[email protected]

bpf: Fix a rcu_sched stall issue with bpf task/task_file iterator

In our production system, we observed rcu stalls when
'bpftool prog` is running.
  rcu: INFO: rcu_sched self-detected stall on CPU
  rcu: \x097-....: (20999 ticks this GP) idle=302/1/0x4000000000000000 softirq=1508852/1508852 fqs=4913
  \x09(t=21031 jiffies g=2534773 q=179750)
  NMI backtrace for cpu 7
  CPU: 7 PID: 184195 Comm: bpftool Kdump: loaded Tainted: G        W         5.8.0-00004-g68bfc7f8c1b4 #6
  Hardware name: Quanta Twin Lakes MP/Twin Lakes Passive MP, BIOS F09_3A17 05/03/2019
  Call Trace:
  <IRQ>
  dump_stack+0x57/0x70
  nmi_cpu_backtrace.cold+0x14/0x53
  ? lapic_can_unplug_cpu.cold+0x39/0x39
  nmi_trigger_cpumask_backtrace+0xb7/0xc7
  rcu_dump_cpu_stacks+0xa2/0xd0
  rcu_sched_clock_irq.cold+0x1ff/0x3d9
  ? tick_nohz_handler+0x100/0x100
  update_process_times+0x5b/0x90
  tick_sched_timer+0x5e/0xf0
  __hrtimer_run_queues+0x12a/0x2a0
  hrtimer_interrupt+0x10e/0x280
  __sysvec_apic_timer_interrupt+0x51/0xe0
  asm_call_on_stack+0xf/0x20
  </IRQ>
  sysvec_apic_timer_interrupt+0x6f/0x80
  asm_sysvec_apic_timer_interrupt+0x12/0x20
  RIP: 0010:task_file_seq_get_next+0x71/0x220
  Code: 00 00 8b 53 1c 49 8b 7d 00 89 d6 48 8b 47 20 44 8b 18 41 39 d3 76 75 48 8b 4f 20 8b 01 39 d0 76 61 41 89 d1 49 39 c1 48 19 c0 <48> 8b 49 08 21 d0 48 8d 04 c1 4c 8b 08 4d 85 c9 74 46 49 8b 41 38
  RSP: 0018:ffffc90006223e10 EFLAGS: 00000297
  RAX: ffffffffffffffff RBX: ffff888f0d172388 RCX: ffff888c8c07c1c0
  RDX: 00000000000f017b RSI: 00000000000f017b RDI: ffff888c254702c0
  RBP: ffffc90006223e68 R08: ffff888be2a1c140 R09: 00000000000f017b
  R10: 0000000000000002 R11: 0000000000100000 R12: ffff888f23c24118
  R13: ffffc90006223e60 R14: ffffffff828509a0 R15: 00000000ffffffff
  task_file_seq_next+0x52/0xa0
  bpf_seq_read+0xb9/0x320
  vfs_read+0x9d/0x180
  ksys_read+0x5f/0xe0
  do_syscall_64+0x38/0x60
  entry_SYSCALL_64_after_hwframe+0x44/0xa9
  RIP: 0033:0x7f8815f4f76e
  Code: c0 e9 f6 fe ff ff 55 48 8d 3d 76 70 0a 00 48 89 e5 e8 36 06 02 00 66 0f 1f 44 00 00 64 8b 04 25 18 00 00 00 85 c0 75 14 0f 05 <48> 3d 00 f0 ff ff 77 52 c3 66 0f 1f 84 00 00 00 00 00 55 48 89 e5
  RSP: 002b:00007fff8f9df578 EFLAGS: 00000246 ORIG_RAX: 0000000000000000
  RAX: ffffffffffffffda RBX: 000000000170b9c0 RCX: 00007f8815f4f76e
  RDX: 0000000000001000 RSI: 00007fff8f9df5b0 RDI: 0000000000000007
  RBP: 00007fff8f9e05f0 R08: 0000000000000049 R09: 0000000000000010
  R10: 00007f881601fa40 R11: 0000000000000246 R12: 00007fff8f9e05a8
  R13: 00007fff8f9e05a8 R14: 0000000001917f90 R15: 000000000000e22e

Note that `bpftool prog` actually calls a task_file bpf iterator
program to establish an association between prog/map/link/btf anon
files and processes.

In the case where the above rcu stall occured, we had a process
having 1587 tasks and each task having roughly 81305 files.
This implied 129 million bpf prog invocations. Unfortunwtely none of
these files are prog/map/link/btf files so bpf iterator/prog needs
to traverse all these files and not able to return to user space
since there are no seq_file buffer overflow.

This patch fixed the issue in bpf_seq_read() to limit the number
of visited objects. If the maximum number of visited objects is
reached, no more objects will be visited in the current syscall.
If there is nothing written in the seq_file buffer, -EAGAIN will
return to the user so user can try again.

The maximum number of visited objects is set at 1 million.
In our Intel Xeon D-2191 2.3GHZ 18-core server, bpf_seq_read()
visiting 1 million files takes around 0.18 seconds.

We did not use cond_resched() since for some iterators, e.g.,
netlink iterator, where rcu read_lock critical section spans between
consecutive seq_ops->next(), which makes impossible to do cond_resched()
in the key while loop of function bpf_seq_read().

Signed-off-by: Yonghong Song <[email protected]>
Signed-off-by: Alexei Starovoitov <[email protected]>
Cc: Paul E. McKenney <[email protected]>
Link: https://lore.kernel.org/bpf/[email protected]

net: ipv4: remove duplicate "the the" phrase in Kconfig text

The Kconfig help text contains the phrase "the the" in the help
text. Fix this and reformat the block of help text.

Signed-off-by: Colin Ian King <[email protected]>
Signed-off-by: David S. Miller <[email protected]>

net: mscc: ocelot: remove duplicate "the the" phrase in Kconfig text

The Kconfig help text contains the phrase "the the" in the help
text. Fix this.

Signed-off-by: Colin Ian King <[email protected]>
Signed-off-by: David S. Miller <[email protected]>

Merge branch 'ethtool-netlink-bug-fixes'

Maxim Mikityanskiy says:

====================
ethtool-netlink bug fixes

This series contains a few bug fixes for ethtool-netlink. These bugs are
specific for the netlink interface, and the legacy ioctl interface is
not affected. These patches aim to have the same behavior in
ethtool-netlink as in the legacy ethtool.

Please also see the sibling series for the userspace tool.

v2 changes: Added Fixes tags.
====================

Signed-off-by: David S. Miller <[email protected]>

ethtool: Don't omit the netlink reply if no features were changed

The legacy ethtool userspace tool shows an error when no features could
be changed. It's useful to have a netlink reply to be able to show this
error when __netdev_update_features wasn't called, for example:

1. ethtool -k eth0
   large-receive-offload: off
2. ethtool -K eth0 rx-fcs on
3. ethtool -K eth0 lro on
   Could not change any device features
   rx-lro: off [requested on]
4. ethtool -K eth0 lro on
   # The output should be the same, but without this patch the kernel
   # doesn't send the reply, and ethtool is unable to detect the error.

This commit makes ethtool-netlink always return a reply when requested,
and it still avoids unnecessary calls to __netdev_update_features if the
wanted features haven't changed.

Fixes: 0980bfcd6954 ("ethtool: set netdev features with FEATURES_SET request")
Signed-off-by: Maxim Mikityanskiy <[email protected]>
Reviewed-by: Michal Kubecek <[email protected]>
Signed-off-by: David S. Miller <[email protected]>

ethtool: Account for hw_features in netlink interface

ethtool-netlink ignores dev->hw_features and may confuse the drivers by
asking them to enable features not in the hw_features bitmask. For
example:

1. ethtool -k eth0
   tls-hw-tx-offload: off [fixed]
2. ethtool -K eth0 tls-hw-tx-offload on
   tls-hw-tx-offload: on
3. ethtool -k eth0
   tls-hw-tx-offload: on [fixed]

Fitler out dev->hw_features from req_wanted to fix it and to resemble
the legacy ethtool behavior.

Fixes: 0980bfcd6954 ("ethtool: set netdev features with FEATURES_SET request")
Signed-off-by: Maxim Mikityanskiy <[email protected]>
Reviewed-by: Michal Kubecek <[email protected]>
Signed-off-by: David S. Miller <[email protected]>

ethtool: Fix preserving of wanted feature bits in netlink interface

Currently, ethtool-netlink calculates new wanted bits as:
(req_wanted & req_mask) | (old_active & ~req_mask)

It completely discards the old wanted bits, so they are forgotten with
the next ethtool command. Sample steps to reproduce:

1. ethtool -k eth0
   tx-tcp-segmentation: on # TSO is on from the beginning
2. ethtool -K eth0 tx off
   tx-tcp-segmentation: off [not requested]
3. ethtool -k eth0
   tx-tcp-segmentation: off [requested on]
4. ethtool -K eth0 rx off # Some change unrelated to TSO
5. ethtool -k eth0
   tx-tcp-segmentation: off # "Wanted on" is forgotten

This commit fixes it by changing the formula to:
(req_wanted & req_mask) | (old_wanted & ~req_mask),
where old_active was replaced by old_wanted to account for the wanted
bits.

The shortcut condition for the case where nothing was changed now
compares wanted bitmasks, instead of wanted to active.

Fixes: 0980bfcd6954 ("ethtool: set netdev features with FEATURES_SET request")
Signed-off-by: Maxim Mikityanskiy <[email protected]>
Reviewed-by: Michal Kubecek <[email protected]>
Signed-off-by: David S. Miller <[email protected]>