Git Repo - linux.git/log

net: switch copy_bpf_fprog_from_user to sockptr_t

Pass a sockptr_t to prepare for set_fs-less handling of the kernel
pointer from bpf-cgroup.

Signed-off-by: Christoph Hellwig <[email protected]>
Signed-off-by: David S. Miller <[email protected]>

net: add a new sockptr_t type

Add a uptr_t type that can hold a pointer to either a user or kernel
memory region, and simply helpers to copy to and from it.

Signed-off-by: Christoph Hellwig <[email protected]>
Signed-off-by: David S. Miller <[email protected]>

bpfilter: reject kernel addresses

The bpfilter user mode helper processes the optval address using
process_vm_readv. Don't send it kernel addresses fed under
set_fs(KERNEL_DS) as that won't work.

Signed-off-by: Christoph Hellwig <[email protected]>
Signed-off-by: David S. Miller <[email protected]>

net/bpfilter: split __bpfilter_process_sockopt

Split __bpfilter_process_sockopt into a low-level send request routine and
the actual setsockopt hook to split the init time ping from the actual
setsockopt processing.

Signed-off-by: Christoph Hellwig <[email protected]>
Signed-off-by: David S. Miller <[email protected]>

bpfilter: fix up a sparse annotation

The __user doesn't make sense when casting to an integer type, just
switch to a uintptr_t cast which also removes the need for the __force.

Signed-off-by: Christoph Hellwig <[email protected]>
Reviewed-by: Luc Van Oostenryck <[email protected]>
Signed-off-by: David S. Miller <[email protected]>

Merge branch 'TC-datapath-hash-api'

Ariel Levkovich says:

====================
TC datapath hash api

Hash based packet classification allows user to set up rules that
provide load balancing of traffic across multiple vports and
for ECMP path selection while keeping the number of rule at minimum.

Instead of matching on exact flow spec, which requires a rule per
flow, user can define rules based on a their hash value and distribute
the flows to different buckets. The number of rules
in this case will be constant and equal to the number of buckets.

The series introduces an extention to the cls flower classifier
and allows user to add rules that match on the hash value that
is stored in skb->hash while assuming the value was set prior to
the classification.

Setting the skb->hash can be done in various ways and is not defined
in this series - for example:
1. By the device driver upon processing an rx packet.
2. Using tc action bpf with a program which computes and sets the
skb->hash value.

$ tc filter add dev ens1f0_0 ingress \
prio 1 chain 2 proto ip \
flower hash 0x0/0xf  \
action mirred egress redirect dev ens1f0_1

$ tc filter add dev ens1f0_0 ingress \
prio 1 chain 2 proto ip \
flower hash 0x1/0xf  \
action mirred egress redirect dev ens1f0_2

v3 -> v4:
*Drop hash setting code leaving only the classidication parts.
  Setting the hash will be possible via existing tc action bpf.

v2 -> v3:
*Split hash algorithm option into 2 different actions.
  Asym_l4 available via act_skbedit and bpf via new act_hash.
====================

Signed-off-by: David S. Miller <[email protected]>

net/sched: cls_flower: Add hash info to flow classification

Adding new cls flower keys for hash value and hash
mask and dissect the hash info from the skb into
the flow key towards flow classication.

Signed-off-by: Ariel Levkovich <[email protected]>
Reviewed-by: Jiri Pirko <[email protected]>
Signed-off-by: David S. Miller <[email protected]>

net/flow_dissector: add packet hash dissection

Retreive a hash value from the SKB and store it
in the dissector key for future matching.

Signed-off-by: Ariel Levkovich <[email protected]>
Reviewed-by: Jiri Pirko <[email protected]>
Signed-off-by: David S. Miller <[email protected]>

net: hyperv: dump TX indirection table to ethtool regs

An imbalanced TX indirection table causes netvsc to have low
performance. This table is created and managed during runtime. To help
better diagnose performance issues caused by imbalanced tables, it needs
make TX indirection tables visible.

Because TX indirection table is driver specified information, so
display it via ethtool register dump.

Signed-off-by: Chi Song <[email protected]>
Reviewed-by: Haiyang Zhang <[email protected]>
Signed-off-by: David S. Miller <[email protected]>

flow_offload: Move rhashtable inclusion to the source file

I noticed that touching linux/rhashtable.h causes lib/vsprintf.c to
be rebuilt. This dependency came through a bogus inclusion in the
file net/flow_offload.h. This patch moves it to the right place.

This patch also removes a lingering rhashtable inclusion in cls_api
created by the same commit.

Fixes: 4e481908c51b ("flow_offload: move tc indirect block to...")
Signed-off-by: Herbert Xu <[email protected]>
Signed-off-by: David S. Miller <[email protected]>

Merge branch 'akpm' into master (patches from Andrew)

Merge misc fixes from Andrew Morton:
"Subsystems affected by this patch series: mm/pagemap, mm/shmem,
  mm/hotfixes, mm/memcg, mm/hugetlb, mailmap, squashfs, scripts,
  io-mapping, MAINTAINERS, and gdb"

* emailed patches from Andrew Morton <[email protected]>:
  scripts/gdb: fix lx-symbols 'gdb.error' while loading modules
  MAINTAINERS: add KCOV section
  io-mapping: indicate mapping failure
  scripts/decode_stacktrace: strip basepath from all paths
  squashfs: fix length field overlap check in metadata reading
  mailmap: add entry for Mike Rapoport
  khugepaged: fix null-pointer dereference due to race
  mm/hugetlb: avoid hardcoding while checking if cma is enabled
  mm: memcg/slab: fix memory leak at non-root kmem_cache destroy
  mm/memcg: fix refcount error while moving and swapping
  mm/memcontrol: fix OOPS inside mem_cgroup_get_nr_swap_pages()
  mm: initialize return of vm_insert_pages
  vfs/xattr: mm/shmem: kernfs: release simple xattr entry in a right way
  mm/mmap.c: close race between munmap() and expand_upwards()/downwards()

Merge branch 'fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs into master

Pull xtensa csum regression fix from Al Viro:
"Max Filippov caught a breakage introduced in xtensa this cycle
  by the csum_and_copy_..._user() series.

  Cut'n'paste from the wrong source - the check that belongs
  in csum_and_copy_to_user() ended up both there and in
  csum_and_copy_from_user()"

* 'fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
  xtensa: fix access check in csum_and_copy_from_user

Merge tag 'arm64-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/arm64/linux into master

Pull arm64 fix from Will Deacon:
"Fix compat vDSO build flags for recent versions of clang to tell it
where to find the assembler"

* tag 'arm64-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/arm64/linux:
arm64: vdso32: Fix '--prefix=' value for newer versions of clang

Merge tag 'for-5.8-rc6-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux into master

Pull btrfs fixes from David Sterba:
"A few resouce leak fixes from recent patches, all are stable material.

  The problems have been observed during testing or have a reproducer"

* tag 'for-5.8-rc6-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux:
  btrfs: fix mount failure caused by race with umount
  btrfs: fix page leaks after failure to lock page for delalloc
  btrfs: qgroup: fix data leak caused by race between writeback and truncate
  btrfs: fix double free on ulist after backref resolution failure

Merge tag 'zonefs-5.8-rc7' of git://git.kernel.org/pub/scm/linux/kernel/git/dlemoal/zonefs into master

Pull zonefs fixes from Damien Le Moal:
"Two fixes, the first one to remove compilation warnings and the second
  to avoid potentially inefficient allocation of BIOs for direct writes
  into sequential zones"

* tag 'zonefs-5.8-rc7' of git://git.kernel.org/pub/scm/linux/kernel/git/dlemoal/zonefs:
  zonefs: count pages after truncating the iterator
  zonefs: Fix compilation warning

Merge tag 'io_uring-5.8-2020-07-24' of git://git.kernel.dk/linux-block into master

Pull io_uring fixes from Jens Axboe:

- Fix discrepancy in how sqe->flags are treated for a few requests,
   this makes it consistent (Daniele)

- Ensure that poll driven retry works with double waitqueue poll users

- Fix a missing io_req_init_async() (Pavel)

* tag 'io_uring-5.8-2020-07-24' of git://git.kernel.dk/linux-block:
  io_uring: missed req_init_async() for IOSQE_ASYNC
  io_uring: always allow drain/link/hardlink/async sqe flags
  io_uring: ensure double poll additions work with both request types

Merge tag 'iommu-fix-v5.8-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/joro/iommu into master

Pull iommu fix from Joerg Roedel:
"Fix a NULL-ptr dereference in the QCOM IOMMU driver"

* tag 'iommu-fix-v5.8-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/joro/iommu:
iommu/qcom: Use domain rather than dev as tlb cookie

Merge tag 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/rdma/rdma into master

Pull rdma fixes from Jason Gunthorpe:
"One merge window regression, some corruption bugs in HNS and a few
  more syzkaller fixes:

   - Two long standing syzkaller races

   - Fix incorrect HW configuration in HNS

   - Restore accidentally dropped locking in IB CM

   - Fix ODP prefetch bug added in the big rework several versions ago"

* tag 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/rdma/rdma:
  RDMA/mlx5: Prevent prefetch from racing with implicit destruction
  RDMA/cm: Protect access to remote_sidr_table
  RDMA/core: Fix race in rdma_alloc_commit_uobject()
  RDMA/hns: Fix wrong PBL offset when VA is not aligned to PAGE_SIZE
  RDMA/hns: Fix wrong assignment of lp_pktn_ini in QPC
  RDMA/mlx5: Use xa_lock_irq when access to SRQ table

Merge tag 'for-5.8/dm-fixes-3' of git://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm into master

Pull device mapper fix from Mike Snitzer:
"A stable fix for DM integrity target's integrity recalculation that
  gets skipped when resuming a device. This is a fix for a previous
  stable@ fix"

* tag 'for-5.8/dm-fixes-3' of git://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm:
  dm integrity: fix integrity recalculation that is improperly skipped

Merge branch 'i2c/for-current' of git://git.kernel.org/pub/scm/linux/kernel/git/wsa/linux into master

Pull i2c fixes from Wolfram Sang:
"Again some driver bugfixes and some documentation fixes"

* 'i2c/for-current' of git://git.kernel.org/pub/scm/linux/kernel/git/wsa/linux:
  i2c: i2c-qcom-geni: Fix DMA transfer race
  i2c: rcar: always clear ICSAR to avoid side effects
  MAINTAINERS: i2c: at91: handover maintenance to Codrin Ciubotariu
  i2c: drop duplicated word in the header file
  i2c: cadence: Clear HOLD bit at correct time in Rx path
  Revert "i2c: cadence: Fix the hold bit setting"

Merge tag 'mmc-v5.8-rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/ulfh/mmc into master

Pull MMC fix from Ulf Hansson:
"Fix clock divider calculation in the ASPEED SDHCI controller"

* tag 'mmc-v5.8-rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/ulfh/mmc:
mmc: sdhci-of-aspeed: Fix clock divider calculation

Merge tag 'drm-fixes-2020-07-24' of git://anongit.freedesktop.org/drm/drm into master

Pull drm fixes from Dave Airlie:
"Quiet fixes, I may have a single regression fix follow up to this for
  nouveau, but it might be next week, Ben was testing it a bit more .

  Otherwise two amdgpu fixes, one lima and one sun4i:

  amdgpu:
    - Fix crash when overclocking VegaM
    - Fix possible crash when editing dpm levels

  sun4i:
    - Fix inverted HPD result; fixes an earlier fix

  lima:
    - fix timeout during reset"

* tag 'drm-fixes-2020-07-24' of git://anongit.freedesktop.org/drm/drm:
  drm/amdgpu: Fix NULL dereference in dpm sysfs handlers
  drm/amd/powerplay: fix a crash when overclocking Vega M
  drm/lima: fix wait pp reset timeout
  drm: sun4i: hdmi: Fix inverted HPD result

scripts/gdb: fix lx-symbols 'gdb.error' while loading modules

Commit ed66f991bb19 ("module: Refactor section attr into bin attribute")
removed the 'name' field from 'struct module_sect_attr' triggering the
following error when invoking lx-symbols:

  (gdb) lx-symbols
  loading vmlinux
  scanning for modules in linux/build
  loading @0xffffffffc014f000: linux/build/drivers/net/tun.ko
  Python Exception <class 'gdb.error'> There is no member named name.:
  Error occurred in Python: There is no member named name.

This patch fixes the issue taking the module name from the 'struct
attribute'.

Fixes: ed66f991bb19 ("module: Refactor section attr into bin attribute")
Signed-off-by: Stefano Garzarella <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Reviewed-by: Jan Kiszka <[email protected]>
Reviewed-by: Kieran Bingham <[email protected]>
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Linus Torvalds <[email protected]>

MAINTAINERS: add KCOV section

To link KCOV to the kasan-dev@ mailing list.

Signed-off-by: Andrey Konovalov <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Acked-by: Dmitry Vyukov <[email protected]>
Cc: Alexander Potapenko <[email protected]>
Cc: Marco Elver <[email protected]>
Link: http://lkml.kernel.org/r/5fa344db7ac4af2213049e5656c0f43d6ecaa379.1595331682.git.andreyknvl@google.com
Signed-off-by: Linus Torvalds <[email protected]>

io-mapping: indicate mapping failure

The !ATOMIC_IOMAP version of io_maping_init_wc will always return
success, even when the ioremap fails.

Since the ATOMIC_IOMAP version returns NULL when the init fails, and
callers check for a NULL return on error this is unexpected.

During a device probe, where the ioremap failed, a crash can look like
this:

    BUG: unable to handle page fault for address: 0000000000210000
     #PF: supervisor write access in kernel mode
     #PF: error_code(0x0002) - not-present page
     Oops: 0002 [#1] PREEMPT SMP
     CPU: 0 PID: 177 Comm:
     RIP: 0010:fill_page_dma [i915]
       gen8_ppgtt_create [i915]
       i915_ppgtt_create [i915]
       intel_gt_init [i915]
       i915_gem_init [i915]
       i915_driver_probe [i915]
       pci_device_probe
       really_probe
       driver_probe_device

The remap failure occurred much earlier in the probe.  If it had been
propagated, the driver would have exited with an error.

Return NULL on ioremap failure.

[[email protected]: detect ioremap_wc() errors earlier]

Fixes: cafaf14a5d8f ("io-mapping: Always create a struct to hold metadata about the io-mapping")
Signed-off-by: Michael J. Ruhl <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Reviewed-by: Andrew Morton <[email protected]>
Cc: Mike Rapoport <[email protected]>
Cc: Andy Shevchenko <[email protected]>
Cc: Chris Wilson <[email protected]>
Cc: Daniel Vetter <[email protected]>
Cc: <[email protected]>
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Linus Torvalds <[email protected]>

scripts/decode_stacktrace: strip basepath from all paths

Currently the basepath is removed only from the beginning of the string.
When the symbol is inlined and there's multiple line outputs of
addr2line, only the first line would have basepath removed.

Change to remove the basepath prefix from all lines.

Fixes: 31013836a71e ("scripts/decode_stacktrace: match basepath using shell prefix operator, not regex")
Co-developed-by: Shik Chen <[email protected]>
Signed-off-by: Pi-Hsun Shih <[email protected]>
Signed-off-by: Shik Chen <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Reviewed-by: Stephen Boyd <[email protected]>
Cc: Sasha Levin <[email protected]>
Cc: Nicolas Boichat <[email protected]>
Cc: Jiri Slaby <[email protected]>
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Linus Torvalds <[email protected]>

squashfs: fix length field overlap check in metadata reading

This is a regression introduced by the "migrate from ll_rw_block usage
to BIO" patch.

Squashfs packs structures on byte boundaries, and due to that the length
field (of the metadata block) may not be fully in the current block.
The new code rewrote and introduced a faulty check for that edge case.

Fixes: 93e72b3c612adcaca1 ("squashfs: migrate from ll_rw_block usage to BIO")
Reported-by: Bernd Amend <[email protected]>
Signed-off-by: Phillip Lougher <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Cc: Christoph Hellwig <[email protected]>
Cc: Adrien Schildknecht <[email protected]>
Cc: Guenter Roeck <[email protected]>
Cc: Daniel Rosenberg <[email protected]>
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Linus Torvalds <[email protected]>

mailmap: add entry for Mike Rapoport

Add an entry to correct my email addresses.

Signed-off-by: Mike Rapoport <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Linus Torvalds <[email protected]>

khugepaged: fix null-pointer dereference due to race

khugepaged has to drop mmap lock several times while collapsing a page.
The situation can change while the lock is dropped and we need to
re-validate that the VMA is still in place and the PMD is still subject
for collapse.

But we miss one corner case: while collapsing an anonymous pages the VMA
could be replaced with file VMA. If the file VMA doesn't have any
private pages we get NULL pointer dereference:

general protection fault, probably for non-canonical address 0xdffffc0000000000: 0000 [#1] PREEMPT SMP KASAN
KASAN: null-ptr-deref in range [0x0000000000000000-0x0000000000000007]
anon_vma_lock_write include/linux/rmap.h:120 [inline]
collapse_huge_page mm/khugepaged.c:1110 [inline]
khugepaged_scan_pmd mm/khugepaged.c:1349 [inline]
khugepaged_scan_mm_slot mm/khugepaged.c:2110 [inline]
khugepaged_do_scan mm/khugepaged.c:2193 [inline]
khugepaged+0x3bba/0x5a10 mm/khugepaged.c:2238

The fix is to make sure that the VMA is anonymous in
hugepage_vma_revalidate(). The helper is only used for collapsing
anonymous pages.

Fixes: 99cb0dbd47a1 ("mm,thp: add read-only THP support for (non-shmem) FS")
Reported-by: [email protected]
Signed-off-by: Kirill A. Shutemov <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Reviewed-by: David Hildenbrand <[email protected]>
Acked-by: Yang Shi <[email protected]>
Cc: <[email protected]>
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Linus Torvalds <[email protected]>

mm/hugetlb: avoid hardcoding while checking if cma is enabled

hugetlb_cma[0] can be NULL due to various reasons, for example, node0
has no memory. so NULL hugetlb_cma[0] doesn't necessarily mean cma is
not enabled. gigantic pages might have been reserved on other nodes.
This patch fixes possible double reservation and CMA leak.

[[email protected]: fix CONFIG_CMA=n warning]
[[email protected]: better checks before using hugetlb_cma]
Link: http://lkml.kernel.org/r/[email protected]
Fixes: cf11e85fc08c ("mm: hugetlb: optionally allocate gigantic hugepages using cma")
Signed-off-by: Barry Song <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Reviewed-by: Mike Kravetz <[email protected]>
Acked-by: Roman Gushchin <[email protected]>
Cc: Jonathan Cameron <[email protected]>
Cc: <[email protected]>
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Linus Torvalds <[email protected]>

mm: memcg/slab: fix memory leak at non-root kmem_cache destroy

If the kmem_cache refcount is greater than one, we should not mark the
root kmem_cache as dying.  If we mark the root kmem_cache dying
incorrectly, the non-root kmem_cache can never be destroyed.  It
resulted in memory leak when memcg was destroyed.  We can use the
following steps to reproduce.

  1) Use kmem_cache_create() to create a new kmem_cache named A.
  2) Coincidentally, the kmem_cache A is an alias for kmem_cache B,
     so the refcount of B is just increased.
  3) Use kmem_cache_destroy() to destroy the kmem_cache A, just
     decrease the B's refcount but mark the B as dying.
  4) Create a new memory cgroup and alloc memory from the kmem_cache
     B. It leads to create a non-root kmem_cache for allocating memory.
  5) When destroy the memory cgroup created in the step 4), the
     non-root kmem_cache can never be destroyed.

If we repeat steps 4) and 5), this will cause a lot of memory leak.  So
only when refcount reach zero, we mark the root kmem_cache as dying.

Fixes: 92ee383f6daa ("mm: fix race between kmem_cache destroy, create and deactivate")
Signed-off-by: Muchun Song <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Reviewed-by: Shakeel Butt <[email protected]>
Acked-by: Roman Gushchin <[email protected]>
Cc: Vlastimil Babka <[email protected]>
Cc: Christoph Lameter <[email protected]>
Cc: Pekka Enberg <[email protected]>
Cc: David Rientjes <[email protected]>
Cc: Joonsoo Kim <[email protected]>
Cc: Shakeel Butt <[email protected]>
Cc: <[email protected]>
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Linus Torvalds <[email protected]>

mm/memcg: fix refcount error while moving and swapping

It was hard to keep a test running, moving tasks between memcgs with
move_charge_at_immigrate, while swapping: mem_cgroup_id_get_many()'s
refcount is discovered to be 0 (supposedly impossible), so it is then
forced to REFCOUNT_SATURATED, and after thousands of warnings in quick
succession, the test is at last put out of misery by being OOM killed.

This is because of the way moved_swap accounting was saved up until the
task move gets completed in __mem_cgroup_clear_mc(), deferred from when
mem_cgroup_move_swap_account() actually exchanged old and new ids.
Concurrent activity can free up swap quicker than the task is scanned,
bringing id refcount down 0 (which should only be possible when
offlining).

Just skip that optimization: do that part of the accounting immediately.

Fixes: 615d66c37c75 ("mm: memcontrol: fix memcg id ref counter on swap charge move")
Signed-off-by: Hugh Dickins <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Reviewed-by: Alex Shi <[email protected]>
Cc: Johannes Weiner <[email protected]>
Cc: Alex Shi <[email protected]>
Cc: Shakeel Butt <[email protected]>
Cc: Michal Hocko <[email protected]>
Cc: <[email protected]>
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Linus Torvalds <[email protected]>

mm/memcontrol: fix OOPS inside mem_cgroup_get_nr_swap_pages()

Prabhakar reported an OOPS inside mem_cgroup_get_nr_swap_pages()
function in a corner case seen on some arm64 boards when kdump kernel
runs with "cgroup_disable=memory" passed to the kdump kernel via
bootargs.

The root-cause behind the same is that currently mem_cgroup_swap_init()
function is implemented as a subsys_initcall() call instead of a
core_initcall(), this means 'cgroup_memory_noswap' still remains set to
the default value (false) even when memcg is disabled via
"cgroup_disable=memory" boot parameter.

This may result in premature OOPS inside mem_cgroup_get_nr_swap_pages()
function in corner cases:

  Unable to handle kernel NULL pointer dereference at virtual address 0000000000000188
  Mem abort info:
    ESR = 0x96000006
    EC = 0x25: DABT (current EL), IL = 32 bits
    SET = 0, FnV = 0
    EA = 0, S1PTW = 0
  Data abort info:
    ISV = 0, ISS = 0x00000006
    CM = 0, WnR = 0
  [0000000000000188] user address but active_mm is swapper
  Internal error: Oops: 96000006 [#1] SMP
  Modules linked in:
  <..snip..>
  Call trace:
    mem_cgroup_get_nr_swap_pages+0x9c/0xf4
    shrink_lruvec+0x404/0x4f8
    shrink_node+0x1a8/0x688
    do_try_to_free_pages+0xe8/0x448
    try_to_free_pages+0x110/0x230
    __alloc_pages_slowpath.constprop.106+0x2b8/0xb48
    __alloc_pages_nodemask+0x2ac/0x2f8
    alloc_page_interleave+0x20/0x90
    alloc_pages_current+0xdc/0xf8
    atomic_pool_expand+0x60/0x210
    __dma_atomic_pool_init+0x50/0xa4
    dma_atomic_pool_init+0xac/0x158
    do_one_initcall+0x50/0x218
    kernel_init_freeable+0x22c/0x2d0
    kernel_init+0x18/0x110
    ret_from_fork+0x10/0x18
  Code: aa1403e3 91106000 97f82a27 14000011 (f940c663)
  ---[ end trace 9795948475817de4 ]---
  Kernel panic - not syncing: Fatal exception
  Rebooting in 10 seconds..

Fixes: eccb52e78809 ("mm: memcontrol: prepare swap controller setup for integration")
Reported-by: Prabhakar Kushwaha <[email protected]>
Signed-off-by: Bhupesh Sharma <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Acked-by: Michal Hocko <[email protected]>
Cc: Johannes Weiner <[email protected]>
Cc: Vladimir Davydov <[email protected]>
Cc: James Morse <[email protected]>
Cc: Mark Rutland <[email protected]>
Cc: Will Deacon <[email protected]>
Cc: Catalin Marinas <[email protected]>
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Linus Torvalds <[email protected]>

mm: initialize return of vm_insert_pages

clang static analysis reports a garbage return

  In file included from mm/memory.c:84:
  mm/memory.c:1612:2: warning: Undefined or garbage value returned to caller [core.uninitialized.UndefReturn]
          return err;
          ^~~~~~~~~~

The setting of err depends on a loop executing.  So initialize err.

Signed-off-by: Tom Rix <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Linus Torvalds <[email protected]>

vfs/xattr: mm/shmem: kernfs: release simple xattr entry in a right way

After commit fdc85222d58e ("kernfs: kvmalloc xattr value instead of
kmalloc"), simple xattr entry is allocated with kvmalloc() instead of
kmalloc(), so we should release it with kvfree() instead of kfree().

Fixes: fdc85222d58e ("kernfs: kvmalloc xattr value instead of kmalloc")
Signed-off-by: Chengguang Xu <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Acked-by: Hugh Dickins <[email protected]>
Acked-by: Tejun Heo <[email protected]>
Cc: Daniel Xu <[email protected]>
Cc: Chris Down <[email protected]>
Cc: Andreas Dilger <[email protected]>
Cc: Greg Kroah-Hartman <[email protected]>
Cc: Al Viro <[email protected]>
Cc: <[email protected]> [5.7]
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Linus Torvalds <[email protected]>

mm/mmap.c: close race between munmap() and expand_upwards()/downwards()

VMA with VM_GROWSDOWN or VM_GROWSUP flag set can change their size under
mmap_read_lock().  It can lead to race with __do_munmap():

Thread A Thread B
__do_munmap()
  detach_vmas_to_be_unmapped()
  mmap_write_downgrade()
expand_downwards()
  vma->vm_start = address;
  // The VMA now overlaps with
  // VMAs detached by the Thread A
// page fault populates expanded part
// of the VMA
  unmap_region()
    // Zaps pagetables partly
    // populated by Thread B

Similar race exists for expand_upwards().

The fix is to avoid downgrading mmap_lock in __do_munmap() if detached
VMAs are next to VM_GROWSDOWN or VM_GROWSUP VMA.

[[email protected]: s/mmap_sem/mmap_lock/ in comment]

Fixes: dd2283f2605e ("mm: mmap: zap pages with read mmap_sem in munmap")
Reported-by: Jann Horn <[email protected]>
Signed-off-by: Kirill A. Shutemov <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Reviewed-by: Yang Shi <[email protected]>
Acked-by: Vlastimil Babka <[email protected]>
Cc: Oleg Nesterov <[email protected]>
Cc: Matthew Wilcox <[email protected]>
Cc: <[email protected]> [4.20+]
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Linus Torvalds <[email protected]>

uprobes: Change handle_swbp() to send SIGTRAP with si_code=SI_KERNEL, to fix GDB regression

If a tracee is uprobed and it hits int3 inserted by debugger, handle_swbp()
does send_sig(SIGTRAP, current, 0) which means si_code == SI_USER. This used
to work when this code was written, but then GDB started to validate si_code
and now it simply can't use breakpoints if the tracee has an active uprobe:

# cat test.c
void unused_func(void)
{
}
int main(void)
{
return 0;
}

# gcc -g test.c -o test
# perf probe -x ./test -a unused_func
# perf record -e probe_test:unused_func gdb ./test -ex run
GNU gdb (GDB) 10.0.50.20200714-git
...
Program received signal SIGTRAP, Trace/breakpoint trap.
0x00007ffff7ddf909 in dl_main () from /lib64/ld-linux-x86-64.so.2
(gdb)

The tracee hits the internal breakpoint inserted by GDB to monitor shared
library events but GDB misinterprets this SIGTRAP and reports a signal.

Change handle_swbp() to use force_sig(SIGTRAP), this matches do_int3_user()
and fixes the problem.

This is the minimal fix for -stable, arch/x86/kernel/uprobes.c is equally
wrong; it should use send_sigtrap(TRAP_TRACE) instead of send_sig(SIGTRAP),
but this doesn't confuse GDB and needs another x86-specific patch.

Reported-by: Aaron Merey <[email protected]>
Signed-off-by: Oleg Nesterov <[email protected]>
Signed-off-by: Ingo Molnar <[email protected]>
Reviewed-by: Srikar Dronamraju <[email protected]>
Cc: [email protected]
Link: https://lore.kernel.org/r/[email protected]

sched: Warn if garbage is passed to default_wake_function()

Since the default_wake_function() passes its flags onto
try_to_wake_up(), warn if those flags collide with internal values.

Given that the supplied flags are garbage, no repair can be done but at
least alert the user to the damage they are causing.

In the belief that these errors should be picked up during testing, the
warning is only compiled in under CONFIG_SCHED_DEBUG.

Signed-off-by: Chris Wilson <[email protected]>
Signed-off-by: Ingo Molnar <[email protected]>
Acked-by: Peter Zijlstra <[email protected]>
Link: https://lore.kernel.org/r/[email protected]

vrf: Handle CONFIG_SYSCTL not set

Randy reported compile failure when CONFIG_SYSCTL is not set/enabled:

ERROR: modpost: "sysctl_vals" [drivers/net/vrf.ko] undefined!

Fix by splitting out the sysctl init and cleanup into helpers that
can be set to do nothing when CONFIG_SYSCTL is disabled. In addition,
move vrf_strict_mode and vrf_strict_mode_change to above
vrf_shared_table_handler (code move only) and wrap all of it
in the ifdef CONFIG_SYSCTL.

Update the strict mode tests to check for the existence of the
/proc/sys entry.

Fixes: 33306f1aaf82 ("vrf: add sysctl parameter for strict mode")
Cc: Andrea Mayer <[email protected]>
Reported-by: Randy Dunlap <[email protected]>
Signed-off-by: David Ahern <[email protected]>
Acked-by: Randy Dunlap <[email protected]> # build-tested
Signed-off-by: David S. Miller <[email protected]>

Merge git://git.kernel.org/pub/scm/linux/kernel/git/pablo/nf

Pablo Neira Ayuso says:

====================
Netfilter/IPVS fixes for net

The following patchset contains Netfilter/IPVS fixes for net:

1) Fix NAT hook deletion when table is dormant, from Florian Westphal.

2) Fix IPVS sync stalls, from guodeqing.
====================

Signed-off-by: David S. Miller <[email protected]>

ice: add 1G SGMII PHY type

There isn't a case for 1G SGMII in ice_get_media_type() so add
the handling for it.

Also handle the special case where some direct attach
cables may report that they support 1G SGMII, but
that is erroneous since SGMII is supposed to be a
backplane media type (between a MAC and a PHY). If
the driver doesn't handle this special case then a
user could see the 'Port' in ethtool change from
'Direct attach Copper' to 'Backplane' when they have
forced the speed to 1G, but the cable hasn't changed.

Lastly, change ice_aq_get_phy_caps() to save the
module_type info if the function was called with
ICE_AQC_REPORT_TOPO_CAP. This call uses the media
information to populate the module_type. If no
media is present then the values in module_type
will be 0.

Signed-off-by: Paul M Stillwell Jr <[email protected]>
Tested-by: Andrew Bowers <[email protected]>
Signed-off-by: Tony Nguyen <[email protected]>

ice: Report AOC PHY Types as Fiber

Report AOC types as fiber instead of unknown.

Signed-off-by: Doug Dziggel <[email protected]>
Tested-by: Andrew Bowers <[email protected]>
Signed-off-by: Tony Nguyen <[email protected]>

ice: add AQC get link topology handle support

Add AQC get link topology handle support. This is needed to determine
Direct Attach (DA) or backplane media type for PHY types that support
either. Get link topology handle cage node type request can be used to
determine if a cage is present or not. If a cage is present for PHY
types that supports both DA and backplane media type, then the media
type is DA, else the media type is backplane.

Signed-off-by: Paul Greenwalt <[email protected]>
Tested-by: Andrew Bowers <[email protected]>
Signed-off-by: Tony Nguyen <[email protected]>

ice: Rename low_power_ctrl

Rename the low_power_ctrl field to low_power_ctrl_an to be properly
descriptive of it being an autoneg field.

Signed-off-by: Lev Faerman <[email protected]>
Tested-by: Andrew Bowers <[email protected]>
Signed-off-by: Tony Nguyen <[email protected]>

ice: update reporting of autoneg capabilities

Firmware now reports AN28, AN32, and AN73. Add a helper and check these new
values and report PHY autoneg capability.

Signed-off-by: Paul Greenwalt <[email protected]>
Tested-by: Andrew Bowers <[email protected]>
Signed-off-by: Tony Nguyen <[email protected]>

ice: add ice_aq_get_phy_caps() debug logs

Add debug logs for ice_aq_get_phy_caps(), and format
ice_aq_set_phy_cfg() and ice_aq_get_link_info() debug logs to make them
more readable.

Signed-off-by: Paul Greenwalt <[email protected]>
Signed-off-by: Paul M Stillwell Jr <[email protected]>
Tested-by: Andrew Bowers <[email protected]>
Signed-off-by: Tony Nguyen <[email protected]>

ice: support Total Port Shutdown on devices that support it

When the Port Disable bit is set in the Link Default Override Mask TLV PFA
module in the NVM, Total Port Shutdown mode is supported and enabled. In
this mode, the driver should act as if the link-down-on-close ethtool
private flag is always enabled and dis-allow any change to that flag.

Signed-off-by: Bruce Allan <[email protected]>
Signed-off-by: Paul Greenwalt <[email protected]>
Tested-by: Andrew Bowers <[email protected]>
Signed-off-by: Tony Nguyen <[email protected]>

ice: add link lenient and default override support

Adds functions to check for link override firmware support and get
the override settings for a port. The previously supported/default link
mode was strict mode.

In strict mode link is configured based on get PHY capabilities PHY types
with media.

Lenient mode is now the default link mode. In lenient mode the link is
configured based on get PHY capabilities PHY types without media. This
allows the user to configure link that the media does not report. Limit the
minimum supported link mode to 25G for devices that support 100G, and 1G
for devices that support less than 100G.

Default override is only supported in lenient mode. If default override
is supported and enabled, then default override values are used for
configuring speed and FEC. Default override provide persistent link
settings in the NVM.

Signed-off-by: Paul Greenwalt <[email protected]>
Signed-off-by: Evan Swanson <[email protected]>
Tested-by: Andrew Bowers <[email protected]>
Signed-off-by: Tony Nguyen <[email protected]>

geneve: fix an uninitialized value in geneve_changelink()

geneve_nl2info() sets 'df' conditionally, so we have to
initialize it by copying the value from existing geneve
device in geneve_changelink().

Fixes: 56c09de347e4 ("geneve: allow changing DF behavior after creation")
Reported-by: [email protected]
Cc: Sabrina Dubroca <[email protected]>
Signed-off-by: Cong Wang <[email protected]>
Reviewed-by: Sabrina Dubroca <[email protected]>
Signed-off-by: David S. Miller <[email protected]>

bonding: check return value of register_netdevice() in bond_newlink()

Very similar to commit 544f287b8495
("bonding: check error value of register_netdevice() immediately"),
we should immediately check the return value of register_netdevice()
before doing anything else.

Fixes: 005db31d5f5f ("bonding: set carrier off for devices created through netlink")
Reported-and-tested-by: [email protected]
Cc: Beniamino Galvani <[email protected]>
Cc: Taehee Yoo <[email protected]>
Cc: Jay Vosburgh <[email protected]>
Signed-off-by: Cong Wang <[email protected]>
Signed-off-by: David S. Miller <[email protected]>

ice: restore PHY settings on media insertion

After the transition from no media to media FW will clear the
set-phy-cfg data set by the user. Save initial PHY settings and any
settings later requested by the user and use that data to restore PHY
settings on media insertion. Since PHY configuration is now being stored,
replace calls that were calling FW to get the configuration with the saved
copy.

Signed-off-by: Paul Greenwalt <[email protected]>
Signed-off-by: Chinh T Cao <[email protected]>
Signed-off-by: Paul M Stillwell Jr <[email protected]>
Tested-by: Andrew Bowers <[email protected]>
Signed-off-by: Tony Nguyen <[email protected]>

net: dsa: stop overriding master's ndo_get_phys_port_name

The purpose of this override is to give the user an indication of what
the number of the CPU port is (in DSA, the CPU port is a hardware
implementation detail and not a network interface capable of traffic).

However, it has always failed (by design) at providing this information
to the user in a reliable fashion.

Prior to commit 3369afba1e46 ("net: Call into DSA netdevice_ops
wrappers"), the behavior was to only override this callback if it was
not provided by the DSA master.

That was its first failure: if the DSA master itself was a DSA port or a
switchdev, then the user would not see the number of the CPU port in
/sys/class/net/eth0/phys_port_name, but the number of the DSA master
port within its respective physical switch.

But that was actually ok in a way. The commit mentioned above changed
that behavior, and now overrides the master's ndo_get_phys_port_name
unconditionally. That comes with problems of its own, which are worse in
a way.

The idea is that it's typical for switchdev users to have udev rules for
consistent interface naming. These are based, among other things, on
the phys_port_name attribute. If we let the DSA switch at the bottom
to start randomly overriding ndo_get_phys_port_name with its own CPU
port, we basically lose any predictability in interface naming, or even
uniqueness, for that matter.

So, there are reasons to let DSA override the master's callback (to
provide a consistent interface, a number which has a clear meaning and
must not be interpreted according to context), and there are reasons to
not let DSA override it (it breaks udev matching for the DSA master).

But, there is an alternative method for users to retrieve the number of
the CPU port of each DSA switch in the system:

  $ devlink port
  pci/0000:00:00.5/0: type eth netdev swp0 flavour physical port 0
  pci/0000:00:00.5/2: type eth netdev swp2 flavour physical port 2
  pci/0000:00:00.5/4: type notset flavour cpu port 4
  spi/spi2.0/0: type eth netdev sw0p0 flavour physical port 0
  spi/spi2.0/1: type eth netdev sw0p1 flavour physical port 1
  spi/spi2.0/2: type eth netdev sw0p2 flavour physical port 2
  spi/spi2.0/4: type notset flavour cpu port 4
  spi/spi2.1/0: type eth netdev sw1p0 flavour physical port 0
  spi/spi2.1/1: type eth netdev sw1p1 flavour physical port 1
  spi/spi2.1/2: type eth netdev sw1p2 flavour physical port 2
  spi/spi2.1/3: type eth netdev sw1p3 flavour physical port 3
  spi/spi2.1/4: type notset flavour cpu port 4

So remove this duplicated, unreliable and troublesome method. From this
patch on, the phys_port_name attribute of the DSA master will only
contain information about itself (if at all). If the users need reliable
information about the CPU port they're probably using devlink anyway.

Signed-off-by: Vladimir Oltean <[email protected]>
Acked-by: florian Fainelli <[email protected]>
Signed-off-by: David S. Miller <[email protected]>

ice: move auto FEC checks into ice_cfg_phy_fec()

The call to ice_cfg_phy_fec() requires the caller to perform certain
actions before calling it. Instead of imposing these preconditions move
the operations into the function and perform them ourselves.

Also, fix some style issues in nearby touched code.

Signed-off-by: Paul Greenwalt <[email protected]>
Signed-off-by: Chinh T Cao <[email protected]>
Tested-by: Andrew Bowers <[email protected]>
Signed-off-by: Tony Nguyen <[email protected]>

ice: refactor FC functions

Create a helper function for configuring requested flow control so that it
can be utilized by other functions looking to configure flow control
settings. Utilize the existing helper ice_copy_phy_caps_to_cfg() to copy a
PHY capability to configuration instead duplicating the code for it.

Signed-off-by: Paul Greenwalt <[email protected]>
Tested-by: Andrew Bowers <[email protected]>
Signed-off-by: Tony Nguyen <[email protected]>

ice: Add advanced power mgmt for WoL

Add callbacks needed to support advanced power management for Wake on LAN.
Also make ice_pf_state_is_nominal function available for all configurations
not just CONFIG_PCI_IOV.

Signed-off-by: Akeem G Abodunrin <[email protected]>
Signed-off-by: Jesse Brandeburg <[email protected]>
Tested-by: Andrew Bowers <[email protected]>
Signed-off-by: Tony Nguyen <[email protected]>

ice: split ice_discover_caps into two functions

Using the new ice_aq_list_caps and ice_parse_(dev|func)_caps functions,
replace ice_discover_caps with two functions that each take a pointer to
the dev_caps and func_caps structures respectively.

This makes the side effect of updating the hw->dev_caps and
hw->func_caps obvious from reading the implementation of the function.
Additionally, it opens the way for enabling reading of device
capabilities outside of the initialization flow. By passing in
a pointer, another caller will be able to read the capabilities without
modifying the HW capabilities structures.

As there are no other callers, it is safe to now remove
ice_aq_discover_caps and ice_parse_caps.

Signed-off-by: Jacob Keller <[email protected]>
Tested-by: Andrew Bowers <[email protected]>
Signed-off-by: Tony Nguyen <[email protected]>

ice: split ice_parse_caps into separate functions

The ice_parse_caps function is used to convert the capability block data
coming from firmware into a structured format used by other parts of the
code.

The current implementation directly updates the hw->func_caps and
hw->dev_caps structures. It is directly called from within
ice_aq_discover_caps. This causes the discover_caps function to have the
side effect of modifying the HW capability structures, which is not
intuitive.

Split this function into ice_parse_dev_caps and ice_parse_func_caps.
These functions will take a pointer to the dev_caps and func_caps
respectively. Also create an ice_parse_common_caps for sharing the
capability logic that is common to device and function.

Doing so enables a future refactor to allow reading and parsing
capabilities into a local caps structure instead of modifying the
members of the HW structure directly.

Signed-off-by: Jacob Keller <[email protected]>
Tested-by: Andrew Bowers <[email protected]>
Signed-off-by: Tony Nguyen <[email protected]>

ice: refactor ice_discover_caps to avoid need to retry

The ice_discover_caps function is used to read the device and function
capabilities, updating the hardware capabilities structures with
relevant data.

The exact number of capabilities returned by the hardware is unknown
ahead of time. The AdminQ command will report the total number of
capabilities in the return buffer.

The current implementation involves requesting capabilities once,
reading this returned size, and then re-requested with that size.

This isn't really necessary. The firmware interface has a maximum size
of ICE_AQ_MAX_BUF_LEN. Firmware can never return more than
ICE_AQ_MAX_BUF_LEN / sizeof(struct ice_aqc_list_caps_elem) capabilities.

Avoid the retry loop by simply allocating a buffer of size
ICE_AQ_MAX_BUF_LEN. This is significantly simpler than retrying. The
extra allocation isn't a big deal, as it will be released after we
finish parsing the capabilities.

Signed-off-by: Jacob Keller <[email protected]>
Tested-by: Andrew Bowers <[email protected]>
Signed-off-by: Tony Nguyen <[email protected]>

Revert "cifs: Fix the target file was deleted when rename failed."

This reverts commit 9ffad9263b467efd8f8dc7ae1941a0a655a2bab2.

Upon additional testing with older servers, it was found that
the original commit introduced a regression when using the old SMB1
dialect and rsyncing over an existing file.

The patch will need to be respun to address this, likely including
a larger refactoring of the SMB1 and SMB3 rename code paths to make
it less confusing and also to address some additional rename error
cases that SMB3 may be able to workaround.

Signed-off-by: Steve French <[email protected]>
Reported-by: Patrick Fernie <[email protected]>
CC: Stable <[email protected]>
Acked-by: Ronnie Sahlberg <[email protected]>
Acked-by: Pavel Shilovsky <[email protected]>
Acked-by: Zhang Xiaoxu <[email protected]>

Merge tag 's390-5.8-6' of git://git.kernel.org/pub/scm/linux/kernel/git/s390/linux into master

Pull s390 fixes from Heiko Carstens:

- Change cpum_cf/perf counter name from DFLT_CCERROR to DFLT_CCFINISH
   to reflect reality and avoid further confusion. This is a user space
   visible change therefore the commit has also a stable tag for 5.7,
   where this counter was introduced.

- Add Matthew Rosato as s390 IOMMU maintainer.

* tag 's390-5.8-6' of git://git.kernel.org/pub/scm/linux/kernel/git/s390/linux:
  MAINTAINERS: add Matthew for s390 IOMMU
  s390/cpum_cf,perf: change DFLT_CCERROR counter name

i2c: i2c-qcom-geni: Fix DMA transfer race

When I have KASAN enabled on my kernel and I start stressing the
touchscreen my system tends to hang.  The touchscreen is one of the
only things that does a lot of big i2c transfers and ends up hitting
the DMA paths in the geni i2c driver.  It appears that KASAN adds
enough delay in my system to tickle a race condition in the DMA setup
code.

When the system hangs, I found that it was running the geni_i2c_irq()
over and over again.  It had these:

m_stat   = 0x04000080
rx_st    = 0x30000011
dm_tx_st = 0x00000000
dm_rx_st = 0x00000000
dma      = 0x00000001

Notably we're in DMA mode but are getting M_RX_IRQ_EN and
M_RX_FIFO_WATERMARK_EN over and over again.

Putting some traces in geni_i2c_rx_one_msg() showed that when we
failed we were getting to the start of geni_i2c_rx_one_msg() but were
never executing geni_se_rx_dma_prep().

I believe that the problem here is that we are starting the geni
command before we run geni_se_rx_dma_prep().  If a transfer makes it
far enough before we do that then we get into the state I have
observed.  Let's change the order, which seems to work fine.

Although problems were seen on the RX path, code inspection suggests
that the TX should be changed too.  Change it as well.

Fixes: 37692de5d523 ("i2c: i2c-qcom-geni: Add bus driver for the Qualcomm GENI I2C controller")
Signed-off-by: Douglas Anderson <[email protected]>
Tested-by: Sai Prakash Ranjan <[email protected]>
Reviewed-by: Akash Asthana <[email protected]>
Reviewed-by: Stephen Boyd <[email protected]>
Reviewed-by: Mukesh Kumar Savaliya <[email protected]>
Signed-off-by: Wolfram Sang <[email protected]>

i2c: rcar: always clear ICSAR to avoid side effects

On R-Car Gen2, we get a timeout when reading from the address set in
ICSAR, even though the slave interface is disabled. Clearing it fixes
this situation. Note that Gen3 is not affected.

To reproduce: bind and undbind an I2C slave on some bus, run
'i2cdetect' on that bus.

Fixes: de20d1857dd6 ("i2c: rcar: add slave support")
Signed-off-by: Wolfram Sang <[email protected]>
Signed-off-by: Wolfram Sang <[email protected]>

tcp: allow at most one TLP probe per flight

Previously TLP may send multiple probes of new data in one
flight. This happens when the sender is cwnd limited. After the
initial TLP containing new data is sent, the sender receives another
ACK that acks partial inflight. It may re-arm another TLP timer
to send more, if no further ACK returns before the next TLP timeout
(PTO) expires. The sender may send in theory a large amount of TLP
until send queue is depleted. This only happens if the sender sees
such irregular uncommon ACK pattern. But it is generally undesirable
behavior during congestion especially.

The original TLP design restrict only one TLP probe per inflight as
published in "Reducing Web Latency: the Virtue of Gentle Aggression",
SIGCOMM 2013. This patch changes TLP to send at most one probe
per inflight.

Note that if the sender is app-limited, TLP retransmits old data
and did not have this issue.

Signed-off-by: Yuchung Cheng <[email protected]>
Signed-off-by: Neal Cardwell <[email protected]>
Signed-off-by: Eric Dumazet <[email protected]>
Signed-off-by: David S. Miller <[email protected]>

AX.25: Prevent integer overflows in connect and sendmsg

We recently added some bounds checking in ax25_connect() and
ax25_sendmsg() and we so we removed the AX25_MAX_DIGIS checks because
they were no longer required.

Unfortunately, I believe they are required to prevent integer overflows
so I have added them back.

Fixes: 8885bb0621f0 ("AX.25: Prevent out-of-bounds read in ax25_sendmsg()")
Fixes: 2f2a7ffad5c6 ("AX.25: Fix out-of-bounds read in ax25_connect()")
Signed-off-by: Dan Carpenter <[email protected]>
Signed-off-by: David S. Miller <[email protected]>

cxgb4: add loopback ethtool self-test

In this test, loopback pkt is created and sent on default queue.
The packet goes until the Multi Port Switch (MPS) just before
the MAC and based on the specified channel number, it either
goes outside the wire on one of the physical ports or looped
back to Rx path by MPS. In this case, we're specifying loopback
channel, instead of physical ports, so the packet gets looped
back to Rx path, instead of getting transmitted on the wire.

v3:
- Modify commit message to include test details.
v2:
- Add only loopback self-test.

Signed-off-by: Vishal Kulkarni <[email protected]>
Acked-by: Jakub Kicinski <[email protected]>
Signed-off-by: David S. Miller <[email protected]>

Merge branch 'l2tp-further-checkpatch-pl-cleanups'

Tom Parkin says:

====================
l2tp: further checkpatch.pl cleanups

l2tp hasn't been kept up to date with the static analysis checks offered
by checkpatch.pl.

This patchset builds on the series "l2tp: cleanup checkpatch.pl
warnings". It includes small refactoring changes which improve code
quality and resolve a subset of the checkpatch warnings for the l2tp
codebase.
====================

Reviewed-by: James Chapman <[email protected]>
Signed-off-by: David S. Miller <[email protected]>

l2tp: cleanup kzalloc calls

Passing "sizeof(struct blah)" in kzalloc calls is less readable,
potentially prone to future bugs if the type of the pointer is changed,
and triggers checkpatch warnings.

Tweak the kzalloc calls in l2tp which use this form to avoid the
warning.

Signed-off-by: Tom Parkin <[email protected]>
Signed-off-by: David S. Miller <[email protected]>

l2tp: cleanup netlink tunnel create address handling

When creating an L2TP tunnel using the netlink API, userspace must
either pass a socket FD for the tunnel to use (for managed tunnels),
or specify the tunnel source/destination address (for unmanaged
tunnels).

Since source/destination addresses may be AF_INET or AF_INET6, the l2tp
netlink code has conditionally compiled blocks to support IPv6.

Rather than embedding these directly into l2tp_nl_cmd_tunnel_create
(where it makes the code difficult to read and confuses checkpatch to
boot) split the handling of address-related attributes into a separate
function.

Signed-off-by: Tom Parkin <[email protected]>
Signed-off-by: David S. Miller <[email protected]>

l2tp: cleanup netlink send of tunnel address information

l2tp_nl_tunnel_send has conditionally compiled code to support AF_INET6,
which makes the code difficult to follow and triggers checkpatch
warnings.

Split the code out into functions to handle the AF_INET v.s. AF_INET6
cases, which both improves readability and resolves the checkpatch
warnings.

Signed-off-by: Tom Parkin <[email protected]>
Signed-off-by: David S. Miller <[email protected]>

l2tp: check socket address type in l2tp_dfs_seq_tunnel_show

checkpatch warns about indentation and brace balancing around the
conditionally compiled code for AF_INET6 support in
l2tp_dfs_seq_tunnel_show.

By adding another check on the socket address type we can make the code
more readable while removing the checkpatch warning.

Signed-off-by: Tom Parkin <[email protected]>
Signed-off-by: David S. Miller <[email protected]>

l2tp: cleanup unnecessary braces in if statements

These checks are all simple and don't benefit from extra braces to
clarify intent. Remove them for easier-reading code.

Signed-off-by: Tom Parkin <[email protected]>
Signed-off-by: David S. Miller <[email protected]>

l2tp: cleanup comparisons to NULL

checkpatch warns about comparisons to NULL, e.g.

        CHECK: Comparison to NULL could be written "!rt"
        #474: FILE: net/l2tp/l2tp_ip.c:474:
        +       if (rt == NULL) {

These sort of comparisons are generally clearer and more readable
the way checkpatch suggests, so update l2tp accordingly.

Signed-off-by: Tom Parkin <[email protected]>
Signed-off-by: David S. Miller <[email protected]>

net/ncsi: use eth_zero_addr() to clear mac address

Use eth_zero_addr() to clear mac address insetad of memset().

Signed-off-by: Miaohe Lin <[email protected]>
Signed-off-by: David S. Miller <[email protected]>

cxgb4: use eth_zero_addr() to clear mac address

Use eth_zero_addr() to clear mac address insetad of memset().

Signed-off-by: Miaohe Lin <[email protected]>
Signed-off-by: David S. Miller <[email protected]>

Merge branch 'mptcp-non-backup-subflows-pre-reqs'

Paolo Abeni says:

====================
mptcp: non backup subflows pre-reqs

This series contains a bunch of MPTCP improvements loosely related to
concurrent subflows xmit usage, currently under development.

The first 3 patches are actually bugfixes for issues that will become apparent
as soon as we will enable the above feature.

The later patches improve the handling of incoming additional subflows,
improving significantly the performances in stress tests based on a high new
connection rate.
====================

Signed-off-by: David S. Miller <[email protected]>

subflow: introduce and use mptcp_can_accept_new_subflow()

So that we can easily perform some basic PM-related
adimission checks before creating the child socket.

Reviewed-by: Mat Martineau <[email protected]>
Tested-by: Christoph Paasch <[email protected]>
Signed-off-by: Paolo Abeni <[email protected]>
Signed-off-by: David S. Miller <[email protected]>

subflow: use rsk_ops->send_reset()

tcp_send_active_reset() is more prone to transient errors
(memory allocation or xmit queue full): in stress conditions
the kernel may drop the egress packet, and the client will be
stuck.

Reviewed-by: Mat Martineau <[email protected]>
Tested-by: Christoph Paasch <[email protected]>
Signed-off-by: Paolo Abeni <[email protected]>
Signed-off-by: David S. Miller <[email protected]>

subflow: explicitly check for plain tcp rsk

When syncookie are in use, the TCP stack may feed into
subflow_syn_recv_sock() plain TCP request sockets. We can't
access mptcp_subflow_request_sock-specific fields on such
sockets. Explicitly check the rsk ops to do safe accesses.

Reviewed-by: Mat Martineau <[email protected]>
Tested-by: Christoph Paasch <[email protected]>
Signed-off-by: Paolo Abeni <[email protected]>
Signed-off-by: David S. Miller <[email protected]>

mptcp: cleanup subflow_finish_connect()

The mentioned function has several unneeded branches,
handle each case - MP_CAPABLE, MP_JOIN, fallback -
under a single conditional and drop quite a bit of
duplicate code.

Reviewed-by: Mat Martineau <[email protected]>
Tested-by: Christoph Paasch <[email protected]>
Signed-off-by: Paolo Abeni <[email protected]>
Signed-off-by: David S. Miller <[email protected]>

mptcp: explicitly track the fully established status

Currently accepted msk sockets become established only after
accept() returns the new sk to user-space.

As MP_JOIN request are refused as per RFC spec on non fully
established socket, the above causes mp_join self-tests
instabilities.

This change lets the msk entering the established status
as soon as it receives the 3rd ack and propagates the first
subflow fully established status on the msk socket.

Finally we can change the subflow acceptance condition to
take in account both the sock state and the msk fully
established flag.

Reviewed-by: Mat Martineau <[email protected]>
Tested-by: Christoph Paasch <[email protected]>
Signed-off-by: Paolo Abeni <[email protected]>
Signed-off-by: David S. Miller <[email protected]>

mptcp: mark as fallback even early ones

In the unlikely event of a failure at connect time,
we currently clear the request_mptcp flag - so that
the MPC handshake is not started at all, but the msk
is not explicitly marked as fallback.

This would lead to later insertion of wrong DSS options
in the xmitted packets, in violation of RFC specs and
possibly fooling the peer.

Fixes: e1ff9e82e2ea ("net: mptcp: improve fallback to TCP")
Reviewed-by: Mat Martineau <[email protected]>
Tested-by: Christoph Paasch <[email protected]>
Signed-off-by: Paolo Abeni <[email protected]>
Signed-off-by: David S. Miller <[email protected]>

mptcp: avoid data corruption on reinsert

When updating a partially acked data fragment, we
actually corrupt it. This is irrelevant till we send
data on a single subflow, as retransmitted data, if
any are discarded by the peer as duplicate, but it
will cause data corruption as soon as we will start
creating non backup subflows.

Reviewed-by: Mat Martineau <[email protected]>
Tested-by: Christoph Paasch <[email protected]>
Signed-off-by: Paolo Abeni <[email protected]>
Signed-off-by: David S. Miller <[email protected]>

subflow: always init 'rel_write_seq'

Currently we do not init the subflow write sequence for
MP_JOIN subflows. This will cause bad mapping being
generated as soon as we will use non backup subflow.

Reviewed-by: Mat Martineau <[email protected]>
Tested-by: Christoph Paasch <[email protected]>
Signed-off-by: Paolo Abeni <[email protected]>
Signed-off-by: David S. Miller <[email protected]>

dm integrity: fix integrity recalculation that is improperly skipped

Commit adc0daad366b62ca1bce3e2958a40b0b71a8b8b3 ("dm: report suspended
device during destroy") broke integrity recalculation.

The problem is dm_suspended() returns true not only during suspend,
but also during resume. So this race condition could occur:
1. dm_integrity_resume calls queue_work(ic->recalc_wq, &ic->recalc_work)
2. integrity_recalc (&ic->recalc_work) preempts the current thread
3. integrity_recalc calls if (unlikely(dm_suspended(ic->ti))) goto unlock_ret;
4. integrity_recalc exits and no recalculating is done.

To fix this race condition, add a function dm_post_suspending that is
only true during the postsuspend phase and use it instead of
dm_suspended().

Signed-off-by: Mikulas Patocka <mpatocka redhat com>
Fixes: adc0daad366b ("dm: report suspended device during destroy")
Cc: stable vger kernel org # v4.18+
Signed-off-by: Mike Snitzer <[email protected]>

sfc: convert to new udp_tunnel infrastructure

Check MC_CMD_DRV_ATTACH_EXT_OUT_FLAG_TRUSTED, before setting
the info, which will hopefully protect us from -EPERM errors
the previous code was gracefully ignoring. Ed reports this
is not the 100% correct bit, but it's the best approximation
we have. Shared code reports the port information back to user
space, so we really want to know what was added and what failed.
Ignoring -EPERM is not an option.

The driver does not call udp_tunnel_get_rx_info(), so its own
management of table state is not really all that problematic,
we can leave it be. This allows the driver to continue with its
copious table syncing, and matching the ports to TX frames,
which it will reportedly do one day.

Leave the feature checking in the callbacks, as the device may
remove the capabilities on reset.

Inline the loop from __efx_ef10_udp_tnl_lookup_port() into
efx_ef10_udp_tnl_has_port(), since it's the only caller now.

With new infra this driver gains port replace - when space frees
up in a full table a new port will be selected for offload.
Plus efx will no longer sleep in an atomic context.

v2:
- amend the commit message about TRUSTED not being 100%
- add TUNNEL_ENCAP_UDP_PORT_ENTRY_INVALID to mark unsed
entries

Signed-off-by: Jakub Kicinski <[email protected]>
Acked-By: Edward Cree <[email protected]>
Signed-off-by: David S. Miller <[email protected]>

io_uring: missed req_init_async() for IOSQE_ASYNC

IOSQE_ASYNC branch of io_queue_sqe() is another place where an
unitialised req->work can be accessed (i.e. prior io_req_init_async()).
Nothing really bad though, it just looses IO_WQ_WORK_CONCURRENT flag.

Signed-off-by: Pavel Begunkov <[email protected]>
Signed-off-by: Jens Axboe <[email protected]>

arm64: vdso32: Fix '--prefix=' value for newer versions of clang

Newer versions of clang only look for $(COMPAT_GCC_TOOLCHAIN_DIR)as [1],
rather than $(COMPAT_GCC_TOOLCHAIN_DIR)$(CROSS_COMPILE_COMPAT)as,
resulting in the following build error:

$ make -skj"$(nproc)" ARCH=arm64 CROSS_COMPILE=aarch64-linux-gnu- \
CROSS_COMPILE_COMPAT=arm-linux-gnueabi- LLVM=1 O=out/aarch64 distclean \
defconfig arch/arm64/kernel/vdso32/
...
/home/nathan/cbl/toolchains/llvm-binutils/bin/as: unrecognized option '-EL'
clang-12: error: assembler command failed with exit code 1 (use -v to see invocation)
make[3]: *** [arch/arm64/kernel/vdso32/Makefile:181: arch/arm64/kernel/vdso32/note.o] Error 1
...

Adding the value of CROSS_COMPILE_COMPAT (adding notdir to account for a
full path for CROSS_COMPILE_COMPAT) fixes this issue, which matches the
solution done for the main Makefile [2].

[1]: https://github.com/llvm/llvm-project/commit/3452a0d8c17f7166f479706b293caf6ac76ffd90
[2]: https://lore.kernel.org/lkml/20200721173125.1273884 [email protected]/

Signed-off-by: Nathan Chancellor <[email protected]>
Cc: [email protected]
Link: https://github.com/ClangBuiltLinux/linux/issues/1099
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Will Deacon <[email protected]>

Merge tag 'amd-drm-fixes-5.8-2020-07-22' of git://people.freedesktop.org/~agd5f/linux into drm-fixes

amd-drm-fixes-5.8-2020-07-22:

amdgpu:
- Fix crash when overclocking VegaM
- Fix possible crash when editing dpm levels

Signed-off-by: Dave Airlie <[email protected]>
From: Alex Deucher <[email protected]>
Link: https://patchwork.freedesktop.org/patch/msgid/[email protected]

Merge tag 'drm-misc-fixes-2020-07-22' of git://anongit.freedesktop.org/drm/drm-misc into drm-fixes

* sun4i: Fix inverted HPD result; fixes an earlier fix
* lima: fix timeout during reset

Signed-off-by: Dave Airlie <[email protected]>
From: Thomas Zimmermann <[email protected]>
Link: https://patchwork.freedesktop.org/patch/msgid/20200722070321.GA29190@linux-uq9g

cxgb4: add missing release on skb in uld_send()

In the implementation of uld_send(), the skb is consumed on all
execution paths except one. Release skb when returning NET_XMIT_DROP.

Signed-off-by: Navid Emamdoost <[email protected]>
Signed-off-by: David S. Miller <[email protected]>

Merge branch 'qed-qede-improve-chain-API-and-add-XDP_REDIRECT-support'

Alexander Lobakin says:

====================
qed, qede: improve chain API and add XDP_REDIRECT support

This series adds missing XDP_REDIRECT case handling in QLogic Everest
Ethernet driver with all necessary prerequisites and ops.
QEDE Tx relies heavily on chain API, so make sure it is in its best
at first.

v2 (from [1]):
- add missing includes to #003 to pass the build on Alpha;
- no functional changes.

[1] https://lore.kernel.org/netdev/20200722155349 [email protected]/
====================

Signed-off-by: David S. Miller <[email protected]>

qede: add .ndo_xdp_xmit() and XDP_REDIRECT support

Add XDP_REDIRECT case handling and the corresponding NDO to support
redirecting XDP frames. This also includes registering driver memory
model (currently order-0 page mode) in BPF subsystem.
The total number of XDP queues is usually 1:1 with Rx ones.

Signed-off-by: Alexander Lobakin <[email protected]>
Signed-off-by: Igor Russkikh <[email protected]>
Signed-off-by: Michal Kalderon <[email protected]>
Signed-off-by: David S. Miller <[email protected]>

qede: refactor XDP Tx processing

Current XDP Tx logic is suboptimal and can't be reused for XDP_REDIRECT
path.
Make qede_xdp_{tx_int,xmit}() more universal and effective in general to
allow future expanding.

Misc: use unlikely() hints where appropriate and replace "fallthrough"
comments with pseudo-keywords.

Signed-off-by: Alexander Lobakin <[email protected]>
Signed-off-by: Igor Russkikh <[email protected]>
Signed-off-by: Michal Kalderon <[email protected]>
Signed-off-by: David S. Miller <[email protected]>

qede: reformat net_device_ops declarations

Correct the indentation of net_device_ops declarations for fancier look.

Signed-off-by: Alexander Lobakin <[email protected]>
Signed-off-by: Igor Russkikh <[email protected]>
Signed-off-by: Michal Kalderon <[email protected]>
Signed-off-by: David S. Miller <[email protected]>

qede: reformat several structures in "qede.h"

Make the file more readable and easier for adding new fields.

Misc: use IFNAMSIZ and netdev_name() instead of sizeof_field()
and direct net_device::name dereferencing.

Signed-off-by: Alexander Lobakin <[email protected]>
Signed-off-by: Igor Russkikh <[email protected]>
Signed-off-by: Michal Kalderon <[email protected]>
Signed-off-by: David S. Miller <[email protected]>

qed: introduce qed_chain_get_elem_used{,u32}()

Add reverse-variants of qed_chain_get_elem_left{,u32}() to be able to
know current chain occupation. They will be used in the upcoming qede
XDP_REDIRECT code.
They share most of the logics with the mentioned ones, so were reused
to collapse the latters.

Signed-off-by: Alexander Lobakin <[email protected]>
Signed-off-by: Igor Russkikh <[email protected]>
Signed-off-by: Michal Kalderon <[email protected]>
Signed-off-by: David S. Miller <[email protected]>

qed: optimize common chain accessors

Constify chain pointers and refactor qed_chain_get_elem_left{,u32}() a bit.

Signed-off-by: Alexander Lobakin <[email protected]>
Signed-off-by: Igor Russkikh <[email protected]>
Signed-off-by: Michal Kalderon <[email protected]>
Signed-off-by: David S. Miller <[email protected]>

qed: add support for different page sizes for chains

Extend current infrastructure to store chain page size in a struct
and use it in all functions instead of fixed QED_CHAIN_PAGE_SIZE.
Its value remains the default one, but can be overridden in
qed_chain_init_params before chain allocation.

Signed-off-by: Alexander Lobakin <[email protected]>
Signed-off-by: Igor Russkikh <[email protected]>
Signed-off-by: Michal Kalderon <[email protected]>
Signed-off-by: David S. Miller <[email protected]>

qed: simplify chain allocation with init params struct

To simplify qed_chain_alloc() prototype and call sites, introduce struct
qed_chain_init_params to specify chain params, and pass a pointer to
filled struct to the actual qed_chain_alloc() instead of a long list
of separate arguments.

Signed-off-by: Alexander Lobakin <[email protected]>
Signed-off-by: Igor Russkikh <[email protected]>
Signed-off-by: Michal Kalderon <[email protected]>
Signed-off-by: David S. Miller <[email protected]>

qed: simplify initialization of the chains with an external PBL

Fill PBL table parameters for chains with an external PBL data earlier on
qed_chain_init_params() rather than on allocation itself. This simplifies
allocation code and allows to extend struct ext_pbl for other chain types.

Signed-off-by: Alexander Lobakin <[email protected]>
Signed-off-by: Igor Russkikh <[email protected]>
Signed-off-by: Michal Kalderon <[email protected]>
Signed-off-by: David S. Miller <[email protected]>