Git Repo - linux.git/log

i2c: virtio: disable timeout handling

If a timeout is hit, it can result is incorrect data on the I2C bus
and/or memory corruptions in the guest since the device can still be
operating on the buffers it was given while the guest has freed them.

Here is, for example, the start of a slub_debug splat which was
triggered on the next transfer after one transfer was forced to timeout
by setting a breakpoint in the backend (rust-vmm/vhost-device):

BUG kmalloc-1k (Not tainted): Poison overwritten
First byte 0x1 instead of 0x6b
Allocated in virtio_i2c_xfer+0x65/0x35c age=350 cpu=0 pid=29
__kmalloc+0xc2/0x1c9
virtio_i2c_xfer+0x65/0x35c
__i2c_transfer+0x429/0x57d
i2c_transfer+0x115/0x134
i2cdev_ioctl_rdwr+0x16a/0x1de
i2cdev_ioctl+0x247/0x2ed
vfs_ioctl+0x21/0x30
sys_ioctl+0xb18/0xb41
Freed in virtio_i2c_xfer+0x32e/0x35c age=244 cpu=0 pid=29
kfree+0x1bd/0x1cc
virtio_i2c_xfer+0x32e/0x35c
__i2c_transfer+0x429/0x57d
i2c_transfer+0x115/0x134
i2cdev_ioctl_rdwr+0x16a/0x1de
i2cdev_ioctl+0x247/0x2ed
vfs_ioctl+0x21/0x30
sys_ioctl+0xb18/0xb41

There is no simple fix for this (the driver would have to always create
bounce buffers and hold on to them until the device eventually returns
the buffers), so just disable the timeout support for now.

Fixes: 3cfc88380413d20f ("i2c: virtio: add a virtio i2c frontend driver")
Acked-by: Jie Deng <[email protected]>
Signed-off-by: Vincent Whitchurch <[email protected]>
Acked-by: Michael S. Tsirkin <[email protected]>
Reviewed-by: Viresh Kumar <[email protected]>
Signed-off-by: Wolfram Sang <[email protected]>

USB: serial: pl2303: fix GC type detection

At least some PL2303GC have a bcdDevice of 0x105 instead of 0x100 as the
datasheet claims. Add it to the list of known release numbers for the
HXN (G) type.

Note the chip type could only be determined indirectly based on its
package being of QFP type, which appears to only be available for
PL2303GC.

Fixes: 894758d0571d ("USB: serial: pl2303: tighten type HXN (G) detection")
Cc: [email protected] # 5.13
Reported-by: Anton Lundin <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Johan Hovold <[email protected]>

i2c: i801: Fix interrupt storm from SMB_ALERT signal

Currently interrupt storm will occur from i2c-i801 after first
transaction if SMB_ALERT signal is enabled and ever asserted. It is
enough if the signal is asserted once even before the driver is loaded
and does not recover because that interrupt is not acknowledged.

This fix aims to fix it by two ways:
- Add acknowledging for the SMB_ALERT interrupt status
- Disable the SMB_ALERT interrupt on platforms where possible since the
driver currently does not make use for it

Acknowledging resets the SMB_ALERT interrupt status on all platforms and
also should help to avoid interrupt storm on older platforms where the
SMB_ALERT interrupt disabling is not available.

For simplicity this fix reuses the host notify feature for disabling and
restoring original register value.

Link: https://bugzilla.kernel.org/show_bug.cgi?id=177311
Reported-by: [email protected]
Reported-by: [email protected]
Signed-off-by: Jarkko Nikula <[email protected]>
Reviewed-by: Jean Delvare <[email protected]>
Tested-by: Jean Delvare <[email protected]>
Signed-off-by: Wolfram Sang <[email protected]>

i2c: i801: Restore INTREN on unload

If driver interrupts are enabled, SMBHSTCNT_INTREN will be 1 after
the first transaction, and will stay to that value forever. This
means that interrupts will be generated for both host-initiated
transactions and also SMBus Alert events even after the driver is
unloaded. To be on the safe side, we should restore the initial state
of this bit at suspend and reboot time, as we do for several other
configuration bits already and for the same reason: the BIOS should
be handed the device in the same configuration state in which we
received it. Otherwise interrupts may be generated which nobody
will process.

Signed-off-by: Jean Delvare <[email protected]>
Tested-by: Jarkko Nikula <[email protected]>
Signed-off-by: Wolfram Sang <[email protected]>

dt-bindings: i2c: imx-lpi2c: Fix i.MX 8QM compatible matching

The i.MX 8QM DTS files use two compatibles, so update the binding to fix
dtbs_check warnings like:

arch/arm64/boot/dts/freescale/imx8qm-mek.dt.yaml: i2c@5a800000:
compatible: ['fsl,imx8qm-lpi2c', 'fsl,imx7ulp-lpi2c'] is too long

Signed-off-by: Abel Vesa <[email protected]>
Acked-by: Rob Herring <[email protected]>
Signed-off-by: Wolfram Sang <[email protected]>

perf: Ignore sigtrap for tracepoints destined for other tasks

syzbot reported that the warning in perf_sigtrap() fires, saying that
the event's task does not match current:

| WARNING: CPU: 0 PID: 9090 at kernel/events/core.c:6446 perf_pending_event+0x40d/0x4b0 kernel/events/core.c:6513
| Modules linked in:
| CPU: 0 PID: 9090 Comm: syz-executor.1 Not tainted 5.15.0-syzkaller #0
| Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011
| RIP: 0010:perf_sigtrap kernel/events/core.c:6446 [inline]
| RIP: 0010:perf_pending_event_disable kernel/events/core.c:6470 [inline]
| RIP: 0010:perf_pending_event+0x40d/0x4b0 kernel/events/core.c:6513
| ...
| Call Trace:
|  <IRQ>
|  irq_work_single+0x106/0x220 kernel/irq_work.c:211
|  irq_work_run_list+0x6a/0x90 kernel/irq_work.c:242
|  irq_work_run+0x4f/0xd0 kernel/irq_work.c:251
|  __sysvec_irq_work+0x95/0x3d0 arch/x86/kernel/irq_work.c:22
|  sysvec_irq_work+0x8e/0xc0 arch/x86/kernel/irq_work.c:17
|  </IRQ>
|  <TASK>
|  asm_sysvec_irq_work+0x12/0x20 arch/x86/include/asm/idtentry.h:664
| RIP: 0010:__raw_spin_unlock_irqrestore include/linux/spinlock_api_smp.h:152 [inline]
| RIP: 0010:_raw_spin_unlock_irqrestore+0x38/0x70 kernel/locking/spinlock.c:194
| ...
|  coredump_task_exit kernel/exit.c:371 [inline]
|  do_exit+0x1865/0x25c0 kernel/exit.c:771
|  do_group_exit+0xe7/0x290 kernel/exit.c:929
|  get_signal+0x3b0/0x1ce0 kernel/signal.c:2820
|  arch_do_signal_or_restart+0x2a9/0x1c40 arch/x86/kernel/signal.c:868
|  handle_signal_work kernel/entry/common.c:148 [inline]
|  exit_to_user_mode_loop kernel/entry/common.c:172 [inline]
|  exit_to_user_mode_prepare+0x17d/0x290 kernel/entry/common.c:207
|  __syscall_exit_to_user_mode_work kernel/entry/common.c:289 [inline]
|  syscall_exit_to_user_mode+0x19/0x60 kernel/entry/common.c:300
|  do_syscall_64+0x42/0xb0 arch/x86/entry/common.c:86
|  entry_SYSCALL_64_after_hwframe+0x44/0xae

On x86 this shouldn't happen, which has arch_irq_work_raise().

The test program sets up a perf event with sigtrap set to fire on the
'sched_wakeup' tracepoint, which fired in ttwu_do_wakeup().

This happened because the 'sched_wakeup' tracepoint also takes a task
argument passed on to perf_tp_event(), which is used to deliver the
event to that other task.

Since we cannot deliver synchronous signals to other tasks, skip an event if
perf_tp_event() is targeted at another task and perf_event_attr::sigtrap is
set, which will avoid ever entering perf_sigtrap() for such events.

Fixes: 97ba62b27867 ("perf: Add support for SIGTRAP on perf events")
Reported-by: [email protected]
Signed-off-by: Marco Elver <[email protected]>
Signed-off-by: Peter Zijlstra (Intel) <[email protected]>
Link: https://lkml.kernel.org/r/YYpoCOBmC/[email protected]

locking/rwsem: Optimize down_read_trylock() under highly contended case

We found that a process with 10 thousnads threads has been encountered
a regression problem from Linux-v4.14 to Linux-v5.4. It is a kind of
workload which will concurrently allocate lots of memory in different
threads sometimes. In this case, we will see the down_read_trylock()
with a high hotspot. Therefore, we suppose that rwsem has a regression
at least since Linux-v5.4. In order to easily debug this problem, we
write a simply benchmark to create the similar situation lile the
following.

  ```c++
  #include <sys/mman.h>
  #include <sys/time.h>
  #include <sys/resource.h>
  #include <sched.h>

  #include <cstdio>
  #include <cassert>
  #include <thread>
  #include <vector>
  #include <chrono>

  volatile int mutex;

  void trigger(int cpu, char* ptr, std::size_t sz)
  {
   cpu_set_t set;
   CPU_ZERO(&set);
   CPU_SET(cpu, &set);
   assert(pthread_setaffinity_np(pthread_self(), sizeof(set), &set) == 0);

   while (mutex);

   for (std::size_t i = 0; i < sz; i += 4096) {
   *ptr = '\0';
   ptr += 4096;
   }
  }

  int main(int argc, char* argv[])
  {
   std::size_t sz = 100;

   if (argc > 1)
   sz = atoi(argv[1]);

   auto nproc = std::thread::hardware_concurrency();
   std::vector<std::thread> thr;
   sz <<= 30;
   auto* ptr = mmap(nullptr, sz, PROT_READ | PROT_WRITE, MAP_ANON |
MAP_PRIVATE, -1, 0);
   assert(ptr != MAP_FAILED);
   char* cptr = static_cast<char*>(ptr);
   auto run = sz / nproc;
   run = (run >> 12) << 12;

   mutex = 1;

   for (auto i = 0U; i < nproc; ++i) {
   thr.emplace_back(std::thread([i, cptr, run]() { trigger(i, cptr, run); }));
   cptr += run;
   }

   rusage usage_start;
   getrusage(RUSAGE_SELF, &usage_start);
   auto start = std::chrono::system_clock::now();

   mutex = 0;

   for (auto& t : thr)
   t.join();

   rusage usage_end;
   getrusage(RUSAGE_SELF, &usage_end);
   auto end = std::chrono::system_clock::now();
   timeval utime;
   timeval stime;
   timersub(&usage_end.ru_utime, &usage_start.ru_utime, &utime);
   timersub(&usage_end.ru_stime, &usage_start.ru_stime, &stime);
   printf("usr: %ld.%06ld\n", utime.tv_sec, utime.tv_usec);
   printf("sys: %ld.%06ld\n", stime.tv_sec, stime.tv_usec);
   printf("real: %lu\n",
          std::chrono::duration_cast<std::chrono::milliseconds>(end -
          start).count());

   return 0;
  }
  ```

The functionality of above program is simply which creates `nproc`
threads and each of them are trying to touch memory (trigger page
fault) on different CPU. Then we will see the similar profile by
`perf top`.

  25.55%  [kernel]                  [k] down_read_trylock
  14.78%  [kernel]                  [k] handle_mm_fault
  13.45%  [kernel]                  [k] up_read
   8.61%  [kernel]                  [k] clear_page_erms
   3.89%  [kernel]                  [k] __do_page_fault

The highest hot instruction, which accounts for about 92%, in
down_read_trylock() is cmpxchg like the following.

  91.89 │      lock   cmpxchg %rdx,(%rdi)

Sice the problem is found by migrating from Linux-v4.14 to Linux-v5.4,
so we easily found that the commit ddb20d1d3aed ("locking/rwsem: Optimize
down_read_trylock()") caused the regression. The reason is that the
commit assumes the rwsem is not contended at all. But it is not always
true for mmap lock which could be contended with thousands threads.
So most threads almost need to run at least 2 times of "cmpxchg" to
acquire the lock. The overhead of atomic operation is higher than
non-atomic instructions, which caused the regression.

By using the above benchmark, the real executing time on a x86-64 system
before and after the patch were:

                  Before Patch  After Patch
   # of Threads      real          real     reduced by
   ------------     ------        ------    ----------
         1          65,373        65,206       ~0.0%
         4          15,467        15,378       ~0.5%
        40           6,214         5,528      ~11.0%

For the uncontended case, the new down_read_trylock() is the same as
before. For the contended cases, the new down_read_trylock() is faster
than before. The more contended, the more fast.

Signed-off-by: Muchun Song <[email protected]>
Signed-off-by: Peter Zijlstra (Intel) <[email protected]>
Acked-by: Waiman Long <[email protected]>
Link: https://lore.kernel.org/r/[email protected]

locking/rwsem: Make handoff bit handling more consistent

There are some inconsistency in the way that the handoff bit is being
handled in readers and writers that lead to a race condition.

Firstly, when a queue head writer set the handoff bit, it will clear
it when the writer is being killed or interrupted on its way out
without acquiring the lock. That is not the case for a queue head
reader. The handoff bit will simply be inherited by the next waiter.

Secondly, in the out_nolock path of rwsem_down_read_slowpath(), both
the waiter and handoff bits are cleared if the wait queue becomes
empty. For rwsem_down_write_slowpath(), however, the handoff bit is
not checked and cleared if the wait queue is empty. This can
potentially make the handoff bit set with empty wait queue.

Worse, the situation in rwsem_down_write_slowpath() relies on wstate,
a variable set outside of the critical section containing the ->count
manipulation, this leads to race condition where RWSEM_FLAG_HANDOFF
can be double subtracted, corrupting ->count.

To make the handoff bit handling more consistent and robust, extract
out handoff bit clearing code into the new rwsem_del_waiter() helper
function. Also, completely eradicate wstate; always evaluate
everything inside the same critical section.

The common function will only use atomic_long_andnot() to clear bits
when the wait queue is empty to avoid possible race condition. If the
first waiter with handoff bit set is killed or interrupted to exit the
slowpath without acquiring the lock, the next waiter will inherit the
handoff bit.

While at it, simplify the trylock for loop in
rwsem_down_write_slowpath() to make it easier to read.

Fixes: 4f23dbc1e657 ("locking/rwsem: Implement lock handoff to prevent lock starvation")
Reported-by: Zhenhua Ma <[email protected]>
Suggested-by: Peter Zijlstra <[email protected]>
Signed-off-by: Waiman Long <[email protected]>
Signed-off-by: Peter Zijlstra (Intel) <[email protected]>
Link: https://lkml.kernel.org/r/[email protected]

erofs: fix deadlock when shrink erofs slab

We observed the following deadlock in the stress test under low
memory scenario:

Thread A                               Thread B
- erofs_shrink_scan
- erofs_try_to_release_workgroup
  - erofs_workgroup_try_to_freeze -- A
                                       - z_erofs_do_read_page
                                        - z_erofs_collection_begin
                                         - z_erofs_register_collection
                                          - erofs_insert_workgroup
                                           - xa_lock(&sbi->managed_pslots) -- B
                                           - erofs_workgroup_get
                                            - erofs_wait_on_workgroup_freezed -- A
  - xa_erase
   - xa_lock(&sbi->managed_pslots) -- B

To fix this, it needs to hold xa_lock before freezing the workgroup
since xarray will be touched then. So let's hold the lock before
accessing each workgroup, just like what we did with the radix tree
before.

[ Gao Xiang: Jianhua Hao also reports this issue at
  https://lore.kernel.org/r/b10b85df30694bac8aadfe43537c897a@xiaomi.com ]

Link: https://lore.kernel.org/r/[email protected]
Fixes: 64094a04414f ("erofs: convert workstn to XArray")
Reviewed-by: Chao Yu <[email protected]>
Reviewed-by: Gao Xiang <[email protected]>
Signed-off-by: Huang Jianan <[email protected]>
Reported-by: Jianhua Hao <[email protected]>
Signed-off-by: Gao Xiang <[email protected]>

scsi: scsi_debug: Zero clear zones at reset write pointer

When a reset is requested the position of the write pointer is updated but
the data in the corresponding zone is not cleared. Instead scsi_debug
returns any data written before the write pointer was reset. This is an
error and prevents using scsi_debug for stale page cache testing of the
BLKRESETZONE ioctl.

Zero written data in the zone when resetting the write pointer.

Link: https://lore.kernel.org/r/[email protected]
Fixes: f0d1cf9378bd ("scsi: scsi_debug: Add ZBC zone commands")
Reviewed-by: Damien Le Moal <[email protected]>
Acked-by: Douglas Gilbert <[email protected]>
Signed-off-by: Shin'ichiro Kawasaki <[email protected]>
Signed-off-by: Martin K. Petersen <[email protected]>

scsi: core: sysfs: Fix setting device state to SDEV_RUNNING

This fixes an issue added in commit 4edd8cd4e86d ("scsi: core: sysfs: Fix
hang when device state is set via sysfs") where if userspace is requesting
to set the device state to SDEV_RUNNING when the state is already
SDEV_RUNNING, we return -EINVAL instead of count. The commmit above set ret
to count for this case, when it should have set it to 0.

Link: https://lore.kernel.org/r/[email protected]
Fixes: 4edd8cd4e86d ("scsi: core: sysfs: Fix hang when device state is set via sysfs")
Reviewed-by: Lee Duncan <[email protected]>
Signed-off-by: Mike Christie <[email protected]>
Signed-off-by: Martin K. Petersen <[email protected]>

scsi: scsi_debug: Sanity check block descriptor length in resp_mode_select()

In resp_mode_select() sanity check the block descriptor len to avoid UAF.

BUG: KASAN: use-after-free in resp_mode_select+0xa4c/0xb40 drivers/scsi/scsi_debug.c:2509
Read of size 1 at addr ffff888026670f50 by task scsicmd/15032

CPU: 1 PID: 15032 Comm: scsicmd Not tainted 5.15.0-01d0625 #15
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS
Call Trace:
<TASK>
dump_stack_lvl+0x89/0xb5 lib/dump_stack.c:107
print_address_description.constprop.9+0x28/0x160 mm/kasan/report.c:257
kasan_report.cold.14+0x7d/0x117 mm/kasan/report.c:443
__asan_report_load1_noabort+0x14/0x20 mm/kasan/report_generic.c:306
resp_mode_select+0xa4c/0xb40 drivers/scsi/scsi_debug.c:2509
schedule_resp+0x4af/0x1a10 drivers/scsi/scsi_debug.c:5483
scsi_debug_queuecommand+0x8c9/0x1e70 drivers/scsi/scsi_debug.c:7537
scsi_queue_rq+0x16b4/0x2d10 drivers/scsi/scsi_lib.c:1521
blk_mq_dispatch_rq_list+0xb9b/0x2700 block/blk-mq.c:1640
__blk_mq_sched_dispatch_requests+0x28f/0x590 block/blk-mq-sched.c:325
blk_mq_sched_dispatch_requests+0x105/0x190 block/blk-mq-sched.c:358
__blk_mq_run_hw_queue+0xe5/0x150 block/blk-mq.c:1762
__blk_mq_delay_run_hw_queue+0x4f8/0x5c0 block/blk-mq.c:1839
blk_mq_run_hw_queue+0x18d/0x350 block/blk-mq.c:1891
blk_mq_sched_insert_request+0x3db/0x4e0 block/blk-mq-sched.c:474
blk_execute_rq_nowait+0x16b/0x1c0 block/blk-exec.c:63
sg_common_write.isra.18+0xeb3/0x2000 drivers/scsi/sg.c:837
sg_new_write.isra.19+0x570/0x8c0 drivers/scsi/sg.c:775
sg_ioctl_common+0x14d6/0x2710 drivers/scsi/sg.c:941
sg_ioctl+0xa2/0x180 drivers/scsi/sg.c:1166
__x64_sys_ioctl+0x19d/0x220 fs/ioctl.c:52
do_syscall_64+0x3a/0x80 arch/x86/entry/common.c:50
entry_SYSCALL_64_after_hwframe+0x44/0xae arch/x86/entry/entry_64.S:113

Link: https://lore.kernel.org/r/[email protected]
Reported-by: syzkaller <[email protected]>
Acked-by: Douglas Gilbert <[email protected]>
Signed-off-by: George Kennedy <[email protected]>
Signed-off-by: Martin K. Petersen <[email protected]>

io_uring: correct link-list traversal locking

As io_remove_next_linked() is now under ->timeout_lock (see
io_link_timeout_fn), we should update locking around io_for_each_link()
and io_match_task() to use the new lock.

Cc: [email protected] # 5.15+
Fixes: 89850fce16a1a ("io_uring: run timeouts from task_work")
Signed-off-by: Pavel Begunkov <[email protected]>
Link: https://lore.kernel.org/r/b54541cedf7de59cb5ae36109e58529ca16e66aa.1637631883.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <[email protected]>

block: avoid to touch unloaded module instance when opening bdev

disk->fops->owner is grabbed in blkdev_get_no_open() after the disk
kobject refcount is increased. This way can't make sure that
disk->fops->owner is still alive since del_gendisk() still can move
on if the kobject refcount of disk is grabbed by open() and
disk->fops->open() isn't called yet.

Fixes the issue by moving try_module_get() into blkdev_get_by_dev()
with ->open_mutex() held, then we can drain the in-progress open()
in del_gendisk(). Meantime new open() won't succeed because disk
becomes not alive.

This way is reasonable because blkdev_get_no_open() needn't to touch
disk->fops or defined callbacks.

Cc: Christoph Hellwig <[email protected]>
Cc: [email protected]
Signed-off-by: Ming Lei <[email protected]>
Reviewed-by: Christoph Hellwig <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Jens Axboe <[email protected]>

Merge tag 'media/v5.16-2' of git://git.kernel.org/pub/scm/linux/kernel/git/mchehab/linux-media

Pull media fixes from Mauro Carvalho Chehab:

- fix VIDIOC_DQEVENT ioctl handling for 32-bit userspace with a 64-bit
   kernel

- regression fix for videobuf2 core

- fix for CEC core when handling non-block transmit

- hi846: fix a clang warning

* tag 'media/v5.16-2' of git://git.kernel.org/pub/scm/linux/kernel/git/mchehab/linux-media:
  media: hi846: remove the of_match_ptr macro
  media: hi846: include property.h instead of of_graph.h
  media: cec: copy sequence field for the reply
  media: videobuf2-dma-sg: Fix buf->vb NULL pointer dereference
  media: v4l2-core: fix VIDIOC_DQEVENT handling on non-x86

SUNRPC: use different lock keys for INET6 and LOCAL

xprtsock.c reclassifies sock locks based on the protocol.
However there are 3 protocols and only 2 classification keys.
The same key is used for both INET6 and LOCAL.

This causes lockdep complaints. The complaints started since Commit
ea9afca88bbe ("SUNRPC: Replace use of socket sk_callback_lock with
sock_lock") which resulted in the sock locks beings used more.

So add another key, and renumber them slightly.

Fixes: ea9afca88bbe ("SUNRPC: Replace use of socket sk_callback_lock with sock_lock")
Fixes: 176e21ee2ec8 ("SUNRPC: Support for RPC over AF_LOCAL transports")
Signed-off-by: NeilBrown <[email protected]>
Signed-off-by: Trond Myklebust <[email protected]>

hugetlbfs: flush before unlock on move_hugetlb_page_tables()

We must flush the TLB before releasing i_mmap_rwsem to avoid the
potential reuse of an unshared PMDs page. This is not true in the case
of move_hugetlb_page_tables(). The last reference on the page table can
therefore be dropped before the TLB flush took place.

Prevent it by reordering the operations and flushing the TLB before
releasing i_mmap_rwsem.

Fixes: 550a7d60bd5e ("mm, hugepages: add mremap() support for hugepage backed vma")
Signed-off-by: Nadav Amit <[email protected]>
Reviewed-by: Mike Kravetz <[email protected]>
Cc: Mina Almasry <[email protected]>
Cc: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

hugetlbfs: flush TLBs correctly after huge_pmd_unshare

When __unmap_hugepage_range() calls to huge_pmd_unshare() succeed, a TLB
flush is missing. This TLB flush must be performed before releasing the
i_mmap_rwsem, in order to prevent an unshared PMDs page from being
released and reused before the TLB flush took place.

Arguably, a comprehensive solution would use mmu_gather interface to
batch the TLB flushes and the PMDs page release, however it is not an
easy solution: (1) try_to_unmap_one() and try_to_migrate_one() also call
huge_pmd_unshare() and they cannot use the mmu_gather interface; and (2)
deferring the release of the page reference for the PMDs page until
after i_mmap_rwsem is dropeed can confuse huge_pmd_unshare() into
thinking PMDs are shared when they are not.

Fix __unmap_hugepage_range() by adding the missing TLB flush, and
forcing a flush when unshare is successful.

Fixes: 24669e58477e ("hugetlb: use mmu_gather instead of a temporary linked list for accumulating pages)" # 3.6
Signed-off-by: Nadav Amit <[email protected]>
Reviewed-by: Mike Kravetz <[email protected]>
Cc: Aneesh Kumar K.V <[email protected]>
Cc: KAMEZAWA Hiroyuki <[email protected]>
Cc: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

ice: avoid bpf_prog refcount underflow

Ice driver has the routines for managing XDP resources that are shared
between ndo_bpf op and VSI rebuild flow. The latter takes place for
example when user changes queue count on an interface via ethtool's
set_channels().

There is an issue around the bpf_prog refcounting when VSI is being
rebuilt - since ice_prepare_xdp_rings() is called with vsi->xdp_prog as
an argument that is used later on by ice_vsi_assign_bpf_prog(), same
bpf_prog pointers are swapped with each other. Then it is also
interpreted as an 'old_prog' which in turn causes us to call
bpf_prog_put on it that will decrement its refcount.

Below splat can be interpreted in a way that due to zero refcount of a
bpf_prog it is wiped out from the system while kernel still tries to
refer to it:

[  481.069429] BUG: unable to handle page fault for address: ffffc9000640f038
[  481.077390] #PF: supervisor read access in kernel mode
[  481.083335] #PF: error_code(0x0000) - not-present page
[  481.089276] PGD 100000067 P4D 100000067 PUD 1001cb067 PMD 106d2b067 PTE 0
[  481.097141] Oops: 0000 [#1] PREEMPT SMP PTI
[  481.101980] CPU: 12 PID: 3339 Comm: sudo Tainted: G           OE     5.15.0-rc5+ #1
[  481.110840] Hardware name: Intel Corp. GRANTLEY/GRANTLEY, BIOS GRRFCRB1.86B.0276.D07.1605190235 05/19/2016
[  481.122021] RIP: 0010:dev_xdp_prog_id+0x25/0x40
[  481.127265] Code: 80 00 00 00 00 0f 1f 44 00 00 89 f6 48 c1 e6 04 48 01 fe 48 8b 86 98 08 00 00 48 85 c0 74 13 48 8b 50 18 31 c0 48 85 d2 74 07 <48> 8b 42 38 8b 40 20 c3 48 8b 96 90 08 00 00 eb e8 66 2e 0f 1f 84
[  481.148991] RSP: 0018:ffffc90007b63868 EFLAGS: 00010286
[  481.155034] RAX: 0000000000000000 RBX: ffff889080824000 RCX: 0000000000000000
[  481.163278] RDX: ffffc9000640f000 RSI: ffff889080824010 RDI: ffff889080824000
[  481.171527] RBP: ffff888107af7d00 R08: 0000000000000000 R09: ffff88810db5f6e0
[  481.179776] R10: 0000000000000000 R11: ffff8890885b9988 R12: ffff88810db5f4bc
[  481.188026] R13: 0000000000000000 R14: 0000000000000000 R15: 0000000000000000
[  481.196276] FS:  00007f5466d5bec0(0000) GS:ffff88903fb00000(0000) knlGS:0000000000000000
[  481.205633] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[  481.212279] CR2: ffffc9000640f038 CR3: 000000014429c006 CR4: 00000000003706e0
[  481.220530] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[  481.228771] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[  481.237029] Call Trace:
[  481.239856]  rtnl_fill_ifinfo+0x768/0x12e0
[  481.244602]  rtnl_dump_ifinfo+0x525/0x650
[  481.249246]  ? __alloc_skb+0xa5/0x280
[  481.253484]  netlink_dump+0x168/0x3c0
[  481.257725]  netlink_recvmsg+0x21e/0x3e0
[  481.262263]  ____sys_recvmsg+0x87/0x170
[  481.266707]  ? __might_fault+0x20/0x30
[  481.271046]  ? _copy_from_user+0x66/0xa0
[  481.275591]  ? iovec_from_user+0xf6/0x1c0
[  481.280226]  ___sys_recvmsg+0x82/0x100
[  481.284566]  ? sock_sendmsg+0x5e/0x60
[  481.288791]  ? __sys_sendto+0xee/0x150
[  481.293129]  __sys_recvmsg+0x56/0xa0
[  481.297267]  do_syscall_64+0x3b/0xc0
[  481.301395]  entry_SYSCALL_64_after_hwframe+0x44/0xae
[  481.307238] RIP: 0033:0x7f5466f39617
[  481.311373] Code: 0c 00 f7 d8 64 89 02 48 c7 c0 ff ff ff ff eb bd 0f 1f 00 f3 0f 1e fa 64 8b 04 25 18 00 00 00 85 c0 75 10 b8 2f 00 00 00 0f 05 <48> 3d 00 f0 ff ff 77 51 c3 48 83 ec 28 89 54 24 1c 48 89 74 24 10
[  481.342944] RSP: 002b:00007ffedc7f4308 EFLAGS: 00000246 ORIG_RAX: 000000000000002f
[  481.361783] RAX: ffffffffffffffda RBX: 00007ffedc7f5460 RCX: 00007f5466f39617
[  481.380278] RDX: 0000000000000000 RSI: 00007ffedc7f5360 RDI: 0000000000000003
[  481.398500] RBP: 00007ffedc7f53f0 R08: 0000000000000000 R09: 000055d556f04d50
[  481.416463] R10: 0000000000000077 R11: 0000000000000246 R12: 00007ffedc7f5360
[  481.434131] R13: 00007ffedc7f5350 R14: 00007ffedc7f5344 R15: 0000000000000e98
[  481.451520] Modules linked in: ice(OE) af_packet binfmt_misc nls_iso8859_1 ipmi_ssif intel_rapl_msr intel_rapl_common x86_pkg_temp_thermal intel_powerclamp mxm_wmi mei_me coretemp mei ipmi_si ipmi_msghandler wmi acpi_pad acpi_power_meter ip_tables x_tables autofs4 crct10dif_pclmul crc32_pclmul ghash_clmulni_intel aesni_intel ahci crypto_simd cryptd libahci lpc_ich [last unloaded: ice]
[  481.528558] CR2: ffffc9000640f038
[  481.542041] ---[ end trace d1f24c9ecf5b61c1 ]---

Fix this by only calling ice_vsi_assign_bpf_prog() inside
ice_prepare_xdp_rings() when current vsi->xdp_prog pointer is NULL.
This way set_channels() flow will not attempt to swap the vsi->xdp_prog
pointers with itself.

Also, sprinkle around some comments that provide a reasoning about
correlation between driver and kernel in terms of bpf_prog refcount.

Fixes: efc2214b6047 ("ice: Add support for XDP")
Reviewed-by: Alexander Lobakin <[email protected]>
Signed-off-by: Marta Plantykow <[email protected]>
Co-developed-by: Maciej Fijalkowski <[email protected]>
Signed-off-by: Maciej Fijalkowski <[email protected]>
Tested-by: Kiran Bhandare <[email protected]>
Signed-off-by: Tony Nguyen <[email protected]>

ice: fix vsi->txq_map sizing

The approach of having XDP queue per CPU regardless of user's setting
exposed a hidden bug that could occur in case when Rx queue count differ
from Tx queue count. Currently vsi->txq_map's size is equal to the
doubled vsi->alloc_txq, which is not correct due to the fact that XDP
rings were previously based on the Rx queue count. Below splat can be
seen when ethtool -L is used and XDP rings are configured:

[  682.875339] BUG: kernel NULL pointer dereference, address: 000000000000000f
[  682.883403] #PF: supervisor read access in kernel mode
[  682.889345] #PF: error_code(0x0000) - not-present page
[  682.895289] PGD 0 P4D 0
[  682.898218] Oops: 0000 [#1] PREEMPT SMP PTI
[  682.903055] CPU: 42 PID: 2878 Comm: ethtool Tainted: G           OE     5.15.0-rc5+ #1
[  682.912214] Hardware name: Intel Corp. GRANTLEY/GRANTLEY, BIOS GRRFCRB1.86B.0276.D07.1605190235 05/19/2016
[  682.923380] RIP: 0010:devres_remove+0x44/0x130
[  682.928527] Code: 49 89 f4 55 48 89 fd 4c 89 ff 53 48 83 ec 10 e8 92 b9 49 00 48 8b 9d a8 02 00 00 48 8d 8d a0 02 00 00 49 89 c2 48 39 cb 74 0f <4c> 3b 63 10 74 25 48 8b 5b 08 48 39 cb 75 f1 4c 89 ff 4c 89 d6 e8
[  682.950237] RSP: 0018:ffffc90006a679f0 EFLAGS: 00010002
[  682.956285] RAX: 0000000000000286 RBX: ffffffffffffffff RCX: ffff88908343a370
[  682.964538] RDX: 0000000000000001 RSI: ffffffff81690d60 RDI: 0000000000000000
[  682.972789] RBP: ffff88908343a0d0 R08: 0000000000000000 R09: 0000000000000000
[  682.981040] R10: 0000000000000286 R11: 3fffffffffffffff R12: ffffffff81690d60
[  682.989282] R13: ffffffff81690a00 R14: ffff8890819807a8 R15: ffff88908343a36c
[  682.997535] FS:  00007f08c7bfa740(0000) GS:ffff88a03fd00000(0000) knlGS:0000000000000000
[  683.006910] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[  683.013557] CR2: 000000000000000f CR3: 0000001080a66003 CR4: 00000000003706e0
[  683.021819] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[  683.030075] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[  683.038336] Call Trace:
[  683.041167]  devm_kfree+0x33/0x50
[  683.045004]  ice_vsi_free_arrays+0x5e/0xc0 [ice]
[  683.050380]  ice_vsi_rebuild+0x4c8/0x750 [ice]
[  683.055543]  ice_vsi_recfg_qs+0x9a/0x110 [ice]
[  683.060697]  ice_set_channels+0x14f/0x290 [ice]
[  683.065962]  ethnl_set_channels+0x333/0x3f0
[  683.070807]  genl_family_rcv_msg_doit+0xea/0x150
[  683.076152]  genl_rcv_msg+0xde/0x1d0
[  683.080289]  ? channels_prepare_data+0x60/0x60
[  683.085432]  ? genl_get_cmd+0xd0/0xd0
[  683.089667]  netlink_rcv_skb+0x50/0xf0
[  683.094006]  genl_rcv+0x24/0x40
[  683.097638]  netlink_unicast+0x239/0x340
[  683.102177]  netlink_sendmsg+0x22e/0x470
[  683.106717]  sock_sendmsg+0x5e/0x60
[  683.110756]  __sys_sendto+0xee/0x150
[  683.114894]  ? handle_mm_fault+0xd0/0x2a0
[  683.119535]  ? do_user_addr_fault+0x1f3/0x690
[  683.134173]  __x64_sys_sendto+0x25/0x30
[  683.148231]  do_syscall_64+0x3b/0xc0
[  683.161992]  entry_SYSCALL_64_after_hwframe+0x44/0xae

Fix this by taking into account the value that num_possible_cpus()
yields in addition to vsi->alloc_txq instead of doubling the latter.

Fixes: efc2214b6047 ("ice: Add support for XDP")
Fixes: 22bf877e528f ("ice: introduce XDP_TX fallback path")
Reviewed-by: Alexander Lobakin <[email protected]>
Signed-off-by: Maciej Fijalkowski <[email protected]>
Tested-by: Kiran Bhandare <[email protected]>
Signed-off-by: Tony Nguyen <[email protected]>

Merge branch 'nh-group-refcnt'

Nikolay Aleksandrov says:

====================
net: nexthop: fix refcount issues when replacing groups

This set fixes a refcount bug when replacing nexthop groups and
modifying routes. It is complex because the objects look valid when
debugging memory dumps, but we end up having refcount dependency between
unlinked objects which can never be released, so in turn they cannot
free their resources and refcounts. The problem happens because we can
have stale IPv6 per-cpu dsts in nexthops which were removed from a
group. Even though the IPv6 gen is bumped, the dsts won't be released
until traffic passes through them or the nexthop is freed, that can take
arbitrarily long time, and even worse we can create a scenario[1] where it
can never be released. The fix is to release the IPv6 per-cpu dsts of
replaced nexthops after an RCU grace period so no new ones can be
created. To do that we add a new IPv6 stub - fib6_nh_release_dsts, which
is used by the nexthop code only when necessary. We can further optimize
group replacement, but that is more suited for net-next as these patches
would have to be backported to stable releases.

v2: patch 02: update commit msg
    patch 03: check for mausezahn before testing and make a few comments
              more verbose

[1]
This info is also present in patch 02's commit message.
Initial state:
$ ip nexthop list
  id 200 via 2002:db8::2 dev bridge.10 scope link onlink
  id 201 via 2002:db8::3 dev bridge scope link onlink
  id 203 group 201/200
$ ip -6 route
  2001:db8::10 nhid 203 metric 1024 pref medium
     nexthop via 2002:db8::3 dev bridge weight 1 onlink
     nexthop via 2002:db8::2 dev bridge.10 weight 1 onlink

Create rt6_info through one of the multipath legs, e.g.:
$ taskset -a -c 1  ./pkt_inj 24 bridge.10 2001:db8::10
(pkt_inj is just a custom packet generator, nothing special)

Then remove that leg from the group by replace (let's assume it is id
200 in this case):
$ ip nexthop replace id 203 group 201

Now remove the IPv6 route:
$ ip -6 route del 2001:db8::10/128

The route won't be really deleted due to the stale rt6_info holding 1
refcnt in nexthop id 200.
At this point we have the following reference count dependency:
(deleted) IPv6 route holds 1 reference over nhid 203
nh 203 holds 1 ref over id 201
nh 200 holds 1 ref over the net device and the route due to the stale
rt6_info

Now to create circular dependency between nh 200 and the IPv6 route, and
also to get a reference over nh 200, restore nhid 200 in the group:
$ ip nexthop replace id 203 group 201/200

And now we have a permanent circular dependncy because nhid 203 holds a
reference over nh 200 and 201, but the route holds a ref over nh 203 and
is deleted.

To trigger the bug just delete the group (nhid 203):
$ ip nexthop del id 203

It won't really be deleted due to the IPv6 route dependency, and now we
have 2 unlinked and deleted objects that reference each other: the group
and the IPv6 route. Since the group drops the reference it holds over its
entries at free time (i.e. its own refcount needs to drop to 0) that will
never happen and we get a permanent ref on them, since one of the entries
holds a reference over the IPv6 route it will also never be released.

At this point the dependencies are:
(deleted, only unlinked) IPv6 route holds reference over group nh 203
(deleted, only unlinked) group nh 203 holds reference over nh 201 and 200
nh 200 holds 1 ref over the net device and the route due to the stale
rt6_info

This is the last point where it can be fixed by running traffic through
nh 200, and specifically through the same CPU so the rt6_info (dst) will
get released due to the IPv6 genid, that in turn will free the IPv6
route, which in turn will free the ref count over the group nh 203.

If nh 200 is deleted at this point, it will never be released due to the
ref from the unlinked group 203, it will only be unlinked:
$ ip nexthop del id 200
$ ip nexthop
$

Now we can never release that stale rt6_info, we have IPv6 route with ref
over group nh 203, group nh 203 with ref over nh 200 and 201, nh 200 with
rt6_info (dst) with ref over the net device and the IPv6 route. All of
these objects are only unlinked, and cannot be released, thus they can't
release their ref counts.

Message from syslogd@dev at Nov 19 14:04:10 ...
  kernel:[73501.828730] unregister_netdevice: waiting for bridge.10 to become free. Usage count = 3
Message from syslogd@dev at Nov 19 14:04:20 ...
  kernel:[73512.068811] unregister_netdevice: waiting for bridge.10 to become free. Usage count = 3

====================

Signed-off-by: David S. Miller <[email protected]>

selftests: net: fib_nexthops: add test for group refcount imbalance bug

The new selftest runs a sequence which causes circular refcount
dependency between deleted objects which cannot be released and results
in a netdevice refcount imbalance.

Signed-off-by: Nikolay Aleksandrov <[email protected]>
Signed-off-by: David S. Miller <[email protected]>

net: nexthop: release IPv6 per-cpu dsts when replacing a nexthop group

When replacing a nexthop group, we must release the IPv6 per-cpu dsts of
the removed nexthop entries after an RCU grace period because they
contain references to the nexthop's net device and to the fib6 info.
With specific series of events[1] we can reach net device refcount
imbalance which is unrecoverable. IPv4 is not affected because dsts
don't take a refcount on the route.

[1]
$ ip nexthop list
  id 200 via 2002:db8::2 dev bridge.10 scope link onlink
  id 201 via 2002:db8::3 dev bridge scope link onlink
  id 203 group 201/200
$ ip -6 route
  2001:db8::10 nhid 203 metric 1024 pref medium
     nexthop via 2002:db8::3 dev bridge weight 1 onlink
     nexthop via 2002:db8::2 dev bridge.10 weight 1 onlink

Create rt6_info through one of the multipath legs, e.g.:
$ taskset -a -c 1  ./pkt_inj 24 bridge.10 2001:db8::10
(pkt_inj is just a custom packet generator, nothing special)

Then remove that leg from the group by replace (let's assume it is id
200 in this case):
$ ip nexthop replace id 203 group 201

Now remove the IPv6 route:
$ ip -6 route del 2001:db8::10/128

The route won't be really deleted due to the stale rt6_info holding 1
refcnt in nexthop id 200.
At this point we have the following reference count dependency:
(deleted) IPv6 route holds 1 reference over nhid 203
nh 203 holds 1 ref over id 201
nh 200 holds 1 ref over the net device and the route due to the stale
rt6_info

Now to create circular dependency between nh 200 and the IPv6 route, and
also to get a reference over nh 200, restore nhid 200 in the group:
$ ip nexthop replace id 203 group 201/200

And now we have a permanent circular dependncy because nhid 203 holds a
reference over nh 200 and 201, but the route holds a ref over nh 203 and
is deleted.

To trigger the bug just delete the group (nhid 203):
$ ip nexthop del id 203

It won't really be deleted due to the IPv6 route dependency, and now we
have 2 unlinked and deleted objects that reference each other: the group
and the IPv6 route. Since the group drops the reference it holds over its
entries at free time (i.e. its own refcount needs to drop to 0) that will
never happen and we get a permanent ref on them, since one of the entries
holds a reference over the IPv6 route it will also never be released.

At this point the dependencies are:
(deleted, only unlinked) IPv6 route holds reference over group nh 203
(deleted, only unlinked) group nh 203 holds reference over nh 201 and 200
nh 200 holds 1 ref over the net device and the route due to the stale
rt6_info

This is the last point where it can be fixed by running traffic through
nh 200, and specifically through the same CPU so the rt6_info (dst) will
get released due to the IPv6 genid, that in turn will free the IPv6
route, which in turn will free the ref count over the group nh 203.

If nh 200 is deleted at this point, it will never be released due to the
ref from the unlinked group 203, it will only be unlinked:
$ ip nexthop del id 200
$ ip nexthop
$

Now we can never release that stale rt6_info, we have IPv6 route with ref
over group nh 203, group nh 203 with ref over nh 200 and 201, nh 200 with
rt6_info (dst) with ref over the net device and the IPv6 route. All of
these objects are only unlinked, and cannot be released, thus they can't
release their ref counts.

Message from syslogd@dev at Nov 19 14:04:10 ...
  kernel:[73501.828730] unregister_netdevice: waiting for bridge.10 to become free. Usage count = 3
Message from syslogd@dev at Nov 19 14:04:20 ...
  kernel:[73512.068811] unregister_netdevice: waiting for bridge.10 to become free. Usage count = 3

Fixes: 7bf4796dd099 ("nexthops: add support for replace")
Signed-off-by: Nikolay Aleksandrov <[email protected]>
Signed-off-by: David S. Miller <[email protected]>

net: ipv6: add fib6_nh_release_dsts stub

We need a way to release a fib6_nh's per-cpu dsts when replacing
nexthops otherwise we can end up with stale per-cpu dsts which hold net
device references, so add a new IPv6 stub called fib6_nh_release_dsts.
It must be used after an RCU grace period, so no new dsts can be created
through a group's nexthop entry.
Similar to fib6_nh_release it shouldn't be used if fib6_nh_init has failed
so it doesn't need a dummy stub when IPv6 is not enabled.

Fixes: 7bf4796dd099 ("nexthops: add support for replace")
Signed-off-by: Nikolay Aleksandrov <[email protected]>
Signed-off-by: David S. Miller <[email protected]>

net, neigh: Fix crash in v6 module initialization error path

When IPv6 module gets initialized, but it's hitting an error in inet6_init()
where it then needs to undo all the prior initialization work, it also might
do a call to ndisc_cleanup() which then calls neigh_table_clear(). In there
is a missing timer cancellation of the table's managed_work item.

The kernel test robot explicitly triggered this error path and caused a UAF
crash similar to the below:

  [...]
  [   28.833183][    C0] BUG: unable to handle page fault for address: f7a43288
  [   28.833973][    C0] #PF: supervisor write access in kernel mode
  [   28.834660][    C0] #PF: error_code(0x0002) - not-present page
  [   28.835319][    C0] *pde = 06b2c067 *pte = 00000000
  [   28.835853][    C0] Oops: 0002 [#1] PREEMPT
  [   28.836367][    C0] CPU: 0 PID: 303 Comm: sed Not tainted 5.16.0-rc1-00233-g83ff5faa0d3b #7
  [   28.837293][    C0] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.14.0-1 04/01/2014
  [   28.838338][    C0] EIP: __run_timers.constprop.0+0x82/0x440
  [...]
  [   28.845607][    C0] Call Trace:
  [   28.845942][    C0]  <SOFTIRQ>
  [   28.846333][    C0]  ? check_preemption_disabled.isra.0+0x2a/0x80
  [   28.846975][    C0]  ? __this_cpu_preempt_check+0x8/0xa
  [   28.847570][    C0]  run_timer_softirq+0xd/0x40
  [   28.848050][    C0]  __do_softirq+0xf5/0x576
  [   28.848547][    C0]  ? __softirqentry_text_start+0x10/0x10
  [   28.849127][    C0]  do_softirq_own_stack+0x2b/0x40
  [   28.849749][    C0]  </SOFTIRQ>
  [   28.850087][    C0]  irq_exit_rcu+0x7d/0xc0
  [   28.850587][    C0]  common_interrupt+0x2a/0x40
  [   28.851068][    C0]  asm_common_interrupt+0x119/0x120
  [...]

Note that IPv6 module cannot be unloaded as per 8ce440610357 ("ipv6: do not
allow ipv6 module to be removed") hence this can only be seen during module
initialization error. Tested with kernel test robot's reproducer.

Fixes: 7482e3841d52 ("net, neigh: Add NTF_MANAGED flag for managed neighbor entries")
Reported-by: kernel test robot <[email protected]>
Signed-off-by: Daniel Borkmann <[email protected]>
Cc: Li Zhijian <[email protected]>
Signed-off-by: David S. Miller <[email protected]>

nixge: fix mac address error handling again

The change to eth_hw_addr_set() caused gcc to correctly spot a
bug that was introduced in an earlier incorrect fix:

In file included from include/linux/etherdevice.h:21,
                 from drivers/net/ethernet/ni/nixge.c:7:
In function '__dev_addr_set',
    inlined from 'eth_hw_addr_set' at include/linux/etherdevice.h:319:2,
    inlined from 'nixge_probe' at drivers/net/ethernet/ni/nixge.c:1286:3:
include/linux/netdevice.h:4648:9: error: 'memcpy' reading 6 bytes from a region of size 0 [-Werror=stringop-overread]
4648 |         memcpy(dev->dev_addr, addr, len);
      |         ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

As nixge_get_nvmem_address() can return either NULL or an error
pointer, the NULL check is wrong, and we can end up reading from
ERR_PTR(-EOPNOTSUPP), which gcc knows to contain zero readable
bytes.

Make the function always return an error pointer again but fix
the check to match that.

Fixes: f3956ebb3bf0 ("ethernet: use eth_hw_addr_set() instead of ether_addr_copy()")
Fixes: abcd3d6fc640 ("net: nixge: Fix error path for obtaining mac address")
Signed-off-by: Arnd Bergmann <[email protected]>
Signed-off-by: David S. Miller <[email protected]>

net/smc: Avoid warning of possible recursive locking

Possible recursive locking is detected by lockdep when SMC
falls back to TCP. The corresponding warnings are as follows:

============================================
WARNING: possible recursive locking detected
5.16.0-rc1+ #18 Tainted: G            E
--------------------------------------------
wrk/1391 is trying to acquire lock:
ffff975246c8e7d8 (&ei->socket.wq.wait){..-.}-{3:3}, at: smc_switch_to_fallback+0x109/0x250 [smc]

but task is already holding lock:
ffff975246c8f918 (&ei->socket.wq.wait){..-.}-{3:3}, at: smc_switch_to_fallback+0xfe/0x250 [smc]

other info that might help us debug this:
  Possible unsafe locking scenario:

        CPU0
        ----
   lock(&ei->socket.wq.wait);
   lock(&ei->socket.wq.wait);

  *** DEADLOCK ***

  May be due to missing lock nesting notation

2 locks held by wrk/1391:
  #0: ffff975246040130 (sk_lock-AF_SMC){+.+.}-{0:0}, at: smc_connect+0x43/0x150 [smc]
  #1: ffff975246c8f918 (&ei->socket.wq.wait){..-.}-{3:3}, at: smc_switch_to_fallback+0xfe/0x250 [smc]

stack backtrace:
Call Trace:
  <TASK>
  dump_stack_lvl+0x56/0x7b
  __lock_acquire+0x951/0x11f0
  lock_acquire+0x27a/0x320
  ? smc_switch_to_fallback+0x109/0x250 [smc]
  ? smc_switch_to_fallback+0xfe/0x250 [smc]
  _raw_spin_lock_irq+0x3b/0x80
  ? smc_switch_to_fallback+0x109/0x250 [smc]
  smc_switch_to_fallback+0x109/0x250 [smc]
  smc_connect_fallback+0xe/0x30 [smc]
  __smc_connect+0xcf/0x1090 [smc]
  ? mark_held_locks+0x61/0x80
  ? __local_bh_enable_ip+0x77/0xe0
  ? lockdep_hardirqs_on+0xbf/0x130
  ? smc_connect+0x12a/0x150 [smc]
  smc_connect+0x12a/0x150 [smc]
  __sys_connect+0x8a/0xc0
  ? syscall_enter_from_user_mode+0x20/0x70
  __x64_sys_connect+0x16/0x20
  do_syscall_64+0x34/0x90
  entry_SYSCALL_64_after_hwframe+0x44/0xae

The nested locking in smc_switch_to_fallback() is considered to
possibly cause a deadlock because smc_wait->lock and clc_wait->lock
are the same type of lock. But actually it is safe so far since
there is no other place trying to obtain smc_wait->lock when
clc_wait->lock is held. So the patch replaces spin_lock() with
spin_lock_nested() to avoid false report by lockdep.

Link: https://lkml.org/lkml/2021/11/19/962
Fixes: 2153bd1e3d3d ("Transfer remaining wait queue entries during fallback")
Reported-by: [email protected]
Signed-off-by: Wen Gu <[email protected]>
Acked-by: Tony Lu <[email protected]>
Signed-off-by: David S. Miller <[email protected]>

vsock/virtio: suppress used length validation

It turns out that vhost vsock violates the virtio spec
by supplying the out buffer length in the used length
(should just be the in length).
As a result, attempts to validate the used length fail with:
vmw_vsock_virtio_transport virtio1: tx: used len 44 is larger than in buflen 0

Since vsock driver does not use the length fox tx and
validates the length before use for rx, it is safe to
suppress the validation in virtio core for this driver.

Reported-by: Halil Pasic <[email protected]>
Fixes: 939779f5152d ("virtio_ring: validate used buffer length")
Cc: "Jason Wang" <[email protected]>
Signed-off-by: Michael S. Tsirkin <[email protected]>
Reviewed-by: Stefano Garzarella <[email protected]>
Signed-off-by: David S. Miller <[email protected]>

net: ax88796c: do not receive data in pointer

Function axspi_read_status calls:

    ret = spi_write_then_read(ax_spi->spi, ax_spi->cmd_buf, 1,
                              (u8 *)&status, 3);

status is a pointer to a struct spi_status, which is 3-byte wide:

    struct spi_status {
        u16 isr;
        u8 status;
    };

But &status is the pointer to this pointer, and spi_write_then_read does
not dereference this parameter:

    int spi_write_then_read(struct spi_device *spi,
                            const void *txbuf, unsigned n_tx,
                            void *rxbuf, unsigned n_rx)

Therefore axspi_read_status currently receive a SPI response in the
pointer status, which overwrites 24 bits of the pointer.

Thankfully, on Little-Endian systems, the pointer is only used in

    le16_to_cpus(&status->isr);

... which is a no-operation. So there, the overwritten pointer is not
dereferenced. Nevertheless on Big-Endian systems, this can lead to
dereferencing pointers after their 24 most significant bits were
overwritten. And in all systems this leads to possible use of
uninitialized value in functions calling spi_write_then_read which
expect status to be initialized when the function returns.

Moreover function axspi_read_status (and macro AX_READ_STATUS) do not
seem to be used anywhere. So currently this seems to be dead code. Fix
the issue anyway so that future code works properly when using function
axspi_read_status.

Fixes: a97c69ba4f30 ("net: ax88796c: ASIX AX88796C SPI Ethernet Adapter Driver")
Signed-off-by: Nicolas Iooss <[email protected]>
Acked-by: Łukasz Stelmach <[email protected]>
Signed-off-by: David S. Miller <[email protected]>

net: stmmac: retain PTP clock time during SIOCSHWTSTAMP ioctls

Currently, when user space emits SIOCSHWTSTAMP ioctl calls such as
enabling/disabling timestamping or changing filter settings, the driver
reads the current CLOCK_REALTIME value and programming this into the
NIC's hardware clock. This might be necessary during system
initialization, but at runtime, when the PTP clock has already been
synchronized to a grandmaster, a reset of the timestamp settings might
result in a clock jump. Furthermore, if the clock is also controlled by
phc2sys in automatic mode (where the UTC offset is queried from ptp4l),
that UTC-to-TAI offset (currently 37 seconds in 2021) would be
temporarily reset to 0, and it would take a long time for phc2sys to
readjust so that CLOCK_REALTIME and the PHC are apart by 37 seconds
again.

To address the issue, we introduce a new function called
stmmac_init_tstamp_counter(), which gets called during ndo_open().
It contains the code snippet moved from stmmac_hwtstamp_set() that
manages the time synchronization. Besides, the sub second increment
configuration is also moved here since the related values are hardware
dependent and runtime invariant.

Furthermore, the hardware clock must be kept running even when no time
stamping mode is selected in order to retain the synchronized time base.
That way, timestamping can be enabled again at any time only with the
need to compensate the clock's natural drifting.

As a side effect, this patch fixes the issue that ptp_clock_info::enable
can be called before SIOCSHWTSTAMP and the driver (which looks at
priv->systime_flags) was not prepared to handle that ordering.

Fixes: 92ba6888510c ("stmmac: add the support for PTP hw clock driver")
Reported-by: Michael Olbrich <[email protected]>
Signed-off-by: Ahmad Fatoum <[email protected]>
Signed-off-by: Holger Assmann <[email protected]>
Signed-off-by: Vladimir Oltean <[email protected]>
Signed-off-by: David S. Miller <[email protected]>

MAINTAINERS: Add entry to MAINTAINERS for Milbeaut

Add entry to MAINTAINERS for Milbeaut that supported minimal drivers.

Signed-off-by: Sugaya Taichi <[email protected]>
Link: https://lore.kernel.org/r/[email protected]'
Signed-off-by: Arnd Bergmann <[email protected]>

nfp: checking parameter process for rx-usecs/tx-usecs is invalid

Use nn->tlv_caps.me_freq_mhz instead of nn->me_freq_mhz to check whether
rx-usecs/tx-usecs is valid.

This is because nn->tlv_caps.me_freq_mhz represents the clock_freq (MHz) of
the flow processing cores (FPC) on the NIC. While nn->me_freq_mhz is not
be set.

Fixes: ce991ab6662a ("nfp: read ME frequency from vNIC ctrl memory")
Signed-off-by: Diana Wang <[email protected]>
Signed-off-by: Simon Horman <[email protected]>
Reviewed-by: Jakub Kicinski <[email protected]>
Signed-off-by: David S. Miller <[email protected]>

ipv6: fix typos in __ip6_finish_output()

We deal with IPv6 packets, so we need to use IP6CB(skb)->flags and
IP6SKB_REROUTED, instead of IPCB(skb)->flags and IPSKB_REROUTED

Found by code inspection, please double check that fixing this bug
does not surface other bugs.

Fixes: 09ee9dba9611 ("ipv6: Reinject IPv6 packets if IPsec policy matches after SNAT")
Signed-off-by: Eric Dumazet <[email protected]>
Cc: Tobias Brunner <[email protected]>
Cc: Steffen Klassert <[email protected]>
Cc: David Ahern <[email protected]>
Reviewed-by: David Ahern <[email protected]>
Tested-by: Tobias Brunner <[email protected]>
Acked-by: Tobias Brunner <[email protected]>
Signed-off-by: David S. Miller <[email protected]>

selftests/tc-testings: Be compatible with newer tc output

old tc(iproute2-5.9.0) output:
action order 1: bpf action.o:[action-ok] id 60 tag bcf7977d3b93787c jited default-action pipe
newer tc(iproute2-5.14.0) output:
action order 1: bpf action.o:[action-ok] id 64 name tag bcf7977d3b93787c jited default-action pipe

It can fix below errors:
# ok 260 f84a - Add cBPF action with invalid bytecode
# not ok 261 e939 - Add eBPF action with valid object-file
#       Could not match regex pattern. Verify command output:
# total acts 0
#
#       action order 1: bpf action.o:[action-ok] id 42 name  tag bcf7977d3b93787c jited default-action pipe
#        index 667 ref 1 bind 0

Signed-off-by: Li Zhijian <[email protected]>
Signed-off-by: David S. Miller <[email protected]>

selftests/tc-testing: match any qdisc type

We should not always presume all kernels use pfifo_fast as the default qdisc.

For example, a fq_codel qdisk could have below output:
qdisc fq_codel 0: parent 1:4 limit 10240p flows 1024 quantum 1514 target 5ms interval 100ms memory_limit 32Mb ecn drop_batch 64

Reported-by: kernel test robot <[email protected]>
Suggested-by: Peilin Ye <[email protected]>
Signed-off-by: Li Zhijian <[email protected]>
Signed-off-by: David S. Miller <[email protected]>

net: dsa: qca8k: fix MTU calculation

qca8k has a global MTU, so its tracking the MTU per port to make sure
that the largest MTU gets applied.
Since it uses the frame size instead of MTU the driver MTU change function
will then add the size of Ethernet header and checksum on top of MTU.

The driver currently populates the per port MTU size as Ethernet frame
length + checksum which equals 1518.

The issue is that then MTU change function will go through all of the
ports, find the largest MTU and apply the Ethernet header + checksum on
top of it again, so for a desired MTU of 1500 you will end up with 1536.

This is obviously incorrect, so to correct it populate the per port struct
MTU with just the MTU and not include the Ethernet header + checksum size
as those will be added by the MTU change function.

Fixes: f58d2598cf70 ("net: dsa: qca8k: implement the port MTU callbacks")
Signed-off-by: Robert Marko <[email protected]>
Signed-off-by: Ansuel Smith <[email protected]>
Reviewed-by: Vladimir Oltean <[email protected]>
Signed-off-by: David S. Miller <[email protected]>

net: dsa: qca8k: fix internal delay applied to the wrong PAD config

With SGMII phy the internal delay is always applied to the PAD0 config.
This is caused by the falling edge configuration that hardcode the reg
to PAD0 (as the falling edge bits are present only in PAD0 reg)
Move the delay configuration before the reg overwrite to correctly apply
the delay.

Fixes: cef08115846e ("net: dsa: qca8k: set internal delay also for sgmii")
Signed-off-by: Ansuel Smith <[email protected]>
Reviewed-by: Vladimir Oltean <[email protected]>
Signed-off-by: David S. Miller <[email protected]>

firmware: smccc: Fix check for ARCH_SOC_ID not implemented

The ARCH_FEATURES function ID is a 32-bit SMC call, which returns
a 32-bit result per the SMCCC spec. Current code is doing a 64-bit
comparison against -1 (SMCCC_RET_NOT_SUPPORTED) to detect that the
feature is unimplemented. That check doesn't work in a Hyper-V VM,
where the upper 32-bits are zero as allowed by the spec.

Cast the result as an 'int' so the comparison works. The change also
makes the code consistent with other similar checks in this file.

Fixes: 821b67fa4639 ("firmware: smccc: Add ARCH_SOC_ID support")
Signed-off-by: Michael Kelley <[email protected]>
Reviewed-by: Sudeep Holla <[email protected]>
Signed-off-by: Arnd Bergmann <[email protected]>

Merge tag 'socfpga_fix_for_v5.16' of git://git.kernel.org/pub/scm/linux/kernel/git/dinguyen/linux into arm/fixes

SoCFPGA fix for v5.16
- Fix crash when CONFIG_FORTIRY_SOURCE is enabled

* tag 'socfpga_fix_for_v5.16' of git://git.kernel.org/pub/scm/linux/kernel/git/dinguyen/linux:
ARM: socfpga: Fix crash with CONFIG_FORTIRY_SOURCE

Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Arnd Bergmann <[email protected]>

Merge tag 'scmi-fixes-5.16' of git://git.kernel.org/pub/scm/linux/kernel/git/sudeep.holla/linux into arm/fixes

Arm SCMI fixes for v5.16

Couple of fixes for sparse warnings(type error assignment in voltage and
sensor protocols), add proper propagation of error from scmi_pm_domain_probe
handling agent discovery response in base protocol correctly and a fix
to avoid null pointer de-reference in the error path.

* tag 'scmi-fixes-5.16' of git://git.kernel.org/pub/scm/linux/kernel/git/sudeep.holla/linux:
  firmware: arm_scmi: Fix type error assignment in voltage protocol
  firmware: arm_scmi: Fix type error in sensor protocol
  firmware: arm_scmi: pm: Propagate return value to caller
  firmware: arm_scmi: Fix base agent discover response
  firmware: arm_scmi: Fix null de-reference on error path

Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Arnd Bergmann <[email protected]>

Merge tag 'optee-fix-for-v5.16' of git://git.linaro.org/people/jens.wiklander/linux-tee into arm/fixes

Fix possible NULL pointer dereference in OP-TEE driver

* tag 'optee-fix-for-v5.16' of git://git.linaro.org/people/jens.wiklander/linux-tee:
optee: fix kfree NULL pointer

Link: https://lore.kernel.org/r/20211117125747.GA2896197@jade
Signed-off-by: Arnd Bergmann <[email protected]>

Merge tag 'arm-soc/for-5.16/devicetree-fixes' of https://github.com/Broadcom/stblinux into arm/fixes

This pull request contains Broadcom ARM-based SoCs Device Tree fixes for
5.16, please pull the following:

- Florian fixes the BCM5310x DTS include file to have the appropriate
  I2C controller interrupt line, and allows the BCMA GPIO controller to
  be used as an interrupt controller. Finally, the BCM2711 (Raspberry Pi
  4) PCIe Device Tree node interrupts are fixed to list the correct
  interrupt output as well as the INTB/C/D lines.

* tag 'arm-soc/for-5.16/devicetree-fixes' of https://github.com/Broadcom/stblinux:
  ARM: dts: bcm2711: Fix PCIe interrupts
  ARM: dts: BCM5301X: Add interrupt properties to GPIO node
  ARM: dts: BCM5301X: Fix I2C controller interrupt

Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Arnd Bergmann <[email protected]>

USB: serial: option: add Telit LE910S1 0x9200 composition

Add the following Telit LE910S1 composition:

0x9200: tty

Signed-off-by: Daniele Palmas <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Cc: [email protected]
Signed-off-by: Johan Hovold <[email protected]>

Revert "parisc: Fix backtrace to always include init funtion names"

This reverts commit 279917e27edc293eb645a25428c6ab3f3bca3f86.

With the CONFIG_HARDENED_USERCOPY option enabled, this patch triggers
kernel bugs at runtime:

  usercopy: Kernel memory overwrite attempt detected to kernel text (offset 2084839, size 6)!
  kernel BUG at mm/usercopy.c:99!
Backtrace:
  IAOQ[0]: usercopy_abort+0xc4/0xe8
  [<00000000406ed1c8>] __check_object_size+0x174/0x238
  [<00000000407086d4>] copy_strings.isra.0+0x3e8/0x708
  [<0000000040709a20>] do_execveat_common.isra.0+0x1bc/0x328
  [<000000004070b760>] compat_sys_execve+0x7c/0xb8
  [<0000000040303eb8>] syscall_exit+0x0/0x14

The problem is, that we have an init section of at least 2MB size which
starts at _stext and is freed after bootup.

If then later some kernel data is (temporarily) stored in this free
memory, check_kernel_text_object() will trigger a bug since the data
appears to be inside the kernel text (>=_stext) area:
        if (overlaps(ptr, len, _stext, _etext))
                usercopy_abort("kernel text");

Signed-off-by: Helge Deller <[email protected]>
Cc: [email protected] # 5.4+

parisc: Convert PTE lookup to use extru_safe() macro

Convert the PTE lookup functions to use the safer extru_safe macro.

Signed-off-by: Helge Deller <[email protected]>

parisc: Fix extraction of hash lock bits in syscall.S

The extru instruction leaves the most significant 32 bits of the target
register in an undefined state on PA 2.0 systems. If any of these bits
are nonzero, this will break the calculation of the lock pointer.

Fix by using extrd,u instruction via extru_safe macro on 64-bit kernels.

Signed-off-by: John David Anglin <[email protected]>
Signed-off-by: Helge Deller <[email protected]>

parisc: Provide an extru_safe() macro to extract unsigned bits

The extru instruction leaves the most significant 32 bits of the
target register in an undefined state on PA 2.0 systems.
Provide a macro to safely use extru on 32- and 64-bit machines.

Suggested-by: John David Anglin <[email protected]>
Signed-off-by: Helge Deller <[email protected]>

parisc: Increase FRAME_WARN to 2048 bytes on parisc

PA-RISC uses a much bigger frame size for functions than other
architectures. So increase it to 2048 for 32- and 64-bit kernels.
This fixes e.g. a warning in lib/xxhash.c.

Reported-by: kernel test robot <[email protected]>
Signed-off-by: Helge Deller <[email protected]>

iomap: Fix inline extent handling in iomap_readpage

Before commit 740499c78408 ("iomap: fix the iomap_readpage_actor return
value for inline data"), when hitting an IOMAP_INLINE extent,
iomap_readpage_actor would report having read the entire page. Since
then, it only reports having read the inline data (iomap->length).

This will force iomap_readpage into another iteration, and the
filesystem will report an unaligned hole after the IOMAP_INLINE extent.
But iomap_readpage_actor (now iomap_readpage_iter) isn't prepared to
deal with unaligned extents, it will get things wrong on filesystems
with a block size smaller than the page size, and we'll eventually run
into the following warning in iomap_iter_advance:

WARN_ON_ONCE(iter->processed > iomap_length(iter));

Fix that by changing iomap_readpage_iter to return 0 when hitting an
inline extent; this will cause iomap_iter to stop immediately.

To fix readahead as well, change iomap_readahead_iter to pass on
iomap_readpage_iter return values less than or equal to zero.

Fixes: 740499c78408 ("iomap: fix the iomap_readpage_actor return value for inline data")
Cc: [email protected] # v5.15+
Signed-off-by: Andreas Gruenbacher <[email protected]>
Reviewed-by: Christoph Hellwig <[email protected]>
Reviewed-by: Darrick J. Wong <[email protected]>
Signed-off-by: Darrick J. Wong <[email protected]>

Linux 5.16-rc2

Merge tag 'x86-urgent-2021-11-21' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip

Pull x86 fixes from Thomas Gleixner:

- Move the command line preparation and the early command line parsing
   earlier so that the command line parameters which affect
   early_reserve_memory(), e.g. efi=nosftreserve, are taken into
   account. This was broken when the invocation of
   early_reserve_memory() was moved recently.

- Use an atomic type for the SGX page accounting, which is read and
   written locklessly, to plug various race conditions related to it.

* tag 'x86-urgent-2021-11-21' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  x86/sgx: Fix free page accounting
  x86/boot: Pull up cmdline preparation and early param parsing

Merge tag 'perf-urgent-2021-11-21' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip

Pull x86 perf fixes from Thomas Gleixner:

- Remove unneded PEBS disabling when taking LBR snapshots to prevent an
   unchecked MSR access error.

- Fix IIO event constraints for Snowridge and Skylake server chips.

* tag 'perf-urgent-2021-11-21' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  x86/perf: Fix snapshot_branch_stack warning in VM
  perf/x86/intel/uncore: Fix IIO event constraints for Snowridge
  perf/x86/intel/uncore: Fix IIO event constraints for Skylake Server
  perf/x86/intel/uncore: Fix filter_tid mask for CHA events on Skylake Server

Merge tag 'powerpc-5.16-2' of git://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux

Pull more powerpc fixes from Michael Ellerman:

- Fix a bug in copying of sigset_t for 32-bit systems, which caused X
   to not start.

- Fix handling of shared LSIs (rare) with the xive interrupt controller
   (Power9/10).

- Fix missing TOC setup in some KVM code, which could result in oopses
   depending on kernel data layout.

- Fix DMA mapping when we have persistent memory and only one DMA
   window available.

- Fix further problems with STRICT_KERNEL_RWX on 8xx, exposed by a
   recent fix.

- A couple of other minor fixes.

Thanks to Alexey Kardashevskiy, Aneesh Kumar K.V, Cédric Le Goater,
Christian Zigotzky, Christophe Leroy, Daniel Axtens, Finn Thain, Greg
Kurz, Masahiro Yamada, Nicholas Piggin, and Uwe Kleine-König.

* tag 'powerpc-5.16-2' of git://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux:
  powerpc/xive: Change IRQ domain to a tree domain
  powerpc/8xx: Fix pinned TLBs with CONFIG_STRICT_KERNEL_RWX
  powerpc/signal32: Fix sigset_t copy
  powerpc/book3e: Fix TLBCAM preset at boot
  powerpc/pseries/ddw: Do not try direct mapping with persistent memory and one window
  powerpc/pseries/ddw: simplify enable_ddw()
  powerpc/pseries/ddw: Revert "Extend upper limit for huge DMA window for persistent memory"
  powerpc/pseries: Fix numa FORM2 parsing fallback code
  powerpc/pseries: rename numa_dist_table to form2_distances
  powerpc: clean vdso32 and vdso64 directories
  powerpc/83xx/mpc8349emitx: Drop unused variable
  KVM: PPC: Book3S HV: Use GLOBAL_TOC for kvmppc_h_set_dabr/xdabr()

pstore/blk: Use "%lu" to format unsigned long

On 32-bit:

    fs/pstore/blk.c: In function ‘__best_effort_init’:
    include/linux/kern_levels.h:5:18: warning: format ‘%zu’ expects argument of type ‘size_t’, but argument 3 has type ‘long unsigned int’ [-Wformat=]
5 | #define KERN_SOH "\001"  /* ASCII Start Of Header */
  |                  ^~~~~~
    include/linux/kern_levels.h:14:19: note: in expansion of macro ‘KERN_SOH’
       14 | #define KERN_INFO KERN_SOH "6" /* informational */
  |                   ^~~~~~~~
    include/linux/printk.h:373:9: note: in expansion of macro ‘KERN_INFO’
      373 |  printk(KERN_INFO pr_fmt(fmt), ##__VA_ARGS__)
  |         ^~~~~~~~~
    fs/pstore/blk.c:314:3: note: in expansion of macro ‘pr_info’
      314 |   pr_info("attached %s (%zu) (no dedicated panic_write!)\n",
  |   ^~~~~~~

Cc: [email protected]
Fixes: 7bb9557b48fcabaa ("pstore/blk: Use the normal block device I/O path")
Signed-off-by: Geert Uytterhoeven <[email protected]>
Signed-off-by: Kees Cook <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Cc: Jens Axboe <[email protected]>
Reviewed-by: Christoph Hellwig <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

Merge branch 'akpm' (patches from Andrew)

Merge misc fixes from Andrew Morton:
"15 patches.

  Subsystems affected by this patch series: ipc, hexagon, mm (swap,
  slab-generic, kmemleak, hugetlb, kasan, damon, and highmem), and proc"

* emailed patches from Andrew Morton <[email protected]>:
  proc/vmcore: fix clearing user buffer by properly using clear_user()
  kmap_local: don't assume kmap PTEs are linear arrays in memory
  mm/damon/dbgfs: fix missed use of damon_dbgfs_lock
  mm/damon/dbgfs: use '__GFP_NOWARN' for user-specified size buffer allocation
  kasan: test: silence intentional read overflow warnings
  hugetlb, userfaultfd: fix reservation restore on userfaultfd error
  hugetlb: fix hugetlb cgroup refcounting during mremap
  mm: kmemleak: slob: respect SLAB_NOLEAKTRACE flag
  hexagon: ignore vmlinux.lds
  hexagon: clean up timer-regs.h
  hexagon: export raw I/O routines for modules
  mm: emit the "free" trace report before freeing memory in kmem_cache_free()
  shm: extend forced shm destroy to support objects from several IPC nses
  ipc: WARN if trying to remove ipc object which is absent
  mm/swap.c:put_pages_list(): reinitialise the page list

Merge tag 'block-5.16-2021-11-19' of git://git.kernel.dk/linux-block

Pull block fixes from Jens Axboe:

- Flip a cap check to avoid a selinux error (Alistair)

- Fix for a regression this merge window where we can miss a queue ref
   put (me)

- Un-mark pstore-blk as broken, as the condition that triggered that
   change has been rectified (Kees)

- Queue quiesce and sync fixes (Ming)

- FUA insertion fix (Ming)

- blk-cgroup error path put fix (Yu)

* tag 'block-5.16-2021-11-19' of git://git.kernel.dk/linux-block:
  blk-mq: don't insert FUA request with data into scheduler queue
  blk-cgroup: fix missing put device in error path from blkg_conf_pref()
  block: avoid to quiesce queue in elevator_init_mq
  Revert "mark pstore-blk as broken"
  blk-mq: cancel blk-mq dispatch work in both blk_cleanup_queue and disk_release()
  block: fix missing queue put in error path
  block: Check ADMIN before NICE for IOPRIO_CLASS_RT

Merge tag 'pinctrl-v5.16-2' of git://git.kernel.org/pub/scm/linux/kernel/git/linusw/linux-pinctrl

Pull pin control fixes from Linus Walleij:
"There is an ACPI stubs fix which is ACKed by the ACPI maintainer for
  merging through my tree.

  One item stand out and that is that I delete the <linux/sdb.h> header
  that is used by nothing. I deleted this subsystem (through the GPIO
  tree) a while back so I feel responsible for tidying up the floor.

  Other than that it is the usual mistakes, a bit noisy around build
  issue and Kconfig then driver fixes.

  Specifics:

   - Fix some stubs causing compile issues for ACPI.

   - Fix some wakeups on AMD IRQs shared between GPIO and SCI.

   - Fix a build warning in the Tegra driver.

   - Fix a Kconfig issue in the Qualcomm driver.

   - Add a missing include the RALink driver.

   - Return a valid type for the Apple pinctrl IRQs.

   - Implement some Qualcomm SDM845 dual-edge errata.

   - Remove the unused <linux/sdb.h> header. (The subsystem was once
     deleted by the pinctrl maintainer...)

   - Fix a duplicate initialized in the Tegra driver.

   - Fix register offsets for UFS and SDC in the Qualcomm SM8350 driver"

* tag 'pinctrl-v5.16-2' of git://git.kernel.org/pub/scm/linux/kernel/git/linusw/linux-pinctrl:
  pinctrl: qcom: sm8350: Correct UFS and SDC offsets
  pinctrl: tegra194: remove duplicate initializer again
  Remove unused header <linux/sdb.h>
  pinctrl: qcom: sdm845: Enable dual edge errata
  pinctrl: apple: Always return valid type in apple_gpio_irq_type
  pinctrl: ralink: include 'ralink_regs.h' in 'pinctrl-mt7620.c'
  pinctrl: qcom: fix unmet dependencies on GPIOLIB for GPIOLIB_IRQCHIP
  pinctrl: tegra: Return const pointer from tegra_pinctrl_get_group()
  pinctrl: amd: Fix wakeups when IRQ is shared with SCI
  ACPI: Add stubs for wakeup handler functions

Merge tag 's390-5.16-3' of git://git.kernel.org/pub/scm/linux/kernel/git/s390/linux

Pull s390 updates from Heiko Carstens:

- Add missing Kconfig option for ftrace direct multi sample, so it can
   be compiled again, and also add s390 support for this sample.

- Update Christian Borntraeger's email address.

- Various fixes for memory layout setup. Besides other this makes it
   possible to load shared DCSS segments again.

- Fix copy to user space of swapped kdump oldmem.

- Remove -mstack-guard and -mstack-size compile options when building
   vdso binaries. This can happen when CONFIG_VMAP_STACK is disabled and
   results in broken vdso code which causes more or less random
   exceptions. Also remove the not needed -nostdlib option.

- Fix memory leak on cpu hotplug and return code handling in kexec
   code.

- Wire up futex_waitv system call.

- Replace snprintf with sysfs_emit where appropriate.

* tag 's390-5.16-3' of git://git.kernel.org/pub/scm/linux/kernel/git/s390/linux:
  ftrace/samples: add s390 support for ftrace direct multi sample
  ftrace/samples: add missing Kconfig option for ftrace direct multi sample
  MAINTAINERS: update email address of Christian Borntraeger
  s390/kexec: fix memory leak of ipl report buffer
  s390/kexec: fix return code handling
  s390/dump: fix copying to user-space of swapped kdump oldmem
  s390: wire up sys_futex_waitv system call
  s390/vdso: filter out -mstack-guard and -mstack-size
  s390/vdso: remove -nostdlib compiler flag
  s390: replace snprintf in show functions with sysfs_emit
  s390/boot: simplify and fix kernel memory layout setup
  s390/setup: re-arrange memblock setup
  s390/setup: avoid using memblock_enforce_memory_limit
  s390/setup: avoid reserving memory above identity mapping

Merge tag '5.16-rc1-smb3-fixes' of git://git.samba.org/sfrench/cifs-2.6

Pull cifs fixes from Steve French:
"Three small cifs/smb3 fixes: two to address minor coverity issues and
  one cleanup"

* tag '5.16-rc1-smb3-fixes' of git://git.samba.org/sfrench/cifs-2.6:
  cifs: introduce cifs_ses_mark_for_reconnect() helper
  cifs: protect srv_count with cifs_tcp_ses_lock
  cifs: move debug print out of spinlock

proc/vmcore: fix clearing user buffer by properly using clear_user()

To clear a user buffer we cannot simply use memset, we have to use
clear_user().  With a virtio-mem device that registers a vmcore_cb and
has some logically unplugged memory inside an added Linux memory block,
I can easily trigger a BUG by copying the vmcore via "cp":

  systemd[1]: Starting Kdump Vmcore Save Service...
  kdump[420]: Kdump is using the default log level(3).
  kdump[453]: saving to /sysroot/var/crash/127.0.0.1-2021-11-11-14:59:22/
  kdump[458]: saving vmcore-dmesg.txt to /sysroot/var/crash/127.0.0.1-2021-11-11-14:59:22/
  kdump[465]: saving vmcore-dmesg.txt complete
  kdump[467]: saving vmcore
  BUG: unable to handle page fault for address: 00007f2374e01000
  #PF: supervisor write access in kernel mode
  #PF: error_code(0x0003) - permissions violation
  PGD 7a523067 P4D 7a523067 PUD 7a528067 PMD 7a525067 PTE 800000007048f867
  Oops: 0003 [#1] PREEMPT SMP NOPTI
  CPU: 0 PID: 468 Comm: cp Not tainted 5.15.0+ #6
  Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS rel-1.14.0-27-g64f37cc530f1-prebuilt.qemu.org 04/01/2014
  RIP: 0010:read_from_oldmem.part.0.cold+0x1d/0x86
  Code: ff ff ff e8 05 ff fe ff e9 b9 e9 7f ff 48 89 de 48 c7 c7 38 3b 60 82 e8 f1 fe fe ff 83 fd 08 72 3c 49 8d 7d 08 4c 89 e9 89 e8 <49> c7 45 00 00 00 00 00 49 c7 44 05 f8 00 00 00 00 48 83 e7 f81
  RSP: 0018:ffffc9000073be08 EFLAGS: 00010212
  RAX: 0000000000001000 RBX: 00000000002fd000 RCX: 00007f2374e01000
  RDX: 0000000000000001 RSI: 00000000ffffdfff RDI: 00007f2374e01008
  RBP: 0000000000001000 R08: 0000000000000000 R09: ffffc9000073bc50
  R10: ffffc9000073bc48 R11: ffffffff829461a8 R12: 000000000000f000
  R13: 00007f2374e01000 R14: 0000000000000000 R15: ffff88807bd421e8
  FS:  00007f2374e12140(0000) GS:ffff88807f000000(0000) knlGS:0000000000000000
  CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
  CR2: 00007f2374e01000 CR3: 000000007a4aa000 CR4: 0000000000350eb0
  Call Trace:
   read_vmcore+0x236/0x2c0
   proc_reg_read+0x55/0xa0
   vfs_read+0x95/0x190
   ksys_read+0x4f/0xc0
   do_syscall_64+0x3b/0x90
   entry_SYSCALL_64_after_hwframe+0x44/0xae

Some x86-64 CPUs have a CPU feature called "Supervisor Mode Access
Prevention (SMAP)", which is used to detect wrong access from the kernel
to user buffers like this: SMAP triggers a permissions violation on
wrong access.  In the x86-64 variant of clear_user(), SMAP is properly
handled via clac()+stac().

To fix, properly use clear_user() when we're dealing with a user buffer.

Link: https://lkml.kernel.org/r/[email protected]
Fixes: 997c136f518c ("fs/proc/vmcore.c: add hook to read_from_oldmem() to check for non-ram pages")
Signed-off-by: David Hildenbrand <[email protected]>
Acked-by: Baoquan He <[email protected]>
Cc: Dave Young <[email protected]>
Cc: Baoquan He <[email protected]>
Cc: Vivek Goyal <[email protected]>
Cc: Philipp Rudo <[email protected]>
Cc: <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

kmap_local: don't assume kmap PTEs are linear arrays in memory

The kmap_local conversion broke the ARM architecture, because the new
code assumes that all PTEs used for creating kmaps form a linear array
in memory, and uses array indexing to look up the kmap PTE belonging to
a certain kmap index.

On ARM, this cannot work, not only because the PTE pages may be
non-adjacent in memory, but also because ARM/!LPAE interleaves hardware
entries and extended entries (carrying software-only bits) in a way that
is not compatible with array indexing.

Fortunately, this only seems to affect configurations with more than 8
CPUs, due to the way the per-CPU kmap slots are organized in memory.

Work around this by permitting an architecture to set a Kconfig symbol
that signifies that the kmap PTEs do not form a lineary array in memory,
and so the only way to locate the appropriate one is to walk the page
tables.

Link: https://lore.kernel.org/linux-arm-kernel/[email protected]/
Link: https://lkml.kernel.org/r/[email protected]
Fixes: 2a15ba82fa6c ("ARM: highmem: Switch to generic kmap atomic")
Signed-off-by: Ard Biesheuvel <[email protected]>
Reported-by: Quanyang Wang <[email protected]>
Reviewed-by: Linus Walleij <[email protected]>
Acked-by: Russell King (Oracle) <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

mm/damon/dbgfs: fix missed use of damon_dbgfs_lock

DAMON debugfs is supposed to protect dbgfs_ctxs, dbgfs_nr_ctxs, and
dbgfs_dirs using damon_dbgfs_lock. However, some of the code is
accessing the variables without the protection. This fixes it by
protecting all such accesses.

Link: https://lkml.kernel.org/r/[email protected]
Fixes: 75c1c2b53c78 ("mm/damon/dbgfs: support multiple contexts")
Signed-off-by: SeongJae Park <[email protected]>
Cc: <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

mm/damon/dbgfs: use '__GFP_NOWARN' for user-specified size buffer allocation

Patch series "DAMON fixes".

This patch (of 2):

DAMON users can trigger below warning in '__alloc_pages()' by invoking
write() to some DAMON debugfs files with arbitrarily high count
argument, because DAMON debugfs interface allocates some buffers based
on the user-specified 'count'.

        if (unlikely(order >= MAX_ORDER)) {
                WARN_ON_ONCE(!(gfp & __GFP_NOWARN));
                return NULL;
        }

Because the DAMON debugfs interface code checks failure of the
'kmalloc()', this commit simply suppresses the warnings by adding
'__GFP_NOWARN' flag.

Link: https://lkml.kernel.org/r/[email protected]
Link: https://lkml.kernel.org/r/[email protected]
Fixes: 4bc05954d007 ("mm/damon: implement a debugfs-based user space interface")
Signed-off-by: SeongJae Park <[email protected]>
Cc: <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

kasan: test: silence intentional read overflow warnings

As done in commit d73dad4eb5ad ("kasan: test: bypass __alloc_size
checks") for __write_overflow warnings, also silence some more cases
that trip the __read_overflow warnings seen in 5.16-rc1[1]:

  In file included from include/linux/string.h:253,
                   from include/linux/bitmap.h:10,
                   from include/linux/cpumask.h:12,
                   from include/linux/mm_types_task.h:14,
                   from include/linux/mm_types.h:5,
                   from include/linux/page-flags.h:13,
                   from arch/arm64/include/asm/mte.h:14,
                   from arch/arm64/include/asm/pgtable.h:12,
                   from include/linux/pgtable.h:6,
                   from include/linux/kasan.h:29,
                   from lib/test_kasan.c:10:
  In function 'memcmp',
      inlined from 'kasan_memcmp' at lib/test_kasan.c:897:2:
  include/linux/fortify-string.h:263:25: error: call to '__read_overflow' declared with attribute error: detected read beyond size of object (1st parameter)
    263 |                         __read_overflow();
        |                         ^~~~~~~~~~~~~~~~~
  In function 'memchr',
      inlined from 'kasan_memchr' at lib/test_kasan.c:872:2:
  include/linux/fortify-string.h:277:17: error: call to '__read_overflow' declared with attribute error: detected read beyond size of object (1st parameter)
    277 |                 __read_overflow();
        |                 ^~~~~~~~~~~~~~~~~

[1] http://kisskb.ellerman.id.au/kisskb/buildresult/14660585/log/

Link: https://lkml.kernel.org/r/[email protected]
Fixes: d73dad4eb5ad ("kasan: test: bypass __alloc_size checks")
Signed-off-by: Kees Cook <[email protected]>
Reviewed-by: Andrey Konovalov <[email protected]>
Acked-by: Marco Elver <[email protected]>
Cc: Andrey Ryabinin <[email protected]>
Cc: Alexander Potapenko <[email protected]>
Cc: Dmitry Vyukov <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

hugetlb, userfaultfd: fix reservation restore on userfaultfd error

Currently in the is_continue case in hugetlb_mcopy_atomic_pte(), if we
bail out using "goto out_release_unlock;" in the cases where idx >=
size, or !huge_pte_none(), the code will detect that new_pagecache_page
== false, and so call restore_reserve_on_error(). In this case I see
restore_reserve_on_error() delete the reservation, and the following
call to remove_inode_hugepages() will increment h->resv_hugepages
causing a 100% reproducible leak.

We should treat the is_continue case similar to adding a page into the
pagecache and set new_pagecache_page to true, to indicate that there is
no reservation to restore on the error path, and we need not call
restore_reserve_on_error(). Rename new_pagecache_page to
page_in_pagecache to make that clear.

Link: https://lkml.kernel.org/r/[email protected]
Fixes: c7b1850dfb41 ("hugetlb: don't pass page cache pages to restore_reserve_on_error")
Signed-off-by: Mina Almasry <[email protected]>
Reported-by: James Houghton <[email protected]>
Reviewed-by: Mike Kravetz <[email protected]>
Cc: Wei Xu <[email protected]>
Cc: <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

hugetlb: fix hugetlb cgroup refcounting during mremap

When hugetlb_vm_op_open() is called during copy_vma(), we may take the
reference to resv_map->css. Later, when clearing the reservation
pointer of old_vma after transferring it to new_vma, we forget to drop
the reference to resv_map->css. This leads to a reference leak of css.

Fixes this by adding a check to drop reservation css reference in
clear_vma_resv_huge_pages()

Link: https://lkml.kernel.org/r/[email protected]
Fixes: 550a7d60bd5e35 ("mm, hugepages: add mremap() support for hugepage backed vma")
Signed-off-by: Bui Quang Minh <[email protected]>
Reviewed-by: Mike Kravetz <[email protected]>
Reviewed-by: Mina Almasry <[email protected]>
Cc: Miaohe Lin <[email protected]>
Cc: Michal Hocko <[email protected]>
Cc: Muchun Song <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

mm: kmemleak: slob: respect SLAB_NOLEAKTRACE flag

When kmemleak is enabled for SLOB, system does not boot and does not
print anything to the console. At the very early stage in the boot
process we hit infinite recursion from kmemleak_init() and eventually
kernel crashes.

kmemleak_init() specifies SLAB_NOLEAKTRACE for KMEM_CACHE(), but
kmem_cache_create_usercopy() removes it because CACHE_CREATE_MASK is not
valid for SLOB.

Let's fix CACHE_CREATE_MASK and make kmemleak work with SLOB

Link: https://lkml.kernel.org/r/[email protected]
Fixes: d8843922fba4 ("slab: Ignore internal flags in cache creation")
Signed-off-by: Rustam Kovhaev <[email protected]>
Acked-by: Vlastimil Babka <[email protected]>
Reviewed-by: Muchun Song <[email protected]>
Cc: Christoph Lameter <[email protected]>
Cc: Pekka Enberg <[email protected]>
Cc: David Rientjes <[email protected]>
Cc: Joonsoo Kim <[email protected]>
Cc: Catalin Marinas <[email protected]>
Cc: Greg Kroah-Hartman <[email protected]>
Cc: Glauber Costa <[email protected]>
Cc: <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

hexagon: ignore vmlinux.lds

After building allmodconfig, there is an untracked vmlinux.lds file in
arch/hexagon/kernel:

$ git ls-files . --exclude-standard --others
arch/hexagon/kernel/vmlinux.lds

Ignore it as all other architectures have.

Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Nathan Chancellor <[email protected]>
Cc: Brian Cain <[email protected]>
Cc: Nick Desaulniers <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

hexagon: clean up timer-regs.h

When building allmodconfig, there is a warning about TIMER_ENABLE being
redefined:

  drivers/clocksource/timer-oxnas-rps.c:39:9: error: 'TIMER_ENABLE' macro redefined [-Werror,-Wmacro-redefined]
  #define TIMER_ENABLE            BIT(7)
          ^
  arch/hexagon/include/asm/timer-regs.h:13:9: note: previous definition is here
  #define TIMER_ENABLE            0
           ^
  1 error generated.

The values in this header are only used in one file each, if they are
used at all.  Remove the header and sink all of the constants into their
respective files.

TCX0_CLK_RATE is only used in arch/hexagon/include/asm/timex.h

TIMER_ENABLE, RTOS_TIMER_INT, RTOS_TIMER_REGS_ADDR are only used in
arch/hexagon/kernel/time.c.

SLEEP_CLK_RATE and TIMER_CLR_ON_MATCH have both been unused since the
file's introduction in commit 71e4a47f32f4 ("Hexagon: Add time and timer
functions").

TIMER_ENABLE is redefined as BIT(0) so the shift is moved into the
definition, rather than its use.

Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Nathan Chancellor <[email protected]>
Acked-by: Brian Cain <[email protected]>
Cc: Nick Desaulniers <[email protected]>
Cc: <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

hexagon: export raw I/O routines for modules

Patch series "Fixes for ARCH=hexagon allmodconfig", v2.

This series fixes some issues noticed with ARCH=hexagon allmodconfig.

This patch (of 3):

When building ARCH=hexagon allmodconfig, the following errors occur:

  ERROR: modpost: "__raw_readsl" [drivers/i3c/master/svc-i3c-master.ko] undefined!
  ERROR: modpost: "__raw_writesl" [drivers/i3c/master/dw-i3c-master.ko] undefined!
  ERROR: modpost: "__raw_readsl" [drivers/i3c/master/dw-i3c-master.ko] undefined!
  ERROR: modpost: "__raw_writesl" [drivers/i3c/master/i3c-master-cdns.ko] undefined!
  ERROR: modpost: "__raw_readsl" [drivers/i3c/master/i3c-master-cdns.ko] undefined!

Export these symbols so that modules can use them without any errors.

Link: https://lkml.kernel.org/r/[email protected]
Link: https://lkml.kernel.org/r/[email protected]
Fixes: 013bf24c3829 ("Hexagon: Provide basic implementation and/or stubs for I/O routines.")
Signed-off-by: Nathan Chancellor <[email protected]>
Acked-by: Brian Cain <[email protected]>
Cc: Nick Desaulniers <[email protected]>
Cc: <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

mm: emit the "free" trace report before freeing memory in kmem_cache_free()

After the memory is freed, it can be immediately allocated by other
CPUs, before the "free" trace report has been emitted.  This causes
inaccurate traces.

For example, if the following sequence of events occurs:

    CPU 0                 CPU 1

  (1) alloc xxxxxx
  (2) free  xxxxxx
                         (3) alloc xxxxxx
                         (4) free  xxxxxx

Then they will be inaccurately reported via tracing, so that they appear
to have happened in this order:

    CPU 0                 CPU 1

  (1) alloc xxxxxx
                         (2) alloc xxxxxx
  (3) free  xxxxxx
                         (4) free  xxxxxx

This makes it look like CPU 1 somehow managed to allocate memory that
CPU 0 still had allocated for itself.

In order to avoid this, emit the "free xxxxxx" tracing report just
before the actual call to free the memory, instead of just after it.

Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Yunfeng Ye <[email protected]>
Reviewed-by: Vlastimil Babka <[email protected]>
Reviewed-by: John Hubbard <[email protected]>
Cc: Christoph Lameter <[email protected]>
Cc: Pekka Enberg <[email protected]>
Cc: David Rientjes <[email protected]>
Cc: Joonsoo Kim <[email protected]>
Cc: Vlastimil Babka <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

shm: extend forced shm destroy to support objects from several IPC nses

Currently, the exit_shm() function not designed to work properly when
task->sysvshm.shm_clist holds shm objects from different IPC namespaces.

This is a real pain when sysctl kernel.shm_rmid_forced = 1, because it
leads to use-after-free (reproducer exists).

This is an attempt to fix the problem by extending exit_shm mechanism to
handle shm's destroy from several IPC ns'es.

To achieve that we do several things:

1. add a namespace (non-refcounted) pointer to the struct shmid_kernel

2. during new shm object creation (newseg()/shmget syscall) we
   initialize this pointer by current task IPC ns

3. exit_shm() fully reworked such that it traverses over all shp's in
   task->sysvshm.shm_clist and gets IPC namespace not from current task
   as it was before but from shp's object itself, then call
   shm_destroy(shp, ns).

Note: We need to be really careful here, because as it was said before
(1), our pointer to IPC ns non-refcnt'ed.  To be on the safe side we
using special helper get_ipc_ns_not_zero() which allows to get IPC ns
refcounter only if IPC ns not in the "state of destruction".

Q/A

Q: Why can we access shp->ns memory using non-refcounted pointer?
A: Because shp object lifetime is always shorther than IPC namespace
   lifetime, so, if we get shp object from the task->sysvshm.shm_clist
   while holding task_lock(task) nobody can steal our namespace.

Q: Does this patch change semantics of unshare/setns/clone syscalls?
A: No. It's just fixes non-covered case when process may leave IPC
   namespace without getting task->sysvshm.shm_clist list cleaned up.

Link: https://lkml.kernel.org/r/[email protected]
Link: https://lkml.kernel.org/r/[email protected]
Fixes: ab602f79915 ("shm: make exit_shm work proportional to task activity")
Co-developed-by: Manfred Spraul <[email protected]>
Signed-off-by: Manfred Spraul <[email protected]>
Signed-off-by: Alexander Mikhalitsyn <[email protected]>
Cc: "Eric W. Biederman" <[email protected]>
Cc: Davidlohr Bueso <[email protected]>
Cc: Greg KH <[email protected]>
Cc: Andrei Vagin <[email protected]>
Cc: Pavel Tikhomirov <[email protected]>
Cc: Vasily Averin <[email protected]>
Cc: <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

ipc: WARN if trying to remove ipc object which is absent

Patch series "shm: shm_rmid_forced feature fixes".

Some time ago I met kernel crash after CRIU restore procedure,
fortunately, it was CRIU restore, so, I had dump files and could do
restore many times and crash reproduced easily.  After some
investigation I've constructed the minimal reproducer.  It was found
that it's use-after-free and it happens only if sysctl
kernel.shm_rmid_forced = 1.

The key of the problem is that the exit_shm() function not handles shp's
object destroy when task->sysvshm.shm_clist contains items from
different IPC namespaces.  In most cases this list will contain only
items from one IPC namespace.

How can this list contain object from different namespaces? The
exit_shm() function is designed to clean up this list always when
process leaves IPC namespace.  But we made a mistake a long time ago and
did not add a exit_shm() call into the setns() syscall procedures.

The first idea was just to add this call to setns() syscall but it
obviously changes semantics of setns() syscall and that's
userspace-visible change.  So, I gave up on this idea.

The first real attempt to address the issue was just to omit forced
destroy if we meet shp object not from current task IPC namespace [1].
But that was not the best idea because task->sysvshm.shm_clist was
protected by rwsem which belongs to current task IPC namespace.  It
means that list corruption may occur.

Second approach is just extend exit_shm() to properly handle shp's from
different IPC namespaces [2].  This is really non-trivial thing, I've
put a lot of effort into that but not believed that it's possible to
make it fully safe, clean and clear.

Thanks to the efforts of Manfred Spraul working an elegant solution was
designed.  Thanks a lot, Manfred!

Eric also suggested the way to address the issue in ("[RFC][PATCH] shm:
In shm_exit destroy all created and never attached segments") Eric's
idea was to maintain a list of shm_clists one per IPC namespace, use
lock-less lists.  But there is some extra memory consumption-related
concerns.

An alternative solution which was suggested by me was implemented in
("shm: reset shm_clist on setns but omit forced shm destroy").  The idea
is pretty simple, we add exit_shm() syscall to setns() but DO NOT
destroy shm segments even if sysctl kernel.shm_rmid_forced = 1, we just
clean up the task->sysvshm.shm_clist list.

This chages semantics of setns() syscall a little bit but in comparision
to the "naive" solution when we just add exit_shm() without any special
exclusions this looks like a safer option.

[1] https://lkml.org/lkml/2021/7/6/1108
[2] https://lkml.org/lkml/2021/7/14/736

This patch (of 2):

Let's produce a warning if we trying to remove non-existing IPC object
from IPC namespace kht/idr structures.

This allows us to catch possible bugs when the ipc_rmid() function was
called with inconsistent struct ipc_ids*, struct kern_ipc_perm*
arguments.

Link: https://lkml.kernel.org/r/[email protected]
Link: https://lkml.kernel.org/r/[email protected]
Co-developed-by: Manfred Spraul <[email protected]>
Signed-off-by: Manfred Spraul <[email protected]>
Signed-off-by: Alexander Mikhalitsyn <[email protected]>
Cc: "Eric W. Biederman" <[email protected]>
Cc: Davidlohr Bueso <[email protected]>
Cc: Greg KH <[email protected]>
Cc: Andrei Vagin <[email protected]>
Cc: Pavel Tikhomirov <[email protected]>
Cc: Vasily Averin <[email protected]>
Cc: <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

mm/swap.c:put_pages_list(): reinitialise the page list

While free_unref_page_list() puts pages onto the CPU local LRU list, it
does not remove them from the list they were passed in on. That makes
the list_head appear to be non-empty, and would lead to various
corruption problems if we didn't have an assertion that the list was
empty.

Reinitialise the list after calling free_unref_page_list() to avoid this
problem.

Link: https://lkml.kernel.org/r/[email protected]
Fixes: 988c69f1bc23 ("mm: optimise put_pages_list()")
Signed-off-by: Matthew Wilcox (Oracle) <[email protected]>
Reviewed-by: Steve French <[email protected]>
Reported-by: Namjae Jeon <[email protected]>
Tested-by: Steve French <[email protected]>
Tested-by: Namjae Jeon <[email protected]>
Cc: Steve French <[email protected]>
Cc: Hyeoncheol Lee <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

af_unix: fix regression in read after shutdown

On kernels before v5.15, calling read() on a unix socket after
shutdown(SHUT_RD) or shutdown(SHUT_RDWR) would return the data
previously written or EOF.  But now, while read() after
shutdown(SHUT_RD) still behaves the same way, read() after
shutdown(SHUT_RDWR) always fails with -EINVAL.

This behaviour change was apparently inadvertently introduced as part of
a bug fix for a different regression caused by the commit adding sockmap
support to af_unix, commit 94531cfcbe79c359 ("af_unix: Add
unix_stream_proto for sockmap").  Those commits, for unclear reasons,
started setting the socket state to TCP_CLOSE on shutdown(SHUT_RDWR),
while this state change had previously only been done in
unix_release_sock().

Restore the original behaviour.  The sockmap tests in
tests/selftests/bpf continue to pass after this patch.

Fixes: d0c6416bd7091647f60 ("unix: Fix an issue in unix_shutdown causing the other end read/write failures")
Link: https://lore.kernel.org/lkml/[email protected]/
Signed-off-by: Vincent Whitchurch <[email protected]>
Tested-by: Casey Schaufler <[email protected]>
Signed-off-by: David S. Miller <[email protected]>

Merge branch 'mptcp-rtx-timer'

Paolo Abeni says:

====================
mptcp: fix 3rd ack rtx timer

Eric noted that the MPTCP code do the wrong thing to schedule
the MPJ 3rd ack timer. He also provided a patch to address the
issues (patch 1/2).

To fix for good the MPJ 3rd ack retransmission timer, we additionally
need to set it after the current ack is transmitted (patch 2/2)

Note that the bug went unnotice so far because all the related
tests required some running data transfer, and that causes
MPTCP-level ack even on the opening MPJ subflow. We now have
explicit packet drill coverage for this code path.
====================

Signed-off-by: David S. Miller <[email protected]>

mptcp: use delegate action to schedule 3rd ack retrans

Scheduling a delack in mptcp_established_options_mp() is
not a good idea: such function is called by tcp_send_ack() and
the pending delayed ack will be cleared shortly after by the
tcp_event_ack_sent() call in __tcp_transmit_skb().

Instead use the mptcp delegated action infrastructure to
schedule the delayed ack after the current bh processing completes.

Additionally moves the schedule_3rdack_retransmission() helper
into protocol.c to avoid making it visible in a different compilation
unit.

Fixes: ec3edaa7ca6ce02f ("mptcp: Add handling of outgoing MP_JOIN requests")
Reviewed-by: Mat Martineau <mathew.j.martineau>@linux.intel.com>
Signed-off-by: Paolo Abeni <[email protected]>
Signed-off-by: David S. Miller <[email protected]>

mptcp: fix delack timer

To compute the rtx timeout schedule_3rdack_retransmission() does multiple
things in the wrong way: srtt_us is measured in usec/8 and the timeout
itself is an absolute value.

Fixes: ec3edaa7ca6ce02f ("mptcp: Add handling of outgoing MP_JOIN requests")
Acked-by: Paolo Abeni <[email protected]>
Reviewed-by: Mat Martineau <mathew.j.martineau>@linux.intel.com>
Signed-off-by: Eric Dumazet <[email protected]>
Signed-off-by: David S. Miller <[email protected]>

Merge branch '40GbE' of git://git.kernel.org/pub/scm/linux/kernel/git/tnguy/net-
queue

Tony Nguyen says:

====================
Intel Wired LAN Driver Updates 2021-11-19

This series contains updates to iavf driver only.

Nitesh prevents user from changing interrupt settings when adaptive
interrupt moderation is on.

Jedrzej resolves a hang that occurred when interface was closed while a
reset was occurring and fixes statistics to be updated when requested to
prevent stale values.

Brett adjusts driver to accommodate changes in supported VLAN features
that could occur during reset and cause various errors to be reported.
====================

Signed-off-by: David S. Miller <[email protected]>

ALSA: intel-dsp-config: add quirk for JSL devices based on ES8336 codec

These devices are based on an I2C/I2S device, we need to force the use
of the SOF driver otherwise the legacy HDaudio driver will be loaded -
only HDMI will be supported.

We previously added support for other Intel platforms but missed
JasperLake.

BugLink: https://github.com/thesofproject/linux/issues/3210
Fixes: 9d36ceab9415 ('ALSA: intel-dsp-config: add quirk for APL/GLK/TGL devices based on ES8336 codec')
Signed-off-by: Pierre-Louis Bossart <[email protected]>
Reviewed-by: Kai Vehmanen <[email protected]>
Signed-off-by: Bard Liao <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Takashi Iwai <[email protected]>

xen/pvh: add missing prototype to header

The prototype of mem_map_via_hcall() is missing in its header, so add
it.

Reported-by: kernel test robot <[email protected]>
Fixes: a43fb7da53007e67ad ("xen/pvh: Move Xen code for getting mem map via hcall out of common file")
Signed-off-by: Juergen Gross <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Reviewed-by: Boris Ostrovsky <[email protected]>
Signed-off-by: Boris Ostrovsky <[email protected]>

Merge tag 'libata-5.16-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/dlemoal/libata

Pull libata fixes from Damien Le Moal:

- Prevent accesses to unsupported log pages as that causes device scan
   failures with LLDDs using libsas (from me).

- A couple of fixes for AMD AHCI adapters handling of low power modes
   and resume (from Mario).

- Fix a compilation warning (from me).

* tag 'libata-5.16-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/dlemoal/libata:
  ata: libata-sata: Declare ata_ncq_sdev_attrs static
  ata: libahci: Adjust behavior when StorageD3Enable _DSD is set
  ata: ahci: Add Green Sardine vendor ID as board_ahci_mobile
  ata: libata: add missing ata_identify_page_supported() calls
  ata: libata: improve ata_read_log_page() error message

Merge tag 'trace-v5.16-6' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace

Pull tracing fixes from Steven Rostedt:

- Fix double free in destroy_hist_field

- Harden memset() of trace_iterator structure

- Do not warn in trace printk check when test buffer fills up

* tag 'trace-v5.16-6' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace:
  tracing: Don't use out-of-sync va_list in event printing
  tracing: Use memset_startat() to zero struct trace_iterator
  tracing/histogram: Fix UAF in destroy_hist_field()

selinux: fix NULL-pointer dereference when hashtab allocation fails

When the hash table slot array allocation fails in hashtab_init(),
h->size is left initialized with a non-zero value, but the h->htable
pointer is NULL. This may then cause a NULL pointer dereference, since
the policydb code relies on the assumption that even after a failed
hashtab_init(), hashtab_map() and hashtab_destroy() can be safely called
on it. Yet, these detect an empty hashtab only by looking at the size.

Fix this by making sure that hashtab_init() always leaves behind a valid
empty hashtab when the allocation fails.

Cc: [email protected]
Fixes: 03414a49ad5f ("selinux: do not allocate hashtabs dynamically")
Signed-off-by: Ondrej Mosnacek <[email protected]>
Signed-off-by: Paul Moore <[email protected]>

Merge tag 'perf-tools-fixes-for-v5.16-2021-11-19' of git://git.kernel.org/pub/scm/linux/kernel/git/acme/linux

Pull perf tools fixes from Arnaldo Carvalho de Melo:

- Fix the 'local_weight', 'weight' (memory access latency),
   'local_ins_lat', 'ins_lat' (instruction latency) and 'pstage_cyc'
   (pipeline stage cycles) sort key sample aggregation.

- Fix 'perf test' entry for watchpoints on s/390.

- Fix branch_stack entry endianness check in the 'perf test' sample
   parsing test.

- Fix ARM SPE handling on 'perf inject'.

- Fix memory leaks detected with ASan.

- Fix build on arm64 related to reallocarray() availability.

- Sync copies of kernel headers: cpufeatures, kvm, MIPS syscalltable
   (futex_waitv).

* tag 'perf-tools-fixes-for-v5.16-2021-11-19' of git://git.kernel.org/pub/scm/linux/kernel/git/acme/linux:
  perf evsel: Fix memory leaks relating to unit
  perf report: Fix memory leaks around perf_tip()
  perf hist: Fix memory leak of a perf_hpp_fmt
  tools headers UAPI: Sync MIPS syscall table file changed by new futex_waitv syscall
  tools build: Fix removal of feature-sync-compare-and-swap feature detection
  perf inject: Fix ARM SPE handling
  perf bench: Fix two memory leaks detected with ASan
  perf test sample-parsing: Fix branch_stack entry endianness check
  tools headers UAPI: Sync x86's asm/kvm.h with the kernel sources
  perf sort: Fix the 'p_stage_cyc' sort key behavior
  perf sort: Fix the 'ins_lat' sort key behavior
  perf sort: Fix the 'weight' sort key behavior
  perf tools: Set COMPAT_NEED_REALLOCARRAY for CONFIG_AUXTRACE=1
  perf tests wp: Remove unused functions on s390
  tools headers UAPI: Sync linux/kvm.h with the kernel sources
  tools headers cpufeatures: Sync with the kernel sources

Merge tag 'riscv-for-linus-5.16-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/riscv/linux

Pull RISC-V fixes from Palmer Dabbelt:
"I have two patches for 5.16:

   - allow external modules to be built against read-only source trees

   - turn KVM on in the defconfigs

  The second one isn't technically a fix, but it got tied up pending
  some defconfig cleanups that ended up finding some larger issues. I
  figured it'd be better to get the config changes some more testing,
  but didn't want to hold up turning KVM on for that"

* tag 'riscv-for-linus-5.16-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/riscv/linux:
  riscv: fix building external modules
  RISC-V: Enable KVM in RV64 and RV32 defconfigs as a module

Merge branch 'SA_IMMUTABLE-fixes-for-v5.16-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace

Pull exit-vs-signal handling fixes from Eric Biederman:
"This is a small set of changes where debuggers were no longer able to
  intercept synchronous SIGTRAP and SIGSEGV, introduced by the exit
  cleanups.

  This is essentially the change you suggested with all of i's dotted
  and the t's crossed so that ptrace can intercept all of the cases it
  has been able to intercept the past, and all of the cases that made it
  to exit without giving ptrace a chance still don't give ptrace a
  chance"

* 'SA_IMMUTABLE-fixes-for-v5.16-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace:
  signal: Replace force_fatal_sig with force_exit_sig when in doubt
  signal: Don't always set SA_IMMUTABLE for forced signals

Merge tag 'scsi-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/jejb/scsi

Pull SCSI fixes from James Bottomley:
"Six fixes, five in drivers (ufs, qla2xxx, iscsi) and one core change
  to fix a regression in user space device state setting, which is used
  by the iscsi daemons to effect device recovery"

* tag 'scsi-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/jejb/scsi:
  scsi: qla2xxx: Fix mailbox direction flags in qla2xxx_get_adapter_id()
  scsi: ufs: core: Fix another task management completion race
  scsi: ufs: core: Fix task management completion timeout race
  scsi: core: sysfs: Fix hang when device state is set via sysfs
  scsi: iscsi: Unblock session then wake up error handler
  scsi: ufs: core: Improve SCSI abort handling

Merge tag 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/rdma/rdma

Pull rdma fixes from Jason Gunthorpe:
"There are a few big regression items from the merge window suggesting
  that people are testing rc1's but not testing the for-next branches:

   - Warnings fixes

   - Crash in hf1 when creating QPs and setting counters

   - Some old mlx4 cards fail to probe due to missing counters

   - Syzkaller crash in the new counters code"

* tag 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/rdma/rdma:
  MAINTAINERS: Update for VMware PVRDMA driver
  RDMA/nldev: Check stat attribute before accessing it
  RDMA/mlx4: Do not fail the registration on port stats
  IB/hfi1: Properly allocate rdma counter desc memory
  RDMA/core: Set send and receive CQ before forwarding to the driver
  RDMA/netlink: Add __maybe_unused to static inline in C file

Merge tag 'gpio-fixes-for-v5.16-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/brgl/linux

Pull gpio fixes from Bartosz Golaszewski:

- fix a coccicheck warning in gpio-virtio

- fix gpio selftests build issues

- fix a Kconfig issue in gpio-rockchip

* tag 'gpio-fixes-for-v5.16-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/brgl/linux:
  gpio: rockchip: needs GENERIC_IRQ_CHIP to fix build errors
  selftests: gpio: restore CFLAGS options
  selftests: gpio: fix uninitialised variable warning
  selftests: gpio: fix gpio compiling error
  gpio: virtio: remove unneeded semicolon

Merge tag 'drm-fixes-2021-11-19' of git://anongit.freedesktop.org/drm/drm

Pull drm fixes from Dave Airlie:
"This week's fixes, pretty quiet, about right for rc2. amdgpu is the
  bulk of them but the scheduler ones have been reported in a few places
  I think.

  Otherwise just some minor i915 fixes and a few other scattered around:

  scheduler:
   - two refcounting fixes

  cma-helper:
   - use correct free path for noncoherent

  efifb:
   - probing fix

  amdgpu:
   - Better debugging info for SMU msgs
   - Better error reporting when adding IP blocks
   - Fix UVD powergating regression on CZ
   - Clock reporting fix for navi1x
   - OLED panel backlight fix
   - Fix scaling on VGA/DVI for non-DC display code
   - Fix GLFCLK handling for RGP on some APUs
   - fix potential memory leak

  amdkfd:
   - GPU reset fix

  i915:
   - return error handling fix
   - ADL-P display fix
   - TGL DSI display clocks fix

  nouveau:
   - infoframe corruption fix

  sun4i:
   - Kconfig fix"

* tag 'drm-fixes-2021-11-19' of git://anongit.freedesktop.org/drm/drm:
  drm/amd/amdgpu: fix potential memleak
  drm/amd/amdkfd: Fix kernel panic when reset failed and been triggered again
  drm/amd/pm: add GFXCLK/SCLK clocks level print support for APUs
  drm/amdgpu: fix set scaling mode Full/Full aspect/Center not works on vga and dvi connectors
  drm/amd/display: Fix OLED brightness control on eDP
  drm/amd/pm: Remove artificial freq level on Navi1x
  drm/amd/pm: avoid duplicate powergate/ungate setting
  drm/amdgpu: add error print when failing to add IP block(v2)
  drm/amd/pm: Enhanced reporting also for a stuck command
  drm/i915/guc: fix NULL vs IS_ERR() checking
  drm/i915/dsi/xelpd: Fix the bit mask for wakeup GB
  Revert "drm/i915/tgl/dsi: Gate the ddi clocks after pll mapping"
  fbdev: Prevent probing generic drivers if a FB is already registered
  drm/scheduler: fix drm_sched_job_add_implicit_dependencies harder
  drm/scheduler: fix drm_sched_job_add_implicit_dependencies
  drm/sun4i: fix unmet dependency on RESET_CONTROLLER for PHY_SUN6I_MIPI_DPHY
  drm/cma-helper: Release non-coherent memory with dma_free_noncoherent()
  drm/nouveau: hdmigv100.c: fix corrupted HDMI Vendor InfoFrame

x86: Pin task-stack in __get_wchan()

When commit 5d1ceb3969b6 ("x86: Fix __get_wchan() for !STACKTRACE")
moved from stacktrace to native unwind_*() usage, the
try_get_task_stack() got lost, leading to use-after-free issues for
dying tasks.

Signed-off-by: Peter Zijlstra (Intel) <[email protected]>
Fixes: 5d1ceb3969b6 ("x86: Fix __get_wchan() for !STACKTRACE")
Link: https://bugzilla.kernel.org/show_bug.cgi?id=215031
Link: https://lore.kernel.org/stable/[email protected]/
Reported-by: Justin Forbes <[email protected]>
Reported-by: Holger Hoffstätte <[email protected]>
Cc: Qi Zheng <[email protected]>
Cc: Kees Cook <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

iavf: Fix VLAN feature flags after VFR

When a VF goes through a reset, it's possible for the VF's feature set
to change. For example it may lose the VIRTCHNL_VF_OFFLOAD_VLAN
capability after VF reset. Unfortunately, the driver doesn't correctly
deal with this situation and errors are seen from downing/upping the
interface and/or moving the interface in/out of a network namespace.

When setting the interface down/up we see the following errors after the
VIRTCHNL_VF_OFFLOAD_VLAN capability was taken away from the VF:

ice 0000:51:00.1: VF 1 failed opcode 12, retval: -64 iavf 0000:51:09.1:
Failed to add VLAN filter, error IAVF_NOT_SUPPORTED ice 0000:51:00.1: VF
1 failed opcode 13, retval: -64 iavf 0000:51:09.1: Failed to delete VLAN
filter, error IAVF_NOT_SUPPORTED

These add/delete errors are happening because the VLAN filters are
tracked internally to the driver and regardless of the VLAN_ALLOWED()
setting the driver tries to delete/re-add them over virtchnl.

Fix the delete failure by making sure to delete any VLAN filter tracking
in the driver when a removal request is made, while preventing the
virtchnl request. This makes it so the driver's VLAN list is up to date
and the errors are

Fix the add failure by making sure the check for VLAN_ALLOWED() during
reset is done after the VF receives its capability list from the PF via
VIRTCHNL_OP_GET_VF_RESOURCES. If VLAN functionality is not allowed, then
prevent requesting re-adding the filters over virtchnl.

When moving the interface into a network namespace we see the following
errors after the VIRTCHNL_VF_OFFLOAD_VLAN capability was taken away from
the VF:

iavf 0000:51:09.1 enp81s0f1v1: NIC Link is Up Speed is 25 Gbps Full Duplex
iavf 0000:51:09.1 temp_27: renamed from enp81s0f1v1
iavf 0000:51:09.1 mgmt: renamed from temp_27
iavf 0000:51:09.1 dev27: set_features() failed (-22); wanted 0x020190001fd54833, left 0x020190001fd54bb3

These errors are happening because we aren't correctly updating the
netdev capabilities and dealing with ndo_fix_features() and
ndo_set_features() correctly.

Fix this by only reporting errors in the driver's ndo_set_features()
callback when VIRTCHNL_VF_OFFLOAD_VLAN is not allowed and any attempt to
enable the VLAN features is made. Also, make sure to disable VLAN
insertion, filtering, and stripping since the VIRTCHNL_VF_OFFLOAD_VLAN
flag applies to all of them and not just VLAN stripping.

Also, after we process the capabilities in the VF reset path, make sure
to call netdev_update_features() in case the capabilities have changed
in order to update the netdev's feature set to match the VF's actual
capabilities.

Lastly, make sure to always report success on VLAN filter delete when
VIRTCHNL_VF_OFFLOAD_VLAN is not supported. The changed flow in
iavf_del_vlans() allows the stack to delete previosly existing VLAN
filters even if VLAN filtering is not allowed. This makes it so the VLAN
filter list is up to date.

Fixes: 8774370d268f ("i40e/i40evf: support for VF VLAN tag stripping control")
Signed-off-by: Brett Creeley <[email protected]>
Tested-by: Konrad Jankowski <[email protected]>
Signed-off-by: Tony Nguyen <[email protected]>

iavf: Fix refreshing iavf adapter stats on ethtool request

Currently iavf adapter statistics are refreshed only in a
watchdog task, triggered approximately every two seconds,
which causes some ethtool requests to return outdated values.

Add explicit statistics refresh when requested by ethtool -S.

Fixes: b476b0030e61 ("iavf: Move commands processing to the separate function")
Signed-off-by: Jan Sokolowski <[email protected]>
Signed-off-by: Jedrzej Jagielski <[email protected]>
Tested-by: Konrad Jankowski <[email protected]>
Signed-off-by: Tony Nguyen <[email protected]>

iavf: Fix deadlock occurrence during resetting VF interface

System hangs if close the interface is called from the kernel during
the interface is in resetting state.
During resetting operation the link is closing but kernel didn't
know it and it tried to close this interface again what sometimes
led to deadlock.
Inform kernel about current state of interface
and turn off the flag IFF_UP when interface is closing until reset
is finished.
Previously it was most likely to hang the system when kernel
(network manager) tried to close the interface in the same time
when interface was in resetting state because of deadlock.

Fixes: 3c8e0b989aa1 ("i40vf: don't stop me now")
Signed-off-by: Jaroslaw Gawin <[email protected]>
Signed-off-by: Jedrzej Jagielski <[email protected]>
Tested-by: Konrad Jankowski <[email protected]>
Signed-off-by: Tony Nguyen <[email protected]>

iavf: Prevent changing static ITR values if adaptive moderation is on

Resolve being able to change static values on VF when adaptive interrupt
moderation is enabled.

This problem is fixed by checking the interrupt settings is not
a combination of change of static value while adaptive interrupt
moderation is turned on.

Without this fix, the user would be able to change static values
on VF with adaptive moderation enabled.

Fixes: 65e87c0398f5 ("i40evf: support queue-specific settings for interrupt moderation")
Signed-off-by: Nitesh B Venkatesh <[email protected]>
Tested-by: George Kuruvinakunnel <[email protected]>
Signed-off-by: Tony Nguyen <[email protected]>

signal: Replace force_fatal_sig with force_exit_sig when in doubt

Recently to prevent issues with SECCOMP_RET_KILL and similar signals
being changed before they are delivered SA_IMMUTABLE was added.

Unfortunately this broke debuggers[1][2] which reasonably expect
to be able to trap synchronous SIGTRAP and SIGSEGV even when
the target process is not configured to handle those signals.

Add force_exit_sig and use it instead of force_fatal_sig where
historically the code has directly called do_exit. This has the
implementation benefits of going through the signal exit path
(including generating core dumps) without the danger of allowing
userspace to ignore or change these signals.

This avoids userspace regressions as older kernels exited with do_exit
which debuggers also can not intercept.

In the future is should be possible to improve the quality of
implementation of the kernel by changing some of these force_exit_sig
calls to force_fatal_sig. That can be done where it matters on
a case-by-case basis with careful analysis.

Reported-by: Kyle Huey <[email protected]>
Reported-by: kernel test robot <[email protected]>
[1] https://lkml.kernel.org/r/CAP045AoMY4xf8aC_4QU_-j7obuEPYgTcnQQP3Yxk=2X90jtpjw@mail.gmail.com
[2] https://lkml.kernel.org/r/20211117150258.GB5403@xsang-OptiPlex-9020
Fixes: 00b06da29cf9 ("signal: Add SA_IMMUTABLE to ensure forced siganls do not get changed")
Fixes: a3616a3c0272 ("signal/m68k: Use force_sigsegv(SIGSEGV) in fpsp040_die")
Fixes: 83a1f27ad773 ("signal/powerpc: On swapcontext failure force SIGSEGV")
Fixes: 9bc508cf0791 ("signal/s390: Use force_sigsegv in default_trap_handler")
Fixes: 086ec444f866 ("signal/sparc32: In setup_rt_frame and setup_fram use force_fatal_sig")
Fixes: c317d306d550 ("signal/sparc32: Exit with a fatal signal when try_to_clear_window_buffer fails")
Fixes: 695dd0d634df ("signal/x86: In emulate_vsyscall force a signal instead of calling do_exit")
Fixes: 1fbd60df8a85 ("signal/vm86_32: Properly send SIGSEGV when the vm86 state cannot be saved.")
Fixes: 941edc5bf174 ("exit/syscall_user_dispatch: Send ordinary signals on failure")
Link: https://lkml.kernel.org/r/[email protected]
Reviewed-by: Kees Cook <[email protected]>
Tested-by: Kees Cook <[email protected]>
Tested-by: Kyle Huey <[email protected]>
Signed-off-by: "Eric W. Biederman" <[email protected]>

signal: Don't always set SA_IMMUTABLE for forced signals

Recently to prevent issues with SECCOMP_RET_KILL and similar signals
being changed before they are delivered SA_IMMUTABLE was added.

Unfortunately this broke debuggers[1][2] which reasonably expect to be
able to trap synchronous SIGTRAP and SIGSEGV even when the target
process is not configured to handle those signals.

Update force_sig_to_task to support both the case when we can allow
the debugger to intercept and possibly ignore the signal and the case
when it is not safe to let userspace know about the signal until the
process has exited.

Suggested-by: Linus Torvalds <[email protected]>
Reported-by: Kyle Huey <[email protected]>
Reported-by: kernel test robot <[email protected]>
Cc: [email protected]
[1] https://lkml.kernel.org/r/CAP045AoMY4xf8aC_4QU_-j7obuEPYgTcnQQP3Yxk=2X90jtpjw@mail.gmail.com
[2] https://lkml.kernel.org/r/20211117150258.GB5403@xsang-OptiPlex-9020
Fixes: 00b06da29cf9 ("signal: Add SA_IMMUTABLE to ensure forced siganls do not get changed")
Link: https://lkml.kernel.org/r/[email protected]
Reviewed-by: Kees Cook <[email protected]>
Tested-by: Kees Cook <[email protected]>
Tested-by: Kyle Huey <[email protected]>
Signed-off-by: "Eric W. Biederman" <[email protected]>

HID: multitouch: Fix Iiyama ProLite T1931SAW (0eef:0001 again!)

Iiyama ProLite T1931SAW does not work with Linux - input devices are
created but cursor does not move.

It has the infamous 0eef:0001 ID which has been reused for various
devices before.

It seems to require export_all_inputs = true.

Hopefully there are no HID devices using this ID that will break.
It should not break non-HID devices (handled by usbtouchscreen).

Signed-off-by: Ondrej Zary <[email protected]>
Reviewed-by: Benjamin Tissoires <[email protected]>
Signed-off-by: Jiri Kosina <[email protected]>

HID: nintendo: eliminate dead datastructures in !CONFIG_NINTENDO_FF case

The rumbling-related identifiers are never used in !CONFIG_NINTENDO_FF
case, so let's hide them in order to avoid unused warnings.

Reported-by: kernel test robot <[email protected]>
Signed-off-by: Jiri Kosina <[email protected]>