Alaa Hleihel [Wed, 19 Aug 2020 15:24:10 +0000 (18:24 +0300)]
net/sched: act_ct: Fix skb double-free in tcf_ct_handle_fragments() error flow
tcf_ct_handle_fragments() shouldn't free the skb when ip_defrag() call
fails. Otherwise, we will cause a double-free bug.
In such cases, just return the error to the caller.
David Laight [Wed, 19 Aug 2020 14:40:52 +0000 (14:40 +0000)]
net: sctp: Fix negotiation of the number of data streams.
The number of output and input streams was never being reduced, eg when
processing received INIT or INIT_ACK chunks.
The effect is that DATA chunks can be sent with invalid stream ids
and then discarded by the remote system.
Mark Tomlinson [Wed, 19 Aug 2020 01:53:58 +0000 (13:53 +1200)]
gre6: Fix reception with IP6_TNL_F_RCV_DSCP_COPY
When receiving an IPv4 packet inside an IPv6 GRE packet, and the
IP6_TNL_F_RCV_DSCP_COPY flag is set on the tunnel, the IPv4 header would
get corrupted. This is due to the common ip6_tnl_rcv() function assuming
that the inner header is always IPv6. This patch checks the tunnel
protocol for IPv4 inner packets, but still defaults to IPv6.
Fixes: 308edfdf1563 ("gre6: Cleanup GREv6 receive path, call common GRE functions") Signed-off-by: Mark Tomlinson <[email protected]> Signed-off-by: David S. Miller <[email protected]>
Haiyang Zhang [Thu, 20 Aug 2020 21:53:15 +0000 (14:53 -0700)]
hv_netvsc: Fix the queue_mapping in netvsc_vf_xmit()
netvsc_vf_xmit() / dev_queue_xmit() will call VF NIC’s ndo_select_queue
or netdev_pick_tx() again. They will use skb_get_rx_queue() to get the
queue number, so the “skb->queue_mapping - 1” will be used. This may
cause the last queue of VF not been used.
Use skb_record_rx_queue() here, so that the skb_get_rx_queue() called
later will get the correct queue number, and VF will be able to use
all queues.
Fixes: b3bf5666a510 ("hv_netvsc: defer queue selection to VF") Signed-off-by: Haiyang Zhang <[email protected]> Signed-off-by: David S. Miller <[email protected]>
Haiyang Zhang [Thu, 20 Aug 2020 21:53:14 +0000 (14:53 -0700)]
hv_netvsc: Remove "unlikely" from netvsc_select_queue
When using vf_ops->ndo_select_queue, the number of queues of VF is
usually bigger than the synthetic NIC. This condition may happen
often.
Remove "unlikely" from the comparison of ndev->real_num_tx_queues.
Fixes: b3bf5666a510 ("hv_netvsc: defer queue selection to VF") Signed-off-by: Haiyang Zhang <[email protected]> Signed-off-by: David S. Miller <[email protected]>
Yauheni Kaliuta [Thu, 20 Aug 2020 11:58:43 +0000 (14:58 +0300)]
bpf: selftests: global_funcs: Check err_str before strstr
The error path in libbpf.c:load_program() has calls to pr_warn()
which ends up for global_funcs tests to
test_global_funcs.c:libbpf_debug_print().
For the tests with no struct test_def::err_str initialized with a
string, it causes call of strstr() with NULL as the second argument
and it segfaults.
Fix it by calling strstr() only for non-NULL err_str.
Andrii Nakryiko [Thu, 20 Aug 2020 05:28:41 +0000 (22:28 -0700)]
bpf: xdp: Fix XDP mode when no mode flags specified
7f0a838254bd ("bpf, xdp: Maintain info on attached XDP BPF programs in net_device")
inadvertently changed which XDP mode is assumed when no mode flags are
specified explicitly. Previously, driver mode was preferred, if driver
supported it. If not, generic SKB mode was chosen. That commit changed default
to SKB mode always. This patch fixes the issue and restores the original
logic.
Calling generic selftests "make install" fails as rsync expects all
files from TEST_GEN_PROGS to be present. The binary is not generated
anymore (commit 3b09d27cc93d) so we can safely remove it from there
and also from gitignore.
Peilin Ye [Thu, 20 Aug 2020 14:30:52 +0000 (16:30 +0200)]
net/smc: Prevent kernel-infoleak in __smc_diag_dump()
__smc_diag_dump() is potentially copying uninitialized kernel stack memory
into socket buffers, since the compiler may leave a 4-byte hole near the
beginning of `struct smcd_diag_dmbinfo`. Fix it by initializing `dinfo`
with memset().
Edward Cree [Thu, 20 Aug 2020 10:47:19 +0000 (11:47 +0100)]
sfc: fix build warnings on 32-bit
Truncation of DMA_BIT_MASK to 32-bit dma_addr_t is semantically safe,
but the compiler was warning because it was happening implicitly.
Insert explicit casts to suppress the warnings.
Bin Meng [Thu, 16 Jul 2020 04:39:53 +0000 (21:39 -0700)]
riscv: Add SiFive drivers to rv32_defconfig
This adds SiFive drivers to rv32_defconfig, to keep in sync with the
64-bit config. This is useful when testing 32-bit kernel with QEMU
'sifive_u' 32-bit machine.
Anup Patel [Mon, 17 Aug 2020 12:42:50 +0000 (18:12 +0530)]
RISC-V: Remove CLINT related code from timer and arch
Right now the RISC-V timer driver is convoluted to support:
1. Linux RISC-V S-mode (with MMU) where it will use TIME CSR for
clocksource and SBI timer calls for clockevent device.
2. Linux RISC-V M-mode (without MMU) where it will use CLINT MMIO
counter register for clocksource and CLINT MMIO compare register
for clockevent device.
We now have a separate CLINT timer driver which also provide CLINT
based IPI operations so let's remove CLINT MMIO related code from
arch/riscv directory and RISC-V timer driver.
Anup Patel [Mon, 17 Aug 2020 12:42:49 +0000 (18:12 +0530)]
clocksource/drivers: Add CLINT timer driver
We add a separate CLINT timer driver for Linux RISC-V M-mode (i.e.
RISC-V NoMMU kernel).
The CLINT MMIO device provides three things:
1. 64bit free running counter register
2. 64bit per-CPU time compare registers
3. 32bit per-CPU inter-processor interrupt registers
Unlike other timer devices, CLINT provides IPI registers along with
timer registers. To use CLINT IPI registers, the CLINT timer driver
provides IPI related callbacks to arch/riscv.
Linus Torvalds [Thu, 20 Aug 2020 17:48:17 +0000 (10:48 -0700)]
Merge tag 'dma-mapping-5.9-1' of git://git.infradead.org/users/hch/dma-mapping
Pull dma-mapping fixes from Christoph Hellwig:
"Fix more fallout from the dma-pool changes (Nicolas Saenz Julienne,
me)"
* tag 'dma-mapping-5.9-1' of git://git.infradead.org/users/hch/dma-mapping:
dma-pool: Only allocate from CMA when in same memory zone
dma-pool: fix coherent pool allocations for IOMMU mappings
Adrian Huang [Wed, 19 Aug 2020 15:42:36 +0000 (23:42 +0800)]
dax: do not print error message for non-persistent memory block device
Commit 231609785cbf ("dax: print error message by pr_info()
in __generic_fsdax_supported()") happens to print the following
error message during booting when the non-persistent memory block
devices are configured by device mapper. Those error messages are
caused by the variable 'dax_dev' is NULL. Users might be confused
with those error messages since they do not use the persistent
memory device. Moreover, users might scare about "what's wrong
with my disks" because they see the 'error' and 'failed' keywords.
# lsblk
NAME MAJ:MIN RM SIZE RO TYPE MOUNTPOINT
sda 8:0 0 1.1T 0 disk
├─sda1 8:1 0 156M 0 part
├─sda2 8:2 0 40G 0 part
└─sda3 8:3 0 1.1T 0 part
sdb 8:16 0 1.1T 0 disk
├─sdb1 8:17 0 600M 0 part
├─sdb2 8:18 0 1G 0 part
└─sdb3 8:19 0 1.1T 0 part
├─rhel00-swap 254:3 0 4G 0 lvm
├─rhel00-home 254:4 0 1T 0 lvm
└─rhel00-root 254:5 0 50G 0 lvm
sdc 8:32 0 1.1T 0 disk
sdd 8:48 0 1.1T 0 disk
sde 8:64 0 1.1T 0 disk
sdf 8:80 0 1.1T 0 disk
sdg 8:96 0 1.1T 0 disk
sdh 8:112 0 3.3T 0 disk
├─sdh1 8:113 0 500M 0 part /boot/efi
├─sdh2 8:114 0 40G 0 part /
├─sdh3 8:115 0 2.9T 0 part /home
└─sdh4 8:116 0 314.6G 0 part [SWAP]
sdi 8:128 0 1.1T 0 disk
sdj 8:144 0 3.3T 0 disk
├─sdj1 8:145 0 512M 0 part
└─sdj2 8:146 0 3.3T 0 part
sdk 8:160 0 119.2G 0 disk
├─sdk1 8:161 0 200M 0 part
├─sdk2 8:162 0 1G 0 part
└─sdk3 8:163 0 118G 0 part
├─rhel-swap 254:0 0 4G 0 lvm
├─rhel-home 254:1 0 64G 0 lvm
└─rhel-root 254:2 0 50G 0 lvm
sdl 8:176 0 119.2G 0 disk
The call path is shown as follows:
dm_table_determine_type
dm_table_supports_dax
device_supports_dax
generic_fsdax_supported
__generic_fsdax_supported
With the disk configuration listing from the command 'lsblk',
the member 'dev->dax_dev' of the block devices 'sdb3' and 'sdk3'
(configured by device mapper) is NULL in function
generic_fsdax_supported() because the member is configured in
function open_table_device().
To prevent the confusing error messages in this scenario (this is
normal behavior), just print those error messages by pr_debug()
by checking if dax_dev is NULL and the block device does not support
DAX.
David Howells [Thu, 20 Aug 2020 13:37:12 +0000 (14:37 +0100)]
afs: Fix key ref leak in afs_put_operation()
The afs_put_operation() function needs to put the reference to the key
that's authenticating the operation.
Fixes: e49c7b2f6de7 ("afs: Build an abstraction around an "operation" concept") Reported-by: Dave Botsch <[email protected]> Signed-off-by: David Howells <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
Merge branch 'opp/fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/vireshk/pm
Pull operating performance points (OPP) framework fixes for 5.9-rc2
from Viresh Kumar:
"This contains the following fixes for 5.9:
- Fix re-enabling of resources (Rajendra Nayak).
- Put OPP table references (Stephen Boyd)."
* 'opp/fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/vireshk/pm:
opp: Enable resources again if they were disabled earlier
opp: Put opp table in dev_pm_opp_set_rate() if _set_opp_bw() fails
opp: Put opp table in dev_pm_opp_set_rate() for empty tables
Vasant Hegde [Thu, 20 Aug 2020 06:18:44 +0000 (11:48 +0530)]
powerpc/pseries: Do not initiate shutdown when system is running on UPS
As per PAPR we have to look for both EPOW sensor value and event
modifier to identify the type of event and take appropriate action.
In LoPAPR v1.1 section 10.2.2 includes table 136 "EPOW Action Codes":
SYSTEM_SHUTDOWN 3
The system must be shut down. An EPOW-aware OS logs the EPOW error
log information, then schedules the system to be shut down to begin
after an OS defined delay internal (default is 10 minutes.)
Then in section 10.3.2.2.8 there is table 146 "Platform Event Log
Format, Version 6, EPOW Section", which includes the "EPOW Event
Modifier":
For EPOW sensor value = 3
0x01 = Normal system shutdown with no additional delay
0x02 = Loss of utility power, system is running on UPS/Battery
0x03 = Loss of system critical functions, system should be shutdown
0x04 = Ambient temperature too high
All other values = reserved
We have a user space tool (rtas_errd) on LPAR to monitor for
EPOW_SHUTDOWN_ON_UPS. Once it gets an event it initiates shutdown
after predefined time. It also starts monitoring for any new EPOW
events. If it receives "Power restored" event before predefined time
it will cancel the shutdown. Otherwise after predefined time it will
shutdown the system.
Commit 79872e35469b ("powerpc/pseries: All events of
EPOW_SYSTEM_SHUTDOWN must initiate shutdown") changed our handling of
the "on UPS/Battery" case, to immediately shutdown the system. This
breaks existing setups that rely on the userspace tool to delay
shutdown and let the system run on the UPS.
Pavel Begunkov [Thu, 20 Aug 2020 08:34:10 +0000 (11:34 +0300)]
io_uring: comment on kfree(iovec) checks
kfree() handles NULL pointers well, but io_{read,write}() checks it
because of performance reasons. Leave a comment there for those who are
tempted to patch it.
Pavel Begunkov [Thu, 20 Aug 2020 08:33:35 +0000 (11:33 +0300)]
io_uring: fix racy req->flags modification
Setting and clearing REQ_F_OVERFLOW in io_uring_cancel_files() and
io_cqring_overflow_flush() are racy, because they might be called
asynchronously.
REQ_F_OVERFLOW flag in only needed for files cancellation, so if it can
be guaranteed that requests _currently_ marked inflight can't be
overflown, the problem will be solved with removing the flag
altogether.
That's how the patch works, it removes inflight status of a request
in io_cqring_fill_event() whenever it should be thrown into CQ-overflow
list. That's Ok to do, because no opcode specific handling can be done
after io_cqring_fill_event(), the same assumption as with "struct
io_completion" patches.
And it already have a good place for such cleanups, which is
io_clean_op(). A nice side effect of this is removing this inflight
check from the hot path.
note on synchronisation: now __io_cqring_fill_event() may be taking two
spinlocks simultaneously, completion_lock and inflight_lock. It's fine,
because we never do that in reverse order, and CQ-overflow of inflight
requests shouldn't happen often.
Weihang Li [Wed, 19 Aug 2020 09:39:44 +0000 (17:39 +0800)]
Revert "RDMA/hns: Reserve one sge in order to avoid local length error"
This patch caused some issues on SEND operation, and it should be reverted
to make the drivers work correctly. There will be a better solution that
has been tested carefully to solve the original problem.
The issue happens when TID RDMA WRITE request is followed by an
IB_WR_RDMA_WRITE_WITH_IMM request, the latter could be completed first on
the responder side. As a result, no ACK packet for the latter could be
sent because the TID RDMA WRITE request is still being processed on the
responder side.
When the TID RDMA WRITE request is eventually completed, the requester
will wait for the IB_WR_RDMA_WRITE_WITH_IMM request to be acknowledged.
If the next request is another TID RDMA WRITE request, no TID RDMA WRITE
DATA packet could be sent because the preceding IB_WR_RDMA_WRITE_WITH_IMM
request is not completed yet.
Consequently the IB_WR_RDMA_WRITE_WITH_IMM will be retried but it will be
ignored on the responder side because the responder thinks it has already
been completed. Eventually the retry will be exhausted and the qp will be
put into error state on the requester side. On the responder side, the TID
resource timer will eventually expire because no TID RDMA WRITE DATA
packets will be received for the second TID RDMA WRITE request. There is
also risk of a write-after-write memory corruption due to the issue.
Fix by adding a requester side interlock to prevent any potential data
corruption and TID RDMA protocol error.
Selvin Xavier [Thu, 6 Aug 2020 04:45:48 +0000 (21:45 -0700)]
RDMA/bnxt_re: Do not add user qps to flushlist
Driver shall add only the kernel qps to the flush list for clean up.
During async error events from the HW, driver is adding qps to this list
without checking if the qp is kernel qp or not.
Add a check to avoid user qp addition to the flush list.
Fixes: 942c9b6ca8de ("RDMA/bnxt_re: Avoid Hard lockup during error CQE processing") Fixes: c50866e2853a ("bnxt_re: fix the regression due to changes in alloc_pbl") Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Selvin Xavier <[email protected]> Signed-off-by: Jason Gunthorpe <[email protected]>
Athira Rajeev [Thu, 6 Aug 2020 12:46:32 +0000 (08:46 -0400)]
powerpc/perf: Fix soft lockups due to missed interrupt accounting
Performance monitor interrupt handler checks if any counter has
overflown and calls record_and_restart() in core-book3s which invokes
perf_event_overflow() to record the sample information. Apart from
creating sample, perf_event_overflow() also does the interrupt and
period checks via perf_event_account_interrupt().
Currently we record information only if the SIAR (Sampled Instruction
Address Register) valid bit is set (using siar_valid() check) and
hence the interrupt check.
But it is possible that we do sampling for some events that are not
generating valid SIAR, and hence there is no chance to disable the
event if interrupts are more than max_samples_per_tick. This leads to
soft lockup.
Fix this by adding perf_event_account_interrupt() in the invalid SIAR
code path for a sampling event. ie if SIAR is invalid, just do
interrupt check and don't record the sample information.
Ard Biesheuvel [Mon, 17 Aug 2020 10:00:17 +0000 (12:00 +0200)]
Documentation: efi: remove description of efi=old_map
The old EFI runtime region mapping logic that was kept around for some
time has finally been removed entirely, along with the SGI UV1 support
code that was its last remaining user. So remove any mention of the
efi=old_map command line parameter from the docs.
Ard Biesheuvel [Thu, 13 Aug 2020 17:38:17 +0000 (19:38 +0200)]
efi/x86: Move 32-bit code into efi_32.c
Now that the old memmap code has been removed, some code that was left
behind in arch/x86/platform/efi/efi.c is only used for 32-bit builds,
which means it can live in efi_32.c as well. So move it over.
Treat a NULL cmdline the same as empty. Although this is unlikely to
happen in practice, the x86 kernel entry does check for NULL cmdline and
handles it, so do it here as well.
efi/x86: Mark kernel rodata non-executable for mixed mode
When remapping the kernel rodata section RO in the EFI pagetables, the
protection flags that were used for the text section are being reused,
but the rodata section should not be marked executable.
Rajendra Nayak [Mon, 10 Aug 2020 07:06:19 +0000 (12:36 +0530)]
opp: Enable resources again if they were disabled earlier
dev_pm_opp_set_rate() can now be called with freq = 0 in order
to either drop performance or bandwidth votes or to disable
regulators on platforms which support them.
In such cases, a subsequent call to dev_pm_opp_set_rate() with
the same frequency ends up returning early because 'old_freq == freq'
Instead make it fall through and put back the dropped performance
and bandwidth votes and/or enable back the regulators.
Randy Dunlap [Thu, 20 Aug 2020 04:30:47 +0000 (06:30 +0200)]
Fix build error when CONFIG_ACPI is not set/enabled:
../arch/x86/pci/xen.c: In function ‘pci_xen_init’:
../arch/x86/pci/xen.c:410:2: error: implicit declaration of function ‘acpi_noirq_set’; did you mean ‘acpi_irq_get’? [-Werror=implicit-function-declaration]
acpi_noirq_set();
efifb_probe() will issue an error message in case the kernel is booted
as Xen dom0 from UEFI as EFI_MEMMAP won't be set in this case. Avoid
that message by calling efi_mem_desc_lookup() only if EFI_MEMMAP is set.
Frederic Barrat [Wed, 19 Aug 2020 13:07:41 +0000 (15:07 +0200)]
powerpc/powernv/pci: Fix possible crash when releasing DMA resources
Fix a typo introduced during recent code cleanup, which could lead to
silently not freeing resources or an oops message (on PCI hotplug or
CAPI reset).
Only impacts ioda2, the code path for ioda1 is correct.
Wang Hai [Wed, 19 Aug 2020 02:33:09 +0000 (10:33 +0800)]
net: gemini: Fix missing free_netdev() in error path of gemini_ethernet_port_probe()
Replace alloc_etherdev_mq with devm_alloc_etherdev_mqs. In this way,
when probe fails, netdev can be freed automatically.
Fixes: 4d5ae32f5e1e ("net: ethernet: Add a driver for Gemini gigabit ethernet") Reported-by: Hulk Robot <[email protected]> Signed-off-by: Wang Hai <[email protected]> Signed-off-by: David S. Miller <[email protected]>
net: atlantic: Use readx_poll_timeout() for large timeout
Commit 8dcf2ad39fdb2 ("net: atlantic: add hwmon getter for MAC temperature")
implemented a read callback with an udelay(10000U). This fails to
compile on ARM because the delay is >1ms. I doubt that it is needed to
spin for 10ms even if possible on x86.
>From looking at the code, the context appears to be preemptible so using
usleep() should work and avoid busy spinning.
Min Li [Tue, 18 Aug 2020 14:41:22 +0000 (10:41 -0400)]
ptp: ptp_clockmatrix: use i2c_master_send for i2c write
The old code for i2c write would break on some controllers, which fails
at handling Repeated Start Condition. So we will just use i2c_master_send
to handle write in one transanction.
Johannes Berg [Wed, 19 Aug 2020 19:52:38 +0000 (21:52 +0200)]
netlink: fix state reallocation in policy export
Evidently, when I did this previously, we didn't have more than
10 policies and didn't run into the reallocation path, because
it's missing a memset() for the unused policies. Fix that.
Fixes: d07dcf9aadd6 ("netlink: add infrastructure to expose policies to userspace") Signed-off-by: Johannes Berg <[email protected]> Signed-off-by: David S. Miller <[email protected]>
David S. Miller [Wed, 19 Aug 2020 22:32:58 +0000 (15:32 -0700)]
Merge branch 'Bug-fixes-for-ENA-ethernet-driver'
Shay Agroskin says:
====================
Bug fixes for ENA ethernet driver
This series adds the following:
- Fix undesired call to ena_restore after returning from suspend
- Fix condition inside a WARN_ON
- Fix overriding previous value when updating missed_tx statistic
v1->v2:
- fix bug when calling reset routine after device resources are freed (Jakub)
v2->v3:
- fix wrong hash in 'Fixes' tag
====================
Shay Agroskin [Wed, 19 Aug 2020 17:28:38 +0000 (20:28 +0300)]
net: ena: Make missed_tx stat incremental
Most statistics in ena driver are incremented, meaning that a stat's
value is a sum of all increases done to it since driver/queue
initialization.
This patch makes all statistics this way, effectively making missed_tx
statistic incremental.
Also added a comment regarding rx_drops and tx_drops to make it
clearer how these counters are calculated.
Fixes: 11095fdb712b ("net: ena: add statistics for missed tx packets") Signed-off-by: Shay Agroskin <[email protected]> Signed-off-by: David S. Miller <[email protected]>
This macro prints the call stack if the expression inside of it is
true [1], but the expression inside of it is the wanted situation.
The expression checks whether the ring has an XDP queue and its index
corresponds to a XDP one.
This patch changes the expression to
!ENA_IS_XDP_INDEX(adapter, i) && adapter->ena_napi[i].xdp_ring
which indicates an unwanted situation.
Also, change the structure of the function. The napi handler is
unregistered for all rings, and so there's no need to check whether the
index is an XDP index or not. By removing this check the code becomes
much more readable.
Fixes: 548c4940b9f1 ("net: ena: Implement XDP_TX action") Signed-off-by: Shay Agroskin <[email protected]> Signed-off-by: David S. Miller <[email protected]>
Shay Agroskin [Wed, 19 Aug 2020 17:28:36 +0000 (20:28 +0300)]
net: ena: Prevent reset after device destruction
The reset work is scheduled by the timer routine whenever it
detects that a device reset is required (e.g. when a keep_alive signal
is missing).
When releasing device resources in ena_destroy_device() the driver
cancels the scheduling of the timer routine without destroying the reset
work explicitly.
This creates the following bug:
The driver is suspended and the ena_suspend() function is called
-> This function calls ena_destroy_device() to free the net device
resources
-> The driver waits for the timer routine to finish
its execution and then cancels it, thus preventing from it
to be called again.
If, in its final execution, the timer routine schedules a reset,
the reset routine might be called afterwards,and a redundant call to
ena_restore_device() would be made.
By changing the reset routine we allow it to read the device's state
accurately.
This is achieved by checking whether ENA_FLAG_TRIGGER_RESET flag is set
before resetting the device and making both the destruction function and
the flag check are under rtnl lock.
The ENA_FLAG_TRIGGER_RESET is cleared at the end of the destruction
routine. Also surround the flag check with 'likely' because
we expect that the reset routine would be called only when
ENA_FLAG_TRIGGER_RESET flag is set.
The destruction of the timer and reset services in __ena_shutoff() have to
stay, even though the timer routine is destroyed in ena_destroy_device().
This is to avoid a case in which the reset routine is scheduled after
free_netdev() in __ena_shutoff(), which would create an access to freed
memory in adapter->flags.
Fixes: 8c5c7abdeb2d ("net: ena: add power management ops to the ENA driver") Signed-off-by: Shay Agroskin <[email protected]> Signed-off-by: David S. Miller <[email protected]>
Marc Zyngier [Wed, 19 Aug 2020 09:42:55 +0000 (10:42 +0100)]
of: address: Work around missing device_type property in pcie nodes
Recent changes to the DT PCI bus parsing made it mandatory for
device tree nodes describing a PCI controller to have the
'device_type = "pci"' property for the node to be matched.
Although this follows the letter of the specification, it
breaks existing device-trees that have been working fine
for years. Rockchip rk3399-based systems are a prime example
of such collateral damage, and have stopped discovering their
PCI bus.
In order to paper over it, let's add a workaround to the code
matching the device type, and accept as PCI any node that is
named "pcie",
A warning will hopefully nudge the user into updating their
DT to a fixed version if they can, but the incentive is
obviously pretty small.
Arvind Sankar [Wed, 19 Aug 2020 14:08:16 +0000 (10:08 -0400)]
lib/string.c: Use freestanding environment
gcc can transform the loop in a naive implementation of memset/memcpy
etc into a call to the function itself. This optimization is enabled by
-ftree-loop-distribute-patterns.
This has been the case for a while, but gcc-10.x enables this option at
-O2 rather than -O3 as in previous versions.
Add -ffreestanding, which implicitly disables this optimization with
gcc. It is unclear whether clang performs such optimizations, but
hopefully it will also not do so in a freestanding environment.
Arvind Sankar [Tue, 4 Aug 2020 23:48:17 +0000 (19:48 -0400)]
x86/boot/compressed: Use builtin mem functions for decompressor
Since commits
c041b5ad8640 ("x86, boot: Create a separate string.h file to provide standard string functions") fb4cac573ef6 ("x86, boot: Move memcmp() into string.h and string.c")
the decompressor stub has been using the compiler's builtin memcpy,
memset and memcmp functions, _except_ where it would likely have the
largest impact, in the decompression code itself.
Remove the #undef's of memcpy and memset in misc.c so that the
decompressor code also uses the compiler builtins.
The rationale given in the comment doesn't really apply: just because
some functions use the out-of-line version is no reason to not use the
builtin version in the rest.
Replace the comment with an explanation of why memzero and memmove are
being #define'd.
Drop the suggestion to #undef in boot/string.h as well: the out-of-line
versions are not really optimized versions, they're generic code that's
good enough for the preboot environment. The compiler will likely
generate better code for constant-size memcpy/memset/memcmp if it is
allowed to.
Most decompressors' performance is unchanged, with the exception of LZ4
and 64-bit ZSTD.
Before After ARCH
LZ4 73ms 10ms 32
LZ4 120ms 10ms 64
ZSTD 90ms 74ms 64
Measurements on QEMU on 2.2GHz Broadwell Xeon, using defconfig kernels.
Decompressor code size has small differences, with the largest being
that 64-bit ZSTD decreases just over 2k. The largest code size increase
was on 64-bit XZ, of about 400 bytes.
Jens Axboe [Wed, 19 Aug 2020 17:10:51 +0000 (11:10 -0600)]
io_uring: use system_unbound_wq for ring exit work
We currently use system_wq, which is unbounded in terms of number of
workers. This means that if we're exiting tons of rings at the same
time, then we'll briefly spawn tons of event kworkers just for a very
short blocking time as the rings exit.
Use system_unbound_wq instead, which has a sane cap on the concurrency
level.
Filipe Manana [Fri, 14 Aug 2020 10:04:09 +0000 (11:04 +0100)]
btrfs: fix space cache memory leak after transaction abort
If a transaction aborts it can cause a memory leak of the pages array of
a block group's io_ctl structure. The following steps explain how that can
happen:
1) Transaction N is committing, currently in state TRANS_STATE_UNBLOCKED
and it's about to start writing out dirty extent buffers;
2) Transaction N + 1 already started and another task, task A, just called
btrfs_commit_transaction() on it;
3) Block group B was dirtied (extents allocated from it) by transaction
N + 1, so when task A calls btrfs_start_dirty_block_groups(), at the
very beginning of the transaction commit, it starts writeback for the
block group's space cache by calling btrfs_write_out_cache(), which
allocates the pages array for the block group's io_ctl with a call to
io_ctl_init(). Block group A is added to the io_list of transaction
N + 1 by btrfs_start_dirty_block_groups();
4) While transaction N's commit is writing out the extent buffers, it gets
an IO error and aborts transaction N, also setting the file system to
RO mode;
5) Task A has already returned from btrfs_start_dirty_block_groups(), is at
btrfs_commit_transaction() and has set transaction N + 1 state to
TRANS_STATE_COMMIT_START. Immediately after that it checks that the
filesystem was turned to RO mode, due to transaction N's abort, and
jumps to the "cleanup_transaction" label. After that we end up at
btrfs_cleanup_one_transaction() which calls btrfs_cleanup_dirty_bgs().
That helper finds block group B in the transaction's io_list but it
never releases the pages array of the block group's io_ctl, resulting in
a memory leak.
In fact at the point when we are at btrfs_cleanup_dirty_bgs(), the pages
array points to pages that were already released by us at
__btrfs_write_out_cache() through the call to io_ctl_drop_pages(). We end
up freeing the pages array only after waiting for the ordered extent to
complete through btrfs_wait_cache_io(), which calls io_ctl_free() to do
that. But in the transaction abort case we don't wait for the space cache's
ordered extent to complete through a call to btrfs_wait_cache_io(), so
that's why we end up with a memory leak - we wait for the ordered extent
to complete indirectly by shutting down the work queues and waiting for
any jobs in them to complete before returning from close_ctree().
We can solve the leak simply by freeing the pages array right after
releasing the pages (with the call to io_ctl_drop_pages()) at
__btrfs_write_out_cache(), since we will never use it anymore after that
and the pages array points to already released pages at that point, which
is currently not a problem since no one will use it after that, but not a
good practice anyway since it can easily lead to use-after-free issues.
So fix this by freeing the pages array right after releasing the pages at
__btrfs_write_out_cache().
This issue can often be reproduced with test case generic/475 from fstests
and kmemleak can detect it and reports it with the following trace:
David Sterba [Mon, 27 Jul 2020 15:38:19 +0000 (17:38 +0200)]
btrfs: use the correct const function attribute for btrfs_get_num_csums
The build robot reports
compiler: h8300-linux-gcc (GCC) 9.3.0
In file included from fs/btrfs/tests/extent-map-tests.c:8:
>> fs/btrfs/tests/../ctree.h:2166:8: warning: type qualifiers ignored on function return type [-Wignored-qualifiers]
2166 | size_t __const btrfs_get_num_csums(void);
| ^~~~~~~
The function attribute for const does not follow the expected scheme and
in this case is confused with a const type qualifier.
Btrfs' async submit mechanism is able to handle errors in the submission
path and the meta-data async submit function correctly passes the error
code to the caller.
In btrfs_submit_bio_start() and btrfs_submit_bio_start_direct_io() we're
not handling the errors returned by btrfs_csum_one_bio() correctly though
and simply call BUG_ON(). This is unnecessary as the caller of these two
functions - run_one_async_start - correctly checks for the return values
and sets the status of the async_submit_bio. The actual bio submission
will be handled later on by run_one_async_done only if
async_submit_bio::status is 0, so the data won't be written if we
encountered an error in the checksum process.
Simply return the error from btrfs_csum_one_bio() to the async submitters,
like it's done in btree_submit_bio_start().
brookxu [Mon, 17 Aug 2020 07:36:15 +0000 (15:36 +0800)]
ext4: limit the length of per-inode prealloc list
In the scenario of writing sparse files, the per-inode prealloc list may
be very long, resulting in high overhead for ext4_mb_use_preallocated().
To circumvent this problem, we limit the maximum length of per-inode
prealloc list to 512 and allow users to modify it.
After patching, we observed that the sys ratio of cpu has dropped, and
the system throughput has increased significantly. We created a process
to write the sparse file, and the running time of the process on the
fixed kernel was significantly reduced, as follows:
Running time on unfixed kernel:
[root@TENCENT64 ~]# time taskset 0x01 ./sparse /data1/sparce.dat
real 0m2.051s
user 0m0.008s
sys 0m2.026s
Running time on fixed kernel:
[root@TENCENT64 ~]# time taskset 0x01 ./sparse /data1/sparce.dat
real 0m0.471s
user 0m0.004s
sys 0m0.395s
brookxu [Sat, 15 Aug 2020 00:10:44 +0000 (08:10 +0800)]
ext4: add mb_debug logging when there are lost chunks
Lost chunks are when some other process raced with the current thread
to grab a particular block allocation. Add mb_debug log for
developers who wants to see how often this is happening for a
particular workload.
Sameer Pujar [Wed, 19 Aug 2020 15:32:10 +0000 (21:02 +0530)]
ALSA: hda: avoid reset of sdo_limit
By default 'sdo_limit' is initialized with a default value of '8'
as per spec. This is overridden in cases where a different value is
required. However this is getting reset when snd_hdac_bus_init_chip()
is called again, which happens during runtime PM cycle.
Avoid this reset by moving 'sdo_limit' setup to 'snd_hdac_bus_init()'
function which would be called only once.
Imre Deak [Mon, 20 Jul 2020 23:29:52 +0000 (02:29 +0300)]
drm/i915/tgl: Make sure TC-cold is blocked before enabling TC AUX power wells
The dependency between power wells is determined by the ordering of the
power well list: when enabling the power wells for a domain, this
happens walking the power well list forward, while disabling them
happens in the reverse direction. Accordingly a power well on the list
must follow any other power well it depends on.
Since the TC AUX power wells depend on TC-cold being blocked, move the
TC-cold off power well before all AUX power wells.
George Spelvin [Wed, 25 Mar 2020 19:24:29 +0000 (19:24 +0000)]
drm/i915/selftests: Avoid passing a random 0 into ilog2
igt_mm_config() calls ilog2() on the (pseudo)random 21-bit number
s>>12. Once in 2 million seeds, this is zero and ilog2 summons
the nasal demons.
There was an attempt to handle this case with a max(), but that's
too late; ms could already be something bizarre.
Given that the low 12 bits of s and ms are always zero, it's a lot
simpler just to divide them by 4096, then everything fits into 32
bits, and we can easily generate a random number 1 <= s <= 0x1fffff.
Chris Wilson [Thu, 16 Jul 2020 09:46:43 +0000 (10:46 +0100)]
drm/i915: Provide the perf pmu.module
Rather than manually implement our own module reference counting for perf
pmu events, finally realise that there is a module parameter to struct
pmu for this very purpose.
Thinh Nguyen [Wed, 19 Aug 2020 02:27:47 +0000 (19:27 -0700)]
usb: uas: Add quirk for PNY Pro Elite
PNY Pro Elite USB 3.1 Gen 2 device (SSD) doesn't respond to ATA_12
pass-through command (i.e. it just hangs). If it doesn't support this
command, it should respond properly to the host. Let's just add a quirk
to be able to move forward with other operations.
Merge tag 'fixes-for-v5.9-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/balbi/usb into usb-linus
Felipe writes:
USB: fixes for v5.9-rc
Three ZLP fixes on dwc3 and a resource leak fix on the TCM gadget
Signed-off-by: Felipe Balbi <[email protected]>
* tag 'fixes-for-v5.9-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/balbi/usb:
usb: dwc3: gadget: Handle ZLP for sg requests
usb: dwc3: gadget: Fix handling ZLP
usb: dwc3: gadget: Don't setup more than requested
usb: gadget: f_tcm: Fix some resource leaks in some error paths
Mike Pozulp [Tue, 18 Aug 2020 16:54:44 +0000 (09:54 -0700)]
ALSA: hda/realtek: Add quirk for Samsung Galaxy Book Ion
The Galaxy Book Ion uses the same ALC298 codec as other Samsung laptops
which have the no headphone sound bug, like my Samsung Notebook. The
Galaxy Book owner confirmed that this patch fixes the bug.
Takashi Iwai [Wed, 19 Aug 2020 06:03:04 +0000 (08:03 +0200)]
Merge tag 'asoc-fix-v5.9-rc1' of https://git.kernel.org/pub/scm/linux/kernel/git/broonie/sound into for-linus
ASoC: Fixes for v5.9
A bunch of fixes that came in during the merge window, mostly for issues
that were uncovered by the changes to report errors on invalid register
access plus one important fix in that code itself.
Yonghong Song [Tue, 18 Aug 2020 22:23:10 +0000 (15:23 -0700)]
bpf: Avoid visit same object multiple times
Currently when traversing all tasks, the next tid
is always increased by one. This may result in
visiting the same task multiple times in a
pid namespace.
This patch fixed the issue by seting the next
tid as pid_nr_ns(pid, ns) + 1, similar to
funciton next_tgid().
Note that `bpftool prog` actually calls a task_file bpf iterator
program to establish an association between prog/map/link/btf anon
files and processes.
In the case where the above rcu stall occured, we had a process
having 1587 tasks and each task having roughly 81305 files.
This implied 129 million bpf prog invocations. Unfortunwtely none of
these files are prog/map/link/btf files so bpf iterator/prog needs
to traverse all these files and not able to return to user space
since there are no seq_file buffer overflow.
This patch fixed the issue in bpf_seq_read() to limit the number
of visited objects. If the maximum number of visited objects is
reached, no more objects will be visited in the current syscall.
If there is nothing written in the seq_file buffer, -EAGAIN will
return to the user so user can try again.
The maximum number of visited objects is set at 1 million.
In our Intel Xeon D-2191 2.3GHZ 18-core server, bpf_seq_read()
visiting 1 million files takes around 0.18 seconds.
We did not use cond_resched() since for some iterators, e.g.,
netlink iterator, where rcu read_lock critical section spans between
consecutive seq_ops->next(), which makes impossible to do cond_resched()
in the key while loop of function bpf_seq_read().
David S. Miller [Tue, 18 Aug 2020 23:00:24 +0000 (16:00 -0700)]
Merge branch 'ethtool-netlink-bug-fixes'
Maxim Mikityanskiy says:
====================
ethtool-netlink bug fixes
This series contains a few bug fixes for ethtool-netlink. These bugs are
specific for the netlink interface, and the legacy ioctl interface is
not affected. These patches aim to have the same behavior in
ethtool-netlink as in the legacy ethtool.
Please also see the sibling series for the userspace tool.
ethtool: Don't omit the netlink reply if no features were changed
The legacy ethtool userspace tool shows an error when no features could
be changed. It's useful to have a netlink reply to be able to show this
error when __netdev_update_features wasn't called, for example:
1. ethtool -k eth0
large-receive-offload: off
2. ethtool -K eth0 rx-fcs on
3. ethtool -K eth0 lro on
Could not change any device features
rx-lro: off [requested on]
4. ethtool -K eth0 lro on
# The output should be the same, but without this patch the kernel
# doesn't send the reply, and ethtool is unable to detect the error.
This commit makes ethtool-netlink always return a reply when requested,
and it still avoids unnecessary calls to __netdev_update_features if the
wanted features haven't changed.
Fixes: 0980bfcd6954 ("ethtool: set netdev features with FEATURES_SET request") Signed-off-by: Maxim Mikityanskiy <[email protected]> Reviewed-by: Michal Kubecek <[email protected]> Signed-off-by: David S. Miller <[email protected]>
ethtool: Account for hw_features in netlink interface
ethtool-netlink ignores dev->hw_features and may confuse the drivers by
asking them to enable features not in the hw_features bitmask. For
example:
1. ethtool -k eth0
tls-hw-tx-offload: off [fixed]
2. ethtool -K eth0 tls-hw-tx-offload on
tls-hw-tx-offload: on
3. ethtool -k eth0
tls-hw-tx-offload: on [fixed]
Fitler out dev->hw_features from req_wanted to fix it and to resemble
the legacy ethtool behavior.
Fixes: 0980bfcd6954 ("ethtool: set netdev features with FEATURES_SET request") Signed-off-by: Maxim Mikityanskiy <[email protected]> Reviewed-by: Michal Kubecek <[email protected]> Signed-off-by: David S. Miller <[email protected]>
It completely discards the old wanted bits, so they are forgotten with
the next ethtool command. Sample steps to reproduce:
1. ethtool -k eth0
tx-tcp-segmentation: on # TSO is on from the beginning
2. ethtool -K eth0 tx off
tx-tcp-segmentation: off [not requested]
3. ethtool -k eth0
tx-tcp-segmentation: off [requested on]
4. ethtool -K eth0 rx off # Some change unrelated to TSO
5. ethtool -k eth0
tx-tcp-segmentation: off # "Wanted on" is forgotten
This commit fixes it by changing the formula to:
(req_wanted & req_mask) | (old_wanted & ~req_mask),
where old_active was replaced by old_wanted to account for the wanted
bits.
The shortcut condition for the case where nothing was changed now
compares wanted bitmasks, instead of wanted to active.
Fixes: 0980bfcd6954 ("ethtool: set netdev features with FEATURES_SET request") Signed-off-by: Maxim Mikityanskiy <[email protected]> Reviewed-by: Michal Kubecek <[email protected]> Signed-off-by: David S. Miller <[email protected]>
Xin Long [Mon, 17 Aug 2020 06:30:49 +0000 (14:30 +0800)]
ipv6: some fixes for ipv6_dev_find()
This patch is to do 3 things for ipv6_dev_find():
As David A. noticed,
- rt6_lookup() is not really needed. Different from __ip_dev_find(),
ipv6_dev_find() doesn't have a compatibility problem, so remove it.
As Hideaki suggested,
- "valid" (non-tentative) check for the address is also needed.
ipv6_chk_addr() calls ipv6_chk_addr_and_flags(), which will
traverse the address hash list, but it's heavy to be called
inside ipv6_dev_find(). This patch is to reuse the code of
ipv6_chk_addr_and_flags() for ipv6_dev_find().
- dev parameter is passed into ipv6_dev_find(), as link-local
addresses from user space has sin6_scope_id set and the dev
lookup needs it.
Jiri Wiesner [Sun, 16 Aug 2020 18:52:44 +0000 (20:52 +0200)]
bonding: fix active-backup failover for current ARP slave
When the ARP monitor is used for link detection, ARP replies are
validated for all slaves (arp_validate=3) and fail_over_mac is set to
active, two slaves of an active-backup bond may get stuck in a state
where both of them are active and pass packets that they receive to
the bond. This state makes IPv6 duplicate address detection fail. The
state is reached thus:
1. The current active slave goes down because the ARP target
is not reachable.
2. The current ARP slave is chosen and made active.
3. A new slave is enslaved. This new slave becomes the current active
slave and can reach the ARP target.
As a result, the current ARP slave stays active after the enslave
action has finished and the log is littered with "PROBE BAD" messages:
> bond0: PROBE: c_arp ens10 && cas ens11 BAD
The workaround is to remove the slave with "going back" status from
the bond and re-enslave it. This issue was encountered when DPDK PMD
interfaces were being enslaved to an active-backup bond.
I would be possible to fix the issue in bond_enslave() or
bond_change_active_slave() but the ARP monitor was fixed instead to
keep most of the actions changing the current ARP slave in the ARP
monitor code. The current ARP slave is set as inactive and backup
during the commit phase. A new state, BOND_LINK_FAIL, has been
introduced for slaves in the context of the ARP monitor. This allows
administrators to see how slaves are rotated for sending ARP requests
and attempts are made to find a new active slave.
Fixes: b2220cad583c9 ("bonding: refactor ARP active-backup monitor") Signed-off-by: Jiri Wiesner <[email protected]> Signed-off-by: David S. Miller <[email protected]>
Miaohe Lin [Sat, 15 Aug 2020 08:46:41 +0000 (04:46 -0400)]
net: handle the return value of pskb_carve_frag_list() correctly
pskb_carve_frag_list() may return -ENOMEM in pskb_carve_inside_nonlinear().
we should handle this correctly or we would get wrong sk_buff.
Fixes: 6fa01ccd8830 ("skbuff: Add pskb_extract() helper function") Signed-off-by: Miaohe Lin <[email protected]> Signed-off-by: David S. Miller <[email protected]>