a refcount leak in the error path of u32_change() has been recently
introduced. It can be observed with the following commands:
[root@f31 ~]# tc filter replace dev eth0 ingress protocol ip prio 97 \
> u32 match ip src 127.0.0.1/32 indev notexist20 flowid 1:1 action drop
RTNETLINK answers: Invalid argument
We have an error talking to the kernel
[root@f31 ~]# tc filter replace dev eth0 ingress protocol ip prio 98 \
> handle 42:42 u32 divisor 256
Error: cls_u32: Divisor can only be used on a hash table.
We have an error talking to the kernel
[root@f31 ~]# tc filter replace dev eth0 ingress protocol ip prio 99 \
> u32 ht 47:47
Error: cls_u32: Specified hash table not found.
We have an error talking to the kernel
they all legitimately return -EINVAL; however, they leave semi-configured
filters at eth0 tc ingress:
[root@f31 ~]# tc filter show dev eth0 ingress
filter protocol ip pref 97 u32 chain 0
filter protocol ip pref 97 u32 chain 0 fh 800: ht divisor 1
filter protocol ip pref 98 u32 chain 0
filter protocol ip pref 98 u32 chain 0 fh 801: ht divisor 1
filter protocol ip pref 99 u32 chain 0
filter protocol ip pref 99 u32 chain 0 fh 802: ht divisor 1
With older kernels, filters were unconditionally considered empty (and
thus de-refcounted) on the error path of ->change().
After commit 8b64678e0af8 ("net: sched: refactor tp insert/delete for
concurrent execution"), filters were considered empty when the walk()
function didn't set 'walker.stop' to 1.
Finally, with commit 6676d5e416ee ("net: sched: set dedicated tcf_walker
flag when tp is empty"), tc filters are considered empty unless the walker
function is called with a non-NULL handle. This last change doesn't fit
cls_u32 design, because at least the "root hnode" is (almost) always
non-NULL, as it's allocated in u32_init().
- patch 1/2 is a proposal to restore the original kernel behavior, where
no filter was installed in the error path of u32_change().
- patch 2/2 adds tdc selftests that can be ued to verify the correct
behavior of u32 in the error path of ->change().
====================
Davide Caratti [Tue, 17 Dec 2019 23:00:05 +0000 (00:00 +0100)]
tc-testing: initial tdc selftests for cls_u32
- move test "e9a3 - Add u32 with source match" to u32.json, and change the
match pattern to catch all hnodes
- add testcases for relevant error paths of cls_u32 module
Davide Caratti [Tue, 17 Dec 2019 23:00:04 +0000 (00:00 +0100)]
net/sched: cls_u32: fix refcount leak in the error path of u32_change()
when users replace cls_u32 filters with new ones having wrong parameters,
so that u32_change() fails to validate them, the kernel doesn't roll-back
correctly, and leaves semi-configured rules.
Fix this in u32_walk(), avoiding a call to the walker function on filters
that don't have a match rule connected. The side effect is, these "empty"
filters are not even dumped when present; but that shouldn't be a problem
as long as we are restoring the original behaviour, where semi-configured
filters were not even added in the error path of u32_change().
Fixes: 6676d5e416ee ("net: sched: set dedicated tcf_walker flag when tp is empty") Signed-off-by: Davide Caratti <[email protected]> Signed-off-by: David S. Miller <[email protected]>
Aditya Pakki [Tue, 17 Dec 2019 20:43:00 +0000 (14:43 -0600)]
nfc: s3fwrn5: replace the assertion with a WARN_ON
In s3fwrn5_fw_recv_frame, if fw_info->rsp is not empty, the
current code causes a crash via BUG_ON. However, s3fwrn5_fw_send_msg
does not crash in such a scenario. The patch replaces the BUG_ON
by returning the error to the callers and frees up skb.
====================
net: macb: fix probing of PHY not described in the dt
The macb Ethernet driver supports various ways of referencing its
network PHY. When a device tree is used the PHY can be referenced with
a phy-handle or, if connected to its internal MDIO bus, described in
a child node. Some platforms omitted the PHY description while
connecting the PHY to the internal MDIO bus and in such cases the MDIO
bus has to be scanned "manually" by the macb driver.
Prior to the phylink conversion the driver registered the MDIO bus with
of_mdiobus_register and then in case the PHY couldn't be retrieved
using dt or using phy_find_first (because registering an MDIO bus with
of_mdiobus_register masks all PHYs) the macb driver was "manually"
scanning the MDIO bus (like mdiobus_register does). The phylink
conversion did break this particular case but reimplementing the manual
scan of the bus in the macb driver wouldn't be very clean. The solution
seems to be registering the MDIO bus based on if the PHYs are described
in the device tree or not.
There are multiple ways to do this, none is perfect. I chose to check if
any of the child nodes of the macb node was a network PHY and based on
this to register the MDIO bus with the of_ helper or not. The drawback
is boards referencing the PHY through phy-handle, would scan the entire
MDIO bus of the macb at boot time (as the MDIO bus would be registered
with mdiobus_register). For this solution to work properly
of_mdiobus_child_is_phy has to be exported, which means the patch doing
so has to be backported to -stable as well.
Another possible solution could have been to simply check if the macb
node has a child node by counting its sub-nodes. This isn't techically
perfect, as there could be other sub-nodes (in practice this should be
fine, fixed-link being taken care of in the driver). We could also
simply s/of_mdiobus_register/mdiobus_register/ but that could break
boards using the PHY description in child node as a selector (which
really would be not a proper way to do this...).
The real issue here being having PHYs not described in the dt but we
have dt backward compatibility, so we have to live with that.
====================
Antoine Tenart [Tue, 17 Dec 2019 17:07:42 +0000 (18:07 +0100)]
net: macb: fix probing of PHY not described in the dt
This patch fixes the case where the PHY isn't described in the device
tree. This is due to the way the MDIO bus is registered in the driver:
whether the PHY is described in the device tree or not, the bus is
registered through of_mdiobus_register. The function masks all the PHYs
and only allow probing the ones described in the device tree. Prior to
the Phylink conversion this was also done but later on in the driver
the MDIO bus was manually scanned to circumvent the fact that the PHY
wasn't described.
This patch fixes it in a proper way, by registering the MDIO bus based
on if the PHY attached to a given interface is described in the device
tree or not.
Fixes: 7897b071ac3b ("net: macb: convert to phylink") Signed-off-by: Antoine Tenart <[email protected]> Signed-off-by: David S. Miller <[email protected]>
tracing: Have the histogram compare functions convert to u64 first
The compare functions of the histogram code would be specific for the size
of the value being compared (byte, short, int, long long). It would
reference the value from the array via the type of the compare, but the
value was stored in a 64 bit number. This is fine for little endian
machines, but for big endian machines, it would end up comparing zeros or
all ones (depending on the sign) for anything but 64 bit numbers.
To fix this, first derference the value as a u64 then convert it to the type
being compared.
Keita Suzuki [Wed, 11 Dec 2019 09:12:58 +0000 (09:12 +0000)]
tracing: Avoid memory leak in process_system_preds()
When failing in the allocation of filter_item, process_system_preds()
goes to fail_mem, where the allocated filter is freed.
However, this leads to memory leak of filter->filter_string and
filter->prog, which is allocated before and in process_preds().
This bug has been detected by kmemleak as well.
Daniel Borkmann [Thu, 19 Dec 2019 21:19:51 +0000 (22:19 +0100)]
bpf: Add further test_verifier cases for record_func_key
Expand dummy prog generation such that we can easily check on return
codes and add few more test cases to make sure we keep on tracking
pruning behavior.
# ./test_verifier
[...]
#1066/p XDP pkt read, pkt_data <= pkt_meta', bad access 1 OK
#1067/p XDP pkt read, pkt_data <= pkt_meta', bad access 2 OK
Summary: 1580 PASSED, 0 SKIPPED, 0 FAILED
Also verified that JIT dump of added test cases looks good.
Daniel Borkmann [Thu, 19 Dec 2019 21:19:50 +0000 (22:19 +0100)]
bpf: Fix record_func_key to perform backtracking on r3
While testing Cilium with /unreleased/ Linus' tree under BPF-based NodePort
implementation, I noticed a strange BPF SNAT engine behavior from time to
time. In some cases it would do the correct SNAT/DNAT service translation,
but at a random point in time it would just stop and perform an unexpected
translation after SYN, SYN/ACK and stack would send a RST back. While initially
assuming that there is some sort of a race condition in BPF code, adding
trace_printk()s for debugging purposes at some point seemed to have resolved
the issue auto-magically.
Digging deeper on this Heisenbug and reducing the trace_printk() calls to
an absolute minimum, it turns out that a single call would suffice to
trigger / not trigger the seen RST issue, even though the logic of the
program itself remains unchanged. Turns out the single call changed verifier
pruning behavior to get everything to work. Reconstructing a minimal test
case, the incorrect JIT dump looked as follows:
Hence, unexpectedly, JIT emitted a direct jump even though retpoline based
one would have been needed since in line 2c and 3a we have different slot
keys in BPF reg r3. Verifier log of the test case reveals what happened:
Second branch is pruned by verifier since considered safe, but issue is that
record_func_key() couldn't have seen the index in line 3a and therefore
decided that emitting a direct jump at this location was okay.
Fix this by reusing our backtracking logic for precise scalar verification
in order to prevent pruning on the slot key. This means verifier will track
content of r3 all the way backwards and only prune if both scalars were
unknown in state equivalence check and therefore poisoned in the first place
in record_func_key(). The range is [x,x] in record_func_key() case since
the slot always would have to be constant immediate. Correct verification
after fix:
Also explicitly adding explicit env->allow_ptr_leaks to fixup_bpf_calls() since
backtracking is enabled under former (direct jumps as well, but use different
test). In case of only tracking different map pointers as in c93552c443eb ("bpf:
properly enforce index mask to prevent out-of-bounds speculation"), pruning
cannot make such short-cuts, neither if there are paths with scalar and non-scalar
types as r3. mark_chain_precision() is only needed after we know that
register_is_const(). If it was not the case, we already poison the key on first
path and non-const key in later paths are not matching the scalar range in regsafe()
either. Cilium NodePort testing passes fine as well now. Note, released kernels
not affected.
net, sysctl: Fix compiler warning when only cBPF is present
proc_dointvec_minmax_bpf_restricted() has been firstly introduced
in commit 2e4a30983b0f ("bpf: restrict access to core bpf sysctls")
under CONFIG_HAVE_EBPF_JIT. Then, this ifdef has been removed in ede95a63b5e8 ("bpf: add bpf_jit_limit knob to restrict unpriv
allocations"), because a new sysctl, bpf_jit_limit, made use of it.
Finally, this parameter has become long instead of integer with fdadd04931c2 ("bpf: fix bpf_jit_limit knob for PAGE_SIZE >= 64K")
and thus, a new proc_dolongvec_minmax_bpf_restricted() has been
added.
With this last change, we got back to that
proc_dointvec_minmax_bpf_restricted() is used only under
CONFIG_HAVE_EBPF_JIT, but the corresponding ifdef has not been
brought back.
So, in configurations like CONFIG_BPF_JIT=y && CONFIG_HAVE_EBPF_JIT=n
since v4.20 we have:
CC net/core/sysctl_net_core.o
net/core/sysctl_net_core.c:292:1: warning: ‘proc_dointvec_minmax_bpf_restricted’ defined but not used [-Wunused-function]
292 | proc_dointvec_minmax_bpf_restricted(struct ctl_table *table, int write,
| ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Suppress this by guarding it with CONFIG_HAVE_EBPF_JIT again.
Linus Torvalds [Thu, 19 Dec 2019 16:13:04 +0000 (08:13 -0800)]
Merge branch 'akpm' (patches from Andrew)
Merge fixes from Andrew Morton:
"6 fixes"
* emailed patches from Andrew Morton <[email protected]>:
lib/Kconfig.debug: fix some messed up configurations
mm: vmscan: protect shrinker idr replace with CONFIG_MEMCG
kasan: don't assume percpu shadow allocations will succeed
kasan: use apply_to_existing_page_range() for releasing vmalloc shadow
mm/memory.c: add apply_to_existing_page_range() helper
kasan: fix crashes on access to memory mapped by vm_map_ram()
Linus Torvalds [Thu, 19 Dec 2019 16:09:43 +0000 (08:09 -0800)]
Merge tag 'pm-5.5-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm
Pull power management fix from Rafael Wysocki:
"Fix a problem related to CPU offline/online and cpufreq governors that
in some system configurations may lead to a system-wide deadlock
during CPU online"
* tag 'pm-5.5-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm:
cpufreq: Avoid leaving stale IRQ work items during CPU offline
Darrick J. Wong [Wed, 11 Dec 2019 21:19:06 +0000 (13:19 -0800)]
xfs: don't commit sunit/swidth updates to disk if that would cause repair failures
Alex Lyakas reported[1] that mounting an xfs filesystem with new sunit
and swidth values could cause xfs_repair to fail loudly. The problem
here is that repair calculates the where mkfs should have allocated the
root inode, based on the superblock geometry. The allocation decisions
depend on sunit, which means that we really can't go updating sunit if
it would lead to a subsequent repair failure on an otherwise correct
filesystem.
Port from xfs_repair some code that computes the location of the root
inode and teach mount to skip the ondisk update if it would cause
problems for repair. Along the way we'll update the documentation,
provide a function for computing the minimum AGFL size instead of
open-coding it, and cut down some indenting in the mount code.
Note that we allow the mount to proceed (and new allocations will
reflect this new geometry) because we've never screened this kind of
thing before. We'll have to wait for a new future incompat feature to
enforce correct behavior, alas.
Note that the geometry reporting always uses the superblock values, not
the incore ones, so that is what xfs_info and xfs_growfs will report.
Darrick J. Wong [Wed, 18 Dec 2019 19:13:16 +0000 (11:13 -0800)]
xfs: split the sunit parameter update into two parts
If the administrator provided a sunit= mount option, we need to validate
the raw parameter, convert the mount option units (512b blocks) into the
internal unit (fs blocks), and then validate that the (now cooked)
parameter doesn't screw anything up on disk. The incore inode geometry
computation can depend on the new sunit option, but a subsequent patch
will make validating the cooked value depends on the computed inode
geometry, so break the sunit update into two steps.
Darrick J. Wong [Wed, 18 Dec 2019 19:09:55 +0000 (11:09 -0800)]
xfs: refactor agfl length computation function
Refactor xfs_alloc_min_freelist to accept a NULL @pag argument, in which
case it returns the largest possible minimum length. This will be used
in an upcoming patch to compute the length of the AGFL at mkfs time.
Darrick J. Wong [Mon, 16 Dec 2019 19:14:09 +0000 (11:14 -0800)]
libxfs: resync with the userspace libxfs
Prepare to resync the userspace libxfs with the kernel libxfs. There
were a few things I missed -- a couple of static inline directory
functions that have to be exported for xfs_repair; a couple of directory
naming functions that make porting much easier if they're /not/ static
inline; and a u16 usage that should have been uint16_t.
None of these things are bugs in their own right; this just makes
porting xfsprogs easier.
Brian Foster [Tue, 17 Dec 2019 21:50:26 +0000 (13:50 -0800)]
xfs: use bitops interface for buf log item AIL flag check
The xfs_log_item flags were converted to atomic bitops as of commit 22525c17ed ("xfs: log item flags are racy"). The assert check for
AIL presence in xfs_buf_item_relse() still uses the old value based
check. This likely went unnoticed as XFS_LI_IN_AIL evaluates to 0
and causes the assert to unconditionally pass. Fix up the check.
Daniel Borkmann [Thu, 19 Dec 2019 15:20:49 +0000 (16:20 +0100)]
Merge branch 'bpf-fix-xsk-wakeup'
Maxim Mikityanskiy says:
====================
This series addresses the issue described in the commit message of the
first patch: lack of synchronization between XSK wakeup and destroying
the resources used by XSK wakeup. The idea is similar to napi_synchronize.
The series contains fixes for the drivers that implement XSK.
v2 incorporates changes suggested by Björn:
1. Call synchronize_rcu in Intel drivers only if the XDP program is
being unloaded.
2. Don't forget rcu_read_lock when wakeup is called from xsk_poll.
3. Use xs->zc as the condition to call ndo_xsk_wakeup.
====================
net/ixgbe: Fix concurrency issues between config flow and XSK
Use synchronize_rcu to wait until the XSK wakeup function finishes
before destroying the resources it uses:
1. ixgbe_down already calls synchronize_rcu after setting __IXGBE_DOWN.
2. After switching the XDP program, call synchronize_rcu to let
ixgbe_xsk_wakeup exit before the XDP program is freed.
3. Changing the number of channels brings the interface down.
4. Disabling UMEM sets __IXGBE_TX_DISABLED before closing hardware
resources and resetting xsk_umem. Check that bit in ixgbe_xsk_wakeup to
avoid using the XDP ring when it's already destroyed. synchronize_rcu is
called from ixgbe_txrx_ring_disable.
net/i40e: Fix concurrency issues between config flow and XSK
Use synchronize_rcu to wait until the XSK wakeup function finishes
before destroying the resources it uses:
1. i40e_down already calls synchronize_rcu. On i40e_down either
__I40E_VSI_DOWN or __I40E_CONFIG_BUSY is set. Check the latter in
i40e_xsk_wakeup (the former is already checked there).
2. After switching the XDP program, call synchronize_rcu to let
i40e_xsk_wakeup exit before the XDP program is freed.
3. Changing the number of channels brings the interface down (see
i40e_prep_for_reset and i40e_pf_quiesce_all_vsi).
net/mlx5e: Fix concurrency issues between config flow and XSK
After disabling resources necessary for XSK (the XDP program, channels,
XSK queues), use synchronize_rcu to wait until the XSK wakeup function
finishes, before freeing the resources.
Suspend XSK wakeups during switching channels. If the XDP program is
being removed, synchronize_rcu before closing the old channels to allow
XSK wakeup to complete.
The XSK wakeup callback in drivers makes some sanity checks before
triggering NAPI. However, some configuration changes may occur during
this function that affect the result of those checks. For example, the
interface can go down, and all the resources will be destroyed after the
checks in the wakeup function, but before it attempts to use these
resources. Wrap this callback in rcu_read_lock to allow driver to
synchronize_rcu before actually destroying the resources.
xsk_wakeup is a new function that encapsulates calling ndo_xsk_wakeup
wrapped into the RCU lock. After this commit, xsk_poll starts using
xsk_wakeup and checks xs->zc instead of ndo_xsk_wakeup != NULL to decide
ndo_xsk_wakeup should be called. It also fixes a bug introduced with the
need_wakeup feature: a non-zero-copy socket may be used with a driver
supporting zero-copy, and in this case ndo_xsk_wakeup should not be
called, so the xs->zc check is the correct one.
clk: qcom: gcc-sc7180: Fix setting flag for votable GDSCs
Commit 17269568f7267 ("clk: qcom: Add Global Clock controller (GCC)
driver for SC7180") sets the VOTABLE flag in .pwrsts, but it needs
to be set in .flags, fix this.
Linus Torvalds [Thu, 19 Dec 2019 01:17:36 +0000 (17:17 -0800)]
Merge tag 'tpmdd-next-20191219' of git://git.infradead.org/users/jjs/linux-tpmdd
Pull tpm fixes from Jarkko Sakkinen:
"Bunch of fixes for rc3"
* tag 'tpmdd-next-20191219' of git://git.infradead.org/users/jjs/linux-tpmdd:
tpm/tpm_ftpm_tee: add shutdown call back
tpm: selftest: cleanup after unseal with wrong auth/policy test
tpm: selftest: add test covering async mode
tpm: fix invalid locking in NONBLOCKING mode
security: keys: trusted: fix lost handle flush
tpm_tis: reserve chip for duration of tpm_tis_core_init
KEYS: asymmetric: return ENOMEM if akcipher_request_alloc() fails
KEYS: remove CONFIG_KEYS_COMPAT
Vasily Gorbik [Tue, 10 Dec 2019 12:50:23 +0000 (13:50 +0100)]
s390/ftrace: save traced function caller
A typical backtrace acquired from ftraced function currently looks like
the following (e.g. for "path_openat"):
arch_stack_walk+0x15c/0x2d8
stack_trace_save+0x50/0x68
stack_trace_call+0x15a/0x3b8
ftrace_graph_caller+0x0/0x1c
0x3e0007e3c98 <- ftraced function caller (should be do_filp_open+0x7c/0xe8)
do_open_execat+0x70/0x1b8
__do_execve_file.isra.0+0x7d8/0x860
__s390x_sys_execve+0x56/0x68
system_call+0xdc/0x2d8
Note random "0x3e0007e3c98" stack value as ftraced function caller. This
value causes either imprecise unwinder result or unwinding failure.
That "0x3e0007e3c98" comes from r14 of ftraced function stack frame, which
it haven't had a chance to initialize since the very first instruction
calls ftrace code ("ftrace_caller"). (ftraced function might never
save r14 as well). Nevertheless according to s390 ABI any function
is called with stack frame allocated for it and r14 contains return
address. "ftrace_caller" itself is called with "brasl %r0,ftrace_caller".
So, to fix this issue simply always save traced function caller onto
ftraced function stack frame.
Vasily Gorbik [Wed, 11 Dec 2019 16:27:31 +0000 (17:27 +0100)]
s390/unwind: stop gracefully at user mode pt_regs in irq stack
Consider reaching user mode pt_regs at the bottom of irq stack graceful
unwinder termination. This is the case when irq/mcck/ext interrupt arrives
while in user mode.
s390/purgatory: do not build purgatory with kcov, kasan and friends
the purgatory must not rely on functions from the "old" kernel,
so we must disable kasan and friends. We also need to have a
separate copy of string.c as the default does not build memcmp
with KASAN.
The stack overflow happens because get_tod_clock_monotonic() gets called
by ftrace but itself calls preempt_{disable,enable}(), which leads to a
endless recursion. Fix this by using preempt_{disable,enable}_notrace().
Jose Abreu [Wed, 18 Dec 2019 10:17:43 +0000 (11:17 +0100)]
net: stmmac: Always arm TX Timer at end of transmission start
If TX Coalesce timer is enabled we should always arm it, otherwise we
may hit the case where an interrupt is missed and the TX Queue will
timeout.
Arming the timer does not necessarly mean it will run the tx_clean()
because this function is wrapped around NAPI launcher.
Fixes: 9125cdd1be11 ("stmmac: add the initial tx coalesce schema") Signed-off-by: Jose Abreu <[email protected]> Signed-off-by: David S. Miller <[email protected]>
Jose Abreu [Wed, 18 Dec 2019 10:17:42 +0000 (11:17 +0100)]
net: stmmac: Enable 16KB buffer size
XGMAC supports maximum MTU that can go to 16KB. Lets add this check in
the calculation of RX buffer size.
Fixes: 7ac6653a085b ("stmmac: Move the STMicroelectronics driver") Signed-off-by: Jose Abreu <[email protected]> Signed-off-by: David S. Miller <[email protected]>
Jose Abreu [Wed, 18 Dec 2019 10:17:41 +0000 (11:17 +0100)]
net: stmmac: 16KB buffer must be 16 byte aligned
The 16KB RX Buffer must also be 16 byte aligned. Fix it.
Fixes: 7ac6653a085b ("stmmac: Move the STMicroelectronics driver") Signed-off-by: Jose Abreu <[email protected]> Signed-off-by: David S. Miller <[email protected]>
Jose Abreu [Wed, 18 Dec 2019 10:17:40 +0000 (11:17 +0100)]
net: stmmac: RX buffer size must be 16 byte aligned
We need to align the RX buffer size to at least 16 byte so that IP
doesn't mis-behave. This is required by HW.
Changes from v2:
- Align UP and not DOWN (David)
Fixes: 7ac6653a085b ("stmmac: Move the STMicroelectronics driver") Signed-off-by: Jose Abreu <[email protected]> Signed-off-by: David S. Miller <[email protected]>
Jose Abreu [Wed, 18 Dec 2019 10:17:39 +0000 (11:17 +0100)]
net: stmmac: xgmac: Clear previous RX buffer size
When switching between buffer sizes we need to clear the previous value.
Fixes: d6ddfacd95c7 ("net: stmmac: Add DMA related callbacks for XGMAC2") Signed-off-by: Jose Abreu <[email protected]> Signed-off-by: David S. Miller <[email protected]>
Jose Abreu [Wed, 18 Dec 2019 10:17:38 +0000 (11:17 +0100)]
net: stmmac: Only the last buffer has the FCS field
Only the last received buffer contains the FCS field. Check for end of
packet before trying to strip the FCS field.
Fixes: 88ebe2cf7f3f ("net: stmmac: Rework stmmac_rx()") Signed-off-by: Jose Abreu <[email protected]> Signed-off-by: David S. Miller <[email protected]>
Jose Abreu [Wed, 18 Dec 2019 10:17:37 +0000 (11:17 +0100)]
net: stmmac: Do not accept invalid MTU values
The maximum MTU value is determined by the maximum size of TX FIFO so
that a full packet can fit in the FIFO. Add a check for this in the MTU
change callback.
Also check if provided and rounded MTU does not passes the maximum limit
of 16K.
Changes from v2:
- Align MTU before checking if its valid
Fixes: 7ac6653a085b ("stmmac: Move the STMicroelectronics driver") Signed-off-by: Jose Abreu <[email protected]> Signed-off-by: David S. Miller <[email protected]>
Jose Abreu [Wed, 18 Dec 2019 10:17:36 +0000 (11:17 +0100)]
net: stmmac: Determine earlier the size of RX buffer
Split Header feature needs to know the size of RX buffer but current
code is determining it too late. Fix this by moving the RX buffer
computation to earlier stage.
Changes from v2:
- Do not try to align already aligned buffer size
Fixes: 67afd6d1cfdf ("net: stmmac: Add Split Header support and enable it in XGMAC cores") Signed-off-by: Jose Abreu <[email protected]> Signed-off-by: David S. Miller <[email protected]>
Jose Abreu [Wed, 18 Dec 2019 10:17:35 +0000 (11:17 +0100)]
net: stmmac: selftests: Needs to check the number of Multicast regs
When running the MC and UC filter tests we setup a multicast address
that its expected to be blocked. If the number of available multicast
registers is zero, driver will always pass the multicast packets which
will fail the test.
Check if available multicast addresses is enough before running the
tests.
Fixes: 091810dbded9 ("net: stmmac: Introduce selftests support") Signed-off-by: Jose Abreu <[email protected]> Signed-off-by: David S. Miller <[email protected]>
Jia-Ju Bai [Wed, 18 Dec 2019 09:21:55 +0000 (17:21 +0800)]
net: nfc: nci: fix a possible sleep-in-atomic-context bug in nci_uart_tty_receive()
The kernel may sleep while holding a spinlock.
The function call path (from bottom to top) in Linux 4.19 is:
net/nfc/nci/uart.c, 349:
nci_skb_alloc in nci_uart_default_recv_buf
net/nfc/nci/uart.c, 255:
(FUNC_PTR)nci_uart_default_recv_buf in nci_uart_tty_receive
net/nfc/nci/uart.c, 254:
spin_lock in nci_uart_tty_receive
nci_skb_alloc(GFP_KERNEL) can sleep at runtime.
(FUNC_PTR) means a function pointer is called.
To fix this bug, GFP_KERNEL is replaced with GFP_ATOMIC for
nci_skb_alloc().
This bug is found by a static analysis tool STCheck written by myself.
Jens Axboe [Wed, 18 Dec 2019 19:19:41 +0000 (12:19 -0700)]
io_uring: io_wq_submit_work() should not touch req->rw
I've been chasing a weird and obscure crash that was userspace stack
corruption, and finally narrowed it down to a bit flip that made a
stack address invalid. io_wq_submit_work() unconditionally flips
the req->rw.ki_flags IOCB_NOWAIT bit, but since it's a generic work
handler, this isn't valid. Normal read/write operations own that
part of the request, on other types it could be something else.
Move the IOCB_NOWAIT clear to the read/write handlers where it belongs.
Olof Johansson [Wed, 18 Dec 2019 17:56:21 +0000 (09:56 -0800)]
clk: Move clk_core_reparent_orphans() under CONFIG_OF
A recent addition exposed a helper that is only used for CONFIG_OF. Move
it into the CONFIG_OF zone in this file to make the compiler stop
warning about an unused function.
Pavel Begunkov [Wed, 18 Dec 2019 16:53:45 +0000 (19:53 +0300)]
io_uring: don't wait when under-submitting
There is no reliable way to submit and wait in a single syscall, as
io_submit_sqes() may under-consume sqes (in case of an early error).
Then it will wait for not-yet-submitted requests, deadlocking the user
in most cases.
Linus Torvalds [Wed, 18 Dec 2019 16:54:15 +0000 (08:54 -0800)]
Merge tag 'sound-5.5-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/tiwai/sound
Pull sound fixes from Takashi Iwai:
"A slightly high amount at this time, but all good and small fixes:
- A PCM core fix that initializes the buffer properly for avoiding
information leaks; it is a long-standing minor problem, but good to
fix better now
- A few ASoC core fixes for the init / cleanup ordering issues that
surfaced after the recent refactoring
- Lots of SOF and topology-related fixes went in, as usual as such
hot topics
- Several ASoC codec and platform-specific small fixes: wm89xx,
realtek, and max98090, AMD, Intel-SST
- A fix for the previous incomplete regression of HD-audio, now
hitting Nvidia HDMI
- A few HD-audio CA0132 codec fixes"
* tag 'sound-5.5-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/tiwai/sound: (27 commits)
ALSA: hda - Downgrade error message for single-cmd fallback
ASoC: wm8962: fix lambda value
ALSA: hda: Fix regression by strip mask fix
ALSA: hda/ca0132 - Fix work handling in delayed HP detection
ALSA: hda/ca0132 - Avoid endless loop
ALSA: hda/ca0132 - Keep power on during processing DSP response
ALSA: pcm: Avoid possible info leaks from PCM stream buffers
ASoC: Intel: common: work-around incorrect ACPI HID for CML boards
ASoC: SOF: Intel: split cht and byt debug window sizes
ASoC: SOF: loader: fix snd_sof_fw_parse_ext_data
ASoC: SOF: loader: snd_sof_fw_parse_ext_data log warning on unknown header
ASoC: simple-card: Don't create separate link when platform is present
ASoC: topology: Check return value for soc_tplg_pcm_create()
ASoC: topology: Check return value for snd_soc_add_dai_link()
ASoC: core: only flush inited work during free
ASoC: Intel: bytcr_rt5640: Update quirk for Teclast X89
ASoC: core: Init pcm runtime work early to avoid warnings
ASoC: Intel: sst: Add missing include <linux/io.h>
ASoC: max98090: fix possible race conditions
ASoC: max98090: exit workaround earlier if PLL is locked
...
The host reports support for the synthetic feature X86_FEATURE_SSBD
when any of the three following hardware features are set:
CPUID.(EAX=7,ECX=0):EDX.SSBD[bit 31]
CPUID.80000008H:EBX.AMD_SSBD[bit 24]
CPUID.80000008H:EBX.VIRT_SSBD[bit 25]
Either of the first two hardware features implies the existence of the
IA32_SPEC_CTRL MSR, but CPUID.80000008H:EBX.VIRT_SSBD[bit 25] does
not. Therefore, CPUID.80000008H:EBX.AMD_SSBD[bit 24] should only be
set in the guest if CPUID.(EAX=7,ECX=0):EDX.SSBD[bit 31] or
CPUID.80000008H:EBX.AMD_SSBD[bit 24] is set on the host.
The host reports support for the synthetic feature X86_FEATURE_SSBD
when any of the three following hardware features are set:
CPUID.(EAX=7,ECX=0):EDX.SSBD[bit 31]
CPUID.80000008H:EBX.AMD_SSBD[bit 24]
CPUID.80000008H:EBX.VIRT_SSBD[bit 25]
Either of the first two hardware features implies the existence of the
IA32_SPEC_CTRL MSR, but CPUID.80000008H:EBX.VIRT_SSBD[bit 25] does
not. Therefore, CPUID.(EAX=7,ECX=0):EDX.SSBD[bit 31] should only be
set in the guest if CPUID.(EAX=7,ECX=0):EDX.SSBD[bit 31] or
CPUID.80000008H:EBX.AMD_SSBD[bit 24] is set on the host.
Robin Murphy [Mon, 9 Dec 2019 19:47:25 +0000 (19:47 +0000)]
iommu/dma: Relax locking in iommu_dma_prepare_msi()
Since commit ece6e6f0218b ("iommu/dma-iommu: Split iommu_dma_map_msi_msg()
in two parts"), iommu_dma_prepare_msi() should no longer have to worry
about preempting itself, nor being called in atomic context at all. Thus
we can downgrade the IRQ-safe locking to a simple mutex to avoid angering
the new might_sleep() check in iommu_map().
Hanjun Guo [Wed, 11 Dec 2019 06:43:06 +0000 (14:43 +0800)]
perf/smmuv3: Remove the leftover put_cpu() in error path
In smmu_pmu_probe(), there is put_cpu() in the error path,
which is wrong because we use raw_smp_processor_id() to
get the cpu ID, not get_cpu(), remove it.
While we are at it, kill 'out_cpuhp_err' altogether and
just return err if we fail to add the hotplug instance.
Lu Baolu [Wed, 20 Nov 2019 06:10:16 +0000 (14:10 +0800)]
iommu/vt-d: Remove incorrect PSI capability check
The PSI (Page Selective Invalidation) bit in the capability register
is only valid for second-level translation. Intel IOMMU supporting
scalable mode must support page/address selective IOTLB invalidation
for first-level translation. Remove the PSI capability check in SVA
cache invalidation code.
Jérôme Pouiller [Tue, 17 Dec 2019 16:14:40 +0000 (16:14 +0000)]
staging: wfx: fix wrong error message
The driver checks that the number of retries made by the device is
coherent with the rate policy. However, this check make sense only if
the device has returned RETRY_EXCEEDED.
Jérôme Pouiller [Tue, 17 Dec 2019 16:14:37 +0000 (16:14 +0000)]
staging: wfx: detect race condition in WEP authentication
Current code has a special case to handle association with WEP. Before
to rework the tx data handling, let's try to detect any possible misuse
of this code.
Jérôme Pouiller [Tue, 17 Dec 2019 16:14:36 +0000 (16:14 +0000)]
staging: wfx: ensure that retry policy always fallbacks to MCS0 / 1Mbps
When not using HT mode, minstrel always includes 1Mbps as fallback rate.
But, when using HT mode, this fallback is not included. Yet, it seems
that it could save some frames. So, this patch add it unconditionally.
Jérôme Pouiller [Tue, 17 Dec 2019 16:14:34 +0000 (16:14 +0000)]
staging: wfx: fix rate control handling
A tx_retry_policy (the equivalent of a list of ieee80211_tx_rate in
hardware API) is not able to include a rate multiple time. So currently,
the driver merges the identical rates from the policy provided by
minstrel (and it try to do the best choice it can in the associated
flags) before to sent it to firmware.
Until now, when rates are merged, field "count" is set to
max(count1, count2). But, it means that the sum of retries for all rates
could be far less than initial number of retries. So, this patch changes
the value of field "count" to count1 + count2. Thus, sum of all retries
for all rates stay the same.
Jérôme Pouiller [Tue, 17 Dec 2019 16:14:30 +0000 (16:14 +0000)]
staging: wfx: fix counter overflow
Some weird behaviors were observed when connection is really good and
packets are small. It appears that sometime, number of packets in queues
can exceed 255 and generate an overflow in field usage_count.
Jérôme Pouiller [Tue, 17 Dec 2019 16:14:29 +0000 (16:14 +0000)]
staging: wfx: fix case of lack of tx_retry_policies
In some rare cases, driver may not have any available tx_retry_policies.
In this case, the driver asks to mac80211 to stop sending data. However,
it seems that a race is possible and a few frames can be sent to the
driver. In this case, driver can't wait for free tx_retry_policies since
wfx_tx() must be atomic. So, this patch fix this case by sending these
frames with the special policy number 15.
The firmware normally use policy 15 to send internal frames (PS-poll,
beacons, etc...). So, it is not a so bad fallback.
Jérôme Pouiller [Tue, 17 Dec 2019 16:14:27 +0000 (16:14 +0000)]
staging: wfx: fix the cache of rate policies on interface reset
Device and driver maintain a cache of rate policies (aka.
tx_retry_policy in hardware API).
When hif_reset() is sent to hardware, device resets its cache of rate
policies. In order to keep driver in sync, it is necessary to do the
same on driver.
Note, when driver tries to use a rate policy that has not been defined
on device, data is sent at 1Mbps. So, this patch should fix abnormal
throughput observed sometime after a reset of the interface.
Adrian Hunter [Tue, 17 Dec 2019 09:53:49 +0000 (11:53 +0200)]
mmc: sdhci: Add a quirk for broken command queuing
Command queuing has been reported broken on some systems based on Intel
GLK. A separate patch disables command queuing in some cases.
This patch adds a quirk for broken command queuing, which enables users
with problems to disable command queuing using sdhci module parameters for
quirks.
Adrian Hunter [Tue, 17 Dec 2019 09:53:48 +0000 (11:53 +0200)]
mmc: sdhci: Workaround broken command queuing on Intel GLK
Command queuing has been reported broken on some Lenovo systems based on
Intel GLK. This is likely a BIOS issue, so disable command queuing for
Intel GLK if the BIOS vendor string is "LENOVO".
Yangbo Lu [Mon, 16 Dec 2019 03:18:42 +0000 (11:18 +0800)]
mmc: sdhci-of-esdhc: fix P2020 errata handling
Two previous patches introduced below quirks for P2020 platforms.
- SDHCI_QUIRK_RESET_AFTER_REQUEST
- SDHCI_QUIRK_BROKEN_TIMEOUT_VAL
The patches made a mistake to add them in quirks2 of sdhci_host
structure, while they were defined for quirks.
host->quirks2 |= SDHCI_QUIRK_RESET_AFTER_REQUEST;
host->quirks2 |= SDHCI_QUIRK_BROKEN_TIMEOUT_VAL;
This patch is to fix them.
host->quirks |= SDHCI_QUIRK_RESET_AFTER_REQUEST;
host->quirks |= SDHCI_QUIRK_BROKEN_TIMEOUT_VAL;
Chris Wilson [Tue, 17 Dec 2019 13:47:29 +0000 (13:47 +0000)]
drm/i915/gem: Keep request alive while attaching fences
Since commit e5dadff4b093 ("drm/i915: Protect request retirement with
timeline->mutex"), the request retirement can happen outside of the
struct_mutex serialised only by the timeline->mutex. We drop the
timeline->mutex on submitting the request (i915_request_add) so after
that point, it is liable to be freed. Make sure our local reference is
kept alive until we have finished attaching it to the signalers. (Note
that this erodes the argument that i915_request_add should consume the
reference, but that is a slightly larger patch!)
John Hurley [Tue, 17 Dec 2019 11:28:56 +0000 (11:28 +0000)]
nfp: flower: fix stats id allocation
As flower rules are added, they are given a stats ID based on the number
of rules that can be supported in firmware. Only after the initial
allocation of all available IDs does the driver begin to reuse those that
have been released.
The initial allocation of IDs was modified to account for multiple memory
units on the offloaded device. However, this introduced a bug whereby the
counter that controls the IDs could be decremented before the ID was
assigned (where it is further decremented). This means that the stats ID
could be assigned as -1/0xfffffff which is out of range.
Fix this by only decrementing the main counter after the current ID has
been assigned.
Fixes: 467322e2627f ("nfp: flower: support multiple memory units for filter offloads") Signed-off-by: John Hurley <[email protected]> Reviewed-by: Jakub Kicinski <[email protected]> Signed-off-by: David S. Miller <[email protected]>
Oleksij Rempel [Tue, 17 Dec 2019 06:51:45 +0000 (07:51 +0100)]
net: ag71xx: fix compile warnings
drivers/net/ethernet/atheros/ag71xx.c: In function 'ag71xx_probe':
drivers/net/ethernet/atheros/ag71xx.c:1776:30: warning: passing argument 2 of
'of_get_phy_mode' makes pointer from integer without a cast [-Wint-conversion]
In file included from drivers/net/ethernet/atheros/ag71xx.c:33:
./include/linux/of_net.h:15:69: note: expected 'phy_interface_t *'
{aka 'enum <anonymous> *'} but argument is of type 'int'
Fixes: 0c65b2b90d13c1 ("net: of_get_phy_mode: Change API to solve int/unit warnings") Signed-off-by: Oleksij Rempel <[email protected]> Reviewed-by: Andrew Lunn <[email protected]> Signed-off-by: David S. Miller <[email protected]>
Eric Dumazet [Tue, 17 Dec 2019 02:51:03 +0000 (18:51 -0800)]
net: annotate lockless accesses to sk->sk_pacing_shift
sk->sk_pacing_shift can be read and written without lock
synchronization. This patch adds annotations to
document this fact and avoid future syzbot complains.
This might also avoid unexpected false sharing
in sk_pacing_shift_update(), as the compiler
could remove the conditional check and always
write over sk->sk_pacing_shift :
if (sk->sk_pacing_shift != val)
sk->sk_pacing_shift = val;
Ben Hutchings [Tue, 17 Dec 2019 01:57:40 +0000 (01:57 +0000)]
net: qlogic: Fix error paths in ql_alloc_large_buffers()
ql_alloc_large_buffers() has the usual RX buffer allocation
loop where it allocates skbs and maps them for DMA. It also
treats failure as a fatal error.
There are (at least) three bugs in the error paths:
1. ql_free_large_buffers() assumes that the lrg_buf[] entry for the
first buffer that couldn't be allocated will have .skb == NULL.
But the qla_buf[] array is not zero-initialised.
2. ql_free_large_buffers() DMA-unmaps all skbs in lrg_buf[]. This is
incorrect for the last allocated skb, if DMA mapping failed.
3. Commit 1acb8f2a7a9f ("net: qlogic: Fix memory leak in
ql_alloc_large_buffers") added a direct call to dev_kfree_skb_any()
after the skb is recorded in lrg_buf[], so ql_free_large_buffers()
will double-free it.
The bugs are somewhat inter-twined, so fix them all at once:
* Clear each entry in qla_buf[] before attempting to allocate
an skb for it. This goes half-way to fixing bug 1.
* Set the .skb field only after the skb is DMA-mapped. This
fixes the rest.
Fixes: 1357bfcf7106 ("qla3xxx: Dynamically size the rx buffer queue ...") Fixes: 0f8ab89e825f ("qla3xxx: Check return code from pci_map_single() ...") Fixes: 1acb8f2a7a9f ("net: qlogic: Fix memory leak in ql_alloc_large_buffers") Signed-off-by: Ben Hutchings <[email protected]> Signed-off-by: David S. Miller <[email protected]>
sctp: fix memleak on err handling of stream initialization
syzbot reported a memory leak when an allocation fails within
genradix_prealloc() for output streams. That's because
genradix_prealloc() leaves initialized members initialized when the
issue happens and SCTP stack will abort the current initialization but
without cleaning up such members.
The fix here is to always call genradix_free() when genradix_prealloc()
fails, for output and also input streams, as it suffers from the same
issue.
Eric Sandeen [Fri, 6 Dec 2019 16:55:59 +0000 (10:55 -0600)]
fs: call fsnotify_sb_delete after evict_inodes
When a filesystem is unmounted, we currently call fsnotify_sb_delete()
before evict_inodes(), which means that fsnotify_unmount_inodes()
must iterate over all inodes on the superblock looking for any inodes
with watches. This is inefficient and can lead to livelocks as it
iterates over many unwatched inodes.
At this point, SB_ACTIVE is gone and dropping refcount to zero kicks
the inode out out immediately, so anything processed by
fsnotify_sb_delete / fsnotify_unmount_inodes gets evicted in that loop.
After that, the call to evict_inodes will evict everything else with a
zero refcount.
This should speed things up overall, and avoid livelocks in
fsnotify_unmount_inodes().
Eric Sandeen [Fri, 6 Dec 2019 16:54:23 +0000 (10:54 -0600)]
fs: avoid softlockups in s_inodes iterators
Anything that walks all inodes on sb->s_inodes list without rescheduling
risks softlockups.
Previous efforts were made in 2 functions, see:
c27d82f fs/drop_caches.c: avoid softlockups in drop_pagecache_sb() ac05fbb inode: don't softlockup when evicting inodes
but there hasn't been an audit of all walkers, so do that now. This
also consistently moves the cond_resched() calls to the bottom of each
loop in cases where it already exists.
One loop remains: remove_dquot_ref(), because I'm not quite sure how
to deal with that one w/o taking the i_lock.
Changbin Du [Wed, 18 Dec 2019 04:51:56 +0000 (20:51 -0800)]
lib/Kconfig.debug: fix some messed up configurations
Some configuration items are messed up during conflict resolving. For
example, STRICT_DEVMEM should not in testing menu, but kunit should.
This patch fixes all of them.
Yang Shi [Wed, 18 Dec 2019 04:51:52 +0000 (20:51 -0800)]
mm: vmscan: protect shrinker idr replace with CONFIG_MEMCG
Since commit 0a432dcbeb32 ("mm: shrinker: make shrinker not depend on
memcg kmem"), shrinkers' idr is protected by CONFIG_MEMCG instead of
CONFIG_MEMCG_KMEM, so it makes no sense to protect shrinker idr replace
with CONFIG_MEMCG_KMEM.
And in the CONFIG_MEMCG && CONFIG_SLOB case, shrinker_idr contains only
shrinker, and it is deferred_split_shrinker. But it is never actually
called, since idr_replace() is never compiled due to the wrong #ifdef.
The deferred_split_shrinker all the time is staying in half-registered
state, and it's never called for subordinate mem cgroups.
Daniel Axtens [Wed, 18 Dec 2019 04:51:49 +0000 (20:51 -0800)]
kasan: don't assume percpu shadow allocations will succeed
syzkaller and the fault injector showed that I was wrong to assume that
we could ignore percpu shadow allocation failures.
Handle failures properly. Merge all the allocated areas back into the
free list and release the shadow, then clean up and return NULL. The
shadow is released unconditionally, which relies upon the fact that the
release function is able to tolerate pages not being present.
Also clean up shadows in the recovery path - currently they are not
released, which leaks a bit of memory.
Daniel Axtens [Wed, 18 Dec 2019 04:51:46 +0000 (20:51 -0800)]
kasan: use apply_to_existing_page_range() for releasing vmalloc shadow
kasan_release_vmalloc uses apply_to_page_range to release vmalloc
shadow. Unfortunately, apply_to_page_range can allocate memory to fill
in page table entries, which is not what we want.
Also, kasan_release_vmalloc is called under free_vmap_area_lock, so if
apply_to_page_range does allocate memory, we get a sleep in atomic bug:
BUG: sleeping function called from invalid context at mm/page_alloc.c:4681
in_atomic(): 1, irqs_disabled(): 0, non_block: 0, pid: 15087, name:
apply_to_page_range() takes an address range, and if any parts of it are
not covered by the existing page table hierarchy, it allocates memory to
fill them in.
In some use cases, this is not what we want - we want to be able to
operate exclusively on PTEs that are already in the tables.
Add apply_to_existing_page_range() for this. Adjust the walker
functions for apply_to_page_range to take 'create', which switches them
between the old and new modes.
Andrey Ryabinin [Wed, 18 Dec 2019 04:51:38 +0000 (20:51 -0800)]
kasan: fix crashes on access to memory mapped by vm_map_ram()
With CONFIG_KASAN_VMALLOC=y any use of memory obtained via vm_map_ram()
will crash because there is no shadow backing that memory.
Instead of sprinkling additional kasan_populate_vmalloc() calls all over
the vmalloc code, move it into alloc_vmap_area(). This will fix
vm_map_ram() and simplify the code a bit.
Paul Mackerras [Wed, 18 Dec 2019 00:43:06 +0000 (11:43 +1100)]
KVM: PPC: Book3S HV: Don't do ultravisor calls on systems without ultravisor
Commit 22945688acd4 ("KVM: PPC: Book3S HV: Support reset of secure
guest") added a call to uv_svm_terminate, which is an ultravisor
call, without any check that the guest is a secure guest or even that
the system has an ultravisor. On a system without an ultravisor,
the ultracall will degenerate to a hypercall, but since we are not
in KVM guest context, the hypercall will get treated as a system
call, which could have random effects depending on what happens to
be in r0, and could also corrupt the current task's kernel stack.
Hence this adds a test for the guest being a secure guest before
doing uv_svm_terminate().
Fixes: 22945688acd4 ("KVM: PPC: Book3S HV: Support reset of secure guest") Signed-off-by: Paul Mackerras <[email protected]>
Jens Axboe [Wed, 18 Dec 2019 02:45:06 +0000 (19:45 -0700)]
io_uring: warn about unhandled opcode
Now that we have all the opcodes handled in terms of command prep and
SQE reuse, add a printk_once() to warn about any potentially new and
unhandled ones.
Jens Axboe [Wed, 18 Dec 2019 02:53:05 +0000 (19:53 -0700)]
io_uring: read opcode and user_data from SQE exactly once
If we defer a request, we can't be reading the opcode again. Ensure that
the user_data and opcode fields are stable. For the user_data we already
have a place for it, for the opcode we can fill a one byte hold and store
that as well. For both of them, assign them when we originally read the
SQE in io_get_sqring(). Any code that uses sqe->opcode or sqe->user_data
is switched to req->opcode and req->user_data.
Jens Axboe [Wed, 18 Dec 2019 01:50:29 +0000 (18:50 -0700)]
io_uring: make IORING_OP_TIMEOUT_REMOVE deferrable
If we defer this command as part of a link, we have to make sure that
the SQE data has been read upfront. Integrate the timeout remove op into
the prep handling to make it safe for SQE reuse.
Jens Axboe [Wed, 18 Dec 2019 01:45:56 +0000 (18:45 -0700)]
io_uring: make IORING_OP_CANCEL_ASYNC deferrable
If we defer this command as part of a link, we have to make sure that
the SQE data has been read upfront. Integrate the async cancel op into
the prep handling to make it safe for SQE reuse.
Jens Axboe [Wed, 18 Dec 2019 01:40:57 +0000 (18:40 -0700)]
io_uring: make IORING_POLL_ADD and IORING_POLL_REMOVE deferrable
If we defer these commands as part of a link, we have to make sure that
the SQE data has been read upfront. Integrate the poll add/remove into
the prep handling to make it safe for SQE reuse.
Pavel Begunkov [Tue, 17 Dec 2019 17:57:05 +0000 (20:57 +0300)]
io_uring: make HARDLINK imply LINK
The rules are as follows, if IOSQE_IO_HARDLINK is specified, then it's a
link and there is no need to set IOSQE_IO_LINK separately, though it
could be there. Add proper check and ensure that IOSQE_IO_HARDLINK
implies IOSQE_IO_LINK.