Git Repo - linux.git/log

Merge branch 'bug fixes for sockmap'

Xu Kuohai says:

====================
bug fixes and a new test case for sockmap.

v3:
fix bpf ci failure

v2: https://lore.kernel.org/bpf/20230803064838 [email protected]
add a test case

v1:
https://lore.kernel.org/bpf/20230728105649.3978774 [email protected]
https://lore.kernel.org/bpf/20230728105717.3978849 [email protected]
====================

Signed-off-by: Martin KaFai Lau <[email protected]>

selftests/bpf: Add sockmap test for redirecting partial skb data

Add a test case to check whether sockmap redirection works correctly
when data length returned by stream_parser is less than skb->len.

In addition, this test checks whether strp_done is called correctly.
The reason is that we returns skb->len - 1 from the stream_parser, so
the last byte in the skb will be held by strp->skb_head. Therefore,
if strp_done is not called to free strp->skb_head, we'll get a memleak
warning.

Signed-off-by: Xu Kuohai <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Martin KaFai Lau <[email protected]>

selftests/bpf: fix a CI failure caused by vsock sockmap test

BPF CI has reported the following failure:

Error: #200/79 sockmap_listen/sockmap VSOCK test_vsock_redir
  Error: #200/79 sockmap_listen/sockmap VSOCK test_vsock_redir
  ./test_progs:vsock_unix_redir_connectible:1506: egress: write: Transport endpoint is not connected
  vsock_unix_redir_connectible:FAIL:1506
  ./test_progs:vsock_unix_redir_connectible:1506: ingress: write: Transport endpoint is not connected
  vsock_unix_redir_connectible:FAIL:1506
  ./test_progs:vsock_unix_redir_connectible:1506: egress: write: Transport endpoint is not connected
  vsock_unix_redir_connectible:FAIL:1506
  ./test_progs:vsock_unix_redir_connectible:1514: ingress: recv() err, errno=11
  vsock_unix_redir_connectible:FAIL:1514
  ./test_progs:vsock_unix_redir_connectible:1518: ingress: vsock socket map failed, a != b
  vsock_unix_redir_connectible:FAIL:1518
  ./test_progs:vsock_unix_redir_connectible:1525: ingress: want pass count 1, have 0

It’s because the recv(... MSG_DONTWAIT) syscall in the test case is
called before the queued work sk_psock_backlog() in the kernel finishes
executing. So the data to be read is still queued in psock->ingress_skb
and cannot be read by the user program. Therefore, the non-blocking
recv() reads nothing and reports an EAGAIN error.

So replace recv(... MSG_DONTWAIT) with xrecv_nonblock(), which calls
select() to wait for data to be readable or timeout before calls recv().

Fixes: d61bd8c1fd02 ("selftests/bpf: add a test case for vsock sockmap")
Signed-off-by: Xu Kuohai <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Martin KaFai Lau <[email protected]>

bpf, sockmap: Fix bug that strp_done cannot be called

strp_done is only called when psock->progs.stream_parser is not NULL,
but stream_parser was set to NULL by sk_psock_stop_strp(), called
by sk_psock_drop() earlier. So, strp_done can never be called.

Introduce SK_PSOCK_RX_ENABLED to mark whether there is strp on psock.
Change the condition for calling strp_done from judging whether
stream_parser is set to judging whether this flag is set. This flag is
only set once when strp_init() succeeds, and will never be cleared later.

Fixes: c0d95d3380ee ("bpf, sockmap: Re-evaluate proto ops when psock is removed from sockmap")
Signed-off-by: Xu Kuohai <[email protected]>
Reviewed-by: John Fastabend <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Martin KaFai Lau <[email protected]>

bpf, sockmap: Fix map type error in sock_map_del_link

sock_map_del_link() operates on both SOCKMAP and SOCKHASH, although
both types have member named "progs", the offset of "progs" member in
these two types is different, so "progs" should be accessed with the
real map type.

Fixes: 604326b41a6f ("bpf, sockmap: convert to generic sk_msg interface")
Signed-off-by: Xu Kuohai <[email protected]>
Reviewed-by: John Fastabend <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Martin KaFai Lau <[email protected]>

xsk: fix refcount underflow in error path

Fix a refcount underflow problem reported by syzbot that can happen
when a system is running out of memory. If xp_alloc_tx_descs() fails,
and it can only fail due to not having enough memory, then the error
path is triggered. In this error path, the refcount of the pool is
decremented as it has incremented before. However, the reference to
the pool in the socket was not nulled. This means that when the socket
is closed later, the socket teardown logic will think that there is a
pool attached to the socket and try to decrease the refcount again,
leading to a refcount underflow.

I chose this fix as it involved adding just a single line. Another
option would have been to move xp_get_pool() and the assignment of
xs->pool to after the if-statement and using xs_umem->pool instead of
xs->pool in the whole if-statement resulting in somewhat simpler code,
but this would have led to much more churn in the code base perhaps
making it harder to backport.

Fixes: ba3beec2ec1d ("xsk: Fix possible crash when multiple sockets are created")
Reported-by: [email protected]
Signed-off-by: Magnus Karlsson <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Martin KaFai Lau <[email protected]>

ASoC: meson: axg-tdm-formatter: fix channel slot allocation

When the tdm lane mask is computed, the driver currently fills the 1st lane
before moving on to the next. If the stream has less channels than the
lanes can accommodate, slots will be disabled on the last lanes.

Unfortunately, the HW distribute channels in a different way. It distribute
channels in pair on each lanes before moving on the next slots.

This difference leads to problems if a device has an interface with more
than 1 lane and with more than 2 slots per lane.

For example: a playback interface with 2 lanes and 4 slots each (total 8
slots - zero based numbering)
- Playing a 8ch stream:
  - All slots activated by the driver
  - channel #2 will be played on lane #1 - slot #0 following HW placement
- Playing a 4ch stream:
  - Lane #1 disabled by the driver
  - channel #2 will be played on lane #0 - slot #2

This behaviour is obviously not desirable.

Change the way slots are activated on the TDM lanes to follow what the HW
does and make sure each channel always get mapped to the same slot/lane.

Fixes: 1a11d88f499c ("ASoC: meson: add tdm formatter base driver")
Signed-off-by: Jerome Brunet <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Mark Brown <[email protected]>

ipv6: adjust ndisc_is_useropt() to also return true for PIO

The upcoming (and nearly finalized):
https://datatracker.ietf.org/doc/draft-collink-6man-pio-pflag/
will update the IPv6 RA to include a new flag in the PIO field,
which will serve as a hint to perform DHCPv6-PD.

As we don't want DHCPv6 related logic inside the kernel, this piece of
information needs to be exposed to userspace. The simplest option is to
simply expose the entire PIO through the already existing mechanism.

Even without this new flag, the already existing PIO R (router address)
flag (from RFC6275) cannot AFAICT be handled entirely in kernel,
and provides useful information that should be exposed to userspace
(the router's global address, for use by Mobile IPv6).

Also cc'ing stable@ for inclusion in LTS, as while technically this is
not quite a bugfix, and instead more of a feature, it is absolutely
trivial and the alternative is manually cherrypicking into all Android
Common Kernel trees - and I know Greg will ask for it to be sent in via
LTS instead...

Cc: Jen Linkova <[email protected]>
Cc: Lorenzo Colitti <[email protected]>
Cc: David Ahern <[email protected]>
Cc: YOSHIFUJI Hideaki / 吉藤英明 <[email protected]>
Cc: [email protected]
Signed-off-by: Maciej Żenczykowski <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Jakub Kicinski <[email protected]>

Merge tag 'wireless-2023-08-09' of git://git.kernel.org/pub/scm/linux/kernel/git/wireless/wireless

Johannes Berg says:

====================
Just a few small updates:
* fix an integer overflow in nl80211
* fix rtw89 8852AE disconnections
* fix a buffer overflow in ath12k
* fix AP_VLAN configuration lookups
* fix allocation failure handling in brcm80211
* update MAINTAINERS for some drivers

* tag 'wireless-2023-08-09' of git://git.kernel.org/pub/scm/linux/kernel/git/wireless/wireless:
  wifi: ath12k: Fix buffer overflow when scanning with extraie
  wifi: nl80211: fix integer overflow in nl80211_parse_mbssid_elems()
  wifi: cfg80211: fix sband iftype data lookup for AP_VLAN
  wifi: rtw89: fix 8852AE disconnection caused by RX full flags
  MAINTAINERS: Remove tree entry for rtl8180
  MAINTAINERS: Update entry for rtl8187
  wifi: brcm80211: handle params_v1 allocation failure
====================

Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Jakub Kicinski <[email protected]>

blk-iocost: fix queue stats accounting

The q->stats->accounting is not only used by iocost, but iocost only
increase this counter, never decrease it. So queue stats accounting
will always enabled after using iocost once.

Signed-off-by: Chengming Zhou <[email protected]>
Acked-by: Tejun Heo <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Jens Axboe <[email protected]>

block: don't make REQ_POLLED imply REQ_NOWAIT

Normally these two flags do go together, as the issuer of polled IO
generally cannot wait for resources that will get freed as part of IO
completion. This is because that very task is the one that will complete
the request and free those resources, hence that would introduce a
deadlock.

But it is possible to have someone else issue the polled IO, eg via
io_uring if the request is punted to io-wq. For that case, it's fine to
have the task block on IO submission, as it is not the same task that
will be completing the IO.

It's completely up to the caller to ask for both polled and nowait IO
separately! If we don't allow polled IO where IOCB_NOWAIT isn't set in
the kiocb, then we can run into repeated -EAGAIN submissions and not
make any progress.

Reviewed-by: Bart Van Assche <[email protected]>
Signed-off-by: Jens Axboe <[email protected]>

Merge branch 'selftests-forwarding-various-fixes'

Ido Schimmel says:

====================
selftests: forwarding: Various fixes

Fix various problems with forwarding selftests. See individual patches
for problem description and solution.
====================

Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Jakub Kicinski <[email protected]>

selftests: forwarding: bridge_mdb: Make test more robust

Some test cases check that the group timer is (or isn't) 0. Instead of
grepping for "0.00" grep for " 0.00" as the former can also match
"260.00" which is the default group membership interval.

Fixes: b6d00da08610 ("selftests: forwarding: Add bridge MDB test")
Reported-by: Mirsad Todorovac <[email protected]>
Closes: https://lore.kernel.org/netdev/[email protected]/
Signed-off-by: Ido Schimmel <[email protected]>
Tested-by: Mirsad Todorovac <[email protected]>
Reviewed-by: Hangbin Liu <[email protected]>
Acked-by: Nikolay Aleksandrov <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Jakub Kicinski <[email protected]>

selftests: forwarding: bridge_mdb_max: Fix failing test with old libnet

As explained in commit 8bcfb4ae4d97 ("selftests: forwarding: Fix failing
tests with old libnet"), old versions of libnet (used by mausezahn) do
not use the "SO_BINDTODEVICE" socket option. For IP unicast packets,
this can be solved by prefixing mausezahn invocations with "ip vrf
exec". However, IP multicast packets do not perform routing and simply
egress the bound device, which does not exist in this case.

Fix by specifying the source and destination MAC of the packet which
will cause mausezahn to use a packet socket instead of an IP socket.

Fixes: 3446dcd7df05 ("selftests: forwarding: bridge_mdb_max: Add a new selftest")
Reported-by: Mirsad Todorovac <[email protected]>
Closes: https://lore.kernel.org/netdev/[email protected]/
Signed-off-by: Ido Schimmel <[email protected]>
Tested-by: Mirsad Todorovac <[email protected]>
Reviewed-by: Hangbin Liu <[email protected]>
Acked-by: Nikolay Aleksandrov <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Jakub Kicinski <[email protected]>

selftests: forwarding: bridge_mdb: Fix failing test with old libnet

As explained in commit 8bcfb4ae4d97 ("selftests: forwarding: Fix failing
tests with old libnet"), old versions of libnet (used by mausezahn) do
not use the "SO_BINDTODEVICE" socket option. For IP unicast packets,
this can be solved by prefixing mausezahn invocations with "ip vrf
exec". However, IP multicast packets do not perform routing and simply
egress the bound device, which does not exist in this case.

Fix by specifying the source and destination MAC of the packet which
will cause mausezahn to use a packet socket instead of an IP socket.

Fixes: b6d00da08610 ("selftests: forwarding: Add bridge MDB test")
Reported-by: Mirsad Todorovac <[email protected]>
Closes: https://lore.kernel.org/netdev/[email protected]/
Signed-off-by: Ido Schimmel <[email protected]>
Tested-by: Mirsad Todorovac <[email protected]>
Reviewed-by: Hangbin Liu <[email protected]>
Acked-by: Nikolay Aleksandrov <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Jakub Kicinski <[email protected]>

selftests: forwarding: tc_flower_l2_miss: Fix failing test with old libnet

As explained in commit 8bcfb4ae4d97 ("selftests: forwarding: Fix failing
tests with old libnet"), old versions of libnet (used by mausezahn) do
not use the "SO_BINDTODEVICE" socket option. For IP unicast packets,
this can be solved by prefixing mausezahn invocations with "ip vrf
exec". However, IP multicast packets do not perform routing and simply
egress the bound device, which does not exist in this case.

Fix by specifying the source and destination MAC of the packet which
will cause mausezahn to use a packet socket instead of an IP socket.

Fixes: 8c33266ae26a ("selftests: forwarding: Add layer 2 miss test cases")
Reported-by: Mirsad Todorovac <[email protected]>
Closes: https://lore.kernel.org/netdev/[email protected]/
Signed-off-by: Ido Schimmel <[email protected]>
Tested-by: Mirsad Todorovac <[email protected]>
Reviewed-by: Hangbin Liu <[email protected]>
Acked-by: Nikolay Aleksandrov <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Jakub Kicinski <[email protected]>

selftests: forwarding: tc_tunnel_key: Make filters more specific

The test installs filters that match on various IP fragments (e.g., no
fragment, first fragment) and expects a certain amount of packets to hit
each filter. This is problematic as the filters are not specific enough
and can match IP packets (e.g., IGMP) generated by the stack, resulting
in failures [1].

Fix by making the filters more specific and match on more fields in the
IP header: Source IP, destination IP and protocol.

[1]
# timeout set to 0
# selftests: net/forwarding: tc_tunnel_key.sh
# TEST: tunnel_key nofrag (skip_hw) [FAIL]
# packet smaller than MTU was not tunneled
# INFO: Could not test offloaded functionality
not ok 89 selftests: net/forwarding: tc_tunnel_key.sh # exit=1

Fixes: 533a89b1940f ("selftests: forwarding: add tunnel_key "nofrag" test case")
Reported-by: Mirsad Todorovac <[email protected]>
Closes: https://lore.kernel.org/netdev/[email protected]/
Signed-off-by: Ido Schimmel <[email protected]>
Reviewed-by: Petr Machata <[email protected]>
Tested-by: Mirsad Todorovac <[email protected]>
Acked-by: Davide Caratti <[email protected]>
Reviewed-by: Hangbin Liu <[email protected]>
Acked-by: Nikolay Aleksandrov <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Jakub Kicinski <[email protected]>

selftests: forwarding: tc_flower: Relax success criterion

The test checks that filters that match on source or destination MAC
were only hit once. A host can send more than one packet with a given
source or destination MAC, resulting in failures.

Fix by relaxing the success criterion and instead check that the filters
were not hit zero times. Using tc_check_at_least_x_packets() is also an
option, but it is not available in older kernels.

Fixes: 07e5c75184a1 ("selftests: forwarding: Introduce tc flower matching tests")
Reported-by: Mirsad Todorovac <[email protected]>
Closes: https://lore.kernel.org/netdev/[email protected]/
Signed-off-by: Ido Schimmel <[email protected]>
Reviewed-by: Petr Machata <[email protected]>
Tested-by: Mirsad Todorovac <[email protected]>
Reviewed-by: Hangbin Liu <[email protected]>
Acked-by: Nikolay Aleksandrov <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Jakub Kicinski <[email protected]>

selftests: forwarding: tc_actions: Use ncat instead of nc

The test relies on 'nc' being the netcat version from the nmap project.
While this seems to be the case on Fedora, it is not the case on Ubuntu,
resulting in failures such as [1].

Fix by explicitly using the 'ncat' utility from the nmap project and the
skip the test in case it is not installed.

[1]
# timeout set to 0
# selftests: net/forwarding: tc_actions.sh
# TEST: gact drop and ok (skip_hw)                                    [ OK ]
# TEST: mirred egress flower redirect (skip_hw)                       [ OK ]
# TEST: mirred egress flower mirror (skip_hw)                         [ OK ]
# TEST: mirred egress matchall mirror (skip_hw)                       [ OK ]
# TEST: mirred_egress_to_ingress (skip_hw)                            [ OK ]
# nc: invalid option -- '-'
# usage: nc [-46CDdFhklNnrStUuvZz] [-I length] [-i interval] [-M ttl]
#         [-m minttl] [-O length] [-P proxy_username] [-p source_port]
#         [-q seconds] [-s sourceaddr] [-T keyword] [-V rtable] [-W recvlimit]
#         [-w timeout] [-X proxy_protocol] [-x proxy_address[:port]]
#         [destination] [port]
# nc: invalid option -- '-'
# usage: nc [-46CDdFhklNnrStUuvZz] [-I length] [-i interval] [-M ttl]
#         [-m minttl] [-O length] [-P proxy_username] [-p source_port]
#         [-q seconds] [-s sourceaddr] [-T keyword] [-V rtable] [-W recvlimit]
#         [-w timeout] [-X proxy_protocol] [-x proxy_address[:port]]
#         [destination] [port]
# TEST: mirred_egress_to_ingress_tcp (skip_hw)                        [FAIL]
#       server output check failed
# INFO: Could not test offloaded functionality
not ok 80 selftests: net/forwarding: tc_actions.sh # exit=1

Fixes: ca22da2fbd69 ("act_mirred: use the backlog for nested calls to mirred ingress")
Reported-by: Mirsad Todorovac <[email protected]>
Closes: https://lore.kernel.org/netdev/[email protected]/
Signed-off-by: Ido Schimmel <[email protected]>
Reviewed-by: Petr Machata <[email protected]>
Tested-by: Mirsad Todorovac <[email protected]>
Reviewed-by: Hangbin Liu <[email protected]>
Acked-by: Nikolay Aleksandrov <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Jakub Kicinski <[email protected]>

selftests: forwarding: ethtool_mm: Skip when MAC Merge is not supported

MAC Merge cannot be tested with veth pairs, resulting in failures:

# ./ethtool_mm.sh
[...]
TEST: Manual configuration with verification: swp1 to swp2 [FAIL]
Verification did not succeed

Fix by skipping the test when the interfaces do not support MAC Merge.

Fixes: e6991384ace5 ("selftests: forwarding: add a test for MAC Merge layer")
Reported-by: Mirsad Todorovac <[email protected]>
Closes: https://lore.kernel.org/netdev/[email protected]/
Signed-off-by: Ido Schimmel <[email protected]>
Reviewed-by: Hangbin Liu <[email protected]>
Acked-by: Nikolay Aleksandrov <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Jakub Kicinski <[email protected]>

selftests: forwarding: hw_stats_l3_gre: Skip when using veth pairs

Layer 3 hardware stats cannot be used when the underlying interfaces are
veth pairs, resulting in failures:

# ./hw_stats_l3_gre.sh
TEST: ping gre flat                                                 [ OK ]
TEST: Test rx packets:                                              [FAIL]
         Traffic not reflected in the counter: 0 -> 0
TEST: Test tx packets:                                              [FAIL]
         Traffic not reflected in the counter: 0 -> 0

Fix by skipping the test when used with veth pairs.

Fixes: 813f97a26860 ("selftests: forwarding: Add a tunnel-based test for L3 HW stats")
Reported-by: Mirsad Todorovac <[email protected]>
Closes: https://lore.kernel.org/netdev/[email protected]/
Signed-off-by: Ido Schimmel <[email protected]>
Reviewed-by: Petr Machata <[email protected]>
Tested-by: Mirsad Todorovac <[email protected]>
Reviewed-by: Hangbin Liu <[email protected]>
Acked-by: Nikolay Aleksandrov <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Jakub Kicinski <[email protected]>

selftests: forwarding: ethtool_extended_state: Skip when using veth pairs

Ethtool extended state cannot be tested with veth pairs, resulting in
failures:

# ./ethtool_extended_state.sh
TEST: Autoneg, No partner detected [FAIL]
Expected "Autoneg", got "Link detected: no"
[...]

Fix by skipping the test when used with veth pairs.

Fixes: 7d10bcce98cd ("selftests: forwarding: Add tests for ethtool extended state")
Reported-by: Mirsad Todorovac <[email protected]>
Closes: https://lore.kernel.org/netdev/[email protected]/
Signed-off-by: Ido Schimmel <[email protected]>
Reviewed-by: Petr Machata <[email protected]>
Tested-by: Mirsad Todorovac <[email protected]>
Reviewed-by: Hangbin Liu <[email protected]>
Acked-by: Nikolay Aleksandrov <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Jakub Kicinski <[email protected]>

selftests: forwarding: ethtool: Skip when using veth pairs

Auto-negotiation cannot be tested with veth pairs, resulting in
failures:

# ./ethtool.sh
TEST: force of same speed autoneg off [FAIL]
error in configuration. swp1 speed Not autoneg off
[...]

Fix by skipping the test when used with veth pairs.

Fixes: 64916b57c0b1 ("selftests: forwarding: Add speed and auto-negotiation test")
Reported-by: Mirsad Todorovac <[email protected]>
Closes: https://lore.kernel.org/netdev/[email protected]/
Signed-off-by: Ido Schimmel <[email protected]>
Reviewed-by: Petr Machata <[email protected]>
Tested-by: Mirsad Todorovac <[email protected]>
Reviewed-by: Hangbin Liu <[email protected]>
Acked-by: Nikolay Aleksandrov <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Jakub Kicinski <[email protected]>

selftests: forwarding: Add a helper to skip test when using veth pairs

A handful of tests require physical loopbacks to be used instead of veth
pairs. Add a helper that these tests will invoke in order to be skipped
when executed with veth pairs.

Fixes: 64916b57c0b1 ("selftests: forwarding: Add speed and auto-negotiation test")
Signed-off-by: Ido Schimmel <[email protected]>
Reviewed-by: Petr Machata <[email protected]>
Tested-by: Mirsad Todorovac <[email protected]>
Reviewed-by: Hangbin Liu <[email protected]>
Acked-by: Nikolay Aleksandrov <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Jakub Kicinski <[email protected]>

selftests: forwarding: Set default IPv6 traceroute utility

The test uses the 'TROUTE6' environment variable to encode the name of
the IPv6 traceroute utility. By default (without a configuration file),
this variable is not set, resulting in failures:

# ./ip6_forward_instats_vrf.sh
TEST: ping6                                                         [ OK ]
TEST: Ip6InTooBigErrors                                             [ OK ]
TEST: Ip6InHdrErrors                                                [FAIL]
TEST: Ip6InAddrErrors                                               [ OK ]
TEST: Ip6InDiscards                                                 [ OK ]

Fix by setting a default utility name and skip the test if the utility
is not present.

Fixes: 0857d6f8c759 ("ipv6: When forwarding count rx stats on the orig netdev")
Reported-by: Mirsad Todorovac <[email protected]>
Closes: https://lore.kernel.org/netdev/[email protected]/
Signed-off-by: Ido Schimmel <[email protected]>
Reviewed-by: Petr Machata <[email protected]>
Tested-by: Mirsad Todorovac <[email protected]>
Reviewed-by: Hangbin Liu <[email protected]>
Acked-by: Nikolay Aleksandrov <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Jakub Kicinski <[email protected]>

selftests: forwarding: bridge_mdb_max: Check iproute2 version

The selftest relies on iproute2 changes present in version 6.3, but the
test does not check for it, resulting in errors:

# ./bridge_mdb_max.sh
  INFO: 802.1d tests
  TEST: cfg4: port: ngroups reporting                                 [FAIL]
          Number of groups was null, now is null, but 5 expected
  TEST: ctl4: port: ngroups reporting                                 [FAIL]
          Number of groups was null, now is null, but 5 expected
  TEST: cfg6: port: ngroups reporting                                 [FAIL]
          Number of groups was null, now is null, but 5 expected
  [...]

Fix by skipping the test if iproute2 is too old.

Fixes: 3446dcd7df05 ("selftests: forwarding: bridge_mdb_max: Add a new selftest")
Reported-by: Mirsad Todorovac <[email protected]>
Closes: https://lore.kernel.org/netdev/[email protected]/
Signed-off-by: Ido Schimmel <[email protected]>
Reviewed-by: Petr Machata <[email protected]>
Tested-by: Mirsad Todorovac <[email protected]>
Reviewed-by: Hangbin Liu <[email protected]>
Acked-by: Nikolay Aleksandrov <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Jakub Kicinski <[email protected]>

selftests: forwarding: bridge_mdb: Check iproute2 version

The selftest relies on iproute2 changes present in version 6.3, but the
test does not check for it, resulting in error:

# ./bridge_mdb.sh

INFO: # Host entries configuration tests
TEST: Common host entries configuration tests (IPv4)                [FAIL]
         Managed to add IPv4 host entry with a filter mode
TEST: Common host entries configuration tests (IPv6)                [FAIL]
         Managed to add IPv6 host entry with a filter mode
TEST: Common host entries configuration tests (L2)                  [FAIL]
         Managed to add L2 host entry with a filter mode

INFO: # Port group entries configuration tests - (*, G)
Command "replace" is unknown, try "bridge mdb help".
[...]

Fix by skipping the test if iproute2 is too old.

Fixes: b6d00da08610 ("selftests: forwarding: Add bridge MDB test")
Reported-by: Mirsad Todorovac <[email protected]>
Closes: https://lore.kernel.org/netdev/[email protected]/
Signed-off-by: Ido Schimmel <[email protected]>
Reviewed-by: Petr Machata <[email protected]>
Tested-by: Mirsad Todorovac <[email protected]>
Reviewed-by: Hangbin Liu <[email protected]>
Acked-by: Nikolay Aleksandrov <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Jakub Kicinski <[email protected]>

selftests: forwarding: Switch off timeout

The default timeout for selftests is 45 seconds, but it is not enough
for forwarding selftests which can takes minutes to finish depending on
the number of tests cases:

# make -C tools/testing/selftests TARGETS=net/forwarding run_tests
TAP version 13
1..102
# timeout set to 45
# selftests: net/forwarding: bridge_igmp.sh
# TEST: IGMPv2 report 239.10.10.10                                    [ OK ]
# TEST: IGMPv2 leave 239.10.10.10                                     [ OK ]
# TEST: IGMPv3 report 239.10.10.10 is_include                         [ OK ]
# TEST: IGMPv3 report 239.10.10.10 include -> allow                   [ OK ]
#
not ok 1 selftests: net/forwarding: bridge_igmp.sh # TIMEOUT 45 seconds

Fix by switching off the timeout and setting it to 0. A similar change
was done for BPF selftests in commit 6fc5916cc256 ("selftests: bpf:
Switch off timeout").

Fixes: 81573b18f26d ("selftests/net/forwarding: add Makefile to install tests")
Reported-by: Mirsad Todorovac <[email protected]>
Closes: https://lore.kernel.org/netdev/[email protected]/
Signed-off-by: Ido Schimmel <[email protected]>
Reviewed-by: Petr Machata <[email protected]>
Tested-by: Mirsad Todorovac <[email protected]>
Reviewed-by: Hangbin Liu <[email protected]>
Acked-by: Nikolay Aleksandrov <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Jakub Kicinski <[email protected]>

selftests: forwarding: Skip test when no interfaces are specified

As explained in [1], the forwarding selftests are meant to be run with
either physical loopbacks or veth pairs. The interfaces are expected to
be specified in a user-provided forwarding.config file or as command
line arguments. By default, this file is not present and the tests fail:

# make -C tools/testing/selftests TARGETS=net/forwarding run_tests
[...]
TAP version 13
1..102
# timeout set to 45
# selftests: net/forwarding: bridge_igmp.sh
# Command line is not complete. Try option "help"
# Failed to create netif
not ok 1 selftests: net/forwarding: bridge_igmp.sh # exit=1
[...]

Fix by skipping a test if interfaces are not provided either via the
configuration file or command line arguments.

# make -C tools/testing/selftests TARGETS=net/forwarding run_tests
[...]
TAP version 13
1..102
# timeout set to 45
# selftests: net/forwarding: bridge_igmp.sh
# SKIP: Cannot create interface. Name not specified
ok 1 selftests: net/forwarding: bridge_igmp.sh # SKIP

[1] tools/testing/selftests/net/forwarding/README

Fixes: 81573b18f26d ("selftests/net/forwarding: add Makefile to install tests")
Reported-by: Mirsad Todorovac <[email protected]>
Closes: https://lore.kernel.org/netdev/[email protected]/
Signed-off-by: Ido Schimmel <[email protected]>
Reviewed-by: Petr Machata <[email protected]>
Tested-by: Mirsad Todorovac <[email protected]>
Reviewed-by: Hangbin Liu <[email protected]>
Acked-by: Nikolay Aleksandrov <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Jakub Kicinski <[email protected]>

tcp: add missing family to tcp_set_ca_state() tracepoint

Before this code is copied, add the missing family, as we did in
commit 3dd344ea84e1 ("net: tracepoint: exposing sk_family in all tcp:tracepoints")

Fixes: 15fcdf6ae116 ("tcp: Add tracepoint for tcp_set_ca_state")
Signed-off-by: Eric Dumazet <[email protected]>
Cc: Ping Gan <[email protected]>
Cc: Manjusaka <[email protected]>
Reviewed-by: Simon Horman <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Jakub Kicinski <[email protected]>

Merge branch 'nexthop-nexthop-dump-fixes'

Ido Schimmel says:

====================
nexthop: Nexthop dump fixes

Patches #1 and #3 fix two problems related to nexthops and nexthop
buckets dump, respectively. Patch #2 is a preparation for the third
patch.

The pattern described in these patches of splitting the NLMSG_DONE to a
separate response is prevalent in other rtnetlink dump callbacks. I
don't know if it's because I'm missing something or if this was done
intentionally to ensure the message is delivered to user space. After
commit 0642840b8bb0 ("af_netlink: ensure that NLMSG_DONE never fails in
dumps") this is no longer necessary and I can improve these dump
callbacks assuming this analysis is correct.

No regressions in existing tests:

# ./fib_nexthops.sh
[...]
Tests passed: 230
Tests failed: 0
====================

Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Jakub Kicinski <[email protected]>

nexthop: Fix infinite nexthop bucket dump when using maximum nexthop ID

A netlink dump callback can return a positive number to signal that more
information needs to be dumped or zero to signal that the dump is
complete. In the second case, the core netlink code will append the
NLMSG_DONE message to the skb in order to indicate to user space that
the dump is complete.

The nexthop bucket dump callback always returns a positive number if
nexthop buckets were filled in the provided skb, even if the dump is
complete. This means that a dump will span at least two recvmsg() calls
as long as nexthop buckets are present. In the last recvmsg() call the
dump callback will not fill in any nexthop buckets because the previous
call indicated that the dump should restart from the last dumped nexthop
ID plus one.

# ip link add name dummy1 up type dummy
# ip nexthop add id 1 dev dummy1
# ip nexthop add id 10 group 1 type resilient buckets 2
# strace -e sendto,recvmsg -s 5 ip nexthop bucket
sendto(3, [[{nlmsg_len=24, nlmsg_type=RTM_GETNEXTHOPBUCKET, nlmsg_flags=NLM_F_REQUEST|NLM_F_DUMP, nlmsg_seq=1691396980, nlmsg_pid=0}, {family=AF_UNSPEC, data="\x00\x00\x00\x00\x00"...}], {nlmsg_len=0, nlmsg_type=0 /* NLMSG_??? */, nlmsg_flags=0, nlmsg_seq=0, nlmsg_pid=0}], 152, 0, NULL, 0) = 152
recvmsg(3, {msg_name={sa_family=AF_NETLINK, nl_pid=0, nl_groups=00000000}, msg_namelen=12, msg_iov=[{iov_base=NULL, iov_len=0}], msg_iovlen=1, msg_controllen=0, msg_flags=MSG_TRUNC}, MSG_PEEK|MSG_TRUNC) = 128
recvmsg(3, {msg_name={sa_family=AF_NETLINK, nl_pid=0, nl_groups=00000000}, msg_namelen=12, msg_iov=[{iov_base=[[{nlmsg_len=64, nlmsg_type=RTM_NEWNEXTHOPBUCKET, nlmsg_flags=NLM_F_MULTI, nlmsg_seq=1691396980, nlmsg_pid=347}, {family=AF_UNSPEC, data="\x00\x00\x00\x00\x00"...}], [{nlmsg_len=64, nlmsg_type=RTM_NEWNEXTHOPBUCKET, nlmsg_flags=NLM_F_MULTI, nlmsg_seq=1691396980, nlmsg_pid=347}, {family=AF_UNSPEC, data="\x00\x00\x00\x00\x00"...}]], iov_len=32768}], msg_iovlen=1, msg_controllen=0, msg_flags=0}, 0) = 128
id 10 index 0 idle_time 6.66 nhid 1
id 10 index 1 idle_time 6.66 nhid 1
recvmsg(3, {msg_name={sa_family=AF_NETLINK, nl_pid=0, nl_groups=00000000}, msg_namelen=12, msg_iov=[{iov_base=NULL, iov_len=0}], msg_iovlen=1, msg_controllen=0, msg_flags=MSG_TRUNC}, MSG_PEEK|MSG_TRUNC) = 20
recvmsg(3, {msg_name={sa_family=AF_NETLINK, nl_pid=0, nl_groups=00000000}, msg_namelen=12, msg_iov=[{iov_base=[{nlmsg_len=20, nlmsg_type=NLMSG_DONE, nlmsg_flags=NLM_F_MULTI, nlmsg_seq=1691396980, nlmsg_pid=347}, 0], iov_len=32768}], msg_iovlen=1, msg_controllen=0, msg_flags=0}, 0) = 20
+++ exited with 0 +++

This behavior is both inefficient and buggy. If the last nexthop to be
dumped had the maximum ID of 0xffffffff, then the dump will restart from
0 (0xffffffff + 1) and never end:

# ip link add name dummy1 up type dummy
# ip nexthop add id 1 dev dummy1
# ip nexthop add id $((2**32-1)) group 1 type resilient buckets 2
# ip nexthop bucket
id 4294967295 index 0 idle_time 5.55 nhid 1
id 4294967295 index 1 idle_time 5.55 nhid 1
id 4294967295 index 0 idle_time 5.55 nhid 1
id 4294967295 index 1 idle_time 5.55 nhid 1
[...]

Fix by adjusting the dump callback to return zero when the dump is
complete. After the fix only one recvmsg() call is made and the
NLMSG_DONE message is appended to the RTM_NEWNEXTHOPBUCKET responses:

# ip link add name dummy1 up type dummy
# ip nexthop add id 1 dev dummy1
# ip nexthop add id $((2**32-1)) group 1 type resilient buckets 2
# strace -e sendto,recvmsg -s 5 ip nexthop bucket
sendto(3, [[{nlmsg_len=24, nlmsg_type=RTM_GETNEXTHOPBUCKET, nlmsg_flags=NLM_F_REQUEST|NLM_F_DUMP, nlmsg_seq=1691396737, nlmsg_pid=0}, {family=AF_UNSPEC, data="\x00\x00\x00\x00\x00"...}], {nlmsg_len=0, nlmsg_type=0 /* NLMSG_??? */, nlmsg_flags=0, nlmsg_seq=0, nlmsg_pid=0}], 152, 0, NULL, 0) = 152
recvmsg(3, {msg_name={sa_family=AF_NETLINK, nl_pid=0, nl_groups=00000000}, msg_namelen=12, msg_iov=[{iov_base=NULL, iov_len=0}], msg_iovlen=1, msg_controllen=0, msg_flags=MSG_TRUNC}, MSG_PEEK|MSG_TRUNC) = 148
recvmsg(3, {msg_name={sa_family=AF_NETLINK, nl_pid=0, nl_groups=00000000}, msg_namelen=12, msg_iov=[{iov_base=[[{nlmsg_len=64, nlmsg_type=RTM_NEWNEXTHOPBUCKET, nlmsg_flags=NLM_F_MULTI, nlmsg_seq=1691396737, nlmsg_pid=350}, {family=AF_UNSPEC, data="\x00\x00\x00\x00\x00"...}], [{nlmsg_len=64, nlmsg_type=RTM_NEWNEXTHOPBUCKET, nlmsg_flags=NLM_F_MULTI, nlmsg_seq=1691396737, nlmsg_pid=350}, {family=AF_UNSPEC, data="\x00\x00\x00\x00\x00"...}], [{nlmsg_len=20, nlmsg_type=NLMSG_DONE, nlmsg_flags=NLM_F_MULTI, nlmsg_seq=1691396737, nlmsg_pid=350}, 0]], iov_len=32768}], msg_iovlen=1, msg_controllen=0, msg_flags=0}, 0) = 148
id 4294967295 index 0 idle_time 6.61 nhid 1
id 4294967295 index 1 idle_time 6.61 nhid 1
+++ exited with 0 +++

Note that if the NLMSG_DONE message cannot be appended because of size
limitations, then another recvmsg() will be needed, but the core netlink
code will not invoke the dump callback and simply reply with a
NLMSG_DONE message since it knows that the callback previously returned
zero.

Add a test that fails before the fix:

# ./fib_nexthops.sh -t basic_res
[...]
TEST: Maximum nexthop ID dump [FAIL]
[...]

And passes after it:

# ./fib_nexthops.sh -t basic_res
[...]
TEST: Maximum nexthop ID dump [ OK ]
[...]

Fixes: 8a1bbabb034d ("nexthop: Add netlink handlers for bucket dump")
Signed-off-by: Ido Schimmel <[email protected]>
Reviewed-by: Petr Machata <[email protected]>
Reviewed-by: David Ahern <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Jakub Kicinski <[email protected]>

nexthop: Make nexthop bucket dump more efficient

rtm_dump_nexthop_bucket_nh() is used to dump nexthop buckets belonging
to a specific resilient nexthop group. The function returns a positive
return code (the skb length) upon both success and failure.

The above behavior is problematic. When a complete nexthop bucket dump
is requested, the function that walks the different nexthops treats the
non-zero return code as an error. This causes buckets belonging to
different resilient nexthop groups to be dumped using different buffers
even if they can all fit in the same buffer:

# ip link add name dummy1 up type dummy
# ip nexthop add id 1 dev dummy1
# ip nexthop add id 10 group 1 type resilient buckets 1
# ip nexthop add id 20 group 1 type resilient buckets 1
# strace -e recvmsg -s 0 ip nexthop bucket
[...]
recvmsg(3, {msg_name={sa_family=AF_NETLINK, nl_pid=0, nl_groups=00000000}, msg_namelen=12, msg_iov=[...], msg_iovlen=1, msg_controllen=0, msg_flags=0}, 0) = 64
id 10 index 0 idle_time 10.27 nhid 1
[...]
recvmsg(3, {msg_name={sa_family=AF_NETLINK, nl_pid=0, nl_groups=00000000}, msg_namelen=12, msg_iov=[...], msg_iovlen=1, msg_controllen=0, msg_flags=0}, 0) = 64
id 20 index 0 idle_time 6.44 nhid 1
[...]

Fix by only returning a non-zero return code when an error occurred and
restarting the dump from the bucket index we failed to fill in. This
allows buckets belonging to different resilient nexthop groups to be
dumped using the same buffer:

# ip link add name dummy1 up type dummy
# ip nexthop add id 1 dev dummy1
# ip nexthop add id 10 group 1 type resilient buckets 1
# ip nexthop add id 20 group 1 type resilient buckets 1
# strace -e recvmsg -s 0 ip nexthop bucket
[...]
recvmsg(3, {msg_name={sa_family=AF_NETLINK, nl_pid=0, nl_groups=00000000}, msg_namelen=12, msg_iov=[...], msg_iovlen=1, msg_controllen=0, msg_flags=0}, 0) = 128
id 10 index 0 idle_time 30.21 nhid 1
id 20 index 0 idle_time 26.7 nhid 1
[...]

While this change is more of a performance improvement change than an
actual bug fix, it is a prerequisite for a subsequent patch that does
fix a bug.

Fixes: 8a1bbabb034d ("nexthop: Add netlink handlers for bucket dump")
Signed-off-by: Ido Schimmel <[email protected]>
Reviewed-by: Petr Machata <[email protected]>
Reviewed-by: David Ahern <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Jakub Kicinski <[email protected]>

nexthop: Fix infinite nexthop dump when using maximum nexthop ID

A netlink dump callback can return a positive number to signal that more
information needs to be dumped or zero to signal that the dump is
complete. In the second case, the core netlink code will append the
NLMSG_DONE message to the skb in order to indicate to user space that
the dump is complete.

The nexthop dump callback always returns a positive number if nexthops
were filled in the provided skb, even if the dump is complete. This
means that a dump will span at least two recvmsg() calls as long as
nexthops are present. In the last recvmsg() call the dump callback will
not fill in any nexthops because the previous call indicated that the
dump should restart from the last dumped nexthop ID plus one.

# ip nexthop add id 1 blackhole
# strace -e sendto,recvmsg -s 5 ip nexthop
sendto(3, [[{nlmsg_len=24, nlmsg_type=RTM_GETNEXTHOP, nlmsg_flags=NLM_F_REQUEST|NLM_F_DUMP, nlmsg_seq=1691394315, nlmsg_pid=0}, {nh_family=AF_UNSPEC, nh_scope=RT_SCOPE_UNIVERSE, nh_protocol=RTPROT_UNSPEC, nh_flags=0}], {nlmsg_len=0, nlmsg_type=0 /* NLMSG_??? */, nlmsg_flags=0, nlmsg_seq=0, nlmsg_pid=0}], 152, 0, NULL, 0) = 152
recvmsg(3, {msg_name={sa_family=AF_NETLINK, nl_pid=0, nl_groups=00000000}, msg_namelen=12, msg_iov=[{iov_base=NULL, iov_len=0}], msg_iovlen=1, msg_controllen=0, msg_flags=MSG_TRUNC}, MSG_PEEK|MSG_TRUNC) = 36
recvmsg(3, {msg_name={sa_family=AF_NETLINK, nl_pid=0, nl_groups=00000000}, msg_namelen=12, msg_iov=[{iov_base=[{nlmsg_len=36, nlmsg_type=RTM_NEWNEXTHOP, nlmsg_flags=NLM_F_MULTI, nlmsg_seq=1691394315, nlmsg_pid=343}, {nh_family=AF_INET, nh_scope=RT_SCOPE_UNIVERSE, nh_protocol=RTPROT_UNSPEC, nh_flags=0}, [[{nla_len=8, nla_type=NHA_ID}, 1], {nla_len=4, nla_type=NHA_BLACKHOLE}]], iov_len=32768}], msg_iovlen=1, msg_controllen=0, msg_flags=0}, 0) = 36
id 1 blackhole
recvmsg(3, {msg_name={sa_family=AF_NETLINK, nl_pid=0, nl_groups=00000000}, msg_namelen=12, msg_iov=[{iov_base=NULL, iov_len=0}], msg_iovlen=1, msg_controllen=0, msg_flags=MSG_TRUNC}, MSG_PEEK|MSG_TRUNC) = 20
recvmsg(3, {msg_name={sa_family=AF_NETLINK, nl_pid=0, nl_groups=00000000}, msg_namelen=12, msg_iov=[{iov_base=[{nlmsg_len=20, nlmsg_type=NLMSG_DONE, nlmsg_flags=NLM_F_MULTI, nlmsg_seq=1691394315, nlmsg_pid=343}, 0], iov_len=32768}], msg_iovlen=1, msg_controllen=0, msg_flags=0}, 0) = 20
+++ exited with 0 +++

This behavior is both inefficient and buggy. If the last nexthop to be
dumped had the maximum ID of 0xffffffff, then the dump will restart from
0 (0xffffffff + 1) and never end:

# ip nexthop add id $((2**32-1)) blackhole
# ip nexthop
id 4294967295 blackhole
id 4294967295 blackhole
[...]

Fix by adjusting the dump callback to return zero when the dump is
complete. After the fix only one recvmsg() call is made and the
NLMSG_DONE message is appended to the RTM_NEWNEXTHOP response:

# ip nexthop add id $((2**32-1)) blackhole
# strace -e sendto,recvmsg -s 5 ip nexthop
sendto(3, [[{nlmsg_len=24, nlmsg_type=RTM_GETNEXTHOP, nlmsg_flags=NLM_F_REQUEST|NLM_F_DUMP, nlmsg_seq=1691394080, nlmsg_pid=0}, {nh_family=AF_UNSPEC, nh_scope=RT_SCOPE_UNIVERSE, nh_protocol=RTPROT_UNSPEC, nh_flags=0}], {nlmsg_len=0, nlmsg_type=0 /* NLMSG_??? */, nlmsg_flags=0, nlmsg_seq=0, nlmsg_pid=0}], 152, 0, NULL, 0) = 152
recvmsg(3, {msg_name={sa_family=AF_NETLINK, nl_pid=0, nl_groups=00000000}, msg_namelen=12, msg_iov=[{iov_base=NULL, iov_len=0}], msg_iovlen=1, msg_controllen=0, msg_flags=MSG_TRUNC}, MSG_PEEK|MSG_TRUNC) = 56
recvmsg(3, {msg_name={sa_family=AF_NETLINK, nl_pid=0, nl_groups=00000000}, msg_namelen=12, msg_iov=[{iov_base=[[{nlmsg_len=36, nlmsg_type=RTM_NEWNEXTHOP, nlmsg_flags=NLM_F_MULTI, nlmsg_seq=1691394080, nlmsg_pid=342}, {nh_family=AF_INET, nh_scope=RT_SCOPE_UNIVERSE, nh_protocol=RTPROT_UNSPEC, nh_flags=0}, [[{nla_len=8, nla_type=NHA_ID}, 4294967295], {nla_len=4, nla_type=NHA_BLACKHOLE}]], [{nlmsg_len=20, nlmsg_type=NLMSG_DONE, nlmsg_flags=NLM_F_MULTI, nlmsg_seq=1691394080, nlmsg_pid=342}, 0]], iov_len=32768}], msg_iovlen=1, msg_controllen=0, msg_flags=0}, 0) = 56
id 4294967295 blackhole
+++ exited with 0 +++

Note that if the NLMSG_DONE message cannot be appended because of size
limitations, then another recvmsg() will be needed, but the core netlink
code will not invoke the dump callback and simply reply with a
NLMSG_DONE message since it knows that the callback previously returned
zero.

Add a test that fails before the fix:

# ./fib_nexthops.sh -t basic
[...]
TEST: Maximum nexthop ID dump [FAIL]
[...]

And passes after it:

# ./fib_nexthops.sh -t basic
[...]
TEST: Maximum nexthop ID dump [ OK ]
[...]

Fixes: ab84be7e54fc ("net: Initial nexthop code")
Reported-by: Petr Machata <[email protected]>
Closes: https://lore.kernel.org/netdev/[email protected]/
Signed-off-by: Ido Schimmel <[email protected]>
Reviewed-by: Petr Machata <[email protected]>
Reviewed-by: David Ahern <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Jakub Kicinski <[email protected]>

vlan: Fix VLAN 0 memory leak

The referenced commit intended to fix memleak of VLAN 0 that is implicitly
created on devices with NETIF_F_HW_VLAN_CTAG_FILTER feature. However, it
doesn't take into account that the feature can be re-set during the
netdevice lifetime which will cause memory leak if feature is disabled
during the device deletion as illustrated by [0]. Fix the leak by
unconditionally deleting VLAN 0 on NETDEV_DOWN event.

[0]:
> modprobe 8021q
> ip l set dev eth2 up
> ethtool -K eth2 rx-vlan-filter off
> modprobe -r mlx5_ib
> modprobe -r mlx5_core
> cat /sys/kernel/debug/kmemleak
unreferenced object 0xffff888103dcd900 (size 256):
  comm "ip", pid 1490, jiffies 4294907305 (age 325.364s)
  hex dump (first 32 bytes):
    00 80 5d 03 81 88 ff ff 00 00 00 00 00 00 00 00  ..].............
    00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
  backtrace:
    [<00000000899f3bb9>] kmalloc_trace+0x25/0x80
    [<000000002889a7a2>] vlan_vid_add+0xa0/0x210
    [<000000007177800e>] vlan_device_event+0x374/0x760 [8021q]
    [<000000009a0716b1>] notifier_call_chain+0x35/0xb0
    [<00000000bbf3d162>] __dev_notify_flags+0x58/0xf0
    [<0000000053d2b05d>] dev_change_flags+0x4d/0x60
    [<00000000982807e9>] do_setlink+0x28d/0x10a0
    [<0000000058c1be00>] __rtnl_newlink+0x545/0x980
    [<00000000e66c3bd9>] rtnl_newlink+0x44/0x70
    [<00000000a2cc5970>] rtnetlink_rcv_msg+0x29c/0x390
    [<00000000d307d1e4>] netlink_rcv_skb+0x54/0x100
    [<00000000259d16f9>] netlink_unicast+0x1f6/0x2c0
    [<000000007ce2afa1>] netlink_sendmsg+0x232/0x4a0
    [<00000000f3f4bb39>] sock_sendmsg+0x38/0x60
    [<000000002f9c0624>] ____sys_sendmsg+0x1e3/0x200
    [<00000000d6ff5520>] ___sys_sendmsg+0x80/0xc0
unreferenced object 0xffff88813354fde0 (size 32):
  comm "ip", pid 1490, jiffies 4294907305 (age 325.364s)
  hex dump (first 32 bytes):
    a0 d9 dc 03 81 88 ff ff a0 d9 dc 03 81 88 ff ff  ................
    81 00 00 00 01 00 00 00 00 00 00 00 00 00 00 00  ................
  backtrace:
    [<00000000899f3bb9>] kmalloc_trace+0x25/0x80
    [<000000002da64724>] vlan_vid_add+0xdf/0x210
    [<000000007177800e>] vlan_device_event+0x374/0x760 [8021q]
    [<000000009a0716b1>] notifier_call_chain+0x35/0xb0
    [<00000000bbf3d162>] __dev_notify_flags+0x58/0xf0
    [<0000000053d2b05d>] dev_change_flags+0x4d/0x60
    [<00000000982807e9>] do_setlink+0x28d/0x10a0
    [<0000000058c1be00>] __rtnl_newlink+0x545/0x980
    [<00000000e66c3bd9>] rtnl_newlink+0x44/0x70
    [<00000000a2cc5970>] rtnetlink_rcv_msg+0x29c/0x390
    [<00000000d307d1e4>] netlink_rcv_skb+0x54/0x100
    [<00000000259d16f9>] netlink_unicast+0x1f6/0x2c0
    [<000000007ce2afa1>] netlink_sendmsg+0x232/0x4a0
    [<00000000f3f4bb39>] sock_sendmsg+0x38/0x60
    [<000000002f9c0624>] ____sys_sendmsg+0x1e3/0x200
    [<00000000d6ff5520>] ___sys_sendmsg+0x80/0xc0

Fixes: efc73f4bbc23 ("net: Fix memory leak - vlan_info struct")
Reviewed-by: Ido Schimmel <[email protected]>
Signed-off-by: Vlad Buslov <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Jakub Kicinski <[email protected]>

x86/mm: Fix VDSO and VVAR placement on 5-level paging machines

Yingcong has noticed that on the 5-level paging machine, VDSO and VVAR
VMAs are placed above the 47-bit border:

8000001a9000-8000001ad000 r--p 00000000 00:00 0 [vvar]
8000001ad000-8000001af000 r-xp 00000000 00:00 0 [vdso]

This might confuse users who are not aware of 5-level paging and expect
all userspace addresses to be under the 47-bit border.

So far problem has only been triggered with ASLR disabled, although it
may also occur with ASLR enabled if the layout is randomized in a just
right way.

The problem happens due to custom placement for the VMAs in the VDSO
code: vdso_addr() tries to place them above the stack and checks the
result against TASK_SIZE_MAX, which is wrong. TASK_SIZE_MAX is set to
the 56-bit border on 5-level paging machines. Use DEFAULT_MAP_WINDOW
instead.

Fixes: b569bab78d8d ("x86/mm: Prepare to expose larger address space to userspace")
Reported-by: Yingcong Wu <[email protected]>
Signed-off-by: Kirill A. Shutemov <[email protected]>
Signed-off-by: Dave Hansen <[email protected]>
Cc: [email protected]
Link: https://lore.kernel.org/all/20230803151609.22141-1-kirill.shutemov%40linux.intel.com

platform/x86: ISST: Reduce noise for missing numa information in logs

On platforms with no numa support and with several CPUs, logs have lots
of noise for message "Fail to get numa node for CPU:.."

Change pr_info() to pr_info_once() as one print is enough to show the
issue.

Signed-off-by: Srinivas Pandruvada <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Reviewed-by: Hans de Goede <[email protected]>
Signed-off-by: Hans de Goede <[email protected]>

platform/x86: msi-ec: Fix the build

The msi-ec driver fails to build for me (gcc 7.5):

  CC [M]  drivers/platform/x86/msi-ec.o
drivers/platform/x86/msi-ec.c:72:6: error: initializer element is not constant
    { SM_ECO_NAME,     0xc2 },
      ^~~~~~~~~~~
drivers/platform/x86/msi-ec.c:72:6: note: (near initialization for ‘CONF0.shift_mode.modes[0].name’)
drivers/platform/x86/msi-ec.c:73:6: error: initializer element is not constant
    { SM_COMFORT_NAME, 0xc1 },
      ^~~~~~~~~~~~~~~
drivers/platform/x86/msi-ec.c:73:6: note: (near initialization for ‘CONF0.shift_mode.modes[1].name’)
drivers/platform/x86/msi-ec.c:74:6: error: initializer element is not constant
    { SM_SPORT_NAME,   0xc0 },
      ^~~~~~~~~~~~~
drivers/platform/x86/msi-ec.c:74:6: note: (near initialization for ‘CONF0.shift_mode.modes[2].name’)
(...)

Don't try to be smart, just use defines for the constant strings. The
compiler will recognize it's the same string and will store it only
once in the data section anyway.

Signed-off-by: Jean Delvare <[email protected]>
Fixes: 392cacf2aa10 ("platform/x86: Add new msi-ec driver")
Cc: [email protected]
Cc: Nikita Kravets <[email protected]>
Cc: Hans de Goede <[email protected]>
Cc: Mark Gross <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Reviewed-by: Hans de Goede <[email protected]>
Signed-off-by: Hans de Goede <[email protected]>

ACPI: resource: Honor MADT INT_SRC_OVR settings for IRQ1 on AMD Zen

On AMD Zen acpi_dev_irq_override() by default prefers the DSDT IRQ 1
settings over the MADT settings.

This causes the keyboard to malfunction on some laptop models
(see Links), all models from the Links have an INT_SRC_OVR MADT entry
for IRQ 1.

Fixes: a9c4a912b7dc ("ACPI: resource: Remove "Zen" specific match and quirks")
Link: https://bugzilla.kernel.org/show_bug.cgi?id=217336
Link: https://bugzilla.kernel.org/show_bug.cgi?id=217394
Link: https://bugzilla.kernel.org/show_bug.cgi?id=217406
Cc: All applicable <[email protected]>
Signed-off-by: Hans de Goede <[email protected]>
Signed-off-by: Rafael J. Wysocki <[email protected]>

ACPI: resource: Always use MADT override IRQ settings for all legacy non i8042 IRQs

All the cases, were the DSDT IRQ settings should be used instead of
the MADT override, are for IRQ 1 or 12, the PS/2 kbd resp. mouse IRQs.

Simplify things by always honering the override for other legacy IRQs
(for non DMI quirked cases).

This allows removing the DMI quirks to honor the override for
some non i8042 IRQs on some AMD ZEN based Lenovo models.

Fixes: a9c4a912b7dc ("ACPI: resource: Remove "Zen" specific match and quirks")
Cc: All applicable <[email protected]>
Signed-off-by: Hans de Goede <[email protected]>
Signed-off-by: Rafael J. Wysocki <[email protected]>

ACPI: resource: revert "Remove "Zen" specific match and quirks"

Commit a9c4a912b7dc ("ACPI: resource: Remove "Zen" specific match and
quirks") is causing keyboard problems for quite a log of AMD based
laptop users, leading to many bug reports.

Revert this change for now, until we can come up with
a better fix for the PS/2 IRQ trigger-type/polarity problems
on some x86 laptops.

Fixes: a9c4a912b7dc ("ACPI: resource: Remove "Zen" specific match and quirks")
Link: https://bugzilla.redhat.com/show_bug.cgi?id=2228891
Link: https://bugzilla.redhat.com/show_bug.cgi?id=2229165
Link: https://bugzilla.redhat.com/show_bug.cgi?id=2229317
Link: https://bugzilla.kernel.org/show_bug.cgi?id=217718
Link: https://bugzilla.kernel.org/show_bug.cgi?id=217726
Link: https://bugzilla.kernel.org/show_bug.cgi?id=217731
Cc: All applicable <[email protected]>
Signed-off-by: Hans de Goede <[email protected]>
Signed-off-by: Rafael J. Wysocki <[email protected]>

rust: macros: vtable: fix `HAS_*` redefinition (`gen_const_name`)

If we define the same function name twice in a trait (using `#[cfg]`),
the `vtable` macro will redefine its `gen_const_name`, e.g. this will
define `HAS_BAR` twice:

    #[vtable]
    pub trait Foo {
        #[cfg(CONFIG_X)]
        fn bar();

        #[cfg(not(CONFIG_X))]
        fn bar(x: usize);
    }

Fixes: b44becc5ee80 ("rust: macros: add `#[vtable]` proc macro")
Signed-off-by: Qingsong Chen <[email protected]>
Reviewed-by: Andreas Hindborg <[email protected]>
Reviewed-by: Gary Guo <[email protected]>
Reviewed-by: Sergio González Collado <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Miguel Ojeda <[email protected]>

alpha: remove __init annotation from exported page_is_ram()

EXPORT_SYMBOL and __init is a bad combination because the .init.text
section is freed up after the initialization.

Commit c5a130325f13 ("ACPI/APEI: Add parameter check before error
injection") exported page_is_ram(), hence the __init annotation should
be removed.

This fixes the modpost warning in ARCH=alpha builds:

WARNING: modpost: vmlinux: page_is_ram: EXPORT_SYMBOL used for init symbol. Remove __init or EXPORT_SYMBOL.

Fixes: c5a130325f13 ("ACPI/APEI: Add parameter check before error injection")
Signed-off-by: Masahiro Yamada <[email protected]>
Reviewed-by: Randy Dunlap <[email protected]>

tpm: Add a helper for checking hwrng enabled

The same checks are repeated in three places to decide whether to use
hwrng. Consolidate these into a helper.

Also this fixes a case that one of them was missing a check in the
cleanup path.

Fixes: 554b841d4703 ("tpm: Disable RNG for all AMD fTPMs")
Signed-off-by: Mario Limonciello <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

mm: Fix access_remote_vm() regression on tagged addresses

GDB uses /proc/PID/mem to access memory of the target process. GDB
doesn't untag addresses manually, but relies on kernel to do the right
thing.

mem_rw() of procfs uses access_remote_vm() to get data from the target
process. It worked fine until recent changes in __access_remote_vm()
that now checks if there's VMA at target address using raw address.

Untag the address before looking up the VMA.

Signed-off-by: Kirill A. Shutemov <[email protected]>
Reported-by: Christina Schimpe <[email protected]>
Fixes: eee9c708cc89 ("gup: avoid stack expansion warning for known-good case")
Cc: [email protected]
Signed-off-by: Linus Torvalds <[email protected]>

drm/amdkfd: disable IOMMUv2 support for Raven

Use the dGPU path instead.  There were a lot of platform
issues with IOMMU in general on these chips due to windows
not enabling IOMMU at the time.  The dGPU path has been
used for a long time with newer APUs and works fine.  This
also paves the way to simplify the driver significantly.

Reviewed-by: Felix Kuehling <[email protected]>
Acked-by: Christian König <[email protected]>
Tested-by: Mike Lothian <[email protected]>
Signed-off-by: Alex Deucher <[email protected]>

drm/amdkfd: disable IOMMUv2 support for KV/CZ

Use the dGPU path instead.  There were a lot of platform
issues with IOMMU in general on these chips due to windows
not enabling IOMMU at the time.  The dGPU path has been
used for a long time with newer APUs and works fine.  This
also paves the way to simplify the driver significantly.

v2: use the dGPU queue manager functions

Reviewed-by: Felix Kuehling <[email protected]>
Acked-by: Christian König <[email protected]>
Tested-by: Mike Lothian <[email protected]>
Signed-off-by: Alex Deucher <[email protected]>

drm/amdkfd: ignore crat by default

We are dropping the IOMMUv2 path, so no need to enable this.
It's often buggy on consumer platforms anyway.

Reviewed-by: Felix Kuehling <[email protected]>
Acked-by: Christian König <[email protected]>
Tested-by: Mike Lothian <[email protected]>
Signed-off-by: Alex Deucher <[email protected]>

drm/amdgpu/gfx11: only enable CP GFX shadowing on SR-IOV

This is only required for SR-IOV world switches, but it
adds additional latency leading to reduced performance in
some benchmarks. Disable for now on bare metal.

Reviewed-by: Christian König <[email protected]>
Signed-off-by: Alex Deucher <[email protected]>

drm/amd/display: Fix a regression on Polaris cards

DCE products don't define a `remove_stream_from_ctx` like DCN ones
do. This means that when compute_mst_dsc_configs_for_state() is called
it always returns -EINVAL which causes MST to fail to setup.

Cc: [email protected] # 6.4.y
Cc: Harry Wentland <[email protected]>
Reported-by: [email protected]
Closes: https://gitlab.freedesktop.org/drm/amd/-/issues/2671
Fixes: efa4c4df864e ("drm/amd/display: call remove_stream_from_ctx from res_pool funcs")
Signed-off-by: Mario Limonciello <[email protected]>
Reviewed-by: Harry Wentland <[email protected]>
Signed-off-by: Alex Deucher <[email protected]>

drm/amdgpu: fix possible UAF in amdgpu_cs_pass1()

Since the gang_size check is outside of chunk parsing
loop, we need to reset i before we free the chunk data.

Suggested by Ye Zhang (@VAR10CK) of Baidu Security.

Reviewed-by: Guchun Chen <[email protected]>
Reviewed-by: Christian König <[email protected]>
Signed-off-by: Alex Deucher <[email protected]>
Cc: [email protected]

x86/CPU/AMD: Do not leak quotient data after a division by 0

Under certain circumstances, an integer division by 0 which faults, can
leave stale quotient data from a previous division operation on Zen1
microarchitectures.

Do a dummy division 0/1 before returning from the #DE exception handler
in order to avoid any leaks of potentially sensitive data.

Signed-off-by: Borislav Petkov (AMD) <[email protected]>
Cc: <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

drm/amd/pm: Fix SMU v13.0.6 energy reporting

Energy counter should be reported in units of 15.259 uJ. Don't apply
any conversion.

Signed-off-by: Lijo Lazar <[email protected]>
Reviewed-by: Hawking Zhang <[email protected]>
Signed-off-by: Alex Deucher <[email protected]>

drm/amd/display: check attr flag before set cursor degamma on DCN3+

Don't set predefined degamma curve to cursor plane if the cursor
attribute flag is not set. Applying a degamma curve to the cursor by
default breaks userspace expectation. Checking the flag before
performing any color transformation prevents too dark cursor gamma in
DCN3+ on many Linux desktop environment (KDE Plasma, GNOME,
wlroots-based, etc.) as reported at:
- https://gitlab.freedesktop.org/drm/amd/-/issues/1513

This is the same approach followed by DCN2 drivers where the issue is
not present.

Fixes: 03f54d7d3448 ("drm/amd/display: Add DCN3 DPP")
Link: https://gitlab.freedesktop.org/drm/amd/-/issues/1513
Signed-off-by: Melissa Wen <[email protected]>
Reviewed-by: Harry Wentland <[email protected]>
Tested-by: Alex Hung <[email protected]>
Signed-off-by: Alex Deucher <[email protected]>
Cc: [email protected]

drm/amd/pm: disable the SMU13 OD feature support temporarily

The existing OD interface cannot support the growing demand for more
OD features. We are in the transition to a new OD mechanism. So,
disable the SMU13 OD feature support temporarily. And this should be
reverted when the new OD mechanism online.

Signed-off-by: Evan Quan <[email protected]>
Reviewed-by: Guchun Chen <[email protected]>
Signed-off-by: Alex Deucher <[email protected]>

drm/amd/pm: correct the pcie width for smu 13.0.0

correct the pcie width value in pp_dpm_pcie for smu 13.0.0

Signed-off-by: Kenneth Feng <[email protected]>
Reviewed-by: Harish Kasiviswanathan <[email protected]>
Acked-by: Alex Deucher <[email protected]>
Signed-off-by: Alex Deucher <[email protected]>
Cc: [email protected]

drm/amd/display: Don't show stack trace for missing eDP

Some systems are only connected by HDMI or DP, so warning related to
missing eDP is unnecessary. Downgrade to debug instead.

Cc: Hamza Mahfooz <[email protected]>
Fixes: 6d9b6dceaa51 ("drm/amd/display: only warn once in dce110_edp_wait_for_hpd_ready()")
Reported-by: [email protected]
Signed-off-by: Mario Limonciello <[email protected]>
Reviewed-by: Hamza Mahfooz <[email protected]>
Signed-off-by: Alex Deucher <[email protected]>

drm/amdgpu: Match against exact bootloader status

On PSP v13.x ASICs, boot loader will set only the MSB to 1 and clear the
least significant bits for any command submission. Hence match against
the exact register value, otherwise a register value of all 0xFFs also
could falsely indicate that boot loader is ready. Also, from PSP v13.0.6
and newer, bits[7:0] will be used to indicate command error status.

Signed-off-by: Lijo Lazar <[email protected]>
Acked-by: Alex Deucher <[email protected]>
Signed-off-by: Alex Deucher <[email protected]>

drm/amd/pm: skip the RLC stop when S0i3 suspend for SMU v13.0.4/11

For SMU v13.0.4/11, driver does not need to stop RLC for S0i3,
the firmwares will handle that properly.

Signed-off-by: Tim Huang <[email protected]>
Reviewed-by: Mario Limonciello <[email protected]>
Signed-off-by: Alex Deucher <[email protected]>

drm/amd: Disable S/G for APUs when 64GB or more host memory

Users report a white flickering screen on multiple systems that
is tied to having 64GB or more memory.  When S/G is enabled pages
will get pinned to both VRAM carve out and system RAM leading to
this.

Until it can be fixed properly, disable S/G when 64GB of memory or
more is detected.  This will force pages to be pinned into VRAM.
This should fix white screen flickers but if VRAM pressure is
encountered may lead to black screens.  It's a trade-off for now.

Fixes: 81d0bcf99009 ("drm/amdgpu: make display pinning more flexible (v2)")
Cc: Hamza Mahfooz <[email protected]>
Cc: Roman Li <[email protected]>
Cc: <[email protected]> # 6.1.y: bf0207e172703 ("drm/amdgpu: add S/G display parameter")
Cc: <[email protected]> # 6.4.y
Link: https://gitlab.freedesktop.org/drm/amd/-/issues/2735
Link: https://gitlab.freedesktop.org/drm/amd/-/issues/2354
Signed-off-by: Mario Limonciello <[email protected]>
Reviewed-by: Alex Deucher <[email protected]>
Signed-off-by: Alex Deucher <[email protected]>

ASoC: SOF: ipc4-topology: Update the basecfg for copier earlier

The sof_ipc4_update_resource_usage() call updates the CPC value in basecfg
and it must be done prior to making a copy of the copier configuration
for the init message.
Other module types do the resource update as last step or in case of a
process module at the correct time, before the memcpy.

Fixes: d8a2c9879349 ("ASoC: SOF: ipc4-loader/topology: Query the CPC value from manifest")
Signed-off-by: Peter Ujfalusi <[email protected]>
Reviewed-by: Ranjani Sridharan <[email protected]>
Reviewed-by: Adrian Bonislawski <[email protected]>
Reviewed-by: Pierre-Louis Bossart <[email protected]>
Reviewed-by: Bard Liao <[email protected]>
Reviewed-by: Chao Song <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Mark Brown <[email protected]>

wifi: ath12k: Fix buffer overflow when scanning with extraie

If cfg80211 is providing extraie's for a scanning process then ath12k will
copy that over to the firmware. The extraie.len is a 32 bit value in struct
element_info and describes the amount of bytes for the vendor information
elements.

The problem is the allocation of the buffer. It has to align the TLV
sections by 4 bytes. But the code was using an u8 to store the newly
calculated length of this section (with alignment). And the new
calculated length was then used to allocate the skbuff. But the actual
code to copy in the data is using the extraie.len and not the calculated
"aligned" length.

The length of extraie with IEEE80211_HW_SINGLE_SCAN_ON_ALL_BANDS enabled
was 264 bytes during tests with a wifi card. But it only allocated 8
bytes (264 bytes % 256) for it. As consequence, the code to memcpy the
extraie into the skb was then just overwriting data after skb->end. Things
like shinfo were therefore corrupted. This could usually be seen by a crash
in skb_zcopy_clear which tried to call a ubuf_info callback (using a bogus
address).

Tested-on: WCN7850 hw2.0 PCI WLAN.HMT.1.0-03427-QCAHMTSWPL_V1.0_V2.0_SILICONZ-1.15378.4

Signed-off-by: Wen Gong <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Johannes Berg <[email protected]>

wifi: nl80211: fix integer overflow in nl80211_parse_mbssid_elems()

nl80211_parse_mbssid_elems() uses a u8 variable num_elems to count the
number of MBSSID elements in the nested netlink attribute attrs, which can
lead to an integer overflow if a user of the nl80211 interface specifies
256 or more elements in the corresponding attribute in userspace. The
integer overflow can lead to a heap buffer overflow as num_elems determines
the size of the trailing array in elems, and this array is thereafter
written to for each element in attrs.

Note that this vulnerability only affects devices with the
wiphy->mbssid_max_interfaces member set for the wireless physical device
struct in the device driver, and can only be triggered by a process with
CAP_NET_ADMIN capabilities.

Fix this by checking for a maximum of 255 elements in attrs.

Cc: [email protected]
Fixes: dc1e3cb8da8b ("nl80211: MBSSID and EMA support in AP mode")
Signed-off-by: Keith Yeo <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Johannes Berg <[email protected]>

netfilter: nf_tables: don't skip expired elements during walk

There is an asymmetry between commit/abort and preparation phase if the
following conditions are met:

1. set is a verdict map ("1.2.3.4 : jump foo")
2. timeouts are enabled

In this case, following sequence is problematic:

1. element E in set S refers to chain C
2. userspace requests removal of set S
3. kernel does a set walk to decrement chain->use count for all elements
   from preparation phase
4. kernel does another set walk to remove elements from the commit phase
   (or another walk to do a chain->use increment for all elements from
    abort phase)

If E has already expired in 1), it will be ignored during list walk, so its use count
won't have been changed.

Then, when set is culled, ->destroy callback will zap the element via
nf_tables_set_elem_destroy(), but this function is only safe for
elements that have been deactivated earlier from the preparation phase:
lack of earlier deactivate removes the element but leaks the chain use
count, which results in a WARN splat when the chain gets removed later,
plus a leak of the nft_chain structure.

Update pipapo_get() not to skip expired elements, otherwise flush
command reports bogus ENOENT errors.

Fixes: 3c4287f62044 ("nf_tables: Add set type for arbitrary concatenation of ranges")
Fixes: 8d8540c4f5e0 ("netfilter: nft_set_rbtree: add timeout support")
Fixes: 9d0982927e79 ("netfilter: nft_hash: add support for timeouts")
Signed-off-by: Florian Westphal <[email protected]>
Signed-off-by: Pablo Neira Ayuso <[email protected]>

drm/rockchip: Don't spam logs in atomic check

Userspace should not be able to trigger DRM_ERROR messages to spam the
logs; especially not through atomic commit parameters which are
completely legitimate for userspace to attempt.

Signed-off-by: Daniel Stone <[email protected]>
Fixes: 7707f7227f09 ("drm/rockchip: Add support for afbc")
Signed-off-by: Heiko Stuebner <[email protected]>
Link: https://patchwork.freedesktop.org/patch/msgid/[email protected]

serial: core: Revert port_id use

Guenter reports boot issues with duplicate sysfs entries for multiport
drivers. Let's go back to using port->line for now to fix the regression.

With this change, the serial core port device names are not correct for the
hardware specific 8250 single port drivers, but that's a cosmetic issue for
now.

Fixes: d962de6ae51f ("serial: core: Fix serial core port id to not use port->line")
Reported-by: Guenter Roeck <[email protected]>
Signed-off-by: Tony Lindgren <[email protected]>
Tested-by: Guenter Roeck <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Greg Kroah-Hartman <[email protected]>

TIOCSTI: Document CAP_SYS_ADMIN behaviour in Kconfig

Clarifies that the LEGACY_TIOCSTI setting is safe to turn off even
when running BRLTTY, as it was introduced in commit 690c8b804ad2
("TIOCSTI: always enable for CAP_SYS_ADMIN").

Signed-off-by: Günther Noack <[email protected]>
Reviewed-by: Samuel Thibault <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Greg Kroah-Hartman <[email protected]>

media: imx: imx7-media-csi: Fix applying format constraints

v4l_bound_align_image() aligns to a multiple of 2 to the power of
walign, not to walign. Depending on the pixel format, this causes the
image width to be aligned to 16 or 256 pixels instead of 4 or 8 as
required by the hardware. Fix it by rounding and clamping the width and
height manually.

Closes: https://lore.kernel.org/linux-media/CAJ+vNU0BOVLTL17ofgHwtexbpuMYwH_aGUC==EXABUtHHiv_ag@mail.gmail.com
Reported-by: Tim Harvey <[email protected]>
Fixes: 6f482c4729d9 ("media: imx: imx7-media-csi: Get rid of superfluous call to imx7_csi_mbus_fmt_to_pix_fmt")
Co-developed-by: Alexander Stein <[email protected]>
Signed-off-by: Alexander Stein <[email protected]>
Signed-off-by: Fabio Estevam <[email protected]>
Signed-off-by: Laurent Pinchart <[email protected]>
Signed-off-by: Mauro Carvalho Chehab <[email protected]>

mmc: sunplus: Fix error handling in spmmc_drv_probe()

When mmc allocation succeeds, the error paths are not freeing mmc.

Fix the above issue by changing mmc_alloc_host() to devm_mmc_alloc_host()
to simplify the error handling. Remove label 'probe_free_host' as devm_*
api takes care of freeing, also remove mmc_free_host() from remove
function as devm_* takes care of freeing.

Fixes: 4e268fed8b18 ("mmc: Add mmc driver for Sunplus SP7021")
Reported-by: Dan Carpenter <[email protected]>
Closes: https://lore.kernel.org/all/[email protected]/
Signed-off-by: Harshit Mogalapalli <[email protected]>
Reviewed-by: Dan Carpenter <[email protected]>
Cc: [email protected]
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Ulf Hansson <[email protected]>

mmc: sunplus: fix return value check of mmc_add_host()

mmc_add_host() may return error, if we ignore its return value,
1. the memory allocated in mmc_alloc_host() will be leaked
2. null-ptr-deref will happen when calling mmc_remove_host()
in remove function spmmc_drv_remove() because deleting not
added device.

Fix this by checking the return value of mmc_add_host(). Moreover,
I fixed the error handling path of spmmc_drv_probe() to clean up.

Fixes: 4e268fed8b18 ("mmc: Add mmc driver for Sunplus SP7021")
Cc: [email protected]
Signed-off-by: Wei Chen <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Ulf Hansson <[email protected]>

mmc: wbsd: fix double mmc_free_host() in wbsd_init()

mmc_free_host() has already be called in wbsd_free_mmc(),
remove the mmc_free_host() in error path in wbsd_init().

Fixes: dc5b9b50fc9d ("mmc: wbsd: fix return value check of mmc_add_host()")
Signed-off-by: Yang Yingliang <[email protected]>
Cc: [email protected]
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Ulf Hansson <[email protected]>

Merge branch 'smc-fixes'

Gerd Bayer says:

====================
net/smc: Fix effective buffer size

commit 0227f058aa29 ("net/smc: Unbind r/w buffer size from clcsock
and make them tunable") started to derive the effective buffer size for
SMC connections inconsistently in case a TCP fallback was used and
memory consumption of SMC with the default settings was doubled when
a connection negotiated SMC. That was not what we want.

This series consolidates the resulting effective buffer size that is
used with SMC sockets, which is based on Jan Karcher's effort (see
[1]). For all TCP exchanges (in particular in case of a fall back when
no SMC connection was possible) the values from net.ipv4.tcp_[rw]mem
are used. If SMC succeeds in establishing a SMC connection, the newly
introduced values from net.smc.[rw]mem are used.

net.smc.[rw]mem is initialized to 64kB, respectively. Internal test
have show this to be a good compromise between throughput/latency
and memory consumption. Also net.smc.[rw]mem is now decoupled completely
from any tuning through net.ipv4.tcp_[rw]mem.

If a user chose to tune a socket's receive or send buffer size with
setsockopt, this tuning is now consistently applied to either fall-back
TCP or proper SMC connections over the socket.

Thanks,
Gerd

v2 - v3:
- Rebase to and resolve conflict of second patch with latest net/master.
v1 - v2:
- In second patch, use sock_net() helper as suggested by Tony and demanded
by kernel test robot.
====================

Signed-off-by: David S. Miller <[email protected]>

net/smc: Use correct buffer sizes when switching between TCP and SMC

Tuning of the effective buffer size through setsockopts was working for
SMC traffic only but not for TCP fall-back connections even before
commit 0227f058aa29 ("net/smc: Unbind r/w buffer size from clcsock and
make them tunable"). That change made it apparent that TCP fall-back
connections would use net.smc.[rw]mem as buffer size instead of
net.ipv4_tcp_[rw]mem.

Amend the code that copies attributes between the (TCP) clcsock and the
SMC socket and adjust buffer sizes appropriately:
- Copy over sk_userlocks so that both sockets agree on whether tuning
  via setsockopt is active.
- When falling back to TCP use sk_sndbuf or sk_rcvbuf as specified with
  setsockopt. Otherwise, use the sysctl value for TCP/IPv4.
- Likewise, use either values from setsockopt or from sysctl for SMC
  (duplicated) on successful SMC connect.

In smc_tcp_listen_work() drop the explicit copy of buffer sizes as that
is taken care of by the attribute copy.

Fixes: 0227f058aa29 ("net/smc: Unbind r/w buffer size from clcsock and make them tunable")
Reviewed-by: Wenjia Zhang <[email protected]>
Reviewed-by: Tony Lu <[email protected]>
Signed-off-by: Gerd Bayer <[email protected]>
Signed-off-by: David S. Miller <[email protected]>

net/smc: Fix setsockopt and sysctl to specify same buffer size again

Commit 0227f058aa29 ("net/smc: Unbind r/w buffer size from clcsock
and make them tunable") introduced the net.smc.rmem and net.smc.wmem
sysctls to specify the size of buffers to be used for SMC type
connections. This created a regression for users that specified the
buffer size via setsockopt() as the effective buffer size was now
doubled.

Re-introduce the division by 2 in the SMC buffer create code and level
this out by duplicating the net.smc.[rw]mem values used for initializing
sk_rcvbuf/sk_sndbuf at socket creation time. This gives users of both
methods (setsockopt or sysctl) the effective buffer size that they
expect.

Initialize net.smc.[rw]mem from its own constant of 64kB, respectively.
Internal performance tests show that this value is a good compromise
between throughput/latency and memory consumption. Also, this decouples
it from any tuning that was done to net.ipv4.tcp_[rw]mem[1] before the
module for SMC protocol was loaded. Check that no more than INT_MAX / 2
is assigned to net.smc.[rw]mem, in order to avoid any overflow condition
when that is doubled for use in sk_sndbuf or sk_rcvbuf.

While at it, drop the confusing sk_buf_size variable from
__smc_buf_create and name "compressed" buffer size variables more
consistently.

Background:

Before the commit mentioned above, SMC's buffer allocator in
__smc_buf_create() always used half of the sockets' sk_rcvbuf/sk_sndbuf
value as initial value to search for appropriate buffers. If the search
resorted to using a bigger buffer when all buffers of the specified
size were busy, the duplicate of the used effective buffer size is
stored back to sk_rcvbuf/sk_sndbuf.

When available, buffers of exactly the size that a user had specified as
input to setsockopt() were used, despite setsockopt()'s documentation in
"man 7 socket" talking of a mandatory duplication:

[...]
       SO_SNDBUF
              Sets  or  gets the maximum socket send buffer in bytes.
              The kernel doubles this value (to allow space for book‐
              keeping  overhead)  when it is set using setsockopt(2),
              and this doubled value is  returned  by  getsockopt(2).
              The     default     value     is     set     by     the
              /proc/sys/net/core/wmem_default file  and  the  maximum
              allowed value is set by the /proc/sys/net/core/wmem_max
              file.  The minimum (doubled) value for this  option  is
              2048.
[...]

Fixes: 0227f058aa29 ("net/smc: Unbind r/w buffer size from clcsock and make them tunable")
Co-developed-by: Jan Karcher <[email protected]>
Signed-off-by: Jan Karcher <[email protected]>
Reviewed-by: Wenjia Zhang <[email protected]>
Reviewed-by: Tony Lu <[email protected]>
Signed-off-by: Gerd Bayer <[email protected]>
Signed-off-by: David S. Miller <[email protected]>

Merge branch 'enetc-probe-fix'

Vladimir Oltean says:

====================
Fix ENETC probing after 6fffbc7ae137 ("PCI: Honor firmware's device disabled status")

I'm not sure who should take this patch set (net maintainers or PCI
maintainers). Everyone could pick up just their part, and that would
work (no compile time dependencies). However, the entire series needs
ACK from both sides and Rob for sure.

v1 at:
https://lore.kernel.org/netdev/20230521115141.2384444 [email protected]/
====================

Reviewed-by: Simon Horman <[email protected]>
Signed-off-by: David S. Miller <[email protected]>

net: enetc: remove of_device_is_available() handling

Since commit 6fffbc7ae137 ("PCI: Honor firmware's device disabled
status"), this is redundant and does nothing, because enetc_pf_probe()
no longer even gets called.

Signed-off-by: Vladimir Oltean <[email protected]>
Signed-off-by: David S. Miller <[email protected]>

net: enetc: reimplement RFS/RSS memory clearing as PCI quirk

The workaround implemented in commit 3222b5b613db ("net: enetc:
initialize RFS/RSS memories for unused ports too") is no longer
effective after commit 6fffbc7ae137 ("PCI: Honor firmware's device
disabled status"). Thus, it has introduced a regression and we see AER
errors being reported again:

$ ip link set sw2p0 up && dhclient -i sw2p0 && ip addr show sw2p0
fsl_enetc 0000:00:00.2 eno2: configuring for fixed/internal link mode
fsl_enetc 0000:00:00.2 eno2: Link is Up - 2.5Gbps/Full - flow control rx/tx
mscc_felix 0000:00:00.5 swp2: configuring for fixed/sgmii link mode
mscc_felix 0000:00:00.5 swp2: Link is Up - 1Gbps/Full - flow control off
sja1105 spi2.2 sw2p0: configuring for phy/rgmii-id link mode
sja1105 spi2.2 sw2p0: Link is Up - 1Gbps/Full - flow control off
pcieport 0000:00:1f.0: AER: Multiple Corrected error received: 0000:00:00.0
pcieport 0000:00:1f.0: AER: can't find device of ID0000

Rob's suggestion is to reimplement the enetc driver workaround as a
PCI fixup, and to modify the PCI core to run the fixups for all PCI
functions. This change handles the first part.

We refactor the common code in enetc_psi_create() and enetc_psi_destroy(),
and use the PCI fixup only for those functions for which enetc_pf_probe()
won't get called. This avoids some work being done twice for the PFs
which are enabled.

Fixes: 6fffbc7ae137 ("PCI: Honor firmware's device disabled status")
Link: https://lore.kernel.org/netdev/CAL_JsqLsVYiPLx2kcHkDQ4t=hQVCR7NHziDwi9cCFUFhx48Qow@mail.gmail.com/
Suggested-by: Rob Herring <[email protected]>
Signed-off-by: Vladimir Oltean <[email protected]>
Signed-off-by: David S. Miller <[email protected]>

PCI: move OF status = "disabled" detection to dev->match_driver

The blamed commit has broken probing on
arch/arm64/boot/dts/freescale/fsl-ls1028a.dtsi when &enetc_port0
(PCI function 0) has status = "disabled".

Background: pci_scan_slot() has logic to say that if the function 0 of a
device is absent, the entire device is absent and we can skip the other
functions entirely. Traditionally, this has meant that
pci_bus_read_dev_vendor_id() returns an error code for that function.

However, since the blamed commit, there is an extra confounding
condition: function 0 of the device exists and has a valid vendor id,
but it is disabled in the device tree. In that case, pci_scan_slot()
would incorrectly skip the entire device instead of just that function.

In the case of NXP LS1028A, status = "disabled" does not mean that the
PCI function's config space is not available for reading. It is, but the
Ethernet port is just not functionally useful with a particular SerDes
protocol configuration (0x9999) due to pinmuxing constraints of the Soc.
So, pci_scan_slot() skips all other functions on the ENETC ECAM
(enetc_port1, enetc_port2, enetc_mdio_pf3 etc) when just enetc_port0 had
to not be probed.

There is an additional regression introduced by the change, caused by
its fundamental premise. The enetc driver needs to run code for all PCI
functions, regardless of whether they're enabled or not in the device
tree. That is no longer possible if the driver's probe function is no
longer called. But Rob recommends that we move the of_device_is_available()
detection to dev->match_driver, and this makes the PCI fixups still run
on all functions, while just probing drivers for those functions that
are enabled. So, a separate change in the enetc driver will have to move
the workarounds to a PCI fixup.

Fixes: 6fffbc7ae137 ("PCI: Honor firmware's device disabled status")
Link: https://lore.kernel.org/netdev/CAL_JsqLsVYiPLx2kcHkDQ4t=hQVCR7NHziDwi9cCFUFhx48Qow@mail.gmail.com/
Suggested-by: Rob Herring <[email protected]>
Signed-off-by: Vladimir Oltean <[email protected]>
Signed-off-by: David S. Miller <[email protected]>

iavf: fix potential races for FDIR filters

Add fdir_fltr_lock locking in unprotected places.

The change in iavf_fdir_is_dup_fltr adds a spinlock around a loop which
iterates over all filters and looks for a duplicate. The filter can be
removed from list and freed from memory at the same time it's being
compared. All other places where filters are deleted are already
protected with spinlock.

The remaining changes protect adapter->fdir_active_fltr variable so now
all its uses are under a spinlock.

Fixes: 527691bf0682 ("iavf: Support IPv4 Flow Director filters")
Signed-off-by: Piotr Gardocki <[email protected]>
Tested-by: Rafal Romanowski <[email protected]>
Signed-off-by: Tony Nguyen <[email protected]>
Reviewed-by: Simon Horman <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Jakub Kicinski <[email protected]>

igc: Add lock to safeguard global Qbv variables

Access to shared variables through hrtimer requires locking in order
to protect the variables because actions to write into these variables
(oper_gate_closed, admin_gate_closed, and qbv_transition) might potentially
occur simultaneously. This patch provides a locking mechanisms to avoid
such scenarios.

Fixes: 175c241288c0 ("igc: Fix TX Hang issue when QBV Gate is closed")
Suggested-by: Leon Romanovsky <[email protected]>
Signed-off-by: Muhammad Husaini Zulkifli <[email protected]>
Tested-by: Naama Meir <[email protected]>
Signed-off-by: Tony Nguyen <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Jakub Kicinski <[email protected]>

Merge tag 'mlx5-fixes-2023-08-07' of git://git.kernel.org/pub/scm/linux/kernel/git/saeed/linux

Saeed Mahameed says:

====================
mlx5 fixes 2023-08-07

This series provides bug fixes to mlx5 driver.

* tag 'mlx5-fixes-2023-08-07' of git://git.kernel.org/pub/scm/linux/kernel/git/saeed/linux:
  net/mlx5e: Add capability check for vnic counters
  net/mlx5: Reload auxiliary devices in pci error handlers
  net/mlx5: Skip clock update work when device is in error state
  net/mlx5: LAG, Check correct bucket when modifying LAG
  net/mlx5e: Unoffload post act rule when handling FIB events
  net/mlx5: Fix devlink controller number for ECVF
  net/mlx5: Allow 0 for total host VFs
  net/mlx5: Return correct EC_VF function ID
  net/mlx5: DR, Fix wrong allocation of modify hdr pattern
  net/mlx5e: TC, Fix internal port memory leak
  net/mlx5e: Take RTNL lock when needed before calling xdp_set_features()
====================

Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Jakub Kicinski <[email protected]>

Merge branch 'there-are-some-bugfix-for-the-hns3-ethernet-driver'

Jijie Shao says:

====================
There are some bugfix for the HNS3 ethernet driver

There are some bugfix for the HNS3 ethernet driver

v1: https://lore.kernel.org/all/20230728075840.4022760 [email protected]/
====================

Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Jakub Kicinski <[email protected]>

net: hns3: fix deadlock issue when externel_lb and reset are executed together

When externel_lb and reset are executed together, a deadlock may
occur:
[ 3147.217009] INFO: task kworker/u321:0:7 blocked for more than 120 seconds.
[ 3147.230483] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[ 3147.238999] task:kworker/u321:0  state:D stack:    0 pid:    7 ppid:     2 flags:0x00000008
[ 3147.248045] Workqueue: hclge hclge_service_task [hclge]
[ 3147.253957] Call trace:
[ 3147.257093]  __switch_to+0x7c/0xbc
[ 3147.261183]  __schedule+0x338/0x6f0
[ 3147.265357]  schedule+0x50/0xe0
[ 3147.269185]  schedule_preempt_disabled+0x18/0x24
[ 3147.274488]  __mutex_lock.constprop.0+0x1d4/0x5dc
[ 3147.279880]  __mutex_lock_slowpath+0x1c/0x30
[ 3147.284839]  mutex_lock+0x50/0x60
[ 3147.288841]  rtnl_lock+0x20/0x2c
[ 3147.292759]  hclge_reset_prepare+0x68/0x90 [hclge]
[ 3147.298239]  hclge_reset_subtask+0x88/0xe0 [hclge]
[ 3147.303718]  hclge_reset_service_task+0x84/0x120 [hclge]
[ 3147.309718]  hclge_service_task+0x2c/0x70 [hclge]
[ 3147.315109]  process_one_work+0x1d0/0x490
[ 3147.319805]  worker_thread+0x158/0x3d0
[ 3147.324240]  kthread+0x108/0x13c
[ 3147.328154]  ret_from_fork+0x10/0x18

In externel_lb process, the hns3 driver call napi_disable()
first, then the reset happen, then the restore process of the
externel_lb will fail, and will not call napi_enable(). When
doing externel_lb again, napi_disable() will be double call,
cause a deadlock of rtnl_lock().

This patch use the HNS3_NIC_STATE_DOWN state to protect the
calling of napi_disable() and napi_enable() in externel_lb
process, just as the usage in ndo_stop() and ndo_start().

Fixes: 04b6ba143521 ("net: hns3: add support for external loopback test")
Signed-off-by: Yonglong Liu <[email protected]>
Signed-off-by: Jijie Shao <[email protected]>
Reviewed-by: Leon Romanovsky <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Jakub Kicinski <[email protected]>

net: hns3: add wait until mac link down

In some configure flow of hns3 driver, for example, change mtu, it will
disable MAC through firmware before configuration. But firmware disables
MAC asynchronously. The rx traffic may be not stopped in this case.

So fixes it by waiting until mac link is down.

Fixes: a9775bb64aa7 ("net: hns3: fix set and get link ksettings issue")
Signed-off-by: Jie Wang <[email protected]>
Signed-off-by: Jijie Shao <[email protected]>
Reviewed-by: Leon Romanovsky <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Jakub Kicinski <[email protected]>

net: hns3: refactor hclge_mac_link_status_wait for interface reuse

Some nic configurations could only be performed after link is down. So this
patch refactor this API for reuse.

Signed-off-by: Jie Wang <[email protected]>
Signed-off-by: Jijie Shao <[email protected]>
Reviewed-by: Leon Romanovsky <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Jakub Kicinski <[email protected]>

net: hns3: restore user pause configure when disable autoneg

Restore the mac pause state to user configuration when autoneg is disabled

Signed-off-by: Jian Shen <[email protected]>
Signed-off-by: Peiyang Wang <[email protected]>
Signed-off-by: Jijie Shao <[email protected]>
Reviewed-by: Leon Romanovsky <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Jakub Kicinski <[email protected]>

net/unix: use consistent error code in SO_PEERPIDFD

Change the new (unreleased) SO_PEERPIDFD sockopt to return ENODATA
rather than ESRCH if a socket type does not support remote peer-PID
queries.

Currently, SO_PEERPIDFD returns ESRCH when the socket in question is
not an AF_UNIX socket. This is quite unexpected, given that one would
assume ESRCH means the peer process already exited and thus cannot be
found. However, in that case the sockopt actually returns EINVAL (via
pidfd_prepare()). This is rather inconsistent with other syscalls, which
usually return ESRCH if a given PID refers to a non-existant process.

This changes SO_PEERPIDFD to return ENODATA instead. This is also what
SO_PEERGROUPS returns, and thus keeps a consistent behavior across
sockopts.

Note that this code is returned in 2 cases: First, if the socket type is
not AF_UNIX, and secondly if the socket was not yet connected. In both
cases ENODATA seems suitable.

Signed-off-by: David Rheinsberg <[email protected]>
Reviewed-by: Christian Brauner <[email protected]>
Acked-by: Luca Boccassi <[email protected]>
Fixes: 7b26952a91cf ("net: core: add getsockopt SO_PEERPIDFD")
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Jakub Kicinski <[email protected]>

riscv: mm: fix 2 instances of -Wmissing-variable-declarations

I'm looking to enable -Wmissing-variable-declarations behind W=1. 0day
bot spotted the following instance in ARCH=riscv builds:

  arch/riscv/mm/init.c:276:7: warning: no previous extern declaration
  for non-static variable 'trampoline_pg_dir'
  [-Wmissing-variable-declarations]
  276 | pgd_t trampoline_pg_dir[PTRS_PER_PGD] __page_aligned_bss;
      |       ^
  arch/riscv/mm/init.c:276:1: note: declare 'static' if the variable is
  not intended to be used outside of this translation unit
  276 | pgd_t trampoline_pg_dir[PTRS_PER_PGD] __page_aligned_bss;
      | ^
  arch/riscv/mm/init.c:279:7: warning: no previous extern declaration
  for non-static variable 'early_pg_dir'
  [-Wmissing-variable-declarations]
  279 | pgd_t early_pg_dir[PTRS_PER_PGD] __initdata __aligned(PAGE_SIZE);
      |       ^
  arch/riscv/mm/init.c:279:1: note: declare 'static' if the variable is
  not intended to be used outside of this translation unit
  279 | pgd_t early_pg_dir[PTRS_PER_PGD] __initdata __aligned(PAGE_SIZE);
      | ^

These symbols are referenced by more than one translation unit, so make
sure they're both declared and include the correct header for their
declarations. Finally, sort the list of includes to help keep them tidy.

Reported-by: kernel test robot <[email protected]>
Closes: https://lore.kernel.org/llvm/[email protected]/
Signed-off-by: Nick Desaulniers <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Cc: [email protected]
Signed-off-by: Palmer Dabbelt <[email protected]>

riscv,mmio: Fix readX()-to-delay() ordering

Section 2.1 of the Platform Specification [1] states:

Unless otherwise specified by a given I/O device, I/O devices are on
ordering channel 0 (i.e., they are point-to-point strongly ordered).

which is not sufficient to guarantee that a readX() by a hart completes
before a subsequent delay() on the same hart (cf. memory-barriers.txt,
"Kernel I/O barrier effects").

Set the I(nput) bit in __io_ar() to restore the ordering, align inline
comments.

[1] https://github.com/riscv/riscv-platform-specs

Signed-off-by: Andrea Parri <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Fixes: fab957c11efe ("RISC-V: Atomic and Locking Code")
Cc: [email protected]
Signed-off-by: Palmer Dabbelt <[email protected]>

riscv: Fix CPU feature detection with SMP disabled

commit 914d6f44fc50 ("RISC-V: only iterate over possible CPUs in ISA
string parser") changed riscv_fill_hwcap() from iterating over CPU DT
nodes to iterating over logical CPU IDs. Since this function runs long
before cpu_dev_init() creates CPU devices, it hits the fallback path in
of_cpu_device_node_get(), which itself iterates over the DT nodes,
searching for a node with the requested CPU ID. (Incidentally, this
makes riscv_fill_hwcap() now take quadratic time.)

riscv_fill_hwcap() passes a logical CPU ID to of_cpu_device_node_get(),
which uses the arch_match_cpu_phys_id() hook to translate the logical ID
to a physical ID as found in the DT.

arch_match_cpu_phys_id() has a generic weak definition, and RISC-V
provides a strong definition using cpuid_to_hartid_map(). However, the
RISC-V specific implementation is located in arch/riscv/kernel/smp.c,
and that file is only compiled when SMP is enabled.

As a result, when SMP is disabled, the generic definition is used, and
riscv_isa gets initialized based on the ISA string of hart 0, not the
boot hart. On FU740, this means has_fpu() returns false, and userspace
crashes when trying to use floating-point instructions.

Fix this by moving arch_match_cpu_phys_id() to a file which is always
compiled.

Fixes: 70114560b285 ("RISC-V: Add RISC-V specific arch_match_cpu_phys_id")
Fixes: 914d6f44fc50 ("RISC-V: only iterate over possible CPUs in ISA string parser")
Reported-by: Palmer Dabbelt <[email protected]>
Signed-off-by: Samuel Holland <[email protected]>
Reviewed-by: Conor Dooley <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Cc: [email protected]
Signed-off-by: Palmer Dabbelt <[email protected]>

Merge tag 'hardening-v6.5-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/kees/linux

Pull hardening fixes from Kees Cook:

- Replace remaining open-coded struct_size_t() instance (Gustavo A. R.
   Silva)

- Adjust vboxsf's trailing arrays to be proper flexible arrays

* tag 'hardening-v6.5-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/kees/linux:
  media: venus: Use struct_size_t() helper in pkt_session_unset_buffers()
  vboxsf: Use flexible arrays for trailing string member

block: get rid of unused plug->nowait flag

This was introduced to add a plug based way of signaling nowait issues,
but we have since moved on from that. Kill the old dead code, nobody is
setting it anymore.

Reviewed-by: Chaitanya Kulkarni <[email protected]>
Reviewed-by: Bart Van Assche <[email protected]>
Signed-off-by: Jens Axboe <[email protected]>

MAINTAINERS: update Claudiu Beznea's email address

Update MAINTAINERS entries with a valid email address as the Microchip
one is no longer valid.

Acked-by: Conor Dooley <[email protected]>
Acked-by: Nicolas Ferre <[email protected]>
Signed-off-by: Claudiu Beznea <[email protected]>
Acked-by: Sebastian Reichel <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Jakub Kicinski <[email protected]>

x86/linkage: Fix typo of BUILD_VDSO in asm/linkage.h

The BUILD_VDSO macro was incorrectly spelled as BULID_VDSO in
asm/linkage.h. This causes the !defined(BULID_VDSO) directive to always
evaluate to true.

Correct the spelling to BUILD_VDSO.

Fixes: bea75b33895f ("x86/Kconfig: Introduce function padding")
Signed-off-by: Jinghao Jia <[email protected]>
Signed-off-by: Borislav Petkov (AMD) <[email protected]>
Reviewed-by: Thomas Gleixner <[email protected]>
Acked-by: Randy Dunlap <[email protected]>
Cc: <[email protected]>
Link: https://lore.kernel.org/r/[email protected]

PCI: acpiphp: Use pci_assign_unassigned_bridge_resources() only for non-root bus

40613da52b13 ("PCI: acpiphp: Reassign resources on bridge if necessary")
changed acpiphp hotplug to use pci_assign_unassigned_bridge_resources()
which depends on bridge being available, however enable_slot() can be
called without bridge associated:

  1. Legitimate case of hotplug on root bus (widely used in virt world)

  2. A (misbehaving) firmware, that sends ACPI Bus Check notifications to
     non existing root ports (Dell Inspiron 7352/0W6WV0), which end up at
     enable_slot(..., bridge = 0) where bus has no bridge assigned to it.
     acpihp doesn't know that it's a bridge, and bus specific 'PCI
     subsystem' can't augment ACPI context with bridge information since
     the PCI device to get this data from is/was not available.

Issue is easy to reproduce with QEMU's 'pc' machine, which supports PCI
hotplug on hostbridge slots. To reproduce, boot kernel at commit
40613da52b13 in VM started with following CLI (assuming guest root fs is
installed on sda1 partition):

  # qemu-system-x86_64 -M pc -m 1G -enable-kvm -cpu host \
        -monitor stdio -serial file:serial.log           \
        -kernel arch/x86/boot/bzImage                    \
        -append "root=/dev/sda1 console=ttyS0"           \
        guest_disk.img

Once guest OS is fully booted at qemu prompt:

  (qemu) device_add e1000

(check serial.log) it will cause NULL pointer dereference at:

  void pci_assign_unassigned_bridge_resources(struct pci_dev *bridge)
  {
    struct pci_bus *parent = bridge->subordinate;

  BUG: kernel NULL pointer dereference, address: 0000000000000018

   ? pci_assign_unassigned_bridge_resources+0x1f/0x260
   enable_slot+0x21f/0x3e0
   acpiphp_hotplug_notify+0x13d/0x260
   acpi_device_hotplug+0xbc/0x540
   acpi_hotplug_work_fn+0x15/0x20
   process_one_work+0x1f7/0x370
   worker_thread+0x45/0x3b0

The issue was discovered on Dell Inspiron 7352/0W6WV0 laptop with following
sequence:

  1. Suspend to RAM
  2. Wake up with the same backtrace being observed:
  3. 2nd suspend to RAM attempt makes laptop freeze

Fix it by using __pci_bus_assign_resources() instead of
pci_assign_unassigned_bridge_resources() as we used to do, but only in case
when bus doesn't have a bridge associated (to cover for the case of ACPI
event on hostbridge or non existing root port).

That lets us keep hotplug on root bus working like it used to and at the
same time keeps resource reassignment usable on root ports (and other 1st
level bridges) that was fixed by 40613da52b13.

Fixes: 40613da52b13 ("PCI: acpiphp: Reassign resources on bridge if necessary")
Link: https://lore.kernel.org/r/[email protected]
Reported-by: Woody Suwalski <[email protected]>
Tested-by: Woody Suwalski <[email protected]>
Tested-by: Michal Koutný <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Igor Mammedov <[email protected]>
Signed-off-by: Bjorn Helgaas <[email protected]>
Acked-by: Rafael J. Wysocki <[email protected]>
Acked-by: Michael S. Tsirkin <[email protected]>

io_uring/parisc: Adjust pgoff in io_uring mmap() for parisc

The changes from commit 32832a407a71 ("io_uring: Fix io_uring mmap() by
using architecture-provided get_unmapped_area()") to the parisc
implementation of get_unmapped_area() broke glibc's locale-gen
executable when running on parisc.

This patch reverts those architecture-specific changes, and instead
adjusts in io_uring_mmu_get_unmapped_area() the pgoff offset which is
then given to parisc's get_unmapped_area() function. This is much
cleaner than the previous approach, and we still will get a coherent
addresss.

This patch has no effect on other architectures (SHM_COLOUR is only
defined on parisc), and the liburing testcase stil passes on parisc.

Cc: [email protected] # 6.4
Signed-off-by: Helge Deller <[email protected]>
Reported-by: Christoph Biedl <[email protected]>
Fixes: 32832a407a71 ("io_uring: Fix io_uring mmap() by using architecture-provided get_unmapped_area()")
Fixes: d808459b2e31 ("io_uring: Adjust mapping wrt architecture aliasing requirements")
Link: https://lore.kernel.org/r/ZNEyGV0jyI8kOOfz@p100
Signed-off-by: Jens Axboe <[email protected]>

MAINTAINERS: Remove maintainer of HiSilicon RoCE

Haoyue no longer maintains the Hisilicon RoCE driver. So remove him
from MAINTAINERS.

Signed-off-by: Junxian Huang <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Leon Romanovsky <[email protected]>

perf stat: Don't display zero tool counts

Andi reported (see link below) a regression when printing the
'duration_time' tool event, where it gets printed as "not counted" for
most of the CPUs, fix it by skipping zero counts for tool events.

Reported-by: Andi Kleen <[email protected]>
Signed-off-by: Ian Rogers <[email protected]>
Tested-by: Andi Kleen <[email protected]>
Tested-by: Arnaldo Carvalho de Melo <[email protected]>
Cc: Adrian Hunter <[email protected]>
Cc: Alexander Shishkin <[email protected]>
Cc: Athira Rajeev <[email protected]>
Cc: Claire Jensen <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: Jiri Olsa <[email protected]>
Cc: Kan Liang <[email protected]>
Cc: Mark Rutland <[email protected]>
Cc: Namhyung Kim <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Link: https://lore.kernel.org/all/ZMlrzcVrVi1lTDmn@tassilo/
Signed-off-by: Arnaldo Carvalho de Melo <[email protected]>

Merge tag 'gfs2-v6.4-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/gfs2/linux-gfs2

Pull gfs2 fixes from Andreas Gruenbacher:

- Fix a freeze consistency check in gfs2_trans_add_meta()

- Don't use filemap_splice_read as it can cause deadlocks on gfs2

* tag 'gfs2-v6.4-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/gfs2/linux-gfs2:
gfs2: Don't use filemap_splice_read
gfs2: Fix freeze consistency check in gfs2_trans_add_meta

accel/ivpu: Add set_pages_array_wc/uc for internal buffers

Buffers mapped with pgprot_writecombined() are not correctly
flushed. This triggers issues on VPU access using random
memory content such as MMU translation faults, invalid context
descriptors being fetched and can lead to VPU FW crashes.

Fixes: 647371a6609d ("accel/ivpu: Add GEM buffer object management")
Cc: [email protected] # 6.3+
Signed-off-by: Karol Wachowski <[email protected]>
Reviewed-by: Stanislaw Gruszka <[email protected]>
Signed-off-by: Stanislaw Gruszka <[email protected]>
Link: https://patchwork.freedesktop.org/patch/msgid/[email protected]