]> Git Repo - linux.git/log
linux.git
17 months agoMerge branch 'switch-dsa-to-inclusive-terminology'
Jakub Kicinski [Tue, 24 Oct 2023 20:08:15 +0000 (13:08 -0700)]
Merge branch 'switch-dsa-to-inclusive-terminology'

Florian Fainelli says:

====================
Switch DSA to inclusive terminology

One of the action items following Netconf'23 is to switch subsystems to
use inclusive terminology. DSA has been making extensive use of the
"master" and "slave" words which are now replaced by "conduit" and
"user" respectively.
====================

Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Jakub Kicinski <[email protected]>
17 months agonet: dsa: Rename IFLA_DSA_MASTER to IFLA_DSA_CONDUIT
Florian Fainelli [Mon, 23 Oct 2023 18:17:29 +0000 (11:17 -0700)]
net: dsa: Rename IFLA_DSA_MASTER to IFLA_DSA_CONDUIT

This preserves the existing IFLA_DSA_MASTER which is part of the uAPI
and creates an alias named IFLA_DSA_CONDUIT.

Reviewed-by: Andrew Lunn <[email protected]>
Reviewed-by: Vladimir Oltean <[email protected]>
Signed-off-by: Florian Fainelli <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Jakub Kicinski <[email protected]>
17 months agonet: dsa: Use conduit and user terms
Florian Fainelli [Mon, 23 Oct 2023 18:17:28 +0000 (11:17 -0700)]
net: dsa: Use conduit and user terms

Use more inclusive terms throughout the DSA subsystem by moving away
from "master" which is replaced by "conduit" and "slave" which is
replaced by "user". No functional changes.

Acked-by: Rob Herring <[email protected]>
Acked-by: Stephen Hemminger <[email protected]>
Reviewed-by: Vladimir Oltean <[email protected]>
Signed-off-by: Florian Fainelli <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Jakub Kicinski <[email protected]>
17 months agotsnep: Fix tsnep_request_irq() format-overflow warning
Gerhard Engleder [Mon, 23 Oct 2023 18:38:56 +0000 (20:38 +0200)]
tsnep: Fix tsnep_request_irq() format-overflow warning

Compiler warns about a possible format-overflow in tsnep_request_irq():
drivers/net/ethernet/engleder/tsnep_main.c:884:55: warning: 'sprintf' may write a terminating nul past the end of the destination [-Wformat-overflow=]
                         sprintf(queue->name, "%s-rx-%d", name,
                                                       ^
drivers/net/ethernet/engleder/tsnep_main.c:881:55: warning: 'sprintf' may write a terminating nul past the end of the destination [-Wformat-overflow=]
                         sprintf(queue->name, "%s-tx-%d", name,
                                                       ^
drivers/net/ethernet/engleder/tsnep_main.c:878:49: warning: '-txrx-' directive writing 6 bytes into a region of size between 5 and 25 [-Wformat-overflow=]
                         sprintf(queue->name, "%s-txrx-%d", name,
                                                 ^~~~~~

Actually overflow cannot happen. Name is limited to IFNAMSIZ, because
netdev_name() is called during ndo_open(). queue_index is single char,
because less than 10 queues are supported.

Fix warning with snprintf(). Additionally increase buffer to 32 bytes,
because those 7 additional bytes were unused anyway.

Reported-by: kernel test robot <[email protected]>
Closes: https://lore.kernel.org/oe-kbuild-all/[email protected]/
Signed-off-by: Gerhard Engleder <[email protected]>
Reviewed-by: Jacob Keller <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Jakub Kicinski <[email protected]>
17 months agoMerge branch 'net-deduplicate-netdev-name-allocation'
Jakub Kicinski [Tue, 24 Oct 2023 20:03:00 +0000 (13:03 -0700)]
Merge branch 'net-deduplicate-netdev-name-allocation'

Jakub Kicinski says:

====================
net: deduplicate netdev name allocation

After recent fixes we have even more duplicated code in netdev name
allocation helpers. There are two complications in this code.
First, __dev_alloc_name() clobbers its output arg even if allocation
fails, forcing callers to do extra copies. Second as our experience in
commit 55a5ec9b7710 ("Revert "net: core: dev_get_valid_name is now the same as dev_alloc_name_ns"") and
commit 029b6d140550 ("Revert "net: core: maybe return -EEXIST in __dev_alloc_name"")
taught us, user space is very sensitive to the exact error codes.

Align the callers of __dev_alloc_name(), and remove some of its
complexity.

v1: https://lore.kernel.org/all/20231020011856.3244410[email protected]/
====================

Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Jakub Kicinski <[email protected]>
17 months agonet: remove else after return in dev_prep_valid_name()
Jakub Kicinski [Mon, 23 Oct 2023 15:23:46 +0000 (08:23 -0700)]
net: remove else after return in dev_prep_valid_name()

Remove unnecessary else clauses after return.
I copied this if / else construct from somewhere,
it makes the code harder to read.

Reviewed-by: Jiri Pirko <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Jakub Kicinski <[email protected]>
17 months agonet: remove dev_valid_name() check from __dev_alloc_name()
Jakub Kicinski [Mon, 23 Oct 2023 15:23:45 +0000 (08:23 -0700)]
net: remove dev_valid_name() check from __dev_alloc_name()

__dev_alloc_name() is only called by dev_prep_valid_name(),
which already checks that name is valid.

Reviewed-by: Jiri Pirko <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Jakub Kicinski <[email protected]>
17 months agonet: trust the bitmap in __dev_alloc_name()
Jakub Kicinski [Mon, 23 Oct 2023 15:23:44 +0000 (08:23 -0700)]
net: trust the bitmap in __dev_alloc_name()

Prior to restructuring __dev_alloc_name() handled both printf
and non-printf names. In a clever attempt at code reuse it
always prints the name into a buffer and checks if it's
a duplicate.

Trust the bitmap, and return an error if its full.

This shrinks the possible ID space by one from 32K to 32K - 1,
as previously the max value would have been tried as a valid ID.
It seems very unlikely that anyone would care as we heard
no requests to increase the max beyond 32k.

Reviewed-by: Jiri Pirko <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Jakub Kicinski <[email protected]>
17 months agonet: reduce indentation of __dev_alloc_name()
Jakub Kicinski [Mon, 23 Oct 2023 15:23:43 +0000 (08:23 -0700)]
net: reduce indentation of __dev_alloc_name()

All callers of __dev_valid_name() go thru dev_prep_valid_name()
which handles the non-printf case. Focus __dev_alloc_name() on
the sprintf case, remove the indentation level.

Minor functional change of returning -EINVAL if % is not found,
which should now never happen.

Reviewed-by: Jiri Pirko <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Jakub Kicinski <[email protected]>
17 months agonet: make dev_alloc_name() call dev_prep_valid_name()
Jakub Kicinski [Mon, 23 Oct 2023 15:23:42 +0000 (08:23 -0700)]
net: make dev_alloc_name() call dev_prep_valid_name()

__dev_alloc_name() handles both the sprintf and non-sprintf
target names. This complicates the code.

dev_prep_valid_name() already handles the non-sprintf case,
before calling __dev_alloc_name(), make the only other caller
also go thru dev_prep_valid_name(). This way we can drop
the non-sprintf handling in __dev_alloc_name() in one of
the next changes.

commit 55a5ec9b7710 ("Revert "net: core: dev_get_valid_name is now the same as dev_alloc_name_ns"") and
commit 029b6d140550 ("Revert "net: core: maybe return -EEXIST in __dev_alloc_name"")
tell us that we can't start returning -EEXIST from dev_alloc_name()
on name duplicates. Bite the bullet and pass the expected errno to
dev_prep_valid_name().

dev_prep_valid_name() must now propagate out the allocated id
for printf names.

Reviewed-by: Jiri Pirko <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Jakub Kicinski <[email protected]>
17 months agonet: don't use input buffer of __dev_alloc_name() as a scratch space
Jakub Kicinski [Mon, 23 Oct 2023 15:23:41 +0000 (08:23 -0700)]
net: don't use input buffer of __dev_alloc_name() as a scratch space

Callers of __dev_alloc_name() want to pass dev->name as
the output buffer. Make __dev_alloc_name() not clobber
that buffer on failure, and remove the workarounds
in callers.

dev_alloc_name_ns() is now completely unnecessary.

The extra strscpy() added here will be gone by the end
of the patch series.

Reviewed-by: Jiri Pirko <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Jakub Kicinski <[email protected]>
17 months agoMerge branch 'mptcp-convert-netlink-code-to-use-yaml-spec'
Jakub Kicinski [Tue, 24 Oct 2023 20:00:33 +0000 (13:00 -0700)]
Merge branch 'mptcp-convert-netlink-code-to-use-yaml-spec'

Mat Martineau says:

====================
mptcp: convert Netlink code to use YAML spec

This series from Davide converts most of the MPTCP Netlink interface
(plus uAPI bits) to use sources generated by YNL using a YAML spec file.

This new YAML file is useful to validate the API and to generate a good
documentation page.

Patch 1 modifies YNL spec to support "uns-admin-perm" for genetlink
legacy.

Patch 2 adds support for validating exact length of netlink attrs.

Patch 3 converts Netlink structures from small_ops to ops to prepare the
switch to YAML.

Patch 4 adds the Netlink YAML spec for MPTCP.

Patch 5 adds and uses a new header file generated from the new YAML
spec.

Patch 6 renames some handlers to match the ones generated from the YAML
spec.

Patch 7 adds and uses Netlink policies automatically generated from the
YAML spec.
====================

Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Jakub Kicinski <[email protected]>
17 months agonet: mptcp: use policy generated by YAML spec
Davide Caratti [Mon, 23 Oct 2023 18:17:11 +0000 (11:17 -0700)]
net: mptcp: use policy generated by YAML spec

generated with:

 $ ./tools/net/ynl/ynl-gen-c.py --mode kernel \
 > --spec Documentation/netlink/specs/mptcp.yaml --source \
 > -o net/mptcp/mptcp_pm_gen.c
 $ ./tools/net/ynl/ynl-gen-c.py --mode kernel \
 > --spec Documentation/netlink/specs/mptcp.yaml --header \
 > -o net/mptcp/mptcp_pm_gen.h

Closes: https://github.com/multipath-tcp/mptcp_net-next/issues/340
Acked-by: Paolo Abeni <[email protected]>
Signed-off-by: Davide Caratti <[email protected]>
Signed-off-by: Mat Martineau <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Jakub Kicinski <[email protected]>
17 months agonet: mptcp: rename netlink handlers to mptcp_pm_nl_<blah>_{doit,dumpit}
Davide Caratti [Mon, 23 Oct 2023 18:17:10 +0000 (11:17 -0700)]
net: mptcp: rename netlink handlers to mptcp_pm_nl_<blah>_{doit,dumpit}

so that they will match names generated from YAML spec.

Link: https://github.com/multipath-tcp/mptcp_net-next/issues/340
Suggested-by: Paolo Abeni <[email protected]>
Acked-by: Paolo Abeni <[email protected]>
Signed-off-by: Davide Caratti <[email protected]>
Signed-off-by: Mat Martineau <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Jakub Kicinski <[email protected]>
17 months agouapi: mptcp: use header file generated from YAML spec
Davide Caratti [Mon, 23 Oct 2023 18:17:09 +0000 (11:17 -0700)]
uapi: mptcp: use header file generated from YAML spec

generated with:

 $ ./tools/net/ynl/ynl-gen-c.py --mode uapi \
 > --spec Documentation/netlink/specs/mptcp.yaml \
 > --header -o include/uapi/linux/mptcp_pm.h

Link: https://github.com/multipath-tcp/mptcp_net-next/issues/340
Acked-by: Paolo Abeni <[email protected]>
Signed-off-by: Davide Caratti <[email protected]>
Signed-off-by: Mat Martineau <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Jakub Kicinski <[email protected]>
17 months agoDocumentation: netlink: add a YAML spec for mptcp
Davide Caratti [Mon, 23 Oct 2023 18:17:08 +0000 (11:17 -0700)]
Documentation: netlink: add a YAML spec for mptcp

it describes most of the current netlink interface (uAPI definitions,
doit/dumpit operations and attributes)

Link: https://github.com/multipath-tcp/mptcp_net-next/issues/340
Acked-by: Paolo Abeni <[email protected]>
Signed-off-by: Davide Caratti <[email protected]>
Signed-off-by: Mat Martineau <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Jakub Kicinski <[email protected]>
17 months agonet: mptcp: convert netlink from small_ops to ops
Davide Caratti [Mon, 23 Oct 2023 18:17:07 +0000 (11:17 -0700)]
net: mptcp: convert netlink from small_ops to ops

in the current MPTCP control plane, all operations use a netlink
attribute of the same type "MPTCP_PM_ATTR". However, add/del/get/flush
operations only parse the first element in the message _ the one that
describes MPTCP endpoints (that was named MPTCP_PM_ATTR_ADDR and
mostly used in ADD_ADDR operations _ probably the similarity of "attr",
"addr" and "add" might cause some confusion to human readers).
Convert MPTCP from 'small_ops' to 'ops', thus allowing different attributes
for each single operation, hopefully makes all this clearer to human
readers.

- use a separate attribute set for add/del/get/flush address operation,
  binary compatible with the existing one, to store the endpoint address.
  MPTCP_PM_ENDPOINT_ADDR is added to the uAPI (with the same value as
  MPTCP_PM_ATTR_ADDR) for these operations.
- convert mptcp_pm_ops[] and add policy files accordingly.

this prepares MPTCP control plane to be described as YAML spec.

Link: https://github.com/multipath-tcp/mptcp_net-next/issues/340
Acked-by: Paolo Abeni <[email protected]>
Signed-off-by: Davide Caratti <[email protected]>
Signed-off-by: Mat Martineau <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Jakub Kicinski <[email protected]>
17 months agotools: ynl-gen: add support for exact-len validation
Davide Caratti [Mon, 23 Oct 2023 18:17:06 +0000 (11:17 -0700)]
tools: ynl-gen: add support for exact-len validation

add support for 'exact-len' validation on netlink attributes.

Link: https://github.com/multipath-tcp/mptcp_net-next/issues/340
Acked-by: Matthieu Baerts <[email protected]>
Signed-off-by: Davide Caratti <[email protected]>
Signed-off-by: Mat Martineau <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Jakub Kicinski <[email protected]>
17 months agotools: ynl: add uns-admin-perm to genetlink legacy
Davide Caratti [Mon, 23 Oct 2023 18:17:05 +0000 (11:17 -0700)]
tools: ynl: add uns-admin-perm to genetlink legacy

this flag maps to GENL_UNS_ADMIN_PERM and will be used by future specs.

Acked-by: Paolo Abeni <[email protected]>
Signed-off-by: Davide Caratti <[email protected]>
Signed-off-by: Mat Martineau <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Jakub Kicinski <[email protected]>
17 months agoMerge tag 'mm-hotfixes-stable-2023-10-24-09-40' of git://git.kernel.org/pub/scm/linux...
Linus Torvalds [Tue, 24 Oct 2023 19:52:16 +0000 (09:52 -1000)]
Merge tag 'mm-hotfixes-stable-2023-10-24-09-40' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm

Pull misc fixes from Andrew Morton:
 "20 hotfixes. 12 are cc:stable and the remainder address post-6.5
  issues or aren't considered necessary for earlier kernel versions"

* tag 'mm-hotfixes-stable-2023-10-24-09-40' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm:
  maple_tree: add GFP_KERNEL to allocations in mas_expected_entries()
  selftests/mm: include mman header to access MREMAP_DONTUNMAP identifier
  mailmap: correct email aliasing for Oleksij Rempel
  mailmap: map Bartosz's old address to the current one
  mm/damon/sysfs: check DAMOS regions update progress from before_terminate()
  MAINTAINERS: Ondrej has moved
  kasan: disable kasan_non_canonical_hook() for HW tags
  kasan: print the original fault addr when access invalid shadow
  hugetlbfs: close race between MADV_DONTNEED and page fault
  hugetlbfs: extend hugetlb_vma_lock to private VMAs
  hugetlbfs: clear resv_map pointer if mmap fails
  mm: zswap: fix pool refcount bug around shrink_worker()
  mm/migrate: fix do_pages_move for compat pointers
  riscv: fix set_huge_pte_at() for NAPOT mappings when a swap entry is set
  riscv: handle VM_FAULT_[HWPOISON|HWPOISON_LARGE] faults instead of panicking
  mmap: fix error paths with dup_anon_vma()
  mmap: fix vma_iterator in error path of vma_merge()
  mm: fix vm_brk_flags() to not bail out while holding lock
  mm/mempolicy: fix set_mempolicy_home_node() previous VMA pointer
  mm/page_alloc: correct start page when guard page debug is enabled

17 months agonetfilter: nf_tables: Carry reset boolean in nft_set_dump_ctx
Phil Sutter [Tue, 24 Oct 2023 13:10:40 +0000 (15:10 +0200)]
netfilter: nf_tables: Carry reset boolean in nft_set_dump_ctx

Relieve the dump callback from having to check nlmsg_type upon each
call. Prep work for set element reset locking.

Signed-off-by: Phil Sutter <[email protected]>
Signed-off-by: Pablo Neira Ayuso <[email protected]>
17 months agobpf: Improve JEQ/JNE branch taken logic
Andrii Nakryiko [Sun, 22 Oct 2023 20:57:37 +0000 (13:57 -0700)]
bpf: Improve JEQ/JNE branch taken logic

When determining if an if/else branch will always or never be taken, use
signed range knowledge in addition to currently used unsigned range knowledge.
If either signed or unsigned range suggests that condition is always/never
taken, return corresponding branch_taken verdict.

Current use of unsigned range for this seems arbitrary and unnecessarily
incomplete. It is possible for *signed* operations to be performed on
register, which could "invalidate" unsigned range for that register. In such
case branch_taken will be artificially useless, even if we can still tell
that some constant is outside of register value range based on its signed
bounds.

veristat-based validation shows zero differences across selftests, Cilium,
and Meta-internal BPF object files.

Signed-off-by: Andrii Nakryiko <[email protected]>
Signed-off-by: Daniel Borkmann <[email protected]>
Acked-by: Shung-Hsi Yu <[email protected]>
Link: https://lore.kernel.org/bpf/[email protected]
17 months agobpf: Fold smp_mb__before_atomic() into atomic_set_release()
Paul E. McKenney [Wed, 18 Oct 2023 22:28:32 +0000 (15:28 -0700)]
bpf: Fold smp_mb__before_atomic() into atomic_set_release()

The bpf_user_ringbuf_drain() BPF_CALL function uses an atomic_set()
immediately preceded by smp_mb__before_atomic() so as to order storing
of ring-buffer consumer and producer positions prior to the atomic_set()
call's clearing of the ->busy flag, as follows:

        smp_mb__before_atomic();
        atomic_set(&rb->busy, 0);

Although this works given current architectures and implementations, and
given that this only needs to order prior writes against a later write.
However, it does so by accident because the smp_mb__before_atomic()
is only guaranteed to work with read-modify-write atomic operations, and
not at all with things like atomic_set() and atomic_read().

Note especially that smp_mb__before_atomic() will not, repeat *not*,
order the prior write to "a" before the subsequent non-read-modify-write
atomic read from "b", even on strongly ordered systems such as x86:

        WRITE_ONCE(a, 1);
        smp_mb__before_atomic();
        r1 = atomic_read(&b);

Therefore, replace the smp_mb__before_atomic() and atomic_set() with
atomic_set_release() as follows:

        atomic_set_release(&rb->busy, 0);

This is no slower (and sometimes is faster) than the original, and also
provides a formal guarantee of ordering that the original lacks.

Signed-off-by: Paul E. McKenney <[email protected]>
Signed-off-by: Daniel Borkmann <[email protected]>
Acked-by: David Vernet <[email protected]>
Link: https://lore.kernel.org/bpf/ec86d38e-cfb4-44aa-8fdb-6c925922d93c@paulmck-laptop
17 months agobpf: Fix unnecessary -EBUSY from htab_lock_bucket
Song Liu [Thu, 12 Oct 2023 05:57:41 +0000 (22:57 -0700)]
bpf: Fix unnecessary -EBUSY from htab_lock_bucket

htab_lock_bucket uses the following logic to avoid recursion:

1. preempt_disable();
2. check percpu counter htab->map_locked[hash] for recursion;
   2.1. if map_lock[hash] is already taken, return -BUSY;
3. raw_spin_lock_irqsave();

However, if an IRQ hits between 2 and 3, BPF programs attached to the IRQ
logic will not able to access the same hash of the hashtab and get -EBUSY.

This -EBUSY is not really necessary. Fix it by disabling IRQ before
checking map_locked:

1. preempt_disable();
2. local_irq_save();
3. check percpu counter htab->map_locked[hash] for recursion;
   3.1. if map_lock[hash] is already taken, return -BUSY;
4. raw_spin_lock().

Similarly, use raw_spin_unlock() and local_irq_restore() in
htab_unlock_bucket().

Fixes: 20b6cc34ea74 ("bpf: Avoid hashtab deadlock with map_locked")
Suggested-by: Tejun Heo <[email protected]>
Signed-off-by: Song Liu <[email protected]>
Signed-off-by: Andrii Nakryiko <[email protected]>
Signed-off-by: Daniel Borkmann <[email protected]>
Link: https://lore.kernel.org/bpf/[email protected]
Link: https://lore.kernel.org/bpf/[email protected]
17 months agonetfilter: nf_tables: set->ops->insert returns opaque set element in case of EEXIST
Pablo Neira Ayuso [Wed, 18 Oct 2023 20:23:35 +0000 (22:23 +0200)]
netfilter: nf_tables: set->ops->insert returns opaque set element in case of EEXIST

Return struct nft_elem_priv instead of struct nft_set_ext for
consistency with ("netfilter: nf_tables: expose opaque set element as
struct nft_elem_priv") and to prepare the introduction of element
timeout updates from control path.

Signed-off-by: Pablo Neira Ayuso <[email protected]>
17 months agonetfilter: nf_tables: shrink memory consumption of set elements
Pablo Neira Ayuso [Mon, 16 Oct 2023 12:29:27 +0000 (14:29 +0200)]
netfilter: nf_tables: shrink memory consumption of set elements

Instead of copying struct nft_set_elem into struct nft_trans_elem, store
the pointer to the opaque set element object in the transaction. Adapt
set backend API (and set backend implementations) to take the pointer to
opaque set element representation whenever required.

This patch deconstifies .remove() and .activate() set backend API since
these modify the set element opaque object. And it also constify
nft_set_elem_ext() this provides access to the nft_set_ext struct
without updating the object.

According to pahole on x86_64, this patch shrinks struct nft_trans_elem
size from 216 to 24 bytes.

This patch also reduces stack memory consumption by removing the
template struct nft_set_elem object, using the opaque set element object
instead such as from the set iterator API, catchall elements and the get
element command.

Signed-off-by: Pablo Neira Ayuso <[email protected]>
17 months agonetfilter: nf_tables: expose opaque set element as struct nft_elem_priv
Pablo Neira Ayuso [Wed, 18 Oct 2023 20:23:07 +0000 (22:23 +0200)]
netfilter: nf_tables: expose opaque set element as struct nft_elem_priv

Add placeholder structure and place it at the beginning of each struct
nft_*_elem for each existing set backend, instead of exposing elements
as void type to the frontend which defeats compiler type checks. Use
this pointer to this new type to replace void *.

This patch updates the following set backend API to use this new struct
nft_elem_priv placeholder structure:

- update
- deactivate
- flush
- get

as well as the following helper functions:

- nft_set_elem_ext()
- nft_set_elem_init()
- nft_set_elem_destroy()
- nf_tables_set_elem_destroy()

This patch adds nft_elem_priv_cast() to cast struct nft_elem_priv to
native element representation from the corresponding set backend.
BUILD_BUG_ON() makes sure this .priv placeholder is always at the top
of the opaque set element representation.

Suggested-by: Florian Westphal <[email protected]>
Signed-off-by: Pablo Neira Ayuso <[email protected]>
17 months agonetfilter: nf_tables: set backend .flush always succeeds
Pablo Neira Ayuso [Wed, 18 Oct 2023 20:20:23 +0000 (22:20 +0200)]
netfilter: nf_tables: set backend .flush always succeeds

.flush is always successful since this results from iterating over the
set elements to toggle mark the element as inactive in the next
generation.

Signed-off-by: Pablo Neira Ayuso <[email protected]>
17 months agonetfilter: nft_set_pipapo: no need to call pipapo_deactivate() from flush
Pablo Neira Ayuso [Wed, 18 Oct 2023 20:20:10 +0000 (22:20 +0200)]
netfilter: nft_set_pipapo: no need to call pipapo_deactivate() from flush

Use the element object that is already offered instead.

Signed-off-by: Pablo Neira Ayuso <[email protected]>
17 months agonetfilter: nf_tables: Carry reset boolean in nft_obj_dump_ctx
Phil Sutter [Fri, 20 Oct 2023 17:34:33 +0000 (19:34 +0200)]
netfilter: nf_tables: Carry reset boolean in nft_obj_dump_ctx

Relieve the dump callback from having to inspect nlmsg_type upon each
call, just do it once at start of the dump.

Signed-off-by: Phil Sutter <[email protected]>
Signed-off-by: Pablo Neira Ayuso <[email protected]>
17 months agonetfilter: nf_tables: nft_obj_filter fits into cb->ctx
Phil Sutter [Fri, 20 Oct 2023 17:34:32 +0000 (19:34 +0200)]
netfilter: nf_tables: nft_obj_filter fits into cb->ctx

No need to allocate it if one may just use struct netlink_callback's
scratch area for it.

Signed-off-by: Phil Sutter <[email protected]>
Signed-off-by: Pablo Neira Ayuso <[email protected]>
17 months agonetfilter: nf_tables: Carry s_idx in nft_obj_dump_ctx
Phil Sutter [Fri, 20 Oct 2023 17:34:31 +0000 (19:34 +0200)]
netfilter: nf_tables: Carry s_idx in nft_obj_dump_ctx

Prep work for moving the context into struct netlink_callback scratch
area.

Signed-off-by: Phil Sutter <[email protected]>
Signed-off-by: Pablo Neira Ayuso <[email protected]>
17 months agonetfilter: nf_tables: A better name for nft_obj_filter
Phil Sutter [Fri, 20 Oct 2023 17:34:30 +0000 (19:34 +0200)]
netfilter: nf_tables: A better name for nft_obj_filter

Name it for what it is supposed to become, a real nft_obj_dump_ctx. No
functional change intended.

Signed-off-by: Phil Sutter <[email protected]>
Signed-off-by: Pablo Neira Ayuso <[email protected]>
17 months agonetfilter: nf_tables: Unconditionally allocate nft_obj_filter
Phil Sutter [Fri, 20 Oct 2023 17:34:29 +0000 (19:34 +0200)]
netfilter: nf_tables: Unconditionally allocate nft_obj_filter

Prep work for moving the filter into struct netlink_callback's scratch
area.

Signed-off-by: Phil Sutter <[email protected]>
Signed-off-by: Pablo Neira Ayuso <[email protected]>
17 months agonetfilter: nf_tables: Drop pointless memset in nf_tables_dump_obj
Phil Sutter [Fri, 20 Oct 2023 17:34:28 +0000 (19:34 +0200)]
netfilter: nf_tables: Drop pointless memset in nf_tables_dump_obj

The code does not make use of cb->args fields past the first one, no
need to zero them.

Signed-off-by: Phil Sutter <[email protected]>
Signed-off-by: Pablo Neira Ayuso <[email protected]>
17 months agonetfilter: conntrack: switch connlabels to atomic_t
Florian Westphal [Fri, 20 Oct 2023 12:38:15 +0000 (14:38 +0200)]
netfilter: conntrack: switch connlabels to atomic_t

The spinlock is back from the day when connabels did not have
a fixed size and reallocation had to be supported.

Remove it.  This change also allows to call the helpers from
softirq or timers without deadlocks.

Also add WARN()s to catch refcounting imbalances.

Signed-off-by: Florian Westphal <[email protected]>
Signed-off-by: Pablo Neira Ayuso <[email protected]>
17 months agobr_netfilter: use single forward hook for ip and arp
Florian Westphal [Fri, 20 Oct 2023 11:14:25 +0000 (13:14 +0200)]
br_netfilter: use single forward hook for ip and arp

br_netfilter registers two forward hooks, one for ip and one for arp.

Just use a common function for both and then call the arp/ip helper
as needed.

Signed-off-by: Florian Westphal <[email protected]>
Signed-off-by: Pablo Neira Ayuso <[email protected]>
17 months agonetfilter: nf_tables: Add locking for NFT_MSG_GETRULE_RESET requests
Phil Sutter [Thu, 19 Oct 2023 14:03:36 +0000 (16:03 +0200)]
netfilter: nf_tables: Add locking for NFT_MSG_GETRULE_RESET requests

Rule reset is not concurrency-safe per-se, so multiple CPUs may reset
the same rule at the same time. At least counter and quota expressions
will suffer from value underruns in this case.

Prevent this by introducing dedicated locking callbacks for nfnetlink
and the asynchronous dump handling to serialize access.

Signed-off-by: Phil Sutter <[email protected]>
Signed-off-by: Pablo Neira Ayuso <[email protected]>
17 months agonetfilter: nf_tables: Introduce nf_tables_getrule_single()
Phil Sutter [Thu, 19 Oct 2023 14:03:35 +0000 (16:03 +0200)]
netfilter: nf_tables: Introduce nf_tables_getrule_single()

Outsource the reply skb preparation for non-dump getrule requests into a
distinct function. Prep work for rule reset locking.

Signed-off-by: Phil Sutter <[email protected]>
Signed-off-by: Pablo Neira Ayuso <[email protected]>
17 months agonetfilter: nf_tables: Open-code audit log call in nf_tables_getrule()
Phil Sutter [Thu, 19 Oct 2023 14:03:34 +0000 (16:03 +0200)]
netfilter: nf_tables: Open-code audit log call in nf_tables_getrule()

The table lookup will be dropped from that function, so remove that
dependency from audit logging code. Using whatever is in
nla[NFTA_RULE_TABLE] is sufficient as long as the previous rule info
filling succeded.

Signed-off-by: Phil Sutter <[email protected]>
Signed-off-by: Pablo Neira Ayuso <[email protected]>
17 months agonetfilter: nft_set_rbtree: prefer sync gc to async worker
Florian Westphal [Fri, 13 Oct 2023 12:18:16 +0000 (14:18 +0200)]
netfilter: nft_set_rbtree: prefer sync gc to async worker

There is no need for asynchronous garbage collection, rbtree inserts
can only happen from the netlink control plane.

We already perform on-demand gc on insertion, in the area of the
tree where the insertion takes place, but we don't do a full tree
walk there for performance reasons.

Do a full gc walk at the end of the transaction instead and
remove the async worker.

Signed-off-by: Florian Westphal <[email protected]>
Signed-off-by: Pablo Neira Ayuso <[email protected]>
17 months agonetfilter: nft_set_rbtree: rename gc deactivate+erase function
Florian Westphal [Fri, 13 Oct 2023 12:18:15 +0000 (14:18 +0200)]
netfilter: nft_set_rbtree: rename gc deactivate+erase function

Next patch adds a cllaer that doesn't hold the priv->write lock and
will need a similar function.

Rename the existing function to make it clear that it can only
be used for opportunistic gc during insertion.

Signed-off-by: Florian Westphal <[email protected]>
Signed-off-by: Pablo Neira Ayuso <[email protected]>
17 months agonet: sched: sch_qfq: Use non-work-conserving warning handler
Liu Jian [Mon, 23 Oct 2023 06:47:29 +0000 (14:47 +0800)]
net: sched: sch_qfq: Use non-work-conserving warning handler

A helper function for printing non-work-conserving alarms is added in
commit b00355db3f88 ("pkt_sched: sch_hfsc: sch_htb: Add non-work-conserving
 warning handler."). In this commit, use qdisc_warn_nonwc() instead of
WARN_ONCE() to handle the non-work-conserving warning in qfq Qdisc.

Signed-off-by: Liu Jian <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Paolo Abeni <[email protected]>
17 months agoMerge branch 'gtp-tunnel-driver-fixes'
Paolo Abeni [Tue, 24 Oct 2023 10:02:03 +0000 (12:02 +0200)]
Merge branch 'gtp-tunnel-driver-fixes'

Pablo Neira Ayuso says:

====================
GTP tunnel driver fixes

The following patchset contains two fixes for the GTP tunnel driver:

1) Incorrect GTPA_MAX definition in UAPI headers. This is updating an
   existing UAPI definition but for a good reason, this is certainly
   broken. Similar fixes for incorrect _MAX definition in netlink
   headers were applied in the past too.

2) Fix GTP driver PMTU with GRO packets, add missing call to
   skb_gso_validate_network_len() to handle GRO packets.
====================

Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Paolo Abeni <[email protected]>
17 months agogtp: fix fragmentation needed check with gso
Pablo Neira Ayuso [Sun, 22 Oct 2023 20:25:18 +0000 (22:25 +0200)]
gtp: fix fragmentation needed check with gso

Call skb_gso_validate_network_len() to check if packet is over PMTU.

Fixes: 459aa660eb1d ("gtp: add initial driver for datapath of GPRS Tunneling Protocol (GTP-U)")
Signed-off-by: Pablo Neira Ayuso <[email protected]>
Signed-off-by: Paolo Abeni <[email protected]>
17 months agogtp: uapi: fix GTPA_MAX
Pablo Neira Ayuso [Sun, 22 Oct 2023 20:25:17 +0000 (22:25 +0200)]
gtp: uapi: fix GTPA_MAX

Subtract one to __GTPA_MAX, otherwise GTPA_MAX is off by 2.

Fixes: 459aa660eb1d ("gtp: add initial driver for datapath of GPRS Tunneling Protocol (GTP-U)")
Signed-off-by: Pablo Neira Ayuso <[email protected]>
Signed-off-by: Paolo Abeni <[email protected]>
17 months agoxsk: Avoid starving the xsk further down the list
Albert Huang [Mon, 23 Oct 2023 12:57:31 +0000 (20:57 +0800)]
xsk: Avoid starving the xsk further down the list

In the previous implementation, when multiple xsk sockets were
associated with a single xsk_buff_pool, a situation could arise
where the xsk_tx_list maintained data at the front for one xsk
socket while starving the xsk sockets at the back of the list.
This could result in issues such as the inability to transmit packets,
increased latency, and jitter. To address this problem, we introduce
a new variable called tx_budget_spent, which limits each xsk to transmit
a maximum of MAX_PER_SOCKET_BUDGET tx descriptors. This allocation ensures
equitable opportunities for subsequent xsk sockets to send tx descriptors.
The value of MAX_PER_SOCKET_BUDGET is set to 32.

Signed-off-by: Albert Huang <[email protected]>
Signed-off-by: Daniel Borkmann <[email protected]>
Acked-by: Magnus Karlsson <[email protected]>
Link: https://lore.kernel.org/bpf/[email protected]
17 months agonet: ethernet: davinci_emac: Use MAC Address from Device Tree
Adam Ford [Sun, 22 Oct 2023 15:19:11 +0000 (10:19 -0500)]
net: ethernet: davinci_emac: Use MAC Address from Device Tree

Currently there is a device tree entry called "local-mac-address"
which can be filled by the bootloader or manually set.This is
useful when the user does not want to use the MAC address
programmed into the SoC.

Currently, the davinci_emac reads the MAC from the DT, copies
it from pdata->mac_addr to priv->mac_addr, then blindly overwrites
it by reading from registers in the SoC, and falls back to a
random MAC if it's still not valid.  This completely ignores any
MAC address in the device tree.

In order to use the local-mac-address, check to see if the contents
of priv->mac_addr are valid before falling back to reading from the
SoC when the MAC address is not valid.

Signed-off-by: Adam Ford <[email protected]>
Reviewed-by: Andrew Lunn <[email protected]>
Reviewed-by: Jacob Keller <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Paolo Abeni <[email protected]>
17 months agoFix NULL pointer dereference in cn_filter()
Anjali Kulkarni [Fri, 20 Oct 2023 23:40:58 +0000 (16:40 -0700)]
Fix NULL pointer dereference in cn_filter()

Check that sk_user_data is not NULL, else return from cn_filter().
Could not reproduce this issue, but Oliver Sang verified it has fixed
the "Closes" problem below.

Fixes: 2aa1f7a1f47c ("connector/cn_proc: Add filtering to fix some bugs")
Reported-by: kernel test robot <[email protected]>
Closes: https://lore.kernel.org/oe-lkp/[email protected]/
Signed-off-by: Anjali Kulkarni <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Paolo Abeni <[email protected]>
17 months agosock: Ignore memcg pressure heuristics when raising allocated
Abel Wu [Thu, 19 Oct 2023 12:00:26 +0000 (20:00 +0800)]
sock: Ignore memcg pressure heuristics when raising allocated

Before sockets became aware of net-memcg's memory pressure since
commit e1aab161e013 ("socket: initial cgroup code."), the memory
usage would be granted to raise if below average even when under
protocol's pressure. This provides fairness among the sockets of
same protocol.

That commit changes this because the heuristic will also be
effective when only memcg is under pressure which makes no sense.
So revert that behavior.

After reverting, __sk_mem_raise_allocated() no longer considers
memcg's pressure. As memcgs are isolated from each other w.r.t.
memory accounting, consuming one's budget won't affect others.
So except the places where buffer sizes are needed to be tuned,
allow workloads to use the memory they are provisioned.

Signed-off-by: Abel Wu <[email protected]>
Acked-by: Shakeel Butt <[email protected]>
Acked-by: Paolo Abeni <[email protected]>
Reviewed-by: Simon Horman <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Paolo Abeni <[email protected]>
17 months agosock: Doc behaviors for pressure heurisitics
Abel Wu [Thu, 19 Oct 2023 12:00:25 +0000 (20:00 +0800)]
sock: Doc behaviors for pressure heurisitics

There are now two accounting infrastructures for skmem, while the
heuristics in __sk_mem_raise_allocated() were actually introduced
before memcg was born.

Add some comments to clarify whether they can be applied to both
infrastructures or not.

Suggested-by: Shakeel Butt <[email protected]>
Signed-off-by: Abel Wu <[email protected]>
Acked-by: Shakeel Butt <[email protected]>
Reviewed-by: Simon Horman <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Paolo Abeni <[email protected]>
17 months agosock: Code cleanup on __sk_mem_raise_allocated()
Abel Wu [Thu, 19 Oct 2023 12:00:24 +0000 (20:00 +0800)]
sock: Code cleanup on __sk_mem_raise_allocated()

Code cleanup for both better simplicity and readability.
No functional change intended.

Signed-off-by: Abel Wu <[email protected]>
Acked-by: Shakeel Butt <[email protected]>
Reviewed-by: Simon Horman <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Paolo Abeni <[email protected]>
17 months agonet: ti: icssg-prueth: Add phys_port_name support
Jan Kiszka [Sun, 22 Oct 2023 08:56:22 +0000 (10:56 +0200)]
net: ti: icssg-prueth: Add phys_port_name support

Helps identifying the ports in udev rules e.g.

Signed-off-by: Jan Kiszka <[email protected]>
Reviewed-by: Jacob Keller <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Paolo Abeni <[email protected]>
17 months agonet: microchip: lan743x: improve throughput with rx timestamp config
Vishvambar Panth S [Fri, 20 Oct 2023 18:58:01 +0000 (00:28 +0530)]
net: microchip: lan743x: improve throughput with rx timestamp config

Currently all RX frames are timestamped which results in a performance
penalty when timestamping is not needed.  The default is now being
changed to not timestamp any Rx frames (HWTSTAMP_FILTER_NONE), but
support has been added to allow changing the desired RX timestamping
mode (HWTSTAMP_FILTER_ALL -  which was the previous setting and
HWTSTAMP_FILTER_PTP_V2_EVENT are now supported) using
SIOCSHWTSTAMP. All settings were tested using the hwstamp_ctl application.
It is also noted that ptp4l, when started, preconfigures the device to
timestamp using HWTSTAMP_FILTER_PTP_V2_EVENT, so this driver continues
to work properly "out of the box".

Test setup:  x64 PC with LAN7430 ---> x64 PC as partner

iperf3 with - Timestamp all incoming packets:
- - - - - - - - - - - - - - - - - - - - - - - - -
[ ID] Interval           Transfer     Bitrate         Retr
[  5]   0.00-5.05   sec   517 MBytes   859 Mbits/sec    0    sender
[  5]   0.00-5.00   sec   515 MBytes   864 Mbits/sec         receiver

iperf Done.

iperf3 with - Timestamp only PTP packets:
- - - - - - - - - - - - - - - - - - - - - - - - -
[ ID] Interval           Transfer     Bitrate         Retr
[  5]   0.00-5.04   sec   563 MBytes   937 Mbits/sec    0    sender
[  5]   0.00-5.00   sec   561 MBytes   941 Mbits/sec         receiver

Signed-off-by: Vishvambar Panth S <[email protected]>
Reviewed-by: Jacob Keller <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Paolo Abeni <[email protected]>
17 months agoMerge tag 'pull-nfsd-fix' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs
Linus Torvalds [Tue, 24 Oct 2023 06:40:04 +0000 (20:40 -1000)]
Merge tag 'pull-nfsd-fix' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs

Pull nfsd fix from Al Viro:
 "Catch from lock_rename() audit; nfsd_rename() checked that both
  directories belonged to the same filesystem, but only after having
  done lock_rename().

  Trivial fix, tested and acked by nfs folks"

* tag 'pull-nfsd-fix' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
  nfsd: lock_rename() needs both directories to live on the same fs

17 months agoMerge branch 'exact-states-comparison-for-iterator-convergence-checks'
Alexei Starovoitov [Tue, 24 Oct 2023 04:49:32 +0000 (21:49 -0700)]
Merge branch 'exact-states-comparison-for-iterator-convergence-checks'

Eduard Zingerman says:

====================
exact states comparison for iterator convergence checks

Iterator convergence logic in is_state_visited() uses state_equals()
for states with branches counter > 0 to check if iterator based loop
converges. This is not fully correct because state_equals() relies on
presence of read and precision marks on registers. These marks are not
guaranteed to be finalized while state has branches.
Commit message for patch #3 describes a program that exhibits such
behavior.

This patch-set aims to fix iterator convergence logic by adding notion
of exact states comparison. Exact comparison does not rely on presence
of read or precision marks and thus is more strict.
As explained in commit message for patch #3 exact comparisons require
addition of speculative register bounds widening. The end result for
BPF verifier users could be summarized as follows:

(!) After this update verifier would reject programs that conjure an
    imprecise value on the first loop iteration and use it as precise
    on the second (for iterator based loops).

I urge people to at least skim over the commit message for patch #3.

Patches are organized as follows:
- patches #1,2: moving/extracting utility functions;
- patch #3: introduces exact mode for states comparison and adds
  widening heuristic;
- patch #4: adds test-cases that demonstrate why the series is
  necessary;
- patch #5: extends patch #3 with a notion of state loop entries,
  these entries have to be tracked to correctly identify that
  different verifier states belong to the same states loop;
- patch #6: adds a test-case that demonstrates a program
  which requires loop entry tracking for correct verification;
- patch #7: just adds a few debug prints.

The following actions are planned as a followup for this patch-set:
- implementation has to be adapted for callbacks handling logic as a
  part of a fix for [1];
- it is necessary to explore ways to improve widening heuristic to
  handle iters_task_vma test w/o need to insert barrier_var() calls;
- explored states eviction logic on cache miss has to be extended
  to either:
  - allow eviction of checkpoint states -or-
  - be sped up in case if there are many active checkpoints associated
    with the same instruction.

The patch-set is a followup for mailing list discussion [1].

Changelog:
- V2 [3] -> V3:
  - correct check for stack spills in widen_imprecise_scalars(),
    added test case progs/iters.c:widen_spill to check the behavior
    (suggested by Andrii);
  - allow eviction of checkpoint states in is_state_visited() to avoid
    pathological verifier performance when iterator based loop does not
    converge (discussion with Alexei).
- V1 [2] -> V2, applied changes suggested by Alexei offlist:
  - __explored_state() function removed;
  - same_callsites() function is now used in clean_live_states();
  - patches #1,2 are added as preparatory code movement;
  - in process_iter_next_call() a safeguard is added to verify that
    cur_st->parent exists and has expected insn index / call sites.

[1] https://lore.kernel.org/bpf/97a90da09404c65c8e810cf83c94ac703705dc0e[email protected]/
[2] https://lore.kernel.org/bpf/20231021005939[email protected]/
[3] https://lore.kernel.org/bpf/20231022010812[email protected]/
====================

Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Alexei Starovoitov <[email protected]>
17 months agobpf: print full verifier states on infinite loop detection
Eduard Zingerman [Tue, 24 Oct 2023 00:09:17 +0000 (03:09 +0300)]
bpf: print full verifier states on infinite loop detection

Additional logging in is_state_visited(): if infinite loop is detected
print full verifier state for both current and equivalent states.

Acked-by: Andrii Nakryiko <[email protected]>
Signed-off-by: Eduard Zingerman <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Alexei Starovoitov <[email protected]>
17 months agoselftests/bpf: test if state loops are detected in a tricky case
Eduard Zingerman [Tue, 24 Oct 2023 00:09:16 +0000 (03:09 +0300)]
selftests/bpf: test if state loops are detected in a tricky case

A convoluted test case for iterators convergence logic that
demonstrates that states with branch count equal to 0 might still be
a part of not completely explored loop.

E.g. consider the following state diagram:

               initial     Here state 'succ' was processed first,
                 |         it was eventually tracked to produce a
                 V         state identical to 'hdr'.
    .---------> hdr        All branches from 'succ' had been explored
    |            |         and thus 'succ' has its .branches == 0.
    |            V
    |    .------...        Suppose states 'cur' and 'succ' correspond
    |    |       |         to the same instruction + callsites.
    |    V       V         In such case it is necessary to check
    |   ...     ...        whether 'succ' and 'cur' are identical.
    |    |       |         If 'succ' and 'cur' are a part of the same loop
    |    V       V         they have to be compared exactly.
    |   succ <- cur
    |    |
    |    V
    |   ...
    |    |
    '----'

Signed-off-by: Eduard Zingerman <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Alexei Starovoitov <[email protected]>
17 months agobpf: correct loop detection for iterators convergence
Eduard Zingerman [Tue, 24 Oct 2023 00:09:15 +0000 (03:09 +0300)]
bpf: correct loop detection for iterators convergence

It turns out that .branches > 0 in is_state_visited() is not a
sufficient condition to identify if two verifier states form a loop
when iterators convergence is computed. This commit adds logic to
distinguish situations like below:

 (I)            initial       (II)            initial
                  |                             |
                  V                             V
     .---------> hdr                           ..
     |            |                             |
     |            V                             V
     |    .------...                    .------..
     |    |       |                     |       |
     |    V       V                     V       V
     |   ...     ...               .-> hdr     ..
     |    |       |                |    |       |
     |    V       V                |    V       V
     |   succ <- cur               |   succ <- cur
     |    |                        |    |
     |    V                        |    V
     |   ...                       |   ...
     |    |                        |    |
     '----'                        '----'

For both (I) and (II) successor 'succ' of the current state 'cur' was
previously explored and has branches count at 0. However, loop entry
'hdr' corresponding to 'succ' might be a part of current DFS path.
If that is the case 'succ' and 'cur' are members of the same loop
and have to be compared exactly.

Co-developed-by: Andrii Nakryiko <[email protected]>
Co-developed-by: Alexei Starovoitov <[email protected]>
Reviewed-by: Andrii Nakryiko <[email protected]>
Signed-off-by: Eduard Zingerman <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Alexei Starovoitov <[email protected]>
17 months agoselftests/bpf: tests with delayed read/precision makrs in loop body
Eduard Zingerman [Tue, 24 Oct 2023 00:09:14 +0000 (03:09 +0300)]
selftests/bpf: tests with delayed read/precision makrs in loop body

These test cases try to hide read and precision marks from loop
convergence logic: marks would only be assigned on subsequent loop
iterations or after exploring states pushed to env->head stack first.
Without verifier fix to use exact states comparison logic for
iterators convergence these tests (except 'triple_continue') would be
errorneously marked as safe.

Signed-off-by: Eduard Zingerman <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Alexei Starovoitov <[email protected]>
17 months agobpf: exact states comparison for iterator convergence checks
Eduard Zingerman [Tue, 24 Oct 2023 00:09:13 +0000 (03:09 +0300)]
bpf: exact states comparison for iterator convergence checks

Convergence for open coded iterators is computed in is_state_visited()
by examining states with branches count > 1 and using states_equal().
states_equal() computes sub-state relation using read and precision marks.
Read and precision marks are propagated from children states,
thus are not guaranteed to be complete inside a loop when branches
count > 1. This could be demonstrated using the following unsafe program:

     1. r7 = -16
     2. r6 = bpf_get_prandom_u32()
     3. while (bpf_iter_num_next(&fp[-8])) {
     4.   if (r6 != 42) {
     5.     r7 = -32
     6.     r6 = bpf_get_prandom_u32()
     7.     continue
     8.   }
     9.   r0 = r10
    10.   r0 += r7
    11.   r8 = *(u64 *)(r0 + 0)
    12.   r6 = bpf_get_prandom_u32()
    13. }

Here verifier would first visit path 1-3, create a checkpoint at 3
with r7=-16, continue to 4-7,3 with r7=-32.

Because instructions at 9-12 had not been visitied yet existing
checkpoint at 3 does not have read or precision mark for r7.
Thus states_equal() would return true and verifier would discard
current state, thus unsafe memory access at 11 would not be caught.

This commit fixes this loophole by introducing exact state comparisons
for iterator convergence logic:
- registers are compared using regs_exact() regardless of read or
  precision marks;
- stack slots have to have identical type.

Unfortunately, this is too strict even for simple programs like below:

    i = 0;
    while(iter_next(&it))
      i++;

At each iteration step i++ would produce a new distinct state and
eventually instruction processing limit would be reached.

To avoid such behavior speculatively forget (widen) range for
imprecise scalar registers, if those registers were not precise at the
end of the previous iteration and do not match exactly.

This a conservative heuristic that allows to verify wide range of
programs, however it precludes verification of programs that conjure
an imprecise value on the first loop iteration and use it as precise
on the second.

Test case iter_task_vma_for_each() presents one of such cases:

        unsigned int seen = 0;
        ...
        bpf_for_each(task_vma, vma, task, 0) {
                if (seen >= 1000)
                        break;
                ...
                seen++;
        }

Here clang generates the following code:

<LBB0_4>:
      24:       r8 = r6                          ; stash current value of
                ... body ...                       'seen'
      29:       r1 = r10
      30:       r1 += -0x8
      31:       call bpf_iter_task_vma_next
      32:       r6 += 0x1                        ; seen++;
      33:       if r0 == 0x0 goto +0x2 <LBB0_6>  ; exit on next() == NULL
      34:       r7 += 0x10
      35:       if r8 < 0x3e7 goto -0xc <LBB0_4> ; loop on seen < 1000

<LBB0_6>:
      ... exit ...

Note that counter in r6 is copied to r8 and then incremented,
conditional jump is done using r8. Because of this precision mark for
r6 lags one state behind of precision mark on r8 and widening logic
kicks in.

Adding barrier_var(seen) after conditional is sufficient to force
clang use the same register for both counting and conditional jump.

This issue was discussed in the thread [1] which was started by
Andrew Werner <[email protected]> demonstrating a similar bug
in callback functions handling. The callbacks would be addressed
in a followup patch.

[1] https://lore.kernel.org/bpf/97a90da09404c65c8e810cf83c94ac703705dc0e[email protected]/

Co-developed-by: Andrii Nakryiko <[email protected]>
Co-developed-by: Alexei Starovoitov <[email protected]>
Signed-off-by: Eduard Zingerman <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Alexei Starovoitov <[email protected]>
17 months agobpf: extract same_callsites() as utility function
Eduard Zingerman [Tue, 24 Oct 2023 00:09:12 +0000 (03:09 +0300)]
bpf: extract same_callsites() as utility function

Extract same_callsites() from clean_live_states() as a utility function.
This function would be used by the next patch in the set.

Signed-off-by: Eduard Zingerman <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Alexei Starovoitov <[email protected]>
17 months agobpf: move explored_state() closer to the beginning of verifier.c
Eduard Zingerman [Tue, 24 Oct 2023 00:09:11 +0000 (03:09 +0300)]
bpf: move explored_state() closer to the beginning of verifier.c

Subsequent patches would make use of explored_state() function.
Move it up to avoid adding unnecessary prototype.

Signed-off-by: Eduard Zingerman <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Alexei Starovoitov <[email protected]>
17 months agoMerge branch 'introduce-page_pool_alloc-related-api'
Jakub Kicinski [Tue, 24 Oct 2023 02:14:51 +0000 (19:14 -0700)]
Merge branch 'introduce-page_pool_alloc-related-api'

Yunsheng Lin says:

====================
introduce page_pool_alloc() related API

In [1] & [2] & [3], there are usecases for veth and virtio_net
to use frag support in page pool to reduce memory usage, and it
may request different frag size depending on the head/tail
room space for xdp_frame/shinfo and mtu/packet size. When the
requested frag size is large enough that a single page can not
be split into more than one frag, using frag support only have
performance penalty because of the extra frag count handling
for frag support.

So this patchset provides a page pool API for the driver to
allocate memory with least memory utilization and performance
penalty when it doesn't know the size of memory it need
beforehand.

1. https://patchwork.kernel.org/project/netdevbpf/patch/d3ae6bd3537fbce379382ac6a42f67e22f27ece2.1683896626[email protected]/
2. https://patchwork.kernel.org/project/netdevbpf/patch/20230526054621[email protected]/
3. https://github.com/alobakin/linux/tree/iavf-pp-frag
====================

Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Jakub Kicinski <[email protected]>
17 months agonet: veth: use newly added page pool API for veth with xdp
Yunsheng Lin [Fri, 20 Oct 2023 09:59:52 +0000 (17:59 +0800)]
net: veth: use newly added page pool API for veth with xdp

Use page_pool_alloc() API to allocate memory with least
memory utilization and performance penalty.

Signed-off-by: Yunsheng Lin <[email protected]>
CC: Lorenzo Bianconi <[email protected]>
CC: Alexander Duyck <[email protected]>
CC: Liang Chen <[email protected]>
CC: Alexander Lobakin <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Jakub Kicinski <[email protected]>
17 months agopage_pool: update document about fragment API
Yunsheng Lin [Fri, 20 Oct 2023 09:59:51 +0000 (17:59 +0800)]
page_pool: update document about fragment API

As more drivers begin to use the fragment API, update the
document about how to decide which API to use for the
driver author.

Signed-off-by: Yunsheng Lin <[email protected]>
CC: Lorenzo Bianconi <[email protected]>
CC: Alexander Duyck <[email protected]>
CC: Liang Chen <[email protected]>
CC: Alexander Lobakin <[email protected]>
CC: Dima Tisnek <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Jakub Kicinski <[email protected]>
17 months agopage_pool: introduce page_pool_alloc() API
Yunsheng Lin [Fri, 20 Oct 2023 09:59:50 +0000 (17:59 +0800)]
page_pool: introduce page_pool_alloc() API

Currently page pool supports the below use cases:
use case 1: allocate page without page splitting using
            page_pool_alloc_pages() API if the driver knows
            that the memory it need is always bigger than
            half of the page allocated from page pool.
use case 2: allocate page frag with page splitting using
            page_pool_alloc_frag() API if the driver knows
            that the memory it need is always smaller than
            or equal to the half of the page allocated from
            page pool.

There is emerging use case [1] & [2] that is a mix of the
above two case: the driver doesn't know the size of memory it
need beforehand, so the driver may use something like below to
allocate memory with least memory utilization and performance
penalty:

if (size << 1 > max_size)
page = page_pool_alloc_pages();
else
page = page_pool_alloc_frag();

To avoid the driver doing something like above, add the
page_pool_alloc() API to support the above use case, and update
the true size of memory that is acctually allocated by updating
'*size' back to the driver in order to avoid exacerbating
truesize underestimate problem.

Rename page_pool_free() which is used in the destroy process to
__page_pool_destroy() to avoid confusion with the newly added
API.

1. https://lore.kernel.org/all/d3ae6bd3537fbce379382ac6a42f67e22f27ece2.1683896626[email protected]/
2. https://lore.kernel.org/all/20230526054621[email protected]/

Signed-off-by: Yunsheng Lin <[email protected]>
CC: Lorenzo Bianconi <[email protected]>
CC: Alexander Duyck <[email protected]>
CC: Liang Chen <[email protected]>
CC: Alexander Lobakin <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Jakub Kicinski <[email protected]>
17 months agopage_pool: remove PP_FLAG_PAGE_FRAG
Yunsheng Lin [Fri, 20 Oct 2023 09:59:49 +0000 (17:59 +0800)]
page_pool: remove PP_FLAG_PAGE_FRAG

PP_FLAG_PAGE_FRAG is not really needed after pp_frag_count
handling is unified and page_pool_alloc_frag() is supported
in 32-bit arch with 64-bit DMA, so remove it.

Signed-off-by: Yunsheng Lin <[email protected]>
CC: Lorenzo Bianconi <[email protected]>
CC: Alexander Duyck <[email protected]>
CC: Liang Chen <[email protected]>
CC: Alexander Lobakin <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Jakub Kicinski <[email protected]>
17 months agopage_pool: unify frag_count handling in page_pool_is_last_frag()
Yunsheng Lin [Fri, 20 Oct 2023 09:59:48 +0000 (17:59 +0800)]
page_pool: unify frag_count handling in page_pool_is_last_frag()

Currently when page_pool_create() is called with
PP_FLAG_PAGE_FRAG flag, page_pool_alloc_pages() is only
allowed to be called under the below constraints:
1. page_pool_fragment_page() need to be called to setup
   page->pp_frag_count immediately.
2. page_pool_defrag_page() often need to be called to drain
   the page->pp_frag_count when there is no more user will
   be holding on to that page.

Those constraints exist in order to support a page to be
split into multi fragments.

And those constraints have some overhead because of the
cache line dirtying/bouncing and atomic update.

Those constraints are unavoidable for case when we need a
page to be split into more than one fragment, but there is
also case that we want to avoid the above constraints and
their overhead when a page can't be split as it can only
hold a fragment as requested by user, depending on different
use cases:
use case 1: allocate page without page splitting.
use case 2: allocate page with page splitting.
use case 3: allocate page with or without page splitting
            depending on the fragment size.

Currently page pool only provide page_pool_alloc_pages() and
page_pool_alloc_frag() API to enable the 1 & 2 separately,
so we can not use a combination of 1 & 2 to enable 3, it is
not possible yet because of the per page_pool flag
PP_FLAG_PAGE_FRAG.

So in order to allow allocating unsplit page without the
overhead of split page while still allow allocating split
page we need to remove the per page_pool flag in
page_pool_is_last_frag(), as best as I can think of, it seems
there are two methods as below:
1. Add per page flag/bit to indicate a page is split or
   not, which means we might need to update that flag/bit
   everytime the page is recycled, dirtying the cache line
   of 'struct page' for use case 1.
2. Unify the page->pp_frag_count handling for both split and
   unsplit page by assuming all pages in the page pool is split
   into a big fragment initially.

As page pool already supports use case 1 without dirtying the
cache line of 'struct page' whenever a page is recyclable, we
need to support the above use case 3 with minimal overhead,
especially not adding any noticeable overhead for use case 1,
and we are already doing an optimization by not updating
pp_frag_count in page_pool_defrag_page() for the last fragment
user, this patch chooses to unify the pp_frag_count handling
to support the above use case 3.

There is no noticeable performance degradation and some
justification for unifying the frag_count handling with this
patch applied using a micro-benchmark testing in [1].

1. https://lore.kernel.org/all/bf2591f8-7b3c-4480-bb2c-31dc9da1d6ac@huawei.com/

Signed-off-by: Yunsheng Lin <[email protected]>
CC: Lorenzo Bianconi <[email protected]>
CC: Alexander Duyck <[email protected]>
CC: Liang Chen <[email protected]>
CC: Alexander Lobakin <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Jakub Kicinski <[email protected]>
17 months agoMerge tag 'urgent/nolibc.2023.10.16a' of git://git.kernel.org/pub/scm/linux/kernel...
Linus Torvalds [Tue, 24 Oct 2023 00:19:11 +0000 (14:19 -1000)]
Merge tag 'urgent/nolibc.2023.10.16a' of git://git.kernel.org/pub/scm/linux/kernel/git/paulmck/linux-rcu

Pull nolibc fixes from Paul McKenney:

 - tools/nolibc: i386: Fix a stack misalign bug on _start

 - MAINTAINERS: nolibc: update tree location

 - tools/nolibc: mark start_c as weak to avoid linker errors

* tag 'urgent/nolibc.2023.10.16a' of git://git.kernel.org/pub/scm/linux/kernel/git/paulmck/linux-rcu:
  tools/nolibc: mark start_c as weak
  MAINTAINERS: nolibc: update tree location
  tools/nolibc: i386: Fix a stack misalign bug on _start

17 months agoMerge tag 'for-net-next-2023-10-23' of git://git.kernel.org/pub/scm/linux/kernel...
Jakub Kicinski [Mon, 23 Oct 2023 23:44:18 +0000 (16:44 -0700)]
Merge tag 'for-net-next-2023-10-23' of git://git.kernel.org/pub/scm/linux/kernel/git/bluetooth/bluetooth-next

Luiz Augusto von Dentz says:

====================
bluetooth-next pull request for net-next:

 - Add 0bda:b85b for Fn-Link RTL8852BE
 - ISO: Many fixes for broadcast support
 - Mark bcm4378/bcm4387 as BROKEN_LE_CODED
 - Add support ITTIM PE50-M75C
 - Add RTW8852BE device 13d3:3570
 - Add support for QCA2066
 - Add support for Intel Misty Peak - 8087:0038

* tag 'for-net-next-2023-10-23' of git://git.kernel.org/pub/scm/linux/kernel/git/bluetooth/bluetooth-next:
  Bluetooth: hci_sync: Fix Opcode prints in bt_dev_dbg/err
  Bluetooth: Fix double free in hci_conn_cleanup
  Bluetooth: btmtksdio: enable bluetooth wakeup in system suspend
  Bluetooth: btusb: Add 0bda:b85b for Fn-Link RTL8852BE
  Bluetooth: hci_bcm4377: Mark bcm4378/bcm4387 as BROKEN_LE_CODED
  Bluetooth: ISO: Copy BASE if service data matches EIR_BAA_SERVICE_UUID
  Bluetooth: Make handle of hci_conn be unique
  Bluetooth: btusb: Add date->evt_skb is NULL check
  Bluetooth: ISO: Fix bcast listener cleanup
  Bluetooth: msft: __hci_cmd_sync() doesn't return NULL
  Bluetooth: ISO: Match QoS adv handle with BIG handle
  Bluetooth: ISO: Allow binding a bcast listener to 0 bises
  Bluetooth: btusb: Add RTW8852BE device 13d3:3570 to device tables
  Bluetooth: qca: add support for QCA2066
  Bluetooth: ISO: Set CIS bit only for devices with CIS support
  Bluetooth: Add support for Intel Misty Peak - 8087:0038
  Bluetooth: Add support ITTIM PE50-M75C
  Bluetooth: ISO: Pass BIG encryption info through QoS
  Bluetooth: ISO: Fix BIS cleanup
====================

Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Jakub Kicinski <[email protected]>
17 months agoMerge branch 'devlink-finish-conversion-to-generated-split_ops'
Jakub Kicinski [Mon, 23 Oct 2023 23:16:52 +0000 (16:16 -0700)]
Merge branch 'devlink-finish-conversion-to-generated-split_ops'

Jiri Pirko says:

====================
devlink: finish conversion to generated split_ops

This patchset converts the remaining genetlink commands to generated
split_ops and removes the existing small_ops arrays entirely
alongside with shared netlink attribute policy.

Patches #1-#6 are just small preparations and small fixes on multiple
              places. Note that couple of patches contain the "Fixes"
              tag but no need to put them into -net tree.
Patch #7 is a simple rename preparation
Patch #8 is the main one in this set and adds actual definitions of cmds
         in to yaml file.
Patches #9-#10 finalize the change removing bits that are no longer in
               use.
====================

Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Jakub Kicinski <[email protected]>
17 months agodevlink: remove netlink small_ops
Jiri Pirko [Sat, 21 Oct 2023 11:27:11 +0000 (13:27 +0200)]
devlink: remove netlink small_ops

All commands are now covered by generated split_ops. Remove the
small_ops entirely alongside with unified devlink netlink policy array.

Signed-off-by: Jiri Pirko <[email protected]>
Reviewed-by: Jacob Keller <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Jakub Kicinski <[email protected]>
17 months agodevlink: remove duplicated netlink callback prototypes
Jiri Pirko [Sat, 21 Oct 2023 11:27:10 +0000 (13:27 +0200)]
devlink: remove duplicated netlink callback prototypes

The prototypes are now generated, remove the old ones.

Signed-off-by: Jiri Pirko <[email protected]>
Reviewed-by: Jacob Keller <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Jakub Kicinski <[email protected]>
17 months agonetlink: specs: devlink: add the remaining command to generate complete split_ops
Jiri Pirko [Sat, 21 Oct 2023 11:27:09 +0000 (13:27 +0200)]
netlink: specs: devlink: add the remaining command to generate complete split_ops

Currently, some of the commands are not described in devlink yaml file
and are manually filled in net/devlink/netlink.c in small_ops. To make
all part of split_ops, add definitions of the rest of the commands
alongside with needed attributes and enums.

Note that this focuses on the kernel side. The requests are fully
described in order to generate split_op alongside with policies.
Follow-up will describe the replies in order to make the userspace
helpers complete.

Signed-off-by: Jiri Pirko <[email protected]>
Reviewed-by: Jacob Keller <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Jakub Kicinski <[email protected]>
17 months agodevlink: rename netlink callback to be aligned with the generated ones
Jiri Pirko [Sat, 21 Oct 2023 11:27:08 +0000 (13:27 +0200)]
devlink: rename netlink callback to be aligned with the generated ones

All remaining doit and dumpit netlink callback functions are going to be
used by generated split ops. They expect certain name format. Rename the
callback to be aligned with generated names.

Signed-off-by: Jiri Pirko <[email protected]>
Reviewed-by: Jacob Keller <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Jakub Kicinski <[email protected]>
17 months agodevlink: make devlink_flash_overwrite enum named one
Jiri Pirko [Sat, 21 Oct 2023 11:27:07 +0000 (13:27 +0200)]
devlink: make devlink_flash_overwrite enum named one

Since this enum is going to be used in generated userspace file, name it
properly.

Signed-off-by: Jiri Pirko <[email protected]>
Reviewed-by: Jacob Keller <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Jakub Kicinski <[email protected]>
17 months agonetlink: specs: devlink: make dont-validate single line
Jiri Pirko [Sat, 21 Oct 2023 11:27:06 +0000 (13:27 +0200)]
netlink: specs: devlink: make dont-validate single line

Make dont-validate field more compact and push it into a single line.

Reviewed-by: Jakub Kicinski <[email protected]>
Signed-off-by: Jiri Pirko <[email protected]>
Reviewed-by: Jacob Keller <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Jakub Kicinski <[email protected]>
17 months agonetlink: specs: devlink: remove reload-action from devlink-get cmd reply
Jiri Pirko [Sat, 21 Oct 2023 11:27:05 +0000 (13:27 +0200)]
netlink: specs: devlink: remove reload-action from devlink-get cmd reply

devlink-get command does not contain reload-action attr in reply.
Remove it.

Signed-off-by: Jiri Pirko <[email protected]>
Reviewed-by: Jacob Keller <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Jakub Kicinski <[email protected]>
17 months agotools: ynl-gen: render rsp_parse() helpers if cmd has only dump op
Jiri Pirko [Sat, 21 Oct 2023 11:27:04 +0000 (13:27 +0200)]
tools: ynl-gen: render rsp_parse() helpers if cmd has only dump op

Due to the check in RenderInfo class constructor, type_consistent
flag is set to False to avoid rendering the same response parsing
helper for do and dump ops. However, in case there is no do, the helper
needs to be rendered for dump op. So split check to achieve that.

Signed-off-by: Jiri Pirko <[email protected]>
Reviewed-by: Jacob Keller <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Jakub Kicinski <[email protected]>
17 months agotools: ynl-gen: introduce support for bitfield32 attribute type
Jiri Pirko [Sat, 21 Oct 2023 11:27:03 +0000 (13:27 +0200)]
tools: ynl-gen: introduce support for bitfield32 attribute type

Introduce support for attribute type bitfield32.
Note that since the generated code works with struct nla_bitfield32,
the generator adds netlink.h to the list of includes for userspace
headers in case any bitfield32 is present.

Note that this is added only to genetlink-legacy scheme as requested
by Jakub Kicinski.

Signed-off-by: Jiri Pirko <[email protected]>
Reviewed-by: Jacob Keller <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Jakub Kicinski <[email protected]>
17 months agogenetlink: don't merge dumpit split op for different cmds into single iter
Jiri Pirko [Sat, 21 Oct 2023 11:27:02 +0000 (13:27 +0200)]
genetlink: don't merge dumpit split op for different cmds into single iter

Currently, split ops of doit and dumpit are merged into a single iter
item when they are subsequent. However, there is no guarantee that the
dumpit op is for the same cmd as doit op.

Fix this by checking if cmd is the same for both.
This problem does not occur in existing families.

Signed-off-by: Jiri Pirko <[email protected]>
Reviewed-by: Jacob Keller <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Jakub Kicinski <[email protected]>
17 months agoMerge branch 'intel-wired-lan-driver-updates-2023-10-19-idpf'
Jakub Kicinski [Mon, 23 Oct 2023 22:55:33 +0000 (15:55 -0700)]
Merge branch 'intel-wired-lan-driver-updates-2023-10-19-idpf'

Jacob Keller says:

====================
Intel Wired LAN Driver Updates 2023-10-19 (idpf)

This series contains two fixes for the recently merged idpf driver.

Michal adds missing logic for programming the scheduling mode of completion
queues.

Pavan fixes a call trace caused by the mailbox work item not being canceled
properly if an error occurred during initialization.
====================

Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Jakub Kicinski <[email protected]>
17 months agoidpf: cancel mailbox work in error path
Pavan Kumar Linga [Mon, 23 Oct 2023 20:26:55 +0000 (13:26 -0700)]
idpf: cancel mailbox work in error path

In idpf_vc_core_init, the mailbox work is queued
on a mailbox workqueue but it is not cancelled on error.
This results in a call trace when idpf_mbx_task tries
to access the freed mailbox queue pointer. Fix it by
cancelling the mailbox work in the error path.

Fixes: 4930fbf419a7 ("idpf: add core init and interrupt request")
Reviewed-by: Wojciech Drewek <[email protected]>
Signed-off-by: Pavan Kumar Linga <[email protected]>
Tested-by: Krishneil Singh <[email protected]>
Signed-off-by: Jacob Keller <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Jakub Kicinski <[email protected]>
17 months agoidpf: set scheduling mode for completion queue
Michal Kubiak [Mon, 23 Oct 2023 20:26:54 +0000 (13:26 -0700)]
idpf: set scheduling mode for completion queue

The HW must be programmed differently for queue-based scheduling mode.
To program the completion queue context correctly, the control plane
must know the scheduling mode not only for the Tx queue, but also for
the completion queue.
Unfortunately, currently the driver sets the scheduling mode only for
the Tx queues.

Propagate the scheduling mode data for the completion queue as
well when sending the queue configuration messages.

Fixes: 1c325aac10a8 ("idpf: configure resources for TX queues")
Reviewed-by: Alexander Lobakin <[email protected]>
Signed-off-by: Michal Kubiak <[email protected]>
Reviewed-by: Alan Brady <[email protected]>
Reviewed-by: Przemek Kitszel <[email protected]>
Tested-by: Krishneil Singh <[email protected]>
Signed-off-by: Jacob Keller <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Jakub Kicinski <[email protected]>
17 months agonet_sched: sch_fq: fastpath needs to take care of sk->sk_pacing_status
Eric Dumazet [Fri, 20 Oct 2023 20:12:54 +0000 (20:12 +0000)]
net_sched: sch_fq: fastpath needs to take care of sk->sk_pacing_status

If packets of a TCP flows take the fast path, we need to make sure
sk->sk_pacing_status is set to SK_PACING_FQ otherwise TCP might
fallback to internal pacing, which is not optimal.

Fixes: 076433bd78d7 ("net_sched: sch_fq: add fast path for mostly idle qdisc")
Signed-off-by: Eric Dumazet <[email protected]>
Reviewed-by: Willem de Bruijn <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Jakub Kicinski <[email protected]>
17 months agonet_sched: sch_fq: fix off-by-one error in fq_dequeue()
Eric Dumazet [Fri, 20 Oct 2023 20:00:53 +0000 (20:00 +0000)]
net_sched: sch_fq: fix off-by-one error in fq_dequeue()

A last minute change went wrong.

We need to look for a packet in all 3 bands, not only two.

Fixes: 29f834aa326e ("net_sched: sch_fq: add 3 bands and WRR scheduling")
Reported-by: kernel test robot <[email protected]>
Closes: https://lore.kernel.org/oe-lkp/[email protected]
Signed-off-by: Eric Dumazet <[email protected]>
Cc: Soheil Hassas Yeganeh <[email protected]>
Cc: Dave Taht <[email protected]>
Cc: Toke Høiland-Jørgensen <[email protected]>
Tested-by: Willem de Bruijn <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Jakub Kicinski <[email protected]>
17 months agosfc: cleanup and reduce netlink error messages
Pieter Jansen van Vuuren [Fri, 20 Oct 2023 14:01:49 +0000 (15:01 +0100)]
sfc: cleanup and reduce netlink error messages

Reduce the length of netlink error messages as they are likely to be
truncated anyway. Additionally, reword netlink error messages so they
are more consistent with previous messages.

Fixes: 9dbc8d2b9a02 ("sfc: add decrement ipv6 hop limit by offloading set hop limit actions")
Fixes: 3c9561c0a5b9 ("sfc: support TC decap rules matching on enc_ip_tos")
Reported-by: kernel test robot <[email protected]>
Closes: https://lore.kernel.org/oe-kbuild-all/[email protected]/
Signed-off-by: Pieter Jansen van Vuuren <[email protected]>
Reviewed-by: Edward Cree <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Jakub Kicinski <[email protected]>
17 months agobpf, tcx: Get rid of tcx_link_const
Daniel Borkmann [Mon, 23 Oct 2023 18:50:15 +0000 (20:50 +0200)]
bpf, tcx: Get rid of tcx_link_const

Small clean up to get rid of the extra tcx_link_const() and only retain
the tcx_link().

Signed-off-by: Daniel Borkmann <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Martin KaFai Lau <[email protected]>
17 months agoBluetooth: hci_sync: Fix Opcode prints in bt_dev_dbg/err
Marcel Ziswiler [Wed, 18 Oct 2023 14:47:35 +0000 (16:47 +0200)]
Bluetooth: hci_sync: Fix Opcode prints in bt_dev_dbg/err

Printed Opcodes may be missing leading zeros:

Bluetooth: hci0: Opcode 0x c03 failed: -110

Fix this by always printing leading zeros:

Bluetooth: hci0: Opcode 0x0c03 failed: -110

Fixes: d0b137062b2d ("Bluetooth: hci_sync: Rework init stages")
Fixes: 6a98e3836fa2 ("Bluetooth: Add helper for serialized HCI command execution")
Signed-off-by: Marcel Ziswiler <[email protected]>
Signed-off-by: Luiz Augusto von Dentz <[email protected]>
17 months agoBluetooth: Fix double free in hci_conn_cleanup
ZhengHan Wang [Wed, 18 Oct 2023 10:30:55 +0000 (12:30 +0200)]
Bluetooth: Fix double free in hci_conn_cleanup

syzbot reports a slab use-after-free in hci_conn_hash_flush [1].
After releasing an object using hci_conn_del_sysfs in the
hci_conn_cleanup function, releasing the same object again
using the hci_dev_put and hci_conn_put functions causes a double free.
Here's a simplified flow:

hci_conn_del_sysfs:
  hci_dev_put
    put_device
      kobject_put
        kref_put
          kobject_release
            kobject_cleanup
              kfree_const
                kfree(name)

hci_dev_put:
  ...
    kfree(name)

hci_conn_put:
  put_device
    ...
      kfree(name)

This patch drop the hci_dev_put and hci_conn_put function
call in hci_conn_cleanup function, because the object is
freed in hci_conn_del_sysfs function.

This patch also fixes the refcounting in hci_conn_add_sysfs() and
hci_conn_del_sysfs() to take into account device_add() failures.

This fixes CVE-2023-28464.

Link: https://syzkaller.appspot.com/bug?id=1bb51491ca5df96a5f724899d1dbb87afda61419
Signed-off-by: ZhengHan Wang <[email protected]>
Co-developed-by: Luiz Augusto von Dentz <[email protected]>
Signed-off-by: Luiz Augusto von Dentz <[email protected]>
17 months agoBluetooth: btmtksdio: enable bluetooth wakeup in system suspend
Zhengping Jiang [Fri, 13 Oct 2023 19:47:07 +0000 (12:47 -0700)]
Bluetooth: btmtksdio: enable bluetooth wakeup in system suspend

The BTMTKSDIO_BT_WAKE_ENABLED flag is set for bluetooth interrupt
during system suspend and increases wakeup count for bluetooth event.

Signed-off-by: Zhengping Jiang <[email protected]>
Signed-off-by: Luiz Augusto von Dentz <[email protected]>
17 months agoBluetooth: btusb: Add 0bda:b85b for Fn-Link RTL8852BE
Guan Wentao [Thu, 12 Oct 2023 11:21:17 +0000 (19:21 +0800)]
Bluetooth: btusb: Add 0bda:b85b for Fn-Link RTL8852BE

Add PID/VID 0bda:b85b for Realtek RTL8852BE USB bluetooth part.
The PID/VID was reported by the patch last year. [1]
Some SBCs like rockpi 5B A8 module contains the device.
And it`s founded in website. [2] [3]

Here is the device tables in /sys/kernel/debug/usb/devices .

T:  Bus=07 Lev=01 Prnt=01 Port=01 Cnt=01 Dev#=  2 Spd=12   MxCh= 0
D:  Ver= 1.00 Cls=e0(wlcon) Sub=01 Prot=01 MxPS=64 #Cfgs=  1
P:  Vendor=0bda ProdID=b85b Rev= 0.00
S:  Manufacturer=Realtek
S:  Product=Bluetooth Radio
S:  SerialNumber=00e04c000001
C:* #Ifs= 2 Cfg#= 1 Atr=e0 MxPwr=500mA
I:* If#= 0 Alt= 0 #EPs= 3 Cls=e0(wlcon) Sub=01 Prot=01 Driver=btusb
E:  Ad=81(I) Atr=03(Int.) MxPS=  16 Ivl=1ms
E:  Ad=02(O) Atr=02(Bulk) MxPS=  64 Ivl=0ms
E:  Ad=82(I) Atr=02(Bulk) MxPS=  64 Ivl=0ms
I:* If#= 1 Alt= 0 #EPs= 2 Cls=e0(wlcon) Sub=01 Prot=01 Driver=btusb
E:  Ad=03(O) Atr=01(Isoc) MxPS=   0 Ivl=1ms
E:  Ad=83(I) Atr=01(Isoc) MxPS=   0 Ivl=1ms
I:  If#= 1 Alt= 1 #EPs= 2 Cls=e0(wlcon) Sub=01 Prot=01 Driver=btusb
E:  Ad=03(O) Atr=01(Isoc) MxPS=   9 Ivl=1ms
E:  Ad=83(I) Atr=01(Isoc) MxPS=   9 Ivl=1ms
I:  If#= 1 Alt= 2 #EPs= 2 Cls=e0(wlcon) Sub=01 Prot=01 Driver=btusb
E:  Ad=03(O) Atr=01(Isoc) MxPS=  17 Ivl=1ms
E:  Ad=83(I) Atr=01(Isoc) MxPS=  17 Ivl=1ms
I:  If#= 1 Alt= 3 #EPs= 2 Cls=e0(wlcon) Sub=01 Prot=01 Driver=btusb
E:  Ad=03(O) Atr=01(Isoc) MxPS=  25 Ivl=1ms
E:  Ad=83(I) Atr=01(Isoc) MxPS=  25 Ivl=1ms
I:  If#= 1 Alt= 4 #EPs= 2 Cls=e0(wlcon) Sub=01 Prot=01 Driver=btusb
E:  Ad=03(O) Atr=01(Isoc) MxPS=  33 Ivl=1ms
E:  Ad=83(I) Atr=01(Isoc) MxPS=  33 Ivl=1ms
I:  If#= 1 Alt= 5 #EPs= 2 Cls=e0(wlcon) Sub=01 Prot=01 Driver=btusb
E:  Ad=03(O) Atr=01(Isoc) MxPS=  49 Ivl=1ms
E:  Ad=83(I) Atr=01(Isoc) MxPS=  49 Ivl=1ms

Link: https://lore.kernel.org/all/[email protected]/
Link: https://forum.radxa.com/t/bluetooth-on-ubuntu/13051/4
Link: https://ubuntuforums.org/showthread.php?t=2489527
Cc: [email protected]
Signed-off-by: Meng Tang <[email protected]>
Signed-off-by: Guan Wentao <[email protected]>
Signed-off-by: Luiz Augusto von Dentz <[email protected]>
17 months agoBluetooth: hci_bcm4377: Mark bcm4378/bcm4387 as BROKEN_LE_CODED
Janne Grunau [Mon, 16 Oct 2023 07:13:08 +0000 (09:13 +0200)]
Bluetooth: hci_bcm4377: Mark bcm4378/bcm4387 as BROKEN_LE_CODED

bcm4378 and bcm4387 claim to support LE Coded PHY but fail to pair
(reliably) with BLE devices if it is enabled.
On bcm4378 pairing usually succeeds after 2-3 tries. On bcm4387
pairing appears to be completely broken.

Cc: [email protected] # 6.4.y+
Link: https://discussion.fedoraproject.org/t/mx-master-3-bluetooth-mouse-doesnt-connect/87072/33
Link: https://github.com/AsahiLinux/linux/issues/177
Fixes: 288c90224eec ("Bluetooth: Enable all supported LE PHY by default")
Signed-off-by: Janne Grunau <[email protected]>
Reviewed-by: Eric Curtin <[email protected]>
Reviewed-by: Neal Gompa <[email protected]>
Signed-off-by: Luiz Augusto von Dentz <[email protected]>
17 months agoBluetooth: ISO: Copy BASE if service data matches EIR_BAA_SERVICE_UUID
Claudia Draghicescu [Thu, 28 Sep 2023 08:02:08 +0000 (11:02 +0300)]
Bluetooth: ISO: Copy BASE if service data matches EIR_BAA_SERVICE_UUID

Copy the content of a Periodic Advertisement Report to BASE only if
the service UUID is Basic Audio Announcement Service UUID.

Signed-off-by: Claudia Draghicescu <[email protected]>
Signed-off-by: Luiz Augusto von Dentz <[email protected]>
17 months agoBluetooth: Make handle of hci_conn be unique
Ziyang Xuan [Wed, 11 Oct 2023 09:57:31 +0000 (17:57 +0800)]
Bluetooth: Make handle of hci_conn be unique

The handle of new hci_conn is always HCI_CONN_HANDLE_MAX + 1 if
the handle of the first hci_conn entry in hci_dev->conn_hash->list
is not HCI_CONN_HANDLE_MAX + 1. Use ida to manage the allocation of
hci_conn->handle to make it be unique.

Fixes: 9f78191cc9f1 ("Bluetooth: hci_conn: Always allocate unique handles")
Signed-off-by: Ziyang Xuan <[email protected]>
Signed-off-by: Luiz Augusto von Dentz <[email protected]>
17 months agoBluetooth: btusb: Add date->evt_skb is NULL check
youwan Wang [Wed, 11 Oct 2023 05:14:47 +0000 (13:14 +0800)]
Bluetooth: btusb: Add date->evt_skb is NULL check

fix crash because of null pointers

[ 6104.969662] BUG: kernel NULL pointer dereference, address: 00000000000000c8
[ 6104.969667] #PF: supervisor read access in kernel mode
[ 6104.969668] #PF: error_code(0x0000) - not-present page
[ 6104.969670] PGD 0 P4D 0
[ 6104.969673] Oops: 0000 [#1] SMP NOPTI
[ 6104.969684] RIP: 0010:btusb_mtk_hci_wmt_sync+0x144/0x220 [btusb]
[ 6104.969688] RSP: 0018:ffffb8d681533d48 EFLAGS: 00010246
[ 6104.969689] RAX: 0000000000000000 RBX: ffff8ad560bb2000 RCX: 0000000000000006
[ 6104.969691] RDX: 0000000000000000 RSI: ffffb8d681533d08 RDI: 0000000000000000
[ 6104.969692] RBP: ffffb8d681533d70 R08: 0000000000000001 R09: 0000000000000001
[ 6104.969694] R10: 0000000000000001 R11: 00000000fa83b2da R12: ffff8ad461d1d7c0
[ 6104.969695] R13: 0000000000000000 R14: ffff8ad459618c18 R15: ffffb8d681533d90
[ 6104.969697] FS:  00007f5a1cab9d40(0000) GS:ffff8ad578200000(0000) knlGS:00000
[ 6104.969699] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 6104.969700] CR2: 00000000000000c8 CR3: 000000018620c001 CR4: 0000000000760ef0
[ 6104.969701] PKRU: 55555554
[ 6104.969702] Call Trace:
[ 6104.969708]  btusb_mtk_shutdown+0x44/0x80 [btusb]
[ 6104.969732]  hci_dev_do_close+0x470/0x5c0 [bluetooth]
[ 6104.969748]  hci_rfkill_set_block+0x56/0xa0 [bluetooth]
[ 6104.969753]  rfkill_set_block+0x92/0x160
[ 6104.969755]  rfkill_fop_write+0x136/0x1e0
[ 6104.969759]  __vfs_write+0x18/0x40
[ 6104.969761]  vfs_write+0xdf/0x1c0
[ 6104.969763]  ksys_write+0xb1/0xe0
[ 6104.969765]  __x64_sys_write+0x1a/0x20
[ 6104.969769]  do_syscall_64+0x51/0x180
[ 6104.969771]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
[ 6104.969773] RIP: 0033:0x7f5a21f18fef
[ 6104.9] RSP: 002b:00007ffeefe39010 EFLAGS: 00000293 ORIG_RAX: 0000000000000001
[ 6104.969780] RAX: ffffffffffffffda RBX: 000055c10a7560a0 RCX: 00007f5a21f18fef
[ 6104.969781] RDX: 0000000000000008 RSI: 00007ffeefe39060 RDI: 0000000000000012
[ 6104.969782] RBP: 00007ffeefe39060 R08: 0000000000000000 R09: 0000000000000017
[ 6104.969784] R10: 00007ffeefe38d97 R11: 0000000000000293 R12: 0000000000000002
[ 6104.969785] R13: 00007ffeefe39220 R14: 00007ffeefe391a0 R15: 000055c10a72acf0

Signed-off-by: youwan Wang <[email protected]>
Signed-off-by: Luiz Augusto von Dentz <[email protected]>
17 months agoBluetooth: ISO: Fix bcast listener cleanup
Iulia Tanasescu [Wed, 11 Oct 2023 14:24:07 +0000 (17:24 +0300)]
Bluetooth: ISO: Fix bcast listener cleanup

This fixes the cleanup callback for slave bis and pa sync hcons.

Closing all bis hcons will trigger BIG Terminate Sync, while closing
all bises and the pa sync hcon will also trigger PA Terminate Sync.

Signed-off-by: Iulia Tanasescu <[email protected]>
Signed-off-by: Luiz Augusto von Dentz <[email protected]>
17 months agoBluetooth: msft: __hci_cmd_sync() doesn't return NULL
Dan Carpenter [Thu, 5 Oct 2023 11:19:23 +0000 (14:19 +0300)]
Bluetooth: msft: __hci_cmd_sync() doesn't return NULL

The __hci_cmd_sync() function doesn't return NULL.  Checking for NULL
doesn't make the code safer, it just confuses people.

When a function returns both error pointers and NULL then generally the
NULL is a kind of success case.  For example, maybe we look up an item
then errors mean we ran out of memory but NULL means the item is not
found.  Or if we request a feature, then error pointers mean that there
was an error but NULL means that the feature has been deliberately
turned off.

In this code it's different.  The NULL is handled as if there is a bug
in __hci_cmd_sync() where it accidentally returns NULL instead of a
proper error code.  This was done consistently until commit 9e14606d8f38
("Bluetooth: msft: Extended monitor tracking by address filter") which
deleted the work around for the potential future bug and treated NULL as
success.

Predicting potential future bugs is complicated, but we should just fix
them instead of working around them.  Instead of debating whether NULL
is failure or success, let's just say it's currently impossible and
delete the dead code.

Signed-off-by: Dan Carpenter <[email protected]>
Signed-off-by: Luiz Augusto von Dentz <[email protected]>
17 months agoBluetooth: ISO: Match QoS adv handle with BIG handle
Iulia Tanasescu [Tue, 3 Oct 2023 14:37:39 +0000 (17:37 +0300)]
Bluetooth: ISO: Match QoS adv handle with BIG handle

In case the user binds multiple sockets for the same BIG, the BIG
handle should be matched with the associated adv handle, if it has
already been allocated previously.

Signed-off-by: Iulia Tanasescu <[email protected]>
Signed-off-by: Luiz Augusto von Dentz <[email protected]>
This page took 0.133115 seconds and 4 git commands to generate.