Git Repo - linux.git/log

bpf: Add bpf_xdp_output() helper

Introduce new helper that reuses existing xdp perf_event output
implementation, but can be called from raw_tracepoint programs
that receive 'struct xdp_buff *' as a tracepoint argument.

Signed-off-by: Eelco Chaudron <[email protected]>
Signed-off-by: Alexei Starovoitov <[email protected]>
Acked-by: John Fastabend <[email protected]>
Acked-by: Toke Høiland-Jørgensen <[email protected]>
Link: https://lore.kernel.org/bpf/158348514556.2239.11050972434793741444.stgit@xdp-tutorial

Merge branch 'bpf_get_ns_current_pid_tgid'

Carlos Neira says:

====================
Currently bpf_get_current_pid_tgid(), is used to do pid filtering in bcc's
scripts but this helper returns the pid as seen by the root namespace which is
fine when a bcc script is not executed inside a container.
When the process of interest is inside a container, pid filtering will not work
if bpf_get_current_pid_tgid() is used.
This helper addresses this limitation returning the pid as it's seen by the current
namespace where the script is executing.

In the future different pid_ns files may belong to different devices, according to the
discussion between Eric Biederman and Yonghong in 2017 Linux plumbers conference.
To address that situation the helper requires inum and dev_t from /proc/self/ns/pid.
This helper has the same use cases as bpf_get_current_pid_tgid() as it can be
used to do pid filtering even inside a container.
====================

Signed-off-by: Alexei Starovoitov <[email protected]>

tools/testing/selftests/bpf: Add self-tests for new helper bpf_get_ns_current_pid_tgid.

Self tests added for new helper bpf_get_ns_current_pid_tgid

Signed-off-by: Carlos Neira <[email protected]>
Signed-off-by: Alexei Starovoitov <[email protected]>
Link: https://lore.kernel.org/bpf/[email protected]

Merge tag 'topic/mst-bw-check-fixes-for-airlied-2020-03-12-2' of git://anongit.freedesktop.org/drm/drm-misc into drm-fixes

UAPI Changes: None

Cross-subsystem Changes: None

Core Changes: Fixed regressions introduced by commit cd82d82cbc04
("drm/dp_mst: Add branch bandwidth validation to MST atomic check"),
which would cause us to:

* Calculate the available bandwidth on an MST topology incorrectly, and
  as a result reject most display configurations that would try to enable
  more then one sink on a topology
* Occasionally expose MST connectors to userspace before finishing
  probing their PBN capabilities, resulting in us rejecting display
  configurations because we assumed briefly that no bandwidth was
  available

Driver Changes: None

Signed-off-by: Dave Airlie <[email protected]>
From: Lyude Paul <[email protected]>
Link: https://patchwork.freedesktop.org/patch/msgid/[email protected]

bpf: Added new helper bpf_get_ns_current_pid_tgid

New bpf helper bpf_get_ns_current_pid_tgid,
This helper will return pid and tgid from current task
which namespace matches dev_t and inode number provided,
this will allows us to instrument a process inside a container.

Signed-off-by: Carlos Neira <[email protected]>
Signed-off-by: Alexei Starovoitov <[email protected]>
Acked-by: Yonghong Song <[email protected]>
Link: https://lore.kernel.org/bpf/[email protected]

fs/nsfs.c: Added ns_match

ns_match returns true if the namespace inode and dev_t matches the ones
provided by the caller.

Signed-off-by: Carlos Neira <[email protected]>
Signed-off-by: Alexei Starovoitov <[email protected]>
Link: https://lore.kernel.org/bpf/[email protected]

Merge tag 'drm-intel-fixes-2020-03-12' of git://anongit.freedesktop.org/drm/drm-intel into drm-fixes

drm/i915 fixes for v5.6-rc6:
- hard lockup fix
- GVT fixes
- 32-bit alignment issue fix
- timeline wait fixes
- cacheline_retire and free

Signed-off-by: Dave Airlie <[email protected]>
From: Jani Nikula <[email protected]>
Link: https://patchwork.freedesktop.org/patch/msgid/[email protected]

tools: bpftool: Fix minor bash completion mistakes

Minor fixes for bash completion: addition of program name completion for
two subcommands, and correction for program test-runs and map pinning.

The completion for the following commands is fixed or improved:

    # bpftool prog run [TAB]
    # bpftool prog pin [TAB]
    # bpftool map pin [TAB]
    # bpftool net attach xdp name [TAB]

Signed-off-by: Quentin Monnet <[email protected]>
Signed-off-by: Daniel Borkmann <[email protected]>
Acked-by: Martin KaFai Lau <[email protected]>
Link: https://lore.kernel.org/bpf/[email protected]

tools: bpftool: Allow all prog/map handles for pinning objects

Documentation and interactive help for bpftool have always explained
that the regular handles for programs (id|name|tag|pinned) and maps
(id|name|pinned) can be passed to the utility when attempting to pin
objects (bpftool prog pin PROG / bpftool map pin MAP).

THIS IS A LIE!! The tool actually accepts only ids, as the parsing is
done in do_pin_any() in common.c instead of reusing the parsing
functions that have long been generic for program and map handles.

Instead of fixing the doc, fix the code. It is trivial to reuse the
generic parsing, and to simplify do_pin_any() in the process.

Do not accept to pin multiple objects at the same time with
prog_parse_fds() or map_parse_fds() (this would require a more complex
syntax for passing multiple sysfs paths and validating that they
correspond to the number of e.g. programs we find for a given name or
tag).

Signed-off-by: Quentin Monnet <[email protected]>
Signed-off-by: Daniel Borkmann <[email protected]>
Acked-by: Martin KaFai Lau <[email protected]>
Link: https://lore.kernel.org/bpf/[email protected]

Merge tag 'amd-drm-fixes-5.6-2020-03-11' of git://people.freedesktop.org/~agd5f/linux into drm-fixes

amd-drm-fixes-5.6-2020-03-11:

amdgpu:
- Update the display watermark bounding box for navi14
- Fix fetching vbios directly from rom on vega20/arcturus
- Navi and renoir watermark fixes

Signed-off-by: Dave Airlie <[email protected]>
From: Alex Deucher <[email protected]>
Link: https://patchwork.freedesktop.org/patch/msgid/[email protected]

Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net

Pull networking fixes from David Miller:
"It looks like a decent sized set of fixes, but a lot of these are one
  liner off-by-one and similar type changes:

   1) Fix netlink header pointer to calcular bad attribute offset
      reported to user. From Pablo Neira Ayuso.

   2) Don't double clear PHY interrupts when ->did_interrupt is set,
      from Heiner Kallweit.

   3) Add missing validation of various (devlink, nl802154, fib, etc.)
      attributes, from Jakub Kicinski.

   4) Missing *pos increments in various netfilter seq_next ops, from
      Vasily Averin.

   5) Missing break in of_mdiobus_register() loop, from Dajun Jin.

   6) Don't double bump tx_dropped in veth driver, from Jiang Lidong.

   7) Work around FMAN erratum A050385, from Madalin Bucur.

   8) Make sure ARP header is pulled early enough in bonding driver,
      from Eric Dumazet.

   9) Do a cond_resched() during multicast processing of ipvlan and
      macvlan, from Mahesh Bandewar.

  10) Don't attach cgroups to unrelated sockets when in interrupt
      context, from Shakeel Butt.

  11) Fix tpacket ring state management when encountering unknown GSO
      types. From Willem de Bruijn.

  12) Fix MDIO bus PHY resume by checking mdio_bus_phy_may_suspend()
      only in the suspend context. From Heiner Kallweit"

* git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net: (112 commits)
  net: systemport: fix index check to avoid an array out of bounds access
  tc-testing: add ETS scheduler to tdc build configuration
  net: phy: fix MDIO bus PM PHY resuming
  net: hns3: clear port base VLAN when unload PF
  net: hns3: fix RMW issue for VLAN filter switch
  net: hns3: fix VF VLAN table entries inconsistent issue
  net: hns3: fix "tc qdisc del" failed issue
  taprio: Fix sending packets without dequeueing them
  net: mvmdio: avoid error message for optional IRQ
  net: dsa: mv88e6xxx: Add missing mask of ATU occupancy register
  net: memcg: fix lockdep splat in inet_csk_accept()
  s390/qeth: implement smarter resizing of the RX buffer pool
  s390/qeth: refactor buffer pool code
  s390/qeth: use page pointers to manage RX buffer pool
  seg6: fix SRv6 L2 tunnels to use IANA-assigned protocol number
  net: dsa: Don't instantiate phylink for CPU/DSA ports unless needed
  net/packet: tpacket_rcv: do not increment ring index on drop
  sxgbe: Fix off by one in samsung driver strncpy size arg
  net: caif: Add lockdep expression to RCU traversal primitive
  MAINTAINERS: remove Sathya Perla as Emulex NIC maintainer
  ...

libbpf: Split BTF presence checks into libbpf- and kernel-specific parts

Needs for application BTF being present differs between user-space libbpf needs and kernel
needs. Currently, BTF is mandatory only in kernel only when BPF application is
using STRUCT_OPS. While libbpf itself relies more heavily on presense of BTF:
  - for BTF-defined maps;
  - for Kconfig externs;
  - for STRUCT_OPS as well.

Thus, checks for presence and validness of bpf_object's BPF needs to be
performed separately, which is patch does.

Fixes: 5327644614a1 ("libbpf: Relax check whether BTF is mandatory")
Reported-by: Michal Rostecki <[email protected]>
Signed-off-by: Andrii Nakryiko <[email protected]>
Signed-off-by: Daniel Borkmann <[email protected]>
Acked-by: Martin KaFai Lau <[email protected]>
Cc: Quentin Monnet <[email protected]>
Link: https://lore.kernel.org/bpf/[email protected]

bpftool: Add _bpftool and profiler.skel.h to .gitignore

These files are generated, so ignore them.

Signed-off-by: Song Liu <[email protected]>
Signed-off-by: Daniel Borkmann <[email protected]>
Reviewed-by: Quentin Monnet <[email protected]>
Link: https://lore.kernel.org/bpf/[email protected]

bpftool: Skeleton should depend on libbpf

Add the dependency to libbpf, to fix build errors like:

  In file included from skeleton/profiler.bpf.c:5:
  .../bpf_helpers.h:5:10: fatal error: 'bpf_helper_defs.h' file not found
  #include "bpf_helper_defs.h"
           ^~~~~~~~~~~~~~~~~~~
  1 error generated.
  make: *** [skeleton/profiler.bpf.o] Error 1
  make: *** Waiting for unfinished jobs....

Fixes: 47c09d6a9f67 ("bpftool: Introduce "prog profile" command")
Suggested-by: Quentin Monnet <[email protected]>
Signed-off-by: Song Liu <[email protected]>
Signed-off-by: Daniel Borkmann <[email protected]>
Reviewed-by: Quentin Monnet <[email protected]>
Acked-by: John Fastabend <[email protected]>
Link: https://lore.kernel.org/bpf/[email protected]

bpftool: Only build bpftool-prog-profile if supported by clang

bpftool-prog-profile requires clang to generate BTF for global variables.
When compared with older clang, skip this command. This is achieved by
adding a new feature, clang-bpf-global-var, to tools/build/feature.

Signed-off-by: Song Liu <[email protected]>
Signed-off-by: Daniel Borkmann <[email protected]>
Reviewed-by: Quentin Monnet <[email protected]>
Link: https://lore.kernel.org/bpf/[email protected]

drm/dp_mst: Rewrite and fix bandwidth limit checks

Sigh, this is mostly my fault for not giving commit cd82d82cbc04
("drm/dp_mst: Add branch bandwidth validation to MST atomic check")
enough scrutiny during review. The way we're checking bandwidth
limitations here is mostly wrong:

For starters, drm_dp_mst_atomic_check_bw_limit() determines the
pbn_limit of a branch by simply scanning each port on the current branch
device, then uses the last non-zero full_pbn value that it finds. It
then counts the sum of the PBN used on each branch device for that
level, and compares against the full_pbn value it found before.

This is wrong because ports can and will have different PBN limitations
on many hubs, especially since a number of DisplayPort hubs out there
will be clever and only use the smallest link rate required for each
downstream sink - potentially giving every port a different full_pbn
value depending on what link rate it's trained at. This means with our
current code, which max PBN value we end up with is not well defined.

Additionally, we also need to remember when checking bandwidth
limitations that the top-most device in any MST topology is a branch
device, not a port. This means that the first level of a topology
doesn't technically have a full_pbn value that needs to be checked.
Instead, we should assume that so long as our VCPI allocations fit we're
within the bandwidth limitations of the primary MSTB.

We do however, want to check full_pbn on every port including those of
the primary MSTB. However, it's important to keep in mind that this
value represents the minimum link rate /between a port's sink or mstb,
and the mstb itself/. A quick diagram to explain:

                                MSTB #1
                               /       \
                              /         \
                           Port #1    Port #2
       full_pbn for Port #1 → |          | ← full_pbn for Port #2
                           Sink #1    MSTB #2
                                         |
                                       etc...

Note that in the above diagram, the combined PBN from all VCPI
allocations on said hub should not exceed the full_pbn value of port #2,
and the display configuration on sink #1 should not exceed the full_pbn
value of port #1. However, port #1 and port #2 can otherwise consume as
much bandwidth as they want so long as their VCPI allocations still fit.

And finally - our current bandwidth checking code also makes the mistake
of not checking whether something is an end device or not before trying
to traverse down it.

So, let's fix it by rewriting our bandwidth checking helpers. We split
the function into one part for handling branches which simply adds up
the total PBN on each branch and returns it, and one for checking each
port to ensure we're not going over its PBN limit. Phew.

This should fix regressions seen, where we erroneously reject display
configurations due to thinking they're going over our bandwidth limits
when they're not.

Changes since v1:
* Took an even closer look at how PBN limitations are supposed to be
  handled, and did some experimenting with Sean Paul. Ended up rewriting
  these helpers again, but this time they should actually be correct!
Changes since v2:
* Small indenting fix
* Fix pbn_used check in drm_dp_mst_atomic_check_port_bw_limit()

Signed-off-by: Lyude Paul <[email protected]>
Fixes: cd82d82cbc04 ("drm/dp_mst: Add branch bandwidth validation to MST atomic check")
Cc: Sean Paul <[email protected]>
Acked-by: Alex Deucher <[email protected]>
Reviewed-by: Mikita Lipski <[email protected]>
Tested-by: Hans de Goede <[email protected]>
Link: https://patchwork.freedesktop.org/patch/msgid/[email protected]

drm/dp_mst: Reprobe path resources in CSN handler

We used to punt off reprobing path resources to the link address probe
work, but now that we handle CSNs asynchronously from the driver's HPD
handling we can do whatever the heck we want from the CSN!

So, reprobe the path resources from drm_dp_mst_handle_conn_stat(). Also,
get rid of the path resource reprobing code in
drm_dp_check_and_send_link_address() since it's needlessly complicated
when we already reprobe path resources from
drm_dp_handle_link_address_port(). And finally, teach
drm_dp_send_enum_path_resources() to return 1 on PBN changes so we know
if we need to send another hotplug or not.

This fixes issues where we've indicated to userspace that a port has
just been connected, before we actually probed it's available PBN -
something that results in unexpected atomic check failures.

Signed-off-by: Lyude Paul <[email protected]>
Fixes: cd82d82cbc04 ("drm/dp_mst: Add branch bandwidth validation to MST atomic check")
Cc: Mikita Lipski <[email protected]>
Cc: Hans de Goede <[email protected]>
Cc: Sean Paul <[email protected]>
Link: https://patchwork.freedesktop.org/patch/msgid/[email protected]
Reviewed-by: Alex Deucher <[email protected]>
Tested-by: Hans de Goede <[email protected]>

drm/dp_mst: Use full_pbn instead of available_pbn for bandwidth checks

DisplayPort specifications are fun. For a while, it's been really
unclear to us what available_pbn actually does. There's a somewhat vague
explanation in the DisplayPort spec (starting from 1.2) that partially
explains it:

  The minimum payload bandwidth number supported by the path. Each node
  updates this number with its available payload bandwidth number if its
  payload bandwidth number is less than that in the Message Transaction
  reply.

So, it sounds like available_pbn represents the smallest link rate in
use between the source and the branch device. Cool, so full_pbn is just
the highest possible PBN that the branch device supports right?

Well, we assumed that for quite a while until Sean Paul noticed that on
some MST hubs, available_pbn will actually get set to 0 whenever there's
any active payloads on the respective branch device. This caused quite a
bit of confusion since clearing the payload ID table would end up fixing
the available_pbn value.

So, we just went with that until commit cd82d82cbc04 ("drm/dp_mst: Add
branch bandwidth validation to MST atomic check") started breaking
people's setups due to us getting erroneous available_pbn values. So, we
did some more digging and got confused until we finally looked at the
definition for full_pbn:

  The bandwidth of the link at the trained link rate and lane count
  between the DP Source device and the DP Sink device with no time slots
  allocated to VC Payloads, represented as a Payload Bandwidth Number. As
  with the Available_Payload_Bandwidth_Number, this number is determined
  by the link with the lowest lane count and link rate.

That's what we get for not reading specs closely enough, hehe. So, since
full_pbn is definitely what we want for doing bandwidth restriction
checks - let's start using that instead and ignore available_pbn
entirely.

Signed-off-by: Lyude Paul <[email protected]>
Fixes: cd82d82cbc04 ("drm/dp_mst: Add branch bandwidth validation to MST atomic check")
Cc: Mikita Lipski <[email protected]>
Cc: Hans de Goede <[email protected]>
Cc: Sean Paul <[email protected]>
Reviewed-by: Mikita Lipski <[email protected]>
Link: https://patchwork.freedesktop.org/patch/msgid/[email protected]
Reviewed-by: Alex Deucher <[email protected]>
Tested-by: Hans de Goede <[email protected]>

drm/dp_mst: Rename drm_dp_mst_is_dp_mst_end_device() to be less redundant

It's already prefixed by dp_mst, so we don't really need to repeat
ourselves here. One of the changes I should have picked up originally
when reviewing MST DSC support.

There should be no functional changes here

Cc: Mikita Lipski <[email protected]>
Cc: Sean Paul <[email protected]>
Cc: Hans de Goede <[email protected]>
Signed-off-by: Lyude Paul <[email protected]>
Reviewed-by: Alex Deucher <[email protected]>
Tested-by: Hans de Goede <[email protected]>
Link: https://patchwork.freedesktop.org/patch/msgid/[email protected]

inet: Use fallthrough;

Convert the various uses of fallthrough comments to fallthrough;

Done via script
Link: https://lore.kernel.org/lkml/b56602fcf79f849e733e7b521bb0e17895d390fa.1582230379.git.joe@perches.com/
And by hand:

net/ipv6/ip6_fib.c has a fallthrough comment outside of an #ifdef block
that causes gcc to emit a warning if converted in-place.

So move the new fallthrough; inside the containing #ifdef/#endif too.

Signed-off-by: Joe Perches <[email protected]>
Signed-off-by: David S. Miller <[email protected]>

octeontx2-pf: unlock on error path in otx2_config_pause_frm()

We need to unlock before returning if this allocation fails.

Fixes: 75f36270990c ("octeontx2-pf: Support to enable/disable pause frames via ethtool")
Signed-off-by: Dan Carpenter <[email protected]>
Signed-off-by: David S. Miller <[email protected]>

Merge branch 'fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs

Pull vfs fixes from Al Viro:
"A couple of fixes for old crap in ->atomic_open() instances"

* 'fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
cifs_atomic_open(): fix double-put on late allocation failure
gfs2_atomic_open(): fix O_EXCL|O_CREAT handling on cold dcache

net: systemport: fix index check to avoid an array out of bounds access

Currently the bounds check on index is off by one and can lead to
an out of bounds access on array priv->filters_loc when index is
RXCHK_BRCM_TAG_MAX.

Fixes: bb9051a2b230 ("net: systemport: Add support for WAKE_FILTER")
Signed-off-by: Colin Ian King <[email protected]>
Signed-off-by: David S. Miller <[email protected]>

Merge branch 'ipa-fixes'

Alex Elder says:

====================
net: fix net-next

David: These patches resolve two issues caused by the IPA driver
being incorporated into net-next.  I hope you will merge
them as soon as you can.

The IPA driver was merged into net-next last week, but two problems
arise as a result, affecting net-next and linux-next:
  - The patch that defines field_max() was not incorporated into
    net-next, but is required by the IPA code
  - A patch that updates "sdm845.dtsi" *was* incorporated into
    net-next, but other changes to that file in the Qualcomm
    for-next branch lead to errors

Bjorn has agreed to incorporate the DTS file change into the
Qualcomm tree after it is reverted from net-next.
====================

Signed-off-by: David S. Miller <[email protected]>

Revert "arm64: dts: sdm845: add IPA information"

This reverts commit 9cc5ae125f0eaee471bc87fb5cbf29385fd9272a.

This commit:
b303f9f0050b arm64: dts: sdm845: Redefine interconnect provider DT nodes
found in the Qualcomm for-next tree removes/redefines the interconnect
provider node(s) used for IPA. I'm not sure whether it technically
conflicts with the IPA change to "sdm845.dtsi" in for-next, but it renders
it broken.

Revert this commit in the for-next tree, with the plan to incorporate
it into the Qualcomm tree instead.

Signed-off-by: Alex Elder <[email protected]>
Signed-off-by: David S. Miller <[email protected]>

bitfield.h: add FIELD_MAX() and field_max()

Define FIELD_MAX(), which supplies the maximum value that can be
represented by a field value. Define field_max() as well, to go
along with the lower-case forms of the field mask functions.

Signed-off-by: Alex Elder <[email protected]>
Acked-by: Jakub Kicinski <[email protected]>
Signed-off-by: David S. Miller <[email protected]>

tc-testing: add ETS scheduler to tdc build configuration

add CONFIG_NET_SCH_ETS to 'config', otherwise test suites using this file
to perform a full tdc run will encounter the following warning:

ok 645 e90e - Add ETS qdisc using bands # skipped - "-----> teardown stage" did not complete successfully

Fixes: 82c664b69c8b ("selftests: qdiscs: Add test coverage for ETS Qdisc")
Reported-by: Jamal Hadi Salim <[email protected]>
Signed-off-by: Davide Caratti <[email protected]>
Signed-off-by: David S. Miller <[email protected]>

net: phy: fix MDIO bus PM PHY resuming

So far we have the unfortunate situation that mdio_bus_phy_may_suspend()
is called in suspend AND resume path, assuming that function result is
the same. After the original change this is no longer the case,
resulting in broken resume as reported by Geert.

To fix this call mdio_bus_phy_may_suspend() in the suspend path only,
and let the phy_device store the info whether it was suspended by
MDIO bus PM.

Fixes: 503ba7c69610 ("net: phy: Avoid multiple suspends")
Reported-by: Geert Uytterhoeven <[email protected]>
Tested-by: Geert Uytterhoeven <[email protected]>
Signed-off-by: Heiner Kallweit <[email protected]>
Reviewed-by: Florian Fainelli <[email protected]>
Signed-off-by: David S. Miller <[email protected]>

Merge branch 'ethtool-netlink-interface-part-3'

Michal Kubecek says:

====================
ethtool netlink interface, part 3

Implementation of more netlink request types:

  - netdev features (ethtool -k/-K, patches 3-6)
  - private flags (--show-priv-flags / --set-priv-flags, patches 7-9)
  - ring sizes (ethtool -g/-G, patches 10-12)
  - channel counts (ethtool -l/-L, patches 13-15)

Patch 1 is a style cleanup suggested in part 2 review and patch 2 updates
the mapping between netdev features and legacy ioctl requests (which are
still used by ethtool for backward compatibility).

Changes in v2:
  - fix netdev reference leaks in error path of ethnl_set_rings() and
    ethnl_set_channels() (found by Jakub Kicinski)
  - use __set_bit() rather than set_bit() (suggested by David Miller)
  - in replies to RINGS_GET and CHANNELS_GET requests, omit ring and
    channel types not supported by driver/device (suggested by Jakub
    Kicinski)
  - more descriptive message size calculations in rings_reply_size() and
    channels_reply_size() (suggested by Jakub Kicinski)
  - coding style cleanup (suggested by Jakub Kicinski)
====================

Signed-off-by: David S. Miller <[email protected]>

ethtool: add CHANNELS_NTF notification

Send ETHTOOL_MSG_CHANNELS_NTF notification whenever channel counts of
a network device are modified using ETHTOOL_MSG_CHANNELS_SET netlink
message or ETHTOOL_SCHANNELS ioctl request.

Signed-off-by: Michal Kubecek <[email protected]>
Signed-off-by: David S. Miller <[email protected]>

ethtool: set device channel counts with CHANNELS_SET request

Implement CHANNELS_SET netlink request to set channel counts of a network
device. These are traditionally set with ETHTOOL_SCHANNELS ioctl request.

Like the ioctl implementation, the generic ethtool code checks if supplied
values do not exceed driver defined limits; if they do, first offending
attribute is reported using extack. Checks preventing removing channels
used for RX indirection table or zerocopy AF_XDP socket are also
implemented.

Move ethtool_get_max_rxfh_channel() helper into common.c so that it can be
used by both ioctl and netlink code.

v2:
- fix netdev reference leak in error path (found by Jakub Kicinsky)

Signed-off-by: Michal Kubecek <[email protected]>
Reviewed-by: Jakub Kicinski <[email protected]>
Signed-off-by: David S. Miller <[email protected]>

ethtool: provide channel counts with CHANNELS_GET request

Implement CHANNELS_GET request to get channel counts of a network device.
These are traditionally available via ETHTOOL_GCHANNELS ioctl request.

Omit attributes for channel types which are not supported by driver or
device (zero reported for maximum).

v2: (all suggested by Jakub Kicinski)
  - minor cleanup in channels_prepare_data()
  - more descriptive channels_reply_size()
  - omit attributes with zero max count

Signed-off-by: Michal Kubecek <[email protected]>
Reviewed-by: Jakub Kicinski <[email protected]>
Signed-off-by: David S. Miller <[email protected]>

ethtool: add RINGS_NTF notification

Send ETHTOOL_MSG_RINGS_NTF notification whenever ring sizes of a network
device are modified using ETHTOOL_MSG_RINGS_SET netlink message or
ETHTOOL_SRINGPARAM ioctl request.

Signed-off-by: Michal Kubecek <[email protected]>
Signed-off-by: David S. Miller <[email protected]>

ethtool: set device ring sizes with RINGS_SET request

Implement RINGS_SET netlink request to set ring sizes of a network device.
These are traditionally set with ETHTOOL_SRINGPARAM ioctl request.

Like the ioctl implementation, the generic ethtool code checks if supplied
values do not exceed driver defined limits; if they do, first offending
attribute is reported using extack.

v2:
- fix netdev reference leak in error path (found by Jakub Kicinsky)

Signed-off-by: Michal Kubecek <[email protected]>
Reviewed-by: Jakub Kicinski <[email protected]>
Signed-off-by: David S. Miller <[email protected]>

ethtool: provide ring sizes with RINGS_GET request

Implement RINGS_GET request to get ring sizes of a network device. These
are traditionally available via ETHTOOL_GRINGPARAM ioctl request.

Omit attributes for ring types which are not supported by driver or device
(zero reported for maximum).

v2: (all suggested by Jakub Kicinski)
  - minor cleanup in rings_prepare_data()
  - more descriptive rings_reply_size()
  - omit attributes with zero max size

Signed-off-by: Michal Kubecek <[email protected]>
Reviewed-by: Jakub Kicinski <[email protected]>
Signed-off-by: David S. Miller <[email protected]>

ethtool: add PRIVFLAGS_NTF notification

Send ETHTOOL_MSG_PRIVFLAGS_NTF notification whenever private flags of
a network device are modified using ETHTOOL_MSG_PRIVFLAGS_SET netlink
message or ETHTOOL_SPFLAGS ioctl request.

Signed-off-by: Michal Kubecek <[email protected]>
Signed-off-by: David S. Miller <[email protected]>

ethtool: set device private flags with PRIVFLAGS_SET request

Implement PRIVFLAGS_SET netlink request to set private flags of a network
device. These are traditionally set with ETHTOOL_SPFLAGS ioctl request.

Signed-off-by: Michal Kubecek <[email protected]>
Reviewed-by: Jakub Kicinski <[email protected]>
Signed-off-by: David S. Miller <[email protected]>

ethtool: provide private flags with PRIVFLAGS_GET request

Implement PRIVFLAGS_GET request to get private flags for a network device.
These are traditionally available via ETHTOOL_GPFLAGS ioctl request.

Signed-off-by: Michal Kubecek <[email protected]>
Reviewed-by: Jakub Kicinski <[email protected]>
Signed-off-by: David S. Miller <[email protected]>

ethtool: add FEATURES_NTF notification

Send ETHTOOL_MSG_FEATURES_NTF notification whenever network device features
are modified using ETHTOOL_MSG_FEATURES_SET netlink message, ethtool ioctl
request or any other way resulting in call to netdev_update_features() or
netdev_change_features()

Signed-off-by: Michal Kubecek <[email protected]>
Signed-off-by: David S. Miller <[email protected]>

ethtool: set netdev features with FEATURES_SET request

Implement FEATURES_SET netlink request to set network device features.
These are traditionally set using ETHTOOL_SFEATURES ioctl request.

Actual change is subject to netdev_change_features() sanity checks so that
it can differ from what was requested. Unlike with most other SET requests,
in addition to error code and optional extack, kernel provides an optional
reply message (ETHTOOL_MSG_FEATURES_SET_REPLY) in the same format but with
different semantics: information about difference between user request and
actual result and difference between old and new state of dev->features.
This reply message can be suppressed by setting ETHTOOL_FLAG_OMIT_REPLY
flag in request header.

Signed-off-by: Michal Kubecek <[email protected]>
Signed-off-by: David S. Miller <[email protected]>

ethtool: add ethnl_parse_bitset() helper

Unlike other SET type commands, modifying netdev features is required to
provide a reply telling userspace what was actually changed, compared to
what was requested. For that purpose, the "modified" flag provided by
ethnl_update_bitset() is not sufficient, we need full information which
bits were requested to change.

Therefore provide ethnl_parse_bitset() returning effective value and mask
bitmaps equivalent to the contents of a bitset nested attribute.

v2: use non-atomic __set_bit() (suggested by David Miller)

Signed-off-by: Michal Kubecek <[email protected]>
Signed-off-by: David S. Miller <[email protected]>

ethtool: provide netdev features with FEATURES_GET request

Implement FEATURES_GET request to get network device features. These are
traditionally available via ETHTOOL_GFEATURES ioctl request.

v2:
- style cleanup suggested by Jakub Kicinski

Signed-off-by: Michal Kubecek <[email protected]>
Reviewed-by: Jakub Kicinski <[email protected]>
Signed-off-by: David S. Miller <[email protected]>

ethtool: update mapping of features to legacy ioctl requests

Legacy ioctl request like ETHTOOL_GTXCSUM are still used by ethtool utility
to get values of legacy flags (which rather work as feature groups). These
are calculated from values of actual features and request to set them is
implemented as an attempt to set all features mapping to them but there are
two inconsistencies:

- tx-checksum-fcoe-crc is shown under tx-checksumming but NETIF_F_FCOE_CRC
is not included in ETHTOOL_GTXCSUM/ETHTOOL_STXCSUM
- tx-scatter-gather-fraglist is shown under scatter-gather but
NETIF_F_FRAGLIST is not included in ETHTOOL_GSG/ETHTOOL_SSG

As the mapping in ethtool output is more correct from logical point of
view, fix ethtool_get_feature_mask() to match it.

Signed-off-by: Michal Kubecek <[email protected]>
Signed-off-by: David S. Miller <[email protected]>

ethtool: rename ethnl_parse_header() to ethnl_parse_header_dev_get()

Andrew Lunn pointed out that even if it's documented that
ethnl_parse_header() takes reference to network device if it fills it
into the target structure, its name doesn't make it apparent so that
corresponding dev_put() looks like mismatched.

Rename the function ethnl_parse_header_dev_get() to indicate that it
takes a reference.

Suggested-by: Andrew Lunn <[email protected]>
Signed-off-by: Michal Kubecek <[email protected]>
Signed-off-by: David S. Miller <[email protected]>

cifs_atomic_open(): fix double-put on late allocation failure

several iterations of ->atomic_open() calling conventions ago, we
used to need fput() if ->atomic_open() failed at some point after
successful finish_open().  Now (since 2016) it's not needed -
struct file carries enough state to make fput() work regardless
of the point in struct file lifecycle and discarding it on
failure exits in open() got unified.  Unfortunately, I'd missed
the fact that we had an instance of ->atomic_open() (cifs one)
that used to need that fput(), as well as the stale comment in
finish_open() demanding such late failure handling.  Trivially
fixed...

Fixes: fe9ec8291fca "do_last(): take fput() on error after opening to out:"
Cc: [email protected] # v4.7+
Signed-off-by: Al Viro <[email protected]>

gfs2_atomic_open(): fix O_EXCL|O_CREAT handling on cold dcache

with the way fs/namei.c:do_last() had been done, ->atomic_open()
instances needed to recognize the case when existing file got
found with O_EXCL|O_CREAT, either by falling back to finish_no_open()
or failing themselves. gfs2 one didn't.

Fixes: 6d4ade986f9c (GFS2: Add atomic_open support)
Cc: [email protected] # v3.11
Signed-off-by: Al Viro <[email protected]>

Merge branch 'Introduce-connection-tracking-offload'

Paul Blakey says:

====================
Introduce connection tracking offload

Background
----------

The connection tracking action provides the ability to associate connection state to a packet.
The connection state may be used for stateful packet processing such as stateful firewalls
and NAT operations.

Connection tracking in TC SW
----------------------------

The CT state may be matched only after the CT action is performed.
As such, CT use cases are commonly implemented using multiple chains.
Consider the following TC filters, as an example:
1. tc filter add dev ens1f0_0 ingress prio 1 chain 0 proto ip flower \
    src_mac 24:8a:07:a5:28:01 ct_state -trk \
    action ct \
    pipe action goto chain 2

2. tc filter add dev ens1f0_0 ingress prio 1 chain 2 proto ip flower \
    ct_state +trk+new \
    action ct commit \
    pipe action tunnel_key set \
        src_ip 0.0.0.0 \
        dst_ip 7.7.7.8 \
        id 98 \
        dst_port 4789 \
    action mirred egress redirect dev vxlan0

3. tc filter add dev ens1f0_0 ingress prio 1 chain 2 proto ip flower \
    ct_state +trk+est \
    action tunnel_key set \
        src_ip 0.0.0.0 \
        dst_ip 7.7.7.8 \
        id 98 \
        dst_port 4789 \
    action mirred egress redirect dev vxlan0

Filter #1 (chain 0) decides, after initial packet classification, to send the packet to the
connection tracking module (ct action).
Once the ct_state is initialized by the CT action the packet processing continues on chain 2.

Chain 2 classifies the packet based on the ct_state.
Filter #2 matches on the +trk+new CT state while filter #3 matches on the +trk+est ct_state.

MLX5 Connection tracking HW offload - MLX5 driver patches
------------------------------

The MLX5 hardware model aligns with the software model by realizing a multi-table
architecture. In SW the TC CT action sets the CT state on the skb. Similarly,
HW sets the CT state on a HW register. Driver gets this CT state while offloading
a tuple with a new ct_metadata action that provides it.

Matches on ct_state are translated to HW register matches.

TC filter with CT action broken to two rules, a pre_ct rule, and a post_ct rule.
pre_ct rule:
   Inserted on the corrosponding tc chain table, matches on original tc match, with
   actions: any pre ct actions, set fte_id, set zone, and goto the ct table.
   The fte_id is a register mapping uniquely identifying this filter.
post_ct_rule:
   Inserted in a post_ct table, matches on the fte_id register mapping, with
   actions: counter + any post ct actions (this is usally 'goto chain X')

post_ct table is a table that all the tuples inserted to the ct table goto, so
if there is a tuple hit, packet will continue from ct table to post_ct table,
after being marked with the CT state (mark/label..)

This design ensures that the rule's actions and counters will be executed only after a CT hit.
HW misses will continue processing in SW from the last chain ID that was processed in hardware.

The following illustrates the HW model:

+-------------------+      +--------------------+    +--------------+
+ pre_ct (tc chain) +----->+ CT (nat or no nat) +--->+ post_ct      +----->
+ original match    +   |  + tuple + zone match + |  + fte_id match +  |
+-------------------+   |  +--------------------+ |  +--------------+  |
                        v                         v                    v
                     set chain miss mapping    set mark             original
                     set fte_id                set label            filter
                     set zone                  set established      actions
                     set tunnel_id             do nat (if needed)
                     do decap

To fill CT table, driver registers a CB for flow offload events, for each new
flow table that is passed to it from offloading ct actions. Once a flow offload
event is triggered on this CB, offload this flow to the hardware CT table.

Established events offload
--------------------------

Currently, act_ct maintains an FT instance per ct zone. Flow table entries
are created, per ct connection, when connections enter an established
state and deleted otherwise. Once an entry is created, the FT assumes
ownership of the entries, and manages their aging. FT is used for software
offload of conntrack. FT entries associate 5-tuples with an action list.

The act_ct changes in this patchset:
Populate the action list with a (new) ct_metadata action, providing the
connection's ct state (zone,mark and label), and mangle actions if NAT
is configured.

Pass the action's flow table instance as ct action entry parameter,
so  when the action is offloaded, the driver may register a callback on
it's block to receive FT flow offload add/del/stats events.

Netilter changes
--------------------------
The netfilter changes export the relevant bits, and add the relevant CBs
to support the above.

Applying this patchset
--------------------------

On top of current net-next ("r8169: simplify getting stats by using netdev_stats_to_stats64"),
pull Saeed's ct-offload branch, from git git://git.kernel.org/pub/scm/linux/kernel/git/saeed/linux.git
and fix the following non trivial conflict in fs_core.c as follows:

Then apply this patchset.

Changelog:
  v2->v3:
    Added the first two patches needed after rebasing on net-next:
     "net/mlx5: E-Switch, Enable reg c1 loopback when possible"
     "net/mlx5e: en_rep: Create uplink rep root table after eswitch offloads table"
====================

Signed-off-by: David S. Miller <[email protected]>

net/mlx5e: CT: Support clear action

Clear action, as with software, removes all ct metadata from
the packet.

Signed-off-by: Paul Blakey <[email protected]>
Reviewed-by: Oz Shlomo <[email protected]>
Signed-off-by: David S. Miller <[email protected]>

net/mlx5e: CT: Handle misses after executing CT action

Mark packets with a unique tupleid, and on miss use that id to get
the act ct restore_cookie. Using that restore cookie, we ask CT to
restore the relevant info on the SKB.

Signed-off-by: Paul Blakey <[email protected]>
Reviewed-by: Oz Shlomo <[email protected]>
Reviewed-by: Roi Dayan <[email protected]>
Signed-off-by: David S. Miller <[email protected]>

net/mlx5e: CT: Offload established flows

Register driver callbacks with the nf flow table platform.
FT add/delete events will create/delete FTE in the CT/CT_NAT tables.

Restoring the CT state on miss will be added in the following patch.

Signed-off-by: Paul Blakey <[email protected]>
Reviewed-by: Oz Shlomo <[email protected]>
Reviewed-by: Roi Dayan <[email protected]>
Reviewed-by: Jiri Pirko <[email protected]>
Signed-off-by: David S. Miller <[email protected]>

net/mlx5e: CT: Introduce connection tracking

Add support for offloading tc ct action and ct matches.
We translate the tc filter with CT action the following HW model:

+-------------------+      +--------------------+    +--------------+
+ pre_ct (tc chain) +----->+ CT (nat or no nat) +--->+ post_ct      +----->
+ original match    +  |   + tuple + zone match + |  + fte_id match +  |
+-------------------+  |   +--------------------+ |  +--------------+  |
                       v                          v                    v
                      set chain miss mapping  set mark             original
                      set fte_id              set label            filter
                      set zone                set established      actions
                      set tunnel_id           do nat (if needed)
                      do decap

Signed-off-by: Paul Blakey <[email protected]>
Reviewed-by: Oz Shlomo <[email protected]>
Reviewed-by: Roi Dayan <[email protected]>
Signed-off-by: David S. Miller <[email protected]>

flow_offload: Add flow_match_ct to get rule ct match

Add relevant getter for ct info dissector.

Signed-off-by: Paul Blakey <[email protected]>
Reviewed-by: Jiri Pirko <[email protected]>
Signed-off-by: David S. Miller <[email protected]>

net/mlx5: E-Switch, Support getting chain mapping

Currently, we write chain register mapping on miss from the the last
prio of a chain. It is used to restore the chain in software.

To support re-using the chain register mapping from global tables (such
as CT tuple table) misses, export the chain mapping.

Signed-off-by: Paul Blakey <[email protected]>
Reviewed-by: Oz Shlomo <[email protected]>
Signed-off-by: David S. Miller <[email protected]>

net/mlx5: E-Switch, Add support for offloading rules with no in_port

FTEs in global tables may match on packets from multiple in_ports.
Provide the capability to omit the in_port match condition.

Signed-off-by: Paul Blakey <[email protected]>
Reviewed-by: Oz Shlomo <[email protected]>
Signed-off-by: David S. Miller <[email protected]>

net/mlx5: E-Switch, Introduce global tables

Currently, flow tables are automatically connected according to their
<chain,prio,level> tuple.

Introduce global tables which are flow tables that are detached from the
eswitch chains processing, and will be connected by explicitly referencing
them from multiple chains.

Add this new table type, and allow connecting them by refenece.

Signed-off-by: Paul Blakey <[email protected]>
Reviewed-by: Oz Shlomo <[email protected]>
Signed-off-by: David S. Miller <[email protected]>

net/sched: act_ct: Enable hardware offload of flow table entires

Pass the zone's flow table instance on the flow action to the drivers.
Thus, allowing drivers to register FT add/del/stats callbacks.

Finally, enable hardware offload on the flow table instance.

Signed-off-by: Paul Blakey <[email protected]>
Reviewed-by: Jiri Pirko <[email protected]>
Signed-off-by: David S. Miller <[email protected]>

net/sched: act_ct: Support refreshing the flow table entries

If driver deleted an FT entry, a FT failed to offload, or registered to the
flow table after flows were already added, we still get packets in
software.

For those packets, while restoring the ct state from the flow table
entry, refresh it's hardware offload.

Signed-off-by: Paul Blakey <[email protected]>
Reviewed-by: Jiri Pirko <[email protected]>
Signed-off-by: David S. Miller <[email protected]>

net/sched: act_ct: Support restoring conntrack info on skbs

Provide an API to restore the ct state pointer.

This may be used by drivers to restore the ct state if they
miss in tc chain after they already did the hardware connection
tracking action (ct_metadata action).

For example, consider the following rule on chain 0 that is in_hw,
however chain 1 is not_in_hw:

$ tc filter add dev ... chain 0 ... \
flower ... action ct pipe action goto chain 1

Packets of a flow offloaded (via nf flow table offload) by the driver
hit this rule in hardware, will be marked with the ct metadata action
(mark, label, zone) that does the equivalent of the software ct action,
and when the packet jumps to hardware chain 1, there would be a miss.

CT was already processed in hardware. Therefore, the driver's miss
handling should restore the ct state on the skb, using the provided API,
and continue the packet processing in chain 1.

Signed-off-by: Paul Blakey <[email protected]>
Reviewed-by: Jiri Pirko <[email protected]>
Signed-off-by: David S. Miller <[email protected]>

net/sched: act_ct: Instantiate flow table entry actions

NF flow table API associate 5-tuple rule with an action list by calling
the flow table type action() CB to fill the rule's actions.

In action CB of act_ct, populate the ct offload entry actions with a new
ct_metadata action. Initialize the ct_metadata with the ct mark, label and
zone information. If ct nat was performed, then also append the relevant
packet mangle actions (e.g. ipv4/ipv6/tcp/udp header rewrites).

Drivers that offload the ft entries may match on the 5-tuple and perform
the action list.

Signed-off-by: Paul Blakey <[email protected]>
Reviewed-by: Jiri Pirko <[email protected]>
Reviewed-by: Edward Cree <[email protected]>
Signed-off-by: David S. Miller <[email protected]>

netfilter: flowtable: Add API for registering to flow table events

Let drivers to add their cb allowing them to receive flow offload events
of type TC_SETUP_CLSFLOWER (REPLACE/DEL/STATS) for flows managed by the
flow table.

Signed-off-by: Paul Blakey <[email protected]>
Reviewed-by: Jiri Pirko <[email protected]>
Signed-off-by: David S. Miller <[email protected]>

net/mlx5e: en_rep: Create uplink rep root table after eswitch offloads table

The eswitch offloads table, which has the reps (vport) rx miss rules,
was moved from OFFLOADS namespace [0,0] (prio, level), to [1,0], so
the restore table (the new [0,0]) can come before it. The destinations
of these miss rules is the rep root ft (ttc for non uplink reps).

Uplink rep root ft is created as OFFLOADS namespace [0,1], and is used
as a hook to next RX prio (either ethtool or ttc), but this fails to
pass fs_core level's check.

Move uplink rep root ft to OFFLOADS prio 1, level 1 ([1,1]), so it
will keep the same relative position after the restore table
change.

Signed-off-by: Paul Blakey <[email protected]>
Signed-off-by: David S. Miller <[email protected]>

net/mlx5: E-Switch, Enable reg c1 loopback when possible

Enable reg c1 loopback if firmware reports it's supported,
as this is needed for restoring packet metadata (e.g chain).

Also define helper to query if it is enabled.

Signed-off-by: Paul Blakey <[email protected]>
Signed-off-by: David S. Miller <[email protected]>

Merge branch 'ct-offload' of git://git.kernel.org/pub/scm/linux/kernel/git/saeed/linux

Merge branch 'bind_addr_zero'

Kuniyuki Iwashima says:

====================
Improve bind(addr, 0) behaviour.

Currently we fail to bind sockets to ephemeral ports when all of the ports
are exhausted even if all sockets have SO_REUSEADDR enabled. In this case,
we still have a chance to connect to the different remote hosts.

These patches add net.ipv4.ip_autobind_reuse option and fix the behaviour
to fully utilize all space of the local (addr, port) tuples.

Changes in v5:
  - Add more description to documents.
  - Fix sysctl option to use proc_dointvec_minmax.
  - Remove the Fixes: tag and squash two commits.

Changes in v4:
  - Add net.ipv4.ip_autobind_reuse option to not change the current behaviour.
  - Modify .gitignore for test.
  https://lore.kernel.org/netdev/20200308181615 [email protected]/

Changes in v3:
  - Change the title and write more specific description of the 3rd patch.
  - Add a test in tools/testing/selftests/net/ as the 4th patch.
  https://lore.kernel.org/netdev/20200229113554 [email protected]/

Changes in v2:
  - Change the description of the 2nd patch ('localhost' -> 'address').
  - Correct the description and the if statement of the 3rd patch.
  https://lore.kernel.org/netdev/20200226074631 [email protected]/

v1 with tests:
  https://lore.kernel.org/netdev/20200220152020 [email protected]/
====================

Reviewed-by: Eric Dumazet <[email protected]>
Signed-off-by: David S. Miller <[email protected]>

selftests: net: Add SO_REUSEADDR test to check if 4-tuples are fully utilized.

This commit adds a test to check if we can fully utilize 4-tuples for
connect() when all ephemeral ports are exhausted.

The test program changes the local port range to use only one port and binds
two sockets with or without SO_REUSEADDR and SO_REUSEPORT, and with the same
EUID or with different EUIDs, then do listen().

We should be able to bind only one socket having both SO_REUSEADDR and
SO_REUSEPORT per EUID, which restriction is to prevent unintentional
listen().

Signed-off-by: Kuniyuki Iwashima <[email protected]>
Signed-off-by: David S. Miller <[email protected]>

tcp: Forbid to bind more than one sockets haveing SO_REUSEADDR and SO_REUSEPORT per EUID.

If there is no TCP_LISTEN socket on a ephemeral port, we can bind multiple
sockets having SO_REUSEADDR to the same port. Then if all sockets bound to
the port have also SO_REUSEPORT enabled and have the same EUID, all of them
can be listened. This is not safe.

Let's say, an application has root privilege and binds sockets to an
ephemeral port with both of SO_REUSEADDR and SO_REUSEPORT. When none of
sockets is not listened yet, a malicious user can use sudo, exhaust
ephemeral ports, and bind sockets to the same ephemeral port, so he or she
can call listen and steal the port.

To prevent this issue, we must not bind more than one sockets that have the
same EUID and both of SO_REUSEADDR and SO_REUSEPORT.

On the other hand, if the sockets have different EUIDs, the issue above does
not occur. After sockets with different EUIDs are bound to the same port and
one of them is listened, no more socket can be listened. This is because the
condition below is evaluated true and listen() for the second socket fails.

} else if (!reuseport_ok ||
   !reuseport || !sk2->sk_reuseport ||
   rcu_access_pointer(sk->sk_reuseport_cb) ||
   (sk2->sk_state != TCP_TIME_WAIT &&
    !uid_eq(uid, sock_i_uid(sk2)))) {
if (inet_rcv_saddr_equal(sk, sk2, true))
break;
}

Therefore, on the same port, we cannot do listen() for multiple sockets with
different EUIDs and any other listen syscalls fail, so the problem does not
happen. In this case, we can still call connect() for other sockets that
cannot be listened, so we have to succeed to call bind() in order to fully
utilize 4-tuples.

Summarizing the above, we should be able to bind only one socket having
SO_REUSEADDR and SO_REUSEPORT per EUID.

Signed-off-by: Kuniyuki Iwashima <[email protected]>
Signed-off-by: David S. Miller <[email protected]>

tcp: bind(0) remove the SO_REUSEADDR restriction when ephemeral ports are exhausted.

Commit aacd9289af8b82f5fb01bcdd53d0e3406d1333c7 ("tcp: bind() use stronger
condition for bind_conflict") introduced a restriction to forbid to bind
SO_REUSEADDR enabled sockets to the same (addr, port) tuple in order to
assign ports dispersedly so that we can connect to the same remote host.

The change results in accelerating port depletion so that we fail to bind
sockets to the same local port even if we want to connect to the different
remote hosts.

You can reproduce this issue by following instructions below.

  1. # sysctl -w net.ipv4.ip_local_port_range="32768 32768"
  2. set SO_REUSEADDR to two sockets.
  3. bind two sockets to (localhost, 0) and the latter fails.

Therefore, when ephemeral ports are exhausted, bind(0) should fallback to
the legacy behaviour to enable the SO_REUSEADDR option and make it possible
to connect to different remote (addr, port) tuples.

This patch allows us to bind SO_REUSEADDR enabled sockets to the same
(addr, port) only when net.ipv4.ip_autobind_reuse is set 1 and all
ephemeral ports are exhausted. This also allows connect() and listen() to
share ports in the following way and may break some applications. So the
ip_autobind_reuse is 0 by default and disables the feature.

  1. setsockopt(sk1, SO_REUSEADDR)
  2. setsockopt(sk2, SO_REUSEADDR)
  3. bind(sk1, saddr, 0)
  4. bind(sk2, saddr, 0)
  5. connect(sk1, daddr)
  6. listen(sk2)

If it is set 1, we can fully utilize the 4-tuples, but we should use
IP_BIND_ADDRESS_NO_PORT for bind()+connect() as possible.

The notable thing is that if all sockets bound to the same port have
both SO_REUSEADDR and SO_REUSEPORT enabled, we can bind sockets to an
ephemeral port and also do listen().

Signed-off-by: Kuniyuki Iwashima <[email protected]>
Signed-off-by: David S. Miller <[email protected]>

tcp: Remove unnecessary conditions in inet_csk_bind_conflict().

When we get an ephemeral port, the relax is false, so the SO_REUSEADDR
conditions may be evaluated twice. We do not need to check the conditions
again.

Signed-off-by: Kuniyuki Iwashima <[email protected]>
Signed-off-by: David S. Miller <[email protected]>

Merge branch 'hns3-fixes'

Huazhong Tan says:

====================
net: hns3: fixes for -net

This series includes several bugfixes for the HNS3 ethernet driver.

[patch 1] fixes an "tc qdisc del" failure.
[patch 2] fixes SW & HW VLAN table not consistent issue.
[patch 3] fixes a RMW issue related to VLAN filter switch.
[patch 4] clears port based VLAN when uploading PF.
====================

Signed-off-by: David S. Miller <[email protected]>

net: hns3: clear port base VLAN when unload PF

Currently, PF missed to clear the port base VLAN for VF when
unload. In this case, the VLAN id will remain in the VLAN
table. This patch fixes it.

Fixes: 92f11ea177cd ("net: hns3: fix set port based VLAN issue for VF")
Signed-off-by: Jian Shen <[email protected]>
Signed-off-by: Huazhong Tan <[email protected]>
Signed-off-by: David S. Miller <[email protected]>

net: hns3: fix RMW issue for VLAN filter switch

According to the user manual, the ingress and egress VLAN filter
are configured at the same time. Currently, hclge_init_vlan_config()
and hclge_set_vlan_spoofchk() will both change the VLAN filter
switch. So it's necessary to read the old configuration before
modifying it.

Fixes: 22044f95faa0 ("net: hns3: add support for spoof check setting")
Signed-off-by: Jian Shen <[email protected]>
Signed-off-by: Huazhong Tan <[email protected]>
Signed-off-by: David S. Miller <[email protected]>

net: hns3: fix VF VLAN table entries inconsistent issue

Currently, if VF is loaded on the host side, the host doesn't
clear the VF's VLAN table entries when VF removing. In this
case, when doing reset and disabling sriov at the same time the
VLAN device over VF will be removed, but the VLAN table entries
in hardware are remained.

This patch fixes it by asking PF to clear the VLAN table entries for
VF when VF is removing. It also clears the VLAN table full bit
after VF VLAN table entries being cleared.

Fixes: c6075b193462 ("net: hns3: Record VF vlan tables")
Signed-off-by: Jian Shen <[email protected]>
Signed-off-by: Huazhong Tan <[email protected]>
Signed-off-by: David S. Miller <[email protected]>

net: hns3: fix "tc qdisc del" failed issue

The HNS3 driver supports to configure TC numbers and TC to priority
map via "tc" tool. But when delete the rule, will fail, because
the HNS3 driver needs at least one TC, but the "tc" tool sets TC
number to zero when delete.

This patch makes sure that the TC number is at least one.

Fixes: 30d240dfa2e8 ("net: hns3: Add mqprio hardware offload support in hns3 driver")
Signed-off-by: Yonglong Liu <[email protected]>
Signed-off-by: Huazhong Tan <[email protected]>
Signed-off-by: David S. Miller <[email protected]>

Merge branch 'ethtool-consolidate-irq-coalescing-part-4'

Jakub Kicinski says:

====================
ethtool: consolidate irq coalescing - part 4

Convert more drivers following the groundwork laid in a recent
patch set [1] and continued in [2], [3]. The aim of the effort
is to consolidate irq coalescing parameter validation in the core.

This set converts 15 drivers in drivers/net/ethernet - remaining
Intel drivers, Freescale/NXP, and others.
2 more conversion sets to come.

[1] https://lore.kernel.org/netdev/20200305051542 [email protected]/
[2] https://lore.kernel.org/netdev/20200306010602.1620354 [email protected]/
[3] https://lore.kernel.org/netdev/20200310021512.1861626 [email protected]/
====================

Signed-off-by: David S. Miller <[email protected]>

net: ixgbevf: reject unsupported coalescing params

Set ethtool_ops->supported_coalesce_params to let
the core reject unsupported coalescing parameters.

This driver did not previously reject unsupported parameters.

Signed-off-by: Jakub Kicinski <[email protected]>
Signed-off-by: David S. Miller <[email protected]>

net: ixgbe: reject unsupported coalescing params

Set ethtool_ops->supported_coalesce_params to let
the core reject unsupported coalescing parameters.

This driver did not previously reject unsupported parameters.

Signed-off-by: Jakub Kicinski <[email protected]>
Signed-off-by: David S. Miller <[email protected]>

net: igc: let core reject the unsupported coalescing parameters

Set ethtool_ops->supported_coalesce_params to let
the core reject unsupported coalescing parameters.

This driver was rejecting almost all unsupported
parameters already, it was only missing a check
for tx_max_coalesced_frames_irq.

As a side effect of these changes the error code for
unsupported params changes from ENOTSUPP to EOPNOTSUPP.

Signed-off-by: Jakub Kicinski <[email protected]>
Signed-off-by: David S. Miller <[email protected]>

net: igbvf: reject unsupported coalescing params

Set ethtool_ops->supported_coalesce_params to let
the core reject unsupported coalescing parameters.

This driver did not previously reject unsupported parameters.

Signed-off-by: Jakub Kicinski <[email protected]>
Signed-off-by: David S. Miller <[email protected]>

net: igb: let core reject the unsupported coalescing parameters

Set ethtool_ops->supported_coalesce_params to let
the core reject unsupported coalescing parameters.

This driver was rejecting almost all unsupported
parameters already, it was only missing a check
for tx_max_coalesced_frames_irq.

As a side effect of these changes the error code for
unsupported params changes from ENOTSUPP to EOPNOTSUPP.

Signed-off-by: Jakub Kicinski <[email protected]>
Signed-off-by: David S. Miller <[email protected]>

net: iavf: reject unsupported coalescing params

Set ethtool_ops->supported_coalesce_params to let
the core reject unsupported coalescing parameters.

This driver did not previously reject unsupported parameters.

Signed-off-by: Jakub Kicinski <[email protected]>
Signed-off-by: David S. Miller <[email protected]>

net: i40e: reject unsupported coalescing params

Set ethtool_ops->supported_coalesce_params to let
the core reject unsupported coalescing parameters.

This driver did not previously reject unsupported parameters.

Signed-off-by: Jakub Kicinski <[email protected]>
Signed-off-by: David S. Miller <[email protected]>

net: fm10k: reject unsupported coalescing params

Set ethtool_ops->supported_coalesce_params to let
the core reject unsupported coalescing parameters.

This driver did not previously reject unsupported parameters.

Signed-off-by: Jakub Kicinski <[email protected]>
Acked-by: Jacob Keller <[email protected]>
Signed-off-by: David S. Miller <[email protected]>

net: e1000: reject unsupported coalescing params

Set ethtool_ops->supported_coalesce_params to let
the core reject unsupported coalescing parameters.

This driver did not previously reject unsupported parameters.

Signed-off-by: Jakub Kicinski <[email protected]>
Signed-off-by: David S. Miller <[email protected]>

net: hns3: reject unsupported coalescing params

Set ethtool_ops->supported_coalesce_params to let
the core reject unsupported coalescing parameters.

This driver did not previously reject unsupported parameters.

Signed-off-by: Jakub Kicinski <[email protected]>
Signed-off-by: David S. Miller <[email protected]>

net: hns: reject unsupported coalescing params

Set ethtool_ops->supported_coalesce_params to let
the core reject unsupported coalescing parameters.

This driver did not previously reject unsupported parameters.

Signed-off-by: Jakub Kicinski <[email protected]>
Signed-off-by: David S. Miller <[email protected]>

net: gianfar: reject unsupported coalescing params

Set ethtool_ops->supported_coalesce_params to let
the core reject unsupported coalescing parameters.

This driver did not previously reject unsupported parameters.

Signed-off-by: Jakub Kicinski <[email protected]>
Signed-off-by: David S. Miller <[email protected]>

net: fec: reject unsupported coalescing params

Set ethtool_ops->supported_coalesce_params to let
the core reject unsupported coalescing parameters.

This driver did not previously reject unsupported parameters.

Signed-off-by: Jakub Kicinski <[email protected]>
Acked-by: Fugang Duan <[email protected]>
Signed-off-by: David S. Miller <[email protected]>

net: dpaa: reject unsupported coalescing params

Set ethtool_ops->supported_coalesce_params to let
the core reject unsupported coalescing parameters.

This driver did not previously reject unsupported parameters
(other than adaptive rx, which will now be rejected by core).

Signed-off-by: Jakub Kicinski <[email protected]>
Acked-by: Madalin Bucur <[email protected]>
Signed-off-by: David S. Miller <[email protected]>

net: be2net: reject unsupported coalescing params

Set ethtool_ops->supported_coalesce_params to let
the core reject unsupported coalescing parameters.

This driver did not previously reject unsupported parameters.

Signed-off-by: Jakub Kicinski <[email protected]>
Signed-off-by: David S. Miller <[email protected]>

sfc: ethtool: Refactor to remove fallthrough comments in case blocks

Converting fallthrough comments to fallthrough; creates warnings
in this code when compiled with gcc.

This code is overly complicated and reads rather better with a
little refactoring and no fallthrough uses at all.

Remove the fallthrough comments and simplify the written source
code while reducing the object code size.

Consolidate duplicated switch/case blocks for IPV4 and IPV6.

defconfig x86-64 with sfc:

$ size drivers/net/ethernet/sfc/ethtool.o*
   text    data     bss     dec     hex filename
  10055      12       0   10067    2753 drivers/net/ethernet/sfc/ethtool.o.new
  10135      12       0   10147    27a3 drivers/net/ethernet/sfc/ethtool.o.old

Signed-off-by: Joe Perches <[email protected]>
Acked-by: Martin Habets <[email protected]>
Signed-off-by: David S. Miller <[email protected]>

taprio: Fix sending packets without dequeueing them

There was a bug that was causing packets to be sent to the driver
without first calling dequeue() on the "child" qdisc. And the KASAN
report below shows that sending a packet without calling dequeue()
leads to bad results.

The problem is that when checking the last qdisc "child" we do not set
the returned skb to NULL, which can cause it to be sent to the driver,
and so after the skb is sent, it may be freed, and in some situations a
reference to it may still be in the child qdisc, because it was never
dequeued.

The crash log looks like this:

[   19.937538] ==================================================================
[   19.938300] BUG: KASAN: use-after-free in taprio_dequeue_soft+0x620/0x780
[   19.938968] Read of size 4 at addr ffff8881128628cc by task swapper/1/0
[   19.939612]
[   19.939772] CPU: 1 PID: 0 Comm: swapper/1 Not tainted 5.6.0-rc3+ #97
[   19.940397] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.12.0-59-gc9ba5276e321-prebuilt.qe4
[   19.941523] Call Trace:
[   19.941774]  <IRQ>
[   19.941985]  dump_stack+0x97/0xe0
[   19.942323]  print_address_description.constprop.0+0x3b/0x60
[   19.942884]  ? taprio_dequeue_soft+0x620/0x780
[   19.943325]  ? taprio_dequeue_soft+0x620/0x780
[   19.943767]  __kasan_report.cold+0x1a/0x32
[   19.944173]  ? taprio_dequeue_soft+0x620/0x780
[   19.944612]  kasan_report+0xe/0x20
[   19.944954]  taprio_dequeue_soft+0x620/0x780
[   19.945380]  __qdisc_run+0x164/0x18d0
[   19.945749]  net_tx_action+0x2c4/0x730
[   19.946124]  __do_softirq+0x268/0x7bc
[   19.946491]  irq_exit+0x17d/0x1b0
[   19.946824]  smp_apic_timer_interrupt+0xeb/0x380
[   19.947280]  apic_timer_interrupt+0xf/0x20
[   19.947687]  </IRQ>
[   19.947912] RIP: 0010:default_idle+0x2d/0x2d0
[   19.948345] Code: 00 00 41 56 41 55 65 44 8b 2d 3f 8d 7c 7c 41 54 55 53 0f 1f 44 00 00 e8 b1 b2 c5 fd e9 07 00 3
[   19.950166] RSP: 0018:ffff88811a3efda0 EFLAGS: 00000282 ORIG_RAX: ffffffffffffff13
[   19.950909] RAX: 0000000080000000 RBX: ffff88811a3a9600 RCX: ffffffff8385327e
[   19.951608] RDX: 1ffff110234752c0 RSI: 0000000000000000 RDI: ffffffff8385262f
[   19.952309] RBP: ffffed10234752c0 R08: 0000000000000001 R09: ffffed10234752c1
[   19.953009] R10: ffffed10234752c0 R11: ffff88811a3a9607 R12: 0000000000000001
[   19.953709] R13: 0000000000000001 R14: 0000000000000000 R15: 0000000000000000
[   19.954408]  ? default_idle_call+0x2e/0x70
[   19.954816]  ? default_idle+0x1f/0x2d0
[   19.955192]  default_idle_call+0x5e/0x70
[   19.955584]  do_idle+0x3d4/0x500
[   19.955909]  ? arch_cpu_idle_exit+0x40/0x40
[   19.956325]  ? _raw_spin_unlock_irqrestore+0x23/0x30
[   19.956829]  ? trace_hardirqs_on+0x30/0x160
[   19.957242]  cpu_startup_entry+0x19/0x20
[   19.957633]  start_secondary+0x2a6/0x380
[   19.958026]  ? set_cpu_sibling_map+0x18b0/0x18b0
[   19.958486]  secondary_startup_64+0xa4/0xb0
[   19.958921]
[   19.959078] Allocated by task 33:
[   19.959412]  save_stack+0x1b/0x80
[   19.959747]  __kasan_kmalloc.constprop.0+0xc2/0xd0
[   19.960222]  kmem_cache_alloc+0xe4/0x230
[   19.960617]  __alloc_skb+0x91/0x510
[   19.960967]  ndisc_alloc_skb+0x133/0x330
[   19.961358]  ndisc_send_ns+0x134/0x810
[   19.961735]  addrconf_dad_work+0xad5/0xf80
[   19.962144]  process_one_work+0x78e/0x13a0
[   19.962551]  worker_thread+0x8f/0xfa0
[   19.962919]  kthread+0x2ba/0x3b0
[   19.963242]  ret_from_fork+0x3a/0x50
[   19.963596]
[   19.963753] Freed by task 33:
[   19.964055]  save_stack+0x1b/0x80
[   19.964386]  __kasan_slab_free+0x12f/0x180
[   19.964830]  kmem_cache_free+0x80/0x290
[   19.965231]  ip6_mc_input+0x38a/0x4d0
[   19.965617]  ipv6_rcv+0x1a4/0x1d0
[   19.965948]  __netif_receive_skb_one_core+0xf2/0x180
[   19.966437]  netif_receive_skb+0x8c/0x3c0
[   19.966846]  br_handle_frame_finish+0x779/0x1310
[   19.967302]  br_handle_frame+0x42a/0x830
[   19.967694]  __netif_receive_skb_core+0xf0e/0x2a90
[   19.968167]  __netif_receive_skb_one_core+0x96/0x180
[   19.968658]  process_backlog+0x198/0x650
[   19.969047]  net_rx_action+0x2fa/0xaa0
[   19.969420]  __do_softirq+0x268/0x7bc
[   19.969785]
[   19.969940] The buggy address belongs to the object at ffff888112862840
[   19.969940]  which belongs to the cache skbuff_head_cache of size 224
[   19.971202] The buggy address is located 140 bytes inside of
[   19.971202]  224-byte region [ffff888112862840, ffff888112862920)
[   19.972344] The buggy address belongs to the page:
[   19.972820] page:ffffea00044a1800 refcount:1 mapcount:0 mapping:ffff88811a2bd1c0 index:0xffff8881128625c0 compo0
[   19.973930] flags: 0x8000000000010200(slab|head)
[   19.974388] raw: 8000000000010200 ffff88811a2ed650 ffff88811a2ed650 ffff88811a2bd1c0
[   19.975151] raw: ffff8881128625c0 0000000000190013 00000001ffffffff 0000000000000000
[   19.975915] page dumped because: kasan: bad access detected
[   19.976461] page_owner tracks the page as allocated
[   19.976946] page last allocated via order 2, migratetype Unmovable, gfp_mask 0xd20c0(__GFP_IO|__GFP_FS|__GFP_NO)
[   19.978332]  prep_new_page+0x24b/0x330
[   19.978707]  get_page_from_freelist+0x2057/0x2c90
[   19.979170]  __alloc_pages_nodemask+0x218/0x590
[   19.979619]  new_slab+0x9d/0x300
[   19.979948]  ___slab_alloc.constprop.0+0x2f9/0x6f0
[   19.980421]  __slab_alloc.constprop.0+0x30/0x60
[   19.980870]  kmem_cache_alloc+0x201/0x230
[   19.981269]  __alloc_skb+0x91/0x510
[   19.981620]  alloc_skb_with_frags+0x78/0x4a0
[   19.982043]  sock_alloc_send_pskb+0x5eb/0x750
[   19.982476]  unix_stream_sendmsg+0x399/0x7f0
[   19.982904]  sock_sendmsg+0xe2/0x110
[   19.983262]  ____sys_sendmsg+0x4de/0x6d0
[   19.983660]  ___sys_sendmsg+0xe4/0x160
[   19.984032]  __sys_sendmsg+0xab/0x130
[   19.984396]  do_syscall_64+0xe7/0xae0
[   19.984761] page last free stack trace:
[   19.985142]  __free_pages_ok+0x432/0xbc0
[   19.985533]  qlist_free_all+0x56/0xc0
[   19.985907]  quarantine_reduce+0x149/0x170
[   19.986315]  __kasan_kmalloc.constprop.0+0x9e/0xd0
[   19.986791]  kmem_cache_alloc+0xe4/0x230
[   19.987182]  prepare_creds+0x24/0x440
[   19.987548]  do_faccessat+0x80/0x590
[   19.987906]  do_syscall_64+0xe7/0xae0
[   19.988276]  entry_SYSCALL_64_after_hwframe+0x49/0xbe
[   19.988775]
[   19.988930] Memory state around the buggy address:
[   19.989402]  ffff888112862780: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc
[   19.990111]  ffff888112862800: fc fc fc fc fc fc fc fc fb fb fb fb fb fb fb fb
[   19.990822] >ffff888112862880: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
[   19.991529]                                               ^
[   19.992081]  ffff888112862900: fb fb fb fb fc fc fc fc fc fc fc fc fc fc fc fc
[   19.992796]  ffff888112862980: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc

Fixes: 5a781ccbd19e ("tc: Add support for configuring the taprio scheduler")
Reported-by: Michael Schmidt <[email protected]>
Signed-off-by: Vinicius Costa Gomes <[email protected]>
Acked-by: Andre Guedes <[email protected]>
Signed-off-by: David S. Miller <[email protected]>

Revert "net: sched: make newly activated qdiscs visible"

This reverts commit 4cda75275f9f89f9485b0ca4d6950c95258a9bce
from net-next.

Brown bag time.

Michal noticed that this change doesn't work at all when
netif_set_real_num_tx_queues() gets called prior to an initial
dev_activate(), as for instance igb does.

Doing so dies with:

[   40.579142] BUG: kernel NULL pointer dereference, address: 0000000000000400
[   40.586922] #PF: supervisor read access in kernel mode
[   40.592668] #PF: error_code(0x0000) - not-present page
[   40.598405] PGD 0 P4D 0
[   40.601234] Oops: 0000 [#1] PREEMPT SMP PTI
[   40.605909] CPU: 18 PID: 1681 Comm: wickedd Tainted: G            E     5.6.0-rc3-ethnl.50-default #1
[   40.616205] Hardware name: Intel Corporation S2600CP/S2600CP, BIOS RMLSDP.86I.R3.27.D685.1305151734 05/15/2013
[   40.627377] RIP: 0010:qdisc_hash_add.part.22+0x2e/0x90
[   40.633115] Code: 00 55 53 89 f5 48 89 fb e8 2f 9b fb ff 85 c0 74 44 48 8b 43 40 48 8b 08 69 43 38 47 86 c8 61 c1 e8 1c 48 83 e8 80 48 8d 14 c1 <48> 8b 04 c1 48 8d 4b 28 48 89 53 30 48 89 43 28 48 85 c0 48 89 0a
[   40.654080] RSP: 0018:ffffb879864934d8 EFLAGS: 00010203
[   40.659914] RAX: 0000000000000080 RBX: ffffffffb8328d80 RCX: 0000000000000000
[   40.667882] RDX: 0000000000000400 RSI: 0000000000000000 RDI: ffffffffb831faa0
[   40.675849] RBP: 0000000000000000 R08: ffffa0752c8b9088 R09: ffffa0752c8b9208
[   40.683816] R10: 0000000000000006 R11: 0000000000000000 R12: ffffa0752d734000
[   40.691783] R13: 0000000000000008 R14: 0000000000000000 R15: ffffa07113c18000
[   40.699750] FS:  00007f94548e5880(0000) GS:ffffa0752e980000(0000) knlGS:0000000000000000
[   40.708782] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[   40.715189] CR2: 0000000000000400 CR3: 000000082b6ae006 CR4: 00000000001606e0
[   40.723156] Call Trace:
[   40.725888]  dev_qdisc_set_real_num_tx_queues+0x61/0x90
[   40.731725]  netif_set_real_num_tx_queues+0x94/0x1d0
[   40.737286]  __igb_open+0x19a/0x5d0 [igb]
[   40.741767]  __dev_open+0xbb/0x150
[   40.745567]  __dev_change_flags+0x157/0x1a0
[   40.750240]  dev_change_flags+0x23/0x60

[...]

Fixes: 4cda75275f9f ("net: sched: make newly activated qdiscs visible")
Reported-by: Michal Kubecek <[email protected]>
CC: Michal Kubecek <[email protected]>
CC: Eric Dumazet <[email protected]>
CC: Jamal Hadi Salim <[email protected]>
CC: Cong Wang <[email protected]>
CC: Jiri Pirko <[email protected]>
Signed-off-by: Julian Wiedmann <[email protected]>
Signed-off-by: David S. Miller <[email protected]>

Merge tag 'for-linus-5.6-2' of git://github.com/cminyard/linux-ipmi

Pull IPMI fix from Corey Minyard:
"Fix a message spew on some system

  The call to platform_get_irq() was changed to print a log if the
  interrupt was not available, and that was causing bogus messages to
  spew out for the IPMI driver. People have requested that this get in
  to 5.6 so I'm sending it along"

* tag 'for-linus-5.6-2' of git://github.com/cminyard/linux-ipmi:
  ipmi_si: Avoid spurious errors for optional IRQs

Merge branch 'linus' of git://git.kernel.org/pub/scm/linux/kernel/git/herbert/crypto-2.6

Pull crypto fix from Herbert Xu:
"Fix a build problem with x86/curve25519"

* 'linus' of git://git.kernel.org/pub/scm/linux/kernel/git/herbert/crypto-2.6:
crypto: x86/curve25519 - support assemblers with no adx support

bpftool: Use linux/types.h from source tree for profiler build

When compiling bpftool on a system where the /usr/include/asm symlink
doesn't exist (e.g. on an Ubuntu system without gcc-multilib installed),
the build fails with:

    CLANG    skeleton/profiler.bpf.o
  In file included from skeleton/profiler.bpf.c:4:
  In file included from /usr/include/linux/bpf.h:11:
  /usr/include/linux/types.h:5:10: fatal error: 'asm/types.h' file not found
  #include <asm/types.h>
           ^~~~~~~~~~~~~
  1 error generated.
  make: *** [Makefile:123: skeleton/profiler.bpf.o] Error 1

This indicates that the build is using linux/types.h from system headers
instead of source tree headers.

To fix this, adjust the clang search path to include the necessary
headers from tools/testing/selftests/bpf/include/uapi and
tools/include/uapi. Also use __bitwise__ instead of __bitwise in
skeleton/profiler.h to avoid clashing with the definition in
tools/testing/selftests/bpf/include/uapi/linux/types.h.

Signed-off-by: Tobias Klauser <[email protected]>
Signed-off-by: Daniel Borkmann <[email protected]>
Reviewed-by: Quentin Monnet <[email protected]>
Link: https://lore.kernel.org/bpf/[email protected]

dt-bindings: soc: qcom: fix IPA binding

The definitions for the "qcom,smem-states" and "qcom,smem-state-names"
properties need to list their "$ref" under an "allOf" keyword.

In addition, fix two problems in the example at the end:
  - Use #include for header files that define needed symbolic values
  - Terminate the line that includes the "ipa-shared" register space
    name with a comma rather than a semicolon

Finally, update some white space in the example for better alignment.

Signed-off-by: Alex Elder <[email protected]>
Signed-off-by: David S. Miller <[email protected]>

net: mvmdio: avoid error message for optional IRQ

Per the dt-binding the interrupt is optional so use
platform_get_irq_optional() instead of platform_get_irq(). Since
commit 7723f4c5ecdb ("driver core: platform: Add an error message to
platform_get_irq*()") platform_get_irq() produces an error message

orion-mdio f1072004.mdio: IRQ index 0 not found

which is perfectly normal if one hasn't specified the optional property
in the device tree.

Signed-off-by: Chris Packham <[email protected]>
Reviewed-by: Andrew Lunn <[email protected]>
Signed-off-by: David S. Miller <[email protected]>

net: dsa: mv88e6xxx: Add missing mask of ATU occupancy register

Only the bottom 12 bits contain the ATU bin occupancy statistics. The
upper bits need masking off.

Fixes: e0c69ca7dfbb ("net: dsa: mv88e6xxx: Add ATU occupancy via devlink resources")
Signed-off-by: Andrew Lunn <[email protected]>
Signed-off-by: David S. Miller <[email protected]>

net: mptcp: don't hang before sending 'MP capable with data'

the following packetdrill script

  socket(..., SOCK_STREAM, IPPROTO_MPTCP) = 3
  fcntl(3, F_GETFL) = 0x2 (flags O_RDWR)
  fcntl(3, F_SETFL, O_RDWR|O_NONBLOCK) = 0
  connect(3, ..., ...) = -1 EINPROGRESS (Operation now in progress)
  > S 0:0(0) <mss 1460,sackOK,TS val 100 ecr 0,nop,wscale 8,mpcapable v1 flags[flag_h] nokey>
  < S. 0:0(0) ack 1 win 65535 <mss 1460,sackOK,TS val 700 ecr 100,nop,wscale 8,mpcapable v1 flags[flag_h] key[skey=2]>
  > . 1:1(0) ack 1 win 256 <nop, nop, TS val 100 ecr 700,mpcapable v1 flags[flag_h] key[ckey,skey]>
  getsockopt(3, SOL_SOCKET, SO_ERROR, [0], [4]) = 0
  fcntl(3, F_SETFL, O_RDWR) = 0
  write(3, ..., 1000) = 1000

doesn't transmit 1KB data packet after a successful three-way-handshake,
using mp_capable with data as required by protocol v1, and write() hangs
forever:

PID: 973    TASK: ffff97dd399cae80  CPU: 1   COMMAND: "packetdrill"
  #0 [ffffa9b94062fb78] __schedule at ffffffff9c90a000
  #1 [ffffa9b94062fc08] schedule at ffffffff9c90a4a0
  #2 [ffffa9b94062fc18] schedule_timeout at ffffffff9c90e00d
  #3 [ffffa9b94062fc90] wait_woken at ffffffff9c120184
  #4 [ffffa9b94062fcb0] sk_stream_wait_connect at ffffffff9c75b064
  #5 [ffffa9b94062fd20] mptcp_sendmsg at ffffffff9c8e801c
  #6 [ffffa9b94062fdc0] sock_sendmsg at ffffffff9c747324
  #7 [ffffa9b94062fdd8] sock_write_iter at ffffffff9c7473c7
  #8 [ffffa9b94062fe48] new_sync_write at ffffffff9c302976
  #9 [ffffa9b94062fed0] vfs_write at ffffffff9c305685
#10 [ffffa9b94062ff00] ksys_write at ffffffff9c305985
#11 [ffffa9b94062ff38] do_syscall_64 at ffffffff9c004475
#12 [ffffa9b94062ff50] entry_SYSCALL_64_after_hwframe at ffffffff9ca0008c
     RIP: 00007f959407eaf7  RSP: 00007ffe9e95a910  RFLAGS: 00000293
     RAX: ffffffffffffffda  RBX: 0000000000000008  RCX: 00007f959407eaf7
     RDX: 00000000000003e8  RSI: 0000000001785fe0  RDI: 0000000000000008
     RBP: 0000000001785fe0   R8: 0000000000000000   R9: 0000000000000003
     R10: 0000000000000007  R11: 0000000000000293  R12: 00000000000003e8
     R13: 00007ffe9e95ae30  R14: 0000000000000000  R15: 0000000000000000
     ORIG_RAX: 0000000000000001  CS: 0033  SS: 002b

Fix it ensuring that socket state is TCP_ESTABLISHED on reception of the
third ack.

Fixes: 1954b86016cf ("mptcp: Check connection state before attempting send")
Suggested-by: Paolo Abeni <[email protected]>
Signed-off-by: Davide Caratti <[email protected]>
Signed-off-by: David S. Miller <[email protected]>

net: memcg: fix lockdep splat in inet_csk_accept()

Locking newsk while still holding the listener lock triggered
a lockdep splat [1]

We can simply move the memcg code after we release the listener lock,
as this can also help if multiple threads are sharing a common listener.

Also fix a typo while reading socket sk_rmem_alloc.

[1]
WARNING: possible recursive locking detected
5.6.0-rc3-syzkaller #0 Not tainted
--------------------------------------------
syz-executor598/9524 is trying to acquire lock:
ffff88808b5b8b90 (sk_lock-AF_INET6){+.+.}, at: lock_sock include/net/sock.h:1541 [inline]
ffff88808b5b8b90 (sk_lock-AF_INET6){+.+.}, at: inet_csk_accept+0x69f/0xd30 net/ipv4/inet_connection_sock.c:492

but task is already holding lock:
ffff88808b5b9590 (sk_lock-AF_INET6){+.+.}, at: lock_sock include/net/sock.h:1541 [inline]
ffff88808b5b9590 (sk_lock-AF_INET6){+.+.}, at: inet_csk_accept+0x8d/0xd30 net/ipv4/inet_connection_sock.c:445

other info that might help us debug this:
Possible unsafe locking scenario:

       CPU0
       ----
  lock(sk_lock-AF_INET6);
  lock(sk_lock-AF_INET6);

*** DEADLOCK ***

May be due to missing lock nesting notation

1 lock held by syz-executor598/9524:
#0: ffff88808b5b9590 (sk_lock-AF_INET6){+.+.}, at: lock_sock include/net/sock.h:1541 [inline]
#0: ffff88808b5b9590 (sk_lock-AF_INET6){+.+.}, at: inet_csk_accept+0x8d/0xd30 net/ipv4/inet_connection_sock.c:445

stack backtrace:
CPU: 0 PID: 9524 Comm: syz-executor598 Not tainted 5.6.0-rc3-syzkaller #0
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011
Call Trace:
__dump_stack lib/dump_stack.c:77 [inline]
dump_stack+0x188/0x20d lib/dump_stack.c:118
print_deadlock_bug kernel/locking/lockdep.c:2370 [inline]
check_deadlock kernel/locking/lockdep.c:2411 [inline]
validate_chain kernel/locking/lockdep.c:2954 [inline]
__lock_acquire.cold+0x114/0x288 kernel/locking/lockdep.c:3954
lock_acquire+0x197/0x420 kernel/locking/lockdep.c:4484
lock_sock_nested+0xc5/0x110 net/core/sock.c:2947
lock_sock include/net/sock.h:1541 [inline]
inet_csk_accept+0x69f/0xd30 net/ipv4/inet_connection_sock.c:492
inet_accept+0xe9/0x7c0 net/ipv4/af_inet.c:734
__sys_accept4_file+0x3ac/0x5b0 net/socket.c:1758
__sys_accept4+0x53/0x90 net/socket.c:1809
__do_sys_accept4 net/socket.c:1821 [inline]
__se_sys_accept4 net/socket.c:1818 [inline]
__x64_sys_accept4+0x93/0xf0 net/socket.c:1818
do_syscall_64+0xf6/0x790 arch/x86/entry/common.c:294
entry_SYSCALL_64_after_hwframe+0x49/0xbe
RIP: 0033:0x4445c9
Code: e8 0c 0d 03 00 48 83 c4 18 c3 0f 1f 80 00 00 00 00 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff 0f 83 eb 08 fc ff c3 66 2e 0f 1f 84 00 00 00 00
RSP: 002b:00007ffc35b37608 EFLAGS: 00000246 ORIG_RAX: 0000000000000120
RAX: ffffffffffffffda RBX: 0000000000000003 RCX: 00000000004445c9
RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000000000000003
RBP: 0000000000000000 R08: 0000000000306777 R09: 0000000000306777
R10: 0000000000000000 R11: 0000000000000246 R12: 0000000000000000
R13: 00000000004053d0 R14: 0000000000000000 R15: 0000000000000000

Fixes: d752a4986532 ("net: memcg: late association of sock to memcg")
Signed-off-by: Eric Dumazet <[email protected]>
Cc: Shakeel Butt <[email protected]>
Reported-by: syzbot <[email protected]>
Signed-off-by: David S. Miller <[email protected]>