Git Repo - linux.git/log

net: dsa: remove useless args of dsa_slave_create

dsa_slave_create currently takes 4 arguments while it only needs the
related dsa_port and its name. Remove all other arguments.

Signed-off-by: Vivien Didelot <[email protected]>
Reviewed-by: Florian Fainelli <[email protected]>
Signed-off-by: David S. Miller <[email protected]>

net: dsa: remove useless args of dsa_cpu_dsa_setup

dsa_cpu_dsa_setup currently takes 4 arguments but they are all available
from the dsa_port argument. Remove all others.

Signed-off-by: Vivien Didelot <[email protected]>
Reviewed-by: Florian Fainelli <[email protected]>
Signed-off-by: David S. Miller <[email protected]>

net: dsa: remove useless argument in legacy setup

dsa_switch_alloc() already assigns ds-dev, which can be used in
dsa_switch_setup_one and dsa_cpu_dsa_setups instead of requiring an
additional struct device argument.

Signed-off-by: Vivien Didelot <[email protected]>
Reviewed-by: Florian Fainelli <[email protected]>
Signed-off-by: David S. Miller <[email protected]>

net: hns3: fix spelling mistake: "capabilty" -> "capability"

Trivial fix to spelling mistake in dev_err error message and also
split overly long line to avoid a checkpatch warning.

Signed-off-by: Colin Ian King <[email protected]>
Signed-off-by: David S. Miller <[email protected]>

Merge branch 'Refactor-lan9303_xxx_packet_processing'

Egil Hjelmeland says:

====================
Refactor lan9303_xxx_packet_processing

This series is purely non functional.

It changes the lan9303_enable_packet_processing,
lan9303_disable_packet_processing() to pass port number (0,1,2) as
parameter instead of port offset. This aligns them with
other functions in the module, and makes it possible to simplify the code.

The lan9303_enable_packet_processing, lan9303_disable_packet_processing
functions operate on port. Therefore rename the functions to reflect that
as well.

Reviewer pointed out lan9303_get_ethtool_stats would be better off with
the use of a lan9303_read_switch_port(). So that was added to the series.

Changes v3 -> v4:
- Whitespace adjustments.

Changes v2 -> v3:
- Patch 1: Removed the change in lan9303_get_ethtool_stats
- Added patch 4: rename lan9303_xxx_packet_processing
- Added patch 5: refactor lan9303_get_ethtool_stats

Changes v1 -> v2:
- introduced lan9303_write_switch_port() in first patch
- inserted LAN9303_NUM_PORTS patch
- Use LAN9303_NUM_PORTS in last patch. Plus whitespace change.
====================

Signed-off-by: David S. Miller <[email protected]>

net: dsa: lan9303: refactor lan9303_get_ethtool_stats

In lan9303_get_ethtool_stats: Get rid of 0x400 constant magic
by using new lan9303_read_switch_reg() inside loop.
Reduced scope of two variables.

Signed-off-by: Egil Hjelmeland <[email protected]>
Reviewed-by: Andrew Lunn <[email protected]>
Reviewed-by: Florian Fainelli <[email protected]>
Signed-off-by: David S. Miller <[email protected]>

net: dsa: lan9303: Rename lan9303_xxx_packet_processing()

The lan9303_enable_packet_processing, lan9303_disable_packet_processing
functions operate on port, so the names should reflect that.
And to align with lan9303_disable_processing(), rename:

lan9303_enable_packet_processing -> lan9303_enable_processing_port
lan9303_disable_packet_processing -> lan9303_disable_processing_port

Signed-off-by: Egil Hjelmeland <[email protected]>
Reviewed-by: Andrew Lunn <[email protected]>
Reviewed-by: Florian Fainelli <[email protected]>
Signed-off-by: David S. Miller <[email protected]>

net: dsa: lan9303: Simplify lan9303_xxx_packet_processing() usage

Simplify usage of lan9303_enable_packet_processing,
lan9303_disable_packet_processing()

Signed-off-by: Egil Hjelmeland <[email protected]>
Reviewed-by: Andrew Lunn <[email protected]>
Reviewed-by: Florian Fainelli <[email protected]>
Signed-off-by: David S. Miller <[email protected]>

net: dsa: lan9303: define LAN9303_NUM_PORTS 3

Will be used instead of '3' in upcomming patches.

Signed-off-by: Egil Hjelmeland <[email protected]>
Signed-off-by: Andrew Lunn <[email protected]>
Reviewed-by: Florian Fainelli <[email protected]>
Signed-off-by: David S. Miller <[email protected]>

net: dsa: lan9303: Change lan9303_xxx_packet_processing() port param.

lan9303_enable_packet_processing, lan9303_disable_packet_processing()
Pass port number (0,1,2) as parameter instead of port offset.
Because other functions in the module pass port numbers.
And to enable simplifications in following patch.

Introduce lan9303_write_switch_port().

Signed-off-by: Egil Hjelmeland <[email protected]>
Reviewed-by: Andrew Lunn <[email protected]>
Reviewed-by: Florian Fainelli <[email protected]>
Signed-off-by: David S. Miller <[email protected]>

Merge branch 'ipv6-sr-add-support-for-advanced-local-segment-processing'

David Lebrun says:

====================
ipv6: sr: add support for advanced local segment processing

v2: use EXPORT_SYMBOL_GPL

The current implementation of IPv6 SR supports SRH insertion/encapsulation
and basic segment endpoint behavior (i.e., processing of an SRH contained in
a packet whose active segment (IPv6 DA) is routed to the local node). This
behavior simply consists of updating the DA to the next segment and forwarding
the packet accordingly. This processing is realised for all such packets,
regardless of the active segment.

The most recent specifications of IPv6 SR [1] [2] extend the SRH processing
features as follows. Each segment endpoint defines a MyLocalSID table.
This table maps segments to operations to perform. For each ingress IPv6
packet whose DA is part of a given prefix, the segment endpoint looks
up the active segment (i.e., the IPv6 DA) in the MyLocalSID table and
applies the corresponding operation. Such specifications enable to specify
arbitrary operations besides the basic SRH processing and allow for a more
fine-grained classification.

This patch series implements those extended specifications by leveraging
a new type of lightweight tunnel, seg6local. The MyLocalSID table is
simply an arbitrary routing table (using CONFIG_IPV6_MULTIPLE_TABLES). The
following commands would assign the prefix fc00::/64 to the MyLocalSID
table, map the segment fc00::42 to the regular SRH processing function
(named "End"), and drop all packets received with an undefined active
segment:

ip -6 rule add fc00::/64 lookup 100
ip -6 route add fc00::42 encap seg6local action End dev eth0 table 100
ip -6 route add blackhole default table 100

As another example, the following command would assign the segment
fc00::1234 to the regular SRH processing function, except that the
processed packet must be forwarded to the next-hop fc42::1 (this operation
is named "End.X"):

ip -6 route add fc00::1234 encap seg6local action End.X nh6 fc42::1 dev eth0 table 100

Those two basic operations (End and End.X) are defined in [1]. A more
extensive list of advanced operations is defined in [2].

The first two patches of the series are preliminary work that remove an
assumption about initial SRH format, and export the two functions used to
insert and encapsulate an SRH onto packets. The third patch defines the
new seg6local lightweight tunnel and implement the core functions. The
fourth patch implements the operations needed to handle the newly defined
rtnetlink attributes. The fifth patch implements a few SRH processing
operations, including End and End.X.

[1] https://tools.ietf.org/html/draft-ietf-6man-segment-routing-header-07
[2] https://tools.ietf.org/html/draft-filsfils-spring-srv6-network-programming-01
====================

Signed-off-by: David S. Miller <[email protected]>

ipv6: sr: implement several seg6local actions

This patch implements the following seg6local actions.

- SEG6_LOCAL_ACTION_END: regular SRH processing. The DA of the packet
  is updated to the next segment and forwarded accordingly.

- SEG6_LOCAL_ACTION_END_X: same as above, except that the packet is
  forwarded to the specified IPv6 next-hop.

- SEG6_LOCAL_ACTION_END_DX6: decapsulate the packet and forward to
  inner IPv6 packet to the specified IPv6 next-hop.

- SEG6_LOCAL_ACTION_END_B6: insert the specified SRH directly after
  the IPv6 header of the packet.

- SEG6_LOCAL_ACTION_END_B6_ENCAP: encapsulate the packet within
  an outer IPv6 header, containing the specified SRH.

Signed-off-by: David Lebrun <[email protected]>
Signed-off-by: David S. Miller <[email protected]>

ipv6: sr: add rtnetlink functions for seg6local action parameters

This patch adds the necessary functions to parse, fill, and compare
seg6local rtnetlink attributes, for all defined action parameters.

- The SRH parameter defines an SRH to be inserted or encapsulated.
- The TABLE parameter defines the table to use for the route lookup of
  the next segment or the inner decapsulated packet.
- The NH4 parameter defines the IPv4 next-hop for an inner decapsulated
  IPv4 packet.
- The NH6 parameter defines the IPv6 next-hop for the next segment or
  for an inner decapsulated IPv6 packet
- The IIF parameter defines an ingress interface index.
- The OIF parameter defines an egress interface index.

Signed-off-by: David Lebrun <[email protected]>
Signed-off-by: David S. Miller <[email protected]>

ipv6: sr: define core operations for seg6local lightweight tunnel

This patch implements a new type of lightweight tunnel named seg6local.
A seg6local lwt is defined by a type of action and a set of parameters.
The action represents the operation to perform on the packets matching the
lwt's route, and is not necessarily an encapsulation. The set of parameters
are arguments for the processing function.

Each action is defined in a struct seg6_action_desc within
seg6_action_table[]. This structure contains the action, mandatory
attributes, the processing function, and a static headroom size required by
the action. The mandatory attributes are encoded as a bitmask field. The
static headroom is set to a non-zero value when the processing function
always add a constant number of bytes to the skb (e.g. the header size for
encapsulations).

To facilitate rtnetlink-related operations such as parsing, fill_encap,
and cmp_encap, each type of action parameter is associated to three
function pointers, in seg6_action_params[].

All actions defined in seg6_local.h are detailed in [1].

[1] https://tools.ietf.org/html/draft-filsfils-spring-srv6-network-programming-01

Signed-off-by: David Lebrun <[email protected]>
Signed-off-by: David S. Miller <[email protected]>

ipv6: sr: export SRH insertion functions

This patch exports the seg6_do_srh_encap() and seg6_do_srh_inline()
functions. It also removes the CONFIG_IPV6_SEG6_INLINE knob
that enabled the compilation of seg6_do_srh_inline(). This function
is now built-in.

Signed-off-by: David Lebrun <[email protected]>
Signed-off-by: David S. Miller <[email protected]>

ipv6: sr: allow SRH insertion with arbitrary segments_left value

The seg6_validate_srh() function only allows SRHs whose active segment is
the first segment of the path. However, an application may insert an SRH
whose active segment is not the first one. Such an application might be
for example an SR-aware Virtual Network Function.

This patch enables to insert SRHs with an arbitrary active segment.

Signed-off-by: David Lebrun <[email protected]>
Signed-off-by: David S. Miller <[email protected]>

bpf: devmap fix mutex in rcu critical section

Originally we used a mutex to protect concurrent devmap update
and delete operations from racing with netdev unregister notifier
callbacks.

The notifier hook is needed because we increment the netdev ref
count when a dev is added to the devmap. This ensures the netdev
reference is valid in the datapath. However, we don't want to block
unregister events, hence the initial mutex and notifier handler.

The concern was in the notifier hook we search the map for dev
entries that hold a refcnt on the net device being torn down. But,
in order to do this we require two steps,

  (i) dereference the netdev:  dev = rcu_dereference(map[i])
(ii) test ifindex:   dev->ifindex == removing_ifindex

and then finally we can swap in the NULL dev in the map via an
xchg operation,

  xchg(map[i], NULL)

The danger here is a concurrent update could run a different
xchg op concurrently leading us to replace the new dev with a
NULL dev incorrectly.

      CPU 1                        CPU 2

   notifier hook                   bpf devmap update

   dev = rcu_dereference(map[i])
                                   dev = rcu_dereference(map[i])
                                   xchg(map[i]), new_dev);
                                   rcu_call(dev,...)
   xchg(map[i], NULL)

The above flow would create the incorrect state with the dev
reference in the update path being lost. To resolve this the
original code used a mutex around the above block. However,
updates, deletes, and lookups occur inside rcu critical sections
so we can't use a mutex in this context safely.

Fortunately, by writing slightly better code we can avoid the
mutex altogether. If CPU 1 in the above example uses a cmpxchg
and _only_ replaces the dev reference in the map when it is in
fact the expected dev the race is removed completely. The two
cases being illustrated here, first the race condition,

      CPU 1                          CPU 2

   notifier hook                     bpf devmap update

   dev = rcu_dereference(map[i])
                                     dev = rcu_dereference(map[i])
                                     xchg(map[i]), new_dev);
                                     rcu_call(dev,...)
   odev = cmpxchg(map[i], dev, NULL)

Now we can test the cmpxchg return value, detect odev != dev and
abort. Or in the good case,

      CPU 1                          CPU 2

   notifier hook                     bpf devmap update
   dev = rcu_dereference(map[i])
   odev = cmpxchg(map[i], dev, NULL)
                                     [...]

Now 'odev == dev' and we can do proper cleanup.

And viola the original race we tried to solve with a mutex is
corrected and the trace noted by Sasha below is resolved due
to removal of the mutex.

Note: When walking the devmap and removing dev references as needed
we depend on the core to fail any calls to dev_get_by_index() using
the ifindex of the device being removed. This way we do not race with
the user while searching the devmap.

Additionally, the mutex was also protecting list add/del/read on
the list of maps in-use. This patch converts this to an RCU list
and spinlock implementation. This protects the list from concurrent
alloc/free operations. The notifier hook walks this list so it uses
RCU read semantics.

BUG: sleeping function called from invalid context at kernel/locking/mutex.c:747
in_atomic(): 1, irqs_disabled(): 0, pid: 16315, name: syz-executor1
1 lock held by syz-executor1/16315:
#0:  (rcu_read_lock){......}, at: [<ffffffff8c363bc2>] map_delete_elem kernel/bpf/syscall.c:577 [inline]
#0:  (rcu_read_lock){......}, at: [<ffffffff8c363bc2>] SYSC_bpf kernel/bpf/syscall.c:1427 [inline]
#0:  (rcu_read_lock){......}, at: [<ffffffff8c363bc2>] SyS_bpf+0x1d32/0x4ba0 kernel/bpf/syscall.c:1388

Fixes: 2ddf71e23cc2 ("net: add notifier hooks for devmap bpf map")
Reported-by: Sasha Levin <[email protected]>
Signed-off-by: Daniel Borkmann <[email protected]>
Signed-off-by: John Fastabend <[email protected]>
Signed-off-by: David S. Miller <[email protected]>

Merge branch 'net_sched-clean-up-filter-handle'

Cong Wang says:

====================
net_sched: clean up filter handle

This patchset sits in my local branch for a long time, it is time to
send it out. It cleans up the ambiguous use of 'unsigned long fh',
please see each of them for details.
====================

Signed-off-by: David S. Miller <[email protected]>

net_sched: use void pointer for filter handle

Now we use 'unsigned long fh' as a pointer in every place,
it is safe to convert it to a void pointer now. This gets
rid of many casts to pointer.

Cc: Jamal Hadi Salim <[email protected]>
Cc: Jiri Pirko <[email protected]>
Signed-off-by: Cong Wang <[email protected]>
Acked-by: Jamal Hadi Salim <[email protected]>
Signed-off-by: David S. Miller <[email protected]>

net_sched: refactor notification code for RTM_DELTFILTER

It is confusing to use 'unsigned long fh' as both a handle
and a pointer, especially commit 9ee7837449b3
("net sched filters: fix notification of filter delete with proper handle").

This patch introduces tfilter_del_notify() so that we can
pass it as a pointer as before, and we don't need to check
RTM_DELTFILTER in tcf_fill_node() any more.

This prepares for the next patch.

Cc: Jamal Hadi Salim <[email protected]>
Cc: Jiri Pirko <[email protected]>
Signed-off-by: Cong Wang <[email protected]>
Acked-by: Jamal Hadi Salim <[email protected]>
Signed-off-by: David S. Miller <[email protected]>

lwtunnel: replace EXPORT_SYMBOL with EXPORT_SYMBOL_GPL

Signed-off-by: Roopa Prabhu <[email protected]>
Signed-off-by: David S. Miller <[email protected]>

Merge branch 'bpf-add-support-for-sys-enter-exit-tracepoints'

Yonghong Song says:

====================
bpf: add support for sys_{enter|exit}_* tracepoints

Currently, bpf programs cannot be attached to sys_enter_* and sys_exit_*
style tracepoints. The main reason is that syscalls/sys_enter_* and syscalls/sys_exit_*
tracepoints are treated differently from other tracepoints and there
is no bpf hook to it.

This patch set adds bpf support for these syscalls tracepoints and also
adds a test case for it.

Changelogs:
v3 -> v4:
- Check the legality of ctx offset access for syscall tracepoint as well.
   trace_event_get_offsets will return correct max offset for each
   specific syscall tracepoint.
- Use variable length array to avoid hardcode 6 as the maximum
   arguments beyond syscall_nr.
v2 -> v3:
- Fix a build issue
v1 -> v2:
- Do not use TRACE_EVENT_FL_CAP_ANY to identify syscall tracepoint.
   Instead use trace_event_call->class.
====================

Signed-off-by: David S. Miller <[email protected]>

bpf: add a test case for syscalls/sys_{enter|exit}_* tracepoints

Signed-off-by: Yonghong Song <[email protected]>
Acked-by: Daniel Borkmann <[email protected]>
Signed-off-by: David S. Miller <[email protected]>

bpf: add support for sys_enter_* and sys_exit_* tracepoints

Currently, bpf programs cannot be attached to sys_enter_* and sys_exit_*
style tracepoints. The iovisor/bcc issue #748
(https://github.com/iovisor/bcc/issues/748) documents this issue.
For example, if you try to attach a bpf program to tracepoints
syscalls/sys_enter_newfstat, you will get the following error:
   # ./tools/trace.py t:syscalls:sys_enter_newfstat
   Ioctl(PERF_EVENT_IOC_SET_BPF): Invalid argument
   Failed to attach BPF to tracepoint

The main reason is that syscalls/sys_enter_* and syscalls/sys_exit_*
tracepoints are treated differently from other tracepoints and there
is no bpf hook to it.

This patch adds bpf support for these syscalls tracepoints by
  . permitting bpf attachment in ioctl PERF_EVENT_IOC_SET_BPF
  . calling bpf programs in perf_syscall_enter and perf_syscall_exit

The legality of bpf program ctx access is also checked.
Function trace_event_get_offsets returns correct max offset for each
specific syscall tracepoint, which is compared against the maximum offset
access in bpf program.

Signed-off-by: Yonghong Song <[email protected]>
Acked-by: Alexei Starovoitov <[email protected]>
Signed-off-by: David S. Miller <[email protected]>

of_mdio: use of_property_read_u32_array()

The "fixed-link" prop support predated of_property_read_u32_array(), so
basically had to open-code it. Using the modern API saves 24 bytes of the
object code (ARM gcc 4.8.5); the only behavior change would be that the
prop length check is now less strict (however the strict pre-check done
in of_phy_is_fixed_link() is left intact anyway)...

Signed-off-by: Sergei Shtylyov <[email protected]>
Reviewed-by: Andrew Lunn <[email protected]>
Reviewed-by: Rob Herring <[email protected]>
Signed-off-by: David S. Miller <[email protected]>

ibmvnic: Report rx buffer return codes as netdev_dbg

Reporting any return code for a receive buffer as an "rx error" only
produces alarming noise and the only values that have been observed to be
used in this field are not error conditions. Change this to a netdev_dbg
with a more descriptive message.

Signed-off-by: John Allen <[email protected]>
Signed-off-by: David S. Miller <[email protected]>

Merge branch 'net-l3mdev-Support-for-sockets-bound-to-enslaved-device'

David Ahern says:

====================
net: l3mdev: Support for sockets bound to enslaved device

A missing piece to the VRF puzzle is the ability to bind sockets to
devices enslaved to a VRF. This patch set adds the enslaved device
index, sdif, to IPv4 and IPv6 socket lookups. The end result for users
is the following scope options for services:

1. "global" services - sockets not bound to any device

   Allows 1 service to work across all network interfaces with
   connected sockets bound to the VRF the connection originates
   (Requires net.ipv4.tcp_l3mdev_accept=1 for TCP and
    net.ipv4.udp_l3mdev_accept=1 for UDP)

2. "VRF" local services - sockets bound to a VRF

   Sockets work across all network interfaces enslaved to a VRF but
   are limited to just the one VRF.

3. "device" services - sockets bound to a specific network interface

   Service works only through the one specific interface.

v3
- convert __inet_lookup_established in dccp_v4_err; missed in v2

v2
- remove sk_lookup struct and add sdif as an argument to existing
  functions

Changes since RFC:
- no significant logic changes; mainly whitespace cleanups
====================

Signed-off-by: David S. Miller <[email protected]>

net: ipv6: add second dif to raw socket lookups

Add a second device index, sdif, to raw socket lookups. sdif is the
index for ingress devices enslaved to an l3mdev. It allows the lookups
to consider the enslaved device as well as the L3 domain when searching
for a socket.

Signed-off-by: David Ahern <[email protected]>
Signed-off-by: David S. Miller <[email protected]>

net: ipv6: add second dif to inet6 socket lookups

Add a second device index, sdif, to inet6 socket lookups. sdif is the
index for ingress devices enslaved to an l3mdev. It allows the lookups
to consider the enslaved device as well as the L3 domain when searching
for a socket.

TCP moves the data in the cb. Prior to tcp_v4_rcv (e.g., early demux) the
ingress index is obtained from IPCB using inet_sdif and after tcp_v4_rcv
tcp_v4_sdif is used.

Signed-off-by: David Ahern <[email protected]>
Signed-off-by: David S. Miller <[email protected]>

net: ipv6: add second dif to udp socket lookups

Add a second device index, sdif, to udp socket lookups. sdif is the
index for ingress devices enslaved to an l3mdev. It allows the lookups
to consider the enslaved device as well as the L3 domain when searching
for a socket.

Early demux lookups are handled in the next patch as part of INET_MATCH
changes.

Signed-off-by: David Ahern <[email protected]>
Signed-off-by: David S. Miller <[email protected]>

net: ipv4: add second dif to multicast source filter

Signed-off-by: David Ahern <[email protected]>
Signed-off-by: David S. Miller <[email protected]>

net: ipv4: add second dif to raw socket lookups

Add a second device index, sdif, to raw socket lookups. sdif is the
index for ingress devices enslaved to an l3mdev. It allows the lookups
to consider the enslaved device as well as the L3 domain when searching
for a socket.

Signed-off-by: David Ahern <[email protected]>
Signed-off-by: David S. Miller <[email protected]>

net: ipv4: add second dif to inet socket lookups

Add a second device index, sdif, to inet socket lookups. sdif is the
index for ingress devices enslaved to an l3mdev. It allows the lookups
to consider the enslaved device as well as the L3 domain when searching
for a socket.

TCP moves the data in the cb. Prior to tcp_v4_rcv (e.g., early demux) the
ingress index is obtained from IPCB using inet_sdif and after the cb move
in tcp_v4_rcv the tcp_v4_sdif helper is used.

Signed-off-by: David Ahern <[email protected]>
Signed-off-by: David S. Miller <[email protected]>

net: ipv4: add second dif to udp socket lookups

Add a second device index, sdif, to udp socket lookups. sdif is the
index for ingress devices enslaved to an l3mdev. It allows the lookups
to consider the enslaved device as well as the L3 domain when searching
for a socket.

Early demux lookups are handled in the next patch as part of INET_MATCH
changes.

Signed-off-by: David Ahern <[email protected]>
Signed-off-by: David S. Miller <[email protected]>

Merge tag 'wireless-drivers-next-for-davem-2017-08-07' of git://git.kernel.org/pub/scm/linux/kernel/git/kvalo/wireless-drivers-next

Kalle Valo says:

====================
wireless-drivers-next patches for 4.14

The first wireless-drivers-next pull request for 4.14. I'm submitting
this unusally late in the cycle as my vacation postponed this. But
even if this is late there's not still that much new features, mostly
cleanup or fixes.

Major changes:

ath10k

* preparation for wcn3990 support

iwlwifi

* Reorganization of the code into separate directories continues

qtnfmac

* regulatory support updates

* add get_channel, dump_survey and channel_switch cfg80211 handlers
====================

Signed-off-by: David S. Miller <[email protected]>

hysdn: fix to a race condition in put_log_buffer

The synchronization type that was used earlier to guard the loop that
deletes unused log buffers may lead to a situation that prevents any
thread from going through the loop.

The patch deletes previously used synchronization mechanism and moves
the loop under the spin_lock so the similar cases won't be feasible in
the future.

Found by by Linux Driver Verification project (linuxtesting.org).

Signed-off-by: Anton Volkov <[email protected]>
Signed-off-by: David S. Miller <[email protected]>

s390/qeth: fix L3 next-hop in xmit qeth hdr

On L3, the qeth_hdr struct needs to be filled with the next-hop
IP address.
The current code accesses rtable->rt_gateway without checking that
rtable is a valid address. The accidental access to a lowcore area
results in a random next-hop address in the qeth_hdr.
rtable (or more precisely, skb_dst(skb)) can be NULL in rare cases
(for instance together with AF_PACKET sockets).
This patch adds the missing NULL-ptr checks.

Signed-off-by: Julian Wiedmann <[email protected]>
Signed-off-by: Ursula Braun <[email protected]>
Fixes: 87e7597b5a3 qeth: Move away from using neighbour entries in qeth_l3_fill_header()
Signed-off-by: David S. Miller <[email protected]>

hns3: fix unused function warning

Without CONFIG_PCI_IOV, we get a harmless warning about an
unused function:

drivers/net/ethernet/hisilicon/hns3/hns3pf/hclge_main.c:2273:13: error: 'hclge_disable_sriov' defined but not used [-Werror=unused-function]

The #ifdefs in this driver are obviously wrong, so this just
removes them and uses an IS_ENABLED() check that does the same
thing correctly in a more readable way.

Fixes: 46a3df9f9718 ("net: hns3: Add HNS3 Acceleration Engine & Compatibility Layer Support")
Signed-off-by: Arnd Bergmann <[email protected]>
Signed-off-by: David S. Miller <[email protected]>

Merge tag 'mlx5-shared-2017-08-07' of git://git.kernel.org/pub/scm/linux/kernel/git/mellanox/linux

Saeed Mahameed says:

====================
mlx5-shared-2017-08-07

This series includes some mlx5 updates for both net-next and rdma trees.

From Saeed,
Core driver updates to allow selectively building the driver with
or without some large driver components, such as
- E-Switch (Ethernet SRIOV support).
- Multi-Physical Function Switch (MPFs) support.
For that we split E-Switch and MPFs functionalities into separate files.

From Erez,
Delay mlx5_core events when mlx5 interfaces, namely mlx5_ib, registration
is taking place and until it completes.

From Rabie,
Increase the maximum supported flow counters.
====================

Signed-off-by: David S. Miller <[email protected]>

Merge tag 'rdma-rc-2017-07-26' of git://git.kernel.org/pub/scm/linux/kernel/git/leon/linux-rdma into leon-ipoib

IPoIB fixes for 4.13

The patchset provides various fixes for IPoIB. It is combination of
fixes to various issues discovered during verification along with
static checkers cleanup patches.

Most of the patches are from pre-git era and hence lack of Fixes lines.

There is one exception in this IPoIB group - addition of patch revert:
Revert "IB/core: Allow QP state transition from reset to error", but
it followed by proper fix to the annoying print, so I thought it is
appropriate to include it.

Signed-off-by: Doug Ledford <[email protected]>

Merge branch 'asix-Improve-robustness'

Dean Jenkins says:

====================
asix: Improve robustness

Please consider taking these patches to improve the robustness of the ASIX USB
to Ethernet driver.

Failures prompting an ASIX driver code review
=============================================

On an ARM i.MX6 embedded platform some strange one-off and two-off failures were
observed in and around the ASIX USB to Ethernet driver. This was observed on a
highly modified kernel 3.14 with the ASIX driver containing back-ported changes
from kernel.org up to kernel 4.8 approximately.

a) A one-off failure in asix_rx_fixup_internal():

There was an occurrence of an attempt to write off the end of the netdev buffer
which was trapped by skb_over_panic() in skb_put().

[20030.846440] skbuff: skb_over_panic: text:7f2271c0 len:120 put:60 head:8366ecc0 data:8366ed02 tail:0x8366ed7a end:0x8366ed40 dev:eth0
[20030.863007] Kernel BUG at 8044ce38 [verbose debug info unavailable]

[20031.215345] Backtrace:
[20031.217884] [<8044cde0>] (skb_panic) from [<8044d50c>] (skb_put+0x50/0x5c)
[20031.227408] [<8044d4bc>] (skb_put) from [<7f2271c0>] (asix_rx_fixup_internal+0x1c4/0x23c [asix])
[20031.242024] [<7f226ffc>] (asix_rx_fixup_internal [asix]) from [<7f22724c>] (asix_rx_fixup_common+0x14/0x18 [asix])
[20031.260309] [<7f227238>] (asix_rx_fixup_common [asix]) from [<7f21f7d4>] (usbnet_bh+0x74/0x224 [usbnet])
[20031.269879] [<7f21f760>] (usbnet_bh [usbnet]) from [<8002f834>] (call_timer_fn+0xa4/0x1f0)
[20031.283961] [<8002f790>] (call_timer_fn) from [<80030834>] (run_timer_softirq+0x230/0x2a8)
[20031.302782] [<80030604>] (run_timer_softirq) from [<80028780>] (__do_softirq+0x15c/0x37c)
[20031.321511] [<80028624>] (__do_softirq) from [<80028c38>] (irq_exit+0x8c/0xe8)
[20031.339298] [<80028bac>] (irq_exit) from [<8000e9c8>] (handle_IRQ+0x8c/0xc8)
[20031.350038] [<8000e93c>] (handle_IRQ) from [<800085c8>] (gic_handle_irq+0xb8/0xf8)
[20031.365528] [<80008510>] (gic_handle_irq) from [<8050de80>] (__irq_svc+0x40/0x70)

Analysis of the logic of the ASIX driver (containing backported changes from
kernel.org up to kernel 4.8 approximately) suggested that the software could not
trigger skb_over_panic(). The analysis of the kernel BUG() crash information
suggested that the netdev buffer was written with 2 minimal 60 octet length
Ethernet frames (ASIX hardware drops the 4 octet FCS field) and the 2nd Ethernet
frame attempted to write off the end of the netdev buffer.

Note that the netdev buffer should only contain 1 Ethernet frame so if an
attempt to write 2 Ethernet frames into the buffer is made then that is wrong.
However, the logic of the asix_rx_fixup_internal() only allows 1 Ethernet frame
to be written into the netdev buffer.

Potentially this failure was due to memory corruption because it was only seen
once.

b) Two-off failures in the NAPI layer's backlog queue:

There were 2 crashes in the NAPI layer's backlog queue presumably after
asix_rx_fixup_internal() called usbnet_skb_return().

[24097.273945] Unable to handle kernel NULL pointer dereference at virtual address 00000004

[24097.398944] PC is at process_backlog+0x80/0x16c

[24097.569466] Backtrace:
[24097.572007] [<8045ad98>] (process_backlog) from [<8045b64c>] (net_rx_action+0xcc/0x248)
[24097.591631] [<8045b580>] (net_rx_action) from [<80028780>] (__do_softirq+0x15c/0x37c)
[24097.610022] [<80028624>] (__do_softirq) from [<800289cc>] (run_ksoftirqd+0x2c/0x84)

and

[ 1059.828452] Unable to handle kernel NULL pointer dereference at virtual address 00000000

[ 1059.953715] PC is at process_backlog+0x84/0x16c

[ 1060.140896] Backtrace:
[ 1060.143434] [<8045ad98>] (process_backlog) from [<8045b64c>] (net_rx_action+0xcc/0x248)
[ 1060.163075] [<8045b580>] (net_rx_action) from [<80028780>] (__do_softirq+0x15c/0x37c)
[ 1060.181474] [<80028624>] (__do_softirq) from [<80028c38>] (irq_exit+0x8c/0xe8)
[ 1060.199256] [<80028bac>] (irq_exit) from [<8000e9c8>] (handle_IRQ+0x8c/0xc8)
[ 1060.210006] [<8000e93c>] (handle_IRQ) from [<800085c8>] (gic_handle_irq+0xb8/0xf8)
[ 1060.225492] [<80008510>] (gic_handle_irq) from [<8050de80>] (__irq_svc+0x40/0x70)

The embedded board was only using an ASIX USB to Ethernet adaptor eth0.

Analysis suggested that the doubly-linked list pointers of the backlog queue had
been corrupted because one of the link pointers was NULL.

Potentially this failure was due to memory corruption because it was only seen
twice.

Results of the ASIX driver code review
======================================

During the code review some weaknesses were observed in the ASIX driver and the
following patches have been created to improve the robustness.

Brief overview of the patches
-----------------------------

1. asix: Add rx->ax_skb = NULL after usbnet_skb_return()

The current ASIX driver sends the received Ethernet frame to the NAPI layer of
the network stack via the call to usbnet_skb_return() in
asix_rx_fixup_internal() but retains the rx->ax_skb pointer to the netdev
buffer. The driver no longer needs the rx->ax_skb pointer at this point because
the NAPI layer now has the Ethernet frame.

This means that asix_rx_fixup_internal() must not use rx->ax_skb after the call
to usbnet_skb_return() because it could corrupt the handling of the Ethernet
frame within the network layer.

Therefore, to remove the risk of erroneous usage of rx->ax_skb, set rx->ax_skb
to NULL after the call to usbnet_skb_return(). This avoids potential erroneous
freeing of rx->ax_skb and erroneous writing to the netdev buffer. If the
software now somehow inappropriately reused rx->ax_skb, then a NULL pointer
dereference of rx->ax_skb would occur which makes investigation easier.

2. asix: Ensure asix_rx_fixup_info members are all reset

This patch creates reset_asix_rx_fixup_info() to allow all the
asix_rx_fixup_info structure members to be consistently reset to initial
conditions.

Call reset_asix_rx_fixup_info() upon each detectable error condition so that the
next URB is processed from a known state.

Otherwise, there is a risk that some members of the asix_rx_fixup_info structure
may be incorrect after an error occurred so potentially leading to a
malfunction.

3. asix: Fix small memory leak in ax88772_unbind()

This patch creates asix_rx_fixup_common_free() to allow the rx->ax_skb to be
freed when necessary.

asix_rx_fixup_common_free() is called from ax88772_unbind() before the parent
private data structure is freed.

Without this patch, there is a risk of a small netdev buffer memory leak each
time ax88772_unbind() is called during the reception of an Ethernet frame that
spans across 2 URBs.

Testing
=======

The patches have been sanity tested on a 64-bit Linux laptop running kernel
4.13-rc2 with the 3 patches applied on top.

The ASIX USB to Adaptor used for testing was (output of lsusb):
ID 0b95:772b ASIX Electronics Corp. AX88772B

Test #1
-------

The test ran a flood ping test script which slowly incremented the ICMP Echo
Request's payload from 0 to 5000 octets. This eventually causes IPv4
fragmentation to occur which causes Ethernet frames to be sent very close to
each other so increases the probability that an Ethernet frame will span 2 URBs.
The test showed that all pings were successful. The test took about 15 minutes
to complete.

Test #2
-------

A script was run on the laptop to periodically run ifdown and ifup every second
so that the ASIX USB to Adaptor was up for 1 second and down for 1 second.

From a Linux PC connected to the laptop, the following ping command was used
ping -f -s 5000 <ip address of laptop>

The large ICMP payload causes IPv4 fragmentation resulting in multiple
Ethernet frames per original IP packet.

Kernel debug within the ASIX driver was enabled to see whether any ASIX errors
were generated. The test was run for about 24 hours and no ASIX errors were
seen.

Patches
=======

The 3 patches have been rebased off the net-next repo master branch with HEAD
fbbeefd net: fec: Allow reception of frames bigger than 1522 bytes
====================

Signed-off-by: David S. Miller <[email protected]>

asix: Fix small memory leak in ax88772_unbind()

When Ethernet frames span mulitple URBs, the netdev buffer memory
pointed to by the asix_rx_fixup_info structure remains allocated
during the time gap between the 2 executions of asix_rx_fixup_internal().

This means that if ax88772_unbind() is called within this time
gap to free the memory of the parent private data structure then
a memory leak of the part filled netdev buffer memory will occur.

Therefore, create a new function asix_rx_fixup_common_free() to
free the memory of the netdev buffer and add a call to
asix_rx_fixup_common_free() from inside ax88772_unbind().

Consequently when an unbind occurs part way through receiving
an Ethernet frame, the netdev buffer memory that is holding part
of the received Ethernet frame will now be freed.

Signed-off-by: Dean Jenkins <[email protected]>
Signed-off-by: David S. Miller <[email protected]>

asix: Ensure asix_rx_fixup_info members are all reset

There is a risk that the members of the structure asix_rx_fixup_info
become unsynchronised leading to the possibility of a malfunction.

For example, rx->split_head was not being set to false after an
error was detected so potentially could cause a malformed 32-bit
Data header word to be formed.

Therefore add function reset_asix_rx_fixup_info() to reset all the
members of asix_rx_fixup_info so that future processing will start
with known initial conditions.

Also, if (skb->len != offset) becomes true then call
reset_asix_rx_fixup_info() so that the processing of the next URB
starts with known initial conditions. Without the call, the check
does nothing which potentially could lead to a malfunction
when the next URB is processed.

In addition, for robustness, call reset_asix_rx_fixup_info() before
every error path's "return 0". This ensures that the next URB is
processed from known initial conditions.

Signed-off-by: Dean Jenkins <[email protected]>
Signed-off-by: David S. Miller <[email protected]>

asix: Add rx->ax_skb = NULL after usbnet_skb_return()

In asix_rx_fixup_internal() there is a risk that rx->ax_skb gets
reused after passing the Ethernet frame into the network stack via
usbnet_skb_return().

The risks include:

a) asynchronously freeing rx->ax_skb after passing the netdev buffer
to the NAPI layer which might corrupt the backlog queue.

b) erroneously reusing rx->ax_skb such as calling skb_put_data() multiple
times which causes writing off the end of the netdev buffer.

Therefore add a defensive rx->ax_skb = NULL after usbnet_skb_return()
so that it is not possible to free rx->ax_skb or to apply
skb_put_data() too many times.

Signed-off-by: Dean Jenkins <[email protected]>
Signed-off-by: David S. Miller <[email protected]>

bpf: fix selftest/bpf/test_pkt_md_access on s390x

Commit 18f3d6be6be1 ("selftests/bpf: Add test cases to test narrower ctx field loads")
introduced new eBPF test cases. One of them (test_pkt_md_access.c)
fails on s390x. The BPF verifier error message is:

[root@s8360046 bpf]# ./test_progs
test_pkt_access:PASS:ipv4 349 nsec
test_pkt_access:PASS:ipv6 212 nsec
[....]
libbpf: load bpf program failed: Permission denied
libbpf: -- BEGIN DUMP LOG ---
libbpf:
0: (71) r2 = *(u8 *)(r1 +0)
invalid bpf_context access off=0 size=1

libbpf: -- END LOG --
libbpf: failed to load program 'test1'
libbpf: failed to load object './test_pkt_md_access.o'
Summary: 29 PASSED, 1 FAILED
[root@s8360046 bpf]#

This is caused by a byte endianness issue. S390x is a big endian
architecture. Pointer access to the lowest byte or halfword of a
four byte value need to add an offset.
On little endian architectures this offset is not needed.

Fix this and use the same approach as the originator used for other files
(for example test_verifier.c) in his original commit.

With this fix the test program test_progs succeeds on s390x:
[root@s8360046 bpf]# ./test_progs
test_pkt_access:PASS:ipv4 236 nsec
test_pkt_access:PASS:ipv6 217 nsec
test_xdp:PASS:ipv4 3624 nsec
test_xdp:PASS:ipv6 1722 nsec
test_l4lb:PASS:ipv4 926 nsec
test_l4lb:PASS:ipv6 1322 nsec
test_tcp_estats:PASS: 0 nsec
test_bpf_obj_id:PASS:get-fd-by-notexist-prog-id 0 nsec
test_bpf_obj_id:PASS:get-fd-by-notexist-map-id 0 nsec
test_bpf_obj_id:PASS:get-prog-info(fd) 0 nsec
test_bpf_obj_id:PASS:get-map-info(fd) 0 nsec
test_bpf_obj_id:PASS:get-prog-info(fd) 0 nsec
test_bpf_obj_id:PASS:get-map-info(fd) 0 nsec
test_bpf_obj_id:PASS:get-prog-fd(next_id) 0 nsec
test_bpf_obj_id:PASS:get-prog-info(next_id->fd) 0 nsec
test_bpf_obj_id:PASS:get-prog-fd(next_id) 0 nsec
test_bpf_obj_id:PASS:get-prog-info(next_id->fd) 0 nsec
test_bpf_obj_id:PASS:check total prog id found by get_next_id 0 nsec
test_bpf_obj_id:PASS:get-map-fd(next_id) 0 nsec
test_bpf_obj_id:PASS:get-map-fd(next_id) 0 nsec
test_bpf_obj_id:PASS:get-map-fd(next_id) 0 nsec
test_bpf_obj_id:PASS:get-map-fd(next_id) 0 nsec
test_bpf_obj_id:PASS:get-map-fd(next_id) 0 nsec
test_bpf_obj_id:PASS:get-map-fd(next_id) 0 nsec
test_bpf_obj_id:PASS:get-map-fd(next_id) 0 nsec
test_bpf_obj_id:PASS:check get-map-info(next_id->fd) 0 nsec
test_bpf_obj_id:PASS:get-map-fd(next_id) 0 nsec
test_bpf_obj_id:PASS:check get-map-info(next_id->fd) 0 nsec
test_bpf_obj_id:PASS:check total map id found by get_next_id 0 nsec
test_pkt_md_access:PASS: 277 nsec
Summary: 30 PASSED, 0 FAILED
[root@s8360046 bpf]#

Fixes: 18f3d6be6be1 ("selftests/bpf: Add test cases to test narrower ctx field loads")
Signed-off-by: Thomas Richter <[email protected]>
Acked-by: Daniel Borkmann <[email protected]>
Signed-off-by: David S. Miller <[email protected]>

Merge branch 'net-sched-summer-cleanup-part-2-ndo_setup_tc'

Jiri Pirko says:

====================
net: sched: summer cleanup part 2, ndo_setup_tc

This patchset focuses on ndo_setup_tc and its args.
Currently there are couple of things that do not make much sense.
The type is passed in struct tc_to_netdev, but as it is always
required, should be arg of the ndo. Other things are passed as args
but they are only relevant for cls offloads and not mqprio. Therefore,
they should be pushed to struct. As the tc_to_netdev struct in the end
is just a container of single pointer, we get rid of it and pass the
struct according to type. So in the end, we have:
ndo_setup_tc(dev, type, type_data_struct)

There are couple of cosmetics done on the way to make things smooth.
Also, reported error is consolidated to eopnotsupp in case the
asked offload is not supported.

v1->v2:
- added forgotten hns3pf bits
====================

Acked-by: Jamal Hadi Salim <[email protected]>
Signed-off-by: David S. Miller <[email protected]>

net: sched: get rid of struct tc_to_netdev

Get rid of struct tc_to_netdev which is now just unnecessary container
and rather pass per-type structures down to drivers directly.
Along with that, consolidate the naming of per-type structure variables
in cls_*.

Signed-off-by: Jiri Pirko <[email protected]>
Acked-by: Jamal Hadi Salim <[email protected]>
Signed-off-by: David S. Miller <[email protected]>

net: sched: change return value of ndo_setup_tc for driver supporting mqprio only

Change the return value from -EINVAL to -EOPNOTSUPP. The rest of the
drivers have it like that, so be aligned.

Signed-off-by: Jiri Pirko <[email protected]>
Acked-by: Jamal Hadi Salim <[email protected]>
Signed-off-by: David S. Miller <[email protected]>

net: sched: move prio into cls_common

prio is not cls_flower specific, but it is meaningful for all
classifiers. Seems that only mlxsw cares about the value. Obviously,
cls offload in other drivers is broken.

Signed-off-by: Jiri Pirko <[email protected]>
Acked-by: Jamal Hadi Salim <[email protected]>
Signed-off-by: David S. Miller <[email protected]>

net: sched: push cls related args into cls_common structure

As ndo_setup_tc is generic offload op for whole tc subsystem, does not
really make sense to have cls-specific args. So move them under
cls_common structurure which is embedded in all cls structs.

Signed-off-by: Jiri Pirko <[email protected]>
Acked-by: Jamal Hadi Salim <[email protected]>
Signed-off-by: David S. Miller <[email protected]>

hns3pf: don't check handle during mqprio offload

Similar to the rest offloaders of mqprio, no need to check handle.

Signed-off-by: Jiri Pirko <[email protected]>
Signed-off-by: David S. Miller <[email protected]>

nfp: change flows in apps that offload ndo_setup_tc

Change the flows a bit in preparation of follow-up changes in
ndo_setup_tc args. Also, change the error code to align with the rest of
the drivers.

Signed-off-by: Jiri Pirko <[email protected]>
Acked-by: Jamal Hadi Salim <[email protected]>
Signed-off-by: David S. Miller <[email protected]>

dsa: push cls_matchall setup_tc processing into a separate function

Let dsa_slave_setup_tc be a splitter for specific setup_tc types and
push out cls_matchall specific code into a separate function.

Signed-off-by: Jiri Pirko <[email protected]>
Acked-by: Jamal Hadi Salim <[email protected]>
Reviewed-by: Florian Fainelli <[email protected]>
Signed-off-by: David S. Miller <[email protected]>

mlxsw: spectrum: rename cls arg in matchall processing

To sync-up with the naming in the rest of the driver, rename the cls arg.

Signed-off-by: Jiri Pirko <[email protected]>
Acked-by: Jamal Hadi Salim <[email protected]>
Signed-off-by: David S. Miller <[email protected]>

mlxsw: spectrum: push cls_flower and cls_matchall setup_tc processing into separate functions

Let mlxsw_sp_setup_tc be a splitter for specific setup_tc types and push
out cls_flower and cls_matchall specific codes into separate functions.

Signed-off-by: Jiri Pirko <[email protected]>
Acked-by: Jamal Hadi Salim <[email protected]>
Signed-off-by: David S. Miller <[email protected]>

mlx5e_rep: push cls_flower setup_tc processing into a separate function

Let mlx5e_rep_setup_tc (former mlx5e_rep_ndo_setup_tc) be a splitter for
specific setup_tc types and push out cls_flower specific code into
a separate function.

Signed-off-by: Jiri Pirko <[email protected]>
Acked-by: Jamal Hadi Salim <[email protected]>
Signed-off-by: David S. Miller <[email protected]>

mlx5e: push cls_flower and mqprio setup_tc processing into separate functions

Let mlx5e_setup_tc (former mlx5e_ndo_setup_tc) be a splitter for specific
setup_tc types and push out cls_flower and mqprio specific codes into
separate functions. Also change the return values so they are the same
as in the rest of the drivers.

Signed-off-by: Jiri Pirko <[email protected]>
Acked-by: Jamal Hadi Salim <[email protected]>
Signed-off-by: David S. Miller <[email protected]>

ixgbe: push cls_u32 and mqprio setup_tc processing into separate functions

Let __ixgbe_setup_tc be a splitter for specific setup_tc types and push out
cls_u32 and mqprio specific codes into separate functions. Also change
the return values so they are the same as in the rest of the drivers.

Signed-off-by: Jiri Pirko <[email protected]>
Acked-by: Jamal Hadi Salim <[email protected]>
Signed-off-by: David S. Miller <[email protected]>

cxgb4: push cls_u32 setup_tc processing into a separate function

Let cxgb_setup_tc be a splitter for specific setup_tc types and push out
cls_u32 specific code into a separate function.

Signed-off-by: Jiri Pirko <[email protected]>
Acked-by: Jamal Hadi Salim <[email protected]>
Signed-off-by: David S. Miller <[email protected]>

net: sched: make egress_dev flag part of flower offload struct

Since this is specific to flower now, make it part of the flower offload
struct.

Signed-off-by: Jiri Pirko <[email protected]>
Acked-by: Jamal Hadi Salim <[email protected]>
Signed-off-by: David S. Miller <[email protected]>

net: sched: rename TC_SETUP_MATCHALL to TC_SETUP_CLSMATCHALL

In order to be aligned with the rest of the types, rename
TC_SETUP_MATCHALL to TC_SETUP_CLSMATCHALL.

Signed-off-by: Jiri Pirko <[email protected]>
Acked-by: Jamal Hadi Salim <[email protected]>
Signed-off-by: David S. Miller <[email protected]>

net: sched: make type an argument for ndo_setup_tc

Since the type is always present, push it to be a separate argument to
ndo_setup_tc. On the way, name the type enum and use it for arg type.

Signed-off-by: Jiri Pirko <[email protected]>
Acked-by: Jamal Hadi Salim <[email protected]>
Signed-off-by: David S. Miller <[email protected]>

net/mlx5: Increase the maximum flow counters supported

Read new NIC capability field which represnts 16 MSBs of the max flow
counters number supported (max_flow_counter_31_16).

Backward compatibility with older firmware is preserved, the modified
driver reads max_flow_counter_31_16 as 0 from the older firmware and
uses up to 64K counters.

Changed flow counter id from 16 bits to 32 bits. Backward compatibility
with older firmware is preserved as we kept the 16 LSBs of the counter
id in place and added 16 MSBs from reserved field.

Changed the background bulk reading of flow counters to work in chunks
of at most 32K counters, to make sure we don't attempt to allocate very
large buffers.

Signed-off-by: Rabie Loulou <[email protected]>
Reviewed-by: Or Gerlitz <[email protected]>
Signed-off-by: Saeed Mahameed <[email protected]>

net/mlx5: Fix counter list hardware structure

The counter list hardware structure doesn't contain a clear and
num_of_counters fields, remove them.

These wrong fields were never used by the driver hence no other driver
changes.

Fixes: a351a1b03bf1 ("net/mlx5: Introduce bulk reading of flow counters")
Signed-off-by: Rabie Loulou <[email protected]>
Reviewed-by: Or Gerlitz <[email protected]>
Signed-off-by: Saeed Mahameed <[email protected]>

net/mlx5: Delay events till ib registration ends

When mlx5_ib registers itself to mlx5_core as an interface, it will
call mlx5_add_device which will call mlx5_ib interface add callback,
in case the latter successfully returns, only then mlx5_core will add
it to the interface list and async events will be forwarded to mlx5_ib.
Between mlx5_ib interface add callback and mlx5_core adding the mlx5_ib
interface to its devices list, arriving mlx5_core events can be missed
by the new mlx5_ib registering interface.

In other words:
thread 1: mlx5_ib: mlx5_register_interface(dev)
thread 1: mlx5_core: mlx5_add_device(dev)
thread 1: mlx5_core: ctx = dev->add => (mlx5_ib)->mlx5_ib_add
thread 2: mlx5_core_event: **new event arrives, forward to dev_list
thread 1: mlx5_core: add_ctx_to_dev_list(ctx)
/* previous event was missed by the new interface.*/
It is ok to miss events before dev->add (mlx5_ib)->mlx5_ib_add_device
but not after.

We fix this race by accumulating the events that come between the
ib_register_device (inside mlx5_add_device->(dev->add)) till the adding
to the list completes and fire them to the new registering interface
after that.

Fixes: f1ee87fe55c8 ("net/mlx5: Organize device list API in one place")
Signed-off-by: Erez Shitrit <[email protected]>
Signed-off-by: Saeed Mahameed <[email protected]>

net/mlx5: Add CONFIG_MLX5_ESWITCH Kconfig

Allow to selectively build the driver with or without sriov eswitch, VF
representors and TC offloads.

Also remove the need of two ndo ops structures (sriov & basic)
and keep only one unified ndo ops, compile out VF SRIOV ndos when not
needed (MLX5_ESWITCH=n), and for VF netdev calling those ndos will result
in returning -EPERM.

Signed-off-by: Saeed Mahameed <[email protected]>
Reviewed-by: Or Gerlitz <[email protected]>
Cc: Jes Sorensen <[email protected]>
Cc: [email protected]

net/mlx5: Separate between E-Switch and MPFS

Multi-Physical Function Switch (MPFs) is required for when multi-PF
configuration is enabled to allow passing user configured unicast MAC
addresses to the requesting PF.

Before this patch eswitch.c used to manage the HW MPFS l2 table,
E-Switch always (regardless of sriov) enabled vport(0) (NIC PF) vport's
contexts update on unicast mac address list changes, to populate the PF's
MPFS L2 table accordingly.

In downstream patch we would like to allow compiling the driver without
E-Switch functionalities, for that we move MPFS l2 table logic out
of eswitch.c into its own file, and provide Kconfig flag (MLX5_MPFS) to
allow compiling out MPFS for those who don't want Multi-PF support.

NIC PF netdevice will now directly update MPFS l2 table via the new MPFS
API. VF netdevice has no access to MPFS L2 table, so E-Switch will remain
responsible of updating its MPFS l2 table on behalf of its VFs.

Due to this change we also don't require enabling vport(0) (PF vport)
unicast mac changes events anymore, for when SRIOV is not enabled.
Which means E-Switch is now activated only on SRIOV activation, and not
required otherwise.

Signed-off-by: Saeed Mahameed <[email protected]>
Cc: Jes Sorensen <[email protected]>
Cc: [email protected]

net/mlx5: Unify vport manager capability check

Expose MLX5_VPORT_MANAGER macro to check for strict vport manager
E-switch and MPFS (Multi Physical Function Switch) abilities.

VPORT manager must be a PF with an ethernet link and with FW advertised
vport group manager capability

Replace older checks with the new macro and use it where needed in
eswitch.c and mlx5e netdev eswitch related flows.

The same macro will be reused in MPFS separation downstream patch.

Signed-off-by: Saeed Mahameed <[email protected]>

net/mlx5e: NIC netdev init flow cleanup

Remove redundant call to unregister vport representor in mlx5e_add error
flow.

Hide the representor priv and eswitch internal structures from en_main.c
as preparation step for downstream patches which would allow building
the driver without support for representors and eswitch.

Fixes: 6f08a22c5fb2 ("net/mlx5e: Register/unregister vport representors on interface attach/detach")
Signed-off-by: Saeed Mahameed <[email protected]>
Reviewed-by: Or Gerlitz <[email protected]>

net/mlx5e: Rearrange netdevice ops structures

Since we are going to allow building the driver without eswitch support,
it would be possible to compile out the sriov netdevice ops struct such
that the basic ops instance will be used for non VF devices too.

Add missing udp tunnel ndos into mlx5e_netdev_ops_basic.

While here, rearrange some ndos in the sriov ops struct and put
vf/eswitch related ndos towards the end of it.

Signed-off-by: Saeed Mahameed <[email protected]>
Reviewed-by: Or Gerlitz <[email protected]>

Merge branch 'sctp-remove-typedefs-from-structures-part-5'

Xin Long says:

====================
sctp: remove typedefs from structures part 5

As we know, typedef is suggested not to use in kernel, even checkpatch.pl
also gives warnings about it. Now sctp is using it for many structures.

All this kind of typedef's using should be removed. This patchset is the
part 5 to remove all typedefs in include/net/sctp/constants.h.

Just as the part 1-4, No any code's logic would be changed in these patches,
only cleaning up.
====================

Signed-off-by: David S. Miller <[email protected]>

sctp: remove the typedef sctp_subtype_t

This patch is to remove the typedef sctp_subtype_t, and
replace with union sctp_subtype in the places where it's
using this typedef.

Note that it doesn't fix many indents although it should,
as sctp_disposition_t's removal would mess them up again.
So better to fix them when removing sctp_disposition_t in
later patch.

Signed-off-by: Xin Long <[email protected]>
Signed-off-by: David S. Miller <[email protected]>

sctp: remove the typedef sctp_event_t

This patch is to remove the typedef sctp_event_t, and
replace with enum sctp_event in the places where it's
using this typedef.

Signed-off-by: Xin Long <[email protected]>
Signed-off-by: David S. Miller <[email protected]>

sctp: remove the typedef sctp_event_timeout_t

This patch is to remove the typedef sctp_event_timeout_t, and
replace with enum sctp_event_timeout in the places where it's
using this typedef.

Signed-off-by: Xin Long <[email protected]>
Signed-off-by: David S. Miller <[email protected]>

sctp: remove the typedef sctp_event_other_t

This patch is to remove the typedef sctp_event_other_t, and
replace with enum sctp_event_other in the places where it's
using this typedef.

Signed-off-by: Xin Long <[email protected]>
Signed-off-by: David S. Miller <[email protected]>

sctp: remove the typedef sctp_event_primitive_t

This patch is to remove the typedef sctp_event_primitive_t, and
replace with enum sctp_event_primitive in the places where it's
using this typedef.

Signed-off-by: Xin Long <[email protected]>
Signed-off-by: David S. Miller <[email protected]>

sctp: remove the typedef sctp_state_t

This patch is to remove the typedef sctp_state_t, and
replace with enum sctp_state in the places where it's
using this typedef.

Signed-off-by: Xin Long <[email protected]>
Signed-off-by: David S. Miller <[email protected]>

sctp: remove the typedef sctp_ierror_t

This patch is to remove the typedef sctp_ierror_t, and
replace with enum sctp_ierror in the places where it's
using this typedef.

Signed-off-by: Xin Long <[email protected]>
Signed-off-by: David S. Miller <[email protected]>

sctp: remove the typedef sctp_xmit_t

This patch is to remove the typedef sctp_xmit_t, and
replace with enum sctp_xmit in the places where it's
using this typedef.

Signed-off-by: Xin Long <[email protected]>
Signed-off-by: David S. Miller <[email protected]>

sctp: remove the typedef sctp_sock_state_t

This patch is to remove the typedef sctp_sock_state_t, and
replace with enum sctp_sock_state in the places where it's
using this typedef.

Signed-off-by: Xin Long <[email protected]>
Signed-off-by: David S. Miller <[email protected]>

sctp: remove the typedef sctp_transport_cmd_t

This patch is to remove the typedef sctp_transport_cmd_t, and
replace with enum sctp_transport_cmd in the places where it's
using this typedef.

Signed-off-by: Xin Long <[email protected]>
Signed-off-by: David S. Miller <[email protected]>

sctp: remove the typedef sctp_scope_t

This patch is to remove the typedef sctp_scope_t, and
replace with enum sctp_scope in the places where it's
using this typedef.

Signed-off-by: Xin Long <[email protected]>
Signed-off-by: David S. Miller <[email protected]>

sctp: remove the typedef sctp_scope_policy_t

This patch is to remove the typedef sctp_scope_policy_t and keep
it's members as an anonymous enum.

It is also to define SCTP_SCOPE_POLICY_MAX to replace the num 3
in sysctl.c to make codes clear.

Signed-off-by: Xin Long <[email protected]>
Signed-off-by: David S. Miller <[email protected]>

sctp: remove the typedef sctp_retransmit_reason_t

This patch is to remove the typedef sctp_retransmit_reason_t, and
replace with enum sctp_retransmit_reason in the places where it's
using this typedef.

Signed-off-by: Xin Long <[email protected]>
Signed-off-by: David S. Miller <[email protected]>

sctp: remove the typedef sctp_lower_cwnd_t

This patch is to remove the typedef sctp_lower_cwnd_t, and
replace with enum sctp_lower_cwnd in the places where it's
using this typedef.

Signed-off-by: Xin Long <[email protected]>
Signed-off-by: David S. Miller <[email protected]>

dt-bindings: net: Document bindings for anarion-gmac

Signed-off-by: Alexandru Gagniuc <[email protected]>
Signed-off-by: David S. Miller <[email protected]>

net: stmmac: Add Adaptrum Anarion GMAC glue layer

Before the GMAC on the Anarion chip can be used, the PHY interface
selection must be configured with the DWMAC block in reset.

This layer covers a block containing only two registers. Although it
is possible to model this as a reset controller and use the "resets"
property of stmmac, it's much more intuitive to include this in the
glue layer instead.

At this time only RGMII is supported, because it is the only mode
which has been validated hardware-wise.

Signed-off-by: Alexandru Gagniuc <[email protected]>
Signed-off-by: David S. Miller <[email protected]>

netvsc: fix rtnl deadlock on unregister of vf

With new transparent VF support, it is possible to get a deadlock
when some of the deferred work is running and the unregister_vf
is trying to cancel the work element. The solution is to use
trylock and reschedule (similar to bonding and team device).

Reported-by: Vitaly Kuznetsov <[email protected]>
Fixes: 0c195567a8f6 ("netvsc: transparent VF management")
Signed-off-by: Stephen Hemminger <[email protected]>
Signed-off-by: David S. Miller <[email protected]>

net: dsa: User per-cpu 64-bit statistics

During testing with a background iperf pushing 1Gbit/sec worth of
traffic and having both ifconfig and ethtool collect statistics, we
could see quite frequent deadlocks. Convert the often accessed DSA slave
network devices statistics to per-cpu 64-bit statistics to remove these
deadlocks and provide fast efficient statistics updates.

Fixes: f613ed665bb3 ("net: dsa: Add support for 64-bit statistics")
Signed-off-by: Florian Fainelli <[email protected]>
Signed-off-by: David S. Miller <[email protected]>

Merge branch 'tcp-cwnd-undo-refactor'

Yuchung Cheng says:

====================
tcp cwnd undo refactor

This patch series consolidate similar cwnd undo functions
implemented by various congestion control by using existing
tcp socket state variable. The first patch fixes a corner
case in of cwnd undo in Reno and HTCP. Since the bug has
existed for many years and is very minor, we consider this
patch set more suitable for net-next as the major change
is the refactor itself.

- v1->v2
Fix trivial compile errors
====================

Signed-off-by: David S. Miller <[email protected]>

tcp: consolidate congestion control undo functions

Most TCP congestion controls are using identical logic to undo
cwnd except BBR. This patch consolidates these similar functions
to the one used currently by Reno and others.

Suggested-by: Neal Cardwell <[email protected]>
Signed-off-by: Yuchung Cheng <[email protected]>
Signed-off-by: Neal Cardwell <[email protected]>
Signed-off-by: David S. Miller <[email protected]>

tcp: fix cwnd undo in Reno and HTCP congestion controls

Using ssthresh to revert cwnd is less reliable when ssthresh is
bounded to 2 packets. This patch uses an existing variable in TCP
"prior_cwnd" that snapshots the cwnd right before entering fast
recovery and RTO recovery in Reno. This fixes the issue discussed
in netdev thread: "A buggy behavior for Linux TCP Reno and HTCP"
https://www.spinics.net/lists/netdev/msg444955.html

Suggested-by: Neal Cardwell <[email protected]>
Reported-by: Wei Sun <[email protected]>
Signed-off-by: Yuchung Cheng <[email protected]>
Signed-off-by: Neal Cardwell <[email protected]>
Signed-off-by: David S. Miller <[email protected]>

netvsc: fix race on sub channel creation

The existing sub channel code did not wait for all the sub-channels
to completely initialize. This could lead to race causing crash
in napi_netif_del() from bad list. The existing code would send
an init message, then wait only for the initial response that
the init message was received. It thought it was waiting for
sub channels but really the init response did the wakeup.

The new code keeps track of the number of open channels and
waits until that many are open.

Other issues here were:
* host might return less sub-channels than was requested.
* the new init status is not valid until after init was completed.

Fixes: b3e6b82a0099 ("hv_netvsc: Wait for sub-channels to be processed during probe")
Signed-off-by: Stephen Hemminger <[email protected]>
Signed-off-by: David S. Miller <[email protected]>

net: systemport: Support 64bit statistics

When using Broadcom Systemport device in 32bit Platform, ifconfig can
only report up to 4G tx,rx status, which will be wrapped to 0 when the
number of incoming or outgoing packets exceeds 4G, only taking
around 2 hours in busy network environment (such as streaming).
Therefore, it makes hard for network diagnostic tool to get reliable
statistical result, so the patch is used to add 64bit support for
Broadcom Systemport device in 32bit Platform.

This patch provides 64bit statistics capability on both ethtool and ifconfig.

Signed-off-by: Jianming.qiao <[email protected]>
Signed-off-by: David S. Miller <[email protected]>

liquidio: moved console_bitmask module param to lio_main.c

Moving PF module param console_bitmask to lio_main.c for consistency.

Signed-off-by: Intiyaz Basha <[email protected]>
Signed-off-by: Felix Manlunas <[email protected]>
Signed-off-by: David S. Miller <[email protected]>

liquidio: add missing strings in oct_dev_state_str array

There's supposed to be a one-to-one correspondence between the 18 macros
that #define the OCT_DEV states (in octeon_device.h) and the strings in the
oct_dev_state_str array, but there are only 14 strings in the array.

Add the missing strings (so they become 18 in total), and also revise some
incorrect/outdated text of existing strings.

Signed-off-by: Intiyaz Basha <[email protected]>
Signed-off-by: Felix Manlunas <[email protected]>
Signed-off-by: David S. Miller <[email protected]>

Merge branch 'phylink-and-sfp-support'

Russell King says:

====================
phylink and sfp support

This patch series introduces generic support for SFP sockets found on
various Marvell based platforms.  The idea here is to provide common
SFP socket support which can be re-used by network drivers as
appropriate, rather than each network driver having to re-implement
SFP socket support.

SFP sockets typically use other system resources, eg, I2C buses to read
identifying information, and GPIOs to monitor socket state and control
the socket.  Meanwhile, some network drivers drive multiple ethernet
ports from one instantiation of the driver.

It is not desirable to block the initialisation of a network driver
(thus denying other ports from being operational) if the resources
for the SFP socket are not yet available.  This means that an element
of independence between the SFP support code and the driver is
required.

More than that, SFP modules effectively bring hotplug PHYs to
networking - SFP copper modules normally contain a standard PHY
accessed over the I2C bus, and it is desirable to read their state
so network drivers can be appropriately configured.

To add to the complexity, SFP modules can be connected in at least
two places:

1. Directly to the serdes output of a MAC with no intervening PHY.
   For example:

     mvneta ----> SFP socket

2. To a PHY, for example:

     mvpp2 ---> PHY ---> copper
                 |
                 `-----> SFP socket

This code supports both setups, although it's not fully implemented
with scenario (2).

Moreover, the link presented by the SFP module can be one of the
10Gbase-R family (for SFP+ sockets), SGMII or 1000base-X (for SFP
sockets) depending on the module, and network drivers need to
reconfigure themselves accordingly for the link to come up.

For example, if the MAC is configured for SGMII and a fibre module
is plugged in, the link won't come up until the MAC is reconfigured
for 1000base-X mode.

The SFP code manages the SFP socket - detecting the module, reading
the identifying information, and managing the control and status
signals.  Importantly, it disables the SFP module transmitter when
the MAC is down, so that the laser is turned off (but that is not
a guarantee.)

phylink provides the mechanisms necessary to manage the link modes,
based on the SFP module type, and supports hot-plugging of the PHY
without needing the MAC driver to be brought up and down on
transitions.  phylink also supports the classical static PHY and
fixed-link modes.

I currently (but not included in this series) have code to convert
mvneta to use phylink, and the out of tree mvpp2x driver.  I have
nothing for the mvpp2 driver at present as that driver is only
recently becoming functional on 10G hardware, and is missing a lot
of features that are necessary to make things work correctly.
====================

Signed-off-by: David S. Miller <[email protected]>

sfp: add SFP module support

Add support for SFP hotpluggable modules via sfp-bus and phylink.
This supports both copper and optical SFP modules, which require
different Serdes modes in order to properly negotiate the link.

Optical SFP modules typically require the Serdes link to be talking
1000BaseX mode - this is the gigabit ethernet mode defined by the
802.3 standard.

Copper SFP modules typically integrate a PHY in the module to convert
from Serdes to copper, and the PHY will be configured by the vendor
to either present a 1000BaseX Serdes link (for fixed 1000BaseT) or a
SGMII Serdes link. However, this is vendor defined, so we instead
detect the PHY, switch the link to SGMII mode, and use traditional
PHY based negotiation.

Signed-off-by: Russell King <[email protected]>
Signed-off-by: David S. Miller <[email protected]>

phylink: add in-band autonegotiation support for 10GBase-KR mode.

Add in-band autonegotation support for 10GBase-KR mode.

Signed-off-by: Russell King <[email protected]>
Reviewed-by: Andrew Lunn <[email protected]>
Signed-off-by: David S. Miller <[email protected]>

phylink: add support for MII ioctl access to Clause 45 PHYs

Add support for reading and writing the clause 45 MII registers.

Signed-off-by: Russell King <[email protected]>
Reviewed-by: Andrew Lunn <[email protected]>
Signed-off-by: David S. Miller <[email protected]>