Git Repo - linux.git/log

nfp: bpf: detect load/store sequences lowered from memory copy

This patch add the optimization frontend, but adding a new eBPF IR scan
pass "nfp_bpf_opt_ldst_gather".

The pass will traverse the IR to recognize the load/store pairs sequences
that come from lowering of memory copy builtins.

The gathered memory copy information will be kept in the meta info
structure of the first load instruction in the sequence and will be
consumed by the optimization backend added in the previous patches.

NOTE: a sequence with cross memory access doesn't qualify this
optimization, i.e. if one load in the sequence will load from place that
has been written by previous store. This is because when we turn the
sequence into single CPP operation, we are reading all contents at once
into NFP transfer registers, then write them out as a whole. This is not
identical with what the original load/store sequence is doing.

Detecting cross memory access for two random pointers will be difficult,
fortunately under XDP/eBPF's restrictied runtime environment, the copy
normally happen among map, packet data and stack, they do not overlap with
each other.

And for cases supported by NFP, cross memory access will only happen on
PTR_TO_PACKET. Fortunately for this, there is ID information that we could
do accurate memory alias check.

Signed-off-by: Jiong Wang <[email protected]>
Reviewed-by: Jakub Kicinski <[email protected]>
Signed-off-by: Daniel Borkmann <[email protected]>

nfp: bpf: implement memory bulk copy for length bigger than 32-bytes

When the gathered copy length is bigger than 32-bytes and within 128-bytes
(the maximum length a single CPP Pull/Push request can finish), the
strategy of read/write are changeed into:

  * Read.
      - use direct reference mode when length is within 32-bytes.
      - use indirect mode when length is bigger than 32-bytes.

  * Write.
      - length <= 8-bytes
        use write8 (direct_ref).
      - length <= 32-byte and 4-bytes aligned
        use write32 (direct_ref).
      - length <= 32-bytes but not 4-bytes aligned
        use write8 (indirect_ref).
      - length > 32-bytes and 4-bytes aligned
        use write32 (indirect_ref).
      - length > 32-bytes and not 4-bytes aligned and <= 40-bytes
        use write32 (direct_ref) to finish the first 32-bytes.
        use write8 (direct_ref) to finish all remaining hanging part.
      - length > 32-bytes and not 4-bytes aligned
        use write32 (indirect_ref) to finish those 4-byte aligned parts.
        use write8 (direct_ref) to finish all remaining hanging part.

Signed-off-by: Jiong Wang <[email protected]>
Reviewed-by: Jakub Kicinski <[email protected]>
Signed-off-by: Daniel Borkmann <[email protected]>

nfp: bpf: implement memory bulk copy for length within 32-bytes

For NFP, we want to re-group a sequence of load/store pairs lowered from
memcpy/memmove into single memory bulk operation which then could be
accelerated using NFP CPP bus.

This patch extends the existing load/store auxiliary information by adding
two new fields:

struct bpf_insn *paired_st;
s16 ldst_gather_len;

Both fields are supposed to be carried by the the load instruction at the
head of the sequence. "paired_st" is the corresponding store instruction at
the head and "ldst_gather_len" is the gathered length.

If "ldst_gather_len" is negative, then the sequence is doing memory
load/store in descending order, otherwise it is in ascending order. We need
this information to detect overlapped memory access.

This patch then optimize memory bulk copy when the copy length is within
32-bytes.

The strategy of read/write used is:

  * Read.
    Use read32 (direct_ref), always.

  * Write.
    - length <= 8-bytes
      write8 (direct_ref).
    - length <= 32-bytes and is 4-byte aligned
      write32 (direct_ref).
    - length <= 32-bytes but is not 4-byte aligned
      write8 (indirect_ref).

NOTE: the optimization should not change program semantics. The destination
register of the last load instruction should contain the same value before
and after this optimization.

Signed-off-by: Jiong Wang <[email protected]>
Reviewed-by: Jakub Kicinski <[email protected]>
Signed-off-by: Daniel Borkmann <[email protected]>

nfp: bpf: factor out is_mbpf_load & is_mbpf_store

It is usual that we need to check if one BPF insn is for loading/storeing
data from/to memory.

Therefore, it makes sense to factor out related code to become common
helper functions.

Signed-off-by: Jiong Wang <[email protected]>
Reviewed-by: Jakub Kicinski <[email protected]>
Signed-off-by: Daniel Borkmann <[email protected]>

nfp: bpf: encode indirect commands

Add support for emitting commands with field overwrites.

Signed-off-by: Jakub Kicinski <[email protected]>
Signed-off-by: Jiong Wang <[email protected]>
Signed-off-by: Daniel Borkmann <[email protected]>

nfp: bpf: correct the encoding for No-Dest immed

When immed is used with No-Dest, the emitter should use reg.dst instead of
reg.areg for the destination, using the latter will actually encode
register zero.

Signed-off-by: Jiong Wang <[email protected]>
Reviewed-by: Jakub Kicinski <[email protected]>
Signed-off-by: Daniel Borkmann <[email protected]>

nfp: bpf: relax source operands check

The NFP normally requires the source operands to be difference addressing
modes, but we should rule out the very special NN_REG_NONE type.

There are instruction that ignores both A/B operands, for example:

local_csr_rd

For these instructions, we might pass the same operand type, NN_REG_NONE,
for both A/B operands.

NOTE: in current NFP ISA, it is only possible for instructions with
unrestricted operands to take none operands, but in case there is new and
similar instructoin in restricted form, they would follow similar rules,
so swreg_to_restricted is updated as well.

Signed-off-by: Jiong Wang <[email protected]>
Reviewed-by: Jakub Kicinski <[email protected]>
Signed-off-by: Daniel Borkmann <[email protected]>

nfp: bpf: don't do ld/shifts combination if shifts are jump destination

If any of the shift insns in the ld/shift sequence is jump destination,
don't do combination.

Signed-off-by: Jiong Wang <[email protected]>
Reviewed-by: Jakub Kicinski <[email protected]>
Signed-off-by: Daniel Borkmann <[email protected]>

nfp: bpf: don't do ld/mask combination if mask is jump destination

If the mask insn in the ld/mask pair is jump destination, then don't do
combination.

Signed-off-by: Jiong Wang <[email protected]>
Reviewed-by: Jakub Kicinski <[email protected]>
Signed-off-by: Daniel Borkmann <[email protected]>

nfp: bpf: flag jump destination to guide insn combine optimizations

NFP eBPF offload JIT engine is doing some instruction combine based
optimizations which however must not be safe if the combined sequences
are across basic block boarders.

Currently, there are post checks during fixing jump destinations. If the
jump destination is found to be eBPF insn that has been combined into
another one, then JIT engine will raise error and abort.

This is not optimal. The JIT engine ought to disable the optimization on
such cross-bb-border sequences instead of abort.

As there is no control flow information in eBPF infrastructure that we
can't do basic block based optimizations, this patch extends the existing
jump destination record pass to also flag the jump destination, then in
instruction combine passes we could skip the optimizations if insns in the
sequence are jump targets.

Suggested-by: Jakub Kicinski <[email protected]>
Signed-off-by: Jiong Wang <[email protected]>
Reviewed-by: Jakub Kicinski <[email protected]>
Signed-off-by: Daniel Borkmann <[email protected]>

nfp: bpf: record jump destination to simplify jump fixup

eBPF insns are internally organized as dual-list inside NFP offload JIT.
Random access to an insn needs to be done by either forward or backward
traversal along the list.

One place we need to do such traversal is at nfp_fixup_branches where one
traversal is needed for each jump insn to find the destination. Such
traversals could be avoided if jump destinations are collected through a
single travesal in a pre-scan pass, and such information could also be
useful in other places where jump destination info are needed.

This patch adds such jump destination collection in nfp_prog_prepare.

Suggested-by: Jakub Kicinski <[email protected]>
Signed-off-by: Jiong Wang <[email protected]>
Reviewed-by: Jakub Kicinski <[email protected]>
Signed-off-by: Daniel Borkmann <[email protected]>

nfp: bpf: support backward jump

This patch adds support for backward jump on NFP.

  - restrictions on backward jump in various functions have been removed.
  - nfp_fixup_branches now supports backward jump.

There is one thing to note, currently an input eBPF JMP insn may generate
several NFP insns, for example,

  NFP imm move insn A \
  NFP compare insn  B  --> 3 NFP insn jited from eBPF JMP insn M
  NFP branch insn   C /
  ---
  NFP insn X           --> 1 NFP insn jited from eBPF insn N
  ---
  ...

therefore, we are doing sanity check to make sure the last jited insn from
an eBPF JMP is a NFP branch instruction.

Once backward jump is allowed, it is possible an eBPF JMP insn is at the
end of the program. This is however causing trouble for the sanity check.
Because the sanity check requires the end index of the NFP insns jited from
one eBPF insn while only the start index is recorded before this patch that
we can only get the end index by:

  start_index_of_the_next_eBPF_insn - 1

or for the above example:

  start_index_of_eBPF_insn_N (which is the index of NFP insn X) - 1

nfp_fixup_branches was using nfp_for_each_insn_walk2 to expose *next* insn
to each iteration during the traversal so the last index could be
calculated from which. Now, it needs some extra code to handle the last
insn. Meanwhile, the use of walk2 is actually unnecessary, we could simply
use generic single instruction walk to do this, the next insn could be
easily calculated using list_next_entry.

So, this patch migrates the jump fixup traversal method to
*list_for_each_entry*, this simplifies the code logic a little bit.

The other thing to note is a new state variable "last_bpf_off" is
introduced to track the index of the last jited NFP insn. This is necessary
because NFP is generating special purposes epilogue sequences, so the index
of the last jited NFP insn is *not* always nfp_prog->prog_len - 1.

Suggested-by: Jakub Kicinski <[email protected]>
Signed-off-by: Jiong Wang <[email protected]>
Signed-off-by: Jakub Kicinski <[email protected]>
Signed-off-by: Daniel Borkmann <[email protected]>

nfp: fix old kdoc issues

Since commit 3a025e1d1c2e ("Add optional check for bad kernel-doc
comments") when built with W=1 build will complain about kdoc errors.
Fix the kdoc issues we have. kdoc is still confused by defines in
nfp_net_ctrl.h but those are not really errors.

Signed-off-by: Jakub Kicinski <[email protected]>
Signed-off-by: Daniel Borkmann <[email protected]>

Merge branch 'bpf-verifier-misc-improvements'

Alexei Starovoitov says:

====================
Small set of verifier improvements and cleanups which is
necessary for bigger patch set of bpf-to-bpf calls coming later.
See individual patches for details.
Tested on x86 and arm64 hw.
====================

Signed-off-by: Daniel Borkmann <[email protected]>

selftests/bpf: adjust test_align expected output

since verifier started to print liveness state of the registers
adjust expected output of test_align.
Now this test checks for both proper alignment handling by verifier
and correctness of liveness marks.

Signed-off-by: Alexei Starovoitov <[email protected]>
Acked-by: Daniel Borkmann <[email protected]>
Signed-off-by: Daniel Borkmann <[email protected]>

bpf: cleanup register_is_null()

don't pass large struct bpf_reg_state by value.
Instead pass it by pointer.

Signed-off-by: Alexei Starovoitov <[email protected]>
Acked-by: John Fastabend <[email protected]>
Acked-by: Daniel Borkmann <[email protected]>
Signed-off-by: Daniel Borkmann <[email protected]>

bpf: improve JEQ/JNE path walking

verifier knows how to trim paths that are known not to be
taken at run-time when register containing run-time constant
is compared with another constant.
It was done only for JEQ comparison.
Extend it to include JNE as well.
More cases can be added in the future.

                     before  after
bpf_lb-DLB_L3.o       2270    2051
bpf_lb-DLB_L4.o       3682    3287
bpf_lb-DUNKNOWN.o     1110    1080
bpf_lxc-DDROP_ALL.o   27876   24980
bpf_lxc-DUNKNOWN.o    38780   34308
bpf_netdev.o          16937   15404
bpf_overlay.o         7929    7191

Signed-off-by: Alexei Starovoitov <[email protected]>
Acked-by: John Fastabend <[email protected]>
Acked-by: Daniel Borkmann <[email protected]>
Signed-off-by: Daniel Borkmann <[email protected]>

bpf: improve verifier liveness marks

registers with pointers filled from stack were missing live_written marks
which caused liveness propagation to unnecessary mark more registers as
live_read and miss state pruning opportunities later on.

                     before  after
bpf_lb-DLB_L3.o       2285   2270
bpf_lb-DLB_L4.o       3723   3682
bpf_lb-DUNKNOWN.o     1110   1110
bpf_lxc-DDROP_ALL.o   27954  27876
bpf_lxc-DUNKNOWN.o    38954  38780
bpf_netdev.o          16943  16937
bpf_overlay.o         7929   7929

Signed-off-by: Alexei Starovoitov <[email protected]>
Acked-by: John Fastabend <[email protected]>
Acked-by: Daniel Borkmann <[email protected]>
Signed-off-by: Daniel Borkmann <[email protected]>

bpf: don't mark FP reg as uninit

when verifier hits an internal bug don't mark register R10==FP as uninit,
since it's read only register and it's not technically correct to let
verifier run further, since it may assume that R10 has valid auxiliary state.

While developing subsequent patches this issue was discovered,
though the code eventually changed that aux reg state doesn't have
pointers any more it is still safer to avoid clearing readonly register.

Signed-off-by: Alexei Starovoitov <[email protected]>
Acked-by: John Fastabend <[email protected]>
Acked-by: Daniel Borkmann <[email protected]>
Signed-off-by: Daniel Borkmann <[email protected]>

bpf: print liveness info to verifier log

let verifier print register and stack liveness information
into verifier log

Signed-off-by: Alexei Starovoitov <[email protected]>
Acked-by: John Fastabend <[email protected]>
Acked-by: Daniel Borkmann <[email protected]>
Signed-off-by: Daniel Borkmann <[email protected]>

bpf: fix stack state printing in verifier log

fix incorrect stack state prints in print_verifier_state()

Fixes: 638f5b90d460 ("bpf: reduce verifier memory consumption")
Signed-off-by: Alexei Starovoitov <[email protected]>
Acked-by: John Fastabend <[email protected]>
Acked-by: Daniel Borkmann <[email protected]>
Signed-off-by: Daniel Borkmann <[email protected]>

samples/bpf: Convert magic numbers to names in multi-prog cgroup test case

Attach flag 1 == BPF_F_ALLOW_OVERRIDE; attach flag 2 == BPF_F_ALLOW_MULTI.
Update the calls to bpf_prog_attach() in test_cgrp2_attach2.c to use the
names over the magic numbers.

Fixes: 39323e788cb67 ("samples/bpf: add multi-prog cgroup test case")
Signed-off-by: David Ahern <[email protected]>
Signed-off-by: Daniel Borkmann <[email protected]>

net: dsa: bcm_sf2: Utilize b53_get_tag_protocol()

Utilize the much more capable b53_get_tag_protocol() which takes care of
all Broadcom switches specifics to resolve which port can have Broadcom
tags enabled or not.

Signed-off-by: Florian Fainelli <[email protected]>
Signed-off-by: David S. Miller <[email protected]>

net/reuseport: drop legacy code

Since commit e32ea7e74727 ("soreuseport: fast reuseport UDP socket
selection") and commit c125e80b8868 ("soreuseport: fast reuseport
TCP socket selection") the relevant reuseport socket matching the current
packet is selected by the reuseport_select_sock() call. The only
exceptions are invalid BPF filters/filters returning out-of-range
indices.
In the latter case the code implicitly falls back to using the hash
demultiplexing, but instead of selecting the socket inside the
reuseport_select_sock() function, it relies on the hash selection
logic introduced with the early soreuseport implementation.

With this patch, in case of a BPF filter returning a bad socket
index value, we fall back to hash-based selection inside the
reuseport_select_sock() body, so that we can drop some duplicate
code in the ipv4 and ipv6 stack.

This also allows faster lookup in the above scenario and will allow
us to avoid computing the hash value for successful, BPF based
demultiplexing - in a later patch.

Signed-off-by: Paolo Abeni <[email protected]>
Acked-by: Craig Gallek <[email protected]>
Signed-off-by: David S. Miller <[email protected]>

Documentation: net: dsa: Cut set_addr() documentation

This is not supported anymore, devices needing a MAC address
just assign one at random, it's just a driver pecularity.

Signed-off-by: Linus Walleij <[email protected]>
Reviewed-by: Andrew Lunn <[email protected]>
Signed-off-by: David S. Miller <[email protected]>

Merge branch 'net-dst_entry-shrink'

David Miller says:

====================
net: Significantly shrink the size of routes.

Through a combination of several things, our route structures are
larger than they need to be.

Mostly this stems from having members in dst_entry which are only used
by one class of routes.  So the majority of the work in this series is
about "un-commoning" these members and pushing them into the type
specific structures.

Unfortunately, IPSEC needed the most surgery.  The majority of the
changes here had to do with bundle creation and management.

The other issue is the refcount alignment in dst_entry.  Once we get
rid of the not-so-common members, it really opens the door to removing
that alignment entirely.

I think the new layout looks really nice, so I'll reproduce it here:

struct net_device       *dev;
struct  dst_ops         *ops;
unsigned long _metrics;
unsigned long           expires;
struct xfrm_state *xfrm;
int (*input)(struct sk_buff *);
int (*output)(struct net *net, struct sock *sk, struct sk_buff *skb);
unsigned short flags;
short obsolete;
unsigned short header_len;
unsigned short trailer_len;
atomic_t __refcnt;
int __use;
unsigned long lastuse;
struct lwtunnel_state   *lwtstate;
struct rcu_head rcu_head;
short error;
short __pad;
__u32 tclassid;

(This is for 64-bit, on 32-bit the __refcnt comes at the very end)

So, the good news:

1) struct dst_entry shrinks from 160 to 112 bytes.

2) struct rtable shrinks from 216 to 168 bytes.

3) struct rt6_info shrinks from 384 to 320 bytes.

Enjoy.

v2:
Collapse some patches logically based upon feedback.
Fix the strange patch #7.

v3: xfrm_dst_path() needs inline keyword
Properly align __refcnt on 32-bit.
====================

Signed-off-by: David S. Miller <[email protected]>

net: Remove dst->next

There are no more users.

Signed-off-by: David S. Miller <[email protected]>
Reviewed-by: Eric Dumazet <[email protected]>

xfrm: Stop using dst->next in bundle construction.

While building ipsec bundles, blocks of xfrm dsts are linked together
using dst->next from bottom to the top.

The only thing this is used for is initializing the pmtu values of the
xfrm stack, and for updating the mtu values at xfrm_bundle_ok() time.

The bundle pmtu entries must be processed in this order so that pmtu
values lower in the stack of routes can propagate up to the higher
ones.

Avoid using dst->next by simply maintaining an array of dst pointers
as we already do for the xfrm_state objects when building the bundle.

Signed-off-by: David S. Miller <[email protected]>
Reviewed-by: Eric Dumazet <[email protected]>

net: Rearrange dst_entry layout to avoid useless padding.

We have padding to try and align the refcount on a separate cache
line. But after several simplifications the padding has increased
substantially.

So now it's easy to change the layout to get rid of the padding
entirely.

We group the write-heavy __refcnt and __use with less often used
items such as the rcu_head and the error code.

Signed-off-by: David S. Miller <[email protected]>
Reviewed-by: Eric Dumazet <[email protected]>

xfrm: Move dst->path into struct xfrm_dst

The first member of an IPSEC route bundle chain sets it's dst->path to
the underlying ipv4/ipv6 route that carries the bundle.

Stated another way, if one were to follow the xfrm_dst->child chain of
the bundle, the final non-NULL pointer would be the path and point to
either an ipv4 or an ipv6 route.

This is largely used to make sure that PMTU events propagate down to
the correct ipv4 or ipv6 route.

When we don't have the top of an IPSEC bundle 'dst->path == dst'.

Move it down into xfrm_dst and key off of dst->xfrm.

Signed-off-by: David S. Miller <[email protected]>
Reviewed-by: Eric Dumazet <[email protected]>

ipv6: Move dst->from into struct rt6_info.

The dst->from value is only used by ipv6 routes to track where
a route "came from".

Any time we clone or copy a core ipv6 route in the ipv6 routing
tables, we have the copy/clone's ->from point to the base route.

This is used to handle route expiration properly.

Only ipv6 uses this mechanism, and only ipv6 code references
it. So it is safe to move it into rt6_info.

Signed-off-by: David S. Miller <[email protected]>
Reviewed-by: Eric Dumazet <[email protected]>

xfrm: Move child route linkage into xfrm_dst.

XFRM bundle child chains look like this:

xdst1 --> xdst2 --> xdst3 --> path_dst

All of xdstN are xfrm_dst objects and xdst->u.dst.xfrm is non-NULL.
The final child pointer in the chain, here called 'path_dst', is some
other kind of route such as an ipv4 or ipv6 one.

The xfrm output path pops routes, one at a time, via the child
pointer, until we hit one which has a dst->xfrm pointer which
is NULL.

We can easily preserve the above mechanisms with child sitting
only in the xfrm_dst structure. All children in the chain
before we break out of the xfrm_output() loop have dst->xfrm
non-NULL and are therefore xfrm_dst objects.

Since we break out of the loop when we find dst->xfrm NULL, we
will not try to dereference 'dst' as if it were an xfrm_dst.

Signed-off-by: David S. Miller <[email protected]>

ipsec: Create and use new helpers for dst child access.

This will make a future change moving the dst->child pointer less
invasive.

Signed-off-by: David S. Miller <[email protected]>
Reviewed-by: Eric Dumazet <[email protected]>

net: Create and use new helper xfrm_dst_child().

Only IPSEC routes have a non-NULL dst->child pointer. And IPSEC
routes are identified by a non-NULL dst->xfrm pointer.

Signed-off-by: David S. Miller <[email protected]>

ipv6: Move rt6_next from dst_entry into ipv6 route structure.

Signed-off-by: David S. Miller <[email protected]>
Reviewed-by: Eric Dumazet <[email protected]>

decnet: Move dn_next into decnet route structure.

Signed-off-by: David S. Miller <[email protected]>
Reviewed-by: Eric Dumazet <[email protected]>

net: dst->rt_next is unused.

Delete it.

Signed-off-by: David S. Miller <[email protected]>
Reviewed-by: Eric Dumazet <[email protected]>

forcedeth: optimize the xmit with unlikely

In xmit, it is very impossible that TX_ERROR occurs. So using
unlikely optimizes the xmit process.

CC: Srinivas Eeda <[email protected]>
CC: Joe Jin <[email protected]>
CC: Junxiao Bi <[email protected]>
Signed-off-by: Zhu Yanjun <[email protected]>
Signed-off-by: David S. Miller <[email protected]>

atm: mpoa: remove 32-bit timekeeping

net/atm/mpoa_* files use 'struct timeval' to store event
timestamps. struct timeval uses a 32-bit seconds field which will
overflow in the year 2038 and beyond. Morever, the timestamps are being
compared only to get seconds elapsed, so struct timeval which stores
a seconds and microseconds field is an overkill. This patch replaces
the use of struct timeval with time64_t to store a 64-bit seconds field.

Signed-off-by: Tina Ruchandani <[email protected]>
Signed-off-by: Arnd Bergmann <[email protected]>
Signed-off-by: David S. Miller <[email protected]>

atm: eni: fix several indentation issues

There are several statements that have incorrect indentation. Fix
these.

Signed-off-by: Colin Ian King <[email protected]>
Signed-off-by: David S. Miller <[email protected]>

openvswitch: use ktime_get_ts64() instead of ktime_get_ts()

timespec is deprecated because of the y2038 overflow, so let's convert
this one to ktime_get_ts64(). The code is already safe even on 32-bit
architectures, since it uses monotonic times. On 64-bit architectures,
nothing changes, while on 32-bit architectures this avoids one
type conversion.

Signed-off-by: Arnd Bergmann <[email protected]>
Signed-off-by: David S. Miller <[email protected]>

netxen: remove timespec usage

netxen_collect_minidump() evidently just wants to get a monotonic
timestamp. Using jiffies_to_timespec(jiffies, &ts) is not
appropriate here, since it will overflow after 2^32 jiffies,
which may be as short as 49 days of uptime.

ktime_get_seconds() is the correct interface here.

Signed-off-by: Arnd Bergmann <[email protected]>
Signed-off-by: David S. Miller <[email protected]>

net: phy: harmonize phy_id{,_mask} data type

Previously phy_id was u32 and phy_id_mask was unsigned int. As the
phy_id_mask defines the important bits of the phy_id (and is therefore
the same size) these two variables should be the same data type.

Signed-off-by: Richard Leitner <[email protected]>
Reviewed-by: Florian Fainelli <[email protected]>
Reviewed-by: Andrew Lunn <[email protected]>
Signed-off-by: David S. Miller <[email protected]>

net: ethernet: davinci_emac: Deduplicate bus_find_device() by name matching

No need to reinvent the wheel, we have bus_find_device_by_name().

Cc: Grygorii Strashko <[email protected]>
Signed-off-by: Lukas Wunner <[email protected]>
Signed-off-by: David S. Miller <[email protected]>

net: thunderx: Set max queue count taking XDP_TX into account

on T81 there are only 4 cores, hence setting max queue count to 4
would leave nothing for XDP_TX. This patch fixes this by doubling
max queue count in above scenarios.

Signed-off-by: Sunil Goutham <[email protected]>
Signed-off-by: cjacob <[email protected]>
Signed-off-by: Aleksey Makarov <[email protected]>
Signed-off-by: David S. Miller <[email protected]>

net: thunderx: Add support for xdp redirect

This patch adds support for XDP_REDIRECT. Flush is not
yet supported.

Signed-off-by: Sunil Goutham <[email protected]>
Signed-off-by: cjacob <[email protected]>
Signed-off-by: Aleksey Makarov <[email protected]>
Signed-off-by: David S. Miller <[email protected]>

Merge tag 'nfsd-4.15-1' of git://linux-nfs.org/~bfields/linux

Pull nfsd fixes from Bruce Fields:
"I screwed up my merge window pull request; I only sent half of what I
  meant to.

  There were no new features, just bugfixes of various importance and
  some very minor cleanup, so I think it's all still appropriate for
  -rc2.

  Highlights:

   - Fixes from Trond for some races in the NFSv4 state code.

   - Fix from Naofumi Honda for a typo in the blocked lock notificiation
     code

   - Fixes from Vasily Averin for some problems starting and stopping
     lockd especially in network namespaces"

* tag 'nfsd-4.15-1' of git://linux-nfs.org/~bfields/linux: (23 commits)
  lockd: fix "list_add double add" caused by legacy signal interface
  nlm_shutdown_hosts_net() cleanup
  race of nfsd inetaddr notifiers vs nn->nfsd_serv change
  race of lockd inetaddr notifiers vs nlmsvc_rqst change
  SUNRPC: make cache_detail structures const
  NFSD: make cache_detail structures const
  sunrpc: make the function arg as const
  nfsd: check for use of the closed special stateid
  nfsd: fix panic in posix_unblock_lock called from nfs4_laundromat
  lockd: lost rollback of set_grace_period() in lockd_down_net()
  lockd: added cleanup checks in exit_net hook
  grace: replace BUG_ON by WARN_ONCE in exit_net hook
  nfsd: fix locking validator warning on nfs4_ol_stateid->st_mutex class
  lockd: remove net pointer from messages
  nfsd: remove net pointer from debug messages
  nfsd: Fix races with check_stateid_generation()
  nfsd: Ensure we check stateid validity in the seqid operation checks
  nfsd: Fix race in lock stateid creation
  nfsd4: move find_lock_stateid
  nfsd: Ensure we don't recognise lock stateids after freeing them
  ...

Merge tag 'for-4.15-rc2-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux

Pull btrfs fixes from David Sterba:
"We've collected some fixes in since the pre-merge window freeze.

  There's technically only one regression fix for 4.15, but the rest
  seems important and candidates for stable.

   - fix missing flush bio puts in error cases (is serious, but rarely
     happens)

   - fix reporting stat::st_blocks for buffered append writes

   - fix space cache invalidation

   - fix out of bound memory access when setting zlib level

   - fix potential memory corruption when fsync fails in the middle

   - fix crash in integrity checker

   - incremetnal send fix, path mixup for certain unlink/rename
     combination

   - pass flags to writeback so compressed writes can be throttled
     properly

   - error handling fixes"

* tag 'for-4.15-rc2-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux:
  Btrfs: incremental send, fix wrong unlink path after renaming file
  btrfs: tree-checker: Fix false panic for sanity test
  Btrfs: fix list_add corruption and soft lockups in fsync
  btrfs: Fix wild memory access in compression level parser
  btrfs: fix deadlock when writing out space cache
  btrfs: clear space cache inode generation always
  Btrfs: fix reported number of inode blocks after buffered append writes
  Btrfs: move definition of the function btrfs_find_new_delalloc_bytes
  Btrfs: bail out gracefully rather than BUG_ON
  btrfs: dev_alloc_list is not protected by RCU, use normal list_del
  btrfs: add missing device::flush_bio puts
  btrfs: Fix transaction abort during failure in btrfs_rm_dev_item
  Btrfs: add write_flags for compression bio

Merge tag 'microblaze-4.15-rc2' of git://git.monstr.eu/linux-2.6-microblaze

Pull Microblaze fix from Michal Simek:
"Add missing header to mmu_context_mm.h"

* tag 'microblaze-4.15-rc2' of git://git.monstr.eu/linux-2.6-microblaze:
microblaze: add missing include to mmu_context_mm.h

Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/sparc

Pull sparc fix from David Miller:
"Sparc T4 and later cpu bootup regression fix"

* git://git.kernel.org/pub/scm/linux/kernel/git/davem/sparc:
sparc64: Fix boot on T4 and later.

Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net

Pull networking fixes from David Miller:

1) The forcedeth conversion from pci_*() DMA interfaces to dma_*() ones
    missed one spot. From Zhu Yanjun.

2) Missing CRYPTO_SHA256 Kconfig dep in cfg80211, from Johannes Berg.

3) Fix checksum offloading in thunderx driver, from Sunil Goutham.

4) Add SPDX to vm_sockets_diag.h, from Stephen Hemminger.

5) Fix use after free of packet headers in TIPC, from Jon Maloy.

6) "sizeof(ptr)" vs "sizeof(*ptr)" bug in i40e, from Gustavo A R Silva.

7) Tunneling fixes in mlxsw driver, from Petr Machata.

8) Fix crash in fanout_demux_rollover() of AF_PACKET, from Mike
    Maloney.

9) Fix race in AF_PACKET bind() vs. NETDEV_UP notifier, from Eric
    Dumazet.

10) Fix regression in sch_sfq.c due to one of the timer_setup()
    conversions. From Paolo Abeni.

11) SCTP does list_for_each_entry() using wrong struct member, fix from
    Xin Long.

12) Don't use big endian netlink attribute read for
    IFLA_BOND_AD_ACTOR_SYSTEM, it is in cpu endianness. Also from Xin
    Long.

13) Fix mis-initialization of q->link.clock in CBQ scheduler, preventing
    adding filters there. From Jiri Pirko.

* git://git.kernel.org/pub/scm/linux/kernel/git/davem/net: (67 commits)
  ethernet: dwmac-stm32: Fix copyright
  net: via: via-rhine: use %p to format void * address instead of %x
  net: ethernet: xilinx: Mark XILINX_LL_TEMAC broken on 64-bit
  myri10ge: Update MAINTAINERS
  net: sched: cbq: create block for q->link.block
  atm: suni: remove extraneous space to fix indentation
  atm: lanai: use %p to format kernel addresses instead of %x
  VSOCK: Don't set sk_state to TCP_CLOSE before testing it
  atm: fore200e: use %pK to format kernel addresses instead of %x
  ambassador: fix incorrect indentation of assignment statement
  vxlan: use __be32 type for the param vni in __vxlan_fdb_delete
  bonding: use nla_get_u64 to extract the value for IFLA_BOND_AD_ACTOR_SYSTEM
  sctp: use right member as the param of list_for_each_entry
  sch_sfq: fix null pointer dereference at timer expiration
  cls_bpf: don't decrement net's refcount when offload fails
  net/packet: fix a race in packet_bind() and packet_notifier()
  packet: fix crash in fanout_demux_rollover()
  sctp: remove extern from stream sched
  sctp: force the params with right types for sctp csum apis
  sctp: force SCTP_ERROR_INV_STRM with __u32 when calling sctp_chunk_fail
  ...

sparc64: Fix boot on T4 and later.

If we don't put the NG4fls.o object into the same part of
the link as the generic sparc64 objects for fls() and __fls()
then the relocation in the branch we use for patching will
not fit.

Move NG4fls.o into lib-y to fix this problem.

Fixes: 46ad8d2d22c1 ("sparc64: Use sparc optimized fls and __fls for T4 and above")
Signed-off-by: David S. Miller <[email protected]>
Reported-by: Anatoly Pugachev <[email protected]>
Tested-by: Anatoly Pugachev <[email protected]>

vsprintf: don't use 'restricted_pointer()' when not restricting

Instead, just fall back on the new '%p' behavior which hashes the
pointer.

Otherwise, '%pK' - that was intended to mark a pointer as restricted -
just ends up leaking pointers that a normal '%p' wouldn't leak. Which
just make the whole thing pointless.

I suspect we should actually get rid of '%pK' entirely, and make it just
work as '%p' regardless, but this is the minimal obvious fix. People
who actually use 'kptr_restrict' should weigh in on which behavior they
want.

Cc: Tobin Harding <[email protected]>
Cc: Kees Cook <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

kallsyms: take advantage of the new '%px' format

The conditional kallsym hex printing used a special fixed-width '%lx'
output (KALLSYM_FMT) in preparation for the hashing of %p, but that
series ended up adding a %px specifier to help with the conversions.

Use it, and avoid the "print pointer as an unsigned long" code.

Signed-off-by: Linus Torvalds <[email protected]>

Merge tag 'printk-hash-pointer-4.15-rc2' of git://github.com/tcharding/linux

Pull printk pointer hashing update from Tobin Harding:
"Here is the patch set that implements hashing of printk specifier %p.

  First we have two clean up patches then we do the hashing. Hashing is
  done via the SipHash algorithm. The next patch adds printk specifier
  %px for printing pointers when we _really_ want to see the address i.e
  %px is functionally equivalent to %lx. Final patch in the set fixes
  KASAN since we break it by hashing %p.

  For the record here is the justification for the series:

    Currently there exist approximately 14 000 places in the Kernel
    where addresses are being printed using an unadorned %p. This
    potentially leaks sensitive information about the Kernel layout in
    memory. Many of these calls are stale, instead of fixing every call
    we hash the address by default before printing. We then add %px to
    provide a way to print the actual address. Although this is
    achievable using %lx, using %px will assist us if we ever want to
    change pointer printing behaviour. %px is more uniquely grep'able
    (there are already >50 000 uses of %lx).

    The added advantage of hashing %p is that security is now opt-out,
    if you _really_ want the address you have to work a little harder
    and use %px.

  This will of course break some users, forcing code printing needed
  addresses to be updated"

[ I do expect this to be an annoyance, and a number of %px users to be
  added for debuggability. But nobody is willing to audit existing %p
  users for information leaks, and a number of places really only use
  the pointer as an object identifier rather than really 'I need the
  address'.

  IOW - sorry for the inconvenience, but it's the least inconvenient of
  the options.    - Linus ]

* tag 'printk-hash-pointer-4.15-rc2' of git://github.com/tcharding/linux:
  kasan: use %px to print addresses instead of %p
  vsprintf: add printk specifier %px
  printk: hash addresses printed with %p
  vsprintf: refactor %pK code out of pointer()
  docs: correct documentation for %pK

Revert "mm, thp: Do not make pmd/pud dirty without a reason"

This reverts commit 152e93af3cfe2d29d8136cc0a02a8612507136ee.

It was a nice cleanup in theory, but as Nicolai Stange points out, we do
need to make the page dirty for the copy-on-write case even when we
didn't end up making it writable, since the dirty bit is what we use to
check that we've gone through a COW cycle.

Reported-by: Michal Hocko <[email protected]>
Acked-by: Kirill A. Shutemov <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

ethernet: dwmac-stm32: Fix copyright

Uniformize STMicroelectronics copyrights header

Signed-off-by: Benjamin Gaignard <[email protected]>
CC: Alexandre Torgue <[email protected]>
Acked-by: Alexandre TORGUE <[email protected]>
Signed-off-by: David S. Miller <[email protected]>

net: via: via-rhine: use %p to format void * address instead of %x

Don't use %x and casting to print out an address, instead use %p
and remove the casting. Cleans up smatch warnings:

drivers/net/ethernet/via/via-rhine.c:998 rhine_init_one_common()
warn: argument 4 to %lx specifier is cast from pointer

Signed-off-by: Colin Ian King <[email protected]>
Signed-off-by: David S. Miller <[email protected]>

net: ethernet: xilinx: Mark XILINX_LL_TEMAC broken on 64-bit

On 64-bit (e.g. powerpc64/allmodconfig):

    drivers/net/ethernet/xilinx/ll_temac_main.c: In function 'temac_start_xmit_done':
    drivers/net/ethernet/xilinx/ll_temac_main.c:633:22: warning: cast to pointer from integer of different size [-Wint-to-pointer-cast]
dev_kfree_skb_irq((struct sk_buff *)cur_p->app4);
  ^

cdmac_bd.app4 is u32, so it is too small to hold a kernel pointer.

Note that several other fields in struct cdmac_bd are also too small to
hold physical addresses on 64-bit platforms.

Signed-off-by: Geert Uytterhoeven <[email protected]>
Signed-off-by: David S. Miller <[email protected]>

myri10ge: Update MAINTAINERS

Change the maintainer to Chris Lee who has access to Myricom hardware
and can test/review. Update the website URL.

Signed-off-by: Hyong-Youb Kim <[email protected]>
Signed-off-by: David S. Miller <[email protected]>

kasan: use %px to print addresses instead of %p

Pointers printed with %p are now hashed by default. Kasan needs the
actual address. We can use the new printk specifier %px for this
purpose.

Use %px instead of %p to print addresses.

Signed-off-by: Tobin C. Harding <[email protected]>

vsprintf: add printk specifier %px

printk specifier %p now hashes all addresses before printing. Sometimes
we need to see the actual unmodified address. This can be achieved using
%lx but then we face the risk that if in future we want to change the
way the Kernel handles printing of pointers we will have to grep through
the already existent 50 000 %lx call sites. Let's add specifier %px as a
clear, opt-in, way to print a pointer and maintain some level of
isolation from all the other hex integer output within the Kernel.

Add printk specifier %px to print the actual unmodified address.

Signed-off-by: Tobin C. Harding <[email protected]>

printk: hash addresses printed with %p

Currently there exist approximately 14 000 places in the kernel where
addresses are being printed using an unadorned %p. This potentially
leaks sensitive information regarding the Kernel layout in memory. Many
of these calls are stale, instead of fixing every call lets hash the
address by default before printing. This will of course break some
users, forcing code printing needed addresses to be updated.

Code that _really_ needs the address will soon be able to use the new
printk specifier %px to print the address.

For what it's worth, usage of unadorned %p can be broken down as
follows (thanks to Joe Perches).

$ git grep -E '%p[^A-Za-z0-9]' | cut -f1 -d"/" | sort | uniq -c
   1084 arch
     20 block
     10 crypto
     32 Documentation
   8121 drivers
   1221 fs
    143 include
    101 kernel
     69 lib
    100 mm
   1510 net
     40 samples
      7 scripts
     11 security
    166 sound
    152 tools
      2 virt

Add function ptr_to_id() to map an address to a 32 bit unique
identifier. Hash any unadorned usage of specifier %p and any malformed
specifiers.

Signed-off-by: Tobin C. Harding <[email protected]>

vsprintf: refactor %pK code out of pointer()

Currently code to handle %pK is all within the switch statement in
pointer(). This is the wrong level of abstraction. Each of the other switch
clauses call a helper function, pK should do the same.

Refactor code out of pointer() to new function restricted_pointer().

Signed-off-by: Tobin C. Harding <[email protected]>

docs: correct documentation for %pK

Current documentation indicates that %pK prints a leading '0x'. This is
not the case.

Correct documentation for printk specifier %pK.

Signed-off-by: Tobin C. Harding <[email protected]>

Merge branch 'linus' of git://git.kernel.org/pub/scm/linux/kernel/git/herbert/crypto-2.6

Pull crypto fixes from Herbert Xu:

- avoid potential bogus alignment for some AEAD operations

- fix crash in algif_aead

- avoid sleeping in softirq context with async af_alg

* 'linus' of git://git.kernel.org/pub/scm/linux/kernel/git/herbert/crypto-2.6:
  crypto: skcipher - Fix skcipher_walk_aead_common
  crypto: af_alg - remove locking in async callback
  crypto: algif_aead - skip SGL entries with NULL page

net: sched: cbq: create block for q->link.block

q->link.block is not initialized, that leads to EINVAL when one tries to
add filter there. So initialize it properly.

This can be reproduced by:
$ tc qdisc add dev eth0 root handle 1: cbq avpkt 1000 rate 1000Mbit bandwidth 1000Mbit
$ tc filter add dev eth0 parent 1: protocol ip prio 100 u32 match ip protocol 0 0x00 flowid 1:1

Reported-by: Jaroslav Aster <[email protected]>
Reported-by: Ivan Vecera <[email protected]>
Fixes: 6529eaba33f0 ("net: sched: introduce tcf block infractructure")
Signed-off-by: Jiri Pirko <[email protected]>
Acked-by: Eelco Chaudron <[email protected]>
Reviewed-by: Ivan Vecera <[email protected]>
Signed-off-by: David S. Miller <[email protected]>

atm: suni: remove extraneous space to fix indentation

Remove a leading space, fixes indentation

Signed-off-by: Colin Ian King <[email protected]>
Signed-off-by: David S. Miller <[email protected]>

atm: lanai: use %p to format kernel addresses instead of %x

Don't use %x and casting to print out a kernel address, instead use %p
and remove the casting. Cleans up smatch warnings:

drivers/atm/lanai.c:1589 service_buffer_allocate() warn: argument 2 to
%08lX specifier is cast from pointer
drivers/atm/lanai.c:2221 lanai_dev_open() warn: argument 4 to %lx
specifier is cast from pointer

Signed-off-by: Colin Ian King <[email protected]>
Signed-off-by: David S. Miller <[email protected]>

VSOCK: Don't set sk_state to TCP_CLOSE before testing it

A recent commit (3b4477d2dcf2) converted the sk_state to use
TCP constants. In that change, vmci_transport_handle_detach
was changed such that sk->sk_state was set to TCP_CLOSE before
we test whether it is TCP_SYN_SENT. This change moves the
sk_state change back to the original locations in that function.

Signed-off-by: Jorgen Hansen <[email protected]>
Reviewed-by: Stefan Hajnoczi <[email protected]>
Signed-off-by: David S. Miller <[email protected]>

atm: fore200e: use %pK to format kernel addresses instead of %x

Don't use %x and casting to print out a kernel address, instead use the
%pK and remove the casting. Cleans up smatch warning:

drivers/atm/fore200e.c:3093 fore200e_proc_read() warn: argument 3 to %08x
specifier is cast from pointer

Signed-off-by: Colin Ian King <[email protected]>
Signed-off-by: David S. Miller <[email protected]>

ambassador: fix incorrect indentation of assignment statement

Remove one extraneous level of indentation on assignment statement.

Signed-off-by: Colin Ian King <[email protected]>
Signed-off-by: David S. Miller <[email protected]>

vxlan: use __be32 type for the param vni in __vxlan_fdb_delete

All callers of __vxlan_fdb_delete pass vni with __be32 type, and
this param should be declared as __be32 type.

Fixes: 3ad7a4b141eb ("vxlan: support fdb and learning in COLLECT_METADATA mode")
Signed-off-by: Xin Long <[email protected]>
Signed-off-by: David S. Miller <[email protected]>

bonding: use nla_get_u64 to extract the value for IFLA_BOND_AD_ACTOR_SYSTEM

bond_opt_initval expects a u64 type param, it's better to use
nla_get_u64 to extract the value here, to eliminate a sparse
endianness mismatch warning.

Fixes: 171a42c38c6e ("bonding: add netlink support for sys prio, actor sys mac, and port key")
Signed-off-by: Xin Long <[email protected]>
Signed-off-by: David S. Miller <[email protected]>

sctp: use right member as the param of list_for_each_entry

Commit d04adf1b3551 ("sctp: reset owner sk for data chunks on out queues
when migrating a sock") made a mistake that using 'list' as the param of
list_for_each_entry to traverse the retransmit, sacked and abandoned
queues, while chunks are using 'transmitted_list' to link into these
queues.

It could cause NULL dereference panic if there are chunks in any of these
queues when peeling off one asoc.

So use the chunk member 'transmitted_list' instead in this patch.

Fixes: d04adf1b3551 ("sctp: reset owner sk for data chunks on out queues when migrating a sock")
Signed-off-by: Xin Long <[email protected]>
Acked-by: Marcelo Ricardo Leitner <[email protected]>
Acked-by: Neil Horman <[email protected]>
Signed-off-by: David S. Miller <[email protected]>

sch_sfq: fix null pointer dereference at timer expiration

While converting sch_sfq to use timer_setup(), the commit cdeabbb88134
("net: sched: Convert timers to use timer_setup()") forgot to
initialize the 'sch' field. As a result, the timer callback tries to
dereference a NULL pointer, and the kernel does oops.

Fix it initializing such field at qdisc creation time.

Fixes: cdeabbb88134 ("net: sched: Convert timers to use timer_setup()")
Signed-off-by: Paolo Abeni <[email protected]>
Acked-by: Cong Wang <[email protected]>
Acked-by: Kees Cook <[email protected]>
Signed-off-by: David S. Miller <[email protected]>

cls_bpf: don't decrement net's refcount when offload fails

When cls_bpf offload was added it seemed like a good idea to
call cls_bpf_delete_prog() instead of extending the error
handling path, since the software state is fully initialized
at that point. This handling of errors without jumping to
the end of the function is error prone, as proven by later
commit missing that extra call to __cls_bpf_delete_prog().

__cls_bpf_delete_prog() is now expected to be invoked with
a reference on exts->net or the field zeroed out. The call
on the offload's error patch does not fullfil this requirement,
leading to each error stealing a reference on net namespace.

Create a function undoing what cls_bpf_set_parms() did and
use it from __cls_bpf_delete_prog() and the error path.

Fixes: aae2c35ec892 ("cls_bpf: use tcf_exts_get_net() before call_rcu()")
Signed-off-by: Jakub Kicinski <[email protected]>
Reviewed-by: Simon Horman <[email protected]>
Acked-by: Daniel Borkmann <[email protected]>
Acked-by: Cong Wang <[email protected]>
Signed-off-by: David S. Miller <[email protected]>

Merge tag 'drm-for-v4.15-part2-fixes' of git://people.freedesktop.org/~airlied/linux

Pull drm fixes from Dave Airlie:

- TTM regression fix for some virt gpus (bochs vga)

- a few i915 stable fixes

- one vc4 fix

- one uapi fix

* tag 'drm-for-v4.15-part2-fixes' of git://people.freedesktop.org/~airlied/linux:
  drm/ttm: don't attempt to use hugepages if dma32 requested (v2)
  drm/vblank: Pass crtc_id to page_flip_ioctl.
  drm/i915: Fix init_clock_gating for resume
  drm/i915: Mark the userptr invalidate workqueue as WQ_MEM_RECLAIM
  drm/i915: Clear breadcrumb node when cancelling signaling
  drm/i915/gvt: ensure -ve return value is handled correctly
  drm/i915: Re-register PMIC bus access notifier on runtime resume
  drm/i915: Fix false-positive assert_rpm_wakelock_held in i915_pmic_bus_access_notifier v2
  drm/edid: Don't send non-zero YQ in AVI infoframe for HDMI 1.x sinks
  drm/vc4: Account for interrupts in flight

Revert "ALSA: usb-audio: Fix potential zero-division at parsing FU"

The commit 8428a8ebde2d ("ALSA: usb-audio: Fix potential zero-division
at parsing FU") is utterly bogus and breaks the case with csize=1
instead of fixing anything. Just take it back again.

Reported-by: Jörg Otte <[email protected]>
Fixes: 8428a8ebde2d ("ALSA: usb-audio: Fix potential zero-division at parsing FU"
Signed-off-by: Takashi Iwai <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

Btrfs: incremental send, fix wrong unlink path after renaming file

Under some circumstances, an incremental send operation can issue wrong
paths for unlink commands related to files that have multiple hard links
and some (or all) of those links were renamed between the parent and send
snapshots. Consider the following example:

Parent snapshot

.                                                      (ino 256)
|---- a/                                               (ino 257)
|     |---- b/                                         (ino 259)
|     |     |---- c/                                   (ino 260)
|     |     |---- f2                                   (ino 261)
|     |
|     |---- f2l1                                       (ino 261)
|
|---- d/                                               (ino 262)
       |---- f1l1_2                                     (ino 258)
       |---- f2l2                                       (ino 261)
       |---- f1_2                                       (ino 258)

Send snapshot

.                                                      (ino 256)
|---- a/                                               (ino 257)
|     |---- f2l1/                                      (ino 263)
|             |---- b2/                                (ino 259)
|                   |---- c/                           (ino 260)
|                   |     |---- d3                     (ino 262)
|                   |           |---- f1l1_2           (ino 258)
|                   |           |---- f2l2_2           (ino 261)
|                   |           |---- f1_2             (ino 258)
|                   |
|                   |---- f2                           (ino 261)
|                   |---- f1l2                         (ino 258)
|
|---- d                                                (ino 261)

When computing the incremental send stream the following steps happen:

1) When processing inode 261, a rename operation is issued that renames
   inode 262, which currently as a path of "d", to an orphan name of
   "o262-7-0". This is done because in the send snapshot, inode 261 has
   of its hard links with a path of "d" as well.

2) Two link operations are issued that create the new hard links for
   inode 261, whose names are "d" and "f2l2_2", at paths "/" and
   "o262-7-0/" respectively.

3) Still while processing inode 261, unlink operations are issued to
   remove the old hard links of inode 261, with names "f2l1" and "f2l2",
   at paths "a/" and "d/". However path "d/" does not correspond anymore
   to the directory inode 262 but corresponds instead to a hard link of
   inode 261 (link command issued in the previous step). This makes the
   receiver fail with a ENOTDIR error when attempting the unlink
   operation.

The problem happens because before sending the unlink operation, we failed
to detect that inode 262 was one of ancestors for inode 261 in the parent
snapshot, and therefore we didn't recompute the path for inode 262 before
issuing the unlink operation for the link named "f2l2" of inode 262. The
detection failed because the function "is_ancestor()" only follows the
first hard link it finds for an inode instead of all of its hard links
(as it was originally created for being used with directories only, for
which only one hard link exists). So fix this by making "is_ancestor()"
follow all hard links of the input inode.

A test case for fstests follows soon.

Signed-off-by: Filipe Manana <[email protected]>
Signed-off-by: David Sterba <[email protected]>

net/packet: fix a race in packet_bind() and packet_notifier()

syzbot reported crashes [1] and provided a C repro easing bug hunting.

When/if packet_do_bind() calls __unregister_prot_hook() and releases
po->bind_lock, another thread can run packet_notifier() and process an
NETDEV_UP event.

This calls register_prot_hook() and hooks again the socket right before
first thread is able to grab again po->bind_lock.

Fixes this issue by temporarily setting po->num to 0, as suggested by
David Miller.

[1]
dev_remove_pack: ffff8801bf16fa80 not found
------------[ cut here ]------------
kernel BUG at net/core/dev.c:7945!  ( BUG_ON(!list_empty(&dev->ptype_all)); )
invalid opcode: 0000 [#1] SMP KASAN
Dumping ftrace buffer:
   (ftrace buffer empty)
Modules linked in:
device syz0 entered promiscuous mode
CPU: 0 PID: 3161 Comm: syzkaller404108 Not tainted 4.14.0+ #190
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011
task: ffff8801cc57a500 task.stack: ffff8801cc588000
RIP: 0010:netdev_run_todo+0x772/0xae0 net/core/dev.c:7945
RSP: 0018:ffff8801cc58f598 EFLAGS: 00010293
RAX: ffff8801cc57a500 RBX: dffffc0000000000 RCX: ffffffff841f75b2
RDX: 0000000000000000 RSI: 1ffff100398b1ede RDI: ffff8801bf1f8810
device syz0 entered promiscuous mode
RBP: ffff8801cc58f898 R08: 0000000000000001 R09: 0000000000000000
R10: 0000000000000000 R11: 0000000000000000 R12: ffff8801bf1f8cd8
R13: ffff8801cc58f870 R14: ffff8801bf1f8780 R15: ffff8801cc58f7f0
FS:  0000000001716880(0000) GS:ffff8801db400000(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 0000000020b13000 CR3: 0000000005e25000 CR4: 00000000001406f0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
Call Trace:
rtnl_unlock+0xe/0x10 net/core/rtnetlink.c:106
tun_detach drivers/net/tun.c:670 [inline]
tun_chr_close+0x49/0x60 drivers/net/tun.c:2845
__fput+0x333/0x7f0 fs/file_table.c:210
____fput+0x15/0x20 fs/file_table.c:244
task_work_run+0x199/0x270 kernel/task_work.c:113
exit_task_work include/linux/task_work.h:22 [inline]
do_exit+0x9bb/0x1ae0 kernel/exit.c:865
do_group_exit+0x149/0x400 kernel/exit.c:968
SYSC_exit_group kernel/exit.c:979 [inline]
SyS_exit_group+0x1d/0x20 kernel/exit.c:977
entry_SYSCALL_64_fastpath+0x1f/0x96
RIP: 0033:0x44ad19

Fixes: 30f7ea1c2b5f ("packet: race condition in packet_bind")
Signed-off-by: Eric Dumazet <[email protected]>
Reported-by: syzbot <[email protected]>
Cc: Francesco Ruggeri <[email protected]>
Signed-off-by: David S. Miller <[email protected]>

packet: fix crash in fanout_demux_rollover()

syzkaller found a race condition fanout_demux_rollover() while removing
a packet socket from a fanout group.

po->rollover is read and operated on during packet_rcv_fanout(), via
fanout_demux_rollover(), but the pointer is currently cleared before the
synchronization in packet_release(). It is safer to delay the cleanup
until after synchronize_net() has been called, ensuring all calls to
packet_rcv_fanout() for this socket have finished.

To further simplify synchronization around the rollover structure, set
po->rollover in fanout_add() only if there are no errors. This removes
the need for rcu in the struct and in the call to
packet_getsockopt(..., PACKET_ROLLOVER_STATS, ...).

Crashing stack trace:
fanout_demux_rollover+0xb6/0x4d0 net/packet/af_packet.c:1392
packet_rcv_fanout+0x649/0x7c8 net/packet/af_packet.c:1487
dev_queue_xmit_nit+0x835/0xc10 net/core/dev.c:1953
xmit_one net/core/dev.c:2975 [inline]
dev_hard_start_xmit+0x16b/0xac0 net/core/dev.c:2995
__dev_queue_xmit+0x17a4/0x2050 net/core/dev.c:3476
dev_queue_xmit+0x17/0x20 net/core/dev.c:3509
neigh_connected_output+0x489/0x720 net/core/neighbour.c:1379
neigh_output include/net/neighbour.h:482 [inline]
ip6_finish_output2+0xad1/0x22a0 net/ipv6/ip6_output.c:120
ip6_finish_output+0x2f9/0x920 net/ipv6/ip6_output.c:146
NF_HOOK_COND include/linux/netfilter.h:239 [inline]
ip6_output+0x1f4/0x850 net/ipv6/ip6_output.c:163
dst_output include/net/dst.h:459 [inline]
NF_HOOK.constprop.35+0xff/0x630 include/linux/netfilter.h:250
mld_sendpack+0x6a8/0xcc0 net/ipv6/mcast.c:1660
mld_send_initial_cr.part.24+0x103/0x150 net/ipv6/mcast.c:2072
mld_send_initial_cr net/ipv6/mcast.c:2056 [inline]
ipv6_mc_dad_complete+0x99/0x130 net/ipv6/mcast.c:2079
addrconf_dad_completed+0x595/0x970 net/ipv6/addrconf.c:4039
addrconf_dad_work+0xac9/0x1160 net/ipv6/addrconf.c:3971
process_one_work+0xbf0/0x1bc0 kernel/workqueue.c:2113
worker_thread+0x223/0x1990 kernel/workqueue.c:2247
kthread+0x35e/0x430 kernel/kthread.c:231
ret_from_fork+0x2a/0x40 arch/x86/entry/entry_64.S:432

Fixes: 0648ab70afe6 ("packet: rollover prepare: per-socket state")
Fixes: 509c7a1ecc860 ("packet: avoid panic in packet_getsockopt()")
Reported-by: syzbot <[email protected]>
Signed-off-by: Mike Maloney <[email protected]>
Reviewed-by: Eric Dumazet <[email protected]>
Signed-off-by: David S. Miller <[email protected]>

Merge branch 'sctp-fix-sparse-errors'

Xin Long says:

====================
sctp: fix some other sparse errors

After the last fixes for sparse errors, there are still three sparse
errors in sctp codes, two of them are type cast, and the other one
is using extern.
====================

Signed-off-by: David S. Miller <[email protected]>

sctp: remove extern from stream sched

Now each stream sched ops is defined in different .c file and
added into the global ops in another .c file, it uses extern
to make this work.

However extern is not good coding style to get them in and
even make C=2 reports errors for this.

This patch adds sctp_sched_ops_xxx_init for each stream sched
ops in their .c file, then get them into the global ops by
calling them when initializing sctp module.

Fixes: 637784ade221 ("sctp: introduce priority based stream scheduler")
Fixes: ac1ed8b82cd6 ("sctp: introduce round robin stream scheduler")
Signed-off-by: Xin Long <[email protected]>
Acked-by: Marcelo Ricardo Leitner <[email protected]>
Signed-off-by: David S. Miller <[email protected]>

sctp: force the params with right types for sctp csum apis

Now sctp_csum_xxx doesn't really match the param types of these common
csum apis. As sctp_csum_xxx is defined in sctp/checksum.h, many sparse
errors occur when make C=2 not only with M=net/sctp but also with other
modules that include this header file.

This patch is to force them fit in csum apis with the right types.

Fixes: e6d8b64b34aa ("net: sctp: fix and consolidate SCTP checksumming code")
Signed-off-by: Xin Long <[email protected]>
Acked-by: Marcelo Ricardo Leitner <[email protected]>
Signed-off-by: David S. Miller <[email protected]>

sctp: force SCTP_ERROR_INV_STRM with __u32 when calling sctp_chunk_fail

This patch is to force SCTP_ERROR_INV_STRM with right type to
fit in sctp_chunk_fail to avoid the sparse error.

Signed-off-by: Xin Long <[email protected]>
Acked-by: Marcelo Ricardo Leitner <[email protected]>
Signed-off-by: David S. Miller <[email protected]>

lmc: Use memdup_user() as a cleanup

Fix coccicheck warning which recommends to use memdup_user():
drivers/net/wan/lmc/lmc_main.c:497:27-34: WARNING opportunity for memdup_user
Generated by: scripts/coccinelle/memdup_user/memdup_user.cocci

Signed-off-by: Vasyl Gomonovych <[email protected]>
Signed-off-by: David S. Miller <[email protected]>

bnxt_en: Fix an error handling path in 'bnxt_get_module_eeprom()'

Error code returned by 'bnxt_read_sfp_module_eeprom_info()' is handled a
few lines above when reading the A0 portion of the EEPROM.
The same should be done when reading the A2 portion of the EEPROM.

In order to correctly propagate an error, update 'rc' in this 2nd call as
well, otherwise 0 (success) is returned.

Signed-off-by: Christophe JAILLET <[email protected]>
Signed-off-by: David S. Miller <[email protected]>

net: phy: marvell10g: fix the PHY id mask

The Marvell 10G PHY driver supports different hardware revisions, which
have their bits 3..0 differing. To get the correct revision number these
bits should be ignored. This patch fixes this by using the already
defined MARVELL_PHY_ID_MASK (0xfffffff0) instead of the custom
0xffffffff mask.

Fixes: 20b2af32ff3f ("net: phy: add Marvell Alaska X 88X3310 10Gigabit PHY support")
Suggested-by: Yan Markman <[email protected]>
Signed-off-by: Antoine Tenart <[email protected]>
Reviewed-by: Andrew Lunn <[email protected]>
Signed-off-by: David S. Miller <[email protected]>

Merge branch 'mvpp2-fixes'

Antoine Tenart says:

====================
net: mvpp2: set of fixes

This series fixes various issues with the Marvell PPv2 driver. The
patches are sent together to avoid any possible conflict. The series is
based on today's net tree.
====================

Signed-off-by: David S. Miller <[email protected]>

net: mvpp2: check ethtool sets the Tx ring size is to a valid min value

This patch fixes the Tx ring size checks when using ethtool, by adding
an extra check in the PPv2 check_ringparam_valid helper. The Tx ring
size cannot be set to a value smaller than the minimum number of
descriptors needed for TSO.

Fixes: 1d17db08c056 ("net: mvpp2: limit TSO segments and use stop/wake thresholds")
Suggested-by: Yan Markman <[email protected]>
Signed-off-by: Antoine Tenart <[email protected]>
Signed-off-by: David S. Miller <[email protected]>

net: mvpp2: do not disable GMAC padding

Short fragmented packets may never be sent by the hardware when padding
is disabled. This patch stop modifying the GMAC padding bits, to leave
them to their reset value (disabled).

Fixes: 3919357fb0bb ("net: mvpp2: initialize the GMAC when using a port")
Signed-off-by: Yan Markman <[email protected]>
[Antoine: commit message]
Signed-off-by: Antoine Tenart <[email protected]>
Signed-off-by: David S. Miller <[email protected]>

net: mvpp2: cleanup probed ports in the probe error path

This patches fixes the probe error path by cleaning up probed ports, to
avoid leaving registered net devices when the driver failed to probe.

Fixes: 3f518509dedc ("ethernet: Add new driver for Marvell Armada 375 network unit")
Signed-off-by: Antoine Tenart <[email protected]>
Signed-off-by: David S. Miller <[email protected]>

net: mvpp2: fix the txq_init error path

When an allocation in the txq_init path fails, the allocated buffers
end-up being freed twice: in the txq_init error path, and in txq_deinit.
This lead to issues as txq_deinit would work on already freed memory
regions:

kernel BUG at mm/slub.c:3915!
Internal error: Oops - BUG: 0 [#1] PREEMPT SMP

This patch fixes this by removing the txq_init own error path, as the
txq_deinit function is always called on errors. This was introduced by
TSO as way more buffers are allocated.

Fixes: 186cd4d4e414 ("net: mvpp2: software tso support")
Signed-off-by: Antoine Tenart <[email protected]>
Signed-off-by: David S. Miller <[email protected]>

Merge branch 'mlxsw-GRE-offloading-fixes'

Jiri Pirko says:

====================
mlxsw: GRE offloading fixes

Petr says:

This patchset fixes a couple bugs in offloading GRE tunnels in mlxsw
driver.

Patch #1 fixes a problem that local routes pointing at a GRE tunnel
device are offloaded even if that netdevice is down.

Patch #2 detects that as a result of moving a GRE netdevice to a
different VRF, two tunnels now have a conflict of local addresses,
something that the mlxsw driver can't offload.

Patch #3 fixes a FIB abort caused by forming a route pointing at a
GRE tunnel that is eligible for offloading but already onloaded.

Patch #4 fixes a problem that next hops migrated to a new RIF kept the
old RIF reference, which went dangling shortly afterwards.
====================

Signed-off-by: David S. Miller <[email protected]>

mlxsw: spectrum_router: Update nexthop RIF on update

The function mlxsw_sp_nexthop_rif_update() walks the list of nexthops
associated with a RIF, and updates the corresponding entries in the
switch. It is used in particular when a tunnel underlay netdevice moves
to a different VRF, and all the nexthops are migrated over to a new RIF.
The problem is that each nexthop holds a reference to its RIF, and that
is not updated. So after the old RIF is gone, further activity on these
nexthops (such as downing the underlay netdevice) dereferences a
dangling pointer.

Fix the issue by updating rif of impacted nexthops before calling
mlxsw_sp_nexthop_rif_update().

Fixes: 0c5f1cd5ba8c ("mlxsw: spectrum_router: Generalize __mlxsw_sp_ipip_entry_update_tunnel()")
Signed-off-by: Petr Machata <[email protected]>
Reviewed-by: Ido Schimmel <[email protected]>
Signed-off-by: Jiri Pirko <[email protected]>
Signed-off-by: David S. Miller <[email protected]>

mlxsw: spectrum_router: Handle encap to demoted tunnels

Some tunnels that are offloadable on their own can nonetheless be
demoted to slow path if their local address is in conflict with that of
another tunnel. When a route is formed for such a tunnel,
mlxsw_sp_nexthop_ipip_init() fails to find the corresponding IPIP entry,
and that triggers a FIB abort.

Resolve the problem by not assuming that a tunnel for which
mlxsw_sp_ipip_ops.can_offload() holds also automatically has an IPIP
entry.

Fixes: af641713e97d ("mlxsw: spectrum_router: Onload conflicting tunnels")
Signed-off-by: Petr Machata <[email protected]>
Reviewed-by: Ido Schimmel <[email protected]>
Signed-off-by: Jiri Pirko <[email protected]>
Signed-off-by: David S. Miller <[email protected]>

mlxsw: spectrum_router: Demote tunnels on VRF migration

The mlxsw driver currently doesn't offload GRE tunnels if they have the
same local address and use the same underlay VRF. When such a situation
arises, the tunnels in conflict are demoted to slow path.

However, the current code only verifies this condition on tunnel
creation and tunnel change, not when a tunnel is moved to a different
VRF. When the tunnel has no bound device, underlay and overlay are the
same. Thus moving a tunnel moves the underlay as well, and that can
cause local address conflict.

So modify mlxsw_sp_netdevice_ipip_ol_vrf_event() to check if there are
any conflicting tunnels, and demote them if yes.

Fixes: af641713e97d ("mlxsw: spectrum_router: Onload conflicting tunnels")
Signed-off-by: Petr Machata <[email protected]>
Reviewed-by: Ido Schimmel <[email protected]>
Signed-off-by: Jiri Pirko <[email protected]>
Signed-off-by: David S. Miller <[email protected]>

mlxsw: spectrum_router: Offload decap only for up tunnels

When a new local route is added, an IPIP entry is looked up to determine
whether the route should be offloaded as a tunnel decap or as a trap.
That decision should take into account whether the tunnel netdevice in
question is actually IFF_UP, and only install a decap offload if it is.

Fixes: 0063587d3587 ("mlxsw: spectrum: Support decap-only IP-in-IP tunnels")
Signed-off-by: Petr Machata <[email protected]>
Reviewed-by: Ido Schimmel <[email protected]>
Signed-off-by: Jiri Pirko <[email protected]>
Signed-off-by: David S. Miller <[email protected]>

Merge branch '40GbE' of git://git.kernel.org/pub/scm/linux/kernel/git/jkirsher/net-queue

Jeff Kirsher says:

====================
Intel Wired LAN Driver Updates 2017-11-27

This series contains updates to e1000, e1000e and i40e.

Gustavo A. R. Silva fixes a sizeof() issue where we were taking the size of
the pointer (which is always the size of the pointer).

Sasha does a follow up fix to a previous fix for buffer overrun, to resolve
community feedback from David Laight and the use of magic numbers.

Amritha fixes the reporting of error codes for when adding a cloud filter
fails.

Ahmad Fatoum brushes the dust off the e1000 driver to fix a code comment
and debug message which was incorrect about what the code was really doing.
====================

Signed-off-by: David S. Miller <[email protected]>