Git Repo - linux.git/log

ARM: 8778/1: clkdev: don't call __of_clk_get_by_name() unnecessarily from clk_get()

The way this function is implemented caused some confusion when
converting the TI DaVinci platform to using the common clock framework.

Current kernel supports booting DaVinci boards both in device tree as
well as legacy, board-file mode. In the latter, we always end up
calling clk_get_sys() as of_node is NULL and __of_clk_get_by_name()
returns -ENOENT.

It was not obvious at first glance how clk_get(dev, NULL) will work in
board-file mode since we always call __of_clk_get_by_name(). Let's make
it clearer by checking if of_node is NULL and skipping right to
clk_get_sys().

Cc: Sekhar Nori <[email protected]>
Cc: Kevin Hilman <[email protected]>
Cc: David Lechner <[email protected]>
Reviewed-by: David Lechner <[email protected]>
Reviewed-by: Sekhar Nori <[email protected]>
Signed-off-by: Bartosz Golaszewski <[email protected]>
Signed-off-by: Russell King <[email protected]>

Documentation: remove dynamic-resolution-notes reference to non-existent file

File dt-object-internal.txt does not exist. This patch removes
a reference to it.

Signed-off-by: Harish Jenny K N <[email protected]>
Reviewed-by: Frank Rowand <[email protected]>
Signed-off-by: Rob Herring <[email protected]>

Bluetooth: mediatek: pass correct size to h4_recv_buf()

We're supposed to pass the number of elements in the mtk_recv_pkts, not
the number of bytes.

Fixes: 7237c4c9ec92 ("Bluetooth: mediatek: Add protocol support for MediaTek serial devices")
Signed-off-by: Dan Carpenter <[email protected]>
Signed-off-by: Marcel Holtmann <[email protected]>

Merge tag 'asoc-v4.19' of https://git.kernel.org/pub/scm/linux/kernel/git/broonie/sound into for-linus

ASoC: Updates for v4.19

A fairly big update, including quite a bit of core activity this time
around (which is good to see) along with a fairly large set of new
drivers.

- A new snd_pcm_stop_xrun() helper which is now used in several
   drivers.
- Support for providing name prefixes to generic component nodes.
- Quite a few fixes for DPCM as it gains a bit wider use and more
   robust testing.
- Generalization of the DIO2125 support to a simple amplifier driver.
- Accessory detection support for the audio graph card.
- DT support for PXA AC'97 devices.
- Quirks for a number of new x86 systems.
- Support for AM Logic Meson, Everest ES7154, Intel systems with
   RT5682, Qualcomm QDSP6 and WCD9335, Realtek RT5682 and TI TAS5707.

parisc: Fix and improve kernel stack unwinding

This patchset fixes and improves stack unwinding a lot:
1. Show backward stack traces with up to 30 callsites
2. Add callinfo to ENTRY_CFI() such that every assembler function will get an
entry in the unwind table
3. Use constants instead of numbers in call_on_stack()
4. Do not depend on CONFIG_KALLSYMS to generate backtraces.
5. Speed up backtrace generation

Make sure you have this patch to GNU as installed:
https://sourceware.org/ml/binutils/2018-07/msg00474.html
Without this patch, unwind info in the kernel is often wrong for various
functions.

Signed-off-by: Helge Deller <[email protected]>

parisc: Remove unnecessary barriers from spinlock.h

Now that mb() is an instruction barrier, it will slow performance if we issue
unnecessary barriers.

The spinlock defines have a number of unnecessary barriers.  The __ldcw()
define is both a hardware and compiler barrier.  The mb() barriers in the
routines using __ldcw() serve no purpose.

The only barrier needed is the one in arch_spin_unlock().  We need to ensure
all accesses are complete prior to releasing the lock.

Signed-off-by: John David Anglin <[email protected]>
Cc: [email protected] # 4.0+
Signed-off-by: Helge Deller <[email protected]>

parisc: Remove ordered stores from syscall.S

Now that we use a sync prior to releasing the locks in syscall.S, we don't need
the PA 2.0 ordered stores used to release some locks. Using an ordered store,
potentially slows the release and subsequent code.

There are a number of other ordered stores and loads that serve no purpose. I
have converted these to normal stores.

Signed-off-by: John David Anglin <[email protected]>
Cc: [email protected] # 4.0+
Signed-off-by: Helge Deller <[email protected]>

parisc: prefer _THIS_IP_ and _RET_IP_ statement expressions

As part of the effort to reduce the code duplication between _THIS_IP_
and current_text_addr(), let's consolidate callers of
current_text_addr() to use _THIS_IP_.

Signed-off-by: Nick Desaulniers <[email protected]>
Signed-off-by: Helge Deller <[email protected]>

parisc: Add HAVE_REGS_AND_STACK_ACCESS_API feature

Some parts of the HAVE_REGS_AND_STACK_ACCESS_API feature is needed for
the rseq syscall. This patch adds the most important parts, and as long
as we don't support kprobes, we should be fine.

Signed-off-by: Helge Deller <[email protected]>

parisc: Drop architecture-specific ENOTSUP define

parisc is the only Linux architecture which has defined a value for ENOTSUP.
All other architectures #define ENOTSUP as EOPNOTSUPP in their libc headers.

Having an own value for ENOTSUP which is different than EOPNOTSUPP often gives
problems with userspace programs which expect both to be the same. One such
example is a build error in the libuv package, as can be seen in
https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=900237.

Since we dropped HP-UX support, there is no real benefit in keeping an own
value for ENOTSUP. This patch drops the parisc value for ENOTSUP from the
kernel sources. glibc needs no patch, it reuses the exported headers.

Signed-off-by: Helge Deller <[email protected]>

parisc: use generic dma_noncoherent_ops

Switch to the generic noncoherent direct mapping implementation.

Fix sync_single_for_cpu to do skip the cache flush unless the transfer
is to the device to match the more tested unmap_single path which should
have the same cache coherency implications.

Signed-off-by: Christoph Hellwig <[email protected]>
Signed-off-by: Helge Deller <[email protected]>

parisc: always use flush_kernel_dcache_range for DMA cache maintainance

Current the S/G list based DMA ops use flush_kernel_vmap_range which
contains a few UP optimizations, while the rest of the DMA operations
uses flush_kernel_dcache_range. The single vs sg operations are supposed
to have the same effect, so they should use the same routines. Use
the more conservation version for now, but if people more familiar with
parisc think the vmap version is generally fine for DMA we should switch
all interfaces over to it.

Signed-off-by: Christoph Hellwig <[email protected]>
Signed-off-by: Helge Deller <[email protected]>

parisc: merge pcx_dma_ops and pcxl_dma_ops

The only difference is that pcxl supports dma coherent allocations, while
pcx only supports non-consistent allocations and otherwise fails.

But dma_alloc* is not in the fast path, and merging these two allows an
easy migration path to the generic dma-noncoherent implementation, so
do it.

Signed-off-by: Christoph Hellwig <[email protected]>
Signed-off-by: Helge Deller <[email protected]>

kconfig: fix the rule of mainmenu_stmt symbol

The rule of mainmenu_stmt does not have debug print of zconf_lineno(),
but if it had, it would print a wrong line number for the same reason
as commit b2d00d7c61c8 ("kconfig: fix line numbers for if-entries in
menu tree").

The mainmenu_stmt does not need to eat following empty lines because
they are reduced to common_stmt.

Signed-off-by: Masahiro Yamada <[email protected]>

Merge branch 'bpf-ancestor-cgroup-id'

Andrey Ignatov says:

====================
This patch set adds new BPF helper bpf_skb_ancestor_cgroup_id that returns
id of cgroup v2 that is ancestor of cgroup associated with the skb at the
ancestor_level.

The helper is useful to implement policies in TC based on cgroups that are
upper in hierarchy than immediate cgroup associated with skb.

v1->v2:
- more reliable check for testing IPv6 to become ready in selftest.
====================

Signed-off-by: Daniel Borkmann <[email protected]>

selftests/bpf: Selftest for bpf_skb_ancestor_cgroup_id

Add selftests for bpf_skb_ancestor_cgroup_id helper.

test_skb_cgroup_id.sh prepares testing interface and adds tc qdisc and
filter for it using BPF object compiled from test_skb_cgroup_id_kern.c
program.

BPF program in test_skb_cgroup_id_kern.c gets ancestor cgroup id using
the new helper at different levels of cgroup hierarchy that skb belongs
to, including root level and non-existing level, and saves it to the map
where the key is the level of corresponding cgroup and the value is its
id.

To trigger BPF program, user space program test_skb_cgroup_id_user is
run. It adds itself into testing cgroup and sends UDP datagram to
link-local multicast address of testing interface. Then it reads cgroup
ids saved in kernel for different levels from the BPF map and compares
them with those in user space. They must be equal for every level of
ancestry.

Example of run:
  # ./test_skb_cgroup_id.sh
  Wait for testing link-local IP to become available ... OK
  Note: 8 bytes struct bpf_elf_map fixup performed due to size mismatch!
  [PASS]

Signed-off-by: Andrey Ignatov <[email protected]>
Signed-off-by: Daniel Borkmann <[email protected]>

selftests/bpf: Add cgroup id helpers to bpf_helpers.h

Add bpf_skb_cgroup_id and bpf_skb_ancestor_cgroup_id helpers to
bpf_helpers.h to use them in tests and samples.

Signed-off-by: Andrey Ignatov <[email protected]>
Signed-off-by: Daniel Borkmann <[email protected]>

bpf: Sync bpf.h to tools/

Sync skb_ancestor_cgroup_id() related bpf UAPI changes to tools/.

Signed-off-by: Andrey Ignatov <[email protected]>
Signed-off-by: Daniel Borkmann <[email protected]>

bpf: Introduce bpf_skb_ancestor_cgroup_id helper

== Problem description ==

It's useful to be able to identify cgroup associated with skb in TC so
that a policy can be applied to this skb, and existing bpf_skb_cgroup_id
helper can help with this.

Though in real life cgroup hierarchy and hierarchy to apply a policy to
don't map 1:1.

It's often the case that there is a container and corresponding cgroup,
but there are many more sub-cgroups inside container, e.g. because it's
delegated to containerized application to control resources for its
subsystems, or to separate application inside container from infra that
belongs to containerization system (e.g. sshd).

At the same time it may be useful to apply a policy to container as a
whole.

If multiple containers like this are run on a host (what is often the
case) and many of them have sub-cgroups, it may not be possible to apply
per-container policy in TC with existing helpers such as
bpf_skb_under_cgroup or bpf_skb_cgroup_id:

* bpf_skb_cgroup_id will return id of immediate cgroup associated with
  skb, i.e. if it's a sub-cgroup inside container, it can't be used to
  identify container's cgroup;

* bpf_skb_under_cgroup can work only with one cgroup and doesn't scale,
  i.e. if there are N containers on a host and a policy has to be
  applied to M of them (0 <= M <= N), it'd require M calls to
  bpf_skb_under_cgroup, and, if M changes, it'd require to rebuild &
  load new BPF program.

== Solution ==

The patch introduces new helper bpf_skb_ancestor_cgroup_id that can be
used to get id of cgroup v2 that is an ancestor of cgroup associated
with skb at specified level of cgroup hierarchy.

That way admin can place all containers on one level of cgroup hierarchy
(what is a good practice in general and already used in many
configurations) and identify specific cgroup on this level no matter
what sub-cgroup skb is associated with.

E.g. if there is a cgroup hierarchy:
  root/
  root/container1/
  root/container1/app11/
  root/container1/app11/sub-app-a/
  root/container1/app12/
  root/container2/
  root/container2/app21/
  root/container2/app22/
  root/container2/app22/sub-app-b/

, then having skb associated with root/container1/app11/sub-app-a/ it's
possible to get ancestor at level 1, what is container1 and apply policy
for this container, or apply another policy if it's container2.

Policies can be kept e.g. in a hash map where key is a container cgroup
id and value is an action.

Levels where container cgroups are created are usually known in advance
whether cgroup hierarchy inside container may be hard to predict
especially in case when its creation is delegated to containerized
application.

== Implementation details ==

The helper gets ancestor by walking parents up to specified level.

Another option would be to get different kind of "id" from
cgroup->ancestor_ids[level] and use it with idr_find() to get struct
cgroup for ancestor. But that would require radix lookup what doesn't
seem to be better (at least it's not obviously better).

Format of return value of the new helper is same as that of
bpf_skb_cgroup_id.

Signed-off-by: Andrey Ignatov <[email protected]>
Signed-off-by: Daniel Borkmann <[email protected]>

bpf: decouple btf from seq bpf fs dump and enable more maps

Commit a26ca7c982cb ("bpf: btf: Add pretty print support to
the basic arraymap") and 699c86d6ec21 ("bpf: btf: add pretty
print for hash/lru_hash maps") enabled support for BTF and
dumping via BPF fs for array and hash/lru map. However, both
can be decoupled from each other such that regular BPF maps
can be supported for attaching BTF key/value information,
while not all maps necessarily need to dump via map_seq_show_elem()
callback.

The basic sanity check which is a prerequisite for all maps
is that key/value size has to match in any case, and some maps
can have extra checks via map_check_btf() callback, e.g.
probing certain types or indicating no support in general. With
that we can also enable retrieving BTF info for per-cpu map
types and lpm.

Signed-off-by: Daniel Borkmann <[email protected]>
Acked-by: Alexei Starovoitov <[email protected]>
Acked-by: Yonghong Song <[email protected]>

Linux 4.18

Merge tag 'scsi-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/jejb/scsi

Pull SCSI fixes from James Bottomley:
"Eight fixes.

  The most important one is the mpt3sas fix which makes the driver work
  again on big endian systems. The rest are mostly minor error path or
  checker issues and the vmw_scsi one fixes a performance problem"

* tag 'scsi-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/jejb/scsi:
  scsi: vmw_pvscsi: Return DID_RESET for status SAM_STAT_COMMAND_TERMINATED
  scsi: sr: Avoid that opening a CD-ROM hangs with runtime power management enabled
  scsi: mpt3sas: Swap I/O memory read value back to cpu endianness
  scsi: fcoe: clear FC_RP_STARTED flags when receiving a LOGO
  scsi: fcoe: drop frames in ELS LOGO error path
  scsi: fcoe: fix use-after-free in fcoe_ctlr_els_send
  scsi: qedi: Fix a potential buffer overflow
  scsi: qla2xxx: Fix memory leak for allocating abort IOCB

init: rename and re-order boot_cpu_state_init()

This is purely a preparatory patch for upcoming changes during the 4.19
merge window.

We have a function called "boot_cpu_state_init()" that isn't really
about the bootup cpu state: that is done much earlier by the similarly
named "boot_cpu_init()" (note lack of "state" in name).

This function initializes some hotplug CPU state, and needs to run after
the percpu data has been properly initialized.  It even has a comment to
that effect.

Except it _doesn't_ actually run after the percpu data has been properly
initialized.  On x86 it happens to do that, but on at least arm and
arm64, the percpu base pointers are initialized by the arch-specific
'smp_prepare_boot_cpu()' hook, which ran _after_ boot_cpu_state_init().

This had some unexpected results, and in particular we have a patch
pending for the merge window that did the obvious cleanup of using
'this_cpu_write()' in the cpu hotplug init code:

  -       per_cpu_ptr(&cpuhp_state, smp_processor_id())->state = CPUHP_ONLINE;
  +       this_cpu_write(cpuhp_state.state, CPUHP_ONLINE);

which is obviously the right thing to do.  Except because of the
ordering issue, it actually failed miserably and unexpectedly on arm64.

So this just fixes the ordering, and changes the name of the function to
be 'boot_cpu_hotplug_init()' to make it obvious that it's about cpu
hotplug state, because the core CPU state was supposed to have already
been done earlier.

Marked for stable, since the (not yet merged) patch that will show this
problem is marked for stable.

Reported-by: Vlastimil Babka <[email protected]>
Reported-by: Mian Yousaf Kaukab <[email protected]>
Suggested-by: Catalin Marinas <[email protected]>
Acked-by: Thomas Gleixner <[email protected]>
Cc: Will Deacon <[email protected]>
Cc: [email protected]
Signed-off-by: Linus Torvalds <[email protected]>

Merge branch 'fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs

Pull vfs fixes from Al Viro:
"A bunch of race fixes, mostly around lazy pathwalk.

  All of it is -stable fodder, a large part going back to 2013"

* 'fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
  make sure that __dentry_kill() always invalidates d_seq, unhashed or not
  fix __legitimize_mnt()/mntput() race
  fix mntput/mntput race
  root dentries need RCU-delayed freeing

xfs: fix a null pointer dereference in xfs_bmap_extents_to_btree

Fuzzing tool reports a write to null pointer error in the
xfs_bmap_extents_to_btree, fix it by bailing out on encountering
a null pointer.

Signed-off-by: Shan Hai <[email protected]>
Reviewed-by: Darrick J. Wong <[email protected]>
Signed-off-by: Darrick J. Wong <[email protected]>

xfs: remove b_last_holder & associated macros

The old lock tracking infrastructure in xfs using the b_last_holder
field seems to only be useful if you can get into the system with a
debugger; it seems that the existing tracepoints would be the way to
go these days, and this old infrastructure can be removed.

Signed-off-by: Eric Sandeen <[email protected]>
Reviewed-by: Darrick J. Wong <[email protected]>
Signed-off-by: Darrick J. Wong <[email protected]>

iomap: Switch to offset_in_page for clarity

Instead of open-coding pos & (PAGE_SIZE - 1) and pos & ~PAGE_MASK, use
the offset_in_page macro.

Signed-off-by: Andreas Gruenbacher <[email protected]>
Reviewed-by: Darrick J. Wong <[email protected]>
Signed-off-by: Darrick J. Wong <[email protected]>

xfs: Close race between direct IO and xfs_break_layouts()

This patch is the duplicate of ross's fix for ext4 for xfs.

If the refcount of a page is lowered between the time that it is returned
by dax_busy_page() and when the refcount is again checked in
xfs_break_layouts() => ___wait_var_event(), the waiting function
xfs_wait_dax_page() will never be called. This means that
xfs_break_layouts() will still have 'retry' set to false, so we'll stop
looping and never check the refcount of other pages in this inode.

Instead, always continue looping as long as dax_layout_busy_page() gives us
a page which it found with an elevated refcount.

Signed-off-by: Dave Jiang <[email protected]>
Reviewed-by: Jan Kara <[email protected]>
Reviewed-by: Darrick J. Wong <[email protected]>
Signed-off-by: Darrick J. Wong <[email protected]>

Merge branch 'for-next' into for-linus

Preparation for 4.19 merge material.

Signed-off-by: Takashi Iwai <[email protected]>

Merge branch 'ip-faster-in-order-IP-fragments'

Peter Oskolkov says:

====================
ip: faster in-order IP fragments

Added "Signed-off-by" in v2.
====================

Signed-off-by: David S. Miller <[email protected]>

ip: process in-order fragments efficiently

This patch changes the runtime behavior of IP defrag queue:
incoming in-order fragments are added to the end of the current
list/"run" of in-order fragments at the tail.

On some workloads, UDP stream performance is substantially improved:

RX: ./udp_stream -F 10 -T 2 -l 60
TX: ./udp_stream -c -H <host> -F 10 -T 5 -l 60

with this patchset applied on a 10Gbps receiver:

  throughput=9524.18
  throughput_units=Mbit/s

upstream (net-next):

  throughput=4608.93
  throughput_units=Mbit/s

Reported-by: Willem de Bruijn <[email protected]>
Signed-off-by: Peter Oskolkov <[email protected]>
Cc: Eric Dumazet <[email protected]>
Cc: Florian Westphal <[email protected]>
Signed-off-by: David S. Miller <[email protected]>

ip: add helpers to process in-order fragments faster.

This patch introduces several helper functions/macros that will be
used in the follow-up patch. No runtime changes yet.

The new logic (fully implemented in the second patch) is as follows:

* Nodes in the rb-tree will now contain not single fragments, but lists
  of consecutive fragments ("runs").

* At each point in time, the current "active" run at the tail is
  maintained/tracked. Fragments that arrive in-order, adjacent
  to the previous tail fragment, are added to this tail run without
  triggering the re-balancing of the rb-tree.

* If a fragment arrives out of order with the offset _before_ the tail run,
  it is inserted into the rb-tree as a single fragment.

* If a fragment arrives after the current tail fragment (with a gap),
  it starts a new "tail" run, as is inserted into the rb-tree
  at the end as the head of the new run.

skb->cb is used to store additional information
needed here (suggested by Eric Dumazet).

Reported-by: Willem de Bruijn <[email protected]>
Signed-off-by: Peter Oskolkov <[email protected]>
Cc: Eric Dumazet <[email protected]>
Cc: Florian Westphal <[email protected]>
Signed-off-by: David S. Miller <[email protected]>

Merge ra.kernel.org:/pub/scm/linux/kernel/git/davem/net

blkcg: Make blkg_root_lookup() work for queues in bypass mode

For legacy queues the only call of blkg_root_lookup() happens after
bypass mode has been enabled. Since blkg_lookup() returns NULL for
queues in bypass mode, modify the blkg_root_lookup() such that it
no longer depends on bypass mode. Rename the function into
blk_queue_root_blkg() as suggested by Tejun.

Suggested-by: Tejun Heo <[email protected]>
Fixes: 6bad9b210a22 ("blkcg: Introduce blkg_root_lookup()")
Signed-off-by: Bart Van Assche <[email protected]>
Cc: Tejun Heo <[email protected]>
Signed-off-by: Jens Axboe <[email protected]>

Merge branch 'Remove-rtnl-lock-dependency-from-all-action-implementations'

Vlad Buslov says:

====================
Remove rtnl lock dependency from all action implementations

Currently, all netlink protocol handlers for updating rules, actions and
qdiscs are protected with single global rtnl lock which removes any
possibility for parallelism. This patch set is a second step to remove
rtnl lock dependency from TC rules update path.

Recently, new rtnl registration flag RTNL_FLAG_DOIT_UNLOCKED was added.
Handlers registered with this flag are called without RTNL taken. End
goal is to have rule update handlers(RTM_NEWTFILTER, RTM_DELTFILTER,
etc.) to be registered with UNLOCKED flag to allow parallel execution.
However, there is no intention to completely remove or split rtnl lock
itself. This patch set addresses specific problems in implementation of
tc actions that prevent their control path from being executed
concurrently. Additional changes are required to refactor classifiers
API and individual classifiers for parallel execution. This patch set
lays groundwork to eventually register rule update handlers as
rtnl-unlocked.

Action API is already prepared for parallel execution with previous
patch set, which means that action ops that use action API for their
implementation do not require additional modifications. (delete, search,
etc.) Action API implements concurrency-safe reference counting and
guarantees that cleanup/delete is called only once, after last reference
to action is released.

The goal of this change is to update specific actions APIs that access
action private state directly, in order to be independent from external
locking. General approach is to re-use existing tcf_lock spinlock (used
by some action implementation to synchronize control path with data
path) to protect action private state from concurrent modification. If
action has rcu-protected pointer, tcf spinlock is used to protect its
update code, instead of relying on rtnl lock.

Some actions need to determine rtnl mutex status in order to release it.
For example, ife action can load additional kernel modules(meta ops) and
must make sure that no locks are held during module load. In such cases
'rtnl_held' argument is used to conditionally release rtnl mutex.

Changes from V1 to V2:
- Patch 12:
  - new patch
- Patch 14:
  - refactor gen_new_estimator() to reuse stats_lock when re-assigning
    rate estimator statistics pointer
- Remove mirred and tunnel_key helper function changes. (to be submitted
  and standalone patch)
====================

Signed-off-by: David S. Miller <[email protected]>

net: sched: act_police: remove dependency on rtnl lock

Use tcf spinlock to protect police action private data from concurrent
modification during dump. (init already uses tcf spinlock when changing
police action state)

Pass tcf spinlock as estimator lock argument to gen_replace_estimator()
during action init.

Signed-off-by: Vlad Buslov <[email protected]>
Signed-off-by: David S. Miller <[email protected]>

net: core: protect rate estimator statistics pointer with lock

Extend gen_new_estimator() to also take stats_lock when re-assigning rate
estimator statistics pointer. (to be used by unlocked actions)

Rename 'stats_lock' to 'lock' and change argument description to explain
that it is now also used for control path.

Signed-off-by: Vlad Buslov <[email protected]>
Signed-off-by: David S. Miller <[email protected]>

net: sched: act_mirred: remove dependency on rtnl lock

Re-introduce mirred list spinlock, that was removed some time ago, in order
to protect it from concurrent modifications, instead of relying on rtnl
lock.

Use tcf spinlock to protect mirred action private data from concurrent
modification in init and dump. Rearrange access to mirred data in order to
be performed only while holding the lock.

Rearrange net dev access to always hold reference while working with it,
instead of relying on rntl lock.

Signed-off-by: Vlad Buslov <[email protected]>
Signed-off-by: David S. Miller <[email protected]>

net: sched: extend action ops with put_dev callback

As a preparation for removing dependency on rtnl lock from rules update
path, all users of shared objects must take reference while working with
them.

Extend action ops with put_dev() API to be used on net device returned by
get_dev().

Modify mirred action (only action that implements get_dev callback):
- Take reference to net device in get_dev.
- Implement put_dev API that releases reference to net device.

Signed-off-by: Vlad Buslov <[email protected]>
Signed-off-by: David S. Miller <[email protected]>

net: sched: act_vlan: remove dependency on rtnl lock

Use tcf spinlock to protect vlan action private data from concurrent
modification during dump and init. Use rcu swap operation to reassign
params pointer under protection of tcf lock. (old params value is not used
by init, so there is no need of standalone rcu dereference step)

Remove rtnl assertion that is no longer necessary.

Signed-off-by: Vlad Buslov <[email protected]>
Signed-off-by: David S. Miller <[email protected]>

net: sched: act_tunnel_key: remove dependency on rtnl lock

Use tcf lock to protect tunnel key action struct private data from
concurrent modification in init and dump. Use rcu swap operation to
reassign params pointer under protection of tcf lock. (old params value is
not used by init, so there is no need of standalone rcu dereference step)

Remove rtnl lock assertion that is no longer required.

Signed-off-by: Vlad Buslov <[email protected]>
Signed-off-by: David S. Miller <[email protected]>

net: sched: act_skbmod: remove dependency on rtnl lock

Move read of skbmod_p rcu pointer to be protected by tcf spinlock. Use tcf
spinlock to protect private skbmod data from concurrent modification during
dump.

Signed-off-by: Vlad Buslov <[email protected]>
Signed-off-by: David S. Miller <[email protected]>

net: sched: act_simple: remove dependency on rtnl lock

Use tcf spinlock to protect private simple action data from concurrent
modification during dump. (simple init already uses tcf spinlock when
changing action state)

Signed-off-by: Vlad Buslov <[email protected]>
Signed-off-by: David S. Miller <[email protected]>

net: sched: act_sample: remove dependency on rtnl lock

Use tcf spinlock to protect private sample action data from concurrent
modification during dump and init.

Signed-off-by: Vlad Buslov <[email protected]>
Signed-off-by: David S. Miller <[email protected]>

net: sched: act_pedit: remove dependency on rtnl lock

Rearrange pedit init code to only access pedit action data while holding
tcf spinlock. Change keys allocation type to atomic to allow it to execute
while holding tcf spinlock. Take tcf spinlock in dump function when
accessing pedit action data.

Signed-off-by: Vlad Buslov <[email protected]>
Signed-off-by: David S. Miller <[email protected]>

net: sched: act_ipt: remove dependency on rtnl lock

Use tcf spinlock to protect ipt action private data from concurrent
modification during dump. Ipt init already takes tcf spinlock when
modifying ipt state.

Signed-off-by: Vlad Buslov <[email protected]>
Signed-off-by: David S. Miller <[email protected]>

net: sched: act_ife: remove dependency on rtnl lock

Use tcf spinlock and rcu to protect params pointer from concurrent
modification during dump and init. Use rcu swap operation to reassign
params pointer under protection of tcf lock. (old params value is not used
by init, so there is no need of standalone rcu dereference step)

Ife action has meta-actions that are compiled as standalone modules. Rtnl
mutex must be released while loading a kernel module. In order to support
execution without rtnl mutex, propagate 'rtnl_held' argument to meta action
loading functions. When requesting meta action module, conditionally
release rtnl lock depending on 'rtnl_held' argument.

Signed-off-by: Vlad Buslov <[email protected]>
Signed-off-by: David S. Miller <[email protected]>

net: sched: act_gact: remove dependency on rtnl lock

Use tcf spinlock to protect gact action private state from concurrent
modification during dump and init. Remove rtnl assertion that is no longer
necessary.

Signed-off-by: Vlad Buslov <[email protected]>
Signed-off-by: David S. Miller <[email protected]>

net: sched: act_csum: remove dependency on rtnl lock

Use tcf lock to protect csum action struct private data from concurrent
modification in init and dump. Use rcu swap operation to reassign params
pointer under protection of tcf lock. (old params value is not used by
init, so there is no need of standalone rcu dereference step)

Remove rtnl assertion that is no longer necessary.

Signed-off-by: Vlad Buslov <[email protected]>
Signed-off-by: David S. Miller <[email protected]>

net: sched: act_bpf: remove dependency on rtnl lock

Use tcf spinlock to protect bpf action private data from concurrent
modification during dump and init. Remove rtnl lock assertion that is no
longer necessary.

Signed-off-by: Vlad Buslov <[email protected]>
Signed-off-by: David S. Miller <[email protected]>

Merge branch 'net-sctp-Avoid-allocating-high-order-memory-with-kmalloc'

Konstantin Khorenko says:

====================
net/sctp: Avoid allocating high order memory with kmalloc()

Each SCTP association can have up to 65535 input and output streams.
For each stream type an array of sctp_stream_in or sctp_stream_out
structures is allocated using kmalloc_array() function. This function
allocates physically contiguous memory regions, so this can lead
to allocation of memory regions of very high order, i.e.:

  sizeof(struct sctp_stream_out) == 24,
  ((65535 * 24) / 4096) == 383 memory pages (4096 byte per page),
  which means 9th memory order.

This can lead to a memory allocation failures on the systems
under a memory stress.

We actually do not need these arrays of memory to be physically
contiguous. Possible simple solution would be to use kvmalloc()
instread of kmalloc() as kvmalloc() can allocate physically scattered
pages if contiguous pages are not available. But the problem
is that the allocation can happed in a softirq context with
GFP_ATOMIC flag set, and kvmalloc() cannot be used in this scenario.

So the other possible solution is to use flexible arrays instead of
contiguios arrays of memory so that the memory would be allocated
on a per-page basis.

This patchset replaces kvmalloc() with flex_array usage.
It consists of two parts:

  * First patch is preparatory - it mechanically wraps all direct
    access to assoc->stream.out[] and assoc->stream.in[] arrays
    with SCTP_SO() and SCTP_SI() wrappers so that later a direct
    array access could be easily changed to an access to a
    flex_array (or any other possible alternative).
  * Second patch replaces kmalloc_array() with flex_array usage.

v2 changes:
sctp_stream_in() users are updated to provide stream as an argument,
sctp_stream_{in,out}_ptr() are now just sctp_stream_{in,out}().

v3 changes:
Move type chages struct sctp_stream_out -> flex_array to next patch.
Make sctp_stream_{in,out}() static incline and move them to a header.

Performance results (single stream):
====================================
  * Kernel: v4.18-rc6 - stock and with 2 patches from Oleg (earlier in this thread)
  * Node: CPU (8 cores): Intel(R) Xeon(R) CPU E31230 @ 3.20GHz
          RAM: 32 Gb

  * netperf: taken from https://github.com/HewlettPackard/netperf.git,
     compiled from sources with sctp support
  * netperf server and client are run on the same node
  * ip link set lo mtu 1500

The script used to run tests:
# cat run_tests.sh
#!/bin/bash

for test in SCTP_STREAM SCTP_STREAM_MANY SCTP_RR SCTP_RR_MANY; do
  echo "TEST: $test";
  for i in `seq 1 3`; do
    echo "Iteration: $i";
    set -x
    netperf -t $test -H localhost -p 22222 -S 200000,200000 -s 200000,200000 \
            -l 60 -- -m 1452;
    set +x
  done
done
================================================

Results (a bit reformatted to be more readable):
Recv   Send    Send
Socket Socket  Message  Elapsed
Size   Size    Size     Time     Throughput
bytes  bytes   bytes    secs.    10^6bits/sec

v4.18-rc7 v4.18-rc7 + fixes
TEST: SCTP_STREAM
212992 212992   1452    60.21 1125.52 1247.04
212992 212992   1452    60.20 1376.38 1149.95
212992 212992   1452    60.20 1131.40 1163.85
TEST: SCTP_STREAM_MANY
212992 212992   1452    60.00 1111.00 1310.05
212992 212992   1452    60.00 1188.55 1130.50
212992 212992   1452    60.00 1108.06 1162.50

===========
Local /Remote
Socket Size   Request  Resp.   Elapsed  Trans.
Send   Recv   Size     Size    Time     Rate
bytes  Bytes  bytes    bytes   secs.    per sec

v4.18-rc7 v4.18-rc7 + fixes
TEST: SCTP_RR
212992 212992 1        1       60.00 45486.98 46089.43
212992 212992 1        1       60.00 45584.18 45994.21
212992 212992 1        1       60.00 45703.86 45720.84
TEST: SCTP_RR_MANY
212992 212992 1        1       60.00 40.75 40.77
212992 212992 1        1       60.00 40.58 40.08
212992 212992 1        1       60.00 39.98 39.97

Performance results for many streams:
=====================================
   * Kernel: v4.18-rc8 - stock and with 2 patches v3
   * Node: CPU (8 cores): Intel(R) Xeon(R) CPU E31230 @ 3.20GHz
           RAM: 32 Gb

   * sctp_test: https://github.com/sctp/lksctp-tools
   * both server and client are run on the same node
   * ip link set lo mtu 1500
   * sysctl -w vm.max_map_count=65530000 (need it to make memory fragmented)

The script used to run tests:
=============================
# cat run_sctp_test.sh
#!/bin/bash

set -x

uname -r
ip link set lo mtu 1500
swapoff -a

free
cat /proc/buddyinfo

./src/apps/sctp_test -H 127.0.0.1 -P 22222 -l -d 0 &
sleep 3

time ./src/apps/sctp_test -H 127.0.0.1 -P 22221 -h 127.0.0.1 -p 22222 \
         -s -c 1 -M 65535 -T -t 1 -x 100000 -d 0 1>/dev/null

killall -9 lt-sctp_test
===============================

Results (a bit reformatted to be more readable):

1) ms stock kernel v4.18-rc8, no memory fragmentation
test 1 test 2 test 3
real    0m14.715s 0m14.593s 0m15.954s
user    0m0.954s 0m0.955s 0m0.854s
sys     0m13.388s 0m12.537s 0m13.749s

2) kernel with fixes, no memory fragmentation
test 1 test 2 test 3
real    0m14.959s 0m14.693s 0m14.762s
user    0m0.948s 0m0.921s 0m0.929s
sys     0m13.538s 0m13.225s 0m13.217s

3) kernel with fixes, memory fragmented
'free':
               total        used        free      shared  buff/cache   available
Mem:       32906008    30555200      302740         764     2048068      266452
Mem:       32906008    30379948      541436         764     1984624      442376
Mem:       32906008    30717312      262380         764     1926316      109908

/proc/buddyinfo:
Node 0, zone   Normal  40773     37     34     29      0      0      0      0      0      0      0
Node 0, zone   Normal 100332     68      8      4      2      1      1      0      0      0      0
Node 0, zone   Normal  31113      7      2      1      0      0      0      0      0      0      0

test 1 test 2 test 3
real    0m14.159s 0m15.252s 0m15.826s
user    0m0.839s 0m1.004s 0m1.048s
sys     0m11.827s 0m14.240s 0m14.778s
====================

Signed-off-by: David S. Miller <[email protected]>

net/sctp: Replace in/out stream arrays with flex_array

This path replaces physically contiguous memory arrays
allocated using kmalloc_array() with flexible arrays.
This enables to avoid memory allocation failures on the
systems under a memory stress.

Signed-off-by: Oleg Babin <[email protected]>
Signed-off-by: Konstantin Khorenko <[email protected]>
Signed-off-by: David S. Miller <[email protected]>

net/sctp: Make wrappers for accessing in/out streams

This patch introduces wrappers for accessing in/out streams indirectly.
This will enable to replace physically contiguous memory arrays
of streams with flexible arrays (or maybe any other appropriate
mechanism) which do memory allocation on a per-page basis.

Signed-off-by: Oleg Babin <[email protected]>
Signed-off-by: Konstantin Khorenko <[email protected]>
Signed-off-by: David S. Miller <[email protected]>

tc: Update README and add config

Updated README.

Added config file that contains the minimum required features enabled to
run the tests currently present in the kernel.
This must be updated when new unittests are created and require their own
modules.

Signed-off-by: Keara Leibovitz <[email protected]>
Signed-off-by: David S. Miller <[email protected]>

Merge branch 'l2tp-rework-pppol2tp-ioctl-handling'

Guillaume Nault says:

====================
l2tp: rework pppol2tp ioctl handling

The current ioctl() handling code can be simplified. It tests for
non-relevant conditions and uselessly holds sockets. Once useless
code is removed, it becomes even simpler to let pppol2tp_ioctl() handle
commands directly, rather than dispatch them to pppol2tp_tunnel_ioctl()
or pppol2tp_session_ioctl(). That is the approach taken by this series.

Patch #1 and #2 define helper functions aimed at simplifying the rest
of the patch set.

Patch #3 drops useless tests in pppol2p_ioctl() and avoid holding a
refcount on the socket.

Patches #4, #5 and #6 are the core of the series. They let
pppol2tp_ioctl() handle all ioctls and drop the tunnel and session
specific functions.

Then patch #6 brings a little bit of consolidation.

Finally, patch #7 takes advantage of the simplified code to make
pppol2tp sockets compatible with dev_ioctl(). Certainly not a killer
feature, but it is trivial and it is always nice to see l2tp getting
better integration with the rest of the stack.
====================

Signed-off-by: David S. Miller <[email protected]>

l2tp: let pppol2tp_ioctl() fallback to dev_ioctl()

Return -ENOIOCTLCMD for unknown ioctl commands. This lets dev_ioctl()
handle generic socket ioctls like SIOCGIFNAME or SIOCGIFINDEX.
PF_PPPOX/PX_PROTO_OL2TP was one of the few socket types not honouring
this mechanism.

Signed-off-by: Guillaume Nault <[email protected]>
Signed-off-by: David S. Miller <[email protected]>

l2tp: zero out stats in pppol2tp_copy_stats()

Integrate memset(0) in pppol2tp_copy_stats() to avoid calling it
manually every time.

While there, constify 'stats'.

Signed-off-by: Guillaume Nault <[email protected]>
Signed-off-by: David S. Miller <[email protected]>

l2tp: remove pppol2tp_session_ioctl()

pppol2tp_ioctl() has everything in place for handling PPPIOCGL2TPSTATS
on session sockets. We just need to copy the stats and set ->session_id.

As a side effect of sharing session and tunnel code, ->using_ipsec is
properly set even when the request was made using a session socket.

Signed-off-by: Guillaume Nault <[email protected]>
Signed-off-by: David S. Miller <[email protected]>

l2tp: remove pppol2tp_tunnel_ioctl()

Handle PPPIOCGL2TPSTATS in pppol2tp_ioctl() if the socket represents a
tunnel. This one is a bit special because the caller may use the tunnel
socket to retrieve statistics of one of its sessions. If the session_id
is set, the corresponding session's statistics are returned, instead of
those of the tunnel. This is handled by the new
pppol2tp_tunnel_copy_stats() helper function.

Set ->tunnel_id and ->using_ipsec out of the conditional, so
that it can be used by the 'else' branch in the following patch.
We cannot do that for ->session_id, because tunnel sockets have to
report the value that was originally passed in 'stats.session_id',
while session sockets have to report their own session_id.

Signed-off-by: Guillaume Nault <[email protected]>
Signed-off-by: David S. Miller <[email protected]>

l2tp: handle PPPIOC[GS]MRU and PPPIOC[GS]FLAGS in pppol2tp_ioctl()

Let pppol2tp_ioctl() handle ioctl commands directly. It still relies on
pppol2tp_{session,tunnel}_ioctl() for PPPIOCGL2TPSTATS.

Signed-off-by: Guillaume Nault <[email protected]>
Signed-off-by: David S. Miller <[email protected]>

l2tp: simplify pppol2tp_ioctl()

* Drop test on 'sk': sock->sk cannot be NULL, or pppox_ioctl() could
    not have called us.

  * Drop test on 'SOCK_DEAD' state: if this flag was set, the socket
    would be in the process of being released and no ioctl could be
    running anymore.

  * Drop test on 'PPPOX_*' state: we depend on ->sk_user_data to get
    the session structure. If it is non-NULL, then the socket is
    connected. Testing for PPPOX_* is redundant.

  * Retrieve session using ->sk_user_data directly, instead of going
    through pppol2tp_sock_to_session(). This avoids grabbing a useless
    reference on the socket.

Signed-off-by: Guillaume Nault <[email protected]>
Signed-off-by: David S. Miller <[email protected]>

l2tp: split l2tp_session_get()

l2tp_session_get() is used for two different purposes. If 'tunnel' is
NULL, the session is searched globally in the supplied network
namespace. Otherwise it is searched exclusively in the tunnel context.

Callers always know the context in which they need to search the
session. But some of them do provide both a namespace and a tunnel,
making the semantic of the call unclear.

This patch defines l2tp_tunnel_get_session() for lookups done in a
tunnel and restricts l2tp_session_get() to namespace searches.

Signed-off-by: Guillaume Nault <[email protected]>
Signed-off-by: David S. Miller <[email protected]>

l2tp: define l2tp_tunnel_uses_xfrm()

Use helper function to figure out if a tunnel is using ipsec.
Also, avoid accessing ->sk_policy directly since it's RCU protected.

Signed-off-by: Guillaume Nault <[email protected]>
Signed-off-by: David S. Miller <[email protected]>

Merge branch 'netsec-driver-improvements'

Ilias Apalodimas says:

====================
netsec driver improvements

This patchset introduces some improvements on socionext netsec driver.
- patch 1/2, avoids unneeded MMIO reads on the Rx path
- patch 2/2, is adjusting the numbers of descriptors used

Changes since v1:
- Move dma_rmb() to protect descriptor accesses until the device
has updated the NETSEC_RX_PKT_OWN_FIELD bit
====================

Signed-off-by: David S. Miller <[email protected]>

net: socionext: Increase descriptors to 256

Increasing descriptors to 256 from 128 and adjusting the NAPI weight
to 64 increases performace on Rx by ~20% on 64byte packets

Signed-off-by: Ilias Apalodimas <[email protected]>
Signed-off-by: David S. Miller <[email protected]>

net: socionext: Use descriptor info instead of MMIO reads on Rx

MMIO reads for remaining packets in queue occur (at least)twice per
invocation of netsec_process_rx(). We can use the packet descriptor to
identify if it's owned by the hardware and break out, avoiding the more
expensive MMIO read operations. This has a ~2% increase on the pps of the
Rx path when tested with 64byte packets

Signed-off-by: Ilias Apalodimas <[email protected]>
Signed-off-by: David S. Miller <[email protected]>

vxge: remove set but not used variable 'req_out', 'status' and 'ret'

Fixes gcc '-Wunused-but-set-variable' warning:

drivers/net/ethernet/neterion/vxge/vxge-config.c:1097:6: warning:
variable 'ret' set but not used [-Wunused-but-set-variable]
drivers/net/ethernet/neterion/vxge/vxge-config.c:2263:6: warning:
variable 'req_out' set but not used [-Wunused-but-set-variable]
drivers/net/ethernet/neterion/vxge/vxge-config.c:2262:22: warning:
variable 'status' set but not used [-Wunused-but-set-variable]
drivers/net/ethernet/neterion/vxge/vxge-config.c:2360:22: warning:
variable 'status' set but not used [-Wunused-but-set-variable]
enum vxge_hw_status status = VXGE_HW_OK;

Signed-off-by: YueHaibing <[email protected]>
Signed-off-by: David S. Miller <[email protected]>

Merge branch 'virtio_net-Expand-affinity-to-arbitrary-numbers-of-cpu-and-vq'

Caleb Raitto says:

====================
virtio_net: Expand affinity to arbitrary numbers of cpu and vq

Virtio-net tries to pin each virtual queue rx and tx interrupt to a cpu if
there are as many queues as cpus.

Expand this heuristic to configure a reasonable affinity setting also
when the number of cpus != the number of virtual queues.

Patch 1 allows vqs to take an affinity mask with more than 1 cpu.
Patch 2 generalizes the algorithm in virtnet_set_affinity beyond
the case where #cpus == #vqs.

v2 changes:
Renamed "virtio_net: Make vp_set_vq_affinity() take a mask." to
"virtio: Make vp_set_vq_affinity() take a mask."

Tested:

[InstanceSetup]
set_multiqueue = false

$ cd /proc/irq
$ for i in `seq 24 60` ; do sudo grep ".*" $i/smp_affinity_list;  done
0-15
0
0
1
1
2
2
3
3
4
4
5
5
6
6
7
7
8
8
9
9
10
10
11
11
12
12
13
13
14
14
15
15
0-15
0-15
0-15
0-15

$ cd /sys/class/net/eth0/queues/
$ for i in `seq 0 15` ; do sudo grep ".*" tx-$i/xps_cpus; done
0001
0002
0004
0008
0010
0020
0040
0080
0100
0200
0400
0800
1000
2000
4000
8000

$ sudo ethtool -L eth0 combined 15

$ cd /proc/irq
$ for i in `seq 24 60` ; do sudo grep ".*" $i/smp_affinity_list;  done
0-15
0-1
0-1
2
2
3
3
4
4
5
5
6
6
7
7
8
8
9
9
10
10
11
11
12
12
13
13
14
14
15
15
15
15
0-15
0-15
0-15
0-15

$ cd /sys/class/net/eth0/queues/
$ for i in `seq 0 14` ; do sudo grep ".*" tx-$i/xps_cpus; done
0003
0004
0008
0010
0020
0040
0080
0100
0200
0400
0800
1000
2000
4000
8000

$ sudo ethtool -L eth0 combined 8

$ cd /proc/irq
$ for i in `seq 24 60` ; do sudo grep ".*" $i/smp_affinity_list;  done
0-15
0-1
0-1
2-3
2-3
4-5
4-5
6-7
6-7
8-9
8-9
10-11
10-11
12-13
12-13
14-15
14-15
9
9
10
10
11
11
12
12
13
13
14
14
15
15
15
15
0-15
0-15
0-15
0-15

$ cd /sys/class/net/eth0/queues/
$ for i in `seq 0 7` ; do sudo grep ".*" tx-$i/xps_cpus; done
0003
000c
0030
00c0
0300
0c00
3000
c000

$ sudo ethtool -L eth0 combined 16
$ sudo sh -c "echo 0 > /sys/devices/system/cpu/cpu15/online"

$ cd /proc/irq
$ for i in `seq 24 60` ; do sudo grep ".*" $i/smp_affinity_list;  done
0-15
0
0
1
1
2
2
3
3
4
4
5
5
6
6
7
7
8
8
9
9
10
10
11
11
12
12
13
13
14
14
0
0
0-15
0-15
0-15
0-15

$ cd /sys/class/net/eth0/queues/
$ for i in `seq 0 15` ; do sudo grep ".*" tx-$i/xps_cpus; done
0001
0002
0004
0008
0010
0020
0040
0080
0100
0200
0400
0800
1000
2000
4000
0001

$ for i in `seq 8 15`; \
do sudo sh -c "echo 0 > /sys/devices/system/cpu/cpu$i/online"; done

$ cd /proc/irq
$ for i in `seq 24 60` ; do sudo grep ".*" $i/smp_affinity_list;  done
0-15
0
0
1
1
2
2
3
3
4
4
5
5
6
6
7
7
0
0
1
1
2
2
3
3
4
4
5
5
6
6
7
7
0-15
0-15
0-15
0-15

$ cd /sys/class/net/eth0/queues/
$ for i in `seq 0 15` ; do sudo grep ".*" tx-$i/xps_cpus; done
0001
0002
0004
0008
0010
0020
0040
0080
0001
0002
0004
0008
0010
0020
0040
0080
====================

Signed-off-by: David S. Miller <[email protected]>

virtio_net: Stripe queue affinities across cores.

Always set the affinity hint, even if #cpu != #vq.

Handle the case where #cpu > #vq (including when #cpu % #vq != 0) and
when #vq > #cpu (including when #vq % #cpu != 0).

Signed-off-by: Caleb Raitto <[email protected]>
Signed-off-by: Willem de Bruijn <[email protected]>
Acked-by: Jon Olson <[email protected]>
Signed-off-by: David S. Miller <[email protected]>

virtio: Make vp_set_vq_affinity() take a mask.

Make vp_set_vq_affinity() take a cpumask instead of taking a single CPU.

If there are fewer queues than cores, queue affinity should be able to
map to multiple cores.

Link: https://patchwork.ozlabs.org/patch/948149/
Suggested-by: Willem de Bruijn <[email protected]>
Signed-off-by: Caleb Raitto <[email protected]>
Acked-by: Gonglei <[email protected]>
Signed-off-by: David S. Miller <[email protected]>

lan743x: lan743x: Add PTP support

PTP support includes:
    Ingress, and egress timestamping.
    One step timestamping available.
    PTP clock support.
    Periodic output support.

Signed-off-by: Bryan Whitehead <[email protected]>
Signed-off-by: David S. Miller <[email protected]>

Merge branch 'tcp-new-mechanism-to-ACK-immediately'

Yuchung Cheng says:

====================
tcp: new mechanism to ACK immediately

This patch is a follow-up feature improvement to the recent fixes on
the performance issues in ECN (delayed) ACKs. Many of the fixes use
tcp_enter_quickack_mode routine to force immediate ACKs. However the
routine also reset tracking interactive session. This is not ideal
because these immediate ACKs are required by protocol specifics
unrelated to the interactiveness nature of the application.

This patch set introduces a new flag to send a one-time immediate ACK
without changing the status of interactive session tracking. With this
patch set the immediate ACKs are generated upon these protocol states:

1) When a hole is repaired
2) When CE status changes between subsequent data packets received
3) When a data packet carries CWR flag
====================

Signed-off-by: David S. Miller <[email protected]>

tcp: avoid resetting ACK timer upon receiving packet with ECN CWR flag

Previously commit 9aee40006190 ("tcp: ack immediately when a cwr
packet arrives") calls tcp_enter_quickack_mode to force sending
two immediate ACKs upon receiving a packet w/ CWR flag. The side
effect is it'll also reset the delayed ACK timer and interactive
session tracking. This patch removes that side effect by using the
new ACK_NOW flag to force an immmediate ACK.

Packetdrill to demonstrate:

    0 socket(..., SOCK_STREAM, IPPROTO_TCP) = 3
   +0 setsockopt(3, SOL_SOCKET, SO_REUSEADDR, [1], 4) = 0
   +0 setsockopt(3, SOL_TCP, TCP_CONGESTION, "dctcp", 5) = 0
   +0 bind(3, ..., ...) = 0
   +0 listen(3, 1) = 0

   +0 < [ect0] SEW 0:0(0) win 32792 <mss 1000,sackOK,nop,nop,nop,wscale 7>
   +0 > SE. 0:0(0) ack 1 <mss 1460,nop,nop,sackOK,nop,wscale 8>
  +.1 < [ect0] . 1:1(0) ack 1 win 257
   +0 accept(3, ..., ...) = 4

   +0 < [ect0] . 1:1001(1000) ack 1 win 257
   +0 > [ect01] . 1:1(0) ack 1001

   +0 write(4, ..., 1) = 1
   +0 > [ect01] P. 1:2(1) ack 1001

   +0 < [ect0] . 1001:2001(1000) ack 2 win 257
   +0 write(4, ..., 1) = 1
   +0 > [ect01] P. 2:3(1) ack 2001

   +0 < [ect0] . 2001:3001(1000) ack 3 win 257
   +0 < [ect0] . 3001:4001(1000) ack 3 win 257
   // Ack delayed ...

   +.01 < [ce] P. 4001:4501(500) ack 3 win 257
   +0 > [ect01] . 3:3(0) ack 4001
   +0 > [ect01] E. 3:3(0) ack 4501

+.001 read(4, ..., 4500) = 4500
   +0 write(4, ..., 1) = 1
   +0 > [ect01] PE. 3:4(1) ack 4501 win 100

+.01 < [ect0] W. 4501:5501(1000) ack 4 win 257
   // No delayed ACK on CWR flag
   +0 > [ect01] . 4:4(0) ack 5501

+.31 < [ect0] . 5501:6501(1000) ack 4 win 257
   +0 > [ect01] . 4:4(0) ack 6501

Fixes: 9aee40006190 ("tcp: ack immediately when a cwr packet arrives")
Signed-off-by: Yuchung Cheng <[email protected]>
Signed-off-by: Neal Cardwell <[email protected]>
Signed-off-by: David S. Miller <[email protected]>

tcp: always ACK immediately on hole repairs

RFC 5681 sec 4.2:
  To provide feedback to senders recovering from losses, the receiver
  SHOULD send an immediate ACK when it receives a data segment that
  fills in all or part of a gap in the sequence space.

When a gap is partially filled, __tcp_ack_snd_check already checks
the out-of-order queue and correctly send an immediate ACK. However
when a gap is fully filled, the previous implementation only resets
pingpong mode which does not guarantee an immediate ACK because the
quick ACK counter may be zero. This patch addresses this issue by
marking the one-time immediate ACK flag instead.

Signed-off-by: Yuchung Cheng <[email protected]>
Signed-off-by: Neal Cardwell <[email protected]>
Signed-off-by: Wei Wang <[email protected]>
Signed-off-by: Eric Dumazet <[email protected]>
Signed-off-by: David S. Miller <[email protected]>

tcp: avoid resetting ACK timer in DCTCP

The recent fix of acking immediately in DCTCP on CE status change
has an undesirable side-effect: it also resets TCP ack timer and
disables pingpong mode (interactive session). But the CE status
change has nothing to do with them. This patch addresses that by
using the new one-time immediate ACK flag instead of calling
tcp_enter_quickack_mode().

Signed-off-by: Yuchung Cheng <[email protected]>
Signed-off-by: Neal Cardwell <[email protected]>
Signed-off-by: Wei Wang <[email protected]>
Signed-off-by: Eric Dumazet <[email protected]>
Signed-off-by: David S. Miller <[email protected]>

tcp: mandate a one-time immediate ACK

Add a new flag to indicate a one-time immediate ACK. This flag is
occasionaly set under specific TCP protocol states in addition to
the more common quickack mechanism for interactive application.

In several cases in the TCP code we want to force an immediate ACK
but do not want to call tcp_enter_quickack_mode() because we do
not want to forget the icsk_ack.pingpong or icsk_ack.ato state.

Signed-off-by: Yuchung Cheng <[email protected]>
Signed-off-by: Neal Cardwell <[email protected]>
Signed-off-by: Wei Wang <[email protected]>
Signed-off-by: Eric Dumazet <[email protected]>
Signed-off-by: David S. Miller <[email protected]>

wimax: usb-tx: mark expected switch fall-through

In preparation to enabling -Wimplicit-fallthrough, mark switch cases
where we are expecting to fall through.

Notice that in this particular case, I placed the "fall through"
annotation at the bottom of the case, which is what GCC is expecting
to find.

Addresses-Coverity-ID: 115075 ("Missing break in switch")
Signed-off-by: Gustavo A. R. Silva <[email protected]>
Signed-off-by: David S. Miller <[email protected]>

wimax: usb-fw: mark expected switch fall-through

In preparation to enabling -Wimplicit-fallthrough, mark switch cases
where we are expecting to fall through.

Notice that in this particular case, I placed the "fall through"
annotation at the bottom of the case, which is what GCC is expecting
to find.

Addresses-Coverity-ID: 1369529 ("Missing break in switch")
Signed-off-by: Gustavo A. R. Silva <[email protected]>
Signed-off-by: David S. Miller <[email protected]>

net: dp83640: Mark expected switch fall-throughs

In preparation to enabling -Wimplicit-fallthrough, mark switch cases
where we are expecting to fall through.

Notice that in this particular case, I replaced the code comment at the
top of the switch statement with a proper "fall through" annotation for
each case, which is what GCC is expecting to find.

Addresses-Coverity-ID: 1056542 ("Missing break in switch")
Addresses-Coverity-ID: 1339579 ("Missing break in switch")
Addresses-Coverity-ID: 1369526 ("Missing break in switch")
Signed-off-by: Gustavo A. R. Silva <[email protected]>
Acked-by: Richard Cochran <[email protected]>
Signed-off-by: David S. Miller <[email protected]>

rxrpc: remove redundant static int 'zero'

The static int 'zero' is defined but is never used hence it is
redundant and can be removed. The use of this variable was removed
with commit a158bdd3247b ("rxrpc: Fix call timeouts").

Cleans up clang warning:
warning: 'zero' defined but not used [-Wunused-const-variable=]

Signed-off-by: Colin Ian King <[email protected]>
Signed-off-by: David S. Miller <[email protected]>

drivers/net/usb/r8152: remove the unneeded variable "ret" in rtl8152_system_suspend

rtl8152_system_suspend defines the variable "ret", but it is not modified
after initialization. So just remove it.

Signed-off-by: zhong jiang <[email protected]>
Signed-off-by: David S. Miller <[email protected]>

Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net

Pull networking fixes from David Miller:
"Last bit of straggler fixes...

  1) Fix btf library licensing to LGPL, from Martin KaFai lau.

  2) Fix error handling in bpf sockmap code, from Daniel Borkmann.

  3) XDP cpumap teardown handling wrt. execution contexts, from Jesper
     Dangaard Brouer.

  4) Fix loss of runtime PM on failed vlan add/del, from Ivan
     Khoronzhuk.

  5) xen-netfront caches skb_shinfo(skb) across a __pskb_pull_tail()
     call, which potentially changes the skb's data buffer, and thus
     skb_shinfo(). Fix from Juergen Gross"

* git://git.kernel.org/pub/scm/linux/kernel/git/davem/net:
  xen/netfront: don't cache skb_shinfo()
  net: ethernet: ti: cpsw: fix runtime_pm while add/kill vlan
  net: ethernet: ti: cpsw: clear all entries when delete vid
  xdp: fix bug in devmap teardown code path
  samples/bpf: xdp_redirect_cpu adjustment to reproduce teardown race easier
  xdp: fix bug in cpumap teardown code path
  bpf, sockmap: fix cork timeout for select due to epipe
  bpf, sockmap: fix leak in bpf_tcp_sendmsg wait for mem path
  bpf, sockmap: fix bpf_tcp_sendmsg sock error handling
  bpf: btf: Change tools/lib/bpf/btf to LGPL

xen/netfront: don't cache skb_shinfo()

skb_shinfo() can change when calling __pskb_pull_tail(): Don't cache
its return value.

Cc: [email protected]
Signed-off-by: Juergen Gross <[email protected]>
Reviewed-by: Wei Liu <[email protected]>
Signed-off-by: David S. Miller <[email protected]>

Merge branch 'cpsw-runtime-pm-fix'

Grygorii Strashko says:

====================
net: ethernet: ti: cpsw: fix runtime pm while add/del reserved vid

Here 2 not critical fixes for:
- vlan ale table leak while error if deleting vlan (simplifies next fix)
- runtime pm while try to set reserved vlan
====================

Reviewed-by: Grygorii Strashko <[email protected]>
Signed-off-by: David S. Miller <[email protected]>

net: ethernet: ti: cpsw: fix runtime_pm while add/kill vlan

It's exclusive with normal behaviour but if try to set vlan to one of
the reserved values is made, the cpsw runtime pm is broken.

Fixes: a6c5d14f5136 ("drivers: net: cpsw: ndev: fix accessing to suspended device")
Signed-off-by: Ivan Khoronzhuk <[email protected]>
Signed-off-by: David S. Miller <[email protected]>

net: ethernet: ti: cpsw: clear all entries when delete vid

In cases if some of the entries were not found in forwarding table
while killing vlan, the rest not needed entries still left in the
table. No need to stop, as entry was deleted anyway. So fix this by
returning error only after all was cleaned. To implement this, return
-ENOENT in cpsw_ale_del_mcast() as it's supposed to be.

Signed-off-by: Ivan Khoronzhuk <[email protected]>
Signed-off-by: David S. Miller <[email protected]>

mtd: rawnand: atmel: Select GENERIC_ALLOCATOR

The driver uses genalloc functions. Select GENERIC_ALLOCATOR to prevent
build errors when selected through COMPILE_TEST.

Fixes: 88a40e7dca00 ("mtd: rawnand: atmel: Allow selection of this driver when COMPILE_TEST=y")
Reported-by: Randy Dunlap <[email protected]>
Signed-off-by: Boris Brezillon <[email protected]>
Acked-by: Miquel Raynal <[email protected]>
Acked-by: Randy Dunlap <[email protected]>

Merge tag 'spi-nor/for-4.19' of git://git.infradead.org/linux-mtd into mtd/next

Pull SPI NOR updates from Boris Brezillon:
"
Core changes:
- Apply reset hacks only when reset is explicitly marked as broken in
the DT

Driver changes:
- Minor cleanup/fixes in the m25p80 driver
- Release flash_np in the nxp-spifi driver
- Add suspend/resume hooks to the atmel-quadspi driver
- Include gpio/consumer.h instead of gpio.h in the atmel-quadspi driver
- Use %pK instead of %p in the stm32-quadspi driver
- Improve timeout handling in the cadence-quadspi driver
- Use mtd_device_register() instead of mtd_device_parse_register() in
the intel-spi driver
"

Merge tag 'nand/for-4.19' of git://git.infradead.org/linux-mtd into mtd/next

Pull NAND updates from Miquel Raynal:

"
NAND core changes:
- Add the SPI-NAND framework.
- Create a helper to find the best ECC configuration.
- Create NAND controller operations.
- Allocate dynamically ONFI parameters structure.
- Add defines for ONFI version bits.
- Add manufacturer fixup for ONFI parameter page.
- Add an option to specify NAND chip as a boot device.
- Add Reed-Solomon error correction algorithm.
- Better name for the controller structure.
- Remove unused caller_is_module() definition.
- Make subop helpers return unsigned values.
- Expose _notsupp() helpers for raw page accessors.
- Add default values for dynamic timings.
- Kill the chip->scan_bbt() hook.
- Rename nand_default_bbt() into nand_create_bbt().
- Start to clean the nand_chip structure.
- Remove stale prototype from rawnand.h.

Raw NAND controllers drivers changes:
- Qcom: structuring cleanup.
- Denali: use core helper to find the best ECC configuration.
- Possible build of almost all drivers by adding a dependency on
   COMPILE_TEST for almost all of them in Kconfig, implies various
   fixes, Kconfig cleanup, GPIO headers inclusion cleanup, and even
   changes in sparc64 and ia64 architectures.
- Clean the ->probe() functions error path of a lot of drivers.
- Migrate all drivers to use nand_scan() instead of
   nand_scan_ident()/nand_scan_tail() pair.
- Use mtd_device_register() where applicable to simplify the code.
- Marvell:
   * Handle on-die ECC.
   * Better clocks handling.
   * Remove bogus comment.
   * Add suspend and resume support.
- Tegra: add NAND controller driver.
- Atmel:
   * Add module param to avoid using dma.
   * Drop Wenyou Yang from MAINTAINERS.
- Denali: optimize timings handling.
- FSMC: Stop using chip->read_buf().
- FSL:
   * Switch to SPDX license tag identifiers.
   * Fix qualifiers in MXC init functions.

Raw NAND chip drivers changes:
- Micron:
   * Add fixup for ONFI revision.
   * Update ecc_stats.corrected.
   * Make ECC activation stateful.
   * Avoid enabling/disabling ECC when it can't be disabled.
   * Get the actual number of bitflips.
   * Allow forced on-die ECC.
   * Support 8/512 on-die ECC.
   * Fix on-die ECC detection logic.
- Hynix:
   * Fix decoding the OOB size on H27UCG8T2BTR.
   * Use ->exec_op() in hynix_nand_reg_write_op().
"

zram: remove BD_CAP_SYNCHRONOUS_IO with writeback feature

If zram supports writeback feature, it's no longer a
BD_CAP_SYNCHRONOUS_IO device beause zram does asynchronous IO operations
for incompressible pages.

Do not pretend to be synchronous IO device.  It makes the system very
sluggish due to waiting for IO completion from upper layers.

Furthermore, it causes a user-after-free problem because swap thinks the
opearion is done when the IO functions returns so it can free the page
(e.g., lock_page_or_retry and goto out_release in do_swap_page) but in
fact, IO is asynchronous so the driver could access a just freed page
afterward.

This patch fixes the problem.

  BUG: Bad page state in process qemu-system-x86  pfn:3dfab21
  page:ffffdfb137eac840 count:0 mapcount:0 mapping:0000000000000000 index:0x1
  flags: 0x17fffc000000008(uptodate)
  raw: 017fffc000000008 dead000000000100 dead000000000200 0000000000000000
  raw: 0000000000000001 0000000000000000 00000000ffffffff 0000000000000000
  page dumped because: PAGE_FLAGS_CHECK_AT_PREP flag set
  bad because of flags: 0x8(uptodate)
  CPU: 4 PID: 1039 Comm: qemu-system-x86 Tainted: G    B 4.18.0-rc5+ #1
  Hardware name: Supermicro Super Server/X10SRL-F, BIOS 2.0b 05/02/2017
  Call Trace:
    dump_stack+0x5c/0x7b
    bad_page+0xba/0x120
    get_page_from_freelist+0x1016/0x1250
    __alloc_pages_nodemask+0xfa/0x250
    alloc_pages_vma+0x7c/0x1c0
    do_swap_page+0x347/0x920
    __handle_mm_fault+0x7b4/0x1110
    handle_mm_fault+0xfc/0x1f0
    __get_user_pages+0x12f/0x690
    get_user_pages_unlocked+0x148/0x1f0
    __gfn_to_pfn_memslot+0xff/0x3c0 [kvm]
    try_async_pf+0x87/0x230 [kvm]
    tdp_page_fault+0x132/0x290 [kvm]
    kvm_mmu_page_fault+0x74/0x570 [kvm]
    kvm_arch_vcpu_ioctl_run+0x9b3/0x1990 [kvm]
    kvm_vcpu_ioctl+0x388/0x5d0 [kvm]
    do_vfs_ioctl+0xa2/0x630
    ksys_ioctl+0x70/0x80
    __x64_sys_ioctl+0x16/0x20
    do_syscall_64+0x55/0x100
    entry_SYSCALL_64_after_hwframe+0x44/0xa9

Link: https://lore.kernel.org/lkml/[email protected]/
Link: http://lkml.kernel.org/r/[email protected]
[[email protected]: fix changelog, add comment]
Link: https://lore.kernel.org/lkml/[email protected]/
Link: http://lkml.kernel.org/r/[email protected]
Link: http://lkml.kernel.org/r/[email protected]
[[email protected]: coding-style fixes]
Signed-off-by: Minchan Kim <[email protected]>
Reported-by: Tino Lehnig <[email protected]>
Tested-by: Tino Lehnig <[email protected]>
Cc: Sergey Senozhatsky <[email protected]>
Cc: Jens Axboe <[email protected]>
Cc: <[email protected]> [4.15+]
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

mm/memory.c: check return value of ioremap_prot

ioremap_prot() can return NULL which could lead to an oops.

Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: chen jie <[email protected]>
Reviewed-by: Andrew Morton <[email protected]>
Cc: Li Zefan <[email protected]>
Cc: chenjie <[email protected]>
Cc: Yang Shi <[email protected]>
Cc: Alexey Dobriyan <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

lib/ubsan: remove null-pointer checks

With gcc-8 fsanitize=null become very noisy.  GCC started to complain
about things like &a->b, where 'a' is NULL pointer.  There is no NULL
dereference, we just calculate address to struct member.  It's
technically undefined behavior so UBSAN is correct to report it.  But as
long as there is no real NULL-dereference, I think, we should be fine.

-fno-delete-null-pointer-checks compiler flag should protect us from any
consequences.  So let's just no use -fsanitize=null as it's not useful
for us.  If there is a real NULL-deref we will see crash.  Even if
userspace mapped something at NULL (root can do this), with things like
SMAP should catch the issue.

Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Andrey Ryabinin <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

MAINTAINERS: GDB: update e-mail address

This entry was created with my personal e-mail address. Update this entry
to my open-source kernel.org account.

Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Kieran Bingham <[email protected]>
Cc: Jan Kiszka <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

bnxt_en: Fix strcpy() warnings in bnxt_ethtool.c

This patch fixes following smatch warnings:

drivers/net/ethernet/broadcom/bnxt/bnxt_ethtool.c:2826 bnxt_fill_coredump_seg_hdr() error: strcpy() '"sEgM"' too large for 'seg_hdr->signature' (5 vs 4)
drivers/net/ethernet/broadcom/bnxt/bnxt_ethtool.c:2858 bnxt_fill_coredump_record() error: strcpy() '"cOrE"' too large for 'record->signature' (5 vs 4)
drivers/net/ethernet/broadcom/bnxt/bnxt_ethtool.c:2879 bnxt_fill_coredump_record() error: strcpy() 'utsname()->sysname' too large for 'record->os_name' (65 vs 32)

Fixes: 6c5657d085ae ("bnxt_en: Add support for ethtool get dump.")
Reported-by: Dan Carpenter <[email protected]>
Signed-off-by: Vasundhara Volam <[email protected]>
Signed-off-by: Michael Chan <[email protected]>
Signed-off-by: David S. Miller <[email protected]>

Merge branch 'bpf-reuseport-map'

Martin KaFai Lau says:

====================
This series introduces a new map type "BPF_MAP_TYPE_REUSEPORT_SOCKARRAY"
and a new prog type BPF_PROG_TYPE_SK_REUSEPORT.

Here is a snippet from a commit message:

"To unleash the full potential of a bpf prog, it is essential for the
userspace to be capable of directly setting up a bpf map which can then
be consumed by the bpf prog to make decision.  In this case, decide which
SO_REUSEPORT sk to serve the incoming request.

By adding BPF_MAP_TYPE_REUSEPORT_SOCKARRAY, the userspace has total control
and visibility on where a SO_REUSEPORT sk should be located in a bpf map.
The later patch will introduce BPF_PROG_TYPE_SK_REUSEPORT such that
the bpf prog can directly select a sk from the bpf map.  That will
raise the programmability of the bpf prog attached to a reuseport
group (a group of sk serving the same IP:PORT).

For example, in UDP, the bpf prog can peek into the payload (e.g.
through the "data" pointer introduced in the later patch) to learn
the application level's connection information and then decide which sk
to pick from a bpf map.  The userspace can tightly couple the sk's location
in a bpf map with the application logic in generating the UDP payload's
connection information.  This connection info contact/API stays within the
userspace.

Also, when used with map-in-map, the userspace can switch the
old-server-process's inner map to a new-server-process's inner map
in one call "bpf_map_update_elem(outer_map, &index, &new_reuseport_array)".
The bpf prog will then direct incoming requests to the new process instead
of the old process.  The old process can finish draining the pending
requests (e.g. by "accept()") before closing the old-fds.  [Note that
deleting a fd from a bpf map does not necessary mean the fd is closed]"
====================

Signed-off-by: Daniel Borkmann <[email protected]>

bpf: Test BPF_PROG_TYPE_SK_REUSEPORT

This patch add tests for the new BPF_PROG_TYPE_SK_REUSEPORT.

The tests cover:
- IPv4/IPv6 + TCP/UDP
- TCP syncookie
- TCP fastopen
- Cases when the bpf_sk_select_reuseport() returning errors
- Cases when the bpf prog returns SK_DROP
- Values from sk_reuseport_md
- outer_map => reuseport_array

The test depends on
commit 3eee1f75f2b9 ("bpf: fix bpf_skb_load_bytes_relative pkt length check")

Signed-off-by: Martin KaFai Lau <[email protected]>
Acked-by: Alexei Starovoitov <[email protected]>
Signed-off-by: Daniel Borkmann <[email protected]>

bpf: test BPF_MAP_TYPE_REUSEPORT_SOCKARRAY

This patch adds tests for the new BPF_MAP_TYPE_REUSEPORT_SOCKARRAY.

Signed-off-by: Martin KaFai Lau <[email protected]>
Acked-by: Alexei Starovoitov <[email protected]>
Signed-off-by: Daniel Borkmann <[email protected]>

bpf: Sync bpf.h uapi to tools/

This patch sync include/uapi/linux/bpf.h to
tools/include/uapi/linux/

Signed-off-by: Martin KaFai Lau <[email protected]>
Acked-by: Alexei Starovoitov <[email protected]>
Signed-off-by: Daniel Borkmann <[email protected]>

bpf: Refactor ARRAY_SIZE macro to bpf_util.h

This patch refactors the ARRAY_SIZE macro to bpf_util.h.

Signed-off-by: Martin KaFai Lau <[email protected]>
Acked-by: Alexei Starovoitov <[email protected]>
Signed-off-by: Daniel Borkmann <[email protected]>

bpf: Enable BPF_PROG_TYPE_SK_REUSEPORT bpf prog in reuseport selection

This patch allows a BPF_PROG_TYPE_SK_REUSEPORT bpf prog to select a
SO_REUSEPORT sk from a BPF_MAP_TYPE_REUSEPORT_ARRAY introduced in
the earlier patch.  "bpf_run_sk_reuseport()" will return -ECONNREFUSED
when the BPF_PROG_TYPE_SK_REUSEPORT prog returns SK_DROP.
The callers, in inet[6]_hashtable.c and ipv[46]/udp.c, are modified to
handle this case and return NULL immediately instead of continuing the
sk search from its hashtable.

It re-uses the existing SO_ATTACH_REUSEPORT_EBPF setsockopt to attach
BPF_PROG_TYPE_SK_REUSEPORT.  The "sk_reuseport_attach_bpf()" will check
if the attaching bpf prog is in the new SK_REUSEPORT or the existing
SOCKET_FILTER type and then check different things accordingly.

One level of "__reuseport_attach_prog()" call is removed.  The
"sk_unhashed() && ..." and "sk->sk_reuseport_cb" tests are pushed
back to "reuseport_attach_prog()" in sock_reuseport.c.  sock_reuseport.c
seems to have more knowledge on those test requirements than filter.c.
In "reuseport_attach_prog()", after new_prog is attached to reuse->prog,
the old_prog (if any) is also directly freed instead of returning the
old_prog to the caller and asking the caller to free.

The sysctl_optmem_max check is moved back to the
"sk_reuseport_attach_filter()" and "sk_reuseport_attach_bpf()".
As of other bpf prog types, the new BPF_PROG_TYPE_SK_REUSEPORT is only
bounded by the usual "bpf_prog_charge_memlock()" during load time
instead of bounded by both bpf_prog_charge_memlock and sysctl_optmem_max.

Signed-off-by: Martin KaFai Lau <[email protected]>
Acked-by: Alexei Starovoitov <[email protected]>
Signed-off-by: Daniel Borkmann <[email protected]>