brcmfmac: split brcmf_attach() and brcmf_detach() functions
Move code allocating/freeing wiphy out of above functions. This will
allow reinitializing the driver (e.g. on some error) without allocating
a new wiphy.
brcmfmac: move "cfg80211_ops" pointer to another struct
This moves "ops" pointer from "struct brcmf_cfg80211_info" to the
"struct brcmf_pub". This movement makes it possible to allocate wiphy
without attaching cfg80211 (brcmf_cfg80211_attach()). It's required for
later separation of wiphy allocation and driver initialization.
While at it fix also an unlikely memory leak in the brcmf_attach().
====================
Netfilter updates for net-next
The following patchset contains Netfilter updates for net-next:
1) Fix error path of nf_tables_updobj(), from Dan Carpenter.
2) Move large structure away from stack in the nf_tables offload
infrastructure, from Arnd Bergmann.
3) Move indirect flow_block logic to nf_tables_offload.
4) Support for synproxy objects, from Fernando Fernandez Mancera.
5) Support for fwd and dup offload.
6) Add __nft_offload_get_chain() helper, this implicitly fixes missing
mutex and check for offload flags in the indirect block support,
patch from wenxu.
7) Remove rules on device unregistration, from wenxu. This includes
two preparation patches to reuse nft_flow_offload_chain() and
nft_flow_offload_rule().
Large batch from Jeremy Sowden to make a second pass to the
CONFIG_HEADER_TEST support and a bit of housekeeping:
8) Missing include guard in conntrack label header, from Jeremy Sowden.
9) A few coding style errors: trailing whitespace, incorrect indent in
Kconfig, and semicolons at the end of function definitions.
10) Remove unused ipt_init() and ip6t_init() declarations.
11) Inline xt_hashlimit, ebt_802_3 and xt_physdev headers. They are
only used once.
12) Update include directive in several netfilter files.
14) Move nf_ip6_ext_hdr() to include/linux/netfilter_ipv6.h
15) Move several synproxy structure definitions to nf_synproxy.h
16) Move nf_bridge_frag_data structure to include/linux/netfilter_bridge.h
17) Clean up static inline definitions in nf_conntrack_ecache.h.
18) Replace defined(CONFIG...) || defined(CONFIG...MODULE) with IS_ENABLED(CONFIG...).
19) Missing inline function conditional definitions based on Kconfig
preferences in synproxy and nf_conntrack_timeout.
20) Update br_nf_pre_routing_ipv6() definition.
21) Move conntrack code in linux/skbuff.h to nf_conntrack headers.
22) Several patches to remove superfluous CONFIG_NETFILTER and
CONFIG_NF_CONNTRACK checks in headers, coming from the initial batch
support for CONFIG_HEADER_TEST for netfilter.
====================
mmc: tmio: Fixup runtime PM management during probe
The tmio_mmc_host_probe() calls pm_runtime_set_active() to update the
runtime PM status of the device, as to make it reflect the current status
of the HW. This works fine for most cases, but unfortunate not for all.
Especially, there is a generic problem when the device has a genpd attached
and that genpd have the ->start|stop() callbacks assigned.
More precisely, if the driver calls pm_runtime_set_active() during
->probe(), genpd does not get to invoke the ->start() callback for it,
which means the HW isn't really fully powered on. Furthermore, in the next
phase, when the device becomes runtime suspended, genpd will invoke the
->stop() callback for it, potentially leading to usage count imbalance
problems, depending on what's implemented behind the callbacks of course.
To fix this problem, convert to call pm_runtime_get_sync() from
tmio_mmc_host_probe() rather than pm_runtime_set_active(). Additionally, to
avoid bumping usage counters and unnecessary re-initializing the HW the
first time the tmio driver's ->runtime_resume() callback is called,
introduce a state flag to keeping track of this.
It turns out that the above commit introduces other problems. For example,
calling pm_runtime_set_active() must not be done prior calling
pm_runtime_enable() as that makes it fail. This leads to additional
problems, such as clock enables being wrongly balanced.
Rather than fixing the problem on top, let's start over by doing a revert.
Jeremy Sowden [Fri, 13 Sep 2019 08:13:16 +0000 (09:13 +0100)]
netfilter: remove CONFIG_NETFILTER checks from headers.
`struct nf_hook_ops`, `struct nf_hook_state` and the `nf_hookfn`
function typedef appear in function and struct declarations and
definitions in a number of netfilter headers. The structs and typedef
themselves are defined by linux/netfilter.h but only when
CONFIG_NETFILTER is enabled. Define them unconditionally and add
forward declarations in order to remove CONFIG_NETFILTER conditionals
from the other headers.
Jeremy Sowden [Fri, 13 Sep 2019 08:13:14 +0000 (09:13 +0100)]
netfilter: conntrack: move code to linux/nf_conntrack_common.h.
Move some `struct nf_conntrack` code from linux/skbuff.h to
linux/nf_conntrack_common.h. Together with a couple of helpers for
getting and setting skb->_nfct, it allows us to remove
CONFIG_NF_CONNTRACK checks from net/netfilter/nf_conntrack.h.
Jeremy Sowden [Fri, 13 Sep 2019 08:13:13 +0000 (09:13 +0100)]
netfilter: br_netfilter: update stub br_nf_pre_routing_ipv6 parameter to `void *priv`.
The real br_nf_pre_routing_ipv6 function, defined when CONFIG_IPV6 is
enabled, expects `void *priv`, not `const struct nf_hook_ops *ops`.
Update the stub br_nf_pre_routing_ipv6, defined when CONFIG_IPV6 is
disabled, to match.
Fixes: 06198b34a3e0 ("netfilter: Pass priv instead of nf_hook_ops to netfilter hooks") Signed-off-by: Jeremy Sowden <[email protected]> Signed-off-by: Pablo Neira Ayuso <[email protected]>
Jeremy Sowden [Fri, 13 Sep 2019 08:13:12 +0000 (09:13 +0100)]
netfilter: conntrack: wrap two inline functions in config checks.
nf_conntrack_synproxy.h contains three inline functions. The contents
of two of them are wrapped in CONFIG_NETFILTER_SYNPROXY checks and just
return NULL if it is not enabled. The third does nothing if they return
NULL, so wrap its contents as well.
nf_ct_timeout_data is only called if CONFIG_NETFILTER_TIMEOUT is
enabled. Wrap its contents in a CONFIG_NETFILTER_TIMEOUT check like the
other inline functions in nf_conntrack_timeout.h.
Jeremy Sowden [Fri, 13 Sep 2019 08:13:09 +0000 (09:13 +0100)]
netfilter: move nf_bridge_frag_data struct definition to a more appropriate header.
There is a struct definition function in nf_conntrack_bridge.h which is
not specific to conntrack and is used elswhere in netfilter. Move it
into netfilter_bridge.h.
Jeremy Sowden [Fri, 13 Sep 2019 08:13:07 +0000 (09:13 +0100)]
netfilter: move inline nf_ip6_ext_hdr() function to a more appropriate header.
There is an inline function in ip6_tables.h which is not specific to
ip6tables and is used elswhere in netfilter. Move it into
netfilter_ipv6.h and update the callers.
Jeremy Sowden [Fri, 13 Sep 2019 08:13:06 +0000 (09:13 +0100)]
netfilter: remove nf_conntrack_icmpv6.h header.
nf_conntrack_icmpv6.h contains two object macros which duplicate macros
in linux/icmpv6.h. The latter definitions are also visible wherever it
is included, so remove it.
Jeremy Sowden [Fri, 13 Sep 2019 08:13:02 +0000 (09:13 +0100)]
netfilter: fix coding-style errors.
Several header-files, Kconfig files and Makefiles have trailing
white-space. Remove it.
In netfilter/Kconfig, indent the type of CONFIG_NETFILTER_NETLINK_ACCT
correctly.
There are semicolons at the end of two function definitions in
include/net/netfilter/nf_conntrack_acct.h and
include/net/netfilter/nf_conntrack_ecache.h. Remove them.
Merge tag 'for-5.3-rc8-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux
Pull btrfs fixes from David Sterba:
"Here are two fixes, one of them urgent fixing a bug introduced in 5.2
and reported by many users. It took time to identify the root cause,
catching the 5.3 release is higly desired also to push the fix to 5.2
stable tree.
The bug is a mess up of return values after adding proper error
handling and honestly the kind of bug that can cause sleeping
disorders until it's caught. My appologies to everybody who was
affected.
Summary of what could happen:
1) either a hang when committing a transaction, if this happens
there's no risk of corruption, still the hang is very inconvenient
and can't be resolved without a reboot
2) writeback for some btree nodes may never be started and we end up
committing a transaction without noticing that, this is really
serious and that will lead to the "parent transid verify failed"
messages"
* tag 'for-5.3-rc8-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux:
Btrfs: fix unwritten extent buffers and hangs on future writeback attempts
Btrfs: fix assertion failure during fsync and use of stale transaction
netfilter: nf_tables_offload: add __nft_offload_get_chain function
Add __nft_offload_get_chain function to get basechain from device. This
function requires that caller holds the per-netns nftables mutex. This
patch implicitly fixes missing offload flags check and proper mutex from
nft_indr_block_cb().
Fixes: 9a32669fecfb ("netfilter: nf_tables_offload: support indr block call") Signed-off-by: wenxu <[email protected]> Signed-off-by: Pablo Neira Ayuso <[email protected]>
Roman Gushchin [Thu, 12 Sep 2019 17:56:45 +0000 (10:56 -0700)]
cgroup: freezer: fix frozen state inheritance
If a new child cgroup is created in the frozen cgroup hierarchy
(one or more of ancestor cgroups is frozen), the CGRP_FREEZE cgroup
flag should be set. Otherwise if a process will be attached to the
child cgroup, it won't become frozen.
The problem can be reproduced with the test_cgfreezer_mkdir test.
This is the output before this patch:
~/test_freezer
ok 1 test_cgfreezer_simple
ok 2 test_cgfreezer_tree
ok 3 test_cgfreezer_forkbomb
Cgroup /sys/fs/cgroup/cg_test_mkdir_A/cg_test_mkdir_B isn't frozen
not ok 4 test_cgfreezer_mkdir
ok 5 test_cgfreezer_rmdir
ok 6 test_cgfreezer_migrate
ok 7 test_cgfreezer_ptrace
ok 8 test_cgfreezer_stopped
ok 9 test_cgfreezer_ptraced
ok 10 test_cgfreezer_vfork
And with this patch:
~/test_freezer
ok 1 test_cgfreezer_simple
ok 2 test_cgfreezer_tree
ok 3 test_cgfreezer_forkbomb
ok 4 test_cgfreezer_mkdir
ok 5 test_cgfreezer_rmdir
ok 6 test_cgfreezer_migrate
ok 7 test_cgfreezer_ptrace
ok 8 test_cgfreezer_stopped
ok 9 test_cgfreezer_ptraced
ok 10 test_cgfreezer_vfork
Roman Gushchin [Thu, 12 Sep 2019 17:56:44 +0000 (10:56 -0700)]
kselftests: cgroup: add freezer mkdir test
Add a new cgroup freezer selftest, which checks that if a cgroup is
frozen, their new child cgroups will properly inherit the frozen
state.
It creates a parent cgroup, freezes it, creates a child cgroup
and populates it with a dummy process. Then it checks that both
parent and child cgroup are frozen.
Tony Nguyen [Mon, 9 Sep 2019 13:47:46 +0000 (06:47 -0700)]
ice: Enable DDP package download
Attempt to request an optional device-specific DDP package file
(one with the PCIe Device Serial Number in its name so that different DDP
package files can be used on different devices). If the optional package
file exists, download it to the device. If not, download the default
package file.
Log an appropriate message based on whether or not a DDP package
file exists and the return code from the attempt to download it to the
device. If the download fails and there is not already a package file on
the device, go into "Safe Mode" where some features are not supported.
Tony Nguyen [Mon, 9 Sep 2019 13:47:45 +0000 (06:47 -0700)]
ice: Initialize DDP package structures
Add functions to initialize, parse, and clean structures representing
the DDP package.
Upon completion of package download, read and store the DDP package
contents to these structures. This configuration is used to
identify the default behavior and later used to update the HW table
entries.
Add the required defines, structures, and functions to enable downloading
a DDP package. Before download, checks are performed to ensure the package
is valid and compatible.
Note that package download is not yet requested by the driver as further
initialization is required to utilize the package.
The FW build id is currently being displayed as an int which doesn't make
sense. Instead display FW build id as a hex value. Also add other useful
information to the output such as NVM version, API patch info, and FW
build hash.
The driver is required to send a version to the firmware
to indicate that the driver is up. If the driver doesn't
do this the firmware doesn't behave properly.
Lior David [Tue, 10 Sep 2019 13:46:43 +0000 (16:46 +0300)]
wil6210: ignore reset errors for FW during probe
There are special kinds of FW such as WMI only which
are used for testing, diagnostics and other specific
scenario.
Such FW is loaded during driver probe and the driver
disallows enabling any network interface, to avoid
operational issues.
In many cases it is used to debug early versions
of FW with new features, which sometimes fail
on startup.
Currently when such FW fails to load (for example,
because of init failure), the driver probe would fail
and shutdown the device making it difficult to debug
the early failure.
To fix this, ignore load failures in WMI only FW and
allow driver probe to succeed, making it possible to
continue and debug the FW load failure.
Lior David [Tue, 10 Sep 2019 13:46:41 +0000 (16:46 +0300)]
wil6210: fix RX short frame check
The short frame check in wil_sring_reap_rx_edma uses
skb->len which store the maximum frame length. Fix
this to use dmalen which is the actual length of
the received frame.
Upon driver rmmod, cancel_work_sync() can be invoked on
p2p.discovery_expired_work before this work struct was initialized.
This causes a WARN_ON with newer kernel version.
Add initialization of discovery_expired_work inside wil_vif_init().
wil6210: make sure DR bit is read before rest of the status message
Due to compiler optimization, it's possible that dr_bit (descriptor
ready) is read last from the status message.
Due to race condition between HW writing the status message and
driver reading it, other fields that were read earlier (before dr_bit)
could have invalid values.
Fix this by explicitly reading the dr_bit first and then using rmb
before reading the rest of the status message.
Ahmad Masri [Tue, 10 Sep 2019 13:46:26 +0000 (16:46 +0300)]
wil6210: fix PTK re-key race
Fix a race between cfg80211 add_key call and transmitting of 4/4 EAP
packet. In case the transmit is delayed until after the add key takes
place, message 4/4 will be encrypted with the new key, and the
receiver side (AP) will drop it due to MIC error.
Wil6210 will monitor and look for the transmitted packet 4/4 eap key.
In case add_key takes place before the transmission completed, then
wil6210 will let the FW store the key and wil6210 will notify the FW
to use the PTK key only after 4/4 eap packet transmission was
completed.
PMC is a hardware debug mechanism which allows capturing real time
debug data and stream it to host memory. The driver allocates memory
buffers and set them inside PMC ring of descriptors.
Add pmcring debugfs that application can use to read the binary
content of descriptors inside the PMC ring (cat pmcring).
Rakesh Pillai [Fri, 8 Mar 2019 11:26:06 +0000 (16:56 +0530)]
ath10k: fix channel info parsing for non tlv target
The tlv targets such as WCN3990 send more data in the chan info event, which is
not sent by the non tlv targets. There is a minimum size check in the wmi event
for non-tlv targets and hence we cannot update the common channel info
structure as it was done in commit 13104929d2ec ("ath10k: fill the channel
survey results for WCN3990 correctly"). This broke channel survey results on
10.x firmware versions.
If the common channel info structure is updated, the size check for chan info
event for non-tlv targets will fail and return -EPROTO and we see the below
error messages
ath10k_pci 0000:01:00.0: failed to parse chan info event: -71
Add tlv specific channel info structure and restore the original size of the
common channel info structure to mitigate this issue.
Nicolas Boichat [Tue, 10 Sep 2019 13:46:17 +0000 (16:46 +0300)]
ath10k: adjust skb length in ath10k_sdio_mbox_rx_packet
When the FW bundles multiple packets, pkt->act_len may be incorrect
as it refers to the first packet only (however, the FW will only
bundle packets that fit into the same pkt->alloc_len).
Before this patch, the skb length would be set (incorrectly) to
pkt->act_len in ath10k_sdio_mbox_rx_packet, and then later manually
adjusted in ath10k_sdio_mbox_rx_process_packet.
The first problem is that ath10k_sdio_mbox_rx_process_packet does not
use proper skb_put commands to adjust the length (it directly changes
skb->len), so we end up with a mismatch between skb->head + skb->tail
and skb->data + skb->len. This is quite serious, and causes corruptions
in the TCP stack, as the stack tries to coalesce packets, and relies
on skb->tail being correct (that is, skb_tail_pointer must point to
the first byte_after_ the data).
Instead of re-adjusting the size in ath10k_sdio_mbox_rx_process_packet,
this moves the code to ath10k_sdio_mbox_rx_packet, and also add a
bounds check, as skb_put would crash the kernel if not enough space is
available.
Tested with QCA6174 SDIO with firmware
WLAN.RMH.4.4.1-00007-QCARMSWP-1.
Fixes: 8530b4e7b22bc3b ("ath10k: sdio: set skb len for all rx packets") Signed-off-by: Nicolas Boichat <[email protected]> Signed-off-by: Wen Gong <[email protected]> Signed-off-by: Kalle Valo <[email protected]>
Ben Greear [Tue, 10 Sep 2019 13:46:15 +0000 (16:46 +0300)]
ath10k: free beacon buf later in vdev teardown
My wave-1 firmware often crashes when I am bringing down
AP vdevs, and sometimes at least some machines lockup hard
after spewing IOMMU errors.
I don't see the same issue in STA mode, so I suspect beacons
are the issue.
Moving the beacon buf deletion to later in the vdev teardown
logic appears to help this problem. Firmware still crashes
often, but several iterations did not show IOMMU errors and
machine didn't hang.
Tested hardware: QCA9880
Tested firmware: ath10k-ct from beginning of 2019, exact version unknown
Chris Wilson [Thu, 12 Sep 2019 12:56:34 +0000 (13:56 +0100)]
Revert "drm/i915/userptr: Acquire the page lock around set_page_dirty()"
The userptr put_pages can be called from inside try_to_unmap, and so
enters with the page lock held on one of the object's backing pages. We
cannot take the page lock ourselves for fear of recursion.
Merge tag 'for-linus-20190912' of gitolite.kernel.org:pub/scm/linux/kernel/git/brauner/linux
Pull clone3 fix from Christian Brauner:
"This is a last-minute bugfix for clone3() that should go in before we
release 5.3 with clone3().
clone3() did not verify that the exit_signal argument was set to a
valid signal. This can be used to cause a crash by specifying a signal
greater than NSIG. e.g. -1.
The commit from Eugene adds a check to copy_clone_args_from_user() to
verify that the exit signal is limited by CSIGNAL as with legacy
clone() and that the signal is valid. With this we don't get the
legacy clone behavior were an invalid signal could be handed down and
would only be detected and then ignored in do_notify_parent(). Users
of clone3() will now get a proper error right when they pass an
invalid exit signal. Note, that this is not a change in user-visible
behavior since no kernel with clone3() has been released yet"
* tag 'for-linus-20190912' of gitolite.kernel.org:pub/scm/linux/kernel/git/brauner/linux:
fork: block invalid exit signals with clone3()
Merge branch 'x86-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull x86 fixes from Ingo Molnar:
"A KVM guest fix, and a kdump kernel relocation errors fix"
* 'x86-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
x86/timer: Force PIT initialization when !X86_FEATURE_ARAT
x86/purgatory: Change compiler flags from -mcmodel=kernel to -mcmodel=large to fix kexec relocation errors
Previously, higher 32 bits of exit_signal fields were lost when copied
to the kernel args structure (that uses int as a type for the respective
field). Moreover, as Oleg has noted, exit_signal is used unchecked, so
it has to be checked for sanity before use; for the legacy syscalls,
applying CSIGNAL mask guarantees that it is at least non-negative;
however, there's no such thing is done in clone3() code path, and that
can break at least thread_group_leader.
This commit adds a check to copy_clone_args_from_user() to verify that
the exit signal is limited by CSIGNAL as with legacy clone() and that
the signal is valid. With this we don't get the legacy clone behavior
were an invalid signal could be handed down and would only be detected
and ignored in do_notify_parent(). Users of clone3() will now get a
proper error when they pass an invalid exit signal. Note, that this is
not user-visible behavior since no kernel with clone3() has been
released yet.
The following program will cause a splat on a non-fixed clone3() version
and will fail correctly on a fixed version:
Thomas Huth [Thu, 12 Sep 2019 11:54:38 +0000 (13:54 +0200)]
KVM: s390: Do not leak kernel stack data in the KVM_S390_INTERRUPT ioctl
When the userspace program runs the KVM_S390_INTERRUPT ioctl to inject
an interrupt, we convert them from the legacy struct kvm_s390_interrupt
to the new struct kvm_s390_irq via the s390int_to_s390irq() function.
However, this function does not take care of all types of interrupts
that we can inject into the guest later (see do_inject_vcpu()). Since we
do not clear out the s390irq values before calling s390int_to_s390irq(),
there is a chance that we copy random data from the kernel stack which
could be leaked to the userspace later.
Specifically, the problem exists with the KVM_S390_INT_PFAULT_INIT
interrupt: s390int_to_s390irq() does not handle it, and the function
__inject_pfault_init() later copies irq->u.ext which contains the
random kernel stack data. This data can then be leaked either to
the guest memory in __deliver_pfault_init(), or the userspace might
retrieve it directly with the KVM_S390_GET_IRQ_STATE ioctl.
Fix it by handling that interrupt type in s390int_to_s390irq(), too,
and by making sure that the s390irq struct is properly pre-initialized.
And while we're at it, make sure that s390int_to_s390irq() now
directly returns -EINVAL for unknown interrupt types, so that we
immediately get a proper error code in case we add more interrupt
types to do_inject_vcpu() without updating s390int_to_s390irq()
sometime in the future.
The ixgbe driver currently does IPsec TX offloading
based on an existing secpath. However, the secpath
can also come from the RX side, in this case it is
misinterpreted for TX offload and the packets are
dropped with a "bad sa_idx" error. Fix this by using
the xfrm_offload() function to test for TX offload.
Fixes: 592594704761 ("ixgbe: process the Tx ipsec offload") Reported-by: Michael Marley <[email protected]> Signed-off-by: Steffen Klassert <[email protected]> Signed-off-by: David S. Miller <[email protected]>
Filipe Manana [Wed, 11 Sep 2019 16:42:00 +0000 (17:42 +0100)]
Btrfs: fix unwritten extent buffers and hangs on future writeback attempts
The lock_extent_buffer_io() returns 1 to the caller to tell it everything
went fine and the callers needs to start writeback for the extent buffer
(submit a bio, etc), 0 to tell the caller everything went fine but it does
not need to start writeback for the extent buffer, and a negative value if
some error happened.
When it's about to return 1 it tries to lock all pages, and if a try lock
on a page fails, and we didn't flush any existing bio in our "epd", it
calls flush_write_bio(epd) and overwrites the return value of 1 to 0 or
an error. The page might have been locked elsewhere, not with the goal
of starting writeback of the extent buffer, and even by some code other
than btrfs, like page migration for example, so it does not mean the
writeback of the extent buffer was already started by some other task,
so returning a 0 tells the caller (btree_write_cache_pages()) to not
start writeback for the extent buffer. Note that epd might currently have
either no bio, so flush_write_bio() returns 0 (success) or it might have
a bio for another extent buffer with a lower index (logical address).
Since we return 0 with the EXTENT_BUFFER_WRITEBACK bit set on the
extent buffer and writeback is never started for the extent buffer,
future attempts to writeback the extent buffer will hang forever waiting
on that bit to be cleared, since it can only be cleared after writeback
completes. Such hang is reported with a trace like the following:
So fix this by not overwriting the return value (ret) with the result
from flush_write_bio(). We also need to clear the EXTENT_BUFFER_WRITEBACK
bit in case flush_write_bio() returns an error, otherwise it will hang
any future attempts to writeback the extent buffer, and undo all work
done before (set back EXTENT_BUFFER_DIRTY, etc).
This is a regression introduced in the 5.2 kernel.
Filipe Manana [Tue, 10 Sep 2019 14:26:49 +0000 (15:26 +0100)]
Btrfs: fix assertion failure during fsync and use of stale transaction
Sometimes when fsync'ing a file we need to log that other inodes exist and
when we need to do that we acquire a reference on the inodes and then drop
that reference using iput() after logging them.
That generally is not a problem except if we end up doing the final iput()
(dropping the last reference) on the inode and that inode has a link count
of 0, which can happen in a very short time window if the logging path
gets a reference on the inode while it's being unlinked.
In that case we end up getting the eviction callback, btrfs_evict_inode(),
invoked through the iput() call chain which needs to drop all of the
inode's items from its subvolume btree, and in order to do that, it needs
to join a transaction at the helper function evict_refill_and_join().
However because the task previously started a transaction at the fsync
handler, btrfs_sync_file(), it has current->journal_info already pointing
to a transaction handle and therefore evict_refill_and_join() will get
that transaction handle from btrfs_join_transaction(). From this point on,
two different problems can happen:
1) evict_refill_and_join() will often change the transaction handle's
block reserve (->block_rsv) and set its ->bytes_reserved field to a
value greater than 0. If evict_refill_and_join() never commits the
transaction, the eviction handler ends up decreasing the reference
count (->use_count) of the transaction handle through the call to
btrfs_end_transaction(), and after that point we have a transaction
handle with a NULL ->block_rsv (which is the value prior to the
transaction join from evict_refill_and_join()) and a ->bytes_reserved
value greater than 0. If after the eviction/iput completes the inode
logging path hits an error or it decides that it must fallback to a
transaction commit, the btrfs fsync handle, btrfs_sync_file(), gets a
non-zero value from btrfs_log_dentry_safe(), and because of that
non-zero value it tries to commit the transaction using a handle with
a NULL ->block_rsv and a non-zero ->bytes_reserved value. This makes
the transaction commit hit an assertion failure at
btrfs_trans_release_metadata() because ->bytes_reserved is not zero but
the ->block_rsv is NULL. The produced stack trace for that is like the
following:
2) If evict_refill_and_join() decides to commit the transaction, it will
be able to do it, since the nested transaction join only increments the
transaction handle's ->use_count reference counter and it does not
prevent the transaction from getting committed. This means that after
eviction completes, the fsync logging path will be using a transaction
handle that refers to an already committed transaction. What happens
when using such a stale transaction can be unpredictable, we are at
least having a use-after-free on the transaction handle itself, since
the transaction commit will call kmem_cache_free() against the handle
regardless of its ->use_count value, or we can end up silently losing
all the updates to the log tree after that iput() in the logging path,
or using a transaction handle that in the meanwhile was allocated to
another task for a new transaction, etc, pretty much unpredictable
what can happen.
In order to fix both of them, instead of using iput() during logging, use
btrfs_add_delayed_iput(), so that the logging path of fsync never drops
the last reference on an inode, that step is offloaded to a safe context
(usually the cleaner kthread).
The assertion failure issue was sporadically triggered by the test case
generic/475 from fstests, which loads the dm error target while fsstress
is running, which lead to fsync failing while logging inodes with -EIO
errors and then trying later to commit the transaction, triggering the
assertion failure.
In event of failure during register_netdevice, free_netdev is
invoked immediately. free_netdev assumes that all the netdevice
refcounts have been dropped prior to it being called and as a
result frees and clears out the refcount pointer.
However, this is not necessarily true as some of the operations
in the NETDEV_UNREGISTER notifier handlers queue RCU callbacks for
invocation after a grace period. The IPv4 callback in_dev_rcu_put
tries to access the refcount after free_netdev is called which
leads to a null de-reference-
Fix this by waiting for the completion of the call_rcu() in
case of register_netdevice errors.
Fixes: 93ee31f14f6f ("[NET]: Fix free_netdev on register_netdev failure.") Cc: Sean Tranchetti <[email protected]> Signed-off-by: Subash Abhinov Kasiviswanathan <[email protected]> Signed-off-by: David S. Miller <[email protected]>
====================
add ksz9567 with I2C support to ksz9477 driver
Resurrect KSZ9477 I2C driver support patch originally sent to the list
by Tristram Ha and resolve outstanding issues. It now works as similarly to
the ksz9477 SPI driver as possible, using the same regmap macros.
Add support for ksz9567 to the ksz9477 driver (tested on a board with
ksz9567 connected via I2C).
Remove NET_DSA_TAG_KSZ_COMMON since it's not needed.
Changes since v1:
Put ksz9477_i2c.c includes in alphabetical order.
Added Reviewed-Bys.
====================
Remove the superfluous NET_DSA_TAG_KSZ_COMMON and just use the existing
NET_DSA_TAG_KSZ. Update the description to mention the three switch
families it supports. No functional change.
net: dsa: microchip: add ksz9567 to ksz9477 driver
Add support for the KSZ9567 7-Port Gigabit Ethernet Switch to the
ksz9477 driver. The KSZ9567 supports both SPI and I2C. Oddly the
ksz9567 is already in the device tree binding documentation.
tun_chr_read_iter() accessed the memory which freed by free_netdev()
called by tun_set_iff():
CPUA CPUB
tun_set_iff()
alloc_netdev_mqs()
tun_attach()
tun_chr_read_iter()
tun_get()
tun_do_read()
tun_ring_recv()
register_netdevice() <-- inject error
goto err_detach
tun_detach_all() <-- set RCV_SHUTDOWN
free_netdev() <-- called from
err_free_dev path
netdev_freemem() <-- free the memory
without check refcount
(In this path, the refcount cannot prevent
freeing the memory of dev, and the memory
will be used by dev_put() called by
tun_chr_read_iter() on CPUB.)
(Break from tun_ring_recv(),
because RCV_SHUTDOWN is set)
tun_put()
dev_put() <-- use the memory
freed by netdev_freemem()
Put the publishing of tfile->tun after register_netdevice(),
so tun_get() won't get the tun pointer that freed by
err_detach path if register_netdevice() failed.
Merge tag 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mst/vhost
Pull virtio fixes from Michael Tsirkin:
"Last minute bugfixes.
A couple of security things.
And an error handling bugfix that is never encountered by most people,
but that also makes it kind of safe to push at the last minute, and it
helps push the fix to stable a bit sooner"
* tag 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mst/vhost:
vhost: make sure log_num < in_num
vhost: block speculation of translated descriptors
virtio_ring: fix unmap of indirect descriptors
Merge tag 'pinctrl-v5.3-3' of git://git.kernel.org/pub/scm/linux/kernel/git/linusw/linux-pinctrl
Pull pin control fix from Linus Walleij:
"Hopefully last pin control fix: a single patch for some Aspeed
problems. The BMCs are much happier now"
* tag 'pinctrl-v5.3-3' of git://git.kernel.org/pub/scm/linux/kernel/git/linusw/linux-pinctrl:
pinctrl: aspeed: Fix spurious mux failures on the AST2500
Merge tag 'gpio-v5.3-6' of git://git.kernel.org/pub/scm/linux/kernel/git/linusw/linux-gpio
Pull GPIO fixes from Linus Walleij:
"I don't really like to send so many fixes at the very last minute, but
the bug-sport activity is unpredictable.
Four fixes, three are -stable material that will go everywhere, one is
for the current cycle:
- An ACPI DSDT error fixup of the type we always see and Hans
invariably gets to fix.
- A OF quirk fix for the current release (v5.3)
- Some consistency checks on the userspace ABI.
- A memory leak"
* tag 'gpio-v5.3-6' of git://git.kernel.org/pub/scm/linux/kernel/git/linusw/linux-gpio:
gpiolib: acpi: Add gpiolib_acpi_run_edge_events_on_boot option and blacklist
gpiolib: of: fix fallback quirks handling
gpio: fix line flag validation in lineevent_create
gpio: fix line flag validation in linehandle_create
gpio: mockup: add missing single_release()
Andrew Jeffery [Thu, 29 Aug 2019 07:17:38 +0000 (16:47 +0930)]
pinctrl: aspeed: Fix spurious mux failures on the AST2500
Commit 674fa8daa8c9 ("pinctrl: aspeed-g5: Delay acquisition of regmaps")
was determined to be a partial fix to the problem of acquiring the LPC
Host Controller and GFX regmaps: The AST2500 pin controller may need to
fetch syscon regmaps during expression evaluation as well as when
setting mux state. For example, this case is hit by attempting to export
pins exposing the LPC Host Controller as GPIOs.
An optional eval() hook is added to the Aspeed pinmux operation struct
and called from aspeed_sig_expr_eval() if the pointer is set by the
SoC-specific driver. This enables the AST2500 to perform the custom
action of acquiring its regmap dependencies as required.
John Wang tested the fix on an Inspur FP5280G2 machine (AST2500-based)
where the issue was found, and I've booted the fix on Witherspoon
(AST2500) and Palmetto (AST2400) machines, and poked at relevant pins
under QEMU by forcing mux configurations via devmem before exporting
GPIOs to exercise the driver.
David S. Miller [Wed, 11 Sep 2019 23:05:52 +0000 (00:05 +0100)]
Merge branch '10GbE' of git://git.kernel.org/pub/scm/linux/kernel/git/jkirsher/net-queue
Jeff Kirsher says:
====================
Intel Wired LAN Driver Updates 2019-09-11
This series contains fixes to ixgbe.
Alex fixes up the adaptive ITR scheme for ixgbe which could result in a
value that was either 0 or something less than 10 which was causing
issues with hardware features, like RSC, that do not function well with
ITR values that low.
Ilya Maximets fixes the ixgbe driver to limit the number of transmit
descriptors to clean by the number of transmit descriptors used in the
transmit ring, so that the driver does not try to "double" clean the
same descriptors.
====================
The PluDevice register provides the authoritative chip model/revision.
Since the model number is purely used for reporting purposes, follow
the hardware team convention of subtracting 0x10 from the PluDevice
register to obtain the chip model/revision number.
Eric Dumazet [Tue, 10 Sep 2019 21:49:28 +0000 (14:49 -0700)]
tcp: force a PSH flag on TSO packets
When tcp sends a TSO packet, adding a PSH flag on it
reduces the sojourn time of GRO packet in GRO receivers.
This is particularly the case under pressure, since RX queues
receive packets for many concurrent flows.
A sender can give a hint to GRO engines when it is
appropriate to flush a super-packet, especially when pacing
is in the picture, since next packet is probably delayed by
one ms.
Having less packets in GRO engine reduces chance
of LRU eviction or inflated RTT, and reduces GRO cost.
We found recently that we must not set the PSH flag on
individual full-size MSS segments [1] :
Under pressure (CWR state), we better let the packet sit
for a small delay (depending on NAPI logic) so that the
ACK packet is delayed, and thus next packet we send is
also delayed a bit. Eventually the bottleneck queue can
be drained. DCTCP flows with CWND=1 have demonstrated
the issue.
This patch allows to slowdown the aggregate traffic without
involving high resolution timers on senders and/or
receivers.
It has been used at Google for about four years,
and has been discussed at various networking conferences.
[1] segments smaller than MSS already have PSH flag set
by tcp_sendmsg() / tcp_mark_push(), unless MSG_MORE
has been requested by the user.
tcp: fix tcp_ecn_withdraw_cwr() to clear TCP_ECN_QUEUE_CWR
Fix tcp_ecn_withdraw_cwr() to clear the correct bit:
TCP_ECN_QUEUE_CWR.
Rationale: basically, TCP_ECN_DEMAND_CWR is a bit that is purely about
the behavior of data receivers, and deciding whether to reflect
incoming IP ECN CE marks as outgoing TCP th->ece marks. The
TCP_ECN_QUEUE_CWR bit is purely about the behavior of data senders,
and deciding whether to send CWR. The tcp_ecn_withdraw_cwr() function
is only called from tcp_undo_cwnd_reduction() by data senders during
an undo, so it should zero the sender-side state,
TCP_ECN_QUEUE_CWR. It does not make sense to stop the reflection of
incoming CE bits on incoming data packets just because outgoing
packets were spuriously retransmitted.
The bug has been reproduced with packetdrill to manifest in a scenario
with RFC3168 ECN, with an incoming data packet with CE bit set and
carrying a TCP timestamp value that causes cwnd undo. Before this fix,
the IP CE bit was ignored and not reflected in the TCP ECE header bit,
and sender sent a TCP CWR ('W') bit on the next outgoing data packet,
even though the cwnd reduction had been undone. After this fix, the
sender properly reflects the CE bit and does not set the W bit.
Note: the bug actually predates 2005 git history; this Fixes footer is
chosen to be the oldest SHA1 I have tested (from Sep 2007) for which
the patch applies cleanly (since before this commit the code was in a
.h file).
ipv6: Don't use dst gateway directly in ip6_confirm_neigh()
This is the equivalent of commit 2c6b55f45d53 ("ipv6: fix neighbour
resolution with raw socket") for ip6_confirm_neigh(): we can send a
packet with MSG_CONFIRM on a raw socket for a connected route, so the
gateway would be :: here, and we should pick the next hop using
rt6_nexthop() instead.
This was found by code review and, to the best of my knowledge, doesn't
actually fix a practical issue: the destination address from the packet
is not considered while confirming a neighbour, as ip6_confirm_neigh()
calls choose_neigh_daddr() without passing the packet, so there are no
similar issues as the one fixed by said commit.
A possible source of issues with the existing implementation might come
from the fact that, if we have a cached dst, we won't consider it,
while rt6_nexthop() takes care of that. I might just not be creative
enough to find a practical problem here: the only way to affect this
with cached routes is to have one coming from an ICMPv6 redirect, but
if the next hop is a directly connected host, there should be no
topology for which a redirect applies here, and tests with redirected
routes show no differences for MSG_CONFIRM (and MSG_PROBE) packets on
raw sockets destined to a directly connected host.
However, directly using the dst gateway here is not consistent anymore
with neighbour resolution, and, in general, as we want the next hop,
using rt6_nexthop() looks like the only sane way to fetch it.
This patch adds ability to switch beetween two PHY SGMII modes.
Some hardware, for example, FPGA IP designs may use 6-wire mode
which enables differential SGMII clock to MAC.
The code assumes log_num < in_num everywhere, and that is true as long as
in_num is incremented by descriptor iov count, and log_num by 1. However
this breaks if there's a zero sized descriptor.
As a result, if a malicious guest creates a vring desc with desc.len = 0,
it may cause the host kernel to crash by overflowing the log array. This
bug can be triggered during the VM migration.
There's no need to log when desc.len = 0, so just don't increment log_num
in this case.
vhost: block speculation of translated descriptors
iovec addresses coming from vhost are assumed to be
pre-validated, but in fact can be speculated to a value
out of range.
Userspace address are later validated with array_index_nospec so we can
be sure kernel info does not leak through these addresses, but vhost
must also not leak userspace info outside the allowed memory table to
guests.
Following the defence in depth principle, make sure
the address is not validated out of node range.
Ilya Maximets [Thu, 22 Aug 2019 17:12:37 +0000 (20:12 +0300)]
ixgbe: fix double clean of Tx descriptors with xdp
Tx code doesn't clear the descriptors' status after cleaning.
So, if the budget is larger than number of used elems in a ring, some
descriptors will be accounted twice and xsk_umem_complete_tx will move
prod_tail far beyond the prod_head breaking the completion queue ring.
Fix that by limiting the number of descriptors to clean by the number
of used descriptors in the Tx ring.
'ixgbe_clean_xdp_tx_irq()' function refactored to look more like
'ixgbe_xsk_clean_tx_ring()' since we're allowed to directly use
'next_to_clean' and 'next_to_use' indexes.
Alexander Duyck [Wed, 4 Sep 2019 15:07:11 +0000 (08:07 -0700)]
ixgbe: Prevent u8 wrapping of ITR value to something less than 10us
There were a couple cases where the ITR value generated via the adaptive
ITR scheme could exceed 126. This resulted in the value becoming either 0
or something less than 10. Switching back and forth between a value less
than 10 and a value greater than 10 can cause issues as certain hardware
features such as RSC to not function well when the ITR value has dropped
that low.
Magnus Karlsson [Mon, 9 Sep 2019 16:55:38 +0000 (09:55 -0700)]
i40e: fix potential RX buffer starvation for AF_XDP
When the RX rings are created they are also populated with buffers
so that packets can be received. Usually these are kernel buffers,
but for AF_XDP in zero-copy mode, these are user-space buffers and
in this case the application might not have sent down any buffers
to the driver at this point. And if no buffers are allocated at ring
creation time, no packets can be received and no interrupts will be
generated so the NAPI poll function that allocates buffers to the
rings will never get executed.
To rectify this, we kick the NAPI context of any queue with an
attached AF_XDP zero-copy socket in two places in the code. Once
after an XDP program has loaded and once after the umem is registered.
This take care of both cases: XDP program gets loaded first then AF_XDP
socket is created, and the reverse, AF_XDP socket is created first,
then XDP program is loaded.
Fixes: 0a714186d3c0 ("i40e: add AF_XDP zero-copy Rx support") Signed-off-by: Magnus Karlsson <[email protected]> Tested-by: Andrew Bowers <[email protected]> Signed-off-by: Jeff Kirsher <[email protected]>
Stefan Assmann [Thu, 5 Sep 2019 06:34:22 +0000 (08:34 +0200)]
iavf: fix MAC address setting for VFs when filter is rejected
Currently iavf unconditionally applies MAC address change requests. This
brings the VF in a state where it is no longer able to pass traffic if
the PF rejects a MAC filter change for the VF.
A typical scenario for a rejected MAC filter is for an untrusted VF to
request to change the MAC address when an administratively set MAC is
present.
To keep iavf working in this scenario the MAC filter handling in iavf
needs to act on the PF reply regarding the MAC filter change. In the
case of an ack the new MAC address gets set, whereas in the case of a
nack the previous MAC address needs to stay in place.
Stefan Assmann [Tue, 3 Sep 2019 06:08:10 +0000 (08:08 +0200)]
i40e: clear __I40E_VIRTCHNL_OP_PENDING on invalid min Tx rate
In the case of an invalid min Tx rate being requested
i40e_ndo_set_vf_bw() immediately returns -EINVAL instead of releasing
__I40E_VIRTCHNL_OP_PENDING first.
Jacob Keller [Mon, 26 Aug 2019 18:16:55 +0000 (11:16 -0700)]
i40e: use BIT macro to specify the cloud filter field flags
The macros used to specify the cloud filter fields are intended to be
individual bits. Declare them using the BIT() macro to make their
intention a little more clear.
Czeslaw Zagorski [Mon, 26 Aug 2019 18:16:54 +0000 (11:16 -0700)]
i40e: Fix message for other card without FEC.
When variable "req_fec, fec, an" are empty,
dmesg shows log with "Requested FEC: , Negotiated FEC: , Autoneg:".
Add link dmesg log for cards without FEC.
i40e: fix missed "Negotiated" string in i40e_print_link_message()
The "Negotiated" string in i40e_print_link_message() function was missed.
This string has been added to the dmesg and small refactoring done removing
common substrings and unifying link status message format.
Without this patch it was not clear that FEC is related to negotiated FEC.
Jacob Keller [Mon, 26 Aug 2019 18:16:52 +0000 (11:16 -0700)]
i40e: mark additional missing bits as reserved
Mark bits 0xD through 0xF for the command flags of a cloud filter as
reserved. These bits are not yet defined and are considered as reserved
in the data sheet.