Linus Torvalds [Sun, 20 Aug 2017 15:54:30 +0000 (08:54 -0700)]
Merge branch 'core-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull watchdog fix from Thomas Gleixner:
"A fix for the hardlockup watchdog to prevent false positives with
extreme Turbo-Modes which make the perf/NMI watchdog fire faster than
the hrtimer which is used to verify.
Slightly larger than the minimal fix, which just would increase the
hrtimer frequency, but comes with extra overhead of more watchdog
timer interrupts and thread wakeups for all users.
With this change we restrict the overhead to the extreme Turbo-Mode
systems"
* 'core-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
kernel/watchdog: Prevent false positives with turbo modes
Add outbound_pci_buffer_overflow to ethtool output for monitoring the
number of packets that were dropped due to lack of PCIe buffers on
receive path from NIC port toward the host(s).
This counter is valid only in case that tx_overflow_buffer_pkt is
supported in MCAM enhanced features.
Gal Pressman [Thu, 15 Jun 2017 15:29:32 +0000 (18:29 +0300)]
net/mlx5e: Add PCIe outbound stalls counters
outbound_pci_stalled_rd - The percentage of time within the last second
that the NIC had outbound non-posted read requests but could not perform
the operation due to insufficient non-posted credits.
outbound_pci_stalled_wr - The percentage of time within the
last second that the NIC had outbound posted writes requests but could
not perform the operation due to insufficient posted credits.
outbound_pci_stalled_rd_events - The number of events where
outbound_pci_stalled_rd was above the threshold.
outbound_pci_stalled_wr_events - The number of events where
outbound_pci_stalled_wr was above the threshold.
Shalom Lagziel [Sun, 28 May 2017 13:40:24 +0000 (16:40 +0300)]
net/mlx5e: IPoIB, Add support for get_link_ksettings in ethtool
Add support for "ethtool DEVNAME" over ipoib ports,
Display standard port information for IPoIB netdevices using ethtool
For example:
$ ethtool ib2
> Settings for ib2:
Supported ports: [ ]
Supported link modes: Not reported
Supported pause frame use: No
Supports auto-negotiation: No
Advertised link modes: Not reported
Advertised pause frame use: No
Advertised auto-negotiation: No
Speed: 100000Mb/s
Duplex: Full
Port: Other
PHYAD: 0
Transceiver: internal
Auto-negotiation: off
Link detected: yes
net/mlx5e: IPoIB, Fix driver name retrieved by ethtool
Printing an enhanced IPoIB device information using
"ethtool -i DEVNAME", prints the low level driver name: mlx5_core.
This commit changes the name to mlx5_core [ib_ipoib], to include the
ipoib device driver infromation.
Eran Ben Elisha [Sun, 5 Feb 2017 15:57:40 +0000 (17:57 +0200)]
net/mlx5e: Send PAOS command on interface up/down
Upon interface up/down, driver will send PAOS (Ports Administrative and
Operational Status Register) in order to inform the Firmware on the
desired status of the port by the driver.
Since now we might change physical link status on mlx5e_open/close,
logical VF representor should not use mlx5e_open/close ndos as is, and
should call the logical version mlx5e_open/closed_locked.
Daniel Borkmann [Sat, 19 Aug 2017 01:12:46 +0000 (03:12 +0200)]
bpf: inline map in map lookup functions for array and htab
Avoid two successive functions calls for the map in map lookup, first
is the bpf_map_lookup_elem() helper call, and second the callback via
map->ops->map_lookup_elem() to get to the map in map implementation.
Implementation inlines array and htab flavor for map in map lookups.
Daniel Borkmann [Sat, 19 Aug 2017 01:12:45 +0000 (03:12 +0200)]
bpf: make htab inlining more robust wrt assumptions
Commit 9015d2f59535 ("bpf: inline htab_map_lookup_elem()") was
making the assumption that a direct call emission to the function
__htab_map_lookup_elem() will always work out for JITs.
This is currently true since all JITs we have are for 64 bit archs,
but in case of 32 bit JITs like upcoming arm32, we get a NULL pointer
dereference when executing the call to __htab_map_lookup_elem()
since passed arguments are of a different size (due to pointer args)
than what we do out of BPF. Guard and thus limit this for now for
the current 64 bit JITs only.
Martin KaFai Lau [Fri, 18 Aug 2017 18:28:00 +0000 (11:28 -0700)]
bpf: Allow selecting numa node during map creation
The current map creation API does not allow to provide the numa-node
preference. The memory usually comes from where the map-creation-process
is running. The performance is not ideal if the bpf_prog is known to
always run in a numa node different from the map-creation-process.
One of the use case is sharding on CPU to different LRU maps (i.e.
an array of LRU maps). Here is the test result of map_perf_test on
the INNER_LRU_HASH_PREALLOC test if we force the lru map used by
CPU0 to be allocated from a remote numa node:
[ The machine has 20 cores. CPU0-9 at node 0. CPU10-19 at node 1 ]
># taskset -c 10 ./map_perf_test 512 8 12600008000000
5:inner_lru_hash_map_perf pre-alloc 1628380 events per sec
4:inner_lru_hash_map_perf pre-alloc 1626396 events per sec
3:inner_lru_hash_map_perf pre-alloc 1626144 events per sec
6:inner_lru_hash_map_perf pre-alloc 1621657 events per sec
2:inner_lru_hash_map_perf pre-alloc 1621534 events per sec
1:inner_lru_hash_map_perf pre-alloc 1620292 events per sec
7:inner_lru_hash_map_perf pre-alloc 1613305 events per sec
0:inner_lru_hash_map_perf pre-alloc 1239150 events per sec #<<<
After specifying numa node:
># taskset -c 10 ./map_perf_test 512 8 12600008000000
5:inner_lru_hash_map_perf pre-alloc 1629627 events per sec
3:inner_lru_hash_map_perf pre-alloc 1628057 events per sec
1:inner_lru_hash_map_perf pre-alloc 1623054 events per sec
6:inner_lru_hash_map_perf pre-alloc 1616033 events per sec
2:inner_lru_hash_map_perf pre-alloc 1614630 events per sec
4:inner_lru_hash_map_perf pre-alloc 1612651 events per sec
7:inner_lru_hash_map_perf pre-alloc 1609337 events per sec
0:inner_lru_hash_map_perf pre-alloc 1619340 events per sec #<<<
This patch adds one field, numa_node, to the bpf_attr. Since numa node 0
is a valid node, a new flag BPF_F_NUMA_NODE is also added. The numa_node
field is honored if and only if the BPF_F_NUMA_NODE flag is set.
Numa node selection is not supported for percpu map.
This patch does not change all the kmalloc. F.e.
'htab = kzalloc()' is not changed since the object
is small enough to stay in the cache.
David S. Miller [Sun, 20 Aug 2017 00:13:41 +0000 (17:13 -0700)]
Merge branch 'net-const-eisa_device_id'
Arvind Yadav says:
====================
constify net eisa_device_id
eisa_device_id are not supposed to change at runtime. All functions
working with eisa_device_id provided by <linux/eisa.h> work with
const eisa_device_id. So mark the non-const structs as const.
====================
Arvind Yadav [Sat, 19 Aug 2017 06:53:34 +0000 (12:23 +0530)]
net: defxx: constify eisa_device_id
eisa_device_id are not supposed to change at runtime. All functions
working with eisa_device_id provided by <linux/eisa.h> work with
const eisa_device_id. So mark the non-const structs as const.
Arvind Yadav [Sat, 19 Aug 2017 06:53:14 +0000 (12:23 +0530)]
net: hp100: constify eisa_device_id
eisa_device_id are not supposed to change at runtime. All functions
working with eisa_device_id provided by <linux/eisa.h> work with
const eisa_device_id. So mark the non-const structs as const.
Arvind Yadav [Sat, 19 Aug 2017 06:52:44 +0000 (12:22 +0530)]
net: de4x5: constify eisa_device_id
eisa_device_id are not supposed to change at runtime. All functions
working with eisa_device_id provided by <linux/eisa.h> work with
const eisa_device_id. So mark the non-const structs as const.
Arvind Yadav [Sat, 19 Aug 2017 06:52:13 +0000 (12:22 +0530)]
net: 3c59x: constify eisa_device_id
eisa_device_id are not supposed to change at runtime. All functions
working with eisa_device_id provided by <linux/eisa.h> work with
const eisa_device_id. So mark the non-const structs as const.
Arvind Yadav [Sat, 19 Aug 2017 06:51:43 +0000 (12:21 +0530)]
net: 3c509: constify eisa_device_id
eisa_device_id are not supposed to change at runtime. All functions
working with eisa_device_id provided by <linux/eisa.h> work with
const eisa_device_id. So mark the non-const structs as const.
====================
nfp: add basic ethtool callbacks to representors
This set extends the basic ethtool functionality to representor
netdevs. I start with providing link state via ethtool and then
move on to functions such as driver information, statistics and
FW log dump. The series contains a number of clean ups to the
ethtool stats code too, some of the logic is simplified by making
better use of the nfp_port abstraction. The stats we expose on
representors are only the PCIe and MAC port statistics firmware
maintains for us.
====================
Jakub Kicinski [Fri, 18 Aug 2017 22:48:22 +0000 (15:48 -0700)]
nfp: don't reuse pointers in ring dumping
We were reusing skb pointer when reading page frag, since ring
entries contain a union of a skb and frag pointer. This can
be confusing to people reading the code. Refactor the code
to read frag pointer directly.
Jakub Kicinski [Fri, 18 Aug 2017 22:48:21 +0000 (15:48 -0700)]
nfp: fix copy paste in names and messages regarding vNICs
Data and control vNICs currently use the same area name and
error message. This could lead to confusion. Make sure
the error message says "ctrl" in case of control and the
data area is called "nfp.bar0".
Jakub Kicinski [Fri, 18 Aug 2017 22:48:20 +0000 (15:48 -0700)]
nfp: add ethtool statistics for representors
Representors may be associated with both VFs or more importantly
with physical ports. Allow vNIC and MAC statistics to be read
with ethtool -S on representors. In case of vNICs we reuse
the vNIC statistic helper, we just need to swap RX and TX to
give statistics the "switch perspective."
Jakub Kicinski [Fri, 18 Aug 2017 22:48:19 +0000 (15:48 -0700)]
nfp: add pointer to vNIC config memory to nfp_port structure
Simplify the statistics handling code by keeping pointer to vNIC's
config memory in nfp_port. Note that this is referring to the
representor side of vNICs, vNIC side has the pointer in nfp_net.
Jakub Kicinski [Fri, 18 Aug 2017 22:48:18 +0000 (15:48 -0700)]
nfp: report MAC statistics in ethtool
Add reporting of MAC statistics in ethtool. MAC statistics
are read out from the MAC IP and accumulated by application
FW, therefore their presence depends on the application FW.
Add missing defines and string names for the statistics and
dump them in ethtool -S.
Jakub Kicinski [Fri, 18 Aug 2017 22:48:17 +0000 (15:48 -0700)]
nfp: store pointer to MAC statistics in nfp_port
Store pointer to device memory containing MAC statistics
in nfp_port. This simplifies representor code and will
be used to dump those statistics in ethtool as well.
Jakub Kicinski [Fri, 18 Aug 2017 22:48:16 +0000 (15:48 -0700)]
nfp: split software and hardware vNIC statistics
In preparation for reporting vNIC HW stats on representors
split handling of the SW and HW stats in ethtool -S.
Representors don't have SW stats (since vNIC is assigned
to the VM).
Remove the questionable defines which assume nn variable
exists in the scope.
Jakub Kicinski [Fri, 18 Aug 2017 22:48:13 +0000 (15:48 -0700)]
nfp: allow retreiving management FW logs on representors
Users should be able to dump the management FW logs on any
of the driver's netdevs. Make the code only depend on the
nfp_app and share it between vNICs and representors.
Storing the dump flag is simply dropped for now, since we
only support the argument being set to 0.
Jakub Kicinski [Fri, 18 Aug 2017 22:48:11 +0000 (15:48 -0700)]
nfp: link basic ethtool ops to representors
Start linking ethtool ops to representors. Begin by adding
a separate ops structure and providing link state. Next
patches will convert appropriate functions to only use nfp_port,
which will make them reusable on representors.
Ideally, attribute groups would contain const attributes
but there are too many places that do modifications of list
during startup (in other code) to allow that.
The following updates are included in this driver update series:
- Set the MDIO mode to clause 45 for the 10GBase-T configuration
- Set the MII control width to 8-bits for speeds less than 1Gbps
- Fix an issue to related to module removal when the devices are up
- Fix ethtool statistics related to packet counting of TSO packets
- Add support for device renaming
- Add additional dynamic debug output for the PCS window calculation
- Optimize reading of DMA channel interrupt enablement register
- Add additional dynamic debug output about the hardware features
- Add per queue Tx and Rx ethtool statistics
- Add a macro to clear ethtool_link_ksettings modes
- Convert the driver to use the ethtool_link_ksettings
- Add support for VXLAN offload capabilities
- Add additional ethtool statistics related to VXLAN
This patch series is based on net-next.
====================
Lendacky, Thomas [Fri, 18 Aug 2017 14:04:04 +0000 (09:04 -0500)]
amd-xgbe: Add support for VXLAN offload capabilities
The hardware has the capability to perform checksum offload support
(both Tx and Rx) and TSO support for VXLAN packets. Add the support
required to enable this.
The hardware can only support a single VXLAN port for offload. If more
than one VXLAN port is added then the offload capabilities have to be
disabled and can no longer be advertised.
Lendacky, Thomas [Fri, 18 Aug 2017 14:03:55 +0000 (09:03 -0500)]
amd-xgbe: Convert to using the new link mode settings
Convert from using the old u32 supported, advertising, etc. link settings
to the new link mode settings that support bit positions / settings
greater than 32 bits.
Currently whenever the driver needs to enable or disable interrupts for
a DMA channel it reads the interrupt enable register (IER), updates the
value and then writes the new value back to the IER. Since the hardware
does not change the IER, software can track this value and elimiate the
need to read it each time.
Add the IER value to the channel related data structure and use that as
the base for enabling and disabling interrupts, thus removing the need
for the MMIO read.
Lendacky, Thomas [Fri, 18 Aug 2017 14:02:57 +0000 (09:02 -0500)]
amd-xgbe: Add support to handle device renaming
Many of the names used by the driver are based upon the name of the device
found during device probe. Move the formatting of the names into the
device open function so that any renaming that occurs before the device is
brought up will be accounted for. This also means moving the creation of
some named workqueues into the device open path.
Add support to register for net events so that if a device is renamed
the corresponding debugfs directory can be renamed.
Lendacky, Thomas [Fri, 18 Aug 2017 14:02:49 +0000 (09:02 -0500)]
amd-xgbe: Update TSO packet statistics accuracy
When transmitting a TSO packet, the driver only increments the TSO packet
statistic by one rather than the number of total packets that were sent.
Update the driver to record the total number of packets that resulted from
TSO transmit.
Lendacky, Thomas [Fri, 18 Aug 2017 14:02:40 +0000 (09:02 -0500)]
amd-xgbe: Be sure driver shuts down cleanly on module removal
Sometimes when the driver is being unloaded while the devices are still
up the driver can issue errors. This is based on timing and the double
invocation of some routines. The phy_exit() call needs to be run after
the network device has been closed and unregistered from the system.
Also, the phy_exit() does not need to invoke phy_stop() since that will
be called as part of the device closing, so remove that call.
Lendacky, Thomas [Fri, 18 Aug 2017 14:02:27 +0000 (09:02 -0500)]
amd-xgbe: Set the MII control width for the MAC interface
When running in SGMII mode at speeds below 1000Mbps, the auto-negotition
control register must set the MII control width for the MAC interface
to be 8-bits wide. By default the width is 4-bits.
Colin Ian King [Fri, 18 Aug 2017 13:49:25 +0000 (14:49 +0100)]
mlx5: ensure 0 is returned when vport is zero
Currently, if vport is zero then then an uninialized return status
in err is returned. Since the only return status at the end of the
function esw_add_uc_addr is zero for the current set of return paths
we may as well just return 0 rather than err to fix this issue.
Detected by CoverityScan, CID#1452698 ("Uninitialized scalar variable")
Fixes: eeb66cdb6826 ("net/mlx5: Separate between E-Switch and MPFS") Signed-off-by: Colin Ian King <[email protected]> Reviewed-by: Leon Romanovsky <[email protected]> Signed-off-by: David S. Miller <[email protected]>
Xin Long [Fri, 18 Aug 2017 03:01:36 +0000 (11:01 +0800)]
net: sched: fix NULL pointer dereference when action calls some targets
As we know in some target's checkentry it may dereference par.entryinfo
to check entry stuff inside. But when sched action calls xt_check_target,
par.entryinfo is set with NULL. It would cause kernel panic when calling
some targets.
It can be reproduce with:
# tc qd add dev eth1 ingress handle ffff:
# tc filter add dev eth1 parent ffff: u32 match u32 0 0 action xt \
-j ECN --ecn-tcp-remove
It could also crash kernel when using target CLUSTERIP or TPROXY.
By now there's no proper value for par.entryinfo in ipt_init_target,
but it can not be set with NULL. This patch is to void all these
panics by setting it with an ipt_entry obj with all members = 0.
Note that this issue has been there since the very beginning.
Martin KaFai Lau [Fri, 18 Aug 2017 01:14:43 +0000 (18:14 -0700)]
bpf: Fix map-in-map checking in the verifier
In check_map_func_compatibility(), a 'break' has been accidentally
removed for the BPF_MAP_TYPE_ARRAY_OF_MAPS and BPF_MAP_TYPE_HASH_OF_MAPS
cases. This patch adds it back.
David Howells [Thu, 17 Aug 2017 23:19:42 +0000 (00:19 +0100)]
rxrpc: Fix oops when discarding a preallocated service call
rxrpc_service_prealloc_one() doesn't set the socket pointer on any new call
it preallocates, but does add it to the rxrpc net namespace call list.
This, however, causes rxrpc_put_call() to oops when the call is discarded
when the socket is closed. rxrpc_put_call() needs the socket to be able to
reach the namespace so that it can use a lock held therein.
Fix this by setting a call's socket pointer immediately before discarding
it.
This can be triggered by unloading the kafs module, resulting in an oops
like the following:
Fixes: 2baec2c3f854 ("rxrpc: Support network namespacing") Signed-off-by: David Howells <[email protected]> Signed-off-by: David S. Miller <[email protected]>
Colin Ian King [Thu, 17 Aug 2017 22:14:58 +0000 (23:14 +0100)]
irda: do not leak initialized list.dev to userspace
list.dev has not been initialized and so the copy_to_user is copying
data from the stack back to user space which is a potential
information leak. Fix this ensuring all of list is initialized to
zero.
Detected by CoverityScan, CID#1357894 ("Uninitialized scalar variable")
Working on streamlining the tracepoints for XDP. The eBPF programs
and XDP have no flow-control or queueing. Investigating using
tracepoint to provide a feedback on XDP_REDIRECT xmit overflow events.
====================
xdp: adjust xdp redirect tracepoint to include return error code
The return error code need to be included in the tracepoint
xdp:xdp_redirect, else its not possible to distinguish successful or
failed XDP_REDIRECT transmits.
XDP have no queuing mechanism. Thus, it is fairly easily to overrun a
NIC transmit queue. The eBPF program invoking helpers (bpf_redirect
or bpf_redirect_map) to redirect a packet doesn't get any feedback
whether the packet was actually transmitted.
Info on failed transmits in the tracepoint xdp:xdp_redirect, is
interesting as this opens for providing a feedback-loop to the
receiving XDP program.
Huy Nguyen [Thu, 17 Aug 2017 15:29:52 +0000 (18:29 +0300)]
net/mlx4_core: Enable 4K UAR if SRIOV module parameter is not enabled
enable_4k_uar module parameter was added in patch cited below to
address the backward compatibility issue in SRIOV when the VM has
system's PAGE_SIZE uar implementation and the Hypervisor has 4k uar
implementation.
The above compatibility issue does not exist in the non SRIOV case.
In this patch, we always enable 4k uar implementation if SRIOV
is not enabled on mlx4's supported cards.
Thierry Reding [Thu, 17 Aug 2017 11:06:14 +0000 (13:06 +0200)]
PCI: Allow PCI express root ports to find themselves
If the pci_find_pcie_root_port() function is called on a root port
itself, return the root port rather than NULL.
This effectively reverts commit 0e405232871d6 ("PCI: fix oops when
try to find Root Port for a PCI device") which added an extra check
that would now be redundant.
Fixes: a99b646afa8a ("PCI: Disable PCIe Relaxed Ordering if unsupported") Fixes: c56d4450eb68 ("PCI: Turn off Request Attributes to avoid Chelsio T5 Completion erratum") Signed-off-by: Thierry Reding <[email protected]> Acked-by: Bjorn Helgaas <[email protected]> Tested-by: Shawn Lin <[email protected]> Tested-by: Michael Ellerman <[email protected]> Signed-off-by: David S. Miller <[email protected]>
Neal Cardwell [Wed, 16 Aug 2017 21:53:36 +0000 (17:53 -0400)]
tcp: when rearming RTO, if RTO time is in past then fire RTO ASAP
In some situations tcp_send_loss_probe() can realize that it's unable
to send a loss probe (TLP), and falls back to calling tcp_rearm_rto()
to schedule an RTO timer. In such cases, sometimes tcp_rearm_rto()
realizes that the RTO was eligible to fire immediately or at some
point in the past (delta_us <= 0). Previously in such cases
tcp_rearm_rto() was scheduling such "overdue" RTOs to happen at now +
icsk_rto, which caused needless delays of hundreds of milliseconds
(and non-linear behavior that made reproducible testing
difficult). This commit changes the logic to schedule "overdue" RTOs
ASAP, rather than at now + icsk_rto.
Currently macvlan devices do not set their hw_enc_features making
encapsulated Tx packets resort to SW fallbacks. Add encapsulation GSO
offloads to ->features as is done for the other GSOs and set
->hw_enc_features.
Linus Torvalds [Fri, 18 Aug 2017 23:06:33 +0000 (16:06 -0700)]
Merge branch 'akpm' (patches from Andrew)
Merge misc fixes from Andrew Morton:
"14 fixes"
* emailed patches from Andrew Morton <[email protected]>:
mm: revert x86_64 and arm64 ELF_ET_DYN_BASE base changes
mm/vmalloc.c: don't unconditonally use __GFP_HIGHMEM
mm/mempolicy: fix use after free when calling get_mempolicy
mm/cma_debug.c: fix stack corruption due to sprintf usage
signal: don't remove SIGNAL_UNKILLABLE for traced tasks.
mm, oom: fix potential data corruption when oom_reaper races with writer
mm: fix double mmap_sem unlock on MMF_UNSTABLE enforced SIGBUS
slub: fix per memcg cache leak on css offline
mm: discard memblock data later
test_kmod: fix description for -s -and -c parameters
kmod: fix wait on recursive loop
wait: add wait_event_killable_timeout()
kernel/watchdog: fix Kconfig constraints for perf hardlockup watchdog
mm: memcontrol: fix NULL pointer crash in test_clear_page_writeback()
Wei Wang [Wed, 16 Aug 2017 18:18:09 +0000 (11:18 -0700)]
ipv6: reset fn->rr_ptr when replacing route
syzcaller reported the following use-after-free issue in rt6_select():
BUG: KASAN: use-after-free in rt6_select net/ipv6/route.c:755 [inline] at addr ffff8800bc6994e8
BUG: KASAN: use-after-free in ip6_pol_route.isra.46+0x1429/0x1470 net/ipv6/route.c:1084 at addr ffff8800bc6994e8
Read of size 4 by task syz-executor1/439628
CPU: 0 PID: 439628 Comm: syz-executor1 Not tainted 4.3.5+ #8
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011 0000000000000000ffff88018fe435b0ffffffff81ca384dffff8801d3588c00 ffff8800bc699380ffff8800bc699500dffffc0000000000ffff8801d40a47c0 ffff88018fe435d8ffffffff81735751ffff88018fe43660ffff8800bc699380
Call Trace:
[<ffffffff81ca384d>] __dump_stack lib/dump_stack.c:15 [inline]
[<ffffffff81ca384d>] dump_stack+0xc1/0x124 lib/dump_stack.c:51
sctp: [Deprecated]: syz-executor0 (pid 439615) Use of struct sctp_assoc_value in delayed_ack socket option.
Use struct sctp_sack_info instead
[<ffffffff81735751>] kasan_object_err+0x21/0x70 mm/kasan/report.c:158
[<ffffffff817359c4>] print_address_description mm/kasan/report.c:196 [inline]
[<ffffffff817359c4>] kasan_report_error+0x1b4/0x4a0 mm/kasan/report.c:285
[<ffffffff81735d93>] kasan_report mm/kasan/report.c:305 [inline]
[<ffffffff81735d93>] __asan_report_load4_noabort+0x43/0x50 mm/kasan/report.c:325
[<ffffffff82a28e39>] rt6_select net/ipv6/route.c:755 [inline]
[<ffffffff82a28e39>] ip6_pol_route.isra.46+0x1429/0x1470 net/ipv6/route.c:1084
[<ffffffff82a28fb1>] ip6_pol_route_output+0x81/0xb0 net/ipv6/route.c:1203
[<ffffffff82ab0a50>] fib6_rule_action+0x1f0/0x680 net/ipv6/fib6_rules.c:95
[<ffffffff8265cbb6>] fib_rules_lookup+0x2a6/0x7a0 net/core/fib_rules.c:223
[<ffffffff82ab1430>] fib6_rule_lookup+0xd0/0x250 net/ipv6/fib6_rules.c:41
[<ffffffff82a22006>] ip6_route_output+0x1d6/0x2c0 net/ipv6/route.c:1224
[<ffffffff829e83d2>] ip6_dst_lookup_tail+0x4d2/0x890 net/ipv6/ip6_output.c:943
[<ffffffff829e889a>] ip6_dst_lookup_flow+0x9a/0x250 net/ipv6/ip6_output.c:1079
[<ffffffff82a9f7d8>] ip6_datagram_dst_update+0x538/0xd40 net/ipv6/datagram.c:91
[<ffffffff82aa0978>] __ip6_datagram_connect net/ipv6/datagram.c:251 [inline]
[<ffffffff82aa0978>] ip6_datagram_connect+0x518/0xe50 net/ipv6/datagram.c:272
[<ffffffff82aa1313>] ip6_datagram_connect_v6_only+0x63/0x90 net/ipv6/datagram.c:284
[<ffffffff8292f790>] inet_dgram_connect+0x170/0x1f0 net/ipv4/af_inet.c:564
[<ffffffff82565547>] SYSC_connect+0x1a7/0x2f0 net/socket.c:1582
[<ffffffff8256a649>] SyS_connect+0x29/0x30 net/socket.c:1563
[<ffffffff82c72032>] entry_SYSCALL_64_fastpath+0x12/0x17
Object at ffff8800bc699380, in cache ip6_dst_cache size: 384
The root cause of it is that in fib6_add_rt2node(), when it replaces an
existing route with the new one, it does not update fn->rr_ptr.
This commit resets fn->rr_ptr to NULL when it points to a route which is
replaced in fib6_add_rt2node().
sctp: fully initialize the IPv6 address in sctp_v6_to_addr()
KMSAN reported use of uninitialized sctp_addr->v4.sin_addr.s_addr and
sctp_addr->v6.sin6_scope_id in sctp_v6_cmp_addr() (see below).
Make sure all fields of an IPv6 address are initialized, which
guarantees that the IPv4 fields are also initialized.
Eric Dumazet [Wed, 16 Aug 2017 16:41:54 +0000 (09:41 -0700)]
tipc: fix use-after-free
syszkaller reported use-after-free in tipc [1]
When msg->rep skb is freed, set the pointer to NULL,
so that caller does not free it again.
[1]
==================================================================
BUG: KASAN: use-after-free in skb_push+0xd4/0xe0 net/core/skbuff.c:1466
Read of size 8 at addr ffff8801c6e71e90 by task syz-executor5/4115
Memory state around the buggy address: ffff8801c6e71d80: fc fc fc fc fc fc fc fc fb fb fb fb fb fb fb fb ffff8801c6e71e00: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
>ffff8801c6e71e80: fb fb fb fb fc fc fc fc fc fc fc fc fc fc fc fc
^ ffff8801c6e71f00: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc ffff8801c6e71f80: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc
==================================================================
Memory state around the buggy address: ffff8801d2843a00: fb fb fb fb fc fc fc fc fb fb fb fb fc fc fc fc ffff8801d2843a80: 00 00 00 fc fc fc fc fc fb fb fb fb fc fc fc fc
>ffff8801d2843b00: 00 00 00 00 fc fc fc fc fb fb fb fb fc fc fc fc
^ ffff8801d2843b80: fb fb fb fb fc fc fc fc fb fb fb fb fc fc fc fc ffff8801d2843c00: fb fb fb fb fc fc fc fc fb fb fb fb fc fc fc fc
Fixes: cf124db566e6 ("net: Fix inconsistent teardown and release of private netdev state.") Signed-off-by: Eric Dumazet <[email protected]> Signed-off-by: David S. Miller <[email protected]>
Kees Cook [Fri, 18 Aug 2017 22:16:31 +0000 (15:16 -0700)]
mm: revert x86_64 and arm64 ELF_ET_DYN_BASE base changes
Moving the x86_64 and arm64 PIE base from 0x555555554000 to 0x000100000000
broke AddressSanitizer. This is a partial revert of:
eab09532d400 ("binfmt_elf: use ELF_ET_DYN_BASE only for PIE") 02445990a96e ("arm64: move ELF_ET_DYN_BASE to 4GB / 4MB")
The AddressSanitizer tool has hard-coded expectations about where
executable mappings are loaded.
The motivation for changing the PIE base in the above commits was to
avoid the Stack-Clash CVEs that allowed executable mappings to get too
close to heap and stack. This was mainly a problem on 32-bit, but the
64-bit bases were moved too, in an effort to proactively protect those
systems (proofs of concept do exist that show 64-bit collisions, but
other recent changes to fix stack accounting and setuid behaviors will
minimize the impact).
The new 32-bit PIE base is fine for ASan (since it matches the ET_EXEC
base), so only the 64-bit PIE base needs to be reverted to let x86 and
arm64 ASan binaries run again. Future changes to the 64-bit PIE base on
these architectures can be made optional once a more dynamic method for
dealing with AddressSanitizer is found. (e.g. always loading PIE into
the mmap region for marked binaries.)
Laura Abbott [Fri, 18 Aug 2017 22:16:27 +0000 (15:16 -0700)]
mm/vmalloc.c: don't unconditonally use __GFP_HIGHMEM
Commit 19809c2da28a ("mm, vmalloc: use __GFP_HIGHMEM implicitly") added
use of __GFP_HIGHMEM for allocations. vmalloc_32 may use
GFP_DMA/GFP_DMA32 which does not play nice with __GFP_HIGHMEM and will
trigger a BUG in gfp_zone.
Only add __GFP_HIGHMEM if we aren't using GFP_DMA/GFP_DMA32.
Bytes b4 ffff8801f582d750: ae 01 ff ff 00 00 00 00 5a 5a 5a 5a 5a 5a 5a 5a ........ZZZZZZZZ
Object ffff8801f582d760: 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b kkkkkkkkkkkkkkkk
Object ffff8801f582d770: 6b 6b 6b 6b 6b 6b 6b a5 kkkkkkk.
Redzone ffff8801f582d778: bb bb bb bb bb bb bb bb ........
Padding ffff8801f582d8b8: 5a 5a 5a 5a 5a 5a 5a 5a ZZZZZZZZ
Memory state around the buggy address: ffff8801f582d600: fb fb fb fc fc fc fc fc fc fc fc fc fc fc fc fc ffff8801f582d680: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc
>ffff8801f582d700: fc fc fc fc fc fc fc fc fc fc fc fc fb fb fb fc
!shared memory policy is not protected against parallel removal by other
thread which is normally protected by the mmap_sem. do_get_mempolicy,
however, drops the lock midway while we can still access it later.
Early premature up_read is a historical artifact from times when
put_user was called in this path see https://lwn.net/Articles/124754/
but that is gone since 8bccd85ffbaf ("[PATCH] Implement sys_* do_*
layering in the memory policy layer."). but when we have the the
current mempolicy ref count model. The issue was introduced
accordingly.
Prakash Gupta [Fri, 18 Aug 2017 22:16:21 +0000 (15:16 -0700)]
mm/cma_debug.c: fix stack corruption due to sprintf usage
name[] in cma_debugfs_add_one() can only accommodate 16 chars including
NULL to store sprintf output. It's common for cma device name to be
larger than 15 chars. This can cause stack corrpution. If the gcc
stack protector is turned on, this can cause a panic due to stack
corruption.
Jamie Iles [Fri, 18 Aug 2017 22:16:18 +0000 (15:16 -0700)]
signal: don't remove SIGNAL_UNKILLABLE for traced tasks.
When forcing a signal, SIGNAL_UNKILLABLE is removed to prevent recursive
faults, but this is undesirable when tracing. For example, debugging an
init process (whether global or namespace), hitting a breakpoint and
SIGTRAP will force SIGTRAP and then remove SIGNAL_UNKILLABLE.
Everything continues fine, but then once debugging has finished, the
init process is left killable which is unlikely what the user expects,
resulting in either an accidentally killed init or an init that stops
reaping zombies.
Michal Hocko [Fri, 18 Aug 2017 22:16:15 +0000 (15:16 -0700)]
mm, oom: fix potential data corruption when oom_reaper races with writer
Wenwei Tao has noticed that our current assumption that the oom victim
is dying and never doing any visible changes after it dies, and so the
oom_reaper can tear it down, is not entirely true.
__task_will_free_mem consider a task dying when SIGNAL_GROUP_EXIT is set
but do_group_exit sends SIGKILL to all threads _after_ the flag is set.
So there is a race window when some threads won't have
fatal_signal_pending while the oom_reaper could start unmapping the
address space. Moreover some paths might not check for fatal signals
before each PF/g-u-p/copy_from_user.
We already have a protection for oom_reaper vs. PF races by checking
MMF_UNSTABLE. This has been, however, checked only for kernel threads
(use_mm users) which can outlive the oom victim. A simple fix would be
to extend the current check in handle_mm_fault for all tasks but that
wouldn't be sufficient because the current check assumes that a kernel
thread would bail out after EFAULT from get_user*/copy_from_user and
never re-read the same address which would succeed because the PF path
has established page tables already. This seems to be the case for the
only existing use_mm user currently (virtio driver) but it is rather
fragile in general.
This is even more fragile in general for more complex paths such as
generic_perform_write which can re-read the same address more times
(e.g. iov_iter_copy_from_user_atomic to fail and then
iov_iter_fault_in_readable on retry).
Therefore we have to implement MMF_UNSTABLE protection in a robust way
and never make a potentially corrupted content visible. That requires
to hook deeper into the PF path and check for the flag _every time_
before a pte for anonymous memory is established (that means all
!VM_SHARED mappings).
The corruption can be triggered artificially
(http://lkml.kernel.org/r/201708040646[email protected])
but there doesn't seem to be any real life bug report. The race window
should be quite tight to trigger most of the time.
Michal Hocko [Fri, 18 Aug 2017 22:16:12 +0000 (15:16 -0700)]
mm: fix double mmap_sem unlock on MMF_UNSTABLE enforced SIGBUS
Tetsuo Handa has noticed that MMF_UNSTABLE SIGBUS path in
handle_mm_fault causes a lockdep splat
Out of memory: Kill process 1056 (a.out) score 603 or sacrifice child
Killed process 1056 (a.out) total-vm:4268108kB, anon-rss:2246048kB, file-rss:0kB, shmem-rss:0kB
a.out (1169) used greatest stack depth: 11664 bytes left
DEBUG_LOCKS_WARN_ON(depth <= 0)
------------[ cut here ]------------
WARNING: CPU: 6 PID: 1339 at kernel/locking/lockdep.c:3617 lock_release+0x172/0x1e0
CPU: 6 PID: 1339 Comm: a.out Not tainted 4.13.0-rc3-next-20170803+ #142
Hardware name: VMware, Inc. VMware Virtual Platform/440BX Desktop Reference Platform, BIOS 6.00 07/02/2015
RIP: 0010:lock_release+0x172/0x1e0
Call Trace:
up_read+0x1a/0x40
__do_page_fault+0x28e/0x4c0
do_page_fault+0x30/0x80
page_fault+0x28/0x30
The reason is that the page fault path might have dropped the mmap_sem
and returned with VM_FAULT_RETRY. MMF_UNSTABLE check however rewrites
the error path to VM_FAULT_SIGBUS and we always expect mmap_sem taken in
that path. Fix this by taking mmap_sem when VM_FAULT_RETRY is held in
the MMF_UNSTABLE path.
We cannot simply add VM_FAULT_SIGBUS to the existing error code because
all arch specific page fault handlers and g-u-p would have to learn a
new error code combination.
Vladimir Davydov [Fri, 18 Aug 2017 22:16:08 +0000 (15:16 -0700)]
slub: fix per memcg cache leak on css offline
To avoid a possible deadlock, sysfs_slab_remove() schedules an
asynchronous work to delete sysfs entries corresponding to the kmem
cache. To ensure the cache isn't freed before the work function is
called, it takes a reference to the cache kobject. The reference is
supposed to be released by the work function.
However, the work function (sysfs_slab_remove_workfn()) does nothing in
case the cache sysfs entry has already been deleted, leaking the kobject
and the corresponding cache.
This may happen on a per memcg cache destruction, because sysfs entries
of a per memcg cache are deleted on memcg offline if the cache is empty
(see __kmemcg_cache_deactivate()).
Pavel Tatashin [Fri, 18 Aug 2017 22:16:05 +0000 (15:16 -0700)]
mm: discard memblock data later
There is existing use after free bug when deferred struct pages are
enabled:
The memblock_add() allocates memory for the memory array if more than
128 entries are needed. See comment in e820__memblock_setup():
* The bootstrap memblock region count maximum is 128 entries
* (INIT_MEMBLOCK_REGIONS), but EFI might pass us more E820 entries
* than that - so allow memblock resizing.
This memblock memory is freed here:
free_low_memory_core_early()
We access the freed memblock.memory later in boot when deferred pages
are initialized in this path:
deferred_init_memmap()
for_each_mem_pfn_range()
__next_mem_pfn_range()
type = &memblock.memory;
One possible explanation for why this use-after-free hasn't been hit
before is that the limit of INIT_MEMBLOCK_REGIONS has never been
exceeded at least on systems where deferred struct pages were enabled.
Tested by reducing INIT_MEMBLOCK_REGIONS down to 4 from the current 128,
and verifying in qemu that this code is getting excuted and that the
freed pages are sane.