Vladimir Oltean [Fri, 30 Aug 2019 01:07:22 +0000 (04:07 +0300)]
taprio: Set default link speed to 10 Mbps in taprio_set_picos_per_byte
The taprio budget needs to be adapted at runtime according to interface
link speed. But that handling is problematic.
For one thing, installing a qdisc on an interface that doesn't have
carrier is not illegal. But taprio prints the following stack trace:
[ 31.851373] ------------[ cut here ]------------
[ 31.856024] WARNING: CPU: 1 PID: 207 at net/sched/sch_taprio.c:481 taprio_dequeue+0x1a8/0x2d4
[ 31.864566] taprio: dequeue() called with unknown picos per byte.
[ 31.864570] Modules linked in:
[ 31.873701] CPU: 1 PID: 207 Comm: tc Not tainted 5.3.0-rc5-01199-g8838fe023cd6 #1689
[ 31.881398] Hardware name: Freescale LS1021A
[ 31.885661] [<c03133a4>] (unwind_backtrace) from [<c030d8cc>] (show_stack+0x10/0x14)
[ 31.893368] [<c030d8cc>] (show_stack) from [<c10ac958>] (dump_stack+0xb4/0xc8)
[ 31.900555] [<c10ac958>] (dump_stack) from [<c0349d04>] (__warn+0xe0/0xf8)
[ 31.907395] [<c0349d04>] (__warn) from [<c0349d64>] (warn_slowpath_fmt+0x48/0x6c)
[ 31.914841] [<c0349d64>] (warn_slowpath_fmt) from [<c0f38db4>] (taprio_dequeue+0x1a8/0x2d4)
[ 31.923150] [<c0f38db4>] (taprio_dequeue) from [<c0f227b0>] (__qdisc_run+0x90/0x61c)
[ 31.930856] [<c0f227b0>] (__qdisc_run) from [<c0ec82ac>] (net_tx_action+0x12c/0x2bc)
[ 31.938560] [<c0ec82ac>] (net_tx_action) from [<c0302298>] (__do_softirq+0x130/0x3c8)
[ 31.946350] [<c0302298>] (__do_softirq) from [<c03502a0>] (irq_exit+0xbc/0xd8)
[ 31.953536] [<c03502a0>] (irq_exit) from [<c03a4808>] (__handle_domain_irq+0x60/0xb4)
[ 31.961328] [<c03a4808>] (__handle_domain_irq) from [<c0754478>] (gic_handle_irq+0x58/0x9c)
[ 31.969638] [<c0754478>] (gic_handle_irq) from [<c0301a8c>] (__irq_svc+0x6c/0x90)
[ 31.977076] Exception stack(0xe8167b20 to 0xe8167b68)
[ 31.982100] 7b20: e9d4bd8000000cc0000000cf00000000e9d4bd80c1f3895800000cc0c1f38960
[ 31.990234] 7b40: 00000001000000cf00000004e9dc080000000000e8167b70c0f478ecc0f46d94
[ 31.998363] 7b60: 60070013ffffffff
[ 32.001833] [<c0301a8c>] (__irq_svc) from [<c0f46d94>] (netlink_trim+0x18/0xd8)
[ 32.009104] [<c0f46d94>] (netlink_trim) from [<c0f478ec>] (netlink_broadcast_filtered+0x34/0x414)
[ 32.017930] [<c0f478ec>] (netlink_broadcast_filtered) from [<c0f47cec>] (netlink_broadcast+0x20/0x28)
[ 32.027102] [<c0f47cec>] (netlink_broadcast) from [<c0eea378>] (rtnetlink_send+0x34/0x88)
[ 32.035238] [<c0eea378>] (rtnetlink_send) from [<c0f25890>] (notify_and_destroy+0x2c/0x44)
[ 32.043461] [<c0f25890>] (notify_and_destroy) from [<c0f25e08>] (qdisc_graft+0x398/0x470)
[ 32.051595] [<c0f25e08>] (qdisc_graft) from [<c0f27a00>] (tc_modify_qdisc+0x3a4/0x724)
[ 32.059470] [<c0f27a00>] (tc_modify_qdisc) from [<c0ee4c84>] (rtnetlink_rcv_msg+0x260/0x2ec)
[ 32.067864] [<c0ee4c84>] (rtnetlink_rcv_msg) from [<c0f4a988>] (netlink_rcv_skb+0xb8/0x110)
[ 32.076172] [<c0f4a988>] (netlink_rcv_skb) from [<c0f4a170>] (netlink_unicast+0x1b4/0x22c)
[ 32.084392] [<c0f4a170>] (netlink_unicast) from [<c0f4a5e4>] (netlink_sendmsg+0x33c/0x380)
[ 32.092614] [<c0f4a5e4>] (netlink_sendmsg) from [<c0ea9f40>] (sock_sendmsg+0x14/0x24)
[ 32.100403] [<c0ea9f40>] (sock_sendmsg) from [<c0eaa780>] (___sys_sendmsg+0x214/0x228)
[ 32.108279] [<c0eaa780>] (___sys_sendmsg) from [<c0eabad0>] (__sys_sendmsg+0x50/0x8c)
[ 32.116068] [<c0eabad0>] (__sys_sendmsg) from [<c0301000>] (ret_fast_syscall+0x0/0x54)
[ 32.123938] Exception stack(0xe8167fa8 to 0xe8167ff0)
[ 32.128960] 7fa0: b6fa68c8000000f800000003bea142d00000000000000000
[ 32.137093] 7fc0: b6fa68c8000000f80052154c000001285d6468a2000000000000002800558c9c
[ 32.145224] 7fe0: 00000070bea1427800530d64b6e17e64
[ 32.150659] ---[ end trace 2139c9827c3e5177 ]---
This happens because the qdisc ->dequeue callback gets called. Which
again is not illegal, the qdisc will dequeue even when the interface is
up but doesn't have carrier (and hence SPEED_UNKNOWN), and the frames
will be dropped further down the stack in dev_direct_xmit().
And, at the end of the day, for what? For calculating the initial budget
of an interface which is non-operational at the moment and where frames
will get dropped anyway.
So if we can't figure out the link speed, default to SPEED_10 and move
along. We can also remove the runtime check now.
Vladimir Oltean [Fri, 30 Aug 2019 01:07:21 +0000 (04:07 +0300)]
taprio: Fix kernel panic in taprio_destroy
taprio_init may fail earlier than this line:
list_add(&q->taprio_list, &taprio_list);
i.e. due to the net device not being multi queue.
Attempting to remove q from the global taprio_list when it is not part
of it will result in a kernel panic.
Fix it by matching list_add and list_del better to one another in the
order of operations. This way we can keep the deletion unconditional
and with lower complexity - O(1).
David S. Miller [Sat, 31 Aug 2019 20:32:30 +0000 (13:32 -0700)]
Merge branch 'qed-Enhancements'
Sudarsana Reddy Kalluru says:
====================
qed*: Enhancements.
The patch series adds couple of enhancements to qed/qede drivers.
- Support for dumping the config id attributes via ethtool -w/W.
- Support for dumping the GRC data of required memory regions using
ethtool -w/W interfaces.
Patch (1) adds driver APIs for reading the config id attributes.
Patch (2) adds ethtool support for dumping the config id attributes.
Patch (3) adds support for configuring the GRC dump config flags.
Patch (4) adds ethtool support for dumping the grc dump.
====================
====================
Dynamic toggling of vlan_filtering for SJA1105 DSA
This patchset addresses a limitation in dsa_8021q where this sequence of
commands was causing the switch to stop forwarding traffic:
ip link add name br0 type bridge vlan_filtering 0
ip link set dev swp2 master br0
echo 1 > /sys/class/net/br0/bridge/vlan_filtering
echo 0 > /sys/class/net/br0/bridge/vlan_filtering
The issue has to do with the VLAN table manipulations that dsa_8021q
does without notifying the bridge layer. The solution is to always
restore the VLANs that the bridge knows about, when disabling tagging.
====================
Vladimir Oltean [Fri, 30 Aug 2019 00:53:25 +0000 (03:53 +0300)]
net: dsa: tag_8021q: Restore bridge VLANs when enabling vlan_filtering
The bridge core assumes that enabling/disabling vlan_filtering will
translate into the simple toggling of a flag for switchdev drivers.
That is clearly not the case for sja1105, which alters the VLAN table
and the pvids in order to obtain port separation in standalone mode.
There are 2 parts to the issue.
First, tag_8021q changes the pvid to a unique per-port rx_vid for frame
identification. But we need to disable tag_8021q when vlan_filtering
kicks in, and at that point, the VLAN configured as pvid will have to be
removed from the filtering table of the ports. With an invalid pvid, the
ports will drop all traffic. Since the bridge will not call any vlan
operation through switchdev after enabling vlan_filtering, we need to
ensure we're in a functional state ourselves. Hence read the pvid that
the bridge is aware of, and program that into our ports.
Secondly, tag_8021q uses the 1024-3071 range privately in
vlan_filtering=0 mode. Had the user installed one of these VLANs during
a previous vlan_filtering=1 session, then upon the next tag_8021q
cleanup for vlan_filtering to kick in again, VLANs in that range will
get deleted unconditionally, hence breaking user expectation. So when
deleting the VLANs, check if the bridge had knowledge about them, and if
it did, re-apply the settings. Wrap this logic inside a
dsa_8021q_vid_apply helper function to reduce code duplication.
It is intuitive that the pvid of a netdevice should have the
BRIDGE_VLAN_INFO_PVID flag set.
However I can't seem to pinpoint a commit where this behavior was
introduced. It seems like it's been like that since forever.
At a first glance it would make more sense to just handle the
BRIDGE_VLAN_INFO_PVID flag in __vlan_add_flags. However, as Nikolay
explains:
There are a few reasons why we don't do it, most importantly because
we need to have only one visible pvid at any single time, even if it's
stale - it must be just one. Right now that rule will not be violated
by this change, but people will try using this flag and could see two
pvids simultaneously. You can see that the pvid code is even using
memory barriers to propagate the new value faster and everywhere the
pvid is read only once. That is the reason the flag is set
dynamically when dumping entries, too. A second (weaker) argument
against would be given the above we don't want another way to do the
same thing, specifically if it can provide us with two pvids (e.g. if
walking the vlan list) or if it can provide us with a pvid different
from the one set in the vg. [Obviously, I'm talking about RCU
pvid/vlan use cases similar to the dumps. The locked cases are fine.
I would like to avoid explaining why this shouldn't be relied upon
without locking]
So instead of introducing the above change and making sure of the pvid
uniqueness under RCU, simply dynamically populate the pvid flag in
br_vlan_get_info().
Use the register value width as the regmap_config name to prevent the
following error when the second and third regmap_configs are
initialized.
"debugfs: Directory '${bus-id}' with parent 'regmap' already present!"
Pu Wen [Sat, 31 Aug 2019 02:20:31 +0000 (10:20 +0800)]
tools/power turbostat: Add support for Hygon Fam 18h (Dhyana) RAPL
Commit 9392bd98bba760be96ee ("tools/power turbostat: Add support for AMD
Fam 17h (Zen) RAPL") and the commit 3316f99a9f1b68c578c5 ("tools/power
turbostat: Also read package power on AMD F17h (Zen)") add AMD Fam 17h
RAPL support.
Hygon Family 18h(Dhyana) support RAPL in bit 14 of CPUID 0x80000007 EDX,
and has MSRs RAPL_PWR_UNIT/CORE_ENERGY_STAT/PKG_ENERGY_STAT. So add Hygon
Dhyana Family 18h support for RAPL.
Already tested on Hygon multi-node systems and it shows correct per-core
energy usage and the total package power.
Pu Wen [Sat, 31 Aug 2019 02:19:58 +0000 (10:19 +0800)]
tools/power turbostat: Fix caller parameter of get_tdp_amd()
Commit 9392bd98bba760be96ee ("tools/power turbostat: Add support for AMD
Fam 17h (Zen) RAPL") add a function get_tdp_amd(), the parameter is CPU
family. But the rapl_probe_amd() function use wrong model parameter.
Fix the wrong caller parameter of get_tdp_amd() to use family.
In some case C1% will be wrong value, when platform doesn't have MSR for
C1 residency.
For example:
Core CPU CPU%c1
- - 100.00
0 0 100.00
0 2 100.00
1 1 100.00
1 3 100.00
But adding Busy% will fix this
Core CPU Busy% CPU%c1
- - 99.77 0.23
0 0 99.77 0.23
0 2 99.77 0.23
1 1 99.77 0.23
1 3 99.77 0.23
This issue can be reproduced on most of the recent systems including
Broadwell, Skylake and later.
This is because if we don't select Busy% or Avg_MHz or Bzy_MHz then
mperf value will not be read from MSR, so it will be 0. But this
is required for C1% calculation when MSR for C1 residency is not present.
Same is true for C3, C6 and C7 column selection.
So add another define DO_BIC_READ(), which doesn't depend on user
column selection and use for mperf, C3, C6 and C7 related counters.
So when there is no platform support for C1 residency counters,
we still read these counters, if the CPU has support and user selected
display of CPU%c1.
Artem Bityutskiy [Wed, 14 Aug 2019 17:12:56 +0000 (20:12 +0300)]
tools/power turbostat: do not enforce 1ms
Turbostat works by taking a snapshot of counters, sleeping, taking another
snapshot, calculating deltas, and printing out the table.
The sleep time is controlled via -i option or by user sending a signal or a
character to stdin. In the latter case, turbostat always adds 1 ms
sleep before it reads the counters, in order to avoid larger imprecisions
in the results in prints.
While the 1 ms delay may be a good idea for a "dumb" user, it is a
problem for an "aware" user. I do thousands and thousands of measurements
over a short period of time (like 2ms), and turbostat unconditionally adds
a 1ms to my interval, so I cannot get what I really need.
This patch removes the unconditional 1ms sleep. This is an expert user
tool, after all, and non-experts will unlikely ever use it in the non-fixed
interval mode anyway, so I think it is OK to remove the 1ms delay.
Artem Bityutskiy [Wed, 14 Aug 2019 17:12:55 +0000 (20:12 +0300)]
tools/power turbostat: read from pipes too
Commit '47936f944e78 tools/power turbostat: fix printing on input' make
a valid fix, but it completely disabled piped stdin support, which is
a valuable use-case. Indeed, if stdin is a pipe, turbostat won't read
anything from it, so it becomes impossible to get turbostat output at
user-defined moments, instead of the regular intervals.
There is no reason why this should works for terminals, but not for
pipes. This patch improves the situation. Instead of ignoring pipes, we
read data from them but gracefully handle the EOF case.
turbostat could be terminated by general protection fault on some latest
hardwares which (for example) support 9 levels of C-states and show 18
"tADDED" lines. That bloats the total output and finally causes buffer
overrun. So let's extend the buffer to avoid this.
Colin Ian King [Mon, 8 Apr 2019 09:00:44 +0000 (10:00 +0100)]
tools/power turbostat: fix leak of file descriptor on error return path
Currently the error return path does not close the file fp and leaks
a file descriptor. Fix this by closing the file.
Fixes: 5ea7647b333f ("tools/power turbostat: Warn on bad ACPI LPIT data") Signed-off-by: Colin Ian King <[email protected]> Signed-off-by: Len Brown <[email protected]>
Yazen Ghannam [Mon, 25 Mar 2019 17:32:42 +0000 (17:32 +0000)]
tools/power turbostat: Make interval calculation per thread to reduce jitter
Turbostat currently normalizes TSC and other values by dividing by an
interval. This interval is the delta between the start of one global
(all counters on all CPUs) sampling and the start of another. However,
this introduces a lot of jitter into the data.
In order to reduce jitter, the interval calculation should be based on
timestamps taken per thread and close to the start of the thread's
sampling.
Define a per thread time value to hold the delta between samples taken
on the thread.
Use the timestamp taken at the beginning of sampling to calculate the
delta.
Move the thread's beginning timestamp to after the CPU migration to
avoid jitter due to the migration.
Use the global time delta for the average time delta.
The -w argument in x86_energy_perf_policy currently triggers an
unconditional segfault.
This is because the argument string reads: "+a:c:dD:E:e:f:m:M:rt:u:vw" and
yet the argument handler expects an argument.
When parse_optarg_string is called with a null argument, we then proceed to
crash in strncmp, not horribly friendly.
The man page describes -w as taking an argument, the long form
(--hwp-window) is correctly marked as taking a required argument, and the
code expects it.
As such, this patch simply marks the short form (-w) as requiring an
argument.
Ben Hutchings [Sun, 16 Sep 2018 15:05:53 +0000 (16:05 +0100)]
tools/power x86_energy_perf_policy: Fix "uninitialized variable" warnings at -O2
x86_energy_perf_policy first uses __get_cpuid() to check the maximum
CPUID level and exits if it is too low. It then assumes that later
calls will succeed (which I think is architecturally guaranteed). It
also assumes that CPUID works at all (which is not guaranteed on
x86_32).
If optimisations are enabled, gcc warns about potentially
uninitialized variables. Fix this by adding an exit-on-error after
every call to __get_cpuid() instead of just checking the maximum
level.
Linus Torvalds [Sat, 31 Aug 2019 16:15:25 +0000 (09:15 -0700)]
Merge tag 'trace-v5.3-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace
Pull tracing fixes from Steven Rostedt:
"Small fixes and minor cleanups for tracing:
- Make exported ftrace function not static
- Fix NULL pointer dereference in reading probes as they are created
- Fix NULL pointer dereference in k/uprobe clean up path
- Various documentation fixes"
* tag 'trace-v5.3-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace:
tracing: Correct kdoc formats
ftrace/x86: Remove mcount() declaration
tracing/probe: Fix null pointer dereference
tracing: Make exported ftrace_set_clr_event non-static
ftrace: Check for successful allocation of hash
ftrace: Check for empty hash and comment the race with registering probes
ftrace: Fix NULL pointer dereference in t_probe_next()
Linus Torvalds [Sat, 31 Aug 2019 15:51:48 +0000 (08:51 -0700)]
Merge tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm
Pull KVM fixes from Radim Krčmář:
"PPC:
- Fix bug which could leave locks held in the host on return to a
guest.
x86:
- Prevent infinitely looping emulation of a failing syscall while
single stepping.
- Do not crash the host when nesting is disabled"
* tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm:
KVM: x86: Don't update RIP or do single-step on faulting emulation
KVM: x86: hyper-v: don't crash on KVM_GET_SUPPORTED_HV_CPUID when kvm_intel.nested is disabled
KVM: PPC: Book3S: Fix incorrect guest-to-user-translation error handling
Linus Torvalds [Sat, 31 Aug 2019 15:31:48 +0000 (08:31 -0700)]
Merge branch 'akpm' (patches from Andrew)
Merge misc mm fixes from Andrew Morton:
"7 fixes"
* emailed patches from Andrew Morton <[email protected]>:
mm: memcontrol: fix percpu vmstats and vmevents flush
mm, memcg: do not set reclaim_state on soft limit reclaim
mailmap: add aliases for Dmitry Safonov
mm/z3fold.c: fix lock/unlock imbalance in z3fold_page_isolate
mm, memcg: partially revert "mm/memcontrol.c: keep local VM counters in sync with the hierarchical ones"
mm/zsmalloc.c: fix build when CONFIG_COMPACTION=n
mm: memcontrol: flush percpu slab vmstats on kmem offlining
Jakub Kicinski [Wed, 28 Aug 2019 05:25:47 +0000 (22:25 -0700)]
tracing: Correct kdoc formats
Fix the following kdoc warnings:
kernel/trace/trace.c:1579: warning: Function parameter or member 'tr' not described in 'update_max_tr_single'
kernel/trace/trace.c:1579: warning: Function parameter or member 'tsk' not described in 'update_max_tr_single'
kernel/trace/trace.c:1579: warning: Function parameter or member 'cpu' not described in 'update_max_tr_single'
kernel/trace/trace.c:1776: warning: Function parameter or member 'type' not described in 'register_tracer'
kernel/trace/trace.c:2239: warning: Function parameter or member 'task' not described in 'tracing_record_taskinfo'
kernel/trace/trace.c:2239: warning: Function parameter or member 'flags' not described in 'tracing_record_taskinfo'
kernel/trace/trace.c:2269: warning: Function parameter or member 'prev' not described in 'tracing_record_taskinfo_sched_switch'
kernel/trace/trace.c:2269: warning: Function parameter or member 'next' not described in 'tracing_record_taskinfo_sched_switch'
kernel/trace/trace.c:2269: warning: Function parameter or member 'flags' not described in 'tracing_record_taskinfo_sched_switch'
kernel/trace/trace.c:3078: warning: Function parameter or member 'ip' not described in 'trace_vbprintk'
kernel/trace/trace.c:3078: warning: Function parameter or member 'fmt' not described in 'trace_vbprintk'
kernel/trace/trace.c:3078: warning: Function parameter or member 'args' not described in 'trace_vbprintk'
Jisheng Zhang [Mon, 26 Aug 2019 09:13:12 +0000 (09:13 +0000)]
ftrace/x86: Remove mcount() declaration
Commit 562e14f72292 ("ftrace/x86: Remove mcount support") removed the
support for using mcount, so we could remove the mcount() declaration
to clean up.
Xinpeng Liu [Wed, 7 Aug 2019 23:29:23 +0000 (07:29 +0800)]
tracing/probe: Fix null pointer dereference
BUG: KASAN: null-ptr-deref in trace_probe_cleanup+0x8d/0xd0
Read of size 8 at addr 0000000000000000 by task syz-executor.0/9746
trace_probe_cleanup+0x8d/0xd0
free_trace_kprobe.part.14+0x15/0x50
alloc_trace_kprobe+0x23e/0x250
tracing: Make exported ftrace_set_clr_event non-static
The function ftrace_set_clr_event is declared static and marked
EXPORT_SYMBOL_GPL(), which is at best an odd combination. Because the
function was decided to be a part of API, this commit removes the static
attribute and adds the declaration to the header.
Linus Torvalds [Sat, 31 Aug 2019 01:47:15 +0000 (18:47 -0700)]
Partially revert "kfifo: fix kfifo_alloc() and kfifo_init()"
Commit dfe2a77fd243 ("kfifo: fix kfifo_alloc() and kfifo_init()") made
the kfifo code round the number of elements up. That was good for
__kfifo_alloc(), but it's actually wrong for __kfifo_init().
The difference? __kfifo_alloc() will allocate the rounded-up number of
elements, but __kfifo_init() uses an allocation done by the caller. We
can't just say "use more elements than the caller allocated", and have
to round down.
The good news? All the normal cases will be using power-of-two arrays
anyway, and most users of kfifo's don't use kfifo_init() at all, but one
of the helper macros to declare a KFIFO that enforce the proper
power-of-two behavior. But it looks like at least ibmvscsis might be
affected.
The bad news? Will Deacon refers to an old thread and points points out
that the memory ordering in kfifo's is questionable. See
Shakeel Butt [Fri, 30 Aug 2019 23:04:53 +0000 (16:04 -0700)]
mm: memcontrol: fix percpu vmstats and vmevents flush
Instead of using raw_cpu_read() use per_cpu() to read the actual data of
the corresponding cpu otherwise we will be reading the data of the
current cpu for the number of online CPUs.
which tells us that soft limit reclaim is about to overwrite the
reclaim_state configured up in the call chain (kswapd in this case but
the direct reclaim is equally possible). This means that reclaim stats
would get misleading once the soft reclaim returns and another reclaim
is done.
Fix the warning by dropping set_task_reclaim_state from the soft reclaim
which is always called with reclaim_state set up.
Roman Gushchin [Fri, 30 Aug 2019 23:04:39 +0000 (16:04 -0700)]
mm, memcg: partially revert "mm/memcontrol.c: keep local VM counters in sync with the hierarchical ones"
Commit 766a4c19d880 ("mm/memcontrol.c: keep local VM counters in sync
with the hierarchical ones") effectively decreased the precision of
per-memcg vmstats_local and per-memcg-per-node lruvec percpu counters.
That's good for displaying in memory.stat, but brings a serious
regression into the reclaim process.
One issue I've discovered and debugged is the following:
lruvec_lru_size() can return 0 instead of the actual number of pages in
the lru list, preventing the kernel to reclaim last remaining pages.
Result is yet another dying memory cgroups flooding. The opposite is
also happening: scanning an empty lru list is the waste of cpu time.
Also, inactive_list_is_low() can return incorrect values, preventing the
active lru from being scanned and freed. It can fail both because the
size of active and inactive lists are inaccurate, and because the number
of workingset refaults isn't precise. In other words, the result is
pretty random.
I'm not sure, if using the approximate number of slab pages in
count_shadow_number() is acceptable, but issues described above are
enough to partially revert the patch.
Let's keep per-memcg vmstat_local batched (they are only used for
displaying stats to the userspace), but keep lruvec stats precise. This
change fixes the dead memcg flooding on my setup.
Roman Gushchin [Fri, 30 Aug 2019 23:04:32 +0000 (16:04 -0700)]
mm: memcontrol: flush percpu slab vmstats on kmem offlining
I've noticed that the "slab" value in memory.stat is sometimes 0, even
if some children memory cgroups have a non-zero "slab" value. The
following investigation showed that this is the result of the kmem_cache
reparenting in combination with the per-cpu batching of slab vmstats.
At the offlining some vmstat value may leave in the percpu cache, not
being propagated upwards by the cgroup hierarchy. It means that stats
on ancestor levels are lower than actual. Later when slab pages are
released, the precise number of pages is substracted on the parent
level, making the value negative. We don't show negative values, 0 is
printed instead.
To fix this issue, let's flush percpu slab memcg and lruvec stats on
memcg offlining. This guarantees that numbers on all ancestor levels
are accurate and match the actual number of outstanding slab pages.
Michael Chan [Fri, 30 Aug 2019 23:10:38 +0000 (19:10 -0400)]
bnxt_en: Fix compile error regression with CONFIG_BNXT_SRIOV not set.
Add a new function bnxt_get_registered_vfs() to handle the work
of getting the number of registered VFs under #ifdef CONFIG_BNXT_SRIOV.
The main code will call this function and will always work correctly
whether CONFIG_BNXT_SRIOV is set or not.
Fixes: 230d1f0de754 ("bnxt_en: Handle firmware reset.") Reported-by: kbuild test robot <[email protected]> Signed-off-by: Michael Chan <[email protected]> Signed-off-by: David S. Miller <[email protected]>
Vlad Buslov [Thu, 29 Aug 2019 16:15:17 +0000 (19:15 +0300)]
net/mlx5e: Move local var definition into ifdef block
New local variable "struct flow_block_offload *f" was added to
mlx5e_setup_tc() in recent rtnl lock removal patches. The variable is used
in code that is only compiled when CONFIG_MLX5_ESWITCH is enabled. This
results compilation warning about unused variable when CONFIG_MLX5_ESWITCH
is not set. Move the variable definition into eswitch-specific code block
from the beginning of mlx5e_setup_tc() function.
Fixes: c9f14470d048 ("net: sched: add API for registering unlocked offload block callbacks") Reported-by: tanhuazhong <[email protected]> Signed-off-by: Vlad Buslov <[email protected]> Signed-off-by: David S. Miller <[email protected]>
Vlad Buslov [Thu, 29 Aug 2019 16:15:16 +0000 (19:15 +0300)]
net: sched: cls_matchall: cleanup flow_action before deallocating
Recent rtnl lock removal patch changed flow_action infra to require proper
cleanup besides simple memory deallocation. However, matchall classifier
was not updated to call tc_cleanup_flow_action(). Add proper cleanup to
mall_replace_hw_filter() and mall_reoffload().
Fixes: 5a6ff4b13d59 ("net: sched: take reference to action dev before calling offloads") Reported-by: Ido Schimmel <[email protected]> Tested-by: Ido Schimmel <[email protected]> Signed-off-by: Vlad Buslov <[email protected]> Signed-off-by: David S. Miller <[email protected]>
David Howells [Thu, 29 Aug 2019 13:12:11 +0000 (14:12 +0100)]
rxrpc: Fix lack of conn cleanup when local endpoint is cleaned up [ver #2]
When a local endpoint is ceases to be in use, such as when the kafs module
is unloaded, the kernel will emit an assertion failure if there are any
outstanding client connections:
rxrpc: Assertion failed
------------[ cut here ]------------
kernel BUG at net/rxrpc/local_object.c:433!
and even beyond that, will evince other oopses if there are service
connections still present.
Fix this by:
(1) Removing the triggering of connection reaping when an rxrpc socket is
released. These don't actually clean up the connections anyway - and
further, the local endpoint may still be in use through another
socket.
(2) Mark the local endpoint as dead when we start the process of tearing
it down.
(3) When destroying a local endpoint, strip all of its client connections
from the idle list and discard the ref on each that the list was
holding.
(4) When destroying a local endpoint, call the service connection reaper
directly (rather than through a workqueue) to immediately kill off all
outstanding service connections.
(5) Make the service connection reaper reap connections for which the
local endpoint is marked dead.
Only after destroying the connections can we close the socket lest we get
an oops in a workqueue that's looking at a connection or a peer.
David S. Miller [Fri, 30 Aug 2019 21:54:41 +0000 (14:54 -0700)]
Merge tag 'rxrpc-fixes-20190827' of git://git.kernel.org/pub/scm/linux/kernel/git/dhowells/linux-fs
David Howells says:
====================
rxrpc: Fix use of skb_cow_data()
Here's a series of patches that replaces the use of skb_cow_data() in rxrpc
with skb_unshare() early on in the input process. The problem that is
being seen is that skb_cow_data() indirectly requires that the maximum
usage count on an sk_buff be 1, and it may generate an assertion failure in
pskb_expand_head() if not.
This can occur because rxrpc_input_data() may be still holding a ref when
it has just attached the sk_buff to the rx ring and given that attachment
its own ref. If recvmsg happens fast enough, skb_cow_data() can see the
ref still held by the softirq handler.
Further, a packet may contain multiple subpackets, each of which gets its
own attachment to the ring and its own ref - also making skb_cow_data() go
bang.
Fix this by:
(1) The DATA packet is currently parsed for subpackets twice by the input
routines. Parse it just once instead and make notes in the sk_buff
private data.
(2) Use the notes from (1) when attaching the packet to the ring multiple
times. Once the packet is attached to the ring, recvmsg can see it
and start modifying it, so the softirq handler is not permitted to
look inside it from that point.
(3) Pass the ref from the input code to the ring rather than getting an
extra ref. rxrpc_input_data() uses a ref on the second refcount to
prevent the packet from evaporating under it.
(4) Call skb_unshare() on secured DATA packets in rxrpc_input_packet()
before we take call->input_lock. Other sorts of packets don't get
modified and so can be left.
A trace is emitted if skb_unshare() eats the skb. Note that
skb_share() for our accounting in this regard as we can't see the
parameters in the packet to log in a trace line if it releases it.
(5) Remove the calls to skb_cow_data(). These are then no longer
necessary.
There are also patches to improve the rxrpc_skb tracepoint to make sure
that Tx-derived buffers are identified separately from Rx-derived buffers
in the trace.
====================
Stephen Rothwell [Fri, 30 Aug 2019 21:34:05 +0000 (14:34 -0700)]
net: stmmac: depend on COMMON_CLK
Fixes: 190f73ab4c43 ("net: stmmac: setup higher frequency clk support for EHL & TGL") Signed-off-by: Stephen Rothwell <[email protected]> Signed-off-by: David S. Miller <[email protected]>
Chen-Yu Tsai [Thu, 29 Aug 2019 03:17:24 +0000 (11:17 +0800)]
net: stmmac: dwmac-rk: Don't fail if phy regulator is absent
The devicetree binding lists the phy phy as optional. As such, the
driver should not bail out if it can't find a regulator. Instead it
should just skip the remaining regulator related code and continue
on normally.
Skip the remainder of phy_power_on() if a regulator supply isn't
available. This also gets rid of the bogus return code.
Fixes: 2e12f536635f ("net: stmmac: dwmac-rk: Use standard devicetree property for phy regulator") Signed-off-by: Chen-Yu Tsai <[email protected]> Signed-off-by: David S. Miller <[email protected]>
Colin Ian King [Wed, 28 Aug 2019 23:14:50 +0000 (00:14 +0100)]
arcnet: capmode: remove redundant assignment to pointer pkt
Pointer pkt is being initialized with a value that is never read
and pkt is being re-assigned a little later on. The assignment is
redundant and hence can be removed.
Addresses-Coverity: ("Ununsed value") Signed-off-by: Colin Ian King <[email protected]> Signed-off-by: David S. Miller <[email protected]>
David S. Miller [Fri, 30 Aug 2019 21:02:19 +0000 (14:02 -0700)]
Merge branch 'bnxt_en-health-and-error-recovery'
Michael Chan says:
====================
bnxt_en: health and error recovery.
This patchset implements adapter health and error recovery. The status
is reported through several devlink reporters and the driver will
initiate and complete the recovery process using the devlink infrastructure.
v2: Added 4 patches at the beginning of the patchset to clean up error code
handling related to firmware messages and to convert to use standard
error codes.
Removed the dropping of rtnl_lock in bnxt_close().
Broke up the patches some more for better patch organization and
future bisection.
====================
Michael Chan [Fri, 30 Aug 2019 03:55:04 +0000 (23:55 -0400)]
bnxt_en: Add bnxt_fw_exception() to handle fatal firmware errors.
This call will handle fatal firmware errors by forcing a reset on the
firmware. The master function driver will carry out the forced reset.
The sequence will go through the same bnxt_fw_reset_task() workqueue.
This fatal reset differs from the non-fatal reset at the beginning
stages. From the BNXT_FW_RESET_STATE_ENABLE_DEV state onwards where
the firmware is coming out of reset, it is practically identical to the
non-fatal reset.
The next patch will add the periodic heartbeat check and the devlink
reporter to report the fatal event and to initiate the bnxt_fw_exception()
call.
Michael Chan [Fri, 30 Aug 2019 03:55:03 +0000 (23:55 -0400)]
bnxt_en: Add RESET_FW state logic to bnxt_fw_reset_task().
This state handles driver initiated chip reset during error recovery.
Only the master function will perform this step during error recovery.
The next patch will add code to initiate this reset from the master
function.
Michael Chan [Fri, 30 Aug 2019 03:55:02 +0000 (23:55 -0400)]
bnxt_en: Do not send firmware messages if firmware is in error state.
Add a flag to mark that the firmware has encountered fatal condition.
The driver will not send any more firmware messages and will return
error to the caller. Fix up some clean up functions to continue
and not abort when the firmware message function returns error.
This is preparation work to fully handle firmware error recovery
under fatal conditions.
Vasundhara Volam [Fri, 30 Aug 2019 03:55:00 +0000 (23:55 -0400)]
bnxt_en: Add devlink health reset reporter.
Add devlink health reporter for the firmware reset event. Once we get
the notification from firmware about the impending reset, the driver
will report this to devlink and the call to bnxt_fw_reset() will be
initiated to complete the reset sequence.
Michael Chan [Fri, 30 Aug 2019 03:54:59 +0000 (23:54 -0400)]
bnxt_en: Handle firmware reset.
Add the bnxt_fw_reset() main function to handle firmware reset. This
is triggered by firmware to initiate an orderly reset, for example
when a non-fatal exception condition has been detected. bnxt_fw_reset()
will first wait for all VFs to shutdown and then start the
bnxt_fw_reset_task() work queue to go through the sequence of reset,
re-probe, and re-initialization.
The next patch will add the devlink reporter to start the sequence and
call bnxt_fw_reset().
Michael Chan [Fri, 30 Aug 2019 03:54:58 +0000 (23:54 -0400)]
bnxt_en: Handle RESET_NOTIFY async event from firmware.
This event from firmware signals a coordinated reset initiated by the
firmware. It may be triggered by some error conditions encountered
in the firmware or other orderly reset conditions.
We store the parameters from this event. Subsequent patches will
add logic to handle reset itself using devlink reporters.
Michael Chan [Fri, 30 Aug 2019 03:54:56 +0000 (23:54 -0400)]
bnxt_en: Add BNXT_STATE_IN_FW_RESET state.
The new flag will be set in subsequent patches when firmware is
going through reset. If bnxt_close() is called while the new flag
is set, the FW reset sequence will have to be aborted because the
NIC is prematurely closed before FW reset has completed. We also
reject SRIOV configurations while FW reset is in progress.
v2: No longer drop rtnl_lock() in close and wait for FW reset to complete.
Call the new firmware API HWRM_ERROR_RECOVERY_QCFG if it is supported
to discover the firmware health and recovery capabilities and settings.
This feature allows the driver to reset the chip if firmware crashes and
becomes unresponsive.
Michael Chan [Fri, 30 Aug 2019 03:54:52 +0000 (23:54 -0400)]
bnxt_en: Handle firmware reset status during IF_UP.
During IF_UP, newer firmware has a new status flag that indicates that
firmware has reset. Add new function bnxt_fw_init_one() to re-probe the
firmware and re-setup VF resources on the PF if necessary. If the
re-probe fails, set a flag to prevent bnxt_open() from proceeding again.
Vasundhara Volam [Fri, 30 Aug 2019 03:54:51 +0000 (23:54 -0400)]
bnxt_en: Register buffers for VFs before reserving resources.
When VFs need to be reconfigured dynamically after firmwware reset, the
configuration sequence on the PF needs to be changed to register the VF
buffers first. Otherwise, some VF firmware commands may not succeed as
there may not be PF buffers ready for the re-directed firmware commands.
This sequencing did not matter much before when we only supported
the normal bring-up of VFs.
Michael Chan [Fri, 30 Aug 2019 03:54:50 +0000 (23:54 -0400)]
bnxt_en: Refactor bnxt_sriov_enable().
Refactor the hardware/firmware configuration portion in
bnxt_sriov_enable() into a new function bnxt_cfg_hw_sriov(). This
new function can be called after a firmware reset to reconfigure the
VFs previously enabled.
v2: straight refactor of the code. Reordering done in the next patch.
Michael Chan [Fri, 30 Aug 2019 03:54:49 +0000 (23:54 -0400)]
bnxt_en: Prepare bnxt_init_one() to be called multiple times.
In preparation for the new firmware reset feature, some of the logic
in bnxt_init_one() and related functions will be called again after
firmware has reset. Reset some of the flags and capabilities so that
everything that can change can be re-initialized. Refactor some
functions to probe firmware versions and capabilities. Check some
buffers before allocating as they may have been allocated previously.
Michael Chan [Fri, 30 Aug 2019 03:54:48 +0000 (23:54 -0400)]
bnxt_en: Suppress all error messages in hwrm_do_send_msg() in silent mode.
If the silent parameter is set, suppress all messages when there is
no response from firmware. When polling for firmware to come out of
reset, no response may be normal and we want to suppress the error
messages. Also, don't poll for the firmware DMA response if Bus Master
is disabled. This is in preparation for error recovery when firmware
may be in error or reset state or Bus Master is disabled.
Michael Chan [Fri, 30 Aug 2019 03:54:47 +0000 (23:54 -0400)]
bnxt_en: Simplify error checking in the SR-IOV message forwarding functions.
There are 4 functions handling message forwarding for SR-IOV. They
check for non-zero firmware response code and then return -1. There
is no need to do this anymore. The main messaging function will
now return standard error code. Since we don't need to examine the
response, we can use the hwrm_send_message() variant which will
take the mutex automatically.
Michael Chan [Fri, 30 Aug 2019 03:54:46 +0000 (23:54 -0400)]
bnxt_en: Convert error code in firmware message response to standard code.
The main firmware messaging function returns the firmware defined error
code and many callers have to convert to standard error code for proper
propagation to userspace. Convert bnxt_hwrm_do_send_msg() to return
standard error code so we can do away with all the special error code
handling by the many callers.
David S. Miller [Fri, 30 Aug 2019 20:54:36 +0000 (13:54 -0700)]
Merge branch 'ioc3-eth-improvements'
Thomas Bogendoerfer says:
====================
ioc3-eth improvements
In my patch series for splitting out the serial code from ioc3-eth
by using a MFD device there was one big patch for ioc3-eth.c,
which wasn't really usefull for reviews. This series contains the
ioc3-eth changes splitted in smaller steps and few more cleanups.
Only the conversion to MFD will be done later in a different series.
Changes in v3:
- no need to check skb == NULL before passing it to dev_kfree_skb_any
- free memory allocated with get_page(s) with free_page(s)
- allocate rx ring with just GFP_KERNEL
- add required alignment for rings in comments
Changes in v2:
- use net_err_ratelimited for printing various ioc3 errors
- added missing clearing of rx buf valid flags into ioc3_alloc_rings
- use __func__ for printing out of memory messages
====================
ioc3_init did everything from reset to init rings to starting the chip.
This change move out chip start into a new function as preparation
for easier handling of receive buffer allocation failures.
net: sgi: ioc3-eth: separate tx and rx ring handling
After allocation of descriptor memory is now done once in probe
handling of tx ring is completely done by ioc3_clean_tx_ring. So
we remove the remaining tx ring actions out of ioc3_alloc_rings
and ioc3_free_rings and rename it to ioc3_[alloc|free]_rx_bufs
to better describe what they are doing.