Merge tag 'for-4.17-rc1-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux
Pull btrfs fixes from David Sterba:
"This contains a few fixups to the qgroup patches that were merged this
dev cycle, unaligned access fix, blockgroup removal corner case fix
and a small debugging output tweak"
* tag 'for-4.17-rc1-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux:
btrfs: print-tree: debugging output enhancement
btrfs: Fix race condition between delayed refs and blockgroup removal
btrfs: fix unaligned access in readdir
btrfs: Fix wrong btrfs_delalloc_release_extents parameter
btrfs: delayed-inode: Remove wrong qgroup meta reservation calls
btrfs: qgroup: Use independent and accurate per inode qgroup rsv
btrfs: qgroup: Commit transaction in advance to reduce early EDQUOT
Cong Wang [Fri, 20 Apr 2018 04:54:34 +0000 (21:54 -0700)]
llc: fix NULL pointer deref for SOCK_ZAPPED
For SOCK_ZAPPED socket, we don't need to care about llc->sap,
so we should just skip these refcount functions in this case.
Fixes: f7e43672683b ("llc: hold llc_sap before release_sock()") Reported-by: kernel test robot <[email protected]> Signed-off-by: Cong Wang <[email protected]> Signed-off-by: David S. Miller <[email protected]>
The CPDMA_TX_PRIORITY_MAP in real is vlan pcp field priority mapping
register and basically replaces vlan pcp field for tagged packets.
So, set it to be 1:1 mapping. Otherwise, it will cause unexpected
change of egress vlan tagged packets, like prio 2 -> prio 5.
Fixes: e05107e6b747 ("net: ethernet: ti: cpsw: add multi queue support") Reviewed-by: Grygorii Strashko <[email protected]> Signed-off-by: Ivan Khoronzhuk <[email protected]> Signed-off-by: David S. Miller <[email protected]>
Cong Wang [Thu, 19 Apr 2018 19:25:38 +0000 (12:25 -0700)]
llc: delete timers synchronously in llc_sk_free()
The connection timers of an llc sock could be still flying
after we delete them in llc_sk_free(), and even possibly
after we free the sock. We could just wait synchronously
here in case of troubles.
Note, I leave other call paths as they are, since they may
not have to wait, at least we can change them to synchronously
when needed.
Also, move the code to net/llc/llc_conn.c, which is apparently
a better place.
l2tp: fix {pppol2tp, l2tp_dfs}_seq_stop() in case of seq_file overflow
Commit 0e0c3fee3a59 ("l2tp: hold reference on tunnels printed in pppol2tp proc file")
assumed that if pppol2tp_seq_stop() was called with non-NULL private
data (the 'v' pointer), then pppol2tp_seq_start() would not be called
again. It turns out that this isn't guaranteed, and overflowing the
seq_file's buffer in pppol2tp_seq_show() is a way to get into this
situation.
Therefore, pppol2tp_seq_stop() needs to reset pd->tunnel, so that
pppol2tp_seq_start() won't drop a reference again if it gets called.
We also have to clear pd->session, because the rest of the code expects
a non-NULL tunnel when pd->session is set.
The l2tp_debugfs module has the same issue. Fix it in the same way.
Fixes: 0e0c3fee3a59 ("l2tp: hold reference on tunnels printed in pppol2tp proc file") Fixes: f726214d9b23 ("l2tp: hold reference on tunnels printed in l2tp/tunnels debugfs file") Signed-off-by: Guillaume Nault <[email protected]> Signed-off-by: David S. Miller <[email protected]>
s390/qeth: use Read device to query hypervisor for MAC
For z/VM NICs, qeth needs to consider which of the three CCW devices in
an MPC group it uses for requesting a managed MAC address.
On the Base device, the hypervisor returns a default MAC which is
pre-assigned when creating the NIC (this MAC is also returned by the
READ MAC primitive). Querying any other device results in the allocation
of an additional MAC address.
For consistency with READ MAC and to avoid using up more addresses than
necessary, it is preferable to use the NIC's default MAC. So switch the
the diag26c over to using a NIC's Read device, which should always be
identical to the Base device.
Fixes: ec61bd2fd2a2 ("s390/qeth: use diag26c to get MAC address on L2") Signed-off-by: Julian Wiedmann <[email protected]> Signed-off-by: David S. Miller <[email protected]>
s390/qeth: fix request-side race during cmd IO timeout
Submitting a cmd IO request (usually on the WRITE device, but for IDX
also on the READ device) is currently done with ccw_device_start()
and a manual timeout in the caller.
On timeout, the caller cleans up the related resources (eg. IO buffer).
But 1) the IO might still be active and utilize those resources, and
2) when the IO completes, qeth_irq() will attempt to clean up the
same resources again.
Instead of introducing additional resource locking, switch to
ccw_device_start_timeout() to ensure IO termination after timeout, and
let the IRQ handler alone deal with cleaning up after a request.
This also removes a stray write->irq_pending reset from
clear_ipacmd_list(). The routine doesn't terminate any pending IO on
the WRITE device, so this should be handled properly via IO timeout
in the IRQ handler.
When changing the MAC address on a L2 qeth device, current code first
unregisters the old address, then registers the new one.
If HW rejects the new address (or the IO fails), the device ends up with
no operable address at all.
Re-order the code flow so that the old address only gets dropped if the
new address was registered successfully. While at it, add logic to catch
some corner-cases.
For control IO, qeth currently tracks the index of the buffer that it
expects to complete the next IO on each qeth_channel. If the channel
presents an IRQ while this buffer has not yet completed, no completion
processing for _any_ completed buffer takes place.
So if the 'next buffer' is skipped for any sort of reason* (eg. when it
is released due to error conditions, before the IO is started), the
buffer obviously won't switch to PROCESSED until it is eventually
allocated for a _different_ IO and completes.
Until this happens, all completion processing on that channel stalls
and pending requests possibly time out.
As a fix, remove the whole 'next buffer' logic and simply process any
IO buffer right when it completes. A channel will never have more than
one IO pending, so there's no risk of processing out-of-sequence.
*Note: currently just one location in the code really handles this problem,
by advancing the 'next' index manually.
Merge branch 'x86-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull x86 fixes from Thomas Gleixner:
"A small set of fixes for x86:
- Prevent X2APIC ID 0xFFFFFFFF from being treated as valid, which
causes the possible CPU count to be wrong.
- Prevent 32bit truncation in calc_hpet_ref() which causes the TSC
calibration to fail
- Fix the page table setup for temporary text mappings in the resume
code which causes resume failures
- Make the page table dump code handle HIGHPTE correctly instead of
oopsing
- Support for topologies where NUMA nodes share an LLC to prevent a
invalid topology warning and further malfunction on such systems.
- Remove the now unused pci-nommu code
- Remove stale function declarations"
* 'x86-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
x86/power/64: Fix page-table setup for temporary text mapping
x86/mm: Prevent kernel Oops in PTDUMP code with HIGHPTE=y
x86,sched: Allow topologies where NUMA nodes share an LLC
x86/processor: Remove two unused function declarations
x86/acpi: Prevent X2APIC id 0xffffffff from being accounted
x86/tsc: Prevent 32bit truncation in calc_hpet_ref()
x86: Remove pci-nommu.c
Merge branch 'perf-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull perf fixes from Thomas Gleixner:
"A larger set of updates for perf.
Kernel:
- Handle the SBOX uncore monitoring correctly on Broadwell CPUs which
do not have SBOX.
- Store context switch out type in PERF_RECORD_SWITCH[_CPU_WIDE]. The
percentage of preempting and non-preempting context switches help
understanding the nature of workloads (CPU or IO bound) that are
running on a machine. This adds the kernel facility and userspace
changes needed to show this information in 'perf script' and 'perf
report -D' (Alexey Budankov)
- Remove a WARN_ON() in the trace/kprobes code which is pointless
because the return error code is already telling the caller what's
wrong.
- Revert a fugly workaround for clang BPF targets.
- Fix sample_max_stack maximum check and do not proceed when an error
has been detect, return them to avoid misidentifying errors (Jiri
Olsa)
- Add SPDX idenitifiers and get rid of GPL boilderplate.
Tools:
- Synchronize kernel ABI headers, v4.17-rc1 (Ingo Molnar)
- Support MAP_FIXED_NOREPLACE, noticed when updating the
tools/include/ copies (Arnaldo Carvalho de Melo)
- Add '\n' at the end of parse-options error messages (Ravi Bangoria)
- Add s390 support for detailed/verbose PMU event description (Thomas
Richter)
- perf annotate fixes and improvements:
* Allow showing offsets in more than just jump targets, use the
new 'O' hotkey in the TUI, config ~/.perfconfig
annotate.offset_level for it and for --stdio2 (Arnaldo Carvalho
de Melo)
* Use the resolved variable names from objdump disassembled lines
to make them more compact, just like was already done for some
instructions, like "mov", this eventually will be done more
generally, but lets now add some more to the existing mechanism
(Arnaldo Carvalho de Melo)
- perf record fixes:
* Change warning for missing topology sysfs entry to debug, as not
all architectures have those files, s390 being one of those
(Thomas Richter)
* Remove old error messages about things that unlikely to be the
root cause in modern systems (Andi Kleen)
* Enable 1ms interval for printing event counters values in
(Alexey Budankov)
- perf test fixes:
* Run dwarf unwind on arm32 (Kim Phillips)
* Remove unused ptrace.h include from LLVM test, sidesteping older
clang's lack of support for some asm constructs (Arnaldo
Carvalho de Melo)
* Fixup BPF test using epoll_pwait syscall function probe, to cope
with the syscall routines renames performed in this development
cycle (Arnaldo Carvalho de Melo)
- perf version fixes:
* Do not print info about HAVE_LIBAUDIT_SUPPORT in 'perf version
--build-options' when HAVE_SYSCALL_TABLE_SUPPORT is true, as
libaudit won't be used in that case, print info about
syscall_table support instead (Jin Yao)
- Build system fixes:
* Use HAVE_..._SUPPORT used consistently (Jin Yao)
* Restore READ_ONCE() C++ compatibility in tools/include (Mark
Rutland)
* Give hints about package names needed to build jvmti (Arnaldo
Carvalho de Melo)"
* 'perf-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (37 commits)
perf/x86/intel/uncore: Fix SBOX support for Broadwell CPUs
perf/x86/intel/uncore: Revert "Remove SBOX support for Broadwell server"
coresight: Move to SPDX identifier
perf test BPF: Fixup BPF test using epoll_pwait syscall function probe
perf tests mmap: Show which tracepoint is failing
perf tools: Add '\n' at the end of parse-options error messages
perf record: Remove suggestion to enable APIC
perf record: Remove misleading error suggestion
perf hists browser: Clarify top/report browser help
perf mem: Allow all record/report options
perf trace: Support MAP_FIXED_NOREPLACE
perf: Remove superfluous allocation error check
perf: Fix sample_max_stack maximum check
perf: Return proper values for user stack errors
perf list: Add s390 support for detailed/verbose PMU event description
perf script: Extend misc field decoding with switch out event type
perf report: Extend raw dump (-D) out with switch out event type
perf/core: Store context switch out type in PERF_RECORD_SWITCH[_CPU_WIDE]
tools/headers: Synchronize kernel ABI headers, v4.17-rc1
trace_kprobe: Remove warning message "Could not insert probe at..."
...
Merge tag 'random_for_linus_stable' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/random
Pull /dev/random fixes from Ted Ts'o:
"Fix some bugs in the /dev/random driver which causes getrandom(2) to
unblock earlier than designed.
Thanks to Jann Horn from Google's Project Zero for pointing this out
to me"
* tag 'random_for_linus_stable' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/random:
random: add new ioctl RNDRESEEDCRNG
random: crng_reseed() should lock the crng instance that it is modifying
random: set up the NUMA crng instances after the CRNG is fully initialized
random: use a different mixing algorithm for add_device_randomness()
random: fix crng_ready() test
Merge branch 'libnvdimm-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/nvdimm/nvdimm
Pull libnvdimm fixes from Dan Williams:
"A regression fix, new unit test infrastructure and a build fix:
- Regression fix addressing support for the new NVDIMM label storage
area access commands (_LSI, _LSR, and _LSW).
The Intel specific version of these commands communicated the
"Device Locked" status on the label-storage-information command.
However, these new commands (standardized in ACPI 6.2) communicate
the "Device Locked" status on the label-storage-read command, and
the driver was missing the indication.
Reading from locked persistent memory is similar to reading
unmapped PCI memory space, returns all 1's.
- Unit test infrastructure is added to regression test the "Device
Locked" detection failure.
- A build fix is included to allow the "of_pmem" driver to be built
as a module and translate an Open Firmware described device to its
local numa node"
* 'libnvdimm-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/nvdimm/nvdimm:
MAINTAINERS: Add backup maintainers for libnvdimm and DAX
device-dax: allow MAP_SYNC to succeed
Revert "libnvdimm, of_pmem: workaround OF_NUMA=n build error"
libnvdimm, of_pmem: use dev_to_node() instead of of_node_to_nid()
tools/testing/nvdimm: enable labels for nfit_test.1 dimms
tools/testing/nvdimm: fix missing newline in nfit_test_dimm 'handle' attribute
tools/testing/nvdimm: support nfit_test_dimm attributes under nfit_test.1
tools/testing/nvdimm: allow custom error code injection
libnvdimm, dimm: handle EACCES failures from label reads
====================
net/ipv6: Another followup to the fib6_info change
Last one - for this week.
Patches 1, 2 and 7 are more cleanup patches - removing dead code,
moving code from a header to near its single caller, and updating
function name.
Patches 3-5 do some refactoring leading up to patch 6 which fixes
a NULL dereference. I have only managed to trigger a panic once, so
I can not definitively confirm it addresses the problem but it seems
pretty clear that it is a race on removing a 'from' reference on
an rt6_info and another path using that 'from' value to do
cookie checking.
====================
David Ahern [Fri, 20 Apr 2018 22:38:03 +0000 (15:38 -0700)]
net/ipv6: Remove unncessary check on f6i in fib6_check
Dan reported an imbalance in fib6_check on use of f6i and checking
whether it is null. Since fib6_check is only called if f6i is non-null,
remove the unnecessary check.
David Ahern [Fri, 20 Apr 2018 22:38:02 +0000 (15:38 -0700)]
net/ipv6: Make from in rt6_info rcu protected
When a dst entry is created from a fib entry, the 'from' in rt6_info
is set to the fib entry. The 'from' reference is used most notably for
cookie checking - making sure stale dst entries are updated if the
fib entry is changed.
When a fib entry is deleted, the pcpu routes on it are walked releasing
the fib6_info reference. This is needed for the fib6_info cleanup to
happen and to make sure all device references are released in a timely
manner.
There is a race window when a FIB entry is deleted and the 'from' on the
pcpu route is dropped and the pcpu route hits a cookie check. Handle
this race using rcu on from.
The following pull-request contains BPF updates for your *net-next* tree.
The main changes are:
1) Initial work on BPF Type Format (BTF) is added, which is a meta
data format which describes the data types of BPF programs / maps.
BTF has its roots from CTF (Compact C-Type format) with a number
of changes to it. First use case is to provide a generic pretty
print capability for BPF maps inspection, later work will also
add BTF to bpftool. pahole support to convert dwarf to BTF will
be upstreamed as well (https://github.com/iamkafai/pahole/tree/btf),
from Martin.
2) Add a new xdp_bpf_adjust_tail() BPF helper for XDP that allows
for changing the data_end pointer. Only shrinking is currently
supported which helps for crafting ICMP control messages. Minor
changes in drivers have been added where needed so they recalc
the packet's length also when data_end was adjusted, from Nikita.
3) Improve bpftool to make it easier to feed hex bytes via cmdline
for map operations, from Quentin.
4) Add support for various missing BPF prog types and attach types
that have been added to kernel recently but neither to bpftool
nor libbpf yet. Doc and bash completion updates have been added
as well for bpftool, from Andrey.
5) Proper fix for avoiding to leak info stored in frame data on page
reuse for the two bpf_xdp_adjust_{head,meta} helpers by disallowing
to move the pointers into struct xdp_frame area, from Jesper.
6) Follow-up compile fix from BTF in order to include stdbool.h in
libbpf, from Björn.
7) Few fixes in BPF sample code, that is, a typo on the netdevice
in a comment and fixup proper dump of XDP action code in the
tracepoint exception, from Wang and Jesper.
====================
Merge tag 'sound-4.17-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/tiwai/sound
Pull sound fixes from Takashi Iwai:
"A few small fixes:
- a fix for the NULL-dereference in rawmidi compat ioctls, triggered
by fuzzer
- HD-audio Realtek codec quirks, a VIA controller fixup
- a long-standing bug fix in LINE6 MIDI"
* tag 'sound-4.17-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/tiwai/sound:
ALSA: rawmidi: Fix missing input substream checks in compat ioctls
ALSA: hda/realtek - adjust the location of one mic
ALSA: hda/realtek - set PINCFG_HEADSET_MIC to parse_flags
ALSA: hda - New VIA controller suppor no-snoop path
ALSA: line6: Use correct endpoint type for midi output
Merge tag 'linux-watchdog-4.17-rc2' of git://www.linux-watchdog.org/linux-watchdog
Pull watchdog fixes from Wim Van Sebroeck:
- fall-through fixes
- MAINTAINER change for hpwdt
- renesas-wdt: Add support for WDIOF_CARDRESET
- aspeed: set bootstatus during probe
* tag 'linux-watchdog-4.17-rc2' of git://www.linux-watchdog.org/linux-watchdog:
aspeed: watchdog: Set bootstatus during probe
watchdog: renesas-wdt: Add support for WDIOF_CARDRESET
watchdog: wafer5823wdt: Mark expected switch fall-through
watchdog: w83977f_wdt: Mark expected switch fall-through
watchdog: sch311x_wdt: Mark expected switch fall-through
watchdog: hpwdt: change maintainer.
Merge tag 'linux-kselftest-4.17-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/shuah/linux-kselftest
Pull Kselftest fix from Shuah Khan:
"A fix from Michael Ellerman to not run dnotify_test by default to
prevent Kselftest running forever"
* tag 'linux-kselftest-4.17-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/shuah/linux-kselftest:
selftests/filesystems: Don't run dnotify_test by default
Merge tag 'arm64-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/arm64/linux
Pull arm64 fixes from Catalin Marinas:
- kasan: avoid pfn_to_nid() before the page array is initialised
- Fix typo causing the "upgrade" of known signals to SIGKILL
* tag 'arm64-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/arm64/linux:
arm64: signal: don't force known signals to SIGKILL
arm64: kasan: avoid pfn_to_nid() before page array is initialized
- "fork: unconditionally clear stack on fork" is a non-bugfix which got
lost during the merge window - performance concerns appear to have
been adequately addressed.
- and a bunch of fixes
* emailed patches from Andrew Morton <[email protected]>:
mm/filemap.c: fix NULL pointer in page_cache_tree_insert()
mm: memcg: add __GFP_NOWARN in __memcg_schedule_kmem_cache_create()
fs, elf: don't complain MAP_FIXED_NOREPLACE unless -EEXIST error
kexec_file: do not add extra alignment to efi memmap
proc: fix /proc/loadavg regression
proc: revalidate kernel thread inodes to root:root
autofs: mount point create should honour passed in mode
MAINTAINERS: add personal addresses for Sascha and Uwe
kasan: add no_sanitize attribute for clang builds
rapidio: fix rio_dma_transfer error handling
mm: enable thp migration for shmem thp
writeback: safer lock nesting
mm, pagemap: fix swap offset value for PMD migration entry
mm: fix do_pages_move status handling
fork: unconditionally clear stack on fork
Merge tag 'perf-urgent-for-mingo-4.17-20180420' of git://git.kernel.org/pub/scm/linux/kernel/git/acme/linux into perf/urgent
Pull perf/urgent fixes and improvements from Arnaldo Carvalho de Melo:
- Store context switch out type in PERF_RECORD_SWITCH[_CPU_WIDE].
The percentage of preempting and non-preempting context switches help
understanding the nature of workloads (CPU or IO bound) that are running
on a machine. This adds the kernel facility and userspace changes needed
to show this information in 'perf script' and 'perf report -D' (Alexey Budankov)
- Remove old error messages about things that unlikely to be the root cause
in modern systems (Andi Kleen)
- Synchronize kernel ABI headers, v4.17-rc1 (Ingo Molnar)
- Support MAP_FIXED_NOREPLACE, noticed when updating the tools/include/
copies (Arnaldo Carvalho de Melo)
- Fixup BPF test using epoll_pwait syscall function probe, to cope with
the syscall routines renames performed in this development cycle (Arnaldo Carvalho de Melo)
- Fix sample_max_stack maximum check and do not proceed when an error
has been detect, return them to avoid misidentifying errors (Jiri Olsa)
- Add '\n' at the end of parse-options error messages (Ravi Bangoria)
- Add s390 support for detailed/verbose PMU event description (Thomas Richter)
Matthew Wilcox [Fri, 20 Apr 2018 21:56:20 +0000 (14:56 -0700)]
mm/filemap.c: fix NULL pointer in page_cache_tree_insert()
f2fs specifies the __GFP_ZERO flag for allocating some of its pages.
Unfortunately, the page cache also uses the mapping's GFP flags for
allocating radix tree nodes. It always masked off the __GFP_HIGHMEM
flag, and masks off __GFP_ZERO in some paths, but not all. That causes
radix tree nodes to be allocated with a NULL list_head, which causes
backtraces like:
The __GFP_DMA and __GFP_DMA32 flags would also be able to sneak through
if they are ever used. Fix them all by using GFP_RECLAIM_MASK at the
innermost location, and remove it from earlier in the callchain.
Minchan Kim [Fri, 20 Apr 2018 21:56:17 +0000 (14:56 -0700)]
mm: memcg: add __GFP_NOWARN in __memcg_schedule_kmem_cache_create()
If there is heavy memory pressure, page allocation with __GFP_NOWAIT
fails easily although it's order-0 request. I got below warning 9 times
for normal boot.
__memcg_schedule_kmem_cache_create() tries to create a shadow slab cache
and the worker allocation failure is not really critical because we will
retry on the next kmem charge. We might miss some charges but that
shouldn't be critical. The excessive allocation failure report is not
very helpful.
And __efi_memmap_init maps with the size including the alignment bytes
but efi_memmap_unmap use nr_maps * desc_size which does not include the
extra bytes.
The alignment in kexec code is only needed for the kexec buffer internal
use Actually kexec should pass exact size of the efi memmap to 2nd
kernel.
Ian Kent [Fri, 20 Apr 2018 21:55:59 +0000 (14:55 -0700)]
autofs: mount point create should honour passed in mode
The autofs file system mkdir inode operation blindly sets the created
directory mode to S_IFDIR | 0555, ingoring the passed in mode, which can
cause selinux dac_override denials.
But the function also checks if the caller is the daemon (as no-one else
should be able to do anything here) so there's no point in not honouring
the passed in mode, allowing the daemon to set appropriate mode when
required.
MAINTAINERS: add personal addresses for Sascha and Uwe
The idea behind using [email protected] (i.e. the mail alias for the
kernel people at Pengutronix) as email address was to have a backup when
a given developer is on vacation or run over by a bus. Make this more
explicit by adding the alias as reviewer and use the personal address
for Sascha and me.
KASAN uses the __no_sanitize_address macro to disable instrumentation of
particular functions. Right now it's defined only for GCC build, which
causes false positives when clang is used.
This patch adds a definition for clang.
Note, that clang's revision 329612 or higher is required.
Ioan Nicu [Fri, 20 Apr 2018 21:55:49 +0000 (14:55 -0700)]
rapidio: fix rio_dma_transfer error handling
Some of the mport_dma_req structure members were initialized late
inside the do_dma_request() function, just before submitting the
request to the dma engine. But we have some error branches before
that. In case of such an error, the code would return on the error
path and trigger the calling of dma_req_free() with a req structure
which is not completely initialized. This causes a NULL pointer
dereference in dma_req_free().
This patch fixes these error branches by making sure that all
necessary mport_dma_req structure members are initialized in
rio_dma_transfer() immediately after the request structure gets
allocated.
My testing for the latest kernel supporting thp migration showed an
infinite loop in offlining the memory block that is filled with shmem
thps. We can get out of the loop with a signal, but kernel should return
with failure in this case.
What happens in the loop is that scan_movable_pages() repeats returning
the same pfn without any progress. That's because page migration always
fails for shmem thps.
In memory offline code, memory blocks containing unmovable pages should be
prevented from being offline targets by has_unmovable_pages() inside
start_isolate_page_range(). So it's possible to change migratability for
non-anonymous thps to avoid the issue, but it introduces more complex and
thp-specific handling in migration code, so it might not good.
So this patch is suggesting to fix the issue by enabling thp migration for
shmem thp. Both of anon/shmem thp are migratable so we don't need
precheck about the type of thps.
lock_page_memcg()/unlock_page_memcg() use spin_lock_irqsave/restore() if
the page's memcg is undergoing move accounting, which occurs when a
process leaves its memcg for a new one that has
memory.move_charge_at_immigrate set.
unlocked_inode_to_wb_begin,end() use spin_lock_irq/spin_unlock_irq() if
the given inode is switching writeback domains. Switches occur when
enough writes are issued from a new domain.
This existing pattern is thus suspicious:
lock_page_memcg(page);
unlocked_inode_to_wb_begin(inode, &locked);
...
unlocked_inode_to_wb_end(inode, locked);
unlock_page_memcg(page);
If both inode switch and process memcg migration are both in-flight then
unlocked_inode_to_wb_end() will unconditionally enable interrupts while
still holding the lock_page_memcg() irq spinlock. This suggests the
possibility of deadlock if an interrupt occurs before unlock_page_memcg().
Due to configuration limitations this deadlock is not currently possible
because we don't mix cgroup writeback (a cgroupv2 feature) and
memory.move_charge_at_immigrate (a cgroupv1 feature).
If the kernel is hacked to always claim inode switching and memcg
moving_account, then this script triggers lockup in less than a minute:
cd /mnt/cgroup/memory
mkdir a b
echo 1 > a/memory.move_charge_at_immigrate
echo 1 > b/memory.move_charge_at_immigrate
(
echo $BASHPID > a/cgroup.procs
while true; do
dd if=/dev/zero of=/mnt/big bs=1M count=256
done
) &
while true; do
sync
done &
sleep 1h &
SLEEP=$!
while true; do
echo $SLEEP > a/cgroup.procs
echo $SLEEP > b/cgroup.procs
done
The deadlock does not seem possible, so it's debatable if there's any
reason to modify the kernel. I suggest we should to prevent future
surprises. And Wang Long said "this deadlock occurs three times in our
environment", so there's more reason to apply this, even to stable.
Stable 4.4 has minor conflicts applying this patch. For a clean 4.4 patch
see "[PATCH for-4.4] writeback: safer lock nesting"
https://lkml.org/lkml/2018/4/11/146
Wang Long said "this deadlock occurs three times in our environment"
mm, pagemap: fix swap offset value for PMD migration entry
The swap offset reported by /proc/<pid>/pagemap may be not correct for
PMD migration entries. If addr passed into pagemap_pmd_range() isn't
aligned with PMD start address, the swap offset reported doesn't
reflect this. And in the loop to report information of each sub-page,
the swap offset isn't increased accordingly as that for PFN.
This may happen after opening /proc/<pid>/pagemap and seeking to a page
whose address doesn't align with a PMD start address. I have verified
this with a simple test program.
BTW: migration swap entries have PFN information, do we need to restrict
whether to show them?
Michal Hocko [Fri, 20 Apr 2018 21:55:35 +0000 (14:55 -0700)]
mm: fix do_pages_move status handling
Li Wang has reported that LTP move_pages04 test fails with the current
tree:
LTP move_pages04:
TFAIL : move_pages04.c:143: status[1] is EPERM, expected EFAULT
The test allocates an array of two pages, one is present while the other
is not (resp. backed by zero page) and it expects EFAULT for the second
page as the man page suggests. We are reporting EPERM which doesn't make
any sense and this is a result of a bug from cf5f16b23ec9 ("mm: unclutter
THP migration").
do_pages_move tries to handle as many pages in one batch as possible so we
queue all pages with the same node target together and that corresponds to
[start, i] range which is then used to update status array.
add_page_for_migration will correctly notice the zero (resp. !present)
page and returns with EFAULT which gets written to the status. But if
this is the last page in the array we do not update start and so the last
store_status after the loop will overwrite the range of the last batch
with NUMA_NO_NODE (which corresponds to EPERM).
Fix this by simply bailing out from the last flush if the pagelist is
empty as there is clearly nothing more to do.
One of the classes of kernel stack content leaks[1] is exposing the
contents of prior heap or stack contents when a new process stack is
allocated. Normally, those stacks are not zeroed, and the old contents
remain in place. In the face of stack content exposure flaws, those
contents can leak to userspace.
Fixing this will make the kernel no longer vulnerable to these flaws, as
the stack will be wiped each time a stack is assigned to a new process.
There's not a meaningful change in runtime performance; it almost looks
like it provides a benefit.
It continues to look like it's faster, though the deviation is rather
wide, but I'm not sure what I could do that would be less noisy. I'm
open to ideas!
Since at least the 3.10 kernel and likely a lot earlier we have
not been able to create unix domain sockets in a cifs share
when mounted using the SFU mount option (except when mounted
with the cifs unix extensions to Samba e.g.)
Trying to create a socket, for example using the af_unix command from
xfstests will cause :
BUG: unable to handle kernel NULL pointer dereference at 00000000 00000040
Since no one uses or depends on being able to create unix domains sockets
on a cifs share the easiest fix to stop this vulnerability is to simply
not allow creation of any other special files than char or block devices
when sfu is used.
Added update to Ronnie's patch to handle a tcon link leak, and
to address a buf leak noticed by Gustavo and Colin.
Merge tag 'mmc-v4.17-3' of git://git.kernel.org/pub/scm/linux/kernel/git/ulfh/mmc
Pull MMC fixes from Ulf Hansson:
"A couple of MMC host fixes:
- sdhci-pci: Fixup tuning for AMD for eMMC HS200 mode
- renesas_sdhi_internal_dmac: Avoid data corruption by limiting
DMA RX"
* tag 'mmc-v4.17-3' of git://git.kernel.org/pub/scm/linux/kernel/git/ulfh/mmc:
mmc: renesas_sdhi_internal_dmac: limit DMA RX for old SoCs
mmc: sdhci-pci: Only do AMD tuning for HS200
Merge tag 'md/4.17-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/shli/md
Pull MD fixes from Shaohua Li:
"Three small fixes for MD:
- md-cluster fix for faulty device from Guoqing
- writehint fix for writebehind IO for raid1 from Mariusz
- a live lock fix for interrupted recovery from Yufen"
* tag 'md/4.17-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/shli/md:
raid1: copy write hint from master bio to behind bio
md/raid1: exit sync request if MD_RECOVERY_INTR is set
md-cluster: don't update recovery_offset for faulty device
- tree block header
* add generation and owner output for node and leaf
- node pointer generation output
- allow btrfs_print_tree() to not follow nodes
* just like btrfs-progs
Please note that, although function btrfs_print_tree() is not called by
anyone right now, it's still a pretty useful function to debug kernel.
So that function is still kept for later use.
Nikolay Borisov [Wed, 18 Apr 2018 06:41:54 +0000 (09:41 +0300)]
btrfs: Fix race condition between delayed refs and blockgroup removal
When the delayed refs for a head are all run, eventually
cleanup_ref_head is called which (in case of deletion) obtains a
reference for the relevant btrfs_space_info struct by querying the bg
for the range. This is problematic because when the last extent of a
bg is deleted a race window emerges between removal of that bg and the
subsequent invocation of cleanup_ref_head. This can result in cache being null
and either a null pointer dereference or assertion failure.
To fix this, introduce a new flag "is_system" to head_ref structs,
which is populated at insertion time. This allows to decouple the
querying for the spaceinfo from querying the possibly deleted bg.
David Howells [Wed, 18 Apr 2018 08:38:34 +0000 (09:38 +0100)]
afs: Fix server record deletion
AFS server records get removed from the net->fs_servers tree when
they're deleted, but not from the net->fs_addresses{4,6} lists, which
can lead to an oops in afs_find_server() when a server record has been
removed, for instance during rmmod.
Fix this by deleting the record from the by-address lists before posting
it for RCU destruction.
The reason this hasn't been noticed before is that the fileserver keeps
probing the local cache manager, thereby keeping the service record
alive, so the oops would only happen when a fileserver eventually gets
bored and stops pinging or if the module gets rmmod'd and a call comes
in from the fileserver during the window between the server records
being destroyed and the socket being closed.
Fixes: d2ddc776a458 ("afs: Overhaul volume and server record caching and fileserver rotation") Signed-off-by: David Howells <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
1) Unbalanced refcounting in TIPC, from Jon Maloy.
2) Only allow TCP_MD5SIG to be set on sockets in close or listen state.
Once the connection is established it makes no sense to change this.
From Eric Dumazet.
3) Missing attribute validation in neigh_dump_table(), also from Eric
Dumazet.
4) Fix address comparisons in SCTP, from Xin Long.
5) Neigh proxy table clearing can deadlock, from Wolfgang Bumiller.
6) Fix tunnel refcounting in l2tp, from Guillaume Nault.
7) Fix double list insert in team driver, from Paolo Abeni.
8) af_vsock.ko module was accidently made unremovable, from Stefan
Hajnoczi.
9) Fix reference to freed llc_sap object in llc stack, from Cong Wang.
10) Don't assume netdevice struct is DMA'able memory in virtio_net
driver, from Michael S. Tsirkin.
* git://git.kernel.org/pub/scm/linux/kernel/git/davem/net: (62 commits)
net/smc: fix shutdown in state SMC_LISTEN
bnxt_en: Fix memory fault in bnxt_ethtool_init()
virtio_net: sparse annotation fix
virtio_net: fix adding vids on big-endian
virtio_net: split out ctrl buffer
net: hns: Avoid action name truncation
docs: ip-sysctl.txt: fix name of some ipv6 variables
vmxnet3: fix incorrect dereference when rxvlan is disabled
llc: hold llc_sap before release_sock()
MAINTAINERS: Direct networking documentation changes to netdev
atm: iphase: fix spelling mistake: "Tansmit" -> "Transmit"
net: qmi_wwan: add Wistron Neweb D19Q1
net: caif: fix spelling mistake "UKNOWN" -> "UNKNOWN"
net: stmmac: Disable ACS Feature for GMAC >= 4
net: mvpp2: Fix DMA address mask size
net: change the comment of dev_mc_init
net: qualcomm: rmnet: Fix warning seen with fill_info
tun: fix vlan packet truncation
tipc: fix infinite loop when dumping link monitor summary
tipc: fix use-after-free in tipc_nametbl_stop
...
Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs
Pull vfs fixes from Al Viro:
"Assorted fixes.
Some of that is only a matter with fault injection (broken handling of
small allocation failure in various mount-related places), but the
last one is a root-triggerable stack overflow, and combined with
userns it gets really nasty ;-/"
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
Don't leak MNT_INTERNAL away from internal mounts
mm,vmscan: Allow preallocating memory for register_shrinker().
rpc_pipefs: fix double-dput()
orangefs_kill_sb(): deal with allocation failures
jffs2_kill_sb(): deal with failed allocations
hypfs_kill_super(): deal with failed allocations
Merge tag 'ecryptfs-4.17-rc2-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tyhicks/ecryptfs
Pull eCryptfs fixes from Tyler Hicks:
"Minor cleanups and a bug fix to completely ignore unencrypted
filenames in the lower filesystem when filename encryption is enabled
at the eCryptfs layer"
* tag 'ecryptfs-4.17-rc2-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tyhicks/ecryptfs:
eCryptfs: don't pass up plaintext names when using filename encryption
ecryptfs: fix spelling mistake: "cadidate" -> "candidate"
ecryptfs: lookup: Don't check if mount_crypt_stat is NULL
Merge tag 'for_v4.17-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-fs
- isofs memory leak fix
- two fsnotify fixes of event mask handling
- udf fix of UTF-16 handling
- couple other smaller cleanups
* tag 'for_v4.17-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-fs:
udf: Fix leak of UTF-16 surrogates into encoded strings
fs: ext2: Adding new return type vm_fault_t
isofs: fix potential memory leak in mount option parsing
MAINTAINERS: add an entry for FSNOTIFY infrastructure
fsnotify: fix typo in a comment about mark->g_list
fsnotify: fix ignore mask logic in send_to_group()
isofs compress: Remove VLA usage
fs: quota: Replace GFP_ATOMIC with GFP_KERNEL in dquot_init
fanotify: fix logic of events on child
Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jikos/hid
Pull HID updates from Jiri Kosina:
- suspend/resume handling fix for Raydium I2C-connected touchscreen
from Aaron Ma
- protocol fixup for certain BT-connected Wacoms from Aaron Armstrong
Skomra
- battery level reporting fix on BT-connected mice from Dmitry Torokhov
- hidraw race condition fix from Rodrigo Rivas Costa
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jikos/hid:
HID: i2c-hid: fix inverted return value from i2c_hid_command()
HID: i2c-hid: Fix resume issue on Raydium touchscreen device
HID: wacom: bluetooth: send exit report for recent Bluetooth devices
HID: hidraw: Fix crash on HIDIOCGFEATURE with a destroyed device
HID: input: fix battery level reporting on BT mice
Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jikos/livepatching
Pull livepatching fix from Jiri Kosina:
"Shadow variable API list_head initialization fix from Petr Mladek"
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jikos/livepatching:
livepatch: Allow to call a custom callback when freeing shadow variables
livepatch: Initialize shadow variables safely by a custom callback
Paolo Abeni [Fri, 20 Apr 2018 11:18:16 +0000 (13:18 +0200)]
tun: do not compute the rxhash, if not needed
Currently, the tun driver, in absence of an eBPF steering program,
always compute the rxhash in its rx path, even when such value
is later unused due to additional checks (
This changeset moves the all the related checks just before the
__skb_get_hash_symmetric(), so that the latter is no more computed
when unneeded.
Also replace an unneeded RCU section with rcu_access_pointer().
====================
lan78xx: Read configuration from Device Tree
The Microchip LAN78XX family of devices are Ethernet controllers with
a USB interface. Despite being discoverable devices it can be useful to
be able to configure them from Device Tree, particularly in low-cost
applications without an EEPROM or programmed OTP.
This patch set adds support for reading the MAC address and LED modes from
Device Tree.
v4:
- Rename nodes in bindings doc.
v3:
- Move LED setting into PHY driver.
v2:
- Use eth_platform_get_mac_address.
- Support up to 4 LEDs, and move LED mode constants into dt-bindings header.
- Improve bindings document.
- Remove EEE support.
====================
Phil Elwell [Thu, 19 Apr 2018 16:59:40 +0000 (17:59 +0100)]
dt-bindings: Document the DT bindings for lan78xx
The Microchip LAN78XX family of devices are Ethernet controllers with
a USB interface. Despite being discoverable devices it can be useful to
be able to configure them from Device Tree, particularly in low-cost
applications without an EEPROM or programmed OTP.
Document the supported properties in a bindings file.
Phil Elwell [Thu, 19 Apr 2018 16:59:39 +0000 (17:59 +0100)]
lan78xx: Read LED states from Device Tree
Add support for DT property "microchip,led-modes", a vector of zero
to four cells (u32s) in the range 0-15, each of which sets the mode
for one of the LEDs. Some possible values are:
These values are given symbolic constants in a dt-bindings header.
Also use the presence of the DT property to indicate that the
LEDs should be enabled - necessary in the event that no valid OTP
or EEPROM is available.
Phil Elwell [Thu, 19 Apr 2018 16:59:38 +0000 (17:59 +0100)]
lan78xx: Read MAC address from DT if present
There is a standard mechanism for locating and using a MAC address from
the Device Tree. Use this facility in the lan78xx driver to support
applications without programmed EEPROM or OTP. At the same time,
regularise the handling of the different address sources.
Use cpumask_local_spread to provide interrupt affinity hints
for each queue. This will spread interrupts across NUMA local
CPUs first, extending to remote nodes if needed.
Eric Dumazet [Thu, 19 Apr 2018 16:14:53 +0000 (09:14 -0700)]
net/ipv6: Fix ip6_convert_metrics() bug
If ip6_convert_metrics() fails to allocate memory, it should not
overwrite rt->fib6_metrics or we risk a crash later as syzbot found.
BUG: KASAN: null-ptr-deref in atomic_read include/asm-generic/atomic-instrumented.h:21 [inline]
BUG: KASAN: null-ptr-deref in refcount_sub_and_test+0x92/0x330 lib/refcount.c:179
Read of size 4 at addr 0000000000000044 by task syzkaller832429/4487
Merge tag 'for-linus-4.17-rc2-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/xen/tip
Pull xen fixes from Juergen Gross:
- some fixes of kmalloc() flags
- one fix of the xenbus driver
- an update of the pv sound driver interface needed for a driver which
will go through the sound tree
* tag 'for-linus-4.17-rc2-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/xen/tip:
xen: xenbus_dev_frontend: Really return response string
xen/sndif: Sync up with the canonical definition in Xen
xen: xen-pciback: Replace GFP_ATOMIC with GFP_KERNEL in pcistub_reg_add
xen: xen-pciback: Replace GFP_ATOMIC with GFP_KERNEL in xen_pcibk_config_quirks_init
xen: xen-pciback: Replace GFP_ATOMIC with GFP_KERNEL in pcistub_device_alloc
xen: xen-pciback: Replace GFP_ATOMIC with GFP_KERNEL in pcistub_init_device
xen: xen-pciback: Replace GFP_ATOMIC with GFP_KERNEL in pcistub_probe
====================
qed* : Use trust mode to override forced MAC
This patchset adds a support to override forced MAC (MAC set by PF for a VF)
when trust mode is enabled using
First patch adds a real change to use .ndo_set_vf_trust to override forced MAC
and allow user to change VFs from VF interface itself.
Second patch takes care of a corner case, where MAC change from VF won't
take effect when VF interface is down, by introducing a new TLV
(a way to send message from VF to PF) to give a hint to PF to update
its bulletin board.
====================
qed* : Add new TLV to request PF to update MAC in bulletin board
There may be a need for VF driver to request PF to explicitly update its
bulletin with a MAC address.
e.g. When user assigns a MAC address to VF while VF is still down,
and PF's bulletin board contains different MAC address, in this case,
when VF's interface is brought up, it gets loaded with MAC address from
bulletin board which is not desirable.
To handle this corner case, we need a new TLV to request PF to update
its bulletin board with suggested MAC.
This request will be honored only for trusted VFs.
qed* : use trust mode to allow VF to override forced MAC
As per existing behavior, when PF sets a MAC address for a VF
(also called as forced MAC), VF is not allowed to change its
MAC address afterwards.
This puts the limitation on few use cases such as bonding of VFs,
where bonding driver asks VF to change its MAC address.
This patch uses a VF trust mode to allow VF to change its MAC address
in spite PF has set a forced MAC for that VF.
Merge tag 'mips_fixes_4.17_1' of git://git.kernel.org/pub/scm/linux/kernel/git/jhogan/mips
Pull MIPS fixes from James Hogan:
- io: Add barriers to read*() & write*()
- dts: Fix boston PCI bus DTC warnings (4.17)
- memset: Several corner case fixes (one 3.10, others longer)
* tag 'mips_fixes_4.17_1' of git://git.kernel.org/pub/scm/linux/kernel/git/jhogan/mips:
MIPS: uaccess: Add micromips clobbers to bzero invocation
MIPS: memset.S: Fix clobber of v1 in last_fixup
MIPS: memset.S: Fix return of __clear_user from Lpartial_fixup
MIPS: memset.S: EVA & fault support for small_memset
MIPS: dts: Boston: Fix PCI bus dtc warnings:
MIPS: io: Add barrier after register read in readX()
MIPS: io: Prevent compiler reordering writeX()
Merge tag 'powerpc-4.17-3' of git://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux
Pull powerpc fixes from Michael Ellerman:
- Fix an off-by-one bug in our alternative asm patching which leads to
incorrectly patched code. This bug lay dormant for nearly 10 years
but we finally hit it due to a recent change.
- Fix lockups when running KVM guests on Power8 due to a missing check
when a thread that's running KVM comes out of idle.
- Fix an out-of-spec behaviour in the XIVE code (P9 interrupt
controller).
- Fix EEH handling of bridge MMIO windows.
- Prevent crashes in our RFI fallback flush handler if firmware didn't
tell us the size of the L1 cache (only seen on simulators).
Thanks to: Benjamin Herrenschmidt, Madhavan Srinivasan, Michael Neuling.
* tag 'powerpc-4.17-3' of git://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux:
powerpc/kvm: Fix lockups when running KVM guests on Power8
powerpc/eeh: Fix enabling bridge MMIO windows
powerpc/xive: Fix trying to "push" an already active pool VP
powerpc/64s: Default l1d_size to 64K in RFI fallback flush
powerpc/lib: Fix off-by-one in alternate feature patching
David S. Miller [Fri, 20 Apr 2018 15:20:06 +0000 (11:20 -0400)]
Merge branch 'geneve-mtu'
Alexey Kodanev says:
====================
geneve: verify user specified MTU or adjust with a lower device
The first two patches don't introduce any functional changes and
contain minor cleanups for code readability.
The last one adds a new function geneve_link_config() similar to the
other tunnels. The function will be used on a new link creation or
when 'remote' parameter is changed. It adjusts a user specified MTU
or, if it finds a lower device, tunes the tunnel MTU using it.
====================
Currently, on a new link creation or when 'remote' address parameter
is updated, an MTU is not changed and always equals 1500. When a lower
device has a larger MTU, it might not be efficient, e.g. for UDP, and
requires the manual MTU adjustments to match the MTU of the lower
device.
This patch tries to automate this process, finds a lower device using
the 'remote' address parameter, then uses its MTU to tune GENEVE's MTU:
* on a new link creation
* when 'remote' parameter is changed
Also with this patch, the MTU from a user, on a new link creation, is
passed to geneve_change_mtu() where it is verified, and MTU adjustments
with a lower device is skipped in that case. Prior that change, it was
possible to set the invalid MTU values on a new link creation.
geneve: check MTU for a minimum in geneve_change_mtu()
geneve_change_mtu() will be used not only as ndo_change_mtu() callback,
but also to verify a user specified MTU on a new link creation in the
next patch.
geneve: cleanup hard coded value for Ethernet header length
Use ETH_HLEN instead and introduce two new macros: GENEVE_IPV4_HLEN
and GENEVE_IPV6_HLEN that include Ethernet header length, corresponded
IP header length and GENEVE_BASE_HLEN.
====================
tipc: Confgiuration of MTU for media UDP
Systematic measurements have shown that an emulated MTU of 14k for
UDP bearers is the optimal value for maximal throughput. Accordingly,
the default MTU of UDP bearers is changed to 14k.
We also provide users with a fallback option from this value,
by providing support to configure MTU for UDP bearers. The following
options are introduced which are symmetrical to the design of
confguring link tolerance.
- Configure media with new MTU value, which will take effect on
links going up after the moment it was configured. Alternatively,
the bearer has to be disabled and re-enabled, for existing links to
reflect the configured value.
- Configure bearer with new MTU value, which take effect on
running links dynamically.
Please note:
- User has to change MTU at both endpoints, otherwise the link
will fall back to smallest MTU after a reset.
- Failover from a link with higher MTU to a link with lower MTU
====================
tipc: confgiure and apply UDP bearer MTU on running links
Currently, we have option to configure MTU of UDP media. The configured
MTU takes effect on the links going up after that moment. I.e, a user
has to reset bearer to have new value applied across its links. This is
confusing and disturbing on a running cluster.
We now introduce the functionality to change the default UDP bearer MTU
in struct tipc_bearer. Additionally, the links are updated dynamically,
without any need for a reset, when bearer value is changed. We leverage
the existing per-link functionality and the design being symetrical to
the confguration of link tolerance.
In previous commit, we changed the default emulated MTU for UDP bearers
to 14k.
This commit adds the functionality to set/change the default value
by configuring new MTU for UDP media. UDP bearer(s) have to be disabled
and enabled back for the new MTU to take effect.
Currently, all bearers are configured with MTU value same as the
underlying L2 device. However, in case of bearers with media type
UDP, higher throughput is possible with a fixed and higher emulated
MTU value than adapting to the underlying L2 MTU.
In this commit, we introduce a parameter mtu in struct tipc_media
and a default value is set for UDP. A default value of 14k
was determined by experimentation and found to have a higher throughput
than 16k. MTU for UDP bearers are assigned the above set value of
media MTU.
Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/s390/linux
Pull s390 fixes and kexec-file-load from Martin Schwidefsky:
"After the common code kexec patches went in via Andrew we can now push
the architecture parts to implement the kexec-file-load system call.
Plus a few more bug fixes and cleanups, this includes an update to the
default configurations"
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/s390/linux:
s390/signal: cleanup uapi struct sigaction
s390: rename default_defconfig to debug_defconfig
s390: remove gcov defconfig
s390: update defconfig
s390: add support for IBM z14 Model ZR1
s390: remove couple of duplicate includes
s390/boot: remove unused COMPILE_VERSION and ccflags-y
s390/nospec: include cpu.h
s390/decompressor: Ignore file vmlinux.bin.full
s390/kexec_file: add generated files to .gitignore
s390/Kconfig: Move kexec config options to "Processor type and features"
s390/kexec_file: Add ELF loader
s390/kexec_file: Add crash support to image loader
s390/kexec_file: Add image loader
s390/kexec_file: Add kexec_file_load system call
s390/kexec_file: Add purgatory
s390/kexec_file: Prepare setup.h for kexec_file_load
s390/smsgiucv: disable SMSG on module unload
s390/sclp: avoid potential usage of uninitialized value
net: ethernet: ave: add support for phy-mode setting of system controller
This patch adds support for specifying system controller that configures
phy-mode setting.
According to the DT property "phy-mode", it's necessary to configure the
controller, which is used to choose the settings of the MAC suitable,
for example, mdio pin connections, internal clocks, and so on.
Supported phy-modes are SoC-dependent. The driver allows phy-mode to set
"internal" if the SoC has a built-in PHY, and {"mii", "rmii", "rgmii"}
if the SoC supports each mode. So we have to check whether the phy-mode
is valid or not.
This adds the following features for each SoC:
- check whether the SoC supports the specified phy-mode
- configure the controller accroding to phy-mode
The DT property accepts one argument to distinguish them for multiple MAC
instances.
net: ethernet: ave: add multiple clocks and resets support as required property
When the link is becoming up for Pro4 SoC, the kernel is stalled
due to some missing clocks and resets.
The AVE block for Pro4 is connected to the GIO bus in the SoC.
Without its clock/reset, the access to the AVE register makes the
system stall.
In the same way, another MAC clock for Giga-bit Connection and
the PHY clock are also required for Pro4 to activate the Giga-bit feature
and to recognize the PHY.
To satisfy these requirements, this patch adds support for multiple clocks
and resets, and adds the clock-names and reset-names to the binding because
we need to distinguish clock/reset for the AVE main block and the others.
Also, make the resets a required property. Currently, "reset is
optional" relies on that the bootloader or firmware has deasserted
the reset before booting the kernel. Drivers should work without
such expectation.
mdiobus_register will search for any mdiobus board info registered for
the bus being registered. If found, it will probe devices on the bus.
That device, if for example it is an ethernet switch, may then try to
register an mdio bus. Thus we need to allow recursive calls to
mdiobus_register.
Holding the mdio_board_lock will cause a deadlock during this
recursion. Release the lock and use list_for_each_entry_safe.
Oskar Senft [Fri, 23 Mar 2018 13:11:30 +0000 (09:11 -0400)]
perf/x86/intel/uncore: Fix SBOX support for Broadwell CPUs
SBOX on some Broadwell CPUs is broken because it's enabled unconditionally
despite the fact that there are no SBOXes available.
Check the Power Control Unit CAPID4 register to determine the number of
available SBOXes on the particular CPU before trying to enable them. If
there are none, nullify the SBOX descriptor so it isn't tried to be
initialized.
x86/power/64: Fix page-table setup for temporary text mapping
On a system with 4-level page-tables there is no p4d, so the pud in the pgd
should be mapped. The old code before commit fb43d6cb91ef already did that.
The change from above commit causes an invalid page-table which causes
undefined behavior. In one report it caused triple faults.