Guo Ren [Tue, 4 Jun 2019 10:54:48 +0000 (18:54 +0800)]
csky: Fixup some error count in 810 & 860.
CK810 pmu only support event with index 0-8 and 0xd; CK860 only
support event 1~4, 0xa~0x1b. So do not register unsupport event
to hardware cache event, which may leader to unknown behavior.
Mao Han [Tue, 4 Jun 2019 10:54:49 +0000 (18:54 +0800)]
csky: Fix perf record in kernel/user space
csky_pmu_event_init is called several times during the perf record
initialzation. After configure the event counter in either kernel
space or user space, csky_pmu_event_init is called twice with no
attr specified. Configuration will be overwritten with sampling in
both kernel space and user space. --all-kernel/--all-user is
useless without this patch applied.
Mao Han [Tue, 4 Jun 2019 10:54:45 +0000 (18:54 +0800)]
csky: Add count-width property for csky pmu
The csky pmu counter may have different io width. When the counter is
smaller then 64 bits and counter value is smaller than the old value, it
will result to a extremely large delta value. So the sampled value should
be extend to 64 bits to avoid this, the extension bits base on the
count-width property from dts.
Mao Han [Tue, 4 Jun 2019 10:54:44 +0000 (18:54 +0800)]
csky: Init pmu as a device
This patch change the csky pmu initialization from arch init to
device init. The pmu can be configued with information from
device tree(pmu device name, irq number and etc.).
Accessing 'current' in bpf context makes no sense, since packets
are processed from softirq context.
As Neal stated : The capability check in tcp_set_congestion_control()
was written assuming a system call context, and then was reused from
a BPF call site.
The fix is to add a new parameter to tcp_set_congestion_control(),
so that the ns_capable() call is only performed under the right
context.
In case of error, the function of_get_mac_address() returns ERR_PTR()
and never returns NULL. The NULL test in the return value check should
be replaced with IS_ERR().
In the sysctl code the proc_dointvec_minmax() function is often used to
validate the user supplied value between an allowed range. This
function uses the extra1 and extra2 members from struct ctl_table as
minimum and maximum allowed value.
On sysctl handler declaration, in every source file there are some
readonly variables containing just an integer which address is assigned
to the extra1 and extra2 members, so the sysctl range is enforced.
The special values 0, 1 and INT_MAX are very often used as range
boundary, leading duplication of variables like zero=0, one=1,
int_max=INT_MAX in different source files:
Add a const int array containing the most commonly used values, some
macros to refer more easily to the correct array member, and use them
instead of creating a local one for every object file.
This is the bloat-o-meter output comparing the old and new binary
compiled with the default Fedora config:
# scripts/bloat-o-meter -d vmlinux.o.old vmlinux.o
add/remove: 2/2 grow/shrink: 0/2 up/down: 24/-188 (-164)
Data old new delta
sysctl_vals - 12 +12
__kstrtab_sysctl_vals - 12 +12
max 14 10 -4
int_max 16 - -16
one 68 - -68
zero 128 28 -100
Total: Before=20583249, After=20583085, chg -0.00%
Dan Williams [Thu, 18 Jul 2019 22:58:43 +0000 (15:58 -0700)]
mm/sparsemem: cleanup 'section number' data types
David points out that there is a mixture of 'int' and 'unsigned long'
usage for section number data types. Update the memory hotplug path to
use 'unsigned long' consistently for section numbers.
Dan Williams [Thu, 18 Jul 2019 22:58:40 +0000 (15:58 -0700)]
libnvdimm/pfn: stop padding pmem namespaces to section alignment
Now that the mm core supports section-unaligned hotplug of ZONE_DEVICE
memory, we no longer need to add padding at pfn/dax device creation
time. The kernel will still honor padding established by older kernels.
At namespace creation time there is the potential for the "expected to
be zero" fields of a 'pfn' info-block to be filled with indeterminate
data. While the kernel buffer is zeroed on allocation it is immediately
overwritten by nd_pfn_validate() filling it with the current contents of
the on-media info-block location. For fields like, 'flags' and the
'padding' it potentially means that future implementations can not rely on
those fields being zero.
In preparation to stop using the 'start_pad' and 'end_trunc' fields for
section alignment, arrange for fields that are not explicitly
initialized to be guaranteed zero. Bump the minor version to indicate
it is safe to assume the 'padding' and 'flags' are zero. Otherwise,
this corruption is expected to benign since all other critical fields
are explicitly initialized.
Note The cc: stable is about spreading this new policy to as many
kernels as possible not fixing an issue in those kernels. It is not
until the change titled "libnvdimm/pfn: Stop padding pmem namespaces to
section alignment" where this improper initialization becomes a problem.
So if someone decides to backport "libnvdimm/pfn: Stop padding pmem
namespaces to section alignment" (which is not tagged for stable), make
sure this pre-requisite is flagged.
Dan Williams [Thu, 18 Jul 2019 22:58:33 +0000 (15:58 -0700)]
mm/devm_memremap_pages: enable sub-section remap
Teach devm_memremap_pages() about the new sub-section capabilities of
arch_{add,remove}_memory(). Effectively, just replace all usage of
align_start, align_end, and align_size with res->start, res->end, and
resource_size(res). The existing sanity check will still make sure that
the two separate remap attempts do not collide within a sub-section (2MB
on x86).
Dan Williams [Thu, 18 Jul 2019 22:58:26 +0000 (15:58 -0700)]
mm/sparsemem: support sub-section hotplug
The libnvdimm sub-system has suffered a series of hacks and broken
workarounds for the memory-hotplug implementation's awkward
section-aligned (128MB) granularity.
For example the following backtrace is emitted when attempting
arch_add_memory() with physical address ranges that intersect 'System
RAM' (RAM) with 'Persistent Memory' (PMEM) within a given section:
It was discovered that the problem goes beyond RAM vs PMEM collisions as
some platform produce PMEM vs PMEM collisions within a given section.
The libnvdimm workaround for that case revealed that the libnvdimm
section-alignment-padding implementation has been broken for a long
while.
A fix for that long-standing breakage introduces as many problems as it
solves as it would require a backward-incompatible change to the
namespace metadata interpretation. Instead of that dubious route [1],
address the root problem in the memory-hotplug implementation.
Note that EEXIST is no longer treated as success as that is how
sparse_add_section() reports subsection collisions, it was also obviated
by recent changes to perform the request_region() for 'System RAM'
before arch_add_memory() in the add_memory() sequence.
Dan Williams [Thu, 18 Jul 2019 22:58:22 +0000 (15:58 -0700)]
mm/sparsemem: prepare for sub-section ranges
Prepare the memory hot-{add,remove} paths for handling sub-section
ranges by plumbing the starting page frame and number of pages being
handled through arch_{add,remove}_memory() to
sparse_{add,remove}_one_section().
This is simply plumbing, small cleanups, and some identifier renames.
No intended functional changes.
Dan Williams [Thu, 18 Jul 2019 22:58:15 +0000 (15:58 -0700)]
mm/hotplug: kill is_dev_zone() usage in __remove_pages()
The zone type check was a leftover from the cleanup that plumbed altmap
through the memory hotplug path, i.e. commit da024512a1fa "mm: pass the
vmem_altmap to arch_remove_memory and __remove_pages".
Dan Williams [Thu, 18 Jul 2019 22:58:11 +0000 (15:58 -0700)]
mm/sparsemem: convert kmalloc_section_memmap() to populate_section_memmap()
Allow sub-section sized ranges to be added to the memmap.
populate_section_memmap() takes an explict pfn range rather than
assuming a full section, and those parameters are plumbed all the way
through to vmmemap_populate(). There should be no sub-section usage in
current deployments. New warnings are added to clarify which memmap
allocation paths are sub-section capable.
Dan Williams [Thu, 18 Jul 2019 22:58:07 +0000 (15:58 -0700)]
mm/hotplug: prepare shrink_{zone, pgdat}_span for sub-section removal
Sub-section hotplug support reduces the unit of operation of hotplug
from section-sized-units (PAGES_PER_SECTION) to sub-section-sized units
(PAGES_PER_SUBSECTION). Teach shrink_{zone,pgdat}_span() to consider
PAGES_PER_SUBSECTION boundaries as the points where pfn_valid(), not
valid_section(), can toggle.
Dan Williams [Thu, 18 Jul 2019 22:58:04 +0000 (15:58 -0700)]
mm/sparsemem: add helpers track active portions of a section at boot
Prepare for hot{plug,remove} of sub-ranges of a section by tracking a
sub-section active bitmask, each bit representing a PMD_SIZE span of the
architecture's memory hotplug section size.
The implications of a partially populated section is that pfn_valid()
needs to go beyond a valid_section() check and either determine that the
section is an "early section", or read the sub-section active ranges
from the bitmask. The expectation is that the bitmask (subsection_map)
fits in the same cacheline as the valid_section() / early_section()
data, so the incremental performance overhead to pfn_valid() should be
negligible.
The rationale for using early_section() to short-ciruit the
subsection_map check is that there are legacy code paths that use
pfn_valid() at section granularity before validating the pfn against
pgdat data. So, the early_section() check allows those traditional
assumptions to persist while also permitting subsection_map to tell the
truth for purposes of populating the unused portions of early sections
with PMEM and other ZONE_DEVICE mappings.
Dan Williams [Thu, 18 Jul 2019 22:58:00 +0000 (15:58 -0700)]
mm/sparsemem: introduce a SECTION_IS_EARLY flag
In preparation for sub-section hotplug, track whether a given section
was created during early memory initialization, or later via memory
hotplug. This distinction is needed to maintain the coarse expectation
that pfn_valid() returns true for any pfn within a given section even if
that section has pages that are reserved from the page allocator.
For example one of the of goals of subsection hotplug is to support
cases where the system physical memory layout collides System RAM and
PMEM within a section. Several pfn_valid() users expect to just check
if a section is valid, but they are not careful to check if the given
pfn is within a "System RAM" boundary and instead expect pgdat
information to further validate the pfn.
Rather than unwind those paths to make their pfn_valid() queries more
precise a follow on patch uses the SECTION_IS_EARLY flag to maintain the
traditional expectation that pfn_valid() returns true for all early
sections.
Dan Williams [Thu, 18 Jul 2019 22:57:57 +0000 (15:57 -0700)]
mm/sparsemem: introduce struct mem_section_usage
Patch series "mm: Sub-section memory hotplug support", v10.
The memory hotplug section is an arbitrary / convenient unit for memory
hotplug. 'Section-size' units have bled into the user interface
('memblock' sysfs) and can not be changed without breaking existing
userspace. The section-size constraint, while mostly benign for typical
memory hotplug, has and continues to wreak havoc with 'device-memory'
use cases, persistent memory (pmem) in particular. Recall that pmem
uses devm_memremap_pages(), and subsequently arch_add_memory(), to
allocate a 'struct page' memmap for pmem. However, it does not use the
'bottom half' of memory hotplug, i.e. never marks pmem pages online and
never exposes the userspace memblock interface for pmem. This leaves an
opening to redress the section-size constraint.
To date, the libnvdimm subsystem has attempted to inject padding to
satisfy the internal constraints of arch_add_memory(). Beyond
complicating the code, leading to bugs [2], wasting memory, and limiting
configuration flexibility, the padding hack is broken when the platform
changes this physical memory alignment of pmem from one boot to the
next. Device failure (intermittent or permanent) and physical
reconfiguration are events that can cause the platform firmware to
change the physical placement of pmem on a subsequent boot, and device
failure is an everyday event in a data-center.
It turns out that sections are only a hard requirement of the
user-facing interface for memory hotplug and with a bit more
infrastructure sub-section arch_add_memory() support can be added for
kernel internal usages like devm_memremap_pages(). Here is an analysis
of the current design assumptions in the current code and how they are
addressed in the new implementation:
Current design assumptions:
- Sections that describe boot memory (early sections) are never
unplugged / removed.
- pfn_valid(), in the CONFIG_SPARSEMEM_VMEMMAP=y, case devolves to a
valid_section() check
- __add_pages() and helper routines assume all operations occur in
PAGES_PER_SECTION units.
- The memblock sysfs interface only comprehends full sections
New design assumptions:
- Sections are instrumented with a sub-section bitmask to track (on
x86) individual 2MB sub-divisions of a 128MB section.
- Partially populated early sections can be extended with additional
sub-sections, and those sub-sections can be removed with
arch_remove_memory(). With this in place we no longer lose usable
memory capacity to padding.
- pfn_valid() is updated to look deeper than valid_section() to also
check the active-sub-section mask. This indication is in the same
cacheline as the valid_section() so the performance impact is
expected to be negligible. So far the lkp robot has not reported any
regressions.
- Outside of the core vmemmap population routines which are replaced,
other helper routines like shrink_{zone,pgdat}_span() are updated to
handle the smaller granularity. Core memory hotplug routines that
deal with online memory are not touched.
- The existing memblock sysfs user api guarantees / assumptions are not
touched since this capability is limited to !online
!memblock-sysfs-accessible sections.
Meanwhile the issue reports continue to roll in from users that do not
understand when and how the 128MB constraint will bite them. The current
implementation relied on being able to support at least one misaligned
namespace, but that immediately falls over on any moderately complex
namespace creation attempt. Beyond the initial problem of 'System RAM'
colliding with pmem, and the unsolvable problem of physical alignment
changes, Linux is now being exposed to platforms that collide pmem ranges
with other pmem ranges by default [3]. In short, devm_memremap_pages()
has pushed the venerable section-size constraint past the breaking point,
and the simplicity of section-aligned arch_add_memory() is no longer
tenable.
These patches are exposed to the kbuild robot on a subsection-v10 branch
[4], and a preview of the unit test for this functionality is available
on the 'subsection-pending' branch of ndctl [5].
Towards enabling memory hotplug to track partial population of a section,
introduce 'struct mem_section_usage'.
A pointer to a 'struct mem_section_usage' instance replaces the existing
pointer to a 'pageblock_flags' bitmap. Effectively it adds one more
'unsigned long' beyond the 'pageblock_flags' (usemap) allocation to house
a new 'subsection_map' bitmap. The new bitmap enables the memory
hot{plug,remove} implementation to act on incremental sub-divisions of a
section.
SUBSECTION_SHIFT is defined as global constant instead of per-architecture
value like SECTION_SIZE_BITS in order to allow cross-arch compatibility of
subsection users. Specifically a common subsection size allows for the
possibility that persistent memory namespace configurations be made
compatible across architectures.
The primary motivation for this functionality is to support platforms that
mix "System RAM" and "Persistent Memory" within a single section, or
multiple PMEM ranges with different mapping lifetimes within a single
section. The section restriction for hotplug has caused an ongoing saga
of hacks and bugs for devm_memremap_pages() users.
Beyond the fixups to teach existing paths how to retrieve the 'usemap'
from a section, and updates to usemap allocation path, there are no
expected behavior changes.
mm/memory_hotplug: rename walk_memory_range() and pass start+size instead of pfns
walk_memory_range() was once used to iterate over sections. Now, it
iterates over memory blocks. Rename the function, fixup the
documentation.
Also, pass start+size instead of PFNs, which is what most callers
already have at hand. (we'll rework link_mem_sections() most probably
soon)
Follow-up patches will rework, simplify, and move walk_memory_blocks()
to drivers/base/memory.c.
Note: walk_memory_blocks() only works correctly right now if the
start_pfn is aligned to a section start. This is the case right now,
but we'll generalize the function in a follow up patch so the semantics
match the documentation.
Patch series "mm: Further memory block device cleanups", v1.
Some further cleanups around memory block devices. Especially, clean up
and simplify walk_memory_range(). Including some other minor cleanups.
This patch (of 6):
We are using a mixture of "int" and "unsigned long". Let's make this
consistent by using "unsigned long" everywhere. We'll do the same with
memory block ids next.
While at it, turn the "unsigned long i" in removable_show() into an int
- sections_per_block is an int.
Nadav Amit [Thu, 18 Jul 2019 22:57:34 +0000 (15:57 -0700)]
resource: avoid unnecessary lookups in find_next_iomem_res()
find_next_iomem_res() shows up to be a source for overhead in dax
benchmarks.
Improve performance by not considering children of the tree if the top
level does not match. Since the range of the parents should include the
range of the children such check is redundant.
Running sysbench on dax (pmem emulation, with write_cache disabled):
sysbench fileio --file-total-size=3G --file-test-mode=rndwr \
--file-io-mode=mmap --threads=4 --file-fsync-mode=fdatasync run
Nadav Amit [Thu, 18 Jul 2019 22:57:31 +0000 (15:57 -0700)]
resource: fix locking in find_next_iomem_res()
Since resources can be removed, locking should ensure that the resource
is not removed while accessing it. However, find_next_iomem_res() does
not hold the lock while copying the data of the resource.
Keep holding the lock while the data is copied. While at it, change the
return value to a more informative value. It is disregarded by the
callers.
Yang Shi [Thu, 18 Jul 2019 22:57:27 +0000 (15:57 -0700)]
mm: thp: fix false negative of shmem vma's THP eligibility
Commit 7635d9cbe832 ("mm, thp, proc: report THP eligibility for each
vma") introduced THPeligible bit for processes' smaps. But, when
checking the eligibility for shmem vma, __transparent_hugepage_enabled()
is called to override the result from shmem_huge_enabled(). It may
result in the anonymous vma's THP flag override shmem's. For example,
running a simple test which create THP for shmem, but with anonymous THP
disabled, when reading the process's smaps, it may show:
And, /proc/meminfo does show THP allocated and PMD mapped too:
ShmemHugePages: 4096 kB
ShmemPmdMapped: 4096 kB
This doesn't make too much sense. The shmem objects should be treated
separately from anonymous THP. Calling shmem_huge_enabled() with
checking MMF_DISABLE_THP sounds good enough. And, we could skip stack
and dax vma check since we already checked if the vma is shmem already.
Also check if vma is suitable for THP by calling
transhuge_vma_suitable().
And minor fix to smaps output format and documentation.
Yang Shi [Thu, 18 Jul 2019 22:57:24 +0000 (15:57 -0700)]
mm: thp: make transhuge_vma_suitable available for anonymous THP
transhuge_vma_suitable() was only available for shmem THP, but anonymous
THP has the same check except pgoff check. And, it will be used for THP
eligible check in the later patch, so make it available for all kind of
THPs. This also helps reduce code duplication slightly.
Since anonymous THP doesn't have to check pgoff, so make pgoff check
shmem vma only.
And regroup some functions in include/linux/mm.h to solve compile issue
since transhuge_vma_suitable() needs call vma_is_anonymous() which was
defined after huge_mm.h is included.
Wei Yang [Thu, 18 Jul 2019 22:57:21 +0000 (15:57 -0700)]
mm/sparse.c: set section nid for hot-add memory
In case of NODE_NOT_IN_PAGE_FLAGS is set, we store section's node id in
section_to_node_table[]. While for hot-add memory, this is missed.
Without this information, page_to_nid() may not give the right node id.
BTW, current online_pages works because it leverages nid in
memory_block. But the granularity of node id should be mem_section
wide.
mm/memory_hotplug: remove "zone" parameter from sparse_remove_one_section
The parameter is unused, so let's drop it. Memory removal paths should
never care about zones. This is the job of memory offlining and will
require more refactorings.
mm/memory_hotplug: make unregister_memory_block_under_nodes() never fail
We really don't want anything during memory hotunplug to fail. We
always pass a valid memory block device, that check can go. Avoid
allocating memory and eventually failing. As we are always called under
lock, we can use a static piece of memory. This avoids having to put
the structure onto the stack, having to guess about the stack size of
callers.
Patch inspired by a patch from Oscar Salvador.
In the future, there might be no need to iterate over nodes at all.
mem->nid should tell us exactly what to remove. Memory block devices
with mixed nodes (added during boot) should properly fenced off and
never removed.
mm/memory_hotplug: remove memory block devices before arch_remove_memory()
Let's factor out removing of memory block devices, which is only
necessary for memory added via add_memory() and friends that created
memory block devices. Remove the devices before calling
arch_remove_memory().
This finishes factoring out memory block device handling from
arch_add_memory() and arch_remove_memory().
mm/memory_hotplug: create memory block devices after arch_add_memory()
Only memory to be added to the buddy and to be onlined/offlined by user
space using /sys/devices/system/memory/... needs (and should have!)
memory block devices.
Factor out creation of memory block devices. Create all devices after
arch_add_memory() succeeded. We can later drop the want_memblock
parameter, because it is now effectively stale.
Only after memory block devices have been added, memory can be onlined
by user space. This implies, that memory is not visible to user space
at all before arch_add_memory() succeeded.
While at it
- use WARN_ON_ONCE instead of BUG_ON in moved unregister_memory()
- introduce find_memory_block_by_id() to search via block id
- Use find_memory_block_by_id() in init_memory_block() to catch
duplicates
mm/memory_hotplug: allow arch_remove_memory() without CONFIG_MEMORY_HOTREMOVE
We want to improve error handling while adding memory by allowing to use
arch_remove_memory() and __remove_pages() even if
CONFIG_MEMORY_HOTREMOVE is not set to e.g., implement something like:
arch_add_memory()
rc = do_something();
if (rc) {
arch_remove_memory();
}
We won't get rid of CONFIG_MEMORY_HOTREMOVE for now, as it will require
quite some dependencies for memory offlining.
A proper arch_remove_memory() implementation is on its way, which also
cleanly removes page tables in arch_add_memory() in case something goes
wrong.
As we want to use arch_remove_memory() in case something goes wrong
during memory hotplug after arch_add_memory() finished, let's add a
temporary hack that is sufficient enough until we get a proper
implementation that cleans up page table entries.
We will remove CONFIG_MEMORY_HOTREMOVE around this code in follow up
patches.
mm/memory_hotplug: simplify and fix check_hotplug_memory_range()
Patch series "mm/memory_hotplug: Factor out memory block devicehandling", v3.
We only want memory block devices for memory to be onlined/offlined
(add/remove from the buddy). This is required so user space can
online/offline memory and kdump gets notified about newly onlined
memory.
Let's factor out creation/removal of memory block devices. This helps
to further cleanup arch_add_memory/arch_remove_memory() and to make
implementation of new features easier - especially sub-section memory
hot add from Dan.
Anshuman Khandual is currently working on arch_remove_memory(). I added
a temporary solution via "arm64/mm: Add temporary arch_remove_memory()
implementation", that is sufficient as a firsts tep in the context of
this series. (we don't cleanup page tables in case anything goes wrong
already)
Did a quick sanity test with DIMM plug/unplug, making sure all devices
and sysfs links properly get added/removed. Compile tested on s390x and
x86-64.
This patch (of 11):
By converting start and size to page granularity, we actually ignore
unaligned parts within a page instead of properly bailing out with an
error.
Michael Chan [Wed, 17 Jul 2019 07:07:23 +0000 (03:07 -0400)]
bnxt_en: Fix VNIC accounting when enabling aRFS on 57500 chips.
Unlike legacy chips, 57500 chips don't need additional VNIC resources
for aRFS/ntuple. Fix the code accordingly so that we don't reserve
and allocate additional VNICs on 57500 chips. Without this patch,
the driver is failing to initialize when it tries to allocate extra
VNICs.
Fixes: ac33906c67e2 ("bnxt_en: Add support for aRFS on 57500 chips.") Signed-off-by: Michael Chan <[email protected]> Signed-off-by: David S. Miller <[email protected]>
Ronnie Sahlberg [Thu, 18 Jul 2019 22:12:11 +0000 (08:12 +1000)]
cifs: flush before set-info if we have writeable handles
Servers can defer destaging any data and updating the mtime until close().
This means that if we do a setinfo to modify the mtime while other handles
are open for write the server may overwrite our setinfo timestamps when
if flushes the file on close() of the writeable handle.
To solve this we add an explicit flush when the mtime is about to
be updated.
This fixes "cp -p" to preserve mtime when copying a file onto an SMB2 share.
Steve French [Thu, 18 Jul 2019 22:22:18 +0000 (17:22 -0500)]
smb3: optimize open to not send query file internal info
We can cut one third of the traffic on open by not querying the
inode number explicitly via SMB3 query_info since it is now
returned on open in the qfid context.
This is better in multiple ways, and
speeds up file open about 10% (more if network is slow).
Merge tag 'for-5.3/dm-changes-2' of git://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm
Pull more device mapper updates from Mike Snitzer:
- Fix zone state management race in DM zoned target by eliminating the
unnecessary DMZ_ACTIVE state.
- A couple fixes for issues the DM snapshot target's optional discard
support added during first week of the 5.3 merge.
- Increase default size of outstanding IO that is allowed for a each
dm-kcopyd client and introduce tunable to allow user adjust.
- Update DM core to use printk ratelimiting functions rather than
duplicate them and in doing so fix an issue where DMDEBUG_LIMIT()
rate limited KERN_DEBUG messages had excessive "callbacks suppressed"
messages.
* tag 'for-5.3/dm-changes-2' of git://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm:
dm: use printk ratelimiting functions
dm kcopyd: Increase default sub-job size to 512KB
dm snapshot: fix oversights in optional discard support
dm zoned: fix zone state management race
Merge tag 'nfs-for-5.3-1' of git://git.linux-nfs.org/projects/trondmy/linux-nfs
Pull NFS client updates from Trond Myklebust:
"Highlights include:
Stable fixes:
- SUNRPC: Ensure bvecs are re-synced when we re-encode the RPC
request
- Fix an Oops in ff_layout_track_ds_error due to a PTR_ERR()
dereference
- Revert buggy NFS readdirplus optimisation
- NFSv4: Handle the special Linux file open access mode
- pnfs: Fix a problem where we gratuitously start doing I/O through
the MDS
Features:
- Allow NFS client to set up multiple TCP connections to the server
using a new 'nconnect=X' mount option. Queue length is used to
balance load.
- Enhance statistics reporting to report on all transports when using
multiple connections.
- Speed up SUNRPC by removing bh-safe spinlocks
- Add a mechanism to allow NFSv4 to request that containers set a
unique per-host identifier for when the hostname is not set.
- Ensure NFSv4 updates the lease_time after a clientid update
Bugfixes and cleanup:
- Fix use-after-free in rpcrdma_post_recvs
- Fix a memory leak when nfs_match_client() is interrupted
- Fix buggy file access checking in NFSv4 open for execute
- disable unsupported client side deduplication
- Fix spurious client disconnections
- Fix occasional RDMA transport deadlock
- Various RDMA cleanups
- Various tracepoint fixes
- Fix the TCP callback channel to guarantee the server can actually
send the number of callback requests that was negotiated at mount
time"
* tag 'nfs-for-5.3-1' of git://git.linux-nfs.org/projects/trondmy/linux-nfs: (68 commits)
pnfs/flexfiles: Add tracepoints for detecting pnfs fallback to MDS
pnfs: Fix a problem where we gratuitously start doing I/O through the MDS
SUNRPC: Optimise transport balancing code
SUNRPC: Ensure the bvecs are reset when we re-encode the RPC request
pnfs/flexfiles: Fix PTR_ERR() dereferences in ff_layout_track_ds_error
NFSv4: Don't use the zero stateid with layoutget
SUNRPC: Fix up backchannel slot table accounting
SUNRPC: Fix initialisation of struct rpc_xprt_switch
SUNRPC: Skip zero-refcount transports
SUNRPC: Replace division by multiplication in calculation of queue length
NFSv4: Validate the stateid before applying it to state recovery
nfs4.0: Refetch lease_time after clientid update
nfs4: Rename nfs41_setup_state_renewal
nfs4: Make nfs4_proc_get_lease_time available for nfs4.0
nfs: Fix copy-and-paste error in debug message
NFS: Replace 16 seq_printf() calls by seq_puts()
NFS: Use seq_putc() in nfs_show_stats()
Revert "NFS: readdirplus optimization by cache mechanism" (memleak)
SUNRPC: Fix transport accounting when caller specifies an rpc_xprt
NFS: Record task, client ID, and XID in xdr_status trace points
...
Thomas Gleixner [Wed, 17 Jul 2019 20:01:49 +0000 (22:01 +0200)]
sched/rt, Kconfig: Introduce CONFIG_PREEMPT_RT
Add a new entry to the preemption menu which enables the real-time support
for the kernel. The choice is only enabled when an architecture supports
it.
It selects PREEMPT as the RT features depend on it. To achieve that the
existing PREEMPT choice is renamed to PREEMPT_LL which select PREEMPT as
well.
arch/x86/boot/compressed/pgtable_64.c: In function 'find_trampoline_placement':
arch/x86/boot/compressed/pgtable_64.c:43:16: warning: unused variable 'trampoline_start' [-Wunused-variable]
unsigned long trampoline_start;
^
pnfs: Fix a problem where we gratuitously start doing I/O through the MDS
If the client has to stop in pnfs_update_layout() to wait for another
layoutget to complete, it currently exits and defaults to I/O through
the MDS if the layoutget was successful.
Fixes: d03360aaf5cc ("pNFS: Ensure we return the error if someone kills...") Signed-off-by: Trond Myklebust <[email protected]> Cc: [email protected] # v4.20+
Merge tag 'riscv/for-v5.3-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/riscv/linux
Pull RISC-V updates from Paul Walmsley:
- Hugepage support
- "Image" header support for RISC-V kernel binaries, compatible with
the current ARM64 "Image" header
- Initial page table setup now split into two stages
- CONFIG_SOC support (starting with SiFive SoCs)
- Avoid reserving memory between RAM start and the kernel in
setup_bootmem()
- Enable high-res timers and dynamic tick in the RV64 defconfig
- Remove long-deprecated gate area stubs
- MAINTAINERS updates to switch to the newly-created shared RISC-V git
tree, and to fix a get_maintainers.pl issue for patches involving
SiFive E-mail addresses
Also, one integration fix to resolve a build problem introduced during
in the v5.3-rc1 merge window:
- Fix build break after macro-to-function conversion in
asm-generic/cacheflush.h
* tag 'riscv/for-v5.3-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/riscv/linux:
riscv: fix build break after macro-to-function conversion in generic cacheflush.h
RISC-V: Add an Image header that boot loader can parse.
RISC-V: Setup initial page tables in two stages
riscv: remove free_initrd_mem
riscv: ccache: Remove unused variable
riscv: Introduce huge page support for 32/64bit kernel
x86, arm64: Move ARCH_WANT_HUGE_PMD_SHARE config in arch/Kconfig
RISC-V: Fix memory reservation in setup_bootmem()
riscv: defconfig: enable SOC_SIFIVE
riscv: select SiFive platform drivers with SOC_SIFIVE
arch: riscv: add config option for building SiFive's SoC resource
riscv: Remove gate area stubs
MAINTAINERS: change the arch/riscv git tree to the new shared tree
MAINTAINERS: don't automatically patches involving SiFive to the linux-riscv list
RISC-V: defconfig: Enable NO_HZ_IDLE and HIGH_RES_TIMERS
Merge branch 'parisc-5.3-2' of git://git.kernel.org/pub/scm/linux/kernel/git/deller/parisc-linux
Pull parisc fixes from Helge Deller:
- Prevent kernel panics by adding proper checking of register values
injected via the ptrace interface
- Wire up the new clone3 syscall
* 'parisc-5.3-2' of git://git.kernel.org/pub/scm/linux/kernel/git/deller/parisc-linux:
parisc: Wire up clone3 syscall
parisc: Avoid kernel panic triggered by invalid kprobe
parisc: Ensure userspace privilege for ptraced processes in regset functions
parisc: Fix kernel panic due invalid values in IAOQ0 or IAOQ1
hersen wu [Wed, 26 Jun 2019 17:06:07 +0000 (13:06 -0400)]
drm/amd/display: init res_pool dccg_ref, dchub_ref with xtalin_freq
[WHY] dc sw clock implementation of navi10 and raven are not exact the
same. dcccg, dchub reference clock initialization is done after dc calls
vbios dispcontroller_init table. for raven family, before
dispcontroller_init is called by dc, the ref clk values are referred
by sw clock implementation and program asic register using wrong
values. this causes dchub pstate error. This need provide valid ref
clk values. for navi10, since dispcontroller_init is not called,
dchubbub_global_timer_enable = 0, hubbub2_get_dchub_ref_freq will
hit aeert. this need remove hubbub2_get_dchub_ref_freq from this
location and move to dcn20_init_hw.
[HOW] for all asic, initialize dccg, dchub ref clk with data from
vbios firmware table by default. for raven asic family, use these data
from vbios, for asic which support sw dccg component, like navi10,
read ref clk by sw dccg functions and update the ref clk.
Merge tag 'modules-for-v5.3' of git://git.kernel.org/pub/scm/linux/kernel/git/jeyu/linux
Pull module updates from Jessica Yu:
"Summary of modules changes for the 5.3 merge window:
- Code fixes and cleanups
- Fix bug where set_memory_x() wasn't being called when rodata=n
- Fix bug where -EEXIST was being returned for going modules
- Allow arches to override module_exit_section()"
* tag 'modules-for-v5.3' of git://git.kernel.org/pub/scm/linux/kernel/git/jeyu/linux:
modules: fix compile error if don't have strict module rwx
ARM: module: recognize unwind exit sections
module: allow arch overrides for .exit section names
modules: fix BUG when load module with rodata=n
kernel/module: Fix mem leak in module_add_modinfo_attrs
kernel: module: Use struct_size() helper
kernel/module.c: Only return -EEXIST for modules that have finished loading
Amir Goldstein [Mon, 10 Jun 2019 17:36:57 +0000 (20:36 +0300)]
cifs: copy_file_range needs to strip setuid bits and update timestamps
cifs has both source and destination inodes locked throughout the copy.
Like ->write_iter(), we update mtime and strip setuid bits of destination
file before copy and like ->read_iter(), we update atime of source file
after copy.
When a route needs to be appended to an existing multipath route,
fib6_add_rt2node() first appends it to the siblings list and increments
the number of sibling routes on each sibling.
Later, the function notifies the route via call_fib6_entry_notifiers().
In case the notification is vetoed, the route is not unlinked from the
siblings list, which can result in a use-after-free.
Fix this by unlinking the route from the siblings list before returning
an error.
Audited the rest of the call sites from which the FIB notification chain
is called and could not find more problems.
Fixes: 2233000cba40 ("net/ipv6: Move call_fib6_entry_notifiers up for route adds") Signed-off-by: Ido Schimmel <[email protected]> Reported-by: Alexander Petrovskiy <[email protected]> Reviewed-by: David Ahern <[email protected]> Signed-off-by: David S. Miller <[email protected]>
This is a conditional retpoline sibling call, which is now possible
thanks to retpolines. Objtool hasn't seen that before. It's
incorrectly interpreting the conditional jump as an unconditional
dynamic jump.
Since %eax was masked with 0x21, only the first two and the last two
entries are possible.
Objtool doesn't actually emulate all the code, so it isn't smart enough
to know that all the middle entries aren't reachable. They point to the
NOP padding area after the end of the function, so objtool seg faulted
when it tried to dereference a NULL insn->func.
After this fix, objtool still gives an "unreachable" error because it
stops reading the jump table when it encounters the bad addresses:
While the above code is technically correct, it's very wasteful of
memory -- it uses 34 jump table entries when only 4 are needed. It's
also not possible for objtool to validate this type of switch table
because the unused entries point outside the function and objtool has no
way of determining if that's intentional. Hopefully the Clang folks can
fix it.
This confuses objtool. It has never seen multiple indirect jump
sites which use the same jump table.
For GCC switch tables, the only way of detecting the size of a table
is by continuing to scan for more tables. The size of the previous
table can only be determined after another switch table is found, or
when the scan reaches the end of the function.
That logic was reused for C jump tables, and was based on the
assumption that each jump table only has a single jump site. The
above optimization breaks that assumption.
Now that C jump tables are supported, call them "jump tables" instead of
"switch tables". Also rename some other variables, add comments, and
simplify the code flow a bit.
- Add an alias check in validate_functions(). With this change, aliases
no longer need uaccess_safe set.
- Add an alias check in decode_instructions(). With this change, the
"if (!insn->func)" check is no longer needed.
- Don't create aliases for zero-length functions, as it can have
unexpected results. The next patch will spit out a warning for
zero-length functions anyway.
If 'insn->func' is NULL, objtool skips some important checks, including
sibling call validation. So if some .fixup code does an invalid sibling
call, objtool ignores it.
Treat all code branches (including alts) as part of the original
function by keeping track of the original func value from
validate_functions().
This improves the usefulness of some clang function fallthrough
warnings, and exposes some additional kernel bugs in the process.
bpf: Disable GCC -fgcse optimization for ___bpf_prog_run()
On x86-64, with CONFIG_RETPOLINE=n, GCC's "global common subexpression
elimination" optimization results in ___bpf_prog_run()'s jumptable code
changing from this:
The jumptable address is placed in a register once, at the beginning of
the function. The function execution can then go through multiple
indirect jumps which rely on that same register value. This has a few
issues:
1) Objtool isn't smart enough to be able to track such a register value
across multiple recursive indirect jumps through the jump table.
2) With CONFIG_RETPOLINE enabled, this optimization actually results in
a small slowdown. I measured a ~4.7% slowdown in the test_bpf
"tcpdump port 22" selftest.
This slowdown is actually predicted by the GCC manual:
Note: When compiling a program using computed gotos, a GCC
extension, you may get better run-time performance if you
disable the global common subexpression elimination pass by
adding -fno-gcse to the command line.
So just disable the optimization for this function.
x86/uaccess: Don't leak AC flag into fentry from mcsafe_handle_tail()
After adding mcsafe_handle_tail() to the objtool uaccess safe list,
objtool reports:
arch/x86/lib/usercopy_64.o: warning: objtool: mcsafe_handle_tail()+0x0: call to __fentry__() with UACCESS enabled
With SMAP, this function is called with AC=1, so it needs to be careful
about which functions it calls. Disable the ftrace entry hook, which
can potentially pull in a lot of extra code.
copy_user_handle_tail() is incorrectly marked as a callable function, so
objtool is rightfully concerned about the CLAC with no corresponding
STAC.
Remove the ELF function annotation. The copy_user_handle_tail() code
path is already verified by objtool because it's jumped to by other
callable asm code (which does the corresponding STAC).
x86/head/64: Annotate start_cpu0() as non-callable
After an objtool improvement, it complains about the fact that
start_cpu0() jumps to code which has an LRET instruction.
arch/x86/kernel/head_64.o: warning: objtool: .head.text+0xe4: unsupported instruction in callable function
Technically, start_cpu0() is callable, but it acts nothing like a
callable function. Prevent objtool from treating it like one by
removing its ELF function annotation.
arch/x86/entry/thunk_64.o: warning: objtool: trace_hardirqs_on_thunk() is missing an ELF size annotation
arch/x86/entry/thunk_64.o: warning: objtool: trace_hardirqs_off_thunk() is missing an ELF size annotation
arch/x86/entry/thunk_64.o: warning: objtool: lockdep_sys_exit_thunk() is missing an ELF size annotation
x86/kvm: Don't call kvm_spurious_fault() from .fixup
After making a change to improve objtool's sibling call detection, it
started showing the following warning:
arch/x86/kvm/vmx/nested.o: warning: objtool: .fixup+0x15: sibling call from callable instruction with modified stack frame
The problem is the ____kvm_handle_fault_on_reboot() macro. It does a
fake call by pushing a fake RIP and doing a jump. That tricks the
unwinder into printing the function which triggered the exception,
rather than the .fixup code.
Instead of the hack to make it look like the original function made the
call, just change the macro so that the original function actually does
make the call. This allows removal of the hack, and also makes objtool
happy.
I triggered a vmx instruction exception and verified that the stack
trace is still sane:
x86/kvm: Replace vmx_vmenter()'s call to kvm_spurious_fault() with UD2
Objtool reports the following:
arch/x86/kvm/vmx/vmenter.o: warning: objtool: vmx_vmenter()+0x14: call without frame pointer save/setup
But frame pointers are necessarily broken anyway, because
__vmx_vcpu_run() clobbers RBP with the guest's value before calling
vmx_vmenter(). So calling without a frame pointer doesn't make things
any worse.
Some of the fastop functions, e.g. em_setcc(), are actually just used as
global labels which point to blocks of functions. The global labels are
incorrectly annotated as functions. Also the functions themselves don't
have size annotations.
Fixes a bunch of warnings like the following:
arch/x86/kvm/emulate.o: warning: objtool: seto() is missing an ELF size annotation
arch/x86/kvm/emulate.o: warning: objtool: em_setcc() is missing an ELF size annotation
arch/x86/kvm/emulate.o: warning: objtool: setno() is missing an ELF size annotation
arch/x86/kvm/emulate.o: warning: objtool: setc() is missing an ELF size annotation
The __raw_callee_save_*() functions have an ELF symbol size of zero,
which confuses objtool and other tools.
Fixes a bunch of warnings like the following:
arch/x86/xen/mmu_pv.o: warning: objtool: __raw_callee_save_xen_pte_val() is missing an ELF size annotation
arch/x86/xen/mmu_pv.o: warning: objtool: __raw_callee_save_xen_pgd_val() is missing an ELF size annotation
arch/x86/xen/mmu_pv.o: warning: objtool: __raw_callee_save_xen_make_pte() is missing an ELF size annotation
arch/x86/xen/mmu_pv.o: warning: objtool: __raw_callee_save_xen_make_pgd() is missing an ELF size annotation