Linus Torvalds [Thu, 16 Nov 2017 19:41:22 +0000 (11:41 -0800)]
Merge tag 'afs-next-20171113' of git://git.kernel.org/pub/scm/linux/kernel/git/dhowells/linux-fs
Pull AFS updates from David Howells:
"kAFS filesystem driver overhaul.
The major points of the overhaul are:
(1) Preliminary groundwork is laid for supporting network-namespacing
of kAFS. The remainder of the namespacing work requires some way
to pass namespace information to submounts triggered by an
automount. This requires something like the mount overhaul that's
in progress.
(2) sockaddr_rxrpc is used in preference to in_addr for holding
addresses internally and add support for talking to the YFS VL
server. With this, kAFS can do everything over IPv6 as well as
IPv4 if it's talking to servers that support it.
(3) Callback handling is overhauled to be generally passive rather
than active. 'Callbacks' are promises by the server to tell us
about data and metadata changes. Callbacks are now checked when
we next touch an inode rather than actively going and looking for
it where possible.
(4) File access permit caching is overhauled to store the caching
information per-inode rather than per-directory, shared over
subordinate files. Whilst older AFS servers only allow ACLs on
directories (shared to the files in that directory), newer AFS
servers break that restriction.
To improve memory usage and to make it easier to do mass-key
removal, permit combinations are cached and shared.
(5) Cell database management is overhauled to allow lighter locks to
be used and to make cell records autonomous state machines that
look after getting their own DNS records and cleaning themselves
up, in particular preventing races in acquiring and relinquishing
the fscache token for the cell.
(6) Volume caching is overhauled. The afs_vlocation record is got rid
of to simplify things and the superblock is now keyed on the cell
and the numeric volume ID only. The volume record is tied to a
superblock and normal superblock management is used to mediate
the lifetime of the volume fscache token.
(7) File server record caching is overhauled to make server records
independent of cells and volumes. A server can be in multiple
cells (in such a case, the administrator must make sure that the
VL services for all cells correctly reflect the volumes shared
between those cells).
Server records are now indexed using the UUID of the server
rather than the address since a server can have multiple
addresses.
(8) File server rotation is overhauled to handle VMOVED, VBUSY (and
similar), VOFFLINE and VNOVOL indications and to handle rotation
both of servers and addresses of those servers. The rotation will
also wait and retry if the server says it is busy.
(9) Data writeback is overhauled. Each inode no longer stores a list
of modified sections tagged with the key that authorised it in
favour of noting the modified region of a page in page->private
and storing a list of keys that made modifications in the inode.
This simplifies things and allows other keys to be used to
actually write to the server if a key that made a modification
becomes useless.
(10) Writable mmap() is implemented. This allows a kernel to be build
entirely on AFS.
Note that Pre AFS-3.4 servers are no longer supported, though this can
be added back if necessary (AFS-3.4 was released in 1998)"
* tag 'afs-next-20171113' of git://git.kernel.org/pub/scm/linux/kernel/git/dhowells/linux-fs: (35 commits)
afs: Protect call->state changes against signals
afs: Trace page dirty/clean
afs: Implement shared-writeable mmap
afs: Get rid of the afs_writeback record
afs: Introduce a file-private data record
afs: Use a dynamic port if 7001 is in use
afs: Fix directory read/modify race
afs: Trace the sending of pages
afs: Trace the initiation and completion of client calls
afs: Fix documentation on # vs % prefix in mount source specification
afs: Fix total-length calculation for multiple-page send
afs: Only progress call state at end of Tx phase from rxrpc callback
afs: Make use of the YFS service upgrade to fully support IPv6
afs: Overhaul volume and server record caching and fileserver rotation
afs: Move server rotation code into its own file
afs: Add an address list concept
afs: Overhaul cell database management
afs: Overhaul permit caching
afs: Overhaul the callback handling
afs: Rename struct afs_call server member to cm_server
...
Linus Torvalds [Thu, 16 Nov 2017 18:57:11 +0000 (10:57 -0800)]
Merge tag 'pinctrl-v4.15-1' of git://git.kernel.org/pub/scm/linux/kernel/git/linusw/linux-pinctrl
Pull pin control updates from Linus Walleij:
"This is the bulk of pin control changes for the v4.15 kernel cycle:
Core:
- The pin control Kconfig entry PINCTRL is now turned into a
menuconfig option. This obviously has the implication of making the
subsystem menu visible in menuconfig. This is happening because of
two things:
(a) Intel have started to deploy and depend on pin controllers in
a way that is affecting users directly. This happens on the
highly integrated laptop chipsets named after geographical
places: baytrail, broxton, cannonlake, cedarfork, cherryview,
denverton, geminilake, lewisburg, merrifield, sunrisepoint...
It started a while back and now it is ever more evident that
this is crucial infrastructure for x86 laptops and not an
embedded obscurity anymore. Users need to be aware.
(b) Pin control expanders on I2C and SPI that are arch-agnostic.
Currently Semtech SX150X and Microchip MCP28x08 but more are
expected. Users will have to be able to configure these in
directly for their set-up.
- Just go and select GPIOLIB now that we made sure that GPIOLIB is a
very vanilla subsystem. Do not depend on it, if we need it, select
it.
- Exposing the pin control subsystem in menuconfig uncovered a bunch
of obscure bugs that are now hopefully fixed, all more or less
pertaining to Blackfin.
- Unified namespace for cross-calls between pin control and GPIO.
- New support for clock skew/delay generic DT bindings and generic
pin config options for this.
- Minor documentation improvements.
Various:
- The Renesas SH-PFC pin controller has evolved a lot. It seems
Renesas are churning out new SoCs by the minute.
- A bunch of non-critical fixes for the Rockchip driver.
- Improve the use of library functions instead of open coding.
- Support the MCP28018 variant in the MCP28x08 driver.
- Static constifying"
* tag 'pinctrl-v4.15-1' of git://git.kernel.org/pub/scm/linux/kernel/git/linusw/linux-pinctrl: (91 commits)
pinctrl: gemini: Fix missing pad descriptions
pinctrl: Add some depends on HAS_IOMEM
pinctrl: samsung/s3c24xx: add CONFIG_OF dependency
pinctrl: gemini: Fix GMAC groups
pinctrl: qcom: spmi-gpio: Add pmi8994 gpio support
pinctrl: ti-iodelay: remove redundant unused variable dev
pinctrl: max77620: Use common error handling code in max77620_pinconf_set()
pinctrl: gemini: Implement clock skew/delay config
pinctrl: gemini: Use generic DT parser
pinctrl: Add skew-delay pin config and bindings
pinctrl: armada-37xx: Add edge both type gpio irq support
pinctrl: uniphier: remove eMMC hardware reset pin-mux
pinctrl: rockchip: Add iomux-route switching support for rk3288
pinctrl: intel: Add Intel Cedar Fork PCH pin controller support
pinctrl: intel: Make offset to interrupt status register configurable
pinctrl: sunxi: Enforce the strict mode by default
pinctrl: sunxi: Disable strict mode for old pinctrl drivers
pinctrl: sunxi: Introduce the strict flag
pinctrl: sh-pfc: Save/restore registers for PSCI system suspend
pinctrl: sh-pfc: r8a7796: Use generic IOCTRL register description
...
Rob Herring [Thu, 16 Nov 2017 17:54:17 +0000 (11:54 -0600)]
of: unittest: disable interrupts_property warning
The testcases.dts has purposely bad data which now generates a dtc warning:
drivers/of/unittest-data/testcases.dtb: Warning (interrupts_property): interrupts size is (4), expected multiple of 8 in /testcase-data/testcase-device2
Disable this warning for now. The proper solution is to split the unit
tests into good and bad data.
Rob Herring [Thu, 16 Nov 2017 17:37:09 +0000 (11:37 -0600)]
of: unittest: let dtc generate __local_fixups__
Remove the manually added __local_fixups__ because dtc can now generate
them. This also fixes a new warning in the process:
drivers/of/unittest-data/testcases.dtb: Warning (interrupts_extended_property): Could not get phandle node for /__local_fixups__/testcase-data/interrupts/interrupts-extended0:interrupts-extended(cell 3)
James Smart [Thu, 16 Nov 2017 01:00:21 +0000 (17:00 -0800)]
nvmet_fc: fix better length checking
Reorganize nvmet_fc_handle_fcp_rqst() so that the nvmet req.transfer_len
field is set after the call nvmet_req_init(). An update to nvmet now
has nvmet_req_init() clearing the field, thus the fc transport was losing
the value.
Bug fixes:
- Fix typo in MAINTAINERS
- Fix error handling; mxs-lradc
- Clean-up IRQs on .remove; fsl-imx25-tsadc"
* tag 'mfd-next-4.15' of git://git.kernel.org/pub/scm/linux/kernel/git/lee/mfd: (21 commits)
dt-bindings: mfd: mc13xxx: Remove obsolete property
mfd: axp20x: Add axp20x-regulator cell for AXP813
mfd: Add Spreadtrum SC27xx series PMICs driver
dt-bindings: mfd: Add Spreadtrum SC27xx PMIC documentation
mfd: ssbi: Use devm_of_platform_populate()
mfd: fsl-imx25: Clean up irq settings during removal
mfd: mxs-lradc: Fix error handling in mxs_lradc_probe()
mfd: lpc_ich: Avoton/Rangeley uses SPI_BYT method
mfd: tps65218: Introduce dependency on CONFIG_OF
mfd: tps65218: Correct the config description
MAINTAINERS: Fix Dialog search term for watchdog binding file
mfd: fsl-imx25: Set irq handler and data in one go
mfd: rts5249: Add support for RTS5250S power saving
ACPI / PMIC: Add opregion driver for Intel Dollar Cove TI PMIC
mfd: Add support for Cherry Trail Dollar Cove TI PMIC
syscon: dt-bindings: Add binding document for iProc MHB block
syscon: dt-bindings: Add binding doc for Broadcom iProc CDRU
mfd: max77693: Add muic of_compatible in mfd_cell
mfd: stw481x: Make three arrays static const, reduces object code size
mfd: tps65217: Introduce dependency on CONFIG_OF
...
Linus Torvalds [Thu, 16 Nov 2017 17:10:59 +0000 (09:10 -0800)]
Merge tag 'char-misc-4.15-rc1' of ssh://gitolite.kernel.org/pub/scm/linux/kernel/git/gregkh/char-misc
Pull char/misc updates from Greg KH:
"Here is the big set of char/misc and other driver subsystem patches
for 4.15-rc1.
There are small changes all over here, hyperv driver updates, pcmcia
driver updates, w1 driver updats, vme driver updates, nvmem driver
updates, and lots of other little one-off driver updates as well. The
shortlog has the full details.
All of these have been in linux-next for quite a while with no
reported issues"
* tag 'char-misc-4.15-rc1' of ssh://gitolite.kernel.org/pub/scm/linux/kernel/git/gregkh/char-misc: (90 commits)
VME: Return -EBUSY when DMA list in use
w1: keep balance of mutex locks and refcnts
MAINTAINERS: Update VME subsystem tree.
nvmem: sunxi-sid: add support for A64/H5's SID controller
nvmem: imx-ocotp: Update module description
nvmem: imx-ocotp: Enable i.MX7D OTP write support
nvmem: imx-ocotp: Add i.MX7D timing write clock setup support
nvmem: imx-ocotp: Move i.MX6 write clock setup to dedicated function
nvmem: imx-ocotp: Add support for banked OTP addressing
nvmem: imx-ocotp: Pass parameters via a struct
nvmem: imx-ocotp: Restrict OTP write to IMX6 processors
nvmem: uniphier: add UniPhier eFuse driver
dt-bindings: nvmem: add description for UniPhier eFuse
nvmem: set nvmem->owner to nvmem->dev->driver->owner if unset
nvmem: qfprom: fix different address space warnings of sparse
nvmem: mtk-efuse: fix different address space warnings of sparse
nvmem: mtk-efuse: use stack for nvmem_config instead of malloc'ing it
nvmem: imx-iim: use stack for nvmem_config instead of malloc'ing it
thunderbolt: tb: fix use after free in tb_activate_pcie_devices
MAINTAINERS: Add git tree for Thunderbolt development
...
Johan Hovold [Thu, 9 Nov 2017 17:07:16 +0000 (18:07 +0100)]
dt-bindings: usb: fix example hub node name
According to the OF Recommended Practice for USB, hub nodes shall be
named "hub", but the example had mixed up the label and node names. Fix
the node name and drop the redundant label.
While at it, remove a newline and add a missing semicolon to the example
source.
Robin Murphy [Wed, 15 Nov 2017 12:50:06 +0000 (12:50 +0000)]
of/pci: Fix theoretical NULL dereference
In the (relatively mechanical) process of adapting the RID-mapping code
to put the resulting ID in an output argument rather than the funtion
return value, we ended up with the debug print using the argument
pointer rather than the local value, which potentially defeats the
earlier NULL check.
Fixes: 987068fcbdb7: "of/irq: Break out msi-map lookup (again)" Reported-by: Dan Carpenter <[email protected]> Signed-off-by: Robin Murphy <[email protected]> Signed-off-by: Rob Herring <[email protected]>
Linus Torvalds [Thu, 16 Nov 2017 16:55:30 +0000 (08:55 -0800)]
Merge tag 'driver-core-4.15-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/driver-core
Pull driver core updates from Greg KH:
"Here is the set of driver core / debugfs patches for 4.15-rc1.
Not many here, mostly all are debugfs fixes to resolve some
long-reported problems with files going away with references to them
in userspace. There's also some SPDX cleanups for the debugfs code, as
well as a few other minor driver core changes for issues reported by
people.
All of these have been in linux-next for a week or more with no
reported issues"
* tag 'driver-core-4.15-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/driver-core:
driver core: Fix device link deferred probe
debugfs: Remove redundant license text
debugfs: add SPDX identifiers to all debugfs files
debugfs: defer debugfs_fsdata allocation to first usage
debugfs: call debugfs_real_fops() only after debugfs_file_get()
debugfs: purge obsolete SRCU based removal protection
IB/hfi1: convert to debugfs_file_get() and -put()
debugfs: convert to debugfs_file_get() and -put()
debugfs: debugfs_real_fops(): drop __must_hold sparse annotation
debugfs: implement per-file removal protection
debugfs: add support for more elaborate ->d_fsdata
driver core: Move device_links_purge() after bus_remove_device()
arch_topology: Fix section miss match warning due to free_raw_capacity()
driver-core: pr_err() strings should end with newlines
Masahiro Yamada [Wed, 15 Nov 2017 04:15:12 +0000 (13:15 +0900)]
arm64: dts: uniphier: route on-board device IRQ to GPIO controller for PXs3
Commit 429f203eb712 ("arm64: dts: uniphier: route on-board device IRQ
to GPIO controller") missed to update this DTS. It becames a real
problem when arm and arm64 trees are merged together.
Steffen Maier [Tue, 17 Oct 2017 16:40:51 +0000 (18:40 +0200)]
zfcp: purely mechanical update using timer API, plus blank lines
erp_memwait only occurs in seldom memory pressure situations.
The typical case never uses the associated timer and thus also
does not need to initialize the timer.
Also, we don't want to re-initialize the timer each time we re-use an
erp_action in zfcp_erp_setup_act() [see also v4.14-rc7 commit ab31fd0ce65e
("scsi: zfcp: fix erp_action use-before-initialize in REC action trace")
for erp_action life cycle].
Hence, retain the lazy inintialization of zfcp_erp_action.timer
in zfcp_erp_strategy_memwait().
Add an empty line after declarations in zfcp_erp_timeout_handler()
and zfcp_fsf_request_timeout_handler() even though it was also missing
before the timer conversion.
Fix checkpatch warning:
WARNING: function definition argument 'struct timer_list *' should also have an identifier name
+extern void zfcp_erp_timeout_handler(struct timer_list *);
Kees Cook [Mon, 16 Oct 2017 23:44:34 +0000 (16:44 -0700)]
s390/scsi: Convert timers to use timer_setup()
In preparation for unconditionally passing the struct timer_list pointer to
all timer callbacks, switch to using the new timer_setup() and from_timer()
to pass the timer pointer explicitly.
s390/cpum_sf: correctly set the PID and TID in perf samples
The hardware sampler creates samples that are processed at a later
point in time. The PID and TID values of the perf samples that are
created for hardware samples are initialized with values from the
current task. Hence, the PID and TID values are not correct and
perf samples are associated with wrong processes.
The PID and TID values are obtained from the Host Program Parameter
(HPP) field in the basic-sampling data entries. These PIDs are
valid in the init PID namespace. Ensure that the PIDs in the perf
samples are resolved considering the PID namespace in which the
perf event was created.
To correct the PID and TID values in the created perf samples,
a special overflow handler is installed. It replaces the default
overflow handler and does not become effective if any other
overflow handler is used. With the special overflow handler most
of the perf samples are associated with the right processes.
For processes, that are no longer exist, the association might
still be wrong.
s390/cpum_sf: load program parameter at sampler enablement
The lpp instruction is used to place the PID of the current
task in the program-parameter (PP) register. The register
contents is then included in the sampling data entries.
The lpp instruction loads the PP register only when at least
one sampling function is enabled. Otherwise it is executed
as a no-op.
Linux calls lpp at context switch. If the context switch
happens before the sampler is enabled, the PP register is
empty. That means, the PID of the task that is sampled is
not stored in sampling data until the next context switch.
s390/perf: add perf register support for floating-point registers
For correct unwinding of user space processes, the floating-point
register contents are required. For example, leaf functions might
use fp registers to temporarily store the return address.
s390/perf: extend perf_regs support to include floating-point registers
Extend the perf register support to also export floating-point register
contents for user space tasks. Floating-point registers might be used
in leaf functions to contain the return address. Hence, they are required
for proper DWARF unwinding.
s390/perf: define common DWARF register string table
Instead of defining DWARF register to string table in dwarf-regs-table.h
and dwarf-regs.c, use a common table in dwarf-regs-table.h.
Ensure that the DWARF register table is up-to-date with
http://refspecs.linuxfoundation.org/ELF/zSeries/lzsabi0_s390/x1542.html.
For unwinding with libdw, also ensure to correctly setup the DWARF
register frame according to the register mappings. Currently, libdw
supports up to 32 registers only.
Heiko Carstens [Tue, 19 Jan 2016 10:23:38 +0000 (11:23 +0100)]
s390/perf: add support for perf_regs and libdw
With support for perf_regs and libdw, you can record and report
call graphs for user space programs. Simply invoke perf with
the --call-graph=dwarf command line option.
s390/cpum_sf: do not register PMU if no sampling mode is authorized
Previously, the cpum_sf PMU was registered even if there is no
sampling mode authorized. Add a check and register cpum_sf only
at least one sampling mode is authorized.
Pu Hou [Fri, 19 May 2017 09:16:55 +0000 (11:16 +0200)]
s390/cpumf: remove raw event support in basic-only sampling mode
Raw sample was implemented to export the diagnostic samples.
With having this achieved with AUX buffers, there is no requirement
for basic samples to export raw data. In particular, most basic
sampling information are consumed for creating the perf event sample.
Pu Hou [Thu, 1 Sep 2016 08:48:22 +0000 (10:48 +0200)]
s390/perf: add callback to perf to enable using AUX buffer
Perf tool need implement a callback to enable using AUX buffer. Perf
will do another mmap() to trigger the setup of AUX buffer in kernel
if there is such callback. The default size of the AUX buffer is set
properly according to the sampling frequency to avoid overflow. It
could also be manually set by -m option of perf.
The interface of perf is not changed. Diagnostic mode sampling
could be started by `perf record -e rBD000` like before.
Pu Hou [Fri, 11 Nov 2016 02:08:49 +0000 (03:08 +0100)]
s390/cpumf: introduce AUX buffer for dump diagnostic sample data
Current implementation uses a private buffer for cpumf to dump samples.
Samples first go to this buffer. Then copy to ring buffer allocated
by perf core. With AUX buffer, this copy is not needed. AUX buffer is
shared and zero-copy mapped to user space. The trailer information at
the end of each SDB(sample data block) is also exported to user space.
AUX buffer is used when diagnostic sampling mode is enabled.
This patch contains functions to setup/free AUX buffer or to begin/end
sampling per-cpu. Also include function called in interrupt to
collect samples.
net/sctp: Always set scope_id in sctp_inet6_skb_msgname
Alexandar Potapenko while testing the kernel with KMSAN and syzkaller
discovered that in some configurations sctp would leak 4 bytes of
kernel stack.
Working with his reproducer I discovered that those 4 bytes that
are leaked is the scope id of an ipv6 address returned by recvmsg.
With a little code inspection and a shrewd guess I discovered that
sctp_inet6_skb_msgname only initializes the scope_id field for link
local ipv6 addresses to the interface index the link local address
pertains to instead of initializing the scope_id field for all ipv6
addresses.
That is almost reasonable as scope_id's are meaniningful only for link
local addresses. Set the scope_id in all other cases to 0 which is
not a valid interface index to make it clear there is nothing useful
in the scope_id field.
There should be no danger of breaking userspace as the stack leak
guaranteed that previously meaningless random data was being returned.
Fixes: 372f525b495c ("SCTP: Resync with LKSCTP tree.")
History-tree: https://git.kernel.org/pub/scm/linux/kernel/git/tglx/history.git Reported-by: Alexander Potapenko <[email protected]> Tested-by: Alexander Potapenko <[email protected]> Signed-off-by: "Eric W. Biederman" <[email protected]> Signed-off-by: David S. Miller <[email protected]>
Huacai Chen [Thu, 16 Nov 2017 03:07:15 +0000 (11:07 +0800)]
fealnx: Fix building error on MIPS
This patch try to fix the building error on MIPS. The reason is MIPS
has already defined the LONG macro, which conflicts with the LONG enum
in drivers/net/ethernet/fealnx.c.
Radim Krčmář [Thu, 16 Nov 2017 13:39:46 +0000 (14:39 +0100)]
Merge tag 'kvm-s390-next-4.15-1' of git://git.kernel.org/pub/scm/linux/kernel/git/kvms390/linux
KVM: s390: fixes and improvements for 4.15
- Some initial preparation patches for exitless interrupts and crypto
- New capability for AIS migration
- Fixes
- merge of the sthyi tree from the base s390 team, which moves the sthyi
out of KVM into a shared function also for non-KVM
David S. Miller [Thu, 16 Nov 2017 13:33:54 +0000 (22:33 +0900)]
Merge branch 'master' of git://git.kernel.org/pub/scm/linux/kernel/git/klassert/ipsec
Steffen Klassert says:
====================
1) Copy policy family in clone_policy, otherwise this can
trigger a BUG_ON in af_key. From Herbert Xu.
2) Revert "xfrm: Fix stack-out-of-bounds read in xfrm_state_find."
This added a regression with transport mode when no addresses
are configured on the policy template.
Both patches are stable candidates.
====================
Vasily Gorbik [Wed, 15 Nov 2017 13:15:36 +0000 (14:15 +0100)]
s390/disassembler: increase show_code buffer size
Current buffer size of 64 is too small. objdump shows that there are
instructions which would require up to 75 bytes buffer (with current
formating). 128 bytes "ought to be enough for anybody".
Also replaces 8 spaces with a single tab to reduce the memory footprint.
Fixes the following KASAN finding:
BUG: KASAN: stack-out-of-bounds in number+0x3fe/0x538
Write of size 1 at addr 000000005a4a75a0 by task bash/1282
Kernel panic - not syncing: Fatal exception: panic_on_oops
With CONFIG_HARDENED_USERCOPY copy_to_user() checks in __check_object_size()
if the source address is within the kernel image. When the crash tool reads
from 0x100000, this check leads to the kernel BUG().
So disable the kernel config option until this bug is fixed.
Corresponding bug report on LKML: https://lkml.org/lkml/2017/11/10/341
Craig Bergstrom [Wed, 15 Nov 2017 22:29:51 +0000 (15:29 -0700)]
x86/mm: Limit mmap() of /dev/mem to valid physical addresses
One thing /dev/mem access APIs should verify is that there's no way
that excessively large pfn's can leak into the high bits of the
page table entry.
In particular, if people can use "very large physical page addresses"
through /dev/mem to set the bits past bit 58 - SOFTW4 and permission
key bits and NX bit, that could *really* confuse the kernel.
We had an earlier attempt:
ce56a86e2ade ("x86/mm: Limit mmap() of /dev/mem to valid physical addresses")
... which turned out to be too restrictive (breaking mem=... bootups for example) and
had to be reverted in:
90edaac62729 ("Revert "x86/mm: Limit mmap() of /dev/mem to valid physical addresses"")
This v2 attempt modifies the original patch and makes sure that mmap(/dev/mem)
limits the pfns so that it at least fits in the actual pteval_t architecturally:
- Make sure mmap_mem() actually validates that the offset fits in phys_addr_t
( This may be indirectly true due to some other check, but it's not
entirely obvious. )
- Change valid_mmap_phys_addr_range() to just use phys_addr_valid()
on the top byte
( Top byte is sufficient, because mmap_mem() has already checked that
it cannot wrap. )
- Add a few comments about what the valid_phys_addr_range() vs.
valid_mmap_phys_addr_range() difference is.
x86/selftests: Add test for mapping placement for 5-level paging
5-level paging provides a 56-bit virtual address space for user space
application. But the kernel defaults to mappings below the 47-bit address
space boundary, which is the upper bound for 4-level paging, unless an
application explicitely request it by using a mmap(2) address hint above
the 47-bit boundary. The kernel prevents mappings which spawn across the
47-bit boundary unless mmap(2) was invoked with MAP_FIXED.
Add a self-test that covers the corner cases of the interface and validates
the correctness of the implementation.
x86/mm: Prevent non-MAP_FIXED mapping across DEFAULT_MAP_WINDOW border
In case of 5-level paging, the kernel does not place any mapping above
47-bit, unless userspace explicitly asks for it.
Userspace can request an allocation from the full address space by
specifying the mmap address hint above 47-bit.
Nicholas noticed that the current implementation violates this interface:
If user space requests a mapping at the end of the 47-bit address space
with a length which causes the mapping to cross the 47-bit border
(DEFAULT_MAP_WINDOW), then the vma is partially in the address space
below and above.
Sanity check the mmap address hint so that start and end of the resulting
vma are on the same side of the 47-bit border. If that's not the case fall
back to the code path which ignores the address hint and allocate from the
regular address space below 47-bit.
To make the checks consistent, mask out the address hints lower bits
(either PAGE_MASK or huge_page_mask()) instead of using ALIGN() which can
push them up to the next boundary.
[ tglx: Moved the address check to a function and massaged comment and
changelog ]
Luca Coelho [Wed, 15 Nov 2017 16:28:04 +0000 (18:28 +0200)]
iwlwifi: fix PCI IDs and configuration mapping for 9000 series
A lot of PCI IDs were missing and there were some problems with the
configuration and firmware selection for devices on the 9000 series.
Fix the firmware selection by adding files for the B-steps; add
configuration for some integrated devices; and add a bunch of PCI IDs
(mostly for integrated devices) that were missing from the driver's
list.
Without this patch, a lot of devices will not be recognized or will
try to load the wrong firmware file.
Ming Lei [Thu, 16 Nov 2017 00:08:44 +0000 (08:08 +0800)]
block: wake up all tasks blocked in get_request()
Once blk_set_queue_dying() is done in blk_cleanup_queue(), we call
blk_freeze_queue() and wait for q->q_usage_counter becoming zero. But
if there are tasks blocked in get_request(), q->q_usage_counter can
never become zero. So we have to wake up all these tasks in
blk_set_queue_dying() first.
Fixes: 3ef28e83ab157997 ("block: generic request_queue reference counting") Signed-off-by: Ming Lei <[email protected]> Signed-off-by: Jens Axboe <[email protected]>
New driver:
- tve200: Faraday Technology TVE200 block.
This "TV Encoder" encodes a ITU-T BT.656 stream and can be found in
the StorLink SL3516 (later Cortina Systems CS3516) as well as the
Grain Media GM8180.
i915:
- Remove Coffeelake from alpha support
- Cannonlake workarounds
- Infoframe refactoring for DisplayPort
- VBT updates
- DisplayPort vswing/emph/buffer translation refactoring
- CCS fixes
- Restore GPU clock boost on missed vblanks
- Scatter list updates for userptr allocations
- Gen9+ transition watermarks
- Display IPC (Isochronous Priority Control)
- Private PAT management
- GVT: improved error handling and pci config sanitizing
- Execlist refactoring
- Transparent Huge Page support
- User defined priorities support
- HuC/GuC firmware refactoring
- DP MST fixes
- eDP power sequencing fixes
- Use RCU instead of stop_machine
- PSR state tracking support
- Eviction fixes
- BDW DP aux channel timeout fixes
- LSPCON fixes
- Cannonlake PLL fixes
amdgpu:
- Per VM BO support
- Powerplay cleanups
- CI powerplay support
- PASID mgr for kfd
- SR-IOV fixes
- initial GPU reset for vega10
- Prime mmap support
- TTM updates
- Clock query interface for Raven
- Fence to handle ioctl
- UVD encode ring support on Polaris
- Transparent huge page DMA support
- Compute LRU pipe tweaks
- BO flag to allow buffers to opt out of implicit sync
- CTX priority setting API
- VRAM lost infrastructure plumbing
qxl:
- fix flicker since atomic rework
amdkfd:
- Further improvements from internal AMD tree
- Usermode events
- Drop radeon support
nouveau:
- Pascal temperature sensor support
- Improved BAR2 handling
- MMU rework to support Pascal MMU
exynos:
- Improved HDMI/mixer support
- HDMI audio interface support
tegra:
- Prep work for tegra186
- Cleanup/fixes
msm:
- Preemption support for a5xx
- Display fixes for 8x96 (snapdragon 820)
- Async cursor plane fixes
- FW loading rework
- GPU debugging improvements
vc4:
- Prep for DSI panels
- fix T-format tiling scanout
- New madvise ioctl
adv7511:
- Improve EDID handling.
- HDMI CEC support
sii8620:
- Add remote control support"
* tag 'drm-for-v4.15' of git://people.freedesktop.org/~airlied/linux: (1480 commits)
drm/rockchip: analogix_dp: Use mutex rather than spinlock
drm/mode_object: fix documentation for object lookups.
drm/i915: Reorder context-close to avoid calling i915_vma_close() under RCU
drm/i915: Move init_clock_gating() back to where it was
drm/i915: Prune the reservation shared fence array
drm/i915: Idle the GPU before shinking everything
drm/i915: Lock llist_del_first() vs llist_del_all()
drm/i915: Calculate ironlake intermediate watermarks correctly, v2.
drm/i915: Disable lazy PPGTT page table optimization for vGPU
drm/i915/execlists: Remove the priority "optimisation"
drm/i915: Filter out spurious execlists context-switch interrupts
drm/amdgpu: use irq-safe lock for kiq->ring_lock
drm/amdgpu: bypass lru touch for KIQ ring submission
drm/amdgpu: Potential uninitialized variable in amdgpu_vm_update_directories()
drm/amdgpu: potential uninitialized variable in amdgpu_vce_ring_parse_cs()
drm/amd/powerplay: initialize a variable before using it
drm/amd/powerplay: suppress KASAN out of bounds warning in vega10_populate_all_memory_levels
drm/amd/amdgpu: fix evicted VRAM bo adjudgement condition
drm/vblank: Tune drm_crtc_accurate_vblank_count() WARN down to a debug
drm/rockchip: add CONFIG_OF dependency for lvds
...
Linus Torvalds [Thu, 16 Nov 2017 04:30:12 +0000 (20:30 -0800)]
Merge tag 'media/v4.15-1' of ssh://gitolite.kernel.org/pub/scm/linux/kernel/git/mchehab/linux-media
Pull media updates from Mauro Carvalho Chehab:
- Documentation for digital TV (both kAPI and uAPI) are now in sync
with the implementation (except for legacy/deprecated ioctls). This
is a major step, as there were always a gap there
- New sensor driver: imx274
- New cec driver: cec-gpio
- New platform driver for rockship rga and tegra CEC
- New RC driver: tango-ir
- Several cleanups at atomisp driver
- Core improvements for RC, CEC, V4L2 async probing support and DVB
- Lots of drivers cleanup, fixes and improvements.
* tag 'media/v4.15-1' of ssh://gitolite.kernel.org/pub/scm/linux/kernel/git/mchehab/linux-media: (332 commits)
dvb_frontend: don't use-after-free the frontend struct
media: dib0700: fix invalid dvb_detach argument
media: v4l2-ctrls: Don't validate BITMASK twice
media: s5p-mfc: fix lockdep warning
media: dvb-core: always call invoke_release() in fe_free()
media: usb: dvb-usb-v2: dvb_usb_core: remove redundant code in dvb_usb_fe_sleep
media: au0828: make const array addr_list static
media: cx88: make const arrays default_addr_list and pvr2000_addr_list static
media: drxd: make const array fastIncrDecLUT static
media: usb: fix spelling mistake: "synchronuously" -> "synchronously"
media: ddbridge: fix build warnings
media: av7110: avoid 2038 overflow in debug print
media: Don't do DMA on stack for firmware upload in the AS102 driver
media: v4l: async: fix unregister for implicitly registered sub-device notifiers
media: v4l: async: fix return of unitialized variable ret
media: imx274: fix missing return assignment from call to imx274_mode_regs
media: camss-vfe: always initialize reg at vfe_set_xbar_cfg()
media: atomisp: make function calls cleaner
media: atomisp: get rid of storage_class.h
media: atomisp: get rid of wrong stddef.h include
...
Linus Torvalds [Thu, 16 Nov 2017 04:11:56 +0000 (20:11 -0800)]
Merge tag 'leaks-4.15-rc1' of git://github.com/tcharding/linux
Pull leaking_addresses script updates from Tobin Harding:
"Here are development patches for the leaking_addresses.pl script.
Changes include:
- add summary reporting to the script
- add 'SigIgn' to false positives
- add a file read timeout so the script doesn't block indefinitely
- add infrastructure to enable multi-arch support and add support for ppc
- add some exclude files/paths suggested by various people
- code clean up and refactoring
- overhaul command line options"
* tag 'leaks-4.15-rc1' of git://github.com/tcharding/linux:
leaking_addresses: add SigIgn to false positives
leaking_addresses: add timeout on file read
leaking_addresses: add support for ppc64
leaking_addresses: add summary reporting options
leaking_addresses: add to exclude files/paths list
leaking_addresses: fix comment string typo
leaking_addresses: remove command line options
leaking_addresses: remove dead/unused code
leaking_addresses: use tabs instead of spaces
Linus Torvalds [Thu, 16 Nov 2017 03:42:40 +0000 (19:42 -0800)]
Merge branch 'akpm' (patches from Andrew)
Merge updates from Andrew Morton:
- a few misc bits
- ocfs2 updates
- almost all of MM
* emailed patches from Andrew Morton <[email protected]>: (131 commits)
memory hotplug: fix comments when adding section
mm: make alloc_node_mem_map a void call if we don't have CONFIG_FLAT_NODE_MEM_MAP
mm: simplify nodemask printing
mm,oom_reaper: remove pointless kthread_run() error check
mm/page_ext.c: check if page_ext is not prepared
writeback: remove unused function parameter
mm: do not rely on preempt_count in print_vma_addr
mm, sparse: do not swamp log with huge vmemmap allocation failures
mm/hmm: remove redundant variable align_end
mm/list_lru.c: mark expected switch fall-through
mm/shmem.c: mark expected switch fall-through
mm/page_alloc.c: broken deferred calculation
mm: don't warn about allocations which stall for too long
fs: fuse: account fuse_inode slab memory as reclaimable
mm, page_alloc: fix potential false positive in __zone_watermark_ok
mm: mlock: remove lru_add_drain_all()
mm, sysctl: make NUMA stats configurable
shmem: convert shmem_init_inodecache() to void
Unify migrate_pages and move_pages access checks
mm, pagevec: rename pagevec drained field
...
Dave Airlie [Thu, 16 Nov 2017 02:39:40 +0000 (12:39 +1000)]
Merge branch 'drm-next-4.15-dc' of git://people.freedesktop.org/~agd5f/linux into drm-next
Various fixes for DC for 4.15.
* 'drm-next-4.15-dc' of git://people.freedesktop.org/~agd5f/linux:
drm/amd/display: fix MST link training fail division by 0
drm/amd/display: Fix formatting for null pointer dereference fix
drm/amd/display: Remove dangling planes on dc commit state
drm/amd/display: add flip_immediate to commit update for stream
drm/amd/display: Miss register MST encoder cbs
drm/amd/display: Fix warnings on S3 resume
drm/amd/display: use num_timing_generator instead of pipe_count
drm/amd/display: use configurable FBC option in dm
drm/amd/display: fix AZ clock not enabled before program AZ endpoint
amdgpu/dm: Don't use DRM_ERROR in amdgpu_dm_atomic_check
Oscar Salvador [Thu, 16 Nov 2017 01:39:18 +0000 (17:39 -0800)]
mm: make alloc_node_mem_map a void call if we don't have CONFIG_FLAT_NODE_MEM_MAP
free_area_init_node() calls alloc_node_mem_map(), but this function does
nothing unless we have CONFIG_FLAT_NODE_MEM_MAP.
As a cleanup, we can move the "#ifdef CONFIG_FLAT_NODE_MEM_MAP" within
alloc_node_mem_map() out of the function, and define a
alloc_node_mem_map() { } when CONFIG_FLAT_NODE_MEM_MAP is not present.
This also moves the printk that lays within the "#ifdef
CONFIG_FLAT_NODE_MEM_MAP" block from free_area_init_node() to
alloc_node_mem_map(), getting rid of the "#ifdef
CONFIG_FLAT_NODE_MEM_MAP" in free_area_init_node().
Michal Hocko [Thu, 16 Nov 2017 01:39:14 +0000 (17:39 -0800)]
mm: simplify nodemask printing
alloc_warn() and dump_header() have to explicitly handle NULL nodemask
which forces both paths to use pr_cont. We can do better. printk
already handles NULL pointers properly so all we need is to teach
nodemask_pr_args to handle NULL nodemask carefully. This allows
simplification of both alloc_warn() and dump_header() and gets rid of
pr_cont altogether.
This patch has been motivated by patch from Joe Perches
Since oom_init() is called before userspace processes start, memory
allocation failure for creating the OOM reaper kernel thread will let
the OOM killer call panic() rather than wake up the OOM reaper.
Jaewon Kim [Thu, 16 Nov 2017 01:39:07 +0000 (17:39 -0800)]
mm/page_ext.c: check if page_ext is not prepared
online_page_ext() and page_ext_init() allocate page_ext for each
section, but they do not allocate if the first PFN is !pfn_present(pfn)
or !pfn_valid(pfn). Then section->page_ext remains as NULL.
lookup_page_ext checks NULL only if CONFIG_DEBUG_VM is enabled. For a
valid PFN, __set_page_owner will try to get page_ext through
lookup_page_ext. Without CONFIG_DEBUG_VM lookup_page_ext will misuse
NULL pointer as value 0. This incurrs invalid address access.
This is the panic example when PFN 0x100000 is not valid but PFN
0x13FC00 is being used for page_ext. section->page_ext is NULL,
get_entry returned invalid page_ext address as 0x1DFA000 for a PFN
0x13FC00.
To avoid this panic, CONFIG_DEBUG_VM should be removed so that page_ext
will be checked at all times.
Unable to handle kernel paging request at virtual address 01dfa014
------------[ cut here ]------------
Kernel BUG at ffffff80082371e0 [verbose debug info unavailable]
Internal error: Oops: 96000045 [#1] PREEMPT SMP
Modules linked in:
PC is at __set_page_owner+0x48/0x78
LR is at __set_page_owner+0x44/0x78
__set_page_owner+0x48/0x78
get_page_from_freelist+0x880/0x8e8
__alloc_pages_nodemask+0x14c/0xc48
__do_page_cache_readahead+0xdc/0x264
filemap_fault+0x2ac/0x550
ext4_filemap_fault+0x3c/0x58
__do_fault+0x80/0x120
handle_mm_fault+0x704/0xbb0
do_page_fault+0x2e8/0x394
do_mem_abort+0x88/0x124
Pre-4.7 kernels also need commit f86e4271978b ("mm: check the return
value of lookup_page_ext for all call sites").
Michal Hocko [Thu, 16 Nov 2017 01:38:59 +0000 (17:38 -0800)]
mm: do not rely on preempt_count in print_vma_addr
The preempt count check on print_vma_addr has been added by commit e8bff74afbdb ("x86: fix "BUG: sleeping function called from invalid
context" in print_vma_addr()") and it relied on the elevated preempt
count from preempt_conditional_sti because preempt_count check doesn't
work on non preemptive kernels by default.
The code has evolved though and commit d99e1bd175f4 ("x86/entry/traps:
Refactor preemption and interrupt flag handling") has replaced
preempt_conditional_sti by an explicit preempt_disable which is noop on
!PREEMPT so the check in print_vma_addr is broken.
Fix the issue by using trylock on mmap_sem rather than chacking the
preempt count. The allocation we are relying on has to be GFP_NOWAIT as
well. There is a chance that we won't dump the vma state if the lock is
contended or the memory short but this is acceptable outcome and much
less fragile than the not working preemption check or tricks around it.
Michal Hocko [Thu, 16 Nov 2017 01:38:56 +0000 (17:38 -0800)]
mm, sparse: do not swamp log with huge vmemmap allocation failures
While doing memory hotplug tests under heavy memory pressure we have
noticed too many page allocation failures when allocating vmemmap memmap
backed by huge page
and we do see many of those because essentially every allocation fails
for each memory section. This is an excessive way to tell the user that
there is nothing to really worry about because we do have a fallback
mechanism to use base pages. The only downside might be a performance
degradation due to TLB pressure.
This patch changes vmemmap_alloc_block() to use __GFP_NOWARN and warn
explicitly once on the first allocation failure. This will reduce the
noise in the kernel log considerably, while we still have an indication
that a performance might be impacted.
Colin Ian King [Thu, 16 Nov 2017 01:38:52 +0000 (17:38 -0800)]
mm/hmm: remove redundant variable align_end
Variable align_end is assigned a value but it is never read, so the
variable is redundant and can be removed. Cleans up the clang warning:
Value stored to 'align_end' is never read
Pavel Tatashin [Thu, 16 Nov 2017 01:38:41 +0000 (17:38 -0800)]
mm/page_alloc.c: broken deferred calculation
In reset_deferred_meminit() we determine number of pages that must not
be deferred. We initialize pages for at least 2G of memory, but also
pages for reserved memory in this node.
The reserved memory is determined in this function:
memblock_reserved_memory_within(), which operates over physical
addresses, and returns size in bytes. However, reset_deferred_meminit()
assumes that that this function operates with pfns, and returns page
count.
The result is that in the best case machine boots slower than expected
due to initializing more pages than needed in single thread, and in the
worst case panics because fewer than needed pages are initialized early.
Tetsuo Handa [Thu, 16 Nov 2017 01:38:37 +0000 (17:38 -0800)]
mm: don't warn about allocations which stall for too long
Commit 63f53dea0c98 ("mm: warn about allocations which stall for too
long") was a great step for reducing possibility of silent hang up
problem caused by memory allocation stalls. But this commit reverts it,
for it is possible to trigger OOM lockup and/or soft lockups when many
threads concurrently called warn_alloc() (in order to warn about memory
allocation stalls) due to current implementation of printk(), and it is
difficult to obtain useful information due to limitation of synchronous
warning approach.
Current printk() implementation flushes all pending logs using the
context of a thread which called console_unlock(). printk() should be
able to flush all pending logs eventually unless somebody continues
appending to printk() buffer.
Since warn_alloc() started appending to printk() buffer while waiting
for oom_kill_process() to make forward progress when oom_kill_process()
is processing pending logs, it became possible for warn_alloc() to force
oom_kill_process() loop inside printk(). As a result, warn_alloc()
significantly increased possibility of preventing oom_kill_process()
from making forward progress.
---------- Pseudo code start ----------
Before warn_alloc() was introduced:
retry:
if (mutex_trylock(&oom_lock)) {
while (atomic_read(&printk_pending_logs) > 0) {
atomic_dec(&printk_pending_logs);
print_one_log();
}
// Send SIGKILL here.
mutex_unlock(&oom_lock)
}
goto retry;
After warn_alloc() was introduced:
retry:
if (mutex_trylock(&oom_lock)) {
while (atomic_read(&printk_pending_logs) > 0) {
atomic_dec(&printk_pending_logs);
print_one_log();
}
// Send SIGKILL here.
mutex_unlock(&oom_lock)
} else if (waited_for_10seconds()) {
atomic_inc(&printk_pending_logs);
}
goto retry;
---------- Pseudo code end ----------
Although waited_for_10seconds() becomes true once per 10 seconds,
unbounded number of threads can call waited_for_10seconds() at the same
time. Also, since threads doing waited_for_10seconds() keep doing
almost busy loop, the thread doing print_one_log() can use little CPU
resource. Therefore, this situation can be simplified like
when printk() is called faster than print_one_log() can process a log.
One of possible mitigation would be to introduce a new lock in order to
make sure that no other series of printk() (either oom_kill_process() or
warn_alloc()) can append to printk() buffer when one series of printk()
(either oom_kill_process() or warn_alloc()) is already in progress.
Such serialization will also help obtaining kernel messages in readable
form.
But this commit does not go that direction, for we don't want to
introduce a new lock dependency, and we unlikely be able to obtain
useful information even if we serialized oom_kill_process() and
warn_alloc().
Synchronous approach is prone to unexpected results (e.g. too late [1],
too frequent [2], overlooked [3]). As far as I know, warn_alloc() never
helped with providing information other than "something is going wrong".
I want to consider asynchronous approach which can obtain information
during stalls with possibly relevant threads (e.g. the owner of
oom_lock and kswapd-like threads) and serve as a trigger for actions
(e.g. turn on/off tracepoints, ask libvirt daemon to take a memory dump
of stalling KVM guest for diagnostic purpose).
This commit temporarily loses ability to report e.g. OOM lockup due to
unable to invoke the OOM killer due to !__GFP_FS allocation request.
But asynchronous approach will be able to detect such situation and emit
warning. Thus, let's remove warn_alloc().
[1] https://bugzilla.kernel.org/show_bug.cgi?id=192981
[2] http://lkml.kernel.org/r/CAM_iQpWuPVGc2ky8M-9yukECtS+zKjiDasNymX7rMcBjBFyM_A@mail.gmail.com
[3] commit db73ee0d46379922 ("mm, vmscan: do not loop on too_many_isolated for ever"))
Johannes Weiner [Thu, 16 Nov 2017 01:38:34 +0000 (17:38 -0800)]
fs: fuse: account fuse_inode slab memory as reclaimable
Fuse inodes are currently included in the unreclaimable slab counts -
SUnreclaim in /proc/meminfo, slab_unreclaimable in /proc/vmstat and the
per-cgroup memory.stat. But they are reclaimable just like other
filesystems' inodes, and /proc/sys/vm/drop_caches frees them easily.
Vlastimil Babka [Thu, 16 Nov 2017 01:38:30 +0000 (17:38 -0800)]
mm, page_alloc: fix potential false positive in __zone_watermark_ok
Since commit 97a16fc82a7c ("mm, page_alloc: only enforce watermarks for
order-0 allocations"), __zone_watermark_ok() check for high-order
allocations will shortcut per-migratetype free list checks for
ALLOC_HARDER allocations, and return true as long as there's free page
of any migratetype. The intention is that ALLOC_HARDER can allocate
from MIGRATE_HIGHATOMIC free lists, while normal allocations can't.
However, as a side effect, the watermark check will then also return
true when there are pages only on the MIGRATE_ISOLATE list, or (prior to
CMA conversion to ZONE_MOVABLE) on the MIGRATE_CMA list. Since the
allocation cannot actually obtain isolated pages, and might not be able
to obtain CMA pages, this can result in a false positive.
The condition should be rare and perhaps the outcome is not a fatal one.
Still, it's better if the watermark check is correct. There also
shouldn't be a performance tradeoff here.
Shakeel Butt [Thu, 16 Nov 2017 01:38:26 +0000 (17:38 -0800)]
mm: mlock: remove lru_add_drain_all()
lru_add_drain_all() is not required by mlock() and it will drain
everything that has been cached at the time mlock is called. And that
is not really related to the memory which will be faulted in (and
cached) and mlocked by the syscall itself.
If anything lru_add_drain_all() should be called _after_ pages have been
mlocked and faulted in but even that is not strictly needed because
those pages would get to the appropriate LRUs lazily during the reclaim
path. Moreover follow_page_pte (gup) will drain the local pcp LRU
cache.
On larger machines the overhead of lru_add_drain_all() in mlock() can be
significant when mlocking data already in memory. We have observed high
latency in mlock() due to lru_add_drain_all() when the users were
mlocking in memory tmpfs files.
Kemi Wang [Thu, 16 Nov 2017 01:38:22 +0000 (17:38 -0800)]
mm, sysctl: make NUMA stats configurable
This is the second step which introduces a tunable interface that allow
numa stats configurable for optimizing zone_statistics(), as suggested
by Dave Hansen and Ying Huang.
When page allocation performance becomes a bottleneck and you can
tolerate some possible tool breakage and decreased numa counter
precision, you can do:
echo 0 > /proc/sys/vm/numa_stat
In this case, numa counter update is ignored. We can see about
*4.8%*(185->176) drop of cpu cycles per single page allocation and
reclaim on Jesper's page_bench01 (single thread) and *8.1%*(343->315)
drop of cpu cycles per single page allocation and reclaim on Jesper's
page_bench03 (88 threads) running on a 2-Socket Broadwell-based server
(88 threads, 126G memory).
Benchmark link provided by Jesper D Brouer (increase loop times to 10000000):
Otto Ebeling [Thu, 16 Nov 2017 01:38:14 +0000 (17:38 -0800)]
Unify migrate_pages and move_pages access checks
Commit 197e7e521384 ("Sanitize 'move_pages()' permission checks") fixed
a security issue I reported in the move_pages syscall, and made it so
that you can't act on set-uid processes unless you have the
CAP_SYS_PTRACE capability.
Unify the access check logic of migrate_pages to match the new behavior
of move_pages. We discussed this a bit in the security@ list and
thought it'd be good for consistency even though there's no evident
security impact. The NUMA node access checks are left intact and
require CAP_SYS_NICE as before.
Mel Gorman [Thu, 16 Nov 2017 01:38:10 +0000 (17:38 -0800)]
mm, pagevec: rename pagevec drained field
According to Vlastimil Babka, the drained field in pagevec is
potentially misleading because it might be interpreted as draining this
pagevec instead of the percpu lru pagevecs. Rename the field for
clarity.
Vlastimil Babka [Thu, 16 Nov 2017 01:38:07 +0000 (17:38 -0800)]
mm, page_alloc: simplify list handling in rmqueue_bulk()
rmqueue_bulk() fills an empty pcplist with pages from the free list. It
tries to preserve increasing order by pfn to the caller, because it
leads to better performance with some I/O controllers, as explained in
commit e084b2d95e48 ("page-allocator: preserve PFN ordering when
__GFP_COLD is set").
To preserve the order, it's sufficient to add pages to the tail of the
list as they are retrieved. The current code instead adds to the head
of the list, but then updates the list head pointer to the last added
page, in each step. This does result in the same order, but is
needlessly confusing and potentially wasteful, with no apparent benefit.
This patch simplifies the code and adjusts comment accordingly.
Mel Gorman [Thu, 16 Nov 2017 01:38:03 +0000 (17:38 -0800)]
mm: remove __GFP_COLD
As the page free path makes no distinction between cache hot and cold
pages, there is no real useful ordering of pages in the free list that
allocation requests can take advantage of. Juding from the users of
__GFP_COLD, it is likely that a number of them are the result of copying
other sites instead of actually measuring the impact. Remove the
__GFP_COLD parameter which simplifies a number of paths in the page
allocator.
This is potentially controversial but bear in mind that the size of the
per-cpu pagelists versus modern cache sizes means that the whole per-cpu
list can often fit in the L3 cache. Hence, there is only a potential
benefit for microbenchmarks that alloc/free pages in a tight loop. It's
even worse when THP is taken into account which has little or no chance
of getting a cache-hot page as the per-cpu list is bypassed and the
zeroing of multiple pages will thrash the cache anyway.
The truncate microbenchmarks are not shown as this patch affects the
allocation path and not the free path. A page fault microbenchmark was
tested but it showed no sigificant difference which is not surprising
given that the __GFP_COLD branches are a miniscule percentage of the
fault path.
Mel Gorman [Thu, 16 Nov 2017 01:37:59 +0000 (17:37 -0800)]
mm: remove cold parameter from free_hot_cold_page*
Most callers users of free_hot_cold_page claim the pages being released
are cache hot. The exception is the page reclaim paths where it is
likely that enough pages will be freed in the near future that the
per-cpu lists are going to be recycled and the cache hotness information
is lost. As no one really cares about the hotness of pages being
released to the allocator, just ditch the parameter.
The APIs are renamed to indicate that it's no longer about hot/cold
pages. It should also be less confusing as there are subtle differences
between them. __free_pages drops a reference and frees a page when the
refcount reaches zero. free_hot_cold_page handled pages whose refcount
was already zero which is non-obvious from the name. free_unref_page
should be more obvious.
No performance impact is expected as the overhead is marginal. The
parameter is removed simply because it is a bit stupid to have a useless
parameter copied everywhere.
Mel Gorman [Thu, 16 Nov 2017 01:37:55 +0000 (17:37 -0800)]
mm: remove cold parameter for release_pages
All callers of release_pages claim the pages being released are cache
hot. As no one cares about the hotness of pages being released to the
allocator, just ditch the parameter.
No performance impact is expected as the overhead is marginal. The
parameter is removed simply because it is a bit stupid to have a useless
parameter copied everywhere.
Mel Gorman [Thu, 16 Nov 2017 01:37:52 +0000 (17:37 -0800)]
mm, pagevec: remove cold parameter for pagevecs
Every pagevec_init user claims the pages being released are hot even in
cases where it is unlikely the pages are hot. As no one cares about the
hotness of pages being released to the allocator, just ditch the
parameter.
No performance impact is expected as the overhead is marginal. The
parameter is removed simply because it is a bit stupid to have a useless
parameter copied everywhere.
Mel Gorman [Thu, 16 Nov 2017 01:37:48 +0000 (17:37 -0800)]
mm: only drain per-cpu pagevecs once per pagevec usage
When a pagevec is initialised on the stack, it is generally used
multiple times over a range of pages, looking up entries and then
releasing them. On each pagevec_release, the per-cpu deferred LRU
pagevecs are drained on the grounds the page being released may be on
those queues and the pages may be cache hot. In many cases only the
first drain is necessary as it's unlikely that the range of pages being
walked is racing against LRU addition. Even if there is such a race,
the impact is marginal where as constantly redraining the lru pagevecs
costs.
This patch ensures that pagevec is only drained once in a given
lifecycle without increasing the cache footprint of the pagevec
structure. Only sparsetruncate tiny is shown here as large files have
many exceptional entries and calls pagecache_release less frequently.
sparsetruncate (tiny)
4.14.0-rc4 4.14.0-rc4
batchshadow-v1r1 onedrain-v1r1
Min Time 141.00 ( 0.00%) 141.00 ( 0.00%)
1st-qrtle Time 142.00 ( 0.00%) 142.00 ( 0.00%)
2nd-qrtle Time 142.00 ( 0.00%) 142.00 ( 0.00%)
3rd-qrtle Time 143.00 ( 0.00%) 143.00 ( 0.00%)
Max-90% Time 144.00 ( 0.00%) 144.00 ( 0.00%)
Max-95% Time 146.00 ( 0.00%) 145.00 ( 0.68%)
Max-99% Time 198.00 ( 0.00%) 194.00 ( 2.02%)
Max Time 254.00 ( 0.00%) 208.00 ( 18.11%)
Amean Time 145.12 ( 0.00%) 144.30 ( 0.56%)
Stddev Time 12.74 ( 0.00%) 9.62 ( 24.49%)
Coeff Time 8.78 ( 0.00%) 6.67 ( 24.06%)
Best99%Amean Time 144.29 ( 0.00%) 143.82 ( 0.32%)
Best95%Amean Time 142.68 ( 0.00%) 142.31 ( 0.26%)
Best90%Amean Time 142.52 ( 0.00%) 142.19 ( 0.24%)
Best75%Amean Time 142.26 ( 0.00%) 141.98 ( 0.20%)
Best50%Amean Time 141.90 ( 0.00%) 141.71 ( 0.13%)
Best25%Amean Time 141.80 ( 0.00%) 141.43 ( 0.26%)
The impact on bonnie is marginal and within the noise because a
significant percentage of the file being truncated has been reclaimed
and consists of shadow entries which reduce the hotness of the
pagevec_release path.
Mel Gorman [Thu, 16 Nov 2017 01:37:44 +0000 (17:37 -0800)]
mm, truncate: remove all exceptional entries from pagevec under one lock
During truncate each entry in a pagevec is checked to see if it is an
exceptional entry and if so, the shadow entry is cleaned up. This is
potentially expensive as multiple entries for a mapping locks/unlocks
the tree lock. This batches the operation such that any exceptional
entries removed from a pagevec only acquire the mapping tree lock once.
The corner case where this is more expensive is where there is only one
exceptional entry but this is unlikely due to temporal locality and how
it affects LRU ordering. Note that for truncations of small files
created recently, this patch should show no gain because it only batches
the handling of exceptional entries.
sparsetruncate (large)
4.14.0-rc4 4.14.0-rc4
pickhelper-v1r1 batchshadow-v1r1
Min Time 38.00 ( 0.00%) 27.00 ( 28.95%)
1st-qrtle Time 40.00 ( 0.00%) 28.00 ( 30.00%)
2nd-qrtle Time 44.00 ( 0.00%) 41.00 ( 6.82%)
3rd-qrtle Time 146.00 ( 0.00%) 147.00 ( -0.68%)
Max-90% Time 153.00 ( 0.00%) 153.00 ( 0.00%)
Max-95% Time 155.00 ( 0.00%) 156.00 ( -0.65%)
Max-99% Time 181.00 ( 0.00%) 171.00 ( 5.52%)
Amean Time 93.04 ( 0.00%) 88.43 ( 4.96%)
Best99%Amean Time 92.08 ( 0.00%) 86.13 ( 6.46%)
Best95%Amean Time 89.19 ( 0.00%) 83.13 ( 6.80%)
Best90%Amean Time 85.60 ( 0.00%) 79.15 ( 7.53%)
Best75%Amean Time 72.95 ( 0.00%) 65.09 ( 10.78%)
Best50%Amean Time 39.86 ( 0.00%) 28.20 ( 29.25%)
Best25%Amean Time 39.44 ( 0.00%) 27.70 ( 29.77%)