]> Git Repo - linux.git/log
linux.git
7 years agoMerge tag 'afs-next-20171113' of git://git.kernel.org/pub/scm/linux/kernel/git/dhowel...
Linus Torvalds [Thu, 16 Nov 2017 19:41:22 +0000 (11:41 -0800)]
Merge tag 'afs-next-20171113' of git://git.kernel.org/pub/scm/linux/kernel/git/dhowells/linux-fs

Pull AFS updates from David Howells:
 "kAFS filesystem driver overhaul.

  The major points of the overhaul are:

   (1) Preliminary groundwork is laid for supporting network-namespacing
       of kAFS. The remainder of the namespacing work requires some way
       to pass namespace information to submounts triggered by an
       automount. This requires something like the mount overhaul that's
       in progress.

   (2) sockaddr_rxrpc is used in preference to in_addr for holding
       addresses internally and add support for talking to the YFS VL
       server. With this, kAFS can do everything over IPv6 as well as
       IPv4 if it's talking to servers that support it.

   (3) Callback handling is overhauled to be generally passive rather
       than active. 'Callbacks' are promises by the server to tell us
       about data and metadata changes. Callbacks are now checked when
       we next touch an inode rather than actively going and looking for
       it where possible.

   (4) File access permit caching is overhauled to store the caching
       information per-inode rather than per-directory, shared over
       subordinate files. Whilst older AFS servers only allow ACLs on
       directories (shared to the files in that directory), newer AFS
       servers break that restriction.

       To improve memory usage and to make it easier to do mass-key
       removal, permit combinations are cached and shared.

   (5) Cell database management is overhauled to allow lighter locks to
       be used and to make cell records autonomous state machines that
       look after getting their own DNS records and cleaning themselves
       up, in particular preventing races in acquiring and relinquishing
       the fscache token for the cell.

   (6) Volume caching is overhauled. The afs_vlocation record is got rid
       of to simplify things and the superblock is now keyed on the cell
       and the numeric volume ID only. The volume record is tied to a
       superblock and normal superblock management is used to mediate
       the lifetime of the volume fscache token.

   (7) File server record caching is overhauled to make server records
       independent of cells and volumes. A server can be in multiple
       cells (in such a case, the administrator must make sure that the
       VL services for all cells correctly reflect the volumes shared
       between those cells).

       Server records are now indexed using the UUID of the server
       rather than the address since a server can have multiple
       addresses.

   (8) File server rotation is overhauled to handle VMOVED, VBUSY (and
       similar), VOFFLINE and VNOVOL indications and to handle rotation
       both of servers and addresses of those servers. The rotation will
       also wait and retry if the server says it is busy.

   (9) Data writeback is overhauled. Each inode no longer stores a list
       of modified sections tagged with the key that authorised it in
       favour of noting the modified region of a page in page->private
       and storing a list of keys that made modifications in the inode.

       This simplifies things and allows other keys to be used to
       actually write to the server if a key that made a modification
       becomes useless.

  (10) Writable mmap() is implemented. This allows a kernel to be build
       entirely on AFS.

  Note that Pre AFS-3.4 servers are no longer supported, though this can
  be added back if necessary (AFS-3.4 was released in 1998)"

* tag 'afs-next-20171113' of git://git.kernel.org/pub/scm/linux/kernel/git/dhowells/linux-fs: (35 commits)
  afs: Protect call->state changes against signals
  afs: Trace page dirty/clean
  afs: Implement shared-writeable mmap
  afs: Get rid of the afs_writeback record
  afs: Introduce a file-private data record
  afs: Use a dynamic port if 7001 is in use
  afs: Fix directory read/modify race
  afs: Trace the sending of pages
  afs: Trace the initiation and completion of client calls
  afs: Fix documentation on # vs % prefix in mount source specification
  afs: Fix total-length calculation for multiple-page send
  afs: Only progress call state at end of Tx phase from rxrpc callback
  afs: Make use of the YFS service upgrade to fully support IPv6
  afs: Overhaul volume and server record caching and fileserver rotation
  afs: Move server rotation code into its own file
  afs: Add an address list concept
  afs: Overhaul cell database management
  afs: Overhaul permit caching
  afs: Overhaul the callback handling
  afs: Rename struct afs_call server member to cm_server
  ...

7 years agoMerge tag 'pinctrl-v4.15-1' of git://git.kernel.org/pub/scm/linux/kernel/git/linusw...
Linus Torvalds [Thu, 16 Nov 2017 18:57:11 +0000 (10:57 -0800)]
Merge tag 'pinctrl-v4.15-1' of git://git.kernel.org/pub/scm/linux/kernel/git/linusw/linux-pinctrl

Pull pin control updates from Linus Walleij:
 "This is the bulk of pin control changes for the v4.15 kernel cycle:

  Core:

   - The pin control Kconfig entry PINCTRL is now turned into a
     menuconfig option. This obviously has the implication of making the
     subsystem menu visible in menuconfig. This is happening because of
     two things:

      (a) Intel have started to deploy and depend on pin controllers in
          a way that is affecting users directly. This happens on the
          highly integrated laptop chipsets named after geographical
          places: baytrail, broxton, cannonlake, cedarfork, cherryview,
          denverton, geminilake, lewisburg, merrifield, sunrisepoint...
          It started a while back and now it is ever more evident that
          this is crucial infrastructure for x86 laptops and not an
          embedded obscurity anymore. Users need to be aware.

      (b) Pin control expanders on I2C and SPI that are arch-agnostic.
          Currently Semtech SX150X and Microchip MCP28x08 but more are
          expected. Users will have to be able to configure these in
          directly for their set-up.

   - Just go and select GPIOLIB now that we made sure that GPIOLIB is a
     very vanilla subsystem. Do not depend on it, if we need it, select
     it.

   - Exposing the pin control subsystem in menuconfig uncovered a bunch
     of obscure bugs that are now hopefully fixed, all more or less
     pertaining to Blackfin.

   - Unified namespace for cross-calls between pin control and GPIO.

   - New support for clock skew/delay generic DT bindings and generic
     pin config options for this.

   - Minor documentation improvements.

  Various:

   - The Renesas SH-PFC pin controller has evolved a lot. It seems
     Renesas are churning out new SoCs by the minute.

   - A bunch of non-critical fixes for the Rockchip driver.

   - Improve the use of library functions instead of open coding.

   - Support the MCP28018 variant in the MCP28x08 driver.

   - Static constifying"

* tag 'pinctrl-v4.15-1' of git://git.kernel.org/pub/scm/linux/kernel/git/linusw/linux-pinctrl: (91 commits)
  pinctrl: gemini: Fix missing pad descriptions
  pinctrl: Add some depends on HAS_IOMEM
  pinctrl: samsung/s3c24xx: add CONFIG_OF dependency
  pinctrl: gemini: Fix GMAC groups
  pinctrl: qcom: spmi-gpio: Add pmi8994 gpio support
  pinctrl: ti-iodelay: remove redundant unused variable dev
  pinctrl: max77620: Use common error handling code in max77620_pinconf_set()
  pinctrl: gemini: Implement clock skew/delay config
  pinctrl: gemini: Use generic DT parser
  pinctrl: Add skew-delay pin config and bindings
  pinctrl: armada-37xx: Add edge both type gpio irq support
  pinctrl: uniphier: remove eMMC hardware reset pin-mux
  pinctrl: rockchip: Add iomux-route switching support for rk3288
  pinctrl: intel: Add Intel Cedar Fork PCH pin controller support
  pinctrl: intel: Make offset to interrupt status register configurable
  pinctrl: sunxi: Enforce the strict mode by default
  pinctrl: sunxi: Disable strict mode for old pinctrl drivers
  pinctrl: sunxi: Introduce the strict flag
  pinctrl: sh-pfc: Save/restore registers for PSCI system suspend
  pinctrl: sh-pfc: r8a7796: Use generic IOCTRL register description
  ...

7 years agoof: unittest: disable interrupts_property warning
Rob Herring [Thu, 16 Nov 2017 17:54:17 +0000 (11:54 -0600)]
of: unittest: disable interrupts_property warning

The testcases.dts has purposely bad data which now generates a dtc warning:

drivers/of/unittest-data/testcases.dtb: Warning (interrupts_property): interrupts size is (4), expected multiple of 8 in /testcase-data/testcase-device2

Disable this warning for now. The proper solution is to split the unit
tests into good and bad data.

Signed-off-by: Rob Herring <[email protected]>
7 years agoof: unittest: let dtc generate __local_fixups__
Rob Herring [Thu, 16 Nov 2017 17:37:09 +0000 (11:37 -0600)]
of: unittest: let dtc generate __local_fixups__

Remove the manually added __local_fixups__ because dtc can now generate
them. This also fixes a new warning in the process:

drivers/of/unittest-data/testcases.dtb: Warning (interrupts_extended_property): Could not get phandle node for /__local_fixups__/testcase-data/interrupts/interrupts-extended0:interrupts-extended(cell 3)

Signed-off-by: Rob Herring <[email protected]>
7 years agodrm/amd/powerplay: fix unfreeze level smc message for smu7
Eric Huang [Thu, 16 Nov 2017 16:14:21 +0000 (11:14 -0500)]
drm/amd/powerplay: fix unfreeze level smc message for smu7

Copy paste typo.

Signed-off-by: Eric Huang <[email protected]>
Reviewed-by: Alex Deucher <[email protected]>
Signed-off-by: Alex Deucher <[email protected]>
Cc: [email protected]
7 years agoMerge tag 'backlight-next-4.15' of git://git.kernel.org/pub/scm/linux/kernel/git...
Linus Torvalds [Thu, 16 Nov 2017 18:36:46 +0000 (10:36 -0800)]
Merge tag 'backlight-next-4.15' of git://git.kernel.org/pub/scm/linux/kernel/git/lee/backlight

Pull backlight updates from Lee Jones:

   - handle 32bit overflow in pwm_bl

   - remove redundant code/checks in tps65217_bl and ili922x

* tag 'backlight-next-4.15' of git://git.kernel.org/pub/scm/linux/kernel/git/lee/backlight:
  backlight: ili922x: Remove redundant variable len
  backlight: tps65217_bl: Remove unnecessary default brightness check
  backlight: pwm_bl: Fix overflow condition

7 years agodrm/amdgpu:fix memleak
Monk Liu [Tue, 14 Nov 2017 08:55:14 +0000 (16:55 +0800)]
drm/amdgpu:fix memleak

those RLC used buffers are not cleared in GFX's sw_fini

Signed-off-by: Monk Liu <[email protected]>
Reviewed-by: Christian König <[email protected]>
Signed-off-by: Alex Deucher <[email protected]>
7 years agodrm/amdgpu:fix memleak in takedown
Monk Liu [Tue, 14 Nov 2017 08:48:06 +0000 (16:48 +0800)]
drm/amdgpu:fix memleak in takedown

this can fix the memory leak under the case that not all
BO are freed during "takedown" stage, because originally
it blocks following kfree on mgr.

Signed-off-by: Monk Liu <[email protected]>
Reviewed-by: Christian König <[email protected]>
Signed-off-by: Alex Deucher <[email protected]>
7 years agonvmet_fc: fix better length checking
James Smart [Thu, 16 Nov 2017 01:00:21 +0000 (17:00 -0800)]
nvmet_fc: fix better length checking

Reorganize nvmet_fc_handle_fcp_rqst() so that the nvmet req.transfer_len
field is set after the call nvmet_req_init(). An update to nvmet now
has nvmet_req_init() clearing the field, thus the fc transport was losing
the value.

Reviewed-by: Christoph Hellwig <[email protected]>
Signed-off-by: James Smart <[email protected]>
Signed-off-by: Jens Axboe <[email protected]>
7 years agoMerge tag 'mfd-next-4.15' of git://git.kernel.org/pub/scm/linux/kernel/git/lee/mfd
Linus Torvalds [Thu, 16 Nov 2017 17:15:57 +0000 (09:15 -0800)]
Merge tag 'mfd-next-4.15' of git://git.kernel.org/pub/scm/linux/kernel/git/lee/mfd

Pull MFD updates from Lee Jones:
 "New drivers:
   - Add support for Cherry Trail Dollar Cove TI PMIC
   - Add support for Add Spreadtrum SC27xx series PMICs

  New device support:
   - Add support Regulator to axp20x

  New functionality:
   - Add DT support; aspeed-scu sc27xx-pmic
   - Add power saving support; rts5249

  Fix-ups:
   - DT clean-up/rework; tps65217, max77693, iproc-cdru, iproc-mhb, tps65218
   - Staticise/constify; stw481x
   - Use new succinct IRQ API; fsl-imx25-tsadc
   - Kconfig fix-ups; MFD_TPS65218
   - Identify SPI method; lpc_ich
   - Use managed resources (devm_*) calls; ssbi
   - Remove unused/obsolete code/documentation; mc13xxx

  Bug fixes:
   - Fix typo in MAINTAINERS
   - Fix error handling; mxs-lradc
   - Clean-up IRQs on .remove; fsl-imx25-tsadc"

* tag 'mfd-next-4.15' of git://git.kernel.org/pub/scm/linux/kernel/git/lee/mfd: (21 commits)
  dt-bindings: mfd: mc13xxx: Remove obsolete property
  mfd: axp20x: Add axp20x-regulator cell for AXP813
  mfd: Add Spreadtrum SC27xx series PMICs driver
  dt-bindings: mfd: Add Spreadtrum SC27xx PMIC documentation
  mfd: ssbi: Use devm_of_platform_populate()
  mfd: fsl-imx25: Clean up irq settings during removal
  mfd: mxs-lradc: Fix error handling in mxs_lradc_probe()
  mfd: lpc_ich: Avoton/Rangeley uses SPI_BYT method
  mfd: tps65218: Introduce dependency on CONFIG_OF
  mfd: tps65218: Correct the config description
  MAINTAINERS: Fix Dialog search term for watchdog binding file
  mfd: fsl-imx25: Set irq handler and data in one go
  mfd: rts5249: Add support for RTS5250S power saving
  ACPI / PMIC: Add opregion driver for Intel Dollar Cove TI PMIC
  mfd: Add support for Cherry Trail Dollar Cove TI PMIC
  syscon: dt-bindings: Add binding document for iProc MHB block
  syscon: dt-bindings: Add binding doc for Broadcom iProc CDRU
  mfd: max77693: Add muic of_compatible in mfd_cell
  mfd: stw481x: Make three arrays static const, reduces object code size
  mfd: tps65217: Introduce dependency on CONFIG_OF
  ...

7 years agoMerge tag 'char-misc-4.15-rc1' of ssh://gitolite.kernel.org/pub/scm/linux/kernel...
Linus Torvalds [Thu, 16 Nov 2017 17:10:59 +0000 (09:10 -0800)]
Merge tag 'char-misc-4.15-rc1' of ssh://gitolite.kernel.org/pub/scm/linux/kernel/git/gregkh/char-misc

Pull char/misc updates from Greg KH:
 "Here is the big set of char/misc and other driver subsystem patches
  for 4.15-rc1.

  There are small changes all over here, hyperv driver updates, pcmcia
  driver updates, w1 driver updats, vme driver updates, nvmem driver
  updates, and lots of other little one-off driver updates as well. The
  shortlog has the full details.

  All of these have been in linux-next for quite a while with no
  reported issues"

* tag 'char-misc-4.15-rc1' of ssh://gitolite.kernel.org/pub/scm/linux/kernel/git/gregkh/char-misc: (90 commits)
  VME: Return -EBUSY when DMA list in use
  w1: keep balance of mutex locks and refcnts
  MAINTAINERS: Update VME subsystem tree.
  nvmem: sunxi-sid: add support for A64/H5's SID controller
  nvmem: imx-ocotp: Update module description
  nvmem: imx-ocotp: Enable i.MX7D OTP write support
  nvmem: imx-ocotp: Add i.MX7D timing write clock setup support
  nvmem: imx-ocotp: Move i.MX6 write clock setup to dedicated function
  nvmem: imx-ocotp: Add support for banked OTP addressing
  nvmem: imx-ocotp: Pass parameters via a struct
  nvmem: imx-ocotp: Restrict OTP write to IMX6 processors
  nvmem: uniphier: add UniPhier eFuse driver
  dt-bindings: nvmem: add description for UniPhier eFuse
  nvmem: set nvmem->owner to nvmem->dev->driver->owner if unset
  nvmem: qfprom: fix different address space warnings of sparse
  nvmem: mtk-efuse: fix different address space warnings of sparse
  nvmem: mtk-efuse: use stack for nvmem_config instead of malloc'ing it
  nvmem: imx-iim: use stack for nvmem_config instead of malloc'ing it
  thunderbolt: tb: fix use after free in tb_activate_pcie_devices
  MAINTAINERS: Add git tree for Thunderbolt development
  ...

7 years agodt-bindings: usb: document hub and host-controller properties
Johan Hovold [Thu, 9 Nov 2017 17:07:19 +0000 (18:07 +0100)]
dt-bindings: usb: document hub and host-controller properties

Hub nodes and host-controller nodes with child nodes must specify values
for #address-cells (1) and #size-cells (0).

Also make the definition of the related reg property a bit more
stringent, and add comments to the example source.

Signed-off-by: Johan Hovold <[email protected]>
Signed-off-by: Rob Herring <[email protected]>
7 years agodt-bindings: usb: clean up compatible property
Johan Hovold [Thu, 9 Nov 2017 17:07:18 +0000 (18:07 +0100)]
dt-bindings: usb: clean up compatible property

Add quotation marks around the compatible string to avoid ambiguity due
to following punctuation, and define the VID and PID components.

Signed-off-by: Johan Hovold <[email protected]>
Signed-off-by: Rob Herring <[email protected]>
7 years agodt-bindings: usb: fix reg-property port-number range
Johan Hovold [Thu, 9 Nov 2017 17:07:17 +0000 (18:07 +0100)]
dt-bindings: usb: fix reg-property port-number range

The USB hub port-number range for USB 2.0 is 1-255 and not 1-31 which
reflects an arbitrary limit set by the current Linux implementation.

Note that for USB 3.1 hubs the valid range is 1-15.

Increase the documented valid range in the binding to 255, which is the
maximum allowed by the specifications.

Signed-off-by: Johan Hovold <[email protected]>
Signed-off-by: Rob Herring <[email protected]>
7 years agodt-bindings: usb: fix example hub node name
Johan Hovold [Thu, 9 Nov 2017 17:07:16 +0000 (18:07 +0100)]
dt-bindings: usb: fix example hub node name

According to the OF Recommended Practice for USB, hub nodes shall be
named "hub", but the example had mixed up the label and node names. Fix
the node name and drop the redundant label.

While at it, remove a newline and add a missing semicolon to the example
source.

Signed-off-by: Johan Hovold <[email protected]>
Signed-off-by: Rob Herring <[email protected]>
7 years agoof/pci: Fix theoretical NULL dereference
Robin Murphy [Wed, 15 Nov 2017 12:50:06 +0000 (12:50 +0000)]
of/pci: Fix theoretical NULL dereference

In the (relatively mechanical) process of adapting the RID-mapping code
to put the resulting ID in an output argument rather than the funtion
return value, we ended up with the debug print using the argument
pointer rather than the local value, which potentially defeats the
earlier NULL check.

Fixes: 987068fcbdb7: "of/irq: Break out msi-map lookup (again)"
Reported-by: Dan Carpenter <[email protected]>
Signed-off-by: Robin Murphy <[email protected]>
Signed-off-by: Rob Herring <[email protected]>
7 years agoMerge tag 'driver-core-4.15-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git...
Linus Torvalds [Thu, 16 Nov 2017 16:55:30 +0000 (08:55 -0800)]
Merge tag 'driver-core-4.15-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/driver-core

Pull driver core updates from Greg KH:
 "Here is the set of driver core / debugfs patches for 4.15-rc1.

  Not many here, mostly all are debugfs fixes to resolve some
  long-reported problems with files going away with references to them
  in userspace. There's also some SPDX cleanups for the debugfs code, as
  well as a few other minor driver core changes for issues reported by
  people.

  All of these have been in linux-next for a week or more with no
  reported issues"

* tag 'driver-core-4.15-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/driver-core:
  driver core: Fix device link deferred probe
  debugfs: Remove redundant license text
  debugfs: add SPDX identifiers to all debugfs files
  debugfs: defer debugfs_fsdata allocation to first usage
  debugfs: call debugfs_real_fops() only after debugfs_file_get()
  debugfs: purge obsolete SRCU based removal protection
  IB/hfi1: convert to debugfs_file_get() and -put()
  debugfs: convert to debugfs_file_get() and -put()
  debugfs: debugfs_real_fops(): drop __must_hold sparse annotation
  debugfs: implement per-file removal protection
  debugfs: add support for more elaborate ->d_fsdata
  driver core: Move device_links_purge() after bus_remove_device()
  arch_topology: Fix section miss match warning due to free_raw_capacity()
  driver-core: pr_err() strings should end with newlines

7 years agoarm64: dts: uniphier: route on-board device IRQ to GPIO controller for PXs3
Masahiro Yamada [Wed, 15 Nov 2017 04:15:12 +0000 (13:15 +0900)]
arm64: dts: uniphier: route on-board device IRQ to GPIO controller for PXs3

Commit 429f203eb712 ("arm64: dts: uniphier: route on-board device IRQ
to GPIO controller") missed to update this DTS.  It becames a real
problem when arm and arm64 trees are merged together.

Signed-off-by: Masahiro Yamada <[email protected]>
Signed-off-by: Arnd Bergmann <[email protected]>
7 years agokbuild: move coccicheck help from scripts/Makefile.help to top Makefile
Masahiro Yamada [Tue, 14 Nov 2017 09:47:20 +0000 (18:47 +0900)]
kbuild: move coccicheck help from scripts/Makefile.help to top Makefile

In my view, it is not helpful to have a separate file just for
the coccicheck help message.  Merge scripts/Makefile.help into
the top-level Makefile.

Signed-off-by: Masahiro Yamada <[email protected]>
Acked-by: Julia Lawall <[email protected]>
7 years agosh: decompressor: add shipped files to .gitignore
Masahiro Yamada [Mon, 13 Nov 2017 10:58:09 +0000 (19:58 +0900)]
sh: decompressor: add shipped files to .gitignore

These files are copied from arch/sh/lib, so should be ignored by git.

Signed-off-by: Masahiro Yamada <[email protected]>
7 years agofrv: .gitignore: ignore vmlinux.lds
Masahiro Yamada [Mon, 13 Nov 2017 10:43:00 +0000 (19:43 +0900)]
frv: .gitignore: ignore vmlinux.lds

Signed-off-by: Masahiro Yamada <[email protected]>
7 years agos390: remove unused parameter from Makefile
Heiko Carstens [Thu, 16 Nov 2017 12:49:50 +0000 (13:49 +0100)]
s390: remove unused parameter from Makefile

Remove unused parameter from the call function,
which I accidentally added.

Signed-off-by: Heiko Carstens <[email protected]>
Signed-off-by: Martin Schwidefsky <[email protected]>
7 years agozfcp: purely mechanical update using timer API, plus blank lines
Steffen Maier [Tue, 17 Oct 2017 16:40:51 +0000 (18:40 +0200)]
zfcp: purely mechanical update using timer API, plus blank lines

erp_memwait only occurs in seldom memory pressure situations.
The typical case never uses the associated timer and thus also
does not need to initialize the timer.
Also, we don't want to re-initialize the timer each time we re-use an
erp_action in zfcp_erp_setup_act() [see also v4.14-rc7 commit ab31fd0ce65e
("scsi: zfcp: fix erp_action use-before-initialize in REC action trace")
for erp_action life cycle].
Hence, retain the lazy inintialization of zfcp_erp_action.timer
in zfcp_erp_strategy_memwait().

Add an empty line after declarations in zfcp_erp_timeout_handler()
and zfcp_fsf_request_timeout_handler() even though it was also missing
before the timer conversion.

Fix checkpatch warning:
WARNING: function definition argument 'struct timer_list *' should also have an identifier name
+extern void zfcp_erp_timeout_handler(struct timer_list *);

Depends-on: v4.14-rc3 commit 686fef928bba ("timer: Prepare to change timer callback argument type")
Signed-off-by: Steffen Maier <[email protected]>
Reviewed-by: Jens Remus <[email protected]>
Signed-off-by: Martin Schwidefsky <[email protected]>
7 years agos390/scsi: Convert timers to use timer_setup()
Kees Cook [Mon, 16 Oct 2017 23:44:34 +0000 (16:44 -0700)]
s390/scsi: Convert timers to use timer_setup()

In preparation for unconditionally passing the struct timer_list pointer to
all timer callbacks, switch to using the new timer_setup() and from_timer()
to pass the timer pointer explicitly.

Cc: Steffen Maier <[email protected]>
Cc: Benjamin Block <[email protected]>
Cc: Heiko Carstens <[email protected]>
Cc: [email protected]
Signed-off-by: Kees Cook <[email protected]>
Signed-off-by: Martin Schwidefsky <[email protected]>
7 years agos390/cpum_sf: correctly set the PID and TID in perf samples
Hendrik Brueckner [Tue, 8 Mar 2016 13:00:23 +0000 (14:00 +0100)]
s390/cpum_sf: correctly set the PID and TID in perf samples

The hardware sampler creates samples that are processed at a later
point in time.  The PID and TID values of the perf samples that are
created for hardware samples are initialized with values from the
current task.  Hence, the PID and TID values are not correct and
perf samples are associated with wrong processes.

The PID and TID values are obtained from the Host Program Parameter
(HPP) field in the basic-sampling data entries.  These PIDs are
valid in the init PID namespace.  Ensure that the PIDs in the perf
samples are resolved considering the PID namespace in which the
perf event was created.

To correct the PID and TID values in the created perf samples,
a special overflow handler is installed.  It replaces the default
overflow handler and does not become effective if any other
overflow handler is used.  With the special overflow handler most
of the perf samples are associated with the right processes.
For processes, that are no longer exist, the association might
still be wrong.

Signed-off-by: Hendrik Brueckner <[email protected]>
Signed-off-by: Martin Schwidefsky <[email protected]>
7 years agos390/cpum_sf: load program parameter at sampler enablement
Hendrik Brueckner [Fri, 27 Oct 2017 13:45:19 +0000 (15:45 +0200)]
s390/cpum_sf: load program parameter at sampler enablement

The lpp instruction is used to place the PID of the current
task in the program-parameter (PP) register.  The register
contents is then included in the sampling data entries.

The lpp instruction loads the PP register only when at least
one sampling function is enabled.  Otherwise it is executed
as a no-op.

Linux calls lpp at context switch.  If the context switch
happens before the sampler is enabled, the PP register is
empty.  That means, the PID of the task that is sampled is
not stored in sampling data until the next context switch.

Hence, always call lpp when enabling the sampler.

Signed-off-by: Hendrik Brueckner <[email protected]>
Signed-off-by: Martin Schwidefsky <[email protected]>
7 years agos390/perf: add perf register support for floating-point registers
Hendrik Brueckner [Wed, 8 Nov 2017 08:17:38 +0000 (09:17 +0100)]
s390/perf: add perf register support for floating-point registers

For correct unwinding of user space processes, the floating-point
register contents are required.  For example, leaf functions might
use fp registers to temporarily store the return address.

Signed-off-by: Hendrik Brueckner <[email protected]>
Reviewed-and-tested-by: Thomas Richter <[email protected]>
Signed-off-by: Martin Schwidefsky <[email protected]>
7 years agos390/perf: extend perf_regs support to include floating-point registers
Hendrik Brueckner [Wed, 8 Nov 2017 06:30:15 +0000 (07:30 +0100)]
s390/perf: extend perf_regs support to include floating-point registers

Extend the perf register support to also export floating-point register
contents for user space tasks.  Floating-point registers might be used
in leaf functions to contain the return address.  Hence, they are required
for proper DWARF unwinding.

Signed-off-by: Hendrik Brueckner <[email protected]>
Reviewed-and-tested-by: Thomas Richter <[email protected]>
Signed-off-by: Martin Schwidefsky <[email protected]>
7 years agos390/perf: define common DWARF register string table
Hendrik Brueckner [Wed, 8 Nov 2017 08:01:12 +0000 (09:01 +0100)]
s390/perf: define common DWARF register string table

Instead of defining DWARF register to string table in dwarf-regs-table.h
and dwarf-regs.c, use a common table in dwarf-regs-table.h.

Ensure that the DWARF register table is up-to-date with
http://refspecs.linuxfoundation.org/ELF/zSeries/lzsabi0_s390/x1542.html.

For unwinding with libdw, also ensure to correctly setup the DWARF
register frame according to the register mappings.  Currently, libdw
supports up to 32 registers only.

Suggested-by: Thomas Richter <[email protected]>
Signed-off-by: Hendrik Brueckner <[email protected]>
Reviewed-and-tested-by: Thomas Richter <[email protected]>
Signed-off-by: Martin Schwidefsky <[email protected]>
7 years agos390/perf: add support for perf_regs and libdw
Heiko Carstens [Tue, 19 Jan 2016 10:23:38 +0000 (11:23 +0100)]
s390/perf: add support for perf_regs and libdw

With support for perf_regs and libdw, you can record and report
call graphs for user space programs. Simply invoke perf with
the --call-graph=dwarf command line option.

Signed-off-by: Heiko Carstens <[email protected]>
[brueckner: added dwfl_thread_state_register_pc() call]
Signed-off-by: Hendrik Brueckner <[email protected]>
Reviewed-and-tested-by: Thomas Richter <[email protected]>
Signed-off-by: Martin Schwidefsky <[email protected]>
7 years agos390/perf: add perf_regs support and user stack dump
Heiko Carstens [Sat, 6 Jun 2015 10:44:25 +0000 (12:44 +0200)]
s390/perf: add perf_regs support and user stack dump

Add s390 support to dump user stack to user space for DWARF
stack unwinding.

Signed-off-by: Heiko Carstens <[email protected]>
Reviewed-by: Hendrik Brueckner <[email protected]>
Reviewed-and-tested-by: Thomas Richter <[email protected]>
Signed-off-by: Martin Schwidefsky <[email protected]>
7 years agos390/cpum_sf: do not register PMU if no sampling mode is authorized
Hendrik Brueckner [Wed, 31 May 2017 13:22:19 +0000 (15:22 +0200)]
s390/cpum_sf: do not register PMU if no sampling mode is authorized

Previously, the cpum_sf PMU was registered even if there is no
sampling mode authorized.  Add a check and register cpum_sf only
at least one sampling mode is authorized.

Signed-off-by: Hendrik Brueckner <[email protected]>
Signed-off-by: Martin Schwidefsky <[email protected]>
7 years agos390/cpumf: remove raw event support in basic-only sampling mode
Pu Hou [Fri, 19 May 2017 09:16:55 +0000 (11:16 +0200)]
s390/cpumf: remove raw event support in basic-only sampling mode

Raw sample was implemented to export the diagnostic samples.
With having this achieved with AUX buffers, there is no requirement
for basic samples to export raw data.  In particular, most basic
sampling information are consumed for creating the perf event sample.

Signed-off-by: Pu Hou <[email protected]>
Reviewed-by: Hendrik Brueckner <[email protected]>
Signed-off-by: Martin Schwidefsky <[email protected]>
7 years agos390/perf: add callback to perf to enable using AUX buffer
Pu Hou [Thu, 1 Sep 2016 08:48:22 +0000 (10:48 +0200)]
s390/perf: add callback to perf to enable using AUX buffer

Perf tool need implement a callback to enable using AUX buffer. Perf
will do another mmap() to trigger the setup of AUX buffer in kernel
if there is such callback. The default size of the AUX buffer is set
properly according to the sampling frequency to avoid overflow. It
could also be manually set by -m option of perf.

The interface of perf is not changed. Diagnostic mode sampling
could be started by `perf record -e rBD000` like before.

Signed-off-by: Pu Hou <[email protected]>
Reviewed-by: Hendrik Brueckner <[email protected]>
Signed-off-by: Martin Schwidefsky <[email protected]>
7 years agos390/cpumf: enable using AUX buffer
Pu Hou [Fri, 11 Nov 2016 03:23:15 +0000 (04:23 +0100)]
s390/cpumf: enable using AUX buffer

Modify PMU callback to use AUX buffer for diagnostic mode sampling.
Basic-mode sampling still use orignal way.

Signed-off-by: Pu Hou <[email protected]>
Reviewed-by: Hendrik Brueckner <[email protected]>
Signed-off-by: Martin Schwidefsky <[email protected]>
7 years agos390/cpumf: introduce AUX buffer for dump diagnostic sample data
Pu Hou [Fri, 11 Nov 2016 02:08:49 +0000 (03:08 +0100)]
s390/cpumf: introduce AUX buffer for dump diagnostic sample data

Current implementation uses a private buffer for cpumf to dump samples.
Samples first go to this buffer. Then copy to ring buffer allocated
by perf core. With AUX buffer, this copy is not needed. AUX buffer is
shared and zero-copy mapped to user space. The trailer information at
the end of each SDB(sample data block) is also exported to user space.
AUX buffer is used when diagnostic sampling mode is enabled.

This patch contains functions to setup/free AUX buffer or to begin/end
sampling per-cpu. Also include function called in interrupt to
collect samples.

Signed-off-by: Pu Hou <[email protected]>
Reviewed-by: Hendrik Brueckner <[email protected]>
Signed-off-by: Martin Schwidefsky <[email protected]>
7 years agonet/sctp: Always set scope_id in sctp_inet6_skb_msgname
Eric W. Biederman [Thu, 16 Nov 2017 04:17:48 +0000 (22:17 -0600)]
net/sctp: Always set scope_id in sctp_inet6_skb_msgname

Alexandar Potapenko while testing the kernel with KMSAN and syzkaller
discovered that in some configurations sctp would leak 4 bytes of
kernel stack.

Working with his reproducer I discovered that those 4 bytes that
are leaked is the scope id of an ipv6 address returned by recvmsg.

With a little code inspection and a shrewd guess I discovered that
sctp_inet6_skb_msgname only initializes the scope_id field for link
local ipv6 addresses to the interface index the link local address
pertains to instead of initializing the scope_id field for all ipv6
addresses.

That is almost reasonable as scope_id's are meaniningful only for link
local addresses.  Set the scope_id in all other cases to 0 which is
not a valid interface index to make it clear there is nothing useful
in the scope_id field.

There should be no danger of breaking userspace as the stack leak
guaranteed that previously meaningless random data was being returned.

Fixes: 372f525b495c ("SCTP: Resync with LKSCTP tree.")
History-tree: https://git.kernel.org/pub/scm/linux/kernel/git/tglx/history.git
Reported-by: Alexander Potapenko <[email protected]>
Tested-by: Alexander Potapenko <[email protected]>
Signed-off-by: "Eric W. Biederman" <[email protected]>
Signed-off-by: David S. Miller <[email protected]>
7 years agofealnx: Fix building error on MIPS
Huacai Chen [Thu, 16 Nov 2017 03:07:15 +0000 (11:07 +0800)]
fealnx: Fix building error on MIPS

This patch try to fix the building error on MIPS. The reason is MIPS
has already defined the LONG macro, which conflicts with the LONG enum
in drivers/net/ethernet/fealnx.c.

Signed-off-by: Huacai Chen <[email protected]>
Signed-off-by: David S. Miller <[email protected]>
7 years agoMerge tag 'kvm-s390-next-4.15-1' of git://git.kernel.org/pub/scm/linux/kernel/git...
Radim Krčmář [Thu, 16 Nov 2017 13:39:46 +0000 (14:39 +0100)]
Merge tag 'kvm-s390-next-4.15-1' of git://git.kernel.org/pub/scm/linux/kernel/git/kvms390/linux

KVM: s390: fixes and improvements for 4.15

- Some initial preparation patches for exitless interrupts and crypto
- New capability for AIS migration
- Fixes
- merge of the sthyi tree from the base s390 team, which moves the sthyi
out of KVM into a shared function also for non-KVM

7 years agoMerge branch 'master' of git://git.kernel.org/pub/scm/linux/kernel/git/klassert/ipsec
David S. Miller [Thu, 16 Nov 2017 13:33:54 +0000 (22:33 +0900)]
Merge branch 'master' of git://git.kernel.org/pub/scm/linux/kernel/git/klassert/ipsec

Steffen Klassert says:

====================
1) Copy policy family in clone_policy, otherwise this can
   trigger a BUG_ON in af_key. From Herbert Xu.

2) Revert "xfrm: Fix stack-out-of-bounds read in xfrm_state_find."
   This added a regression with transport mode when no addresses
   are configured on the policy template.

Both patches are stable candidates.
====================

Signed-off-by: David S. Miller <[email protected]>
7 years agoMerge branch 'isdn-hisax-Fix-pnp_irq-error-checking'
David S. Miller [Thu, 16 Nov 2017 13:31:16 +0000 (22:31 +0900)]
Merge branch 'isdn-hisax-Fix-pnp_irq-error-checking'

Arvind Yadav says:

====================
isdn: hisax: Fix pnp_irq's error checking

The pnp_irq() function returns -1 if an error occurs.
pnp_irq() error checking for zero is not correct.
====================

Signed-off-by: David S. Miller <[email protected]>
7 years agoisdn: hisax: Fix pnp_irq's error checking for setup_teles3
Arvind Yadav [Thu, 16 Nov 2017 04:27:29 +0000 (09:57 +0530)]
isdn: hisax: Fix pnp_irq's error checking for setup_teles3

The pnp_irq() function returns -1 if an error occurs.
pnp_irq() error checking for zero is not correct.

Signed-off-by: Arvind Yadav <[email protected]>
Signed-off-by: David S. Miller <[email protected]>
7 years agoisdn: hisax: Fix pnp_irq's error checking for setup_sedlbauer_isapnp
Arvind Yadav [Thu, 16 Nov 2017 04:27:28 +0000 (09:57 +0530)]
isdn: hisax: Fix pnp_irq's error checking for setup_sedlbauer_isapnp

The pnp_irq() function returns -1 if an error occurs.
pnp_irq() error checking for zero is not correct.

Signed-off-by: Arvind Yadav <[email protected]>
Signed-off-by: David S. Miller <[email protected]>
7 years agoisdn: hisax: Fix pnp_irq's error checking for setup_niccy
Arvind Yadav [Thu, 16 Nov 2017 04:27:27 +0000 (09:57 +0530)]
isdn: hisax: Fix pnp_irq's error checking for setup_niccy

The pnp_irq() function returns -1 if an error occurs.
pnp_irq() error checking for zero is not correct.

Signed-off-by: Arvind Yadav <[email protected]>
Signed-off-by: David S. Miller <[email protected]>
7 years agoisdn: hisax: Fix pnp_irq's error checking for setup_ix1micro
Arvind Yadav [Thu, 16 Nov 2017 04:27:26 +0000 (09:57 +0530)]
isdn: hisax: Fix pnp_irq's error checking for setup_ix1micro

The pnp_irq() function returns -1 if an error occurs.
pnp_irq() error checking for zero is not correct.

Signed-off-by: Arvind Yadav <[email protected]>
Signed-off-by: David S. Miller <[email protected]>
7 years agoisdn: hisax: Fix pnp_irq's error checking for setup_isurf
Arvind Yadav [Thu, 16 Nov 2017 04:27:25 +0000 (09:57 +0530)]
isdn: hisax: Fix pnp_irq's error checking for setup_isurf

The pnp_irq() function returns -1 if an error occurs.
pnp_irq() error checking for zero is not correct.

Signed-off-by: Arvind Yadav <[email protected]>
Signed-off-by: David S. Miller <[email protected]>
7 years agoisdn: hisax: Handle return value of pnp_irq and pnp_port_start
Arvind Yadav [Thu, 16 Nov 2017 04:27:24 +0000 (09:57 +0530)]
isdn: hisax: Handle return value of pnp_irq and pnp_port_start

pnp_irq() and pnp_port_start() can fail here and we must check
its return value.

Signed-off-by: Arvind Yadav <[email protected]>
Signed-off-by: David S. Miller <[email protected]>
7 years agoisdn: hisax: Fix pnp_irq's error checking for setup_hfcs
Arvind Yadav [Thu, 16 Nov 2017 04:27:23 +0000 (09:57 +0530)]
isdn: hisax: Fix pnp_irq's error checking for setup_hfcs

The pnp_irq() function returns -1 if an error occurs.
pnp_irq() error checking for zero is not correct.

Signed-off-by: Arvind Yadav <[email protected]>
Signed-off-by: David S. Miller <[email protected]>
7 years agoisdn: hisax: Fix pnp_irq's error checking for setup_hfcsx
Arvind Yadav [Thu, 16 Nov 2017 04:27:22 +0000 (09:57 +0530)]
isdn: hisax: Fix pnp_irq's error checking for setup_hfcsx

The pnp_irq() function returns -1 if an error occurs.
pnp_irq() error checking for zero is not correct.

Signed-off-by: Arvind Yadav <[email protected]>
Signed-off-by: David S. Miller <[email protected]>
7 years agoisdn: hisax: Fix pnp_irq's error checking for setup_elsa_isapnp
Arvind Yadav [Thu, 16 Nov 2017 04:27:21 +0000 (09:57 +0530)]
isdn: hisax: Fix pnp_irq's error checking for setup_elsa_isapnp

The pnp_irq() function returns -1 if an error occurs.
pnp_irq() error checking for zero is not correct.

Signed-off-by: Arvind Yadav <[email protected]>
Signed-off-by: David S. Miller <[email protected]>
7 years agoisdn: hisax: Fix pnp_irq's error checking for setup_diva_isapnp
Arvind Yadav [Thu, 16 Nov 2017 04:27:20 +0000 (09:57 +0530)]
isdn: hisax: Fix pnp_irq's error checking for setup_diva_isapnp

The pnp_irq() function returns -1 if an error occurs.
pnp_irq() error checking for zero is not correct.

Signed-off-by: Arvind Yadav <[email protected]>
Signed-off-by: David S. Miller <[email protected]>
7 years agoisdn: hisax: Fix pnp_irq's error checking for avm_pnp_setup
Arvind Yadav [Thu, 16 Nov 2017 04:27:19 +0000 (09:57 +0530)]
isdn: hisax: Fix pnp_irq's error checking for avm_pnp_setup

The pnp_irq() function returns -1 if an error occurs.
pnp_irq() error checking for zero is not correct.

Signed-off-by: Arvind Yadav <[email protected]>
Signed-off-by: David S. Miller <[email protected]>
7 years agoisdn: hisax: Fix pnp_irq's error checking for setup_asuscom
Arvind Yadav [Thu, 16 Nov 2017 04:27:18 +0000 (09:57 +0530)]
isdn: hisax: Fix pnp_irq's error checking for setup_asuscom

The pnp_irq() function returns -1 if an error occurs.
pnp_irq() error checking for zero is not correct.

Signed-off-by: Arvind Yadav <[email protected]>
Signed-off-by: David S. Miller <[email protected]>
7 years agos390/disassembler: increase show_code buffer size
Vasily Gorbik [Wed, 15 Nov 2017 13:15:36 +0000 (14:15 +0100)]
s390/disassembler: increase show_code buffer size

Current buffer size of 64 is too small. objdump shows that there are
instructions which would require up to 75 bytes buffer (with current
formating). 128 bytes "ought to be enough for anybody".

Also replaces 8 spaces with a single tab to reduce the memory footprint.

Fixes the following KASAN finding:

BUG: KASAN: stack-out-of-bounds in number+0x3fe/0x538
Write of size 1 at addr 000000005a4a75a0 by task bash/1282

CPU: 1 PID: 1282 Comm: bash Not tainted 4.14.0+ #215
Hardware name: IBM 2964 N96 702 (z/VM 6.4.0)
Call Trace:
([<000000000011eeb6>] show_stack+0x56/0x88)
 [<0000000000e1ce1a>] dump_stack+0x15a/0x1b0
 [<00000000004e2994>] print_address_description+0xf4/0x288
 [<00000000004e2cf2>] kasan_report+0x13a/0x230
 [<0000000000e38ae6>] number+0x3fe/0x538
 [<0000000000e3dfe4>] vsnprintf+0x194/0x948
 [<0000000000e3ea42>] sprintf+0xa2/0xb8
 [<00000000001198dc>] print_insn+0x374/0x500
 [<0000000000119346>] show_code+0x4ee/0x538
 [<000000000011f234>] show_registers+0x34c/0x388
 [<000000000011f2ae>] show_regs+0x3e/0xa8
 [<000000000011f502>] die+0x1ea/0x2e8
 [<0000000000138f0e>] do_no_context+0x106/0x168
 [<0000000000139a1a>] do_protection_exception+0x4da/0x7d0
 [<0000000000e55914>] pgm_check_handler+0x16c/0x1c0
 [<000000000090639e>] sysrq_handle_crash+0x46/0x58
([<0000000000000007>] 0x7)
 [<00000000009073fa>] __handle_sysrq+0x102/0x218
 [<0000000000907c06>] write_sysrq_trigger+0xd6/0x100
 [<000000000061d67a>] proc_reg_write+0xb2/0x128
 [<0000000000520be6>] __vfs_write+0xee/0x368
 [<0000000000521222>] vfs_write+0x21a/0x278
 [<000000000052156a>] SyS_write+0xda/0x178
 [<0000000000e555cc>] system_call+0xc4/0x270

The buggy address belongs to the page:
page:000003d1016929c0 count:0 mapcount:0 mapping:          (null) index:0x0
flags: 0x0()
raw: 0000000000000000 0000000000000000 0000000000000000 ffffffff00000000
raw: 0000000000000100 0000000000000200 0000000000000000 0000000000000000
page dumped because: kasan: bad access detected

Memory state around the buggy address:
 000000005a4a7480: 00 00 00 00 00 00 00 00 00 00 00 00 f1 f1 f1 f1
 000000005a4a7500: 00 00 00 00 00 00 00 00 f2 f2 f2 f2 00 00 00 00
>000000005a4a7580: 00 00 00 00 f3 f3 f3 f3 00 00 00 00 00 00 00 00
                               ^
 000000005a4a7600: 00 00 00 00 00 00 00 00 00 00 f1 f1 f1 f1 f8 f8
 000000005a4a7680: f2 f2 f2 f2 f2 f2 f8 f8 f2 f2 f3 f3 f3 f3 00 00
==================================================================

Cc: <[email protected]>
Signed-off-by: Vasily Gorbik <[email protected]>
Signed-off-by: Martin Schwidefsky <[email protected]>
7 years agos390: Remove CONFIG_HARDENED_USERCOPY
Michael Holzheu [Wed, 15 Nov 2017 16:06:30 +0000 (17:06 +0100)]
s390: Remove CONFIG_HARDENED_USERCOPY

When running the crash tool on a s390 live system we get a kernel panic
for reading memory within the kernel image:

 # uname -a
   Linux r3545011 4.14.0-rc8-00066-g1c9dbd4615fd #45 SMP PREEMPT Fri Nov 10 16:16:22 CET 2017 s390x s390x s390x GNU/Linux
 # crash /boot/vmlinux-devel /dev/mem
 # crash> rd 0x100000

 usercopy: kernel memory exposure attempt detected from 0000000000100000 (<kernel text>) (8 bytes)
 ------------[ cut here ]------------
 kernel BUG at mm/usercopy.c:72!
 illegal operation: 0001 ilc:1 [#1] PREEMPT SMP.
 Modules linked in:
 CPU: 0 PID: 1461 Comm: crash Not tainted 4.14.0-rc8-00066-g1c9dbd4615fd-dirty #46
 Hardware name: IBM 2827 H66 706 (z/VM 6.3.0)
 task: 000000001ad10100 task.stack: 000000001df78000
 Krnl PSW : 0704d00180000000 000000000038165c (__check_object_size+0x164/0x1d0)
            R:0 T:1 IO:1 EX:1 Key:0 M:1 W:0 P:0 AS:3 CC:1 PM:0 RI:0 EA:3
 Krnl GPRS: 0000000012440e1d 0000000080000000 0000000000000061 00000000001cabc0
            00000000001cc6d6 0000000000000000 0000000000cc4ed2 0000000000001000
            000003ffc22fdd20 0000000000000008 0000000000100008 0000000000000001
            0000000000000008 0000000000100000 0000000000381658 000000001df7bc90
 Krnl Code: 000000000038164cc020004a1c4a        larl    %r2,cc4ee0
            0000000000381652c0e5fff2581b        brasl   %r14,1cc688
           #0000000000381658a7f40001            brc     15,38165a
           >000000000038165ceb42000c000c        srlg    %r4,%r2,12
            0000000000381662eb32001c000c        srlg    %r3,%r2,28
            0000000000381668c0110003ffff        lgfi    %r1,262143
            000000000038166eec31ff752065        clgrj   %r3,%r1,2,381558
            0000000000381674a7f4ff67            brc     15,381542
 Call Trace:
 ([<0000000000381658>] __check_object_size+0x160/0x1d0)
  [<000000000082263a>] read_mem+0xaa/0x130.
  [<0000000000386182>] __vfs_read+0x42/0x168.
  [<000000000038632e>] vfs_read+0x86/0x140.
  [<0000000000386a26>] SyS_read+0x66/0xc0.
  [<0000000000ace6a4>] system_call+0xc4/0x2b0.
 INFO: lockdep is turned off.
 Last Breaking-Event-Address:
  [<0000000000381658>] __check_object_size+0x160/0x1d0

 Kernel panic - not syncing: Fatal exception: panic_on_oops

With CONFIG_HARDENED_USERCOPY copy_to_user() checks in __check_object_size()
if the source address is within the kernel image. When the crash tool reads
from 0x100000, this check leads to the kernel BUG().

So disable the kernel config option until this bug is fixed.

Corresponding bug report on LKML: https://lkml.org/lkml/2017/11/10/341

Signed-off-by: Michael Holzheu <[email protected]>
Signed-off-by: Martin Schwidefsky <[email protected]>
7 years agox86/mm: Limit mmap() of /dev/mem to valid physical addresses
Craig Bergstrom [Wed, 15 Nov 2017 22:29:51 +0000 (15:29 -0700)]
x86/mm: Limit mmap() of /dev/mem to valid physical addresses

One thing /dev/mem access APIs should verify is that there's no way
that excessively large pfn's can leak into the high bits of the
page table entry.

In particular, if people can use "very large physical page addresses"
through /dev/mem to set the bits past bit 58 - SOFTW4 and permission
key bits and NX bit, that could *really* confuse the kernel.

We had an earlier attempt:

  ce56a86e2ade ("x86/mm: Limit mmap() of /dev/mem to valid physical addresses")

... which turned out to be too restrictive (breaking mem=... bootups for example) and
had to be reverted in:

  90edaac62729 ("Revert "x86/mm: Limit mmap() of /dev/mem to valid physical addresses"")

This v2 attempt modifies the original patch and makes sure that mmap(/dev/mem)
limits the pfns so that it at least fits in the actual pteval_t architecturally:

 - Make sure mmap_mem() actually validates that the offset fits in phys_addr_t

    ( This may be indirectly true due to some other check, but it's not
      entirely obvious. )

 - Change valid_mmap_phys_addr_range() to just use phys_addr_valid()
   on the top byte

    ( Top byte is sufficient, because mmap_mem() has already checked that
      it cannot wrap. )

 - Add a few comments about what the valid_phys_addr_range() vs.
   valid_mmap_phys_addr_range() difference is.

Signed-off-by: Craig Bergstrom <[email protected]>
[ Fixed the checks and added comments. ]
Signed-off-by: Linus Torvalds <[email protected]>
[ Collected the discussion and patches into a commit. ]
Cc: Boris Ostrovsky <[email protected]>
Cc: Fengguang Wu <[email protected]>
Cc: Greg Kroah-Hartman <[email protected]>
Cc: Hans Verkuil <[email protected]>
Cc: Mauro Carvalho Chehab <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Sander Eikelenboom <[email protected]>
Cc: Sean Young <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Link: http://lkml.kernel.org/r/CA+55aFyEcOMb657vWSmrM13OxmHxC-XxeBmNis=DwVvpJUOogQ@mail.gmail.com
Signed-off-by: Ingo Molnar <[email protected]>
7 years agox86/selftests: Add test for mapping placement for 5-level paging
Kirill A. Shutemov [Wed, 15 Nov 2017 14:36:07 +0000 (17:36 +0300)]
x86/selftests: Add test for mapping placement for 5-level paging

5-level paging provides a 56-bit virtual address space for user space
application. But the kernel defaults to mappings below the 47-bit address
space boundary, which is the upper bound for 4-level paging, unless an
application explicitely request it by using a mmap(2) address hint above
the 47-bit boundary. The kernel prevents mappings which spawn across the
47-bit boundary unless mmap(2) was invoked with MAP_FIXED.

Add a self-test that covers the corner cases of the interface and validates
the correctness of the implementation.

[ tglx: Massaged changelog once more ]

Signed-off-by: Kirill A. Shutemov <[email protected]>
Signed-off-by: Thomas Gleixner <[email protected]>
Cc: Nicholas Piggin <[email protected]>
Cc: Andy Lutomirski <[email protected]>
Cc: [email protected]
Cc: Linus Torvalds <[email protected]>
Link: https://lkml.kernel.org/r/[email protected]
7 years agox86/mm: Prevent non-MAP_FIXED mapping across DEFAULT_MAP_WINDOW border
Kirill A. Shutemov [Wed, 15 Nov 2017 14:36:06 +0000 (17:36 +0300)]
x86/mm: Prevent non-MAP_FIXED mapping across DEFAULT_MAP_WINDOW border

In case of 5-level paging, the kernel does not place any mapping above
47-bit, unless userspace explicitly asks for it.

Userspace can request an allocation from the full address space by
specifying the mmap address hint above 47-bit.

Nicholas noticed that the current implementation violates this interface:

  If user space requests a mapping at the end of the 47-bit address space
  with a length which causes the mapping to cross the 47-bit border
  (DEFAULT_MAP_WINDOW), then the vma is partially in the address space
  below and above.

Sanity check the mmap address hint so that start and end of the resulting
vma are on the same side of the 47-bit border. If that's not the case fall
back to the code path which ignores the address hint and allocate from the
regular address space below 47-bit.

To make the checks consistent, mask out the address hints lower bits
(either PAGE_MASK or huge_page_mask()) instead of using ALIGN() which can
push them up to the next boundary.

[ tglx: Moved the address check to a function and massaged comment and
   changelog ]

Reported-by: Nicholas Piggin <[email protected]>
Signed-off-by: Kirill A. Shutemov <[email protected]>
Signed-off-by: Thomas Gleixner <[email protected]>
Cc: Andy Lutomirski <[email protected]>
Cc: [email protected]
Cc: Linus Torvalds <[email protected]>
Link: https://lkml.kernel.org/r/[email protected]
7 years agohwmon: (w83793) Remove duplicate NULL check
Andy Shevchenko [Tue, 31 Oct 2017 14:21:46 +0000 (16:21 +0200)]
hwmon: (w83793) Remove duplicate NULL check

Since i2c_unregister_device() became NULL-aware we may remove duplicate
NULL check.

Cc: Rudolf Marek <[email protected]>
Cc: Jean Delvare <[email protected]>
Cc: Guenter Roeck <[email protected]>
Cc: [email protected]
Signed-off-by: Andy Shevchenko <[email protected]>
Reviewed-by: Jean Delvare <[email protected]>
Signed-off-by: Guenter Roeck <[email protected]>
7 years agohwmon: (w83792d) Remove duplicate NULL check
Andy Shevchenko [Tue, 31 Oct 2017 14:21:45 +0000 (16:21 +0200)]
hwmon: (w83792d) Remove duplicate NULL check

Since i2c_unregister_device() became NULL-aware we may remove duplicate
NULL check.

Cc: Jean Delvare <[email protected]>
Cc: Guenter Roeck <[email protected]>
Cc: [email protected]
Signed-off-by: Andy Shevchenko <[email protected]>
Reviewed-by: Jean Delvare <[email protected]>
Signed-off-by: Guenter Roeck <[email protected]>
7 years agohwmon: (w83791d) Remove duplicate NULL check
Andy Shevchenko [Tue, 31 Oct 2017 14:21:44 +0000 (16:21 +0200)]
hwmon: (w83791d) Remove duplicate NULL check

Since i2c_unregister_device() became NULL-aware we may remove duplicate
NULL check.

Cc: Marc Hulsman <[email protected]>
Cc: Jean Delvare <[email protected]>
Cc: Guenter Roeck <[email protected]>
Cc: [email protected]
Signed-off-by: Andy Shevchenko <[email protected]>
Reviewed-by: Jean Delvare <[email protected]>
Signed-off-by: Guenter Roeck <[email protected]>
7 years agohwmon: (w83781d) Remove duplicate NULL check
Andy Shevchenko [Tue, 31 Oct 2017 14:21:43 +0000 (16:21 +0200)]
hwmon: (w83781d) Remove duplicate NULL check

Since i2c_unregister_device() became NULL-aware we may remove duplicate
NULL check.

Cc: Jean Delvare <[email protected]>
Cc: Guenter Roeck <[email protected]>
Cc: [email protected]
Signed-off-by: Andy Shevchenko <[email protected]>
Reviewed-by: Jean Delvare <[email protected]>
Acked-by: Wolfram Sang <[email protected]>
Signed-off-by: Guenter Roeck <[email protected]>
7 years agohwmon: (k10temp) Correct model name for Ryzen 1600X
Guenter Roeck [Mon, 13 Nov 2017 20:38:23 +0000 (12:38 -0800)]
hwmon: (k10temp) Correct model name for Ryzen 1600X

Ryzen 1600X is a Ryzen 5 processor.

Signed-off-by: Guenter Roeck <[email protected]>
7 years agoiwlwifi: fix firmware names for 9000 and A000 series hw
Thomas Backlund [Tue, 14 Nov 2017 10:37:51 +0000 (12:37 +0200)]
iwlwifi: fix firmware names for 9000 and A000 series hw

iwlwifi 9000 and a0000 series hw contains an extra dash in firmware
file name as seeen in modinfo output for kernel 4.14:

firmware:       iwlwifi-9260-th-b0-jf-b0--34.ucode
firmware:       iwlwifi-9260-th-a0-jf-a0--34.ucode
firmware:       iwlwifi-9000-pu-a0-jf-b0--34.ucode
firmware:       iwlwifi-9000-pu-a0-jf-a0--34.ucode
firmware:       iwlwifi-QuQnj-a0-hr-a0--34.ucode
firmware:       iwlwifi-QuQnj-a0-jf-b0--34.ucode
firmware:       iwlwifi-QuQnj-f0-hr-a0--34.ucode
firmware:       iwlwifi-Qu-a0-jf-b0--34.ucode
firmware:       iwlwifi-Qu-a0-hr-a0--34.ucode

Fix that by dropping the extra adding of '"-"'.

Signed-off-by: Thomas Backlund <[email protected]>
Cc: [email protected] # 4.13
Signed-off-by: Luca Coelho <[email protected]>
7 years agoiwlwifi: fix PCI IDs and configuration mapping for 9000 series
Luca Coelho [Wed, 15 Nov 2017 16:28:04 +0000 (18:28 +0200)]
iwlwifi: fix PCI IDs and configuration mapping for 9000 series

A lot of PCI IDs were missing and there were some problems with the
configuration and firmware selection for devices on the 9000 series.
Fix the firmware selection by adding files for the B-steps; add
configuration for some integrated devices; and add a bunch of PCI IDs
(mostly for integrated devices) that were missing from the driver's
list.

Without this patch, a lot of devices will not be recognized or will
try to load the wrong firmware file.

Cc: [email protected] # 4.13
Signed-off-by: Luca Coelho <[email protected]>
7 years agosched/deadline: Fix the description of runtime accounting in the documentation
Claudio Scordino [Tue, 14 Nov 2017 11:19:26 +0000 (12:19 +0100)]
sched/deadline: Fix the description of runtime accounting in the documentation

Signed-off-by: Claudio Scordino <[email protected]>
Signed-off-by: Luca Abeni <[email protected]>
Acked-by: Daniel Bristot de Oliveira <[email protected]>
Acked-by: Peter Zijlstra <[email protected]>
Cc: Jonathan Corbet <[email protected]>
Cc: Linus Torvalds <[email protected]>
Cc: Mathieu Poirier <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: Tommaso Cucinotta <[email protected]>
Cc: [email protected]
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Ingo Molnar <[email protected]>
7 years agorpmsg: glink: The mbox client knows_txdone
Bjorn Andersson [Thu, 16 Nov 2017 06:08:33 +0000 (22:08 -0800)]
rpmsg: glink: The mbox client knows_txdone

As the GLINK driver is ticking the txdone of the mailbox channel (to
implement the doorbell) it needs to set knows_txdone.

Signed-off-by: Bjorn Andersson <[email protected]>
7 years agoblock: wake up all tasks blocked in get_request()
Ming Lei [Thu, 16 Nov 2017 00:08:44 +0000 (08:08 +0800)]
block: wake up all tasks blocked in get_request()

Once blk_set_queue_dying() is done in blk_cleanup_queue(), we call
blk_freeze_queue() and wait for q->q_usage_counter becoming zero. But
if there are tasks blocked in get_request(), q->q_usage_counter can
never become zero. So we have to wake up all these tasks in
blk_set_queue_dying() first.

Fixes: 3ef28e83ab157997 ("block: generic request_queue reference counting")
Signed-off-by: Ming Lei <[email protected]>
Signed-off-by: Jens Axboe <[email protected]>
7 years agoMerge tag 'drm-for-v4.15' of git://people.freedesktop.org/~airlied/linux
Linus Torvalds [Thu, 16 Nov 2017 04:42:10 +0000 (20:42 -0800)]
Merge tag 'drm-for-v4.15' of git://people.freedesktop.org/~airlied/linux

Pull drm updates from Dave Airlie:
 "This is the main drm pull request for v4.15.

  Core:
   - Atomic object lifetime fixes
   - Atomic iterator improvements
   - Sparse/smatch fixes
   - Legacy kms ioctls to be interruptible
   - EDID override improvements
   - fb/gem helper cleanups
   - Simple outreachy patches
   - Documentation improvements
   - Fix dma-buf rcu races
   - DRM mode object leasing for improving VR use cases.
   - vgaarb improvements for non-x86 platforms.

  New driver:
   - tve200: Faraday Technology TVE200 block.

     This "TV Encoder" encodes a ITU-T BT.656 stream and can be found in
     the StorLink SL3516 (later Cortina Systems CS3516) as well as the
     Grain Media GM8180.

  New bridges:
   - SiI9234 support

  New panels:
   - S6E63J0X03, OTM8009A, Seiko 43WVF1G, 7" rpi touch panel, Toshiba
     LT089AC19000, Innolux AT043TN24

  i915:
   - Remove Coffeelake from alpha support
   - Cannonlake workarounds
   - Infoframe refactoring for DisplayPort
   - VBT updates
   - DisplayPort vswing/emph/buffer translation refactoring
   - CCS fixes
   - Restore GPU clock boost on missed vblanks
   - Scatter list updates for userptr allocations
   - Gen9+ transition watermarks
   - Display IPC (Isochronous Priority Control)
   - Private PAT management
   - GVT: improved error handling and pci config sanitizing
   - Execlist refactoring
   - Transparent Huge Page support
   - User defined priorities support
   - HuC/GuC firmware refactoring
   - DP MST fixes
   - eDP power sequencing fixes
   - Use RCU instead of stop_machine
   - PSR state tracking support
   - Eviction fixes
   - BDW DP aux channel timeout fixes
   - LSPCON fixes
   - Cannonlake PLL fixes

  amdgpu:
   - Per VM BO support
   - Powerplay cleanups
   - CI powerplay support
   - PASID mgr for kfd
   - SR-IOV fixes
   - initial GPU reset for vega10
   - Prime mmap support
   - TTM updates
   - Clock query interface for Raven
   - Fence to handle ioctl
   - UVD encode ring support on Polaris
   - Transparent huge page DMA support
   - Compute LRU pipe tweaks
   - BO flag to allow buffers to opt out of implicit sync
   - CTX priority setting API
   - VRAM lost infrastructure plumbing

  qxl:
   - fix flicker since atomic rework

  amdkfd:
   - Further improvements from internal AMD tree
   - Usermode events
   - Drop radeon support

  nouveau:
   - Pascal temperature sensor support
   - Improved BAR2 handling
   - MMU rework to support Pascal MMU

  exynos:
   - Improved HDMI/mixer support
   - HDMI audio interface support

  tegra:
   - Prep work for tegra186
   - Cleanup/fixes

  msm:
   - Preemption support for a5xx
   - Display fixes for 8x96 (snapdragon 820)
   - Async cursor plane fixes
   - FW loading rework
   - GPU debugging improvements

  vc4:
   - Prep for DSI panels
   - fix T-format tiling scanout
   - New madvise ioctl

  Rockchip:
   - LVDS support

  omapdrm:
   - omap4 HDMI CEC support

  etnaviv:
   - GPU performance counters groundwork

  sun4i:
   - refactor driver load + TCON backend
   - HDMI improvements
   - A31 support
   - Misc fixes

  udl:
   - Probe/EDID read fixes.

  tilcdc:
   - Misc fixes.

  pl111:
   - Support more variants

  adv7511:
   - Improve EDID handling.
   - HDMI CEC support

  sii8620:
   - Add remote control support"

* tag 'drm-for-v4.15' of git://people.freedesktop.org/~airlied/linux: (1480 commits)
  drm/rockchip: analogix_dp: Use mutex rather than spinlock
  drm/mode_object: fix documentation for object lookups.
  drm/i915: Reorder context-close to avoid calling i915_vma_close() under RCU
  drm/i915: Move init_clock_gating() back to where it was
  drm/i915: Prune the reservation shared fence array
  drm/i915: Idle the GPU before shinking everything
  drm/i915: Lock llist_del_first() vs llist_del_all()
  drm/i915: Calculate ironlake intermediate watermarks correctly, v2.
  drm/i915: Disable lazy PPGTT page table optimization for vGPU
  drm/i915/execlists: Remove the priority "optimisation"
  drm/i915: Filter out spurious execlists context-switch interrupts
  drm/amdgpu: use irq-safe lock for kiq->ring_lock
  drm/amdgpu: bypass lru touch for KIQ ring submission
  drm/amdgpu: Potential uninitialized variable in amdgpu_vm_update_directories()
  drm/amdgpu: potential uninitialized variable in amdgpu_vce_ring_parse_cs()
  drm/amd/powerplay: initialize a variable before using it
  drm/amd/powerplay: suppress KASAN out of bounds warning in vega10_populate_all_memory_levels
  drm/amd/amdgpu: fix evicted VRAM bo adjudgement condition
  drm/vblank: Tune drm_crtc_accurate_vblank_count() WARN down to a debug
  drm/rockchip: add CONFIG_OF dependency for lvds
  ...

7 years agoMerge tag 'media/v4.15-1' of ssh://gitolite.kernel.org/pub/scm/linux/kernel/git/mcheh...
Linus Torvalds [Thu, 16 Nov 2017 04:30:12 +0000 (20:30 -0800)]
Merge tag 'media/v4.15-1' of ssh://gitolite.kernel.org/pub/scm/linux/kernel/git/mchehab/linux-media

Pull media updates from Mauro Carvalho Chehab:

 - Documentation for digital TV (both kAPI and uAPI) are now in sync
   with the implementation (except for legacy/deprecated ioctls). This
   is a major step, as there were always a gap there

 - New sensor driver: imx274

 - New cec driver: cec-gpio

 - New platform driver for rockship rga and tegra CEC

 - New RC driver: tango-ir

 - Several cleanups at atomisp driver

 - Core improvements for RC, CEC, V4L2 async probing support and DVB

 - Lots of drivers cleanup, fixes and improvements.

* tag 'media/v4.15-1' of ssh://gitolite.kernel.org/pub/scm/linux/kernel/git/mchehab/linux-media: (332 commits)
  dvb_frontend: don't use-after-free the frontend struct
  media: dib0700: fix invalid dvb_detach argument
  media: v4l2-ctrls: Don't validate BITMASK twice
  media: s5p-mfc: fix lockdep warning
  media: dvb-core: always call invoke_release() in fe_free()
  media: usb: dvb-usb-v2: dvb_usb_core: remove redundant code in dvb_usb_fe_sleep
  media: au0828: make const array addr_list static
  media: cx88: make const arrays default_addr_list and pvr2000_addr_list static
  media: drxd: make const array fastIncrDecLUT static
  media: usb: fix spelling mistake: "synchronuously" -> "synchronously"
  media: ddbridge: fix build warnings
  media: av7110: avoid 2038 overflow in debug print
  media: Don't do DMA on stack for firmware upload in the AS102 driver
  media: v4l: async: fix unregister for implicitly registered sub-device notifiers
  media: v4l: async: fix return of unitialized variable ret
  media: imx274: fix missing return assignment from call to imx274_mode_regs
  media: camss-vfe: always initialize reg at vfe_set_xbar_cfg()
  media: atomisp: make function calls cleaner
  media: atomisp: get rid of storage_class.h
  media: atomisp: get rid of wrong stddef.h include
  ...

7 years agoMerge tag 'leaks-4.15-rc1' of git://github.com/tcharding/linux
Linus Torvalds [Thu, 16 Nov 2017 04:11:56 +0000 (20:11 -0800)]
Merge tag 'leaks-4.15-rc1' of git://github.com/tcharding/linux

Pull leaking_addresses script updates from Tobin Harding:
 "Here are development patches for the leaking_addresses.pl script.

  Changes include:

   - add summary reporting to the script

   - add 'SigIgn' to false positives

   - add a file read timeout so the script doesn't block indefinitely

   - add infrastructure to enable multi-arch support and add support for ppc

   - add some exclude files/paths suggested by various people

   - code clean up and refactoring

   - overhaul command line options"

* tag 'leaks-4.15-rc1' of git://github.com/tcharding/linux:
  leaking_addresses: add SigIgn to false positives
  leaking_addresses: add timeout on file read
  leaking_addresses: add support for ppc64
  leaking_addresses: add summary reporting options
  leaking_addresses: add to exclude files/paths list
  leaking_addresses: fix comment string typo
  leaking_addresses: remove command line options
  leaking_addresses: remove dead/unused code
  leaking_addresses: use tabs instead of spaces

7 years agoMerge branch 'akpm' (patches from Andrew)
Linus Torvalds [Thu, 16 Nov 2017 03:42:40 +0000 (19:42 -0800)]
Merge branch 'akpm' (patches from Andrew)

Merge updates from Andrew Morton:

 - a few misc bits

 - ocfs2 updates

 - almost all of MM

* emailed patches from Andrew Morton <[email protected]>: (131 commits)
  memory hotplug: fix comments when adding section
  mm: make alloc_node_mem_map a void call if we don't have CONFIG_FLAT_NODE_MEM_MAP
  mm: simplify nodemask printing
  mm,oom_reaper: remove pointless kthread_run() error check
  mm/page_ext.c: check if page_ext is not prepared
  writeback: remove unused function parameter
  mm: do not rely on preempt_count in print_vma_addr
  mm, sparse: do not swamp log with huge vmemmap allocation failures
  mm/hmm: remove redundant variable align_end
  mm/list_lru.c: mark expected switch fall-through
  mm/shmem.c: mark expected switch fall-through
  mm/page_alloc.c: broken deferred calculation
  mm: don't warn about allocations which stall for too long
  fs: fuse: account fuse_inode slab memory as reclaimable
  mm, page_alloc: fix potential false positive in __zone_watermark_ok
  mm: mlock: remove lru_add_drain_all()
  mm, sysctl: make NUMA stats configurable
  shmem: convert shmem_init_inodecache() to void
  Unify migrate_pages and move_pages access checks
  mm, pagevec: rename pagevec drained field
  ...

7 years agoMerge branch 'drm-next-4.15-dc' of git://people.freedesktop.org/~agd5f/linux into...
Dave Airlie [Thu, 16 Nov 2017 02:39:40 +0000 (12:39 +1000)]
Merge branch 'drm-next-4.15-dc' of git://people.freedesktop.org/~agd5f/linux into drm-next

Various fixes for DC for 4.15.

* 'drm-next-4.15-dc' of git://people.freedesktop.org/~agd5f/linux:
  drm/amd/display: fix MST link training fail division by 0
  drm/amd/display: Fix formatting for null pointer dereference fix
  drm/amd/display: Remove dangling planes on dc commit state
  drm/amd/display: add flip_immediate to commit update for stream
  drm/amd/display: Miss register MST encoder cbs
  drm/amd/display: Fix warnings on S3 resume
  drm/amd/display: use num_timing_generator instead of pipe_count
  drm/amd/display: use configurable FBC option in dm
  drm/amd/display: fix AZ clock not enabled before program AZ endpoint
  amdgpu/dm: Don't use DRM_ERROR in amdgpu_dm_atomic_check

7 years agomemory hotplug: fix comments when adding section
Fan Du [Thu, 16 Nov 2017 01:39:21 +0000 (17:39 -0800)]
memory hotplug: fix comments when adding section

Here, pfn_to_node should be page_to_nid.

Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Fan Du <[email protected]>
Acked-by: Michal Hocko <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
7 years agomm: make alloc_node_mem_map a void call if we don't have CONFIG_FLAT_NODE_MEM_MAP
Oscar Salvador [Thu, 16 Nov 2017 01:39:18 +0000 (17:39 -0800)]
mm: make alloc_node_mem_map a void call if we don't have CONFIG_FLAT_NODE_MEM_MAP

free_area_init_node() calls alloc_node_mem_map(), but this function does
nothing unless we have CONFIG_FLAT_NODE_MEM_MAP.

As a cleanup, we can move the "#ifdef CONFIG_FLAT_NODE_MEM_MAP" within
alloc_node_mem_map() out of the function, and define a
alloc_node_mem_map() { } when CONFIG_FLAT_NODE_MEM_MAP is not present.

This also moves the printk that lays within the "#ifdef
CONFIG_FLAT_NODE_MEM_MAP" block from free_area_init_node() to
alloc_node_mem_map(), getting rid of the "#ifdef
CONFIG_FLAT_NODE_MEM_MAP" in free_area_init_node().

[[email protected]: clean up the printk while we're there]
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Oscar Salvador <[email protected]>
Acked-by: Michal Hocko <[email protected]>
Cc: Mel Gorman <[email protected]>
Cc: Vlastimil Babka <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
7 years agomm: simplify nodemask printing
Michal Hocko [Thu, 16 Nov 2017 01:39:14 +0000 (17:39 -0800)]
mm: simplify nodemask printing

alloc_warn() and dump_header() have to explicitly handle NULL nodemask
which forces both paths to use pr_cont.  We can do better.  printk
already handles NULL pointers properly so all we need is to teach
nodemask_pr_args to handle NULL nodemask carefully.  This allows
simplification of both alloc_warn() and dump_header() and gets rid of
pr_cont altogether.

This patch has been motivated by patch from Joe Perches

  http://lkml.kernel.org/r/b31236dfe3fc924054fd7842bde678e71d193638.1509991345[email protected]

[[email protected]: fix tile warning, per Arnd]
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Michal Hocko <[email protected]>
Acked-by: Joe Perches <[email protected]>
Cc: Arnd Bergmann <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
7 years agomm,oom_reaper: remove pointless kthread_run() error check
Tetsuo Handa [Thu, 16 Nov 2017 01:39:10 +0000 (17:39 -0800)]
mm,oom_reaper: remove pointless kthread_run() error check

Since oom_init() is called before userspace processes start, memory
allocation failure for creating the OOM reaper kernel thread will let
the OOM killer call panic() rather than wake up the OOM reaper.

Link: http://lkml.kernel.org/r/1510137800-4602-1-git-send-email-penguin-kernel@I-love.SAKURA.ne.jp
Signed-off-by: Tetsuo Handa <[email protected]>
Acked-by: Michal Hocko <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
7 years agomm/page_ext.c: check if page_ext is not prepared
Jaewon Kim [Thu, 16 Nov 2017 01:39:07 +0000 (17:39 -0800)]
mm/page_ext.c: check if page_ext is not prepared

online_page_ext() and page_ext_init() allocate page_ext for each
section, but they do not allocate if the first PFN is !pfn_present(pfn)
or !pfn_valid(pfn).  Then section->page_ext remains as NULL.
lookup_page_ext checks NULL only if CONFIG_DEBUG_VM is enabled.  For a
valid PFN, __set_page_owner will try to get page_ext through
lookup_page_ext.  Without CONFIG_DEBUG_VM lookup_page_ext will misuse
NULL pointer as value 0.  This incurrs invalid address access.

This is the panic example when PFN 0x100000 is not valid but PFN
0x13FC00 is being used for page_ext.  section->page_ext is NULL,
get_entry returned invalid page_ext address as 0x1DFA000 for a PFN
0x13FC00.

To avoid this panic, CONFIG_DEBUG_VM should be removed so that page_ext
will be checked at all times.

  Unable to handle kernel paging request at virtual address 01dfa014
  ------------[ cut here ]------------
  Kernel BUG at ffffff80082371e0 [verbose debug info unavailable]
  Internal error: Oops: 96000045 [#1] PREEMPT SMP
  Modules linked in:
  PC is at __set_page_owner+0x48/0x78
  LR is at __set_page_owner+0x44/0x78
    __set_page_owner+0x48/0x78
    get_page_from_freelist+0x880/0x8e8
    __alloc_pages_nodemask+0x14c/0xc48
    __do_page_cache_readahead+0xdc/0x264
    filemap_fault+0x2ac/0x550
    ext4_filemap_fault+0x3c/0x58
    __do_fault+0x80/0x120
    handle_mm_fault+0x704/0xbb0
    do_page_fault+0x2e8/0x394
    do_mem_abort+0x88/0x124

Pre-4.7 kernels also need commit f86e4271978b ("mm: check the return
value of lookup_page_ext for all call sites").

Link: http://lkml.kernel.org/r/[email protected]
Fixes: eefa864b701d ("mm/page_ext: resurrect struct page extending code for debugging")
Signed-off-by: Jaewon Kim <[email protected]>
Acked-by: Michal Hocko <[email protected]>
Cc: Vlastimil Babka <[email protected]>
Cc: Minchan Kim <[email protected]>
Cc: Joonsoo Kim <[email protected]>
Cc: <[email protected]> [depends on f86e427197, see above]
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
7 years agowriteback: remove unused function parameter
Wang Long [Thu, 16 Nov 2017 01:39:03 +0000 (17:39 -0800)]
writeback: remove unused function parameter

The parameter `struct bdi_writeback *wb` is not been used in the
function body.  Remove it.

Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Wang Long <[email protected]>
Reviewed-by: Jan Kara <[email protected]>
Acked-by: Tejun Heo <[email protected]>
Cc: Jens Axboe <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
7 years agomm: do not rely on preempt_count in print_vma_addr
Michal Hocko [Thu, 16 Nov 2017 01:38:59 +0000 (17:38 -0800)]
mm: do not rely on preempt_count in print_vma_addr

The preempt count check on print_vma_addr has been added by commit
e8bff74afbdb ("x86: fix "BUG: sleeping function called from invalid
context" in print_vma_addr()") and it relied on the elevated preempt
count from preempt_conditional_sti because preempt_count check doesn't
work on non preemptive kernels by default.

The code has evolved though and commit d99e1bd175f4 ("x86/entry/traps:
Refactor preemption and interrupt flag handling") has replaced
preempt_conditional_sti by an explicit preempt_disable which is noop on
!PREEMPT so the check in print_vma_addr is broken.

Fix the issue by using trylock on mmap_sem rather than chacking the
preempt count.  The allocation we are relying on has to be GFP_NOWAIT as
well.  There is a chance that we won't dump the vma state if the lock is
contended or the memory short but this is acceptable outcome and much
less fragile than the not working preemption check or tricks around it.

Link: http://lkml.kernel.org/r/[email protected]
Fixes: d99e1bd175f4 ("x86/entry/traps: Refactor preemption and interrupt flag handling")
Signed-off-by: Michal Hocko <[email protected]>
Acked-by: Vlastimil Babka <[email protected]>
Acked-by: Yang Shi <[email protected]>
Cc: Frederic Weisbecker <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
7 years agomm, sparse: do not swamp log with huge vmemmap allocation failures
Michal Hocko [Thu, 16 Nov 2017 01:38:56 +0000 (17:38 -0800)]
mm, sparse: do not swamp log with huge vmemmap allocation failures

While doing memory hotplug tests under heavy memory pressure we have
noticed too many page allocation failures when allocating vmemmap memmap
backed by huge page

  kworker/u3072:1: page allocation failure: order:9, mode:0x24084c0(GFP_KERNEL|__GFP_REPEAT|__GFP_ZERO)
  [...]
  Call Trace:
    dump_trace+0x59/0x310
    show_stack_log_lvl+0xea/0x170
    show_stack+0x21/0x40
    dump_stack+0x5c/0x7c
    warn_alloc_failed+0xe2/0x150
    __alloc_pages_nodemask+0x3ed/0xb20
    alloc_pages_current+0x7f/0x100
    vmemmap_alloc_block+0x79/0xb6
    __vmemmap_alloc_block_buf+0x136/0x145
    vmemmap_populate+0xd2/0x2b9
    sparse_mem_map_populate+0x23/0x30
    sparse_add_one_section+0x68/0x18e
    __add_pages+0x10a/0x1d0
    arch_add_memory+0x4a/0xc0
    add_memory_resource+0x89/0x160
    add_memory+0x6d/0xd0
    acpi_memory_device_add+0x181/0x251
    acpi_bus_attach+0xfd/0x19b
    acpi_bus_scan+0x59/0x69
    acpi_device_hotplug+0xd2/0x41f
    acpi_hotplug_work_fn+0x1a/0x23
    process_one_work+0x14e/0x410
    worker_thread+0x116/0x490
    kthread+0xbd/0xe0
    ret_from_fork+0x3f/0x70

and we do see many of those because essentially every allocation fails
for each memory section.  This is an excessive way to tell the user that
there is nothing to really worry about because we do have a fallback
mechanism to use base pages.  The only downside might be a performance
degradation due to TLB pressure.

This patch changes vmemmap_alloc_block() to use __GFP_NOWARN and warn
explicitly once on the first allocation failure.  This will reduce the
noise in the kernel log considerably, while we still have an indication
that a performance might be impacted.

[[email protected]: forgot to git add the follow up fix]
Link: http://lkml.kernel.org/r/[email protected]
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Johannes Weiner <[email protected]>
Signed-off-by: Michal Hocko <[email protected]>
Cc: Joe Perches <[email protected]>
Cc: Vlastimil Babka <[email protected]>
Cc: Khalid Aziz <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
7 years agomm/hmm: remove redundant variable align_end
Colin Ian King [Thu, 16 Nov 2017 01:38:52 +0000 (17:38 -0800)]
mm/hmm: remove redundant variable align_end

Variable align_end is assigned a value but it is never read, so the
variable is redundant and can be removed.  Cleans up the clang warning:
Value stored to 'align_end' is never read

Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Colin Ian King <[email protected]>
Reviewed-by: Jérôme Glisse <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
7 years agomm/list_lru.c: mark expected switch fall-through
Gustavo A. R. Silva [Thu, 16 Nov 2017 01:38:49 +0000 (17:38 -0800)]
mm/list_lru.c: mark expected switch fall-through

In preparation for enabling -Wimplicit-fallthrough, mark switch cases
where we are expecting to fall through.

Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Gustavo A. R. Silva <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
7 years agomm/shmem.c: mark expected switch fall-through
Gustavo A. R. Silva [Thu, 16 Nov 2017 01:38:45 +0000 (17:38 -0800)]
mm/shmem.c: mark expected switch fall-through

In preparation to enabling -Wimplicit-fallthrough, mark switch cases
where we are expecting to fall through.

Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Gustavo A. R. Silva <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
7 years agomm/page_alloc.c: broken deferred calculation
Pavel Tatashin [Thu, 16 Nov 2017 01:38:41 +0000 (17:38 -0800)]
mm/page_alloc.c: broken deferred calculation

In reset_deferred_meminit() we determine number of pages that must not
be deferred.  We initialize pages for at least 2G of memory, but also
pages for reserved memory in this node.

The reserved memory is determined in this function:
memblock_reserved_memory_within(), which operates over physical
addresses, and returns size in bytes.  However, reset_deferred_meminit()
assumes that that this function operates with pfns, and returns page
count.

The result is that in the best case machine boots slower than expected
due to initializing more pages than needed in single thread, and in the
worst case panics because fewer than needed pages are initialized early.

Link: http://lkml.kernel.org/r/[email protected]
Fixes: 864b9a393dcb ("mm: consider memblock reservations for deferred memory initialization sizing")
Signed-off-by: Pavel Tatashin <[email protected]>
Acked-by: Michal Hocko <[email protected]>
Cc: Mel Gorman <[email protected]>
Cc: <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
7 years agomm: don't warn about allocations which stall for too long
Tetsuo Handa [Thu, 16 Nov 2017 01:38:37 +0000 (17:38 -0800)]
mm: don't warn about allocations which stall for too long

Commit 63f53dea0c98 ("mm: warn about allocations which stall for too
long") was a great step for reducing possibility of silent hang up
problem caused by memory allocation stalls.  But this commit reverts it,
for it is possible to trigger OOM lockup and/or soft lockups when many
threads concurrently called warn_alloc() (in order to warn about memory
allocation stalls) due to current implementation of printk(), and it is
difficult to obtain useful information due to limitation of synchronous
warning approach.

Current printk() implementation flushes all pending logs using the
context of a thread which called console_unlock().  printk() should be
able to flush all pending logs eventually unless somebody continues
appending to printk() buffer.

Since warn_alloc() started appending to printk() buffer while waiting
for oom_kill_process() to make forward progress when oom_kill_process()
is processing pending logs, it became possible for warn_alloc() to force
oom_kill_process() loop inside printk().  As a result, warn_alloc()
significantly increased possibility of preventing oom_kill_process()
from making forward progress.

---------- Pseudo code start ----------
Before warn_alloc() was introduced:

  retry:
    if (mutex_trylock(&oom_lock)) {
      while (atomic_read(&printk_pending_logs) > 0) {
        atomic_dec(&printk_pending_logs);
        print_one_log();
      }
      // Send SIGKILL here.
      mutex_unlock(&oom_lock)
    }
    goto retry;

After warn_alloc() was introduced:

  retry:
    if (mutex_trylock(&oom_lock)) {
      while (atomic_read(&printk_pending_logs) > 0) {
        atomic_dec(&printk_pending_logs);
        print_one_log();
      }
      // Send SIGKILL here.
      mutex_unlock(&oom_lock)
    } else if (waited_for_10seconds()) {
      atomic_inc(&printk_pending_logs);
    }
    goto retry;
---------- Pseudo code end ----------

Although waited_for_10seconds() becomes true once per 10 seconds,
unbounded number of threads can call waited_for_10seconds() at the same
time.  Also, since threads doing waited_for_10seconds() keep doing
almost busy loop, the thread doing print_one_log() can use little CPU
resource.  Therefore, this situation can be simplified like

---------- Pseudo code start ----------
  retry:
    if (mutex_trylock(&oom_lock)) {
      while (atomic_read(&printk_pending_logs) > 0) {
        atomic_dec(&printk_pending_logs);
        print_one_log();
      }
      // Send SIGKILL here.
      mutex_unlock(&oom_lock)
    } else {
      atomic_inc(&printk_pending_logs);
    }
    goto retry;
---------- Pseudo code end ----------

when printk() is called faster than print_one_log() can process a log.

One of possible mitigation would be to introduce a new lock in order to
make sure that no other series of printk() (either oom_kill_process() or
warn_alloc()) can append to printk() buffer when one series of printk()
(either oom_kill_process() or warn_alloc()) is already in progress.

Such serialization will also help obtaining kernel messages in readable
form.

---------- Pseudo code start ----------
  retry:
    if (mutex_trylock(&oom_lock)) {
      mutex_lock(&oom_printk_lock);
      while (atomic_read(&printk_pending_logs) > 0) {
        atomic_dec(&printk_pending_logs);
        print_one_log();
      }
      // Send SIGKILL here.
      mutex_unlock(&oom_printk_lock);
      mutex_unlock(&oom_lock)
    } else {
      if (mutex_trylock(&oom_printk_lock)) {
        atomic_inc(&printk_pending_logs);
        mutex_unlock(&oom_printk_lock);
      }
    }
    goto retry;
---------- Pseudo code end ----------

But this commit does not go that direction, for we don't want to
introduce a new lock dependency, and we unlikely be able to obtain
useful information even if we serialized oom_kill_process() and
warn_alloc().

Synchronous approach is prone to unexpected results (e.g.  too late [1],
too frequent [2], overlooked [3]).  As far as I know, warn_alloc() never
helped with providing information other than "something is going wrong".
I want to consider asynchronous approach which can obtain information
during stalls with possibly relevant threads (e.g.  the owner of
oom_lock and kswapd-like threads) and serve as a trigger for actions
(e.g.  turn on/off tracepoints, ask libvirt daemon to take a memory dump
of stalling KVM guest for diagnostic purpose).

This commit temporarily loses ability to report e.g.  OOM lockup due to
unable to invoke the OOM killer due to !__GFP_FS allocation request.
But asynchronous approach will be able to detect such situation and emit
warning.  Thus, let's remove warn_alloc().

[1] https://bugzilla.kernel.org/show_bug.cgi?id=192981
[2] http://lkml.kernel.org/r/CAM_iQpWuPVGc2ky8M-9yukECtS+zKjiDasNymX7rMcBjBFyM_A@mail.gmail.com
[3] commit db73ee0d46379922 ("mm, vmscan: do not loop on too_many_isolated for ever"))

Link: http://lkml.kernel.org/r/1509017339-4802-1-git-send-email-penguin-kernel@I-love.SAKURA.ne.jp
Signed-off-by: Tetsuo Handa <[email protected]>
Reported-by: Cong Wang <[email protected]>
Reported-by: yuwang.yuwang <[email protected]>
Reported-by: Johannes Weiner <[email protected]>
Acked-by: Michal Hocko <[email protected]>
Acked-by: Johannes Weiner <[email protected]>
Cc: Vlastimil Babka <[email protected]>
Cc: Mel Gorman <[email protected]>
Cc: Dave Hansen <[email protected]>
Cc: Sergey Senozhatsky <[email protected]>
Cc: Petr Mladek <[email protected]>
Cc: Steven Rostedt <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
7 years agofs: fuse: account fuse_inode slab memory as reclaimable
Johannes Weiner [Thu, 16 Nov 2017 01:38:34 +0000 (17:38 -0800)]
fs: fuse: account fuse_inode slab memory as reclaimable

Fuse inodes are currently included in the unreclaimable slab counts -
SUnreclaim in /proc/meminfo, slab_unreclaimable in /proc/vmstat and the
per-cgroup memory.stat.  But they are reclaimable just like other
filesystems' inodes, and /proc/sys/vm/drop_caches frees them easily.

Mark the slab cache reclaimable.

Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Johannes Weiner <[email protected]>
Cc: Miklos Szeredi <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
7 years agomm, page_alloc: fix potential false positive in __zone_watermark_ok
Vlastimil Babka [Thu, 16 Nov 2017 01:38:30 +0000 (17:38 -0800)]
mm, page_alloc: fix potential false positive in __zone_watermark_ok

Since commit 97a16fc82a7c ("mm, page_alloc: only enforce watermarks for
order-0 allocations"), __zone_watermark_ok() check for high-order
allocations will shortcut per-migratetype free list checks for
ALLOC_HARDER allocations, and return true as long as there's free page
of any migratetype.  The intention is that ALLOC_HARDER can allocate
from MIGRATE_HIGHATOMIC free lists, while normal allocations can't.

However, as a side effect, the watermark check will then also return
true when there are pages only on the MIGRATE_ISOLATE list, or (prior to
CMA conversion to ZONE_MOVABLE) on the MIGRATE_CMA list.  Since the
allocation cannot actually obtain isolated pages, and might not be able
to obtain CMA pages, this can result in a false positive.

The condition should be rare and perhaps the outcome is not a fatal one.
Still, it's better if the watermark check is correct.  There also
shouldn't be a performance tradeoff here.

Link: http://lkml.kernel.org/r/[email protected]
Fixes: 97a16fc82a7c ("mm, page_alloc: only enforce watermarks for order-0 allocations")
Signed-off-by: Vlastimil Babka <[email protected]>
Acked-by: Mel Gorman <[email protected]>
Cc: Joonsoo Kim <[email protected]>
Cc: Rik van Riel <[email protected]>
Cc: David Rientjes <[email protected]>
Cc: Johannes Weiner <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
7 years agomm: mlock: remove lru_add_drain_all()
Shakeel Butt [Thu, 16 Nov 2017 01:38:26 +0000 (17:38 -0800)]
mm: mlock: remove lru_add_drain_all()

lru_add_drain_all() is not required by mlock() and it will drain
everything that has been cached at the time mlock is called.  And that
is not really related to the memory which will be faulted in (and
cached) and mlocked by the syscall itself.

If anything lru_add_drain_all() should be called _after_ pages have been
mlocked and faulted in but even that is not strictly needed because
those pages would get to the appropriate LRUs lazily during the reclaim
path.  Moreover follow_page_pte (gup) will drain the local pcp LRU
cache.

On larger machines the overhead of lru_add_drain_all() in mlock() can be
significant when mlocking data already in memory.  We have observed high
latency in mlock() due to lru_add_drain_all() when the users were
mlocking in memory tmpfs files.

[[email protected]: changelog fix]
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Shakeel Butt <[email protected]>
Acked-by: Michal Hocko <[email protected]>
Acked-by: Balbir Singh <[email protected]>
Acked-by: Vlastimil Babka <[email protected]>
Cc: "Kirill A. Shutemov" <[email protected]>
Cc: Joonsoo Kim <[email protected]>
Cc: Minchan Kim <[email protected]>
Cc: Yisheng Xie <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: Greg Thelen <[email protected]>
Cc: Hugh Dickins <[email protected]>
Cc: Anshuman Khandual <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
7 years agomm, sysctl: make NUMA stats configurable
Kemi Wang [Thu, 16 Nov 2017 01:38:22 +0000 (17:38 -0800)]
mm, sysctl: make NUMA stats configurable

This is the second step which introduces a tunable interface that allow
numa stats configurable for optimizing zone_statistics(), as suggested
by Dave Hansen and Ying Huang.

=========================================================================

When page allocation performance becomes a bottleneck and you can
tolerate some possible tool breakage and decreased numa counter
precision, you can do:

echo 0 > /proc/sys/vm/numa_stat

In this case, numa counter update is ignored.  We can see about
*4.8%*(185->176) drop of cpu cycles per single page allocation and
reclaim on Jesper's page_bench01 (single thread) and *8.1%*(343->315)
drop of cpu cycles per single page allocation and reclaim on Jesper's
page_bench03 (88 threads) running on a 2-Socket Broadwell-based server
(88 threads, 126G memory).

Benchmark link provided by Jesper D Brouer (increase loop times to
10000000):

  https://github.com/netoptimizer/prototype-kernel/tree/master/kernel/mm/bench

=========================================================================

When page allocation performance is not a bottleneck and you want all
tooling to work, you can do:

echo 1 > /proc/sys/vm/numa_stat

This is system default setting.

Many thanks to Michal Hocko, Dave Hansen, Ying Huang and Vlastimil Babka
for comments to help improve the original patch.

[[email protected]: make sure mutex is a global static]
Link: http://lkml.kernel.org/r/20171107213809.GA4314@beast
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Kemi Wang <[email protected]>
Signed-off-by: Kees Cook <[email protected]>
Reported-by: Jesper Dangaard Brouer <[email protected]>
Suggested-by: Dave Hansen <[email protected]>
Suggested-by: Ying Huang <[email protected]>
Acked-by: Vlastimil Babka <[email protected]>
Acked-by: Michal Hocko <[email protected]>
Cc: "Luis R . Rodriguez" <[email protected]>
Cc: Kees Cook <[email protected]>
Cc: Jonathan Corbet <[email protected]>
Cc: Mel Gorman <[email protected]>
Cc: Johannes Weiner <[email protected]>
Cc: Christopher Lameter <[email protected]>
Cc: Sebastian Andrzej Siewior <[email protected]>
Cc: Andrey Ryabinin <[email protected]>
Cc: Tim Chen <[email protected]>
Cc: Andi Kleen <[email protected]>
Cc: Aaron Lu <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
7 years agoshmem: convert shmem_init_inodecache() to void
weiping zhang [Thu, 16 Nov 2017 01:38:18 +0000 (17:38 -0800)]
shmem: convert shmem_init_inodecache() to void

shmem_inode_cachep was created with SLAB_PANIC flag and
shmem_init_inodecache() never returns non-zero, so convert this
function to return void.

Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: weiping zhang <[email protected]>
Cc: Hugh Dickins <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
7 years agoUnify migrate_pages and move_pages access checks
Otto Ebeling [Thu, 16 Nov 2017 01:38:14 +0000 (17:38 -0800)]
Unify migrate_pages and move_pages access checks

Commit 197e7e521384 ("Sanitize 'move_pages()' permission checks") fixed
a security issue I reported in the move_pages syscall, and made it so
that you can't act on set-uid processes unless you have the
CAP_SYS_PTRACE capability.

Unify the access check logic of migrate_pages to match the new behavior
of move_pages.  We discussed this a bit in the security@ list and
thought it'd be good for consistency even though there's no evident
security impact.  The NUMA node access checks are left intact and
require CAP_SYS_NICE as before.

Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Otto Ebeling <[email protected]>
Acked-by: Michal Hocko <[email protected]>
Cc: Eric W. Biederman <[email protected]>
Cc: Willy Tarreau <[email protected]>
Cc: Kees Cook <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
7 years agomm, pagevec: rename pagevec drained field
Mel Gorman [Thu, 16 Nov 2017 01:38:10 +0000 (17:38 -0800)]
mm, pagevec: rename pagevec drained field

According to Vlastimil Babka, the drained field in pagevec is
potentially misleading because it might be interpreted as draining this
pagevec instead of the percpu lru pagevecs.  Rename the field for
clarity.

Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Mel Gorman <[email protected]>
Suggested-by: Vlastimil Babka <[email protected]>
Cc: Andi Kleen <[email protected]>
Cc: Dave Chinner <[email protected]>
Cc: Dave Hansen <[email protected]>
Cc: Jan Kara <[email protected]>
Cc: Johannes Weiner <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
7 years agomm, page_alloc: simplify list handling in rmqueue_bulk()
Vlastimil Babka [Thu, 16 Nov 2017 01:38:07 +0000 (17:38 -0800)]
mm, page_alloc: simplify list handling in rmqueue_bulk()

rmqueue_bulk() fills an empty pcplist with pages from the free list.  It
tries to preserve increasing order by pfn to the caller, because it
leads to better performance with some I/O controllers, as explained in
commit e084b2d95e48 ("page-allocator: preserve PFN ordering when
__GFP_COLD is set").

To preserve the order, it's sufficient to add pages to the tail of the
list as they are retrieved.  The current code instead adds to the head
of the list, but then updates the list head pointer to the last added
page, in each step.  This does result in the same order, but is
needlessly confusing and potentially wasteful, with no apparent benefit.
This patch simplifies the code and adjusts comment accordingly.

Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Vlastimil Babka <[email protected]>
Acked-by: Mel Gorman <[email protected]>
Cc: Andi Kleen <[email protected]>
Cc: Dave Chinner <[email protected]>
Cc: Dave Hansen <[email protected]>
Cc: Jan Kara <[email protected]>
Cc: Johannes Weiner <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
7 years agomm: remove __GFP_COLD
Mel Gorman [Thu, 16 Nov 2017 01:38:03 +0000 (17:38 -0800)]
mm: remove __GFP_COLD

As the page free path makes no distinction between cache hot and cold
pages, there is no real useful ordering of pages in the free list that
allocation requests can take advantage of.  Juding from the users of
__GFP_COLD, it is likely that a number of them are the result of copying
other sites instead of actually measuring the impact.  Remove the
__GFP_COLD parameter which simplifies a number of paths in the page
allocator.

This is potentially controversial but bear in mind that the size of the
per-cpu pagelists versus modern cache sizes means that the whole per-cpu
list can often fit in the L3 cache.  Hence, there is only a potential
benefit for microbenchmarks that alloc/free pages in a tight loop.  It's
even worse when THP is taken into account which has little or no chance
of getting a cache-hot page as the per-cpu list is bypassed and the
zeroing of multiple pages will thrash the cache anyway.

The truncate microbenchmarks are not shown as this patch affects the
allocation path and not the free path.  A page fault microbenchmark was
tested but it showed no sigificant difference which is not surprising
given that the __GFP_COLD branches are a miniscule percentage of the
fault path.

Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Mel Gorman <[email protected]>
Acked-by: Vlastimil Babka <[email protected]>
Cc: Andi Kleen <[email protected]>
Cc: Dave Chinner <[email protected]>
Cc: Dave Hansen <[email protected]>
Cc: Jan Kara <[email protected]>
Cc: Johannes Weiner <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
7 years agomm: remove cold parameter from free_hot_cold_page*
Mel Gorman [Thu, 16 Nov 2017 01:37:59 +0000 (17:37 -0800)]
mm: remove cold parameter from free_hot_cold_page*

Most callers users of free_hot_cold_page claim the pages being released
are cache hot.  The exception is the page reclaim paths where it is
likely that enough pages will be freed in the near future that the
per-cpu lists are going to be recycled and the cache hotness information
is lost.  As no one really cares about the hotness of pages being
released to the allocator, just ditch the parameter.

The APIs are renamed to indicate that it's no longer about hot/cold
pages.  It should also be less confusing as there are subtle differences
between them.  __free_pages drops a reference and frees a page when the
refcount reaches zero.  free_hot_cold_page handled pages whose refcount
was already zero which is non-obvious from the name.  free_unref_page
should be more obvious.

No performance impact is expected as the overhead is marginal.  The
parameter is removed simply because it is a bit stupid to have a useless
parameter copied everywhere.

[[email protected]: add pages to head, not tail]
Link: http://lkml.kernel.org/r/[email protected]
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Mel Gorman <[email protected]>
Acked-by: Vlastimil Babka <[email protected]>
Cc: Andi Kleen <[email protected]>
Cc: Dave Chinner <[email protected]>
Cc: Dave Hansen <[email protected]>
Cc: Jan Kara <[email protected]>
Cc: Johannes Weiner <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
7 years agomm: remove cold parameter for release_pages
Mel Gorman [Thu, 16 Nov 2017 01:37:55 +0000 (17:37 -0800)]
mm: remove cold parameter for release_pages

All callers of release_pages claim the pages being released are cache
hot.  As no one cares about the hotness of pages being released to the
allocator, just ditch the parameter.

No performance impact is expected as the overhead is marginal.  The
parameter is removed simply because it is a bit stupid to have a useless
parameter copied everywhere.

Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Mel Gorman <[email protected]>
Acked-by: Vlastimil Babka <[email protected]>
Cc: Andi Kleen <[email protected]>
Cc: Dave Chinner <[email protected]>
Cc: Dave Hansen <[email protected]>
Cc: Jan Kara <[email protected]>
Cc: Johannes Weiner <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
7 years agomm, pagevec: remove cold parameter for pagevecs
Mel Gorman [Thu, 16 Nov 2017 01:37:52 +0000 (17:37 -0800)]
mm, pagevec: remove cold parameter for pagevecs

Every pagevec_init user claims the pages being released are hot even in
cases where it is unlikely the pages are hot.  As no one cares about the
hotness of pages being released to the allocator, just ditch the
parameter.

No performance impact is expected as the overhead is marginal.  The
parameter is removed simply because it is a bit stupid to have a useless
parameter copied everywhere.

Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Mel Gorman <[email protected]>
Acked-by: Vlastimil Babka <[email protected]>
Cc: Andi Kleen <[email protected]>
Cc: Dave Chinner <[email protected]>
Cc: Dave Hansen <[email protected]>
Cc: Jan Kara <[email protected]>
Cc: Johannes Weiner <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
7 years agomm: only drain per-cpu pagevecs once per pagevec usage
Mel Gorman [Thu, 16 Nov 2017 01:37:48 +0000 (17:37 -0800)]
mm: only drain per-cpu pagevecs once per pagevec usage

When a pagevec is initialised on the stack, it is generally used
multiple times over a range of pages, looking up entries and then
releasing them.  On each pagevec_release, the per-cpu deferred LRU
pagevecs are drained on the grounds the page being released may be on
those queues and the pages may be cache hot.  In many cases only the
first drain is necessary as it's unlikely that the range of pages being
walked is racing against LRU addition.  Even if there is such a race,
the impact is marginal where as constantly redraining the lru pagevecs
costs.

This patch ensures that pagevec is only drained once in a given
lifecycle without increasing the cache footprint of the pagevec
structure.  Only sparsetruncate tiny is shown here as large files have
many exceptional entries and calls pagecache_release less frequently.

sparsetruncate (tiny)
                              4.14.0-rc4             4.14.0-rc4
                        batchshadow-v1r1          onedrain-v1r1
Min          Time      141.00 (   0.00%)      141.00 (   0.00%)
1st-qrtle    Time      142.00 (   0.00%)      142.00 (   0.00%)
2nd-qrtle    Time      142.00 (   0.00%)      142.00 (   0.00%)
3rd-qrtle    Time      143.00 (   0.00%)      143.00 (   0.00%)
Max-90%      Time      144.00 (   0.00%)      144.00 (   0.00%)
Max-95%      Time      146.00 (   0.00%)      145.00 (   0.68%)
Max-99%      Time      198.00 (   0.00%)      194.00 (   2.02%)
Max          Time      254.00 (   0.00%)      208.00 (  18.11%)
Amean        Time      145.12 (   0.00%)      144.30 (   0.56%)
Stddev       Time       12.74 (   0.00%)        9.62 (  24.49%)
Coeff        Time        8.78 (   0.00%)        6.67 (  24.06%)
Best99%Amean Time      144.29 (   0.00%)      143.82 (   0.32%)
Best95%Amean Time      142.68 (   0.00%)      142.31 (   0.26%)
Best90%Amean Time      142.52 (   0.00%)      142.19 (   0.24%)
Best75%Amean Time      142.26 (   0.00%)      141.98 (   0.20%)
Best50%Amean Time      141.90 (   0.00%)      141.71 (   0.13%)
Best25%Amean Time      141.80 (   0.00%)      141.43 (   0.26%)

The impact on bonnie is marginal and within the noise because a
significant percentage of the file being truncated has been reclaimed
and consists of shadow entries which reduce the hotness of the
pagevec_release path.

Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Mel Gorman <[email protected]>
Cc: Andi Kleen <[email protected]>
Cc: Dave Chinner <[email protected]>
Cc: Dave Hansen <[email protected]>
Cc: Jan Kara <[email protected]>
Cc: Johannes Weiner <[email protected]>
Cc: Vlastimil Babka <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
7 years agomm, truncate: remove all exceptional entries from pagevec under one lock
Mel Gorman [Thu, 16 Nov 2017 01:37:44 +0000 (17:37 -0800)]
mm, truncate: remove all exceptional entries from pagevec under one lock

During truncate each entry in a pagevec is checked to see if it is an
exceptional entry and if so, the shadow entry is cleaned up.  This is
potentially expensive as multiple entries for a mapping locks/unlocks
the tree lock.  This batches the operation such that any exceptional
entries removed from a pagevec only acquire the mapping tree lock once.
The corner case where this is more expensive is where there is only one
exceptional entry but this is unlikely due to temporal locality and how
it affects LRU ordering.  Note that for truncations of small files
created recently, this patch should show no gain because it only batches
the handling of exceptional entries.

sparsetruncate (large)
                              4.14.0-rc4             4.14.0-rc4
                         pickhelper-v1r1       batchshadow-v1r1
Min          Time       38.00 (   0.00%)       27.00 (  28.95%)
1st-qrtle    Time       40.00 (   0.00%)       28.00 (  30.00%)
2nd-qrtle    Time       44.00 (   0.00%)       41.00 (   6.82%)
3rd-qrtle    Time      146.00 (   0.00%)      147.00 (  -0.68%)
Max-90%      Time      153.00 (   0.00%)      153.00 (   0.00%)
Max-95%      Time      155.00 (   0.00%)      156.00 (  -0.65%)
Max-99%      Time      181.00 (   0.00%)      171.00 (   5.52%)
Amean        Time       93.04 (   0.00%)       88.43 (   4.96%)
Best99%Amean Time       92.08 (   0.00%)       86.13 (   6.46%)
Best95%Amean Time       89.19 (   0.00%)       83.13 (   6.80%)
Best90%Amean Time       85.60 (   0.00%)       79.15 (   7.53%)
Best75%Amean Time       72.95 (   0.00%)       65.09 (  10.78%)
Best50%Amean Time       39.86 (   0.00%)       28.20 (  29.25%)
Best25%Amean Time       39.44 (   0.00%)       27.70 (  29.77%)

bonnie
                                      4.14.0-rc4             4.14.0-rc4
                                 pickhelper-v1r1       batchshadow-v1r1
Hmean     SeqCreate ops         71.92 (   0.00%)       76.78 (   6.76%)
Hmean     SeqCreate read        42.42 (   0.00%)       45.01 (   6.10%)
Hmean     SeqCreate del      26519.88 (   0.00%)    27191.87 (   2.53%)
Hmean     RandCreate ops        71.92 (   0.00%)       76.95 (   7.00%)
Hmean     RandCreate read       44.44 (   0.00%)       49.23 (  10.78%)
Hmean     RandCreate del     24948.62 (   0.00%)    24764.97 (  -0.74%)

Truncation of a large number of files shows a substantial gain with 99%
of files being truncated 6.46% faster.  bonnie shows a modest gain of
2.53%

[[email protected]: fix truncate_exceptional_pvec_entries()]
Link: http://lkml.kernel.org/r/[email protected]
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Mel Gorman <[email protected]>
Signed-off-by: Jan Kara <[email protected]>
Reviewed-by: Jan Kara <[email protected]>
Acked-by: Johannes Weiner <[email protected]>
Cc: Andi Kleen <[email protected]>
Cc: Dave Chinner <[email protected]>
Cc: Dave Hansen <[email protected]>
Cc: Vlastimil Babka <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
This page took 0.150135 seconds and 4 git commands to generate.