Git Repo - linux.git/log

Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/sage/ceph-client

Pull Ceph updates from Sage Weil:
"This changeset has a few main parts:

   - Ilya has finished a huge refactoring effort to sync up the
     client-side logic in libceph with the user-space client code, which
     has evolved significantly over the last couple years, with lots of
     additional behaviors (e.g., how requests are handled when cluster
     is full and transitions from full to non-full).

     This structure of the code is more closely aligned with userspace
     now such that it will be much easier to maintain going forward when
     behavior changes take place.  There are some locking improvements
     bundled in as well.

   - Zheng adds multi-filesystem support (multiple namespaces within the
     same Ceph cluster)

   - Zheng has changed the readdir offsets and directory enumeration so
     that dentry offsets are hash-based and therefore stable across
     directory fragmentation events on the MDS.

   - Zheng has a smorgasbord of bug fixes across fs/ceph"

* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/sage/ceph-client: (71 commits)
  ceph: fix wake_up_session_cb()
  ceph: don't use truncate_pagecache() to invalidate read cache
  ceph: SetPageError() for writeback pages if writepages fails
  ceph: handle interrupted ceph_writepage()
  ceph: make ceph_update_writeable_page() uninterruptible
  libceph: make ceph_osdc_wait_request() uninterruptible
  ceph: handle -EAGAIN returned by ceph_update_writeable_page()
  ceph: make fault/page_mkwrite return VM_FAULT_OOM for -ENOMEM
  ceph: block non-fatal signals for fault/page_mkwrite
  ceph: make logical calculation functions return bool
  ceph: tolerate bad i_size for symlink inode
  ceph: improve fragtree change detection
  ceph: keep leaf frag when updating fragtree
  ceph: fix dir_auth check in ceph_fill_dirfrag()
  ceph: don't assume frag tree splits in mds reply are sorted
  ceph: fix inode reference leak
  ceph: using hash value to compose dentry offset
  ceph: don't forbid marking directory complete after forward seek
  ceph: record 'offset' for each entry of readdir result
  ceph: define 'end/complete' in readdir reply as bit flags
  ...

Btrfs: fix handling of faults from btrfs_copy_from_user

When btrfs_copy_from_user isn't able to copy all of the pages, we need
to adjust our accounting to reflect the work that was actually done.

Commit 2e78c927d79 changed around the decisions a little and we ended up
skipping the accounting adjustments some of the time.  This commit makes
sure that when we don't copy anything at all, we still hop into
the adjustments, and switches to release_bytes instead of write_bytes,
since write_bytes isn't aligned.

The accounting errors led to warnings during btrfs_destroy_inode:

[   70.847532] WARNING: CPU: 10 PID: 514 at fs/btrfs/inode.c:9350 btrfs_destroy_inode+0x2b3/0x2c0
[   70.847536] Modules linked in: i2c_piix4 virtio_net i2c_core input_leds button led_class serio_raw acpi_cpufreq sch_fq_codel autofs4 virtio_blk
[   70.847538] CPU: 10 PID: 514 Comm: umount Tainted: G        W 4.6.0-rc6_00062_g2997da1-dirty #23
[   70.847539] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.9.0-1.fc24 04/01/2014
[   70.847542]  0000000000000000 ffff880ff5cafab8 ffffffff8149d5e9 0000000000000202
[   70.847543]  0000000000000000 0000000000000000 0000000000000000 ffff880ff5cafb08
[   70.847547]  ffffffff8107bdfd ffff880ff5cafaf8 000024868120013d ffff880ff5cafb28
[   70.847547] Call Trace:
[   70.847550]  [<ffffffff8149d5e9>] dump_stack+0x51/0x78
[   70.847551]  [<ffffffff8107bdfd>] __warn+0xfd/0x120
[   70.847553]  [<ffffffff8107be3d>] warn_slowpath_null+0x1d/0x20
[   70.847555]  [<ffffffff8139c9e3>] btrfs_destroy_inode+0x2b3/0x2c0
[   70.847556]  [<ffffffff812003a1>] ? __destroy_inode+0x71/0x140
[   70.847558]  [<ffffffff812004b3>] destroy_inode+0x43/0x70
[   70.847559]  [<ffffffff810b7b5f>] ? wake_up_bit+0x2f/0x40
[   70.847560]  [<ffffffff81200c68>] evict+0x148/0x1d0
[   70.847562]  [<ffffffff81398ade>] ? start_transaction+0x3de/0x460
[   70.847564]  [<ffffffff81200d49>] dispose_list+0x59/0x80
[   70.847565]  [<ffffffff81201ba0>] evict_inodes+0x180/0x190
[   70.847566]  [<ffffffff812191ff>] ? __sync_filesystem+0x3f/0x50
[   70.847568]  [<ffffffff811e95f8>] generic_shutdown_super+0x48/0x100
[   70.847569]  [<ffffffff810b75c0>] ? woken_wake_function+0x20/0x20
[   70.847571]  [<ffffffff811e9796>] kill_anon_super+0x16/0x30
[   70.847573]  [<ffffffff81365cde>] btrfs_kill_super+0x1e/0x130
[   70.847574]  [<ffffffff811e99be>] deactivate_locked_super+0x4e/0x90
[   70.847576]  [<ffffffff811e9e61>] deactivate_super+0x51/0x70
[   70.847577]  [<ffffffff8120536f>] cleanup_mnt+0x3f/0x80
[   70.847579]  [<ffffffff81205402>] __cleanup_mnt+0x12/0x20
[   70.847581]  [<ffffffff81098358>] task_work_run+0x68/0xa0
[   70.847582]  [<ffffffff810022b6>] exit_to_usermode_loop+0xd6/0xe0
[   70.847583]  [<ffffffff81002e1d>] do_syscall_64+0xbd/0x170
[   70.847586]  [<ffffffff817d4dbc>] entry_SYSCALL64_slow_path+0x25/0x25

This is the test program I used to force short returns from
btrfs_copy_from_user

void *dontneed(void *arg)
{
char *p = arg;
int ret;

while(1) {
ret = madvise(p, BUFSIZE/4, MADV_DONTNEED);
if (ret) {
perror("madvise");
exit(1);
}
}
}

int main(int ac, char **av) {
int ret;
int fd;
char *filename;
unsigned long offset;
char *buf;
int i;
pthread_t tid;

if (ac != 2) {
fprintf(stderr, "usage: dammitdave filename\n");
exit(1);
}

buf = mmap(NULL, BUFSIZE, PROT_READ|PROT_WRITE,
   MAP_PRIVATE|MAP_ANONYMOUS, -1, 0);
if (buf == MAP_FAILED) {
perror("mmap");
exit(1);
}
memset(buf, 'a', BUFSIZE);
filename = av[1];

ret = pthread_create(&tid, NULL, dontneed, buf);
if (ret) {
fprintf(stderr, "error %d from pthread_create\n", ret);
exit(1);
}

ret = pthread_detach(tid);
if (ret) {
fprintf(stderr, "pthread detach failed %d\n", ret);
exit(1);
}

while (1) {
fd = open(filename, O_RDWR | O_CREAT, 0600);
if (fd < 0) {
perror("open");
exit(1);
}

for (i = 0; i < ROUNDS; i++) {
int this_write = BUFSIZE;

offset = rand() % MAXSIZE;
ret = pwrite(fd, buf, this_write, offset);
if (ret < 0) {
perror("pwrite");
exit(1);
} else if (ret != this_write) {
fprintf(stderr, "short write to %s offset %lu ret %d\n",
filename, offset, ret);
exit(1);
}
if (i == ROUNDS - 1) {
ret = sync_file_range(fd, offset, 4096,
    SYNC_FILE_RANGE_WRITE);
if (ret < 0) {
perror("sync_file_range");
exit(1);
}
}
}
ret = ftruncate(fd, 0);
if (ret < 0) {
perror("ftruncate");
exit(1);
}
ret = close(fd);
if (ret) {
perror("close");
exit(1);
}
ret = unlink(filename);
if (ret) {
perror("unlink");
exit(1);
}

}
return 0;
}

Signed-off-by: Chris Mason <[email protected]>
Reported-by: Dave Jones <[email protected]>
Fixes: 2e78c927d79333f299a8ac81c2fd2952caeef335
cc: [email protected] # v4.6
Signed-off-by: Chris Mason <[email protected]>

Merge branch 'for-chris-4.7' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux into for-linus-4.7

i2c: dev: switch from register_chrdev to cdev API

i2c-dev had never moved away from the older register_chrdev interface to
implement its char device registration. The register_chrdev API has the
limitation of enabling only up to 256 i2c-dev busses to exist.

Large platforms with lots of i2c devices (i.e. pluggable transceivers)
with dedicated busses may have to exceed that limit.
In particular, there are also platforms making use of the i2c bus
multiplexing API, which instantiates a virtual bus for each possible
multiplexed selection.

This patch removes the register_chrdev usage and replaces it with the
less old cdev API, which takes away the 256 i2c-dev bus limitation.
It should not have any other impact for i2c bus drivers or user space.

This patch has been tested on qemu x86 and qemu powerpc platforms with
the aid of a module which adds and removes 5000 virtual i2c busses, as
well as validated on an existing powerpc hardware platform which makes
use of the i2c bus multiplexing API.
i2c-dev busses with device minor numbers larger than 256 have also been
validated to work with the existing i2c-tools.

Signed-off-by: Erico Nunes <[email protected]>
[wsa: kept includes sorted]
Signed-off-by: Wolfram Sang <[email protected]>

i2c: xlr: rename ARCH_TANGOX to ARCH_TANGO

The ARCH name was changed during the review process of the mach, and
this driver was forgotten to be converted. Fix it now.
http://article.gmane.org/gmane.linux.ports.arm.kernel/456331

Signed-off-by: Marc Gonzalez <[email protected]>
[wsa: updated commit message slightly]
Signed-off-by: Wolfram Sang <[email protected]>

i2c: at91: change log when dma configuration fails

When the DMA configuration fails, there is a log reporting that we can't
use DMA and indicating the error number. When booting the kernel, it is
annoying to see this error number. Moreover, people can think something
is going wrong. It is not the case, it means that DMA can't be used but
it doesn't prevent to use i2c.

Signed-off-by: Ludovic Desroches <[email protected]>
Signed-off-by: Wolfram Sang <[email protected]>

misc: at24: Fix typo in at24 header file

This commit fixes a simple typo s/mvmem/nvmem in the
example.

Signed-off-by: Moritz Fischer <[email protected]>
Signed-off-by: Wolfram Sang <[email protected]>

i2c: rcar: should depend on HAS_DMA

If NO_DMA=y:

    drivers/built-in.o: In function `rcar_i2c_dma_unmap':
    i2c-rcar.c:(.text+0x6f06c6): undefined reference to `bad_dma_ops'
    drivers/built-in.o: In function `rcar_i2c_dma':
    i2c-rcar.c:(.text+0x6f07e2): undefined reference to `bad_dma_ops'
    i2c-rcar.c:(.text+0x6f0838): undefined reference to `bad_dma_ops'

Add a dependency on HAS_DMA to fix this.

Signed-off-by: Geert Uytterhoeven <[email protected]>
Acked-by: Niklas Söderlund <[email protected]>
Signed-off-by: Wolfram Sang <[email protected]>

i2c: rcar: use dma_request_chan()

New drivers should not use dma_request_slave_channel_reason() but
dma_request_chan(). The former is a macro to the later so this change do
not effect the driver in any way.

Reported-by: Arnd Bergmann <[email protected]>
Signed-off-by: Niklas Söderlund <[email protected]>
Signed-off-by: Wolfram Sang <[email protected]>

Merge tag 'nfs-for-4.7-1' of git://git.linux-nfs.org/projects/anna/linux-nfs

Pull NFS client updates from Anna Schumaker:
"Highlights include:

  Features:
   - Add support for the NFS v4.2 COPY operation
   - Add support for NFS/RDMA over IPv6

  Bugfixes and cleanups:
   - Avoid race that crashes nfs_init_commit()
   - Fix oops in callback path
   - Fix LOCK/OPEN race when unlinking an open file
   - Choose correct stateids when using delegations in setattr, read and
     write
   - Don't send empty SETATTR after OPEN_CREATE
   - xprtrdma: Prevent server from writing a reply into memory client
     has released
   - xprtrdma: Support using Read list and Reply chunk in one RPC call"

* tag 'nfs-for-4.7-1' of git://git.linux-nfs.org/projects/anna/linux-nfs: (61 commits)
  pnfs: pnfs_update_layout needs to consider if strict iomode checking is on
  nfs/flexfiles: Use the layout segment for reading unless it a IOMODE_RW and reading is disabled
  nfs/flexfiles: Helper function to detect FF_FLAGS_NO_READ_IO
  nfs: avoid race that crashes nfs_init_commit
  NFS: checking for NULL instead of IS_ERR() in nfs_commit_file()
  pnfs: make pnfs_layout_process more robust
  pnfs: rework LAYOUTGET retry handling
  pnfs: lift retry logic from send_layoutget to pnfs_update_layout
  pnfs: fix bad error handling in send_layoutget
  flexfiles: add kerneldoc header to nfs4_ff_layout_prepare_ds
  flexfiles: remove pointless setting of NFS_LAYOUT_RETURN_REQUESTED
  pnfs: only tear down lsegs that precede seqid in LAYOUTRETURN args
  pnfs: keep track of the return sequence number in pnfs_layout_hdr
  pnfs: record sequence in pnfs_layout_segment when it's created
  pnfs: don't merge new ff lsegs with ones that have LAYOUTRETURN bit set
  pNFS/flexfiles: When initing reads or writes, we might have to retry connecting to DSes
  pNFS/flexfiles: When checking for available DSes, conditionally check for MDS io
  pNFS/flexfile: Fix erroneous fall back to read/write through the MDS
  NFS: Reclaim writes via writepage are opportunistic
  NFSv4: Use the right stateid for delegations in setattr, read and write
  ...

Merge tag 'xfs-for-linus-4.7-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/dgc/linux-xfs

Pull xfs updates from Dave Chinner:
"A pretty average collection of fixes, cleanups and improvements in
  this request.

  Summary:
   - fixes for mount line parsing, sparse warnings, read-only compat
     feature remount behaviour
   - allow fast path symlink lookups for inline symlinks.
   - attribute listing cleanups
   - writeback goes direct to bios rather than indirecting through
     bufferheads
   - transaction allocation cleanup
   - optimised kmem_realloc
   - added configurable error handling for metadata write errors,
     changed default error handling behaviour from "retry forever" to
     "retry until unmount then fail"
   - fixed several inode cluster writeback lookup vs reclaim race
     conditions
   - fixed inode cluster writeback checking wrong inode after lookup
   - fixed bugs where struct xfs_inode freeing wasn't actually RCU safe
   - cleaned up inode reclaim tagging"

* tag 'xfs-for-linus-4.7-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/dgc/linux-xfs: (39 commits)
  xfs: fix warning in xfs_finish_page_writeback for non-debug builds
  xfs: move reclaim tagging functions
  xfs: simplify inode reclaim tagging interfaces
  xfs: rename variables in xfs_iflush_cluster for clarity
  xfs: xfs_iflush_cluster has range issues
  xfs: mark reclaimed inodes invalid earlier
  xfs: xfs_inode_free() isn't RCU safe
  xfs: optimise xfs_iext_destroy
  xfs: skip stale inodes in xfs_iflush_cluster
  xfs: fix inode validity check in xfs_iflush_cluster
  xfs: xfs_iflush_cluster fails to abort on error
  xfs: remove xfs_fs_evict_inode()
  xfs: add "fail at unmount" error handling configuration
  xfs: add configuration handlers for specific errors
  xfs: add configuration of error failure speed
  xfs: introduce table-based init for error behaviors
  xfs: add configurable error support to metadata buffers
  xfs: introduce metadata IO error class
  xfs: configurable error behavior via sysfs
  xfs: buffer ->bi_end_io function requires irq-safe lock
  ...

staging/rdma: Remove the entire rdma subdirectory of staging

This is no longer in use. Remove it.

Signed-off-by: Doug Ledford <[email protected]>

IB/core: Make device counter infrastructure dynamic

In practice, each RDMA device has a unique set of counters that the
hardware implements. Having a central set of counters that they must
all adhere to is limiting and causes many useful counters to not be
available.

Therefore we create a dynamic counter registration infrastructure.

The driver must implement a stats structure allocation routine, in
which the driver must place the directory name it wants, a list of
names for all of the counters, an array of u64 counters themselves,
plus a few generic configuration options.

We then implement a core routine to create a sysfs file for each
of the named stats elements, and a core routine to retrieve the
stats when any of the sysfs attribute files are read.

To avoid excessive beating on the stats generation routine in the
drivers, the core code also caches the stats for a short period of
time so that someone attempting to read all of the stats in a
given device's directory will not result in a stats generation
call per file read.

Future work will attempt to standardize just the shared stats
elements, and possibly add a method to get the stats via netlink
in addition to sysfs.

Signed-off-by: Christoph Lameter <[email protected]>
Signed-off-by: Mark Bloch <[email protected]>
Reviewed-by: Steve Wise <[email protected]>
Signed-off-by: Doug Ledford <[email protected]>
[ Add caching, make structure names more informative, add i40iw support,
other significant rewrites from the original patch ]

Merge branch 'hfi1-2' into k.o/for-4.7

Merge branch 'hwmon-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jdelvare/staging

Pull hwmon fixlets from Jean Delvare.

* 'hwmon-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jdelvare/staging:
Documentation/hwmon: Update links in max34440
hwmon: (emc2103) Fix typo in MODULE_PARM_DESC

Merge tag 'mmc-v4.7-rc1' of git://git.linaro.org/people/ulf.hansson/mmc

Pull MMC fixes from Ulf Hansson:
"Here are some mmc fixes intended for v4.7 rc1.  They are based on a
  commit earlier in the merge window and have been tested in linux-next
  for a while.

  MMC core:
   - Prevent re-tuning while serving requests for RPMB partitions
   - Extend timeout for long read time quirk to support more eMMCs

  MMC host:
   - sdhci-acpi: Ensure connected devices are powered when probing
   - sdhci-pci|acpi: Remove unreliable MMC_CAP_BUS_WIDTH_TEST for Intel HWs
   - dw_mmc: Correct the assigning of max_blk_size
   - dw_mmc-rockchip: Allow RPMB partitions to be created
   - dw_mmc-rockchip: Set the drive phase properly"

* tag 'mmc-v4.7-rc1' of git://git.linaro.org/people/ulf.hansson/mmc:
  mmc: sdhci-acpi: Remove MMC_CAP_BUS_WIDTH_TEST for Intel controllers
  mmc: sdhci-pci: Remove MMC_CAP_BUS_WIDTH_TEST for Intel controllers
  mmc: longer timeout for long read time quirk
  mmc: dw_mmc: rockchip: Set the drive phase properly
  mmc: dw_mmc: fix the wrong max_blk_size
  mmc: dw_mmc-rockchip: add MMC_CAP_CMD23 capabilities
  mmc: sdhci-acpi: Ensure connected devices are powered when probing
  ACPI / PM: Export acpi_device_fix_up_power()
  mmc: block: Pause re-tuning while switched to the RPMB partition
  mmc: block: Always switch back to main area after RPMB access
  mmc: core: Add a facility to "pause" re-tuning

Merge branch 'next' of git://git.kernel.org/pub/scm/linux/kernel/git/rzhang/linux

Pull thermal management updates from Zhang Rui:

- Introduce generic ADC thermal driver, based on OF thermal (Laxman
   Dewangan)

- Introduce new thermal driver for Tango chips (Marc Gonzalez)

- Rockchip driver support for RK3399, RK3366, and some fixes (Caesar
   Wang, Elaine Zhang and Shawn Lin)

- Add CPU power cooling model to Mediatek thermal driver (Dawei Chien)

- Wider usage of dev_thermal_zone_of_sensor_register (Eduardo Valentin)

- TI thermal driver gained a new maintainer (Keerthy).

- Enabled powerclamp driver by checking CPU feature and package cstate
   counter instead of CPU whitelist (Jacob Pan)

- Various fixes on thermal governor, OF thermal, Tegra, and RCAR

* 'next' of git://git.kernel.org/pub/scm/linux/kernel/git/rzhang/linux: (50 commits)
  thermal: tango: initialize TEMPSI_CFG
  thermal: rockchip: use the usleep_range instead of udelay
  thermal: rockchip: add the notes for better reading
  thermal: rockchip: Support RK3366 SoCs in the thermal driver
  thermal: rockchip: handle the power sequence for tsadc controller
  thermal: rockchip: update the tsadc table for rk3399
  thermal: rockchip: fixes the code_to_temp for tsadc driver
  thermal: rockchip: disable thermal->clk in err case
  thermal: tegra: add Tegra132 specific SOC_THERM driver
  thermal: fix ptr_ret.cocci warnings
  thermal: mediatek: Add cpu dynamic power cooling model.
  thermal: generic-adc: Add ADC based thermal sensor driver
  thermal: generic-adc: Add DT binding for ADC based thermal sensor
  thermal: tegra: fix static checker warning
  thermal: tegra: mark PM functions __maybe_unused
  thermal: add temperature sensor support for tango SoC
  thermal: hisilicon: fix IRQ imbalance enabling
  thermal: hisilicon: support to use any sensor
  thermal: rcar: Remove binding docs for r8a7794
  thermal: tegra: add PM support
  ...

IB/hfi1: Fix pio map initialization

The pio map initialization function is off by 1 causing the last
kernel send context that is allocated to not get mapped into the
pio map which leads to the last kernel send context not being used
by any of the qps.

The send context reserved for VL15 is taken care of by setting the
scontext variable that is used as the index into the kernel send
context array to 1 and does not need to be accounted for in the
kernel send context counting loop as it is currently done.

Fix the kernel send context counting loop to account for all the
allocated send contexts and map all of them to the different VLs.

Reviewed-by: Dennis Dalessandro <[email protected]>
Reviewed-by: Mike Marciniszyn <[email protected]>
Reviewed-by: Jianxin Xiong <[email protected]>
Signed-off-by: Jubin John <[email protected]>
Signed-off-by: Doug Ledford <[email protected]>

IB/hfi1: Correct 8051 link parameter settings

Two 8051 link settings, external device config and tuning method,
were written in the wrong location and the previous settings were
not cleared. For both, clear the old value and write the new
value.

Fixes: 8ebd4cf1852a ("staging/rdma/hfi1: Add active and optical cable support")
Reviewed-by: Dennis Dalessandro <[email protected]>
Signed-off-by: Dean Luick <[email protected]>
Signed-off-by: Doug Ledford <[email protected]>

IB/hfi1: Update pkey table properly after link down or FM start

When FM is disabled, and the HFI port on the switch is
changed from MgmtAllowed=YES to MgmtAllowed=NO and the
link is bounced, FULL_MGMT_P_KEY doesn't get cleared
from the pkey table. This also occurs when the QSFP
cable is moved from a switch port with MgmtAllowed=YES
to a MgmtAllowed=NO port. Clear pkey entry properly.

Also, when the driver is loaded and the switch port is
set to MgmtAllowed=NO, FULL_MGMT_P_KEY shouldn't be added
to pkey table after FM is started. Only set FULL_MGMT_P_KEY
in the pkey table if switch port is configured to
MgmtAllowed=YES.

Reviewed-by: Dean Luick <[email protected]>
Signed-off-by: Sebastian Sanchez <[email protected]>
Signed-off-by: Doug Ledford <[email protected]>

IB/rdamvt: Fix rdmavt s_ack_queue sizing

rdmavt allows the driver to specify the size of the ack queue, but
only uses it for the modify QP limit testing for setting the atomic
limit value.

The driver dependent size is now used to size the s_ack_queue ring
dynamicially.

Since the driver knows its size, the driver will use its define
for any ring size dependent code.

Reviewed-by: Mitko Haralanov <[email protected]>
Signed-off-by: Mike Marciniszyn <[email protected]>
Signed-off-by: Doug Ledford <[email protected]>

IB/rdmavt: Max atomic value should be a u8

This matches the ib_qp_attr size and
avoids a extremely large value when the lower level
driver registers.

As part of the patch, the u8 ordinals are moved to the
end of the struct to reduce pahole noted excesses.

Reviewed-by: Mitko Haralanov <[email protected]>
Reviewed-by: Dennis Dalessandro <[email protected]>
Signed-off-by: Mike Marciniszyn <[email protected]>
Signed-off-by: Doug Ledford <[email protected]>

IB/hfi1: Fix hard lockup due to not using save/restore spin lock

Commit b9b06cb6feda
("IB/hfi1: Fix missing lock/unlock in verbs drain callback")
added a spin lock.

Unfortunately, the new lock code can be called from a base
level interrupt state, and an interrupt that can get stacked
will attempt to get the same lock.

Fix by using the flag save/restore spin lock variation.

Cc: [email protected] # 4.6+
Reviewed-by: Sebastian Sanchez <[email protected]>
Signed-off-by: Mike Marciniszyn <[email protected]>
Signed-off-by: Dennis Dalessandro <[email protected]>
Signed-off-by: Doug Ledford <[email protected]>

IB/hfi1: Add tracing support for send with invalidate opcode

Enable trace generation for packets with the "Send Last with
Invalidate" and "Send Only with Invalidate" opcodes.

Reviewed-by: Mike Marciniszyn <[email protected]>
Reviewed-by: Dennis Dalessandro <[email protected]>
Signed-off-by: Jianxin Xiong <[email protected]>
Signed-off-by: Doug Ledford <[email protected]>

IB/hfi1, qib: Add ieth to the packet header definitions

A new union member "ieth" (Invalidate Extended Transport Header) is
added to the packet header definition in preparation of supporting
the send with invalidate opcode.

Reviewed-by: Mike Marciniszyn <[email protected]>
Reviewed-by: Dennis Dalessandro <[email protected]>
Signed-off-by: Jianxin Xiong <[email protected]>
Signed-off-by: Doug Ledford <[email protected]>

Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jmorris/linux-security

Pull Yama locking fix from James Morris:
"Fix for the Yama LSM"

* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jmorris/linux-security:
Yama: fix double-spinlock and user access in atomic context

Merge branches 'misc-4.7-2', 'ipoib' and 'ib-router' into k.o/for-4.7

IB/hfi1: Move driver out of staging

The TODO list for the hfi1 driver was completed during 4.6. In addition
other objections raised (which are far beyond what was in the TODO list)
have been addressed as well. It is now time to remove the driver from
staging and into the drivers/infiniband sub-tree.

Reviewed-by: Jubin John <[email protected]>
Signed-off-by: Dennis Dalessandro <[email protected]>
Signed-off-by: Doug Ledford <[email protected]>

IB/hfi1: Do not free hfi1 cdev parent structure early

The deletion of a cdev is not a fence for holding off references to the
structure. The driver attempts to delete the cdev and then proceeds to
free the parent structure, the hfi1_devdata, or dd. This can potentially
lead to a kernel panic in situations where a user has an FD for the cdev
open, and the pci device gets removed. If the user then closes the FD
there will be a NULL dereference when trying to do put on the cdev's
kobject.

Fix this by pointing the cdev's kobject.parent at a new kobject embedded
in its parent structure. Also take a reference when the device is opened
and put it back when it is closed.

Reviewed-by: Mitko Haralanov <[email protected]>
Signed-off-by: Ira Weiny <[email protected]>
Signed-off-by: Dennis Dalessandro <[email protected]>
Signed-off-by: Doug Ledford <[email protected]>

IB/hfi1: Add trace message in user IOCTL handling

Add a trace message to HFI1s user IOCTL handling. This allows debugging
of which IOCTLs are being handled by the driver.

Reviewed-by: Ira Weiny <[email protected]>
Signed-off-by: Dennis Dalessandro <[email protected]>
Signed-off-by: Doug Ledford <[email protected]>

IB/hfi1: Remove write(), use ioctl() for user cmds

Remove the write() handler for user space commands now that ioctl
handling is available. User apps will need to change to use ioctl from
this point forward.

Reviewed-by: Mitko Haralanov <[email protected]>
Signed-off-by: Dennis Dalessandro <[email protected]>
Signed-off-by: Doug Ledford <[email protected]>

IB/hfi1: Add ioctl() interface for user commands

IOCTL is more suited to what user space commands need to do than the
write() interface. Add IOCTL definitions for all existing write commands
and the handling for those. The write() interface will be removed in a
follow on patch.

Reviewed-by: Mitko Haralanov <[email protected]>
Reviewed-by: Mike Marciniszyn <[email protected]>
Reviewed-by: Ira Weiny <[email protected]>
Signed-off-by: Dennis Dalessandro <[email protected]>
Signed-off-by: Doug Ledford <[email protected]>

IB/hfi1: Remove unused user command

The HFI1_CMD_SDMA_STATUS_UPD command was never implemented it has no
reason to live in the driver. Remove it.

Reviewed-by: Christoph Hellwig <[email protected]>
Reviewed-by: Mitko Haralanov <[email protected]>
Reviewed-by: Ira Weiny <[email protected]>
Signed-off-by: Dennis Dalessandro <[email protected]>
Signed-off-by: Doug Ledford <[email protected]>

IB/hfi1: Remove snoop/diag interface

The snoop/diag interface is better served by an implementation which is
more general and usable by other drivers perhaps. Go ahead and remove
the code now and get rid of the char dev. We can put the feature back
when we have a more agreeable solution.

Reviewed-by: Dean Luick <[email protected]>
Reviewed-by: Mike Marciniszyn <[email protected]>
Signed-off-by: Dennis Dalessandro <[email protected]>
Signed-off-by: Doug Ledford <[email protected]>

IB/hfi1: Remove EPROM functionality from data device

Remove EPROM handling from the cdev which is used for user application
data traffic.

Reviewed-by: Dean Luick <[email protected]>
Reviewed-by: Mike Marciniszyn <[email protected]>
Signed-off-by: Dennis Dalessandro <[email protected]>
Signed-off-by: Doug Ledford <[email protected]>

IB/hfi1: Remove UI char device

Remove UI char device which exposes direct access to registers for user
space. This was put in to aid in debugging the hardware. We are looking
into alternatives means of providing the same functionality. This
removes another char device from HFI1's footprint.

Reviewed-by: Dean Luick <[email protected]>
Reviewed-by: Mitko Haralanov <[email protected]>
Reviewed-by: Mike Marciniszyn <[email protected]>
Signed-off-by: Dennis Dalessandro <[email protected]>
Signed-off-by: Doug Ledford <[email protected]>

IB/hfi1: Remove multiple device cdev

hfi1 current exports a cdev that can be used to target all of the hfi's
in the system. However there is a problem with this approach in
that the devices could be on different subnets. This is a problem that
user space can figure out and explicitly tell the driver on which device
to create a context.

Remove the multi-purpose cdev leaving a dedicated cdev for each port.
Also remove the striping capability that is dependent upon the user
choosing the multi-purpose cdev. It is now up to user space to determine
how to stripe contexts.

Reviewed-by: Dean Luick <[email protected]>
Reviewed-by: Mitko Haralanov <[email protected]>
Reviewed-by: Mike Marciniszyn <[email protected]>
Reviewed-by: Ira Weiny <[email protected]>
Signed-off-by: Dennis Dalessandro <[email protected]>
Signed-off-by: Doug Ledford <[email protected]>

IB/hfi1: Remove anti-pattern in cdev init

Remove the usage of an anti-pattern goto in hfi1_cdev_init to improve
code readability.

Suggested-by: Jason Gunthorpe <[email protected]>
Reviewed-by: Ira Weiny <[email protected]>
Signed-off-by: Dennis Dalessandro <[email protected]>
Signed-off-by: Doug Ledford <[email protected]>

IB/hfi1: Fix bug that blocks process on exit after port bounce

During the processing of a user SDMA request, if there was an
error before the request counter was increased, the state of
the packet queue could be updated incorrectly, causing the
counter to underflow. As the result, the process could get
stuck later since the counter could never get back to 0.

This patch adds a condition to guard the packet queue update
so that the counter is only decreased if it has been increased
before the error happens.

Reviewed-by: Mitko Haralanov <[email protected]>
Signed-off-by: Jianxin Xiong <[email protected]>
Signed-off-by: Doug Ledford <[email protected]>

IB/qib: Remove unused qib_7322_intr_msgs[]

Building the qib driver with gcc version 6.1.0 raises the following
build warning:
drivers/infiniband/hw/qib/qib_iba7322.c:1311:39: warning:
'qib_7322_intr_msgs' defined but not used [-Wunused-const-variable=]
static const struct qib_hwerror_msgs qib_7322_intr_msgs[] = {
^~~~~~~~~~~~~~~~~~
Remove the unused qib_7322_intr_msgs[]

Reviewed-by: Dennis Dalessandro <[email protected]>
Reviewed-by: Mike Marciniszyn <[email protected]>
Signed-off-by: Jubin John <[email protected]>
Signed-off-by: Doug Ledford <[email protected]>

IB/hfi1: Remove unnecessary comment

This comment was old, the MTU enums have been defined.

Reviewed-by: Mitko Haralanov <[email protected]>
Reviewed-by: Dennis Dalessandro <[email protected]>
Signed-off-by: Ira Weiny <[email protected]>
Signed-off-by: Doug Ledford <[email protected]>

IB/hfi1: Fix sdma_event_names[] build warning

sdma_event_names[] is only used within CONFIG_SDMA_VERBOSITY ifdefs, so
when CONFIG_SDMA_VERBOSITY is disabled, it results in the following
0-day build warning:
>> drivers/infiniband/hw/hfi1/sdma.c:137:27: warning: 'sdma_event_names'
>> defined but not used [-Wunused-const-variable=]
static const char * const sdma_event_names[] = {
^~~~~~~~~~~~~~~~
This occurs on the following compiler:
compiler: gcc-6 (Debian 6.1.1-1) 6.1.1 20160430

For more information check:
https://lists.01.org/pipermail/kbuild-all/2016-May/020060.html

Fix this warning by defining sdma_event_name[] only within the
CONFIG_SDMA_VERBOSITY ifdefs.

Reported-by: kbuild test robot <[email protected]>
Reviewed-by: Mike Marciniszyn <[email protected]>
Reviewed-by: Dennis Dalessandro <[email protected]>
Signed-off-by: Jubin John <[email protected]>
Signed-off-by: Doug Ledford <[email protected]>

IB/rdmavt: Use kzalloc_node

Use kzalloc_node instead of kzalloc for rdmavt memory region segment
allocation to optimize for performance on NUMA platforms.

Reviewed-by: Dennis Dalessandro <[email protected]>
Signed-off-by: Jubin John <[email protected]>
Signed-off-by: Doug Ledford <[email protected]>

IB/rdmavt: Insure QP vmalloc variants zero memory

The usage of the various vmalloc APIs do not consistently zero memory
when allocating the swqe. Insure zeroing variants are used.

Reviewed-by: Mitko Haralanov <[email protected]>
Signed-off-by: Mike Marciniszyn <[email protected]>
Signed-off-by: Dennis Dalessandro <[email protected]>
Signed-off-by: Doug Ledford <[email protected]>

IB/hfi1: Fix an interval RB node reference count leak

Commit e88c9271d9f8 ("IB/hfi1: Fix buffer cache corner case which
may cause corruption") introduced a bug which may cause a reference
count of a interval RB node to be leaked in the case where an SDMA
transfer from that node completes at the same time as the node is
being extended.

If a node is being extended, it is first removed from the RB tree
in order to be processed without the risk of an invalidation event
removing the node at the same time.

If a SDMA completion happens during that time, the completion handler
will fail to find the node in the RB tree and, therefore, fail to
correctly decrement its refcount. This leaves the node in the tree and
its pages pinned for the duration of the user process.

To prevent this from happening the io vector adds a reference to the
RB node, which is used during the SDMA completion instead of looking
up the node in the RB tree.

This change adds a performance improvement as a side effect by avoiding
the RB tree lookup.

Fixes: e88c9271d9f8 ("IB/hfi1: Fix buffer cache corner case which may cause corruption")
Reviewed-by: Dean Luick <[email protected]>
Reviewed-by: Harish Chegondi <[email protected]>
Signed-off-by: Mitko Haralanov <[email protected]>
Signed-off-by: Doug Ledford <[email protected]>

blk-mq: clear q->mq_ops if init fail

blk_mq_init_queue() calls blk_mq_init_allocated_queue(), but q->mq_ops
was not cleared when blk_mq_init_allocated_queue() fails.
Then blk_cleanup_queue() calls blk_mq_free_queue() which will crash because:
- q->all_q_node is not added to all_q_list yet
- q->tag_set is NULL
- hctx was not setup yet or already freed

Fixed it by clearing q->mq_ops on error path.

Signed-off-by: Ming Lin <[email protected]>
Reviewed-by: Christoph Hellwig <[email protected]>
Signed-off-by: Jens Axboe <[email protected]>

pnfs: pnfs_update_layout needs to consider if strict iomode checking is on

As flexfiles has FF_FLAGS_NO_READ_IO, there is a need to generically
support enforcing that a IOMODE_RW segment will not allow READ I/O.

Signed-off-by: Tom Haynes <[email protected]>
Signed-off-by: Anna Schumaker <[email protected]>

nfs/flexfiles: Use the layout segment for reading unless it a IOMODE_RW and reading is disabled

Signed-off-by: Tom Haynes <[email protected]>
Signed-off-by: Anna Schumaker <[email protected]>

Documentation/hwmon: Update links in max34440

It appears the website for maxim-ic.com changed to
maximintegrated.com.

Signed-off-by: Glenn Dayton <[email protected]>
Signed-off-by: Jean Delvare <[email protected]>

hwmon: (emc2103) Fix typo in MODULE_PARM_DESC

"apd" was intended here instead of "init".

Signed-off-by: Dan Carpenter <[email protected]>
Signed-off-by: Jean Delvare <[email protected]>

restore killability of old mutex_lock_killable(&inode->i_mutex) users

The ones that are taking it exclusive, that is...

Signed-off-by: Al Viro <[email protected]>

add down_write_killable_nested()

Signed-off-by: Al Viro <[email protected]>

update D/f/directory-locking

Signed-off-by: Al Viro <[email protected]>

Revert "mtd: atmel_nand: Support variable RB_EDGE interrupts"

This reverts commit 5ddc7bd43ccc ("mtd: atmel_nand: Support variable
RB_EDGE interrupts")

Because for current SoCs, the RB_EDGE3(i.e. bit 27) of HSMC_SR
register does not exist, the RB_EDGE0 (i.e. bit 24) is the ready/busy
line edge status bit. It is a datasheet bug.

Cc: <[email protected]>
Fixes: commit 5ddc7bd43ccc ("mtd: atmel_nand: Support variable RB_EDGE interrupts")
Signed-off-by: Wenyou Yang <[email protected]>
Signed-off-by: Brian Norris <[email protected]>

Merge branch 'x86-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip

Pull x86 fixes from Ingo Molnar:
"Misc fixes: EFI, entry code, pkeys and MPX fixes, TASK_SIZE cleanups
  and a tsc frequency table fix"

* 'x86-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  x86/mm: Switch from TASK_SIZE to TASK_SIZE_MAX in the page fault code
  x86/fsgsbase/64: Use TASK_SIZE_MAX for FSBASE/GSBASE upper limits
  x86/mm/mpx: Work around MPX erratum SKD046
  x86/entry/64: Fix stack return address retrieval in thunk
  x86/efi: Fix 7-parameter efi_call()s
  x86/cpufeature, x86/mm/pkeys: Fix broken compile-time disabling of pkeys
  x86/tsc: Add missing Cherrytrail frequency to the table

Merge branch 'sched-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip

Pull scheduler fixes from Ingo Molnar:
"Two fixes: one for a lost wakeup, the other to fix the compiler
  optimizing out preempt operations on ARM64 (and possibly other non-x86
  architectures)"

* 'sched-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  sched/core: Fix remote wakeups
  sched/preempt: Fix preempt_count manipulations

Merge branch 'perf-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip

Pull perf updates from Ingo Molnar:
"Mostly tooling and PMU driver fixes, but also a number of late updates
  such as the reworking of the call-chain size limiting logic to make
  call-graph recording more robust, plus tooling side changes for the
  new 'backwards ring-buffer' extension to the perf ring-buffer"

* 'perf-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (34 commits)
  perf record: Read from backward ring buffer
  perf record: Rename variable to make code clear
  perf record: Prevent reading invalid data in record__mmap_read
  perf evlist: Add API to pause/resume
  perf trace: Use the ptr->name beautifier as default for "filename" args
  perf trace: Use the fd->name beautifier as default for "fd" args
  perf report: Add srcline_from/to branch sort keys
  perf evsel: Record fd into perf_mmap
  perf evsel: Add overwrite attribute and check write_backward
  perf tools: Set buildid dir under symfs when --symfs is provided
  perf trace: Only auto set call-graph to "dwarf" when syscalls are being traced
  perf annotate: Sort list of recognised instructions
  perf annotate: Fix identification of ARM blt and bls instructions
  perf tools: Fix usage of max_stack sysctl
  perf callchain: Stop validating callchains by the max_stack sysctl
  perf trace: Fix exit_group() formatting
  perf top: Use machine->kptr_restrict_warned
  perf trace: Warn when trying to resolve kernel addresses with kptr_restrict=1
  perf machine: Do not bail out if not managing to read ref reloc symbol
  perf/x86/intel/p4: Trival indentation fix, remove space
  ...

Yama: fix double-spinlock and user access in atomic context

Commit 8a56038c2aef ("Yama: consolidate error reporting") causes lockups
when someone hits a Yama denial. Call chain:

process_vm_readv -> process_vm_rw -> process_vm_rw_core -> mm_access
-> ptrace_may_access
task_lock(...) is taken
__ptrace_may_access -> security_ptrace_access_check
-> yama_ptrace_access_check -> report_access -> kstrdup_quotable_cmdline
-> get_cmdline -> access_process_vm -> get_task_mm
task_lock(...) is taken again

task_lock(p) just calls spin_lock(&p->alloc_lock), so at this point,
spin_lock() is called on a lock that is already held by the current
process.

Also: Since the alloc_lock is a spinlock, sleeping inside
security_ptrace_access_check hooks is probably not allowed at all? So it's
not even possible to print the cmdline from in there because that might
involve paging in userspace memory.

It would be tempting to rewrite ptrace_may_access() to drop the alloc_lock
before calling the LSM, but even then, ptrace_may_access() itself might be
called from various contexts in which you're not allowed to sleep; for
example, as far as I understand, to be able to hold a reference to another
task, usually an RCU read lock will be taken (see e.g. kcmp() and
get_robust_list()), so that also prohibits sleeping. (And using e.g. FUSE,
a user can cause pagefault handling to take arbitrary amounts of time -
see https://bugs.chromium.org/p/project-zero/issues/detail?id=808.)

Therefore, AFAIK, in order to print the name of a process below
security_ptrace_access_check(), you'd have to either grab a reference to
the mm_struct and defer the access violation reporting or just use the
"comm" value that's stored in kernelspace and accessible without big
complications. (Or you could try to use some kind of atomic remote VM
access that fails if the memory isn't paged in, similar to
copy_from_user_inatomic(), and if necessary fall back to comm, but
that'd be kind of ugly because the comm/cmdline choice would look
pretty random to the user.)

Fix it by deferring reporting of the access violation until current
exits kernelspace the next time.

v2: Don't oops on PTRACE_TRACEME, call report_access under
task_lock(current). Also fix nonsensical comment. And don't use
GPF_ATOMIC for memory allocation with no locks held.
This patch is tested both for ptrace attach and ptrace traceme.

Fixes: 8a56038c2aef ("Yama: consolidate error reporting")
Signed-off-by: Jann Horn <[email protected]>
Acked-by: Kees Cook <[email protected]>
Signed-off-by: James Morris <[email protected]>

Merge branch 'core-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip

Pull objtool build fix from Ingo Molnar:
"An libtool fix for older libelf versions"

* 'core-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
objtool: Allow building with older libelf

ceph: fix wake_up_session_cb()

We should reset i_requested_max_size before waking the waiters.
(zero i_requested_max_size make waiter re-request the max size)

Signed-off-by: Yan, Zheng <[email protected]>

ceph: don't use truncate_pagecache() to invalidate read cache

truncate_pagecache() drops dirty pages, it's dangerous to use it
to invalidate read cache. Besides, we shouldn't start invalidating
read cache while there are buffer writers. Because buffer writers
may add dirty pages later.

Signed-off-by: Yan, Zheng <[email protected]>

ceph: SetPageError() for writeback pages if writepages fails

Signed-off-by: Yan, Zheng <[email protected]>

ceph: handle interrupted ceph_writepage()

writepage() can be interrupted when it's called by direct memory
reclaimer (the direct memory relaimer is killed). To avoid lossing
data, we redirty the page.

Signed-off-by: Yan, Zheng <[email protected]>

ceph: make ceph_update_writeable_page() uninterruptible

ceph_update_writeable_page() is used by ceph_write_begin(). It beaks
atomicity of write operation if it's interruptible.

Signed-off-by: Yan, Zheng <[email protected]>

libceph: make ceph_osdc_wait_request() uninterruptible

Ceph_osdc_wait_request() is used when cephfs issues sync IO. In most
cases, the sync IO should be uninterruptible. The fix is use killale
wait function in ceph_osdc_wait_request().

Signed-off-by: Yan, Zheng <[email protected]>

ceph: handle -EAGAIN returned by ceph_update_writeable_page()

when ceph_update_writeable_page() return -EAGAIN, caller should
lock the page and call ceph_update_writeable_page() again.

Signed-off-by: Yan, Zheng <[email protected]>

ceph: make fault/page_mkwrite return VM_FAULT_OOM for -ENOMEM

Signed-off-by: Yan, Zheng <[email protected]>

ceph: block non-fatal signals for fault/page_mkwrite

Fault and page_mkwrite are supposed to be uninterruptable. But they
call ceph functions that are interruptible. So they should block
signals before calling functions that are interruptible

Signed-off-by: Yan, Zheng <[email protected]>

ceph: make logical calculation functions return bool

This patch makes serverl logical caculation functions return bool to
improve readability due to these particular functions only using 0/1
as their return value.

No functional change.

Signed-off-by: Zhang Zhuoyu <[email protected]>

ceph: tolerate bad i_size for symlink inode

A mds bug can cause symlink's size to be truncated to zero.

Signed-off-by: Yan, Zheng <[email protected]>

ceph: improve fragtree change detection

check if number of splits in i_fragtree is equal to number of splits
in mds reply

Signed-off-by: Yan, Zheng <[email protected]>

ceph: keep leaf frag when updating fragtree

Nodes in i_fragtree are sorted according to ceph_compare_frag().
It means frag node in i_fragtree always follow its direct parent
node. To check if a leaf node is valid, we just need to check if
it's child of previous split node.

Signed-off-by: Yan, Zheng <[email protected]>

ceph: fix dir_auth check in ceph_fill_dirfrag()

-1 is CDIR_AUTH_PARENT, it means dir's auth mds is the same as
inode's auth mds

Signed-off-by: Yan, Zheng <[email protected]>

ceph: don't assume frag tree splits in mds reply are sorted

The algorithm that updates i_fragtree relies on that the frag tree
splits in mds reply are of the same order of i_fragtree. This is not
true because current MDS encodes frag tree splits in ascending order
of (unsigned)frag_t. But nodes in i_fragtree are sorted according to
ceph_frag_compare().

The fix is sort the frag tree splits first, then updates i_fragtree.

Signed-off-by: Yan, Zheng <[email protected]>

ceph: fix inode reference leak

Signed-off-by: Yan, Zheng <[email protected]>

ceph: using hash value to compose dentry offset

If MDS sorts dentries in dirfrag in hash order, we use hash value to
compose dentry offset. dentry offset is:

(0xff << 52) | ((24 bits hash) << 28) |
(the nth entry hash hash collision)

This offset is stable across directory fragmentation. This alos means
there is no need to reset readdir offset if directory get fragmented
in the middle of readdir.

Signed-off-by: Yan, Zheng <[email protected]>

ceph: don't forbid marking directory complete after forward seek

Forward seek within same frag does not update fi->last_name, it will
not affect contents of later readdir reply. So there is no need to
forbid marking directory complete

Signed-off-by: Yan, Zheng <[email protected]>

ceph: record 'offset' for each entry of readdir result

This is preparation for using hash value as dentry 'offset'

Signed-off-by: Yan, Zheng <[email protected]>

ceph: define 'end/complete' in readdir reply as bit flags

Set a flag in readdir request, which indicates that client interprets
'end/complete' as bit flags. So that mds can reply additional flags in
readdir reply.

Signed-off-by: Yan, Zheng <[email protected]>

ceph: define struct for dir entry in readdir reply

This avoids defining multiple arrays for entries in readdir reply

Signed-off-by: Yan, Zheng <[email protected]>

ceph: simplify 'offset in frag'

don't distinguish leftmost frag from other frags. always use 2 as
first entry's offset.

Signed-off-by: Yan, Zheng <[email protected]>

ceph: remove unnecessary checks in __dcache_readdir

we never add snapdir and the hidden .ceph dir into readdir cache

Signed-off-by: Yan, Zheng <[email protected]>

ceph: search cache postion for dcache readdir

use binary search to find cache index that corresponds to readdir
postion.

Signed-off-by: Yan, Zheng <[email protected]>

ceph: use CEPH_MDS_OP_RMXATTR request to remove xattr

Setxattr with NULL value and XATTR_REPLACE flag should be equivalent
to removexattr. But current MDS does not support deleting vxattrs through
MDS_OP_SETXATTR request. The workaround is sending MDS_OP_RMXATTR request
if setxattr actually removs xattr.

Signed-off-by: Yan, Zheng <[email protected]>

ceph: report mount root in session metadata

Signed-off-by: Yan, Zheng <[email protected]>

ceph: don't show symlink target in debugfs/mdsc

symlink target is useless for debug and can be very long. It's annoying
to show it in debugfs/mdsc.

Signed-off-by: Yan, Zheng <[email protected]>

ceph: don't call truncate_pagecache in ceph_writepages_start

truncate_pagecache() may decrease inode's reference. This can cause
deadlock if inode's last reference is dropped and iput_final() wants
to evict the inode. (evict() calls inode_wait_for_writeback(), which
waits for ceph_writepages_start() to return).

The fix is use work thead to truncate dirty pages. Also add 'forced
umount' check to ceph_update_writeable_page(), which prevents new
pages getting dirty.

Signed-off-by: Yan, Zheng <[email protected]>

ceph: renew caps for read/write if mds session got killed.

When mds session gets killed, read/write operation may hang.
Client waits for Frw caps, but mds does not know what caps client
wants. To recover this, client sends an open request to mds. The
request will tell mds what caps client wants.

Signed-off-by: Yan, Zheng <[email protected]>

ceph: CEPH_FEATURE_MDSENC support

Signed-off-by: Yan, Zheng <[email protected]>

ceph: multiple filesystem support

To access non-default filesystem, we just need to subscribe to
mdsmap.<MDS_NAMESPACE_ID> and add a new mount option for mds
namespace id.

Signed-off-by: Yan, Zheng <[email protected]>
[[email protected]: switch to a new libceph API]
Signed-off-by: Ilya Dryomov <[email protected]>

libceph: support for subscribing to "mdsmap.<id>" maps

Signed-off-by: Ilya Dryomov <[email protected]>

libceph: replace ceph_monc_request_next_osdmap()

... with a wrapper around maybe_request_map() - no need for two
osdmap-specific functions.

Signed-off-by: Ilya Dryomov <[email protected]>

libceph: take osdc->lock in osdmap_show() and dump flags in hex

There is now about a dozen CEPH_OSDMAP_* flags. This is a debugging
interface, so just dump in hex instead of spelling each flag out.

Signed-off-by: Ilya Dryomov <[email protected]>

libceph: pool deletion detection

This adds the "map check" infrastructure for sending osdmap version
checks on CALC_TARGET_POOL_DNE and completing in-flight requests with
-ENOENT if the target pool doesn't exist or has just been deleted.

Signed-off-by: Ilya Dryomov <[email protected]>

libceph: async MON client generic requests

For map check, we are going to need to send CEPH_MSG_MON_GET_VERSION
messages asynchronously and get a callback on completion. Refactor MON
client to allow firing off generic requests asynchronously and add an
async variant of ceph_monc_get_version(). ceph_monc_do_statfs() is
switched over and remains sync.

Signed-off-by: Ilya Dryomov <[email protected]>

libceph: support for checking on status of watch

Implement ceph_osdc_watch_check() to be able to check on status of
watch. Note that the time it takes for a watch/notify event to get
delivered through the notify_wq is taken into account.

Signed-off-by: Ilya Dryomov <[email protected]>

libceph: support for sending notifies

Implement ceph_osdc_notify() for sending notifies.

Due to the fact that the current messenger can't do read-in into
pagelists (it can only do write-out from them), I had to go with a page
vector for a NOTIFY_COMPLETE payload, for now.

Signed-off-by: Ilya Dryomov <[email protected]>

libceph, rbd: ceph_osd_linger_request, watch/notify v2

This adds support and switches rbd to a new, more reliable version of
watch/notify protocol.  As with the OSD client update, this is mostly
about getting the right structures linked into the right places so that
reconnects are properly sent when needed.  watch/notify v2 also
requires sending regular pings to the OSDs - send_linger_ping().

A major change from the old watch/notify implementation is the
introduction of ceph_osd_linger_request - linger requests no longer
piggy back on ceph_osd_request.  ceph_osd_event has been merged into
ceph_osd_linger_request.

All the details are now hidden within libceph, the interface consists
of a simple pair of watch/unwatch functions and ceph_osdc_notify_ack().
ceph_osdc_watch() does return ceph_osd_linger_request, but only to keep
the lifetime management simple.

ceph_osdc_notify_ack() accepts an optional data payload, which is
relayed back to the notifier.

Portions of this patch are loosely based on work by Douglas Fuller
<[email protected]> and Mike Christie <[email protected]>.

Signed-off-by: Ilya Dryomov <[email protected]>

rbd: rbd_dev_header_unwatch_sync() variant

Introduce __rbd_dev_header_unwatch_sync(), which doesn't flush notify
callbacks. This is for the new rados_watcherrcb_t, which would be
called from a notify callback.

Signed-off-by: Ilya Dryomov <[email protected]>

libceph: wait_request_timeout()

The unwatch timeout is currently implemented in rbd. With
watch/unwatch code moving into libceph, we are going to need
a ceph_osdc_wait_request() variant with a timeout.

Signed-off-by: Ilya Dryomov <[email protected]>