Marc Dionne [Thu, 16 Mar 2017 16:27:47 +0000 (16:27 +0000)]
afs: Populate and use client modification time
The inode timestamps should be set from the client time
in the status received from the server, rather than the
server time which is meant for internal server use.
Set AFS_SET_MTIME and populate the mtime for operations
that take an input status, such as file/dir creation
and StoreData. If an input time is not provided the
server will set the vnode times based on the current server
time.
In a situation where the server has some skew with the
client, this could lead to the client seeing a timestamp
in the future for a file that it just created or wrote.
David Howells [Thu, 16 Mar 2017 16:27:47 +0000 (16:27 +0000)]
afs: Better abort and net error handling
If we receive a network error, a remote abort or a protocol error whilst
we're still transmitting data, make sure we return an appropriate error to
the caller rather than ESHUTDOWN or ECONNABORTED.
David Howells [Thu, 16 Mar 2017 16:27:47 +0000 (16:27 +0000)]
afs: Fix the maths in afs_fs_store_data()
afs_fs_store_data() works out of the size of the write it's going to make,
but it uses 32-bit unsigned subtraction in one place that gets
automatically cast to loff_t.
However, if to < offset, then the number goes negative, but as the result
isn't signed, this doesn't get sign-extended to 64-bits when placed in a
loff_t.
David Howells [Thu, 16 Mar 2017 16:27:46 +0000 (16:27 +0000)]
afs: Use a bvec rather than a kvec in afs_send_pages()
Use a bvec rather than a kvec in afs_send_pages() as we don't then have to
call kmap() in advance. This allows us to pass the array of contiguous
pages that we extracted through to rxrpc in one go rather than passing a
single page at a time.
David Howells [Thu, 16 Mar 2017 16:27:46 +0000 (16:27 +0000)]
afs: Make struct afs_read::remain 64-bit
Make struct afs_read::remain 64-bit so that it can handle huge transfers if
we ever request them or the server decides to give us a bit extra data (the
other fields there are already 64-bit).
David Howells [Thu, 16 Mar 2017 16:27:46 +0000 (16:27 +0000)]
afs: Fix AFS read bug
Fix a bug in AFS read whereby the request page afs_read::index isn't
incremented after calling ->page_done() if ->remain reaches 0, indicating
that the data read is complete.
Without this a NULL pointer exception happens when ->page_done() is called
twice for the last page because the page clearing loop will call it also
and afs_readpages_page_done() clears the current entry in the page list.
Tina Ruchandani [Thu, 16 Mar 2017 16:27:46 +0000 (16:27 +0000)]
afs: Prevent callback expiry timer overflow
get_seconds() returns real wall-clock seconds. On 32-bit systems
this value will overflow in year 2038 and beyond. This patch changes
afs_vnode record to use ktime_get_real_seconds() instead, for the
fields cb_expires and cb_expires_at.
Tina Ruchandani [Thu, 16 Mar 2017 16:27:46 +0000 (16:27 +0000)]
afs: Migrate vlocation fields to 64-bit
get_seconds() returns real wall-clock seconds. On 32-bit systems
this value will overflow in year 2038 and beyond. This patch changes
afs's vlocation record to use ktime_get_real_seconds() instead, for the
fields time_of_death and update_at.
afs: security: Replace rcu_assign_pointer() with RCU_INIT_POINTER()
The use of "rcu_assign_pointer()" is NULLing out the pointer.
According to RCU_INIT_POINTER()'s block comment:
"1. This use of RCU_INIT_POINTER() is NULLing out the pointer"
it is better to use it instead of rcu_assign_pointer() because it has a
smaller overhead.
The following Coccinelle semantic patch was used:
@@
@@
afs: inode: Replace rcu_assign_pointer() with RCU_INIT_POINTER()
The use of "rcu_assign_pointer()" is NULLing out the pointer.
According to RCU_INIT_POINTER()'s block comment:
"1. This use of RCU_INIT_POINTER() is NULLing out the pointer"
it is better to use it instead of rcu_assign_pointer() because it has a
smaller overhead.
The following Coccinelle semantic patch was used:
@@
@@
David Howells [Thu, 16 Mar 2017 16:27:45 +0000 (16:27 +0000)]
afs: Distinguish mountpoints from symlinks by file mode alone
In AFS, mountpoints appear as symlinks with mode 0644 and normal symlinks
have mode 0777, so use this to distinguish them rather than reading the
content and parsing it. In the case of a mountpoint, the symlink body is a
formatted string indicating the location of the target volume.
Note that with this, kAFS no longer 'pre-fetches' the contents of symlinks,
so afs_readpage() may fail with an access-denial because when the VFS calls
d_automount(), it wraps the call in an credentials override that sets the
initial creds - thereby preventing access to the caller's keyrings and the
authentication keys held therein.
To this end, a patch reverting that change to the VFS is required also.
David Howells [Thu, 16 Mar 2017 16:27:44 +0000 (16:27 +0000)]
afs: Handle a short write to an AFS page
Handle the situation where afs_write_begin() is told to expect that a
full-page write will be made, but this doesn't happen (EFAULT, CTRL-C,
etc.), and so afs_write_end() sees a partial write took place. Currently,
no attempt is to deal with the discrepency.
David Howells [Thu, 16 Mar 2017 16:27:44 +0000 (16:27 +0000)]
afs: Handle better the server returning excess or short data
When an AFS server is given an FS.FetchData{,64} request to read data from
a file, it is permitted by the protocol to return more or less than was
requested. kafs currently relies on the latter behaviour in readpage{,s}
to handle a partial page at the end of the file (we just ask for a whole
page and clear space beyond the short read).
However, we don't handle all cases. Add:
(1) Handle excess data by discarding it rather than aborting. Note that
we use a common static buffer to discard into so that the decryption
algorithm advances the PCBC state.
(2) Handle a short read that affects more than just the last page.
Note that if a read comes up unexpectedly short of long, it's possible that
the server's copy of the file changed - in which case the data version
number will have been incremented and the callback will have been broken -
in which case all the pages currently attached to the inode will be zapped
anyway at some point.
Marc Dionne [Thu, 16 Mar 2017 16:27:44 +0000 (16:27 +0000)]
afs: Deal with an empty callback array
Servers may send a callback array that is the same size as
the FID array, or an empty array. If the callback count is
0, the code would attempt to read (fid_count * 12) bytes of
data, which would fail and result in an unmarshalling error.
This would lead to stale data for remotely modified files
or directories.
Store the callback array size in the internal afs_call
structure and use that to determine the amount of data to
read.
David Howells [Thu, 16 Mar 2017 16:27:43 +0000 (16:27 +0000)]
afs: Fix page overput in afs_fill_page()
afs_fill_page() loads the page it wants to fill into the afs_read request
without incrementing its refcount - but then calls afs_put_read() to clean
up afterwards, which then releases a ref on the page.
Fix this by getting a ref on the page before calling
afs_vnode_fetch_data().
This causes sync after a write to hang in afs_writepages_region() because
find_get_pages_tag() gets confused and doesn't return.
Fixes: 196ee9cd2d04 ("afs: Make afs_fs_fetch_data() take a list of pages") Reported-by: Marc Dionne <[email protected]> Signed-off-by: David Howells <[email protected]> Tested-by: Marc Dionne <[email protected]>
Romain Izard [Thu, 9 Mar 2017 15:18:20 +0000 (16:18 +0100)]
mmc: sdhci-of-at91: Support external regulators
The SDHCI controller in the SAMA5D2 chip requires a valid voltage set
in the power control register, otherwise commands will fail with a
timeout error.
When using the regulator framework to specify the regulator used by the
mmc device, the voltage is not configured, and it is not possible to use
the connected device.
Implement a custom 'set_power' function for this specific hardware, that
configures the voltage in the register in all cases.
Peter Zijlstra [Thu, 16 Mar 2017 12:47:51 +0000 (13:47 +0100)]
perf/core: Better explain the inherit magic
While going through the event inheritance code Oleg got confused.
Add some comments to better explain the silent dissapearance of
orphaned events.
So what happens is that at perf_event_release_kernel() time; when an
event looses its connection to userspace (and ceases to exist from the
user's perspective) we can still have an arbitrary amount of inherited
copies of the event. We want to synchronously find and remove all
these child events.
Since that requires a bit of lock juggling, there is the possibility
that concurrent clone()s will create new child events. Therefore we
first mark the parent event as DEAD, which marks all the extant child
events as orphaned.
We then avoid copying orphaned events; in order to avoid getting more
of them.
Peter Zijlstra [Thu, 16 Mar 2017 12:47:49 +0000 (13:47 +0100)]
perf/core: Fix event inheritance on fork()
While hunting for clues to a use-after-free, Oleg spotted that
perf_event_init_context() can loose an error value with the result
that fork() can succeed even though we did not fully inherit the perf
event context.
This patch closes the hole by making perf_event_free_task() destroy the
task <-> ctx relation such that perf_event_release_kernel() will no longer
observe the now dead task.
Tony Luck [Wed, 8 Mar 2017 17:45:39 +0000 (01:45 +0800)]
EDAC, pnd2_edac: Add new EDAC driver for Intel SoC platforms
Initial target for this driver is the Intel Apollo Lake platform and
Denverton micro-server, they use the same internal memory controller IP
called Pondicherry2.
Memory controller registers are not in PCI config space like earlier
Intel memory controllers. For Apollo Lake platform they are accessed via
a "side-band" interface, for Denverton micro-server they are access via
PCI config space and memory map I/O. This driver is for Apollo Lake and
Denverton, but only the Denverton is fully enabled while we wait for the
sideband driver.
Apollo lake driver and initial cut at Denverton driver by Tony Luck.
Extensive cleanup, refactoring and basic verification by Qiuxu Zhuo.
Prarit Bhargava [Tue, 14 Mar 2017 11:36:02 +0000 (07:36 -0400)]
hwrng: geode - Revert managed API changes
After commit e9afc746299d ("hwrng: geode - Use linux/io.h instead of
asm/io.h") the geode-rng driver uses devres with pci_dev->dev to keep
track of resources, but does not actually register a PCI driver. This
results in the following issues:
1. The driver leaks memory because the driver does not attach to a
device. The driver only uses the PCI device as a reference. devm_*()
functions will release resources on driver detach, which the geode-rng
driver will never do. As a result,
2. The driver cannot be reloaded because there is always a use of the
ioport and region after the first load of the driver.
Revert the changes made by e9afc746299d ("hwrng: geode - Use linux/io.h
instead of asm/io.h").
Prarit Bhargava [Tue, 14 Mar 2017 11:36:01 +0000 (07:36 -0400)]
hwrng: amd - Revert managed API changes
After commit 31b2a73c9c5f ("hwrng: amd - Migrate to managed API"), the
amd-rng driver uses devres with pci_dev->dev to keep track of resources,
but does not actually register a PCI driver. This results in the
following issues:
1. The message
WARNING: CPU: 2 PID: 621 at drivers/base/dd.c:349 driver_probe_device+0x38c
is output when the i2c_amd756 driver loads and attempts to register a PCI
driver. The PCI & device subsystems assume that no resources have been
registered for the device, and the WARN_ON() triggers since amd-rng has
already do so.
2. The driver leaks memory because the driver does not attach to a
device. The driver only uses the PCI device as a reference. devm_*()
functions will release resources on driver detach, which the amd-rng
driver will never do. As a result,
3. The driver cannot be reloaded because there is always a use of the
ioport and region after the first load of the driver.
Revert the changes made by 31b2a73c9c5f ("hwrng: amd - Migrate to managed
API").
Gary R Hook [Fri, 10 Mar 2017 18:28:18 +0000 (12:28 -0600)]
crypto: ccp - Assign DMA commands to the channel's CCP
The CCP driver generally uses a round-robin approach when
assigning operations to available CCPs. For the DMA engine,
however, the DMA mappings of the SGs are associated with a
specific CCP. When an IOMMU is enabled, the IOMMU is
programmed based on this specific device.
If the DMA operations are not performed by that specific
CCP then addressing errors and I/O page faults will occur.
Update the CCP driver to allow a specific CCP device to be
requested for an operation and use this in the DMA engine
support.
Johannes Berg [Wed, 15 Mar 2017 13:26:04 +0000 (14:26 +0100)]
nl80211: fix dumpit error path RTNL deadlocks
Sowmini pointed out Dmitry's RTNL deadlock report to me, and it turns out
to be perfectly accurate - there are various error paths that miss unlock
of the RTNL.
To fix those, change the locking a bit to not be conditional in all those
nl80211_prepare_*_dump() functions, but make those require the RTNL to
start with, and fix the buggy error paths. This also let me use sparse
(by appropriately overriding the rtnl_lock/rtnl_unlock functions) to
validate the changes.
sched/deadline: Use deadline instead of period when calculating overflow
I was testing Daniel's changes with his test case, and tweaked it a
little. Instead of having the runtime equal to the deadline, I
increased the deadline ten fold.
The results were rather surprising. The behavior that Daniel's patch
was fixing came back. The task started using much more than .1% of the
CPU. More like 20%.
Looking into this I found that it was due to the dl_entity_overflow()
constantly returning true. That's because it uses the relative period
against relative runtime vs the absolute deadline against absolute
runtime.
runtime / (deadline - t) > dl_runtime / dl_period
There's even a comment mentioning this, and saying that when relative
deadline equals relative period, that the equation is the same as using
deadline instead of period. That comment is backwards! What we really
want is:
We care about if the runtime can make its deadline, not its period. And
then we can say "when the deadline equals the period, the equation is
the same as using dl_period instead of dl_deadline".
After correcting this, now when the task gets enqueued, it can throttle
correctly, and Daniel's fix to the throttling of sleeping deadline
tasks works even when the runtime and deadline are not the same.
sched/deadline: Throttle a constrained deadline task activated after the deadline
During the activation, CBS checks if it can reuse the current task's
runtime and period. If the deadline of the task is in the past, CBS
cannot use the runtime, and so it replenishes the task. This rule
works fine for implicit deadline tasks (deadline == period), and the
CBS was designed for implicit deadline tasks. However, a task with
constrained deadline (deadine < period) might be awakened after the
deadline, but before the next period. In this case, replenishing the
task would allow it to run for runtime / deadline. As in this case
deadline < period, CBS enables a task to run for more than the
runtime / period. In a very loaded system, this can cause a domino
effect, making other tasks miss their deadlines.
To avoid this problem, in the activation of a constrained deadline
task after the deadline but before the next period, throttle the
task and set the replenishing timer to the begin of the next period,
unless it is boosted.
Reproducer:
--------------- %< ---------------
int main (int argc, char **argv)
{
int ret;
int flags = 0;
unsigned long l = 0;
struct timespec ts;
struct sched_attr attr;
if (ret < 0) {
perror("sched_setattr");
exit(-1);
}
for(;;) {
/* XXX: you may need to adjust the loop */
for (l = 0; l < 150000; l++);
/*
* The ideia is to go to sleep right before the deadline
* and then wake up before the next period to receive
* a new replenishment.
*/
nanosleep(&ts, NULL);
}
exit(0);
}
--------------- >% ---------------
On my box, this reproducer uses almost 50% of the CPU time, which is
obviously wrong for a task with 2/2000 reservation.
sched/deadline: Make sure the replenishment timer fires in the next period
Currently, the replenishment timer is set to fire at the deadline
of a task. Although that works for implicit deadline tasks because the
deadline is equals to the begin of the next period, that is not correct
for constrained deadline tasks (deadline < period).
For instance:
f.c:
--------------- %< ---------------
int main (void)
{
for(;;);
}
--------------- >% ---------------
The task is throttled after running 492318 ms, as expected:
f-11295 [003] 13322.606094: sched_switch: f:11295 [-1] R ==> watchdog/3:32 [0]
But then, the task is replenished 500719 ms after the first
replenishment:
<idle>-0 [003] 13322.614495: sched_switch: swapper/3:0 [120] R ==> f:11295 [-1]
Running for 490277 ms:
f-11295 [003] 13323.104772: sched_switch: f:11295 [-1] R ==> swapper/3:0 [120]
Hence, in the first period, the task runs 2 * runtime, and that is a bug.
During the first replenishment, the next deadline is set one period away.
So the runtime / period starts to be respected. However, as the second
replenishment took place in the wrong instant, the next replenishment
will also be held in a wrong instant of time. Rather than occurring in
the nth period away from the first activation, it is taking place
in the (nth period - relative deadline).
Sudip Mukherjee [Mon, 6 Mar 2017 23:23:43 +0000 (23:23 +0000)]
ppdev: fix registering same device name
Usually every parallel port will have a single pardev registered with
it. But ppdev driver is an exception. This userspace parallel port
driver allows to create multiple parrallel port devices for a single
parallel port. And as a result we were having a big warning like:
"sysfs: cannot create duplicate filename '/devices/parport0/ppdev0.0'".
And with that many parallel port printers stopped working.
We have been using the minor number as the id field while registering
a parralel port device with a parralel port. But when there are
multiple parrallel port device for one single parallel port, they all
tried to register with the same name like 'pardev0.0' and everything
started failing.
Use an incremented index as the id instead of the minor number.
Sudip Mukherjee [Mon, 6 Mar 2017 23:23:42 +0000 (23:23 +0000)]
parport: fix attempt to write duplicate procfiles
Usually every parallel port will have a single pardev registered with
it. But ppdev driver is an exception. This userspace parallel port
driver allows to create multiple parrallel port devices for a single
parallel port. And as a result we were having a nice warning like:
"sysctl table check failed:
/dev/parport/parport0/devices/ppdev0/timeslice Sysctl already exists"
Use the same logic as used in parport_register_device() and register
the proc files only once for each parallel port.
Niklas Cassel [Sat, 25 Feb 2017 00:17:53 +0000 (01:17 +0100)]
locking/rwsem: Fix down_write_killable() for CONFIG_RWSEM_GENERIC_SPINLOCK=y
We hang if SIGKILL has been sent, but the task is stuck in down_read()
(after do_exit()), even though no task is doing down_write() on the
rwsem in question:
INFO: task libupnp:21868 blocked for more than 120 seconds.
libupnp D 0 21868 1 0x08100008
...
Call Trace:
__schedule()
schedule()
__down_read()
do_exit()
do_group_exit()
__wake_up_parent()
This bug has already been fixed for CONFIG_RWSEM_XCHGADD_ALGORITHM=y in
the following commit:
Matt Fleming [Fri, 17 Feb 2017 12:07:31 +0000 (12:07 +0000)]
sched/loadavg: Use {READ,WRITE}_ONCE() for sample window
'calc_load_update' is accessed without any kind of locking and there's
a clear assumption in the code that only a single value is read or
written.
Make this explicit by using READ_ONCE() and WRITE_ONCE(), and avoid
unintentionally seeing multiple values, or having the load/stores
split.
Technically the loads in calc_global_*() don't require this since
those are the only functions that update 'calc_load_update', but I've
added the READ_ONCE() for consistency.
Matt Fleming [Fri, 17 Feb 2017 12:07:30 +0000 (12:07 +0000)]
sched/loadavg: Avoid loadavg spikes caused by delayed NO_HZ accounting
If we crossed a sample window while in NO_HZ we will add LOAD_FREQ to
the pending sample window time on exit, setting the next update not
one window into the future, but two.
This has the effect of delaying load average updates for potentially
up to ~9seconds.
This can result in huge spikes in the load average values due to
per-cpu uninterruptible task counts being out of sync when accumulated
across all CPUs.
It's safe to update the per-cpu active count if we wake between sample
windows because any load that we left in 'calc_load_idle' will have
been zero'd when the idle load was folded in calc_global_load().
This issue is easy to reproduce before,
commit 9d89c257dfb9 ("sched/fair: Rewrite runnable load and utilization average tracking")
just by forking short-lived process pipelines built from ps(1) and
grep(1) in a loop. I'm unable to reproduce the spikes after that
commit, but the bug still seems to be present from code review.
The DL task will be migrated to a suitable later deadline rq once the DL
timer fires and currnet rq is offline. The rq clock of the new rq should
be updated. This patch fixes it by updating the rq clock after holding
the new rq's rq lock.
Brian Norris [Sat, 11 Mar 2017 01:39:23 +0000 (17:39 -0800)]
mwifiex: uninit wakeup info when removing device
We manually init wakeup info, but we don't detach it on device removal.
This means that if we (for example) rmmod + modprobe the driver, the
device framework might return -EEXIST the second time, and we'll
complain in the logs:
[ 839.311881] mwifiex_pcie 0000:01:00.0: fail to init wakeup for mwifiex
AFAICT, there's no other negative effect.
But we can fix this by disabling wakeup on remove, similar to what a few
other drivers do (e.g., the power supply framework).
This code (and bug) has existed on SDIO for a while, but it got moved
around and enabled for PCIe with commit 853402a00823 ("mwifiex: Enable
WoWLAN for both sdio and pcie").
Brian Norris [Sat, 11 Mar 2017 01:39:22 +0000 (17:39 -0800)]
mwifiex: set adapter->dev before starting to use mwifiex_dbg()
The mwifiex_dbg() log handler utilizes the struct device in
adapter->dev. Without it, it decides not to print anything.
As of commit 2e02b5814217 ("mwifiex: Allow mwifiex early access to device
structure"), we started assigning that pointer only after we finished
mwifiex_register() -- this effectively neuters any mwifiex_dbg() logging
done before this point.
Let's move the device assignment into mwifiex_register().
Fixes: 2e02b5814217 ("mwifiex: Allow mwifiex early access to device structure") Cc: Rajat Jain <[email protected]> Signed-off-by: Brian Norris <[email protected]> Signed-off-by: Kalle Valo <[email protected]>
Brian Norris [Sat, 11 Mar 2017 01:39:21 +0000 (17:39 -0800)]
mwifiex: pcie: don't leak DMA buffers when removing
When PCIe FLR support was added, much of the remove/release code for
PCIe was migrated to ->down_dev(), but ->down_dev() is never called for
device removal. Let's refactor the cleanup to be done in both cases.
Also, drop the comments above mwifiex_cleanup_pcie(), because they were
clearly wrong, and it's better to have clear and obvious code than to
detail the code steps in comments anyway.
Ville Syrjälä [Wed, 15 Mar 2017 14:31:58 +0000 (16:31 +0200)]
drm/i915: Do .init_clock_gating() earlier to avoid it clobbering watermarks
Currently ILK-BDW explicitly disable LP1+ watermarks from their
.init_clock_gating() hooks. Unfortunately that hook gets called way too
late since by that time we've already initialized all the watermark
state tracking which then gets out of sync with the hardware state.
We may eventually want to consider killing off the explicit LP1+
disable from .init_clock_gating(). In the meantime however, we can
avoid the problem by reordering the init sequence such that
intel_modeset_init_hw()->intel_init_clock_gating() gets called
prior to the hardware state takeover.
I suppose prior to the two stage watermark programming we were
magically saved by something that forced the watermarks to be
reprogrammed fully after .init_clock_gating() got called. But
now that no longer happens.
Note that the diff might look a bit odd as it kills off one
call of intel_update_cdclk(), but that's fine because
intel_modeset_init_hw() does the exact same thing. Previously
we just did it twice.
Actually even this new init sequence is pretty bogus as
.init_clock_gating() really should be called before any gem
hardware init since it can configure various clock gating
workarounds and whatnot that affect the GT side as well. Also
intel_modeset_init() really should get split up into better
defined init stages. Another "fun" detail is that
intel_modeset_gem_init() is where RPS/RC6 gets configured.
Why that is done from the display code is beyond me. I've
decided to leave all this be for now, and just try to fix
the init sequence enough for watermarks to work.
Sara Sharon [Tue, 14 Mar 2017 07:50:35 +0000 (09:50 +0200)]
iwlwifi: mvm: cleanup pending frames in DQA mode
When a station is asleep, the fw will set it as "asleep".
All queues that are used only by one station will be stopped by
the fw.
In pre-DQA mode this was relevant for aggregation queues. However,
in DQA mode a queue is owned by one station only, so all queues
will be stopped.
As a result, we don't expect to get filtered frames back to
mac80211 and don't have to maintain the entire pending_frames
state logic, the same way as we do in aggregations.
The correct behavior is to align DQA behavior with the aggregation
queue behaviour pre-DQA:
- Don't count pending frames.
- Let mac80211 know we have frames in these queues so that it can
properly handle trigger frames.
When a trigger frame is received, mac80211 tells the driver to send
frames from the queues using release_buffered_frames.
The driver will tell the fw to let frames out even if the station
is asleep. This is done by iwl_mvm_sta_modify_sleep_tx_count.
K. Y. Srinivasan [Mon, 13 Mar 2017 03:00:30 +0000 (20:00 -0700)]
Drivers: hv: vmbus: Don't leak memory when a channel is rescinded
When we close a channel that has been rescinded, we will leak memory since
vmbus_teardown_gpadl() returns an error. Fix this so that we can properly
cleanup the memory allocated to the ring buffers.
Fixes: ccb61f8a99e6 ("Drivers: hv: vmbus: Fix a rescind handling bug") Signed-off-by: K. Y. Srinivasan <[email protected]> Signed-off-by: Greg Kroah-Hartman <[email protected]>
Drivers: hv: util: move waiting for release to hv_utils_transport itself
Waiting for release_event in all three drivers introduced issues on release
as on_reset() hook is not always called. E.g. if the device was never
opened we will never get the completion.
Move the waiting code to hvutil_transport_destroy() and make sure it is
only called when the device is open. hvt->lock serialization should
guarantee the absence of races.
Fixes: 5a66fecbf6aa ("Drivers: hv: util: kvp: Fix a rescind processing issue") Fixes: 20951c7535b5 ("Drivers: hv: util: Fcopy: Fix a rescind processing issue") Fixes: d77044d142e9 ("Drivers: hv: util: Backup: Fix a rescind processing issue") Reported-by: Dexuan Cui <[email protected]> Tested-by: Dexuan Cui <[email protected]> Signed-off-by: Vitaly Kuznetsov <[email protected]> Signed-off-by: K. Y. Srinivasan <[email protected]> Signed-off-by: Greg Kroah-Hartman <[email protected]>
Dexuan Cui [Sun, 5 Mar 2017 01:13:58 +0000 (18:13 -0700)]
vmbus: remove hv_event_tasklet_disable/enable
With the recent introduction of per-channel tasklet, we need to update
the way we handle the 3 concurrency issues:
1. hv_process_channel_removal -> percpu_channel_deq vs.
vmbus_chan_sched -> list_for_each_entry(..., percpu_list);
2. vmbus_process_offer -> percpu_channel_enq/deq vs. vmbus_chan_sched.
3. vmbus_close_internal vs. the per-channel tasklet vmbus_on_event;
The first 2 issues can be handled by Stephen's recent patch
"vmbus: use rcu for per-cpu channel list", and the third issue
can be handled by calling tasklet_disable in vmbus_close_internal here.
We don't need the original hv_event_tasklet_disable/enable since we
now use per-channel tasklet instead of the previous per-CPU tasklet,
and actually we must remove them due to the side effect now:
vmbus_process_offer -> hv_event_tasklet_enable -> tasklet_schedule will
start the per-channel callback prematurely, cauing NULL dereferencing
(the channel may haven't been properly configured to run the callback yet).
The per-cpu channel list is now referred to in the interrupt
routine. This is mostly safe since the host will not normally generate
an interrupt when channel is being deleted but if it did then there
would be a use after free problem.
To solve, this use RCU protection on ther per-cpu list.
Fixes: 631e63a9f346 ("vmbus: change to per channel tasklet") Signed-off-by: Stephen Hemminger <[email protected]> Signed-off-by: K. Y. Srinivasan <[email protected]> Signed-off-by: Greg Kroah-Hartman <[email protected]>
The driver still struggles with firmwares that do not replay to the OS
version request. It is safe not waiting for the replay. First, the driver
doesn't do anything with the replay second the connection is closed
immediately, hence the packet will be just safely discarded in case it
is received and last the driver won't get stuck if the firmware won't
reply.
Tomas Winkler [Sun, 5 Mar 2017 19:40:41 +0000 (21:40 +0200)]
mei: fix deadlock on mei reset
This patch fixes 'mei: synchronize irq before initiating a reset'
The patch had introduced a deadlock between irq thread and mei_reset()
as they are both holding the same device lock.
Linus Torvalds [Wed, 15 Mar 2017 23:54:58 +0000 (16:54 -0700)]
Merge branch 'for-linus' of git://git.kernel.dk/linux-block
Pull block fixes from Jens Axboe:
"Four small fixes for this cycle:
- followup fix from Neil for a fix that went in before -rc2, ensuring
that we always see the full per-task bio_list.
- fix for blk-mq-sched from me that ensures that we retain similar
direct-to-issue behavior on running the queue.
- fix from Sagi fixing a potential NULL pointer dereference in blk-mq
on spurious CPU unplug.
- a memory leak fix in writeback from Tahsin, fixing a case where
device removal of a mounted device can leak a struct
wb_writeback_work"
* 'for-linus' of git://git.kernel.dk/linux-block:
blk-mq-sched: don't run the queue async from blk_mq_try_issue_directly()
writeback: fix memory leak in wb_queue_work()
blk-mq: Fix tagset reinit in the presence of cpu hot-unplug
blk: Ensure users for current->bio_list can see the full list.
Eric Dumazet [Wed, 15 Mar 2017 20:21:28 +0000 (13:21 -0700)]
net: properly release sk_frag.page
I mistakenly added the code to release sk->sk_frag in
sk_common_release() instead of sk_destruct()
TCP sockets using sk->sk_allocation == GFP_ATOMIC do no call
sk_common_release() at close time, thus leaking one (order-3) page.
iSCSI is using such sockets.
Fixes: 5640f7685831 ("net: use a per task frag allocator") Signed-off-by: Eric Dumazet <[email protected]> Signed-off-by: David S. Miller <[email protected]>
Lendacky, Thomas [Wed, 15 Mar 2017 20:11:23 +0000 (15:11 -0500)]
amd-xgbe: Fix jumbo MTU processing on newer hardware
Newer hardware does not provide a cumulative payload length when multiple
descriptors are needed to handle the data. Once the MTU increases beyond
the size that can be handled by a single descriptor, the SKB does not get
built properly by the driver.
The driver will now calculate the size of the data buffers used by the
hardware. The first buffer of the first descriptor is for packet headers
or packet headers and data when the headers can't be split. Subsequent
descriptors in a multi-descriptor chain will not use the first buffer. The
second buffer is used by all the descriptors in the chain for payload data.
Based on whether the driver is processing the first, intermediate, or last
descriptor it can calculate the buffer usage and build the SKB properly.
The following patchset contains Netfilter fixes for your net tree, a
rather large batch of fixes targeted to nf_tables, conntrack and bridge
netfilter. More specifically, they are:
1) Don't track fragmented packets if the socket option IP_NODEFRAG is set.
From Florian Westphal.
2) SCTP protocol tracker assumes that ICMP error messages contain the
checksum field, what results in packet drops. From Ying Xue.
3) Fix inconsistent handling of AH traffic from nf_tables.
4) Fix new bitmap set representation with big endian. Fix mismatches in
nf_tables due to incorrect big endian handling too. Both patches
from Liping Zhang.
5) Bridge netfilter doesn't honor maximum fragment size field, cap to
largest fragment seen. From Florian Westphal.
6) Fake conntrack entry needs to be aligned to 8 bytes since the 3 LSB
bits are now used to store the ctinfo. From Steven Rostedt.
7) Fix element comments with the bitmap set type. Revert the flush
field in the nft_set_iter structure, not required anymore after
fixing up element comments.
8) Missing error on invalid conntrack direction from nft_ct, also from
Liping Zhang.
====================
parisc: Optimize flush_kernel_vmap_range and invalidate_kernel_vmap_range
The previously submitted patch did not resolve the random segmentation
faults observed on the phantom buildd system. There are still
unresolved problems with the Debian 4.8 and 4.9 kernels on C8000.
The attached patch removes the flush of the offset map pages and does a
whole data cache flush for large ranges. No other arch flushes the
offset map in these routines as far as I can tell.
I have not observed any random segmentation faults on rp3440 in two
weeks of testing with 4.10.0 and 4.10.1.
Mikulas Patocka [Tue, 14 Mar 2017 15:47:29 +0000 (11:47 -0400)]
parisc: support R_PARISC_SECREL32 relocation in modules
The parisc kernel doesn't work with CONFIG_MODVERSIONS since the commit 71810db27c1c853b335675bee335d893bc3d324b. It can't load modules with the
error: "module unix: Unknown relocation: 41".
The commit changes __kcrctab from 64-bit valus to 32-bit values. The
assembler generates R_PARISC_SECREL32 secrel relocation for them and the
module loader doesn't support this relocation.
This patch adds the R_PARISC_SECREL32 relocation to the module loader.
The tiadc_irq_h(int irq, void *private) function is handling FIFO
overruns by clearing flags, disabling and enabling the ADC to
recover.
If the ADC is running in continuous mode a FIFO overrun happens
regularly. If the disabling of the ADC happens concurrently with
a new conversion. It might happen that the enabling of the ADC
is ignored by the hardware. This stops the ADC permanently. No
more interrupts are triggered.
According to the AM335x Reference Manual (SPRUH73H October 2011 -
Revised April 2013 - Chapter 12.4 and 12.5) it is necessary to
check the ADC FSM bits in REG_ADCFSM before enabling the ADC
again. Because the disabling of the ADC is done right after the
current conversion has been finished.
To trigger this bug it is necessary to run the ADC in continuous
mode. The ADC values of all channels need to be read in an endless
loop. The bug appears within the first 6 hours (~5.4 million
handled FIFO overruns). The user space application will hang on
reading new values from the character device.
Taku Izumi [Wed, 15 Mar 2017 04:47:50 +0000 (13:47 +0900)]
fjes: Fix wrong netdevice feature flags
This patch fixes netdev->features for Extended Socket network device.
Currently Extended Socket network device's netdev->feature claims
NETIF_F_HW_CSUM, however this is completely wrong. There's no feature
of checksum offloading.
That causes invalid TCP/UDP checksum and packet rejection when IP
forwarding from Extended Socket network device to other network device.
Eric Biggers [Wed, 15 Mar 2017 19:08:48 +0000 (15:08 -0400)]
jbd2: don't leak memory if setting up journal fails
In journal_init_common(), if we failed to allocate the j_wbuf array, or
if we failed to create the buffer_head for the journal superblock, we
leaked the memory allocated for the revocation tables. Fix this.
Eric Biggers [Wed, 15 Mar 2017 18:52:02 +0000 (14:52 -0400)]
ext4: mark inode dirty after converting inline directory
If ext4_convert_inline_data() was called on a directory with inline
data, the filesystem was left in an inconsistent state (as considered by
e2fsck) because the file size was not increased to cover the new block.
This happened because the inode was not marked dirty after i_disksize
was updated. Fix this by marking the inode dirty at the end of
ext4_finish_convert_inline_dir().
This bug was probably not noticed before because most users mark the
inode dirty afterwards for other reasons. But if userspace executed
FS_IOC_SET_ENCRYPTION_POLICY with invalid parameters, as exercised by
'kvm-xfstests -c adv generic/396', then the inode was never marked dirty
after updating i_disksize.
Eric Biggers [Wed, 22 Feb 2017 21:25:14 +0000 (13:25 -0800)]
fscrypt: eliminate ->prepare_context() operation
The only use of the ->prepare_context() fscrypt operation was to allow
ext4 to evict inline data from the inode before ->set_context().
However, there is no reason why this cannot be done as simply the first
step in ->set_context(), and in fact it makes more sense to do it that
way because then the policy modes and flags get validated before any
real work is done. Therefore, merge ext4_prepare_context() into
ext4_set_context(), and remove ->prepare_context().
Linus Torvalds [Wed, 15 Mar 2017 17:44:19 +0000 (10:44 -0700)]
Merge tag 'scsi-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/jejb/scsi
Pull SCSI fixes from James Bottomley:
"This is a rather large set of fixes. The bulk are for lpfc correcting
a lot of issues in the new NVME driver code which just went in in the
merge window.
The others are:
- fix a hang in the vmware paravirt driver caused by incorrect
handling of the new MSI vector allocation
- long standing bug in storvsc, which recent block changes turned
from being a harmless annoyance into a hang
- yet more fallout (in mpt3sas) from the changes to device blocking
The remainder are small fixes and updates"
* tag 'scsi-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/jejb/scsi: (34 commits)
scsi: lpfc: Add shutdown method for kexec
scsi: storvsc: Workaround for virtual DVD SCSI version
scsi: lpfc: revise version number to 11.2.0.10
scsi: lpfc: code cleanups in NVME initiator discovery
scsi: lpfc: code cleanups in NVME initiator base
scsi: lpfc: correct rdp diag portnames
scsi: lpfc: remove dead sli3 nvme code
scsi: lpfc: correct double print
scsi: lpfc: Rename LPFC_MAX_EQ_DELAY to LPFC_MAX_EQ_DELAY_EQID_CNT
scsi: lpfc: Rework lpfc Kconfig for NVME options
scsi: lpfc: add transport eh_timed_out reference
scsi: lpfc: Fix eh_deadline setting for sli3 adapters.
scsi: lpfc: add NVME exchange aborts
scsi: lpfc: Fix nvme allocation bug on failed nvme_fc_register_localport
scsi: lpfc: Fix IO submission if WQ is full
scsi: lpfc: Fix NVME CMD IU byte swapped word 1 problem
scsi: lpfc: Fix RCTL value on NVME LS request and response
scsi: lpfc: Fix crash during Hardware error recovery on SLI3 adapters
scsi: lpfc: fix missing spin_unlock on sql_list_lock
scsi: lpfc: don't dereference dma_buf->iocbq before null check
...
James Smart [Wed, 8 Mar 2017 22:36:01 +0000 (14:36 -0800)]
scsi: lpfc: Finalize Kconfig options for nvme
Reviewing the result of what was just added for Kconfig, we made a poor
choice. It worked well for full kernel builds, but not so much for how
it would be deployed on a distro.
Here's the final result:
- lpfc will compile in NVME initiator and/or NVME target support based
on whether the kernel has the corresponding subsystem support.
Kconfig is not used to drive this specifically for lpfc.
- There is a module parameter, lpfc_enable_fc4_type, that indicates
whether the ports will do FCP-only or FCP & NVME (NVME-only not yet
possible due to dependency on fc transport). As FCP & NVME divvys up
exchange resources, and given NVME will not be often initially, the
default is changed to FCP only.
Eric Biggers [Tue, 21 Feb 2017 23:07:11 +0000 (15:07 -0800)]
fscrypt: remove broken support for detecting keyring key revocation
Filesystem encryption ostensibly supported revoking a keyring key that
had been used to "unlock" encrypted files, causing those files to become
"locked" again. This was, however, buggy for several reasons, the most
severe of which was that when key revocation happened to be detected for
an inode, its fscrypt_info was immediately freed, even while other
threads could be using it for encryption or decryption concurrently.
This could be exploited to crash the kernel or worse.
This patch fixes the use-after-free by removing the code which detects
the keyring key having been revoked, invalidated, or expired. Instead,
an encrypted inode that is "unlocked" now simply remains unlocked until
it is evicted from memory. Note that this is no worse than the case for
block device-level encryption, e.g. dm-crypt, and it still remains
possible for a privileged user to evict unused pages, inodes, and
dentries by running 'sync; echo 3 > /proc/sys/vm/drop_caches', or by
simply unmounting the filesystem. In fact, one of those actions was
already needed anyway for key revocation to work even somewhat sanely.
This change is not expected to break any applications.
In the future I'd like to implement a real API for fscrypt key
revocation that interacts sanely with ongoing filesystem operations ---
waiting for existing operations to complete and blocking new operations,
and invalidating and sanitizing key material and plaintext from the VFS
caches. But this is a hard problem, and for now this bug must be fixed.
This bug affected almost all versions of ext4, f2fs, and ubifs
encryption, and it was potentially reachable in any kernel configured
with encryption support (CONFIG_EXT4_ENCRYPTION=y,
CONFIG_EXT4_FS_ENCRYPTION=y, CONFIG_F2FS_FS_ENCRYPTION=y, or
CONFIG_UBIFS_FS_ENCRYPTION=y). Note that older kernels did not use the
shared fs/crypto/ code, but due to the potential security implications
of this bug, it may still be worthwhile to backport this fix to them.
Linus Torvalds [Wed, 15 Mar 2017 16:33:15 +0000 (09:33 -0700)]
Merge tag 'gfs2-4.11-rc3.fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/gfs2/linux-gfs2
Pull gfs2 fix from Bob Peterson:
"This is an emergency patch for 4.11-rc3
The GFS2 developers uncovered a really nasty problem that can lead to
random corruption and kernel panic, much like the last one. Andreas
Gruenbacher wrote a simple one-line patch to fix the problem."
* tag 'gfs2-4.11-rc3.fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/gfs2/linux-gfs2:
gfs2: Avoid alignment hole in struct lm_lockname
cpufreq: intel_pstate: Avoid percentages in limits-related computations
Currently, intel_pstate_update_perf_limits() first converts the
policy minimum and maximum limits into percentages of the maximum
turbo frequency (rounding up to an integer) and then converts these
percentages to fractions (by using fixed-point arithmetic to divide
them by 100).
That introduces a rounding error unnecessarily, because the fractions
can be obtained by carrying out fixed-point divisions directly on the
input numbers.
Rework the computations in intel_pstate_hwp_set() to use fractions
instead of percentages (and drop redundant local variables from
there) and modify intel_pstate_update_perf_limits() to compute the
fractions directly and percentages out of them.
While at it, introduce percent_ext_fp() for converting percentages
to fractions (with extended number of fraction bits) and use it in
the computations.