Issue is that the old content from tcf_bpf is allocated and needs
to be released when we replace it. We seem to do that since the
beginning of act_bpf on the filter and insns, later on the name as
well.
Example test case, after patch:
# FOO="1,6 0 0 4294967295,"
# BAR="1,6 0 0 4294967294,"
# tc actions add action bpf bytecode "$FOO" index 2
# tc actions show action bpf
action order 0: bpf bytecode '1,6 0 0 4294967295' default-action pipe
index 2 ref 1 bind 0
# tc actions replace action bpf bytecode "$BAR" index 2
# tc actions show action bpf
action order 0: bpf bytecode '1,6 0 0 4294967294' default-action pipe
index 2 ref 1 bind 0
# tc actions replace action bpf bytecode "$FOO" index 2
# tc actions show action bpf
action order 0: bpf bytecode '1,6 0 0 4294967295' default-action pipe
index 2 ref 1 bind 0
# tc actions del action bpf index 2
[...]
# echo "scan" > /sys/kernel/debug/kmemleak
# cat /sys/kernel/debug/kmemleak | grep "comm \"tc\"" | wc -l
0
Fixes: d23b8ad8ab23 ("tc: add BPF based action") Signed-off-by: Daniel Borkmann <[email protected]> Signed-off-by: David S. Miller <[email protected]>
Cortina phy does not have kernel driver and we don't attach
device with phy layer for intefaces like XFI, XLAUI etc,
Hence check for interface type before calling disconnect.
net: thunderx: Fix crash when changing rss with mutliple traffic flows
This fixes a crash when changing rss with multiple traffic flows.
While interface teardown, disable tx queues after all NAPI threads
are done. If done otherwise tx queues might be woken up inside NAPI
if any CQE_TX are processed.
net: thunderx: Wakeup TXQ only if CQE_TX are processed
Previously TXQ is wakedup whenever napi is executed
and irrespective of if any CQE_TX are processed or not.
Added 'txq_stop' and 'txq_wake' counters to aid in debugging
if there are any future issues.
Suppressing standard alloc_pages() warnings. Some kernel configs limit
alloc size and the network driver may fail. Do not drop a kernel
warning in this case, instead just drop a oneliner that the network
driver could not be loaded since the buffer could not be allocated.
With earlier configured value sufficient number of CQEs are not
being reserved for transmitted packets. Hence under heavy incoming
traffic load, receive notifications will take away most of the CQ
thus transmit notifications will be lost resulting in tx skbs not
being freed.
Finally SQ will be full and it will be stopped, watchdog timer
will kick in. After this fix receive notifications will not take
morethan half of CQ reserving the rest for transmit notifications.
Also changed CQ & SQ sizes from 16k to 4k.
This is also due to the receive notifications taking first half of
CQ under heavy load and time taken by NAPI to clear transmit notifications
will increase with higher queue sizes. Again results in SQ being stopped.
Eric Dumazet [Wed, 29 Jul 2015 10:01:41 +0000 (12:01 +0200)]
ipv6: flush nd cache on IFF_NOARP change
This patch is the IPv6 equivalent of commit 6c8b4e3ff81b ("arp: flush arp cache on IFF_NOARP change")
Without it, we keep buggy neighbours in the cache, with destination
MAC address equal to our own MAC address.
Tested:
tcpdump -i eth0 -s 0 ip6 -n -e &
ip link set dev eth0 arp off
ping6 remote // sends buggy frames
ip link set dev eth0 arp on
ping6 remote // should work once kernel is patched
Dave Airlie [Thu, 30 Jul 2015 02:41:44 +0000 (12:41 +1000)]
Merge branch 'msm-fixes-4.2' of git://people.freedesktop.org/~robclark/linux into drm-fixes
Fix for nasty crash on mdp4 in disable path, fix for dma-buf export,
smb leak on mdp5 which could result in intermittent modeset fails, and
don't let interrupted system call disturb atomic commit once we are
past the point of no return.
* 'msm-fixes-4.2' of git://people.freedesktop.org/~robclark/linux:
drm/msm/mdp5: release SMB (shared memory blocks) in various cases
drm/msm: change to uninterruptible wait in atomic commit
drm/msm: mdp4: Fix drm_framebuffer dereference crash
drm/msm: fix msm_gem_prime_get_sg_table()
Dave Airlie [Thu, 30 Jul 2015 02:40:27 +0000 (12:40 +1000)]
Merge branch 'drm-fixes-4.2' of git://people.freedesktop.org/~agd5f/linux into drm-fixes
Radeon and amdgpu fixes for 4.2. The audio fix ended up being more
invasive than I would have liked, but this should finally fix up the
last of the regressions since DP audio support was added.
* 'drm-fixes-4.2' of git://people.freedesktop.org/~agd5f/linux:
drm/amdgpu: add new parameter to seperate map and unmap
drm/amdgpu: hdp_flush is not needed for inside IB
drm/amdgpu: different emit_ib for gfx and compute
drm/amdgpu: information leak in amdgpu_info_ioctl()
drm/amdgpu: clean up init sequence for failures
drm/radeon/combios: add some validation of lvds values
drm/radeon: rework audio modeset to handle non-audio hdmi features
drm/radeon: rework audio detect (v4)
drm/amdgpu: Drop drm/ prefix for including drm.h in amdgpu_drm.h
drm/radeon: Drop drm/ prefix for including drm.h in radeon_drm.h
net: netcp: ethss: cleanup gbe_probe() and gbe_remove() functions
This patch clean up error handle code to use goto label properly. In some
cases, the code unnecessarily use goto instead of just returning the error
code. Code also make explicit calls to devm_* APIs on error which is
not necessary. In the gbe_remove() also it makes similar calls which is
also unnecessary.
net: netcp: ethss: fix up incorrect use of list api
The code seems to assume a null is returned when the list is empty
from first_sec_slave() to break the loop which is incorrect. Fix the
code by using list_empty().
Merge tag 'pm+acpi-4.2-rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm
Pull power management and ACPI fixes from Rafael Wysocki:
"These fix three regressions, two recent ones (cpufreq core and ACPI
device power management) and one introduced during the 4.1 cycle
(intel_pstate).
Specifics:
- Fix a recently introduced issue in the cpufreq core causing it to
attempt to create duplicate symbolic links to the policy directory
in sysfs for CPUs that are offline when the cpufreq driver is being
registered (Rafael J Wysocki)
- Fix a recently introduced problem in the ACPI device power
management core code causing it to store an incorrect value in the
device object's power.state field in some cases which in turn leads
to attempts to turn power resources off while they should still be
on going forward (Mika Westerberg)
- Fix an intel_pstate driver issue introduced during the 4.1 cycle
which leads to kernel panics on boot on Knights Landing chips due
to incomplete support for them in that driver (Lukasz Anaczkowski)"
* tag 'pm+acpi-4.2-rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm:
cpufreq: Avoid attempts to create duplicate symbolic links
ACPI / PM: Use target_state to set the device power state
intel_pstate: Add get_scaling cpu_defaults param to Knights Landing
Merge tag 'dm-4.2-fixes-3' of git://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm
Pull device mapper fixes from Mike Snitzer:
- fix DM thinp to consistently return -ENOSPC when out of data space
- fix a logic bug in the DM cache smq policy's creation error path
- revert a DM cache 4.2-rc3 change that reduced writeback efficiency
- fix a hang on DM cache device destruction due to improper
prealloc_used accounting introduced in 4.2-rc3
- update URL for dm-crypt wiki page
* tag 'dm-4.2-fixes-3' of git://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm:
dm cache: fix device destroy hang due to improper prealloc_used accounting
Revert "dm cache: do not wake_worker() in free_migration()"
dm crypt: update wiki page URL
dm cache policy smq: fix alloc_bitset check that always evaluates as false
dm thin: return -ENOSPC when erroring retry list due to out of data space
Daniel Borkmann [Tue, 28 Jul 2015 13:26:36 +0000 (15:26 +0200)]
ebpf, x86: fix general protection fault when tail call is invoked
With eBPF JIT compiler enabled on x86_64, I was able to reliably trigger
the following general protection fault out of an eBPF program with a simple
tail call, f.e. tracex5 (or a stripped down version of it):
The problem is that rax has 5a5a5a5a5a5a5a5a, which suggests a tail call
jump to map slot 0 is pointing to a poisoned page. The issue is the following:
lea instruction has a wrong offset, i.e. it should be ...
lea 0x80(%rsi,%rdx,8),%rax
... but it actually seems to be ...
lea -0x80(%rsi,%rdx,8),%rax
... where 0x80 is offsetof(struct bpf_array, prog), thus the offset needs
to be positive instead of negative. Disassembling the interpreter, we btw
similarly do:
Now the other interesting fact is that this panic triggers only when things
like CONFIG_LOCKDEP are being used. In that case offsetof(struct bpf_array,
prog) starts at offset 0x80 and in non-CONFIG_LOCKDEP case at offset 0x50.
Reason is that the work_struct inside struct bpf_map grows by 48 bytes in my
case due to the lockdep_map member (which also has CONFIG_LOCK_STAT enabled
members).
Changing the emitter to always use the 4 byte displacement in the lea
instruction fixes the panic on my side. It increases the tail call instruction
emission by 3 more byte, but it should cover us from various combinations
(and perhaps other future increases on related structures).
Since mdb states were introduced when deleting an entry the state was
left as it was set in the delete request from the user which leads to
the following output when doing a monitor (for example):
$ bridge mdb add dev br0 port eth3 grp 239.0.0.1 permanent
(monitor) dev br0 port eth3 grp 239.0.0.1 permanent
$ bridge mdb del dev br0 port eth3 grp 239.0.0.1 permanent
(monitor) dev br0 port eth3 grp 239.0.0.1 temp
^^^
Note the "temp" state in the delete notification which is wrong since
the entry was permanent, the state in a delete is always reported as
"temp" regardless of the real state of the entry.
After this patch:
$ bridge mdb add dev br0 port eth3 grp 239.0.0.1 permanent
(monitor) dev br0 port eth3 grp 239.0.0.1 permanent
$ bridge mdb del dev br0 port eth3 grp 239.0.0.1 permanent
(monitor) dev br0 port eth3 grp 239.0.0.1 permanent
There's one important note to make here that the state is actually not
matched when doing a delete, so one can delete a permanent entry by
stating "temp" in the end of the command, I've chosen this fix in order
not to break user-space tools which rely on this (incorrect) behaviour.
So to give an example after this patch and using the wrong state:
$ bridge mdb add dev br0 port eth3 grp 239.0.0.1 permanent
(monitor) dev br0 port eth3 grp 239.0.0.1 permanent
$ bridge mdb del dev br0 port eth3 grp 239.0.0.1 temp
(monitor) dev br0 port eth3 grp 239.0.0.1 permanent
Note the state of the entry that got deleted is correct in the
notification.
Signed-off-by: Nikolay Aleksandrov <[email protected]> Fixes: ccb1c31a7a87 ("bridge: add flags to distinguish permanent mdb entires") Signed-off-by: David S. Miller <[email protected]>
bridge: mcast: give fast leave precedence over multicast router and querier
When fast leave is configured on a bridge port and an IGMP leave is
received for a group, the group is not deleted immediately if there is
a router detected or if multicast querier is configured.
Ideally the group should be deleted immediately when fast leave is
configured.
Wentao Xu [Fri, 19 Jun 2015 18:03:42 +0000 (14:03 -0400)]
drm/msm/mdp5: release SMB (shared memory blocks) in various cases
Release all blocks after the pipe is disabled, even when vsync
didn't happen in some error cases. Allow requesting SMB multiple
times before configuring to hardware, by releasing blocks not
programmed to hardware yet for shrinking case.
This fixes a potential leak of shared memory pool blocks.
Dan Carpenter [Tue, 28 Jul 2015 15:51:29 +0000 (18:51 +0300)]
drm/amdgpu: information leak in amdgpu_info_ioctl()
We recently changed the drm_amdgpu_info_device struct so now there is
a 4 byte hole at the end. We need to initialize it so we don't disclose
secret information from the stack.
Fixes: fa92754e9c47 ('drm/amdgpu: add VCE harvesting instance query') Signed-off-by: Dan Carpenter <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
Alex Deucher [Thu, 23 Jul 2015 14:01:09 +0000 (10:01 -0400)]
drm/radeon: rework audio detect (v4)
1. Always assign audio function pointers even if the display does
not support audio. We need to properly disable the audio stream
when when using a non-audio capable monitor. Fixes purple line
on some hdmi monitors.
2. Check if a pin is in use by another encoder before disabling
it.
v2: make sure we've fetched the edid before checking audio and
look up the encoder before calling audio_detect since
connector->encoder may not be assigned yet. Separate
pin and afmt. They are allocated at different times and
have no dependency on eachother.
v3: fix connector fetching in encoder functions
v4: fix missed dig->pin check in dce6_afmt_write_latency_fields
bridge: Fix network header pointer for vlan tagged packets
There are several devices that can receive vlan tagged packets with
CHECKSUM_PARTIAL like tap, possibly veth and xennet.
When (multiple) vlan tagged packets with CHECKSUM_PARTIAL are forwarded
by bridge to a device with the IP_CSUM feature, they end up with checksum
error because before entering bridge, the network header is set to
ETH_HLEN (not including vlan header length) in __netif_receive_skb_core(),
get_rps_cpu(), or drivers' rx functions, and nobody fixes the pointer later.
Since the network header is exepected to be ETH_HLEN in flow-dissection
and hash-calculation in RPS in rx path, and since the header pointer fix
is needed only in tx path, set the appropriate network header on forwarding
packets.
Mike Snitzer [Wed, 29 Jul 2015 17:48:23 +0000 (13:48 -0400)]
dm cache: fix device destroy hang due to improper prealloc_used accounting
Commit 665022d72f9 ("dm cache: avoid calls to prealloc_free_structs() if
possible") introduced a regression that caused the removal of a DM cache
device to hang in cache_postsuspend()'s call to wait_for_migrations()
with the following stack trace:
Fix this by accounting for the call to prealloc_data_structs()
immediately _before_ the call as opposed to after. This is needed
because it is possible to break out of the control loop after the call
to prealloc_data_structs() but before prealloc_used was set to true.
Taking the wake_worker() out of free_migration() will slow writeback
dramatically, and hence adaptability.
Say we have 10k blocks that need writing back, but are only able to
issue 5 concurrently due to the migration bandwidth: it's imperative
that we wake_worker() immediately after migration completion; waiting
for the next 1 second wake up (via do_waker) means it'll take a long
time to write that all back.
ALSA: hda - Fix race between PM ops and HDA init/probe
PM ops could be triggered before HDA is done initializing
and cause PM to set HDA controller to D3Hot. This can result
in "CORB reset timeout#2, CORBRP = 65535" and "no codecs
initialized". Additionally, PM ops can be triggered before
azx_probe_continue finishes (async probe). This can result
in a NULL deref kernel crash.
Pull SCSI target fixes from Nicholas Bellinger:
"This series is larger than what I'd normally be conformable with
sending for a -rc5 PULL request..
However, the bulk of the series is localized to qla2xxx target
specific fixes that address a number of real-world correctness issues,
that have been outstanding on the list for ~6 weeks now. They where
submitted + verified + acked by the HW LLD vendor, contributed by a
major production customer of the code, and are marked for v3.18.y
stable code.
That said, I don't see a good reason to wait another month to get
these fixes into mainline.
Beyond the qla2xx specific fixes, this series also includes:
- bugfix for a long standing use-after-free in iscsi-target during
TPG shutdown + demo-mode sessions.
- bugfix for a >= v4.0 regression OOPs in iscsi-target during a
iscsi_start_kthreads() failure.
- bugfix for a >= v4.0 regression hang in iscsi-target for iser
explicit session/connection logout.
- bugfix for a iser-target bug where a early CMA REJECTED status
during login triggers a NULL pointer dereference OOPs.
- bugfixes for a handful of v4.2-rc1 specific regressions related to
the larger set of recent backend configfs attribute changes.
A big thanks to QLogic + Pure Storage for the qla2xxx target bugfixes"
* git://git.kernel.org/pub/scm/linux/kernel/git/nab/target-pending: (28 commits)
Documentation/target: Fix tcm_mod_builder.py build breakage
iser-target: Fix REJECT CM event use-after-free OOPs
iscsi-target: Fix iser explicit logout TX kthread leak
iscsi-target: Fix iscsit_start_kthreads failure OOPs
iscsi-target: Fix use-after-free during TPG session shutdown
qla2xxx: terminate exchange when command is aborted by LIO
qla2xxx: drop cmds/tmrs arrived while session is being deleted
qla2xxx: disable scsi_transport_fc registration in target mode
qla2xxx: added sess generations to detect RSCN update races
qla2xxx: Abort stale cmds on qla_tgt_wq when plogi arrives
qla2xxx: delay plogi/prli ack until existing sessions are deleted
qla2xxx: cleanup cmd in qla workqueue before processing TMR
qla2xxx: kill sessions/log out initiator on RSCN and port down events
qla2xxx: fix command initialization in target mode.
qla2xxx: Remove msleep in qlt_send_term_exchange
qla2xxx: adjust debug flags
qla2xxx: release request queue reservation.
qla2xxx: Add flush after updating ATIOQ consumer index.
qla2xxx: Enable target mode for ISP27XX
qla2xxx: Fix hardware lock/unlock issue causing kernel panic.
...
Jeff Mahoney [Mon, 15 Jun 2015 13:41:19 +0000 (09:41 -0400)]
btrfs: add missing discards when unpinning extents with -o discard
When we clear the dirty bits in btrfs_delete_unused_bgs for extents
in the empty block group, it results in btrfs_finish_extent_commit being
unable to discard the freed extents.
The block group removal patch added an alternate path to forget extents
other than btrfs_finish_extent_commit. As a result, any extents that
would be freed when the block group is removed aren't discarded. In my
test run, with a large copy of mixed sized files followed by removal, it
left nearly 2/3 of extents undiscarded.
To clean up the block groups, we add the removed block group onto a list
that will be discarded after transaction commit.
Jeff Mahoney [Mon, 15 Jun 2015 13:41:18 +0000 (09:41 -0400)]
btrfs: explictly delete unused block groups in close_ctree and ro-remount
The cleaner thread may already be sleeping by the time we enter
close_ctree. If that's the case, we'll skip removing any unused
block groups queued for removal, even during a normal umount.
They'll be cleaned up automatically at next mount, but users
expect a umount to be a clean synchronization point, especially
when used on thin-provisioned storage with -odiscard. We also
explicitly remove unused block groups in the ro-remount path
for the same reason.
Jeff Mahoney [Mon, 15 Jun 2015 13:41:17 +0000 (09:41 -0400)]
btrfs: iterate over unused chunk space in FITRIM
Since we now clean up block groups automatically as they become
empty, iterating over block groups is no longer sufficient to discard
unused space.
This patch iterates over the unused chunk space and discards any regions
that are unallocated, regardless of whether they were ever used. This is
a change for btrfs but is consistent with other file systems.
We do this in a transactionless manner since the discard process can take
a substantial amount of time and a transaction would need to be started
before the acquisition of the device list lock. That would mean a
transaction would be held open across /all/ of the discards collectively.
In order to prevent other threads from allocating or freeing chunks, we
hold the chunks lock across the search and discard calls. We release it
between searches to allow the file system to perform more-or-less
normally. Since the running transaction can commit and disappear while
we're using the transaction pointer, we take a reference to it and
release it after the search. This is safe since it would happen normally
at the end of the transaction commit after any locks are released anyway.
We also take the commit_root_sem to protect against a transaction starting
and committing while we're running.
Jeff Mahoney [Mon, 15 Jun 2015 13:41:16 +0000 (09:41 -0400)]
btrfs: skip superblocks during discard
Btrfs doesn't track superblocks with extent records so there is nothing
persistent on-disk to indicate that those blocks are in use. We track
the superblocks in memory to ensure they don't get used by removing them
from the free space cache when we load a block group from disk. Prior
to 47ab2a6c6a (Btrfs: remove empty block groups automatically), that
was fine since the block group would never be reclaimed so the superblock
was always safe. Once we started removing the empty block groups, we
were protected by the fact that discards weren't being properly issued
for unused space either via FITRIM or -odiscard. The block groups were
still being released, but the blocks remained on disk.
In order to properly discard unused block groups, we need to filter out
the superblocks from the discard range. Superblocks are located at fixed
locations on each device, so it makes sense to filter them out in
btrfs_issue_discard, which is used by both -odiscard and FITRIM.
Jeff Mahoney [Mon, 15 Jun 2015 13:41:15 +0000 (09:41 -0400)]
btrfs: btrfs_issue_discard ensure offset/length are aligned to sector boundaries
It's possible, though unexpected, to pass unaligned offsets and lengths
to btrfs_issue_discard. We then shift the offset/length values to sector
units. If an unaligned offset has been passed, it will result in the
entire sector being discarded, possibly losing data. An unaligned
length is safe but we'll end up returning an inaccurate number of
discarded bytes.
This patch aligns the offset to the 512B boundary, adjusts the length,
and warns, since we shouldn't be discarding on an offset that isn't
aligned with our sector size.
Chris Wilson [Wed, 15 Jul 2015 08:50:42 +0000 (09:50 +0100)]
drm/i915: Replace WARN inside I915_READ64_2x32 with retry loop
Since we may conceivably encounter situations where the upper part of the
64bit register changes between reads, for example when a timestamp
counter overflows, change the WARN into a retry loop.
Dave Airlie [Wed, 29 Jul 2015 07:21:38 +0000 (17:21 +1000)]
Merge branch 'linux-4.2' of git://anongit.freedesktop.org/git/nouveau/linux-2.6 into drm-fixes
Two more nouveau fixes.
* 'linux-4.2' of git://anongit.freedesktop.org/git/nouveau/linux-2.6:
drm/nouveau/nouveau/ttm: fix tiled system memory with Maxwell
drm/nouveau/kms/nv50-: guard against enabling cursor on disabled heads
tpacket_fill_skb() can return a negative value (-errno) which
is stored in tp_len variable. In that case the following
condition will be (but shouldn't be) true:
tp_len > dev->mtu + dev->hard_header_len
as dev->mtu and dev->hard_header_len are both unsigned.
That may lead to just returning an incorrect EMSGSIZE errno
to the user.
Fixes: 52f1454f629fa ("packet: allow to transmit +4 byte in TX_RING slot for VLAN case") Signed-off-by: Alexander Drozdov <[email protected]> Acked-by: Daniel Borkmann <[email protected]> Signed-off-by: David S. Miller <[email protected]>
Eric Dumazet [Mon, 27 Jul 2015 09:33:50 +0000 (11:33 +0200)]
arp: filter NOARP neighbours for SIOCGARP
When arp is off on a device, and ioctl(SIOCGARP) is queried,
a buggy answer is given with MAC address of the device, instead
of the mac address of the destination/gateway.
We filter out NUD_NOARP neighbours for /proc/net/arp,
we must do the same for SIOCGARP ioctl.
lpaa23:~# ip link set dev eth0 arp off
lpaa23:~# cat /proc/net/arp # check arp table is now 'empty'
IP address HW type Flags HW address Mask Device
lpaa23:~# ./arp 10.246.7.190
MAC=00:1a:11:c3:0d:7f // buggy answer before patch (this is eth0 mac)
After patch :
lpaa23:~# ip link set dev eth0 arp off
lpaa23:~# ./arp 10.246.7.190
ioctl(SIOCGARP) failed: No such device or address
David Ward [Sun, 26 Jul 2015 16:18:58 +0000 (12:18 -0400)]
net/ipv4: suppress NETDEV_UP notification on address lifetime update
This notification causes the FIB to be updated, which is not needed
because the address already exists, and more importantly it may undo
intentional changes that were made to the FIB after the address was
originally added. (As a point of comparison, when an address becomes
deprecated because its preferred lifetime expired, a notification on
this chain is not generated.)
The motivation for this commit is fixing an incompatibility between
DHCP clients which set and update the address lifetime according to
the lease, and a commercial VPN client which replaces kernel routes
in a way that outbound traffic is sent only through the tunnel (and
disconnects if any further route changes are detected via netlink).
bridge: stp: when using userspace stp stop kernel hello and hold timers
These should be handled only by the respective STP which is in control.
They become problematic for devices with limited resources with many
ports because the hold_timer is per port and fires each second and the
hello timer fires each 2 seconds even though it's global. While in
user-space STP mode these timers are completely unnecessary so it's better
to keep them off.
Also ensure that when the bridge is up these timers are started only when
running with kernel STP.
Dave Chinner [Wed, 29 Jul 2015 01:48:02 +0000 (11:48 +1000)]
xfs: remote attributes need to be considered data
We don't log remote attribute contents, and instead write them
synchronously before we commit the block allocation and attribute
tree update transaction. As a result we are writing to the allocated
space before the allcoation has been made permanent.
As a result, we cannot consider this allocation to be a metadata
allocation. Metadata allocation can take blocks from the free list
and so reuse them before the transaction that freed the block is
committed to disk. This behaviour is perfectly fine for journalled
metadata changes as log recovery will ensure the free operation is
replayed before the overwrite, but for remote attribute writes this
is not the case.
Hence we have to consider the remote attribute blocks to contain
data and allocate accordingly. We do this by dropping the
XFS_BMAPI_METADATA flag from the block allocation. This means the
allocation will not use blocks that are on the busy list without
first ensuring that the freeing transaction has been committed to
disk and the blocks removed from the busy list. This ensures we will
never overwrite a freed block without first ensuring that it is
really free.
On examination of the log via xfs_logprint, none of the symlink
buffers in the log had a bad magic number, nor were any other types
of buffer log format headers mis-identified as symlink buffers.
Tracing was used to find the buffer the kernel was tripping over,
and xfs_db identified it's contents as:
This is a remote attribute buffer, which are notable in that they
are not logged but are instead written synchronously by the remote
attribute code so that they exist on disk before the attribute
transactions are committed to the journal.
The above remote attribute block has an invalid LSN in it - cycle
0xd002000, block 0 - which means when log recovery comes along to
determine if the transaction that writes to the underlying block
should be replayed, it sees a block that has a future LSN and so
does not replay the buffer data in the transaction. Instead, it
validates the buffer magic number and attaches the buffer verifier
to it. It is this buffer magic number check that is failing in the
above assert, indicating that we skipped replay due to the LSN of
the underlying buffer.
The problem here is that the remote attribute buffers cannot have a
valid LSN placed into them, because the transaction that contains
the attribute tree pointer changes and the block allocation that the
attribute data is being written to hasn't yet been committed. Hence
the LSN field in the attribute block is completely unwritten,
thereby leaving the underlying contents of the block in the LSN
field. It could have any value, and hence a future overwrite of the
block by log recovery may or may not work correctly.
Fix this by always writing an invalid LSN to the remote attribute
block, as any buffer in log recovery that needs to write over the
remote attribute should occur. We are protected from having old data
written over the attribute by the fact that freeing the block before
the remote attribute is written will result in the buffer being
marked stale in the log and so all changes prior to the buffer stale
transaction will be cancelled by log recovery.
Hence it is safe to ignore the LSN in the case or synchronously
written, unlogged metadata such as remote attribute blocks, and to
ensure we do that correctly, we need to write an invalid LSN to all
remote attribute blocks to trigger immediate recovery of metadata
that is written over the top.
As a further protection for filesystems that may already have remote
attribute blocks with bad LSNs on disk, change the log recovery code
to always trigger immediate recovery of metadata over remote
attribute blocks.
Dave Chinner [Wed, 29 Jul 2015 01:48:00 +0000 (11:48 +1000)]
xfs: call dax_fault on read page faults for DAX
When modifying the patch series to handle the XFS MMAP_LOCK nesting
of page faults, I botched the conversion of the read page fault
path, and so it is only every calling through the page cache. Re-add
the necessary __dax_fault() call for such files.
Because the get_blocks callback on read faults may not set up the
mapping buffer correctly to allow unwritten extent completion to be
run, we need to allow callers of __dax_fault() to pass a null
complete_unwritten() callback. The DAX code always zeros the
unwritten page when it is read faulted so there are no stale data
exposure issues with not doing the conversion. The only downside
will be the potential for increased CPU overhead on repeated read
faults of the same page. If this proves to be a problem, then the
filesystem needs to fix it's get_block callback and provide a
convert_unwritten() callback to the read fault path.
Merge tag 'arm64-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/arm64/linux
Pull arm64 fix from Catalin Marinas:
"Fix buffer overflow when UTF-16 UEFI vendor string is copied from the
system table into a char array with a size of 100 bytes"
* tag 'arm64-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/arm64/linux:
arm64/efi: map the entire UEFI vendor string before reading it
Revert "Input: zforce - don't overwrite the stack"
This reverts commit 7d01cd261c76f95913c81554a751968a1d282d3a because
with given FRAME_MAXSIZE of 257 the check will never trigger and it
causes warnings from GCC (with -Wtype-limits). Also the check was
incorrect as it was not accounting for the already read 2 bytes of data
stored in the buffer.
Merge tag 'devicetree-fixes-for-4.2' of git://git.kernel.org/pub/scm/linux/kernel/git/robh/linux
Pull devicetree fixes from Rob Herring:
"A handful of DT related fixes for 4.2-rc"
* tag 'devicetree-fixes-for-4.2' of git://git.kernel.org/pub/scm/linux/kernel/git/robh/linux:
of: Drop owner assignment from platform and i2c driver
DEVICETREE: Misc fix for the AR7100 SPI controller binding
of: constify drv arg of of_driver_match_device stub
of: add HAS_IOMEM depends to OF_ADDRESS
Merge tag 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mst/vhost
Pull vhost fixes from Michael Tsirkin:
"Two bugfixes only here"
* tag 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mst/vhost:
vhost: fix error handling for memory region alloc
vhost: actually track log eventfd file
Merge tag 'linux-kselftest-4.2-rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/shuah/linux-kselftest
Pull kselftest fix from Shuah Khan.
* tag 'linux-kselftest-4.2-rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/shuah/linux-kselftest:
selftests/futex: Fix futex_cmp_requeue_pi() error handling
Merge tag 'nfs-for-4.2-2' of git://git.linux-nfs.org/projects/trondmy/linux-nfs
Pull NFS client bugfixes from Trond Myklebust:
"Highlights include:
Stable patches:
- Fix a situation where the client uses the wrong (zero) stateid.
- Fix a memory leak in nfs_do_recoalesce
Bugfixes:
- Plug a memory leak when ->prepare_layoutcommit fails
- Fix an Oops in the NFSv4 open code
- Fix a backchannel deadlock
- Fix a livelock in sunrpc when sendmsg fails due to low memory
availability
- Don't revalidate the mapping if both size and change attr are up to
date
- Ensure we don't miss a file extension when doing pNFS
- Several fixes to handle NFSv4.1 sequence operation status bits
correctly
- Several pNFS layout return bugfixes"
* tag 'nfs-for-4.2-2' of git://git.linux-nfs.org/projects/trondmy/linux-nfs: (28 commits)
nfs: Fix an oops caused by using other thread's stack space in ASYNC mode
nfs: plug memory leak when ->prepare_layoutcommit fails
SUNRPC: Report TCP errors to the caller
sunrpc: translate -EAGAIN to -ENOBUFS when socket is writable.
NFSv4.2: handle NFS-specific llseek errors
NFS: Don't clear desc->pg_moreio in nfs_do_recoalesce()
NFS: Fix a memory leak in nfs_do_recoalesce
NFS: nfs_mark_for_revalidate should always set NFS_INO_REVAL_PAGECACHE
NFS: Remove the "NFS_CAP_CHANGE_ATTR" capability
NFS: Set NFS_INO_REVAL_PAGECACHE if the change attribute is uninitialised
NFS: Don't revalidate the mapping if both size and change attr are up to date
NFSv4/pnfs: Ensure we don't miss a file extension
NFSv4: We must set NFS_OPEN_STATE flag in nfs_resync_open_stateid_locked
SUNRPC: xprt_complete_bc_request must also decrement the free slot count
SUNRPC: Fix a backchannel deadlock
pNFS: Don't throw out valid layout segments
pNFS: pnfs_roc_drain() fix a race with open
pNFS: Fix races between return-on-close and layoutreturn.
pNFS: pnfs_roc_drain should return 'true' when sleeping
pNFS: Layoutreturn must invalidate all existing layout segments.
...
Merge tag 'for-f2fs-v4.2-rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/jaegeuk/f2fs
Pull f2fs fixes from Jaegeuk Kim.
* tag 'for-f2fs-v4.2-rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/jaegeuk/f2fs:
f2fs: call set_page_dirty to attach i_wb for cgroup
f2fs: handle error cases in move_encrypted_block
cpufreq: Avoid attempts to create duplicate symbolic links
After commit 87549141d516 (cpufreq: Stop migrating sysfs files on
hotplug) there is a problem with CPUs that share cpufreq policy
objects with other CPUs and are initially offline.
Say CPU1 shares a policy with CPU0 which is online and is registered
first. As part of the registration process, cpufreq_add_dev() is
called for it. It creates the policy object and a symbolic link
to it from the CPU1's sysfs directory. If CPU1 is registered
subsequently and it is offline at that time, cpufreq_add_dev() will
attempt to create a symbolic link to the policy object for it, but
that link is present already, so a warning about that will be
triggered.
To avoid that warning, make cpufreq use an additional CPU mask
containing related CPUs that are actually present for each policy
object. That mask is initialized when the policy object is populated
after its creation (for the first online CPU using it) and it includes
CPUs from the "policy CPUs" mask returned by the cpufreq driver's
->init() callback that are physically present at that time. Symbolic
links to the policy are created only for the CPUs in that mask.
If cpufreq_add_dev() is invoked for an offline CPU, it checks the
new mask and only creates the symlink if the CPU was not in it (the
CPU is added to the mask at the same time).
In turn, cpufreq_remove_dev() drops the given CPU from the new mask,
removes its symlink to the policy object and returns, unless it is
the CPU owning the policy object. In that case, the policy object
is moved to a new CPU's sysfs directory or deleted if the CPU being
removed was the last user of the policy.
While at it, notice that cpufreq_remove_dev() can't fail, because
its return value is ignored, so make it ignore return values from
__cpufreq_remove_dev_prepare() and __cpufreq_remove_dev_finish()
and prevent these functions from aborting on errors returned by
__cpufreq_governor(). Also drop the now unused sif argument from
them.
Fixes: 87549141d516 (cpufreq: Stop migrating sysfs files on hotplug) Signed-off-by: Rafael J. Wysocki <[email protected]> Reported-and-tested-by: Russell King <[email protected]> Acked-by: Viresh Kumar <[email protected]>
Mika Westerberg [Tue, 28 Jul 2015 10:51:21 +0000 (13:51 +0300)]
ACPI / PM: Use target_state to set the device power state
Commit 20dacb71ad28 ("ACPI / PM: Rework device power management to follow
ACPI 6") changed the device power management to use D3hot if the device
in question does not have _PR3 method even if D3cold was requested by the
caller.
However, if the device has _PR3 device->power.state is also set to D3hot
instead of D3Cold after power resources have been turned off because
device->power.state will be assigned from "state" instead of
"target_state".
Next time the device is transitioned to D0, acpi_power_transition() will
find that the current power state of the device is D3hot instead of D3cold
which causes it to power down all resources required for the current
(wrong) state D3hot.
Below is a simplified ASL example of a real touch panel device which
triggers the problem:
Scope (TPL1)
{
Name (_PR0, Package (1) { \_SB.PCI0.I2C1.PXTC })
Name (_PR3, Package (1) { \_SB.PCI0.I2C1.PXTC })
...
}
In both D0 and D3hot the same power resource is required. However, when
acpi_power_transition() turns off power resources required for D3hot (as
the device is transitioned to D0) it powers down PXTC which then makes the
device to lose its power.
Fix this by assigning "target_state" to the device power state instead of
"state" that is always D3hot even for devices with valid _PR3.
Fixes: 20dacb71ad28 (ACPI / PM: Rework device power management to follow ACPI 6) Signed-off-by: Mika Westerberg <[email protected]> Signed-off-by: Rafael J. Wysocki <[email protected]>
Jeff Layton [Fri, 10 Jul 2015 19:58:42 +0000 (15:58 -0400)]
nfs: plug memory leak when ->prepare_layoutcommit fails
"data" is currently leaked when the prepare_layoutcommit operation
returns an error. Put the cred before taking the spinlock in that
case, take the lock and then goto out_unlock which will drop the
lock and then free "data".
Olof Johansson [Tue, 28 Jul 2015 10:32:24 +0000 (12:32 +0200)]
Merge tag 'imx-fixes-4.2-2' of git://git.kernel.org/pub/scm/linux/kernel/git/shawnguo/linux into fixes
The i.MX fixes for 4.2, 2nd round:
- Add the required second clock for i.MX35 FlexCAN in device tree,
so that the device can be probed by kernel successfully.
* tag 'imx-fixes-4.2-2' of git://git.kernel.org/pub/scm/linux/kernel/git/shawnguo/linux:
ARM: dts: i.MX35: Fix can support.
The illegal operation was executed because of a missing facility check,
which should have made sure that the ECAG execution would only be executed
on machines which have the general-instructions-extension facility
installed.
Colin Ian King [Mon, 27 Jul 2015 14:23:43 +0000 (15:23 +0100)]
KEYS: ensure we free the assoc array edit if edit is valid
__key_link_end is not freeing the associated array edit structure
and this leads to a 512 byte memory leak each time an identical
existing key is added with add_key().
The reason the add_key() system call returns okay is that
key_create_or_update() calls __key_link_begin() before checking to see
whether it can update a key directly rather than adding/replacing - which
it turns out it can. Thus __key_link() is not called through
__key_instantiate_and_link() and __key_link_end() must cancel the edit.
Dave Airlie [Tue, 28 Jul 2015 02:38:30 +0000 (12:38 +1000)]
Merge branch 'linux-4.2' of git://anongit.freedesktop.org/git/nouveau/linux-2.6 into drm-fixes
Various minor fixes all over the place, nothing too scary.
* 'linux-4.2' of git://anongit.freedesktop.org/git/nouveau/linux-2.6:
drm/nouveau/fbcon/g80: reduce PUSH_SPACE alloc, fire ring on accel init
drm/nouveau/fbcon/gf100-: reduce RING_SPACE allocation
drm/nouveau/fbcon/nv11-: correctly account for ring space usage
drm/nouveau/bios: add proper support for opcode 0x59
drm/nouveau/bios: add 0x59 and 0x5a opcodes
drm/nouveau/disp: Use NULL for pointers
drm/nouveau/pm: fix a potential race condition when creating an engine context
drm/nouveau/pm: prevent freeing the wrong engine context
drm/nouveau/gr/gf100: wait for GR idle after GO_IDLE bundle
drm/nouveau/gr/gf100: wait on bottom half of FE's pipeline
drm/nouveau/fifo/gk104: kick channels when deactivating them
drm/nouveau/ibus/gk20a: increase SM wait timeout
drm/nouveau/platform: fix compile error if !CONFIG_IOMMU
drm/nouveau: Do not leak client objects
drm/nouveau/clk/gt215: u32->s32 for difference in req. and set clock
drm/nouveau/drm/nv04-nv40/instmem: protect access to priv->heap by mutex
drm/nouveau: hold mutex when calling nouveau_abi16_fini()
Henrik Rydberg [Fri, 24 Jul 2015 21:45:05 +0000 (14:45 -0700)]
HID: apple: Add support for the 2015 Macbook Pro
This patch adds keyboard support for MacbookPro12,1 as WELLSPRING9
(0x0272, 0x0273, 0x0274). The touchpad is handled in a separate
bcm5974 patch, as usual.
Lars Westerhoff [Mon, 27 Jul 2015 22:32:21 +0000 (01:32 +0300)]
packet: missing dev_put() in packet_do_bind()
When binding a PF_PACKET socket, the use count of the bound interface is
always increased with dev_hold in dev_get_by_{index,name}. However,
when rebound with the same protocol and device as in the previous bind
the use count of the interface was not decreased. Ultimately, this
caused the deletion of the interface to fail with the following message:
unregister_netdevice: waiting for dummy0 to become free. Usage count = 1
This patch moves the dev_put out of the conditional part that was only
executed when either the protocol or device changed on a bind.
Fixes: 902fefb82ef7 ('packet: improve socket create/bind latency in some cases') Signed-off-by: Lars Westerhoff <[email protected]> Signed-off-by: Dan Carpenter <[email protected]> Reviewed-by: Daniel Borkmann <[email protected]> Signed-off-by: David S. Miller <[email protected]>
Alexander Duyck [Mon, 27 Jul 2015 20:08:06 +0000 (13:08 -0700)]
fib_trie: Drop unnecessary calls to leaf_pull_suffix
It was reported that update_suffix was taking a long time on systems where
a large number of leaves were attached to a single node. As it turns out
fib_table_flush was calling update_suffix for each leaf that didn't have all
of the aliases stripped from it. As a result, on this large node removing
one leaf would result in us calling update_suffix for every other leaf on
the node.
The fix is to just remove the calls to leaf_pull_suffix since they are
redundant as we already have a call in resize that will go through and
update the suffix length for the node before we exit out of
fib_table_flush or fib_table_flush_external.
arm64/efi: map the entire UEFI vendor string before reading it
At boot, the UTF-16 UEFI vendor string is copied from the system
table into a char array with a size of 100 bytes. However, this
size of 100 bytes is also used for memremapping() the source,
which may not be sufficient if the vendor string exceeds 50
UTF-16 characters, and the placement of the vendor string inside
a 4 KB page happens to leave the end unmapped.
So use the correct '100 * sizeof(efi_char16_t)' for the size of
the mapping.
sunrpc: translate -EAGAIN to -ENOBUFS when socket is writable.
The networking layer does not reliably report the distinction between
a non-block write failing because:
1/ the queue is too full already and
2/ a memory allocation attempt failed.
The distinction is important because in the first case it is
appropriate to retry as soon as the socket reports that it is
writable, and in the second case a small delay is required as the
socket will most likely report as writable but kmalloc could still
fail.
sk_stream_wait_memory() exhibits this distinction nicely, setting
'vm_wait' if a small wait is needed. However in the non-blocking case
it always returns -EAGAIN no matter the cause of the failure. This
-EAGAIN call get all the way to sunrpc.
The sunrpc layer expects EAGAIN to indicate the first cause, and
ENOBUFS to indicate the second. Various documentation suggests that
this is not unreasonable, but does not guarantee the desired error
codes.
The result of getting -EAGAIN when -ENOBUFS is expected is that the
send is tried again in a tight loop and soft lockups are reported.
so: add tests after calls to xs_sendpages() to translate -EAGAIN into
-ENOBUFS if the socket is writable. This cannot happen inside
xs_sendpages() as the test for "is socket writable" is different
between TCP and UDP.
With this change, the tight loop retrying xs_sendpages() becomes a
loop which only retries every 250ms, and so will not trigger a
soft-lockup warning.
It is possible that the write did fail because the queue was too full
and by the time xs_sendpages() completed, the queue was writable
again. In this case an extra 250ms delay is inserted that isn't
really needed. This circumstance suggests a degree of congestion so a
delay is not necessarily a bad thing, and it can only cause a single
250ms delay, not a series of them.
Igor Mammedov [Wed, 15 Jul 2015 14:48:50 +0000 (16:48 +0200)]
vhost: fix error handling for memory region alloc
callers of vhost_kvzalloc() expect the same behaviour on
allocation error as from kmalloc/vmalloc i.e. NULL return
value. So just return vzmalloc() returned value instead of
returning ERR_PTR(-ENOMEM)
Fixes: 4de7255f7d2be5 ("vhost: extend memory regions allocation to vmalloc") Spotted-by: Dan Carpenter <[email protected]> Suggested-by: Julia Lawall <[email protected]> Signed-off-by: Igor Mammedov <[email protected]> Signed-off-by: Michael S. Tsirkin <[email protected]>