OGAWA Hirofumi [Fri, 10 Mar 2017 00:17:37 +0000 (16:17 -0800)]
fat: fix using uninitialized fields of fat_inode/fsinfo_inode
Recently fallocate patch was merged and it uses
MSDOS_I(inode)->mmu_private at fat_evict_inode(). However,
fat_inode/fsinfo_inode that was introduced in past didn't initialize
MSDOS_I(inode) properly.
With those combinations, it became the cause of accessing random entry
in FAT area.
Remove incorrect CONFIG_IDE ifdef (CONFIG_IDE config option is for
internal drivers/ide/ use) and make IDE hardware interface always
initialized (not only when IDE subsystem is built-in).
This patch allows Cayman board to work with modular IDE subsystem
support and removes the requirement of having the whole core IDE
subsystem built-in when using libata PATA support.
Dmitry Vyukov [Fri, 10 Mar 2017 00:17:32 +0000 (16:17 -0800)]
kasan: fix races in quarantine_remove_cache()
quarantine_remove_cache() frees all pending objects that belong to the
cache, before we destroy the cache itself. However there are currently
two possibilities how it can fail to do so.
First, another thread can hold some of the objects from the cache in
temp list in quarantine_put(). quarantine_put() has a windows of
enabled interrupts, and on_each_cpu() in quarantine_remove_cache() can
finish right in that window. These objects will be later freed into the
destroyed cache.
Then, quarantine_reduce() has the same problem. It grabs a batch of
objects from the global quarantine, then unlocks quarantine_lock and
then frees the batch. quarantine_remove_cache() can finish while some
objects from the cache are still in the local to_free list in
quarantine_reduce().
Fix the race with quarantine_put() by disabling interrupts for the whole
duration of quarantine_put(). In combination with on_each_cpu() in
quarantine_remove_cache() it ensures that quarantine_remove_cache()
either sees the objects in the per-cpu list or in the global list.
Fix the race with quarantine_reduce() by protecting quarantine_reduce()
with srcu critical section and then doing synchronize_srcu() at the end
of quarantine_remove_cache().
I've done some assessment of how good synchronize_srcu() works in this
case. And on a 4 CPU VM I see that it blocks waiting for pending read
critical sections in about 2-3% of cases. Which looks good to me.
I suspect that these races are the root cause of some GPFs that I
episodically hit. Previously I did not have any explanation for them.
I am not sure about backporting. The bug is quite hard to trigger, I've
seen it few times during our massive continuous testing (however, it
could be cause of some other episodic stray crashes as it leads to
memory corruption...). If it is triggered, the consequences are very
bad -- almost definite bad memory corruption. The fix is non trivial
and has chances of introducing new bugs. I am also not sure how
actively people use KASAN on older releases.
Dmitry Vyukov [Fri, 10 Mar 2017 00:17:28 +0000 (16:17 -0800)]
kasan: resched in quarantine_remove_cache()
We see reported stalls/lockups in quarantine_remove_cache() on machines
with large amounts of RAM. quarantine_remove_cache() needs to scan
whole quarantine in order to take out all objects belonging to the
cache. Quarantine is currently 1/32-th of RAM, e.g. on a machine with
256GB of memory that will be 8GB. Moreover quarantine scanning is a
walk over uncached linked list, which is slow.
Add cond_resched() after scanning of each non-empty batch of objects.
Batches are specifically kept of reasonable size for quarantine_put().
On a machine with 256GB of RAM we should have ~512 non-empty batches,
each with 16MB of objects.
Tahsin Erdogan [Fri, 10 Mar 2017 00:17:26 +0000 (16:17 -0800)]
mm: do not call mem_cgroup_free() from within mem_cgroup_alloc()
mem_cgroup_free() indirectly calls wb_domain_exit() which is not
prepared to deal with a struct wb_domain object that hasn't executed
wb_domain_init(). For instance, the following warning message is
printed by lockdep if alloc_percpu() fails in mem_cgroup_alloc():
INFO: trying to register non-static key.
the code is fine but needs lockdep annotation.
turning off the locking correctness validator.
CPU: 1 PID: 1950 Comm: mkdir Not tainted 4.10.0+ #151
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Bochs 01/01/2011
Call Trace:
dump_stack+0x67/0x99
register_lock_class+0x36d/0x540
__lock_acquire+0x7f/0x1a30
lock_acquire+0xcc/0x200
del_timer_sync+0x3c/0xc0
wb_domain_exit+0x14/0x20
mem_cgroup_free+0x14/0x40
mem_cgroup_css_alloc+0x3f9/0x620
cgroup_apply_control_enable+0x190/0x390
cgroup_mkdir+0x290/0x3d0
kernfs_iop_mkdir+0x58/0x80
vfs_mkdir+0x10e/0x1a0
SyS_mkdirat+0xa8/0xd0
SyS_mkdir+0x14/0x20
entry_SYSCALL_64_fastpath+0x18/0xad
Add __mem_cgroup_free() which skips wb_domain_exit(). This is used by
both mem_cgroup_free() and mem_cgroup_alloc() clean up.
The second mmap() create PTE-mapping of the first huge page in file. It
makes kernel munlock the page as we never keep PTE-mapped page mlocked.
On munlockall() when we handle vma created by the first mmap(),
munlock_vma_page() returns page_mask == 0, as the page is not mlocked
anymore. On next iteration follow_page_mask() return tail page, but
page_mask is HPAGE_NR_PAGES - 1. It makes us skip to the first tail
page of the next huge page and step on
VM_BUG_ON_PAGE(PageMlocked(page)).
The fix is not use the page_mask from follow_page_mask() at all. It has
no use for us.
Andrea Arcangeli [Fri, 10 Mar 2017 00:17:14 +0000 (16:17 -0800)]
userfaultfd: selftest: vm: allow to build in vm/ directory
linux/tools/testing/selftests/vm $ make
gcc -Wall -I ../../../../usr/include compaction_test.c -lrt -o /compaction_test
/usr/lib/gcc/x86_64-pc-linux-gnu/4.9.4/../../../../x86_64-pc-linux-gnu/bin/ld: cannot open output file /compaction_test: Permission denied
collect2: error: ld returned 1 exit status
make: *** [../lib.mk:54: /compaction_test] Error 1
Since commit a8ba798bc8ec ("selftests: enable O and KBUILD_OUTPUT")
selftests/vm build fails if run from the "selftests/vm" directory, but
it works in the selftests/ directory. It's quicker to be able to do a
local vm-only build after a tree wipe and this patch allows for it
again.
Andrea Arcangeli [Fri, 10 Mar 2017 00:17:11 +0000 (16:17 -0800)]
userfaultfd: non-cooperative: userfaultfd_remove revalidate vma in MADV_DONTNEED
userfaultfd_remove() has to be execute before zapping the pagetables or
UFFDIO_COPY could keep filling pages after zap_page_range returned,
which would result in non zero data after a MADV_DONTNEED.
However userfaultfd_remove() may have to release the mmap_sem. This was
handled correctly in MADV_REMOVE, but MADV_DONTNEED accessed a
potentially stale vma (the very vma passed to zap_page_range(vma, ...)).
The fix consists in revalidating the vma in case userfaultfd_remove()
had to release the mmap_sem.
This also optimizes away an unnecessary down_read/up_read in the
MADV_REMOVE case if UFFD_EVENT_FORK had to be delivered.
It all remains zero runtime cost in case CONFIG_USERFAULTFD=n as
userfaultfd_remove() will be defined as "true" at build time.
Laurent Dufour [Fri, 10 Mar 2017 00:17:06 +0000 (16:17 -0800)]
mm/cgroup: avoid panic when init with low memory
The system may panic when initialisation is done when almost all the
memory is assigned to the huge pages using the kernel command line
parameter hugepage=xxxx. Panic may occur like this:
This is a chicken and egg issue where the kernel try to get free memory
when allocating per node data in mem_cgroup_init(), but in that path
mem_cgroup_soft_limit_reclaim() is called which assumes that these data
are allocated.
As mem_cgroup_soft_limit_reclaim() is best effort, it should return when
these data are not yet allocated.
This patch also fixes potential null pointer access in
mem_cgroup_remove_from_trees() and mem_cgroup_update_tree().
Andrea Arcangeli [Fri, 10 Mar 2017 00:16:54 +0000 (16:16 -0800)]
userfaultfd: non-cooperative: release all ctx in dup_userfaultfd_complete
Don't stop running dup_fctx() even if userfaultfd_event_wait_completion
fails as it has to run userfaultfd_ctx_put on all ctx to pair against
the userfaultfd_ctx_get that was run on all fctx->orig in
dup_userfaultfd.
Patch series "userfaultfd non-cooperative further update for 4.11 merge
window".
Unfortunately I noticed one relevant bug in userfaultfd_exit while doing
more testing. I've been doing testing before and this was also tested
by kbuild bot and exercised by the selftest, but this bug never
reproduced before.
I dropped userfaultfd_exit as result. I dropped it because of
implementation difficulty in receiving signals in __mmput and because I
think -ENOSPC as result from the background UFFDIO_COPY should be enough
already.
Before I decided to remove userfaultfd_exit, I noticed userfaultfd_exit
wasn't exercised by the selftest and when I tried to exercise it, after
moving it to a more correct place in __mmput where it would make more
sense and where the vma list is stable, it resulted in the
event_wait_completion in D state. So then I added the second patch to
be sure even if we call userfaultfd_event_wait_completion too late
during task exit(), we won't risk to generate tasks in D state. The
same check exists in handle_userfault() for the same reason, except it
makes a difference there, while here is just a robustness check and it's
run under WARN_ON_ONCE.
While looking at the userfaultfd_event_wait_completion() function I
looked back at its callers too while at it and I think it's not ok to
stop executing dup_fctx on the fcs list because we relay on
userfaultfd_event_wait_completion to execute
userfaultfd_ctx_put(fctx->orig) which is paired against
userfaultfd_ctx_get(fctx->orig) in dup_userfault just before
list_add(fcs). This change only takes care of fctx->orig but this area
also needs further review looking for similar problems in fctx->new.
The only patch that is urgent is the first because it's an use after
free during a SMP race condition that affects all processes if
CONFIG_USERFAULTFD=y. Very hard to reproduce though and probably
impossible without SLUB poisoning enabled.
This patch (of 3):
I once reproduced this oops with the userfaultfd selftest, it's not
easily reproducible and it requires SLUB poisoning to reproduce.
In the debugger I located the "mm" pointer in the stack and walking
mm->mmap->vm_next through the end shows the vma->vm_next list is fully
consistent and it is null terminated list as expected. So this has to
be an SMP race condition where userfaultfd_exit was running while the
vma list was being modified by another CPU.
When userfaultfd_exit() run one of the ->vm_next pointers pointed to
SLAB_POISON (RBX is the vma pointer and is 0x6b6b..).
The reason is that it's not running in __mmput but while there are still
other threads running and it's not holding the mmap_sem (it can't as it
has to wait the even to be received by the manager). So this is an use
after free that was happening for all processes.
One more implementation problem aside from the race condition:
userfaultfd_exit has really to check a flag in mm->flags before walking
the vma or it's going to slowdown the exit() path for regular tasks.
One more implementation problem: at that point signals can't be
delivered so it would also create a task in D state if the manager
doesn't read the event.
The major design issue: it overall looks superfluous as the manager can
check for -ENOSPC in the background transfer:
if (mmget_not_zero(ctx->mm)) {
[..]
} else {
return -ENOSPC;
}
It's safer to roll it back and re-introduce it later if at all.
Dan Williams [Fri, 10 Mar 2017 00:16:45 +0000 (16:16 -0800)]
x86, mm: unify exit paths in gup_pte_range()
All exit paths from gup_pte_range() require pte_unmap() of the original
pte page before returning. Refactor the code to have a single exit
point to do the unmap.
This mirrors the flow of the generic gup_pte_range() in mm/gup.c.
Dan Williams [Fri, 10 Mar 2017 00:16:42 +0000 (16:16 -0800)]
x86, mm: fix gup_pte_range() vs DAX mappings
gup_pte_range() fails to check pte_allows_gup() before translating a DAX
pte entry, pte_devmap(), to a page. This allows writes to read-only
mappings, and bypasses the DAX cacheline dirty tracking due to missed
'mkwrite' faults. The gup_huge_pmd() path and the gup_huge_pud() path
correctly check pte_allows_gup() before checking for _devmap() entries.
power/mm: update pte_write and pte_wrprotect to handle savedwrite
We use pte_write() to check whethwer the pte entry is writable. This is
mostly used to later mark the pte read only if it is writable. The other
use of pte_write() is to check whether the pte_entry is writable so that
hardware page table entry can be marked accordingly. This is used in kvm
where we look at qemu page table entry and update hardware hash page table
for the guest with correct write enable bit.
With the above, for the first usage we should also check the savedwrite
bit so that we can correctly clear the savedwite bit. For the later, we
add a new variant __pte_write().
With this we can revert write_protect_page part of 595cd8f256d2 ("mm/ksm:
handle protnone saved writes when making page write protect"). But I left
it as it is as an example code for savedwrite check.
We need to mark pages of parent process read only on fork. Numa fault
pte needs a protnone ptes variant with saved write flag set. On fork we
need to make sure we remove the saved write bit. Instead of adding the
protnone check in the caller update ptep_set_wrprotect variants to clear
savedwrite bit.
Without this we see random segfaults in application on fork.
Masahiro Yamada [Fri, 10 Mar 2017 00:16:31 +0000 (16:16 -0800)]
scripts/spelling.txt: add "disble(d)" pattern and fix typo instances
Fix typos and add the following to the scripts/spelling.txt:
disble||disable
disbled||disabled
I kept the TSL2563_INT_DISBLED in /drivers/iio/light/tsl2563.c
untouched. The macro is not referenced at all, but this commit is
touching only comment blocks just in case.
__do_fault assumes vmf->page has been initialized and is valid if
VM_FAULT_NOPAGE is not returned by vma->vm_ops->fault(vma, vmf).
handle_userfault() in turn should return VM_FAULT_NOPAGE if it doesn't
return VM_FAULT_SIGBUS or VM_FAULT_RETRY (the other two possibilities).
This VM_FAULT_NOPAGE case is only invoked when signal are pending and it
didn't matter for anonymous memory before. It only started to matter
since shmem was introduced. hugetlbfs also takes a different path and
doesn't exercise __do_fault.
Linus Torvalds [Fri, 10 Mar 2017 00:30:37 +0000 (16:30 -0800)]
Merge tag 'pm-4.11-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm
Pull power management fixes from Rafael Wysocki:
"These fix several issues in the intel_pstate driver and one issue in
the schedutil cpufreq governor, clean up that governor a bit and hook
up existing code for disabling cpufreq to a new kernel command line
option.
Specifics:
- Three fixes for intel_pstate problems related to the passive mode
(in which it acts as a regular cpufreq scaling driver), two for the
handling of global P-state limits and one for the handling of the
cpu_frequency tracepoint in that mode (Rafael Wysocki).
- Three fixes for the handling of P-state limits in intel_pstate in
the active mode (Rafael Wysocki).
- Introduction of a new cpufreq.off=1 kernel command line argument
that will disable cpufreq entirely if passed to the kernel and is
simply hooked up to the existing code used by Xen (Len Brown).
- Fix for the schedutil cpufreq governor to prevent it from using
stale raw frequency values in configurations with mutiple CPUs
sharing one policy object and a cleanup for it reducing its
overhead slightly (Viresh Kumar)"
* tag 'pm-4.11-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm:
cpufreq: intel_pstate: Do not reinit performance limits in ->setpolicy
cpufreq: intel_pstate: Fix intel_pstate_verify_policy()
cpufreq: intel_pstate: Fix global settings in active mode
cpufreq: Add the "cpufreq.off=1" cmdline option
cpufreq: schedutil: Pass sg_policy to get_next_freq()
cpufreq: schedutil: move cached_raw_freq to struct sugov_policy
cpufreq: intel_pstate: Avoid triggering cpu_frequency tracepoint unnecessarily
cpufreq: intel_pstate: Fix intel_cpufreq_verify_policy()
cpufreq: intel_pstate: Do not use performance_limits in passive mode
Linus Torvalds [Fri, 10 Mar 2017 00:02:04 +0000 (16:02 -0800)]
Merge tag 'pci-v4.11-fixes-2' of git://git.kernel.org/pub/scm/linux/kernel/git/helgaas/pci
Pull PCI fixes from Bjorn Helgaas:
"PCI fixes:
- fix NULL pointer dereference in Exynos driver
- fix NULL pointer dereference in ASPM with pre-1.1 PCIe devices
- blacklist QLogic ISP2722 to prevent panics while reading VPD"
* tag 'pci-v4.11-fixes-2' of git://git.kernel.org/pub/scm/linux/kernel/git/helgaas/pci:
PCI/ASPM: Always set link->downstream to avoid NULL dereference on remove
PCI: Prevent VPD access for QLogic ISP2722
PCI: exynos: Initialize elbi_base even when using PHY framework
Linus Torvalds [Thu, 9 Mar 2017 23:53:25 +0000 (15:53 -0800)]
Merge branch 'for-linus' of git://git.kernel.dk/linux-block
Pull block fixes from Jens Axboe:
"Sending this a bit sooner than I otherwise would have, as a fix in the
merge window had some unfortunate issues and side effects for some
folks.
This contains:
- Fixes from Jan for the bdi registration/unregistration. These have
been tested by the various parties reporting issues, and should be
solid at this point.
- Also from Jan, fix for axonram gendisk registration.
- A stable fix for zram from Johannes.
- A small series from Ming, fixing up some long standing issues with
blk-mq hardware queue kobject initialization and registration.
- A fix for sed opal from Jon, fixing a nonsensical range check and
some set-but-not-used variables.
- A fix from Neil for a long standing deadlock issue for stacking
device drivers. With this in place, dm/md don't have to work around
the issue anymore, and can be properly fixed up"
* 'for-linus' of git://git.kernel.dk/linux-block:
axonram: Fix gendisk handling
blk: improve order of bio handling in generic_make_request()
Revert "scsi, block: fix duplicate bdi name registration crashes"
block: Make del_gendisk() safer for disks without queues
bdi: Fix use-after-free in wb_congested_put()
block: Allow bdi re-registration
block/sed: Fix opal user range check and unused variables
zram: set physical queue limits to avoid array out of bounds accesses
blk-mq: free hctx->cpumask in release handler of hctx's kobject
blk-mq: make lifetime consistent between hctx and its kobject
blk-mq: make lifetime consitent between q/ctx and its kobject
blk-mq: initialize mq kobjects in blk_mq_init_allocated_queue()
Linus Torvalds [Thu, 9 Mar 2017 23:50:56 +0000 (15:50 -0800)]
Merge tag 'media/v4.11-2' of git://git.kernel.org/pub/scm/linux/kernel/git/mchehab/linux-media
Pull media fixes from Mauro Carvalho Chehab:
"Media regression fixes:
- serial_ir: fix a Kernel crash during boot on Kernel 4.11-rc1, due
to an IRQ code called too early
- other IR regression fixes at lirc and at the raw IR decoding
- a deadlock fix at the RC nuvoton driver
- fix another issue with DMA on stack at dw2102 driver
There's an extra patch there that change a driver interface for the
SoC VSP1 driver, with is shared between the DRM and V4L2 driver. The
patch itself is trivial, and was acked by David Arlie"
* tag 'media/v4.11-2' of git://git.kernel.org/pub/scm/linux/kernel/git/mchehab/linux-media:
[media] v4l: vsp1: Adapt vsp1_du_setup_lif() interface to use a structure
[media] dw2102: don't do DMA on stack
[media] rc: protocol is not set on register for raw IR devices
[media] rc: raw decoder for keymap protocol is not loaded on register
[media] rc: nuvoton: fix deadlock in nvt_write_wakeup_codes
[media] lirc: fix dead lock between open and wakeup_filter
[media] serial_ir: ensure we're ready to receive interrupts
when all map elements are pre-allocated one cpu can delete and reuse htab_elem
while another cpu is still walking the hlist. In such case the lookup may
miss the element. Convert hlist to hlist_nulls to avoid such scenario.
When bucket lock is taken there is no need to take such precautions,
so only convert map_lookup and map_get_next to nulls.
The race window is extremely small and only reproducible with explicit
udelay() inside lookup_nulls_elem_raw()
Similar to hlist add hlist_nulls_for_each_entry_safe() and
hlist_nulls_entry_safe() helpers.
when htab_elem is removed from the bucket list the htab_elem.hash_node.next
field should not be overridden too early otherwise we have a tiny race window
between lookup and delete.
The bug was discovered by manual code analysis and reproducible
only with explicit udelay() in lookup_elem_raw().
Replace MAX_ADDR_LEN with its numeric value to fix the following
linux/packet_diag.h userspace compilation error:
/usr/include/linux/packet_diag.h:67:17: error: 'MAX_ADDR_LEN' undeclared here (not in a function)
__u8 pdmc_addr[MAX_ADDR_LEN];
This is not the first case in the UAPI where the numeric value
of MAX_ADDR_LEN is used instead of symbolic one, uapi/linux/if_link.h
already does the same:
Paolo Abeni [Tue, 7 Mar 2017 17:33:31 +0000 (18:33 +0100)]
net/tunnel: set inner protocol in network gro hooks
The gso code of several tunnels type (gre and udp tunnels)
takes for granted that the skb->inner_protocol is properly
initialized and drops the packet elsewhere.
On the forwarding path no one is initializing such field,
so gro encapsulated packets are dropped on forward.
Since commit 38720352412a ("gre: Use inner_proto to obtain
inner header protocol"), this can be reproduced when the
encapsulated packets use gre as the tunneling protocol.
The issue happens also with vxlan and geneve tunnels since
commit 8bce6d7d0d1e ("udp: Generalize skb_udp_segment"), if the
forwarding host's ingress nic has h/w offload for such tunnel
and a vxlan/geneve device is configured on top of it, regardless
of the configured peer address and vni.
To address the issue, this change initialize the inner_protocol
field for encapsulated packets in both ipv4 and ipv6 gro complete
callbacks.
Fixes: 38720352412a ("gre: Use inner_proto to obtain inner header protocol") Fixes: 8bce6d7d0d1e ("udp: Generalize skb_udp_segment") Signed-off-by: Paolo Abeni <[email protected]> Acked-by: Alexander Duyck <[email protected]> Signed-off-by: David S. Miller <[email protected]>
In qed_ll2_start_ooo() the ll2_info variable is uninitialized and then
passed to qed_ll2_acquire_connection() where it is copied into a new
memory space.
This shouldn't cause any issue as long as non of the copied memory is
every read.
But the potential for a bug being introduced by reading this memory
is real.
Detected by CoverityScan, CID#1399632 ("Uninitialized scalar variable")
This patch set fixes multiple issues such as IOMMU
translation faults when kernel is booted with IOMMU enabled
on host, incorrect MAC ID reading from ACPI tables and IPv6
UDP packet drop due to failure of checksum validation.
Changes from v1:
- As suggested by David Miller, got rid of conditional
calling of DMA map/unmap APIs. Also updated commit message
in 'IOMMU translation faults' patch.
====================
Sunil Goutham [Tue, 7 Mar 2017 12:39:10 +0000 (18:09 +0530)]
net: thunderx: Fix invalid mac addresses for node1 interfaces
When booted with ACPI, random mac addresses are being
assigned to node1 interfaces due to mismatch of bgx_id
in BGX driver and ACPI tables.
This patch fixes this issue by setting maximum BGX devices
per node based on platform/soc instead of a macro. This
change will set the bgx_id appropriately.
Sunil Goutham [Tue, 7 Mar 2017 12:39:09 +0000 (18:09 +0530)]
net: thunderx: Fix LMAC mode debug prints for QSGMII mode
When BGX/LMACs are in QSGMII mode, for some LMACs, mode info is
not being printed. This patch will fix that. With changes already
done to not do any sort of serdes 2 lane mapping config calculation
in kernel driver, we can get rid of this logic.
Sunil Goutham [Tue, 7 Mar 2017 12:39:08 +0000 (18:09 +0530)]
net: thunderx: Fix IOMMU translation faults
ACPI support has been added to ARM IOMMU driver in 4.10 kernel
and that has resulted in VNIC interfaces throwing translation
faults when kernel is booted with ACPI as driver was not using
DMA API. This patch fixes the issue by using DMA API which inturn
will create translation tables when IOMMU is enabled.
Also VNIC doesn't have a seperate receive buffer ring per receive
queue, so there is no 1:1 descriptor index matching between CQE_RX
and the index in buffer ring from where a buffer has been used for
DMA'ing. Unlike other NICs, here it's not possible to maintain dma
address to virt address mappings within the driver. This leaves us
no other choice but to use IOMMU's IOVA address conversion API to
get buffer's virtual address which can be given to network stack
for processing.
Zhu Yanjun [Tue, 7 Mar 2017 07:48:36 +0000 (02:48 -0500)]
rds: ib: add error handle
In the function rds_ib_setup_qp, the error handle is missing. When some
error occurs, it is possible that memory leak occurs. As such, error
handle is added.
VSR Burru [Tue, 7 Mar 2017 02:45:59 +0000 (18:45 -0800)]
liquidio: improve UDP TX performance
Improve UDP TX performance by:
* reducing the ring size from 2K to 512
* replacing the numerous streaming DMA allocations for info buffers and
gather lists with one large consistent DMA allocation per ring
BQL is not effective here. We reduced the ring size because there is heavy
overhead with dma_map_single every so often. With iommu=on, dma_map_single
in PF Tx data path was taking longer time (~700usec) for every ~250
packets. Debugged intel_iommu code, and found that PF driver is utilizing
too many static IO virtual address mapping entries (for gather list entries
and info buffers): about 100K entries for two PF's each using 8 rings.
Also, finding an empty entry (in rbtree of device domain's iova mapping in
kernel) during Tx path becomes a bottleneck every so often; the loop to
find the empty entry goes through over 40K iterations; this is too costly
and was the major overhead. Overhead is low when this loop quits quickly.
David Ahern [Mon, 6 Mar 2017 23:57:31 +0000 (15:57 -0800)]
net: ipv6: Remove redundant RTA_OIF in multipath routes
Dinesh reported that RTA_MULTIPATH nexthops are 8-bytes larger with IPv6
than IPv4. The recent refactoring for multipath support in netlink
messages does discriminate between non-multipath which needs the OIF
and multipath which adds a rtnexthop struct for each hop making the
RTA_OIF attribute redundant. Resolve by adding a flag to the info
function to skip the oif for multipath.
Fixes: beb1afac518d ("net: ipv6: Add support to dump multipath routes
via RTA_MULTIPATH attribute") Reported-by: Dinesh Dutt <[email protected]> Signed-off-by: David Ahern <[email protected]> Signed-off-by: David S. Miller <[email protected]>
Linus Torvalds [Thu, 9 Mar 2017 20:23:30 +0000 (12:23 -0800)]
Merge tag 'for-linus-4.11-rc1-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/xen/tip
Pull xen fix and cleanup from Juergen Gross:
"This contains one fix for MSIX handling under Xen and a trivial
cleanup patch"
* tag 'for-linus-4.11-rc1-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/xen/tip:
xenbus: Remove duplicate inclusion of linux/init.h
xen: do not re-use pirq number cached in pci device msi msg data
arch, mm: convert all architectures to use 5level-fixup.h
If an architecture uses 4level-fixup.h we don't need to do anything as
it includes 5level-fixup.h.
If an architecture uses pgtable-nop*d.h, define __ARCH_USE_5LEVEL_HACK
before inclusion of the header. It makes asm-generic code to use
5level-fixup.h.
If an architecture has 4-level paging or folds levels on its own,
include 5level-fixup.h directly.
We are going to introduce <asm-generic/pgtable-nop4d.h> to provide
abstraction for properly (in opposite to 5level-fixup.h hack) folded
p4d level. The new header will be included from pgtable-nopud.h.
If an architecture uses <asm-generic/nop*d.h>, we cannot use
5level-fixup.h directly to quickly convert the architecture to 5-level
paging as it would conflict with pgtable-nop4d.h.
With this patch an architecture can define __ARCH_USE_5LEVEL_HACK before
inclusion <asm-genenric/nop*d.h> to use 5level-fixup.h.
We are going to switch core MM to 5-level paging abstraction.
This is preparation step which adds <asm-generic/5level-fixup.h>
As with 4level-fixup.h, the new header allows quickly make all
architectures compatible with 5-level paging in core MM.
In long run we would like to switch architectures to properly folded p4d
level by using <asm-generic/pgtable-nop4d.h>, but it requires more
changes to arch-specific code.
Shaohua Li [Tue, 28 Feb 2017 21:00:20 +0000 (13:00 -0800)]
md/raid1/10: fix potential deadlock
Neil Brown pointed out a potential deadlock in raid 10 code with
bio_split/chain. The raid1 code could have the same issue, but recent
barrier rework makes it less likely to happen. The deadlock happens in
below sequence:
1. generic_make_request(bio), this will set current->bio_list
2. raid10_make_request will split bio to bio1 and bio2
3. __make_request(bio1), wait_barrer, add underlayer disk bio to
current->bio_list
4. __make_request(bio2), wait_barrer
If raise_barrier happens between 3 & 4, since wait_barrier runs at 3,
raise_barrier waits for IO completion from 3. And since raise_barrier
sets barrier, 4 waits for raise_barrier. But IO from 3 can't be
dispatched because raid10_make_request() doesn't finished yet.
The solution is to adjust the IO ordering. Quotes from Neil:
"
It is much safer to:
if (need to split) {
split = bio_split(bio, ...)
bio_chain(...)
make_request_fn(split);
generic_make_request(bio);
} else
make_request_fn(mddev, bio);
This way we first process the initial section of the bio (in 'split')
which will queue some requests to the underlying devices. These
requests will be queued in generic_make_request.
Then we queue the remainder of the bio, which will be added to the end
of the generic_make_request queue.
Then we return.
generic_make_request() will pop the lower-level device requests off the
queue and handle them first. Then it will process the remainder
of the original bio once the first section has been fully processed.
"
Note, this only happens in read path. In write path, the bio is flushed to
underlaying disks either by blk flush (from schedule) or offladed to raid1/10d.
It's queued in current->bio_list.
NeilBrown [Tue, 28 Feb 2017 20:31:28 +0000 (07:31 +1100)]
md: don't impose the MD_SB_DISKS limit on arrays without metadata.
These arrays, created with "mdadm --build" don't benefit from a limit.
The default will be used, which is '0' and is interpreted as "don't
impose a limit".
Guoqing Jiang [Fri, 24 Feb 2017 03:15:13 +0000 (11:15 +0800)]
md-cluster: remove useless memset from gather_all_resync_info
This memset is not needed. The lvb is already zeroed because
it was recently allocated by lockres_init, which uses kzalloc(),
and read_resync_info() doesn't need it to be zero anyway.
Shaohua Li [Thu, 23 Feb 2017 20:26:41 +0000 (12:26 -0800)]
md/raid10: submit bio directly to replacement disk
Commit 57c67df(md/raid10: submit IO from originating thread instead of
md thread) submits bio directly for normal disks but not for replacement
disks. There is no point we shouldn't do this for replacement disks.
Guenter Roeck [Thu, 9 Mar 2017 13:39:37 +0000 (15:39 +0200)]
usb: host: xhci-plat: Fix timeout on removal of hot pluggable xhci controllers
Upstream commit 98d74f9ceaef ("xhci: fix 10 second timeout on removal of
PCI hotpluggable xhci controllers") fixes a problem with hot pluggable PCI
xhci controllers which can result in excessive timeouts, to the point where
the system reports a deadlock.
The same problem is seen with hot pluggable xhci controllers using the
xhci-plat driver, such as the driver used for Type-C ports on rk3399.
Similar to hot-pluggable PCI controllers, the driver for this chip
removes the xhci controller from the system when the Type-C cable is
disconnected.
The solution for PCI devices works just as well for non-PCI devices
and avoids the problem.
Peter Chen [Thu, 9 Mar 2017 13:39:36 +0000 (15:39 +0200)]
usb: host: xhci-dbg: HCIVERSION should be a binary number
According to xHCI spec, HCIVERSION containing a BCD encoding
of the xHCI specification revision number, 0100h corresponds
to xHCI version 1.0. Change "100" as "0x100".
Chunfeng Yun [Thu, 9 Mar 2017 13:39:34 +0000 (15:39 +0200)]
usb: xhci-mtk: check hcc_params after adding primary hcd
hcc_params is set in xhci_gen_setup() called from usb_add_hcd(),
so checks the Maximum Primary Stream Array Size in the hcc_params
register after adding primary hcd.
Yuantian Tang [Thu, 9 Mar 2017 09:13:29 +0000 (17:13 +0800)]
ahci: qoriq: correct the sata ecc setting error
Sata ecc is controlled by only 1 bit which is 24bit in big-endian
in ecc register. So only setting 24bit to disable sata ecc prevents
other bits from being overwritten in ecc register.
Wolfram Sang [Thu, 9 Mar 2017 15:41:48 +0000 (16:41 +0100)]
Revert "i2c: copy device properties when using i2c_register_board_info()"
This reverts commit b0c1e95ab44feaad8831f2c06a3473c974003b49. It
contains a flaw and the next version has more features added which makes
me want to move it to the next cycle.
Wolfram Sang [Thu, 9 Mar 2017 15:32:17 +0000 (16:32 +0100)]
Revert "i2c: add missing of_node_put in i2c_mux_del_adapters"
This reverts commit 02dbfa5e5583523035f05636c614a0eca77f1aab. I grabbed
the wrong version from the list and will pull the proper one from Peter
Rosin's mux tree.
i2c: exynos5: Avoid transaction timeouts due TRANSFER_DONE_AUTO not set
After commit 7999eecb7e56 ("i2c: exynos5: fix arbitration lost handling"),
some I2C transactions are failing because the TRANSFER_DONE_AUTO field is
not set in the I2C_TRANS_STATUS register so the i2c->status value is left
to -EINVAL causing the i2c->msg_complete completion to never be signaled.
For example, when reading the time of an I2C rtc on an Exynos5800 machine:
The Exynos5422 manual states clearly that most I2C_TRANS_STATUS reg bits
(including TRANSFER_DONE_AUTO) are cleared after the register is read. So
reading has side effects and should only be done if HSI2C_INT_I2C was set.
Fixes: 7999eecb7e56 ("i2c: exynos5: fix arbitration lost handling") Signed-off-by: Javier Martinez Canillas <[email protected]> Reviewed-by: Andrzej Hajda <[email protected]> Signed-off-by: Wolfram Sang <[email protected]>
Radim Krčmář [Thu, 9 Mar 2017 14:48:42 +0000 (15:48 +0100)]
Merge tag 'kvm-arm-for-4.11-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/kvmarm/kvmarm
KVM/ARM updates for v4.11-rc2
vgic updates:
- Honour disabling the ITS
- Don't deadlock when deactivating own interrupts via MMIO
- Correctly expose the lact of IRQ/FIQ bypass on GICv3
I/O virtualization:
- Make KVM_CAP_NR_MEMSLOTS big enough for large guests with
many PCIe devices
General bug fixes:
- Gracefully handle exception generated with syndroms that
the host doesn't understand
- Properly invalidate TLBs on VHE systems
Radim Krčmář [Tue, 7 Mar 2017 16:51:49 +0000 (17:51 +0100)]
KVM: nVMX: do not warn when MSR bitmap address is not backed
Before trying to do nested_get_page() in nested_vmx_merge_msr_bitmap(),
we have already checked that the MSR bitmap address is valid (4k aligned
and within physical limits). SDM doesn't specify what happens if the
there is no memory mapped at the valid address, but Intel CPUs treat the
situation as if the bitmap was configured to trap all MSRs.
KVM already does that by returning false and a correct handling doesn't
need the guest-trigerrable warning that was reported by syzkaller:
(The warning was originally there to catch some possible bugs in nVMX.)
Reported-by: Dmitry Vyukov <[email protected]> Reviewed-by: Jim Mattson <[email protected]>
[Jim Mattson explained the bare metal behavior: "I believe this behavior
would be documented in the chipset data sheet rather than the SDM,
since the chipset returns all 1s for an unclaimed read."] Signed-off-by: Radim Krčmář <[email protected]>
* pm-cpufreq:
cpufreq: intel_pstate: Do not reinit performance limits in ->setpolicy
cpufreq: intel_pstate: Fix intel_pstate_verify_policy()
cpufreq: intel_pstate: Fix global settings in active mode
cpufreq: Add the "cpufreq.off=1" cmdline option
cpufreq: intel_pstate: Avoid triggering cpu_frequency tracepoint unnecessarily
cpufreq: intel_pstate: Fix intel_cpufreq_verify_policy()
cpufreq: intel_pstate: Do not use performance_limits in passive mode
Thomas Gleixner [Thu, 9 Mar 2017 11:06:41 +0000 (12:06 +0100)]
Merge tag 'irq-fixes-4.11-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/maz/arm-platforms into irq/urgent
Pull irqchip/irqdomain updates for 4.11-rc2 from Marc Zyngier
- irqchip/crossbar: Some type tidying up
- irqchip/gicv3-its: Workaround for a Qualcomm erratum
- irqdomain: Compile for for systems that don't use CONFIG_IRQ_DOMAIN
crypto: s5p-sss - Fix spinlock recursion on LRW(AES)
Running TCRYPT with LRW compiled causes spinlock recursion:
testing speed of async lrw(aes) (lrw(ecb-aes-s5p)) encryption
tcrypt: test 0 (256 bit key, 16 byte blocks): 19007 operations in 1 seconds (304112 bytes)
tcrypt: test 1 (256 bit key, 64 byte blocks): 15753 operations in 1 seconds (1008192 bytes)
tcrypt: test 2 (256 bit key, 256 byte blocks): 14293 operations in 1 seconds (3659008 bytes)
tcrypt: test 3 (256 bit key, 1024 byte blocks): 11906 operations in 1 seconds (12191744 bytes)
tcrypt: test 4 (256 bit key, 8192 byte blocks):
BUG: spinlock recursion on CPU#1, irq/84-10830000/89
lock: 0xeea99a68, .magic: dead4ead, .owner: irq/84-10830000/89, .owner_cpu: 1
CPU: 1 PID: 89 Comm: irq/84-10830000 Not tainted 4.11.0-rc1-00001-g897ca6d0800d #559
Hardware name: SAMSUNG EXYNOS (Flattened Device Tree)
[<c010e1ec>] (unwind_backtrace) from [<c010ae1c>] (show_stack+0x10/0x14)
[<c010ae1c>] (show_stack) from [<c03449c0>] (dump_stack+0x78/0x8c)
[<c03449c0>] (dump_stack) from [<c015de68>] (do_raw_spin_lock+0x11c/0x120)
[<c015de68>] (do_raw_spin_lock) from [<c0720110>] (_raw_spin_lock_irqsave+0x20/0x28)
[<c0720110>] (_raw_spin_lock_irqsave) from [<c0572ca0>] (s5p_aes_crypt+0x2c/0xb4)
[<c0572ca0>] (s5p_aes_crypt) from [<bf1d8aa4>] (do_encrypt+0x78/0xb0 [lrw])
[<bf1d8aa4>] (do_encrypt [lrw]) from [<bf1d8b00>] (encrypt_done+0x24/0x54 [lrw])
[<bf1d8b00>] (encrypt_done [lrw]) from [<c05732a0>] (s5p_aes_complete+0x60/0xcc)
[<c05732a0>] (s5p_aes_complete) from [<c0573440>] (s5p_aes_interrupt+0x134/0x1a0)
[<c0573440>] (s5p_aes_interrupt) from [<c01667c4>] (irq_thread_fn+0x1c/0x54)
[<c01667c4>] (irq_thread_fn) from [<c0166a98>] (irq_thread+0x12c/0x1e0)
[<c0166a98>] (irq_thread) from [<c0136a28>] (kthread+0x108/0x138)
[<c0136a28>] (kthread) from [<c0107778>] (ret_from_fork+0x14/0x3c)
Interrupt handling routine was calling req->base.complete() under
spinlock. In most cases this wasn't fatal but when combined with some
of the cipher modes (like LRW) this caused recursion - starting the new
encryption (s5p_aes_crypt()) while still holding the spinlock from
previous round (s5p_aes_complete()).
Beside that, the s5p_aes_interrupt() error handling path could execute
two completions in case of error for RX and TX blocks.
Rewrite the interrupt handling routine and the completion by:
1. Splitting the operations on scatterlist copies from
s5p_aes_complete() into separate s5p_sg_done(). This still should be
done under lock.
The s5p_aes_complete() now only calls req->base.complete() and it has
to be called outside of lock.
2. Moving the s5p_aes_complete() out of spinlock critical sections.
In interrupt service routine s5p_aes_interrupts(), it appeared in few
places, including error paths inside other functions called from ISR.
This code was not so obvious to read so simplify it by putting the
s5p_aes_complete() only within ISR level.
Merge tag 'usb-serial-4.11-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/johan/usb-serial into usb-linus
Johan writes:
USB-serial fixes for v4.11-rc2
Here's a fix for a digi_acceleport regression in -rc1, and some fixes
for long-standing issues in three other drivers, including a
NULL-pointer dereference and a couple of information leaks that could be
triggered by a malicious device.
A recent change claimed to fix an off-by-one error in the OOB-port
completion handler, but instead introduced such an error. This could
specifically led to modem-status changes going unnoticed, effectively
breaking TIOCMGET.
Note that the offending commit fixes a loop-condition underflow and is
marked for stable, but should not be backported without this fix.
Richard Leitner [Mon, 6 Mar 2017 08:24:21 +0000 (09:24 +0100)]
usb: usb251xb: dt: add unit suffix to oc-delay and power-on-time
Rename oc-delay-* to oc-delay-us and make it expect a time value.
Furthermore add -ms suffix to power-on-time. There changes were
suggested by Rob Herring in https://lkml.org/lkml/2017/2/15/1283.
Remove the max_{power,current}_{sp,bp} properties of the usb251xb driver
from devicetree. This is done to simplify the dt bindings as requested
by Rob Herring in https://lkml.org/lkml/2017/2/15/1283. If those
properties are ever needed by somebody they can be enabled again easily.
Tobias Jakobi [Mon, 27 Feb 2017 23:46:58 +0000 (00:46 +0100)]
usb-storage: Add ignore-residue quirk for Initio INIC-3619
This USB-SATA bridge chip is used in a StarTech enclosure for
optical drives.
Without the quirk MakeMKV fails during the key exchange with an
installed BluRay drive:
> Error 'Scsi error - ILLEGAL REQUEST:COPY PROTECTION KEY EXCHANGE FAILURE - KEY NOT ESTABLISHED'
> occurred while issuing SCSI command AD010..080002400 to device 'SG:dev_11:2'
Johan Hovold [Tue, 7 Mar 2017 15:11:04 +0000 (16:11 +0100)]
USB: iowarrior: fix NULL-deref in write
Make sure to verify that we have the required interrupt-out endpoint for
IOWarrior56 devices to avoid dereferencing a NULL-pointer in write
should a malicious device lack such an endpoint.
Johan Hovold [Tue, 7 Mar 2017 15:11:03 +0000 (16:11 +0100)]
USB: iowarrior: fix NULL-deref at probe
Make sure to check for the required interrupt-in endpoint to avoid
dereferencing a NULL-pointer should a malicious device lack such an
endpoint.
Note that a fairly recent change purported to fix this issue, but added
an insufficient test on the number of endpoints only, a test which can
now be removed.
Fixes: 4ec0ef3a8212 ("USB: iowarrior: fix oops with malicious USB descriptors") Fixes: 946b960d13c1 ("USB: add driver for iowarrior devices.") Cc: stable <[email protected]> # 2.6.21 Signed-off-by: Johan Hovold <[email protected]> Signed-off-by: Greg Kroah-Hartman <[email protected]>
The driver doesn't have a struct of_device_id table but supported devices
are registered via Device Trees. This is working on the assumption that a
I2C device registered via OF will always match a legacy I2C device ID and
that the MODALIAS reported will always be of the form i2c:<device>.
But this could change in the future so the correct approach is to have an
OF device ID table if the devices are registered via OF.
usb: ohci-at91: Do not drop unhandled USB suspend control requests
In patch 2e2aa1bc7eff90ecm, USB suspend and wakeup control requests are
passed to SFR_OHCIICR register. If a processor does not have such a
register, this hub control request will be dropped.
If no such a SFR register is available, all USB suspend control requests
will now be processed using ohci_hub_control()
(like before patch 2e2aa1bc7eff90ecm.)
Tested on an Atmel AT91SAM9G20 with an on-board TI TUSB2046B hub chip
If the last USB device is unplugged from the USB hub, the hub goes into
sleep and will not wakeup when an USB devices is inserted.
powerpc/powernv/ioda2: Update iommu table base on ownership change
On POWERNV platform, in order to do DMA via IOMMU (i.e. 32bit DMA in
our case), a device needs an iommu_table pointer set via
set_iommu_table_base().
The codeflow is:
- pnv_pci_ioda2_setup_dma_pe()
- pnv_pci_ioda2_setup_default_config()
- pnv_ioda_setup_bus_dma() [1]
pnv_pci_ioda2_setup_dma_pe() creates IOMMU groups,
pnv_pci_ioda2_setup_default_config() does default DMA setup,
pnv_ioda_setup_bus_dma() takes a bus PE (on IODA2, all physical function
PEs as bus PEs except NPU), walks through all underlying buses and
devices, adds all devices to an IOMMU group and sets iommu_table.
On IODA2, when VFIO is used, it takes ownership over a PE which means it
removes all tables and creates new ones (with a possibility of sharing
them among PEs). So when the ownership is returned from VFIO to
the kernel, the iommu_table pointer written to a device at [1] is
stale and needs an update.
This adds an "add_to_group" parameter to pnv_ioda_setup_bus_dma()
(in fact re-adds as it used to be there a while ago for different
reasons) to tell the helper if a device needs to be added to
an IOMMU group with an iommu_table update or just the latter.
This calls pnv_ioda_setup_bus_dma(..., false) from
pnv_ioda2_release_ownership() so when the ownership is restored,
32bit DMA can work again for a device. This does the same thing
on obtaining ownership as the iommu_table point is stale at this point
anyway and it is safer to have NULL there.
We did not hit this earlier as all tested devices in recent years were
only using 64bit DMA; the rare exception for this is MPT3 SAS adapter
which uses both 32bit and 64bit DMA access and it has not been tested
with VFIO much.
Linu Cherian [Wed, 8 Mar 2017 06:08:35 +0000 (11:38 +0530)]
KVM: arm64: Increase number of user memslots to 512
Having only 32 memslots is a real constraint for the maximum
number of PCI devices that can be assigned to a single guest.
Assuming each PCI device/virtual function having two memory BAR
regions, we could assign only 15 devices/virtual functions to a
guest.
Hence increase KVM_USER_MEM_SLOTS to 512 as done in other archs like
powerpc.
Merge tag 'fixes-for-v4.11-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/balbi/usb into usb-linus
Felipe writes:
usb: fixes for v4.11-rc2
dwc3 got a few fixes this time around:
Fixed an old bug where a broken endpoint descriptor passed in via
userspace through f_fs could prevent dwc3 from working because when
calculating max bursts, we could overwrite top 16 bits of a register.
Also fixed a bug on dwc3's ep_dequeue implementation which wasn't
properly incrementing our TRB dequeue pointer.
dwc3 on omap got two fixes: one for system suspend/resume and another
added a missing break statement on dwc3_omap_set_mailbox().
Apart from these, we have a set of smaller fixes including memory leak
in configfs, build warning fix in atmel udc and a revert of a broken
patch that went in during the merge window
Chris Wilson [Thu, 2 Feb 2017 20:47:41 +0000 (20:47 +0000)]
drm/i915: Drain the freed state from the tail of the next commit
If we have any residual freed atomic state from earlier commits, flush
the freed list after performing the current modeset. This prevents the
freed list from ever-growing if userspace manages to starve the kernel
threads (i.e. we are never able to run our free state worker and
eventually the system may even oom).
Ville Syrjälä [Tue, 7 Mar 2017 20:54:19 +0000 (22:54 +0200)]
drm/i915: Nuke debug messages from the pipe update critical section
printks are slow so we should not be doing them from the vblank evade
critical section. These could explain why we sometimes seem to
blow past our 100 usec deadline.
The problem has been there ever since commit bfd16b2a23dc ("drm/i915:
Make updating pipe without modeset atomic.") but it may not have
been readily visible until commit e1edbd44e23b ("drm/i915: Complain
if we take too long under vblank evasion.") increased our chances
of noticing it.
Chris Wilson [Tue, 7 Mar 2017 12:03:38 +0000 (12:03 +0000)]
drm/i915: Use pagecache write to prepopulate shmemfs from pwrite-ioctl
Before we instantiate/pin the backing store for our use, we
can prepopulate the shmemfs filp efficiently using a write into the
pagecache. We avoid the penalty of instantiating all the pages, important
if the user is just writing to a few and never uses the object on the GPU,
and using a direct write into shmemfs allows it to avoid the cost of
retrieving a page (mostly the clear-before-use, but in theory we could
curtail swapin) before it is overwritten.
This can be extended later to provide additional specialisation for
other backends (other than shmemfs). For now it provides a defense
against very large write-only allocations from exhausting all of system
memory.
Chris Wilson [Tue, 7 Mar 2017 13:20:31 +0000 (13:20 +0000)]
drm/i915: Store a permanent error in obj->mm.pages
Once the object has been truncated, it is unrecoverable. To facilitate
detection of this state store the error in obj->mm.pages.
This is required for the next patch which should be applied to v4.10
(via stable), so we also need to mark this patch for backporting. In
that regard, let's consider this to be a fix/improvement too.
v2: Avoid dereferencing the ERR_PTR when freeing the object.