Daniel Borkmann [Wed, 22 Aug 2018 21:49:37 +0000 (23:49 +0200)]
bpf: use per htab salt for bucket hash
All BPF hash and LRU maps currently have a known and global seed
we feed into jhash() which is 0. This is suboptimal, thus fix it
by generating a random seed upon hashtab setup time which we can
later on feed into jhash() on lookup, update and deletions.
Bruce Allan [Thu, 9 Aug 2018 13:28:52 +0000 (06:28 -0700)]
ice: Remove unnecessary node owner check
There is already a check for owner == ICE_SCHED_NODE_OWNER_LAN at the
beginning of ice_sched_update_vsi_child_nodes. Remove the additional
check to address the static analysis tool smatch issue "warn: we tested
'owner' before and it was 'false'".
1) Fix "odd binop '0x0 & 0xc'" when performing the bitwise-and with a
constant value of zero (ICE_AQC_GSET_RSS_LUT_TABLE_SIZE_128_FLAG).
Remove a similar bitwise-and with 0 in ice_add_marker_act() and use the
right mask ICE_LG_ACT_GENERIC_OFFSET_M in the expression.
2) Fix a similar issue "odd binop '0x0 & 0x1800' in ice_req_irq_msix_misc.
3) Fix "odd binop '0x380000 & 0x7fff8'" in ice_add_marker_act(). Also, use
a new define ICE_LG_ACT_GENERIC_OFF_RX_DESC_PROF_IDX instead of magic
number '7'.
4) Fix warn: odd binop '0x0 & 0x18' in ice_set_dflt_vsi_ctx() by removing
unnecessary logic to explicitly unset bits 3 and 4 in port_vlan_bits.
These bits are unset already by the memset on ctxt->info.
Jens Axboe [Thu, 23 Aug 2018 15:34:46 +0000 (09:34 -0600)]
blk-wbt: don't maintain inflight counts if disabled
A previous commit removed the ability to have per-rq flags. We used
those flags to maintain inflight counts. Since we don't have those
anymore, we have to always maintain inflight counts, even if wbt is
disabled. This is clearly suboptimal.
Add a queue quiesce around changing the wbt latency settings from sysfs
to work around this. With that, we can reliably put the enabled check in
our bio_to_wbt_flags(), since we know the WBT_TRACKED flag will be
consistent for the lifetime of the request.
Fixes: c1c80384c8f ("block: remove external dependency on wbt_flags") Reviewed-by: Josef Bacik <[email protected]> Signed-off-by: Jens Axboe <[email protected]>
powerpc/mce: Fix SLB rebolting during MCE recovery path.
The commit e7e81847478 ("powerpc/64s: move machine check SLB flushing
to mm/slb.c") introduced a bug in reloading bolted SLB entries. Unused
bolted entries are stored with .esid=0 in the slb_shadow area, and
that value is now used directly as the RB input to slbmte, which means
the RB[52:63] index field is set to 0, which causes SLB entry 0 to be
cleared.
Fix this by storing the index bits in the unused bolted entries, which
directs the slbmte to the right place.
The SLB shadow area is also used by the hypervisor, but PAPR is okay
with that, from LoPAPR v1.1, 14.11.1.3 SLB Shadow Buffer:
Note: SLB is filled sequentially starting at index 0
from the shadow buffer ignoring the contents of
RB field bits 52-63
Fixes: e7e81847478b ("powerpc/64s: move machine check SLB flushing to mm/slb.c") Signed-off-by: Mahesh Salgaonkar <[email protected]> Signed-off-by: Nicholas Piggin <[email protected]> Reviewed-by: Nicholas Piggin <[email protected]> Signed-off-by: Michael Ellerman <[email protected]>
Paul Mackerras [Thu, 23 Aug 2018 00:08:58 +0000 (10:08 +1000)]
KVM: PPC: Book3S: Fix guest DMA when guest partially backed by THP pages
Commit 76fa4975f3ed ("KVM: PPC: Check if IOMMU page is contained in
the pinned physical page", 2018-07-17) added some checks to ensure
that guest DMA mappings don't attempt to map more than the guest is
entitled to access. However, errors in the logic mean that legitimate
guest requests to map pages for DMA are being denied in some
situations. Specifically, if the first page of the range passed to
mm_iommu_get() is mapped with a normal page, and subsequent pages are
mapped with transparent huge pages, we end up with mem->pageshift ==
0. That means that the page size checks in mm_iommu_ua_to_hpa() and
mm_iommu_up_to_hpa_rm() will always fail for every page in that
region, and thus the guest can never map any memory in that region for
DMA, typically leading to a flood of error messages like this:
(a) use of 'ua' not 'ua + (i << PAGE_SHIFT)' in the find_linux_pte()
call (meaning that find_linux_pte() returns the pte for the
first address in the range, not the address we are currently up
to);
(b) use of 'pageshift' as the variable to receive the hugepage shift
returned by find_linux_pte() - for a normal page this gets set
to 0, leading to us setting mem->pageshift to 0 when we conclude
that the pte returned by find_linux_pte() didn't match the page
we were looking at;
(c) comparing 'compshift', which is a page order, i.e. log base 2 of
the number of pages, with 'pageshift', which is a log base 2 of
the number of bytes.
To fix these problems, this patch introduces 'cur_ua' to hold the
current user address and uses that in the find_linux_pte() call;
introduces 'pteshift' to hold the hugepage shift found by
find_linux_pte(); and compares 'pteshift' with 'compshift +
PAGE_SHIFT' rather than 'compshift'.
The patch also moves the local_irq_restore to the point after the PTE
pointer returned by find_linux_pte() has been dereferenced because
otherwise the PTE could change underneath us, and adds a check to
avoid doing the find_linux_pte() call once mem->pageshift has been
reduced to PAGE_SHIFT, as an optimization.
Fixes: 76fa4975f3ed ("KVM: PPC: Check if IOMMU page is contained in the pinned physical page") Cc: [email protected] # v4.12+ Signed-off-by: Paul Mackerras <[email protected]> Signed-off-by: Michael Ellerman <[email protected]>
powerpc/mm/radix: Only need the Nest MMU workaround for R -> RW transition
The Nest MMU workaround is only needed for RW upgrades. Avoid doing
that for other PTE updates.
We also avoid clearing the PTE while marking it invalid. This is
because other page table walkers will find this PTE none and can
result in unexpected behaviour due to that. Instead we clear
_PAGE_PRESENT and set the software PTE bit _PAGE_INVALID.
pte_present() is already updated to check for both bits. This makes
sure page table walkers will find the PTE present and things like
pte_pfn(pte) returns the right value.
Based on an original patch from Benjamin Herrenschmidt <[email protected]>
Arnd Bergmann [Tue, 21 Aug 2018 20:37:33 +0000 (22:37 +0200)]
ACPI: fix menuconfig presentation of ACPI submenu
My fix for a recursive Kconfig dependency caused another issue where the
ACPI specific options end up in the top-level menu in 'menuconfig'. This
was an unintended side-effect of having a silent option between
'menuconfig ACPI' and 'if ACPI'.
Moving the ARCH_SUPPORTS_ACPI symbol ahead of the ACPI menu solves that
problem and restores the previous presentation.
I have encountered an interrupt storm during the eMMC chip probing (and
the chip finally didn't get detected). It turned out that U-Boot left
the SDHI DMA interrupts enabled while the Linux driver didn't use those.
Masking those interrupts in renesas_sdhi_internal_dmac_request_dma() gets
rid of both issues...
Huazhong Tan [Thu, 23 Aug 2018 03:37:15 +0000 (11:37 +0800)]
net: hns3: fix page_offset overflow when CONFIG_ARM64_64K_PAGES
When enable the config item "CONFIG_ARM64_64K_PAGES", the size of
PAGE_SIZE is 65536(64K). But the type of page_offset is u16, it will
overflow. So change it to u32, when "CONFIG_ARM64_64K_PAGES" enabled.
Fixes: 76ad4f0ee747 ("net: hns3: Add support of HNS3 Ethernet Driver for hip08 SoC") Signed-off-by: Huazhong Tan <[email protected]> Signed-off-by: Salil Mehta <[email protected]> Signed-off-by: David S. Miller <[email protected]>
Hangbin Liu [Thu, 23 Aug 2018 03:31:37 +0000 (11:31 +0800)]
net/ipv6: init ip6 anycast rt->dst.input as ip6_input
Commit 6edb3c96a5f02 ("net/ipv6: Defer initialization of dst to data path")
forgot to handle anycast route and init anycast rt->dst.input to ip6_forward.
Fix it by setting anycast rt->dst.input back to ip6_input.
Fixes: 6edb3c96a5f02 ("net/ipv6: Defer initialization of dst to data path") Signed-off-by: Hangbin Liu <[email protected]> Reviewed-by: David Ahern <[email protected]> Signed-off-by: David S. Miller <[email protected]>
Huazhong Tan [Thu, 23 Aug 2018 03:10:12 +0000 (11:10 +0800)]
net: hns: fix skb->truesize underestimation
skb->truesize is not meant to be tracking amount of used bytes in a skb,
but amount of reserved/consumed bytes in memory.
For instance, if we use a single byte in last page fragment, we have to
account the full size of the fragment.
So skb_add_rx_frag needs to calculate the length of the entire buffer into
turesize.
Fixes: 9cbe9fd5214e ("net: hns: optimize XGE capability by reducing cpu usage") Signed-off-by: Huazhong tan <[email protected]> Signed-off-by: Salil Mehta <[email protected]> Signed-off-by: David S. Miller <[email protected]>
Huazhong Tan [Thu, 23 Aug 2018 03:10:10 +0000 (11:10 +0800)]
net: hns: fix length and page_offset overflow when CONFIG_ARM64_64K_PAGES
When enable the config item "CONFIG_ARM64_64K_PAGES", the size of PAGE_SIZE
is 65536(64K). But the type of length and page_offset are u16, they will
overflow. So change them to u32.
Fixes: 6fe6611ff275 ("net: add Hisilicon Network Subsystem hnae framework support") Signed-off-by: Huazhong Tan <[email protected]> Signed-off-by: Salil Mehta <[email protected]> Signed-off-by: David S. Miller <[email protected]>
David S. Miller [Thu, 23 Aug 2018 04:45:32 +0000 (21:45 -0700)]
Merge branch 'tcp_bbr-PROBE_RTT-minor-bug-fixes'
Kevin Yang says:
====================
tcp_bbr: PROBE_RTT minor bug fixes
This series includes two minor bug fixes for the TCP BBR PROBE_RTT
mechanism, and one preparatory patch:
(1) A preparatory patch to reorganize the PROBE_RTT logic by refactoring
(into its own function) the code to exit PROBE_RTT, since the next
patch will be using that code in a new context.
(2) Fix: When BBR restarts from idle and if BBR is in PROBE_RTT mode,
BBR should check if it's time to exit PROBE_RTT. If yes, then BBR
should exit PROBE_RTT mode and restore the cwnd to its full value.
(3) Fix: Apply the PROBE_RTT cwnd cap even if the count of fully-ACKed
packets is 0.
====================
Kevin Yang [Wed, 22 Aug 2018 21:43:16 +0000 (17:43 -0400)]
tcp_bbr: apply PROBE_RTT cwnd cap even if acked==0
This commit fixes a corner case where TCP BBR would enter PROBE_RTT
mode but not reduce its cwnd. If a TCP receiver ACKed less than one
full segment, the number of delivered/acked packets was 0, so that
bbr_set_cwnd() would short-circuit and exit early, without cutting
cwnd to the value we want for PROBE_RTT.
The fix is to instead make sure that even when 0 full packets are
ACKed, we do apply all the appropriate caps, including the cap that
applies in PROBE_RTT mode.
Kevin Yang [Wed, 22 Aug 2018 21:43:15 +0000 (17:43 -0400)]
tcp_bbr: in restart from idle, see if we should exit PROBE_RTT
This patch fix the case where BBR does not exit PROBE_RTT mode when
it restarts from idle. When BBR restarts from idle and if BBR is in
PROBE_RTT mode, BBR should check if it's time to exit PROBE_RTT. If
yes, then BBR should exit PROBE_RTT mode and restore the cwnd to its
full value.
Kevin Yang [Wed, 22 Aug 2018 21:43:14 +0000 (17:43 -0400)]
tcp_bbr: add bbr_check_probe_rtt_done() helper
This patch add a helper function bbr_check_probe_rtt_done() to
1. check the condition to see if bbr should exit probe_rtt mode;
2. process the logic of exiting probe_rtt mode.
Eric Dumazet [Wed, 22 Aug 2018 20:30:45 +0000 (13:30 -0700)]
ipv4: tcp: send zero IPID for RST and ACK sent in SYN-RECV and TIME-WAIT state
tcp uses per-cpu (and per namespace) sockets (net->ipv4.tcp_sk) internally
to send some control packets.
1) RST packets, through tcp_v4_send_reset()
2) ACK packets in SYN-RECV and TIME-WAIT state, through tcp_v4_send_ack()
These packets assert IP_DF, and also use the hashed IP ident generator
to provide an IPv4 ID number.
Geoff Alexander reported this could be used to build off-path attacks.
These packets should not be fragmented, since their size is smaller than
IPV4_MIN_MTU. Only some tunneled paths could eventually have to fragment,
regardless of inner IPID.
We really can use zero IPID, to address the flaw, and as a bonus,
avoid a couple of atomic operations in ip_idents_reserve()
Arnd Bergmann [Wed, 22 Aug 2018 15:25:44 +0000 (17:25 +0200)]
net_sched: fix unused variable warning in stmmac
The new tcf_exts_for_each_action() macro doesn't reference its
arguments when CONFIG_NET_CLS_ACT is disabled, which leads to
a harmless warning in at least one driver:
drivers/net/ethernet/stmicro/stmmac/stmmac_tc.c: In function 'tc_fill_actions':
drivers/net/ethernet/stmicro/stmmac/stmmac_tc.c:64:6: error: unused variable 'i' [-Werror=unused-variable]
Adding a cast to void lets us avoid this kind of warning.
To be on the safe side, do it for all three arguments, not
just the one that caused the warning.
Fixes: 244cd96adb5f ("net_sched: remove list_head from tc_action") Signed-off-by: Arnd Bergmann <[email protected]> Signed-off-by: David S. Miller <[email protected]>
sch_cake: Fix TC filter flow override and expand it to hosts as well
The TC filter flow mapping override completely skipped the call to
cake_hash(); however that meant that the internal state was not being
updated, which ultimately leads to deadlocks in some configurations. Fix
that by passing the overridden flow ID into cake_hash() instead so it can
react appropriately.
In addition, the major number of the class ID can now be set to override
the host mapping in host isolation mode. If both host and flow are
overridden (or if the respective modes are disabled), flow dissection and
hashing will be skipped entirely; otherwise, the hashing will be kept for
the portions that are not set by the filter.
net/ncsi: Fixup .dumpit message flags and ID check in Netlink handler
The ncsi_pkg_info_all_nl() .dumpit handler is missing the NLM_F_MULTI
flag, causing additional package information after the first to be lost.
Also fixup a sanity check in ncsi_write_package_info() to reject out of
range package IDs.
powerpc/mm/books3s: Add new pte bit to mark pte temporarily invalid.
When splitting a huge pmd pte, we need to mark the pmd entry invalid. We
can do that by clearing _PAGE_PRESENT bit. But then that will be taken as a
swap pte. In order to differentiate between the two use a software pte bit
when invalidating.
For regular pte, due to bd5050e38aec ("powerpc/mm/radix: Change pte relax
sequence to handle nest MMU hang") we need to mark the pte entry invalid when
relaxing access permission. Instead of marking pte_none which can result in
different page table walk routines possibly skipping this pte entry, invalidate
it but still keep it marked present.
Christophe Leroy [Tue, 21 Aug 2018 13:03:23 +0000 (13:03 +0000)]
powerpc/nohash: fix pte_access_permitted()
Commit 5769beaf180a8 ("powerpc/mm: Add proper pte access check helper
for other platforms") replaced generic pte_access_permitted() by an
arch specific one.
The generic one is defined as
(pte_present(pte) && (!(write) || pte_write(pte)))
The arch specific one is open coded checking that _PAGE_USER and
_PAGE_WRITE (_PAGE_RW) flags are set, but lacking to check that
_PAGE_RO and _PAGE_PRIVILEGED are unset, leading to a useless test
on targets like the 8xx which defines _PAGE_RW and _PAGE_USER as 0.
Commit 5fa5b16be5b31 ("powerpc/mm/hugetlb: Use pte_access_permitted
for hugetlb access check") replaced some tests performed with
pte helpers by a call to pte_access_permitted(), leading to the same
issue.
This patch rewrites powerpc/nohash pte_access_permitted()
using pte helpers.
Fixes: 5769beaf180a8 ("powerpc/mm: Add proper pte access check helper for other platforms") Fixes: 5fa5b16be5b31 ("powerpc/mm/hugetlb: Use pte_access_permitted for hugetlb access check") Cc: [email protected] # v4.15+ Signed-off-by: Christophe Leroy <[email protected]> Reviewed-by: Aneesh Kumar K.V <[email protected]> Signed-off-by: Michael Ellerman <[email protected]>
Dave Airlie [Thu, 23 Aug 2018 01:24:45 +0000 (11:24 +1000)]
Merge branch 'drm-next-4.19' of git://people.freedesktop.org/~agd5f/linux into drm-next
Fixes for 4.19:
- Fix build when KCOV is enabled
- Misc display fixes
- A couple of SR-IOV fixes
- Fence fixes for eviction handling for KFD
- Misc other fixes
Nick Desaulniers [Wed, 22 Aug 2018 23:37:24 +0000 (16:37 -0700)]
include/linux/compiler*.h: make compiler-*.h mutually exclusive
Commit cafa0010cd51 ("Raise the minimum required gcc version to 4.6")
recently exposed a brittle part of the build for supporting non-gcc
compilers.
Both Clang and ICC define __GNUC__, __GNUC_MINOR__, and
__GNUC_PATCHLEVEL__ for quick compatibility with code bases that haven't
added compiler specific checks for __clang__ or __INTEL_COMPILER.
This is brittle, as they happened to get compatibility by posing as a
certain version of GCC. This broke when upgrading the minimal version
of GCC required to build the kernel, to a version above what ICC and
Clang claim to be.
Rather than always including compiler-gcc.h then undefining or
redefining macros in compiler-intel.h or compiler-clang.h, let's
separate out the compiler specific macro definitions into mutually
exclusive headers, do more proper compiler detection, and keep shared
definitions in compiler_types.h.
Chuck Lever [Thu, 16 Aug 2018 16:06:04 +0000 (12:06 -0400)]
nfsd: Use correct credential for NFSv4.0 callback with GSS
I've had trouble when operating a multi-homed Linux NFS server with
Kerberos using NFSv4.0. Lately, I've seen my clients reporting
this (and then hanging):
The client-side commit f11b2a1cfbf5 ("nfs4: copy acceptor name from
context to nfs_client") appears to be related, but I suspect this
problem has been going on for some time before that.
RFC 7530 Section 3.3.3 says:
> For Kerberos V5, nfs/hostname would be a server principal in the
> Kerberos Key Distribution Center database. This is the same
> principal the client acquired a GSS-API context for when it issued
> the SETCLIENTID operation ...
In other words, an NFSv4.0 client expects that the server will use
the same GSS principal for callback that the client used to
establish its lease. For example, if the client used the service
principal "[email protected]" to establish its lease, the server
is required to use "[email protected]" when performing NFSv4.0
callback operations.
The Linux NFS server currently does not. It uses a common service
principal for all callback connections. Sometimes this works as
expected, and other times -- for example, when the server is
accessible via multiple hostnames -- it won't work at all.
This patch scrapes the target name from the client credential,
and uses that for the NFSv4.0 callback credential. That should
be correct much more often.
Chuck Lever [Thu, 16 Aug 2018 16:05:59 +0000 (12:05 -0400)]
sunrpc: Extract target name into svc_cred
NFSv4.0 callback needs to know the GSS target name the client used
when it established its lease. That information is available from
the GSS context created by gssproxy. Make it available in each
svc_cred.
Note this will also give us access to the real target service
principal name (which is typically "nfs", but spec does not require
that).
Chuck Lever [Thu, 16 Aug 2018 16:05:54 +0000 (12:05 -0400)]
sunrpc: Enable the kernel to specify the hostname part of service principals
A multi-homed NFS server may have more than one "nfs" key in its
keytab. Enable the kernel to pick the key it wants as a machine
credential when establishing a GSS context.
This is useful for GSS-protected NFSv4.0 callbacks, which are
required by RFC 7530 S3.3.3 to use the same principal as the service
principal the client used when establishing its lease.
A complementary modification to rpc.gssd is required to fully enable
this feature.
This is BUG_ON(!virt_addr_valid(buf)) triggered by using a stack
allocated buffer with a scatterlist. Convert the buffer for
rc4salt to be dynamically allocated instead.
drm/amdgpu: Fix page fault and kasan warning on pci device remove.
Problem:
When executing echo 1 > /sys/class/drm/card0/device/remove kasan warning
as bellow and page fault happen because adev->gart.pages already freed by the
time amdgpu_gart_unbind is called.
Fix:
Split gmc_v{6,7,8,9}_0_gart_fini to postpone amdgpu_gart_fini to after
memory managers are shut down since gart unbind happens
as part of this procedure
Linus Torvalds [Wed, 22 Aug 2018 21:14:15 +0000 (14:14 -0700)]
Merge tag 'platform-drivers-x86-v4.19-1' of git://git.infradead.org/linux-platform-drivers-x86
Pull x86 platform driver updates from Andy Shevchenko:
- The driver for Silead touchscreen configurations has been renamed
from silead_dmi to touchscreen_dmi since it starts supporting other
touchscreens which require some DMI quirks
It also gets expanded to cover cases for Chuwi Vi10, ONDA V891W,
Connect Tablet 9, Onda V820w, and Cube KNote i1101 tablets.
- Another bunch of changes is related to Mellanox platform code to
allow user space to communicate with Mellanox for system control and
monitoring purposes. The driver notifies user on hotplug device
signal receiving.
- ASUS WMI drivers recognize lid flip action on UX360, and correctly
toggles airplane mode LED. In addition the keyboard backlight toggle
gets support.
- ThinkPad ACPI driver enables support for calculator key (on at least
P52). It also has been fixed to support three characters model
designators, which are used for modern laptops. Earlier the battery,
marked as BAT1, on ThinkPad laptops has not been configured properly,
which is fixed. On the opposite the multi-battery configurations now
probed correctly.
- Dell SMBIOS driver starts working on some Dell servers which do not
support token interface. The regression with backlight detection has
also been fixed. In order to support dock mode on some laptops, Intel
virtual button driver has been fixed. The last but not least is the
fix to Intel HID driver due to changes in Dell systems that prevented
to use power button.
* tag 'platform-drivers-x86-v4.19-1' of git://git.infradead.org/linux-platform-drivers-x86: (47 commits)
platform/x86: acer-wmi: Silence "unsupported" message a bit
platform/x86: intel_punit_ipc: fix build errors
platform/x86: ideapad: Add Y520-15IKBM and Y720-15IKBM to no_hw_rfkill
platform/x86: asus-nb-wmi: Add keymap entry for lid flip action on UX360
platform/x86: acer-wmi: refactor function has_cap
platform/x86: thinkpad_acpi: Fix multi-battery bug
platform/x86: thinkpad_acpi: extend battery quirk coverage
platform/x86: touchscreen_dmi: Add info for the Cube KNote i1101 tablet
platform/x86: mlx-platform: Fix copy-paste error in mlxplat_init()
platform/x86: mlx-platform: Remove unused define
platform/x86: mlx-platform: Change mlxreg-io configuration for MSN274x systems
Documentation/ABI: Add new attribute for mlxreg-io sysfs interfaces
platform/x86: mlx-platform: Allow mlxreg-io driver activation for more systems
platform/x86: mlx-platform: Add ASIC hotplug device configuration
platform/mellanox: mlxreg-hotplug: Add hotplug hwmon uevent notification
platform/mellanox: mlxreg-hotplug: Improve mechanism of ASIC health discovery
platform/x86: mlx-platform: Add mlxreg-fan platform driver activation
platform/x86: dell-laptop: Fix backlight detection
platform/x86: toshiba_acpi: Fix defined but not used build warnings
platform/x86: thinkpad_acpi: Support battery quirk
...
Jens Axboe [Mon, 20 Aug 2018 19:20:50 +0000 (13:20 -0600)]
blk-wbt: use wq_has_sleeper() for wq active check
We need the memory barrier before checking the list head,
use the appropriate helper for this. The matching queue
side memory barrier is provided by set_current_state().
Shan Hai [Wed, 22 Aug 2018 18:02:56 +0000 (02:02 +0800)]
bcache: release dc->writeback_lock properly in bch_writeback_thread()
The writeback thread would exit with a lock held when the cache device
is detached via sysfs interface, fix it by releasing the held lock
before exiting the while-loop.
Emily Deng [Wed, 22 Aug 2018 12:18:25 +0000 (20:18 +0800)]
amdgpu: fix multi-process hang issue
SWDEV-146499: hang during multi vulkan process testing
cause:
the second frame's PREAMBLE_IB have clear-state
and LOAD actions, those actions ruin the pipeline
that is still doing process in the previous frame's
work-load IB.
fix:
need insert pipeline sync if have context switch for
SRIOV (because only SRIOV will report PREEMPTION flag
to UMD)
Linus Torvalds [Wed, 22 Aug 2018 21:04:41 +0000 (14:04 -0700)]
Merge tag 'xtensa-20180820' of git://github.com/jcmvbkbc/linux-xtensa
Pull Xtensa updates from Max Filippov:
- switch xtensa arch to the generic noncoherent direct mapping
operations
- add support for DMA_ATTR_NO_KERNEL_MAPPING attribute
- clean up users of platform/hardware.h in generic Xtensa code
- fix assembly cache maintenance code for long cache lines
- rework noMMU cache attributes initialization
- add big-endian HiFi2 test_kc705_be CPU variant
* tag 'xtensa-20180820' of git://github.com/jcmvbkbc/linux-xtensa:
xtensa: add test_kc705_be variant
xtensa: clean up boot-elf/bootstrap.S
xtensa: make bootparam parsing optional
xtensa: drop variant IRQ support
xtensa: drop unneeded platform/hardware.h headers
xtensa: move PLATFORM_NR_IRQS to Kconfig
xtensa: rework {CONFIG,PLATFORM}_DEFAULT_MEM_START
xtensa: drop unused {CONFIG,PLATFORM}_DEFAULT_MEM_SIZE
xtensa: rework noMMU cache attributes initialization
xtensa: increase ranges in ___invalidate_{i,d}cache_all
xtensa: limit offsets in __loop_cache_{all,page}
xtensa: platform-specific handling of coherent memory
xtensa: support DMA_ATTR_NO_KERNEL_MAPPING attribute
xtensa: use generic dma_noncoherent_ops
Linus Torvalds [Wed, 22 Aug 2018 20:52:44 +0000 (13:52 -0700)]
Merge tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm
Pull second set of KVM updates from Paolo Bonzini:
"ARM:
- Support for Group0 interrupts in guests
- Cache management optimizations for ARMv8.4 systems
- Userspace interface for RAS
- Fault path optimization
- Emulated physical timer fixes
- Random cleanups
x86:
- fixes for L1TF
- a new test case
- non-support for SGX (inject the right exception in the guest)
- fix lockdep false positive"
* tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm: (49 commits)
KVM: VMX: fixes for vmentry_l1d_flush module parameter
kvm: selftest: add dirty logging test
kvm: selftest: pass in extra memory when create vm
kvm: selftest: include the tools headers
kvm: selftest: unify the guest port macros
tools: introduce test_and_clear_bit
KVM: x86: SVM: Call x86_spec_ctrl_set_guest/host() with interrupts disabled
KVM: vmx: Inject #UD for SGX ENCLS instruction in guest
KVM: vmx: Add defines for SGX ENCLS exiting
x86/kvm/vmx: Fix coding style in vmx_setup_l1d_flush()
x86: kvm: avoid unused variable warning
KVM: Documentation: rename the capability of KVM_CAP_ARM_SET_SERROR_ESR
KVM: arm/arm64: Skip updating PTE entry if no change
KVM: arm/arm64: Skip updating PMD entry if no change
KVM: arm: Use true and false for boolean values
KVM: arm/arm64: vgic: Do not use spin_lock_irqsave/restore with irq disabled
KVM: arm/arm64: vgic: Move DEBUG_SPINLOCK_BUG_ON to vgic.h
KVM: arm: vgic-v3: Add support for ICC_SGI0R and ICC_ASGI1R accesses
KVM: arm64: vgic-v3: Add support for ICC_SGI0R_EL1 and ICC_ASGI1R_EL1 accesses
KVM: arm/arm64: vgic-v3: Add core support for Group0 SGIs
...
* tag 'for-4.19/post-20180822' of git://git.kernel.dk/linux-block: (31 commits)
block/DAC960.c: make some arrays static const, shrinks object size
blk-mq: sync the update nr_hw_queues with blk_mq_queue_tag_busy_iter
blk-mq: init hctx sched after update ctx and hctx mapping
block: remove duplicate initialization
tracing/blktrace: Fix to allow setting same value
pktcdvd: fix setting of 'ret' error return for a few cases
block: change return type to bool
block, bfq: return nbytes and not zero from struct cftype .write() method
block, bfq: improve code of bfq_bfqq_charge_time
block, bfq: reduce write overcharge
block, bfq: always update the budget of an entity when needed
block, bfq: readd missing reset of parent-entity service
blk-wbt: fix IO hang in wbt_wait()
block: don't warn for flush on read-only device
bcache: add the missing comments for smp_mb()/smp_wmb()
bcache: remove unnecessary space before ioctl function pointer arguments
bcache: add missing SPDX header
bcache: move open brace at end of function definitions to next line
bcache: add static const prefix to char * array declarations
bcache: fix code comments style
...
Linus Torvalds [Wed, 22 Aug 2018 20:29:39 +0000 (13:29 -0700)]
Merge tag 'f2fs-for-4.19' of git://git.kernel.org/pub/scm/linux/kernel/git/jaegeuk/f2fs
Pull f2fs updates from Jaegeuk Kim:
"In this round, we've tuned f2fs to improve general performance by
serializing block allocation and enhancing discard flows like fstrim
which avoids user IO contention. And we've added fsync_mode=nobarrier
which gives an option to user where it skips issuing cache_flush
commands to underlying flash storage. And there are many bug fixes
related to fuzzed images, revoked atomic writes, quota ops, and minor
direct IO.
Enhancements:
- add fsync_mode=nobarrier which bypasses cache_flush command
- enhance the discarding flow which avoids user IOs and issues in
LBA order
- readahead some encrypted blocks during GC
- enable in-memory inode checksum to verify the blocks if
F2FS_CHECK_FS is set
- enhance nat_bits behavior
- set -o discard by default
- set REQ_RAHEAD to bio in ->readpages
Bug fixes:
- fix a corner case to corrupt atomic_writes revoking flow
- revisit i_gc_rwsem to fix race conditions
- fix some dio behaviors captured by xfstests
- correct handling errors given by quota-related failures
- add many sanity check flows to avoid fuzz test failures
- add more error number propagation to their callers
- fix several corner cases to continue fault injection w/ shutdown
loop"
* tag 'f2fs-for-4.19' of git://git.kernel.org/pub/scm/linux/kernel/git/jaegeuk/f2fs: (89 commits)
f2fs: readahead encrypted block during GC
f2fs: avoid fi->i_gc_rwsem[WRITE] lock in f2fs_gc
f2fs: fix performance issue observed with multi-thread sequential read
f2fs: fix to skip verifying block address for non-regular inode
f2fs: rework fault injection handling to avoid a warning
f2fs: support fault_type mount option
f2fs: fix to return success when trimming meta area
f2fs: fix use-after-free of dicard command entry
f2fs: support discard submission error injection
f2fs: split discard command in prior to block layer
f2fs: wake up gc thread immediately when gc_urgent is set
f2fs: fix incorrect range->len in f2fs_trim_fs()
f2fs: refresh recent accessed nat entry in lru list
f2fs: fix avoid race between truncate and background GC
f2fs: avoid race between zero_range and background GC
f2fs: fix to do sanity check with block address in main area v2
f2fs: fix to do sanity check with inline flags
f2fs: fix to reset i_gc_failures correctly
f2fs: fix invalid memory access
f2fs: fix to avoid broken of dnode block list
...
John Fastabend [Wed, 22 Aug 2018 15:37:37 +0000 (08:37 -0700)]
bpf: sockmap: write_space events need to be passed to TCP handler
When sockmap code is using the stream parser it also handles the write
space events in order to handle the case where (a) verdict redirects
skb to another socket and (b) the sockmap then sends the skb but due
to memory constraints (or other EAGAIN errors) needs to do a retry.
But the initial code missed a third case where the
skb_send_sock_locked() triggers an sk_wait_event(). A typically case
would be when sndbuf size is exceeded. If this happens because we
do not pass the write_space event to the lower layers we never wake
up the event and it will wait for sndtimeo. Which as noted in ktls
fix may be rather large and look like a hang to the user.
To reproduce the best test is to reduce the sndbuf size and send
1B data chunks to stress the memory handling. To fix this pass the
event from the upper layer to the lower layer.
John Fastabend [Wed, 22 Aug 2018 15:37:32 +0000 (08:37 -0700)]
tls: possible hang when do_tcp_sendpages hits sndbuf is full case
Currently, the lower protocols sk_write_space handler is not called if
TLS is sending a scatterlist via tls_push_sg. However, normally
tls_push_sg calls do_tcp_sendpage, which may be under memory pressure,
that in turn may trigger a wait via sk_wait_event. Typically, this
happens when the in-flight bytes exceed the sdnbuf size. In the normal
case when enough ACKs are received sk_write_space() will be called and
the sk_wait_event will be woken up allowing it to send more data
and/or return to the user.
But, in the TLS case because the sk_write_space() handler does not
wake up the events the above send will wait until the sndtimeo is
exceeded. By default this is MAX_SCHEDULE_TIMEOUT so it look like a
hang to the user (especially this impatient user). To fix this pass
the sk_write_space event to the lower layers sk_write_space event
which in the TCP case will wake any pending events.
I observed the above while integrating sockmap and ktls. It
initially appeared as test_sockmap (modified to use ktls) occasionally
hanging. To reliably reproduce this reduce the sndbuf size and stress
the tls layer by sending many 1B sends. This results in every byte
needing a header and each byte individually being sent to the crypto
layer.
Linus Torvalds [Wed, 22 Aug 2018 19:34:08 +0000 (12:34 -0700)]
Merge branch 'akpm' (patches from Andrew)
Merge more updates from Andrew Morton:
- the rest of MM
- procfs updates
- various misc things
- more y2038 fixes
- get_maintainer updates
- lib/ updates
- checkpatch updates
- various epoll updates
- autofs updates
- hfsplus
- some reiserfs work
- fatfs updates
- signal.c cleanups
- ipc/ updates
* emailed patches from Andrew Morton <[email protected]>: (166 commits)
ipc/util.c: update return value of ipc_getref from int to bool
ipc/util.c: further variable name cleanups
ipc: simplify ipc initialization
ipc: get rid of ids->tables_initialized hack
lib/rhashtable: guarantee initial hashtable allocation
lib/rhashtable: simplify bucket_table_alloc()
ipc: drop ipc_lock()
ipc/util.c: correct comment in ipc_obtain_object_check
ipc: rename ipcctl_pre_down_nolock()
ipc/util.c: use ipc_rcu_putref() for failues in ipc_addid()
ipc: reorganize initialization of kern_ipc_perm.seq
ipc: compute kern_ipc_perm.id under the ipc lock
init/Kconfig: remove EXPERT from CHECKPOINT_RESTORE
fs/sysv/inode.c: use ktime_get_real_seconds() for superblock stamp
adfs: use timespec64 for time conversion
kernel/sysctl.c: fix typos in comments
drivers/rapidio/devices/rio_mport_cdev.c: remove redundant pointer md
fork: don't copy inconsistent signal handler state to child
signal: make get_signal() return bool
signal: make sigkill_pending() return bool
...
Daniel Borkmann [Wed, 22 Aug 2018 16:09:17 +0000 (18:09 +0200)]
bpf, sockmap: fix sock hash count in alloc_sock_hash_elem
When we try to allocate a new sock hash entry and the allocation
fails, then sock hash map fails to reduce the map element counter,
meaning we keep accounting this element although it was never used.
Fix it by dropping the element counter on error.
Fixes: 81110384441a ("bpf: sockmap, add hash map support") Signed-off-by: Daniel Borkmann <[email protected]> Acked-by: John Fastabend <[email protected]>
Daniel Borkmann [Tue, 21 Aug 2018 13:55:00 +0000 (15:55 +0200)]
bpf, sockmap: fix sock_hash_alloc and reject zero-sized keys
Currently, it is possible to create a sock hash map with key size
of 0 and have the kernel return a fd back to user space. This is
invalid for hash maps (and kernel also hasn't been tested for zero
key size support in general at this point). Thus, reject such
configuration.
Manfred Spraul [Wed, 22 Aug 2018 05:02:00 +0000 (22:02 -0700)]
ipc/util.c: further variable name cleanups
The varable names got a mess, thus standardize them again:
id: user space id. Called semid, shmid, msgid if the type is known.
Most functions use "id" already.
idx: "index" for the idr lookup
Right now, some functions use lid, ipc_addid() already uses idx as
the variable name.
seq: sequence number, to avoid quick collisions of the user space id
key: user space key, used for the rhash tree
Davidlohr Bueso [Wed, 22 Aug 2018 05:01:52 +0000 (22:01 -0700)]
ipc: get rid of ids->tables_initialized hack
In sysvipc we have an ids->tables_initialized regarding the rhashtable,
introduced in 0cfb6aee70bd ("ipc: optimize semget/shmget/msgget for lots
of keys")
It's there, specifically, to prevent nil pointer dereferences, from using
an uninitialized api. Considering how rhashtable_init() can fail
(probably due to ENOMEM, if anything), this made the overall ipc
initialization capable of failure as well. That alone is ugly, but fine,
however I've spotted a few issues regarding the semantics of
tables_initialized (however unlikely they may be):
- There is inconsistency in what we return to userspace: ipc_addid()
returns ENOSPC which is certainly _wrong_, while ipc_obtain_object_idr()
returns EINVAL.
- After we started using rhashtables, ipc_findkey() can return nil upon
!tables_initialized, but the caller expects nil for when the ipc
structure isn't found, and can therefore call into ipcget() callbacks.
Now that rhashtable initialization cannot fail, we can properly get rid of
the hack altogether.
rhashtable_init() may fail due to -ENOMEM, thus making the entire api
unusable. This patch removes this scenario, however unlikely. In order
to guarantee memory allocation, this patch always ends up doing
GFP_KERNEL|__GFP_NOFAIL for both the tbl as well as
alloc_bucket_spinlocks().
Upon the first table allocation failure, we shrink the size to the
smallest value that makes sense and retry with __GFP_NOFAIL semantics.
With the defaults, this means that from 64 buckets, we retry with only 4.
Any later issues regarding performance due to collisions or larger table
resizing (when more memory becomes available) is the least of our
problems.
Davidlohr Bueso [Wed, 22 Aug 2018 05:01:45 +0000 (22:01 -0700)]
lib/rhashtable: simplify bucket_table_alloc()
As of ce91f6ee5b3b ("mm: kvmalloc does not fallback to vmalloc for
incompatible gfp flags") we can simplify the caller and trust kvzalloc()
to just do the right thing. For the case of the GFP_ATOMIC context, we
can drop the __GFP_NORETRY flag for obvious reasons, and for the
__GFP_NOWARN case, however, it is changed such that the caller passes the
flag instead of making bucket_table_alloc() handle it.
This slightly changes the gfp flags passed on to nested_table_alloc() as
it will now also use GFP_ATOMIC | __GFP_NOWARN. However, I consider this
a positive consequence as for the same reasons we want nowarn semantics in
bucket_table_alloc().
Davidlohr Bueso [Wed, 22 Aug 2018 05:01:41 +0000 (22:01 -0700)]
ipc: drop ipc_lock()
ipc/util.c contains multiple functions to get the ipc object pointer given
an id number.
There are two sets of function: One set verifies the sequence counter part
of the id number, other functions do not check the sequence counter.
The standard for function names in ipc/util.c is
- ..._check() functions verify the sequence counter
- ..._idr() functions do not verify the sequence counter
ipc_lock() is an exception: It does not verify the sequence counter value,
but this is not obvious from the function name.
Furthermore, shm.c is the only user of this helper. Thus, we can simply
move the logic into shm_lock() and get rid of the function altogether.
Manfred Spraul [Wed, 22 Aug 2018 05:01:37 +0000 (22:01 -0700)]
ipc/util.c: correct comment in ipc_obtain_object_check
The comment that explains ipc_obtain_object_check is wrong: The function
checks the sequence number, not the reference counter.
Note that checking the reference counter would be meaningless: The
reference counter is decreased without holding any locks, thus an object
with kern_ipc_perm.deleted=true may disappear at the end of the next rcu
grace period.
Manfred Spraul [Wed, 22 Aug 2018 05:01:34 +0000 (22:01 -0700)]
ipc: rename ipcctl_pre_down_nolock()
Both the comment and the name of ipcctl_pre_down_nolock() are misleading:
The function must be called while holdling the rw semaphore.
Therefore the patch renames the function to ipcctl_obtain_check(): This
name matches the other names used in util.c:
- "obtain" function look up a pointer in the idr, without
acquiring the object lock.
- The caller is responsible for locking.
- _check means that the sequence number is checked.
Manfred Spraul [Wed, 22 Aug 2018 05:01:29 +0000 (22:01 -0700)]
ipc/util.c: use ipc_rcu_putref() for failues in ipc_addid()
ipc_addid() is impossible to use:
- for certain failures, the caller must not use ipc_rcu_putref(),
because the reference counter is not yet initialized.
- for other failures, the caller must use ipc_rcu_putref(),
because parallel operations could be ongoing already.
The patch cleans that up, by initializing the refcount early, and by
modifying all callers.
The issues is related to the finding of
syzbot+2827ef6b3385deb07eaf@syzkaller.appspotmail.com: syzbot found an
issue with reading kern_ipc_perm.seq, here both read and write to already
released memory could happen.
Manfred Spraul [Wed, 22 Aug 2018 05:01:25 +0000 (22:01 -0700)]
ipc: reorganize initialization of kern_ipc_perm.seq
ipc_addid() initializes kern_ipc_perm.seq after having called idr_alloc()
(within ipc_idr_alloc()).
Thus a parallel semop() or msgrcv() that uses ipc_obtain_object_check()
may see an uninitialized value.
The patch moves the initialization of kern_ipc_perm.seq before the calls
of idr_alloc().
Notes:
1) This patch has a user space visible side effect:
If /proc/sys/kernel/*_next_id is used (i.e.: checkpoint/restore) and
if semget()/msgget()/shmget() fails in the final step of adding the id
to the rhash tree, then .._next_id is cleared. Before the patch, is
remained unmodified.
There is no change of the behavior after a successful ..get() call: It
always clears .._next_id, there is no impact to non checkpoint/restore
code as that code does not use .._next_id.
2) The patch correctly documents that after a call to ipc_idr_alloc(),
the full tear-down sequence must be used. The callers of ipc_addid()
do not fullfill that, i.e. more bugfixes are required.
The patch is a squash of a patch from Dmitry and my own changes.
Adrian Reber [Wed, 22 Aug 2018 05:01:17 +0000 (22:01 -0700)]
init/Kconfig: remove EXPERT from CHECKPOINT_RESTORE
The CHECKPOINT_RESTORE configuration option was introduced in 2012 and
combined with EXPERT. CHECKPOINT_RESTORE is already enabled in many
distribution kernels and also part of the defconfigs of various
architectures.
To make it easier for distributions to enable CHECKPOINT_RESTORE this
removes EXPERT and moves the configuration option out of the EXPERT block.
Arnd Bergmann [Wed, 22 Aug 2018 05:01:13 +0000 (22:01 -0700)]
fs/sysv/inode.c: use ktime_get_real_seconds() for superblock stamp
get_seconds() is deprecated in favor of ktime_get_real_seconds(), which
returns a 64-bit timestamp.
In the SYSV file system, the superblock timestamp is only 32 bits wide,
and it is used to check whether a file system is clean, so the best
solution seems to be to force a wraparound and explicitly convert it to an
unsigned 32-bit value.
This is independent of the inode timestamps that are also 32-bit wide on
disk and that come from current_time().
Jann Horn [Wed, 22 Aug 2018 05:00:58 +0000 (22:00 -0700)]
fork: don't copy inconsistent signal handler state to child
Before this change, if a multithreaded process forks while one of its
threads is changing a signal handler using sigaction(), the memcpy() in
copy_sighand() can race with the struct assignment in do_sigaction(). It
isn't clear whether this can cause corruption of the userspace signal
handler pointer, but it definitely can cause inconsistency between
different fields of struct sigaction.
Take the appropriate spinlock to avoid this.
I have tested that this patch prevents inconsistency between sa_sigaction
and sa_flags, which is possible before this patch.