Linus Torvalds [Sun, 6 May 2018 03:30:58 +0000 (17:30 -1000)]
Merge tag 'platform-drivers-x86-v4.17-2' of git://git.infradead.org/linux-platform-drivers-x86
Pull x86 platform driver fixes from Darren Hart:
- We missed a case in the Dell config dependencies resulting in a
possible bad configuration, resolve it by giving up on trying to keep
DELL_LAPTOP visible in the menu and make it depend on DELL_SMBIOS.
- Fix a null pointer dereference at module unload for the asus-wireless
driver.
* tag 'platform-drivers-x86-v4.17-2' of git://git.infradead.org/linux-platform-drivers-x86:
platform/x86: Kconfig: Fix dell-laptop dependency chain.
platform/x86: asus-wireless: Fix NULL pointer dereference
Linus Torvalds [Sun, 6 May 2018 03:28:08 +0000 (17:28 -1000)]
Merge tag 'usb-4.17-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/usb
Pull USB fixes from Greg KH:
"Here are some USB driver fixes for 4.17-rc4.
The majority of them are some USB gadget fixes that missed my last
pull request. The "largest" patch in here is a fix for the old visor
driver that syzbot found 6 months or so ago and I finally remembered
to fix it.
All of these have been in linux-next with no reported issues"
* tag 'usb-4.17-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/usb:
Revert "usb: host: ehci: Use dma_pool_zalloc()"
usb: typec: tps6598x: handle block reads separately with plain-I2C adapters
usb: typec: tcpm: Release the role mux when exiting
USB: Accept bulk endpoints with 1024-byte maxpacket
xhci: Fix use-after-free in xhci_free_virt_device
USB: serial: visor: handle potential invalid device configuration
USB: serial: option: adding support for ublox R410M
usb: musb: trace: fix NULL pointer dereference in musb_g_tx()
usb: musb: host: fix potential NULL pointer dereference
usb: gadget: composite Allow for larger configuration descriptors
usb: dwc3: gadget: Fix list_del corruption in dwc3_ep_dequeue
usb: dwc3: gadget: dwc3_gadget_del_and_unmap_request() can be static
usb: dwc2: pci: Fix error return code in dwc2_pci_probe()
usb: dwc2: WA for Full speed ISOC IN in DDMA mode.
usb: dwc2: dwc2_vbus_supply_init: fix error check
usb: gadget: f_phonet: fix pn_net_xmit()'s return type
Since the commit "8003c9ae204e: add APIC Timer periodic/oneshot mode VMX
preemption timer support", a Windows 10 guest has some erratic timer
spikes.
Here the results on a 150000 times 1ms timer without any load:
Before 8003c9ae204e | After 8003c9ae204e
Max 1834us | 86000us
Mean 1100us | 1021us
Deviation 59us | 149us
Here the results on a 150000 times 1ms timer with a cpu-z stress test:
Before 8003c9ae204e | After 8003c9ae204e
Max 32000us | 140000us
Mean 1006us | 1997us
Deviation 140us | 11095us
The root cause of the problem is starting hrtimer with an expiry time
already in the past can take more than 20 milliseconds to trigger the
timer function. It can be solved by forward such past timers
immediately, rather than submitting them to hrtimer_start().
In case the timer is periodic, update the target expiration and call
hrtimer_start with it.
v2: Check if the tsc deadline is already expired. Thank you Mika.
v3: Execute the past timers immediately rather than submitting them to
hrtimer_start().
v4: Rearm the periodic timer with advance_periodic_target_expiration() a
simpler version of set_target_expiration(). Thank you Paolo.
Radim Krčmář [Sat, 5 May 2018 21:05:31 +0000 (23:05 +0200)]
Merge tag 'kvmarm-fixes-for-4.17-2' of git://git.kernel.org/pub/scm/linux/kernel/git/kvmarm/kvmarm
KVM/arm fixes for 4.17, take #2
- Fix proxying of GICv2 CPU interface accesses
- Fix crash when switching to BE
- Track source vcpu git GICv2 SGIs
- Fix an outdated bit of documentation
With commit ce88313069c36eef80f21fd7 ("arch/sh: make the DMA mapping
operations observe dev->dma_pfn_offset") the generic DMA allocation
function on which the SH 'dma_alloc_coherent()' function relies on,
accesses the 'dma_pfn_offset' field of struct device.
Unfortunately the 'dma_generic_alloc_coherent()' function is called from
several places with a NULL struct device argument, halting the CPU
during the boot process.
This patch fixes the issue by protecting access to dev->dma_pfn_offset,
with a trivial check for validity. It also passes a valid 'struct device'
in the 'platform_resource_setup_memory()' function which is the main user
of 'dma_alloc_coherent()', and inserts a WARN_ON() check to remind to future
(and existing) bogus users of this function to provide a valid 'struct device'
whenever possible.
Fixes: ce88313069c36eef80f21fd7 ("arch/sh: make the DMA mapping operations observe dev->dma_pfn_offset") Signed-off-by: Jacopo Mondi <[email protected]> Reviewed-by: Geert Uytterhoeven <[email protected]> Reviewed-by: Thomas Petazzoni <[email protected]> Signed-off-by: Rich Felker <[email protected]>
Linus Torvalds [Sat, 5 May 2018 07:15:25 +0000 (21:15 -1000)]
Merge tag 'kbuild-fixes-v4.17' of git://git.kernel.org/pub/scm/linux/kernel/git/masahiroy/linux-kbuild
Pull Kbuild fixes from Masahiro Yamada:
- remove state comment in modpost
- extend MAINTAINERS entry to cover modpost and more makefiles
- fix missed building of SANCOV gcc-plugin
- replace left-over 'bison' with $(YACC)
- display short log when generating parer of genksyms
* tag 'kbuild-fixes-v4.17' of git://git.kernel.org/pub/scm/linux/kernel/git/masahiroy/linux-kbuild:
genksyms: fix typo in parse.tab.{c,h} generation rules
kbuild: replace hardcoded bison in cmd_bison_h with $(YACC)
gcc-plugins: fix build condition of SANCOV plugin
MAINTAINERS: Update Kbuild entry with a few paths
modpost: delete stale comment
Linus Torvalds [Sat, 5 May 2018 07:12:06 +0000 (21:12 -1000)]
Merge tag 'clk-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/clk/linux
Pull clk fixes froom Stephen Boyd:
"A handful of fixes for the stm32mp1 clk driver came in during the
merge window for the driver that got merged in the merge window.
Plus a warning fix for unused PM ops and a couple fixes for the meson
clk driver clk names that went unnoticed with the regmap rework.
There's also another fix in here for the mux rounding flag which
wasn't doing what it said it did, but now it does"
* tag 'clk-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/clk/linux:
clk: meson: meson8b: fix meson8b_cpu_clk parent clock name
clk: meson: meson8b: fix meson8b_fclk_div3_div clock name
clk: meson: drop meson_aoclk_gate_regmap_ops
clk: meson: honor CLK_MUX_ROUND_CLOSEST in clk_regmap
clk: honor CLK_MUX_ROUND_CLOSEST in generic clk mux
clk: cs2000: mark resume function as __maybe_unused
clk: stm32mp1: remove ck_apb_dbg clock
clk: stm32mp1: set stgen_k clock as critical
clk: stm32mp1: add missing tzc2 clock
clk: stm32mp1: fix SAI3 & SAI4 clocks
clk: stm32mp1: remove unused dfsdm_src[] const
clk: stm32mp1: add missing static
Linus Torvalds [Sat, 5 May 2018 07:05:12 +0000 (21:05 -1000)]
Merge tag 'drm-fixes-for-v4.17-rc4' of git://people.freedesktop.org/~airlied/linux
Pull drm fixes from Dave Airlie:
"vmwgfx, i915, vc4, vga dac fixes.
This seems eerily quiet, so I expect it will explode next week or
something.
One i915 model firmware, two vmwgfx fixes, one vc4 fix and one bridge
leak fix"
* tag 'drm-fixes-for-v4.17-rc4' of git://people.freedesktop.org/~airlied/linux:
drm/bridge: vga-dac: Fix edid memory leak
drm/vc4: Make sure vc4_bo_{inc,dec}_usecnt() calls are balanced
drm/i915/glk: Add MODULE_FIRMWARE for Geminilake
drm/vmwgfx: Fix a buffer object leak
drm/vmwgfx: Clean up fbdev modeset locking
Linus Torvalds [Sat, 5 May 2018 06:57:28 +0000 (20:57 -1000)]
Merge tag 'trace-v4.17-rc1-3' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace
Pull tracing fixes from Steven Rostedt:
"Some of the files in the tracing directory show file mode 0444 when
they are writable by root. To fix the confusion, they should be 0644.
Note, either case root can still write to them.
Zhengyuan asked why I never applied that patch (the first one is from
2014!). I simply forgot about it. /me lowers head in shame"
* tag 'trace-v4.17-rc1-3' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace:
tracing: Fix the file mode of stack tracer
ftrace: Have set_graph_* files have normal file modes
Linus Torvalds [Sat, 5 May 2018 06:51:10 +0000 (20:51 -1000)]
Merge tag 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/rdma/rdma
Pull rdma fixes from Doug Ledford:
"This is our first pull request of the rc cycle. It's not that it's
been overly quiet, we were just waiting on a few things before sending
this off.
For instance, the 6 patch series from Intel for the hfi1 driver had
actually been pulled in on Tuesday for a Wednesday pull request, only
to have Jason notice something I missed, so we held off for some
testing, and then on Thursday had to respin the series because the
very first patch needed a minor fix (unnecessary cast is all).
There is a sizable hns patch series in here, as well as a reasonably
largish hfi1 patch series, then all of the lines of uapi updates are
just the change to the new official Linux-OpenIB SPDX tag (a bunch of
our files had what amounts to a BSD-2-Clause + MIT Warranty statement
as their license as a result of the initial code submission years ago,
and the SPDX folks decided it was unique enough to warrant a unique
tag), then the typical mlx4 and mlx5 updates, and finally some cxgb4
and core/cache/cma updates to round out the bunch.
None of it was overly large by itself, but in the 2 1/2 weeks we've
been collecting patches, it has added up :-/.
As best I can tell, it's been through 0day (I got a notice about my
last for-next push, but not for my for-rc push, but Jason seems to
think that failure messages are prioritized and success messages not
so much). It's also been through linux-next. And yes, we did notice in
the context portion of the CMA query gid fix patch that there is a
dubious BUG_ON() in the code, and have plans to audit our BUG_ON usage
and remove it anywhere we can.
Summary:
- Various build fixes (USER_ACCESS=m and ADDR_TRANS turned off)
- SPDX license tag cleanups (new tag Linux-OpenIB)
- RoCE GID fixes related to default GIDs
- Various fixes to: cxgb4, uverbs, cma, iwpm, rxe, hns (big batch),
mlx4, mlx5, and hfi1 (medium batch)"
* tag 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/rdma/rdma: (52 commits)
RDMA/cma: Do not query GID during QP state transition to RTR
IB/mlx4: Fix integer overflow when calculating optimal MTT size
IB/hfi1: Fix memory leak in exception path in get_irq_affinity()
IB/{hfi1, rdmavt}: Fix memory leak in hfi1_alloc_devdata() upon failure
IB/hfi1: Fix NULL pointer dereference when invalid num_vls is used
IB/hfi1: Fix loss of BECN with AHG
IB/hfi1 Use correct type for num_user_context
IB/hfi1: Fix handling of FECN marked multicast packet
IB/core: Make ib_mad_client_id atomic
iw_cxgb4: Atomically flush per QP HW CQEs
IB/uverbs: Fix kernel crash during MR deregistration flow
IB/uverbs: Prevent reregistration of DM_MR to regular MR
RDMA/mlx4: Add missed RSS hash inner header flag
RDMA/hns: Fix a couple misspellings
RDMA/hns: Submit bad wr
RDMA/hns: Update assignment method for owner field of send wqe
RDMA/hns: Adjust the order of cleanup hem table
RDMA/hns: Only assign dqpn if IB_QP_PATH_DEST_QPN bit is set
RDMA/hns: Remove some unnecessary attr_mask judgement
RDMA/hns: Only assign mtu if IB_QP_PATH_MTU bit is set
...
Linus Torvalds [Sat, 5 May 2018 06:41:44 +0000 (20:41 -1000)]
Merge tag 'for-linus-20180504' of git://git.kernel.dk/linux-block
Pull block fixes from Jens Axboe:
"A collection of fixes that should to into this release. This contains:
- Set of bcache fixes from Coly, fixing regression in patches that
went into this series.
- Set of NVMe fixes by way of Keith.
- Set of bdi related fixes, one from Jan and two from Tetsuo Handa,
fixing various issues around device addition/removal.
- Two block inflight fixes from Omar, fixing issues around the
transition to using tags for blk-mq inflight accounting that we
did a few releases ago"
* tag 'for-linus-20180504' of git://git.kernel.dk/linux-block:
bdi: Fix oops in wb_workfn()
nvmet: switch loopback target state to connecting when resetting
nvme/multipath: Fix multipath disabled naming collisions
nvme/multipath: Disable runtime writable enabling parameter
nvme: Set integrity flag for user passthrough commands
nvme: fix potential memory leak in option parsing
bdi: Fix use after free bug in debugfs_remove()
bdi: wake up concurrent wb_shutdown() callers.
bcache: use pr_info() to inform duplicated CACHE_SET_IO_DISABLE set
bcache: set dc->io_disable to true in conditional_stop_bcache_device()
bcache: add wait_for_kthread_stop() in bch_allocator_thread()
bcache: count backing device I/O error for writeback I/O
bcache: set CACHE_SET_IO_DISABLE in bch_cached_dev_error()
bcache: store disk name in struct cache and struct cached_dev
blk-mq: fix sysfs inflight counter
blk-mq: count allocated but not started requests in iostats inflight
Linus Torvalds [Sat, 5 May 2018 06:36:50 +0000 (20:36 -1000)]
Merge tag 'xfs-4.17-fixes-2' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux
Pull xfs fixes from Darrick Wong:
"I've got one more bug fix for xfs for 4.17-rc4, which caps the amount
of data we try to handle in one dedupe request so that userspace can't
livelock the kernel.
This series has been run through a full xfstests run during the week
and through a quick xfstests run against this morning's master, with
no ajor failures reported.
Summary:
- Cap the maximum length of a deduplication request at MAX_RW_COUNT/2
to avoid kernel livelock due to excessively large IO requests"
* tag 'xfs-4.17-fixes-2' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux:
xfs: cap the length of deduplication requests
Linus Torvalds [Sat, 5 May 2018 06:32:18 +0000 (20:32 -1000)]
Merge tag 'for-4.17-rc3-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux
Pull btrfs fixes from David Sterba:
"Two regression fixes and one fix for stable"
* tag 'for-4.17-rc3-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux:
Btrfs: send, fix missing truncate for inode with prealloc extent past eof
btrfs: Take trans lock before access running trans in check_delayed_ref
btrfs: Fix wrong first_key parameter in replace_path
Since commit d677a4d60193 ("Makefile: support flag
-fsanitizer-coverage=trace-cmp"), you miss to build the SANCOV
plugin under some circumstances.
CONFIG_KCOV=y
CONFIG_KCOV_ENABLE_COMPARISONS=y
Your compiler does not support -fsanitize-coverage=trace-pc
Your compiler does not support -fsanitize-coverage=trace-cmp
Under this condition, $(CFLAGS_KCOV) is not empty but contains a
space, so the following ifeq-conditional is false.
ifeq ($(CFLAGS_KCOV),)
Then, scripts/Makefile.gcc-plugins misses to add sancov_plugin.so to
gcc-plugin-y while the SANCOV plugin is necessary as an alternative
means.
Fixes: d677a4d60193 ("Makefile: support flag -fsanitizer-coverage=trace-cmp") Signed-off-by: Masahiro Yamada <[email protected]> Acked-by: Kees Cook <[email protected]>
Rasmus Villemoes [Thu, 22 Mar 2018 20:58:27 +0000 (21:58 +0100)]
MAINTAINERS: Update Kbuild entry with a few paths
I managed to send some modpost patches to old addresses of both
Masahiro and Michal, and omitted linux-kbuild from cc, because my
tried and trusted scripts/get_maintainer wrapper failed me. Add the
modpost directory to the MAINTAINERS entry, and while at it make the
Makefile glob match scripts/Makefile itself, and add one matching the
Kbuild.include file as well.
The following pull-request contains BPF updates for your *net* tree.
The main changes are:
1) Sanitize attr->{prog,map}_type from bpf(2) since used as an array index
to retrieve prog/map specific ops such that we prevent potential out of
bounds value under speculation, from Mark and Daniel.
====================
Daniel Borkmann [Fri, 4 May 2018 14:27:53 +0000 (16:27 +0200)]
bpf, xskmap: fix crash in xsk_map_alloc error path handling
If bpf_map_precharge_memlock() did not fail, then we set err to zero.
However, any subsequent failure from either alloc_percpu() or the
bpf_map_area_alloc() will return ERR_PTR(0) which in find_and_alloc_map()
will cause NULL pointer deref.
In devmap we have the convention that we return -EINVAL on page count
overflow, so keep the same logic here and just set err to -ENOMEM
after successful bpf_map_precharge_memlock().
Daniel Borkmann [Fri, 4 May 2018 21:41:05 +0000 (23:41 +0200)]
Merge branch 'bpf-event-output-offload'
Jakub Kicinski says:
====================
This series centres on NFP offload of bpf_event_output(). The
first patch allows perf event arrays to be used by offloaded
programs. Next patch makes the nfp driver keep track of such
arrays to be able to filter FW events referring to maps.
Perf event arrays are not device bound. Having driver
reimplement and manage the perf array seems brittle and unnecessary.
Patch 4 moves slightly the verifier step which replaces map fds
with map pointers. This is useful for nfp JIT since we can then
easily replace host pointers with NFP table ids (patch 6). This
allows us to lift the limitation on map helpers having to be used
with the same map pointer on all paths. Second use of replacing
fds with real host map pointers is that we can use the host map
pointer as a key for FW events in perf event array offload.
Patch 5 adds perf event output offload support for the NFP.
There are some differences between bpf_event_output() offloaded
and non-offloaded version. The FW messages which carry events
may get dropped and reordered relatively easily. The return codes
from the helper are also not guaranteed to match the host. Users
are warned about some of those discrepancies with a one time
warning message to kernel logs.
bpftool gains an ability to dump perf ring events in a very simple
format. This was very useful for testing and simple debug, maybe
it will be useful to others?
Last patch is a trivial comment fix.
====================
Users of BPF sooner or later discover perf_event_output() helpers
and BPF_MAP_TYPE_PERF_EVENT_ARRAY. Dumping this array type is
not possible, however, we can add simple reading of perf events.
Create a new event_pipe subcommand for maps, this sub command
will only work with BPF_MAP_TYPE_PERF_EVENT_ARRAY maps.
Parts of the code from samples/bpf/trace_output_user.c.
Jakub Kicinski [Fri, 4 May 2018 01:37:14 +0000 (18:37 -0700)]
tools: bpftool: fold hex keyword in command help
Instead of spelling [hex] BYTES everywhere use DATA as keyword
for generalized value. This will help us keep the messages
concise when longer command are added in the future. It will
also be useful once BTF support comes. We will only have to
change the definition of DATA.
Jakub Kicinski [Fri, 4 May 2018 01:37:13 +0000 (18:37 -0700)]
nfp: bpf: rewrite map pointers with NFP TIDs
Kernel will now replace map fds with actual pointer before
calling the offload prepare. We can identify those pointers
and replace them with NFP table IDs instead of loading the
table ID in code generated for CALL instruction.
This allows us to support having the same CALL being used with
different maps.
Since we don't want to change the FW ABI we still need to
move the TID from R1 to portion of R0 before the jump.
Jakub Kicinski [Fri, 4 May 2018 01:37:12 +0000 (18:37 -0700)]
nfp: bpf: perf event output helpers support
Add support for the perf_event_output family of helpers.
The implementation on the NFP will not match the host code exactly.
The state of the host map and rings is unknown to the device, hence
device can't return errors when rings are not installed. The device
simply packs the data into a firmware notification message and sends
it over to the host, returning success to the program.
There is no notion of a host CPU on the device when packets are being
processed. Device will only offload programs which set BPF_F_CURRENT_CPU.
Still, if map index doesn't match CPU no error will be returned (see
above).
Dropped/lost firmware notification messages will not cause "lost
events" event on the perf ring, they are only visible via device
error counters.
Firmware notification messages may also get reordered in respect
to the packets which caused their generation.
Jakub Kicinski [Fri, 4 May 2018 01:37:11 +0000 (18:37 -0700)]
bpf: replace map pointer loads before calling into offloads
Offloads may find host map pointers more useful than map fds.
Map pointers can be used to identify the map, while fds are
only valid within the context of loading process.
Jump to skip_full_check on error in case verifier log overflow
has to be handled (replace_map_fd_with_map_ptr() prints to the
log, driver prep may do that too in the future).
Jakub Kicinski [Fri, 4 May 2018 01:37:10 +0000 (18:37 -0700)]
bpf: export bpf_event_output()
bpf_event_output() is useful for offloads to add events to BPF
event rings, export it. Note that export is placed near the stub
since tracing is optional and kernel/bpf/core.c is always going
to be built.
Jakub Kicinski [Fri, 4 May 2018 01:37:09 +0000 (18:37 -0700)]
nfp: bpf: record offload neutral maps in the driver
For asynchronous events originating from the device, like perf event
output, we need to be able to make sure that objects being referred
to by the FW message are valid on the host. FW events can get queued
and reordered. Even if we had a FW message "barrier" we should still
protect ourselves from bogus FW output.
Add a reverse-mapping hash table and record in it all raw map pointers
FW may refer to. Only record neutral maps, i.e. perf event arrays.
These are currently the only objects FW can refer to. Use RCU protection
on the read side, update side is under RTNL.
Since program vs map destruction order is slightly painful for offload
simply take an extra reference on all the recorded maps to make sure
they don't disappear.
Jakub Kicinski [Fri, 4 May 2018 01:37:08 +0000 (18:37 -0700)]
bpf: offload: allow offloaded programs to use perf event arrays
BPF_MAP_TYPE_PERF_EVENT_ARRAY is special as far as offload goes.
The map only holds glue to perf ring, not actual data. Allow
non-offloaded perf event arrays to be used in offloaded programs.
Offload driver can extract the events from HW and put them in
the map for user space to retrieve.
Alan writes:
What you can't see just from reading the patch is that in both
cases (ehci->itd_pool and ehci->sitd_pool) there are two
allocation paths -- the two branches of an "if" statement -- and
only one of the paths calls dma_pool_[z]alloc. However, the
memset is needed for both paths, and so it can't be eliminated.
Given that it must be present, there's no advantage to calling
dma_pool_zalloc rather than dma_pool_alloc.
When the module is removed the led workqueue is destroyed in the remove
callback, before the led device is unregistered from the led subsystem.
This leads to a NULL pointer derefence when the led device is
unregistered automatically later as part of the module removal cleanup.
Bellow is the backtrace showing the problem.
Daniel Borkmann [Fri, 4 May 2018 09:58:38 +0000 (11:58 +0200)]
Merge branch 'bpf-subprog-mgmt-cleanup'
Jiong Wang says:
====================
This patch set clean up some code logic related with managing subprog
information.
Part of the set are inspried by Edwin's code in his RFC:
"bpf/verifier: subprog/func_call simplifications"
but with clearer separation so it could be easier to review.
- Path 1 unifies main prog and subprogs. All of them are registered in
env->subprog_starts.
- After patch 1, it is clear that subprog_starts and subprog_stack_depth
could be merged as both of them now have main and subprog unified.
Patch 2 therefore does the merge, all subprog information are centred
at bpf_subprog_info.
- Patch 3 goes further to introduce a new fake "exit" subprog which
serves as an ending marker to the subprog list. We could then turn the
following code snippets across verifier:
There is no functional change by this patch set.
No bpf selftest (both non-jit and jit) regression found after this set.
v2:
- fixed adjust_subprog_starts to also update fake "exit" subprog start.
- for John's suggestion on renaming subprog to prog, I could work on
a follow-up patch if it is recognized as worth the change.
====================
Dump command mailbox length printed was correct only if data_only flag
was set. For the case that data_only flag was clear the offset to stop
printing at was wrong and so the buffer printed was too short.
Changed the print loop to stop according to number of buffers in
mailbox.
Leon Romanovsky [Tue, 3 Apr 2018 11:03:52 +0000 (14:03 +0300)]
net/mlx5: Decrease level of prints about non-existent MKEY
User-controlled application can cause multiple prints as below to flood
dmesg. Since knowledge of failed MKey release is important for debug,
let's decrease its level to debug.
mlx5_core 0000:00:04.0: mlx5_core_destroy_mkey:127:(pid 2352): failed
radix tree delete of mkey 0x1ed700
Antoine Tenart [Fri, 4 May 2018 15:10:54 +0000 (17:10 +0200)]
net: phy: sfp: fix the BR,min computation
In an SFP EEPROM values can be read to get information about a given SFP
module. One of those is the bitrate, which can be determined using a
nominal bitrate in addition with min and max values (in %). The SFP code
currently compute both BR,min and BR,max values thanks to this nominal
and min,max values.
This patch fixes the BR,min computation as the min value should be
subtracted to the nominal one, not added.
Fixes: 9962acf7fb8c ("sfp: add support for 1000Base-PX and 1000Base-BX10") Signed-off-by: Antoine Tenart <[email protected]> Signed-off-by: David S. Miller <[email protected]>
Rob Taglang [Thu, 3 May 2018 21:13:06 +0000 (17:13 -0400)]
net: ethernet: sun: niu set correct packet size in skb
Currently, skb->len and skb->data_len are set to the page size, not
the packet size. This causes the frame check sequence to not be
located at the "end" of the packet resulting in ethernet frame check
errors. The driver does work currently, but stricter kernel facing
networking solutions like OpenVSwitch will drop these packets as
invalid.
These changes set the packet size correctly so that these errors no
longer occur. The length does not include the frame check sequence, so
that subtraction was removed.
Tested on Oracle/SUN Multithreaded 10-Gigabit Ethernet Network
Controller [108e:abcd] and validated in wireshark.
Michael Chan [Fri, 4 May 2018 00:04:27 +0000 (20:04 -0400)]
tg3: Fix vunmap() BUG_ON() triggered from tg3_free_consistent().
tg3_free_consistent() calls dma_free_coherent() to free tp->hw_stats
under spinlock and can trigger BUG_ON() in vunmap() because vunmap()
may sleep. Fix it by removing the spinlock and relying on the
TG3_FLAG_INIT_COMPLETE flag to prevent race conditions between
tg3_get_stats64() and tg3_free_consistent(). TG3_FLAG_INIT_COMPLETE
is always cleared under tp->lock before tg3_free_consistent()
and therefore tg3_get_stats64() can safely access tp->hw_stats
under tp->lock if TG3_FLAG_INIT_COMPLETE is set.
Fixes: f5992b72ebe0 ("tg3: Fix race condition in tg3_get_stats64().") Reported-by: Zumeng Chen <[email protected]> Signed-off-by: Michael Chan <[email protected]> Signed-off-by: David S. Miller <[email protected]>
ioc_data.dev_num can be controlled by user-space, hence leading to
a potential exploitation of the Spectre variant 1 vulnerability.
This issue was detected with the help of Smatch:
net/atm/lec.c:702 lec_vcc_attach() warn: potential spectre issue
'dev_lec'
Fix this by sanitizing ioc_data.dev_num before using it to index
dev_lec. Also, notice that there is another instance in which array
dev_lec is being indexed using ioc_data.dev_num at line 705:
lec_vcc_added(netdev_priv(dev_lec[ioc_data.dev_num]),
Notice that given that speculation windows are large, the policy is
to kill the speculation on the first load and not worry if it can be
completed with a dependent load/store [1].
Fix this by sanitizing pool before using it to index
zatm_dev->pool_info
Notice that given that speculation windows are large, the policy is
to kill the speculation on the first load and not worry if it can be
completed with a dependent load/store [1].
Stefano Brivio [Thu, 3 May 2018 16:13:25 +0000 (18:13 +0200)]
openvswitch: Don't swap table in nlattr_set() after OVS_ATTR_NESTED is found
If an OVS_ATTR_NESTED attribute type is found while walking
through netlink attributes, we call nlattr_set() recursively
passing the length table for the following nested attributes, if
different from the current one.
However, once we're done with those sub-nested attributes, we
should continue walking through attributes using the current
table, instead of using the one related to the sub-nested
attributes.
we switch to the 'ovs_tunnel_key_lens' table on attribute #3,
and we don't switch back to 'ovs_key_lens' while setting
attributes #9 to #11 in the sequence. As OVS_KEY_ATTR_MPLS
evaluates to 21, and the array size of 'ovs_tunnel_key_lens' is
15, we also get this kind of KASan splat while accessing the
wrong table:
[ 7655.132484] The buggy address belongs to the variable:
[ 7655.138226] ovs_tunnel_key_lens+0xf0/0xffffffffffffd400 [openvswitch]
[ 7655.145507]
[ 7655.147166] Memory state around the buggy address:
[ 7655.152514] ffffffffc169eb80: 00 00 00 00 00 00 00 00 00 00 fa fa fa fa fa fa
[ 7655.160585] ffffffffc169ec00: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
[ 7655.168644] >ffffffffc169ec80: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 fa fa
[ 7655.176701] ^
[ 7655.184372] ffffffffc169ed00: fa fa fa fa 00 00 00 00 fa fa fa fa 00 00 00 05
[ 7655.192431] ffffffffc169ed80: fa fa fa fa 00 00 00 00 00 00 00 00 00 00 00 00
[ 7655.200490] ==================================================================
Reported-by: Hangbin Liu <[email protected]> Fixes: 982b52700482 ("openvswitch: Fix mask generation for nested attributes.") Signed-off-by: Stefano Brivio <[email protected]> Reviewed-by: Sabrina Dubroca <[email protected]> Signed-off-by: David S. Miller <[email protected]>
Eric Dumazet [Thu, 19 Apr 2018 15:49:29 +0000 (08:49 -0700)]
net/mlx4_en: optimizes get_fixed_ipv6_csum()
While trying to support CHECKSUM_COMPLETE for IPV6 fragments,
I had to experiments various hacks in get_fixed_ipv6_csum().
I must admit I could not find how to implement this :/
However, get_fixed_ipv6_csum() does a lot of redundant operations,
calling csum_partial() twice.
First csum_partial() computes the checksum of saddr and daddr,
put in @csum_pseudo_hdr. Undone later in the second csum_partial()
computed on whole ipv6 header.
Then nexthdr is added once, added a second time, then substracted.
payload_len is added once, then substracted.
Really all this can be reduced to two add_csum(), to add back 6 bytes
that were removed by mlx4 when providing hw_checksum in RX descriptor.
James Morse [Fri, 4 May 2018 15:19:24 +0000 (16:19 +0100)]
arm64: vgic-v2: Fix proxying of cpuif access
Proxying the cpuif accesses at EL2 makes use of vcpu_data_guest_to_host
and co, which check the endianness, which call into vcpu_read_sys_reg...
which isn't mapped at EL2 (it was inlined before, and got moved OoL
with the VHE optimizations).
The result is of course a nice panic. Let's add some specialized
cruft to keep the broken platforms that require this hack alive.
But, this code used vcpu_data_guest_to_host(), which expected us to
write the value to host memory, instead we have trapped the guest's
read or write to an mmio-device, and are about to replay it using the
host's readl()/writel() which also perform swabbing based on the host
endianness. This goes wrong when both host and guest are big-endian,
as readl()/writel() will undo the guest's swabbing, causing the
big-endian value to be written to device-memory.
What needs doing?
A big-endian guest will have pre-swabbed data before storing, undo this.
If its necessary for the host, writel() will re-swab it.
For a read a big-endian guest expects to swab the data after the load.
The hosts's readl() will correct for host endianness, giving us the
device-memory's value in the register. For a big-endian guest, swab it
as if we'd only done the load.
For a little-endian guest, nothing needs doing as readl()/writel() leave
the correct device-memory value in registers.
Tested on Juno with that rarest of things: a big-endian 64K host.
Based on a patch from Marc Zyngier.
Reported-by: Suzuki K Poulose <[email protected]> Fixes: bf8feb39642b ("arm64: KVM: vgic-v2: Add GICV access from HYP") Signed-off-by: James Morse <[email protected]> Signed-off-by: Marc Zyngier <[email protected]>
Stefan comes up with an smc implementation for splice(). The first
three patches are preparational patches, the 4th patch implements
splice().
====================
KVM: arm/arm64: vgic_init: Cleanup reference to process_maintenance
One comment still mentioned process_maintenance operations after
commit af0614991ab6 ("KVM: arm/arm64: vgic: Get rid of unnecessary
process_maintenance operation")
Update the comment to point to vgic_fold_lr_state instead, which
is where maintenance interrupts are taken care of.
James Morse [Wed, 2 May 2018 11:17:02 +0000 (12:17 +0100)]
KVM: arm64: Fix order of vcpu_write_sys_reg() arguments
A typo in kvm_vcpu_set_be()'s call:
| vcpu_write_sys_reg(vcpu, SCTLR_EL1, sctlr)
causes us to use the 32bit register value as an index into the sys_reg[]
array, and sail off the end of the linear map when we try to bring up
big-endian secondaries.
| Unable to handle kernel paging request at virtual address ffff80098b982c00
| Mem abort info:
| ESR = 0x96000045
| Exception class = DABT (current EL), IL = 32 bits
| SET = 0, FnV = 0
| EA = 0, S1PTW = 0
| Data abort info:
| ISV = 0, ISS = 0x00000045
| CM = 0, WnR = 1
| swapper pgtable: 4k pages, 48-bit VAs, pgdp = 000000002ea0571a
| [ffff80098b982c00] pgd=00000009ffff8803, pud=0000000000000000
| Internal error: Oops: 96000045 [#1] PREEMPT SMP
| Modules linked in:
| CPU: 2 PID: 1561 Comm: kvm-vcpu-0 Not tainted 4.17.0-rc3-00001-ga912e2261ca6-dirty #1323
| Hardware name: ARM Juno development board (r1) (DT)
| pstate: 60000005 (nZCv daif -PAN -UAO)
| pc : vcpu_write_sys_reg+0x50/0x134
| lr : vcpu_write_sys_reg+0x50/0x134
Linus Torvalds [Fri, 4 May 2018 15:44:50 +0000 (05:44 -1000)]
Merge tag 'pm-4.17-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm
Pull power management fix from Rafael Wysocki:
"This fixes a regression from the 4.14 cycle in the CPPC cpufreq driver
causing it to use an incorrect transition delay value which leads to a
very high rate of frequency change requests when the schedutil
governor is in use (Prashanth Prakash)"
* tag 'pm-4.17-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm:
cpufreq / CPPC: Set platform specific transition_delay_us
Linus Torvalds [Fri, 4 May 2018 15:43:33 +0000 (05:43 -1000)]
Merge tag 'acpi-4.17-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm
Pull ACPI fix from Rafael Wysocki:
"This fixes an ACPICA utilities (acpidump) build regression from the
4.16 cycle by setting LD in the CFLAGS passed to the linker to $(CC)
again (Jiri Slaby)"
* tag 'acpi-4.17-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm:
tools: power/acpi, revert to LD = gcc
Linus Torvalds [Fri, 4 May 2018 15:38:51 +0000 (05:38 -1000)]
Merge tag 'media/v4.17-4' of git://git.kernel.org/pub/scm/linux/kernel/git/mchehab/linux-media
Pull media fixes from Mauro Carvalho Chehab:
- a trivial one-line fix addressing a PTR_ERR() getting value from a
wrong var at imx driver
- a patch changing my e-mail at the Kernel tree to [email protected].
no code changes
* tag 'media/v4.17-4' of git://git.kernel.org/pub/scm/linux/kernel/git/mchehab/linux-media:
MAINTAINERS & files: Canonize the e-mails I use at files
media: imx-media-csi: Fix inconsistent IS_ERR and PTR_ERR
Linus Torvalds [Fri, 4 May 2018 15:37:22 +0000 (05:37 -1000)]
Merge tag 'sound-4.17-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/tiwai/sound
Pull sound fixes from Takashi Iwai:
"A collection of small fixes, all deserved for stable.
Two are about core API fixes for the bugs that were triggered by
ever-growing fuzzers, while others are driver-specific fixes"
* tag 'sound-4.17-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/tiwai/sound:
ALSA: pcm: Check PCM state at xfern compat ioctl
ALSA: aloop: Add missing cable lock to ctl API callbacks
ALSA: dice: fix kernel NULL pointer dereference due to invalid calculation for array index
ALSA: seq: Fix races at MIDI encoding in snd_virmidi_output_trigger()
ALSA: hda - Fix incorrect usage of IS_REACHABLE()
On the quest to remove all VLAs from the kernel[1], this avoids VLAs
in dm-raid1.c by just using the maximum size for the stack arrays.
The nr_mirrors value was already capped at 9, so this makes it a trivial
adjustment to the array sizes.
====================
sh_eth: complain on access to unimplemented TSU registers
Here's a set of 2 patches against DaveM's 'net-next.git' repo. The 1st patch
routes TSU_POST<n> register accesses thru sh_eth_tsu_{read|write}() and the 2nd
added WARN_ON() unimplemented register to those functions. I'm going to deal with
TSU_ADR{H|L}<n> registers in a later series...
====================
Sergei Shtylyov [Wed, 2 May 2018 19:55:52 +0000 (22:55 +0300)]
sh_eth: WARN_ON() access to unimplemented TSU register
Commit 3365711df024 ("sh_eth: WARN on access to a register not implemented
in a particular chip") added WARN_ON() to sh_eth_{read|write}() but not
to sh_eth_tsu_{read|write}(). Now that we've routed almost all TSU register
accesses (except TSU_ADR{H|L}<n> -- which are special) thru the latter
pair of accessors, it makes sense to check for the unimplemented TSU
registers as well...
Sergei Shtylyov [Wed, 2 May 2018 19:54:48 +0000 (22:54 +0300)]
sh_eth: use TSU register accessors for TSU_POST<n>
There's no particularly good reason TSU_POST<n> registers get accessed
circumventing sh_eth_tsu_{read|write}() -- start using those, removing
(badly named) sh_eth_tsu_get_post_reg_offset(), while at it...
MAINTAINERS & files: Canonize the e-mails I use at files
From now on, I'll start using my @kernel.org as my development e-mail.
As such, let's remove the entries that point to the old [email protected] at MAINTAINERS file.
For the files written with a copyright with mchehab@s-opensource,
let's keep Samsung on their names, using [email protected],
in order to keep pointing to my employer, with sponsors the work.
For the files written before I join Samsung (on July, 4 2013),
let's just use [email protected].
For bug reports, we can simply point to just kernel.org, as
this will reach my mchehab+samsung inbox anyway.
The reason is there is no marker in subprog_info array to tell the end of
it.
We could resolve this issue by introducing a faked "ending" subprog.
The special "ending" subprog is with "insn_cnt" as start offset, so it is
serving as the end mark whenever we iterate over all subprogs.
Jiong Wang [Wed, 2 May 2018 20:17:17 +0000 (16:17 -0400)]
bpf: unify main prog and subprog
Currently, verifier treat main prog and subprog differently. All subprogs
detected are kept in env->subprog_starts while main prog is not kept there.
Instead, main prog is implicitly defined as the prog start at 0.
There is actually no difference between main prog and subprog, it is better
to unify them, and register all progs detected into env->subprog_starts.
Commit 7ed1c1901fe5 (tools: fix cross-compile var clobbering) removed
setting of LD to $(CROSS_COMPILE)gcc. This broke build of acpica
(acpidump) in power/acpi:
ld: unrecognized option '-D_LINUX'
The tools pass CFLAGS to the linker (incl. -D_LINUX), so revert this
particular change and let LD be $(CC) again. Note that the old behaviour
was a bit different, it used $(CROSS_COMPILE)gcc which was eliminated by
the commit 7ed1c1901fe5. We use $(CC) for that reason.
Miquel Raynal [Thu, 3 May 2018 10:00:27 +0000 (12:00 +0200)]
mtd: rawnand: marvell: fix command xtype in BCH write hook
One layout supported by the Marvell NAND controller supports NAND pages
of 2048 bytes, all handled in one single chunk when using BCH with a
strength of 4-bit per 512 bytes. In this case, instead of the generic
XTYPE_WRITE_DISPATCH/XTYPE_LAST_NAKED_RW couple, the controller expects
to receive XTYPE_MONOLITHIC_RW.
This fixes problems at boot like:
[ 1.315475] Scanning device for bad blocks
[ 3.203108] marvell-nfc f10d0000.flash: Timeout waiting for RB signal
[ 3.209564] nand_bbt: error while writing BBT block -110
[ 4.243106] marvell-nfc f10d0000.flash: Timeout waiting for RB signal
[ 5.283106] marvell-nfc f10d0000.flash: Timeout waiting for RB signal
[ 5.289562] nand_bbt: error -110 while marking block 2047 bad
[ 6.323106] marvell-nfc f10d0000.flash: Timeout waiting for RB signal
[ 6.329559] nand_bbt: error while writing BBT block -110
[ 7.363106] marvell-nfc f10d0000.flash: Timeout waiting for RB signal
[ 8.403105] marvell-nfc f10d0000.flash: Timeout waiting for RB signal
[ 8.409559] nand_bbt: error -110 while marking block 2046 bad
...
Chris Packham [Thu, 3 May 2018 02:21:28 +0000 (14:21 +1200)]
mtd: rawnand: marvell: pass ms delay to wait_op
marvell_nfc_wait_op() expects the delay to be expressed in milliseconds
but nand_sdr_timings uses picoseconds. Use PSEC_TO_MSEC when passing
tPROG_max to marvell_nfc_wait_op().
Mathias Krause [Thu, 3 May 2018 08:55:07 +0000 (10:55 +0200)]
xfrm: use a dedicated slab cache for struct xfrm_state
struct xfrm_state is rather large (768 bytes here) and therefore wastes
quite a lot of memory as it falls into the kmalloc-1024 slab cache,
leaving 256 bytes of unused memory per XFRM state object -- a net waste
of 25%.
Using a dedicated slab cache for struct xfrm_state reduces the level of
internal fragmentation to a minimum.
On my configuration SLUB chooses to create a slab cache covering 4
pages holding 21 objects, resulting in an average memory waste of ~13
bytes per object -- a net waste of only 1.6%.
In my tests this led to memory savings of roughly 2.3MB for 10k XFRM
states.
Linus Torvalds [Fri, 4 May 2018 05:26:51 +0000 (19:26 -1000)]
Merge tag 'linux-kselftest-4.17-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/shuah/linux-kselftest
Pull kselftest fixes from Shuah Khan:
"This Kselftest update for 4.17-rc4 consists of a fix for a syntax
error in the script that runs selftests. Mathieu Desnoyers found this
bug in the script on systems running GNU Make 3.8 or older"
* tag 'linux-kselftest-4.17-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/shuah/linux-kselftest:
selftests: Fix lib.mk run_tests target shell script
1) Various sockmap fixes from John Fastabend (pinned map handling,
blocking in recvmsg, double page put, error handling during redirect
failures, etc.)
2) Fix dead code handling in x86-64 JIT, from Gianluca Borello.
3) Missing device put in RDS IB code, from Dag Moxnes.
4) Don't process fast open during repair mode in TCP< from Yuchung
Cheng.
5) Move address/port comparison fixes in SCTP, from Xin Long.
6) Handle add a bond slave's master into a bridge properly, from
Hangbin Liu.
7) IPv6 multipath code can operate on unitialized memory due to an
assumption that the icmp header is in the linear SKB area. Fix from
Eric Dumazet.
8) Don't invoke do_tcp_sendpages() recursively via TLS, from Dave
Watson.
9) Fix memory leaks in x86-64 JIT, from Daniel Borkmann.
10) RDS leaks kernel memory to userspace, from Eric Dumazet.
11) DCCP can invoke a tasklet on a freed socket, take a refcount. Also
from Eric Dumazet.
* git://git.kernel.org/pub/scm/linux/kernel/git/davem/net: (78 commits)
dccp: fix tasklet usage
smc: fix sendpage() call
net/smc: handle unregistered buffers
net/smc: call consolidation
qed: fix spelling mistake: "offloded" -> "offloaded"
net/mlx5e: fix spelling mistake: "loobpack" -> "loopback"
tcp: restore autocorking
rds: do not leak kernel memory to user land
qmi_wwan: do not steal interfaces from class drivers
ipv4: fix fnhe usage by non-cached routes
bpf: sockmap, fix error handling in redirect failures
bpf: sockmap, zero sg_size on error when buffer is released
bpf: sockmap, fix scatterlist update on error path in send with apply
net_sched: fq: take care of throttled flows before reuse
ipv6: Revert "ipv6: Allow non-gateway ECMP for IPv6"
bpf, x64: fix memleak when not converging on calls
bpf, x64: fix memleak when not converging after image
net/smc: restrict non-blocking connect finish
8139too: Use disable_irq_nosync() in rtl8139_poll_controller()
sctp: fix the issue that the cookie-ack with auth can't get processed
...
Linus Torvalds [Fri, 4 May 2018 04:31:19 +0000 (18:31 -1000)]
Merge branch 'parisc-4.17-4' of git://git.kernel.org/pub/scm/linux/kernel/git/deller/parisc-linux
Pull parisc fixes from Helge Deller:
"Fix two section mismatches, convert to read_persistent_clock64(), add
further documentation regarding the HPMC crash handler and make
bzImage the default build target"
* 'parisc-4.17-4' of git://git.kernel.org/pub/scm/linux/kernel/git/deller/parisc-linux:
parisc: Fix section mismatches
parisc: drivers.c: Fix section mismatches
parisc: time: Convert read_persistent_clock() to read_persistent_clock64()
parisc: Document rules regarding checksum of HPMC handler
parisc: Make bzImage default build target
Daniel Borkmann [Fri, 4 May 2018 00:13:57 +0000 (02:13 +0200)]
bpf: use array_index_nospec in find_prog_type
Commit 9ef09e35e521 ("bpf: fix possible spectre-v1 in find_and_alloc_map()")
converted find_and_alloc_map() over to use array_index_nospec() to sanitize
map type that user space passes on map creation, and this patch does an
analogous conversion for progs in find_prog_type() as it's also passed from
user space when loading progs as attr->prog_type.
Andrzej Hajda [Fri, 2 Feb 2018 15:11:22 +0000 (16:11 +0100)]
drm/exynos/mixer: fix synchronization check in interlaced mode
In case of interlace mode video processor registers and mixer config
register must be check to ensure internal state is in sync with shadow
registers.
This patch fixes page-faults in interlaced mode.
Dave Airlie [Fri, 4 May 2018 00:03:27 +0000 (10:03 +1000)]
Merge branch 'vmwgfx-fixes-4.17' of git://people.freedesktop.org/~thomash/linux into drm-fixes
Two fixes for now, one for a long standing problem uncovered by a commit
in the 4.17 merge window, one for a regression introduced by a previous
bugfix, Cc'd stable.
* 'vmwgfx-fixes-4.17' of git://people.freedesktop.org/~thomash/linux:
drm/vmwgfx: Fix a buffer object leak
drm/vmwgfx: Clean up fbdev modeset locking
====================
This set simplifies BPF JITs significantly by moving ld_abs/ld_ind
to native BPF, for details see individual patches. Main rationale
is in patch 'implement ld_abs/ld_ind in native bpf'. Thanks!
v1 -> v2:
- Added missing seen_lds_abs in LDX_MSH and use X = A
initially due to being preserved on func call.
- Added a large batch of cBPF tests into test_bpf.
- Added x32 removal of LD_ABS/LD_IND, so all JITs are
covered.
====================
Daniel Borkmann [Thu, 3 May 2018 23:08:23 +0000 (01:08 +0200)]
bpf, x32: remove ld_abs/ld_ind
Since LD_ABS/LD_IND instructions are now removed from the core and
reimplemented through a combination of inlined BPF instructions and
a slow-path helper, we can get rid of the complexity from x32 JIT.
Daniel Borkmann [Thu, 3 May 2018 23:08:22 +0000 (01:08 +0200)]
bpf, s390x: remove ld_abs/ld_ind
Since LD_ABS/LD_IND instructions are now removed from the core and
reimplemented through a combination of inlined BPF instructions and
a slow-path helper, we can get rid of the complexity from s390x JIT.
Tested on s390x instance on LinuxONE.
Daniel Borkmann [Thu, 3 May 2018 23:08:21 +0000 (01:08 +0200)]
bpf, ppc64: remove ld_abs/ld_ind
Since LD_ABS/LD_IND instructions are now removed from the core and
reimplemented through a combination of inlined BPF instructions and
a slow-path helper, we can get rid of the complexity from ppc64 JIT.
Daniel Borkmann [Thu, 3 May 2018 23:08:20 +0000 (01:08 +0200)]
bpf, mips64: remove ld_abs/ld_ind
Since LD_ABS/LD_IND instructions are now removed from the core and
reimplemented through a combination of inlined BPF instructions and
a slow-path helper, we can get rid of the complexity from mips64 JIT.
Daniel Borkmann [Thu, 3 May 2018 23:08:19 +0000 (01:08 +0200)]
bpf, arm32: remove ld_abs/ld_ind
Since LD_ABS/LD_IND instructions are now removed from the core and
reimplemented through a combination of inlined BPF instructions and
a slow-path helper, we can get rid of the complexity from arm32 JIT.
Daniel Borkmann [Thu, 3 May 2018 23:08:18 +0000 (01:08 +0200)]
bpf, sparc64: remove ld_abs/ld_ind
Since LD_ABS/LD_IND instructions are now removed from the core and
reimplemented through a combination of inlined BPF instructions and
a slow-path helper, we can get rid of the complexity from sparc64 JIT.
Daniel Borkmann [Thu, 3 May 2018 23:08:17 +0000 (01:08 +0200)]
bpf, arm64: remove ld_abs/ld_ind
Since LD_ABS/LD_IND instructions are now removed from the core and
reimplemented through a combination of inlined BPF instructions and
a slow-path helper, we can get rid of the complexity from arm64 JIT.
Daniel Borkmann [Thu, 3 May 2018 23:08:16 +0000 (01:08 +0200)]
bpf, x64: remove ld_abs/ld_ind
Since LD_ABS/LD_IND instructions are now removed from the core and
reimplemented through a combination of inlined BPF instructions and
a slow-path helper, we can get rid of the complexity from x64 JIT.
Daniel Borkmann [Thu, 3 May 2018 23:08:15 +0000 (01:08 +0200)]
bpf: add skb_load_bytes_relative helper
This adds a small BPF helper similar to bpf_skb_load_bytes() that
is able to load relative to mac/net header offset from the skb's
linear data. Compared to bpf_skb_load_bytes(), it takes a fifth
argument namely start_header, which is either BPF_HDR_START_MAC
or BPF_HDR_START_NET. This allows for a more flexible alternative
compared to LD_ABS/LD_IND with negative offset. It's enabled for
tc BPF programs as well as sock filter program types where it's
mainly useful in reuseport programs to ease access to lower header
data.
Daniel Borkmann [Thu, 3 May 2018 23:08:14 +0000 (01:08 +0200)]
bpf: implement ld_abs/ld_ind in native bpf
The main part of this work is to finally allow removal of LD_ABS
and LD_IND from the BPF core by reimplementing them through native
eBPF instead. Both LD_ABS/LD_IND were carried over from cBPF and
keeping them around in native eBPF caused way more trouble than
actually worth it. To just list some of the security issues in
the past:
* fdfaf64e7539 ("x86: bpf_jit: support negative offsets")
* 35607b02dbef ("sparc: bpf_jit: fix loads from negative offsets")
* e0ee9c12157d ("x86: bpf_jit: fix two bugs in eBPF JIT compiler")
* 07aee9439454 ("bpf, sparc: fix usage of wrong reg for load_skb_regs after call")
* 6d59b7dbf72e ("bpf, s390x: do not reload skb pointers in non-skb context")
* 87338c8e2cbb ("bpf, ppc64: do not reload skb pointers in non-skb context")
For programs in native eBPF, LD_ABS/LD_IND are pretty much legacy
these days due to their limitations and more efficient/flexible
alternatives that have been developed over time such as direct
packet access. LD_ABS/LD_IND only cover 1/2/4 byte loads into a
register, the load happens in host endianness and its exception
handling can yield unexpected behavior. The latter is explained
in depth in f6b1b3bf0d5f ("bpf: fix subprog verifier bypass by
div/mod by 0 exception") with similar cases of exceptions we had.
In native eBPF more recent program types will disable LD_ABS/LD_IND
altogether through may_access_skb() in verifier, and given the
limitations in terms of exception handling, it's also disabled
in programs that use BPF to BPF calls.
In terms of cBPF, the LD_ABS/LD_IND is used in networking programs
to access packet data. It is not used in seccomp-BPF but programs
that use it for socket filtering or reuseport for demuxing with
cBPF. This is mostly relevant for applications that have not yet
migrated to native eBPF.
The main complexity and source of bugs in LD_ABS/LD_IND is coming
from their implementation in the various JITs. Most of them keep
the model around from cBPF times by implementing a fastpath written
in asm. They use typically two from the BPF program hidden CPU
registers for caching the skb's headlen (skb->len - skb->data_len)
and skb->data. Throughout the JIT phase this requires to keep track
whether LD_ABS/LD_IND are used and if so, the two registers need
to be recached each time a BPF helper would change the underlying
packet data in native eBPF case. At least in eBPF case, available
CPU registers are rare and the additional exit path out of the
asm written JIT helper makes it also inflexible since not all
parts of the JITer are in control from plain C. A LD_ABS/LD_IND
implementation in eBPF therefore allows to significantly reduce
the complexity in JITs with comparable performance results for
them, e.g.:
test_bpf tcpdump port 22 tcpdump complex
x64 - before 15 21 10 14 19 18
- after 7 10 10 7 10 15
arm64 - before 40 91 92 40 91 151
- after 51 64 73 51 62 113
For cBPF we now track any usage of LD_ABS/LD_IND in bpf_convert_filter()
and cache the skb's headlen and data in the cBPF prologue. The
BPF_REG_TMP gets remapped from R8 to R2 since it's mainly just
used as a local temporary variable. This allows to shrink the
image on x86_64 also for seccomp programs slightly since mapping
to %rsi is not an ereg. In callee-saved R8 and R9 we now track
skb data and headlen, respectively. For normal prologue emission
in the JITs this does not add any extra instructions since R8, R9
are pushed to stack in any case from eBPF side. cBPF uses the
convert_bpf_ld_abs() emitter which probes the fast path inline
already and falls back to bpf_skb_load_helper_{8,16,32}() helper
relying on the cached skb data and headlen as well. R8 and R9
never need to be reloaded due to bpf_helper_changes_pkt_data()
since all skb access in cBPF is read-only. Then, for the case
of native eBPF, we use the bpf_gen_ld_abs() emitter, which calls
the bpf_skb_load_helper_{8,16,32}_no_cache() helper unconditionally,
does neither cache skb data and headlen nor has an inlined fast
path. The reason for the latter is that native eBPF does not have
any extra registers available anyway, but even if there were, it
avoids any reload of skb data and headlen in the first place.
Additionally, for the negative offsets, we provide an alternative
bpf_skb_load_bytes_relative() helper in eBPF which operates
similarly as bpf_skb_load_bytes() and allows for more flexibility.
Tested myself on x64, arm64, s390x, from Sandipan on ppc64.
Daniel Borkmann [Thu, 3 May 2018 23:08:13 +0000 (01:08 +0200)]
bpf: migrate ebpf ld_abs/ld_ind tests to test_verifier
Remove all eBPF tests involving LD_ABS/LD_IND from test_bpf.ko. Reason
is that the eBPF tests from test_bpf module do not go via BPF verifier
and therefore any instruction rewrites from verifier cannot take place.
Therefore, move them into test_verifier which runs out of user space,
so that verfier can rewrite LD_ABS/LD_IND internally in upcoming patches.
It will have the same effect since runtime tests are also performed from
there. This also allows to finally unexport bpf_skb_vlan_{push,pop}_proto
and keep it internal to core kernel.
Additionally, also add further cBPF LD_ABS/LD_IND test coverage into
test_bpf.ko suite.