powerpc/preempt: Don't touch the idle task's preempt_count during hotplug
Powerpc currently resets a CPU's idle task preempt_count to 0 before
said task starts executing the secondary startup routine (and becomes an
idle task proper).
This conflicts with commit f1a0a376ca0c ("sched/core: Initialize the
idle task with preemption disabled").
which initializes all of the idle tasks' preempt_count to
PREEMPT_DISABLED during smp_init(). Note that this was superfluous
before said commit, as back then the hotplug machinery would invoke
init_idle() via idle_thread_get(), which would have already reset the
CPU's idle task's preempt_count to PREEMPT_ENABLED.
Commit 91c960b0056672 ("bpf: Rename BPF_XADD and prepare to encode other
atomics in .imm") converted BPF_XADD to BPF_ATOMIC and updated all JIT
implementations to reject JIT'ing instructions with an immediate value
different from BPF_ADD. However, ppc32 BPF JIT was implemented around
the same time and didn't include the same change. Update the ppc32 JIT
accordingly.
Commit 91c960b0056672 ("bpf: Rename BPF_XADD and prepare to encode other
atomics in .imm") converted BPF_XADD to BPF_ATOMIC and added a way to
distinguish instructions based on the immediate field. Existing JIT
implementations were updated to check for the immediate field and to
reject programs utilizing anything more than BPF_ADD (such as BPF_FETCH)
in the immediate field.
However, the check added to powerpc64 JIT did not look at the correct
BPF instruction. Due to this, such programs would be accepted and
incorrectly JIT'ed resulting in soft lockups, as seen with the atomic
bounds test. Fix this by looking at the correct immediate value.
The powerpc kernel is not prepared to handle exec faults from kernel.
Especially, the function is_exec_fault() will return 'false' when an
exec fault is taken by kernel, because the check is based on reading
current->thread.regs->trap which contains the trap from user.
For instance, when provoking a LKDTM EXEC_USERSPACE test,
current->thread.regs->trap is set to SYSCALL trap (0xc00), and
the fault taken by the kernel is not seen as an exec fault by
set_access_flags_filter().
Commit d7df2443cd5f ("powerpc/mm: Fix spurious segfaults on radix
with autonuma") made it clear and handled it properly. But later on
commit d3ca587404b3 ("powerpc/mm: Fix reporting of kernel execute
faults") removed that handling, introducing test based on error_code.
And here is the problem, because on the 603 all upper bits of SRR1
get cleared when the TLB instruction miss handler bails out to ISI.
Until commit cbd7e6ca0210 ("powerpc/fault: Avoid heavy
search_exception_tables() verification"), an exec fault from kernel
at a userspace address was indirectly caught by the lack of entry for
that address in the exception tables. But after that commit the
kernel mainly relies on KUAP or on core mm handling to catch wrong
user accesses. Here the access is not wrong, so mm handles it.
It is a minor fault because PAGE_EXEC is not set,
set_access_flags_filter() should set PAGE_EXEC and voila.
But as is_exec_fault() returns false as explained in the beginning,
set_access_flags_filter() bails out without setting PAGE_EXEC flag,
which leads to a forever minor exec fault.
As the kernel is not prepared to handle such exec faults, the thing to
do is to fire in bad_kernel_fault() for any exec fault taken by the
kernel, as it was prior to commit d3ca587404b3.
Merge tag 'powerpc-5.14-1' of git://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux
Pull powerpc updates from Michael Ellerman:
- A big series refactoring parts of our KVM code, and converting some
to C.
- Support for ARCH_HAS_SET_MEMORY, and ARCH_HAS_STRICT_MODULE_RWX on
some CPUs.
- Support for the Microwatt soft-core.
- Optimisations to our interrupt return path on 64-bit.
- Support for userspace access to the NX GZIP accelerator on PowerVM on
Power10.
- Enable KUAP and KUEP by default on 32-bit Book3S CPUs.
- Other smaller features, fixes & cleanups.
Thanks to: Andy Shevchenko, Aneesh Kumar K.V, Arnd Bergmann, Athira
Rajeev, Baokun Li, Benjamin Herrenschmidt, Bharata B Rao, Christophe
Leroy, Daniel Axtens, Daniel Henrique Barboza, Finn Thain, Geoff Levand,
Haren Myneni, Jason Wang, Jiapeng Chong, Joel Stanley, Jordan Niethe,
Kajol Jain, Nathan Chancellor, Nathan Lynch, Naveen N. Rao, Nicholas
Piggin, Nick Desaulniers, Paul Mackerras, Russell Currey, Sathvika
Vasireddy, Shaokun Zhang, Stephen Rothwell, Sudeep Holla, Suraj Jitindar
Singh, Tom Rix, Vaibhav Jain, YueHaibing, Zhang Jianhua, and Zhen Lei.
* tag 'powerpc-5.14-1' of git://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux: (218 commits)
powerpc: Only build restart_table.c for 64s
powerpc/64s: move ret_from_fork etc above __end_soft_masked
powerpc/64s/interrupt: clean up interrupt return labels
powerpc/64/interrupt: add missing kprobe annotations on interrupt exit symbols
powerpc/64: enable MSR[EE] in irq replay pt_regs
powerpc/64s/interrupt: preserve regs->softe for NMI interrupts
powerpc/64s: add a table of implicit soft-masked addresses
powerpc/64e: remove implicit soft-masking and interrupt exit restart logic
powerpc/64e: fix CONFIG_RELOCATABLE build warnings
powerpc/64s: fix hash page fault interrupt handler
powerpc/4xx: Fix setup_kuep() on SMP
powerpc/32s: Fix setup_{kuap/kuep}() on SMP
powerpc/interrupt: Use names in check_return_regs_valid()
powerpc/interrupt: Also use exit_must_hard_disable() on PPC32
powerpc/sysfs: Replace sizeof(arr)/sizeof(arr[0]) with ARRAY_SIZE
powerpc/ptrace: Refactor regs_set_return_{msr/ip}
powerpc/ptrace: Move set_return_regs_changed() before regs_set_return_{msr/ip}
powerpc/stacktrace: Fix spurious "stale" traces in raise_backtrace_ipi()
powerpc/pseries/vas: Include irqdomain.h
powerpc: mark local variables around longjmp as volatile
...
Merge tag 'asm-generic-unaligned-5.14' of git://git.kernel.org/pub/scm/linux/kernel/git/arnd/asm-generic
Pull asm/unaligned.h unification from Arnd Bergmann:
"Unify asm/unaligned.h around struct helper
The get_unaligned()/put_unaligned() helpers are traditionally
architecture specific, with the two main variants being the
"access-ok.h" version that assumes unaligned pointer accesses always
work on a particular architecture, and the "le-struct.h" version that
casts the data to a byte aligned type before dereferencing, for
architectures that cannot always do unaligned accesses in hardware.
Based on the discussion linked below, it appears that the access-ok
version is not realiable on any architecture, but the struct version
probably has no downsides. This series changes the code to use the
same implementation on all architectures, addressing the few
exceptions separately"
Link: https://lore.kernel.org/lkml/[email protected]/ Link: https://gcc.gnu.org/bugzilla/show_bug.cgi?id=100363 Link: https://lore.kernel.org/lkml/[email protected]/
Link: git://git.kernel.org/pub/scm/linux/kernel/git/arnd/asm-generic.git unaligned-rework-v2 Link: https://lore.kernel.org/lkml/CAHk-=whGObOKruA_bU3aPGZfoDqZM1_9wBkwREp0H0FgR-90uQ@mail.gmail.com/
* tag 'asm-generic-unaligned-5.14' of git://git.kernel.org/pub/scm/linux/kernel/git/arnd/asm-generic:
asm-generic: simplify asm/unaligned.h
asm-generic: uaccess: 1-byte access is always aligned
netpoll: avoid put_unaligned() on single character
mwifiex: re-fix for unaligned accesses
apparmor: use get_unaligned() only for multi-byte words
partitions: msdos: fix one-byte get_unaligned()
asm-generic: unaligned always use struct helpers
asm-generic: unaligned: remove byteshift helpers
powerpc: use linux/unaligned/le_struct.h on LE power7
m68k: select CONFIG_HAVE_EFFICIENT_UNALIGNED_ACCESS
sh: remove unaligned access for sh4a
openrisc: always use unaligned-struct header
asm-generic: use asm-generic/unaligned.h for most architectures
Merge tag 'perf-tools-for-v5.14-2021-07-01' of git://git.kernel.org/pub/scm/linux/kernel/git/acme/linux
Pull perf tool updates from Arnaldo Carvalho de Melo:
"Tools:
- Add cgroup support for 'perf top' (-G).
- Add support for KVM MSRs in 'perf kvm stat'
- Support probes on init functions in 'perf probe', to support the
bootconfig format.
- Improve error reporting in 'perf probe'.
- No need to synthesize BUILD_ID records in 'perf inject' if the
MMAP2 records have build ids already.
- Allow toggling source code ('s' hotkey) in 'perf annotate' in all
lines.
- Add itrace options support to 'perf annotate'.
- Support to custom DSO filters for 'perf script'.
Hardware enablement:
- Support the HYBRID_TOPOLOGY and HYBRID_CPU_PMU_CAPS features in the
perf.data file header.
- Support PMU prefix for mem-load and mem-store events, to support
hybrid (BIG little) CPUs such as Intel's Alderlake.
- Support hybrid CPUs in 'perf mem' and 'perf c2c'.
Hardware tracing:
- Intel PT now supports tracing KVM guests.
- Timestamp improvements for ARM's Coresight.
Build:
- Add 'make -C tools/perf build-test' entries for
libopencsd/CORESIGHT=1 and libbpf/LIBBPF_DYNAMIC=1.
- Use bison's --file-prefix-map option to avoid storing full paths
when using O= in the perf build.
Tests:
- Improve the 'perf test' entries for libpfm4 and BPF counters.
Misc:
- Sync msr-index.h, mount.h, kvm headers with the kernel originals.
- Add vendor events and metrics for Intel's Icelake Server & Client"
* tag 'perf-tools-for-v5.14-2021-07-01' of git://git.kernel.org/pub/scm/linux/kernel/git/acme/linux: (123 commits)
perf session: Add missing evlist__delete when deleting a session
perf annotate: Allow 's' on source code lines
perf dlfilter: Add object_code() to perf_dlfilter_fns
perf dlfilter: Add attr() to perf_dlfilter_fns
perf dlfilter: Add srcline() to perf_dlfilter_fns
perf dlfilter: Add insn() to perf_dlfilter_fns
perf dlfilter: Add resolve_address() to perf_dlfilter_fns
perf build: Install perf_dlfilter.h
perf script: Add option to pass arguments to dlfilters
perf script: Add option to list dlfilters
perf script: Add dlfilter__filter_event_early()
perf script: Add API for filtering via dynamically loaded shared object
perf llvm: Return -ENOMEM when asprintf() fails
perf cs-etm: Delay decode of non-timeless data until cs_etm__flush_events()
tools headers UAPI: Synch KVM's svm.h header with the kernel
tools kvm headers arm64: Update KVM headers from the kernel sources
tools headers UAPI: Sync linux/kvm.h with the kernel sources
tools headers cpufeatures: Sync with the kernel sources
tools include UAPI: Update linux/mount.h copy
tools arch x86: Sync the msr-index.h copy with the kernel sources
...
Merge more updates from Andrew Morton:
"190 patches.
Subsystems affected by this patch series: mm (hugetlb, userfaultfd,
vmscan, kconfig, proc, z3fold, zbud, ras, mempolicy, memblock,
migration, thp, nommu, kconfig, madvise, memory-hotplug, zswap,
zsmalloc, zram, cleanups, kfence, and hmm), procfs, sysctl, misc,
core-kernel, lib, lz4, checkpatch, init, kprobes, nilfs2, hfs,
signals, exec, kcov, selftests, compress/decompress, and ipc"
* emailed patches from Andrew Morton <[email protected]>: (190 commits)
ipc/util.c: use binary search for max_idx
ipc/sem.c: use READ_ONCE()/WRITE_ONCE() for use_global_lock
ipc: use kmalloc for msg_queue and shmid_kernel
ipc sem: use kvmalloc for sem_undo allocation
lib/decompressors: remove set but not used variabled 'level'
selftests/vm/pkeys: exercise x86 XSAVE init state
selftests/vm/pkeys: refill shadow register after implicit kernel write
selftests/vm/pkeys: handle negative sys_pkey_alloc() return code
selftests/vm/pkeys: fix alloc_random_pkey() to make it really, really random
kcov: add __no_sanitize_coverage to fix noinstr for all architectures
exec: remove checks in __register_bimfmt()
x86: signal: don't do sas_ss_reset() until we are certain that sigframe won't be abandoned
hfsplus: report create_date to kstat.btime
hfsplus: remove unnecessary oom message
nilfs2: remove redundant continue statement in a while-loop
kprobes: remove duplicated strong free_insn_page in x86 and s390
init: print out unknown kernel parameters
checkpatch: do not complain about positive return values starting with EPOLL
checkpatch: improve the indented label test
checkpatch: scripts/spdxcheck.py now requires python3
...
Merge branch 'for-5.14' of git://git.kernel.org/pub/scm/linux/kernel/git/dennis/percpu
Pull percpu updates from Dennis Zhou:
- percpu chunk depopulation - depopulate backing pages for chunks with
empty pages when we exceed a global threshold without those pages.
This lets us reclaim a portion of memory that would previously be
lost until the full chunk would be freed (possibly never).
- memcg accounting cleanup - previously separate chunks were managed
for normal allocations and __GFP_ACCOUNT allocations. These are now
consolidated which cleans up the code quite a bit.
- a few misc clean ups for clang warnings
* 'for-5.14' of git://git.kernel.org/pub/scm/linux/kernel/git/dennis/percpu:
percpu: optimize locking in pcpu_balance_workfn()
percpu: initialize best_upa variable
percpu: rework memcg accounting
mm, memcg: introduce mem_cgroup_kmem_disabled()
mm, memcg: mark cgroup_memory_nosocket, nokmem and noswap as __ro_after_init
percpu: make symbol 'pcpu_free_slot' static
percpu: implement partial chunk depopulation
percpu: use pcpu_free_slot instead of pcpu_nr_slots - 1
percpu: factor out pcpu_check_block_hint()
percpu: split __pcpu_balance_workfn()
percpu: fix a comment about the chunks ordering
Merge tag 'mips_5.14' of git://git.kernel.org/pub/scm/linux/kernel/git/mips/linux
Pull MIPS updates from Thomas Bogendoerfer:
- add support for OpeneEmbed SOM9331 board
- Ingenic fixes/improvments
- other fixes and cleanups
* tag 'mips_5.14' of git://git.kernel.org/pub/scm/linux/kernel/git/mips/linux: (39 commits)
MIPS: Fix PKMAP with 32-bit MIPS huge page support
MIPS: CI20: Add second percpu timer for SMP.
MIPS: CI20: Reduce clocksource to 750 kHz.
MIPS: Ingenic: Add MAC syscon nodes for Ingenic SoCs.
dt-bindings: clock: Add documentation for MAC PHY control bindings.
MIPS: X1830: Respect cell count of common properties.
MIPS: set mips32r5 for virt extensions
MIPS: loongsoon64: Reserve memory below starting pfn to prevent Oops
MIPS: MT extensions are not available on MIPS32r1
mips/kvm: Use BUG_ON instead of if condition followed by BUG
MIPS: OCTEON: octeon-usb: Use devm_platform_get_and_ioremap_resource()
MIPS: add PMD table accounting into MIPS'pmd_alloc_one
MIPS: Loongson64: fix spelling of SPDX tag
MIPS: ingenic: rs90: Add dedicated VRAM memory region
MIPS: ingenic: gcw0: Set codec to cap-less mode for FM radio
MIPS: ingenic: jz4780: Fix I2C nodes to match DT doc
MIPS: ingenic: Select CPU_SUPPORTS_CPUFREQ && MIPS_EXTERNAL_TIMER
MIPS: Kconfig: ingenic: Ensure MACH_INGENIC_GENERIC selects all SoCs
MIPS: cpu-probe: Fix FPU detection on Ingenic JZ4760(B)
MIPS: boot: Support specifying UART port on Ingenic SoCs
...
Merge tag 'pinctrl-v5.14-1' of git://git.kernel.org/pub/scm/linux/kernel/git/linusw/linux-pinctrl
Pull pin control updates from Linus Walleij:
"This is the bulk of pin control changes for the v5.14 kernel. Not so
much going on. No core changes, just drivers.
The most interesting would be that MIPS Ralink is migrating to pin
control and we have some bindings but not yet code for the Apple M1
pin controller.
New drivers:
- Last merge window we created a driver for the Ralink RT2880. We are
now moving the Ralink SoC pin control drivers out of the MIPS
architecture code and into the pin control subsystem. This concerns
RT288X, MT7620, RT305X, RT3883 and MT7621.
- Qualcomm SM6125 SoC pin control driver.
- Qualcomm spmi-gpio support for PM7325.
- Qualcomm spmi-mpp also handles PMI8994 (just a compatible string)
- Mediatek MT8365 SoC pin controller.
- New device HID for the AMD GPIO controller.
Improvements:
- Pin bias config support for a slew of Renesas pin controllers.
- Incremental improvements and non-urgent bug fixes to the Renesas
SoC drivers.
- Implement irq_set_wake on the AMD pin controller so we can wake up
from external pin events.
Misc:
- Devicetree bindings for the Apple M1 pin controller, we will
probably see a proper driver for this soon as well"
* tag 'pinctrl-v5.14-1' of git://git.kernel.org/pub/scm/linux/kernel/git/linusw/linux-pinctrl: (54 commits)
pinctrl: ralink: rt305x: add missing include
pinctrl: stm32: check for IRQ MUX validity during alloc()
pinctrl: zynqmp: some code cleanups
drivers: qcom: pinctrl: Add pinctrl driver for sm6125
dt-bindings: pinctrl: qcom: sm6125: Document SM6125 pinctrl driver
dt-bindings: pinctrl: mcp23s08: add documentation for reset-gpios
pinctrl: mcp23s08: Add optional reset GPIO
pinctrl: mediatek: fix mode encoding
pinctrl: mcp23s08: Fix missing unlock on error in mcp23s08_irq()
pinctrl: bcm: Constify static pinmux_ops
pinctrl: bcm: Constify static pinctrl_ops
pinctrl: ralink: move RT288X SoC pinmux config into a new 'pinctrl-rt288x.c' file
pinctrl: ralink: move MT7620 SoC pinmux config into a new 'pinctrl-mt7620.c' file
pinctrl: ralink: move RT305X SoC pinmux config into a new 'pinctrl-rt305x.c' file
pinctrl: ralink: move RT3883 SoC pinmux config into a new 'pinctrl-rt3883.c' file
pinctrl: ralink: move MT7621 SoC pinmux config into a new 'pinctrl-mt7621.c' file
pinctrl: ralink: move ralink architecture pinmux header into the driver
pinctrl: single: config: enable the pin's input
pinctrl: mtk: Fix mt8365 Kconfig dependency
pinctrl: mcp23s08: fix race condition in irq handler
...
Merge tag 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/rdma/rdma
Pull rdma updates from Jason Gunthorpe:
"This contains a replacement driver for Intel iWarp hardware. This new
driver supports the old ethernet hardware and also newer chips that
can do ROCE.
Other than that, this contains the typical mix of patches:
- Driver updates and cleanups for bnxt_re, cxgb4, mlx4, and mlx5
- Many static checker driven code clean ups, including a wide
refcount_t conversion
- Several series for the hns driver, more HIP09 HW capabilities,
migration to new HW register manipulators, and code cleanups
- Minor fixes and improvements in srp, rts, and cm
- Improvements throughout for sysfs related code to use
DEVICE_ATTR_*, make the ib_port sysfs first-class, and overall use
sysfs APIs properly
- Intel's new irdma driver replacing i40iw
- rxe general clean ups and Memory Window support"
* tag 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/rdma/rdma: (211 commits)
RDMA/core: Always release restrack object
RDMA/mlx5: Don't access NULL-cleared mpi pointer
RDMA/irdma: Fix potential overflow expression in irdma_prm_get_pbles
RDMA/irdma: Check contents of user-space irdma_mem_reg_req object
RDMA/rxe: Missing unlock on error in get_srq_wqe()
RDMA/cma: Fix rdma_resolve_route() memory leak
RDMA/core/sa_query: Remove unused argument
RDMA/cma: Fix incorrect Packet Lifetime calculation
RDMA/cma: Protect RMW with qp_mutex
RDMA/cma: Remove unnecessary INIT->INIT transition
RDMA/hns: Add window selection field of congestion control
RDMA/hfi1: Remove use of kmap()
RDMA/irdma: Remove use of kmap()
RDMA/bnxt_re: Fix uninitialized struct bit field rsvd1
IB/isert: Align target max I/O size to initiator size
RDMA/hns: Fix incorrect vlan enable bit in QPC
MAINTAINERS: Update Broadcom RDMA maintainers
RDMA/irdma: Use the queried port attributes
RDMA/rxe: Fix redundant skb_put_zero
RDMA/rxe: Fix extra copy in prepare_ack_packet
...
Merge tag 'clk-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/clk/linux
Pull clk updates from Stephen Boyd:
"This round has a diffstat dominated by Qualcomm clk drivers. Honestly
though that's just a bunch of data so the diffstat reflects that.
Looking beyond that there's just a bunch of updates all around in
various clk drivers. Renesas and NXP (for i.MX) are two SoC vendors
that have a lot of patches in here.
Overall the driver changes look to be mostly enabling more clks and
non-critical fixes that we could hold until the next merge window.
I'm especially excited about the series from Arnd that graduates
clkdev to be the only implementation of clk_get() and clk_put().
That's a good step in the right direction to migreate eveerything over
to the common clk framework. Now we don't have to worry about clkdev
specific details, they're just part of the clk API now.
Core:
- clkdev is now the only option, i.e. clk_get()/clk_put() is
implemented in only one place in the kernel instead of in
drivers/clk/clkdev.c and in architectures that want their own
implementation
Updates:
- Stop using clock-output-names in ST clk drivers (yay!)
- Support secure mode of STM32MP1 SoCs
- Improve clock support for Actions S500 SoC
- duty cycle setting support on qcom clks
- Add TI am33xx spread spectrum clock support
- Use determine_rate() for the Amlogic pll ops instead of
round_rate()
- Restrict Amlogic gp0/1 and audio plls range on g12a/sm1
- Improve Amlogic axg-audio controller error on deferral
- Add NNA clocks on Amlogic g12a
- Reduce memory footprint of Rockchip PLL rate tables
- A fix for the newly added Rockchip rk3568 clk driver
- Exported clock for the newly added Rockchip video decoder
- Remove audio ipg clock from i.MX8MP
- Remove deprecated legacy clock binding for i.MX SCU clock driver
- Use common clk-imx8qxp for both i.MX8QXP and i.MX8QM
- Add multiple clocks to clk-imx8qxp driver (enet, hdmi, lcdif,
audio, parallel interface)
- Add dedicated clock ops for i.MX paralel interface
- Different fixes for clocks controlled by ATF on i.MX SoCs
- Add A53/A72 frequency scaling support i.MX clk-scu driver
- Add special case for DCSS clock on suspend for i.MX clk-scu driver
- Add parent save/restore on suspend/resume to i.MX clk-scu driver
- Skip runtime PM enablement for CPU clocks in i.MX clk-scu driver
- Remove the sys1_pll/sys2_pll clock gates for i.MX8MQ and their
bindings
- Tegra clk driver no longer deasserts resets on clk_enable as it
gets in the way of certain power-up sequences
- Fix compile testing for Tegra clk driver
- One patch to fix a divider on the Allwinner v3s Audio PLL
- Add support for CPU core clock boost modes on Renesas R-Car Gen3
- Add ISPCS (Image Signal Processor) clocks on Renesas R-Car V3U
- Switch SH/R-Mobile and R-Car "DIV6" clocks to .determine_rate() and
improve support for multiple parents
- Switch Renesas RZ/N1 divider clocks to .determine_rate()
- Add ZA2 (Audio Clock Generator) clock on Renesas R-Car D3
- Convert ar7 to common clk framework
- Convert ralink to common clk framework"
* tag 'clk-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/clk/linux: (161 commits)
clk: zynqmp: Handle divider specific read only flag
clk: zynqmp: Use firmware specific mux clock flags
clk: zynqmp: Use firmware specific divider clock flags
clk: zynqmp: Use firmware specific common clock flags
clk: lmk04832: Use of match table
clk: lmk04832: Depend on SPI
clk: stm32mp1: new compatible for secure RCC support
dt-bindings: clock: stm32mp1 new compatible for secure rcc
dt-bindings: reset: add MCU HOLD BOOT ID for SCMI reset domains on stm32mp15
dt-bindings: reset: add IDs for SCMI reset domains on stm32mp15
dt-bindings: clock: add IDs for SCMI clocks on stm32mp15
reset: stm32mp1: remove stm32mp1 reset
clk: hisilicon: Add clock driver for hi3559A SoC
dt-bindings: Document the hi3559a clock bindings
clk: si5341: Add sysfs properties to allow checking/resetting device faults
clk: si5341: Add silabs,iovdd-33 property
clk: si5341: Add silabs,xaxb-ext-clk property
clk: si5341: Allow different output VDD_SEL values
clk: si5341: Update initialization magic
clk: si5341: Check for input clock presence and PLL lock on startup
...
Merge tag 'drm-next-2021-07-01' of git://anongit.freedesktop.org/drm/drm
Pull drm updates from Dave Airlie:
"Highlights:
- AMD enables two more GPUs, with resulting header files
- i915 has started to move to TTM for discrete GPU and enable DG1
discrete GPU support (not by default yet)
- new HyperV drm driver
- vmwgfx adds arm64 support
- TTM refactoring ongoing
- 16bpc display support for AMD hw
Otherwise it's just the usual insane amounts of work all over the
place in lots of drivers and the core, as mostly summarised below:
Core:
- mark AGP ioctls as legacy
- disable force probing for non-master clients
- HDR metadata property helpers
- HDMI infoframe signal colorimetry support
- remove drm_device.pdev pointer
- remove DRM_KMS_FB_HELPER config option
- remove drm_pci_alloc/free
- drm_err_*/drm_dbg_* helpers
- use drm driver names for fbdev
- leaked DMA handle fix
- 16bpc fixed point format fourcc
- add prefetching memcpy for WC
- Documentation fixes
aperture:
- add aperture ownership helpers
dp:
- aux fixes
- downstream 0 port handling
- use extended base receiver capability DPCD
- Rename DP_PSR_SELECTIVE_UPDATE to better mach eDP spec
- mst: use khz as link rate during init
- VCPI fixes for StarTech hub
ttm:
- provide tt_shrink file via debugfs
- warn about freeing pinned BOs
- fix swapping error handling
- move page alignment into BO
- cleanup ttm_agp_backend
- add ttm_sys_manager
- don't override vm_ops
- ttm_bo_mmap removed
- make ttm_resource base of all managers
- remove VM_MIXEDMAP usage
panel:
- sysfs_emit support
- simple: runtime PM support
- simple: power up panel when reading EDID + caching
* tag 'drm-next-2021-07-01' of git://anongit.freedesktop.org/drm/drm: (1570 commits)
drm/i915: Reinstate the mmap ioctl for some platforms
drm/i915/dsc: abstract helpers to get bigjoiner primary/secondary crtc
Revert "drm/msm/mdp5: provide dynamic bandwidth management"
drm/msm/mdp5: provide dynamic bandwidth management
drm/msm/mdp5: add perf blocks for holding fudge factors
drm/msm/mdp5: switch to standard zpos property
drm/msm/mdp5: add support for alpha/blend_mode properties
drm/msm/mdp5: use drm_plane_state for pixel blend mode
drm/msm/mdp5: use drm_plane_state for storing alpha value
drm/msm/mdp5: use drm atomic helpers to handle base drm plane state
drm/msm/dsi: do not enable PHYs when called for the slave DSI interface
drm/msm: Add debugfs to trigger shrinker
drm/msm/dpu: Avoid ABBA deadlock between IRQ modules
drm/msm: devcoredump iommu fault support
iommu/arm-smmu-qcom: Add stall support
drm/msm: Improve the a6xx page fault handler
iommu/arm-smmu-qcom: Add an adreno-smmu-priv callback to get pagefault info
iommu/arm-smmu: Add support for driver IOMMU fault handlers
drm/msm: export hangcheck_period in debugfs
drm/msm/a6xx: add support for Adreno 660 GPU
...
Merge tag 'for-5.14/io_uring-2021-06-30' of git://git.kernel.dk/linux-block
Pull io_uring updates from Jens Axboe:
- Multi-queue iopoll improvement (Fam)
- Allow configurable io-wq CPU masks (me)
- renameat/linkat tightening (me)
- poll re-arm improvement (Olivier)
- SQPOLL race fix (Olivier)
- Cancelation unification (Pavel)
- SQPOLL cleanups (Pavel)
- Enable file backed buffers for shmem/memfd (Pavel)
- A ton of cleanups and performance improvements (Pavel)
- Followup and misc fixes (Colin, Fam, Hao, Olivier)
* tag 'for-5.14/io_uring-2021-06-30' of git://git.kernel.dk/linux-block: (83 commits)
io_uring: code clean for kiocb_done()
io_uring: spin in iopoll() only when reqs are in a single queue
io_uring: pre-initialise some of req fields
io_uring: refactor io_submit_flush_completions
io_uring: optimise hot path restricted checks
io_uring: remove not needed PF_EXITING check
io_uring: mainstream sqpoll task_work running
io_uring: refactor io_arm_poll_handler()
io_uring: reduce latency by reissueing the operation
io_uring: add IOPOLL and reserved field checks to IORING_OP_UNLINKAT
io_uring: add IOPOLL and reserved field checks to IORING_OP_RENAMEAT
io_uring: refactor io_openat2()
io_uring: simplify struct io_uring_sqe layout
io_uring: update sqe layout build checks
io_uring: fix code style problems
io_uring: refactor io_sq_thread()
io_uring: don't change sqpoll creds if not needed
io_uring: Create define to modify a SQPOLL parameter
io_uring: Fix race condition when sqp thread goes to sleep
io_uring: improve in tctx_task_work() resubmission
...
Riccardo Mancini [Thu, 24 Jun 2021 23:19:25 +0000 (01:19 +0200)]
perf session: Add missing evlist__delete when deleting a session
ASan reports a memory leak caused by evlist not being deleted on exit in
perf-report, perf-script and perf-data.
The problem is caused by evlist->session not being deleted, which is
allocated in perf_session__read_header, called in perf_session__new if
perf_data is in read mode.
In case of write mode, the session->evlist is filled by the caller.
This patch solves the problem by calling evlist__delete in
perf_session__delete if perf_data is in read mode.
Changes in v2:
- call evlist__delete from within perf_session__delete
Indirect leak of 2704 byte(s) in 1 object(s) allocated from:
#0 0x4f4137 in calloc (/home/user/linux/tools/perf/perf+0x4f4137)
#1 0xbe3d56 in zalloc /home/user/linux/tools/lib/perf/../../lib/zalloc.c:8:9
#2 0x7f999e in evlist__new /home/user/linux/tools/perf/util/evlist.c:77:26
#3 0x8ad938 in perf_session__read_header /home/user/linux/tools/perf/util/header.c:3797:20
#4 0x8ec714 in perf_session__open /home/user/linux/tools/perf/util/session.c:109:6
#5 0x8ebe83 in perf_session__new /home/user/linux/tools/perf/util/session.c:213:10
#6 0x60c6de in cmd_script /home/user/linux/tools/perf/builtin-script.c:3856:12
#7 0x7b2930 in run_builtin /home/user/linux/tools/perf/perf.c:313:11
#8 0x7b120f in handle_internal_command /home/user/linux/tools/perf/perf.c:365:8
#9 0x7b2493 in run_argv /home/user/linux/tools/perf/perf.c:409:2
#10 0x7b0c89 in main /home/user/linux/tools/perf/perf.c:539:3
#11 0x7f5260654b74 (/lib64/libc.so.6+0x27b74)
Indirect leak of 568 byte(s) in 1 object(s) allocated from:
#0 0x4f4137 in calloc (/home/user/linux/tools/perf/perf+0x4f4137)
#1 0xbe3d56 in zalloc /home/user/linux/tools/lib/perf/../../lib/zalloc.c:8:9
#2 0x80ce88 in evsel__new_idx /home/user/linux/tools/perf/util/evsel.c:268:24
#3 0x8aed93 in evsel__new /home/user/linux/tools/perf/util/evsel.h:210:9
#4 0x8ae07e in perf_session__read_header /home/user/linux/tools/perf/util/header.c:3853:11
#5 0x8ec714 in perf_session__open /home/user/linux/tools/perf/util/session.c:109:6
#6 0x8ebe83 in perf_session__new /home/user/linux/tools/perf/util/session.c:213:10
#7 0x60c6de in cmd_script /home/user/linux/tools/perf/builtin-script.c:3856:12
#8 0x7b2930 in run_builtin /home/user/linux/tools/perf/perf.c:313:11
#9 0x7b120f in handle_internal_command /home/user/linux/tools/perf/perf.c:365:8
#10 0x7b2493 in run_argv /home/user/linux/tools/perf/perf.c:409:2
#11 0x7b0c89 in main /home/user/linux/tools/perf/perf.c:539:3
#12 0x7f5260654b74 (/lib64/libc.so.6+0x27b74)
Indirect leak of 264 byte(s) in 1 object(s) allocated from:
#0 0x4f4137 in calloc (/home/user/linux/tools/perf/perf+0x4f4137)
#1 0xbe3d56 in zalloc /home/user/linux/tools/lib/perf/../../lib/zalloc.c:8:9
#2 0xbe3e70 in xyarray__new /home/user/linux/tools/lib/perf/xyarray.c:10:23
#3 0xbd7754 in perf_evsel__alloc_id /home/user/linux/tools/lib/perf/evsel.c:361:21
#4 0x8ae201 in perf_session__read_header /home/user/linux/tools/perf/util/header.c:3871:7
#5 0x8ec714 in perf_session__open /home/user/linux/tools/perf/util/session.c:109:6
#6 0x8ebe83 in perf_session__new /home/user/linux/tools/perf/util/session.c:213:10
#7 0x60c6de in cmd_script /home/user/linux/tools/perf/builtin-script.c:3856:12
#8 0x7b2930 in run_builtin /home/user/linux/tools/perf/perf.c:313:11
#9 0x7b120f in handle_internal_command /home/user/linux/tools/perf/perf.c:365:8
#10 0x7b2493 in run_argv /home/user/linux/tools/perf/perf.c:409:2
#11 0x7b0c89 in main /home/user/linux/tools/perf/perf.c:539:3
#12 0x7f5260654b74 (/lib64/libc.so.6+0x27b74)
Indirect leak of 32 byte(s) in 1 object(s) allocated from:
#0 0x4f4137 in calloc (/home/user/linux/tools/perf/perf+0x4f4137)
#1 0xbe3d56 in zalloc /home/user/linux/tools/lib/perf/../../lib/zalloc.c:8:9
#2 0xbd77e0 in perf_evsel__alloc_id /home/user/linux/tools/lib/perf/evsel.c:365:14
#3 0x8ae201 in perf_session__read_header /home/user/linux/tools/perf/util/header.c:3871:7
#4 0x8ec714 in perf_session__open /home/user/linux/tools/perf/util/session.c:109:6
#5 0x8ebe83 in perf_session__new /home/user/linux/tools/perf/util/session.c:213:10
#6 0x60c6de in cmd_script /home/user/linux/tools/perf/builtin-script.c:3856:12
#7 0x7b2930 in run_builtin /home/user/linux/tools/perf/perf.c:313:11
#8 0x7b120f in handle_internal_command /home/user/linux/tools/perf/perf.c:365:8
#9 0x7b2493 in run_argv /home/user/linux/tools/perf/perf.c:409:2
#10 0x7b0c89 in main /home/user/linux/tools/perf/perf.c:539:3
#11 0x7f5260654b74 (/lib64/libc.so.6+0x27b74)
Indirect leak of 7 byte(s) in 1 object(s) allocated from:
#0 0x4b8207 in strdup (/home/user/linux/tools/perf/perf+0x4b8207)
#1 0x8b4459 in evlist__set_event_name /home/user/linux/tools/perf/util/header.c:2292:16
#2 0x89d862 in process_event_desc /home/user/linux/tools/perf/util/header.c:2313:3
#3 0x8af319 in perf_file_section__process /home/user/linux/tools/perf/util/header.c:3651:9
#4 0x8aa6e9 in perf_header__process_sections /home/user/linux/tools/perf/util/header.c:3427:9
#5 0x8ae3e7 in perf_session__read_header /home/user/linux/tools/perf/util/header.c:3886:2
#6 0x8ec714 in perf_session__open /home/user/linux/tools/perf/util/session.c:109:6
#7 0x8ebe83 in perf_session__new /home/user/linux/tools/perf/util/session.c:213:10
#8 0x60c6de in cmd_script /home/user/linux/tools/perf/builtin-script.c:3856:12
#9 0x7b2930 in run_builtin /home/user/linux/tools/perf/perf.c:313:11
#10 0x7b120f in handle_internal_command /home/user/linux/tools/perf/perf.c:365:8
#11 0x7b2493 in run_argv /home/user/linux/tools/perf/perf.c:409:2
#12 0x7b0c89 in main /home/user/linux/tools/perf/perf.c:539:3
#13 0x7f5260654b74 (/lib64/libc.so.6+0x27b74)
SUMMARY: AddressSanitizer: 3728 byte(s) leaked in 7 allocation(s).
Riccardo Mancini [Thu, 24 Jun 2021 22:34:22 +0000 (00:34 +0200)]
perf annotate: Allow 's' on source code lines
In perf annotate, when 's' is pressed on a line containing source code,
it shows the message "Only available for assembly lines".
This patch gets rid of the error, moving the cursr to the next available
asm line (or the closest previous one if no asm line is found moving
forwards), before hiding source code lines.
Changes in v2:
- handle case of no asm line found in
annotate_browser__find_next_asm_line by returning NULL and
handling error in caller.
Adrian Hunter [Sun, 27 Jun 2021 13:18:11 +0000 (16:18 +0300)]
perf script: Add option to list dlfilters
Add option --list-dlfilters to list dlfilters in the current directory or
the exec-path e.g. ~/libexec/perf-core/dlfilters. Use with option -v (must
come before option --list-dlfilters) to show long descriptions.
Adrian Hunter [Sun, 27 Jun 2021 13:18:10 +0000 (16:18 +0300)]
perf script: Add dlfilter__filter_event_early()
filter_event_early() can be more than 30% faster than filter_event()
because it is called before internal filtering. In other respects it
is the same as filter_event(), except that it will be passed events
that have yet to be filtered out.
Adrian Hunter [Sun, 27 Jun 2021 13:18:09 +0000 (16:18 +0300)]
perf script: Add API for filtering via dynamically loaded shared object
In some cases, users want to filter very large amounts of data (e.g.
from AUX area tracing like Intel PT) looking for something specific.
While scripting such as Python can be used, Python is 10 to 20 times
slower than C. So define a C API so that custom filters can be written
and loaded.
Zhihao sent a patch but it made llvm__compile_bpf() return what
asprintf() returns on error, which is just -1, but since this function
returns -errno, fix it by returning -ENOMEM for this case instead.
James Clark [Wed, 9 Jun 2021 13:04:20 +0000 (16:04 +0300)]
perf cs-etm: Delay decode of non-timeless data until cs_etm__flush_events()
Currently, timeless mode starts the decode on PERF_RECORD_EXIT, and
non-timeless mode starts decoding on the fist PERF_RECORD_AUX record.
This can cause the "data has no samples!" error if the first
PERF_RECORD_AUX record comes before the first (or any relevant)
PERF_RECORD_MMAP2 record because the mmaps are required by the decoder
to access the binary data.
This change pushes the start of non-timeless decoding to the very end of
parsing the file. The PERF_RECORD_EXIT event can't be used because it
might not exist in system-wide or snapshot modes.
I have not been able to find the exact cause for the events to be
intermittently in the wrong order in the basic scenario:
perf record -e cs_etm/@tmc_etr0/u top
But it can be made to happen every time with the --delay option. This is
because "enable_on_exec" is disabled, which causes tracing to start
before the process to be launched is exec'd. For example:
perf record -e cs_etm/@tmc_etr0/u --delay=1 top
perf report -D | grep 'AUX\|MAP'
Another scenario in which decoding from the first aux record fails is a
workload that forks. Although the aux record comes after 'bash', it
comes before 'top', which is what we are interested in. For example:
perf record -e cs_etm/@tmc_etr0/u -- bash -c top
perf report -D | grep 'AUX\|MAP'
A third scenario is when the majority of time is spent in a shared
library that is not loaded at startup. For example a dynamically loaded
plugin.
Testing
=======
Testing was done by checking if any samples that are present in the
old output are missing from the new output. Timestamps must be
stripped out with awk because now they are set to the last AUX sample,
rather than the first:
Testing showed that the new output is a superset of the old. When lines
appear in the comm output, it is not because they are missing but
because [unknown] is now resolved to sensible locations. For example
last putp branch here now resolves to libtinfo, so it's not missing
from the output, but is actually improved:
In the following two modes, decoding now works and the "data has no
samples!" error is not displayed any more:
perf record -e cs_etm/@tmc_etr0/u -- bash -c top
perf record -e cs_etm/@tmc_etr0/u --delay=1 top
In snapshot mode, there is also an improvement to decoding. Previously
samples for the 'kill' process that was used to send SIGUSR2 were
completely missing, because the process hadn't started yet. But now
there are additional samples present:
perf record -e cs_etm/@tmc_etr0/u --snapshot -a
perf script
Warning: Kernel ABI header at 'tools/arch/x86/include/uapi/asm/svm.h' differs from latest version at 'arch/x86/include/uapi/asm/svm.h'
diff -u tools/arch/x86/include/uapi/asm/svm.h arch/x86/include/uapi/asm/svm.h
tools kvm headers arm64: Update KVM headers from the kernel sources
To pick the changes from:
f0376edb1ddcab19 ("KVM: arm64: Add ioctl to fetch/store tags in a guest")
That don't causes any changes in tooling (when built on x86), only
addresses this perf build warning:
Warning: Kernel ABI header at 'tools/arch/arm64/include/uapi/asm/kvm.h' differs from latest version at 'arch/arm64/include/uapi/asm/kvm.h'
diff -u tools/arch/arm64/include/uapi/asm/kvm.h arch/arm64/include/uapi/asm/kvm.h
Warning: Kernel ABI header at 'tools/arch/x86/include/uapi/asm/kvm.h' differs from latest version at 'arch/x86/include/uapi/asm/kvm.h'
diff -u tools/arch/x86/include/uapi/asm/kvm.h arch/x86/include/uapi/asm/kvm.h
Warning: Kernel ABI header at 'tools/include/uapi/linux/kvm.h' differs from latest version at 'include/uapi/linux/kvm.h'
diff -u tools/include/uapi/linux/kvm.h include/uapi/linux/kvm.h
tools headers cpufeatures: Sync with the kernel sources
To pick the changes from:
1348924ba8169f35 ("x86/msr: Define new bits in TSX_FORCE_ABORT MSR") cbcddaa33d7e11a0 ("perf/x86/rapl: Use CPUID bit on AMD and Hygon parts")
This only causes these perf files to be rebuilt:
CC /tmp/build/perf/bench/mem-memcpy-x86-64-asm.o
CC /tmp/build/perf/bench/mem-memset-x86-64-asm.o
And addresses this perf build warning:
Warning: Kernel ABI header at 'tools/arch/x86/include/asm/cpufeatures.h' differs from latest version at 'arch/x86/include/asm/cpufeatures.h'
diff -u tools/arch/x86/include/asm/cpufeatures.h arch/x86/include/asm/cpufeatures.h
dd8b477f9a3d8edb ("mount: Support "nosymfollow" in new mount api")
That ends up adding support for the new MOUNT_ATTR_NOSYMFOLLOW mount
attribute:
$ tools/perf/trace/beauty/fsmount.sh > before
$ cp include/uapi/linux/mount.h tools/include/uapi/linux/mount.h
$ tools/perf/trace/beauty/fsmount.sh > after
$ diff -u before after
--- before 2021-07-01 13:34:04.542517355 -0300
+++ after 2021-07-01 13:34:12.423694537 -0300
@@ -7,4 +7,5 @@
[ilog2(0x00000020) + 1] = "STRICTATIME",
[ilog2(0x00000080) + 1] = "NODIRATIME",
[ilog2(0x00100000) + 1] = "IDMAP",
+ [ilog2(0x00200000) + 1] = "NOSYMFOLLOW",
};
$
So now one can use it in --filter expressions for tracepoints.
This silences this perf build warnings:
Warning: Kernel ABI header at 'tools/include/uapi/linux/mount.h' differs from latest version at 'include/uapi/linux/mount.h'
diff -u tools/include/uapi/linux/mount.h include/uapi/linux/mount.h
tools arch x86: Sync the msr-index.h copy with the kernel sources
To pick up the changes from these csets:
1348924ba8169f35 ("x86/msr: Define new bits in TSX_FORCE_ABORT MSR")
That cause no changes to tooling:
$ tools/perf/trace/beauty/tracepoints/x86_msr.sh > before
$ cp arch/x86/include/asm/msr-index.h tools/arch/x86/include/asm/msr-index.h
$ tools/perf/trace/beauty/tracepoints/x86_msr.sh > after
$ diff -u before after
$
Just silences this perf build warning:
Warning: Kernel ABI header at 'tools/arch/x86/include/asm/msr-index.h' differs from latest version at 'arch/x86/include/asm/msr-index.h'
diff -u tools/arch/x86/include/asm/msr-index.h arch/x86/include/asm/msr-index.h
Leo Yan [Wed, 19 May 2021 07:19:39 +0000 (15:19 +0800)]
perf arm-spe: Don't wait for PERF_RECORD_EXIT event
When decode Arm SPE trace, it waits for PERF_RECORD_EXIT event (the last
perf event) for processing trace data, which is needless and even might
cause logic error, e.g. it might fail to correlate perf events with Arm
SPE events correctly.
So this patch removes the condition checking for PERF_RECORD_EXIT event.
Leo Yan [Wed, 19 May 2021 07:19:38 +0000 (15:19 +0800)]
perf arm-spe: Bail out if the trace is later than perf event
It's possible that record in Arm SPE trace is later than perf event and
vice versa. This asks to correlate the perf events and Arm SPE
synthesized events to be processed in the manner of correct timing.
To achieve the time ordering, this patch reverses the flow, it firstly
calls arm_spe_sample() and then calls arm_spe_decode(). By comparing
the timestamp value and detect the perf event is coming earlier than Arm
SPE trace data, it bails out from the decoding loop, the last record is
pushed into auxtrace stack and is deferred to generate sample. To track
the timestamp, everytime it updates timestamp for the latest record.
Leo Yan [Wed, 19 May 2021 07:19:37 +0000 (15:19 +0800)]
perf arm-spe: Assign kernel time to synthesized event
In current code, it assigns the arch timer counter to the synthesized
samples Arm SPE trace, thus the samples don't contain the kernel time
but only contain the raw counter value.
To fix the issue, this patch converts the timer counter to kernel time
and assigns it to sample timestamp.
Leo Yan [Wed, 19 May 2021 07:19:36 +0000 (15:19 +0800)]
perf arm-spe: Convert event kernel time to counter value
When handle a perf event, Arm SPE decoder needs to decide if this perf
event is earlier or later than the samples from Arm SPE trace data; to
do comparision, it needs to use the same unit for the time.
This patch converts the event kernel time to arch timer's counter value,
thus it can be used to compare with counter value contained in Arm SPE
Timestamp packet.
Leo Yan [Wed, 19 May 2021 07:19:35 +0000 (15:19 +0800)]
perf arm-spe: Save clock parameters from TIME_CONV event
During the recording phase, "perf record" tool synthesizes event
PERF_RECORD_TIME_CONV for the hardware clock parameters and saves the
event into the data file.
Afterwards, when processing the data file, the event TIME_CONV will be
processed at the very early time and is stored into session context.
This patch extracts these parameters from the session context and saves
into the structure "spe->tc" with the type perf_tsc_conversion, so that
the parameters are ready for conversion between clock counter and time
stamp.
The callback cs_etm_find_snapshot() is invoked for snapshot mode, its
main purpose is to find the correct AUX trace data and returns "head"
and "old" (we can call "old" as "old head") to the caller, the caller
__auxtrace_mmap__read() uses these two pointers to decide the AUX trace
data size.
This patch removes cs_etm_find_snapshot() with below reasons:
- The first thing in cs_etm_find_snapshot() is to check if the head has
wrapped around, if it is not, directly bails out. The checking is
pointless, this is because the "head" and "old" pointers both are
monotonical increasing so they never wrap around.
- cs_etm_find_snapshot() adjusts the "head" and "old" pointers and
assumes the AUX ring buffer is fully filled with the hardware trace
data, so it always subtracts the difference "mm->len" from "head" to
get "old". Let's imagine the snapshot is taken in very short
interval, the tracers only fill a small chunk of the trace data into
the AUX ring buffer, in this case, it's wrongly to copy the whole the
AUX ring buffer to perf file.
- As the "head" and "old" pointers are monotonically increased, the
function __auxtrace_mmap__read() handles these two pointers properly.
It calculates the reminders for these two pointers, and the size is
clamped to be never more than "snapshot_size". We can simply reply on
the function __auxtrace_mmap__read() to calculate the correct result
for data copying, it's not necessary to add Arm CoreSight specific
callback.
Merge tag 'fs_for_v5.14-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-fs
Pull misc fs updates from Jan Kara:
"The new quotactl_fd() syscall (remake of quotactl_path() syscall that
got introduced & disabled in 5.13 cycle), and couple of udf, reiserfs,
isofs, and writeback fixes and cleanups"
* tag 'fs_for_v5.14-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-fs:
writeback: fix obtain a reference to a freeing memcg css
quota: remove unnecessary oom message
isofs: remove redundant continue statement
quota: Wire up quotactl_fd syscall
quota: Change quotactl_path() systcall to an fd-based one
reiserfs: Remove unneed check in reiserfs_write_full_page()
udf: Fix NULL pointer dereference in udf_symlink function
reiserfs: add check for invalid 1st journal block
If semctl(), msgctl() and shmctl() are called with IPC_INFO, SEM_INFO,
MSG_INFO or SHM_INFO, then the return value is the index of the highest
used index in the kernel's internal array recording information about all
SysV objects of the requested type for the current namespace. (This
information can be used with repeated ..._STAT or ..._STAT_ANY operations
to obtain information about all SysV objects on the system.)
There is a cache for this value. But when the cache needs up be updated,
then the highest used index is determined by looping over all possible
values. With the introduction of IPCMNI_EXTEND_SHIFT, this could be a
loop over 16 million entries. And due to /proc/sys/kernel/*next_id, the
index values do not need to be consecutive.
With <write 16000000 to msg_next_id>, msgget(), msgctl(,IPC_RMID) in a
loop, I have observed a performance increase of around factor 13000.
As there is no get_last() function for idr structures: Implement a
"get_last()" using a binary search.
As far as I see, ipc is the only user that needs get_last(), thus
implement it in ipc/util.c and not in a central location.
ipc/sem.c: use READ_ONCE()/WRITE_ONCE() for use_global_lock
The patch solves three weaknesses in ipc/sem.c:
1) The initial read of use_global_lock in sem_lock() is an intentional
race. KCSAN detects these accesses and prints a warning.
2) The code assumes that plain C read/writes are not mangled by the CPU
or the compiler.
3) The comment it sysvipc_sem_proc_show() was hard to understand: The
rest of the comments in ipc/sem.c speaks about sem_perm.lock, and
suddenly this function speaks about ipc_lock_object().
To solve 1) and 2), use READ_ONCE()/WRITE_ONCE(). Plain C reads are used
in code that owns sma->sem_perm.lock.
msg_queue and shmid_kernel are quite small objects, no need to use
kvmalloc for them. mhocko@: "Both of them are 256B on most 64b systems."
Previously these objects was allocated via ipc_alloc/ipc_rcu_alloc(),
common function for several ipc objects. It had kvmalloc call inside().
Later, this function went away and was finally replaced by direct kvmalloc
call, and now we can use more suitable kmalloc/kfree for them.
Some ipc objects use the wrong allocation functions: small objects can use
kmalloc(), and vice versa, potentially large objects can use kmalloc().
This patch (of 2):
Size of sem_undo can exceed one page and with the maximum possible nsems =
32000 it can grow up to 64Kb. Let's switch its allocation to kvmalloc to
avoid user-triggered disruptive actions like OOM killer in case of
high-order memory shortage.
User triggerable high order allocations are quite a problem on heavily
fragmented systems. They can be a DoS vector.
Dave Hansen [Thu, 1 Jul 2021 01:57:03 +0000 (18:57 -0700)]
selftests/vm/pkeys: exercise x86 XSAVE init state
On x86, there is a set of instructions used to save and restore register
state collectively known as the XSAVE architecture. There are about a
dozen different features managed with XSAVE. The protection keys
register, PKRU, is one of those features.
The hardware optimizes XSAVE by tracking when the state has not changed
from its initial (init) state. In this case, it can avoid the cost of
writing state to memory (it would usually just be a bunch of 0's).
When the pkey register is 0x0 the hardware optionally choose to track the
register as being in the init state (optimize away the writes). AMD CPUs
do this more aggressively compared to Intel.
On x86, PKRU is rarely in its (very permissive) init state. Instead, the
value defaults to something very restrictive. It is not surprising that
bugs have popped up in the rare cases when PKRU reaches its init state.
Add a protection key selftest which gets the protection keys register into
its init state in a way that should work on Intel and AMD. Then, do a
bunch of pkey register reads to watch for inadvertent changes.
This adds "-mxsave" to CFLAGS for all the x86 vm selftests in order to
allow use of the XSAVE instruction __builtin functions. This will make
the builtins available on all of the vm selftests, but is expected to be
harmless.
Dave Hansen [Thu, 1 Jul 2021 01:56:59 +0000 (18:56 -0700)]
selftests/vm/pkeys: refill shadow register after implicit kernel write
The pkey test code keeps a "shadow" of the pkey register around. This
ensures that any bugs which might write to the register can be caught more
quickly.
Generally, userspace has a good idea when the kernel is going to write to
the register. For instance, alloc_pkey() is passed a permission mask.
The caller of alloc_pkey() can update the shadow based on the return value
and the mask.
But, the kernel can also modify the pkey register in a more sneaky way.
For mprotect(PROT_EXEC) mappings, the kernel will allocate a pkey and
write the pkey register to create an execute-only mapping. The kernel
never tells userspace what key it uses for this.
This can cause the test to fail with messages like:
The alloc_pkey() sefltest function wraps the sys_pkey_alloc() system call.
On success, it updates its "shadow" register value because
sys_pkey_alloc() updates the real register.
But, the success check is wrong. pkey_alloc() considers any non-zero
return code to indicate success where the pkey register will be modified.
This fails to take negative return codes into account.
Consider only a positive return value as a successful call.
Dave Hansen [Thu, 1 Jul 2021 01:56:53 +0000 (18:56 -0700)]
selftests/vm/pkeys: fix alloc_random_pkey() to make it really, really random
Patch series "selftests/vm/pkeys: Bug fixes and a new test".
There has been a lot of activity on the x86 front around the XSAVE
architecture which is used to context-switch processor state (among other
things). In addition, AMD has recently joined the protection keys club by
adding processor support for PKU.
The AMD implementation helped uncover a kernel bug around the PKRU "init
state", which actually applied to Intel's implementation but was just
harder to hit. This series adds a test which is expected to help find
this class of bug both on AMD and Intel. All the work around pkeys on x86
also uncovered a few bugs in the selftest.
This patch (of 4):
The "random" pkey allocation code currently does the good old:
srand((unsigned int)time(NULL));
*But*, it unfortunately does this on every random pkey allocation.
There may be thousands of these a second. time() has a one second
resolution. So, each time alloc_random_pkey() is called, the PRNG is
*RESET* to time(). This is nasty. Normally, if you do:
srand(<ANYTHING>);
foo = rand();
bar = rand();
You'll be quite guaranteed that 'foo' and 'bar' are different. But, if
you do:
srand(1);
foo = rand();
srand(1);
bar = rand();
You are quite guaranteed that 'foo' and 'bar' are the *SAME*. The recent
"fix" effectively forced the test case to use the same "random" pkey for
the whole test, unless the test run crossed a second boundary.
Only run srand() once at program startup.
This explains some very odd and persistent test failures I've been seeing.
Marco Elver [Thu, 1 Jul 2021 01:56:49 +0000 (18:56 -0700)]
kcov: add __no_sanitize_coverage to fix noinstr for all architectures
Until now no compiler supported an attribute to disable coverage
instrumentation as used by KCOV.
To work around this limitation on x86, noinstr functions have their
coverage instrumentation turned into nops by objtool. However, this
solution doesn't scale automatically to other architectures, such as
arm64, which are migrating to use the generic entry code.
Add __no_sanitize_coverage for both compilers, and add it to noinstr.
Note: In the Clang case, __has_feature(coverage_sanitizer) is only true if
the feature is enabled, and therefore we do not require an additional
defined(CONFIG_KCOV) (like in the GCC case where __has_attribute(..) is
always true) to avoid adding redundant attributes to functions if KCOV is
off. That being said, compilers that support the attribute will not
generate errors/warnings if the attribute is redundantly used; however,
where possible let's avoid it as it reduces preprocessed code size and
associated compile-time overheads.
Al Viro [Thu, 1 Jul 2021 01:56:43 +0000 (18:56 -0700)]
x86: signal: don't do sas_ss_reset() until we are certain that sigframe won't be abandoned
Currently we handle SS_AUTODISARM as soon as we have stored the altstack
settings into sigframe - that's the point when we have set the things up
for eventual sigreturn to restore the old settings. And if we manage to
set the sigframe up (we are not done with that yet), everything's fine.
However, in case of failure we end up with sigframe-to-be abandoned and
SIGSEGV force-delivered. And in that case we end up with inconsistent
rules - late failures have altstack reset, early ones do not.
It's trivial to get consistent behaviour - just handle SS_AUTODISARM once
we have set the sigframe up and are committed to entering the handler,
i.e. in signal_delivered().
Barry Song [Thu, 1 Jul 2021 01:56:31 +0000 (18:56 -0700)]
kprobes: remove duplicated strong free_insn_page in x86 and s390
free_insn_page() in x86 and s390 is same with the common weak function in
kernel/kprobes.c. Plus, the comment "Recover page to RW mode before
releasing it" in x86 seems insensible to be there since resetting mapping
is done by common code in vfree() of module_memfree(). So drop these two
duplicated strong functions and related comment, then mark the common one
in kernel/kprobes.c strong.
Andrew Halaney [Thu, 1 Jul 2021 01:56:28 +0000 (18:56 -0700)]
init: print out unknown kernel parameters
It is easy to foobar setting a kernel parameter on the command line
without realizing it, there's not much output that you can use to assess
what the kernel did with that parameter by default.
Make it a little more explicit which parameters on the command line
_looked_ like a valid parameter for the kernel, but did not match anything
and ultimately got tossed to init. This is very similar to the unknown
parameter message received when loading a module.
This assumes the parameters are processed in a normal fashion, some
parameters (dyndbg= for example) don't register their parameter with the
rest of the kernel's parameters, and therefore always show up in this list
(and are also given to init - like the rest of this list).
Another example is BOOT_IMAGE= is highlighted as an offender, which it
technically is, but is passed by LILO and GRUB so most systems will see
that complaint.
An example output where "foobared" and "unrecognized" are intentionally
invalid parameters:
checkpatch: do not complain about positive return values starting with EPOLL
checkpatch complains about positive return values of poll functions.
Example:
WARNING: return of an errno should typically be negative (ie: return -EPOLLIN)
+ return EPOLLIN;
Poll functions return positive values. The defines for the return values
of poll functions all start with EPOLL, resulting in a number of false
positives. An often used workaround is to assign poll function return
values to variables and returning that variable, but that is a less than
perfect solution.
There is no error definition which starts with EPOLL, so it is safe to
omit the warning for return values starting with EPOLL.
Joe Perches [Thu, 1 Jul 2021 01:56:22 +0000 (18:56 -0700)]
checkpatch: improve the indented label test
checkpatch identifies a label only when a terminating colon
immediately follows an identifier.
Bitfield definitions can appear to be labels so ignore any
spaces between the identifier terminating colon and any digit
that may be used to define a bitfield length.
Miscellanea:
o Improve the initial checkpatch comment
o Use the more typical '&&' instead of 'and'
o Require the initial label character to be a non-digit
(Can't use $Ident here because $Ident allows ## concatenation)
o Use $sline instead of $line to ignore comments
o Use '$sline !~ /.../' instead of '!($line =~ /.../)'
checkpatch: scripts/spdxcheck.py now requires python3
Since commit d0259c42abff ("spdxcheck.py: Use Python 3"), spdxcheck.py
explicitly expects to run as python3 script. If "python" still points to
python v2.7 and the script is executed with "python scripts/spdxcheck.py",
the following error may be seen even if git-python is installed for
python3.
Traceback (most recent call last):
File "scripts/spdxcheck.py", line 10, in <module>
import git
ImportError: No module named git
To fix the problem, check for the existence of python3, check if
the script is executable and not just for its existence, and execute
it directly.
lib/decompress_unlz4.c: correctly handle zero-padding around initrds.
lz4 compatible decompressor is simple. The format is underspecified and
relies on EOF notification to determine when to stop. Initramfs buffer
format[1] explicitly states that it can have arbitrary number of zero
padding. Thus when operating without a fill function, be extra careful to
ensure that sizes less than 4, or apperantly empty chunksizes are treated
as EOF.
To test this I have created two cpio initrds, first a normal one,
main.cpio. And second one with just a single /test-file with content
"second" second.cpio. Then i compressed both of them with gzip, and with
lz4 -l. Then I created a padding of 4 bytes (dd if=/dev/zero of=pad4 bs=1
count=4). To create four testcase initrds:
The pad4 test-cases replicate the initrd load by grub, as it pads and
aligns every initrd it loads.
All of the above boot, however /test-file was not accessible in the initrd
for the testcase #4, as decoding in lz4 decompressor failed. Also an
error message printed which usually is harmless.
Whith a patched kernel, all of the above testcases now pass, and
/test-file is accessible.
This fixes lz4 initrd decompress warning on every boot with grub. And
more importantly this fixes inability to load multiple lz4 compressed
initrds with grub. This patch has been shipping in Ubuntu kernels since
January 2021.
Andy Shevchenko [Thu, 1 Jul 2021 01:56:10 +0000 (18:56 -0700)]
kernel.h: split out kstrtox() and simple_strtox() to a separate header
kernel.h is being used as a dump for all kinds of stuff for a long time.
Here is the attempt to start cleaning it up by splitting out kstrtox() and
simple_strtox() helpers.
At the same time convert users in header and lib folders to use new
header. Though for time being include new header back to kernel.h to
avoid twisted indirected includes for existing users.
The test_string module can't be removed because it lacks an exit hook.
Since there is no reason for it to be permanent, add an empty one to allow
module removal.
If the input is out of the range of the allowed values, either larger than
the largest value or closer to zero than the smallest non-zero allowed
value, then a division by zero would occur.
In the case of input too large, the division by zero will occur on the
first iteration. The best result (largest allowed value) will be found by
always choosing the semi-convergent and excluding the denominator based
limit when finding it.
In the case of the input too small, the division by zero will occur on the
second iteration. The numerator based semi-convergent should not be
calculated to avoid the division by zero. But the semi-convergent vs
previous convergent test is still needed, which effectively chooses
between 0 (the previous convergent) vs the smallest allowed fraction (best
semi-convergent) as the result.
Andy Shevchenko [Thu, 1 Jul 2021 01:55:05 +0000 (18:55 -0700)]
lib/string_helpers: switch to use BIT() macro
Patch series "lib/string_helpers: get rid of ugly *_escape_mem_ascii()", v3.
Get rid of ugly *_escape_mem_ascii() API since it's not flexible and has
the only single user. Provide better approach based on usage of the
string_escape_mem() with appropriate flags.
Test cases has been expanded accordingly to cover new functionality.
This patch (of 15):
Switch to use BIT() macro for flag definitions. No changes implied.
Andy Shevchenko [Thu, 1 Jul 2021 01:54:59 +0000 (18:54 -0700)]
kernel.h: split out panic and oops helpers
kernel.h is being used as a dump for all kinds of stuff for a long time.
Here is the attempt to start cleaning it up by splitting out panic and
oops helpers.
There are several purposes of doing this:
- dropping dependency in bug.h
- dropping a loop by moving out panic_notifier.h
- unload kernel.h from something which has its own domain
At the same time convert users tree-wide to use new headers, although for
the time being include new header back to kernel.h to avoid twisted
indirected includes for existing users.
And 'ino' field to /proc/<pid>/fdinfo/<FD> and
/proc/<pid>/task/<tid>/fdinfo/<FD>.
The inode numbers can be used to uniquely identify DMA buffers in user
space and avoids a dependency on /proc/<pid>/fd/* when accounting
per-process DMA buffer sizes.
procfs: allow reading fdinfo with PTRACE_MODE_READ
Android captures per-process system memory state when certain low memory
events (e.g a foreground app kill) occur, to identify potential memory
hoggers. In order to measure how much memory a process actually consumes,
it is necessary to include the DMA buffer sizes for that process in the
memory accounting. Since the handle to DMA buffers are raw FDs, it is
important to be able to identify which processes have FD references to a
DMA buffer.
Currently, DMA buffer FDs can be accounted using /proc/<pid>/fd/* and
/proc/<pid>/fdinfo -- both are only readable by the process owner, as
follows:
1. Do a readlink on each FD.
2. If the target path begins with "/dmabuf", then the FD is a dmabuf FD.
3. stat the file to get the dmabuf inode number.
4. Read/ proc/<pid>/fdinfo/<fd>, to get the DMA buffer size.
Accessing other processes' fdinfo requires root privileges. This limits
the use of the interface to debugging environments and is not suitable for
production builds. Granting root privileges even to a system process
increases the attack surface and is highly undesirable.
Since fdinfo doesn't permit reading process memory and manipulating
process state, allow accessing fdinfo under PTRACE_MODE_READ_FSCRED.
Use size_t when capping the count argument received by mem_rw(). Since
count is size_t, using min_t(int, ...) can lead to a negative value
that will later be passed to access_remote_vm(), which can cause
unexpected behavior.
Since we are capping the value to at maximum PAGE_SIZE, the conversion
from size_t to int when passing it to access_remote_vm() as "len"
shouldn't be a problem.
Some NVIDIA GPUs do not support direct atomic access to system memory via
PCIe. Instead this must be emulated by granting the GPU exclusive access
to the memory. This is achieved by replacing CPU page table entries with
special swap entries that fault on userspace access.
The driver then grants the GPU permission to update the page undergoing
atomic access via the GPU page tables. When CPU access to the page is
required a CPU fault is raised which calls into the device driver via MMU
notifiers to revoke the atomic access. The original page table entries
are then restored allowing CPU access to proceed.
Call mmu_interval_notifier_insert() as part of nouveau_range_fault().
This doesn't introduce any functional change but makes it easier for a
subsequent patch to alter the behaviour of nouveau_range_fault() to
support GPU atomic operations.