Miaohe Lin [Thu, 2 Sep 2021 21:53:33 +0000 (14:53 -0700)]
mm: gup: remove set but unused local variable major
Patch series "Cleanups and fixup for gup".
This series contains cleanups to remove unneeded variable, useless BUG_ON
and use helper to improve readability. Also we fix a potential pgmap
refcnt leak. More details can be found in the respective changelogs.
This patch (of 5):
Since commit a2beb5f1efed ("mm: clean up the last pieces of page fault
accountings"), the local variable major is unused. Remove it.
Currently cgroup_writeback_by_id calls mem_cgroup_wb_stats() to get dirty
pages for a memcg. However mem_cgroup_wb_stats() does a lot more than
just get the number of dirty pages. Just directly get the number of dirty
pages instead of calling mem_cgroup_wb_stats(). Also
cgroup_writeback_by_id() is only called for best-effort dirty flushing, so
remove the unused 'nr' parameter and no need to explicitly flush memcg
stats.
Johannes Weiner [Thu, 2 Sep 2021 21:53:24 +0000 (14:53 -0700)]
fs: inode: count invalidated shadow pages in pginodesteal
pginodesteal is supposed to capture the impact that inode reclaim has on
the page cache state. Currently, it doesn't consider shadow pages that
get dropped this way, even though this can have a significant impact on
paging behavior, memory pressure calculations etc.
To improve visibility into these effects, make sure shadow pages get
counted when they get dropped through inode reclaim.
This changes the return value semantics of invalidate_mapping_pages()
semantics slightly, but the only two users are the inode shrinker itsel
and a usb driver that logs it for debugging purposes.
Johannes Weiner [Thu, 2 Sep 2021 21:53:21 +0000 (14:53 -0700)]
fs: drop_caches: fix skipping over shadow cache inodes
When drop_caches truncates the page cache in an inode it also includes any
shadow entries for evicted pages. However, there is a preliminary check
on whether the inode has pages: if it has *only* shadow entries, it will
skip running truncation on the inode and leave it behind.
Fix the check to mapping_empty(), such that it runs truncation on any
inode that has cache entries at all.
Johannes Weiner [Thu, 2 Sep 2021 21:53:18 +0000 (14:53 -0700)]
mm: remove irqsave/restore locking from contexts with irqs enabled
The page cache deletion paths all have interrupts enabled, so no need to
use irqsafe/irqrestore locking variants.
They used to have irqs disabled by the memcg lock added in commit c4843a7593a9 ("memcg: add per cgroup dirty page accounting"), but that has
since been replaced by memcg taking the page lock instead, commit 0a31bc97c80c ("mm: memcontrol: rewrite uncharge AP").
Jan Kara [Thu, 2 Sep 2021 21:53:15 +0000 (14:53 -0700)]
writeback: use READ_ONCE for unlocked reads of writeback stats
We do some unlocked reads of writeback statistics like
avg_write_bandwidth, dirty_ratelimit, or bw_time_stamp. Generally we are
fine with getting somewhat out-of-date values but actually getting
different values in various parts of the functions because the compiler
decided to reload value from original memory location could confuse
calculations. Use READ_ONCE for these unlocked accesses and WRITE_ONCE
for the updates to be on the safe side.
Jan Kara [Thu, 2 Sep 2021 21:53:12 +0000 (14:53 -0700)]
writeback: rename domain_update_bandwidth()
Rename domain_update_bandwidth() to domain_update_dirty_limit(). The
original name is a misnomer. The function has nothing to do with a
bandwidth, it updates dirty limits.
Jan Kara [Thu, 2 Sep 2021 21:53:09 +0000 (14:53 -0700)]
writeback: fix bandwidth estimate for spiky workload
Michael Stapelberg has reported that for workload with short big spikes of
writes (GCC linker seem to trigger this frequently) the write throughput
is heavily underestimated and tends to steadily sink until it reaches
zero. This has rather bad impact on writeback throttling (causing
stalls). The problem is that writeback throughput estimate gets updated
at most once per 200 ms. One update happens early after we submit pages
for writeback (at that point writeout of only small fraction of pages is
completed and thus observed throughput is tiny). Next update happens only
during the next write spike (updates happen only from inode writeback and
dirty throttling code) and if that is more than 1s after previous spike,
we decide system was idle and just ignore whatever was written until this
moment.
Fix the problem by making sure writeback throughput estimate is also
updated shortly after writeback completes to get reasonable estimate of
throughput for spiky workloads.
[[email protected]: avoid division by 0 in wb_update_dirty_ratelimit()]
Jan Kara [Thu, 2 Sep 2021 21:53:06 +0000 (14:53 -0700)]
writeback: reliably update bandwidth estimation
Currently we trigger writeback bandwidth estimation from
balance_dirty_pages() and from wb_writeback(). However neither of these
need to trigger when the system is relatively idle and writeback is
triggered e.g. from fsync(2). Make sure writeback estimates happen
reliably by triggering them from do_writepages().
Jan Kara [Thu, 2 Sep 2021 21:53:04 +0000 (14:53 -0700)]
writeback: track number of inodes under writeback
Patch series "writeback: Fix bandwidth estimates", v4.
Fix estimate of writeback throughput when device is not fully busy doing
writeback. Michael Stapelberg has reported that such workload (e.g.
generated by linking) tends to push estimated throughput down to 0 and as
a result writeback on the device is practically stalled.
The first three patches fix the reported issue, the remaining two patches
are unrelated cleanups of problems I've noticed when reading the code.
This patch (of 4):
Track number of inodes under writeback for each bdi_writeback structure.
We will use this to decide whether wb does any IO and so we can estimate
its writeback throughput. In principle we could use number of pages under
writeback (WB_WRITEBACK counter) for this however normal percpu counter
reads are too inaccurate for our purposes and summing the counter is too
expensive.
mm: add kernel_misc_reclaimable in show_free_areas
Print NR_KERNEL_MISC_RECLAIMABLE stat from show_free_areas() so users can
check whether the shrinker is working correctly and to show the current
memory usage.
While it was not useful to that bug report to know where the reclaim lock
had been acquired, it might be useful under other circumstances. Allow
the caller of __fs_reclaim_acquire to specify the instruction pointer to
use.
In page table entry modifying tests, set_xxx_at() are used to populate
the page table entries. On ARM64, PG_arch_1 (PG_dcache_clean) flag is
set to the target page flag if execution permission is given. The logic
exits since commit 4f04d8f00545 ("arm64: MMU definitions"). The page
flag is kept when the page is free'd to buddy's free area list. However,
it will trigger page checking failure when it's pulled from the buddy's
free area list, as the following warning messages indicate.
This fixes the issue by clearing PG_arch_1 through flush_dcache_page()
after set_xxx_at() is called. For architectures other than ARM64, the
unexpected overhead of cache flushing is acceptable.
The variables used by old implementation isn't needed as we switched to
"struct pgtable_debug_args". Lets remove them and related code in
debug_vm_pgtable().
mm/debug_vm_pgtable: use struct pgtable_debug_args in PGD and P4D modifying tests
This uses struct pgtable_debug_args in PGD/P4D modifying tests. No
allocated huge page is used in these tests. Besides, the unused variable
@saved_p4dp and @saved_pudp are dropped.
mm/debug_vm_pgtable: use struct pgtable_debug_args in PUD modifying tests
This uses struct pgtable_debug_args in PUD modifying tests. The allocated
huge page is used when set_pud_at() is used. The corresponding tests are
skipped if the huge page doesn't exist. Besides, the following unused
variables in debug_vm_pgtable() are dropped: @prot, @paddr, @pud_aligned.
mm/debug_vm_pgtable: use struct pgtable_debug_args in PMD modifying tests
This uses struct pgtable_debug_args in PMD modifying tests. The allocated
huge page is used when set_pmd_at() is used. The corresponding tests are
skipped if the huge page doesn't exist. Besides, the unused variable
@pmd_aligned in debug_vm_pgtable() is dropped.
mm/debug_vm_pgtable: use struct pgtable_debug_args in PTE modifying tests
This uses struct pgtable_debug_args in PTE modifying tests. The allocated
page is used as set_pte_at() is used there. The tests are skipped if the
allocated page doesn't exist. It's notable that args->ptep need to be
mapped before the tests. The reason why we don't map args->ptep at the
beginning is PTE entry is only mapped and accessible in atomic context
when CONFIG_HIGHPTE is enabled. So we avoid to do that so that atomic
context is only enabled if needed.
Besides, the unused variable @pte_aligned and @ptep in debug_vm_pgtable()
are dropped.
mm/debug_vm_pgtable: use struct pgtable_debug_args in migration and thp tests
This uses struct pgtable_debug_args in the migration and thp test
functions. It's notable that the pre-allocated page is used in
swap_migration_tests() as set_pte_at() is used there.
Patch series "mm/debug_vm_pgtable: Enhancements", v6.
There are a couple of issues with current implementations and this series
tries to resolve the issues:
(a) All needed information are scattered in variables, passed to various
test functions. The code is organized in pretty much relaxed fashion.
(b) The page isn't allocated from buddy during page table entry modifying
tests. The page can be invalid, conflicting to the implementations
of set_xxx_at() on ARM64. The target page is accessed so that the
iCache can be flushed when execution permission is given on ARM64.
Besides, the target page can be unmapped and accessing to it causes
kernel crash.
"struct pgtable_debug_args" is introduced to address issue (a). For issue
(b), the used page is allocated from buddy in page table entry modifying
tests. The corresponding tets will be skipped if we fail to allocate the
(huge) page. For other test cases, the original page around to kernel
symbol (@start_kernel) is still used.
The patches are organized as below. PATCH[2-10] could be combined to one
patch, but it will make the review harder:
PATCH[1] introduces "struct pgtable_debug_args" as place holder of all
needed information. With it, the old and new implementation
can coexist.
PATCH[2-10] uses "struct pgtable_debug_args" in various test functions.
PATCH[11] removes the unused code for old implementation.
PATCH[12] fixes the issue of corrupted page flag for ARM64
This patch (of 6):
In debug_vm_pgtable(), there are many local variables introduced to track
the needed information and they are passed to the functions for various
test cases. It'd better to introduce a struct as place holder for these
information. With it, what the tests functions need is the struct. In
this way, the code is simplified and easier to be maintained.
Besides, set_xxx_at() could access the data on the corresponding pages in
the page table modifying tests. So the accessed pages in the tests should
have been allocated from buddy. Otherwise, we're accessing pages that
aren't owned by us. This causes issues like page flag corruption or
kernel crash on accessing unmapped page when CONFIG_DEBUG_PAGEALLOC is
enabled.
This introduces "struct pgtable_debug_args". The struct is initialized
and destroyed, but the information in the struct isn't used yet. It will
be used in subsequent patches.
Gang He [Thu, 2 Sep 2021 21:50:17 +0000 (14:50 -0700)]
ocfs2: ocfs2_downconvert_lock failure results in deadlock
Usually, ocfs2_downconvert_lock() function always downconverts dlm lock to
the expected level for satisfy dlm bast requests from the other nodes.
But there is a rare situation. When dlm lock conversion is being
canceled, ocfs2_downconvert_lock() function will return -EBUSY. You need
to be aware that ocfs2_cancel_convert() function is asynchronous in fsdlm
implementation.
If we does not requeue this lockres entry, ocfs2 downconvert thread no
longer handles this dlm lock bast request. Then, the other nodes will not
get the dlm lock again, the current node's process will be blocked when
acquire this dlm lock again.
Tuo Li [Thu, 2 Sep 2021 21:50:14 +0000 (14:50 -0700)]
ocfs2: quota_local: fix possible uninitialized-variable access in ocfs2_local_read_info()
A memory block is allocated through kmalloc(), and its return value is
assigned to the pointer oinfo. However, oinfo->dqi_gqinode is not
initialized but it is accessed in:
iput(oinfo->dqi_gqinode);
To fix this possible uninitialized-variable access, assign NULL to
oinfo->dqi_gqinode, and add ocfs2_qinfo_lock_res_init() behind the
assignment in ocfs2_local_read_info(). Remove ocfs2_qinfo_lock_res_init()
in ocfs2_global_read_info().
Patch series "ia64: Miscellaneous fixes and cleanups".
This patch series contains some miscellaneous fixes and cleanups for ia64.
The second patch fixes a naming conflict triggered by a patch for the FDT
code.
This patch (of 3):
The definition of reserve_elfcorehdr() depends on CONFIG_CRASH_DUMP, not
CONFIG_PROC_VMCORE.
Let's also remove masking off MAP_DENYWRITE from ksys_mmap_pgoff():
the last in-tree occurrence of MAP_DENYWRITE is now in LEGACY_MAP_MASK,
which accepts the flag e.g., for MAP_SHARED_VALIDATE; however, the flag
is ignored throughout the kernel now.
Add a comment to LEGACY_MAP_MASK stating that MAP_DENYWRITE is ignored.
At exec time when we mmap the new executable via MAP_DENYWRITE we have it
opened via do_open_execat() and already deny_write_access()'ed the file
successfully. Once exec completes, we allow_write_acces(); however,
we set mm->exe_file in begin_new_exec() via set_mm_exe_file() and
also deny_write_access() as long as mm->exe_file remains set. We'll
effectively deny write access to our executable via mm->exe_file
until mm->exe_file is changed -- when the process is removed, on new
exec, or via sys_prctl(PR_SET_MM_MAP/EXE_FILE).
Let's remove all usage of MAP_DENYWRITE, it's no longer necessary for
mm->exe_file.
In case of an elf interpreter, we'll now only deny write access to the file
during exec. This is somewhat okay, because the interpreter behaves
(and sometime is) a shared library; all shared libraries, especially the
ones loaded directly in user space like via dlopen() won't ever be mapped
via MAP_DENYWRITE, because we ignore that from user space completely;
these shared libraries can always be modified while mapped and executed.
Let's only special-case the main executable, denying write access while
being executed by a process. This can be considered a minor user space
visible change.
While this is a cleanup, it also fixes part of a problem reported with
VM_DENYWRITE on overlayfs, as VM_DENYWRITE is effectively unused with
this patch and will be removed next:
"Overlayfs did not honor positive i_writecount on realfile for
VM_DENYWRITE mappings." [1]
kernel/fork: always deny write access to current MM exe_file
We want to remove VM_DENYWRITE only currently only used when mapping the
executable during exec. During exec, we already deny_write_access() the
executable, however, after exec completes the VMAs mapped
with VM_DENYWRITE effectively keeps write access denied via
deny_write_access().
Let's deny write access when setting or replacing the MM exe_file. With
this change, we can remove VM_DENYWRITE for mapping executables.
Make set_mm_exe_file() return an error in case deny_write_access()
fails; note that this should never happen, because exec code does a
deny_write_access() early and keeps write access denied when calling
set_mm_exe_file. However, it makes the code easier to read and makes
set_mm_exe_file() and replace_mm_exe_file() look more similar.
This represents a minor user space visible change:
sys_prctl(PR_SET_MM_MAP/EXE_FILE) can now fail if the file is already
opened writable. Also, after sys_prctl(PR_SET_MM_MAP/EXE_FILE) the file
cannot be opened writable. Note that we can already fail with -EACCES if
the file doesn't have execute permissions.
binfmt: don't use MAP_DENYWRITE when loading shared libraries via uselib()
uselib() is the legacy systemcall for loading shared libraries.
Nowadays, applications use dlopen() to load shared libraries, completely
implemented in user space via mmap().
For example, glibc uses MAP_COPY to mmap shared libraries. While this
maps to MAP_PRIVATE | MAP_DENYWRITE on Linux, Linux ignores any
MAP_DENYWRITE specification from user space in mmap.
With this change, all remaining in-tree users of MAP_DENYWRITE use it
to map an executable. We will be able to open shared libraries loaded
via uselib() writable, just as we already can via dlopen() from user
space.
This is one step into the direction of removing MAP_DENYWRITE from the
kernel. This can be considered a minor user space visible change.
It seems that this bug has already been fixed by Eric Dumazet in the
past in:
commit 78296c97ca1f ("netfilter: xt_socket: fix a stack corruption bug")
But a variant of the same issue has been introduced in
commit d64d80a2cde9 ("netfilter: x_tables: don't extract flow keys on early demuxed sks in socket match")
`daddr` and `saddr` potentially hold a reference to ipv6_var that is no
longer in scope when the call to `nf_socket_get_sock_v6` is made.
Fixes: d64d80a2cde9 ("netfilter: x_tables: don't extract flow keys on early demuxed sks in socket match") Acked-by: Matthieu Baerts <[email protected]> Signed-off-by: Benjamin Hesmans <[email protected]> Reviewed-by: Florian Westphal <[email protected]> Signed-off-by: Pablo Neira Ayuso <[email protected]>
Xiaoguang Wang [Fri, 3 Sep 2021 14:24:36 +0000 (22:24 +0800)]
io_uring: fix possible poll event lost in multi shot mode
IIUC, IORING_POLL_ADD_MULTI is similar to epoll's edge-triggered mode,
that means once one pure poll request returns one event(cqe), we'll
need to read or write continually until EAGAIN is returned, then I think
there is a possible poll event lost race in multi shot mode:
t1 poll request add | |
t2 | |
t3 event happens | |
t4 task work add | |
t5 | task work run |
t6 | commit one cqe |
t7 | | user app handles cqe
t8 | new event happen |
t9 | add back to waitqueue |
t10 |
After t6 but before t9, if new event happens, there'll be no wakeup
operation, and if user app has picked up this cqe in t7, read or write
until EAGAIN is returned. In t8, new event happens and will be lost,
though this race window maybe small.
To fix this possible race, add poll request back to waitqueue before
committing cqe.
Kate Hsuan [Fri, 3 Sep 2021 09:44:11 +0000 (17:44 +0800)]
libata: Add ATA_HORKAGE_NO_NCQ_ON_ATI for Samsung 860 and 870 SSD.
Many users are reporting that the Samsung 860 and 870 SSD are having
various issues when combined with AMD/ATI (vendor ID 0x1002) SATA
controllers and only completely disabling NCQ helps to avoid these
issues.
Always disabling NCQ for Samsung 860/870 SSDs regardless of the host
SATA adapter vendor will cause I/O performance degradation with well
behaved adapters. To limit the performance impact to ATI adapters,
introduce the ATA_HORKAGE_NO_NCQ_ON_ATI flag to force disable NCQ
only for these adapters.
Also, two libata.force parameters (noncqati and ncqati) are introduced
to disable and enable the NCQ for the system which equipped with ATI
SATA adapter and Samsung 860 and 870 SSDs. The user can determine NCQ
function to be enabled or disabled according to the demand.
After verifying the chipset from the user reports, the issue appears
on AMD/ATI SB7x0/SB8x0/SB9x0 SATA Controllers and does not appear on
recent AMD SATA adapters. The vendor ID of ATI should be 0x1002.
Therefore, ATA_HORKAGE_NO_NCQ_ON_AMD was modified to
ATA_HORKAGE_NO_NCQ_ON_ATI.
Hans de Goede [Mon, 23 Aug 2021 09:52:20 +0000 (11:52 +0200)]
libata: add ATA_HORKAGE_NO_NCQ_TRIM for Samsung 860 and 870 SSDs
Commit ca6bfcb2f6d9 ("libata: Enable queued TRIM for Samsung SSD 860")
limited the existing ATA_HORKAGE_NO_NCQ_TRIM quirk from "Samsung SSD 8*",
covering all Samsung 800 series SSDs, to only apply to "Samsung SSD 840*"
and "Samsung SSD 850*" series based on information from Samsung.
But there is a large number of users which is still reporting issues
with the Samsung 860 and 870 SSDs combined with Intel, ASmedia or
Marvell SATA controllers and all reporters also report these problems
going away when disabling queued trims.
Note that with AMD SATA controllers users are reporting even worse
issues and only completely disabling NCQ helps there, this will be
addressed in a separate patch.
That lets us resolve a conflict in arch/powerpc/sysdev/xive/common.c.
Between cbc06f051c52 ("powerpc/xive: Do not skip CPU-less nodes when
creating the IPIs"), which moved request_irq() out of xive_init_ipis(),
and 17df41fec5b8 ("powerpc: use IRQF_NO_DEBUG for IPIs") which added
IRQF_NO_DEBUG to that request_irq() call, which has now moved.
net: remove the unnecessary check in cipso_v4_doi_free
The commit 733c99ee8be9 ("net: fix NULL pointer reference in
cipso_v4_doi_free") was merged by a mistake, this patch try
to cleanup the mess.
And we already have the commit e842cb60e8ac ("net: fix NULL
pointer reference in cipso_v4_doi_free") which fixed the root
cause of the issue mentioned in it's description.
Before vlan/port mcast router support was added
br_multicast_set_port_router was used only with bh already disabled due
to the bridge port lock, but that is no longer the case and when it is
called to configure a vlan/port mcast router we can deadlock with the
timer, so always disable bh to make sure it can be called from contexts
with both enabled and disabled bh.
Fixes: 2796d846d74a ("net: bridge: vlan: convert mcast router global option to per-vlan entry") Signed-off-by: Nikolay Aleksandrov <[email protected]> Signed-off-by: David S. Miller <[email protected]>
The ISA DMA API is inconsistent between architectures, and while
powerpc implements most of what the others have, it does not provide
isa_virt_to_bus():
../drivers/net/ethernet/cirrus/cs89x0.c: In function ‘net_open’:
../drivers/net/ethernet/cirrus/cs89x0.c:897:20: error: implicit declaration of function ‘isa_virt_to_bus’ [-Werror=implicit-function-declaration]
(unsigned long)isa_virt_to_bus(lp->dma_buff));
../drivers/net/ethernet/cirrus/cs89x0.c:894:3: note: in expansion of macro ‘cs89_dbg’
cs89_dbg(1, debug, "%s: dma %lx %lx\n",
I tried a couple of approaches to handle this consistently across
all architectures, but as this driver is really only used on
ARM, I ended up taking the easy way out and just disable compile
testing on powerpc.
Pavel Begunkov [Wed, 1 Sep 2021 23:38:23 +0000 (00:38 +0100)]
io_uring: prolong tctx_task_work() with flushing
io_submit_flush_completions() may enqueue linked requests for task_work
execution, so don't leave tctx_task_work() right after the tw list is
exhausted, but try to flush and then retry.
Pavel Begunkov [Wed, 1 Sep 2021 23:38:22 +0000 (00:38 +0100)]
io_uring: don't disable kiocb_done() CQE batching
Not passing issue_flags from kiocb_done() into __io_complete_rw() means
that completion batching for this case is disabled, e.g. for most of
buffered reads.
io_uring: ensure IORING_REGISTER_IOWQ_MAX_WORKERS works with SQPOLL
SQPOLL has a different thread doing submissions, we need to check for
that and use the right task context when updating the worker values.
Just hold the sqd->lock across the operation, this ensures that the
thread cannot go away while we poke at ->io_uring.
Jin Yao [Thu, 2 Sep 2021 06:59:55 +0000 (14:59 +0800)]
perf tests: Add test for PMU aliases
A perf uncore PMU may have two PMU names, a real name and an alias.
Add one test case to verify that the real and alias names have the same
effect.
Iterate sysfs to get one event which has an alias and create an evlist
by adding two evsels. Evsel1 is created by event and evsel2 is created
by alias.
Kan Liang [Thu, 2 Sep 2021 06:59:54 +0000 (14:59 +0800)]
perf pmu: Add PMU alias support
A perf uncore PMU may have two PMU names, a real name and an alias. The
alias is exported at /sys/bus/event_source/devices/uncore_*/alias.
The perf tool should support the alias as well.
Add alias_name in the struct perf_pmu to store the alias. For the PMU
which doesn't have an alias. It's NULL.
Introduce two X86 specific functions to retrieve the real name and the
alias separately.
Only go through the sysfs to retrieve the mapping between the real name
and the alias once. The result is cached in a list, uncore_pmu_list.
Nothing changed for the other ARCHs.
With the patch, the perf tool can monitor the PMU with either the real
name or the alias.
Use the real name,
$ perf stat -e uncore_cha_2/event=1/ -x, 4044879584,,uncore_cha_2/event=1/,2528059205,100.00,,
Use the alias,
$ perf stat -e uncore_type_0_2/event=1/ -x, 3659675336,,uncore_type_0_2/event=1/,2287306455,100.00,,
Committer notes:
Rename 'struct perf_pmu_alias_name' to 'pmu_alias', the 'perf_' prefix
should be used for libperf, things inside just tools/perf/ are being
moved away from that prefix.
Also 'pmu_alias' is shorter and reflects the abstraction.
Also don't use 'pmu' as the name for variables for that type, we should
use that for the 'struct perf_pmu' variables, avoiding confusion. Use
'pmu_alias' for 'struct pmu_alias' variables.
Stephen Brennan [Wed, 1 Sep 2021 21:08:15 +0000 (14:08 -0700)]
perf script python: Allow reporting the [un]throttle PERF_RECORD_ meta event
perf_events may sometimes throttle an event due to creating too many
samples during a given timer tick.
As of now, the perf tool will not report on throttling, which means this
is a silent error.
Implement a callback for the throttle and unthrottle events within the
Python scripting engine, which can allow scripts to detect and report
when events may have been lost due to throttling.
Leo Yan [Thu, 2 Sep 2021 08:18:00 +0000 (16:18 +0800)]
perf build: Report failure for testing feature libopencsd
When build perf tool with passing option 'CORESIGHT=1' explicitly, if
the feature test fails for library libopencsd, the build doesn't
complain the feature failure and continue to build the tool with
disabling the CoreSight feature insteadly.
This patch changes the building behaviour, when build perf tool with the
option 'CORESIGHT=1' and detect the failure for testing feature
libopencsd, the build process will be aborted and it shows the complaint
info.
Committer testing:
First make sure there is no opencsd library installed:
James Clark [Fri, 6 Aug 2021 13:41:09 +0000 (14:41 +0100)]
perf cs-etm: Show a warning for an unknown magic number
Currently perf reports "Cannot allocate memory" which isn't very helpful
for a potentially user facing issue. If we add a new magic number in
the future, perf will be able to report unrecognised magic numbers.
James Clark [Fri, 6 Aug 2021 13:41:08 +0000 (14:41 +0100)]
perf cs-etm: Print the decoder name
Use the real name of the decoder instead of hard-coding "ETM" to avoid
confusion when the trace is ETE. This also now distinguishes between
ETMv3 and ETMv4.
James Clark [Fri, 6 Aug 2021 13:41:07 +0000 (14:41 +0100)]
perf cs-etm: Create ETE decoder
If the magic number indicates ETE instantiate a OCSD_BUILTIN_DCD_ETE
decoder instead of OCSD_BUILTIN_DCD_ETMV4I. ETE is the new trace feature
for Armv9.
Testing performed
=================
* Old files with v0 and v1 headers for ETMv4 still open correctly
* New files with new magic number open on new versions of perf
* New files with new magic number fail to open on old versions of perf
* Decoding with the ETE decoder results in the same output as the ETMv4
decoder as long as there are no new ETE packet types
James Clark [Fri, 6 Aug 2021 13:41:06 +0000 (14:41 +0100)]
perf cs-etm: Update OpenCSD decoder for ETE
OpenCSD v1.1.1 has a bug fix for the installation of the ETE decoder
headers. This also means that including headers separately for each
decoder is unnecessary so remove these.
James Clark [Fri, 6 Aug 2021 13:41:01 +0000 (14:41 +0100)]
perf cs-etm: Refactor initialisation of decoder params.
The initialisation of the decoder params is duplicated between
creation of the packet printer and packet decoder. Put them both
into one function so that future changes only need to be made in one
place.
Mat Martineau [Thu, 2 Sep 2021 18:51:19 +0000 (11:51 -0700)]
mptcp: Only send extra TCP acks in eligible socket states
Recent changes exposed a bug where specifically-timed requests to the
path manager netlink API could trigger a divide-by-zero in
__tcp_select_window(), as syzkaller does:
mptcp_pm_nl_addr_send_ack() was attempting to send a TCP ACK on the
first subflow in the MPTCP socket's connection list without validating
that the subflow was in a suitable connection state. To address this,
always validate subflow state when sending extra ACKs on subflows
for address advertisement or subflow priority change.
Fixes: 84dfe3677a6f ("mptcp: send out dedicated ADD_ADDR packet") Closes: https://github.com/multipath-tcp/mptcp_net-next/issues/229 Co-developed-by: Paolo Abeni <[email protected]> Signed-off-by: Paolo Abeni <[email protected]> Signed-off-by: Mat Martineau <[email protected]> Acked-by: Geliang Tang <[email protected]> Signed-off-by: David S. Miller <[email protected]>
This can be demonstrated with "ptp4l -m -i <intf>"
To fix this, we pull the use of the queue_lock further up above the
callers of ionic_reconfigure_queues() and ionic_stop_queues_reconfig().
Fixes: 7ee99fc5ed2e ("ionic: pull hwstamp queue_lock up a level") Signed-off-by: Shannon Nelson <[email protected]> Signed-off-by: David S. Miller <[email protected]>
Kernel v5.14 has various changes to optimize unaligned memory accesses,
e.g. commit 0652035a5794 ("asm-generic: unaligned: remove byteshift helpers").
Those changes triggered an unalignment-exception and thus crashed the
bootloader on parisc because the unaligned "output_len" variable now suddenly
was read word-wise while it was read byte-wise in the past.
Fix this issue by declaring the external output_len variable as char which then
forces the compiler to generate byte-accesses.
Signed-off-by: Helge Deller <[email protected]> Cc: Arnd Bergmann <[email protected]> Cc: John David Anglin <[email protected]>
Bug: https://gcc.gnu.org/bugzilla/show_bug.cgi?id=102162 Fixes: 8c031ba63f8f ("parisc: Unbreak bootloader due to gcc-7 optimizations") Fixes: 0652035a5794 ("asm-generic: unaligned: remove byteshift helpers") Cc: <[email protected]> # v5.14+
Ariel Marcovitch [Sun, 22 Aug 2021 19:22:01 +0000 (22:22 +0300)]
checkkconfigsymbols.py: Fix the '--ignore' option
It seems like the implementation of the --ignore option is broken.
In check_symbols_helper, when going through the list of files, a file is
added to the list of source files to check if it matches the ignore
pattern. Instead, as stated in the comment below this condition, the
file should be added if it doesn't match the pattern.
This means that when providing an ignore pattern, the only files that
will be checked will be the ones we want the ignore, in addition to the
Kconfig files that don't match the pattern (the check in
parse_kconfig_files is done right)
Masahiro Yamada [Thu, 19 Aug 2021 00:57:36 +0000 (09:57 +0900)]
kbuild: remove stale *.symversions
cmd_update_lto_symversions merges all the existing *.symversions, but
some of them might be stale.
If the last EXPORT_SYMBOL is removed from a C file, the *.symversions
file is not deleted or updated. It contains stale CRCs, but still they
will be used for linking the vmlinux or modules.
It is not a big deal when the EXPORT_SYMBOL is really removed. However,
when the EXPORT_SYMBOL is moved to another file, the same __crc_<symbol>
will appear twice in the merged *.symversions, possibly with different
CRCs if the function argument is changed at the same time. It would
confuse module versioning.
If no EXPORT_SYMBOL is found, let's remove *.symversions explicitly.
Nick Desaulniers [Tue, 17 Aug 2021 00:21:07 +0000 (17:21 -0700)]
x86: remove cc-option-yn test for -mtune=
As noted in the comment, -mtune= has been supported since GCC 3.4. The
minimum required version of GCC to build the kernel (as specified in
Documentation/process/changes.rst) is GCC 4.9.
tune is not immediately expanded. Instead it defines a macro that will
test via cc-option later values for -mtune=. But we can skip the test
whether to use -mtune= vs. -mcpu=.
Nick Desaulniers [Tue, 17 Aug 2021 00:21:06 +0000 (17:21 -0700)]
arc: replace cc-option-yn uses with cc-option
cc-option-yn can be replaced with cc-option. ie.
Checking for support:
ifeq ($(call cc-option-yn,$(FLAG)),y)
becomes:
ifneq ($(call cc-option,$(FLAG)),)
Checking for lack of support:
ifeq ($(call cc-option-yn,$(FLAG)),n)
becomes:
ifeq ($(call cc-option,$(FLAG)),)
Nick Desaulniers [Tue, 17 Aug 2021 00:21:04 +0000 (17:21 -0700)]
s390: replace cc-option-yn uses with cc-option
cc-option-yn can be replaced with cc-option. ie.
Checking for support:
ifeq ($(call cc-option-yn,$(FLAG)),y)
becomes:
ifneq ($(call cc-option,$(FLAG)),)
Checking for lack of support:
ifeq ($(call cc-option-yn,$(FLAG)),n)
becomes:
ifeq ($(call cc-option,$(FLAG)),)
sparc: move the install rule to arch/sparc/Makefile
Currently, the install target in arch/sparc/Makefile descends into
arch/sparc/boot/Makefile to invoke the shell script, but there is no
good reason to do so.
arch/sparc/Makefile can run the shell script directly.
kbuild: Switch to 'f' variants of integrated assembler flag
It has been brought up a few times in various code reviews that clang
3.5 introduced -f{,no-}integrated-as as the preferred way to enable and
disable the integrated assembler, mentioning that -{no-,}integrated-as
are now considered legacy flags.
Switch the kernel over to using those variants in case there is ever a
time where clang decides to remove the non-'f' variants of the flag.
Also, fix a typo in a comment ("intergrated" -> "integrated").
kbuild: Shuffle blank line to improve comment meaning
-Wunused-but-set-variable and -Wunused-const-variable are both disabled
for the same reason but there is a blank line between them and no blank
line between -Wno-unused-const-variable and the block.
Shuffle the new line so that it is clear that the comment applied to
both flags and the next block is separate from them.
Whenever a warning is disabled, it is helpful for future travelers to
understand why the warning is disabled and why it is acceptable to do
so. Add a comment for -Wno-gnu so that people understand why it is
disabled.
kbuild: Remove -Wno-format-invalid-specifier from clang block
Turning on -Wformat does not reveal any instances of this warning across
several different builds so remove this line to keep the number of
disabled warnings as slim as possible.
This has been disabled since commit 61163efae020 ("kbuild: LLVMLinux:
Add Kbuild support for building kernel with Clang"), which does not
explain exactly why it was turned off but since it was so long ago in
terms of both the kernel and LLVM so it is possible that some bug got
fixed along the way.