]> Git Repo - linux.git/log
linux.git
4 years agomm/mempolicy.c: check parameters first in kernel_get_mempolicy
Wenchao Hao [Wed, 12 Aug 2020 01:31:16 +0000 (18:31 -0700)]
mm/mempolicy.c: check parameters first in kernel_get_mempolicy

Previous implementatoin calls untagged_addr() before error check, while if
the error check failed and return EINVAL, the untagged_addr() call is just
useless work.

Signed-off-by: Wenchao Hao <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Reviewed-by: Andrew Morton <[email protected]>
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Linus Torvalds <[email protected]>
4 years agomm: mempolicy: fix kerneldoc of numa_map_to_online_node()
Krzysztof Kozlowski [Wed, 12 Aug 2020 01:31:13 +0000 (18:31 -0700)]
mm: mempolicy: fix kerneldoc of numa_map_to_online_node()

Fix W=1 compile warnings (invalid kerneldoc):

    mm/mempolicy.c:137: warning: Function parameter or member 'node' not described in 'numa_map_to_online_node'
    mm/mempolicy.c:137: warning: Excess function parameter 'nid' description in 'numa_map_to_online_node'

Signed-off-by: Krzysztof Kozlowski <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Reviewed-by: Andrew Morton <[email protected]>
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Linus Torvalds <[email protected]>
4 years agomm/compaction: correct the comments of compact_defer_shift
Alex Shi [Wed, 12 Aug 2020 01:31:10 +0000 (18:31 -0700)]
mm/compaction: correct the comments of compact_defer_shift

There is no compact_defer_limit. It should be compact_defer_shift in
use. and add compact_order_failed explanation.

Signed-off-by: Alex Shi <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Reviewed-by: Alexander Duyck <[email protected]>
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Linus Torvalds <[email protected]>
4 years agomm: use unsigned types for fragmentation score
Nitin Gupta [Wed, 12 Aug 2020 01:31:07 +0000 (18:31 -0700)]
mm: use unsigned types for fragmentation score

Proactive compaction uses per-node/zone "fragmentation score" which is
always in range [0, 100], so use unsigned type of these scores as well as
for related constants.

Signed-off-by: Nitin Gupta <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Reviewed-by: Baoquan He <[email protected]>
Cc: Luis Chamberlain <[email protected]>
Cc: Kees Cook <[email protected]>
Cc: Iurii Zaikin <[email protected]>
Cc: Vlastimil Babka <[email protected]>
Cc: Joonsoo Kim <[email protected]>
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Linus Torvalds <[email protected]>
4 years agomm: fix compile error due to COMPACTION_HPAGE_ORDER
Nitin Gupta [Wed, 12 Aug 2020 01:31:04 +0000 (18:31 -0700)]
mm: fix compile error due to COMPACTION_HPAGE_ORDER

Fix compile error when COMPACTION_HPAGE_ORDER is assigned to
HUGETLB_PAGE_ORDER.  The correct way to check if this constant is defined
is to check for CONFIG_HUGETLBFS.

Reported-by: Nathan Chancellor <[email protected]>
Signed-off-by: Nitin Gupta <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Tested-by: Nathan Chancellor <[email protected]>
Cc: Stephen Rothwell <[email protected]>
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Linus Torvalds <[email protected]>
4 years agomm: proactive compaction
Nitin Gupta [Wed, 12 Aug 2020 01:31:00 +0000 (18:31 -0700)]
mm: proactive compaction

For some applications, we need to allocate almost all memory as hugepages.
However, on a running system, higher-order allocations can fail if the
memory is fragmented.  Linux kernel currently does on-demand compaction as
we request more hugepages, but this style of compaction incurs very high
latency.  Experiments with one-time full memory compaction (followed by
hugepage allocations) show that kernel is able to restore a highly
fragmented memory state to a fairly compacted memory state within <1 sec
for a 32G system.  Such data suggests that a more proactive compaction can
help us allocate a large fraction of memory as hugepages keeping
allocation latencies low.

For a more proactive compaction, the approach taken here is to define a
new sysctl called 'vm.compaction_proactiveness' which dictates bounds for
external fragmentation which kcompactd tries to maintain.

The tunable takes a value in range [0, 100], with a default of 20.

Note that a previous version of this patch [1] was found to introduce too
many tunables (per-order extfrag{low, high}), but this one reduces them to
just one sysctl.  Also, the new tunable is an opaque value instead of
asking for specific bounds of "external fragmentation", which would have
been difficult to estimate.  The internal interpretation of this opaque
value allows for future fine-tuning.

Currently, we use a simple translation from this tunable to [low, high]
"fragmentation score" thresholds (low=100-proactiveness, high=low+10%).
The score for a node is defined as weighted mean of per-zone external
fragmentation.  A zone's present_pages determines its weight.

To periodically check per-node score, we reuse per-node kcompactd threads,
which are woken up every 500 milliseconds to check the same.  If a node's
score exceeds its high threshold (as derived from user-provided
proactiveness value), proactive compaction is started until its score
reaches its low threshold value.  By default, proactiveness is set to 20,
which implies threshold values of low=80 and high=90.

This patch is largely based on ideas from Michal Hocko [2].  See also the
LWN article [3].

Performance data
================

System: x64_64, 1T RAM, 80 CPU threads.
Kernel: 5.6.0-rc3 + this patch

echo madvise | sudo tee /sys/kernel/mm/transparent_hugepage/enabled
echo madvise | sudo tee /sys/kernel/mm/transparent_hugepage/defrag

Before starting the driver, the system was fragmented from a userspace
program that allocates all memory and then for each 2M aligned section,
frees 3/4 of base pages using munmap.  The workload is mainly anonymous
userspace pages, which are easy to move around.  I intentionally avoided
unmovable pages in this test to see how much latency we incur when
hugepage allocations hit direct compaction.

1. Kernel hugepage allocation latencies

With the system in such a fragmented state, a kernel driver then allocates
as many hugepages as possible and measures allocation latency:

(all latency values are in microseconds)

- With vanilla 5.6.0-rc3

  percentile latency
  –––––––––– –––––––
   5    7894
  10    9496
  25   12561
  30   15295
  40   18244
  50   21229
  60   27556
  75   30147
  80   31047
  90   32859
  95   33799

Total 2M hugepages allocated = 383859 (749G worth of hugepages out of 762G
total free => 98% of free memory could be allocated as hugepages)

- With 5.6.0-rc3 + this patch, with proactiveness=20

sysctl -w vm.compaction_proactiveness=20

  percentile latency
  –––––––––– –––––––
   5       2
  10       2
  25       3
  30       3
  40       3
  50       4
  60       4
  75       4
  80       4
  90       5
  95     429

Total 2M hugepages allocated = 384105 (750G worth of hugepages out of 762G
total free => 98% of free memory could be allocated as hugepages)

2. JAVA heap allocation

In this test, we first fragment memory using the same method as for (1).

Then, we start a Java process with a heap size set to 700G and request the
heap to be allocated with THP hugepages.  We also set THP to madvise to
allow hugepage backing of this heap.

/usr/bin/time
 java -Xms700G -Xmx700G -XX:+UseTransparentHugePages -XX:+AlwaysPreTouch

The above command allocates 700G of Java heap using hugepages.

- With vanilla 5.6.0-rc3

17.39user 1666.48system 27:37.89elapsed

- With 5.6.0-rc3 + this patch, with proactiveness=20

8.35user 194.58system 3:19.62elapsed

Elapsed time remains around 3:15, as proactiveness is further increased.

Note that proactive compaction happens throughout the runtime of these
workloads.  The situation of one-time compaction, sufficient to supply
hugepages for following allocation stream, can probably happen for more
extreme proactiveness values, like 80 or 90.

In the above Java workload, proactiveness is set to 20.  The test starts
with a node's score of 80 or higher, depending on the delay between the
fragmentation step and starting the benchmark, which gives more-or-less
time for the initial round of compaction.  As t he benchmark consumes
hugepages, node's score quickly rises above the high threshold (90) and
proactive compaction starts again, which brings down the score to the low
threshold level (80).  Repeat.

bpftrace also confirms proactive compaction running 20+ times during the
runtime of this Java benchmark.  kcompactd threads consume 100% of one of
the CPUs while it tries to bring a node's score within thresholds.

Backoff behavior
================

Above workloads produce a memory state which is easy to compact.  However,
if memory is filled with unmovable pages, proactive compaction should
essentially back off.  To test this aspect:

- Created a kernel driver that allocates almost all memory as hugepages
  followed by freeing first 3/4 of each hugepage.
- Set proactiveness=40
- Note that proactive_compact_node() is deferred maximum number of times
  with HPAGE_FRAG_CHECK_INTERVAL_MSEC of wait between each check
  (=> ~30 seconds between retries).

[1] https://patchwork.kernel.org/patch/11098289/
[2] https://lore.kernel.org/linux-mm/20161230131412[email protected]/
[3] https://lwn.net/Articles/817905/

Signed-off-by: Nitin Gupta <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Tested-by: Oleksandr Natalenko <[email protected]>
Reviewed-by: Vlastimil Babka <[email protected]>
Reviewed-by: Khalid Aziz <[email protected]>
Reviewed-by: Oleksandr Natalenko <[email protected]>
Cc: Vlastimil Babka <[email protected]>
Cc: Khalid Aziz <[email protected]>
Cc: Michal Hocko <[email protected]>
Cc: Mel Gorman <[email protected]>
Cc: Matthew Wilcox <[email protected]>
Cc: Mike Kravetz <[email protected]>
Cc: Joonsoo Kim <[email protected]>
Cc: David Rientjes <[email protected]>
Cc: Nitin Gupta <[email protected]>
Cc: Oleksandr Natalenko <[email protected]>
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Linus Torvalds <[email protected]>
4 years ago/proc/PID/smaps: consistent whitespace output format
Michal Koutný [Wed, 12 Aug 2020 01:30:57 +0000 (18:30 -0700)]
/proc/PID/smaps: consistent whitespace output format

The keys in smaps output are padded to fixed width with spaces.  All
except for THPeligible that uses tabs (only since commit c06306696f83
("mm: thp: fix false negative of shmem vma's THP eligibility")).

Unify the output formatting to save time debugging some naïve parsers.
(Part of the unification is also aligning FilePmdMapped with others.)

Signed-off-by: Michal Koutný <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Acked-by: Yang Shi <[email protected]>
Cc: Alexey Dobriyan <[email protected]>
Cc: Matthew Wilcox <[email protected]>
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Linus Torvalds <[email protected]>
4 years agomm/vmscan: restore active/inactive ratio for anonymous LRU
Joonsoo Kim [Wed, 12 Aug 2020 01:30:54 +0000 (18:30 -0700)]
mm/vmscan: restore active/inactive ratio for anonymous LRU

Now that workingset detection is implemented for anonymous LRU, we don't
need large inactive list to allow detecting frequently accessed pages
before they are reclaimed, anymore.  This effectively reverts the
temporary measure put in by commit "mm/vmscan: make active/inactive ratio
as 1:1 for anon lru".

Signed-off-by: Joonsoo Kim <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Acked-by: Johannes Weiner <[email protected]>
Acked-by: Vlastimil Babka <[email protected]>
Cc: Hugh Dickins <[email protected]>
Cc: Matthew Wilcox <[email protected]>
Cc: Mel Gorman <[email protected]>
Cc: Michal Hocko <[email protected]>
Cc: Minchan Kim <[email protected]>
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Linus Torvalds <[email protected]>
4 years agomm/swap: implement workingset detection for anonymous LRU
Joonsoo Kim [Wed, 12 Aug 2020 01:30:50 +0000 (18:30 -0700)]
mm/swap: implement workingset detection for anonymous LRU

This patch implements workingset detection for anonymous LRU.  All the
infrastructure is implemented by the previous patches so this patch just
activates the workingset detection by installing/retrieving the shadow
entry and adding refault calculation.

Signed-off-by: Joonsoo Kim <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Acked-by: Johannes Weiner <[email protected]>
Acked-by: Vlastimil Babka <[email protected]>
Cc: Hugh Dickins <[email protected]>
Cc: Matthew Wilcox <[email protected]>
Cc: Mel Gorman <[email protected]>
Cc: Michal Hocko <[email protected]>
Cc: Minchan Kim <[email protected]>
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Linus Torvalds <[email protected]>
4 years agomm/swapcache: support to handle the shadow entries
Joonsoo Kim [Wed, 12 Aug 2020 01:30:47 +0000 (18:30 -0700)]
mm/swapcache: support to handle the shadow entries

Workingset detection for anonymous page will be implemented in the
following patch and it requires to store the shadow entries into the
swapcache.  This patch implements an infrastructure to store the shadow
entry in the swapcache.

Signed-off-by: Joonsoo Kim <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Acked-by: Johannes Weiner <[email protected]>
Cc: Hugh Dickins <[email protected]>
Cc: Matthew Wilcox <[email protected]>
Cc: Mel Gorman <[email protected]>
Cc: Michal Hocko <[email protected]>
Cc: Minchan Kim <[email protected]>
Cc: Vlastimil Babka <[email protected]>
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Linus Torvalds <[email protected]>
4 years agomm/workingset: prepare the workingset detection infrastructure for anon LRU
Joonsoo Kim [Wed, 12 Aug 2020 01:30:43 +0000 (18:30 -0700)]
mm/workingset: prepare the workingset detection infrastructure for anon LRU

To prepare the workingset detection for anon LRU, this patch splits
workingset event counters for refault, activate and restore into anon and
file variants, as well as the refaults counter in struct lruvec.

Signed-off-by: Joonsoo Kim <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Acked-by: Johannes Weiner <[email protected]>
Acked-by: Vlastimil Babka <[email protected]>
Cc: Hugh Dickins <[email protected]>
Cc: Matthew Wilcox <[email protected]>
Cc: Mel Gorman <[email protected]>
Cc: Michal Hocko <[email protected]>
Cc: Minchan Kim <[email protected]>
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Linus Torvalds <[email protected]>
4 years agomm/vmscan: protect the workingset on anonymous LRU
Joonsoo Kim [Wed, 12 Aug 2020 01:30:40 +0000 (18:30 -0700)]
mm/vmscan: protect the workingset on anonymous LRU

In current implementation, newly created or swap-in anonymous page is
started on active list.  Growing active list results in rebalancing
active/inactive list so old pages on active list are demoted to inactive
list.  Hence, the page on active list isn't protected at all.

Following is an example of this situation.

Assume that 50 hot pages on active list.  Numbers denote the number of
pages on active/inactive list (active | inactive).

1. 50 hot pages on active list
50(h) | 0

2. workload: 50 newly created (used-once) pages
50(uo) | 50(h)

3. workload: another 50 newly created (used-once) pages
50(uo) | 50(uo), swap-out 50(h)

This patch tries to fix this issue.  Like as file LRU, newly created or
swap-in anonymous pages will be inserted to the inactive list.  They are
promoted to active list if enough reference happens.  This simple
modification changes the above example as following.

1. 50 hot pages on active list
50(h) | 0

2. workload: 50 newly created (used-once) pages
50(h) | 50(uo)

3. workload: another 50 newly created (used-once) pages
50(h) | 50(uo), swap-out 50(uo)

As you can see, hot pages on active list would be protected.

Note that, this implementation has a drawback that the page cannot be
promoted and will be swapped-out if re-access interval is greater than the
size of inactive list but less than the size of total(active+inactive).
To solve this potential issue, following patch will apply workingset
detection similar to the one that's already applied to file LRU.

Signed-off-by: Joonsoo Kim <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Acked-by: Johannes Weiner <[email protected]>
Acked-by: Vlastimil Babka <[email protected]>
Cc: Hugh Dickins <[email protected]>
Cc: Matthew Wilcox <[email protected]>
Cc: Mel Gorman <[email protected]>
Cc: Michal Hocko <[email protected]>
Cc: Minchan Kim <[email protected]>
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Linus Torvalds <[email protected]>
4 years agomm/vmscan: make active/inactive ratio as 1:1 for anon lru
Joonsoo Kim [Wed, 12 Aug 2020 01:30:36 +0000 (18:30 -0700)]
mm/vmscan: make active/inactive ratio as 1:1 for anon lru

Patch series "workingset protection/detection on the anonymous LRU list", v7.

* PROBLEM
In current implementation, newly created or swap-in anonymous page is
started on the active list.  Growing the active list results in
rebalancing active/inactive list so old pages on the active list are
demoted to the inactive list.  Hence, hot page on the active list isn't
protected at all.

Following is an example of this situation.

Assume that 50 hot pages on active list and system can contain total 100
pages.  Numbers denote the number of pages on active/inactive list (active
| inactive).  (h) stands for hot pages and (uo) stands for used-once
pages.

1. 50 hot pages on active list
50(h) | 0

2. workload: 50 newly created (used-once) pages
50(uo) | 50(h)

3. workload: another 50 newly created (used-once) pages
50(uo) | 50(uo), swap-out 50(h)

As we can see, hot pages are swapped-out and it would cause swap-in later.

* SOLUTION
Since this is what we want to avoid, this patchset implements workingset
protection.  Like as the file LRU list, newly created or swap-in anonymous
page is started on the inactive list.  Also, like as the file LRU list, if
enough reference happens, the page will be promoted.  This simple
modification changes the above example as following.

1. 50 hot pages on active list
50(h) | 0

2. workload: 50 newly created (used-once) pages
50(h) | 50(uo)

3. workload: another 50 newly created (used-once) pages
50(h) | 50(uo), swap-out 50(uo)

hot pages remains in the active list. :)

* EXPERIMENT
I tested this scenario on my test bed and confirmed that this problem
happens on current implementation. I also checked that it is fixed by
this patchset.

* SUBJECT
workingset detection

* PROBLEM
Later part of the patchset implements the workingset detection for the
anonymous LRU list.  There is a corner case that workingset protection
could cause thrashing.  If we can avoid thrashing by workingset detection,
we can get the better performance.

Following is an example of thrashing due to the workingset protection.

1. 50 hot pages on active list
50(h) | 0

2. workload: 50 newly created (will be hot) pages
50(h) | 50(wh)

3. workload: another 50 newly created (used-once) pages
50(h) | 50(uo), swap-out 50(wh)

4. workload: 50 (will be hot) pages
50(h) | 50(wh), swap-in 50(wh)

5. workload: another 50 newly created (used-once) pages
50(h) | 50(uo), swap-out 50(wh)

6. repeat 4, 5

Without workingset detection, this kind of workload cannot be promoted and
thrashing happens forever.

* SOLUTION
Therefore, this patchset implements workingset detection.  All the
infrastructure for workingset detecion is already implemented, so there is
not much work to do.  First, extend workingset detection code to deal with
the anonymous LRU list.  Then, make swap cache handles the exceptional
value for the shadow entry.  Lastly, install/retrieve the shadow value
into/from the swap cache and check the refault distance.

* EXPERIMENT
I made a test program to imitates above scenario and confirmed that
problem exists.  Then, I checked that this patchset fixes it.

My test setup is a virtual machine with 8 cpus and 6100MB memory.  But,
the amount of the memory that the test program can use is about 280 MB.
This is because the system uses large ram-backed swap and large ramdisk to
capture the trace.

Test scenario is like as below.

1. allocate cold memory (512MB)
2. allocate hot-1 memory (96MB)
3. activate hot-1 memory (96MB)
4. allocate another hot-2 memory (96MB)
5. access cold memory (128MB)
6. access hot-2 memory (96MB)
7. repeat 5, 6

Since hot-1 memory (96MB) is on the active list, the inactive list can
contains roughly 190MB pages.  hot-2 memory's re-access interval (96+128
MB) is more 190MB, so it cannot be promoted without workingset detection
and swap-in/out happens repeatedly.  With this patchset, workingset
detection works and promotion happens.  Therefore, swap-in/out occurs
less.

Here is the result. (average of 5 runs)

type swap-in swap-out
base 863240 989945
patch 681565 809273

As we can see, patched kernel do less swap-in/out.

* OVERALL TEST (ebizzy using modified random function)
ebizzy is the test program that main thread allocates lots of memory and
child threads access them randomly during the given times.  Swap-in will
happen if allocated memory is larger than the system memory.

The random function that represents the zipf distribution is used to make
hot/cold memory.  Hot/cold ratio is controlled by the parameter.  If the
parameter is high, hot memory is accessed much larger than cold one.  If
the parameter is low, the number of access on each memory would be
similar.  I uses various parameters in order to show the effect of
patchset on various hot/cold ratio workload.

My test setup is a virtual machine with 8 cpus, 1024 MB memory and 5120 MB
ram swap.

Result format is as following.

param: 1-1024-0.1
- 1 (number of thread)
- 1024 (allocated memory size, MB)
- 0.1 (zipf distribution alpha,
0.1 works like as roughly uniform random,
1.3 works like as small portion of memory is hot and the others are cold)

pswpin: smaller is better
std: standard deviation
improvement: negative is better

* single thread
           param        pswpin       std       improvement
      base 1-1024.0-0.1 14101983.40   79441.19
      prot 1-1024.0-0.1 14065875.80  136413.01  (   -0.26 )
    detect 1-1024.0-0.1 13910435.60  100804.82  (   -1.36 )
      base 1-1024.0-0.7 7998368.80   43469.32
      prot 1-1024.0-0.7 7622245.80   88318.74  (   -4.70 )
    detect 1-1024.0-0.7 7618515.20   59742.07  (   -4.75 )
      base 1-1024.0-1.3 1017400.80   38756.30
      prot 1-1024.0-1.3  940464.60   29310.69  (   -7.56 )
    detect 1-1024.0-1.3  945511.40   24579.52  (   -7.07 )
      base 1-1280.0-0.1 22895541.40   50016.08
      prot 1-1280.0-0.1 22860305.40   51952.37  (   -0.15 )
    detect 1-1280.0-0.1 22705565.20   93380.35  (   -0.83 )
      base 1-1280.0-0.7 13717645.60   46250.65
      prot 1-1280.0-0.7 12935355.80   64754.43  (   -5.70 )
    detect 1-1280.0-0.7 13040232.00   63304.00  (   -4.94 )
      base 1-1280.0-1.3 1654251.40    4159.68
      prot 1-1280.0-1.3 1522680.60   33673.50  (   -7.95 )
    detect 1-1280.0-1.3 1599207.00   70327.89  (   -3.33 )
      base 1-1536.0-0.1 31621775.40   31156.28
      prot 1-1536.0-0.1 31540355.20   62241.36  (   -0.26 )
    detect 1-1536.0-0.1 31420056.00  123831.27  (   -0.64 )
      base 1-1536.0-0.7 19620760.60   60937.60
      prot 1-1536.0-0.7 18337839.60   56102.58  (   -6.54 )
    detect 1-1536.0-0.7 18599128.00   75289.48  (   -5.21 )
      base 1-1536.0-1.3 2378142.40   20994.43
      prot 1-1536.0-1.3 2166260.60   48455.46  (   -8.91 )
    detect 1-1536.0-1.3 2183762.20   16883.24  (   -8.17 )
      base 1-1792.0-0.1 40259714.80   90750.70
      prot 1-1792.0-0.1 40053917.20   64509.47  (   -0.51 )
    detect 1-1792.0-0.1 39949736.40  104989.64  (   -0.77 )
      base 1-1792.0-0.7 25704884.40   69429.68
      prot 1-1792.0-0.7 23937389.00   79945.60  (   -6.88 )
    detect 1-1792.0-0.7 24271902.00   35044.30  (   -5.57 )
      base 1-1792.0-1.3 3129497.00   32731.86
      prot 1-1792.0-1.3 2796994.40   19017.26  (  -10.62 )
    detect 1-1792.0-1.3 2886840.40   33938.82  (   -7.75 )
      base 1-2048.0-0.1 48746924.40   50863.88
      prot 1-2048.0-0.1 48631954.40   24537.30  (   -0.24 )
    detect 1-2048.0-0.1 48509419.80   27085.34  (   -0.49 )
      base 1-2048.0-0.7 32046424.40   78624.22
      prot 1-2048.0-0.7 29764182.20   86002.26  (   -7.12 )
    detect 1-2048.0-0.7 30250315.80  101282.14  (   -5.60 )
      base 1-2048.0-1.3 3916723.60   24048.55
      prot 1-2048.0-1.3 3490781.60   33292.61  (  -10.87 )
    detect 1-2048.0-1.3 3585002.20   44942.04  (   -8.47 )

* multi thread
           param        pswpin       std       improvement
      base 8-1024.0-0.1 16219822.60  329474.01
      prot 8-1024.0-0.1 15959494.00  654597.45  (   -1.61 )
    detect 8-1024.0-0.1 15773790.80  502275.25  (   -2.75 )
      base 8-1024.0-0.7 9174107.80  537619.33
      prot 8-1024.0-0.7 8571915.00  385230.08  (   -6.56 )
    detect 8-1024.0-0.7 8489484.20  364683.00  (   -7.46 )
      base 8-1024.0-1.3 1108495.60   83555.98
      prot 8-1024.0-1.3 1038906.20   63465.20  (   -6.28 )
    detect 8-1024.0-1.3  941817.80   32648.80  (  -15.04 )
      base 8-1280.0-0.1 25776114.20  450480.45
      prot 8-1280.0-0.1 25430847.00  465627.07  (   -1.34 )
    detect 8-1280.0-0.1 25282555.00  465666.55  (   -1.91 )
      base 8-1280.0-0.7 15218968.00  702007.69
      prot 8-1280.0-0.7 13957947.80  492643.86  (   -8.29 )
    detect 8-1280.0-0.7 14158331.20  238656.02  (   -6.97 )
      base 8-1280.0-1.3 1792482.80   30512.90
      prot 8-1280.0-1.3 1577686.40   34002.62  (  -11.98 )
    detect 8-1280.0-1.3 1556133.00   22944.79  (  -13.19 )
      base 8-1536.0-0.1 33923761.40  575455.85
      prot 8-1536.0-0.1 32715766.20  300633.51  (   -3.56 )
    detect 8-1536.0-0.1 33158477.40  117764.51  (   -2.26 )
      base 8-1536.0-0.7 20628907.80  303851.34
      prot 8-1536.0-0.7 19329511.20  341719.31  (   -6.30 )
    detect 8-1536.0-0.7 20013934.00  385358.66  (   -2.98 )
      base 8-1536.0-1.3 2588106.40  130769.20
      prot 8-1536.0-1.3 2275222.40   89637.06  (  -12.09 )
    detect 8-1536.0-1.3 2365008.40  124412.55  (   -8.62 )
      base 8-1792.0-0.1 43328279.20  946469.12
      prot 8-1792.0-0.1 41481980.80  525690.89  (   -4.26 )
    detect 8-1792.0-0.1 41713944.60  406798.93  (   -3.73 )
      base 8-1792.0-0.7 27155647.40  536253.57
      prot 8-1792.0-0.7 24989406.80  502734.52  (   -7.98 )
    detect 8-1792.0-0.7 25524806.40  263237.87  (   -6.01 )
      base 8-1792.0-1.3 3260372.80  137907.92
      prot 8-1792.0-1.3 2879187.80   63597.26  (  -11.69 )
    detect 8-1792.0-1.3 2892962.20   33229.13  (  -11.27 )
      base 8-2048.0-0.1 50583989.80  710121.48
      prot 8-2048.0-0.1 49599984.40  228782.42  (   -1.95 )
    detect 8-2048.0-0.1 50578596.00  660971.66  (   -0.01 )
      base 8-2048.0-0.7 33765479.60  812659.55
      prot 8-2048.0-0.7 30767021.20  462907.24  (   -8.88 )
    detect 8-2048.0-0.7 32213068.80  211884.24  (   -4.60 )
      base 8-2048.0-1.3 3941675.80   28436.45
      prot 8-2048.0-1.3 3538742.40   76856.08  (  -10.22 )
    detect 8-2048.0-1.3 3579397.80   58630.95  (   -9.19 )

As we can see, all the cases show improvement.  Especially, test case with
zipf distribution 1.3 show more improvements.  It means that if there is a
hot/cold tendency in anon pages, this patchset works better.

This patch (of 6):

Current implementation of LRU management for anonymous page has some
problems.  Most important one is that it doesn't protect the workingset,
that is, pages on the active LRU list.  Although, this problem will be
fixed in the following patchset, the preparation is required and this
patch does it.

What following patch does is to implement workingset protection.  After
the following patchset, newly created or swap-in pages will start their
lifetime on the inactive list.  If inactive list is too small, there is
not enough chance to be referenced and the page cannot become the
workingset.

In order to provide the newly anonymous or swap-in pages enough chance to
be referenced again, this patch makes active/inactive LRU ratio as 1:1.

This is just a temporary measure.  Later patch in the series introduces
workingset detection for anonymous LRU that will be used to better decide
if pages should start on the active and inactive list.  Afterwards this
patch is effectively reverted.

Signed-off-by: Joonsoo Kim <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Acked-by: Johannes Weiner <[email protected]>
Acked-by: Vlastimil Babka <[email protected]>
Cc: Michal Hocko <[email protected]>
Cc: Hugh Dickins <[email protected]>
Cc: Minchan Kim <[email protected]>
Cc: Mel Gorman <[email protected]>
Cc: Matthew Wilcox <[email protected]>
Link: http://lkml.kernel.org/r/[email protected]
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Linus Torvalds <[email protected]>
4 years agomm/hugetlb: add mempolicy check in the reservation routine
Muchun Song [Wed, 12 Aug 2020 01:30:32 +0000 (18:30 -0700)]
mm/hugetlb: add mempolicy check in the reservation routine

In the reservation routine, we only check whether the cpuset meets the
memory allocation requirements.  But we ignore the mempolicy of MPOL_BIND
case.  If someone mmap hugetlb succeeds, but the subsequent memory
allocation may fail due to mempolicy restrictions and receives the SIGBUS
signal.  This can be reproduced by the follow steps.

 1) Compile the test case.
    cd tools/testing/selftests/vm/
    gcc map_hugetlb.c -o map_hugetlb

 2) Pre-allocate huge pages. Suppose there are 2 numa nodes in the
    system. Each node will pre-allocate one huge page.
    echo 2 > /proc/sys/vm/nr_hugepages

 3) Run test case(mmap 4MB). We receive the SIGBUS signal.
    numactl --membind=3D0 ./map_hugetlb 4

With this patch applied, the mmap will fail in the step 3) and throw
"mmap: Cannot allocate memory".

[[email protected]: include sched.h for `current']

Reported-by: Jianchao Guo <[email protected]>
Suggested-by: Michal Hocko <[email protected]>
Signed-off-by: Muchun Song <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Reviewed-by: Mike Kravetz <[email protected]>
Cc: David Rientjes <[email protected]>
Cc: Mel Gorman <[email protected]>
Cc: Michel Lespinasse <[email protected]>
Cc: Baoquan He <[email protected]>
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Linus Torvalds <[email protected]>
4 years agokselftests: cgroup: add perpcu memory accounting test
Roman Gushchin [Wed, 12 Aug 2020 01:30:29 +0000 (18:30 -0700)]
kselftests: cgroup: add perpcu memory accounting test

Add a simple test to check the percpu memory accounting.  The test creates
a cgroup tree with 1000 child cgroups and checks values of memory.current
and memory.stat::percpu.

Signed-off-by: Roman Gushchin <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Cc: Christoph Lameter <[email protected]>
Cc: David Rientjes <[email protected]>
Cc: Dennis Zhou <[email protected]>
Cc: Johannes Weiner <[email protected]>
Cc: Joonsoo Kim <[email protected]>
Cc: Mel Gorman <[email protected]>
Cc: Michal Hocko <[email protected]>
Cc: Pekka Enberg <[email protected]>
Cc: Shakeel Butt <[email protected]>
Cc: Tejun Heo <[email protected]>
Cc: Tobin C. Harding <[email protected]>
Cc: Vlastimil Babka <[email protected]>
Cc: Waiman Long <[email protected]>
Cc: Michal Koutný <[email protected]>
Cc: Bixuan Cui <[email protected]>
Cc: Stephen Rothwell <[email protected]>
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Linus Torvalds <[email protected]>
4 years agomm: memcg: charge memcg percpu memory to the parent cgroup
Roman Gushchin [Wed, 12 Aug 2020 01:30:25 +0000 (18:30 -0700)]
mm: memcg: charge memcg percpu memory to the parent cgroup

Memory cgroups are using large chunks of percpu memory to store vmstat
data.  Yet this memory is not accounted at all, so in the case when there
are many (dying) cgroups, it's not exactly clear where all the memory is.

Because the size of memory cgroup internal structures can dramatically
exceed the size of object or page which is pinning it in the memory, it's
not a good idea to simply ignore it.  It actually breaks the isolation
between cgroups.

Let's account the consumed percpu memory to the parent cgroup.

[[email protected]: add WARN_ON_ONCE()s, per Johannes]
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Roman Gushchin <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Reviewed-by: Shakeel Butt <[email protected]>
Acked-by: Dennis Zhou <[email protected]>
Acked-by: Johannes Weiner <[email protected]>
Cc: Christoph Lameter <[email protected]>
Cc: David Rientjes <[email protected]>
Cc: Joonsoo Kim <[email protected]>
Cc: Mel Gorman <[email protected]>
Cc: Michal Hocko <[email protected]>
Cc: Pekka Enberg <[email protected]>
Cc: Tejun Heo <[email protected]>
Cc: Tobin C. Harding <[email protected]>
Cc: Vlastimil Babka <[email protected]>
Cc: Waiman Long <[email protected]>
Cc: Bixuan Cui <[email protected]>
Cc: Michal Koutný <[email protected]>
Cc: Stephen Rothwell <[email protected]>
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Linus Torvalds <[email protected]>
4 years agomm: memcg/percpu: per-memcg percpu memory statistics
Roman Gushchin [Wed, 12 Aug 2020 01:30:21 +0000 (18:30 -0700)]
mm: memcg/percpu: per-memcg percpu memory statistics

Percpu memory can represent a noticeable chunk of the total memory
consumption, especially on big machines with many CPUs.  Let's track
percpu memory usage for each memcg and display it in memory.stat.

A percpu allocation is usually scattered over multiple pages (and nodes),
and can be significantly smaller than a page.  So let's add a byte-sized
counter on the memcg level: MEMCG_PERCPU_B.  Byte-sized vmstat infra
created for slabs can be perfectly reused for percpu case.

[[email protected]: v3]
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Roman Gushchin <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Reviewed-by: Shakeel Butt <[email protected]>
Acked-by: Dennis Zhou <[email protected]>
Acked-by: Johannes Weiner <[email protected]>
Cc: Christoph Lameter <[email protected]>
Cc: David Rientjes <[email protected]>
Cc: Joonsoo Kim <[email protected]>
Cc: Mel Gorman <[email protected]>
Cc: Michal Hocko <[email protected]>
Cc: Pekka Enberg <[email protected]>
Cc: Tejun Heo <[email protected]>
Cc: Tobin C. Harding <[email protected]>
Cc: Vlastimil Babka <[email protected]>
Cc: Waiman Long <[email protected]>
Cc: Bixuan Cui <[email protected]>
Cc: Michal Koutný <[email protected]>
Cc: Stephen Rothwell <[email protected]>
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Linus Torvalds <[email protected]>
4 years agomm: memcg/percpu: account percpu memory to memory cgroups
Roman Gushchin [Wed, 12 Aug 2020 01:30:17 +0000 (18:30 -0700)]
mm: memcg/percpu: account percpu memory to memory cgroups

Percpu memory is becoming more and more widely used by various subsystems,
and the total amount of memory controlled by the percpu allocator can make
a good part of the total memory.

As an example, bpf maps can consume a lot of percpu memory, and they are
created by a user.  Also, some cgroup internals (e.g.  memory controller
statistics) can be quite large.  On a machine with many CPUs and big
number of cgroups they can consume hundreds of megabytes.

So the lack of memcg accounting is creating a breach in the memory
isolation.  Similar to the slab memory, percpu memory should be accounted
by default.

To implement the perpcu accounting it's possible to take the slab memory
accounting as a model to follow.  Let's introduce two types of percpu
chunks: root and memcg.  What makes memcg chunks different is an
additional space allocated to store memcg membership information.  If
__GFP_ACCOUNT is passed on allocation, a memcg chunk should be be used.
If it's possible to charge the corresponding size to the target memory
cgroup, allocation is performed, and the memcg ownership data is recorded.
System-wide allocations are performed using root chunks, so there is no
additional memory overhead.

To implement a fast reparenting of percpu memory on memcg removal, we
don't store mem_cgroup pointers directly: instead we use obj_cgroup API,
introduced for slab accounting.

[[email protected]: fix CONFIG_MEMCG_KMEM=n build errors and warning]
[[email protected]: move unreachable code, per Roman]
[[email protected]: mm/percpu: fix 'defined but not used' warning]
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Roman Gushchin <[email protected]>
Signed-off-by: Bixuan Cui <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Reviewed-by: Shakeel Butt <[email protected]>
Acked-by: Dennis Zhou <[email protected]>
Cc: Christoph Lameter <[email protected]>
Cc: David Rientjes <[email protected]>
Cc: Johannes Weiner <[email protected]>
Cc: Joonsoo Kim <[email protected]>
Cc: Mel Gorman <[email protected]>
Cc: Michal Hocko <[email protected]>
Cc: Pekka Enberg <[email protected]>
Cc: Tejun Heo <[email protected]>
Cc: Tobin C. Harding <[email protected]>
Cc: Vlastimil Babka <[email protected]>
Cc: Waiman Long <[email protected]>
Cc: Bixuan Cui <[email protected]>
Cc: Michal Koutný <[email protected]>
Cc: Stephen Rothwell <[email protected]>
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Linus Torvalds <[email protected]>
4 years agopercpu: return number of released bytes from pcpu_free_area()
Roman Gushchin [Wed, 12 Aug 2020 01:30:14 +0000 (18:30 -0700)]
percpu: return number of released bytes from pcpu_free_area()

Patch series "mm: memcg accounting of percpu memory", v3.

This patchset adds percpu memory accounting to memory cgroups.  It's based
on the rework of the slab controller and reuses concepts and features
introduced for the per-object slab accounting.

Percpu memory is becoming more and more widely used by various subsystems,
and the total amount of memory controlled by the percpu allocator can make
a good part of the total memory.

As an example, bpf maps can consume a lot of percpu memory, and they are
created by a user.  Also, some cgroup internals (e.g.  memory controller
statistics) can be quite large.  On a machine with many CPUs and big
number of cgroups they can consume hundreds of megabytes.

So the lack of memcg accounting is creating a breach in the memory
isolation.  Similar to the slab memory, percpu memory should be accounted
by default.

Percpu allocations by their nature are scattered over multiple pages, so
they can't be tracked on the per-page basis.  So the per-object tracking
introduced by the new slab controller is reused.

The patchset implements charging of percpu allocations, adds memcg-level
statistics, enables accounting for percpu allocations made by memory
cgroup internals and provides some basic tests.

To implement the accounting of percpu memory without a significant memory
and performance overhead the following approach is used: all accounted
allocations are placed into a separate percpu chunk (or chunks).  These
chunks are similar to default chunks, except that they do have an attached
vector of pointers to obj_cgroup objects, which is big enough to save a
pointer for each allocated object.  On the allocation, if the allocation
has to be accounted (__GFP_ACCOUNT is passed, the allocating process
belongs to a non-root memory cgroup, etc), the memory cgroup is getting
charged and if the maximum limit is not exceeded the allocation is
performed using a memcg-aware chunk.  Otherwise -ENOMEM is returned or the
allocation is forced over the limit, depending on gfp (as any other kernel
memory allocation).  The memory cgroup information is saved in the
obj_cgroup vector at the corresponding offset.  On the release time the
memcg information is restored from the vector and the cgroup is getting
uncharged.  Unaccounted allocations (at this point the absolute majority
of all percpu allocations) are performed in the old way, so no additional
overhead is expected.

To avoid pinning dying memory cgroups by outstanding allocations,
obj_cgroup API is used instead of directly saving memory cgroup pointers.
obj_cgroup is basically a pointer to a memory cgroup with a standalone
reference counter.  The trick is that it can be atomically swapped to
point at the parent cgroup, so that the original memory cgroup can be
released prior to all objects, which has been charged to it.  Because all
charges and statistics are fully recursive, it's perfectly correct to
uncharge the parent cgroup instead.  This scheme is used in the slab
memory accounting, and percpu memory can just follow the scheme.

This patch (of 5):

To implement accounting of percpu memory we need the information about the
size of freed object.  Return it from pcpu_free_area().

Signed-off-by: Roman Gushchin <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Reviewed-by: Shakeel Butt <[email protected]>
Acked-by: Dennis Zhou <[email protected]>
Cc: Tejun Heo <[email protected]>
Cc: Christoph Lameter <[email protected]>
Cc: Johannes Weiner <[email protected]>
Cc: Michal Hocko <[email protected]>
Cc: David Rientjes <[email protected]>
Cc: Joonsoo Kim <[email protected]>
Cc: Mel Gorman <[email protected]>
Cc: Pekka Enberg <[email protected]>
Cc: Tobin C. Harding <[email protected]>
Cc: Vlastimil Babka <[email protected]>
Cc: Waiman Long <[email protected]>
cC: Michal Koutný[email protected]>
Cc: Bixuan Cui <[email protected]>
Cc: Michal Koutný <[email protected]>
Cc: Stephen Rothwell <[email protected]>
Link: http://lkml.kernel.org/r/[email protected]
Link: http://lkml.kernel.org/r/[email protected]
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Linus Torvalds <[email protected]>
4 years agofix breakage in do_rmdir()
Al Viro [Wed, 12 Aug 2020 04:15:18 +0000 (05:15 +0100)]
fix breakage in do_rmdir()

syzbot reported and bisected a use-after-free due to the recent init
cleanups.

The putname() should happen only after we'd *not* branched to retry,
same as it's done in do_unlinkat().

Reported-by: [email protected]
Fixes: e24ab0ef689d "fs: push the getname from do_rmdir into the callers"
Cc: Christoph Hellwig <[email protected]>
Signed-off-by: Al Viro <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
4 years agoALSA: hda/hdmi: Use force connectivity quirk on another HP desktop
Kai-Heng Feng [Tue, 11 Aug 2020 09:53:34 +0000 (17:53 +0800)]
ALSA: hda/hdmi: Use force connectivity quirk on another HP desktop

There's another HP desktop has buggy BIOS which flags the Port
Connectivity bit as no connection.

Apply force connectivity quirk to enable DP/HDMI audio.

Signed-off-by: Kai-Heng Feng <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Takashi Iwai <[email protected]>
4 years agoNFS: Fix flexfiles read failover
Trond Myklebust [Tue, 11 Aug 2020 17:36:32 +0000 (13:36 -0400)]
NFS: Fix flexfiles read failover

The current mirrored read failover code is correctly resetting the mirror
index between failed reads, however it is not able to actually flip the
RPC call over to the next RPC client.
The end result is that we keep resending the RPC call to the same client
over and over.

The fix is to use the pnfs_read_resend_pnfs() mechanism to schedule a
new RPC call, but we need to add the ability to pass in a mirror
index so that we always retry the next mirror in the list.

Fixes: 166bd5b889ac ("pNFS/flexfiles: Fix layoutstats handling during read failovers")
Signed-off-by: Trond Myklebust <[email protected]>
4 years agoio_uring: fail poll arm on queue proc failure
Jens Axboe [Tue, 11 Aug 2020 15:50:19 +0000 (09:50 -0600)]
io_uring: fail poll arm on queue proc failure

Check the ipt.error value, it must have been either cleared to zero or
set to another error than the default -EINVAL if we don't go through the
waitqueue proc addition. Just give up on poll at that point and return
failure, this will fallback to async work.

io_poll_add() doesn't suffer from this failure case, as it returns the
error value directly.

Cc: [email protected] # v5.7+
Reported-by: [email protected]
Reviewed-by: Stefano Garzarella <[email protected]>
Signed-off-by: Jens Axboe <[email protected]>
4 years agofs: nfs: delete repeated words in comments
Randy Dunlap [Tue, 11 Aug 2020 02:18:35 +0000 (19:18 -0700)]
fs: nfs: delete repeated words in comments

Drop duplicated words {the, and} in comments.

Signed-off-by: Randy Dunlap <[email protected]>
Cc: Trond Myklebust <[email protected]>
Cc: Anna Schumaker <[email protected]>
Cc: [email protected]
Signed-off-by: Trond Myklebust <[email protected]>
4 years agorpc_pipefs: convert comma to semicolon
Xu Wang [Mon, 10 Aug 2020 02:46:01 +0000 (02:46 +0000)]
rpc_pipefs: convert comma to semicolon

Replace a comma between expression statements by a semicolon.

Signed-off-by: Xu Wang <[email protected]>
Signed-off-by: Trond Myklebust <[email protected]>
4 years agonfs: Fix getxattr kernel panic and memory overflow
Jeffrey Mitchell [Wed, 5 Aug 2020 17:23:19 +0000 (12:23 -0500)]
nfs: Fix getxattr kernel panic and memory overflow

Move the buffer size check to decode_attr_security_label() before memcpy()
Only call memcpy() if the buffer is large enough

Fixes: aa9c2669626c ("NFS: Client implementation of Labeled-NFS")
Signed-off-by: Jeffrey Mitchell <[email protected]>
[Trond: clean up duplicate test of label->len != 0]
Signed-off-by: Trond Myklebust <[email protected]>
4 years agoNFS: Don't return layout segments that are in use
Trond Myklebust [Wed, 5 Aug 2020 13:03:56 +0000 (09:03 -0400)]
NFS: Don't return layout segments that are in use

If the NFS_LAYOUT_RETURN_REQUESTED flag is set, we want to return the
layout as soon as possible, meaning that the affected layout segments
should be marked as invalid, and should no longer be in use for I/O.

Fixes: f0b429819b5f ("pNFS: Ignore non-recalled layouts in pnfs_layout_need_return()")
Cc: [email protected] # v4.19+
Signed-off-by: Trond Myklebust <[email protected]>
4 years agoNFS: Don't move layouts to plh_return_segs list while in use
Trond Myklebust [Tue, 4 Aug 2020 20:30:30 +0000 (16:30 -0400)]
NFS: Don't move layouts to plh_return_segs list while in use

If the layout segment is still in use for a read or a write, we should
not move it to the layout plh_return_segs list. If we do, we can end
up returning the layout while I/O is still in progress.

Fixes: e0b7d420f72a ("pNFS: Don't discard layout segments that are marked for return")
Cc: [email protected] # v4.19+
Signed-off-by: Trond Myklebust <[email protected]>
4 years agoNFS: Add layout segment info to pnfs read/write/commit tracepoints
Trond Myklebust [Wed, 5 Aug 2020 02:16:06 +0000 (22:16 -0400)]
NFS: Add layout segment info to pnfs read/write/commit tracepoints

Allow the pnfs I/O tracepoints to trace which layout segment is being
used.

Signed-off-by: Trond Myklebust <[email protected]>
4 years agoparisc: Implement __smp_store_release and __smp_load_acquire barriers
John David Anglin [Thu, 30 Jul 2020 12:59:12 +0000 (08:59 -0400)]
parisc: Implement __smp_store_release and __smp_load_acquire barriers

This patch implements the __smp_store_release and __smp_load_acquire barriers
using ordered stores and loads.  This avoids the sync instruction present in
the generic implementation.

Cc: <[email protected]> # 4.14+
Signed-off-by: Dave Anglin <[email protected]>
Signed-off-by: Helge Deller <[email protected]>
4 years agoperf bench: Fix a couple of spelling mistakes in options text
Colin Ian King [Wed, 12 Aug 2020 06:46:47 +0000 (07:46 +0100)]
perf bench: Fix a couple of spelling mistakes in options text

There are a couple of spelling mistakes in the text. Fix these.

Signed-off-by: Colin King <[email protected]>
Cc: Alexander Shishkin <[email protected]>
Cc: Jiri Olsa <[email protected]>
Cc: Mark Rutland <[email protected]>
Cc: Namhyung Kim <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: [email protected]
Link: http://lore.kernel.org/lkml/[email protected]
Signed-off-by: Arnaldo Carvalho de Melo <[email protected]>
4 years agoperf bench numa: Fix benchmark names
Alexander Gordeev [Mon, 10 Aug 2020 06:22:00 +0000 (08:22 +0200)]
perf bench numa: Fix benchmark names

Standard benchmark names let users know the tests specifics.  For
example "2x1-bw-process" name tells that two processes one thread each
are run and the RAM bandwidth is measured.

Several benchmarks names do not correspond to their actual running
configuration. Fix that and also some whitespace and comment
inconsistencies.

Signed-off-by: Alexander Gordeev <[email protected]>
Cc: Alexander Shishkin <[email protected]>
Cc: Jiri Olsa <[email protected]>
Cc: Mark Rutland <[email protected]>
Cc: Namhyung Kim <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Link: http://lore.kernel.org/lkml/6b6f2084f132ee8e9203dc7c32f9deb209b87a68.1597004831.git.agordeev@linux.ibm.com
Signed-off-by: Arnaldo Carvalho de Melo <[email protected]>
4 years agoperf bench numa: Fix number of processes in "2x3-convergence" test
Alexander Gordeev [Mon, 10 Aug 2020 06:21:59 +0000 (08:21 +0200)]
perf bench numa: Fix number of processes in "2x3-convergence" test

Signed-off-by: Alexander Gordeev <[email protected]>
Acked-by: Namhyung Kim <[email protected]>
Cc: Alexander Shishkin <[email protected]>
Cc: Jiri Olsa <[email protected]>
Cc: Mark Rutland <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Link: http://lore.kernel.org/lkml/d949f5f48e17fc816f3beecf8479f1b2480345e4.1597004831.git.agordeev@linux.ibm.com
Signed-off-by: Arnaldo Carvalho de Melo <[email protected]>
4 years agotools headers UAPI: Sync kvm.h headers with the kernel sources
Arnaldo Carvalho de Melo [Wed, 12 Aug 2020 12:02:40 +0000 (09:02 -0300)]
tools headers UAPI: Sync kvm.h headers with the kernel sources

To pick the changes in:

  3edd68399dc1 ("KVM: x86: Add a capability for GUEST_MAXPHYADDR < HOST_MAXPHYADDR support")
  1aa561b1a4c0 ("kvm: x86: Add "last CPU" to some KVM_EXIT information")
  23a60f834406 ("s390/kvm: diagnose 0x318 sync and reset")

That do not result in any change in tooling, as the additions are not
being used in any table generator.

This silences these perf build warning:

  Warning: Kernel ABI header at 'tools/include/uapi/linux/kvm.h' differs from latest version at 'include/uapi/linux/kvm.h'
  diff -u tools/include/uapi/linux/kvm.h include/uapi/linux/kvm.h

Cc: Adrian Hunter <[email protected]>
Cc: Christian Borntraeger <[email protected]>
Cc: Collin Walling <[email protected]>
Cc: Jim Mattson <[email protected]>
Cc: Jiri Olsa <[email protected]>
Cc: Mohammed Gamal <[email protected]>
Cc: Namhyung Kim <[email protected]>
Cc: Paolo Bonzini <[email protected]>
Signed-off-by: Arnaldo Carvalho de Melo <[email protected]>
4 years agotools include UAPI: Sync linux/vhost.h with the kernel sources
Arnaldo Carvalho de Melo [Wed, 12 Aug 2020 11:57:07 +0000 (08:57 -0300)]
tools include UAPI: Sync linux/vhost.h with the kernel sources

To get the changes in:

  25abc060d282 ("vhost-vdpa: support IOTLB batching hints")

This doesn't result in any changes in tooling, no new ioctls to be
picked up by the id->string table generators, etc.

Silencing this perf build warning:

  Warning: Kernel ABI header at 'tools/include/uapi/linux/vhost.h' differs from latest version at 'include/uapi/linux/vhost.h'
  diff -u tools/include/uapi/linux/vhost.h include/uapi/linux/vhost.h

Cc: Adrian Hunter <[email protected]>
Cc: Jason Wang <[email protected]>
Cc: Jiri Olsa <[email protected]>
Cc: Michael S. Tsirkin <[email protected]>
Cc: Namhyung Kim <[email protected]>
Signed-off-by: Arnaldo Carvalho de Melo <[email protected]>
4 years agotools headers kvm s390: Sync headers with the kernel sources
Arnaldo Carvalho de Melo [Wed, 12 Aug 2020 11:52:32 +0000 (08:52 -0300)]
tools headers kvm s390: Sync headers with the kernel sources

To pick the changes in:

  23a60f834406 ("s390/kvm: diagnose 0x318 sync and reset")

None of them trigger any changes in tooling, this time this is just to silence
these perf build warnings:

  Warning: Kernel ABI header at 'tools/arch/s390/include/uapi/asm/kvm.h' differs from latest version at 'arch/s390/include/uapi/asm/kvm.h'
  diff -u tools/arch/s390/include/uapi/asm/kvm.h arch/s390/include/uapi/asm/kvm.h

Cc: Adrian Hunter <[email protected]>
Cc: Christian Borntraeger <[email protected]>
Cc: Collin Walling <[email protected]>
Cc: Jiri Olsa <[email protected]>
Cc: Namhyung Kim <[email protected]>
Signed-off-by: Arnaldo Carvalho de Melo <[email protected]>
4 years agoperf trace beauty: Use the autogenerated protocol family table
Arnaldo Carvalho de Melo [Wed, 12 Aug 2020 11:43:51 +0000 (08:43 -0300)]
perf trace beauty: Use the autogenerated protocol family table

That helps us not to lose new protocol families when they are
introduced, replacing that hardcoded, dated family->string table.

To recap what this allows us to do:

  # perf trace -e syscalls:sys_enter_socket/max-stack=10/ --filter=family==INET --max-events=1
     0.000 fetchmail/41097 syscalls:sys_enter_socket(family: INET, type: DGRAM|CLOEXEC|NONBLOCK, protocol: IP)
                                       __GI___socket (inlined)
                                       reopen (/usr/lib64/libresolv-2.31.so)
                                       send_dg (/usr/lib64/libresolv-2.31.so)
                                       __res_context_send (/usr/lib64/libresolv-2.31.so)
                                       __GI___res_context_query (inlined)
                                       __GI___res_context_search (inlined)
                                       _nss_dns_gethostbyname4_r (/usr/lib64/libnss_dns-2.31.so)
                                       gaih_inet.constprop.0 (/usr/lib64/libc-2.31.so)
                                       __GI_getaddrinfo (inlined)
                                       [0x15cb2] (/usr/bin/fetchmail)
  #

More work is still needed to allow for the more natura strace-like
syscall name usage instead of the trace event name:

  # perf trace -e socket/max-stack=10,family==INET/ --max-events=1

I.e. to allow for modifiers to follow the syscall name and for logical
expressions to be accepted as filters to use with that syscall, be it as
trace event filters or BPF based ones.

Using -v we can see how the trace event filter is built:

  # perf trace -v -e syscalls:sys_enter_socket/call-graph=dwarf/ --filter=family==INET --max-events=2
  <SNIP>
  New filter for syscalls:sys_enter_socket: (family==0x2) && (common_pid != 41384 && common_pid != 2836)
  <SNIP>

  $ tools/perf/trace/beauty/socket.sh | grep -w 2
[2] = "INET",
  $

Cc: Adrian Hunter <[email protected]>
Cc: Jiri Olsa <[email protected]>
Cc: Namhyung Kim <[email protected]>
Cc: Steven Rostedt <[email protected]>
Signed-off-by: Arnaldo Carvalho de Melo <[email protected]>
4 years agoperf trace beauty: Add script to autogenerate socket families table
Arnaldo Carvalho de Melo [Wed, 12 Aug 2020 11:30:06 +0000 (08:30 -0300)]
perf trace beauty: Add script to autogenerate socket families table

To use with 'perf trace', to convert the protocol families to strings,
e.g:

  $ tools/perf/trace/beauty/socket.sh
  static const char *socket_families[] = {
   [0] = "UNSPEC",
   [1] = "LOCAL",
   [2] = "INET",
   [3] = "AX25",
   [4] = "IPX",
   [5] = "APPLETALK",
   [6] = "NETROM",
   [7] = "BRIDGE",
   [8] = "ATMPVC",
   [9] = "X25",
   [10] = "INET6",
   [11] = "ROSE",
   [12] = "DECnet",
   [13] = "NETBEUI",
   [14] = "SECURITY",
   [15] = "KEY",
   [16] = "NETLINK",
   [17] = "PACKET",
   [18] = "ASH",
   [19] = "ECONET",
   [20] = "ATMSVC",
   [21] = "RDS",
   [22] = "SNA",
   [23] = "IRDA",
   [24] = "PPPOX",
   [25] = "WANPIPE",
   [26] = "LLC",
   [27] = "IB",
   [28] = "MPLS",
   [29] = "CAN",
   [30] = "TIPC",
   [31] = "BLUETOOTH",
   [32] = "IUCV",
   [33] = "RXRPC",
   [34] = "ISDN",
   [35] = "PHONET",
   [36] = "IEEE802154",
   [37] = "CAIF",
   [38] = "ALG",
   [39] = "NFC",
   [40] = "VSOCK",
   [41] = "KCM",
   [42] = "QIPCRTR",
   [43] = "SMC",
   [44] = "XDP",
  };
  $

This uses a copy of include/linux/socket.h that is kept in a directory
to be used just for these table generation scripts and for checking if
the kernel has a new file that maybe gets something new for these
tables.

This allows us to:

- Avoid accessing files outside tools/, in the kernel sources, that may
  be changed in unexpected ways and thus break these scripts.

- Notice when those files change and thus check if the changes don't
  break those scripts, update them to automatically get the new
  definitions, a new socket family, for instance.

- Not add then to the tools/include/ where it may end up used while
  building the tools and end up requiring dragging yet more stuff from
  the kernel or plain break the build in some of the myriad environments
  where perf may be built.

This will replace the previous static array in tools/perf/ that was
dated and was already missing the AF_KCM, AF_QIPCRTR, AF_SMC and AF_XDP
families.

The next cset will wire this up to the perf build process.

At some point this must be made into a library to be used in places such
as libtraceevent, bpftrace, etc.

Cc: Adrian Hunter <[email protected]>
Cc: Jiri Olsa <[email protected]>
Cc: Namhyung Kim <[email protected]>
Cc: Steven Rostedt <[email protected]>
Signed-off-by: Arnaldo Carvalho de Melo <[email protected]>
4 years agortc: pcf2127: fix alarm handling
Alexandre Belloni [Wed, 12 Aug 2020 08:51:14 +0000 (10:51 +0200)]
rtc: pcf2127: fix alarm handling

Fix multiple issues when handling alarms:
 - Use threaded interrupt to avoid scheduling when atomic
 - Stop matching on week day as it may not be set correctly
 - Avoid parsing the DT interrupt and use what is provided by the i2c or
   spi subsystem
 - Avoid returning IRQ_NONE in case of error in the interrupt handler
 - Never write WDTF as specified in the datasheet
 - Set uie_unsupported, as for the pcf85063, setting alarms every seconds
   is not working correctly and confuses the RTC.

Signed-off-by: Alexandre Belloni <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
4 years agortc: pcf2127: add alarm support
Liam Beguin [Tue, 30 Jun 2020 02:42:11 +0000 (22:42 -0400)]
rtc: pcf2127: add alarm support

Add alarm support for the pcf2127 RTC chip family.
Tested on pca2129.

Signed-off-by: Liam Beguin <[email protected]>
Signed-off-by: Alexandre Belloni <[email protected]>
Reviewed-by: Bruno Thomsen <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
4 years agortc: pcf2127: add pca2129 device id
Liam Beguin [Tue, 30 Jun 2020 02:42:10 +0000 (22:42 -0400)]
rtc: pcf2127: add pca2129 device id

The PCA2129 is the automotive grade version of the PCF2129.
add it to the list of compatibles.

Signed-off-by: Liam Beguin <[email protected]>
Signed-off-by: Alexandre Belloni <[email protected]>
Reviewed-by: Bruno Thomsen <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
4 years agogenirq/PM: Always unlock IRQ descriptor in rearm_wake_irq()
Guenter Roeck [Tue, 11 Aug 2020 18:00:01 +0000 (11:00 -0700)]
genirq/PM: Always unlock IRQ descriptor in rearm_wake_irq()

rearm_wake_irq() does not unlock the irq descriptor if the interrupt
is not suspended or if wakeup is not enabled on it.

Restucture the exit conditions so the unlock is always ensured.

Fixes: 3a79bc63d9075 ("PCI: irq: Introduce rearm_wake_irq()")
Signed-off-by: Guenter Roeck <[email protected]>
Signed-off-by: Thomas Gleixner <[email protected]>
Acked-by: Rafael J. Wysocki <[email protected]>
Cc: [email protected]
Link: https://lore.kernel.org/r/[email protected]
4 years agobtrfs: trim: fix underflow in trim length to prevent access beyond device boundary
Qu Wenruo [Fri, 31 Jul 2020 11:29:11 +0000 (19:29 +0800)]
btrfs: trim: fix underflow in trim length to prevent access beyond device boundary

[BUG]
The following script can lead to tons of beyond device boundary access:

  mkfs.btrfs -f $dev -b 10G
  mount $dev $mnt
  trimfs $mnt
  btrfs filesystem resize 1:-1G $mnt
  trimfs $mnt

[CAUSE]
Since commit 929be17a9b49 ("btrfs: Switch btrfs_trim_free_extents to
find_first_clear_extent_bit"), we try to avoid trimming ranges that's
already trimmed.

So we check device->alloc_state by finding the first range which doesn't
have CHUNK_TRIMMED and CHUNK_ALLOCATED not set.

But if we shrunk the device, that bits are not cleared, thus we could
easily got a range starts beyond the shrunk device size.

This results the returned @start and @end are all beyond device size,
then we call "end = min(end, device->total_bytes -1);" making @end
smaller than device size.

Then finally we goes "len = end - start + 1", totally underflow the
result, and lead to the beyond-device-boundary access.

[FIX]
This patch will fix the problem in two ways:

- Clear CHUNK_TRIMMED | CHUNK_ALLOCATED bits when shrinking device
  This is the root fix

- Add extra safety check when trimming free device extents
  We check and warn if the returned range is already beyond current
  device.

Link: https://github.com/kdave/btrfs-progs/issues/282
Fixes: 929be17a9b49 ("btrfs: Switch btrfs_trim_free_extents to find_first_clear_extent_bit")
CC: [email protected] # 5.4+
Signed-off-by: Qu Wenruo <[email protected]>
Reviewed-by: Filipe Manana <[email protected]>
Signed-off-by: David Sterba <[email protected]>
4 years agoALSA: hda/realtek - Fix unused variable warning
Takashi Iwai [Wed, 12 Aug 2020 07:02:56 +0000 (09:02 +0200)]
ALSA: hda/realtek - Fix unused variable warning

The previous fix forgot to remove the unused variable that triggers a
compile warning now:
  sound/pci/hda/patch_realtek.c: In function 'alc285_fixup_hp_gpio_led':
  sound/pci/hda/patch_realtek.c:4163:19: warning: unused variable 'spec' [-Wunused-variable]

Fix it.

Fixes: 404690649e6a ("ALSA: hda - reverse the setting value in the micmute_led_set")
Reported-by: Stephen Rothwell <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Takashi Iwai <[email protected]>
4 years agodrm/ttm: revert "drm/ttm: make TT creation purely optional v3"
Christian König [Wed, 12 Aug 2020 03:03:49 +0000 (13:03 +1000)]
drm/ttm: revert "drm/ttm: make TT creation purely optional v3"

This reverts commit 2ddef17678bc2ea1d20517dd2b4ed4aa967ffa8b.

As it turned out VMWGFX needs a much wider audit to fix this.

Signed-off-by: Christian König <[email protected]>
Reviewed-by: Dave Airlie <[email protected]>
Signed-off-by: Dave Airlie <[email protected]>
Link: https://patchwork.freedesktop.org/patch/msgid/[email protected]
4 years agoMerge branch 'vmwgfx-next-5.9' of git://people.freedesktop.org/~sroland/linux into...
Dave Airlie [Wed, 12 Aug 2020 02:58:19 +0000 (12:58 +1000)]
Merge branch 'vmwgfx-next-5.9' of git://people.freedesktop.org/~sroland/linux into drm-next

The drm_mode_config_reset patches are very important fixing a recently
introduced kernel crash, the others fix various older issues which are
a bit less serious in practice.

Signed-off-by: Dave Airlie <[email protected]>
From: "Roland Scheidegger (VMware)" <[email protected]>
Link: https://patchwork.freedesktop.org/patch/msgid/[email protected]
4 years agoparisc: mask out enable and reserved bits from sba imask
Sven Schnelle [Tue, 11 Aug 2020 16:19:19 +0000 (18:19 +0200)]
parisc: mask out enable and reserved bits from sba imask

When using kexec the SBA IOMMU IBASE might still have the RE
bit set. This triggers a WARN_ON when trying to write back the
IBASE register later, and it also makes some mask calculations fail.

Cc: <[email protected]>
Signed-off-by: Sven Schnelle <[email protected]>
Signed-off-by: Helge Deller <[email protected]>
4 years agoMerge tag 'tag-chrome-platform-for-v5.9' of git://git.kernel.org/pub/scm/linux/kernel...
Linus Torvalds [Wed, 12 Aug 2020 00:28:32 +0000 (17:28 -0700)]
Merge tag 'tag-chrome-platform-for-v5.9' of git://git.kernel.org/pub/scm/linux/kernel/git/chrome-platform/linux

Pull chrome platform updates from Benson Leung:
 "cros_ec_typec:

   - Add support for switch control and alternate modes to the Chrome EC
     Type C port driver

   - Add basic suspend/resume support

  sensorhub:

   - Fix timestamp overflow issue

   - Fix legacy timestamp spreading on Nami systems

  cros_ec_proto:

   - After removing all users of, stop exporting cros_ec_cmd_xfer

   - Check for missing EC_CMD_HOST_EVENT_GET_WAKE_MASK and ignore
     wakeups on old ECs

  misc:

   - Documentation warning cleanup

   - Fix double unlock issue in ishtp"

* tag 'tag-chrome-platform-for-v5.9' of git://git.kernel.org/pub/scm/linux/kernel/git/chrome-platform/linux: (21 commits)
  platform/chrome: cros_ec_proto: check for missing EC_CMD_HOST_EVENT_GET_WAKE_MASK
  platform/chrome: cros_ec_proto: ignore unnecessary wakeups on old ECs
  platform/chrome: cros_ec_sensorhub: Simplify legacy timestamp spreading
  platform/chrome: cros_ec_proto: Do not export cros_ec_cmd_xfer()
  platform/chrome: cros_ec_typec: Unregister partner on error
  platform/chrome: cros_ec_sensorhub: Fix EC timestamp overflow
  platform/chrome: cros_ec_typec: Add PM support
  platform/chrome: cros_ec_typec: Use workqueue for port update
  platform/chrome: cros_ec_typec: Add a dependency on USB_ROLE_SWITCH
  platform/chrome: cros_ec_ishtp: Fix a double-unlock issue
  platform/chrome: cros_ec_rpmsg: Document missing struct parameters
  platform/chrome: cros_ec_spi: Document missing function parameters
  platform/chrome: cros_ec_typec: Add TBT compat support
  platform/chrome: cros_ec: Add TBT pd_ctrl fields
  platform/chrome: cros_ec_typec: Make configure_mux static
  platform/chrome: cros_ec_typec: Support DP alt mode
  platform/chrome: cros_ec_typec: Add USB mux control
  platform/chrome: cros_ec_typec: Register PD CTRL cmd v2
  platform/chrome: cros_ec: Update mux state bits
  platform/chrome: cros_ec_typec: Register Type C switches
  ...

4 years agoMerge tag 'for-linus-5.9-ofs1' of git://git.kernel.org/pub/scm/linux/kernel/git/hubca...
Linus Torvalds [Wed, 12 Aug 2020 00:08:11 +0000 (17:08 -0700)]
Merge tag 'for-linus-5.9-ofs1' of git://git.kernel.org/pub/scm/linux/kernel/git/hubcap/linux

Pull orangefs updates from Mike Marshall:
 "A fix and a cleanup...

  The fix: Al Viro pointed out that I had broken some acl functionality
  with one of my previous patches.

  And the cleanup: Jing Xiangfeng found and removed a needless variable
  assignment"

* tag 'for-linus-5.9-ofs1' of git://git.kernel.org/pub/scm/linux/kernel/git/hubcap/linux:
  orangefs: remove unnecessary assignment to variable ret
  orangefs: posix acl fix...

4 years agoMerge tag 'zonefs-5.9-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/dlemoal...
Linus Torvalds [Wed, 12 Aug 2020 00:05:55 +0000 (17:05 -0700)]
Merge tag 'zonefs-5.9-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/dlemoal/zonefs

Pull zonefs update from Damien Le Moal:
 "A single change for this cycle adding support for zone capacities
  smaller than the zone size, from Johannes"

* tag 'zonefs-5.9-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/dlemoal/zonefs:
  zonefs: update documentation to reflect zone size vs capacity
  zonefs: add zone-capacity support

4 years agoexfat: retain 'VolumeFlags' properly
Tetsuhiro Kohada [Fri, 31 Jul 2020 05:58:26 +0000 (14:58 +0900)]
exfat: retain 'VolumeFlags' properly

MediaFailure and VolumeDirty should be retained if these are set before
mounting.

In '3.1.13.3 Media Failure Field' of exfat specification describe:

 If, upon mounting a volume, the value of this field is 1,
 implementations which scan the entire volume for media failures and
 record all failures as "bad" clusters in the FAT (or otherwise resolve
 media failures) may clear the value of  this field to 0.

Therefore, We should not clear MediaFailure without scanning volume.

In '8.1 Recommended Write Ordering' of exfat specification describe:

 Clear the value of the VolumeDirty field to 0, if its value prior to
 the first step was 0.

Therefore, We should not clear VolumeDirty after mounting.
Also rename ERR_MEDIUM to MEDIA_FAILURE.

Signed-off-by: Tetsuhiro Kohada <[email protected]>
Signed-off-by: Namjae Jeon <[email protected]>
4 years agoexfat: optimize exfat_zeroed_cluster()
Tetsuhiro Kohada [Wed, 24 Jun 2020 02:30:40 +0000 (11:30 +0900)]
exfat: optimize exfat_zeroed_cluster()

Replace part of exfat_zeroed_cluster() with exfat_update_bhs().
And remove exfat_sync_bhs().

Signed-off-by: Tetsuhiro Kohada <[email protected]>
Reviewed-by: Sungjong Seo <[email protected]>
Signed-off-by: Namjae Jeon <[email protected]>
4 years agoexfat: add error check when updating dir-entries
Tetsuhiro Kohada [Wed, 24 Jun 2020 00:54:54 +0000 (09:54 +0900)]
exfat: add error check when updating dir-entries

Add error check when synchronously updating dir-entries.

Suggested-by: Sungjong Seo <[email protected]>
Signed-off-by: Tetsuhiro Kohada <[email protected]>
Signed-off-by: Namjae Jeon <[email protected]>
4 years agoexfat: write multiple sectors at once
Tetsuhiro Kohada [Tue, 23 Jun 2020 06:22:19 +0000 (15:22 +0900)]
exfat: write multiple sectors at once

Write multiple sectors at once when updating dir-entries.
Add exfat_update_bhs() for that. It wait for write completion once
instead of sector by sector.
It's only effective if sync enabled.

Signed-off-by: Tetsuhiro Kohada <[email protected]>
Signed-off-by: Namjae Jeon <[email protected]>
4 years agoexfat: remove EXFAT_SB_DIRTY flag
Tetsuhiro Kohada [Tue, 16 Jun 2020 02:18:07 +0000 (11:18 +0900)]
exfat: remove EXFAT_SB_DIRTY flag

This flag is set/reset in exfat_put_super()/exfat_sync_fs()
to avoid sync_blockdev().
- exfat_put_super():
Before calling this, the VFS has already called sync_filesystem(),
so sync is never performed here.
- exfat_sync_fs():
After calling this, the VFS calls sync_blockdev(), so, it is meaningless
to check EXFAT_SB_DIRTY or to bypass sync_blockdev() here.

Remove the EXFAT_SB_DIRTY check to ensure synchronization.
And remove the code related to the flag.

Signed-off-by: Tetsuhiro Kohada <[email protected]>
Reviewed-by: Sungjong Seo <[email protected]>
Signed-off-by: Namjae Jeon <[email protected]>
4 years agoMerge branch 'net-initialize-fastreuse-on-inet_inherit_port'
David S. Miller [Tue, 11 Aug 2020 22:49:21 +0000 (15:49 -0700)]
Merge branch 'net-initialize-fastreuse-on-inet_inherit_port'

Tim Froidcoeur says:

====================
net: initialize fastreuse on inet_inherit_port

In the case of TPROXY, bind_conflict optimizations for SO_REUSEADDR or
SO_REUSEPORT are broken, possibly resulting in O(n) instead of O(1) bind
behaviour or in the incorrect reuse of a bind.

the kernel keeps track for each bind_bucket if all sockets in the
bind_bucket support SO_REUSEADDR or SO_REUSEPORT in two fastreuse flags.
These flags allow skipping the costly bind_conflict check when possible
(meaning when all sockets have the proper SO_REUSE option).

For every socket added to a bind_bucket, these flags need to be updated.
As soon as a socket that does not support reuse is added, the flag is
set to false and will never go back to true, unless the bind_bucket is
deleted.

Note that there is no mechanism to re-evaluate these flags when a socket
is removed (this might make sense when removing a socket that would not
allow reuse; this leaves room for a future patch).

For this optimization to work, it is mandatory that these flags are
properly initialized and updated.

When a child socket is created from a listen socket in
__inet_inherit_port, the TPROXY case could create a new bind bucket
without properly initializing these flags, thus preventing the
optimization to work. Alternatively, a socket not allowing reuse could
be added to an existing bind bucket without updating the flags, causing
bind_conflict to never be called as it should.

Patch 1/2 refactors the fastreuse update code in inet_csk_get_port into a
small helper function, making the actual fix tiny and easier to understand.

Patch 2/2 calls this new helper when __inet_inherit_port decides to create
a new bind_bucket or use a different bind_bucket than the one of the listen
socket.

v4: - rebase on latest linux/net master branch
v3: - remove company disclaimer from automatic signature
v2: - remove unnecessary cast
====================

Signed-off-by: David S. Miller <[email protected]>
4 years agonet: initialize fastreuse on inet_inherit_port
Tim Froidcoeur [Tue, 11 Aug 2020 18:33:24 +0000 (20:33 +0200)]
net: initialize fastreuse on inet_inherit_port

In the case of TPROXY, bind_conflict optimizations for SO_REUSEADDR or
SO_REUSEPORT are broken, possibly resulting in O(n) instead of O(1) bind
behaviour or in the incorrect reuse of a bind.

the kernel keeps track for each bind_bucket if all sockets in the
bind_bucket support SO_REUSEADDR or SO_REUSEPORT in two fastreuse flags.
These flags allow skipping the costly bind_conflict check when possible
(meaning when all sockets have the proper SO_REUSE option).

For every socket added to a bind_bucket, these flags need to be updated.
As soon as a socket that does not support reuse is added, the flag is
set to false and will never go back to true, unless the bind_bucket is
deleted.

Note that there is no mechanism to re-evaluate these flags when a socket
is removed (this might make sense when removing a socket that would not
allow reuse; this leaves room for a future patch).

For this optimization to work, it is mandatory that these flags are
properly initialized and updated.

When a child socket is created from a listen socket in
__inet_inherit_port, the TPROXY case could create a new bind bucket
without properly initializing these flags, thus preventing the
optimization to work. Alternatively, a socket not allowing reuse could
be added to an existing bind bucket without updating the flags, causing
bind_conflict to never be called as it should.

Call inet_csk_update_fastreuse when __inet_inherit_port decides to create
a new bind_bucket or use a different bind_bucket than the one of the
listen socket.

Fixes: 093d282321da ("tproxy: fix hash locking issue when using port redirection in __inet_inherit_port()")
Acked-by: Matthieu Baerts <[email protected]>
Signed-off-by: Tim Froidcoeur <[email protected]>
Signed-off-by: David S. Miller <[email protected]>
4 years agonet: refactor bind_bucket fastreuse into helper
Tim Froidcoeur [Tue, 11 Aug 2020 18:33:23 +0000 (20:33 +0200)]
net: refactor bind_bucket fastreuse into helper

Refactor the fastreuse update code in inet_csk_get_port into a small
helper function that can be called from other places.

Acked-by: Matthieu Baerts <[email protected]>
Signed-off-by: Tim Froidcoeur <[email protected]>
Signed-off-by: David S. Miller <[email protected]>
4 years agonet: phy: marvell10g: fix null pointer dereference
Marek Behún [Mon, 10 Aug 2020 15:01:58 +0000 (17:01 +0200)]
net: phy: marvell10g: fix null pointer dereference

Commit c3e302edca24 ("net: phy: marvell10g: fix temperature sensor on 2110")
added a check for PHY ID via phydev->drv->phy_id in a function which is
called by devres at a time when phydev->drv is already set to null by
phy_remove function.

This null pointer dereference can be triggered via SFP subsystem with a
SFP module containing this Marvell PHY. When the SFP interface is put
down, the SFP subsystem removes the PHY.

Fixes: c3e302edca24 ("net: phy: marvell10g: fix temperature sensor on 2110")
Signed-off-by: Marek Behún <[email protected]>
Cc: Maxime Chevallier <[email protected]>
Cc: Andrew Lunn <[email protected]>
Cc: Baruch Siach <[email protected]>
Cc: Russell King <[email protected]>
Signed-off-by: David S. Miller <[email protected]>
4 years agonet: Fix potential memory leak in proto_register()
Miaohe Lin [Mon, 10 Aug 2020 12:16:58 +0000 (08:16 -0400)]
net: Fix potential memory leak in proto_register()

If we failed to assign proto idx, we free the twsk_slab_name but forget to
free the twsk_slab. Add a helper function tw_prot_cleanup() to free these
together and also use this helper function in proto_unregister().

Fixes: b45ce32135d1 ("sock: fix potential memory leak in proto_register()")
Signed-off-by: Miaohe Lin <[email protected]>
Signed-off-by: David S. Miller <[email protected]>
4 years agoMerge tag 'arm64-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/arm64/linux
Linus Torvalds [Tue, 11 Aug 2020 21:43:12 +0000 (14:43 -0700)]
Merge tag 'arm64-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/arm64/linux

Pull arm64 fix from Catalin Marinas:
 "Fix recordmcount build failure on non-arm64 (caused by an arm64
  patch)"

* tag 'arm64-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/arm64/linux:
  recordmcount: Fix build failure on non arm64

4 years agoMerge tag 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mst/vhost
Linus Torvalds [Tue, 11 Aug 2020 21:34:17 +0000 (14:34 -0700)]
Merge tag 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mst/vhost

Pull virtio updates from Michael Tsirkin:

 - IRQ bypass support for vdpa and IFC

 - MLX5 vdpa driver

 - Endianness fixes for virtio drivers

 - Misc other fixes

* tag 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mst/vhost: (71 commits)
  vdpa/mlx5: fix up endian-ness for mtu
  vdpa: Fix pointer math bug in vdpasim_get_config()
  vdpa/mlx5: Fix pointer math in mlx5_vdpa_get_config()
  vdpa/mlx5: fix memory allocation failure checks
  vdpa/mlx5: Fix uninitialised variable in core/mr.c
  vdpa_sim: init iommu lock
  virtio_config: fix up warnings on parisc
  vdpa/mlx5: Add VDPA driver for supported mlx5 devices
  vdpa/mlx5: Add shared memory registration code
  vdpa/mlx5: Add support library for mlx5 VDPA implementation
  vdpa/mlx5: Add hardware descriptive header file
  vdpa: Modify get_vq_state() to return error code
  net/vdpa: Use struct for set/get vq state
  vdpa: remove hard coded virtq num
  vdpasim: support batch updating
  vhost-vdpa: support IOTLB batching hints
  vhost-vdpa: support get/set backend features
  vhost: generialize backend features setting/getting
  vhost-vdpa: refine ioctl pre-processing
  vDPA: dont change vq irq after DRIVER_OK
  ...

4 years agoMerge tag 'for-v5.9' of git://git.kernel.org/pub/scm/linux/kernel/git/jmorris/linux...
Linus Torvalds [Tue, 11 Aug 2020 21:30:36 +0000 (14:30 -0700)]
Merge tag 'for-v5.9' of git://git.kernel.org/pub/scm/linux/kernel/git/jmorris/linux-security

Pull security subsystem updates from James Morris:
 "A couple of minor documentation updates only for this release"

* tag 'for-v5.9' of git://git.kernel.org/pub/scm/linux/kernel/git/jmorris/linux-security:
  LSM: drop duplicated words in header file comments
  Replace HTTP links with HTTPS ones: security

4 years agoMerge tag 'iommu-updates-v5.9' of git://git.kernel.org/pub/scm/linux/kernel/git/joro...
Linus Torvalds [Tue, 11 Aug 2020 21:13:24 +0000 (14:13 -0700)]
Merge tag 'iommu-updates-v5.9' of git://git.kernel.org/pub/scm/linux/kernel/git/joro/iommu

Pull iommu updates from Joerg Roedel:

 - Remove of the dev->archdata.iommu (or similar) pointers from most
   architectures. Only Sparc is left, but this is private to Sparc as
   their drivers don't use the IOMMU-API.

 - ARM-SMMU updates from Will Deacon:

     - Support for SMMU-500 implementation in Marvell Armada-AP806 SoC

     - Support for SMMU-500 implementation in NVIDIA Tegra194 SoC

     - DT compatible string updates

     - Remove unused IOMMU_SYS_CACHE_ONLY flag

     - Move ARM-SMMU drivers into their own subdirectory

 - Intel VT-d updates from Lu Baolu:

     - Misc tweaks and fixes for vSVA

     - Report/response page request events

     - Cleanups

 - Move the Kconfig and Makefile bits for the AMD and Intel drivers into
   their respective subdirectory.

 - MT6779 IOMMU Support

 - Support for new chipsets in the Renesas IOMMU driver

 - Other misc cleanups and fixes (e.g. to improve compile test coverage)

* tag 'iommu-updates-v5.9' of git://git.kernel.org/pub/scm/linux/kernel/git/joro/iommu: (77 commits)
  iommu/amd: Move Kconfig and Makefile bits down into amd directory
  iommu/vt-d: Move Kconfig and Makefile bits down into intel directory
  iommu/arm-smmu: Move Arm SMMU drivers into their own subdirectory
  iommu/vt-d: Skip TE disabling on quirky gfx dedicated iommu
  iommu: Add gfp parameter to io_pgtable_ops->map()
  iommu: Mark __iommu_map_sg() as static
  iommu/vt-d: Rename intel-pasid.h to pasid.h
  iommu/vt-d: Add page response ops support
  iommu/vt-d: Report page request faults for guest SVA
  iommu/vt-d: Add a helper to get svm and sdev for pasid
  iommu/vt-d: Refactor device_to_iommu() helper
  iommu/vt-d: Disable multiple GPASID-dev bind
  iommu/vt-d: Warn on out-of-range invalidation address
  iommu/vt-d: Fix devTLB flush for vSVA
  iommu/vt-d: Handle non-page aligned address
  iommu/vt-d: Fix PASID devTLB invalidation
  iommu/vt-d: Remove global page support in devTLB flush
  iommu/vt-d: Enforce PASID devTLB field mask
  iommu: Make some functions static
  iommu/amd: Remove double zero check
  ...

4 years agoMerge tag 'backlight-next-5.9' of git://git.kernel.org/pub/scm/linux/kernel/git/lee...
Linus Torvalds [Tue, 11 Aug 2020 20:48:02 +0000 (13:48 -0700)]
Merge tag 'backlight-next-5.9' of git://git.kernel.org/pub/scm/linux/kernel/git/lee/backlight

Pull backlight updates from Lee Jones:
 "Core Framework:
   - Trivial: Code refactoring
   - New API backlight_is_blank()
   - New API backlight_get_brightness()
   - Additional/reworked documentation
   - Remove 'extern' labels from prototypes
   - Drop backlight_put()
   - Staticify of_find_backlight()

  Driver Removal:
   - Removal of unused OT200 driver
   - Removal of unused Generic Backlight driver

  Fix-ups
   - Bunch of W=1 warning fixes
   - Convert to GPIO descriptors; sky81452
   - Move platform data handling into driver; sky81452
   - Remove superfluous code; lms501kf03
   - Many instances of using new APIs"

* tag 'backlight-next-5.9' of git://git.kernel.org/pub/scm/linux/kernel/git/lee/backlight: (34 commits)
  video: backlight: cr_bllcd: Remove unused variable 'intensity'
  backlight: backlight: Make of_find_backlight static
  backlight: backlight: Drop backlight_put()
  backlight: Use backlight_get_brightness() throughout
  backlight: jornada720_bl: Introduce backlight_is_blank()
  backlight: gpio_backlight: Simplify update_status()
  backlight: cr_bllcd: Introduce gpio-backlight semantics
  backlight: as3711_bl: Simplify update_status
  backlight: backlight: Introduce backlight_get_brightness()
  doc-rst: Wire-up Backlight kernel-doc documentation
  backlight: backlight: Add overview and update existing doc
  backlight: backlight: Drop extern from prototypes
  backlight: generic_bl: Remove this driver as it is unused
  backlight: backlight: Document enums in backlight.h
  backlight: backlight: Document inline functions in backlight.h
  backlight: backlight: Improve backlight_device documentation
  backlight: backlight: Improve backlight_properties documentation
  backlight: backlight: Improve backlight_ops documentation
  backlight: backlight: Add backlight_is_blank()
  backlight: backlight: Refactor fb_notifier_callback()
  ...

4 years agoblock: fix double account of flush request's driver tag
Ming Lei [Mon, 10 Aug 2020 03:59:50 +0000 (11:59 +0800)]
block: fix double account of flush request's driver tag

In case of none scheduler, we share data request's driver tag for
flush request, so have to mark the flush request as INFLIGHT for
avoiding double account of this driver tag.

Fixes: 568f27006577 ("blk-mq: centralise related handling into blk_mq_get_driver_tag")
Reported-by: Matthew Wilcox <[email protected]>
Signed-off-by: Ming Lei <[email protected]>
Tested-by: Matthew Wilcox <[email protected]>
Cc: Christoph Hellwig <[email protected]>
Signed-off-by: Jens Axboe <[email protected]>
4 years agoMerge tag 'hwlock-v5.9' of git://git.kernel.org/pub/scm/linux/kernel/git/andersson...
Linus Torvalds [Tue, 11 Aug 2020 18:53:34 +0000 (11:53 -0700)]
Merge tag 'hwlock-v5.9' of git://git.kernel.org/pub/scm/linux/kernel/git/andersson/remoteproc

Pull hwspinlock updates from Bjorn Andersson:
 "This introduces a new DT binding format to describe the Qualcomm
  hardware mutex block and deprecates the old, invalid, one.

  It also cleans up the Kconfig slightly"

* tag 'hwlock-v5.9' of git://git.kernel.org/pub/scm/linux/kernel/git/andersson/remoteproc:
  dt-bindings: hwlock: qcom: Remove invalid binding
  hwspinlock: qcom: Allow mmio usage in addition to syscon
  dt-bindings: hwlock: qcom: Allow device on mmio bus
  dt-bindings: hwlock: qcom: Migrate binding to YAML
  hwspinlock: Simplify Kconfig

4 years agoMerge tag 'rproc-v5.9' of git://git.kernel.org/pub/scm/linux/kernel/git/andersson...
Linus Torvalds [Tue, 11 Aug 2020 18:17:45 +0000 (11:17 -0700)]
Merge tag 'rproc-v5.9' of git://git.kernel.org/pub/scm/linux/kernel/git/andersson/remoteproc

Pull remoteproc updates from Bjorn Andersson:
 "This introduces a new "detached" state for remote processors that are
  deemed to be running at the time Linux boots and the infrastructure
  for "attaching" to these. It then introduces the support for
  performing this operation for the STM32 platform.

  The coredump functionality is moved out from the core file and gains
  support for an optional mode where the recovery phase awaits the
  notification from devcoredump that the dump should be released. This
  allows userspace to grab the coredump in scenarios where vmalloc space
  is too low for creating a complete copy of the coredump before handing
  this to devcoredump.

  A new character device based interface is introduced to allow tying
  the stoppage of a remote processor to the termination of a user space
  process. This is useful in situations when such process provides
  crucial resources/operations for the firmware running on the remote
  processor.

  The Texas Instrument K3 driver gains support for the C66x and C71x
  DSPs.

  Qualcomm remoteprocs gains support for stashing relocation information
  in IMEM, to aid post mortem debugging and the crash notification
  mechanism is generalized to be reusable in cases where loosely coupled
  drivers needs to know about the status of a remote processor. One such
  example is the IPA hardware block, which is jointly owned with the
  modem and migrated to this improved interface.

  It also introduces a number of bug fixes and debug improvements for
  the Qualcomm modem remoteproc driver.

  And it cleans up the inconsistent interface for remoteproc drivers to
  implement power management"

* tag 'rproc-v5.9' of git://git.kernel.org/pub/scm/linux/kernel/git/andersson/remoteproc: (56 commits)
  remoteproc: core: Register the character device interface
  remoteproc: Add remoteproc character device interface
  remoteproc: kill IPA notify code
  net: ipa: new notification infrastructure
  remoteproc: k3-dsp: Add support for C71x DSPs
  dt-bindings: remoteproc: k3-dsp: Update bindings for C71x DSPs
  remoteproc: k3-dsp: Add support for L2RAM loading on C66x DSPs
  remoteproc: k3-dsp: Add a remoteproc driver of K3 C66x DSPs
  dt-bindings: remoteproc: Add bindings for C66x DSPs on TI K3 SoCs
  remoteproc: k3: Add TI-SCI processor control helper functions
  remoteproc: Introduce rproc_of_parse_firmware() helper
  dt-bindings: arm: keystone: Add common TI SCI bindings
  remoteproc: qcom_q6v5_mss: Remove redundant running state
  remoteproc: qcom: q6v5: Update running state before requesting stop
  remoteproc: qcom_q6v5_mss: Add modem debug policy support
  remoteproc: qcom_q6v5_mss: Validate modem blob firmware size before load
  remoteproc: qcom_q6v5_mss: Validate MBA firmware size before load
  rpmsg: update documentation
  remoteproc: qcom_q6v5_mss: Add MBA log extraction support
  remoteproc: Add coredump debugfs entry
  ...

4 years agoMerge tag 'rpmsg-v5.9' of git://git.kernel.org/pub/scm/linux/kernel/git/andersson...
Linus Torvalds [Tue, 11 Aug 2020 18:13:17 +0000 (11:13 -0700)]
Merge tag 'rpmsg-v5.9' of git://git.kernel.org/pub/scm/linux/kernel/git/andersson/remoteproc

Pull rpmsg update from Bjorn Andersson:
 "This ensures that rpmsg uses little-endian, per the VirtIO 1.0
  specification"

* tag 'rpmsg-v5.9' of git://git.kernel.org/pub/scm/linux/kernel/git/andersson/remoteproc:
  rpmsg: virtio: add endianness conversions

4 years agoMerge tag 'libnvdimm-for-5.9' of git://git.kernel.org/pub/scm/linux/kernel/git/nvdimm...
Linus Torvalds [Tue, 11 Aug 2020 17:59:19 +0000 (10:59 -0700)]
Merge tag 'libnvdimm-for-5.9' of git://git.kernel.org/pub/scm/linux/kernel/git/nvdimm/nvdimm

Pull libnvdimm updayes from Vishal Verma:
 "You'd normally receive this pull request from Dan Williams, but he's
  busy watching a newborn (Congrats Dan!), so I'm watching libnvdimm
  this cycle.

  This adds a new feature in libnvdimm - 'Runtime Firmware Activation',
  and a few small cleanups and fixes in libnvdimm and DAX. I'd
  originally intended to make separate topic-based pull requests - one
  for libnvdimm, and one for DAX, but some of the DAX material fell out
  since it wasn't quite ready.

  Summary:

   - add 'Runtime Firmware Activation' support for NVDIMMs that
     advertise the relevant capability

   - misc libnvdimm and DAX cleanups"

* tag 'libnvdimm-for-5.9' of git://git.kernel.org/pub/scm/linux/kernel/git/nvdimm/nvdimm:
  libnvdimm/security: ensure sysfs poll thread woke up and fetch updated attr
  libnvdimm/security: the 'security' attr never show 'overwrite' state
  libnvdimm/security: fix a typo
  ACPI: NFIT: Fix ARS zero-sized allocation
  dax: Fix incorrect argument passed to xas_set_err()
  ACPI: NFIT: Add runtime firmware activate support
  PM, libnvdimm: Add runtime firmware activation support
  libnvdimm: Convert to DEVICE_ATTR_ADMIN_RO()
  drivers/dax: Expand lock scope to cover the use of addresses
  fs/dax: Remove unused size parameter
  dax: print error message by pr_info() in __generic_fsdax_supported()
  driver-core: Introduce DEVICE_ATTR_ADMIN_{RO,RW}
  tools/testing/nvdimm: Emulate firmware activation commands
  tools/testing/nvdimm: Prepare nfit_ctl_test() for ND_CMD_CALL emulation
  tools/testing/nvdimm: Add command debug messages
  tools/testing/nvdimm: Cleanup dimm index passing
  ACPI: NFIT: Define runtime firmware activation commands
  ACPI: NFIT: Move bus_dsm_mask out of generic nvdimm_bus_descriptor
  libnvdimm: Validate command family indices

4 years agonet: qcom/emac: add missed clk_disable_unprepare in error path of emac_clks_phase1_init
Wang Hai [Mon, 10 Aug 2020 02:57:05 +0000 (10:57 +0800)]
net: qcom/emac: add missed clk_disable_unprepare in error path of emac_clks_phase1_init

Fix the missing clk_disable_unprepare() before return
from emac_clks_phase1_init() in the error handling case.

Fixes: b9b17debc69d ("net: emac: emac gigabit ethernet controller driver")
Reported-by: Hulk Robot <[email protected]>
Signed-off-by: Wang Hai <[email protected]>
Acked-by: Timur Tabi <[email protected]>
Signed-off-by: David S. Miller <[email protected]>
4 years agoionic_lif: Use devm_kcalloc() in ionic_qcq_alloc()
Xu Wang [Mon, 10 Aug 2020 02:38:07 +0000 (02:38 +0000)]
ionic_lif: Use devm_kcalloc() in ionic_qcq_alloc()

A multiplication for the size determination of a memory allocation
indicated that an array data structure should be processed.
Thus use the corresponding function "devm_kcalloc".

Signed-off-by: Xu Wang <[email protected]>
Acked-by: Shannon Nelson <[email protected]>
Signed-off-by: David S. Miller <[email protected]>
4 years agonet/nfc/rawsock.c: add CAP_NET_RAW check.
Qingyu Li [Mon, 10 Aug 2020 01:51:00 +0000 (09:51 +0800)]
net/nfc/rawsock.c: add CAP_NET_RAW check.

When creating a raw AF_NFC socket, CAP_NET_RAW needs to be checked first.

Signed-off-by: Qingyu Li <[email protected]>
Signed-off-by: David S. Miller <[email protected]>
4 years agohinic: fix strncpy output truncated compile warnings
Luo bin [Sun, 9 Aug 2020 03:53:49 +0000 (11:53 +0800)]
hinic: fix strncpy output truncated compile warnings

fix the compile warnings of 'strncpy' output truncated before
terminating nul copying N bytes from a string of the same length

Signed-off-by: Luo bin <[email protected]>
Reported-by: kernel test robot <[email protected]>
Signed-off-by: David S. Miller <[email protected]>
4 years agodrivers/net/wan/x25_asy: Added needed_headroom and a skb->len check
Xie He [Sun, 9 Aug 2020 02:35:48 +0000 (19:35 -0700)]
drivers/net/wan/x25_asy: Added needed_headroom and a skb->len check

1. Added a skb->len check

This driver expects upper layers to include a pseudo header of 1 byte
when passing down a skb for transmission. This driver will read this
1-byte header. This patch added a skb->len check before reading the
header to make sure the header exists.

2. Added needed_headroom

When this driver transmits data,
  first this driver will remove a pseudo header of 1 byte,
  then the lapb module will prepend the LAPB header of 2 or 3 bytes.
So the value of needed_headroom in this driver should be 3 - 1.

Cc: Willem de Bruijn <[email protected]>
Cc: Martin Schiller <[email protected]>
Signed-off-by: Xie He <[email protected]>
Acked-by: Willem de Bruijn <[email protected]>
Signed-off-by: David S. Miller <[email protected]>
4 years agonet/tls: Fix kmap usage
Ira Weiny [Tue, 11 Aug 2020 00:02:58 +0000 (17:02 -0700)]
net/tls: Fix kmap usage

When MSG_OOB is specified to tls_device_sendpage() the mapped page is
never unmapped.

Hold off mapping the page until after the flags are checked and the page
is actually needed.

Fixes: e8f69799810c ("net/tls: Add generic NIC offload infrastructure")
Signed-off-by: Ira Weiny <[email protected]>
Reviewed-by: Jakub Kicinski <[email protected]>
Signed-off-by: David S. Miller <[email protected]>
4 years agodoc/zh_CN: resolve undefined label warning in admin-guide index
Lukas Bulwahn [Sun, 2 Aug 2020 16:19:56 +0000 (18:19 +0200)]
doc/zh_CN: resolve undefined label warning in admin-guide index

Documentation generation warns:

  Documentation/translations/zh_CN/admin-guide/index.rst:3:
  WARNING: undefined label: documentation/admin-guide/index.rst

Use doc reference for .rst files to resolve the warning.

Fixes: 37a607cf2318 ("doc/zh_CN: add admin-guide index")
Signed-off-by: Lukas Bulwahn <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Jonathan Corbet <[email protected]>
4 years agodoc/zh_CN: fix title heading markup in admin-guide cpu-load
Lukas Bulwahn [Sun, 2 Aug 2020 16:21:01 +0000 (18:21 +0200)]
doc/zh_CN: fix title heading markup in admin-guide cpu-load

Documentation generation warns:

  Documentation/translations/zh_CN/admin-guide/cpu-load.rst:1:
  WARNING: Title overline too short.

Extend title heading markup by one. It was just off by one.

Fixes: e210c66d567c ("doc/zh_CN: add cpu-load Chinese version")
Signed-off-by: Lukas Bulwahn <[email protected]>
Acked-by: Tao Zhou <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Jonathan Corbet <[email protected]>
4 years agodocs: remove the 2.6 "Upgrading I2C Drivers" guide
Stephen Kitt [Thu, 6 Aug 2020 16:14:56 +0000 (18:14 +0200)]
docs: remove the 2.6 "Upgrading I2C Drivers" guide

All the drivers have long since been upgraded, and all the important
information here is also included in the "Implementing I2C device
drivers" guide.

Signed-off-by: Stephen Kitt <[email protected]>
Reviewed-by: Wolfram Sang <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Jonathan Corbet <[email protected]>
4 years agodocs: Correct the release date of 5.2 stable
Billy Wilson [Thu, 6 Aug 2020 23:17:54 +0000 (17:17 -0600)]
docs: Correct the release date of 5.2 stable

A table lists the 5.2 stable release date as September 15, but it was
released on July 7. This may confuse a reader who is trying to
understand the stable update release cycle.

Signed-off-by: Billy Wilson <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Jonathan Corbet <[email protected]>
4 years agomailmap: Update comments for with format and more detalis
Kees Cook [Sat, 8 Aug 2020 07:14:36 +0000 (00:14 -0700)]
mailmap: Update comments for with format and more detalis

Without having first read the git-shortlog man-page, the format
of .mailmap may not be immediately obvious. Add comments with pointers
to the man-page, along with other details.

Signed-off-by: Kees Cook <[email protected]>
Link: https://lore.kernel.org/r/202008080013.58EBD83@keescook
Signed-off-by: Jonathan Corbet <[email protected]>
4 years agodocs: cdrom: Fix a typo and rst markup
Remi Andruccioli [Sat, 8 Aug 2020 16:31:23 +0000 (18:31 +0200)]
docs: cdrom: Fix a typo and rst markup

"The capability fags" should be "The capability flags".

In rst markup, a incorrect markup expression is causing bad rendering in
Sphinx output. Replace the erroneous single quote by a backquote.

Signed-off-by: Remi Andruccioli <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Jonathan Corbet <[email protected]>
4 years agoDoc: admin-guide: use correct legends in kernel-parameters.txt
Randy Dunlap [Mon, 10 Aug 2020 02:49:41 +0000 (19:49 -0700)]
Doc: admin-guide: use correct legends in kernel-parameters.txt

Documentation/admin-guide/kernel-parameters.rst includes a legend
telling us what configurations or hardware platforms are relevant
for certain boot options.  For X86, it is spelled "X86" and for
x86_64, it is spelled "X86-64", so make corrections for those.

Signed-off-by: Randy Dunlap <[email protected]>
Cc: Jonathan Corbet <[email protected]>
Cc: [email protected]
Cc: Thomas Gleixner <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: Borislav Petkov <[email protected]>
Cc: [email protected]
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Jonathan Corbet <[email protected]>
4 years agoDocumentation/features: refresh RISC-V arch support files
Tobias Klauser [Mon, 10 Aug 2020 09:50:00 +0000 (11:50 +0200)]
Documentation/features: refresh RISC-V arch support files

Support for these was added by the following commits:

  f2c9699f6555 ("riscv: Add STACKPROTECTOR supported")
  3c4697982982 ("riscv: Enable LOCKDEP_SUPPORT & fixup TRACE_IRQFLAGS_SUPPORT").
  ed48b297fe21 ("riscv: Enable context tracking")
  cbb3d91d3bcf ("riscv: Add kmemleak support")

Signed-off-by: Tobias Klauser <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Jonathan Corbet <[email protected]>
4 years agodocumentation: coccinelle: Improve command example for make C={1,2}
Sumera Priyadarsini [Tue, 11 Aug 2020 00:23:50 +0000 (05:53 +0530)]
documentation: coccinelle: Improve command example for make C={1,2}

Modify coccinelle documentation to further clarify
the usage of the makefile C variable by coccicheck.

Signed-off-by: Sumera Priyadarsini <[email protected]>
Acked-by: Julia Lawall <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Jonathan Corbet <[email protected]>
4 years agoCore-api: Documentation: Replace deprecated :c:func: Usage
Puranjay Mohan [Mon, 10 Aug 2020 18:30:19 +0000 (00:00 +0530)]
Core-api: Documentation: Replace deprecated :c:func: Usage

Replace :c:func: with func() as the previous usage is deprecated.

Signed-off-by: Puranjay Mohan <[email protected]>
Reviewed-by: Matthew Wilcox (Oracle) <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Jonathan Corbet <[email protected]>
4 years agoDev-tools: Documentation: Replace deprecated :c:func: Usage
Puranjay Mohan [Mon, 10 Aug 2020 18:36:13 +0000 (00:06 +0530)]
Dev-tools: Documentation: Replace deprecated :c:func: Usage

Replace :c:func: with func() as the previous usage is deprecated.

Signed-off-by: Puranjay Mohan <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Jonathan Corbet <[email protected]>
4 years agoFilesystems: Documentation: Replace deprecated :c:func: Usage
Puranjay Mohan [Mon, 10 Aug 2020 18:48:28 +0000 (00:18 +0530)]
Filesystems: Documentation: Replace deprecated :c:func: Usage

Replace :c:func: with func() as the previous usage is deprecated.

Signed-off-by: Puranjay Mohan <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Jonathan Corbet <[email protected]>
4 years agodocs: trace: fix a typo
Bryan Brattlof [Tue, 11 Aug 2020 16:17:12 +0000 (16:17 +0000)]
docs: trace: fix a typo

emumerated -> enumerated

Signed-off-by: Bryan Brattlof <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Jonathan Corbet <[email protected]>
4 years agos390/numa: move code to arch/s390/kernel
Alexander Gordeev [Tue, 4 Aug 2020 18:35:50 +0000 (20:35 +0200)]
s390/numa: move code to arch/s390/kernel

Move all code from arch/s390/numa/ to arch/s390/kernel/
since numa.c is the only source file and all others were
deleted with the fake NUMA support removal.

Signed-off-by: Alexander Gordeev <[email protected]>
Signed-off-by: Heiko Carstens <[email protected]>
4 years agos390/time: remove select CLOCKSOURCE_VALIDATE_LAST_CYCLE again
Heiko Carstens [Wed, 5 Aug 2020 12:50:41 +0000 (14:50 +0200)]
s390/time: remove select CLOCKSOURCE_VALIDATE_LAST_CYCLE again

Sven Schnelle reported that setting CLOCKSOURCE_VALIDATE_LAST_CYCLE
doesn't make sense: even if our tod clock overflows delta calculation
(now - last) with unsigned 64 bit values will still be correct.

Therefore revert commit 555701a714f7 ("s390/time: select
CLOCKSOURCE_VALIDATE_LAST_CYCLE").

Fixes: 555701a714f7 ("s390/time: select CLOCKSOURCE_VALIDATE_LAST_CYCLE")
Reported-by: Sven Schnelle <[email protected]>
Signed-off-by: Heiko Carstens <[email protected]>
4 years agos390/debug: debug feature version 3
Mikhail Zaslonko [Tue, 5 May 2020 08:34:52 +0000 (10:34 +0200)]
s390/debug: debug feature version 3

Change __debug_entry structure in the following way:
 - remove redundant union
 - Field containing cpuid is expanded to 16 bits. 8-bit width was not
   enough since we already support up to 512 cpus.
 - Field containing the timestamp is expanded to 60 bits. The timestamp
   itself is now stored in the absolute Unix time format in microseconds
   taking the Epoch Index into acount.
Adjust default header for debug entries by setting minimum width for cpuid
to 4 digits.

Reviewed-by: Heiko Carstens <[email protected]>
Signed-off-by: Mikhail Zaslonko <[email protected]>
Signed-off-by: Heiko Carstens <[email protected]>
4 years agos390/Kconfig: add missing ZCRYPT dependency to VFIO_AP
Krzysztof Kozlowski [Wed, 5 Aug 2020 15:50:53 +0000 (17:50 +0200)]
s390/Kconfig: add missing ZCRYPT dependency to VFIO_AP

The VFIO_AP uses ap_driver_register() (and deregister) functions
implemented in ap_bus.c (compiled into ap.o).  However the ap.o will be
built only if CONFIG_ZCRYPT is selected.

This was not visible before commit e93a1695d7fb ("iommu: Enable compile
testing for some of drivers") because the CONFIG_VFIO_AP depends on
CONFIG_S390_AP_IOMMU which depends on the missing CONFIG_ZCRYPT.  After
adding COMPILE_TEST, it is possible to select a configuration with
VFIO_AP and S390_AP_IOMMU but without the ZCRYPT.

Add proper dependency to the VFIO_AP to fix build errors:

ERROR: modpost: "ap_driver_register" [drivers/s390/crypto/vfio_ap.ko] undefined!
ERROR: modpost: "ap_driver_unregister" [drivers/s390/crypto/vfio_ap.ko] undefined!

Reported-by: kernel test robot <[email protected]>
Fixes: e93a1695d7fb ("iommu: Enable compile testing for some of drivers")
Signed-off-by: Krzysztof Kozlowski <[email protected]>
Signed-off-by: Heiko Carstens <[email protected]>
4 years agos390/numa: set node distance to LOCAL_DISTANCE
Alexander Gordeev [Tue, 4 Aug 2020 18:35:49 +0000 (20:35 +0200)]
s390/numa: set node distance to LOCAL_DISTANCE

The node distance is hardcoded to 0, which causes a trouble
for some user-level applications. In particular, "libnuma"
expects the distance of a node to itself as LOCAL_DISTANCE.
This update removes the offending node distance override.

Cc: <[email protected]> # 4.4
Fixes: 3a368f742da1 ("s390/numa: add core infrastructure")
Signed-off-by: Alexander Gordeev <[email protected]>
Signed-off-by: Heiko Carstens <[email protected]>
4 years agos390/pkey: remove redundant variable initialization
Tianjia Zhang [Sun, 2 Aug 2020 11:15:26 +0000 (19:15 +0800)]
s390/pkey: remove redundant variable initialization

In the first place, the initialization value of `rc` is wrong.
It is unnecessary to initialize `rc` variables, so remove their
initialization operation.

Fixes: f2bbc96e7cfad ("s390/pkey: add CCA AES cipher key support")
Signed-off-by: Tianjia Zhang <[email protected]>
Signed-off-by: Harald Freudenberger <[email protected]>
Signed-off-by: Heiko Carstens <[email protected]>
4 years agos390/test_unwind: fix possible memleak in test_unwind()
Wang Hai [Thu, 30 Jul 2020 06:36:02 +0000 (14:36 +0800)]
s390/test_unwind: fix possible memleak in test_unwind()

test_unwind() misses to call kfree(bt) in an error path.
Add the missed function call to fix it.

Fixes: 0610154650f1 ("s390/test_unwind: print verbose unwinding results")
Reported-by: Hulk Robot <[email protected]>
Signed-off-by: Wang Hai <[email protected]>
Acked-by: Ilya Leoshkevich <[email protected]>
Signed-off-by: Heiko Carstens <[email protected]>
4 years agos390/gmap: improve THP splitting
Gerald Schaefer [Wed, 29 Jul 2020 20:22:34 +0000 (22:22 +0200)]
s390/gmap: improve THP splitting

During s390_enable_sie(), we need to take care of splitting all qemu user
process THP mappings. This is currently done with follow_page(FOLL_SPLIT),
by simply iterating over all vma ranges, with PAGE_SIZE increment.

This logic is sub-optimal and can result in a lot of unnecessary overhead,
especially when using qemu and ASAN with large shadow map. Ilya reported
significant system slow-down with one CPU busy for a long time and overall
unresponsiveness.

Fix this by using walk_page_vma() and directly calling split_huge_pmd()
only for present pmds, which greatly reduces overhead.

Cc: <[email protected]> # v5.4+
Reported-by: Ilya Leoshkevich <[email protected]>
Tested-by: Ilya Leoshkevich <[email protected]>
Acked-by: Christian Borntraeger <[email protected]>
Signed-off-by: Gerald Schaefer <[email protected]>
Signed-off-by: Heiko Carstens <[email protected]>
4 years agos390/atomic: circumvent gcc 10 build regression
Vasily Gorbik [Thu, 30 Jul 2020 14:02:28 +0000 (16:02 +0200)]
s390/atomic: circumvent gcc 10 build regression

Circumvent the following gcc 10 allyesconfig build regression:

  CC      drivers/leds/trigger/ledtrig-cpu.o
In file included from ./arch/s390/include/asm/bitops.h:39,
                 from ./include/linux/bitops.h:29,
                 from ./include/linux/kernel.h:12,
                 from drivers/leds/trigger/ledtrig-cpu.c:18:
./arch/s390/include/asm/atomic_ops.h: In function 'ledtrig_cpu':
./arch/s390/include/asm/atomic_ops.h:46:2: warning: 'asm' operand 1 probably does not match constraints
   46 |  asm volatile(       \
      |  ^~~
./arch/s390/include/asm/atomic_ops.h:53:2: note: in expansion of macro '__ATOMIC_CONST_OP'
   53 |  __ATOMIC_CONST_OP(op_name, op_type, op_string, "\n")  \
      |  ^~~~~~~~~~~~~~~~~
./arch/s390/include/asm/atomic_ops.h:56:1: note: in expansion of macro '__ATOMIC_CONST_OPS'
   56 | __ATOMIC_CONST_OPS(__atomic_add_const, int, "asi")
      | ^~~~~~~~~~~~~~~~~~
./arch/s390/include/asm/atomic_ops.h:46:2: error: impossible constraint in 'asm'
   46 |  asm volatile(       \
      |  ^~~
./arch/s390/include/asm/atomic_ops.h:53:2: note: in expansion of macro '__ATOMIC_CONST_OP'
   53 |  __ATOMIC_CONST_OP(op_name, op_type, op_string, "\n")  \
      |  ^~~~~~~~~~~~~~~~~
./arch/s390/include/asm/atomic_ops.h:56:1: note: in expansion of macro '__ATOMIC_CONST_OPS'
   56 | __ATOMIC_CONST_OPS(__atomic_add_const, int, "asi")
      | ^~~~~~~~~~~~~~~~~~
scripts/Makefile.build:280: recipe for target 'drivers/leds/trigger/ledtrig-cpu.o' failed

By swapping conditions as proposed here:
https://gcc.gnu.org/pipermail/gcc-patches/2020-July/549318.html

Suggested-by: Ilya Leoshkevich <[email protected]>
Signed-off-by: Vasily Gorbik <[email protected]>
Signed-off-by: Heiko Carstens <[email protected]>
4 years agoparisc: Whitespace cleanups in atomic.h
Helge Deller [Sun, 14 Jun 2020 08:50:42 +0000 (10:50 +0200)]
parisc: Whitespace cleanups in atomic.h

Fix whitespace indenting and drop trailing backslashes.

Cc: <[email protected]> # 4.19+
Signed-off-by: Helge Deller <[email protected]>
4 years agocpufreq: intel_pstate: Implement passive mode with HWP enabled
Rafael J. Wysocki [Thu, 6 Aug 2020 12:03:55 +0000 (14:03 +0200)]
cpufreq: intel_pstate: Implement passive mode with HWP enabled

Allow intel_pstate to work in the passive mode with HWP enabled and
make it set the HWP minimum performance limit (HWP floor) to the
P-state value given by the target frequency supplied by the cpufreq
governor, so as to prevent the HWP algorithm and the CPU scheduler
from working against each other, at least when the schedutil governor
is in use, and update the intel_pstate documentation accordingly.

Among other things, this allows utilization clamps to be taken
into account, at least to a certain extent, when intel_pstate is
in use and makes it more likely that sufficient capacity for
deadline tasks will be provided.

After this change, the resulting behavior of an HWP system with
intel_pstate in the passive mode should be close to the behavior
of the analogous non-HWP system with intel_pstate in the passive
mode, except that the HWP algorithm is generally allowed to make the
CPU run at a frequency above the floor P-state set by intel_pstate in
the entire available range of P-states, while without HWP a CPU can
run in a P-state above the requested one if the latter falls into the
range of turbo P-states (referred to as the turbo range) or if the
P-states of all CPUs in one package are coordinated with each other
at the hardware level.

[Note that in principle the HWP floor may not be taken into account
 by the processor if it falls into the turbo range, in which case the
 processor has a license to choose any P-state, either below or above
 the HWP floor, just like a non-HWP processor in the case when the
 target P-state falls into the turbo range.]

With this change applied, intel_pstate in the passive mode assumes
complete control over the HWP request MSR and concurrent changes of
that MSR (eg. via the direct MSR access interface) are overridden by
it.

Signed-off-by: Rafael J. Wysocki <[email protected]>
Acked-by: Srinivas Pandruvada <[email protected]>
Reviewed-by: Francisco Jerez <[email protected]>
This page took 0.178997 seconds and 4 git commands to generate.