Merge branch 'for-next/bti-user' into for-next/bti

author Will Deacon <[email protected]>

Tue, 5 May 2020 14:15:58 +0000 (15:15 +0100)

committer Will Deacon <[email protected]>

Tue, 5 May 2020 14:15:58 +0000 (15:15 +0100)
author Will Deacon <[email protected]>
Tue, 5 May 2020 14:15:58 +0000 (15:15 +0100)
committer Will Deacon <[email protected]>
Tue, 5 May 2020 14:15:58 +0000 (15:15 +0100)
diff --combined Documentation/filesystems/proc.rst

index 38b606991065b3df4f075bb3120dc9e5db09bf71,0000000000000000000000000000000000000000..9969bf4c0c44f6370cb65c59486f071a0178edc4

mode 100644,000000..100644
--- 1/Documentation/filesystems/proc.rst
--- /dev/null
+++ b/Documentation/filesystems/proc.rst
@@@ -1,2169 -1,0 +1,2170 @@@
+ +.. SPDX-License-Identifier: GPL-2.0
+ +
+ +====================
+ +The /proc Filesystem
+ +====================
+ +
+ +=====================  =======================================  ================
+ +/proc/sys              Terrehon Bowden <[email protected]>,  October 7 1999
+ +                       Bodo Bauer <[email protected]>
+ +2.4.x update         Jorge Nerin <[email protected]>   November 14 2000
+ +move /proc/sys               Shen Feng <[email protected]>          April 1 2009
+ +fixes/update part 1.1  Stefani Seibold <[email protected]>    June 9 2009
+ +=====================  =======================================  ================
+ +
+ +
+ +
+ +.. Table of Contents
+ +
+ +  0     Preface
+ +  0.1 Introduction/Credits
+ +  0.2 Legal Stuff
+ +
+ +  1   Collecting System Information
+ +  1.1 Process-Specific Subdirectories
+ +  1.2 Kernel data
+ +  1.3 IDE devices in /proc/ide
+ +  1.4 Networking info in /proc/net
+ +  1.5 SCSI info
+ +  1.6 Parallel port info in /proc/parport
+ +  1.7 TTY info in /proc/tty
+ +  1.8 Miscellaneous kernel statistics in /proc/stat
+ +  1.9 Ext4 file system parameters
+ +
+ +  2   Modifying System Parameters
+ +
+ +  3   Per-Process Parameters
+ +  3.1 /proc/<pid>/oom_adj & /proc/<pid>/oom_score_adj - Adjust the oom-killer
+ +                                                              score
+ +  3.2 /proc/<pid>/oom_score - Display current oom-killer score
+ +  3.3 /proc/<pid>/io - Display the IO accounting fields
+ +  3.4 /proc/<pid>/coredump_filter - Core dump filtering settings
+ +  3.5 /proc/<pid>/mountinfo - Information about mounts
+ +  3.6 /proc/<pid>/comm  & /proc/<pid>/task/<tid>/comm
+ +  3.7   /proc/<pid>/task/<tid>/children - Information about task children
+ +  3.8   /proc/<pid>/fdinfo/<fd> - Information about opened file
+ +  3.9   /proc/<pid>/map_files - Information about memory mapped files
+ +  3.10  /proc/<pid>/timerslack_ns - Task timerslack value
+ +  3.11        /proc/<pid>/patch_state - Livepatch patch operation state
+ +  3.12        /proc/<pid>/arch_status - Task architecture specific information
+ +
+ +  4   Configuring procfs
+ +  4.1 Mount options
+ +
+ +Preface
+ +=======
+ +
+ +0.1 Introduction/Credits
+ +------------------------
+ +
+ +This documentation is  part of a soon (or  so we hope) to be  released book on
+ +the SuSE  Linux distribution. As  there is  no complete documentation  for the
+ +/proc file system and we've used  many freely available sources to write these
+ +chapters, it  seems only fair  to give the work  back to the  Linux community.
+ +This work is  based on the 2.2.*  kernel version and the  upcoming 2.4.*. I'm
+ +afraid it's still far from complete, but we  hope it will be useful. As far as
+ +we know, it is the first 'all-in-one' document about the /proc file system. It
+ +is focused  on the Intel  x86 hardware,  so if you  are looking for  PPC, ARM,
+ +SPARC, AXP, etc., features, you probably  won't find what you are looking for.
+ +It also only covers IPv4 networking, not IPv6 nor other protocols - sorry. But
+ +additions and patches  are welcome and will  be added to this  document if you
+ +mail them to Bodo.
+ +
+ +We'd like  to  thank Alan Cox, Rik van Riel, and Alexey Kuznetsov and a lot of
+ +other people for help compiling this documentation. We'd also like to extend a
+ +special thank  you to Andi Kleen for documentation, which we relied on heavily
+ +to create  this  document,  as well as the additional information he provided.
+ +Thanks to  everybody  else  who contributed source or docs to the Linux kernel
+ +and helped create a great piece of software... :)
+ +
+ +If you  have  any comments, corrections or additions, please don't hesitate to
+ +contact Bodo  Bauer  at  [email protected].  We'll  be happy to add them to this
+ +document.
+ +
+ +The   latest   version    of   this   document   is    available   online   at
+ +http://tldp.org/LDP/Linux-Filesystem-Hierarchy/html/proc.html
+ +
+ +If  the above  direction does  not works  for you,  you could  try the  kernel
+ +mailing  list  at  [email protected]  and/or try  to  reach  me  at
+ [email protected].
+ +
+ +0.2 Legal Stuff
+ +---------------
+ +
+ +We don't  guarantee  the  correctness  of this document, and if you come to us
+ +complaining about  how  you  screwed  up  your  system  because  of  incorrect
+ +documentation, we won't feel responsible...
+ +
+ +Chapter 1: Collecting System Information
+ +========================================
+ +
+ +In This Chapter
+ +---------------
+ +* Investigating  the  properties  of  the  pseudo  file  system  /proc and its
+ +  ability to provide information on the running Linux system
+ +* Examining /proc's structure
+ +* Uncovering  various  information  about the kernel and the processes running
+ +  on the system
+ +
+ +------------------------------------------------------------------------------
+ +
+ +The proc  file  system acts as an interface to internal data structures in the
+ +kernel. It  can  be  used to obtain information about the system and to change
+ +certain kernel parameters at runtime (sysctl).
+ +
+ +First, we'll  take  a  look  at the read-only parts of /proc. In Chapter 2, we
+ +show you how you can use /proc/sys to change settings.
+ +
+ +1.1 Process-Specific Subdirectories
+ +-----------------------------------
+ +
+ +The directory  /proc  contains  (among other things) one subdirectory for each
+ +process running on the system, which is named after the process ID (PID).
+ +
+ +The link  self  points  to  the  process reading the file system. Each process
+ +subdirectory has the entries listed in Table 1-1.
+ +
+ +Note that an open a file descriptor to /proc/<pid> or to any of its
+ +contained files or subdirectories does not prevent <pid> being reused
+ +for some other process in the event that <pid> exits. Operations on
+ +open /proc/<pid> file descriptors corresponding to dead processes
+ +never act on any new process that the kernel may, through chance, have
+ +also assigned the process ID <pid>. Instead, operations on these FDs
+ +usually fail with ESRCH.
+ +
+ +.. table:: Table 1-1: Process specific entries in /proc
+ +
+ + =============  ===============================================================
+ + File         Content
+ + =============  ===============================================================
+ + clear_refs   Clears page referenced bits shown in smaps output
+ + cmdline      Command line arguments
+ + cpu          Current and last cpu in which it was executed   (2.4)(smp)
+ + cwd          Link to the current working directory
+ + environ      Values of environment variables
+ + exe          Link to the executable of this process
+ + fd           Directory, which contains all file descriptors
+ + maps         Memory maps to executables and library files    (2.4)
+ + mem          Memory held by this process
+ + root         Link to the root directory of this process
+ + stat         Process status
+ + statm                Process memory status information
+ + status               Process status in human readable form
+ + wchan                Present with CONFIG_KALLSYMS=y: it shows the kernel function
+ +              symbol the task is blocked in - or "0" if not blocked.
+ + pagemap      Page table
+ + stack                Report full stack trace, enable via CONFIG_STACKTRACE
+ + smaps                An extension based on maps, showing the memory consumption of
+ +              each mapping and flags associated with it
+ + smaps_rollup Accumulated smaps stats for all mappings of the process.  This
+ +              can be derived from smaps, but is faster and more convenient
+ + numa_maps    An extension based on maps, showing the memory locality and
+ +              binding policy as well as mem usage (in pages) of each mapping.
+ + =============  ===============================================================
+ +
+ +For example, to get the status information of a process, all you have to do is
+ +read the file /proc/PID/status::
+ +
+ +  >cat /proc/self/status
+ +  Name:   cat
+ +  State:  R (running)
+ +  Tgid:   5452
+ +  Pid:    5452
+ +  PPid:   743
+ +  TracerPid:      0                                           (2.4)
+ +  Uid:    501     501     501     501
+ +  Gid:    100     100     100     100
+ +  FDSize: 256
+ +  Groups: 100 14 16
+ +  VmPeak:     5004 kB
+ +  VmSize:     5004 kB
+ +  VmLck:         0 kB
+ +  VmHWM:       476 kB
+ +  VmRSS:       476 kB
+ +  RssAnon:             352 kB
+ +  RssFile:             120 kB
+ +  RssShmem:              4 kB
+ +  VmData:      156 kB
+ +  VmStk:        88 kB
+ +  VmExe:        68 kB
+ +  VmLib:      1412 kB
+ +  VmPTE:        20 kb
+ +  VmSwap:        0 kB
+ +  HugetlbPages:          0 kB
+ +  CoreDumping:    0
+ +  THP_enabled:          1
+ +  Threads:        1
+ +  SigQ:   0/28578
+ +  SigPnd: 0000000000000000
+ +  ShdPnd: 0000000000000000
+ +  SigBlk: 0000000000000000
+ +  SigIgn: 0000000000000000
+ +  SigCgt: 0000000000000000
+ +  CapInh: 00000000fffffeff
+ +  CapPrm: 0000000000000000
+ +  CapEff: 0000000000000000
+ +  CapBnd: ffffffffffffffff
+ +  CapAmb: 0000000000000000
+ +  NoNewPrivs:     0
+ +  Seccomp:        0
+ +  Speculation_Store_Bypass:       thread vulnerable
+ +  voluntary_ctxt_switches:        0
+ +  nonvoluntary_ctxt_switches:     1
+ +
+ +This shows you nearly the same information you would get if you viewed it with
+ +the ps  command.  In  fact,  ps  uses  the  proc  file  system  to  obtain its
+ +information.  But you get a more detailed  view of the  process by reading the
+ +file /proc/PID/status. It fields are described in table 1-2.
+ +
+ +The  statm  file  contains  more  detailed  information about the process
+ +memory usage. Its seven fields are explained in Table 1-3.  The stat file
+ +contains details information about the process itself.  Its fields are
+ +explained in Table 1-4.
+ +
+ +(for SMP CONFIG users)
+ +
+ +For making accounting scalable, RSS related information are handled in an
+ +asynchronous manner and the value may not be very precise. To see a precise
+ +snapshot of a moment, you can see /proc/<pid>/smaps file and scan page table.
+ +It's slow but very precise.
+ +
+ +.. table:: Table 1-2: Contents of the status files (as of 4.19)
+ +
+ + ==========================  ===================================================
+ + Field                       Content
+ + ==========================  ===================================================
+ + Name                        filename of the executable
+ + Umask                       file mode creation mask
+ + State                       state (R is running, S is sleeping, D is sleeping
+ +                             in an uninterruptible wait, Z is zombie,
+ +                           T is traced or stopped)
+ + Tgid                        thread group ID
+ + Ngid                        NUMA group ID (0 if none)
+ + Pid                         process id
+ + PPid                        process id of the parent process
+ + TracerPid                   PID of process tracing this process (0 if not)
+ + Uid                         Real, effective, saved set, and  file system UIDs
+ + Gid                         Real, effective, saved set, and  file system GIDs
+ + FDSize                      number of file descriptor slots currently allocated
+ + Groups                      supplementary group list
+ + NStgid                      descendant namespace thread group ID hierarchy
+ + NSpid                       descendant namespace process ID hierarchy
+ + NSpgid                      descendant namespace process group ID hierarchy
+ + NSsid                       descendant namespace session ID hierarchy
+ + VmPeak                      peak virtual memory size
+ + VmSize                      total program size
+ + VmLck                       locked memory size
+ + VmPin                       pinned memory size
+ + VmHWM                       peak resident set size ("high water mark")
+ + VmRSS                       size of memory portions. It contains the three
+ +                             following parts
+ +                             (VmRSS = RssAnon + RssFile + RssShmem)
+ + RssAnon                     size of resident anonymous memory
+ + RssFile                     size of resident file mappings
+ + RssShmem                    size of resident shmem memory (includes SysV shm,
+ +                             mapping of tmpfs and shared anonymous mappings)
+ + VmData                      size of private data segments
+ + VmStk                       size of stack segments
+ + VmExe                       size of text segment
+ + VmLib                       size of shared library code
+ + VmPTE                       size of page table entries
+ + VmSwap                      amount of swap used by anonymous private data
+ +                             (shmem swap usage is not included)
+ + HugetlbPages                size of hugetlb memory portions
+ + CoreDumping                 process's memory is currently being dumped
+ +                             (killing the process may lead to a corrupted core)
+ + THP_enabled               process is allowed to use THP (returns 0 when
+ +                           PR_SET_THP_DISABLE is set on the process
+ + Threads                     number of threads
+ + SigQ                        number of signals queued/max. number for queue
+ + SigPnd                      bitmap of pending signals for the thread
+ + ShdPnd                      bitmap of shared pending signals for the process
+ + SigBlk                      bitmap of blocked signals
+ + SigIgn                      bitmap of ignored signals
+ + SigCgt                      bitmap of caught signals
+ + CapInh                      bitmap of inheritable capabilities
+ + CapPrm                      bitmap of permitted capabilities
+ + CapEff                      bitmap of effective capabilities
+ + CapBnd                      bitmap of capabilities bounding set
+ + CapAmb                      bitmap of ambient capabilities
+ + NoNewPrivs                  no_new_privs, like prctl(PR_GET_NO_NEW_PRIV, ...)
+ + Seccomp                     seccomp mode, like prctl(PR_GET_SECCOMP, ...)
+ + Speculation_Store_Bypass    speculative store bypass mitigation status
+ + Cpus_allowed                mask of CPUs on which this process may run
+ + Cpus_allowed_list           Same as previous, but in "list format"
+ + Mems_allowed                mask of memory nodes allowed to this process
+ + Mems_allowed_list           Same as previous, but in "list format"
+ + voluntary_ctxt_switches     number of voluntary context switches
+ + nonvoluntary_ctxt_switches  number of non voluntary context switches
+ + ==========================  ===================================================
+ +
+ +
+ +.. table:: Table 1-3: Contents of the statm files (as of 2.6.8-rc3)
+ +
+ + ======== ===============================     ==============================
+ + Field    Content
+ + ======== ===============================     ==============================
+ + size     total program size (pages)          (same as VmSize in status)
+ + resident size of memory portions (pages)     (same as VmRSS in status)
+ + shared   number of pages that are shared     (i.e. backed by a file, same
+ +                                              as RssFile+RssShmem in status)
+ + trs      number of pages that are 'code'     (not including libs; broken,
+ +                                              includes data segment)
+ + lrs      number of pages of library          (always 0 on 2.6)
+ + drs      number of pages of data/stack               (including libs; broken,
+ +                                              includes library text)
+ + dt       number of dirty pages                       (always 0 on 2.6)
+ + ======== ===============================     ==============================
+ +
+ +
+ +.. table:: Table 1-4: Contents of the stat files (as of 2.6.30-rc7)
+ +
+ +  ============= ===============================================================
+ +  Field         Content
+ +  ============= ===============================================================
+ +  pid           process id
+ +  tcomm         filename of the executable
+ +  state         state (R is running, S is sleeping, D is sleeping in an
+ +                uninterruptible wait, Z is zombie, T is traced or stopped)
+ +  ppid          process id of the parent process
+ +  pgrp          pgrp of the process
+ +  sid           session id
+ +  tty_nr        tty the process uses
+ +  tty_pgrp      pgrp of the tty
+ +  flags         task flags
+ +  min_flt       number of minor faults
+ +  cmin_flt      number of minor faults with child's
+ +  maj_flt       number of major faults
+ +  cmaj_flt      number of major faults with child's
+ +  utime         user mode jiffies
+ +  stime         kernel mode jiffies
+ +  cutime        user mode jiffies with child's
+ +  cstime        kernel mode jiffies with child's
+ +  priority      priority level
+ +  nice          nice level
+ +  num_threads   number of threads
+ +  it_real_value       (obsolete, always 0)
+ +  start_time    time the process started after system boot
+ +  vsize         virtual memory size
+ +  rss           resident set memory size
+ +  rsslim        current limit in bytes on the rss
+ +  start_code    address above which program text can run
+ +  end_code      address below which program text can run
+ +  start_stack   address of the start of the main process stack
+ +  esp           current value of ESP
+ +  eip           current value of EIP
+ +  pending       bitmap of pending signals
+ +  blocked       bitmap of blocked signals
+ +  sigign        bitmap of ignored signals
+ +  sigcatch      bitmap of caught signals
+ +  0           (place holder, used to be the wchan address,
+ +              use /proc/PID/wchan instead)
+ +  0             (place holder)
+ +  0             (place holder)
+ +  exit_signal   signal to send to parent thread on exit
+ +  task_cpu      which CPU the task is scheduled on
+ +  rt_priority   realtime priority
+ +  policy        scheduling policy (man sched_setscheduler)
+ +  blkio_ticks   time spent waiting for block IO
+ +  gtime         guest time of the task in jiffies
+ +  cgtime        guest time of the task children in jiffies
+ +  start_data    address above which program data+bss is placed
+ +  end_data      address below which program data+bss is placed
+ +  start_brk     address above which program heap can be expanded with brk()
+ +  arg_start     address above which program command line is placed
+ +  arg_end       address below which program command line is placed
+ +  env_start     address above which program environment is placed
+ +  env_end       address below which program environment is placed
+ +  exit_code     the thread's exit_code in the form reported by the waitpid
+ +              system call
+ +  ============= ===============================================================
+ +
+ +The /proc/PID/maps file contains the currently mapped memory regions and
+ +their access permissions.
+ +
+ +The format is::
+ +
+ +    address           perms offset  dev   inode      pathname
+ +
+ +    08048000-08049000 r-xp 00000000 03:00 8312       /opt/test
+ +    08049000-0804a000 rw-p 00001000 03:00 8312       /opt/test
+ +    0804a000-0806b000 rw-p 00000000 00:00 0          [heap]
+ +    a7cb1000-a7cb2000 ---p 00000000 00:00 0
+ +    a7cb2000-a7eb2000 rw-p 00000000 00:00 0
+ +    a7eb2000-a7eb3000 ---p 00000000 00:00 0
+ +    a7eb3000-a7ed5000 rw-p 00000000 00:00 0
+ +    a7ed5000-a8008000 r-xp 00000000 03:00 4222       /lib/libc.so.6
+ +    a8008000-a800a000 r--p 00133000 03:00 4222       /lib/libc.so.6
+ +    a800a000-a800b000 rw-p 00135000 03:00 4222       /lib/libc.so.6
+ +    a800b000-a800e000 rw-p 00000000 00:00 0
+ +    a800e000-a8022000 r-xp 00000000 03:00 14462      /lib/libpthread.so.0
+ +    a8022000-a8023000 r--p 00013000 03:00 14462      /lib/libpthread.so.0
+ +    a8023000-a8024000 rw-p 00014000 03:00 14462      /lib/libpthread.so.0
+ +    a8024000-a8027000 rw-p 00000000 00:00 0
+ +    a8027000-a8043000 r-xp 00000000 03:00 8317       /lib/ld-linux.so.2
+ +    a8043000-a8044000 r--p 0001b000 03:00 8317       /lib/ld-linux.so.2
+ +    a8044000-a8045000 rw-p 0001c000 03:00 8317       /lib/ld-linux.so.2
+ +    aff35000-aff4a000 rw-p 00000000 00:00 0          [stack]
+ +    ffffe000-fffff000 r-xp 00000000 00:00 0          [vdso]
+ +
+ +where "address" is the address space in the process that it occupies, "perms"
+ +is a set of permissions::
+ +
+ + r = read
+ + w = write
+ + x = execute
+ + s = shared
+ + p = private (copy on write)
+ +
+ +"offset" is the offset into the mapping, "dev" is the device (major:minor), and
+ +"inode" is the inode  on that device.  0 indicates that  no inode is associated
+ +with the memory region, as the case would be with BSS (uninitialized data).
+ +The "pathname" shows the name associated file for this mapping.  If the mapping
+ +is not associated with a file:
+ +
+ + =======                    ====================================
+ + [heap]                     the heap of the program
+ + [stack]                    the stack of the main process
+ + [vdso]                     the "virtual dynamic shared object",
+ +                            the kernel system call handler
+ + =======                    ====================================
+ +
+ + or if empty, the mapping is anonymous.
+ +
+ +The /proc/PID/smaps is an extension based on maps, showing the memory
+ +consumption for each of the process's mappings. For each mapping (aka Virtual
+ +Memory Area, or VMA) there is a series of lines such as the following::
+ +
+ +    08048000-080bc000 r-xp 00000000 03:02 13130      /bin/bash
+ +
+ +    Size:               1084 kB
+ +    KernelPageSize:        4 kB
+ +    MMUPageSize:           4 kB
+ +    Rss:                 892 kB
+ +    Pss:                 374 kB
+ +    Shared_Clean:        892 kB
+ +    Shared_Dirty:          0 kB
+ +    Private_Clean:         0 kB
+ +    Private_Dirty:         0 kB
+ +    Referenced:          892 kB
+ +    Anonymous:             0 kB
+ +    LazyFree:              0 kB
+ +    AnonHugePages:         0 kB
+ +    ShmemPmdMapped:        0 kB
+ +    Shared_Hugetlb:        0 kB
+ +    Private_Hugetlb:       0 kB
+ +    Swap:                  0 kB
+ +    SwapPss:               0 kB
+ +    KernelPageSize:        4 kB
+ +    MMUPageSize:           4 kB
+ +    Locked:                0 kB
+ +    THPeligible:           0
+ +    VmFlags: rd ex mr mw me dw
+ +
+ +The first of these lines shows the same information as is displayed for the
+ +mapping in /proc/PID/maps.  Following lines show the size of the mapping
+ +(size); the size of each page allocated when backing a VMA (KernelPageSize),
+ +which is usually the same as the size in the page table entries; the page size
+ +used by the MMU when backing a VMA (in most cases, the same as KernelPageSize);
+ +the amount of the mapping that is currently resident in RAM (RSS); the
+ +process' proportional share of this mapping (PSS); and the number of clean and
+ +dirty shared and private pages in the mapping.
+ +
+ +The "proportional set size" (PSS) of a process is the count of pages it has
+ +in memory, where each page is divided by the number of processes sharing it.
+ +So if a process has 1000 pages all to itself, and 1000 shared with one other
+ +process, its PSS will be 1500.
+ +
+ +Note that even a page which is part of a MAP_SHARED mapping, but has only
+ +a single pte mapped, i.e.  is currently used by only one process, is accounted
+ +as private and not as shared.
+ +
+ +"Referenced" indicates the amount of memory currently marked as referenced or
+ +accessed.
+ +
+ +"Anonymous" shows the amount of memory that does not belong to any file.  Even
+ +a mapping associated with a file may contain anonymous pages: when MAP_PRIVATE
+ +and a page is modified, the file page is replaced by a private anonymous copy.
+ +
+ +"LazyFree" shows the amount of memory which is marked by madvise(MADV_FREE).
+ +The memory isn't freed immediately with madvise(). It's freed in memory
+ +pressure if the memory is clean. Please note that the printed value might
+ +be lower than the real value due to optimizations used in the current
+ +implementation. If this is not desirable please file a bug report.
+ +
+ +"AnonHugePages" shows the ammount of memory backed by transparent hugepage.
+ +
+ +"ShmemPmdMapped" shows the ammount of shared (shmem/tmpfs) memory backed by
+ +huge pages.
+ +
+ +"Shared_Hugetlb" and "Private_Hugetlb" show the ammounts of memory backed by
+ +hugetlbfs page which is *not* counted in "RSS" or "PSS" field for historical
+ +reasons. And these are not included in {Shared,Private}_{Clean,Dirty} field.
+ +
+ +"Swap" shows how much would-be-anonymous memory is also used, but out on swap.
+ +
+ +For shmem mappings, "Swap" includes also the size of the mapped (and not
+ +replaced by copy-on-write) part of the underlying shmem object out on swap.
+ +"SwapPss" shows proportional swap share of this mapping. Unlike "Swap", this
+ +does not take into account swapped out page of underlying shmem objects.
+ +"Locked" indicates whether the mapping is locked in memory or not.
+ +"THPeligible" indicates whether the mapping is eligible for allocating THP
+ +pages - 1 if true, 0 otherwise. It just shows the current status.
+ +
+ +"VmFlags" field deserves a separate description. This member represents the
+ +kernel flags associated with the particular virtual memory area in two letter
+ +encoded manner. The codes are the following:
+ +
+ +    ==    =======================================
+ +    rd    readable
+ +    wr    writeable
+ +    ex    executable
+ +    sh    shared
+ +    mr    may read
+ +    mw    may write
+ +    me    may execute
+ +    ms    may share
+ +    gd    stack segment growns down
+ +    pf    pure PFN range
+ +    dw    disabled write to the mapped file
+ +    lo    pages are locked in memory
+ +    io    memory mapped I/O area
+ +    sr    sequential read advise provided
+ +    rr    random read advise provided
+ +    dc    do not copy area on fork
+ +    de    do not expand area on remapping
+ +    ac    area is accountable
+ +    nr    swap space is not reserved for the area
+ +    ht    area uses huge tlb pages
+ +    ar    architecture specific flag
+ +    dd    do not include area into core dump
+ +    sd    soft dirty flag
+ +    mm    mixed map area
+ +    hg    huge page advise flag
+ +    nh    no huge page advise flag
+ +    mg    mergable advise flag
++    bt  - arm64 BTI guarded page
+ +    ==    =======================================
+ +
+ +Note that there is no guarantee that every flag and associated mnemonic will
+ +be present in all further kernel releases. Things get changed, the flags may
+ +be vanished or the reverse -- new added. Interpretation of their meaning
+ +might change in future as well. So each consumer of these flags has to
+ +follow each specific kernel version for the exact semantic.
+ +
+ +This file is only present if the CONFIG_MMU kernel configuration option is
+ +enabled.
+ +
+ +Note: reading /proc/PID/maps or /proc/PID/smaps is inherently racy (consistent
+ +output can be achieved only in the single read call).
+ +
+ +This typically manifests when doing partial reads of these files while the
+ +memory map is being modified.  Despite the races, we do provide the following
+ +guarantees:
+ +
+ +1) The mapped addresses never go backwards, which implies no two
+ +   regions will ever overlap.
+ +2) If there is something at a given vaddr during the entirety of the
+ +   life of the smaps/maps walk, there will be some output for it.
+ +
+ +The /proc/PID/smaps_rollup file includes the same fields as /proc/PID/smaps,
+ +but their values are the sums of the corresponding values for all mappings of
+ +the process.  Additionally, it contains these fields:
+ +
+ +- Pss_Anon
+ +- Pss_File
+ +- Pss_Shmem
+ +
+ +They represent the proportional shares of anonymous, file, and shmem pages, as
+ +described for smaps above.  These fields are omitted in smaps since each
+ +mapping identifies the type (anon, file, or shmem) of all pages it contains.
+ +Thus all information in smaps_rollup can be derived from smaps, but at a
+ +significantly higher cost.
+ +
+ +The /proc/PID/clear_refs is used to reset the PG_Referenced and ACCESSED/YOUNG
+ +bits on both physical and virtual pages associated with a process, and the
+ +soft-dirty bit on pte (see Documentation/admin-guide/mm/soft-dirty.rst
+ +for details).
+ +To clear the bits for all the pages associated with the process::
+ +
+ +    > echo 1 > /proc/PID/clear_refs
+ +
+ +To clear the bits for the anonymous pages associated with the process::
+ +
+ +    > echo 2 > /proc/PID/clear_refs
+ +
+ +To clear the bits for the file mapped pages associated with the process::
+ +
+ +    > echo 3 > /proc/PID/clear_refs
+ +
+ +To clear the soft-dirty bit::
+ +
+ +    > echo 4 > /proc/PID/clear_refs
+ +
+ +To reset the peak resident set size ("high water mark") to the process's
+ +current value::
+ +
+ +    > echo 5 > /proc/PID/clear_refs
+ +
+ +Any other value written to /proc/PID/clear_refs will have no effect.
+ +
+ +The /proc/pid/pagemap gives the PFN, which can be used to find the pageflags
+ +using /proc/kpageflags and number of times a page is mapped using
+ +/proc/kpagecount. For detailed explanation, see
+ +Documentation/admin-guide/mm/pagemap.rst.
+ +
+ +The /proc/pid/numa_maps is an extension based on maps, showing the memory
+ +locality and binding policy, as well as the memory usage (in pages) of
+ +each mapping. The output follows a general format where mapping details get
+ +summarized separated by blank spaces, one mapping per each file line::
+ +
+ +    address   policy    mapping details
+ +
+ +    00400000 default file=/usr/local/bin/app mapped=1 active=0 N3=1 kernelpagesize_kB=4
+ +    00600000 default file=/usr/local/bin/app anon=1 dirty=1 N3=1 kernelpagesize_kB=4
+ +    3206000000 default file=/lib64/ld-2.12.so mapped=26 mapmax=6 N0=24 N3=2 kernelpagesize_kB=4
+ +    320621f000 default file=/lib64/ld-2.12.so anon=1 dirty=1 N3=1 kernelpagesize_kB=4
+ +    3206220000 default file=/lib64/ld-2.12.so anon=1 dirty=1 N3=1 kernelpagesize_kB=4
+ +    3206221000 default anon=1 dirty=1 N3=1 kernelpagesize_kB=4
+ +    3206800000 default file=/lib64/libc-2.12.so mapped=59 mapmax=21 active=55 N0=41 N3=18 kernelpagesize_kB=4
+ +    320698b000 default file=/lib64/libc-2.12.so
+ +    3206b8a000 default file=/lib64/libc-2.12.so anon=2 dirty=2 N3=2 kernelpagesize_kB=4
+ +    3206b8e000 default file=/lib64/libc-2.12.so anon=1 dirty=1 N3=1 kernelpagesize_kB=4
+ +    3206b8f000 default anon=3 dirty=3 active=1 N3=3 kernelpagesize_kB=4
+ +    7f4dc10a2000 default anon=3 dirty=3 N3=3 kernelpagesize_kB=4
+ +    7f4dc10b4000 default anon=2 dirty=2 active=1 N3=2 kernelpagesize_kB=4
+ +    7f4dc1200000 default file=/anon_hugepage\040(deleted) huge anon=1 dirty=1 N3=1 kernelpagesize_kB=2048
+ +    7fff335f0000 default stack anon=3 dirty=3 N3=3 kernelpagesize_kB=4
+ +    7fff3369d000 default mapped=1 mapmax=35 active=0 N3=1 kernelpagesize_kB=4
+ +
+ +Where:
+ +
+ +"address" is the starting address for the mapping;
+ +
+ +"policy" reports the NUMA memory policy set for the mapping (see Documentation/admin-guide/mm/numa_memory_policy.rst);
+ +
+ +"mapping details" summarizes mapping data such as mapping type, page usage counters,
+ +node locality page counters (N0 == node0, N1 == node1, ...) and the kernel page
+ +size, in KB, that is backing the mapping up.
+ +
+ +1.2 Kernel data
+ +---------------
+ +
+ +Similar to  the  process entries, the kernel data files give information about
+ +the running kernel. The files used to obtain this information are contained in
+ +/proc and  are  listed  in Table 1-5. Not all of these will be present in your
+ +system. It  depends  on the kernel configuration and the loaded modules, which
+ +files are there, and which are missing.
+ +
+ +.. table:: Table 1-5: Kernel info in /proc
+ +
+ + ============ ===============================================================
+ + File         Content
+ + ============ ===============================================================
+ + apm          Advanced power management info
+ + buddyinfo    Kernel memory allocator information (see text)  (2.5)
+ + bus          Directory containing bus specific information
+ + cmdline      Kernel command line
+ + cpuinfo      Info about the CPU
+ + devices      Available devices (block and character)
+ + dma          Used DMS channels
+ + filesystems  Supported filesystems
+ + driver       Various drivers grouped here, currently rtc     (2.4)
+ + execdomains  Execdomains, related to security                        (2.4)
+ + fb         Frame Buffer devices                              (2.4)
+ + fs         File system parameters, currently nfs/exports     (2.4)
+ + ide          Directory containing info about the IDE subsystem
+ + interrupts   Interrupt usage
+ + iomem              Memory map                                        (2.4)
+ + ioports      I/O port usage
+ + irq        Masks for irq to cpu affinity                     (2.4)(smp?)
+ + isapnp       ISA PnP (Plug&Play) Info                                (2.4)
+ + kcore        Kernel core image (can be ELF or A.OUT(deprecated in 2.4))
+ + kmsg         Kernel messages
+ + ksyms        Kernel symbol table
+ + loadavg      Load average of last 1, 5 & 15 minutes
+ + locks        Kernel locks
+ + meminfo      Memory info
+ + misc         Miscellaneous
+ + modules      List of loaded modules
+ + mounts       Mounted filesystems
+ + net          Networking info (see text)
+ + pagetypeinfo Additional page allocator information (see text)  (2.5)
+ + partitions   Table of partitions known to the system
+ + pci        Deprecated info of PCI bus (new way -> /proc/bus/pci/,
+ +              decoupled by lspci                              (2.4)
+ + rtc          Real time clock
+ + scsi         SCSI info (see text)
+ + slabinfo     Slab pool info
+ + softirqs     softirq usage
+ + stat         Overall statistics
+ + swaps        Swap space utilization
+ + sys          See chapter 2
+ + sysvipc      Info of SysVIPC Resources (msg, sem, shm)               (2.4)
+ + tty        Info of tty drivers
+ + uptime       Wall clock since boot, combined idle time of all cpus
+ + version      Kernel version
+ + video              bttv info of video resources                      (2.4)
+ + vmallocinfo  Show vmalloced areas
+ + ============ ===============================================================
+ +
+ +You can,  for  example,  check  which interrupts are currently in use and what
+ +they are used for by looking in the file /proc/interrupts::
+ +
+ +  > cat /proc/interrupts
+ +             CPU0
+ +    0:    8728810          XT-PIC  timer
+ +    1:        895          XT-PIC  keyboard
+ +    2:          0          XT-PIC  cascade
+ +    3:     531695          XT-PIC  aha152x
+ +    4:    2014133          XT-PIC  serial
+ +    5:      44401          XT-PIC  pcnet_cs
+ +    8:          2          XT-PIC  rtc
+ +   11:          8          XT-PIC  i82365
+ +   12:     182918          XT-PIC  PS/2 Mouse
+ +   13:          1          XT-PIC  fpu
+ +   14:    1232265          XT-PIC  ide0
+ +   15:          7          XT-PIC  ide1
+ +  NMI:          0
+ +
+ +In 2.4.* a couple of lines where added to this file LOC & ERR (this time is the
+ +output of a SMP machine)::
+ +
+ +  > cat /proc/interrupts
+ +
+ +             CPU0       CPU1
+ +    0:    1243498    1214548    IO-APIC-edge  timer
+ +    1:       8949       8958    IO-APIC-edge  keyboard
+ +    2:          0          0          XT-PIC  cascade
+ +    5:      11286      10161    IO-APIC-edge  soundblaster
+ +    8:          1          0    IO-APIC-edge  rtc
+ +    9:      27422      27407    IO-APIC-edge  3c503
+ +   12:     113645     113873    IO-APIC-edge  PS/2 Mouse
+ +   13:          0          0          XT-PIC  fpu
+ +   14:      22491      24012    IO-APIC-edge  ide0
+ +   15:       2183       2415    IO-APIC-edge  ide1
+ +   17:      30564      30414   IO-APIC-level  eth0
+ +   18:        177        164   IO-APIC-level  bttv
+ +  NMI:    2457961    2457959
+ +  LOC:    2457882    2457881
+ +  ERR:       2155
+ +
+ +NMI is incremented in this case because every timer interrupt generates a NMI
+ +(Non Maskable Interrupt) which is used by the NMI Watchdog to detect lockups.
+ +
+ +LOC is the local interrupt counter of the internal APIC of every CPU.
+ +
+ +ERR is incremented in the case of errors in the IO-APIC bus (the bus that
+ +connects the CPUs in a SMP system. This means that an error has been detected,
+ +the IO-APIC automatically retry the transmission, so it should not be a big
+ +problem, but you should read the SMP-FAQ.
+ +
+ +In 2.6.2* /proc/interrupts was expanded again.  This time the goal was for
+ +/proc/interrupts to display every IRQ vector in use by the system, not
+ +just those considered 'most important'.  The new vectors are:
+ +
+ +THR
+ +  interrupt raised when a machine check threshold counter
+ +  (typically counting ECC corrected errors of memory or cache) exceeds
+ +  a configurable threshold.  Only available on some systems.
+ +
+ +TRM
+ +  a thermal event interrupt occurs when a temperature threshold
+ +  has been exceeded for the CPU.  This interrupt may also be generated
+ +  when the temperature drops back to normal.
+ +
+ +SPU
+ +  a spurious interrupt is some interrupt that was raised then lowered
+ +  by some IO device before it could be fully processed by the APIC.  Hence
+ +  the APIC sees the interrupt but does not know what device it came from.
+ +  For this case the APIC will generate the interrupt with a IRQ vector
+ +  of 0xff. This might also be generated by chipset bugs.
+ +
+ +RES, CAL, TLB]
+ +  rescheduling, call and TLB flush interrupts are
+ +  sent from one CPU to another per the needs of the OS.  Typically,
+ +  their statistics are used by kernel developers and interested users to
+ +  determine the occurrence of interrupts of the given type.
+ +
+ +The above IRQ vectors are displayed only when relevant.  For example,
+ +the threshold vector does not exist on x86_64 platforms.  Others are
+ +suppressed when the system is a uniprocessor.  As of this writing, only
+ +i386 and x86_64 platforms support the new IRQ vector displays.
+ +
+ +Of some interest is the introduction of the /proc/irq directory to 2.4.
+ +It could be used to set IRQ to CPU affinity, this means that you can "hook" an
+ +IRQ to only one CPU, or to exclude a CPU of handling IRQs. The contents of the
+ +irq subdir is one subdir for each IRQ, and two files; default_smp_affinity and
+ +prof_cpu_mask.
+ +
+ +For example::
+ +
+ +  > ls /proc/irq/
+ +  0  10  12  14  16  18  2  4  6  8  prof_cpu_mask
+ +  1  11  13  15  17  19  3  5  7  9  default_smp_affinity
+ +  > ls /proc/irq/0/
+ +  smp_affinity
+ +
+ +smp_affinity is a bitmask, in which you can specify which CPUs can handle the
+ +IRQ, you can set it by doing::
+ +
+ +  > echo 1 > /proc/irq/10/smp_affinity
+ +
+ +This means that only the first CPU will handle the IRQ, but you can also echo
+ +5 which means that only the first and third CPU can handle the IRQ.
+ +
+ +The contents of each smp_affinity file is the same by default::
+ +
+ +  > cat /proc/irq/0/smp_affinity
+ +  ffffffff
+ +
+ +There is an alternate interface, smp_affinity_list which allows specifying
+ +a cpu range instead of a bitmask::
+ +
+ +  > cat /proc/irq/0/smp_affinity_list
+ +  1024-1031
+ +
+ +The default_smp_affinity mask applies to all non-active IRQs, which are the
+ +IRQs which have not yet been allocated/activated, and hence which lack a
+ +/proc/irq/[0-9]* directory.
+ +
+ +The node file on an SMP system shows the node to which the device using the IRQ
+ +reports itself as being attached. This hardware locality information does not
+ +include information about any possible driver locality preference.
+ +
+ +prof_cpu_mask specifies which CPUs are to be profiled by the system wide
+ +profiler. Default value is ffffffff (all cpus if there are only 32 of them).
+ +
+ +The way IRQs are routed is handled by the IO-APIC, and it's Round Robin
+ +between all the CPUs which are allowed to handle it. As usual the kernel has
+ +more info than you and does a better job than you, so the defaults are the
+ +best choice for almost everyone.  [Note this applies only to those IO-APIC's
+ +that support "Round Robin" interrupt distribution.]
+ +
+ +There are  three  more  important subdirectories in /proc: net, scsi, and sys.
+ +The general  rule  is  that  the  contents,  or  even  the  existence of these
+ +directories, depend  on your kernel configuration. If SCSI is not enabled, the
+ +directory scsi  may  not  exist. The same is true with the net, which is there
+ +only when networking support is present in the running kernel.
+ +
+ +The slabinfo  file  gives  information  about  memory usage at the slab level.
+ +Linux uses  slab  pools for memory management above page level in version 2.2.
+ +Commonly used  objects  have  their  own  slab  pool (such as network buffers,
+ +directory cache, and so on).
+ +
+ +::
+ +
+ +    > cat /proc/buddyinfo
+ +
+ +    Node 0, zone      DMA      0      4      5      4      4      3 ...
+ +    Node 0, zone   Normal      1      0      0      1    101      8 ...
+ +    Node 0, zone  HighMem      2      0      0      1      1      0 ...
+ +
+ +External fragmentation is a problem under some workloads, and buddyinfo is a
+ +useful tool for helping diagnose these problems.  Buddyinfo will give you a
+ +clue as to how big an area you can safely allocate, or why a previous
+ +allocation failed.
+ +
+ +Each column represents the number of pages of a certain order which are
+ +available.  In this case, there are 0 chunks of 2^0*PAGE_SIZE available in
+ +ZONE_DMA, 4 chunks of 2^1*PAGE_SIZE in ZONE_DMA, 101 chunks of 2^4*PAGE_SIZE
+ +available in ZONE_NORMAL, etc...
+ +
+ +More information relevant to external fragmentation can be found in
+ +pagetypeinfo::
+ +
+ +    > cat /proc/pagetypeinfo
+ +    Page block order: 9
+ +    Pages per block:  512
+ +
+ +    Free pages count per migrate type at order       0      1      2      3      4      5      6      7      8      9     10
+ +    Node    0, zone      DMA, type    Unmovable      0      0      0      1      1      1      1      1      1      1      0
+ +    Node    0, zone      DMA, type  Reclaimable      0      0      0      0      0      0      0      0      0      0      0
+ +    Node    0, zone      DMA, type      Movable      1      1      2      1      2      1      1      0      1      0      2
+ +    Node    0, zone      DMA, type      Reserve      0      0      0      0      0      0      0      0      0      1      0
+ +    Node    0, zone      DMA, type      Isolate      0      0      0      0      0      0      0      0      0      0      0
+ +    Node    0, zone    DMA32, type    Unmovable    103     54     77      1      1      1     11      8      7      1      9
+ +    Node    0, zone    DMA32, type  Reclaimable      0      0      2      1      0      0      0      0      1      0      0
+ +    Node    0, zone    DMA32, type      Movable    169    152    113     91     77     54     39     13      6      1    452
+ +    Node    0, zone    DMA32, type      Reserve      1      2      2      2      2      0      1      1      1      1      0
+ +    Node    0, zone    DMA32, type      Isolate      0      0      0      0      0      0      0      0      0      0      0
+ +
+ +    Number of blocks type     Unmovable  Reclaimable      Movable      Reserve      Isolate
+ +    Node 0, zone      DMA            2            0            5            1            0
+ +    Node 0, zone    DMA32           41            6          967            2            0
+ +
+ +Fragmentation avoidance in the kernel works by grouping pages of different
+ +migrate types into the same contiguous regions of memory called page blocks.
+ +A page block is typically the size of the default hugepage size e.g. 2MB on
+ +X86-64. By keeping pages grouped based on their ability to move, the kernel
+ +can reclaim pages within a page block to satisfy a high-order allocation.
+ +
+ +The pagetypinfo begins with information on the size of a page block. It
+ +then gives the same type of information as buddyinfo except broken down
+ +by migrate-type and finishes with details on how many page blocks of each
+ +type exist.
+ +
+ +If min_free_kbytes has been tuned correctly (recommendations made by hugeadm
+ +from libhugetlbfs https://github.com/libhugetlbfs/libhugetlbfs/), one can
+ +make an estimate of the likely number of huge pages that can be allocated
+ +at a given point in time. All the "Movable" blocks should be allocatable
+ +unless memory has been mlock()'d. Some of the Reclaimable blocks should
+ +also be allocatable although a lot of filesystem metadata may have to be
+ +reclaimed to achieve this.
+ +
+ +
+ +meminfo
+ +~~~~~~~
+ +
+ +Provides information about distribution and utilization of memory.  This
+ +varies by architecture and compile options.  The following is from a
+ +16GB PIII, which has highmem enabled.  You may not have all of these fields.
+ +
+ +::
+ +
+ +    > cat /proc/meminfo
+ +
+ +    MemTotal:     16344972 kB
+ +    MemFree:      13634064 kB
+ +    MemAvailable: 14836172 kB
+ +    Buffers:          3656 kB
+ +    Cached:        1195708 kB
+ +    SwapCached:          0 kB
+ +    Active:         891636 kB
+ +    Inactive:      1077224 kB
+ +    HighTotal:    15597528 kB
+ +    HighFree:     13629632 kB
+ +    LowTotal:       747444 kB
+ +    LowFree:          4432 kB
+ +    SwapTotal:           0 kB
+ +    SwapFree:            0 kB
+ +    Dirty:             968 kB
+ +    Writeback:           0 kB
+ +    AnonPages:      861800 kB
+ +    Mapped:         280372 kB
+ +    Shmem:             644 kB
+ +    KReclaimable:   168048 kB
+ +    Slab:           284364 kB
+ +    SReclaimable:   159856 kB
+ +    SUnreclaim:     124508 kB
+ +    PageTables:      24448 kB
+ +    NFS_Unstable:        0 kB
+ +    Bounce:              0 kB
+ +    WritebackTmp:        0 kB
+ +    CommitLimit:   7669796 kB
+ +    Committed_AS:   100056 kB
+ +    VmallocTotal:   112216 kB
+ +    VmallocUsed:       428 kB
+ +    VmallocChunk:   111088 kB
+ +    Percpu:          62080 kB
+ +    HardwareCorrupted:   0 kB
+ +    AnonHugePages:   49152 kB
+ +    ShmemHugePages:      0 kB
+ +    ShmemPmdMapped:      0 kB
+ +
+ +MemTotal
+ +              Total usable ram (i.e. physical ram minus a few reserved
+ +              bits and the kernel binary code)
+ +MemFree
+ +              The sum of LowFree+HighFree
+ +MemAvailable
+ +              An estimate of how much memory is available for starting new
+ +              applications, without swapping. Calculated from MemFree,
+ +              SReclaimable, the size of the file LRU lists, and the low
+ +              watermarks in each zone.
+ +              The estimate takes into account that the system needs some
+ +              page cache to function well, and that not all reclaimable
+ +              slab will be reclaimable, due to items being in use. The
+ +              impact of those factors will vary from system to system.
+ +Buffers
+ +              Relatively temporary storage for raw disk blocks
+ +              shouldn't get tremendously large (20MB or so)
+ +Cached
+ +              in-memory cache for files read from the disk (the
+ +              pagecache).  Doesn't include SwapCached
+ +SwapCached
+ +              Memory that once was swapped out, is swapped back in but
+ +              still also is in the swapfile (if memory is needed it
+ +              doesn't need to be swapped out AGAIN because it is already
+ +              in the swapfile. This saves I/O)
+ +Active
+ +              Memory that has been used more recently and usually not
+ +              reclaimed unless absolutely necessary.
+ +Inactive
+ +              Memory which has been less recently used.  It is more
+ +              eligible to be reclaimed for other purposes
+ +HighTotal, HighFree
+ +              Highmem is all memory above ~860MB of physical memory
+ +              Highmem areas are for use by userspace programs, or
+ +              for the pagecache.  The kernel must use tricks to access
+ +              this memory, making it slower to access than lowmem.
+ +LowTotal, LowFree
+ +              Lowmem is memory which can be used for everything that
+ +              highmem can be used for, but it is also available for the
+ +              kernel's use for its own data structures.  Among many
+ +              other things, it is where everything from the Slab is
+ +              allocated.  Bad things happen when you're out of lowmem.
+ +SwapTotal
+ +              total amount of swap space available
+ +SwapFree
+ +              Memory which has been evicted from RAM, and is temporarily
+ +              on the disk
+ +Dirty
+ +              Memory which is waiting to get written back to the disk
+ +Writeback
+ +              Memory which is actively being written back to the disk
+ +AnonPages
+ +              Non-file backed pages mapped into userspace page tables
+ +HardwareCorrupted
+ +              The amount of RAM/memory in KB, the kernel identifies as
+ +            corrupted.
+ +AnonHugePages
+ +              Non-file backed huge pages mapped into userspace page tables
+ +Mapped
+ +              files which have been mmaped, such as libraries
+ +Shmem
+ +              Total memory used by shared memory (shmem) and tmpfs
+ +ShmemHugePages
+ +              Memory used by shared memory (shmem) and tmpfs allocated
+ +              with huge pages
+ +ShmemPmdMapped
+ +              Shared memory mapped into userspace with huge pages
+ +KReclaimable
+ +              Kernel allocations that the kernel will attempt to reclaim
+ +              under memory pressure. Includes SReclaimable (below), and other
+ +              direct allocations with a shrinker.
+ +Slab
+ +              in-kernel data structures cache
+ +SReclaimable
+ +              Part of Slab, that might be reclaimed, such as caches
+ +SUnreclaim
+ +              Part of Slab, that cannot be reclaimed on memory pressure
+ +PageTables
+ +              amount of memory dedicated to the lowest level of page
+ +              tables.
+ +NFS_Unstable
+ +              NFS pages sent to the server, but not yet committed to stable
+ +            storage
+ +Bounce
+ +              Memory used for block device "bounce buffers"
+ +WritebackTmp
+ +              Memory used by FUSE for temporary writeback buffers
+ +CommitLimit
+ +              Based on the overcommit ratio ('vm.overcommit_ratio'),
+ +              this is the total amount of  memory currently available to
+ +              be allocated on the system. This limit is only adhered to
+ +              if strict overcommit accounting is enabled (mode 2 in
+ +              'vm.overcommit_memory').
+ +
+ +              The CommitLimit is calculated with the following formula::
+ +
+ +                CommitLimit = ([total RAM pages] - [total huge TLB pages]) *
+ +                               overcommit_ratio / 100 + [total swap pages]
+ +
+ +              For example, on a system with 1G of physical RAM and 7G
+ +              of swap with a `vm.overcommit_ratio` of 30 it would
+ +              yield a CommitLimit of 7.3G.
+ +
+ +              For more details, see the memory overcommit documentation
+ +              in vm/overcommit-accounting.
+ +Committed_AS
+ +              The amount of memory presently allocated on the system.
+ +              The committed memory is a sum of all of the memory which
+ +              has been allocated by processes, even if it has not been
+ +              "used" by them as of yet. A process which malloc()'s 1G
+ +              of memory, but only touches 300M of it will show up as
+ +            using 1G. This 1G is memory which has been "committed" to
+ +              by the VM and can be used at any time by the allocating
+ +              application. With strict overcommit enabled on the system
+ +              (mode 2 in 'vm.overcommit_memory'),allocations which would
+ +              exceed the CommitLimit (detailed above) will not be permitted.
+ +              This is useful if one needs to guarantee that processes will
+ +              not fail due to lack of memory once that memory has been
+ +              successfully allocated.
+ +VmallocTotal
+ +              total size of vmalloc memory area
+ +VmallocUsed
+ +              amount of vmalloc area which is used
+ +VmallocChunk
+ +              largest contiguous block of vmalloc area which is free
+ +Percpu
+ +              Memory allocated to the percpu allocator used to back percpu
+ +              allocations. This stat excludes the cost of metadata.
+ +
+ +vmallocinfo
+ +~~~~~~~~~~~
+ +
+ +Provides information about vmalloced/vmaped areas. One line per area,
+ +containing the virtual address range of the area, size in bytes,
+ +caller information of the creator, and optional information depending
+ +on the kind of area :
+ +
+ + ==========  ===================================================
+ + pages=nr    number of pages
+ + phys=addr   if a physical address was specified
+ + ioremap     I/O mapping (ioremap() and friends)
+ + vmalloc     vmalloc() area
+ + vmap        vmap()ed pages
+ + user        VM_USERMAP area
+ + vpages      buffer for pages pointers was vmalloced (huge area)
+ + N<node>=nr  (Only on NUMA kernels)
+ +             Number of pages allocated on memory node <node>
+ + ==========  ===================================================
+ +
+ +::
+ +
+ +    > cat /proc/vmallocinfo
+ +    0xffffc20000000000-0xffffc20000201000 2101248 alloc_large_system_hash+0x204 ...
+ +    /0x2c0 pages=512 vmalloc N0=128 N1=128 N2=128 N3=128
+ +    0xffffc20000201000-0xffffc20000302000 1052672 alloc_large_system_hash+0x204 ...
+ +    /0x2c0 pages=256 vmalloc N0=64 N1=64 N2=64 N3=64
+ +    0xffffc20000302000-0xffffc20000304000    8192 acpi_tb_verify_table+0x21/0x4f...
+ +    phys=7fee8000 ioremap
+ +    0xffffc20000304000-0xffffc20000307000   12288 acpi_tb_verify_table+0x21/0x4f...
+ +    phys=7fee7000 ioremap
+ +    0xffffc2000031d000-0xffffc2000031f000    8192 init_vdso_vars+0x112/0x210
+ +    0xffffc2000031f000-0xffffc2000032b000   49152 cramfs_uncompress_init+0x2e ...
+ +    /0x80 pages=11 vmalloc N0=3 N1=3 N2=2 N3=3
+ +    0xffffc2000033a000-0xffffc2000033d000   12288 sys_swapon+0x640/0xac0      ...
+ +    pages=2 vmalloc N1=2
+ +    0xffffc20000347000-0xffffc2000034c000   20480 xt_alloc_table_info+0xfe ...
+ +    /0x130 [x_tables] pages=4 vmalloc N0=4
+ +    0xffffffffa0000000-0xffffffffa000f000   61440 sys_init_module+0xc27/0x1d00 ...
+ +    pages=14 vmalloc N2=14
+ +    0xffffffffa000f000-0xffffffffa0014000   20480 sys_init_module+0xc27/0x1d00 ...
+ +    pages=4 vmalloc N1=4
+ +    0xffffffffa0014000-0xffffffffa0017000   12288 sys_init_module+0xc27/0x1d00 ...
+ +    pages=2 vmalloc N1=2
+ +    0xffffffffa0017000-0xffffffffa0022000   45056 sys_init_module+0xc27/0x1d00 ...
+ +    pages=10 vmalloc N0=10
+ +
+ +
+ +softirqs
+ +~~~~~~~~
+ +
+ +Provides counts of softirq handlers serviced since boot time, for each cpu.
+ +
+ +::
+ +
+ +    > cat /proc/softirqs
+ +                  CPU0       CPU1       CPU2       CPU3
+ +      HI:          0          0          0          0
+ +    TIMER:      27166      27120      27097      27034
+ +    NET_TX:          0          0          0         17
+ +    NET_RX:         42          0          0         39
+ +    BLOCK:          0          0        107       1121
+ +    TASKLET:          0          0          0        290
+ +    SCHED:      27035      26983      26971      26746
+ +    HRTIMER:          0          0          0          0
+ +      RCU:       1678       1769       2178       2250
+ +
+ +
+ +1.3 IDE devices in /proc/ide
+ +----------------------------
+ +
+ +The subdirectory /proc/ide contains information about all IDE devices of which
+ +the kernel  is  aware.  There is one subdirectory for each IDE controller, the
+ +file drivers  and a link for each IDE device, pointing to the device directory
+ +in the controller specific subtree.
+ +
+ +The file  drivers  contains general information about the drivers used for the
+ +IDE devices::
+ +
+ +  > cat /proc/ide/drivers
+ +  ide-cdrom version 4.53
+ +  ide-disk version 1.08
+ +
+ +More detailed  information  can  be  found  in  the  controller  specific
+ +subdirectories. These  are  named  ide0,  ide1  and  so  on.  Each  of  these
+ +directories contains the files shown in table 1-6.
+ +
+ +
+ +.. table:: Table 1-6: IDE controller info in  /proc/ide/ide?
+ +
+ + ======= =======================================
+ + File    Content
+ + ======= =======================================
+ + channel IDE channel (0 or 1)
+ + config  Configuration (only for PCI/IDE bridge)
+ + mate    Mate name
+ + model   Type/Chipset of IDE controller
+ + ======= =======================================
+ +
+ +Each device  connected  to  a  controller  has  a separate subdirectory in the
+ +controllers directory.  The  files  listed in table 1-7 are contained in these
+ +directories.
+ +
+ +
+ +.. table:: Table 1-7: IDE device information
+ +
+ + ================ ==========================================
+ + File             Content
+ + ================ ==========================================
+ + cache            The cache
+ + capacity         Capacity of the medium (in 512Byte blocks)
+ + driver           driver and version
+ + geometry         physical and logical geometry
+ + identify         device identify block
+ + media            media type
+ + model            device identifier
+ + settings         device setup
+ + smart_thresholds IDE disk management thresholds
+ + smart_values     IDE disk management values
+ + ================ ==========================================
+ +
+ +The most  interesting  file is ``settings``. This file contains a nice
+ +overview of the drive parameters::
+ +
+ +  # cat /proc/ide/ide0/hda/settings
+ +  name                    value           min             max             mode
+ +  ----                    -----           ---             ---             ----
+ +  bios_cyl                526             0               65535           rw
+ +  bios_head               255             0               255             rw
+ +  bios_sect               63              0               63              rw
+ +  breada_readahead        4               0               127             rw
+ +  bswap                   0               0               1               r
+ +  file_readahead          72              0               2097151         rw
+ +  io_32bit                0               0               3               rw
+ +  keepsettings            0               0               1               rw
+ +  max_kb_per_request      122             1               127             rw
+ +  multcount               0               0               8               rw
+ +  nice1                   1               0               1               rw
+ +  nowerr                  0               0               1               rw
+ +  pio_mode                write-only      0               255             w
+ +  slow                    0               0               1               rw
+ +  unmaskirq               0               0               1               rw
+ +  using_dma               0               0               1               rw
+ +
+ +
+ +1.4 Networking info in /proc/net
+ +--------------------------------
+ +
+ +The subdirectory  /proc/net  follows  the  usual  pattern. Table 1-8 shows the
+ +additional values  you  get  for  IP  version 6 if you configure the kernel to
+ +support this. Table 1-9 lists the files and their meaning.
+ +
+ +
+ +.. table:: Table 1-8: IPv6 info in /proc/net
+ +
+ + ========== =====================================================
+ + File       Content
+ + ========== =====================================================
+ + udp6       UDP sockets (IPv6)
+ + tcp6       TCP sockets (IPv6)
+ + raw6       Raw device statistics (IPv6)
+ + igmp6      IP multicast addresses, which this host joined (IPv6)
+ + if_inet6   List of IPv6 interface addresses
+ + ipv6_route Kernel routing table for IPv6
+ + rt6_stats  Global IPv6 routing tables statistics
+ + sockstat6  Socket statistics (IPv6)
+ + snmp6      Snmp data (IPv6)
+ + ========== =====================================================
+ +
+ +.. table:: Table 1-9: Network info in /proc/net
+ +
+ + ============= ================================================================
+ + File          Content
+ + ============= ================================================================
+ + arp           Kernel  ARP table
+ + dev           network devices with statistics
+ + dev_mcast     the Layer2 multicast groups a device is listening too
+ +               (interface index, label, number of references, number of bound
+ +               addresses).
+ + dev_stat      network device status
+ + ip_fwchains   Firewall chain linkage
+ + ip_fwnames    Firewall chain names
+ + ip_masq       Directory containing the masquerading tables
+ + ip_masquerade Major masquerading table
+ + netstat       Network statistics
+ + raw           raw device statistics
+ + route         Kernel routing table
+ + rpc           Directory containing rpc info
+ + rt_cache      Routing cache
+ + snmp          SNMP data
+ + sockstat      Socket statistics
+ + tcp           TCP  sockets
+ + udp           UDP sockets
+ + unix          UNIX domain sockets
+ + wireless      Wireless interface data (Wavelan etc)
+ + igmp          IP multicast addresses, which this host joined
+ + psched        Global packet scheduler parameters.
+ + netlink       List of PF_NETLINK sockets
+ + ip_mr_vifs    List of multicast virtual interfaces
+ + ip_mr_cache   List of multicast routing cache
+ + ============= ================================================================
+ +
+ +You can  use  this  information  to see which network devices are available in
+ +your system and how much traffic was routed over those devices::
+ +
+ +  > cat /proc/net/dev
+ +  Inter-|Receive                                                   |[...
+ +   face |bytes    packets errs drop fifo frame compressed multicast|[...
+ +      lo:  908188   5596     0    0    0     0          0         0 [...
+ +    ppp0:15475140  20721   410    0    0   410          0         0 [...
+ +    eth0:  614530   7085     0    0    0     0          0         1 [...
+ +
+ +  ...] Transmit
+ +  ...] bytes    packets errs drop fifo colls carrier compressed
+ +  ...]  908188     5596    0    0    0     0       0          0
+ +  ...] 1375103    17405    0    0    0     0       0          0
+ +  ...] 1703981     5535    0    0    0     3       0          0
+ +
+ +In addition, each Channel Bond interface has its own directory.  For
+ +example, the bond0 device will have a directory called /proc/net/bond0/.
+ +It will contain information that is specific to that bond, such as the
+ +current slaves of the bond, the link status of the slaves, and how
+ +many times the slaves link has failed.
+ +
+ +1.5 SCSI info
+ +-------------
+ +
+ +If you  have  a  SCSI  host adapter in your system, you'll find a subdirectory
+ +named after  the driver for this adapter in /proc/scsi. You'll also see a list
+ +of all recognized SCSI devices in /proc/scsi::
+ +
+ +  >cat /proc/scsi/scsi
+ +  Attached devices:
+ +  Host: scsi0 Channel: 00 Id: 00 Lun: 00
+ +    Vendor: IBM      Model: DGHS09U          Rev: 03E0
+ +    Type:   Direct-Access                    ANSI SCSI revision: 03
+ +  Host: scsi0 Channel: 00 Id: 06 Lun: 00
+ +    Vendor: PIONEER  Model: CD-ROM DR-U06S   Rev: 1.04
+ +    Type:   CD-ROM                           ANSI SCSI revision: 02
+ +
+ +
+ +The directory  named  after  the driver has one file for each adapter found in
+ +the system.  These  files  contain information about the controller, including
+ +the used  IRQ  and  the  IO  address range. The amount of information shown is
+ +dependent on  the adapter you use. The example shows the output for an Adaptec
+ +AHA-2940 SCSI adapter::
+ +
+ +  > cat /proc/scsi/aic7xxx/0
+ +
+ +  Adaptec AIC7xxx driver version: 5.1.19/3.2.4
+ +  Compile Options:
+ +    TCQ Enabled By Default : Disabled
+ +    AIC7XXX_PROC_STATS     : Disabled
+ +    AIC7XXX_RESET_DELAY    : 5
+ +  Adapter Configuration:
+ +             SCSI Adapter: Adaptec AHA-294X Ultra SCSI host adapter
+ +                             Ultra Wide Controller
+ +      PCI MMAPed I/O Base: 0xeb001000
+ +   Adapter SEEPROM Config: SEEPROM found and used.
+ +        Adaptec SCSI BIOS: Enabled
+ +                      IRQ: 10
+ +                     SCBs: Active 0, Max Active 2,
+ +                           Allocated 15, HW 16, Page 255
+ +               Interrupts: 160328
+ +        BIOS Control Word: 0x18b6
+ +     Adapter Control Word: 0x005b
+ +     Extended Translation: Enabled
+ +  Disconnect Enable Flags: 0xffff
+ +       Ultra Enable Flags: 0x0001
+ +   Tag Queue Enable Flags: 0x0000
+ +  Ordered Queue Tag Flags: 0x0000
+ +  Default Tag Queue Depth: 8
+ +      Tagged Queue By Device array for aic7xxx host instance 0:
+ +        {255,255,255,255,255,255,255,255,255,255,255,255,255,255,255,255}
+ +      Actual queue depth per device for aic7xxx host instance 0:
+ +        {1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1}
+ +  Statistics:
+ +  (scsi0:0:0:0)
+ +    Device using Wide/Sync transfers at 40.0 MByte/sec, offset 8
+ +    Transinfo settings: current(12/8/1/0), goal(12/8/1/0), user(12/15/1/0)
+ +    Total transfers 160151 (74577 reads and 85574 writes)
+ +  (scsi0:0:6:0)
+ +    Device using Narrow/Sync transfers at 5.0 MByte/sec, offset 15
+ +    Transinfo settings: current(50/15/0/0), goal(50/15/0/0), user(50/15/0/0)
+ +    Total transfers 0 (0 reads and 0 writes)
+ +
+ +
+ +1.6 Parallel port info in /proc/parport
+ +---------------------------------------
+ +
+ +The directory  /proc/parport  contains information about the parallel ports of
+ +your system.  It  has  one  subdirectory  for  each port, named after the port
+ +number (0,1,2,...).
+ +
+ +These directories contain the four files shown in Table 1-10.
+ +
+ +
+ +.. table:: Table 1-10: Files in /proc/parport
+ +
+ + ========= ====================================================================
+ + File      Content
+ + ========= ====================================================================
+ + autoprobe Any IEEE-1284 device ID information that has been acquired.
+ + devices   list of the device drivers using that port. A + will appear by the
+ +           name of the device currently using the port (it might not appear
+ +           against any).
+ + hardware  Parallel port's base address, IRQ line and DMA channel.
+ + irq       IRQ that parport is using for that port. This is in a separate
+ +           file to allow you to alter it by writing a new value in (IRQ
+ +           number or none).
+ + ========= ====================================================================
+ +
+ +1.7 TTY info in /proc/tty
+ +-------------------------
+ +
+ +Information about  the  available  and actually used tty's can be found in the
+ +directory /proc/tty.You'll  find  entries  for drivers and line disciplines in
+ +this directory, as shown in Table 1-11.
+ +
+ +
+ +.. table:: Table 1-11: Files in /proc/tty
+ +
+ + ============= ==============================================
+ + File          Content
+ + ============= ==============================================
+ + drivers       list of drivers and their usage
+ + ldiscs        registered line disciplines
+ + driver/serial usage statistic and status of single tty lines
+ + ============= ==============================================
+ +
+ +To see  which  tty's  are  currently in use, you can simply look into the file
+ +/proc/tty/drivers::
+ +
+ +  > cat /proc/tty/drivers
+ +  pty_slave            /dev/pts      136   0-255 pty:slave
+ +  pty_master           /dev/ptm      128   0-255 pty:master
+ +  pty_slave            /dev/ttyp       3   0-255 pty:slave
+ +  pty_master           /dev/pty        2   0-255 pty:master
+ +  serial               /dev/cua        5   64-67 serial:callout
+ +  serial               /dev/ttyS       4   64-67 serial
+ +  /dev/tty0            /dev/tty0       4       0 system:vtmaster
+ +  /dev/ptmx            /dev/ptmx       5       2 system
+ +  /dev/console         /dev/console    5       1 system:console
+ +  /dev/tty             /dev/tty        5       0 system:/dev/tty
+ +  unknown              /dev/tty        4    1-63 console
+ +
+ +
+ +1.8 Miscellaneous kernel statistics in /proc/stat
+ +-------------------------------------------------
+ +
+ +Various pieces   of  information about  kernel activity  are  available in the
+ +/proc/stat file.  All  of  the numbers reported  in  this file are  aggregates
+ +since the system first booted.  For a quick look, simply cat the file::
+ +
+ +  > cat /proc/stat
+ +  cpu  2255 34 2290 22625563 6290 127 456 0 0 0
+ +  cpu0 1132 34 1441 11311718 3675 127 438 0 0 0
+ +  cpu1 1123 0 849 11313845 2614 0 18 0 0 0
+ +  intr 114930548 113199788 3 0 5 263 0 4 [... lots more numbers ...]
+ +  ctxt 1990473
+ +  btime 1062191376
+ +  processes 2915
+ +  procs_running 1
+ +  procs_blocked 0
+ +  softirq 183433 0 21755 12 39 1137 231 21459 2263
+ +
+ +The very first  "cpu" line aggregates the  numbers in all  of the other "cpuN"
+ +lines.  These numbers identify the amount of time the CPU has spent performing
+ +different kinds of work.  Time units are in USER_HZ (typically hundredths of a
+ +second).  The meanings of the columns are as follows, from left to right:
+ +
+ +- user: normal processes executing in user mode
+ +- nice: niced processes executing in user mode
+ +- system: processes executing in kernel mode
+ +- idle: twiddling thumbs
+ +- iowait: In a word, iowait stands for waiting for I/O to complete. But there
+ +  are several problems:
+ +
+ +  1. Cpu will not wait for I/O to complete, iowait is the time that a task is
+ +     waiting for I/O to complete. When cpu goes into idle state for
+ +     outstanding task io, another task will be scheduled on this CPU.
+ +  2. In a multi-core CPU, the task waiting for I/O to complete is not running
+ +     on any CPU, so the iowait of each CPU is difficult to calculate.
+ +  3. The value of iowait field in /proc/stat will decrease in certain
+ +     conditions.
+ +
+ +  So, the iowait is not reliable by reading from /proc/stat.
+ +- irq: servicing interrupts
+ +- softirq: servicing softirqs
+ +- steal: involuntary wait
+ +- guest: running a normal guest
+ +- guest_nice: running a niced guest
+ +
+ +The "intr" line gives counts of interrupts  serviced since boot time, for each
+ +of the  possible system interrupts.   The first  column  is the  total of  all
+ +interrupts serviced  including  unnumbered  architecture specific  interrupts;
+ +each  subsequent column is the  total for that particular numbered interrupt.
+ +Unnumbered interrupts are not shown, only summed into the total.
+ +
+ +The "ctxt" line gives the total number of context switches across all CPUs.
+ +
+ +The "btime" line gives  the time at which the  system booted, in seconds since
+ +the Unix epoch.
+ +
+ +The "processes" line gives the number  of processes and threads created, which
+ +includes (but  is not limited  to) those  created by  calls to the  fork() and
+ +clone() system calls.
+ +
+ +The "procs_running" line gives the total number of threads that are
+ +running or ready to run (i.e., the total number of runnable threads).
+ +
+ +The   "procs_blocked" line gives  the  number of  processes currently blocked,
+ +waiting for I/O to complete.
+ +
+ +The "softirq" line gives counts of softirqs serviced since boot time, for each
+ +of the possible system softirqs. The first column is the total of all
+ +softirqs serviced; each subsequent column is the total for that particular
+ +softirq.
+ +
+ +
+ +1.9 Ext4 file system parameters
+ +-------------------------------
+ +
+ +Information about mounted ext4 file systems can be found in
+ +/proc/fs/ext4.  Each mounted filesystem will have a directory in
+ +/proc/fs/ext4 based on its device name (i.e., /proc/fs/ext4/hdc or
+ +/proc/fs/ext4/dm-0).   The files in each per-device directory are shown
+ +in Table 1-12, below.
+ +
+ +.. table:: Table 1-12: Files in /proc/fs/ext4/<devname>
+ +
+ + ==============  ==========================================================
+ + File            Content
+ + mb_groups       details of multiblock allocator buddy cache of free blocks
+ + ==============  ==========================================================
+ +
+ +2.0 /proc/consoles
+ +------------------
+ +Shows registered system console lines.
+ +
+ +To see which character device lines are currently used for the system console
+ +/dev/console, you may simply look into the file /proc/consoles::
+ +
+ +  > cat /proc/consoles
+ +  tty0                 -WU (ECp)       4:7
+ +  ttyS0                -W- (Ep)        4:64
+ +
+ +The columns are:
+ +
+ ++--------------------+-------------------------------------------------------+
+ +| device             | name of the device                                    |
+ ++====================+=======================================================+
+ +| operations         | * R = can do read operations                          |
+ +|                    | * W = can do write operations                         |
+ +|                    | * U = can do unblank                                  |
+ ++--------------------+-------------------------------------------------------+
+ +| flags              | * E = it is enabled                                   |
+ +|                    | * C = it is preferred console                         |
+ +|                    | * B = it is primary boot console                      |
+ +|                    | * p = it is used for printk buffer                    |
+ +|                    | * b = it is not a TTY but a Braille device            |
+ +|                    | * a = it is safe to use when cpu is offline           |
+ ++--------------------+-------------------------------------------------------+
+ +| major:minor        | major and minor number of the device separated by a   |
+ +|                    | colon                                                 |
+ ++--------------------+-------------------------------------------------------+
+ +
+ +Summary
+ +-------
+ +
+ +The /proc file system serves information about the running system. It not only
+ +allows access to process data but also allows you to request the kernel status
+ +by reading files in the hierarchy.
+ +
+ +The directory  structure  of /proc reflects the types of information and makes
+ +it easy, if not obvious, where to look for specific data.
+ +
+ +Chapter 2: Modifying System Parameters
+ +======================================
+ +
+ +In This Chapter
+ +---------------
+ +
+ +* Modifying kernel parameters by writing into files found in /proc/sys
+ +* Exploring the files which modify certain parameters
+ +* Review of the /proc/sys file tree
+ +
+ +------------------------------------------------------------------------------
+ +
+ +A very  interesting part of /proc is the directory /proc/sys. This is not only
+ +a source  of  information,  it also allows you to change parameters within the
+ +kernel. Be  very  careful  when attempting this. You can optimize your system,
+ +but you  can  also  cause  it  to  crash.  Never  alter kernel parameters on a
+ +production system.  Set  up  a  development machine and test to make sure that
+ +everything works  the  way  you want it to. You may have no alternative but to
+ +reboot the machine once an error has been made.
+ +
+ +To change  a  value,  simply  echo  the new value into the file. An example is
+ +given below  in the section on the file system data. You need to be root to do
+ +this. You  can  create  your  own  boot script to perform this every time your
+ +system boots.
+ +
+ +The files  in /proc/sys can be used to fine tune and monitor miscellaneous and
+ +general things  in  the operation of the Linux kernel. Since some of the files
+ +can inadvertently  disrupt  your  system,  it  is  advisable  to  read  both
+ +documentation and  source  before actually making adjustments. In any case, be
+ +very careful  when  writing  to  any  of these files. The entries in /proc may
+ +change slightly between the 2.1.* and the 2.2 kernel, so if there is any doubt
+ +review the kernel documentation in the directory /usr/src/linux/Documentation.
+ +This chapter  is  heavily  based  on the documentation included in the pre 2.2
+ +kernels, and became part of it in version 2.2.1 of the Linux kernel.
+ +
+ +Please see: Documentation/admin-guide/sysctl/ directory for descriptions of these
+ +entries.
+ +
+ +Summary
+ +-------
+ +
+ +Certain aspects  of  kernel  behavior  can be modified at runtime, without the
+ +need to  recompile  the kernel, or even to reboot the system. The files in the
+ +/proc/sys tree  can  not only be read, but also modified. You can use the echo
+ +command to write value into these files, thereby changing the default settings
+ +of the kernel.
+ +
+ +
+ +Chapter 3: Per-process Parameters
+ +=================================
+ +
+ +3.1 /proc/<pid>/oom_adj & /proc/<pid>/oom_score_adj- Adjust the oom-killer score
+ +--------------------------------------------------------------------------------
+ +
+ +These file can be used to adjust the badness heuristic used to select which
+ +process gets killed in out of memory conditions.
+ +
+ +The badness heuristic assigns a value to each candidate task ranging from 0
+ +(never kill) to 1000 (always kill) to determine which process is targeted.  The
+ +units are roughly a proportion along that range of allowed memory the process
+ +may allocate from based on an estimation of its current memory and swap use.
+ +For example, if a task is using all allowed memory, its badness score will be
+ +1000.  If it is using half of its allowed memory, its score will be 500.
+ +
+ +There is an additional factor included in the badness score: the current memory
+ +and swap usage is discounted by 3% for root processes.
+ +
+ +The amount of "allowed" memory depends on the context in which the oom killer
+ +was called.  If it is due to the memory assigned to the allocating task's cpuset
+ +being exhausted, the allowed memory represents the set of mems assigned to that
+ +cpuset.  If it is due to a mempolicy's node(s) being exhausted, the allowed
+ +memory represents the set of mempolicy nodes.  If it is due to a memory
+ +limit (or swap limit) being reached, the allowed memory is that configured
+ +limit.  Finally, if it is due to the entire system being out of memory, the
+ +allowed memory represents all allocatable resources.
+ +
+ +The value of /proc/<pid>/oom_score_adj is added to the badness score before it
+ +is used to determine which task to kill.  Acceptable values range from -1000
+ +(OOM_SCORE_ADJ_MIN) to +1000 (OOM_SCORE_ADJ_MAX).  This allows userspace to
+ +polarize the preference for oom killing either by always preferring a certain
+ +task or completely disabling it.  The lowest possible value, -1000, is
+ +equivalent to disabling oom killing entirely for that task since it will always
+ +report a badness score of 0.
+ +
+ +Consequently, it is very simple for userspace to define the amount of memory to
+ +consider for each task.  Setting a /proc/<pid>/oom_score_adj value of +500, for
+ +example, is roughly equivalent to allowing the remainder of tasks sharing the
+ +same system, cpuset, mempolicy, or memory controller resources to use at least
+ +50% more memory.  A value of -500, on the other hand, would be roughly
+ +equivalent to discounting 50% of the task's allowed memory from being considered
+ +as scoring against the task.
+ +
+ +For backwards compatibility with previous kernels, /proc/<pid>/oom_adj may also
+ +be used to tune the badness score.  Its acceptable values range from -16
+ +(OOM_ADJUST_MIN) to +15 (OOM_ADJUST_MAX) and a special value of -17
+ +(OOM_DISABLE) to disable oom killing entirely for that task.  Its value is
+ +scaled linearly with /proc/<pid>/oom_score_adj.
+ +
+ +The value of /proc/<pid>/oom_score_adj may be reduced no lower than the last
+ +value set by a CAP_SYS_RESOURCE process. To reduce the value any lower
+ +requires CAP_SYS_RESOURCE.
+ +
+ +Caveat: when a parent task is selected, the oom killer will sacrifice any first
+ +generation children with separate address spaces instead, if possible.  This
+ +avoids servers and important system daemons from being killed and loses the
+ +minimal amount of work.
+ +
+ +
+ +3.2 /proc/<pid>/oom_score - Display current oom-killer score
+ +-------------------------------------------------------------
+ +
+ +This file can be used to check the current score used by the oom-killer is for
+ +any given <pid>. Use it together with /proc/<pid>/oom_score_adj to tune which
+ +process should be killed in an out-of-memory situation.
+ +
+ +
+ +3.3  /proc/<pid>/io - Display the IO accounting fields
+ +-------------------------------------------------------
+ +
+ +This file contains IO statistics for each running process
+ +
+ +Example
+ +~~~~~~~
+ +
+ +::
+ +
+ +    test:/tmp # dd if=/dev/zero of=/tmp/test.dat &
+ +    [1] 3828
+ +
+ +    test:/tmp # cat /proc/3828/io
+ +    rchar: 323934931
+ +    wchar: 323929600
+ +    syscr: 632687
+ +    syscw: 632675
+ +    read_bytes: 0
+ +    write_bytes: 323932160
+ +    cancelled_write_bytes: 0
+ +
+ +
+ +Description
+ +~~~~~~~~~~~
+ +
+ +rchar
+ +^^^^^
+ +
+ +I/O counter: chars read
+ +The number of bytes which this task has caused to be read from storage. This
+ +is simply the sum of bytes which this process passed to read() and pread().
+ +It includes things like tty IO and it is unaffected by whether or not actual
+ +physical disk IO was required (the read might have been satisfied from
+ +pagecache)
+ +
+ +
+ +wchar
+ +^^^^^
+ +
+ +I/O counter: chars written
+ +The number of bytes which this task has caused, or shall cause to be written
+ +to disk. Similar caveats apply here as with rchar.
+ +
+ +
+ +syscr
+ +^^^^^
+ +
+ +I/O counter: read syscalls
+ +Attempt to count the number of read I/O operations, i.e. syscalls like read()
+ +and pread().
+ +
+ +
+ +syscw
+ +^^^^^
+ +
+ +I/O counter: write syscalls
+ +Attempt to count the number of write I/O operations, i.e. syscalls like
+ +write() and pwrite().
+ +
+ +
+ +read_bytes
+ +^^^^^^^^^^
+ +
+ +I/O counter: bytes read
+ +Attempt to count the number of bytes which this process really did cause to
+ +be fetched from the storage layer. Done at the submit_bio() level, so it is
+ +accurate for block-backed filesystems. <please add status regarding NFS and
+ +CIFS at a later time>
+ +
+ +
+ +write_bytes
+ +^^^^^^^^^^^
+ +
+ +I/O counter: bytes written
+ +Attempt to count the number of bytes which this process caused to be sent to
+ +the storage layer. This is done at page-dirtying time.
+ +
+ +
+ +cancelled_write_bytes
+ +^^^^^^^^^^^^^^^^^^^^^
+ +
+ +The big inaccuracy here is truncate. If a process writes 1MB to a file and
+ +then deletes the file, it will in fact perform no writeout. But it will have
+ +been accounted as having caused 1MB of write.
+ +In other words: The number of bytes which this process caused to not happen,
+ +by truncating pagecache. A task can cause "negative" IO too. If this task
+ +truncates some dirty pagecache, some IO which another task has been accounted
+ +for (in its write_bytes) will not be happening. We _could_ just subtract that
+ +from the truncating task's write_bytes, but there is information loss in doing
+ +that.
+ +
+ +
+ +.. Note::
+ +
+ +   At its current implementation state, this is a bit racy on 32-bit machines:
+ +   if process A reads process B's /proc/pid/io while process B is updating one
+ +   of those 64-bit counters, process A could see an intermediate result.
+ +
+ +
+ +More information about this can be found within the taskstats documentation in
+ +Documentation/accounting.
+ +
+ +3.4 /proc/<pid>/coredump_filter - Core dump filtering settings
+ +---------------------------------------------------------------
+ +When a process is dumped, all anonymous memory is written to a core file as
+ +long as the size of the core file isn't limited. But sometimes we don't want
+ +to dump some memory segments, for example, huge shared memory or DAX.
+ +Conversely, sometimes we want to save file-backed memory segments into a core
+ +file, not only the individual files.
+ +
+ +/proc/<pid>/coredump_filter allows you to customize which memory segments
+ +will be dumped when the <pid> process is dumped. coredump_filter is a bitmask
+ +of memory types. If a bit of the bitmask is set, memory segments of the
+ +corresponding memory type are dumped, otherwise they are not dumped.
+ +
+ +The following 9 memory types are supported:
+ +
+ +  - (bit 0) anonymous private memory
+ +  - (bit 1) anonymous shared memory
+ +  - (bit 2) file-backed private memory
+ +  - (bit 3) file-backed shared memory
+ +  - (bit 4) ELF header pages in file-backed private memory areas (it is
+ +    effective only if the bit 2 is cleared)
+ +  - (bit 5) hugetlb private memory
+ +  - (bit 6) hugetlb shared memory
+ +  - (bit 7) DAX private memory
+ +  - (bit 8) DAX shared memory
+ +
+ +  Note that MMIO pages such as frame buffer are never dumped and vDSO pages
+ +  are always dumped regardless of the bitmask status.
+ +
+ +  Note that bits 0-4 don't affect hugetlb or DAX memory. hugetlb memory is
+ +  only affected by bit 5-6, and DAX is only affected by bits 7-8.
+ +
+ +The default value of coredump_filter is 0x33; this means all anonymous memory
+ +segments, ELF header pages and hugetlb private memory are dumped.
+ +
+ +If you don't want to dump all shared memory segments attached to pid 1234,
+ +write 0x31 to the process's proc file::
+ +
+ +  $ echo 0x31 > /proc/1234/coredump_filter
+ +
+ +When a new process is created, the process inherits the bitmask status from its
+ +parent. It is useful to set up coredump_filter before the program runs.
+ +For example::
+ +
+ +  $ echo 0x7 > /proc/self/coredump_filter
+ +  $ ./some_program
+ +
+ +3.5   /proc/<pid>/mountinfo - Information about mounts
+ +--------------------------------------------------------
+ +
+ +This file contains lines of the form::
+ +
+ +    36 35 98:0 /mnt1 /mnt2 rw,noatime master:1 - ext3 /dev/root rw,errors=continue
+ +    (1)(2)(3)   (4)   (5)      (6)      (7)   (8) (9)   (10)         (11)
+ +
+ +    (1) mount ID:  unique identifier of the mount (may be reused after umount)
+ +    (2) parent ID:  ID of parent (or of self for the top of the mount tree)
+ +    (3) major:minor:  value of st_dev for files on filesystem
+ +    (4) root:  root of the mount within the filesystem
+ +    (5) mount point:  mount point relative to the process's root
+ +    (6) mount options:  per mount options
+ +    (7) optional fields:  zero or more fields of the form "tag[:value]"
+ +    (8) separator:  marks the end of the optional fields
+ +    (9) filesystem type:  name of filesystem of the form "type[.subtype]"
+ +    (10) mount source:  filesystem specific information or "none"
+ +    (11) super options:  per super block options
+ +
+ +Parsers should ignore all unrecognised optional fields.  Currently the
+ +possible optional fields are:
+ +
+ +================  ==============================================================
+ +shared:X          mount is shared in peer group X
+ +master:X          mount is slave to peer group X
+ +propagate_from:X  mount is slave and receives propagation from peer group X [#]_
+ +unbindable        mount is unbindable
+ +================  ==============================================================
+ +
+ +.. [#] X is the closest dominant peer group under the process's root.  If
+ +       X is the immediate master of the mount, or if there's no dominant peer
+ +       group under the same root, then only the "master:X" field is present
+ +       and not the "propagate_from:X" field.
+ +
+ +For more information on mount propagation see:
+ +
+ +  Documentation/filesystems/sharedsubtree.txt
+ +
+ +
+ +3.6   /proc/<pid>/comm  & /proc/<pid>/task/<tid>/comm
+ +--------------------------------------------------------
+ +These files provide a method to access a tasks comm value. It also allows for
+ +a task to set its own or one of its thread siblings comm value. The comm value
+ +is limited in size compared to the cmdline value, so writing anything longer
+ +then the kernel's TASK_COMM_LEN (currently 16 chars) will result in a truncated
+ +comm value.
+ +
+ +
+ +3.7   /proc/<pid>/task/<tid>/children - Information about task children
+ +-------------------------------------------------------------------------
+ +This file provides a fast way to retrieve first level children pids
+ +of a task pointed by <pid>/<tid> pair. The format is a space separated
+ +stream of pids.
+ +
+ +Note the "first level" here -- if a child has own children they will
+ +not be listed here, one needs to read /proc/<children-pid>/task/<tid>/children
+ +to obtain the descendants.
+ +
+ +Since this interface is intended to be fast and cheap it doesn't
+ +guarantee to provide precise results and some children might be
+ +skipped, especially if they've exited right after we printed their
+ +pids, so one need to either stop or freeze processes being inspected
+ +if precise results are needed.
+ +
+ +
+ +3.8   /proc/<pid>/fdinfo/<fd> - Information about opened file
+ +---------------------------------------------------------------
+ +This file provides information associated with an opened file. The regular
+ +files have at least three fields -- 'pos', 'flags' and mnt_id. The 'pos'
+ +represents the current offset of the opened file in decimal form [see lseek(2)
+ +for details], 'flags' denotes the octal O_xxx mask the file has been
+ +created with [see open(2) for details] and 'mnt_id' represents mount ID of
+ +the file system containing the opened file [see 3.5 /proc/<pid>/mountinfo
+ +for details].
+ +
+ +A typical output is::
+ +
+ +      pos:    0
+ +      flags:  0100002
+ +      mnt_id: 19
+ +
+ +All locks associated with a file descriptor are shown in its fdinfo too::
+ +
+ +    lock:       1: FLOCK  ADVISORY  WRITE 359 00:13:11691 0 EOF
+ +
+ +The files such as eventfd, fsnotify, signalfd, epoll among the regular pos/flags
+ +pair provide additional information particular to the objects they represent.
+ +
+ +Eventfd files
+ +~~~~~~~~~~~~~
+ +
+ +::
+ +
+ +      pos:    0
+ +      flags:  04002
+ +      mnt_id: 9
+ +      eventfd-count:  5a
+ +
+ +where 'eventfd-count' is hex value of a counter.
+ +
+ +Signalfd files
+ +~~~~~~~~~~~~~~
+ +
+ +::
+ +
+ +      pos:    0
+ +      flags:  04002
+ +      mnt_id: 9
+ +      sigmask:        0000000000000200
+ +
+ +where 'sigmask' is hex value of the signal mask associated
+ +with a file.
+ +
+ +Epoll files
+ +~~~~~~~~~~~
+ +
+ +::
+ +
+ +      pos:    0
+ +      flags:  02
+ +      mnt_id: 9
+ +      tfd:        5 events:       1d data: ffffffffffffffff pos:0 ino:61af sdev:7
+ +
+ +where 'tfd' is a target file descriptor number in decimal form,
+ +'events' is events mask being watched and the 'data' is data
+ +associated with a target [see epoll(7) for more details].
+ +
+ +The 'pos' is current offset of the target file in decimal form
+ +[see lseek(2)], 'ino' and 'sdev' are inode and device numbers
+ +where target file resides, all in hex format.
+ +
+ +Fsnotify files
+ +~~~~~~~~~~~~~~
+ +For inotify files the format is the following::
+ +
+ +      pos:    0
+ +      flags:  02000000
+ +      inotify wd:3 ino:9e7e sdev:800013 mask:800afce ignored_mask:0 fhandle-bytes:8 fhandle-type:1 f_handle:7e9e0000640d1b6d
+ +
+ +where 'wd' is a watch descriptor in decimal form, ie a target file
+ +descriptor number, 'ino' and 'sdev' are inode and device where the
+ +target file resides and the 'mask' is the mask of events, all in hex
+ +form [see inotify(7) for more details].
+ +
+ +If the kernel was built with exportfs support, the path to the target
+ +file is encoded as a file handle.  The file handle is provided by three
+ +fields 'fhandle-bytes', 'fhandle-type' and 'f_handle', all in hex
+ +format.
+ +
+ +If the kernel is built without exportfs support the file handle won't be
+ +printed out.
+ +
+ +If there is no inotify mark attached yet the 'inotify' line will be omitted.
+ +
+ +For fanotify files the format is::
+ +
+ +      pos:    0
+ +      flags:  02
+ +      mnt_id: 9
+ +      fanotify flags:10 event-flags:0
+ +      fanotify mnt_id:12 mflags:40 mask:38 ignored_mask:40000003
+ +      fanotify ino:4f969 sdev:800013 mflags:0 mask:3b ignored_mask:40000000 fhandle-bytes:8 fhandle-type:1 f_handle:69f90400c275b5b4
+ +
+ +where fanotify 'flags' and 'event-flags' are values used in fanotify_init
+ +call, 'mnt_id' is the mount point identifier, 'mflags' is the value of
+ +flags associated with mark which are tracked separately from events
+ +mask. 'ino', 'sdev' are target inode and device, 'mask' is the events
+ +mask and 'ignored_mask' is the mask of events which are to be ignored.
+ +All in hex format. Incorporation of 'mflags', 'mask' and 'ignored_mask'
+ +does provide information about flags and mask used in fanotify_mark
+ +call [see fsnotify manpage for details].
+ +
+ +While the first three lines are mandatory and always printed, the rest is
+ +optional and may be omitted if no marks created yet.
+ +
+ +Timerfd files
+ +~~~~~~~~~~~~~
+ +
+ +::
+ +
+ +      pos:    0
+ +      flags:  02
+ +      mnt_id: 9
+ +      clockid: 0
+ +      ticks: 0
+ +      settime flags: 01
+ +      it_value: (0, 49406829)
+ +      it_interval: (1, 0)
+ +
+ +where 'clockid' is the clock type and 'ticks' is the number of the timer expirations
+ +that have occurred [see timerfd_create(2) for details]. 'settime flags' are
+ +flags in octal form been used to setup the timer [see timerfd_settime(2) for
+ +details]. 'it_value' is remaining time until the timer exiration.
+ +'it_interval' is the interval for the timer. Note the timer might be set up
+ +with TIMER_ABSTIME option which will be shown in 'settime flags', but 'it_value'
+ +still exhibits timer's remaining time.
+ +
+ +3.9   /proc/<pid>/map_files - Information about memory mapped files
+ +---------------------------------------------------------------------
+ +This directory contains symbolic links which represent memory mapped files
+ +the process is maintaining.  Example output::
+ +
+ +     | lr-------- 1 root root 64 Jan 27 11:24 333c600000-333c620000 -> /usr/lib64/ld-2.18.so
+ +     | lr-------- 1 root root 64 Jan 27 11:24 333c81f000-333c820000 -> /usr/lib64/ld-2.18.so
+ +     | lr-------- 1 root root 64 Jan 27 11:24 333c820000-333c821000 -> /usr/lib64/ld-2.18.so
+ +     | ...
+ +     | lr-------- 1 root root 64 Jan 27 11:24 35d0421000-35d0422000 -> /usr/lib64/libselinux.so.1
+ +     | lr-------- 1 root root 64 Jan 27 11:24 400000-41a000 -> /usr/bin/ls
+ +
+ +The name of a link represents the virtual memory bounds of a mapping, i.e.
+ +vm_area_struct::vm_start-vm_area_struct::vm_end.
+ +
+ +The main purpose of the map_files is to retrieve a set of memory mapped
+ +files in a fast way instead of parsing /proc/<pid>/maps or
+ +/proc/<pid>/smaps, both of which contain many more records.  At the same
+ +time one can open(2) mappings from the listings of two processes and
+ +comparing their inode numbers to figure out which anonymous memory areas
+ +are actually shared.
+ +
+ +3.10  /proc/<pid>/timerslack_ns - Task timerslack value
+ +---------------------------------------------------------
+ +This file provides the value of the task's timerslack value in nanoseconds.
+ +This value specifies a amount of time that normal timers may be deferred
+ +in order to coalesce timers and avoid unnecessary wakeups.
+ +
+ +This allows a task's interactivity vs power consumption trade off to be
+ +adjusted.
+ +
+ +Writing 0 to the file will set the tasks timerslack to the default value.
+ +
+ +Valid values are from 0 - ULLONG_MAX
+ +
+ +An application setting the value must have PTRACE_MODE_ATTACH_FSCREDS level
+ +permissions on the task specified to change its timerslack_ns value.
+ +
+ +3.11  /proc/<pid>/patch_state - Livepatch patch operation state
+ +-----------------------------------------------------------------
+ +When CONFIG_LIVEPATCH is enabled, this file displays the value of the
+ +patch state for the task.
+ +
+ +A value of '-1' indicates that no patch is in transition.
+ +
+ +A value of '0' indicates that a patch is in transition and the task is
+ +unpatched.  If the patch is being enabled, then the task hasn't been
+ +patched yet.  If the patch is being disabled, then the task has already
+ +been unpatched.
+ +
+ +A value of '1' indicates that a patch is in transition and the task is
+ +patched.  If the patch is being enabled, then the task has already been
+ +patched.  If the patch is being disabled, then the task hasn't been
+ +unpatched yet.
+ +
+ +3.12 /proc/<pid>/arch_status - task architecture specific status
+ +-------------------------------------------------------------------
+ +When CONFIG_PROC_PID_ARCH_STATUS is enabled, this file displays the
+ +architecture specific status of the task.
+ +
+ +Example
+ +~~~~~~~
+ +
+ +::
+ +
+ + $ cat /proc/6753/arch_status
+ + AVX512_elapsed_ms:      8
+ +
+ +Description
+ +~~~~~~~~~~~
+ +
+ +x86 specific entries:
+ +~~~~~~~~~~~~~~~~~~~~~
+ +
+ +AVX512_elapsed_ms:
+ +^^^^^^^^^^^^^^^^^^
+ +
+ +  If AVX512 is supported on the machine, this entry shows the milliseconds
+ +  elapsed since the last time AVX512 usage was recorded. The recording
+ +  happens on a best effort basis when a task is scheduled out. This means
+ +  that the value depends on two factors:
+ +
+ +    1) The time which the task spent on the CPU without being scheduled
+ +       out. With CPU isolation and a single runnable task this can take
+ +       several seconds.
+ +
+ +    2) The time since the task was scheduled out last. Depending on the
+ +       reason for being scheduled out (time slice exhausted, syscall ...)
+ +       this can be arbitrary long time.
+ +
+ +  As a consequence the value cannot be considered precise and authoritative
+ +  information. The application which uses this information has to be aware
+ +  of the overall scenario on the system in order to determine whether a
+ +  task is a real AVX512 user or not. Precise information can be obtained
+ +  with performance counters.
+ +
+ +  A special value of '-1' indicates that no AVX512 usage was recorded, thus
+ +  the task is unlikely an AVX512 user, but depends on the workload and the
+ +  scheduling scenario, it also could be a false negative mentioned above.
+ +
+ +Configuring procfs
+ +------------------
+ +
+ +4.1   Mount options
+ +---------------------
+ +
+ +The following mount options are supported:
+ +
+ +      =========       ========================================================
+ +      hidepid=        Set /proc/<pid>/ access mode.
+ +      gid=            Set the group authorized to learn processes information.
+ +      =========       ========================================================
+ +
+ +hidepid=0 means classic mode - everybody may access all /proc/<pid>/ directories
+ +(default).
+ +
+ +hidepid=1 means users may not access any /proc/<pid>/ directories but their
+ +own.  Sensitive files like cmdline, sched*, status are now protected against
+ +other users.  This makes it impossible to learn whether any user runs
+ +specific program (given the program doesn't reveal itself by its behaviour).
+ +As an additional bonus, as /proc/<pid>/cmdline is unaccessible for other users,
+ +poorly written programs passing sensitive information via program arguments are
+ +now protected against local eavesdroppers.
+ +
+ +hidepid=2 means hidepid=1 plus all /proc/<pid>/ will be fully invisible to other
+ +users.  It doesn't mean that it hides a fact whether a process with a specific
+ +pid value exists (it can be learned by other means, e.g. by "kill -0 $PID"),
+ +but it hides process' uid and gid, which may be learned by stat()'ing
+ +/proc/<pid>/ otherwise.  It greatly complicates an intruder's task of gathering
+ +information about running processes, whether some daemon runs with elevated
+ +privileges, whether other user runs some sensitive program, whether other users
+ +run any program at all, etc.
+ +
+ +gid= defines a group authorized to learn processes information otherwise
+ +prohibited by hidepid=.  If you use some daemon like identd which needs to learn
+ +information about processes information, just add identd to this group.
diff --combined arch/arm64/Kconfig

index 40fb05d96c6072c9357cf69965ca006c0a5fdb27,53c77711f7522cda5a5212f4e1344039574994be..43be825d0730a6c8657402342fcec9da84a99f3c
--- 1/arch/arm64/Kconfig
--- 2/arch/arm64/Kconfig
+++ b/arch/arm64/Kconfig
@@@ -9,6 -9,8 +9,7 @@@ config ARM6
         select ACPI_MCFG if (ACPI && PCI)
         select ACPI_SPCR_TABLE if ACPI
         select ACPI_PPTT if ACPI
- -      select ARCH_CLOCKSOURCE_DATA
+       select ARCH_BINFMT_ELF_STATE
         select ARCH_HAS_DEBUG_VIRTUAL
         select ARCH_HAS_DEVMEM_IS_ALLOWED
         select ARCH_HAS_DMA_PREP_COHERENT
@@@ -32,6 -34,7 +33,7 @@@
         select ARCH_HAS_SYSCALL_WRAPPER
         select ARCH_HAS_TEARDOWN_DMA_OPS if IOMMU_SUPPORT
         select ARCH_HAS_TICK_BROADCAST if GENERIC_CLOCKEVENTS_BROADCAST
+       select ARCH_HAVE_ELF_PROT
         select ARCH_HAVE_NMI_SAFE_CMPXCHG
         select ARCH_INLINE_READ_LOCK if !PREEMPTION
         select ARCH_INLINE_READ_LOCK_BH if !PREEMPTION
@@@ -61,6 -64,7 +63,7 @@@
         select ARCH_INLINE_SPIN_UNLOCK_IRQRESTORE if !PREEMPTION
         select ARCH_KEEP_MEMBLOCK
         select ARCH_USE_CMPXCHG_LOCKREF
+       select ARCH_USE_GNU_PROPERTY
         select ARCH_USE_QUEUED_RWLOCKS
         select ARCH_USE_QUEUED_SPINLOCKS
         select ARCH_SUPPORTS_MEMORY_FAILURE
@@@ -117,7 -121,6 +120,7 @@@
         select HAVE_ALIGNED_STRUCT_PAGE if SLUB
         select HAVE_ARCH_AUDITSYSCALL
         select HAVE_ARCH_BITREVERSE
+ +      select HAVE_ARCH_COMPILER_H
         select HAVE_ARCH_HUGE_VMAP
         select HAVE_ARCH_JUMP_LABEL
         select HAVE_ARCH_JUMP_LABEL_RELATIVE
@@@ -281,9 -284,6 +284,9 @@@ config ZONE_DMA3
   config ARCH_ENABLE_MEMORY_HOTPLUG
         def_bool y
   
+ +config ARCH_ENABLE_MEMORY_HOTREMOVE
+ +      def_bool y
+ +
   config SMP
         def_bool y
   
@@@ -955,11 -955,11 +958,11 @@@ config HOTPLUG_CP
   
   # Common NUMA Features
   config NUMA
- -      bool "Numa Memory Allocation and Scheduler Support"
+ +      bool "NUMA Memory Allocation and Scheduler Support"
         select ACPI_NUMA if ACPI
         select OF_NUMA
         help
- -        Enable NUMA (Non Uniform Memory Access) support.
+ +        Enable NUMA (Non-Uniform Memory Access) support.
   
           The kernel will try to allocate memory used by a CPU on the
           local memory of the CPU and add some more
@@@ -1501,12 -1501,6 +1504,12 @@@ config ARM64_PTR_AUT
         bool "Enable support for pointer authentication"
         default y
         depends on !KVM || ARM64_VHE
+ +      depends on (CC_HAS_SIGN_RETURN_ADDRESS || CC_HAS_BRANCH_PROT_PAC_RET) && AS_HAS_PAC
+ +      # GCC 9.1 and later inserts a .note.gnu.property section note for PAC
+ +      # which is only understood by binutils starting with version 2.33.1.
+ +      depends on !CC_IS_GCC || GCC_VERSION < 90100 || LD_VERSION >= 233010000
+ +      depends on !CC_IS_CLANG || AS_HAS_CFI_NEGATE_RA_STATE
+ +      depends on (!FUNCTION_GRAPH_TRACER || DYNAMIC_FTRACE_WITH_REGS)
         help
           Pointer authentication (part of the ARMv8.3 Extensions) provides
           instructions for signing and authenticating pointers against secret
@@@ -1514,76 -1508,42 +1517,98 @@@
           and other attacks.
   
           This option enables these instructions at EL0 (i.e. for userspace).
- -
           Choosing this option will cause the kernel to initialise secret keys
           for each process at exec() time, with these keys being
           context-switched along with the process.
   
+ +        If the compiler supports the -mbranch-protection or
+ +        -msign-return-address flag (e.g. GCC 7 or later), then this option
+ +        will also cause the kernel itself to be compiled with return address
+ +        protection. In this case, and if the target hardware is known to
+ +        support pointer authentication, then CONFIG_STACKPROTECTOR can be
+ +        disabled with minimal loss of protection.
+ +
           The feature is detected at runtime. If the feature is not present in
           hardware it will not be advertised to userspace/KVM guest nor will it
           be enabled. However, KVM guest also require VHE mode and hence
           CONFIG_ARM64_VHE=y option to use this feature.
   
+ +        If the feature is present on the boot CPU but not on a late CPU, then
+ +        the late CPU will be parked. Also, if the boot CPU does not have
+ +        address auth and the late CPU has then the late CPU will still boot
+ +        but with the feature disabled. On such a system, this option should
+ +        not be selected.
+ +
+ +        This feature works with FUNCTION_GRAPH_TRACER option only if
+ +        DYNAMIC_FTRACE_WITH_REGS is enabled.
+ +
+ +config CC_HAS_BRANCH_PROT_PAC_RET
+ +      # GCC 9 or later, clang 8 or later
+ +      def_bool $(cc-option,-mbranch-protection=pac-ret+leaf)
+ +
+ +config CC_HAS_SIGN_RETURN_ADDRESS
+ +      # GCC 7, 8
+ +      def_bool $(cc-option,-msign-return-address=all)
+ +
+ +config AS_HAS_PAC
+ +      def_bool $(as-option,-Wa$(comma)-march=armv8.3-a)
+ +
+ +config AS_HAS_CFI_NEGATE_RA_STATE
+ +      def_bool $(as-instr,.cfi_startproc\n.cfi_negate_ra_state\n.cfi_endproc\n)
+ +
+ +endmenu
+ +
+ +menu "ARMv8.4 architectural features"
+ +
+ +config ARM64_AMU_EXTN
+ +      bool "Enable support for the Activity Monitors Unit CPU extension"
+ +      default y
+ +      help
+ +        The activity monitors extension is an optional extension introduced
+ +        by the ARMv8.4 CPU architecture. This enables support for version 1
+ +        of the activity monitors architecture, AMUv1.
+ +
+ +        To enable the use of this extension on CPUs that implement it, say Y.
+ +
+ +        Note that for architectural reasons, firmware _must_ implement AMU
+ +        support when running on CPUs that present the activity monitors
+ +        extension. The required support is present in:
+ +          * Version 1.5 and later of the ARM Trusted Firmware
+ +
+ +        For kernels that have this configuration enabled but boot with broken
+ +        firmware, you may need to say N here until the firmware is fixed.
+ +        Otherwise you may experience firmware panics or lockups when
+ +        accessing the counter registers. Even if you are not observing these
+ +        symptoms, the values returned by the register reads might not
+ +        correctly reflect reality. Most commonly, the value read will be 0,
+ +        indicating that the counter is not enabled.
+ +
   endmenu
   
   menu "ARMv8.5 architectural features"
   
+ config ARM64_BTI
+       bool "Branch Target Identification support"
+       default y
+       help
+         Branch Target Identification (part of the ARMv8.5 Extensions)
+         provides a mechanism to limit the set of locations to which computed
+         branch instructions such as BR or BLR can jump.
+ 
+         To make use of BTI on CPUs that support it, say Y.
+ 
+         BTI is intended to provide complementary protection to other control
+         flow integrity protection mechanisms, such as the Pointer
+         authentication mechanism provided as part of the ARMv8.3 Extensions.
+         For this reason, it does not make sense to enable this option without
+         also enabling support for pointer authentication.  Thus, when
+         enabling this option you should also select ARM64_PTR_AUTH=y.
+ 
+         Userspace binaries must also be specifically compiled to make use of
+         this mechanism.  If you say N here or the hardware does not support
+         BTI, such binaries can still run, but you get no additional
+         enforcement of branch destinations.
+ 
   config ARM64_E0PD
         bool "Enable support for E0PD"
         default y
diff --combined arch/arm64/include/asm/cpucaps.h

index 8eb5a088ae6588c8c0f2332b31ca92945eda9b66,58e776c22aab8b9341e507552705181f3536929f..7b6051494f717116e4c3510167bb5095f2726a6b
--- 1/arch/arm64/include/asm/cpucaps.h
--- 2/arch/arm64/include/asm/cpucaps.h
+++ b/arch/arm64/include/asm/cpucaps.h
@@@ -58,10 -58,8 +58,11 @@@
   #define ARM64_WORKAROUND_SPECULATIVE_AT_NVHE  48
   #define ARM64_HAS_E0PD                                49
   #define ARM64_HAS_RNG                         50
- -#define ARM64_BTI                             51
+ +#define ARM64_HAS_AMU_EXTN                    51
+ +#define ARM64_HAS_ADDRESS_AUTH                        52
+ +#define ARM64_HAS_GENERIC_AUTH                        53
++#define ARM64_BTI                             54
   
- #define ARM64_NCAPS                           54
- -#define ARM64_NCAPS                           52
++#define ARM64_NCAPS                           55
   
   #endif /* __ASM_CPUCAPS_H */
diff --combined arch/arm64/include/asm/cpufeature.h

index afe08251ff95640818a89453db51d2127258a11c,e3ebcc59e83b5bc1075f4f4b69bef8a072442358..99ab13a07a1d7d0d2f2454f2837a9e597fb59f6f
--- 1/arch/arm64/include/asm/cpufeature.h
--- 2/arch/arm64/include/asm/cpufeature.h
+++ b/arch/arm64/include/asm/cpufeature.h
@@@ -208,10 -208,6 +208,10 @@@ extern struct arm64_ftr_reg arm64_ftr_r
    *     In some non-typical cases either both (a) and (b), or neither,
    *     should be permitted. This can be described by including neither
    *     or both flags in the capability's type field.
+ + *
+ + *     In case of a conflict, the CPU is prevented from booting. If the
+ + *     ARM64_CPUCAP_PANIC_ON_CONFLICT flag is specified for the capability,
+ + *     then a kernel panic is triggered.
    */
   
   
@@@ -244,8 -240,6 +244,8 @@@
   #define ARM64_CPUCAP_PERMITTED_FOR_LATE_CPU   ((u16)BIT(4))
   /* Is it safe for a late CPU to miss this capability when system has it */
   #define ARM64_CPUCAP_OPTIONAL_FOR_LATE_CPU    ((u16)BIT(5))
+ +/* Panic when a conflict is detected */
+ +#define ARM64_CPUCAP_PANIC_ON_CONFLICT                ((u16)BIT(6))
   
   /*
    * CPU errata workarounds that need to be enabled at boot time if one or
@@@ -285,20 -279,9 +285,20 @@@
   
   /*
    * CPU feature used early in the boot based on the boot CPU. All secondary
- - * CPUs must match the state of the capability as detected by the boot CPU.
+ + * CPUs must match the state of the capability as detected by the boot CPU. In
+ + * case of a conflict, a kernel panic is triggered.
+ + */
+ +#define ARM64_CPUCAP_STRICT_BOOT_CPU_FEATURE          \
+ +      (ARM64_CPUCAP_SCOPE_BOOT_CPU | ARM64_CPUCAP_PANIC_ON_CONFLICT)
+ +
+ +/*
+ + * CPU feature used early in the boot based on the boot CPU. It is safe for a
+ + * late CPU to have this feature even though the boot CPU hasn't enabled it,
+ + * although the feature will not be used by Linux in this case. If the boot CPU
+ + * has enabled this feature already, then every late CPU must have it.
    */
- -#define ARM64_CPUCAP_STRICT_BOOT_CPU_FEATURE ARM64_CPUCAP_SCOPE_BOOT_CPU
+ +#define ARM64_CPUCAP_BOOT_CPU_FEATURE                  \
+ +      (ARM64_CPUCAP_SCOPE_BOOT_CPU | ARM64_CPUCAP_PERMITTED_FOR_LATE_CPU)
   
   struct arm64_cpu_capabilities {
         const char *desc;
@@@ -357,6 -340,18 +357,6 @@@ static inline int cpucap_default_scope(
         return cap->type & ARM64_CPUCAP_SCOPE_MASK;
   }
   
- -static inline bool
- -cpucap_late_cpu_optional(const struct arm64_cpu_capabilities *cap)
- -{
- -      return !!(cap->type & ARM64_CPUCAP_OPTIONAL_FOR_LATE_CPU);
- -}
- -
- -static inline bool
- -cpucap_late_cpu_permitted(const struct arm64_cpu_capabilities *cap)
- -{
- -      return !!(cap->type & ARM64_CPUCAP_PERMITTED_FOR_LATE_CPU);
- -}
- -
   /*
    * Generic helper for handling capabilties with multiple (match,enable) pairs
    * of call backs, sharing the same capability bit.
@@@ -395,16 -390,14 +395,16 @@@ unsigned long cpu_get_elf_hwcap2(void)
   #define cpu_set_named_feature(name) cpu_set_feature(cpu_feature(name))
   #define cpu_have_named_feature(name) cpu_have_feature(cpu_feature(name))
   
- -/* System capability check for constant caps */
- -static __always_inline bool __cpus_have_const_cap(int num)
+ +static __always_inline bool system_capabilities_finalized(void)
   {
- -      if (num >= ARM64_NCAPS)
- -              return false;
- -      return static_branch_unlikely(&cpu_hwcap_keys[num]);
+ +      return static_branch_likely(&arm64_const_caps_ready);
   }
   
+ +/*
+ + * Test for a capability with a runtime check.
+ + *
+ + * Before the capability is detected, this returns false.
+ + */
   static inline bool cpus_have_cap(unsigned int num)
   {
         if (num >= ARM64_NCAPS)
@@@ -412,53 -405,14 +412,53 @@@
         return test_bit(num, cpu_hwcaps);
   }
   
+ +/*
+ + * Test for a capability without a runtime check.
+ + *
+ + * Before capabilities are finalized, this returns false.
+ + * After capabilities are finalized, this is patched to avoid a runtime check.
+ + *
+ + * @num must be a compile-time constant.
+ + */
+ +static __always_inline bool __cpus_have_const_cap(int num)
+ +{
+ +      if (num >= ARM64_NCAPS)
+ +              return false;
+ +      return static_branch_unlikely(&cpu_hwcap_keys[num]);
+ +}
+ +
+ +/*
+ + * Test for a capability, possibly with a runtime check.
+ + *
+ + * Before capabilities are finalized, this behaves as cpus_have_cap().
+ + * After capabilities are finalized, this is patched to avoid a runtime check.
+ + *
+ + * @num must be a compile-time constant.
+ + */
   static __always_inline bool cpus_have_const_cap(int num)
   {
- -      if (static_branch_likely(&arm64_const_caps_ready))
+ +      if (system_capabilities_finalized())
                 return __cpus_have_const_cap(num);
         else
                 return cpus_have_cap(num);
   }
   
+ +/*
+ + * Test for a capability without a runtime check.
+ + *
+ + * Before capabilities are finalized, this will BUG().
+ + * After capabilities are finalized, this is patched to avoid a runtime check.
+ + *
+ + * @num must be a compile-time constant.
+ + */
+ +static __always_inline bool cpus_have_final_cap(int num)
+ +{
+ +      if (system_capabilities_finalized())
+ +              return __cpus_have_const_cap(num);
+ +      else
+ +              BUG();
+ +}
+ +
   static inline void cpus_set_cap(unsigned int num)
   {
         if (num >= ARM64_NCAPS) {
@@@ -481,41 -435,18 +481,41 @@@ cpuid_feature_extract_signed_field(u64 
         return cpuid_feature_extract_signed_field_width(features, field, 4);
   }
   
- -static inline unsigned int __attribute_const__
+ +static __always_inline unsigned int __attribute_const__
   cpuid_feature_extract_unsigned_field_width(u64 features, int field, int width)
   {
         return (u64)(features << (64 - width - field)) >> (64 - width);
   }
   
- -static inline unsigned int __attribute_const__
+ +static __always_inline unsigned int __attribute_const__
   cpuid_feature_extract_unsigned_field(u64 features, int field)
   {
         return cpuid_feature_extract_unsigned_field_width(features, field, 4);
   }
   
+ +/*
+ + * Fields that identify the version of the Performance Monitors Extension do
+ + * not follow the standard ID scheme. See ARM DDI 0487E.a page D13-2825,
+ + * "Alternative ID scheme used for the Performance Monitors Extension version".
+ + */
+ +static inline u64 __attribute_const__
+ +cpuid_feature_cap_perfmon_field(u64 features, int field, u64 cap)
+ +{
+ +      u64 val = cpuid_feature_extract_unsigned_field(features, field);
+ +      u64 mask = GENMASK_ULL(field + 3, field);
+ +
+ +      /* Treat IMPLEMENTATION DEFINED functionality as unimplemented */
+ +      if (val == 0xf)
+ +              val = 0;
+ +
+ +      if (val > cap) {
+ +              features &= ~mask;
+ +              features |= (cap << field) & mask;
+ +      }
+ +
+ +      return features;
+ +}
+ +
   static inline u64 arm64_ftr_mask(const struct arm64_ftr_bits *ftrp)
   {
         return (u64)GENMASK(ftrp->shift + ftrp->width - 1, ftrp->shift);
@@@ -633,7 -564,7 +633,7 @@@ static inline bool system_supports_mixe
         return val == 0x1;
   }
   
- -static inline bool system_supports_fpsimd(void)
+ +static __always_inline bool system_supports_fpsimd(void)
   {
         return !cpus_have_const_cap(ARM64_HAS_NO_FPSIMD);
   }
@@@ -644,13 -575,13 +644,13 @@@ static inline bool system_uses_ttbr0_pa
                 !cpus_have_const_cap(ARM64_HAS_PAN);
   }
   
- -static inline bool system_supports_sve(void)
+ +static __always_inline bool system_supports_sve(void)
   {
         return IS_ENABLED(CONFIG_ARM64_SVE) &&
                 cpus_have_const_cap(ARM64_SVE);
   }
   
- -static inline bool system_supports_cnp(void)
+ +static __always_inline bool system_supports_cnp(void)
   {
         return IS_ENABLED(CONFIG_ARM64_CNP) &&
                 cpus_have_const_cap(ARM64_HAS_CNP);
@@@ -659,13 -590,15 +659,13 @@@
   static inline bool system_supports_address_auth(void)
   {
         return IS_ENABLED(CONFIG_ARM64_PTR_AUTH) &&
- -              (cpus_have_const_cap(ARM64_HAS_ADDRESS_AUTH_ARCH) ||
- -               cpus_have_const_cap(ARM64_HAS_ADDRESS_AUTH_IMP_DEF));
+ +              cpus_have_const_cap(ARM64_HAS_ADDRESS_AUTH);
   }
   
   static inline bool system_supports_generic_auth(void)
   {
         return IS_ENABLED(CONFIG_ARM64_PTR_AUTH) &&
- -              (cpus_have_const_cap(ARM64_HAS_GENERIC_AUTH_ARCH) ||
- -               cpus_have_const_cap(ARM64_HAS_GENERIC_AUTH_IMP_DEF));
+ +              cpus_have_const_cap(ARM64_HAS_GENERIC_AUTH);
   }
   
   static inline bool system_uses_irq_prio_masking(void)
@@@ -680,6 -613,17 +680,11 @@@ static inline bool system_has_prio_mask
                system_uses_irq_prio_masking();
   }
   
- -      return IS_ENABLED(CONFIG_ARM64_BTI) &&
- -              cpus_have_const_cap(ARM64_BTI);
- -}
- -
- -static inline bool system_capabilities_finalized(void)
- -{
- -      return static_branch_likely(&arm64_const_caps_ready);
+ static inline bool system_supports_bti(void)
+ {
++      return IS_ENABLED(CONFIG_ARM64_BTI) && cpus_have_const_cap(ARM64_BTI);
+ }
+ 
   #define ARM64_BP_HARDEN_UNKNOWN               -1
   #define ARM64_BP_HARDEN_WA_NEEDED     0
   #define ARM64_BP_HARDEN_NOT_REQUIRED  1
@@@ -740,11 -684,6 +745,11 @@@ static inline bool cpu_has_hw_af(void
                                                 ID_AA64MMFR1_HADBS_SHIFT);
   }
   
+ +#ifdef CONFIG_ARM64_AMU_EXTN
+ +/* Check whether the cpu supports the Activity Monitors Unit (AMU) */
+ +extern bool cpu_has_amu_feat(int cpu);
+ +#endif
+ +
   #endif /* __ASSEMBLY__ */
   
   #endif
diff --combined arch/arm64/include/asm/esr.h

index 6a395a7e6707bae66291b7ce2c2cd81f2003f262,390b8ba67830cd3cb9581820d85a42794f956bbe..035003acfa876dd998c56d842ea48e882aa44638
--- 1/arch/arm64/include/asm/esr.h
--- 2/arch/arm64/include/asm/esr.h
+++ b/arch/arm64/include/asm/esr.h
@@@ -22,7 -22,7 +22,7 @@@
   #define ESR_ELx_EC_PAC                (0x09)  /* EL2 and above */
   /* Unallocated EC: 0x0A - 0x0B */
   #define ESR_ELx_EC_CP14_64    (0x0C)
- /* Unallocated EC: 0x0d */
+ #define ESR_ELx_EC_BTI                (0x0D)
   #define ESR_ELx_EC_ILL                (0x0E)
   /* Unallocated EC: 0x0F - 0x10 */
   #define ESR_ELx_EC_SVC32      (0x11)
@@@ -60,7 -60,7 +60,7 @@@
   #define ESR_ELx_EC_BKPT32     (0x38)
   /* Unallocated EC: 0x39 */
   #define ESR_ELx_EC_VECTOR32   (0x3A)  /* EL2 only */
- -/* Unallocted EC: 0x3B */
+ +/* Unallocated EC: 0x3B */
   #define ESR_ELx_EC_BRK64      (0x3C)
   /* Unallocated EC: 0x3D - 0x3F */
   #define ESR_ELx_EC_MAX                (0x3F)
diff --combined arch/arm64/include/asm/kvm_emulate.h

index a30b4eec7cb40048c92d9d4261765446c2365af8,dee51c1dcb93a24bd3f3529b80b4e6423ebacd5e..6ea53e6e8b262b1497de1e9cc4b0b4be493f04c3
--- 1/arch/arm64/include/asm/kvm_emulate.h
--- 2/arch/arm64/include/asm/kvm_emulate.h
+++ b/arch/arm64/include/asm/kvm_emulate.h
@@@ -36,7 -36,7 +36,7 @@@ void kvm_inject_undef32(struct kvm_vcp
   void kvm_inject_dabt32(struct kvm_vcpu *vcpu, unsigned long addr);
   void kvm_inject_pabt32(struct kvm_vcpu *vcpu, unsigned long addr);
   
- -static inline bool vcpu_el1_is_32bit(struct kvm_vcpu *vcpu)
+ +static __always_inline bool vcpu_el1_is_32bit(struct kvm_vcpu *vcpu)
   {
         return !(vcpu->arch.hcr_el2 & HCR_RW);
   }
@@@ -89,8 -89,7 +89,8 @@@ static inline unsigned long *vcpu_hcr(s
   static inline void vcpu_clear_wfx_traps(struct kvm_vcpu *vcpu)
   {
         vcpu->arch.hcr_el2 &= ~HCR_TWE;
- -      if (atomic_read(&vcpu->arch.vgic_cpu.vgic_v3.its_vpe.vlpi_count))
+ +      if (atomic_read(&vcpu->arch.vgic_cpu.vgic_v3.its_vpe.vlpi_count) ||
+ +          vcpu->kvm->arch.vgic.nassgireq)
                 vcpu->arch.hcr_el2 &= ~HCR_TWI;
         else
                 vcpu->arch.hcr_el2 |= HCR_TWI;
@@@ -128,7 -127,7 +128,7 @@@ static inline void vcpu_set_vsesr(struc
         vcpu->arch.vsesr_el2 = vsesr;
   }
   
- -static inline unsigned long *vcpu_pc(const struct kvm_vcpu *vcpu)
+ +static __always_inline unsigned long *vcpu_pc(const struct kvm_vcpu *vcpu)
   {
         return (unsigned long *)&vcpu_gp_regs(vcpu)->regs.pc;
   }
@@@ -154,17 -153,17 +154,17 @@@ static inline void vcpu_write_elr_el1(c
                 *__vcpu_elr_el1(vcpu) = v;
   }
   
- -static inline unsigned long *vcpu_cpsr(const struct kvm_vcpu *vcpu)
+ +static __always_inline unsigned long *vcpu_cpsr(const struct kvm_vcpu *vcpu)
   {
         return (unsigned long *)&vcpu_gp_regs(vcpu)->regs.pstate;
   }
   
- -static inline bool vcpu_mode_is_32bit(const struct kvm_vcpu *vcpu)
+ +static __always_inline bool vcpu_mode_is_32bit(const struct kvm_vcpu *vcpu)
   {
         return !!(*vcpu_cpsr(vcpu) & PSR_MODE32_BIT);
   }
   
- -static inline bool kvm_condition_valid(const struct kvm_vcpu *vcpu)
+ +static __always_inline bool kvm_condition_valid(const struct kvm_vcpu *vcpu)
   {
         if (vcpu_mode_is_32bit(vcpu))
                 return kvm_condition_valid32(vcpu);
@@@ -182,13 -181,13 +182,13 @@@ static inline void vcpu_set_thumb(struc
    * coming from a read of ESR_EL2. Otherwise, it may give the wrong result on
    * AArch32 with banked registers.
    */
- -static inline unsigned long vcpu_get_reg(const struct kvm_vcpu *vcpu,
+ +static __always_inline unsigned long vcpu_get_reg(const struct kvm_vcpu *vcpu,
                                          u8 reg_num)
   {
         return (reg_num == 31) ? 0 : vcpu_gp_regs(vcpu)->regs.regs[reg_num];
   }
   
- -static inline void vcpu_set_reg(struct kvm_vcpu *vcpu, u8 reg_num,
+ +static __always_inline void vcpu_set_reg(struct kvm_vcpu *vcpu, u8 reg_num,
                                 unsigned long val)
   {
         if (reg_num != 31)
@@@ -265,12 -264,12 +265,12 @@@ static inline bool vcpu_mode_priv(cons
         return mode != PSR_MODE_EL0t;
   }
   
- -static inline u32 kvm_vcpu_get_hsr(const struct kvm_vcpu *vcpu)
+ +static __always_inline u32 kvm_vcpu_get_hsr(const struct kvm_vcpu *vcpu)
   {
         return vcpu->arch.fault.esr_el2;
   }
   
- -static inline int kvm_vcpu_get_condition(const struct kvm_vcpu *vcpu)
+ +static __always_inline int kvm_vcpu_get_condition(const struct kvm_vcpu *vcpu)
   {
         u32 esr = kvm_vcpu_get_hsr(vcpu);
   
@@@ -280,12 -279,12 +280,12 @@@
         return -1;
   }
   
- -static inline unsigned long kvm_vcpu_get_hfar(const struct kvm_vcpu *vcpu)
+ +static __always_inline unsigned long kvm_vcpu_get_hfar(const struct kvm_vcpu *vcpu)
   {
         return vcpu->arch.fault.far_el2;
   }
   
- -static inline phys_addr_t kvm_vcpu_get_fault_ipa(const struct kvm_vcpu *vcpu)
+ +static __always_inline phys_addr_t kvm_vcpu_get_fault_ipa(const struct kvm_vcpu *vcpu)
   {
         return ((phys_addr_t)vcpu->arch.fault.hpfar_el2 & HPFAR_MASK) << 8;
   }
@@@ -300,7 -299,7 +300,7 @@@ static inline u32 kvm_vcpu_hvc_get_imm(
         return kvm_vcpu_get_hsr(vcpu) & ESR_ELx_xVC_IMM_MASK;
   }
   
- -static inline bool kvm_vcpu_dabt_isvalid(const struct kvm_vcpu *vcpu)
+ +static __always_inline bool kvm_vcpu_dabt_isvalid(const struct kvm_vcpu *vcpu)
   {
         return !!(kvm_vcpu_get_hsr(vcpu) & ESR_ELx_ISV);
   }
@@@ -320,17 -319,17 +320,17 @@@ static inline bool kvm_vcpu_dabt_issf(c
         return !!(kvm_vcpu_get_hsr(vcpu) & ESR_ELx_SF);
   }
   
- -static inline int kvm_vcpu_dabt_get_rd(const struct kvm_vcpu *vcpu)
+ +static __always_inline int kvm_vcpu_dabt_get_rd(const struct kvm_vcpu *vcpu)
   {
         return (kvm_vcpu_get_hsr(vcpu) & ESR_ELx_SRT_MASK) >> ESR_ELx_SRT_SHIFT;
   }
   
- -static inline bool kvm_vcpu_dabt_iss1tw(const struct kvm_vcpu *vcpu)
+ +static __always_inline bool kvm_vcpu_dabt_iss1tw(const struct kvm_vcpu *vcpu)
   {
         return !!(kvm_vcpu_get_hsr(vcpu) & ESR_ELx_S1PTW);
   }
   
- -static inline bool kvm_vcpu_dabt_iswrite(const struct kvm_vcpu *vcpu)
+ +static __always_inline bool kvm_vcpu_dabt_iswrite(const struct kvm_vcpu *vcpu)
   {
         return !!(kvm_vcpu_get_hsr(vcpu) & ESR_ELx_WNR) ||
                 kvm_vcpu_dabt_iss1tw(vcpu); /* AF/DBM update */
@@@ -341,18 -340,18 +341,18 @@@ static inline bool kvm_vcpu_dabt_is_cm(
         return !!(kvm_vcpu_get_hsr(vcpu) & ESR_ELx_CM);
   }
   
- -static inline unsigned int kvm_vcpu_dabt_get_as(const struct kvm_vcpu *vcpu)
+ +static __always_inline unsigned int kvm_vcpu_dabt_get_as(const struct kvm_vcpu *vcpu)
   {
         return 1 << ((kvm_vcpu_get_hsr(vcpu) & ESR_ELx_SAS) >> ESR_ELx_SAS_SHIFT);
   }
   
   /* This one is not specific to Data Abort */
- -static inline bool kvm_vcpu_trap_il_is32bit(const struct kvm_vcpu *vcpu)
+ +static __always_inline bool kvm_vcpu_trap_il_is32bit(const struct kvm_vcpu *vcpu)
   {
         return !!(kvm_vcpu_get_hsr(vcpu) & ESR_ELx_IL);
   }
   
- -static inline u8 kvm_vcpu_trap_get_class(const struct kvm_vcpu *vcpu)
+ +static __always_inline u8 kvm_vcpu_trap_get_class(const struct kvm_vcpu *vcpu)
   {
         return ESR_ELx_EC(kvm_vcpu_get_hsr(vcpu));
   }
@@@ -362,17 -361,17 +362,17 @@@ static inline bool kvm_vcpu_trap_is_iab
         return kvm_vcpu_trap_get_class(vcpu) == ESR_ELx_EC_IABT_LOW;
   }
   
- -static inline u8 kvm_vcpu_trap_get_fault(const struct kvm_vcpu *vcpu)
+ +static __always_inline u8 kvm_vcpu_trap_get_fault(const struct kvm_vcpu *vcpu)
   {
         return kvm_vcpu_get_hsr(vcpu) & ESR_ELx_FSC;
   }
   
- -static inline u8 kvm_vcpu_trap_get_fault_type(const struct kvm_vcpu *vcpu)
+ +static __always_inline u8 kvm_vcpu_trap_get_fault_type(const struct kvm_vcpu *vcpu)
   {
         return kvm_vcpu_get_hsr(vcpu) & ESR_ELx_FSC_TYPE;
   }
   
- -static inline bool kvm_vcpu_dabt_isextabt(const struct kvm_vcpu *vcpu)
+ +static __always_inline bool kvm_vcpu_dabt_isextabt(const struct kvm_vcpu *vcpu)
   {
         switch (kvm_vcpu_trap_get_fault(vcpu)) {
         case FSC_SEA:
@@@ -391,7 -390,7 +391,7 @@@
         }
   }
   
- -static inline int kvm_vcpu_sys_get_rt(struct kvm_vcpu *vcpu)
+ +static __always_inline int kvm_vcpu_sys_get_rt(struct kvm_vcpu *vcpu)
   {
         u32 esr = kvm_vcpu_get_hsr(vcpu);
         return ESR_ELx_SYS64_ISS_RT(esr);
@@@ -505,12 -504,14 +505,14 @@@ static inline unsigned long vcpu_data_h
         return data;            /* Leave LE untouched */
   }
   
- -static inline void kvm_skip_instr(struct kvm_vcpu *vcpu, bool is_wide_instr)
+ +static __always_inline void kvm_skip_instr(struct kvm_vcpu *vcpu, bool is_wide_instr)
   {
-       if (vcpu_mode_is_32bit(vcpu))
+       if (vcpu_mode_is_32bit(vcpu)) {
                 kvm_skip_instr32(vcpu, is_wide_instr);
-       else
+       } else {
                 *vcpu_pc(vcpu) += 4;
+               *vcpu_cpsr(vcpu) &= ~PSR_BTYPE_MASK;
+       }
   
         /* advance the singlestep state machine */
         *vcpu_cpsr(vcpu) &= ~DBG_SPSR_SS;
@@@ -520,7 -521,7 +522,7 @@@
    * Skip an instruction which has been emulated at hyp while most guest sysregs
    * are live.
    */
- -static inline void __hyp_text __kvm_skip_instr(struct kvm_vcpu *vcpu)
+ +static __always_inline void __hyp_text __kvm_skip_instr(struct kvm_vcpu *vcpu)
   {
         *vcpu_pc(vcpu) = read_sysreg_el2(SYS_ELR);
         vcpu->arch.ctxt.gp_regs.regs.pstate = read_sysreg_el2(SYS_SPSR);
diff --combined arch/arm64/include/asm/sysreg.h

index c4ac0ac25a00809bb36d45adcfcd00e5ff369eca,db08ceb4cc9a296883ca2916807be52b06b63f03..2918eb19f15399a3a8fdab19e0895e00aef29ca1
--- 1/arch/arm64/include/asm/sysreg.h
--- 2/arch/arm64/include/asm/sysreg.h
+++ b/arch/arm64/include/asm/sysreg.h
@@@ -49,9 -49,7 +49,9 @@@
   #ifndef CONFIG_BROKEN_GAS_INST
   
   #ifdef __ASSEMBLY__
- -#define __emit_inst(x)                        .inst (x)
+ +// The space separator is omitted so that __emit_inst(x) can be parsed as
+ +// either an assembler directive or an assembler macro argument.
+ +#define __emit_inst(x)                        .inst(x)
   #else
   #define __emit_inst(x)                        ".inst " __stringify((x)) "\n\t"
   #endif
@@@ -388,42 -386,6 +388,42 @@@
   #define SYS_TPIDR_EL0                 sys_reg(3, 3, 13, 0, 2)
   #define SYS_TPIDRRO_EL0                       sys_reg(3, 3, 13, 0, 3)
   
+ +/* Definitions for system register interface to AMU for ARMv8.4 onwards */
+ +#define SYS_AM_EL0(crm, op2)          sys_reg(3, 3, 13, (crm), (op2))
+ +#define SYS_AMCR_EL0                  SYS_AM_EL0(2, 0)
+ +#define SYS_AMCFGR_EL0                        SYS_AM_EL0(2, 1)
+ +#define SYS_AMCGCR_EL0                        SYS_AM_EL0(2, 2)
+ +#define SYS_AMUSERENR_EL0             SYS_AM_EL0(2, 3)
+ +#define SYS_AMCNTENCLR0_EL0           SYS_AM_EL0(2, 4)
+ +#define SYS_AMCNTENSET0_EL0           SYS_AM_EL0(2, 5)
+ +#define SYS_AMCNTENCLR1_EL0           SYS_AM_EL0(3, 0)
+ +#define SYS_AMCNTENSET1_EL0           SYS_AM_EL0(3, 1)
+ +
+ +/*
+ + * Group 0 of activity monitors (architected):
+ + *                op0  op1  CRn   CRm       op2
+ + * Counter:       11   011  1101  010:n<3>  n<2:0>
+ + * Type:          11   011  1101  011:n<3>  n<2:0>
+ + * n: 0-15
+ + *
+ + * Group 1 of activity monitors (auxiliary):
+ + *                op0  op1  CRn   CRm       op2
+ + * Counter:       11   011  1101  110:n<3>  n<2:0>
+ + * Type:          11   011  1101  111:n<3>  n<2:0>
+ + * n: 0-15
+ + */
+ +
+ +#define SYS_AMEVCNTR0_EL0(n)          SYS_AM_EL0(4 + ((n) >> 3), (n) & 7)
+ +#define SYS_AMEVTYPE0_EL0(n)          SYS_AM_EL0(6 + ((n) >> 3), (n) & 7)
+ +#define SYS_AMEVCNTR1_EL0(n)          SYS_AM_EL0(12 + ((n) >> 3), (n) & 7)
+ +#define SYS_AMEVTYPE1_EL0(n)          SYS_AM_EL0(14 + ((n) >> 3), (n) & 7)
+ +
+ +/* AMU v1: Fixed (architecturally defined) activity monitors */
+ +#define SYS_AMEVCNTR0_CORE_EL0                SYS_AMEVCNTR0_EL0(0)
+ +#define SYS_AMEVCNTR0_CONST_EL0               SYS_AMEVCNTR0_EL0(1)
+ +#define SYS_AMEVCNTR0_INST_RET_EL0    SYS_AMEVCNTR0_EL0(2)
+ +#define SYS_AMEVCNTR0_MEM_STALL               SYS_AMEVCNTR0_EL0(3)
+ +
   #define SYS_CNTFRQ_EL0                        sys_reg(3, 3, 14, 0, 0)
   
   #define SYS_CNTP_TVAL_EL0             sys_reg(3, 3, 14, 2, 0)
@@@ -552,6 -514,8 +552,8 @@@
   #endif
   
   /* SCTLR_EL1 specific flags. */
+ #define SCTLR_EL1_BT1         (BIT(36))
+ #define SCTLR_EL1_BT0         (BIT(35))
   #define SCTLR_EL1_UCI         (BIT(26))
   #define SCTLR_EL1_E0E         (BIT(24))
   #define SCTLR_EL1_SPAN                (BIT(23))
@@@ -636,7 -600,6 +638,7 @@@
   #define ID_AA64PFR0_CSV3_SHIFT                60
   #define ID_AA64PFR0_CSV2_SHIFT                56
   #define ID_AA64PFR0_DIT_SHIFT         48
+ +#define ID_AA64PFR0_AMU_SHIFT         44
   #define ID_AA64PFR0_SVE_SHIFT         32
   #define ID_AA64PFR0_RAS_SHIFT         28
   #define ID_AA64PFR0_GIC_SHIFT         24
@@@ -647,7 -610,6 +649,7 @@@
   #define ID_AA64PFR0_EL1_SHIFT         4
   #define ID_AA64PFR0_EL0_SHIFT         0
   
+ +#define ID_AA64PFR0_AMU                       0x1
   #define ID_AA64PFR0_SVE                       0x1
   #define ID_AA64PFR0_RAS_V1            0x1
   #define ID_AA64PFR0_FP_NI             0xf
@@@ -660,10 -622,12 +662,12 @@@
   
   /* id_aa64pfr1 */
   #define ID_AA64PFR1_SSBS_SHIFT                4
+ #define ID_AA64PFR1_BT_SHIFT          0
   
   #define ID_AA64PFR1_SSBS_PSTATE_NI    0
   #define ID_AA64PFR1_SSBS_PSTATE_ONLY  1
   #define ID_AA64PFR1_SSBS_PSTATE_INSNS 2
+ #define ID_AA64PFR1_BT_BTI            0x1
   
   /* id_aa64zfr0 */
   #define ID_AA64ZFR0_F64MM_SHIFT               56
@@@ -742,16 -706,6 +746,16 @@@
   #define ID_AA64DFR0_TRACEVER_SHIFT    4
   #define ID_AA64DFR0_DEBUGVER_SHIFT    0
   
+ +#define ID_AA64DFR0_PMUVER_8_0                0x1
+ +#define ID_AA64DFR0_PMUVER_8_1                0x4
+ +#define ID_AA64DFR0_PMUVER_8_4                0x5
+ +#define ID_AA64DFR0_PMUVER_8_5                0x6
+ +#define ID_AA64DFR0_PMUVER_IMP_DEF    0xf
+ +
+ +#define ID_DFR0_PERFMON_SHIFT         24
+ +
+ +#define ID_DFR0_PERFMON_8_1           0x4
+ +
   #define ID_ISAR5_RDM_SHIFT            24
   #define ID_ISAR5_CRC32_SHIFT          16
   #define ID_ISAR5_SHA2_SHIFT           12
diff --combined arch/arm64/kernel/cpufeature.c

index 9fac745aa7bb248771bf113c7b3e8539707af51a,e6d31776e49bddcaa49663ba9e1f50ffb675f8d8..b234d6f71cba6402a57bc833473b77bdfde2edba
--- 1/arch/arm64/kernel/cpufeature.c
--- 2/arch/arm64/kernel/cpufeature.c
+++ b/arch/arm64/kernel/cpufeature.c
@@@ -116,8 -116,6 +116,8 @@@ cpufeature_pan_not_uao(const struct arm
   
   static void cpu_enable_cnp(struct arm64_cpu_capabilities const *cap);
   
+ +static bool __system_matches_cap(unsigned int n);
+ +
   /*
    * NOTE: Any changes to the visibility of features should be kept in
    * sync with the documentation of the CPU feature register ABI.
@@@ -165,7 -163,6 +165,7 @@@ static const struct arm64_ftr_bits ftr_
         ARM64_FTR_BITS(FTR_HIDDEN, FTR_NONSTRICT, FTR_LOWER_SAFE, ID_AA64PFR0_CSV3_SHIFT, 4, 0),
         ARM64_FTR_BITS(FTR_HIDDEN, FTR_NONSTRICT, FTR_LOWER_SAFE, ID_AA64PFR0_CSV2_SHIFT, 4, 0),
         ARM64_FTR_BITS(FTR_VISIBLE, FTR_STRICT, FTR_LOWER_SAFE, ID_AA64PFR0_DIT_SHIFT, 4, 0),
+ +      ARM64_FTR_BITS(FTR_HIDDEN, FTR_NONSTRICT, FTR_LOWER_SAFE, ID_AA64PFR0_AMU_SHIFT, 4, 0),
         ARM64_FTR_BITS(FTR_VISIBLE_IF_IS_ENABLED(CONFIG_ARM64_SVE),
                                    FTR_STRICT, FTR_LOWER_SAFE, ID_AA64PFR0_SVE_SHIFT, 4, 0),
         ARM64_FTR_BITS(FTR_HIDDEN, FTR_STRICT, FTR_LOWER_SAFE, ID_AA64PFR0_RAS_SHIFT, 4, 0),
@@@ -182,6 -179,8 +182,8 @@@
   
   static const struct arm64_ftr_bits ftr_id_aa64pfr1[] = {
         ARM64_FTR_BITS(FTR_VISIBLE, FTR_STRICT, FTR_LOWER_SAFE, ID_AA64PFR1_SSBS_SHIFT, 4, ID_AA64PFR1_SSBS_PSTATE_NI),
+       ARM64_FTR_BITS(FTR_VISIBLE_IF_IS_ENABLED(CONFIG_ARM64_BTI),
+                                   FTR_STRICT, FTR_LOWER_SAFE, ID_AA64PFR1_BT_SHIFT, 4, 0),
         ARM64_FTR_END,
   };
   
@@@ -554,7 -553,7 +556,7 @@@ static void __init init_cpu_ftr_reg(u3
   
         BUG_ON(!reg);
   
- -      for (ftrp  = reg->ftr_bits; ftrp->width; ftrp++) {
+ +      for (ftrp = reg->ftr_bits; ftrp->width; ftrp++) {
                 u64 ftr_mask = arm64_ftr_mask(ftrp);
                 s64 ftr_new = arm64_ftr_value(ftrp, new);
   
@@@ -1225,57 -1224,6 +1227,57 @@@ static bool has_hw_dbm(const struct arm
   
   #endif
   
+ +#ifdef CONFIG_ARM64_AMU_EXTN
+ +
+ +/*
+ + * The "amu_cpus" cpumask only signals that the CPU implementation for the
+ + * flagged CPUs supports the Activity Monitors Unit (AMU) but does not provide
+ + * information regarding all the events that it supports. When a CPU bit is
+ + * set in the cpumask, the user of this feature can only rely on the presence
+ + * of the 4 fixed counters for that CPU. But this does not guarantee that the
+ + * counters are enabled or access to these counters is enabled by code
+ + * executed at higher exception levels (firmware).
+ + */
+ +static struct cpumask amu_cpus __read_mostly;
+ +
+ +bool cpu_has_amu_feat(int cpu)
+ +{
+ +      return cpumask_test_cpu(cpu, &amu_cpus);
+ +}
+ +
+ +/* Initialize the use of AMU counters for frequency invariance */
+ +extern void init_cpu_freq_invariance_counters(void);
+ +
+ +static void cpu_amu_enable(struct arm64_cpu_capabilities const *cap)
+ +{
+ +      if (has_cpuid_feature(cap, SCOPE_LOCAL_CPU)) {
+ +              pr_info("detected CPU%d: Activity Monitors Unit (AMU)\n",
+ +                      smp_processor_id());
+ +              cpumask_set_cpu(smp_processor_id(), &amu_cpus);
+ +              init_cpu_freq_invariance_counters();
+ +      }
+ +}
+ +
+ +static bool has_amu(const struct arm64_cpu_capabilities *cap,
+ +                  int __unused)
+ +{
+ +      /*
+ +       * The AMU extension is a non-conflicting feature: the kernel can
+ +       * safely run a mix of CPUs with and without support for the
+ +       * activity monitors extension. Therefore, unconditionally enable
+ +       * the capability to allow any late CPU to use the feature.
+ +       *
+ +       * With this feature unconditionally enabled, the cpu_enable
+ +       * function will be called for all CPUs that match the criteria,
+ +       * including secondary and hotplugged, marking this feature as
+ +       * present on that respective CPU. The enable function will also
+ +       * print a detection message.
+ +       */
+ +
+ +      return true;
+ +}
+ +#endif
+ +
   #ifdef CONFIG_ARM64_VHE
   static bool runs_at_el2(const struct arm64_cpu_capabilities *entry, int __unused)
   {
@@@ -1370,18 -1318,10 +1372,18 @@@ static void cpu_clear_disr(const struc
   #endif /* CONFIG_ARM64_RAS_EXTN */
   
   #ifdef CONFIG_ARM64_PTR_AUTH
- -static void cpu_enable_address_auth(struct arm64_cpu_capabilities const *cap)
+ +static bool has_address_auth(const struct arm64_cpu_capabilities *entry,
+ +                           int __unused)
   {
- -      sysreg_clear_set(sctlr_el1, 0, SCTLR_ELx_ENIA | SCTLR_ELx_ENIB |
- -                                     SCTLR_ELx_ENDA | SCTLR_ELx_ENDB);
+ +      return __system_matches_cap(ARM64_HAS_ADDRESS_AUTH_ARCH) ||
+ +             __system_matches_cap(ARM64_HAS_ADDRESS_AUTH_IMP_DEF);
+ +}
+ +
+ +static bool has_generic_auth(const struct arm64_cpu_capabilities *entry,
+ +                           int __unused)
+ +{
+ +      return __system_matches_cap(ARM64_HAS_GENERIC_AUTH_ARCH) ||
+ +             __system_matches_cap(ARM64_HAS_GENERIC_AUTH_IMP_DEF);
   }
   #endif /* CONFIG_ARM64_PTR_AUTH */
   
@@@ -1409,25 -1349,21 +1411,40 @@@ static bool can_use_gic_priorities(cons
   }
   #endif
   
+ #ifdef CONFIG_ARM64_BTI
+ static void bti_enable(const struct arm64_cpu_capabilities *__unused)
+ {
+       /*
+        * Use of X16/X17 for tail-calls and trampolines that jump to
+        * function entry points using BR is a requirement for
+        * marking binaries with GNU_PROPERTY_AARCH64_FEATURE_1_BTI.
+        * So, be strict and forbid other BRs using other registers to
+        * jump onto a PACIxSP instruction:
+        */
+       sysreg_clear_set(sctlr_el1, 0, SCTLR_EL1_BT0 | SCTLR_EL1_BT1);
+       isb();
+ }
+ #endif /* CONFIG_ARM64_BTI */
+ 
+ +/* Internal helper functions to match cpu capability type */
+ +static bool
+ +cpucap_late_cpu_optional(const struct arm64_cpu_capabilities *cap)
+ +{
+ +      return !!(cap->type & ARM64_CPUCAP_OPTIONAL_FOR_LATE_CPU);
+ +}
+ +
+ +static bool
+ +cpucap_late_cpu_permitted(const struct arm64_cpu_capabilities *cap)
+ +{
+ +      return !!(cap->type & ARM64_CPUCAP_PERMITTED_FOR_LATE_CPU);
+ +}
+ +
+ +static bool
+ +cpucap_panic_on_conflict(const struct arm64_cpu_capabilities *cap)
+ +{
+ +      return !!(cap->type & ARM64_CPUCAP_PANIC_ON_CONFLICT);
+ +}
+ +
   static const struct arm64_cpu_capabilities arm64_features[] = {
         {
                 .desc = "GIC system register CPU interface",
@@@ -1580,24 -1516,6 +1597,24 @@@
                 .cpu_enable = cpu_clear_disr,
         },
   #endif /* CONFIG_ARM64_RAS_EXTN */
+ +#ifdef CONFIG_ARM64_AMU_EXTN
+ +      {
+ +              /*
+ +               * The feature is enabled by default if CONFIG_ARM64_AMU_EXTN=y.
+ +               * Therefore, don't provide .desc as we don't want the detection
+ +               * message to be shown until at least one CPU is detected to
+ +               * support the feature.
+ +               */
+ +              .capability = ARM64_HAS_AMU_EXTN,
+ +              .type = ARM64_CPUCAP_WEAK_LOCAL_CPU_FEATURE,
+ +              .matches = has_amu,
+ +              .sys_reg = SYS_ID_AA64PFR0_EL1,
+ +              .sign = FTR_UNSIGNED,
+ +              .field_pos = ID_AA64PFR0_AMU_SHIFT,
+ +              .min_field_value = ID_AA64PFR0_AMU,
+ +              .cpu_enable = cpu_amu_enable,
+ +      },
+ +#endif /* CONFIG_ARM64_AMU_EXTN */
         {
                 .desc = "Data cache clean to the PoU not required for I/D coherence",
                 .capability = ARM64_HAS_CACHE_IDC,
@@@ -1691,27 -1609,24 +1708,27 @@@
         {
                 .desc = "Address authentication (architected algorithm)",
                 .capability = ARM64_HAS_ADDRESS_AUTH_ARCH,
- -              .type = ARM64_CPUCAP_SYSTEM_FEATURE,
+ +              .type = ARM64_CPUCAP_BOOT_CPU_FEATURE,
                 .sys_reg = SYS_ID_AA64ISAR1_EL1,
                 .sign = FTR_UNSIGNED,
                 .field_pos = ID_AA64ISAR1_APA_SHIFT,
                 .min_field_value = ID_AA64ISAR1_APA_ARCHITECTED,
                 .matches = has_cpuid_feature,
- -              .cpu_enable = cpu_enable_address_auth,
         },
         {
                 .desc = "Address authentication (IMP DEF algorithm)",
                 .capability = ARM64_HAS_ADDRESS_AUTH_IMP_DEF,
- -              .type = ARM64_CPUCAP_SYSTEM_FEATURE,
+ +              .type = ARM64_CPUCAP_BOOT_CPU_FEATURE,
                 .sys_reg = SYS_ID_AA64ISAR1_EL1,
                 .sign = FTR_UNSIGNED,
                 .field_pos = ID_AA64ISAR1_API_SHIFT,
                 .min_field_value = ID_AA64ISAR1_API_IMP_DEF,
                 .matches = has_cpuid_feature,
- -              .cpu_enable = cpu_enable_address_auth,
+ +      },
+ +      {
+ +              .capability = ARM64_HAS_ADDRESS_AUTH,
+ +              .type = ARM64_CPUCAP_BOOT_CPU_FEATURE,
+ +              .matches = has_address_auth,
         },
         {
                 .desc = "Generic authentication (architected algorithm)",
@@@ -1733,11 -1648,6 +1750,11 @@@
                 .min_field_value = ID_AA64ISAR1_GPI_IMP_DEF,
                 .matches = has_cpuid_feature,
         },
+ +      {
+ +              .capability = ARM64_HAS_GENERIC_AUTH,
+ +              .type = ARM64_CPUCAP_SYSTEM_FEATURE,
+ +              .matches = has_generic_auth,
+ +      },
   #endif /* CONFIG_ARM64_PTR_AUTH */
   #ifdef CONFIG_ARM64_PSEUDO_NMI
         {
@@@ -1778,6 -1688,19 +1795,19 @@@
                 .sign = FTR_UNSIGNED,
                 .min_field_value = 1,
         },
+ #endif
+ #ifdef CONFIG_ARM64_BTI
+       {
+               .desc = "Branch Target Identification",
+               .capability = ARM64_BTI,
+               .type = ARM64_CPUCAP_SYSTEM_FEATURE,
+               .matches = has_cpuid_feature,
+               .cpu_enable = bti_enable,
+               .sys_reg = SYS_ID_AA64PFR1_EL1,
+               .field_pos = ID_AA64PFR1_BT_SHIFT,
+               .min_field_value = ID_AA64PFR1_BT_BTI,
+               .sign = FTR_UNSIGNED,
+       },
   #endif
         {},
   };
@@@ -1888,6 -1811,9 +1918,9 @@@ static const struct arm64_cpu_capabilit
         HWCAP_CAP(SYS_ID_AA64ZFR0_EL1, ID_AA64ZFR0_F64MM_SHIFT, FTR_UNSIGNED, ID_AA64ZFR0_F64MM, CAP_HWCAP, KERNEL_HWCAP_SVEF64MM),
   #endif
         HWCAP_CAP(SYS_ID_AA64PFR1_EL1, ID_AA64PFR1_SSBS_SHIFT, FTR_UNSIGNED, ID_AA64PFR1_SSBS_PSTATE_INSNS, CAP_HWCAP, KERNEL_HWCAP_SSBS),
+ #ifdef CONFIG_ARM64_BTI
+       HWCAP_CAP(SYS_ID_AA64PFR1_EL1, ID_AA64PFR1_BT_SHIFT, FTR_UNSIGNED, ID_AA64PFR1_BT_BTI, CAP_HWCAP, KERNEL_HWCAP_BTI),
+ #endif
   #ifdef CONFIG_ARM64_PTR_AUTH
         HWCAP_MULTI_CAP(ptr_auth_hwcap_addr_matches, CAP_HWCAP, KERNEL_HWCAP_PACA),
         HWCAP_MULTI_CAP(ptr_auth_hwcap_gen_matches, CAP_HWCAP, KERNEL_HWCAP_PACG),
@@@ -2087,8 -2013,10 +2120,8 @@@ static void __init enable_cpu_capabilit
    * Run through the list of capabilities to check for conflicts.
    * If the system has already detected a capability, take necessary
    * action on this CPU.
- - *
- - * Returns "false" on conflicts.
    */
- -static bool verify_local_cpu_caps(u16 scope_mask)
+ +static void verify_local_cpu_caps(u16 scope_mask)
   {
         int i;
         bool cpu_has_cap, system_has_cap;
@@@ -2133,12 -2061,10 +2166,12 @@@
                 pr_crit("CPU%d: Detected conflict for capability %d (%s), System: %d, CPU: %d\n",
                         smp_processor_id(), caps->capability,
                         caps->desc, system_has_cap, cpu_has_cap);
- -              return false;
- -      }
   
- -      return true;
+ +              if (cpucap_panic_on_conflict(caps))
+ +                      cpu_panic_kernel();
+ +              else
+ +                      cpu_die_early();
+ +      }
   }
   
   /*
@@@ -2148,8 -2074,12 +2181,8 @@@
   static void check_early_cpu_features(void)
   {
         verify_cpu_asid_bits();
- -      /*
- -       * Early features are used by the kernel already. If there
- -       * is a conflict, we cannot proceed further.
- -       */
- -      if (!verify_local_cpu_caps(SCOPE_BOOT_CPU))
- -              cpu_panic_kernel();
+ +
+ +      verify_local_cpu_caps(SCOPE_BOOT_CPU);
   }
   
   static void
@@@ -2197,7 -2127,8 +2230,7 @@@ static void verify_local_cpu_capabiliti
          * check_early_cpu_features(), as they need to be verified
          * on all secondary CPUs.
          */
- -      if (!verify_local_cpu_caps(SCOPE_ALL & ~SCOPE_BOOT_CPU))
- -              cpu_die_early();
+ +      verify_local_cpu_caps(SCOPE_ALL & ~SCOPE_BOOT_CPU);
   
         verify_local_elf_hwcaps(arm64_elf_hwcaps);
   
@@@ -2248,23 -2179,6 +2281,23 @@@ bool this_cpu_has_cap(unsigned int n
         return false;
   }
   
+ +/*
+ + * This helper function is used in a narrow window when,
+ + * - The system wide safe registers are set with all the SMP CPUs and,
+ + * - The SYSTEM_FEATURE cpu_hwcaps may not have been set.
+ + * In all other cases cpus_have_{const_}cap() should be used.
+ + */
+ +static bool __system_matches_cap(unsigned int n)
+ +{
+ +      if (n < ARM64_NCAPS) {
+ +              const struct arm64_cpu_capabilities *cap = cpu_hwcaps_ptrs[n];
+ +
+ +              if (cap)
+ +                      return cap->matches(cap, SCOPE_SYSTEM);
+ +      }
+ +      return false;
+ +}
+ +
   void cpu_set_feature(unsigned int num)
   {
         WARN_ON(num >= MAX_CPU_FEATURES);
@@@ -2337,7 -2251,7 +2370,7 @@@ void __init setup_cpu_features(void
   static bool __maybe_unused
   cpufeature_pan_not_uao(const struct arm64_cpu_capabilities *entry, int __unused)
   {
- -      return (cpus_have_const_cap(ARM64_HAS_PAN) && !cpus_have_const_cap(ARM64_HAS_UAO));
+ +      return (__system_matches_cap(ARM64_HAS_PAN) && !__system_matches_cap(ARM64_HAS_UAO));
   }
   
   static void __maybe_unused cpu_enable_cnp(struct arm64_cpu_capabilities const *cap)
diff --combined arch/arm64/kernel/entry-common.c

index c839b5bf1904b128b0fe7f91d569397b8566d80a,55ec0627f5a7363857012cf1b1fd17f7f7cbb101..1196eb4f4c762a1225ff300fa99229bcf5b88e59
--- 1/arch/arm64/kernel/entry-common.c
--- 2/arch/arm64/kernel/entry-common.c
+++ b/arch/arm64/kernel/entry-common.c
@@@ -175,7 -175,7 +175,7 @@@ NOKPROBE_SYMBOL(el0_pc)
   static void notrace el0_sp(struct pt_regs *regs, unsigned long esr)
   {
         user_exit_irqoff();
- -      local_daif_restore(DAIF_PROCCTX_NOIRQ);
+ +      local_daif_restore(DAIF_PROCCTX);
         do_sp_pc_abort(regs->sp, esr, regs);
   }
   NOKPROBE_SYMBOL(el0_sp);
@@@ -188,6 -188,14 +188,14 @@@ static void notrace el0_undef(struct pt
   }
   NOKPROBE_SYMBOL(el0_undef);
   
+ static void notrace el0_bti(struct pt_regs *regs)
+ {
+       user_exit_irqoff();
+       local_daif_restore(DAIF_PROCCTX);
+       do_bti(regs);
+ }
+ NOKPROBE_SYMBOL(el0_bti);
+ 
   static void notrace el0_inv(struct pt_regs *regs, unsigned long esr)
   {
         user_exit_irqoff();
@@@ -255,6 -263,9 +263,9 @@@ asmlinkage void notrace el0_sync_handle
         case ESR_ELx_EC_UNKNOWN:
                 el0_undef(regs);
                 break;
+       case ESR_ELx_EC_BTI:
+               el0_bti(regs);
+               break;
         case ESR_ELx_EC_BREAKPT_LOW:
         case ESR_ELx_EC_SOFTSTP_LOW:
         case ESR_ELx_EC_WATCHPT_LOW:
diff --combined arch/arm64/kernel/process.c

index 56be4cbf771f604a849f958382aec9acdf4e837f,127aee47843312ff645ea1946086c00f6edd49d3..eade7807e819d5637ea157315819ca136153e43d
--- 1/arch/arm64/kernel/process.c
--- 2/arch/arm64/kernel/process.c
+++ b/arch/arm64/kernel/process.c
@@@ -11,6 -11,7 +11,7 @@@
   
   #include <linux/compat.h>
   #include <linux/efi.h>
+ #include <linux/elf.h>
   #include <linux/export.h>
   #include <linux/sched.h>
   #include <linux/sched/debug.h>
@@@ -18,6 -19,7 +19,7 @@@
   #include <linux/sched/task_stack.h>
   #include <linux/kernel.h>
   #include <linux/lockdep.h>
+ #include <linux/mman.h>
   #include <linux/mm.h>
   #include <linux/stddef.h>
   #include <linux/sysctl.h>
@@@ -141,11 -143,11 +143,11 @@@ void arch_cpu_idle_dead(void
    * to execute e.g. a RAM-based pin loop is not sufficient. This allows the
    * kexec'd kernel to use any and all RAM as it sees fit, without having to
    * avoid any code or data used by any SW CPU pin loop. The CPU hotplug
- - * functionality embodied in disable_nonboot_cpus() to achieve this.
+ + * functionality embodied in smpt_shutdown_nonboot_cpus() to achieve this.
    */
   void machine_shutdown(void)
   {
- -      disable_nonboot_cpus();
+ +      smp_shutdown_nonboot_cpus(reboot_cpu);
   }
   
   /*
@@@ -209,6 -211,15 +211,15 @@@ void machine_restart(char *cmd
         while (1);
   }
   
+ #define bstr(suffix, str) [PSR_BTYPE_ ## suffix >> PSR_BTYPE_SHIFT] = str
+ static const char *const btypes[] = {
+       bstr(NONE, "--"),
+       bstr(  JC, "jc"),
+       bstr(   C, "-c"),
+       bstr(  J , "j-")
+ };
+ #undef bstr
+ 
   static void print_pstate(struct pt_regs *regs)
   {
         u64 pstate = regs->pstate;
@@@ -227,7 -238,10 +238,10 @@@
                         pstate & PSR_AA32_I_BIT ? 'I' : 'i',
                         pstate & PSR_AA32_F_BIT ? 'F' : 'f');
         } else {
-               printk("pstate: %08llx (%c%c%c%c %c%c%c%c %cPAN %cUAO)\n",
+               const char *btype_str = btypes[(pstate & PSR_BTYPE_MASK) >>
+                                              PSR_BTYPE_SHIFT];
+ 
+               printk("pstate: %08llx (%c%c%c%c %c%c%c%c %cPAN %cUAO BTYPE=%s)\n",
                         pstate,
                         pstate & PSR_N_BIT ? 'N' : 'n',
                         pstate & PSR_Z_BIT ? 'Z' : 'z',
@@@ -238,7 -252,8 +252,8 @@@
                         pstate & PSR_I_BIT ? 'I' : 'i',
                         pstate & PSR_F_BIT ? 'F' : 'f',
                         pstate & PSR_PAN_BIT ? '+' : '-',
-                       pstate & PSR_UAO_BIT ? '+' : '-');
+                       pstate & PSR_UAO_BIT ? '+' : '-',
+                       btype_str);
         }
   }
   
@@@ -262,7 -277,7 +277,7 @@@ void __show_regs(struct pt_regs *regs
   
         if (!user_mode(regs)) {
                 printk("pc : %pS\n", (void *)regs->pc);
- -              printk("lr : %pS\n", (void *)lr);
+ +              printk("lr : %pS\n", (void *)ptrauth_strip_insn_pac(lr));
         } else {
                 printk("pc : %016llx\n", regs->pc);
                 printk("lr : %016llx\n", lr);
@@@ -376,8 -391,6 +391,8 @@@ int copy_thread_tls(unsigned long clone
          */
         fpsimd_flush_task_state(p);
   
+ +      ptrauth_thread_init_kernel(p);
+ +
         if (likely(!(p->flags & PF_KTHREAD))) {
                 *childregs = *current_pt_regs();
                 childregs->regs[0] = 0;
@@@ -514,6 -527,7 +529,6 @@@ __notrace_funcgraph struct task_struct 
         contextidr_thread_switch(next);
         entry_task_switch(next);
         uao_thread_switch(next);
- -      ptrauth_thread_switch(next);
         ssbs_thread_switch(next);
   
         /*
@@@ -655,3 -669,25 +670,25 @@@ asmlinkage void __sched arm64_preempt_s
         if (system_capabilities_finalized())
                 preempt_schedule_irq();
   }
+ 
+ #ifdef CONFIG_BINFMT_ELF
+ int arch_elf_adjust_prot(int prot, const struct arch_elf_state *state,
+                        bool has_interp, bool is_interp)
+ {
+       /*
+        * For dynamically linked executables the interpreter is
+        * responsible for setting PROT_BTI on everything except
+        * itself.
+        */
+       if (is_interp != has_interp)
+               return prot;
+ 
+       if (!(state->flags & ARM64_ELF_BTI))
+               return prot;
+ 
+       if (prot & PROT_EXEC)
+               prot |= PROT_BTI;
+ 
+       return prot;
+ }
+ #endif
diff --combined arch/arm64/kernel/ptrace.c

index b3d3005d9515de52dc5527696049aaffcc9633dd,fd8ac7cf68e70542ac193a30320410b4667a92a0..585dd7f5c826cd04232a11b765f5928193d7b707
--- 1/arch/arm64/kernel/ptrace.c
--- 2/arch/arm64/kernel/ptrace.c
+++ b/arch/arm64/kernel/ptrace.c
@@@ -999,7 -999,7 +999,7 @@@ static struct ptrauth_key pac_key_from_
   }
   
   static void pac_address_keys_to_user(struct user_pac_address_keys *ukeys,
- -                                   const struct ptrauth_keys *keys)
+ +                                   const struct ptrauth_keys_user *keys)
   {
         ukeys->apiakey = pac_key_to_user(&keys->apia);
         ukeys->apibkey = pac_key_to_user(&keys->apib);
@@@ -1007,7 -1007,7 +1007,7 @@@
         ukeys->apdbkey = pac_key_to_user(&keys->apdb);
   }
   
- -static void pac_address_keys_from_user(struct ptrauth_keys *keys,
+ +static void pac_address_keys_from_user(struct ptrauth_keys_user *keys,
                                        const struct user_pac_address_keys *ukeys)
   {
         keys->apia = pac_key_from_user(ukeys->apiakey);
@@@ -1021,7 -1021,7 +1021,7 @@@ static int pac_address_keys_get(struct 
                                 unsigned int pos, unsigned int count,
                                 void *kbuf, void __user *ubuf)
   {
- -      struct ptrauth_keys *keys = &target->thread.keys_user;
+ +      struct ptrauth_keys_user *keys = &target->thread.keys_user;
         struct user_pac_address_keys user_keys;
   
         if (!system_supports_address_auth())
@@@ -1038,7 -1038,7 +1038,7 @@@ static int pac_address_keys_set(struct 
                                 unsigned int pos, unsigned int count,
                                 const void *kbuf, const void __user *ubuf)
   {
- -      struct ptrauth_keys *keys = &target->thread.keys_user;
+ +      struct ptrauth_keys_user *keys = &target->thread.keys_user;
         struct user_pac_address_keys user_keys;
         int ret;
   
@@@ -1056,12 -1056,12 +1056,12 @@@
   }
   
   static void pac_generic_keys_to_user(struct user_pac_generic_keys *ukeys,
- -                                   const struct ptrauth_keys *keys)
+ +                                   const struct ptrauth_keys_user *keys)
   {
         ukeys->apgakey = pac_key_to_user(&keys->apga);
   }
   
- -static void pac_generic_keys_from_user(struct ptrauth_keys *keys,
+ +static void pac_generic_keys_from_user(struct ptrauth_keys_user *keys,
                                        const struct user_pac_generic_keys *ukeys)
   {
         keys->apga = pac_key_from_user(ukeys->apgakey);
@@@ -1072,7 -1072,7 +1072,7 @@@ static int pac_generic_keys_get(struct 
                                 unsigned int pos, unsigned int count,
                                 void *kbuf, void __user *ubuf)
   {
- -      struct ptrauth_keys *keys = &target->thread.keys_user;
+ +      struct ptrauth_keys_user *keys = &target->thread.keys_user;
         struct user_pac_generic_keys user_keys;
   
         if (!system_supports_generic_auth())
@@@ -1089,7 -1089,7 +1089,7 @@@ static int pac_generic_keys_set(struct 
                                 unsigned int pos, unsigned int count,
                                 const void *kbuf, const void __user *ubuf)
   {
- -      struct ptrauth_keys *keys = &target->thread.keys_user;
+ +      struct ptrauth_keys_user *keys = &target->thread.keys_user;
         struct user_pac_generic_keys user_keys;
         int ret;
   
@@@ -1874,7 -1874,7 +1874,7 @@@ void syscall_trace_exit(struct pt_regs 
    */
   #define SPSR_EL1_AARCH64_RES0_BITS \
         (GENMASK_ULL(63, 32) | GENMASK_ULL(27, 25) | GENMASK_ULL(23, 22) | \
-        GENMASK_ULL(20, 13) | GENMASK_ULL(11, 10) | GENMASK_ULL(5, 5))
+        GENMASK_ULL(20, 13) | GENMASK_ULL(5, 5))
   #define SPSR_EL1_AARCH32_RES0_BITS \
         (GENMASK_ULL(63, 32) | GENMASK_ULL(22, 22) | GENMASK_ULL(20, 20))
   
diff --combined fs/binfmt_elf.c

index 13f25e241ac46cbd2f5ffa23de45e60a035a0c1a,cceb29d6ef1d9ff14c0f4eab0b811fee1c0db786..4adb963cdb830661c794969272a8aeb5e23ee578
--- 1/fs/binfmt_elf.c
--- 2/fs/binfmt_elf.c
+++ b/fs/binfmt_elf.c
@@@ -27,7 -27,6 +27,7 @@@
   #include <linux/highuid.h>
   #include <linux/compiler.h>
   #include <linux/highmem.h>
+ +#include <linux/hugetlb.h>
   #include <linux/pagemap.h>
   #include <linux/vmalloc.h>
   #include <linux/security.h>
@@@ -40,12 -39,18 +40,18 @@@
   #include <linux/sched/coredump.h>
   #include <linux/sched/task_stack.h>
   #include <linux/sched/cputime.h>
+ #include <linux/sizes.h>
+ #include <linux/types.h>
   #include <linux/cred.h>
   #include <linux/dax.h>
   #include <linux/uaccess.h>
   #include <asm/param.h>
   #include <asm/page.h>
   
+ #ifndef ELF_COMPAT
+ #define ELF_COMPAT 0
+ #endif
+ 
   #ifndef user_long_t
   #define user_long_t long
   #endif
@@@ -539,7 -544,8 +545,8 @@@ static inline int arch_check_elf(struc
   
   #endif /* !CONFIG_ARCH_BINFMT_ELF_STATE */
   
- static inline int make_prot(u32 p_flags)
+ static inline int make_prot(u32 p_flags, struct arch_elf_state *arch_state,
+                           bool has_interp, bool is_interp)
   {
         int prot = 0;
   
@@@ -549,7 -555,8 +556,8 @@@
                 prot |= PROT_WRITE;
         if (p_flags & PF_X)
                 prot |= PROT_EXEC;
-       return prot;
+ 
+       return arch_elf_adjust_prot(prot, arch_state, has_interp, is_interp);
   }
   
   /* This is much more generalized than the library routine read function,
@@@ -559,7 -566,8 +567,8 @@@
   
   static unsigned long load_elf_interp(struct elfhdr *interp_elf_ex,
                 struct file *interpreter,
-               unsigned long no_base, struct elf_phdr *interp_elf_phdata)
+               unsigned long no_base, struct elf_phdr *interp_elf_phdata,
+               struct arch_elf_state *arch_state)
   {
         struct elf_phdr *eppnt;
         unsigned long load_addr = 0;
@@@ -591,7 -599,8 +600,8 @@@
         for (i = 0; i < interp_elf_ex->e_phnum; i++, eppnt++) {
                 if (eppnt->p_type == PT_LOAD) {
                         int elf_type = MAP_PRIVATE | MAP_DENYWRITE;
-                       int elf_prot = make_prot(eppnt->p_flags);
+                       int elf_prot = make_prot(eppnt->p_flags, arch_state,
+                                                true, true);
                         unsigned long vaddr = 0;
                         unsigned long k, map_addr;
   
@@@ -682,6 -691,111 +692,111 @@@ out
    * libraries.  There is no binary dependent code anywhere else.
    */
   
+ static int parse_elf_property(const char *data, size_t *off, size_t datasz,
+                             struct arch_elf_state *arch,
+                             bool have_prev_type, u32 *prev_type)
+ {
+       size_t o, step;
+       const struct gnu_property *pr;
+       int ret;
+ 
+       if (*off == datasz)
+               return -ENOENT;
+ 
+       if (WARN_ON_ONCE(*off > datasz || *off % ELF_GNU_PROPERTY_ALIGN))
+               return -EIO;
+       o = *off;
+       datasz -= *off;
+ 
+       if (datasz < sizeof(*pr))
+               return -ENOEXEC;
+       pr = (const struct gnu_property *)(data + o);
+       o += sizeof(*pr);
+       datasz -= sizeof(*pr);
+ 
+       if (pr->pr_datasz > datasz)
+               return -ENOEXEC;
+ 
+       WARN_ON_ONCE(o % ELF_GNU_PROPERTY_ALIGN);
+       step = round_up(pr->pr_datasz, ELF_GNU_PROPERTY_ALIGN);
+       if (step > datasz)
+               return -ENOEXEC;
+ 
+       /* Properties are supposed to be unique and sorted on pr_type: */
+       if (have_prev_type && pr->pr_type <= *prev_type)
+               return -ENOEXEC;
+       *prev_type = pr->pr_type;
+ 
+       ret = arch_parse_elf_property(pr->pr_type, data + o,
+                                     pr->pr_datasz, ELF_COMPAT, arch);
+       if (ret)
+               return ret;
+ 
+       *off = o + step;
+       return 0;
+ }
+ 
+ #define NOTE_DATA_SZ SZ_1K
+ #define GNU_PROPERTY_TYPE_0_NAME "GNU"
+ #define NOTE_NAME_SZ (sizeof(GNU_PROPERTY_TYPE_0_NAME))
+ 
+ static int parse_elf_properties(struct file *f, const struct elf_phdr *phdr,
+                               struct arch_elf_state *arch)
+ {
+       union {
+               struct elf_note nhdr;
+               char data[NOTE_DATA_SZ];
+       } note;
+       loff_t pos;
+       ssize_t n;
+       size_t off, datasz;
+       int ret;
+       bool have_prev_type;
+       u32 prev_type;
+ 
+       if (!IS_ENABLED(CONFIG_ARCH_USE_GNU_PROPERTY) || !phdr)
+               return 0;
+ 
+       /* load_elf_binary() shouldn't call us unless this is true... */
+       if (WARN_ON_ONCE(phdr->p_type != PT_GNU_PROPERTY))
+               return -ENOEXEC;
+ 
+       /* If the properties are crazy large, that's too bad (for now): */
+       if (phdr->p_filesz > sizeof(note))
+               return -ENOEXEC;
+ 
+       pos = phdr->p_offset;
+       n = kernel_read(f, &note, phdr->p_filesz, &pos);
+ 
+       BUILD_BUG_ON(sizeof(note) < sizeof(note.nhdr) + NOTE_NAME_SZ);
+       if (n < 0 || n < sizeof(note.nhdr) + NOTE_NAME_SZ)
+               return -EIO;
+ 
+       if (note.nhdr.n_type != NT_GNU_PROPERTY_TYPE_0 ||
+           note.nhdr.n_namesz != NOTE_NAME_SZ ||
+           strncmp(note.data + sizeof(note.nhdr),
+                   GNU_PROPERTY_TYPE_0_NAME, n - sizeof(note.nhdr)))
+               return -ENOEXEC;
+ 
+       off = round_up(sizeof(note.nhdr) + NOTE_NAME_SZ,
+                      ELF_GNU_PROPERTY_ALIGN);
+       if (off > n)
+               return -ENOEXEC;
+ 
+       if (note.nhdr.n_descsz > n - off)
+               return -ENOEXEC;
+       datasz = off + note.nhdr.n_descsz;
+ 
+       have_prev_type = false;
+       do {
+               ret = parse_elf_property(note.data, &off, datasz, arch,
+                                        have_prev_type, &prev_type);
+               have_prev_type = true;
+       } while (!ret);
+ 
+       return ret == -ENOENT ? 0 : ret;
+ }
+ 
   static int load_elf_binary(struct linux_binprm *bprm)
   {
         struct file *interpreter = NULL; /* to shut gcc up */
@@@ -689,6 -803,7 +804,7 @@@
         int load_addr_set = 0;
         unsigned long error;
         struct elf_phdr *elf_ppnt, *elf_phdata, *interp_elf_phdata = NULL;
+       struct elf_phdr *elf_property_phdata = NULL;
         unsigned long elf_bss, elf_brk;
         int bss_prot = 0;
         int retval, i;
@@@ -699,11 -814,19 +815,11 @@@
         unsigned long reloc_func_desc __maybe_unused = 0;
         int executable_stack = EXSTACK_DEFAULT;
         struct elfhdr *elf_ex = (struct elfhdr *)bprm->buf;
- -      struct {
- -              struct elfhdr interp_elf_ex;
- -      } *loc;
+ +      struct elfhdr *interp_elf_ex = NULL;
         struct arch_elf_state arch_state = INIT_ARCH_ELF_STATE;
         struct mm_struct *mm;
         struct pt_regs *regs;
   
- -      loc = kmalloc(sizeof(*loc), GFP_KERNEL);
- -      if (!loc) {
- -              retval = -ENOMEM;
- -              goto out_ret;
- -      }
- -
         retval = -ENOEXEC;
         /* First of all, some simple consistency checks */
         if (memcmp(elf_ex->e_ident, ELFMAG, SELFMAG) != 0)
@@@ -726,6 -849,11 +842,11 @@@
         for (i = 0; i < elf_ex->e_phnum; i++, elf_ppnt++) {
                 char *elf_interpreter;
   
+               if (elf_ppnt->p_type == PT_GNU_PROPERTY) {
+                       elf_property_phdata = elf_ppnt;
+                       continue;
+               }
+ 
                 if (elf_ppnt->p_type != PT_INTERP)
                         continue;
   
@@@ -763,15 -891,9 +884,15 @@@
                  */
                 would_dump(bprm, interpreter);
   
+ +              interp_elf_ex = kmalloc(sizeof(*interp_elf_ex), GFP_KERNEL);
+ +              if (!interp_elf_ex) {
+ +                      retval = -ENOMEM;
+ +                      goto out_free_ph;
+ +              }
+ +
                 /* Get the exec headers */
- -              retval = elf_read(interpreter, &loc->interp_elf_ex,
- -                                sizeof(loc->interp_elf_ex), 0);
+ +              retval = elf_read(interpreter, interp_elf_ex,
+ +                                sizeof(*interp_elf_ex), 0);
                 if (retval < 0)
                         goto out_free_dentry;
   
@@@ -805,25 -927,30 +926,30 @@@ out_free_interp
         if (interpreter) {
                 retval = -ELIBBAD;
                 /* Not an ELF interpreter */
- -              if (memcmp(loc->interp_elf_ex.e_ident, ELFMAG, SELFMAG) != 0)
+ +              if (memcmp(interp_elf_ex->e_ident, ELFMAG, SELFMAG) != 0)
                         goto out_free_dentry;
                 /* Verify the interpreter has a valid arch */
- -              if (!elf_check_arch(&loc->interp_elf_ex) ||
- -                  elf_check_fdpic(&loc->interp_elf_ex))
+ +              if (!elf_check_arch(interp_elf_ex) ||
+ +                  elf_check_fdpic(interp_elf_ex))
                         goto out_free_dentry;
   
                 /* Load the interpreter program headers */
- -              interp_elf_phdata = load_elf_phdrs(&loc->interp_elf_ex,
+ +              interp_elf_phdata = load_elf_phdrs(interp_elf_ex,
                                                    interpreter);
                 if (!interp_elf_phdata)
                         goto out_free_dentry;
   
                 /* Pass PT_LOPROC..PT_HIPROC headers to arch code */
+               elf_property_phdata = NULL;
                 elf_ppnt = interp_elf_phdata;
- -              for (i = 0; i < loc->interp_elf_ex.e_phnum; i++, elf_ppnt++)
+ +              for (i = 0; i < interp_elf_ex->e_phnum; i++, elf_ppnt++)
                         switch (elf_ppnt->p_type) {
+                       case PT_GNU_PROPERTY:
+                               elf_property_phdata = elf_ppnt;
+                               break;
+ 
                         case PT_LOPROC ... PT_HIPROC:
- -                              retval = arch_elf_pt_proc(&loc->interp_elf_ex,
+ +                              retval = arch_elf_pt_proc(interp_elf_ex,
                                                           elf_ppnt, interpreter,
                                                           true, &arch_state);
                                 if (retval)
@@@ -832,13 -959,18 +958,18 @@@
                         }
         }
   
+       retval = parse_elf_properties(interpreter ?: bprm->file,
+                                     elf_property_phdata, &arch_state);
+       if (retval)
+               goto out_free_dentry;
+ 
         /*
          * Allow arch code to reject the ELF at this point, whilst it's
          * still possible to return an error to the code that invoked
          * the exec syscall.
          */
         retval = arch_check_elf(elf_ex,
- -                              !!interpreter, &loc->interp_elf_ex,
+ +                              !!interpreter, interp_elf_ex,
                                 &arch_state);
         if (retval)
                 goto out_free_dentry;
@@@ -913,7 -1045,8 +1044,8 @@@
                         }
                 }
   
-               elf_prot = make_prot(elf_ppnt->p_flags);
+               elf_prot = make_prot(elf_ppnt->p_flags, &arch_state,
+                                    !!interpreter, false);
   
                 elf_flags = MAP_PRIVATE | MAP_DENYWRITE | MAP_EXECUTABLE;
   
@@@ -1054,16 -1187,17 +1186,17 @@@
         }
   
         if (interpreter) {
- -              elf_entry = load_elf_interp(&loc->interp_elf_ex,
+ +              elf_entry = load_elf_interp(interp_elf_ex,
                                             interpreter,
-                                           load_bias, interp_elf_phdata);
+                                           load_bias, interp_elf_phdata,
+                                           &arch_state);
                 if (!IS_ERR((void *)elf_entry)) {
                         /*
                          * load_elf_interp() returns relocation
                          * adjustment
                          */
                         interp_load_addr = elf_entry;
- -                      elf_entry += loc->interp_elf_ex.e_entry;
+ +                      elf_entry += interp_elf_ex->e_entry;
                 }
                 if (BAD_ADDR(elf_entry)) {
                         retval = IS_ERR((void *)elf_entry) ?
@@@ -1074,9 -1208,6 +1207,9 @@@
   
                 allow_write_access(interpreter);
                 fput(interpreter);
+ +
+ +              kfree(interp_elf_ex);
+ +              kfree(interp_elf_phdata);
         } else {
                 elf_entry = e_entry;
                 if (BAD_ADDR(elf_entry)) {
@@@ -1085,6 -1216,7 +1218,6 @@@
                 }
         }
   
- -      kfree(interp_elf_phdata);
         kfree(elf_phdata);
   
         set_binfmt(&elf_format);
@@@ -1154,11 -1286,12 +1287,11 @@@
         start_thread(regs, elf_entry, bprm->p);
         retval = 0;
   out:
- -      kfree(loc);
- -out_ret:
         return retval;
   
         /* error cleanup */
   out_free_dentry:
+ +      kfree(interp_elf_ex);
         kfree(interp_elf_phdata);
         allow_write_access(interpreter);
         if (interpreter)
@@@ -1317,7 -1450,7 +1450,7 @@@ static unsigned long vma_dump_size(stru
         }
   
         /* Hugetlb memory check */
- -      if (vma->vm_flags & VM_HUGETLB) {
+ +      if (is_vm_hugetlb_page(vma)) {
                 if ((vma->vm_flags & VM_SHARED) && FILTER(HUGETLB_SHARED))
                         goto whole;
                 if (!(vma->vm_flags & VM_SHARED) && FILTER(HUGETLB_PRIVATE))
diff --combined fs/proc/task_mmu.c

index 8d382d4ec0672f32549ac9b2dd9d156fb16dd1da,1e3409c484d16cff89d60ff2e6e00841ce51d3fd..b73cdbb221e8d3d4cea62c5de916847a19f1c754
--- 1/fs/proc/task_mmu.c
--- 2/fs/proc/task_mmu.c
+++ b/fs/proc/task_mmu.c
@@@ -123,14 -123,38 +123,14 @@@ static void release_task_mempolicy(stru
   }
   #endif
   
- -static void vma_stop(struct proc_maps_private *priv)
- -{
- -      struct mm_struct *mm = priv->mm;
- -
- -      release_task_mempolicy(priv);
- -      up_read(&mm->mmap_sem);
- -      mmput(mm);
- -}
- -
- -static struct vm_area_struct *
- -m_next_vma(struct proc_maps_private *priv, struct vm_area_struct *vma)
- -{
- -      if (vma == priv->tail_vma)
- -              return NULL;
- -      return vma->vm_next ?: priv->tail_vma;
- -}
- -
- -static void m_cache_vma(struct seq_file *m, struct vm_area_struct *vma)
- -{
- -      if (m->count < m->size) /* vma is copied successfully */
- -              m->version = m_next_vma(m->private, vma) ? vma->vm_end : -1UL;
- -}
- -
   static void *m_start(struct seq_file *m, loff_t *ppos)
   {
         struct proc_maps_private *priv = m->private;
- -      unsigned long last_addr = m->version;
+ +      unsigned long last_addr = *ppos;
         struct mm_struct *mm;
         struct vm_area_struct *vma;
- -      unsigned int pos = *ppos;
   
- -      /* See m_cache_vma(). Zero at the start or after lseek. */
+ +      /* See m_next(). Zero at the start or after lseek. */
         if (last_addr == -1UL)
                 return NULL;
   
@@@ -139,59 -163,64 +139,59 @@@
                 return ERR_PTR(-ESRCH);
   
         mm = priv->mm;
- -      if (!mm || !mmget_not_zero(mm))
+ +      if (!mm || !mmget_not_zero(mm)) {
+ +              put_task_struct(priv->task);
+ +              priv->task = NULL;
                 return NULL;
+ +      }
   
         if (down_read_killable(&mm->mmap_sem)) {
                 mmput(mm);
+ +              put_task_struct(priv->task);
+ +              priv->task = NULL;
                 return ERR_PTR(-EINTR);
         }
   
         hold_task_mempolicy(priv);
         priv->tail_vma = get_gate_vma(mm);
   
- -      if (last_addr) {
- -              vma = find_vma(mm, last_addr - 1);
- -              if (vma && vma->vm_start <= last_addr)
- -                      vma = m_next_vma(priv, vma);
- -              if (vma)
- -                      return vma;
- -      }
- -
- -      m->version = 0;
- -      if (pos < mm->map_count) {
- -              for (vma = mm->mmap; pos; pos--) {
- -                      m->version = vma->vm_start;
- -                      vma = vma->vm_next;
- -              }
+ +      vma = find_vma(mm, last_addr);
+ +      if (vma)
                 return vma;
- -      }
- -
- -      /* we do not bother to update m->version in this case */
- -      if (pos == mm->map_count && priv->tail_vma)
- -              return priv->tail_vma;
   
- -      vma_stop(priv);
- -      return NULL;
+ +      return priv->tail_vma;
   }
   
- -static void *m_next(struct seq_file *m, void *v, loff_t *pos)
+ +static void *m_next(struct seq_file *m, void *v, loff_t *ppos)
   {
         struct proc_maps_private *priv = m->private;
- -      struct vm_area_struct *next;
+ +      struct vm_area_struct *next, *vma = v;
+ +
+ +      if (vma == priv->tail_vma)
+ +              next = NULL;
+ +      else if (vma->vm_next)
+ +              next = vma->vm_next;
+ +      else
+ +              next = priv->tail_vma;
+ +
+ +      *ppos = next ? next->vm_start : -1UL;
   
- -      (*pos)++;
- -      next = m_next_vma(priv, v);
- -      if (!next)
- -              vma_stop(priv);
         return next;
   }
   
   static void m_stop(struct seq_file *m, void *v)
   {
         struct proc_maps_private *priv = m->private;
+ +      struct mm_struct *mm = priv->mm;
   
- -      if (!IS_ERR_OR_NULL(v))
- -              vma_stop(priv);
- -      if (priv->task) {
- -              put_task_struct(priv->task);
- -              priv->task = NULL;
- -      }
+ +      if (!priv->task)
+ +              return;
+ +
+ +      release_task_mempolicy(priv);
+ +      up_read(&mm->mmap_sem);
+ +      mmput(mm);
+ +      put_task_struct(priv->task);
+ +      priv->task = NULL;
   }
   
   static int proc_maps_open(struct inode *inode, struct file *file,
@@@ -334,6 -363,7 +334,6 @@@ done
   static int show_map(struct seq_file *m, void *v)
   {
         show_map_vma(m, v);
- -      m_cache_vma(m, v);
         return 0;
   }
   
@@@ -638,6 -668,9 +638,9 @@@ static void show_smap_vma_flags(struct 
                 [ilog2(VM_ARCH_1)]      = "ar",
                 [ilog2(VM_WIPEONFORK)]  = "wf",
                 [ilog2(VM_DONTDUMP)]    = "dd",
+ #ifdef CONFIG_ARM64_BTI
+               [ilog2(VM_ARM64_BTI)]   = "bt",
+ #endif
   #ifdef CONFIG_MEM_SOFT_DIRTY
                 [ilog2(VM_SOFTDIRTY)]   = "sd",
   #endif
@@@ -817,6 -850,8 +820,6 @@@ static int show_smap(struct seq_file *m
                 seq_printf(m, "ProtectionKey:  %8u\n", vma_pkey(vma));
         show_smap_vma_flags(m, vma);
   
- -      m_cache_vma(m, vma);
- -
         return 0;
   }
   
@@@ -1855,6 -1890,7 +1858,6 @@@ static int show_numa_map(struct seq_fil
         seq_printf(m, " kernelpagesize_kB=%lu", vma_kernel_pagesize(vma) >> 10);
   out:
         seq_putc(m, '\n');
- -      m_cache_vma(m, vma);
         return 0;
   }
   
diff --combined include/linux/mm.h

index 5a323422d783d076c01b41b2a9a1f4bbd7d1a6a5,9e5fce1b2099de4728b301fb03f67e759ae70ae4..b61ca546eea4fcbd8e12b6c258d05e6c02c97189
--- 1/include/linux/mm.h
--- 2/include/linux/mm.h
+++ b/include/linux/mm.h
@@@ -27,7 -27,6 +27,7 @@@
   #include <linux/memremap.h>
   #include <linux/overflow.h>
   #include <linux/sizes.h>
+ +#include <linux/sched.h>
   
   struct mempolicy;
   struct anon_vma;
@@@ -325,6 -324,9 +325,9 @@@ extern unsigned int kobjsize(const voi
   #elif defined(CONFIG_SPARC64)
   # define VM_SPARC_ADI VM_ARCH_1       /* Uses ADI tag for access control */
   # define VM_ARCH_CLEAR        VM_SPARC_ADI
+ #elif defined(CONFIG_ARM64)
+ # define VM_ARM64_BTI VM_ARCH_1       /* BTI guarded page, a.k.a. GP bit */
+ # define VM_ARCH_CLEAR        VM_ARM64_BTI
   #elif !defined(CONFIG_MMU)
   # define VM_MAPPED_COPY       VM_ARCH_1       /* T if mapped copy of data (nommu mmap) */
   #endif
@@@ -343,20 -345,6 +346,20 @@@
   /* Bits set in the VMA until the stack is in its final location */
   #define VM_STACK_INCOMPLETE_SETUP     (VM_RAND_READ | VM_SEQ_READ)
   
+ +#define TASK_EXEC ((current->personality & READ_IMPLIES_EXEC) ? VM_EXEC : 0)
+ +
+ +/* Common data flag combinations */
+ +#define VM_DATA_FLAGS_TSK_EXEC        (VM_READ | VM_WRITE | TASK_EXEC | \
+ +                               VM_MAYREAD | VM_MAYWRITE | VM_MAYEXEC)
+ +#define VM_DATA_FLAGS_NON_EXEC        (VM_READ | VM_WRITE | VM_MAYREAD | \
+ +                               VM_MAYWRITE | VM_MAYEXEC)
+ +#define VM_DATA_FLAGS_EXEC    (VM_READ | VM_WRITE | VM_EXEC | \
+ +                               VM_MAYREAD | VM_MAYWRITE | VM_MAYEXEC)
+ +
+ +#ifndef VM_DATA_DEFAULT_FLAGS         /* arch can override this */
+ +#define VM_DATA_DEFAULT_FLAGS  VM_DATA_FLAGS_EXEC
+ +#endif
+ +
   #ifndef VM_STACK_DEFAULT_FLAGS                /* arch can override this */
   #define VM_STACK_DEFAULT_FLAGS VM_DATA_DEFAULT_FLAGS
   #endif
@@@ -369,18 -357,12 +372,18 @@@
   
   #define VM_STACK_FLAGS        (VM_STACK | VM_STACK_DEFAULT_FLAGS | VM_ACCOUNT)
   
+ +/* VMA basic access permission flags */
+ +#define VM_ACCESS_FLAGS (VM_READ | VM_WRITE | VM_EXEC)
+ +
+ +
   /*
    * Special vmas that are non-mergable, non-mlock()able.
- - * Note: mm/huge_memory.c VM_NO_THP depends on this definition.
    */
   #define VM_SPECIAL (VM_IO | VM_DONTEXPAND | VM_PFNMAP | VM_MIXEDMAP)
   
+ +/* This mask prevents VMA from being scanned with khugepaged */
+ +#define VM_NO_KHUGEPAGED (VM_SPECIAL | VM_HUGETLB)
+ +
   /* This mask defines which mm->def_flags a process can inherit its parent */
   #define VM_INIT_DEF_MASK      VM_NOHUGEPAGE
   
@@@ -399,75 -381,15 +402,75 @@@
    */
   extern pgprot_t protection_map[16];
   
- -#define FAULT_FLAG_WRITE      0x01    /* Fault was a write access */
- -#define FAULT_FLAG_MKWRITE    0x02    /* Fault was mkwrite of existing pte */
- -#define FAULT_FLAG_ALLOW_RETRY        0x04    /* Retry fault if blocking */
- -#define FAULT_FLAG_RETRY_NOWAIT       0x08    /* Don't drop mmap_sem and wait when retrying */
- -#define FAULT_FLAG_KILLABLE   0x10    /* The fault task is in SIGKILL killable region */
- -#define FAULT_FLAG_TRIED      0x20    /* Second try */
- -#define FAULT_FLAG_USER               0x40    /* The fault originated in userspace */
- -#define FAULT_FLAG_REMOTE     0x80    /* faulting for non current tsk/mm */
- -#define FAULT_FLAG_INSTRUCTION  0x100 /* The fault was during an instruction fetch */
+ +/**
+ + * Fault flag definitions.
+ + *
+ + * @FAULT_FLAG_WRITE: Fault was a write fault.
+ + * @FAULT_FLAG_MKWRITE: Fault was mkwrite of existing PTE.
+ + * @FAULT_FLAG_ALLOW_RETRY: Allow to retry the fault if blocked.
+ + * @FAULT_FLAG_RETRY_NOWAIT: Don't drop mmap_sem and wait when retrying.
+ + * @FAULT_FLAG_KILLABLE: The fault task is in SIGKILL killable region.
+ + * @FAULT_FLAG_TRIED: The fault has been tried once.
+ + * @FAULT_FLAG_USER: The fault originated in userspace.
+ + * @FAULT_FLAG_REMOTE: The fault is not for current task/mm.
+ + * @FAULT_FLAG_INSTRUCTION: The fault was during an instruction fetch.
+ + * @FAULT_FLAG_INTERRUPTIBLE: The fault can be interrupted by non-fatal signals.
+ + *
+ + * About @FAULT_FLAG_ALLOW_RETRY and @FAULT_FLAG_TRIED: we can specify
+ + * whether we would allow page faults to retry by specifying these two
+ + * fault flags correctly.  Currently there can be three legal combinations:
+ + *
+ + * (a) ALLOW_RETRY and !TRIED:  this means the page fault allows retry, and
+ + *                              this is the first try
+ + *
+ + * (b) ALLOW_RETRY and TRIED:   this means the page fault allows retry, and
+ + *                              we've already tried at least once
+ + *
+ + * (c) !ALLOW_RETRY and !TRIED: this means the page fault does not allow retry
+ + *
+ + * The unlisted combination (!ALLOW_RETRY && TRIED) is illegal and should never
+ + * be used.  Note that page faults can be allowed to retry for multiple times,
+ + * in which case we'll have an initial fault with flags (a) then later on
+ + * continuous faults with flags (b).  We should always try to detect pending
+ + * signals before a retry to make sure the continuous page faults can still be
+ + * interrupted if necessary.
+ + */
+ +#define FAULT_FLAG_WRITE                      0x01
+ +#define FAULT_FLAG_MKWRITE                    0x02
+ +#define FAULT_FLAG_ALLOW_RETRY                        0x04
+ +#define FAULT_FLAG_RETRY_NOWAIT                       0x08
+ +#define FAULT_FLAG_KILLABLE                   0x10
+ +#define FAULT_FLAG_TRIED                      0x20
+ +#define FAULT_FLAG_USER                               0x40
+ +#define FAULT_FLAG_REMOTE                     0x80
+ +#define FAULT_FLAG_INSTRUCTION                0x100
+ +#define FAULT_FLAG_INTERRUPTIBLE              0x200
+ +
+ +/*
+ + * The default fault flags that should be used by most of the
+ + * arch-specific page fault handlers.
+ + */
+ +#define FAULT_FLAG_DEFAULT  (FAULT_FLAG_ALLOW_RETRY | \
+ +                           FAULT_FLAG_KILLABLE | \
+ +                           FAULT_FLAG_INTERRUPTIBLE)
+ +
+ +/**
+ + * fault_flag_allow_retry_first - check ALLOW_RETRY the first time
+ + *
+ + * This is mostly used for places where we want to try to avoid taking
+ + * the mmap_sem for too long a time when waiting for another condition
+ + * to change, in which case we can try to be polite to release the
+ + * mmap_sem in the first round to avoid potential starvation of other
+ + * processes that would also want the mmap_sem.
+ + *
+ + * Return: true if the page fault allows retry and this is the first
+ + * attempt of the fault handling; false otherwise.
+ + */
+ +static inline bool fault_flag_allow_retry_first(unsigned int flags)
+ +{
+ +      return (flags & FAULT_FLAG_ALLOW_RETRY) &&
+ +          (!(flags & FAULT_FLAG_TRIED));
+ +}
   
   #define FAULT_FLAG_TRACE \
         { FAULT_FLAG_WRITE,             "WRITE" }, \
@@@ -478,8 -400,7 +481,8 @@@
         { FAULT_FLAG_TRIED,             "TRIED" }, \
         { FAULT_FLAG_USER,              "USER" }, \
         { FAULT_FLAG_REMOTE,            "REMOTE" }, \
- -      { FAULT_FLAG_INSTRUCTION,       "INSTRUCTION" }
+ +      { FAULT_FLAG_INSTRUCTION,       "INSTRUCTION" }, \
+ +      { FAULT_FLAG_INTERRUPTIBLE,     "INTERRUPTIBLE" }
   
   /*
    * vm_fault is filled by the the pagefault handler and passed to the vma's
@@@ -623,36 -544,6 +626,36 @@@ static inline bool vma_is_anonymous(str
         return !vma->vm_ops;
   }
   
+ +static inline bool vma_is_temporary_stack(struct vm_area_struct *vma)
+ +{
+ +      int maybe_stack = vma->vm_flags & (VM_GROWSDOWN | VM_GROWSUP);
+ +
+ +      if (!maybe_stack)
+ +              return false;
+ +
+ +      if ((vma->vm_flags & VM_STACK_INCOMPLETE_SETUP) ==
+ +                                              VM_STACK_INCOMPLETE_SETUP)
+ +              return true;
+ +
+ +      return false;
+ +}
+ +
+ +static inline bool vma_is_foreign(struct vm_area_struct *vma)
+ +{
+ +      if (!current->mm)
+ +              return true;
+ +
+ +      if (current->mm != vma->vm_mm)
+ +              return true;
+ +
+ +      return false;
+ +}
+ +
+ +static inline bool vma_is_accessible(struct vm_area_struct *vma)
+ +{
+ +      return vma->vm_flags & VM_ACCESS_FLAGS;
+ +}
+ +
   #ifdef CONFIG_SHMEM
   /*
    * The vma_is_shmem is not inline because it is used only by slow
@@@ -882,24 -773,6 +885,24 @@@ static inline unsigned int compound_ord
         return page[1].compound_order;
   }
   
+ +static inline bool hpage_pincount_available(struct page *page)
+ +{
+ +      /*
+ +       * Can the page->hpage_pinned_refcount field be used? That field is in
+ +       * the 3rd page of the compound page, so the smallest (2-page) compound
+ +       * pages cannot support it.
+ +       */
+ +      page = compound_head(page);
+ +      return PageCompound(page) && compound_order(page) > 1;
+ +}
+ +
+ +static inline int compound_pincount(struct page *page)
+ +{
+ +      VM_BUG_ON_PAGE(!hpage_pincount_available(page), page);
+ +      page = compound_head(page);
+ +      return atomic_read(compound_pincount_ptr(page));
+ +}
+ +
   static inline void set_compound_order(struct page *page, unsigned int order)
   {
         page[1].compound_order = order;
@@@ -1131,8 -1004,6 +1134,8 @@@ static inline void get_page(struct pag
         page_ref_inc(page);
   }
   
+ +bool __must_check try_grab_page(struct page *page, unsigned int flags);
+ +
   static inline __must_check bool try_get_page(struct page *page)
   {
         page = compound_head(page);
@@@ -1161,87 -1032,29 +1164,87 @@@ static inline void put_page(struct pag
                 __put_page(page);
   }
   
- -/**
- - * unpin_user_page() - release a gup-pinned page
- - * @page:            pointer to page to be released
+ +/*
+ + * GUP_PIN_COUNTING_BIAS, and the associated functions that use it, overload
+ + * the page's refcount so that two separate items are tracked: the original page
+ + * reference count, and also a new count of how many pin_user_pages() calls were
+ + * made against the page. ("gup-pinned" is another term for the latter).
+ + *
+ + * With this scheme, pin_user_pages() becomes special: such pages are marked as
+ + * distinct from normal pages. As such, the unpin_user_page() call (and its
+ + * variants) must be used in order to release gup-pinned pages.
    *
- - * Pages that were pinned via pin_user_pages*() must be released via either
- - * unpin_user_page(), or one of the unpin_user_pages*() routines. This is so
- - * that eventually such pages can be separately tracked and uniquely handled. In
- - * particular, interactions with RDMA and filesystems need special handling.
+ + * Choice of value:
    *
- - * unpin_user_page() and put_page() are not interchangeable, despite this early
- - * implementation that makes them look the same. unpin_user_page() calls must
- - * be perfectly matched up with pin*() calls.
+ + * By making GUP_PIN_COUNTING_BIAS a power of two, debugging of page reference
+ + * counts with respect to pin_user_pages() and unpin_user_page() becomes
+ + * simpler, due to the fact that adding an even power of two to the page
+ + * refcount has the effect of using only the upper N bits, for the code that
+ + * counts up using the bias value. This means that the lower bits are left for
+ + * the exclusive use of the original code that increments and decrements by one
+ + * (or at least, by much smaller values than the bias value).
+ + *
+ + * Of course, once the lower bits overflow into the upper bits (and this is
+ + * OK, because subtraction recovers the original values), then visual inspection
+ + * no longer suffices to directly view the separate counts. However, for normal
+ + * applications that don't have huge page reference counts, this won't be an
+ + * issue.
+ + *
+ + * Locking: the lockless algorithm described in page_cache_get_speculative()
+ + * and page_cache_gup_pin_speculative() provides safe operation for
+ + * get_user_pages and page_mkclean and other calls that race to set up page
+ + * table entries.
    */
- -static inline void unpin_user_page(struct page *page)
- -{
- -      put_page(page);
- -}
+ +#define GUP_PIN_COUNTING_BIAS (1U << 10)
   
+ +void unpin_user_page(struct page *page);
   void unpin_user_pages_dirty_lock(struct page **pages, unsigned long npages,
                                  bool make_dirty);
- -
   void unpin_user_pages(struct page **pages, unsigned long npages);
   
+ +/**
+ + * page_maybe_dma_pinned() - report if a page is pinned for DMA.
+ + *
+ + * This function checks if a page has been pinned via a call to
+ + * pin_user_pages*().
+ + *
+ + * For non-huge pages, the return value is partially fuzzy: false is not fuzzy,
+ + * because it means "definitely not pinned for DMA", but true means "probably
+ + * pinned for DMA, but possibly a false positive due to having at least
+ + * GUP_PIN_COUNTING_BIAS worth of normal page references".
+ + *
+ + * False positives are OK, because: a) it's unlikely for a page to get that many
+ + * refcounts, and b) all the callers of this routine are expected to be able to
+ + * deal gracefully with a false positive.
+ + *
+ + * For huge pages, the result will be exactly correct. That's because we have
+ + * more tracking data available: the 3rd struct page in the compound page is
+ + * used to track the pincount (instead using of the GUP_PIN_COUNTING_BIAS
+ + * scheme).
+ + *
+ + * For more information, please see Documentation/vm/pin_user_pages.rst.
+ + *
+ + * @page:     pointer to page to be queried.
+ + * @Return:   True, if it is likely that the page has been "dma-pinned".
+ + *            False, if the page is definitely not dma-pinned.
+ + */
+ +static inline bool page_maybe_dma_pinned(struct page *page)
+ +{
+ +      if (hpage_pincount_available(page))
+ +              return compound_pincount(page) > 0;
+ +
+ +      /*
+ +       * page_ref_count() is signed. If that refcount overflows, then
+ +       * page_ref_count() returns a negative value, and callers will avoid
+ +       * further incrementing the refcount.
+ +       *
+ +       * Here, for that overflow case, use the signed bit to count a little
+ +       * bit higher via unsigned math, and thus still get an accurate result.
+ +       */
+ +      return ((unsigned int)page_ref_count(compound_head(page))) >=
+ +              GUP_PIN_COUNTING_BIAS;
+ +}
+ +
   #if defined(CONFIG_SPARSEMEM) && !defined(CONFIG_SPARSEMEM_VMEMMAP)
   #define SECTION_IN_PAGE_FLAGS
   #endif
@@@ -1789,26 -1602,9 +1792,26 @@@ extern unsigned long move_page_tables(s
                 unsigned long old_addr, struct vm_area_struct *new_vma,
                 unsigned long new_addr, unsigned long len,
                 bool need_rmap_locks);
+ +
+ +/*
+ + * Flags used by change_protection().  For now we make it a bitmap so
+ + * that we can pass in multiple flags just like parameters.  However
+ + * for now all the callers are only use one of the flags at the same
+ + * time.
+ + */
+ +/* Whether we should allow dirty bit accounting */
+ +#define  MM_CP_DIRTY_ACCT                  (1UL << 0)
+ +/* Whether this protection change is for NUMA hints */
+ +#define  MM_CP_PROT_NUMA                   (1UL << 1)
+ +/* Whether this change is for write protecting */
+ +#define  MM_CP_UFFD_WP                     (1UL << 2) /* do wp */
+ +#define  MM_CP_UFFD_WP_RESOLVE             (1UL << 3) /* Resolve wp */
+ +#define  MM_CP_UFFD_WP_ALL                 (MM_CP_UFFD_WP | \
+ +                                          MM_CP_UFFD_WP_RESOLVE)
+ +
   extern unsigned long change_protection(struct vm_area_struct *vma, unsigned long start,
                               unsigned long end, pgprot_t newprot,
- -                            int dirty_accountable, int prot_numa);
+ +                            unsigned long cp_flags);
   extern int mprotect_fixup(struct vm_area_struct *vma,
                           struct vm_area_struct **pprev, unsigned long start,
                           unsigned long end, unsigned long newflags);
@@@ -1927,18 -1723,6 +1930,18 @@@ static inline void sync_mm_rss(struct m
   }
   #endif
   
+ +#ifndef CONFIG_ARCH_HAS_PTE_SPECIAL
+ +static inline int pte_special(pte_t pte)
+ +{
+ +      return 0;
+ +}
+ +
+ +static inline pte_t pte_mkspecial(pte_t pte)
+ +{
+ +      return pte;
+ +}
+ +#endif
+ +
   #ifndef CONFIG_ARCH_HAS_PTE_DEVMAP
   static inline int pte_devmap(pte_t pte)
   {
@@@ -2583,7 -2367,26 +2586,7 @@@ struct vm_unmapped_area_info 
         unsigned long align_offset;
   };
   
- -extern unsigned long unmapped_area(struct vm_unmapped_area_info *info);
- -extern unsigned long unmapped_area_topdown(struct vm_unmapped_area_info *info);
- -
- -/*
- - * Search for an unmapped address range.
- - *
- - * We are looking for a range that:
- - * - does not intersect with any VMA;
- - * - is contained within the [low_limit, high_limit) interval;
- - * - is at least the desired size.
- - * - satisfies (begin_addr & align_mask) == (align_offset & align_mask)
- - */
- -static inline unsigned long
- -vm_unmapped_area(struct vm_unmapped_area_info *info)
- -{
- -      if (info->flags & VM_UNMAPPED_AREA_TOPDOWN)
- -              return unmapped_area_topdown(info);
- -      else
- -              return unmapped_area(info);
- -}
+ +extern unsigned long vm_unmapped_area(struct vm_unmapped_area_info *info);
   
   /* truncate.c */
   extern void truncate_inode_pages(struct address_space *, loff_t);
@@@ -2719,8 -2522,6 +2722,8 @@@ struct vm_area_struct *find_extend_vma(
   int remap_pfn_range(struct vm_area_struct *, unsigned long addr,
                         unsigned long pfn, unsigned long size, pgprot_t);
   int vm_insert_page(struct vm_area_struct *, unsigned long addr, struct page *);
+ +int vm_insert_pages(struct vm_area_struct *vma, unsigned long addr,
+ +                      struct page **pages, unsigned long *num);
   int vm_map_pages(struct vm_area_struct *vma, struct page **pages,
                                 unsigned long num);
   int vm_map_pages_zero(struct vm_area_struct *vma, struct page **pages,
@@@ -2917,10 -2718,6 +2920,10 @@@ static inline bool debug_pagealloc_enab
   #if defined(CONFIG_DEBUG_PAGEALLOC) || defined(CONFIG_ARCH_HAS_SET_DIRECT_MAP)
   extern void __kernel_map_pages(struct page *page, int numpages, int enable);
   
+ +/*
+ + * When called in DEBUG_PAGEALLOC context, the call should most likely be
+ + * guarded by debug_pagealloc_enabled() or debug_pagealloc_enabled_static()
+ + */
   static inline void
   kernel_map_pages(struct page *page, int numpages, int enable)
   {
@@@ -3069,23 -2866,6 +3072,23 @@@ extern long copy_huge_page_from_user(st
                                 const void __user *usr_src,
                                 unsigned int pages_per_huge_page,
                                 bool allow_pagefault);
+ +
+ +/**
+ + * vma_is_special_huge - Are transhuge page-table entries considered special?
+ + * @vma: Pointer to the struct vm_area_struct to consider
+ + *
+ + * Whether transhuge page-table entries are considered "special" following
+ + * the definition in vm_normal_page().
+ + *
+ + * Return: true if transhuge page-table entries should be considered special,
+ + * false otherwise.
+ + */
+ +static inline bool vma_is_special_huge(const struct vm_area_struct *vma)
+ +{
+ +      return vma_is_dax(vma) || (vma->vm_file &&
+ +                                 (vma->vm_flags & (VM_PFNMAP | VM_MIXEDMAP)));
+ +}
+ +
   #endif /* CONFIG_TRANSPARENT_HUGEPAGE || CONFIG_HUGETLBFS */
   
   #ifdef CONFIG_DEBUG_PAGEALLOC
author	Will Deacon <[email protected]>
	Tue, 5 May 2020 14:15:58 +0000 (15:15 +0100)
committer	Will Deacon <[email protected]>
	Tue, 5 May 2020 14:15:58 +0000 (15:15 +0100)
		1	2
Documentation/filesystems/proc.rst	patch \|	diff1 \|	\|	blob \| history
arch/arm64/Kconfig	patch \|	diff1 \|	diff2 \|	blob \| history
arch/arm64/include/asm/cpucaps.h	patch \|	diff1 \|	diff2 \|	blob \| history
arch/arm64/include/asm/cpufeature.h	patch \|	diff1 \|	diff2 \|	blob \| history
arch/arm64/include/asm/esr.h	patch \|	diff1 \|	diff2 \|	blob \| history
arch/arm64/include/asm/kvm_emulate.h	patch \|	diff1 \|	diff2 \|	blob \| history
arch/arm64/include/asm/sysreg.h	patch \|	diff1 \|	diff2 \|	blob \| history
arch/arm64/kernel/cpufeature.c	patch \|	diff1 \|	diff2 \|	blob \| history
arch/arm64/kernel/entry-common.c	patch \|	diff1 \|	diff2 \|	blob \| history
arch/arm64/kernel/process.c	patch \|	diff1 \|	diff2 \|	blob \| history
arch/arm64/kernel/ptrace.c	patch \|	diff1 \|	diff2 \|	blob \| history
fs/binfmt_elf.c	patch \|	diff1 \|	diff2 \|	blob \| history
fs/proc/task_mmu.c	patch \|	diff1 \|	diff2 \|	blob \| history
include/linux/mm.h	patch \|	diff1 \|	diff2 \|	blob \| history