Git Repo - linux.git/log

media: atomisp: get rid of mmgr_load and mmgr_store

Those functions are just wrappers for hmm_load/hmm_store.

Signed-off-by: Mauro Carvalho Chehab <[email protected]>

media: atomisp: go one step further to drop ia_css_memory_access.c

Move the attrs handling into hmm, simplifying even further
what the ia_css_memory_access.c file does.

Yet, the returned type for ia_css_memory_access.c is an
integer, instead of a pointer.

Signed-off-by: Mauro Carvalho Chehab <[email protected]>

media: atomisp: reduce abstraction at ia_css_memory_access

Yet another memory abstraction layer. Getting rid of this
may be a little trickier, but let's reduce it to a minimal.

Signed-off-by: Mauro Carvalho Chehab <[email protected]>

media: atomisp: get rid of the hrt/hive_isp_css_mm_hrt abstraction layer

Simplify the code by removing this extra memory management
abstraction layer.

Signed-off-by: Mauro Carvalho Chehab <[email protected]>

media: atomisp: simplify hive_isp_css_mm_hrt wrapper

The code there is a wrapper for hmm/ wrapper. Simplify it,
and get rid of ION-specific code.

Signed-off-by: Mauro Carvalho Chehab <[email protected]>

media: atomisp: fix a handful of spelling mistakes

There are several spelling mistakes in various messages and literal
strings. Fix these.

Signed-off-by: Colin Ian King <[email protected]>
Acked-by: Sakari Ailus <[email protected]>
Signed-off-by: Mauro Carvalho Chehab <[email protected]>

media: atomisp: fix size of delay_frames array

Right now, the variables that define the max number of
delay frames is defined as:

#define VIDEO_FRAME_DELAY 2
#define MAX_NUM_VIDEO_DELAY_FRAMES (VIDEO_FRAME_DELAY + 1)
#define NUM_PREVIEW_DVS_FRAMES (2)
#define MAX_NUM_DELAY_FRAMES MAX(MAX_NUM_VIDEO_DELAY_FRAMES, NUM_PREVIEW_DVS_FRAMES)

In other words, we have:
MAX_NUM_VIDEO_DELAY_FRAMES = 3
MAX_NUM_DELAY_FRAMES = 2

The MAX_NUM_DELAY_FRAMES macro is used only only when allocating
memory. On all other parts, including looping over such array,
MAX_NUM_VIDEO_DELAY_FRAMES is used instead, like:

void sh_css_binary_args_reset(struct sh_css_binary_args *args)
{
unsigned int i;
...

for (i = 0; i < MAX_NUM_VIDEO_DELAY_FRAMES; i++)
args->delay_frames[i] = NULL;

Which will cause buffer overflows, with may override the next array
(tnr_frames[]).

In practice, this may not be causing real issues, as the code
checks for num_delay_frames on some parts (but not everywhere).

So, get rid of the smallest value.

Acked-by: Sakari Ailus <[email protected]>
Signed-off-by: Mauro Carvalho Chehab <[email protected]>

media: atomisp: drop a cast for a const argument

Some arguments for tnf and ref settings are meant to be const, but
they're defined without such annotation. Due to that, there's an
ugly cast at sh_css_sp.c.

Acked-by: Sakari Ailus <[email protected]>
Signed-off-by: Mauro Carvalho Chehab <[email protected]>

media: atomisp: partially get rid of one abstraction layer

The very same macros are defined as CSS_foo and IA_CSS_foo.

Remove this abstraction, as it just make things confusing,
for no good reason.

Acked-by: Sakari Ailus <[email protected]>
Signed-off-by: Mauro Carvalho Chehab <[email protected]>

media: atomisp: make it use dbg_level to control debug level

This driver has 3 different types of debug messages:

- dev_dbg()
- dbg_level
- ia_css_debug_trace_level

Which is crazy. Ideally, it shold just use dev_dbg()
everywhere, but for now let's unify the last two machanisms.

Acked-by: Sakari Ailus <[email protected]>
Signed-off-by: Mauro Carvalho Chehab <[email protected]>

media: atomisp: get rid of some old broken debug code

It sounds that someone once changed the debug level at compile
time for some testing, but forgot to remove the legacy code after
finishing debuging it.

Get rid of the dead code.

Acked-by: Sakari Ailus <[email protected]>
Signed-off-by: Mauro Carvalho Chehab <[email protected]>

media: atomisp: update TODO list

Let's reflect the current status at the TODO list, as other
developers can help addressing issues over there.

Acked-by: Sakari Ailus <[email protected]>
Signed-off-by: Mauro Carvalho Chehab <[email protected]>

media: atomisp: don't flood dmesg with -EAGAIN return codes

Using DQBUF on non-blocking mode will return -EAGAIN
if nothing arrives. Printing it has no value, even for debug
purposes. So, only display real return codes.

Acked-by: Sakari Ailus <[email protected]>
Signed-off-by: Mauro Carvalho Chehab <[email protected]>

media: atomisp: improve debug messages for set format

There are several error conditions that don't print anything,
making harder to identify bugs at the code there.

Acked-by: Sakari Ailus <[email protected]>
Signed-off-by: Mauro Carvalho Chehab <[email protected]>

media: atomisp: avoid a copy of v4l2_mbus_framefmt at stack

There's no reason to copy isp_sink_fmt, as the driver
uses it for read-only purposes.

Linux stack is a precious resource. Let's avoid wasting it.

Acked-by: Sakari Ailus <[email protected]>
Signed-off-by: Mauro Carvalho Chehab <[email protected]>

media: atomisp: reduce debug printk rate when IRQs are received

Currently, when an EOF IRQ is received, it generates two messages:

[ 59.191893] atomisp-isp2 0000:00:03.0: irq:0x200000
[ 59.191913] atomisp-isp2 0000:00:03.0: atomisp_isr EOF exp_id 142, asd 0

Flooding the dmesg with lots of messages per second. The same
pattern happens for all other IRQs.

Change the logic for printing just one message per IRQ and
rate-limit those, as, for debugging purposes, it is usually
interesting to know that IRQs are being received, but not
displaying every single one.

Acked-by: Sakari Ailus <[email protected]>
Signed-off-by: Mauro Carvalho Chehab <[email protected]>

media: atomisp: get rid of hmm_vm.c

This is not used anywhere. So, let's trash it.

Acked-by: Sakari Ailus <[email protected]>
Signed-off-by: Mauro Carvalho Chehab <[email protected]>

media: atomisp: fix pipeline initialization code

The code under load_primary_binaries() is complex and
were hard to understand, because it used to have lots
of ifdefs and broken identation.

The patch which cleaned it and removed the version-specific
ifdefs added a regression.

Solve it.

Fixes: 3c0538fbad9f ("media: atomisp: get rid of most checks for ISP2401 version")
Signed-off-by: Mauro Carvalho Chehab <[email protected]>

io_uring: fix io_kiocb.flags modification race in IOPOLL mode

While testing io_uring in arm, we found sometimes io_sq_thread() keeps
polling io requests even though there are not inflight io requests in
block layer. After some investigations, found a possible race about
io_kiocb.flags, see below race codes:
  1) in the end of io_write() or io_read()
    req->flags &= ~REQ_F_NEED_CLEANUP;
    kfree(iovec);
    return ret;

  2) in io_complete_rw_iopoll()
    if (res != -EAGAIN)
        req->flags |= REQ_F_IOPOLL_COMPLETED;

In IOPOLL mode, io requests still maybe completed by interrupt, then
above codes are not safe, concurrent modifications to req->flags, which
is not protected by lock or is not atomic modifications. I also had
disassemble io_complete_rw_iopoll() in arm:
   req->flags |= REQ_F_IOPOLL_COMPLETED;
   0xffff000008387b18 <+76>:    ldr     w0, [x19,#104]
   0xffff000008387b1c <+80>:    orr     w0, w0, #0x1000
   0xffff000008387b20 <+84>:    str     w0, [x19,#104]

Seems that the "req->flags |= REQ_F_IOPOLL_COMPLETED;" is  load and
modification, two instructions, which obviously is not atomic.

To fix this issue, add a new iopoll_completed in io_kiocb to indicate
whether io request is completed.

Signed-off-by: Xiaoguang Wang <[email protected]>
Signed-off-by: Jens Axboe <[email protected]>

KVM: nVMX: Consult only the "basic" exit reason when routing nested exit

Consult only the basic exit reason, i.e. bits 15:0 of vmcs.EXIT_REASON,
when determining whether a nested VM-Exit should be reflected into L1 or
handled by KVM in L0.

For better or worse, the switch statement in nested_vmx_exit_reflected()
currently defaults to "true", i.e. reflects any nested VM-Exit without
dedicated logic.  Because the case statements only contain the basic
exit reason, any VM-Exit with modifier bits set will be reflected to L1,
even if KVM intended to handle it in L0.

Practically speaking, this only affects EXIT_REASON_MCE_DURING_VMENTRY,
i.e. a #MC that occurs on nested VM-Enter would be incorrectly routed to
L1, as "failed VM-Entry" is the only modifier that KVM can currently
encounter.  The SMM modifiers will never be generated as KVM doesn't
support/employ a SMI Transfer Monitor.  Ditto for "exit from enclave",
as KVM doesn't yet support virtualizing SGX, i.e. it's impossible to
enter an enclave in a KVM guest (L1 or L2).

Fixes: 644d711aa0e1 ("KVM: nVMX: Deciding if L0 or L1 should handle an L2 exit")
Cc: Jim Mattson <[email protected]>
Cc: Xiaoyao Li <[email protected]>
Cc: [email protected]
Signed-off-by: Sean Christopherson <[email protected]>
Message-Id: <20200227174430 [email protected]>
Signed-off-by: Paolo Bonzini <[email protected]>

spi: spi-fsl-dspi: Free DMA memory with matching function

Driver allocates DMA memory with dma_alloc_coherent() but frees it with
dma_unmap_single().

This causes DMA warning during system shutdown (with DMA debugging) on
Toradex Colibri VF50 module:

    WARNING: CPU: 0 PID: 1 at ../kernel/dma/debug.c:1036 check_unmap+0x3fc/0xb04
    DMA-API: fsl-edma 40098000.dma-controller: device driver frees DMA memory with wrong function
      [device address=0x0000000087040000] [size=8 bytes] [mapped as coherent] [unmapped as single]
    Hardware name: Freescale Vybrid VF5xx/VF6xx (Device Tree)
      (unwind_backtrace) from [<8010bb34>] (show_stack+0x10/0x14)
      (show_stack) from [<8011ced8>] (__warn+0xf0/0x108)
      (__warn) from [<8011cf64>] (warn_slowpath_fmt+0x74/0xb8)
      (warn_slowpath_fmt) from [<8017d170>] (check_unmap+0x3fc/0xb04)
      (check_unmap) from [<8017d900>] (debug_dma_unmap_page+0x88/0x90)
      (debug_dma_unmap_page) from [<80601d68>] (dspi_release_dma+0x88/0x110)
      (dspi_release_dma) from [<80601e4c>] (dspi_shutdown+0x5c/0x80)
      (dspi_shutdown) from [<805845f8>] (device_shutdown+0x17c/0x220)
      (device_shutdown) from [<80143ef8>] (kernel_restart+0xc/0x50)
      (kernel_restart) from [<801441cc>] (__do_sys_reboot+0x18c/0x210)
      (__do_sys_reboot) from [<80100060>] (ret_fast_syscall+0x0/0x28)
    DMA-API: Mapped at:
     dma_alloc_attrs+0xa4/0x130
     dspi_probe+0x568/0x7b4
     platform_drv_probe+0x6c/0xa4
     really_probe+0x208/0x348
     driver_probe_device+0x5c/0xb4

Fixes: 90ba37033cb9 ("spi: spi-fsl-dspi: Add DMA support for Vybrid")
Signed-off-by: Krzysztof Kozlowski <[email protected]>
Acked-by: Vladimir Oltean <[email protected]>
Cc: <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Mark Brown <[email protected]>

spi: tools: Add macro definitions to fix build errors

Add SPI_TX_OCTAL and SPI_RX_OCTAL to fix the following build errors:

CC       spidev_test.o
spidev_test.c: In function ‘transfer’:
spidev_test.c:131:13: error: ‘SPI_TX_OCTAL’ undeclared (first use in this function)
  if (mode & SPI_TX_OCTAL)
             ^
spidev_test.c:131:13: note: each undeclared identifier is reported only once for each function it appears in
spidev_test.c:137:13: error: ‘SPI_RX_OCTAL’ undeclared (first use in this function)
  if (mode & SPI_RX_OCTAL)
             ^
spidev_test.c: In function ‘parse_opts’:
spidev_test.c:290:12: error: ‘SPI_TX_OCTAL’ undeclared (first use in this function)
    mode |= SPI_TX_OCTAL;
            ^
spidev_test.c:308:12: error: ‘SPI_RX_OCTAL’ undeclared (first use in this function)
    mode |= SPI_RX_OCTAL;
            ^
  LD       spidev_test-in.o
ld: cannot find spidev_test.o: No such file or directory

Additionally, maybe SPI_CS_WORD and SPI_3WIRE_HIZ will be used in the future,
so add them too.

Fixes: 896fa735084e ("spi: spidev_test: Add support for Octal mode data transfers")
Signed-off-by: Qing Zhang <[email protected]>
Reviewed-by: Geert Uytterhoeven <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Mark Brown <[email protected]>

spi: tools: Make default_tx/rx and input_tx static

Fix the following sparse warning:

./spidev_test.c:50:9: warning: symbol 'default_tx' was not declared. Should it be static?
./spidev_test.c:59:9: warning: symbol 'default_rx' was not declared. Should it be static?
./spidev_test.c:60:6: warning: symbol 'input_tx' was not declared. Should it be static?

Signed-off-by: Qing Zhang <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Mark Brown <[email protected]>

umem: remove redundant initialization of variable ret

The variable ret is being initialized with a value that is never read
and it is being updated later with a new value. The initialization is
redundant and can be removed.

Signed-off-by: Colin Ian King <[email protected]>
Addresses-Coverity: ("Unused value")
Signed-off-by: Jens Axboe <[email protected]>

pktcdvd: remove redundant initialization of variable ret

The variable ret is being initialized with a value that is never read
and it is being updated later with a new value. The initialization is
redundant and can be removed.

Signed-off-by: Colin Ian King <[email protected]>
Addresses-Coverity: ("Unused value")
Signed-off-by: Jens Axboe <[email protected]>

nvmet: fail outstanding host posted AEN req

In function nvmet_async_event_process() we only process AENs iff
there is an open slot on the ctrl->async_event_cmds[] && aen
event list posted by the target is not empty. This keeps host
posted AEN outstanding if target generated AEN list is empty.
We do cleanup the target generated entries from the aen list in
nvmet_ctrl_free()-> nvmet_async_events_free() but we don't
process AEN posted by the host. This leads to following problem :-

When processing admin sq at the time of nvmet_sq_destroy() holds
an extra percpu reference(atomic value = 1), so in the following code
path after switching to atomic rcu, release function (nvmet_sq_free())
is not getting called which blocks the sq->free_done in
nvmet_sq_destroy() :-

nvmet_sq_destroy()
percpu_ref_kill_and_confirm()
- __percpu_ref_switch_mode()
--  __percpu_ref_switch_to_atomic()
---   call_rcu() -> percpu_ref_switch_to_atomic_rcu()
----     /* calls switch callback */
- percpu_ref_put()
-- percpu_ref_put_many(ref, 1)
--- else if (unlikely(atomic_long_sub_and_test(nr, &ref->count)))
----   ref->release(ref); <---- Not called.

This results in indefinite hang:-

  void nvmet_sq_destroy(struct nvmet_sq *sq)
...
          if (ctrl && ctrl->sqs && ctrl->sqs[0] == sq) {
                  nvmet_async_events_process(ctrl, status);
                  percpu_ref_put(&sq->ref);
          }
          percpu_ref_kill_and_confirm(&sq->ref, nvmet_confirm_sq);
          wait_for_completion(&sq->confirm_done);
          wait_for_completion(&sq->free_done); <-- Hang here

Which breaks the further disconnect sequence. This problem seems to be
introduced after commit 64f5e9cdd711b ("nvmet: fix memory leak when
removing namespaces and controllers concurrently").

This patch processes ctrl->async_event_cmds[] in the admin sq destroy()
context irrespetive of aen_list. Also we get rid of the controller's
aen_list processing in the nvmet_sq_destroy() context and just ignore
ctrl->aen_list.

This results in nvmet_async_events_process() being called from workqueue
context so we adjust the code accordingly.

Fixes: 64f5e9cdd711 ("nvmet: fix memory leak when removing namespaces and controllers concurrently ")
Signed-off-by: Chaitanya Kulkarni <[email protected]>
Reviewed-by: Sagi Grimberg <[email protected]>
Signed-off-by: Christoph Hellwig <[email protected]>
Signed-off-by: Jens Axboe <[email protected]>

nvme-pci: use simple suspend when a HMB is enabled

While the NVMe specification allows the device to access the host memory
buffer in host DRAM from all power states, hosts will fail access to
DRAM during S3 and similar power states.

Fixes: d916b1be94b6 ("nvme-pci: use host managed power state for suspend")
Signed-off-by: Christoph Hellwig <[email protected]>
Reviewed-by: Keith Busch <[email protected]>
Reviewed-by: Sagi Grimberg <[email protected]>
Signed-off-by: Jens Axboe <[email protected]>

nvme-fc: don't call nvme_cleanup_cmd() for AENs

Asynchronous event notifications do not have an associated request.
When fcp_io() fails we unconditionally call nvme_cleanup_cmd() which
leads to a crash.

Fixes: 16686f3a6c3c ("nvme: move common call to nvme_cleanup_cmd to core layer")
Signed-off-by: Daniel Wagner <[email protected]>
Reviewed-by: Himanshu Madhani <[email protected]>
Reviewed-by: Hannes Reinecke <[email protected]>
Reviewed-by: James Smart <[email protected]>
Signed-off-by: Christoph Hellwig <[email protected]>
Signed-off-by: Jens Axboe <[email protected]>

nvmet-tcp: constify nvmet_tcp_ops

nvmet_tcp_ops is never modified and can be made const to allow the
compiler to put it in read-only memory, as done in other transports.

Before:
   text    data     bss     dec     hex filename
  16164     160      12   16336    3fd0 drivers/nvme/target/tcp.o

After:
   text    data     bss     dec     hex filename
  16277      64      12   16353    3fe1 drivers/nvme/target/tcp.o

Signed-off-by: Max Gurtovoy <[email protected]>
Reviewed-by: Himanshu Madhani <[email protected]>
Reviewed-by: Israel Rukshin <[email protected]>
Acked-by: Sagi Grimberg <[email protected]>
Signed-off-by: Christoph Hellwig <[email protected]>
Signed-off-by: Jens Axboe <[email protected]>

nvme-tcp: constify nvme_tcp_mq_ops and nvme_tcp_admin_mq_ops

nvme_tcp_mq_ops and nvme_tcp_admin_mq_ops are never modified and can be
made const to allow the compiler to put them in read-only memory.

Before:
   text    data     bss     dec     hex filename
  53102    6885     576   60563    ec93 drivers/nvme/host/tcp.o

After:
   text    data     bss     dec     hex filename
  53422    6565     576   60563    ec93 drivers/nvme/host/tcp.o

Signed-off-by: Rikard Falkeborn <[email protected]>
Acked-by: Sagi Grimberg <[email protected]>
Reviewed-by: Max Gurtovoy <[email protected]>
Signed-off-by: Christoph Hellwig <[email protected]>
Signed-off-by: Jens Axboe <[email protected]>

nvme: do not call del_gendisk() on a disk that was never added

device_add_disk() is negated by del_gendisk().
alloc_disk_node() is negated by put_disk().

In nvme_alloc_ns(), device_add_disk() is one of the last things being
called in the success case, and only void functions are being called
after this. Therefore this call should not be negated in the error path.

The superfluous call to del_gendisk() leads to the following prints:
[ 7.839975] kobject: '(null)' (000000001ff73734): is not initialized, yet kobject_put() is being called.
[ 7.840865] WARNING: CPU: 2 PID: 361 at lib/kobject.c:736 kobject_put+0x70/0x120

Fixes: 33cfdc2aa696 ("nvme: enforce extended LBA format for fabrics metadata")
Signed-off-by: Niklas Cassel <[email protected]>
Reviewed-by: Sagi Grimberg <[email protected]>
Reviewed-by: Max Gurtovoy <[email protected]>
Signed-off-by: Christoph Hellwig <[email protected]>
Signed-off-by: Jens Axboe <[email protected]>

tools, bpftool: Fix memory leak in codegen error cases

Free the memory allocated for the template on error paths in function
codegen.

Signed-off-by: Tobias Klauser <[email protected]>
Signed-off-by: Daniel Borkmann <[email protected]>
Acked-by: Andrii Nakryiko <[email protected]>
Link: https://lore.kernel.org/bpf/[email protected]

selftests/bpf: Add cgroup_skb/egress test for load_bytes_relative

When cgroup_skb/egress triggers the MAC header is not set. Added a
test that asserts reading MAC header is a -EFAULT but NET header
succeeds. The test result from within the eBPF program is stored in
an 1-element array map that the userspace then reads and asserts on.

Another assertion is added that reading from a large offset, past
the end of packet, returns -EFAULT.

Signed-off-by: YiFei Zhu <[email protected]>
Signed-off-by: Daniel Borkmann <[email protected]>
Reviewed-by: Stanislav Fomichev <[email protected]>
Link: https://lore.kernel.org/bpf/9028ccbea4385a620e69c0a104f469ffd655c01e.1591812755.git.zhuyifei@google.com

net/filter: Permit reading NET in load_bytes_relative when MAC not set

Added a check in the switch case on start_header that checks for
the existence of the header, and in the case that MAC is not set
and the caller requests for MAC, -EFAULT. If the caller requests
for NET then MAC's existence is completely ignored.

There is no function to check NET header's existence and as far
as cgroup_skb/egress is concerned it should always be set.

Removed for ptr >= the start of header, considering offset is
bounded unsigned and should always be true. len <= end - mac is
redundant to ptr + len <= end.

Fixes: 3eee1f75f2b9 ("bpf: fix bpf_skb_load_bytes_relative pkt length check")
Signed-off-by: YiFei Zhu <[email protected]>
Signed-off-by: Daniel Borkmann <[email protected]>
Reviewed-by: Stanislav Fomichev <[email protected]>
Link: https://lore.kernel.org/bpf/76bb820ddb6a95f59a772ecbd8c8a336f646b362.1591812755.git.zhuyifei@google.com

x86/mce/dev-mcelog: Fix -Wstringop-truncation warning about strncpy()

The kbuild test robot reported this warning:

  arch/x86/kernel/cpu/mce/dev-mcelog.c: In function 'dev_mcelog_init_device':
  arch/x86/kernel/cpu/mce/dev-mcelog.c:346:2: warning: 'strncpy' output \
    truncated before terminating nul copying 12 bytes from a string of the \
    same length [-Wstringop-truncation]

This is accurate, but I don't care that the trailing NUL character isn't
copied. The string being copied is just a magic number signature so that
crash dump tools can be sure they are decoding the right blob of memory.

Use memcpy() instead of strncpy().

Fixes: d8ecca4043f2 ("x86/mce/dev-mcelog: Dynamically allocate space for machine check records")
Reported-by: kbuild test robot <[email protected]>
Signed-off-by: Tony Luck <[email protected]>
Signed-off-by: Borislav Petkov <[email protected]>
Signed-off-by: Thomas Gleixner <[email protected]>
Link: https://lkml.kernel.org/r/[email protected]

x86/{mce,mm}: Unmap the entire page if the whole page is affected and poisoned

An interesting thing happened when a guest Linux instance took a machine
check. The VMM unmapped the bad page from guest physical space and
passed the machine check to the guest.

Linux took all the normal actions to offline the page from the process
that was using it. But then guest Linux crashed because it said there
was a second machine check inside the kernel with this stack trace:

do_memory_failure
    set_mce_nospec
         set_memory_uc
              _set_memory_uc
                   change_page_attr_set_clr
                        cpa_flush
                             clflush_cache_range_opt

This was odd, because a CLFLUSH instruction shouldn't raise a machine
check (it isn't consuming the data). Further investigation showed that
the VMM had passed in another machine check because is appeared that the
guest was accessing the bad page.

Fix is to check the scope of the poison by checking the MCi_MISC register.
If the entire page is affected, then unmap the page. If only part of the
page is affected, then mark the page as uncacheable.

This assumes that VMMs will do the logical thing and pass in the "whole
page scope" via the MCi_MISC register (since they unmapped the entire
page).

  [ bp: Adjust to x86/entry changes. ]

Fixes: 284ce4011ba6 ("x86/memory_failure: Introduce {set, clear}_mce_nospec()")
Reported-by: Jue Wang <[email protected]>
Signed-off-by: Tony Luck <[email protected]>
Signed-off-by: Borislav Petkov <[email protected]>
Signed-off-by: Thomas Gleixner <[email protected]>
Tested-by: Jue Wang <[email protected]>
Cc: <[email protected]>
Link: https://lkml.kernel.org/r/[email protected]

Merge branch 'x86/entry' into ras/core

to fixup conflicts in arch/x86/kernel/cpu/mce/core.c so MCE specific follow
up patches can be applied without creating a horrible merge conflict
afterwards.

x86/entry: Unbreak __irqentry_text_start/end magic

The entry rework moved interrupt entry code from the irqentry to the
noinstr section which made the irqentry section empty.

This breaks boundary checks which rely on the __irqentry_text_start/end
markers to find out whether a function in a stack trace is
interrupt/exception entry code. This affects the function graph tracer and
filter_irq_stacks().

As the IDT entry points are all sequentialy emitted this is rather simple
to unbreak by injecting __irqentry_text_start/end as global labels.

To make this work correctly:

  - Remove the IRQENTRY_TEXT section from the x86 linker script
  - Define __irqentry so it breaks the build if it's used
  - Adjust the entry mirroring in PTI
  - Remove the redundant kprobes and unwinder bound checks

Reported-by: Qian Cai <[email protected]>
Signed-off-by: Thomas Gleixner <[email protected]>

x86/entry: __always_inline CR2 for noinstr

vmlinux.o: warning: objtool: exc_page_fault()+0x9: call to read_cr2() leaves .noinstr.text section
vmlinux.o: warning: objtool: exc_page_fault()+0x24: call to prefetchw() leaves .noinstr.text section
vmlinux.o: warning: objtool: exc_page_fault()+0x21: call to kvm_handle_async_pf.isra.0() leaves .noinstr.text section
vmlinux.o: warning: objtool: exc_nmi()+0x1cc: call to write_cr2() leaves .noinstr.text section

Signed-off-by: Peter Zijlstra (Intel) <[email protected]>
Signed-off-by: Thomas Gleixner <[email protected]>
Link: https://lkml.kernel.org/r/[email protected]

lockdep: __always_inline more for noinstr

vmlinux.o: warning: objtool: debug_locks_off()+0xd: call to __debug_locks_off() leaves .noinstr.text section
vmlinux.o: warning: objtool: match_held_lock()+0x6a: call to look_up_lock_class.isra.0() leaves .noinstr.text section
vmlinux.o: warning: objtool: lock_is_held_type()+0x90: call to lockdep_recursion_finish() leaves .noinstr.text section

Signed-off-by: Peter Zijlstra (Intel) <[email protected]>
Signed-off-by: Thomas Gleixner <[email protected]>
Link: https://lkml.kernel.org/r/[email protected]

x86/entry: Re-order #DB handler to avoid *SAN instrumentation

vmlinux.o: warning: objtool: exc_debug()+0xbb: call to clear_ti_thread_flag.constprop.0() leaves .noinstr.text section
vmlinux.o: warning: objtool: noist_exc_debug()+0x55: call to clear_ti_thread_flag.constprop.0() leaves .noinstr.text section

Rework things so that handle_debug() looses the noinstr and move the
clear_thread_flag() into that.

Signed-off-by: Peter Zijlstra (Intel) <[email protected]>
Signed-off-by: Thomas Gleixner <[email protected]>
Link: https://lkml.kernel.org/r/[email protected]

x86/entry: __always_inline arch_atomic_* for noinstr

vmlinux.o: warning: objtool: rcu_dynticks_eqs_exit()+0x33: call to arch_atomic_and.constprop.0() leaves .noinstr.text section

Signed-off-by: Peter Zijlstra (Intel) <[email protected]>
Signed-off-by: Thomas Gleixner <[email protected]>
Link: https://lkml.kernel.org/r/[email protected]

x86/entry: __always_inline irqflags for noinstr

vmlinux.o: warning: objtool: lockdep_hardirqs_on()+0x65: call to arch_local_save_flags() leaves .noinstr.text section
vmlinux.o: warning: objtool: lockdep_hardirqs_off()+0x5d: call to arch_local_save_flags() leaves .noinstr.text section
vmlinux.o: warning: objtool: lock_is_held_type()+0x35: call to arch_local_irq_save() leaves .noinstr.text section
vmlinux.o: warning: objtool: check_preemption_disabled()+0x31: call to arch_local_save_flags() leaves .noinstr.text section
vmlinux.o: warning: objtool: check_preemption_disabled()+0x33: call to arch_irqs_disabled_flags() leaves .noinstr.text section
vmlinux.o: warning: objtool: lock_is_held_type()+0x2f: call to native_irq_disable() leaves .noinstr.text section

Signed-off-by: Peter Zijlstra (Intel) <[email protected]>
Signed-off-by: Thomas Gleixner <[email protected]>
Link: https://lkml.kernel.org/r/[email protected]

x86/entry: __always_inline debugreg for noinstr

vmlinux.o: warning: objtool: exc_debug()+0x21: call to native_get_debugreg() leaves .noinstr.text section

Signed-off-by: Peter Zijlstra (Intel) <[email protected]>
Signed-off-by: Thomas Gleixner <[email protected]>
Link: https://lkml.kernel.org/r/[email protected]

x86/idt: Consolidate idt functionality

- Move load_current_idt() out of line and replace the hideous comment with
a lockdep assert. This allows to make idt_table and idt_descr static.

- Mark idt_table read only after the IDT initialization is complete.

- Shuffle code around to consolidate the #ifdef sections into one.

- Adapt the F00F bug code.

Signed-off-by: Thomas Gleixner <[email protected]>
Link: https://lkml.kernel.org/r/[email protected]

x86/idt: Cleanup trap_init()

No point in having all the IDT cruft in trap_init(). Move it into the IDT
code and fixup the comments.

Signed-off-by: Thomas Gleixner <[email protected]>
Link: https://lkml.kernel.org/r/[email protected]

x86/idt: Use proper constants for table size

Use the actual struct size to calculate the IDT table size instead of
hardcoded values.

Signed-off-by: Thomas Gleixner <[email protected]>
Link: https://lkml.kernel.org/r/[email protected]

x86/idt: Add comments about early #PF handling

The difference between 32 and 64 bit vs. early #PF handling is not
documented. Replace the FIXME at idt_setup_early_pf() with proper comments.

Signed-off-by: Thomas Gleixner <[email protected]>
Link: https://lkml.kernel.org/r/[email protected]

x86/idt: Mark init only functions __init

Since 8175cfbbbfcb ("x86/idt: Remove update_intr_gate()") set_intr_gate()
and idt_setup_from_table() are only called from __init functions. Mark them
as well.

Signed-off-by: Thomas Gleixner <[email protected]>
Link: https://lkml.kernel.org/r/[email protected]

x86/entry: Rename trace_hardirqs_off_prepare()

The typical pattern for trace_hardirqs_off_prepare() is:

  ENTRY
    lockdep_hardirqs_off(); // because hardware
    ... do entry magic
    instrumentation_begin();
    trace_hardirqs_off_prepare();
    ... do actual work
    trace_hardirqs_on_prepare();
    lockdep_hardirqs_on_prepare();
    instrumentation_end();
    ... do exit magic
    lockdep_hardirqs_on();

which shows that it's named wrong, rename it to
trace_hardirqs_off_finish(), as it concludes the hardirq_off transition.

Also, given that the above is the only correct order, make the traditional
all-in-one trace_hardirqs_off() follow suit.

Signed-off-by: Peter Zijlstra (Intel) <[email protected]>
Signed-off-by: Thomas Gleixner <[email protected]>
Link: https://lkml.kernel.org/r/[email protected]

x86/entry: Clarify irq_{enter,exit}_rcu()

Because:

irq_enter_rcu() includes lockdep_hardirq_enter()
irq_exit_rcu() does *NOT* include lockdep_hardirq_exit()

Which resulted in two 'stray' lockdep_hardirq_exit() calls in
idtentry.h, and me spending a long time trying to find the matching
enter calls.

Signed-off-by: Peter Zijlstra (Intel) <[email protected]>
Signed-off-by: Thomas Gleixner <[email protected]>
Link: https://lkml.kernel.org/r/[email protected]

x86/entry: Remove DBn stacks

Both #DB itself, as all other IST users (NMI, #MC) now clear DR7 on
entry. Combined with not allowing breakpoints on entry/noinstr/NOKPROBE
text and no single step (EFLAGS.TF) inside the #DB handler should guarantee
no nested #DB.

Signed-off-by: Peter Zijlstra (Intel) <[email protected]>
Signed-off-by: Thomas Gleixner <[email protected]>
Link: https://lkml.kernel.org/r/[email protected]

x86/entry: Remove debug IDT frobbing

This is all unused now.

Signed-off-by: Peter Zijlstra (Intel) <[email protected]>
Signed-off-by: Thomas Gleixner <[email protected]>
Link: https://lkml.kernel.org/r/[email protected]

x86/entry: Optimize local_db_save() for virt

Because DRn access is 'difficult' with virt; but the DR7 read is cheaper
than a cacheline miss on native, add a virt specific fast path to
local_db_save(), such that when breakpoints are not in use to avoid
touching DRn entirely.

Suggested-by: Andy Lutomirski <[email protected]>
Signed-off-by: Peter Zijlstra (Intel) <[email protected]>
Signed-off-by: Thomas Gleixner <[email protected]>
Link: https://lkml.kernel.org/r/[email protected]

x86/entry, mce: Disallow #DB during #MC

#MC is fragile as heck, don't tempt fate.

Signed-off-by: Peter Zijlstra (Intel) <[email protected]>
Signed-off-by: Thomas Gleixner <[email protected]>
Link: https://lkml.kernel.org/r/[email protected]

x86/entry, nmi: Disable #DB

Instead of playing stupid games with IST stacks, fully disallow #DB
during NMIs. There is absolutely no reason to allow them, and killing
this saves a heap of trouble.

#DB is already forbidden on noinstr and CEA, so there can't be a #DB before
this. Disabling it right after nmi_enter() ensures that the full NMI code
is protected.

Signed-off-by: Peter Zijlstra (Intel) <[email protected]>
Signed-off-by: Thomas Gleixner <[email protected]>
Link: https://lkml.kernel.org/r/[email protected]

x86/entry: Introduce local_db_{save,restore}()

In order to allow other exceptions than #DB to disable breakpoints,
provide common helpers.

Signed-off-by: Peter Zijlstra (Intel) <[email protected]>
Signed-off-by: Thomas Gleixner <[email protected]>
Link: https://lkml.kernel.org/r/[email protected]

x86/hw_breakpoint: Prevent data breakpoints on user_pcid_flush_mask

The per-CPU user_pcid_flush_mask is used in the low level entry code. A
data breakpoint can cause #DB recursion.

Protect the full cpu_tlbstate structure for simplicity.

Signed-off-by: Lai Jiangshan <[email protected]>
Signed-off-by: Peter Zijlstra (Intel) <[email protected]>
Signed-off-by: Thomas Gleixner <[email protected]>
Link: https://lkml.kernel.org/r/[email protected]
Link: https://lkml.kernel.org/r/[email protected]

x86/hw_breakpoint: Prevent data breakpoints on per_cpu cpu_tss_rw

cpu_tss_rw is not directly referenced by hardware, but cpu_tss_rw is
accessed in CPU entry code, especially when #DB shifts its stacks.

If a data breakpoint would be set on cpu_tss_rw.x86_tss.ist[IST_INDEX_DB],
it would cause recursive #DB ending up in a double fault.

Add it to the list of protected items.

Signed-off-by: Lai Jiangshan <[email protected]>
Signed-off-by: Peter Zijlstra (Intel) <[email protected]>
Signed-off-by: Thomas Gleixner <[email protected]>
Link: https://lkml.kernel.org/r/[email protected]
Link: https://lkml.kernel.org/r/[email protected]

x86/hw_breakpoint: Prevent data breakpoints on direct GDT

A data breakpoint on the GDT can be fatal and must be avoided. The GDT in
the CPU entry area is already protected, but not the direct GDT.

Add the necessary protection.

Signed-off-by: Lai Jiangshan <[email protected]>
Signed-off-by: Peter Zijlstra (Intel) <[email protected]>
Signed-off-by: Thomas Gleixner <[email protected]>
Link: https://lkml.kernel.org/r/[email protected]
Link: https://lkml.kernel.org/r/[email protected]

x86/hw_breakpoint: Add within_area() to check data breakpoints

Add a within_area() helper to checking whether the data breakpoints overlap
with cpu_entry_area.

It will be used to completely prevent data breakpoints on GDT, IDT, or TSS.

Signed-off-by: Lai Jiangshan <[email protected]>
Signed-off-by: Peter Zijlstra (Intel) <[email protected]>
Signed-off-by: Thomas Gleixner <[email protected]>
Link: https://lkml.kernel.org/r/[email protected]
Link: https://lkml.kernel.org/r/[email protected]

xen: Move xen_setup_callback_vector() definition to include/xen/hvm.h

Kbuild test robot reports the following problem on ARM:

for 'xen_setup_callback_vector' [-Wmissing-prototypes]
1664 | void xen_setup_callback_vector(void) {}
| ^~~~~~~~~~~~~~~~~~~~~~~~~

The problem is that xen_setup_callback_vector is a x86 only thing, its
definition is present in arch/x86/xen/xen-ops.h but not on ARM. In
events_base.c there is a stub for !CONFIG_XEN_PVHVM but it is not declared
as 'static'.

On x86 the situation is hardly better: drivers/xen/events/events_base.c
doesn't include 'xen-ops.h' from arch/x86/xen/, it includes its namesake
from include/xen/ which also results in a 'no previous prototype' warning.

Currently, xen_setup_callback_vector() has two call sites: one in
drivers/xen/events_base.c and another in arch/x86/xen/suspend_hvm.c. The
former is placed under #ifdef CONFIG_X86 and the later is only compiled
in when CONFIG_XEN_PVHVM.

Resolve the issue by moving xen_setup_callback_vector() declaration to
arch neutral 'include/xen/hvm.h' as the implementation lives in arch
neutral drivers/xen/events/events_base.c.

Reported-by: kbuild test robot <[email protected]>
Signed-off-by: Vitaly Kuznetsov <[email protected]>
Signed-off-by: Thomas Gleixner <[email protected]>
Reviewed-by: Juergen Gross <[email protected]>
Link: https://lkml.kernel.org/r/[email protected]

x86/entry: Remove the TRACE_IRQS cruft

No more users.

Signed-off-by: Thomas Gleixner <[email protected]>
Signed-off-by: Ingo Molnar <[email protected]>
Acked-by: Andy Lutomirski <[email protected]>
Link: https://lore.kernel.org/r/[email protected]

x86/entry: Move paranoid irq tracing out of ASM code

The last step to remove the irq tracing cruft from ASM. Ignore #DF as the
maschine is going to die anyway.

Signed-off-by: Thomas Gleixner <[email protected]>
Signed-off-by: Ingo Molnar <[email protected]>
Acked-by: Andy Lutomirski <[email protected]>
Link: https://lore.kernel.org/r/[email protected]

x86/entry/64: Remove TRACE_IRQS_*_DEBUG

Since INT3/#BP no longer runs on an IST, this workaround is no longer
required.

Tested by running lockdep+ftrace as described in the initial commit:

5963e317b1e9 ("ftrace/x86: Do not change stacks in DEBUG when calling lockdep")

Signed-off-by: Thomas Gleixner <[email protected]>
Signed-off-by: Peter Zijlstra (Intel) <[email protected]>
Signed-off-by: Thomas Gleixner <[email protected]>
Signed-off-by: Ingo Molnar <[email protected]>
Reviewed-by: Steven Rostedt (VMware) <[email protected]>
Acked-by: Andy Lutomirski <[email protected]>
Link: https://lore.kernel.org/r/[email protected]

x86/entry/32: Remove redundant irq disable code

All exceptions/interrupts return with interrupts disabled now. No point in
doing this in ASM again.

Signed-off-by: Thomas Gleixner <[email protected]>
Signed-off-by: Ingo Molnar <[email protected]>
Acked-by: Andy Lutomirski <[email protected]>
Link: https://lore.kernel.org/r/[email protected]

x86/entry: Make enter_from_user_mode() static

The ASM users are gone. All callers are local.

Signed-off-by: Thomas Gleixner <[email protected]>
Signed-off-by: Ingo Molnar <[email protected]>
Acked-by: Andy Lutomirski <[email protected]>
Link: https://lore.kernel.org/r/[email protected]

x86/entry/64: Remove IRQ stack switching ASM

No more users.

Signed-off-by: Thomas Gleixner <[email protected]>
Signed-off-by: Ingo Molnar <[email protected]>
Acked-by: Andy Lutomirski <[email protected]>
Link: https://lore.kernel.org/r/[email protected]

x86/entry: Remove the apic/BUILD interrupt leftovers

Remove all the code which was there to emit the system vector stubs. All
users are gone.

Move the now unused GET_CR2_INTO macro muck to head_64.S where the last
user is. Fixup the eye hurting comment there while at it.

Signed-off-by: Thomas Gleixner <[email protected]>
Signed-off-by: Ingo Molnar <[email protected]>
Acked-by: Andy Lutomirski <[email protected]>
Link: https://lore.kernel.org/r/[email protected]

x86/entry: Convert reschedule interrupt to IDTENTRY_SYSVEC_SIMPLE

The scheduler IPI does not need the full interrupt entry handling logic
when the entry is from kernel mode. Use IDTENTRY_SYSVEC_SIMPLE and spare
all the overhead.

Signed-off-by: Thomas Gleixner <[email protected]>
Signed-off-by: Ingo Molnar <[email protected]>
Acked-by: Andy Lutomirski <[email protected]>
Link: https://lore.kernel.org/r/[email protected]

x86/entry: Convert XEN hypercall vector to IDTENTRY_SYSVEC

Convert the last oldstyle defined vector to IDTENTRY_SYSVEC:

  - Implement the C entry point with DEFINE_IDTENTRY_SYSVEC
  - Emit the ASM stub with DECLARE_IDTENTRY_SYSVEC
  - Remove the ASM idtentries in 64-bit
  - Remove the BUILD_INTERRUPT entries in 32-bit
  - Remove the old prototypes

Fixup the related XEN code by providing the primary C entry point in x86 to
avoid cluttering the generic code with X86'isms.

No functional change.

Signed-off-by: Thomas Gleixner <[email protected]>
Signed-off-by: Ingo Molnar <[email protected]>
Acked-by: Andy Lutomirski <[email protected]>
Link: https://lore.kernel.org/r/[email protected]

x86/entry: Convert various hypervisor vectors to IDTENTRY_SYSVEC

Convert various hypervisor vectors to IDTENTRY_SYSVEC:

  - Implement the C entry point with DEFINE_IDTENTRY_SYSVEC
  - Emit the ASM stub with DECLARE_IDTENTRY_SYSVEC
  - Remove the ASM idtentries in 64-bit
  - Remove the BUILD_INTERRUPT entries in 32-bit
  - Remove the old prototypes

No functional change.

Signed-off-by: Thomas Gleixner <[email protected]>
Signed-off-by: Ingo Molnar <[email protected]>
Acked-by: Andy Lutomirski <[email protected]>
Reviewed-by: Wei Liu <[email protected]>
Link: https://lore.kernel.org/r/[email protected]

x86/entry: Convert KVM vectors to IDTENTRY_SYSVEC*

Convert KVM specific system vectors to IDTENTRY_SYSVEC*:

The two empty stub handlers which only increment the stats counter do no
need to run on the interrupt stack. Use IDTENTRY_SYSVEC_SIMPLE for them.

The wakeup handler does more work and runs on the interrupt stack.

None of these handlers need to save and restore the irq_regs pointer.

Signed-off-by: Thomas Gleixner <[email protected]>
Signed-off-by: Ingo Molnar <[email protected]>
Acked-by: Paolo Bonzini <[email protected]>
Acked-by: Andy Lutomirski <[email protected]>
Link: https://lore.kernel.org/r/[email protected]

x86/entry: Convert various system vectors

Convert various system vectors to IDTENTRY_SYSVEC:

  - Implement the C entry point with DEFINE_IDTENTRY_SYSVEC
  - Emit the ASM stub with DECLARE_IDTENTRY_SYSVEC
  - Remove the ASM idtentries in 64-bit
  - Remove the BUILD_INTERRUPT entries in 32-bit
  - Remove the old prototypes

No functional change.

Signed-off-by: Thomas Gleixner <[email protected]>
Signed-off-by: Ingo Molnar <[email protected]>
Acked-by: Andy Lutomirski <[email protected]>
Link: https://lore.kernel.org/r/[email protected]

x86/entry: Convert SMP system vectors to IDTENTRY_SYSVEC

Convert SMP system vectors to IDTENTRY_SYSVEC:

  - Implement the C entry point with DEFINE_IDTENTRY_SYSVEC
  - Emit the ASM stub with DECLARE_IDTENTRY_SYSVEC
  - Remove the ASM idtentries in 64-bit
  - Remove the BUILD_INTERRUPT entries in 32-bit
  - Remove the old prototypes

No functional change.

Signed-off-by: Thomas Gleixner <[email protected]>
Signed-off-by: Ingo Molnar <[email protected]>
Acked-by: Andy Lutomirski <[email protected]>
Link: https://lore.kernel.org/r/[email protected]

x86/entry: Convert APIC interrupts to IDTENTRY_SYSVEC

Convert APIC interrupts to IDTENTRY_SYSVEC:

  - Implement the C entry point with DEFINE_IDTENTRY_SYSVEC
  - Emit the ASM stub with DECLARE_IDTENTRY_SYSVEC
  - Remove the ASM idtentries in 64-bit
  - Remove the BUILD_INTERRUPT entries in 32-bit
  - Remove the old prototypes

No functional change.

Signed-off-by: Thomas Gleixner <[email protected]>
Signed-off-by: Ingo Molnar <[email protected]>
Acked-by: Andy Lutomirski <[email protected]>
Link: https://lore.kernel.org/r/[email protected]

x86/entry: Provide IDTENTRY_SYSVEC

Provide IDTENTRY variants for system vectors to consolidate the different
mechanisms to emit the ASM stubs for 32- and 64-bit.

On 64-bit this also moves the stack switching from ASM to C code. 32-bit will
excute the system vectors w/o stack switching as before.

The simple variant is meant for "empty" system vectors like scheduler IPI
and KVM posted interrupt vectors. These do not need the full glory of irq
enter/exit handling with softirq processing and more.

Signed-off-by: Thomas Gleixner <[email protected]>
Signed-off-by: Ingo Molnar <[email protected]>
Acked-by: Andy Lutomirski <[email protected]>
Link: https://lore.kernel.org/r/[email protected]

x86/entry: Use idtentry for interrupts

Replace the extra interrupt handling code and reuse the existing idtentry
machinery. This moves the irq stack switching on 64-bit from ASM to C code;
32-bit already does the stack switching in C.

This requires to remove HAVE_IRQ_EXIT_ON_IRQ_STACK as the stack switch is
not longer in the low level entry code.

Signed-off-by: Thomas Gleixner <[email protected]>
Signed-off-by: Ingo Molnar <[email protected]>
Acked-by: Andy Lutomirski <[email protected]>
Link: https://lore.kernel.org/r/[email protected]

x86/entry: Add IRQENTRY_IRQ macro

Provide a seperate IDTENTRY macro for device interrupts. Similar to
IDTENTRY_ERRORCODE with the addition of invoking irq_enter/exit_rcu() and
providing the errorcode as a 'u8' argument to the C function, which
truncates the sign extended vector number.

Signed-off-by: Thomas Gleixner <[email protected]>
Signed-off-by: Ingo Molnar <[email protected]>
Acked-by: Andy Lutomirski <[email protected]>
Link: https://lore.kernel.org/r/[email protected]

x86/irq: Rework handle_irq() for 64-bit

To consolidate the interrupt entry/exit code vs. the other exceptions
make handle_irq() an inline and handle both 64-bit and 32-bit mode.

Preparatory change to move irq stack switching for 64-bit to C which allows
to consolidate the entry exit handling by reusing the idtentry machinery
both in ASM and C.

Signed-off-by: Thomas Gleixner <[email protected]>
Signed-off-by: Ingo Molnar <[email protected]>
Acked-by: Andy Lutomirski <[email protected]>
Link: https://lore.kernel.org/r/[email protected]

x86/irq: Convey vector as argument and not in ptregs

Device interrupts which go through do_IRQ() or the spurious interrupt
handler have their separate entry code on 64 bit for no good reason.

Both 32 and 64 bit transport the vector number through ORIG_[RE]AX in
pt_regs. Further the vector number is forced to fit into an u8 and is
complemented and offset by 0x80 so it's in the signed character
range. Otherwise GAS would expand the pushq to a 5 byte instruction for any
vector > 0x7F.

Treat the vector number like an error code and hand it to the C function as
argument. This allows to get rid of the extra entry code in a later step.

Simplify the error code push magic by implementing the pushq imm8 via a
'.byte 0x6a, vector' sequence so GAS is not able to screw it up. As the
pushq imm8 is sign extending the resulting error code needs to be truncated
to 8 bits in C code.

Originally-by: Andy Lutomirski <[email protected]>
Signed-off-by: Thomas Gleixner <[email protected]>
Signed-off-by: Ingo Molnar <[email protected]>
Acked-by: Andy Lutomirski <[email protected]>
Link: https://lore.kernel.org/r/[email protected]

x86/irq: Use generic irq_regs implementation

The only difference is the name of the per-CPU variable: irq_regs
vs. __irq_regs, but the accessor functions are identical.

Remove the pointless copy and use the generic variant.

Signed-off-by: Thomas Gleixner <[email protected]>
Signed-off-by: Ingo Molnar <[email protected]>
Acked-by: Andy Lutomirski <[email protected]>
Link: https://lore.kernel.org/r/[email protected]

x86/entry/32: Remove common_exception()

No more users.

Signed-off-by: Thomas Gleixner <[email protected]>
Signed-off-by: Ingo Molnar <[email protected]>
Acked-by: Andy Lutomirski <[email protected]>
Link: https://lore.kernel.org/r/[email protected]

x86/entry/64: Remove error_exit()

No more users.

Signed-off-by: Thomas Gleixner <[email protected]>
Signed-off-by: Ingo Molnar <[email protected]>
Acked-by: Andy Lutomirski <[email protected]>
Link: https://lore.kernel.org/r/[email protected]

x86/entry: Change exit path of xen_failsafe_callback

xen_failsafe_callback() is invoked from XEN for two cases:

1. Fault while reloading DS, ES, FS or GS
2. Fault while executing IRET

#1 retries the IRET after XEN has fixed up the segments.
#2 injects a #GP which kills the task

For #1 there is no reason to go through the full exception return path
because the tasks TIF state is still the same. So just going straight to
the IRET path is good enough.

Signed-off-by: Thomas Gleixner <[email protected]>
Signed-off-by: Ingo Molnar <[email protected]>
Acked-by: Andy Lutomirski <[email protected]>
Link: https://lore.kernel.org/r/[email protected]

x86/entry: Remove the transition leftovers

Now that all exceptions are converted over the sane flag is not longer
needed. Also the vector argument of idtentry_body on 64-bit is pointless
now.

Signed-off-by: Thomas Gleixner <[email protected]>
Signed-off-by: Ingo Molnar <[email protected]>
Acked-by: Andy Lutomirski <[email protected]>
Link: https://lore.kernel.org/r/[email protected]

x86/entry: Switch page fault exception to IDTENTRY_RAW

Convert page fault exceptions to IDTENTRY_RAW:

  - Implement the C entry point with DEFINE_IDTENTRY_RAW
  - Add the CR2 read into the exception handler
  - Add the idtentry_enter/exit_cond_rcu() invocations in
    in the regular page fault handler and in the async PF
    part.
  - Emit the ASM stub with DECLARE_IDTENTRY_RAW
  - Remove the ASM idtentry in 64-bit
  - Remove the CR2 read from 64-bit
  - Remove the open coded ASM entry code in 32-bit
  - Fix up the XEN/PV code
  - Remove the old prototypes

No functional change.

Signed-off-by: Thomas Gleixner <[email protected]>
Signed-off-by: Ingo Molnar <[email protected]>
Acked-by: Andy Lutomirski <[email protected]>
Link: https://lore.kernel.org/r/[email protected]

x86/entry/64: Simplify idtentry_body

All C functions which do not have an error code have been converted to the
new IDTENTRY interface which does not expect an error code in the
arguments. Spare the XORL.

Signed-off-by: Thomas Gleixner <[email protected]>
Signed-off-by: Ingo Molnar <[email protected]>
Acked-by: Andy Lutomirski <[email protected]>
Link: https://lore.kernel.org/r/[email protected]

x86/entry: Switch XEN/PV hypercall entry to IDTENTRY

Convert the XEN/PV hypercall to IDTENTRY:

  - Emit the ASM stub with DECLARE_IDTENTRY
  - Remove the ASM idtentry in 64-bit
  - Remove the open coded ASM entry code in 32-bit
  - Remove the old prototypes

The handler stubs need to stay in ASM code as they need corner case handling
and adjustment of the stack pointer.

Provide a new C function which invokes the entry/exit handling and calls
into the XEN handler on the interrupt stack if required.

The exit code is slightly different from the regular idtentry_exit() on
non-preemptible kernels. If the hypercall is preemptible and need_resched()
is set then XEN provides a preempt hypercall scheduling function.

Move this functionality into the entry code so it can use the existing
idtentry functionality.

[ mingo: Build fixes. ]

Signed-off-by: Thomas Gleixner <[email protected]>
Signed-off-by: Ingo Molnar <[email protected]>
Acked-by: Andy Lutomirski <[email protected]>
Acked-by: Juergen Gross <[email protected]>
Tested-by: Juergen Gross <[email protected]>
Link: https://lore.kernel.org/r/[email protected]

x86/entry: Split out idtentry_exit_cond_resched()

The XEN PV hypercall requires the ability of conditional rescheduling when
preemption is disabled because some hypercalls take ages.

Split out the rescheduling code from idtentry_exit_cond_rcu() so it can
be reused for that.

Signed-off-by: Thomas Gleixner <[email protected]>
Signed-off-by: Ingo Molnar <[email protected]>
Acked-by: Andy Lutomirski <[email protected]>
Link: https://lore.kernel.org/r/[email protected]

x86/entry/64: Move do_softirq_own_stack() to C

The first step to get rid of the ENTER/LEAVE_IRQ_STACK ASM macro maze. Use
the new C code helpers to move do_softirq_own_stack() out of ASM code.

Signed-off-by: Thomas Gleixner <[email protected]>
Signed-off-by: Ingo Molnar <[email protected]>
Acked-by: Andy Lutomirski <[email protected]>
Link: https://lore.kernel.org/r/[email protected]

x86/entry: Provide helpers for executing on the irqstack

Device interrupt handlers and system vector handlers are executed on the
interrupt stack. The stack switch happens in the low level assembly entry
code. This conflicts with the efforts to consolidate the exit code in C to
ensure correctness vs. RCU and tracing.

As there is no way to move #DB away from IST due to the MOV SS issue, the
requirements vs. #DB and NMI for switching to the interrupt stack do not
exist anymore. The only requirement is that interrupts are disabled.

That allows the moving of the stack switching to C code, which simplifies the
entry/exit handling further, because it allows the switching of stacks after
handling the entry and on exit before handling RCU, returning to usermode and
kernel preemption in the same way as for regular exceptions.

The initial attempt of having the stack switching in inline ASM caused too
much headache vs. objtool and the unwinder. After analysing the use cases
it was agreed on that having the stack switch in ASM for the price of an
indirect call is acceptable, as the main users are indirect call heavy
anyway and the few system vectors which are empty shells (scheduler IPI and
KVM posted interrupt vectors) can run from the regular stack.

Provide helper functions to check whether the interrupt stack is already
active and whether stack switching is required.

64-bit only for now, as 32-bit has a variant of that already. Once this is
cleaned up, the two implementations might be consolidated as an additional
cleanup on top.

Signed-off-by: Thomas Gleixner <[email protected]>
Signed-off-by: Ingo Molnar <[email protected]>
Acked-by: Andy Lutomirski <[email protected]>
Link: https://lore.kernel.org/r/[email protected]

genirq: Provide __irq_enter/exit_raw()

Like __irq_enter/exit() but without time accounting. To be used for "empty"
system vectors like the scheduler IPI to avoid the overhead.

Signed-off-by: Thomas Gleixner <[email protected]>
Signed-off-by: Ingo Molnar <[email protected]>
Reviewed-by: Andy Lutomirski <[email protected]>
Link: https://lore.kernel.org/r/[email protected]

genirq: Provide irq_enter/exit_rcu()

irq_enter()/exit() currently include RCU handling. To properly separate the RCU
handling code, provide variants which contain only the non-RCU related
functionality.

Signed-off-by: Thomas Gleixner <[email protected]>
Signed-off-by: Ingo Molnar <[email protected]>
Reviewed-by: Andy Lutomirski <[email protected]>
Link: https://lore.kernel.org/r/[email protected]

x86/entry: Clean up idtentry_enter/exit() leftovers

Now that everything is converted to conditional RCU handling remove
idtentry_enter/exit() and tidy up the conditional functions.

This does not remove rcu_irq_exit_preempt(), to avoid conflicts with the RCU
tree. Will be removed once all of this hits Linus's tree.

Signed-off-by: Thomas Gleixner <[email protected]>
Signed-off-by: Ingo Molnar <[email protected]>
Acked-by: Andy Lutomirski <[email protected]>
Link: https://lore.kernel.org/r/[email protected]

x86/idtentry: Switch to conditional RCU handling

Switch all idtentry_enter/exit() users over to the new conditional RCU
handling scheme and make the user mode entries in #DB, #INT3 and #MCE use
the user mode idtentry functions.

Signed-off-by: Thomas Gleixner <[email protected]>
Signed-off-by: Ingo Molnar <[email protected]>
Acked-by: Andy Lutomirski <[email protected]>
Link: https://lore.kernel.org/r/[email protected]

x86/entry: Provide idtentry_enter/exit_user()

As there are exceptions which already handle entry from user mode and from
kernel mode separately, providing explicit user entry/exit handling callbacks
makes sense and makes the code easier to understand.

Signed-off-by: Thomas Gleixner <[email protected]>
Signed-off-by: Ingo Molnar <[email protected]>
Acked-by: Andy Lutomirski <[email protected]>
Link: https://lore.kernel.org/r/[email protected]

x86/entry: Provide idtentry_entry/exit_cond_rcu()

After a lengthy discussion [1] it turned out that RCU does not need a full
rcu_irq_enter/exit() when RCU is already watching. All it needs if
NOHZ_FULL is active is to check whether the tick needs to be restarted.

This allows to avoid a separate variant for the pagefault handler which
cannot invoke rcu_irq_enter() on a kernel pagefault which might sleep.

The cond_rcu argument is only temporary and will be removed once the
existing users of idtentry_enter/exit() have been cleaned up. After that
the code can be significantly simplified.

[ mingo: Simplified the control flow ]

Signed-off-by: Thomas Gleixner <[email protected]>
Signed-off-by: Ingo Molnar <[email protected]>
Acked-by: "Paul E. McKenney" <[email protected]>
Acked-by: Andy Lutomirski <[email protected]>
Link: [1] https://lkml.kernel.org/r/20200515235125.628629605@linutronix.de
Link: https://lore.kernel.org/r/[email protected]

nmi, tracing: Make hardware latency tracing noinstr safe

The hardware latency tracer calls into instrumentable functions. Move the
calls into the RCU watching sections and annotate them.

Signed-off-by: Thomas Gleixner <[email protected]>
Signed-off-by: Ingo Molnar <[email protected]>
Acked-by: Andy Lutomirski <[email protected]>
Cc: Paul E. McKenney <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Ingo Molnar <[email protected]>

x86/entry: Fix allnoconfig build warning

The following commit:

095b7a3e7745 ("x86/entry: Convert double fault exception to IDTENTRY_DF")

introduced a new build warning on 64-bit allnoconfig kernels, that have CONFIG_VMAP_STACK disabled:

arch/x86/kernel/traps.c:332:16: warning: unused variable ‘address’ [-Wunused-variable]

This variable is only used if CONFIG_VMAP_STACK is defined, so make it
dependent on that, not CONFIG_X86_64.

Signed-off-by: Ingo Molnar <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: Alexandre Chartre <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Andy Lutomirski <[email protected]>
Cc: Borislav Petkov <[email protected]>