Merge tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm

author Linus Torvalds <[email protected]>

Fri, 12 Jul 2019 22:35:14 +0000 (15:35 -0700)

committer Linus Torvalds <[email protected]>

Fri, 12 Jul 2019 22:35:14 +0000 (15:35 -0700)
author Linus Torvalds <[email protected]>
Fri, 12 Jul 2019 22:35:14 +0000 (15:35 -0700)
committer Linus Torvalds <[email protected]>
Fri, 12 Jul 2019 22:35:14 +0000 (15:35 -0700)
diff --combined Documentation/arm64/silicon-errata.rst

index c792774be59e625a204065ad761538ef326fa4e5,0000000000000000000000000000000000000000..3e57d09246e668c14fb44c6b949494df273b29ae

mode 100644,000000..100644
--- 1/Documentation/arm64/silicon-errata.rst
--- /dev/null
+++ b/Documentation/arm64/silicon-errata.rst
@@@ -1,131 -1,0 +1,133 @@@
+ +=======================================
+ +Silicon Errata and Software Workarounds
+ +=======================================
+ +
+ +Author: Will Deacon <[email protected]>
+ +
+ +Date  : 27 November 2015
+ +
+ +It is an unfortunate fact of life that hardware is often produced with
+ +so-called "errata", which can cause it to deviate from the architecture
+ +under specific circumstances.  For hardware produced by ARM, these
+ +errata are broadly classified into the following categories:
+ +
+ +  ==========  ========================================================
+ +  Category A  A critical error without a viable workaround.
+ +  Category B  A significant or critical error with an acceptable
+ +              workaround.
+ +  Category C  A minor error that is not expected to occur under normal
+ +              operation.
+ +  ==========  ========================================================
+ +
+ +For more information, consult one of the "Software Developers Errata
+ +Notice" documents available on infocenter.arm.com (registration
+ +required).
+ +
+ +As far as Linux is concerned, Category B errata may require some special
+ +treatment in the operating system. For example, avoiding a particular
+ +sequence of code, or configuring the processor in a particular way. A
+ +less common situation may require similar actions in order to declassify
+ +a Category A erratum into a Category C erratum. These are collectively
+ +known as "software workarounds" and are only required in the minority of
+ +cases (e.g. those cases that both require a non-secure workaround *and*
+ +can be triggered by Linux).
+ +
+ +For software workarounds that may adversely impact systems unaffected by
+ +the erratum in question, a Kconfig entry is added under "Kernel
+ +Features" -> "ARM errata workarounds via the alternatives framework".
+ +These are enabled by default and patched in at runtime when an affected
+ +CPU is detected. For less-intrusive workarounds, a Kconfig option is not
+ +available and the code is structured (preferably with a comment) in such
+ +a way that the erratum will not be hit.
+ +
+ +This approach can make it slightly onerous to determine exactly which
+ +errata are worked around in an arbitrary kernel source tree, so this
+ +file acts as a registry of software workarounds in the Linux Kernel and
+ +will be updated when new workarounds are committed and backported to
+ +stable kernels.
+ +
+ ++----------------+-----------------+-----------------+-----------------------------+
+ +| Implementor    | Component       | Erratum ID      | Kconfig                     |
+ ++================+=================+=================+=============================+
+ +| Allwinner      | A64/R18         | UNKNOWN1        | SUN50I_ERRATUM_UNKNOWN1     |
+ ++----------------+-----------------+-----------------+-----------------------------+
+ ++----------------+-----------------+-----------------+-----------------------------+
+ +| ARM            | Cortex-A53      | #826319         | ARM64_ERRATUM_826319        |
+ ++----------------+-----------------+-----------------+-----------------------------+
+ +| ARM            | Cortex-A53      | #827319         | ARM64_ERRATUM_827319        |
+ ++----------------+-----------------+-----------------+-----------------------------+
+ +| ARM            | Cortex-A53      | #824069         | ARM64_ERRATUM_824069        |
+ ++----------------+-----------------+-----------------+-----------------------------+
+ +| ARM            | Cortex-A53      | #819472         | ARM64_ERRATUM_819472        |
+ ++----------------+-----------------+-----------------+-----------------------------+
+ +| ARM            | Cortex-A53      | #845719         | ARM64_ERRATUM_845719        |
+ ++----------------+-----------------+-----------------+-----------------------------+
+ +| ARM            | Cortex-A53      | #843419         | ARM64_ERRATUM_843419        |
+ ++----------------+-----------------+-----------------+-----------------------------+
+ +| ARM            | Cortex-A57      | #832075         | ARM64_ERRATUM_832075        |
+ ++----------------+-----------------+-----------------+-----------------------------+
+ +| ARM            | Cortex-A57      | #852523         | N/A                         |
+ ++----------------+-----------------+-----------------+-----------------------------+
+ +| ARM            | Cortex-A57      | #834220         | ARM64_ERRATUM_834220        |
+ ++----------------+-----------------+-----------------+-----------------------------+
+ +| ARM            | Cortex-A72      | #853709         | N/A                         |
+ ++----------------+-----------------+-----------------+-----------------------------+
+ +| ARM            | Cortex-A73      | #858921         | ARM64_ERRATUM_858921        |
+ ++----------------+-----------------+-----------------+-----------------------------+
+ +| ARM            | Cortex-A55      | #1024718        | ARM64_ERRATUM_1024718       |
+ ++----------------+-----------------+-----------------+-----------------------------+
+ +| ARM            | Cortex-A76      | #1188873,1418040| ARM64_ERRATUM_1418040       |
+ ++----------------+-----------------+-----------------+-----------------------------+
+ +| ARM            | Cortex-A76      | #1165522        | ARM64_ERRATUM_1165522       |
+ ++----------------+-----------------+-----------------+-----------------------------+
+ +| ARM            | Cortex-A76      | #1286807        | ARM64_ERRATUM_1286807       |
+ ++----------------+-----------------+-----------------+-----------------------------+
+ +| ARM            | Cortex-A76      | #1463225        | ARM64_ERRATUM_1463225       |
+ ++----------------+-----------------+-----------------+-----------------------------+
+ +| ARM            | Neoverse-N1     | #1188873,1418040| ARM64_ERRATUM_1418040       |
+ ++----------------+-----------------+-----------------+-----------------------------+
++| ARM            | Neoverse-N1     | #1349291        | N/A                         |
+++----------------+-----------------+-----------------+-----------------------------+
+ +| ARM            | MMU-500         | #841119,826419  | N/A                         |
+ ++----------------+-----------------+-----------------+-----------------------------+
+ ++----------------+-----------------+-----------------+-----------------------------+
+ +| Cavium         | ThunderX ITS    | #22375,24313    | CAVIUM_ERRATUM_22375        |
+ ++----------------+-----------------+-----------------+-----------------------------+
+ +| Cavium         | ThunderX ITS    | #23144          | CAVIUM_ERRATUM_23144        |
+ ++----------------+-----------------+-----------------+-----------------------------+
+ +| Cavium         | ThunderX GICv3  | #23154          | CAVIUM_ERRATUM_23154        |
+ ++----------------+-----------------+-----------------+-----------------------------+
+ +| Cavium         | ThunderX Core   | #27456          | CAVIUM_ERRATUM_27456        |
+ ++----------------+-----------------+-----------------+-----------------------------+
+ +| Cavium         | ThunderX Core   | #30115          | CAVIUM_ERRATUM_30115        |
+ ++----------------+-----------------+-----------------+-----------------------------+
+ +| Cavium         | ThunderX SMMUv2 | #27704          | N/A                         |
+ ++----------------+-----------------+-----------------+-----------------------------+
+ +| Cavium         | ThunderX2 SMMUv3| #74             | N/A                         |
+ ++----------------+-----------------+-----------------+-----------------------------+
+ +| Cavium         | ThunderX2 SMMUv3| #126            | N/A                         |
+ ++----------------+-----------------+-----------------+-----------------------------+
+ ++----------------+-----------------+-----------------+-----------------------------+
+ +| Freescale/NXP  | LS2080A/LS1043A | A-008585        | FSL_ERRATUM_A008585         |
+ ++----------------+-----------------+-----------------+-----------------------------+
+ ++----------------+-----------------+-----------------+-----------------------------+
+ +| Hisilicon      | Hip0{5,6,7}     | #161010101      | HISILICON_ERRATUM_161010101 |
+ ++----------------+-----------------+-----------------+-----------------------------+
+ +| Hisilicon      | Hip0{6,7}       | #161010701      | N/A                         |
+ ++----------------+-----------------+-----------------+-----------------------------+
+ +| Hisilicon      | Hip07           | #161600802      | HISILICON_ERRATUM_161600802 |
+ ++----------------+-----------------+-----------------+-----------------------------+
+ +| Hisilicon      | Hip08 SMMU PMCG | #162001800      | N/A                         |
+ ++----------------+-----------------+-----------------+-----------------------------+
+ ++----------------+-----------------+-----------------+-----------------------------+
+ +| Qualcomm Tech. | Kryo/Falkor v1  | E1003           | QCOM_FALKOR_ERRATUM_1003    |
+ ++----------------+-----------------+-----------------+-----------------------------+
+ +| Qualcomm Tech. | Falkor v1       | E1009           | QCOM_FALKOR_ERRATUM_1009    |
+ ++----------------+-----------------+-----------------+-----------------------------+
+ +| Qualcomm Tech. | QDF2400 ITS     | E0065           | QCOM_QDF2400_ERRATUM_0065   |
+ ++----------------+-----------------+-----------------+-----------------------------+
+ +| Qualcomm Tech. | Falkor v{1,2}   | E1041           | QCOM_FALKOR_ERRATUM_1041    |
+ ++----------------+-----------------+-----------------+-----------------------------+
+ ++----------------+-----------------+-----------------+-----------------------------+
+ +| Fujitsu        | A64FX           | E#010001        | FUJITSU_ERRATUM_010001      |
+ ++----------------+-----------------+-----------------+-----------------------------+
diff --combined Documentation/virtual/kvm/api.txt

index 383b292966fa1472492533e974da8cc0a0198633,8a97a1a7eca2229961d084a9b83552b0d4a1b924..2cd6250b289676ac0091f6a6299d33fabcd4d2f7
--- 1/Documentation/virtual/kvm/api.txt
--- 2/Documentation/virtual/kvm/api.txt
+++ b/Documentation/virtual/kvm/api.txt
@@@ -2205,7 -2205,7 +2205,7 @@@ max_vq.  This is the maximum vector len
   this vcpu, and determines which register slices are visible through
   this ioctl interface.
   
- -(See Documentation/arm64/sve.txt for an explanation of the "vq"
+ +(See Documentation/arm64/sve.rst for an explanation of the "vq"
   nomenclature.)
   
   KVM_REG_ARM64_SVE_VLS is only accessible after KVM_ARM_VCPU_INIT.
@@@ -4081,6 -4081,32 +4081,32 @@@ KVM_ARM_VCPU_FINALIZE call
   See KVM_ARM_VCPU_INIT for details of vcpu features that require finalization
   using this ioctl.
   
+ 4.120 KVM_SET_PMU_EVENT_FILTER
+ 
+ Capability: KVM_CAP_PMU_EVENT_FILTER
+ Architectures: x86
+ Type: vm ioctl
+ Parameters: struct kvm_pmu_event_filter (in)
+ Returns: 0 on success, -1 on error
+ 
+ struct kvm_pmu_event_filter {
+        __u32 action;
+        __u32 nevents;
+        __u64 events[0];
+ };
+ 
+ This ioctl restricts the set of PMU events that the guest can program.
+ The argument holds a list of events which will be allowed or denied.
+ The eventsel+umask of each event the guest attempts to program is compared
+ against the events field to determine whether the guest should have access.
+ This only affects general purpose counters; fixed purpose counters can
+ be disabled by changing the perfmon CPUID leaf.
+ 
+ Valid values for 'action':
+ #define KVM_PMU_EVENT_ALLOW 0
+ #define KVM_PMU_EVENT_DENY 1
+ 
+ 
   5. The kvm_run structure
   ------------------------
   
@@@ -4909,6 -4935,8 +4935,8 @@@ Valid bits in args[0] ar
   
   #define KVM_X86_DISABLE_EXITS_MWAIT            (1 << 0)
   #define KVM_X86_DISABLE_EXITS_HLT              (1 << 1)
+ #define KVM_X86_DISABLE_EXITS_PAUSE            (1 << 2)
+ #define KVM_X86_DISABLE_EXITS_CSTATE           (1 << 3)
   
   Enabling this capability on a VM provides userspace with a way to no
   longer intercept some instructions for improved latency in some
diff --combined arch/arm64/include/asm/cpufeature.h

index 3d8db50d9ae2f8ccff4db5722593e2f0a4a1d21d,948427f6b3d9b93518f6c0b270852a2ea7cd5631..407e2bf23676c970fbf37d4e264b39cb8f58cad5
--- 1/arch/arm64/include/asm/cpufeature.h
--- 2/arch/arm64/include/asm/cpufeature.h
+++ b/arch/arm64/include/asm/cpufeature.h
@@@ -614,12 -614,12 +614,18 @@@ static inline bool system_uses_irq_prio
                cpus_have_const_cap(ARM64_HAS_IRQ_PRIO_MASKING);
   }
   
+ +static inline bool system_has_prio_mask_debugging(void)
+ +{
+ +      return IS_ENABLED(CONFIG_ARM64_DEBUG_PRIORITY_MASKING) &&
+ +             system_uses_irq_prio_masking();
+ +}
+ +
+ #define ARM64_BP_HARDEN_UNKNOWN               -1
+ #define ARM64_BP_HARDEN_WA_NEEDED     0
+ #define ARM64_BP_HARDEN_NOT_REQUIRED  1
+ 
+ int get_spectre_v2_workaround_state(void);
+ 
   #define ARM64_SSBD_UNKNOWN            -1
   #define ARM64_SSBD_FORCE_DISABLE      0
   #define ARM64_SSBD_KERNEL             1
diff --combined arch/arm64/include/asm/kvm_host.h

index 9f19c354b165c30b1a7c04811232b3e0ec319896,63a196c19fed0268bc4cbccd014d0eba663924ec..f656169db8c33bb3fa1ecac3f7fac872922c98b7
--- 1/arch/arm64/include/asm/kvm_host.h
--- 2/arch/arm64/include/asm/kvm_host.h
+++ b/arch/arm64/include/asm/kvm_host.h
@@@ -19,12 -19,12 +19,12 @@@
   #include <asm/arch_gicv3.h>
   #include <asm/barrier.h>
   #include <asm/cpufeature.h>
+ #include <asm/cputype.h>
   #include <asm/daifflags.h>
   #include <asm/fpsimd.h>
   #include <asm/kvm.h>
   #include <asm/kvm_asm.h>
   #include <asm/kvm_mmio.h>
- #include <asm/smp_plat.h>
   #include <asm/thread_info.h>
   
   #define __KVM_HAVE_ARCH_INTC_INITIALIZED
@@@ -484,11 -484,10 +484,10 @@@ struct kvm_vcpu *kvm_mpidr_to_vcpu(stru
   
   DECLARE_PER_CPU(kvm_host_data_t, kvm_host_data);
   
- static inline void kvm_init_host_cpu_context(struct kvm_cpu_context *cpu_ctxt,
-                                            int cpu)
+ static inline void kvm_init_host_cpu_context(struct kvm_cpu_context *cpu_ctxt)
   {
         /* The host's MPIDR is immutable, so let's set it up at boot time */
-       cpu_ctxt->sys_regs[MPIDR_EL1] = cpu_logical_map(cpu);
+       cpu_ctxt->sys_regs[MPIDR_EL1] = read_cpuid_mpidr();
   }
   
   void __kvm_enable_ssbs(void);
@@@ -597,12 -596,11 +596,12 @@@ static inline void kvm_arm_vhe_guest_en
          * will not signal the CPU of interrupts of lower priority, and the
          * only way to get out will be via guest exceptions.
          * Naturally, we want to avoid this.
+ +       *
+ +       * local_daif_mask() already sets GIC_PRIO_PSR_I_SET, we just need a
+ +       * dsb to ensure the redistributor is forwards EL2 IRQs to the CPU.
          */
- -      if (system_uses_irq_prio_masking()) {
- -              gic_write_pmr(GIC_PRIO_IRQON);
+ +      if (system_uses_irq_prio_masking())
                 dsb(sy);
- -      }
   }
   
   static inline void kvm_arm_vhe_guest_exit(void)
@@@ -621,9 -619,21 +620,21 @@@
         isb();
   }
   
- static inline bool kvm_arm_harden_branch_predictor(void)
+ #define KVM_BP_HARDEN_UNKNOWN         -1
+ #define KVM_BP_HARDEN_WA_NEEDED               0
+ #define KVM_BP_HARDEN_NOT_REQUIRED    1
+ 
+ static inline int kvm_arm_harden_branch_predictor(void)
   {
-       return cpus_have_const_cap(ARM64_HARDEN_BRANCH_PREDICTOR);
+       switch (get_spectre_v2_workaround_state()) {
+       case ARM64_BP_HARDEN_WA_NEEDED:
+               return KVM_BP_HARDEN_WA_NEEDED;
+       case ARM64_BP_HARDEN_NOT_REQUIRED:
+               return KVM_BP_HARDEN_NOT_REQUIRED;
+       case ARM64_BP_HARDEN_UNKNOWN:
+       default:
+               return KVM_BP_HARDEN_UNKNOWN;
+       }
   }
   
   #define KVM_SSBD_UNKNOWN              -1
diff --combined arch/arm64/include/asm/sysreg.h

index d0bd4ffcf2c463afde0c5459c8ad03d0c679d267,852cc113de7cdd59013cc9d120d0054cc26ad6d4..a7522fca11059fcea49046632e038581be820a5f
--- 1/arch/arm64/include/asm/sysreg.h
--- 2/arch/arm64/include/asm/sysreg.h
+++ b/arch/arm64/include/asm/sysreg.h
@@@ -191,6 -191,9 +191,9 @@@
   #define SYS_APGAKEYLO_EL1             sys_reg(3, 0, 2, 3, 0)
   #define SYS_APGAKEYHI_EL1             sys_reg(3, 0, 2, 3, 1)
   
+ #define SYS_SPSR_EL1                  sys_reg(3, 0, 4, 0, 0)
+ #define SYS_ELR_EL1                   sys_reg(3, 0, 4, 0, 1)
+ 
   #define SYS_ICC_PMR_EL1                       sys_reg(3, 0, 4, 6, 0)
   
   #define SYS_AFSR0_EL1                 sys_reg(3, 0, 5, 1, 0)
@@@ -382,6 -385,9 +385,9 @@@
   #define SYS_CNTP_CTL_EL0              sys_reg(3, 3, 14, 2, 1)
   #define SYS_CNTP_CVAL_EL0             sys_reg(3, 3, 14, 2, 2)
   
+ #define SYS_CNTV_CTL_EL0              sys_reg(3, 3, 14, 3, 1)
+ #define SYS_CNTV_CVAL_EL0             sys_reg(3, 3, 14, 3, 2)
+ 
   #define SYS_AARCH32_CNTP_TVAL         sys_reg(0, 0, 14, 2, 0)
   #define SYS_AARCH32_CNTP_CTL          sys_reg(0, 0, 14, 2, 1)
   #define SYS_AARCH32_CNTP_CVAL         sys_reg(0, 2, 0, 14, 0)
@@@ -392,14 -398,17 +398,17 @@@
   #define __TYPER_CRm(n)                        (0xc | (((n) >> 3) & 0x3))
   #define SYS_PMEVTYPERn_EL0(n)         sys_reg(3, 3, 14, __TYPER_CRm(n), __PMEV_op2(n))
   
- #define SYS_PMCCFILTR_EL0             sys_reg (3, 3, 14, 15, 7)
+ #define SYS_PMCCFILTR_EL0             sys_reg(3, 3, 14, 15, 7)
   
   #define SYS_ZCR_EL2                   sys_reg(3, 4, 1, 2, 0)
- 
   #define SYS_DACR32_EL2                        sys_reg(3, 4, 3, 0, 0)
+ #define SYS_SPSR_EL2                  sys_reg(3, 4, 4, 0, 0)
+ #define SYS_ELR_EL2                   sys_reg(3, 4, 4, 0, 1)
   #define SYS_IFSR32_EL2                        sys_reg(3, 4, 5, 0, 1)
+ #define SYS_ESR_EL2                   sys_reg(3, 4, 5, 2, 0)
   #define SYS_VSESR_EL2                 sys_reg(3, 4, 5, 2, 3)
   #define SYS_FPEXC32_EL2                       sys_reg(3, 4, 5, 3, 0)
+ #define SYS_FAR_EL2                   sys_reg(3, 4, 6, 0, 0)
   
   #define SYS_VDISR_EL2                 sys_reg(3, 4, 12, 1,  1)
   #define __SYS__AP0Rx_EL2(x)           sys_reg(3, 4, 12, 8, x)
@@@ -444,7 -453,29 +453,29 @@@
   #define SYS_ICH_LR15_EL2              __SYS__LR8_EL2(7)
   
   /* VHE encodings for architectural EL0/1 system registers */
+ #define SYS_SCTLR_EL12                        sys_reg(3, 5, 1, 0, 0)
+ #define SYS_CPACR_EL12                        sys_reg(3, 5, 1, 0, 2)
   #define SYS_ZCR_EL12                  sys_reg(3, 5, 1, 2, 0)
+ #define SYS_TTBR0_EL12                        sys_reg(3, 5, 2, 0, 0)
+ #define SYS_TTBR1_EL12                        sys_reg(3, 5, 2, 0, 1)
+ #define SYS_TCR_EL12                  sys_reg(3, 5, 2, 0, 2)
+ #define SYS_SPSR_EL12                 sys_reg(3, 5, 4, 0, 0)
+ #define SYS_ELR_EL12                  sys_reg(3, 5, 4, 0, 1)
+ #define SYS_AFSR0_EL12                        sys_reg(3, 5, 5, 1, 0)
+ #define SYS_AFSR1_EL12                        sys_reg(3, 5, 5, 1, 1)
+ #define SYS_ESR_EL12                  sys_reg(3, 5, 5, 2, 0)
+ #define SYS_FAR_EL12                  sys_reg(3, 5, 6, 0, 0)
+ #define SYS_MAIR_EL12                 sys_reg(3, 5, 10, 2, 0)
+ #define SYS_AMAIR_EL12                        sys_reg(3, 5, 10, 3, 0)
+ #define SYS_VBAR_EL12                 sys_reg(3, 5, 12, 0, 0)
+ #define SYS_CONTEXTIDR_EL12           sys_reg(3, 5, 13, 0, 1)
+ #define SYS_CNTKCTL_EL12              sys_reg(3, 5, 14, 1, 0)
+ #define SYS_CNTP_TVAL_EL02            sys_reg(3, 5, 14, 2, 0)
+ #define SYS_CNTP_CTL_EL02             sys_reg(3, 5, 14, 2, 1)
+ #define SYS_CNTP_CVAL_EL02            sys_reg(3, 5, 14, 2, 2)
+ #define SYS_CNTV_TVAL_EL02            sys_reg(3, 5, 14, 3, 0)
+ #define SYS_CNTV_CTL_EL02             sys_reg(3, 5, 14, 3, 1)
+ #define SYS_CNTV_CVAL_EL02            sys_reg(3, 5, 14, 3, 2)
   
   /* Common SCTLR_ELx flags. */
   #define SCTLR_ELx_DSSBS       (_BITUL(44))
@@@ -549,7 -580,6 +580,7 @@@
   
   /* id_aa64isar1 */
   #define ID_AA64ISAR1_SB_SHIFT         36
+ +#define ID_AA64ISAR1_FRINTTS_SHIFT    32
   #define ID_AA64ISAR1_GPI_SHIFT                28
   #define ID_AA64ISAR1_GPA_SHIFT                24
   #define ID_AA64ISAR1_LRCPC_SHIFT      20
diff --combined arch/arm64/kernel/traps.c

index 678af745d881f159b451622cd2783ec86a581b32,66743bd1e4220d995b0c1f9f6ef7954a9fd850ea..8c03456dade6a477f62a2d051b267ae925704132
--- 1/arch/arm64/kernel/traps.c
--- 2/arch/arm64/kernel/traps.c
+++ b/arch/arm64/kernel/traps.c
@@@ -55,19 -55,16 +55,19 @@@ static void dump_backtrace_entry(unsign
         printk(" %pS\n", (void *)where);
   }
   
- -static void __dump_instr(const char *lvl, struct pt_regs *regs)
+ +static void dump_kernel_instr(const char *lvl, struct pt_regs *regs)
   {
         unsigned long addr = instruction_pointer(regs);
         char str[sizeof("00000000 ") * 5 + 2 + 1], *p = str;
         int i;
   
+ +      if (user_mode(regs))
+ +              return;
+ +
         for (i = -4; i < 1; i++) {
                 unsigned int val, bad;
   
- -              bad = get_user(val, &((u32 *)addr)[i]);
+ +              bad = aarch64_insn_read(&((u32 *)addr)[i], &val);
   
                 if (!bad)
                         p += sprintf(p, i == 0 ? "(%08x) " : "%08x ", val);
@@@ -76,8 -73,19 +76,8 @@@
                         break;
                 }
         }
- -      printk("%sCode: %s\n", lvl, str);
- -}
   
- -static void dump_instr(const char *lvl, struct pt_regs *regs)
- -{
- -      if (!user_mode(regs)) {
- -              mm_segment_t fs = get_fs();
- -              set_fs(KERNEL_DS);
- -              __dump_instr(lvl, regs);
- -              set_fs(fs);
- -      } else {
- -              __dump_instr(lvl, regs);
- -      }
+ +      printk("%sCode: %s\n", lvl, str);
   }
   
   void dump_backtrace(struct pt_regs *regs, struct task_struct *tsk)
@@@ -163,7 -171,8 +163,7 @@@ static int __die(const char *str, int e
         print_modules();
         show_regs(regs);
   
- -      if (!user_mode(regs))
- -              dump_instr(KERN_EMERG, regs);
+ +      dump_kernel_instr(KERN_EMERG, regs);
   
         return ret;
   }
@@@ -233,16 -242,16 +233,16 @@@ void arm64_force_sig_fault(int signo, i
   {
         arm64_show_signal(signo, str);
         if (signo == SIGKILL)
- -              force_sig(SIGKILL, current);
+ +              force_sig(SIGKILL);
         else
- -              force_sig_fault(signo, code, addr, current);
+ +              force_sig_fault(signo, code, addr);
   }
   
   void arm64_force_sig_mceerr(int code, void __user *addr, short lsb,
                             const char *str)
   {
         arm64_show_signal(SIGBUS, str);
- -      force_sig_mceerr(code, addr, lsb, current);
+ +      force_sig_mceerr(code, addr, lsb);
   }
   
   void arm64_force_sig_ptrace_errno_trap(int errno, void __user *addr,
@@@ -871,6 -880,10 +871,10 @@@ bool arm64_is_fatal_ras_serror(struct p
                 /*
                  * The CPU can't make progress. The exception may have
                  * been imprecise.
+                *
+                * Neoverse-N1 #1349291 means a non-KVM SError reported as
+                * Unrecoverable should be treated as Uncontainable. We
+                * call arm64_serror_panic() in both cases.
                  */
                 return true;
   
diff --combined arch/arm64/kvm/hyp/switch.c

index 58f281b6ca4a97d50167899cd6d5ddfb29f708f3,80062f93769d752a066799db5ac4def664ed7ef6..adaf266d8de88f818954a1414c7f7ca729fb1a32
--- 1/arch/arm64/kvm/hyp/switch.c
--- 2/arch/arm64/kvm/hyp/switch.c
+++ b/arch/arm64/kvm/hyp/switch.c
@@@ -284,7 -284,7 +284,7 @@@ static bool __hyp_text __populate_fault
         if (ec != ESR_ELx_EC_DABT_LOW && ec != ESR_ELx_EC_IABT_LOW)
                 return true;
   
-       far = read_sysreg_el2(far);
+       far = read_sysreg_el2(SYS_FAR);
   
         /*
          * The HPFAR can be invalid if the stage 2 fault did not
@@@ -401,7 -401,7 +401,7 @@@ static bool __hyp_text __hyp_handle_fps
   static bool __hyp_text fixup_guest_exit(struct kvm_vcpu *vcpu, u64 *exit_code)
   {
         if (ARM_EXCEPTION_CODE(*exit_code) != ARM_EXCEPTION_IRQ)
-               vcpu->arch.fault.esr_el2 = read_sysreg_el2(esr);
+               vcpu->arch.fault.esr_el2 = read_sysreg_el2(SYS_ESR);
   
         /*
          * We're using the raw exception code in order to only process
@@@ -604,7 -604,7 +604,7 @@@ int __hyp_text __kvm_vcpu_run_nvhe(stru
          * Naturally, we want to avoid this.
          */
         if (system_uses_irq_prio_masking()) {
- -              gic_write_pmr(GIC_PRIO_IRQON);
+ +              gic_write_pmr(GIC_PRIO_IRQON | GIC_PRIO_PSR_I_SET);
                 dsb(sy);
         }
   
@@@ -697,8 -697,8 +697,8 @@@ static void __hyp_text __hyp_call_panic
         asm volatile("ldr %0, =__hyp_panic_string" : "=r" (str_va));
   
         __hyp_do_panic(str_va,
-                      spsr,  elr,
-                      read_sysreg(esr_el2),   read_sysreg_el2(far),
+                      spsr, elr,
+                      read_sysreg(esr_el2), read_sysreg_el2(SYS_FAR),
                        read_sysreg(hpfar_el2), par, vcpu);
   }
   
@@@ -713,15 -713,15 +713,15 @@@ static void __hyp_call_panic_vhe(u64 sp
   
         panic(__hyp_panic_string,
               spsr,  elr,
-             read_sysreg_el2(esr),   read_sysreg_el2(far),
+             read_sysreg_el2(SYS_ESR),   read_sysreg_el2(SYS_FAR),
               read_sysreg(hpfar_el2), par, vcpu);
   }
   NOKPROBE_SYMBOL(__hyp_call_panic_vhe);
   
   void __hyp_text __noreturn hyp_panic(struct kvm_cpu_context *host_ctxt)
   {
-       u64 spsr = read_sysreg_el2(spsr);
-       u64 elr = read_sysreg_el2(elr);
+       u64 spsr = read_sysreg_el2(SYS_SPSR);
+       u64 elr = read_sysreg_el2(SYS_ELR);
         u64 par = read_sysreg(par_el1);
   
         if (!has_vhe())
diff --combined arch/s390/include/asm/kvm_host.h

index 4a928e2c667b0d00e3fa07f55f7cc8f4c8c14555,da5825a3c16bf420392af97b8ec9478722a516ad..abe60268335d2027df8a36b0517755d25e1789fa
--- 1/arch/s390/include/asm/kvm_host.h
--- 2/arch/s390/include/asm/kvm_host.h
+++ b/arch/s390/include/asm/kvm_host.h
@@@ -18,7 -18,6 +18,7 @@@
   #include <linux/kvm_host.h>
   #include <linux/kvm.h>
   #include <linux/seqlock.h>
+ +#include <linux/module.h>
   #include <asm/debug.h>
   #include <asm/cpu.h>
   #include <asm/fpu/api.h>
@@@ -721,14 -720,8 +721,14 @@@ struct kvm_s390_cpu_model 
         unsigned short ibc;
   };
   
+ +struct kvm_s390_module_hook {
+ +      int (*hook)(struct kvm_vcpu *vcpu);
+ +      struct module *owner;
+ +};
+ +
   struct kvm_s390_crypto {
         struct kvm_s390_crypto_cb *crycb;
+ +      struct kvm_s390_module_hook *pqap_hook;
         __u32 crycbd;
         __u8 aes_kw;
         __u8 dea_kw;
@@@ -912,7 -905,6 +912,6 @@@ extern int kvm_s390_gisc_register(struc
   extern int kvm_s390_gisc_unregister(struct kvm *kvm, u32 gisc);
   
   static inline void kvm_arch_hardware_disable(void) {}
- static inline void kvm_arch_check_processor_compat(void *rtn) {}
   static inline void kvm_arch_sync_events(struct kvm *kvm) {}
   static inline void kvm_arch_vcpu_uninit(struct kvm_vcpu *vcpu) {}
   static inline void kvm_arch_sched_in(struct kvm_vcpu *vcpu, int cpu) {}
diff --combined arch/s390/kvm/kvm-s390.c

index 1c4113f0f2a87255a83fee1f0789acc4a600418e,0fef9192f6acdf5e5bb5d5ba043947fa4dc7495d..3f520cd837fb8ccea849adc1b36d7e4a2026090b
--- 1/arch/s390/kvm/kvm-s390.c
--- 2/arch/s390/kvm/kvm-s390.c
+++ b/arch/s390/kvm/kvm-s390.c
@@@ -227,6 -227,11 +227,11 @@@ int kvm_arch_hardware_enable(void
         return 0;
   }
   
+ int kvm_arch_check_processor_compat(void)
+ {
+       return 0;
+ }
+ 
   static void kvm_gmap_notifier(struct gmap *gmap, unsigned long start,
                               unsigned long end);
   
@@@ -2418,13 -2423,13 +2423,13 @@@ int kvm_arch_init_vm(struct kvm *kvm, u
         kvm->arch.sca = (struct bsca_block *) get_zeroed_page(alloc_flags);
         if (!kvm->arch.sca)
                 goto out_err;
-       spin_lock(&kvm_lock);
+       mutex_lock(&kvm_lock);
         sca_offset += 16;
         if (sca_offset + sizeof(struct bsca_block) > PAGE_SIZE)
                 sca_offset = 0;
         kvm->arch.sca = (struct bsca_block *)
                         ((char *) kvm->arch.sca + sca_offset);
-       spin_unlock(&kvm_lock);
+       mutex_unlock(&kvm_lock);
   
         sprintf(debug_name, "kvm-%u", current->pid);
   
@@@ -2461,9 -2466,6 +2466,9 @@@
                 set_kvm_facility(kvm->arch.model.fac_list, 147);
         }
   
+ +      if (css_general_characteristics.aiv && test_facility(65))
+ +              set_kvm_facility(kvm->arch.model.fac_mask, 65);
+ +
         kvm->arch.model.cpuid = kvm_s390_get_initial_cpuid();
         kvm->arch.model.ibc = sclp.ibc & 0x0fff;
   
diff --combined arch/x86/kvm/lapic.c

index 4dabc318adb833d12fc91ee0bed224262a26b4f2,42da7eb846aa2929cfc5de4c8d7321aa4a8fdec1..a232e76d8f23ffef9b74bd9d6e7e4ce0da565bd2
--- 1/arch/x86/kvm/lapic.c
--- 2/arch/x86/kvm/lapic.c
+++ b/arch/x86/kvm/lapic.c
@@@ -69,6 -69,7 +69,7 @@@
   #define X2APIC_BROADCAST              0xFFFFFFFFul
   
   #define LAPIC_TIMER_ADVANCE_ADJUST_DONE 100
+ #define LAPIC_TIMER_ADVANCE_ADJUST_INIT 1000
   /* step-by-step approximation to mitigate fluctuation */
   #define LAPIC_TIMER_ADVANCE_ADJUST_STEP 8
   
@@@ -85,11 -86,6 +86,6 @@@ bool kvm_apic_pending_eoi(struct kvm_vc
                 apic_test_vector(vector, apic->regs + APIC_IRR);
   }
   
- static inline void apic_clear_vector(int vec, void *bitmap)
- {
-       clear_bit(VEC_POS(vec), (bitmap) + REG_POS(vec));
- }
- 
   static inline int __apic_test_and_set_vector(int vec, void *bitmap)
   {
         return __test_and_set_bit(VEC_POS(vec), (bitmap) + REG_POS(vec));
@@@ -443,12 -439,12 +439,12 @@@ static inline void apic_clear_irr(int v
   
         if (unlikely(vcpu->arch.apicv_active)) {
                 /* need to update RVI */
-               apic_clear_vector(vec, apic->regs + APIC_IRR);
+               kvm_lapic_clear_vector(vec, apic->regs + APIC_IRR);
                 kvm_x86_ops->hwapic_irr_update(vcpu,
                                 apic_find_highest_irr(apic));
         } else {
                 apic->irr_pending = false;
-               apic_clear_vector(vec, apic->regs + APIC_IRR);
+               kvm_lapic_clear_vector(vec, apic->regs + APIC_IRR);
                 if (apic_search_irr(apic) != -1)
                         apic->irr_pending = true;
         }
@@@ -1053,9 -1049,11 +1049,11 @@@ static int __apic_accept_irq(struct kvm
   
                 if (apic_test_vector(vector, apic->regs + APIC_TMR) != !!trig_mode) {
                         if (trig_mode)
-                               kvm_lapic_set_vector(vector, apic->regs + APIC_TMR);
+                               kvm_lapic_set_vector(vector,
+                                                    apic->regs + APIC_TMR);
                         else
-                               apic_clear_vector(vector, apic->regs + APIC_TMR);
+                               kvm_lapic_clear_vector(vector,
+                                                      apic->regs + APIC_TMR);
                 }
   
                 if (vcpu->arch.apicv_active)
@@@ -1313,21 -1311,45 +1311,45 @@@ static inline struct kvm_lapic *to_lapi
         return container_of(dev, struct kvm_lapic, dev);
   }
   
+ #define APIC_REG_MASK(reg)    (1ull << ((reg) >> 4))
+ #define APIC_REGS_MASK(first, count) \
+       (APIC_REG_MASK(first) * ((1ull << (count)) - 1))
+ 
   int kvm_lapic_reg_read(struct kvm_lapic *apic, u32 offset, int len,
                 void *data)
   {
         unsigned char alignment = offset & 0xf;
         u32 result;
         /* this bitmask has a bit cleared for each reserved register */
-       static const u64 rmask = 0x43ff01ffffffe70cULL;
- 
-       if ((alignment + len) > 4) {
-               apic_debug("KVM_APIC_READ: alignment error %x %d\n",
-                          offset, len);
-               return 1;
-       }
+       u64 valid_reg_mask =
+               APIC_REG_MASK(APIC_ID) |
+               APIC_REG_MASK(APIC_LVR) |
+               APIC_REG_MASK(APIC_TASKPRI) |
+               APIC_REG_MASK(APIC_PROCPRI) |
+               APIC_REG_MASK(APIC_LDR) |
+               APIC_REG_MASK(APIC_DFR) |
+               APIC_REG_MASK(APIC_SPIV) |
+               APIC_REGS_MASK(APIC_ISR, APIC_ISR_NR) |
+               APIC_REGS_MASK(APIC_TMR, APIC_ISR_NR) |
+               APIC_REGS_MASK(APIC_IRR, APIC_ISR_NR) |
+               APIC_REG_MASK(APIC_ESR) |
+               APIC_REG_MASK(APIC_ICR) |
+               APIC_REG_MASK(APIC_ICR2) |
+               APIC_REG_MASK(APIC_LVTT) |
+               APIC_REG_MASK(APIC_LVTTHMR) |
+               APIC_REG_MASK(APIC_LVTPC) |
+               APIC_REG_MASK(APIC_LVT0) |
+               APIC_REG_MASK(APIC_LVT1) |
+               APIC_REG_MASK(APIC_LVTERR) |
+               APIC_REG_MASK(APIC_TMICT) |
+               APIC_REG_MASK(APIC_TMCCT) |
+               APIC_REG_MASK(APIC_TDCR);
+ 
+       /* ARBPRI is not valid on x2APIC */
+       if (!apic_x2apic_mode(apic))
+               valid_reg_mask |= APIC_REG_MASK(APIC_ARBPRI);
   
-       if (offset > 0x3f0 || !(rmask & (1ULL << (offset >> 4)))) {
+       if (offset > 0x3f0 || !(valid_reg_mask & APIC_REG_MASK(offset))) {
                 apic_debug("KVM_APIC_READ: read reserved register %x\n",
                            offset);
                 return 1;
@@@ -1499,11 -1521,40 +1521,40 @@@ static inline void __wait_lapic_expire(
         }
   }
   
- void wait_lapic_expire(struct kvm_vcpu *vcpu)
+ static inline void adjust_lapic_timer_advance(struct kvm_vcpu *vcpu,
+                                             s64 advance_expire_delta)
   {
         struct kvm_lapic *apic = vcpu->arch.apic;
         u32 timer_advance_ns = apic->lapic_timer.timer_advance_ns;
-       u64 guest_tsc, tsc_deadline, ns;
+       u64 ns;
+ 
+       /* too early */
+       if (advance_expire_delta < 0) {
+               ns = -advance_expire_delta * 1000000ULL;
+               do_div(ns, vcpu->arch.virtual_tsc_khz);
+               timer_advance_ns -= min((u32)ns,
+                       timer_advance_ns / LAPIC_TIMER_ADVANCE_ADJUST_STEP);
+       } else {
+       /* too late */
+               ns = advance_expire_delta * 1000000ULL;
+               do_div(ns, vcpu->arch.virtual_tsc_khz);
+               timer_advance_ns += min((u32)ns,
+                       timer_advance_ns / LAPIC_TIMER_ADVANCE_ADJUST_STEP);
+       }
+ 
+       if (abs(advance_expire_delta) < LAPIC_TIMER_ADVANCE_ADJUST_DONE)
+               apic->lapic_timer.timer_advance_adjust_done = true;
+       if (unlikely(timer_advance_ns > 5000)) {
+               timer_advance_ns = LAPIC_TIMER_ADVANCE_ADJUST_INIT;
+               apic->lapic_timer.timer_advance_adjust_done = false;
+       }
+       apic->lapic_timer.timer_advance_ns = timer_advance_ns;
+ }
+ 
+ void kvm_wait_lapic_expire(struct kvm_vcpu *vcpu)
+ {
+       struct kvm_lapic *apic = vcpu->arch.apic;
+       u64 guest_tsc, tsc_deadline;
   
         if (apic->lapic_timer.expired_tscdeadline == 0)
                 return;
@@@ -1514,34 -1565,15 +1565,15 @@@
         tsc_deadline = apic->lapic_timer.expired_tscdeadline;
         apic->lapic_timer.expired_tscdeadline = 0;
         guest_tsc = kvm_read_l1_tsc(vcpu, rdtsc());
-       trace_kvm_wait_lapic_expire(vcpu->vcpu_id, guest_tsc - tsc_deadline);
+       apic->lapic_timer.advance_expire_delta = guest_tsc - tsc_deadline;
   
         if (guest_tsc < tsc_deadline)
                 __wait_lapic_expire(vcpu, tsc_deadline - guest_tsc);
   
-       if (!apic->lapic_timer.timer_advance_adjust_done) {
-               /* too early */
-               if (guest_tsc < tsc_deadline) {
-                       ns = (tsc_deadline - guest_tsc) * 1000000ULL;
-                       do_div(ns, vcpu->arch.virtual_tsc_khz);
-                       timer_advance_ns -= min((u32)ns,
-                               timer_advance_ns / LAPIC_TIMER_ADVANCE_ADJUST_STEP);
-               } else {
-               /* too late */
-                       ns = (guest_tsc - tsc_deadline) * 1000000ULL;
-                       do_div(ns, vcpu->arch.virtual_tsc_khz);
-                       timer_advance_ns += min((u32)ns,
-                               timer_advance_ns / LAPIC_TIMER_ADVANCE_ADJUST_STEP);
-               }
-               if (abs(guest_tsc - tsc_deadline) < LAPIC_TIMER_ADVANCE_ADJUST_DONE)
-                       apic->lapic_timer.timer_advance_adjust_done = true;
-               if (unlikely(timer_advance_ns > 5000)) {
-                       timer_advance_ns = 0;
-                       apic->lapic_timer.timer_advance_adjust_done = true;
-               }
-               apic->lapic_timer.timer_advance_ns = timer_advance_ns;
-       }
+       if (unlikely(!apic->lapic_timer.timer_advance_adjust_done))
+               adjust_lapic_timer_advance(vcpu, apic->lapic_timer.advance_expire_delta);
   }
+ EXPORT_SYMBOL_GPL(kvm_wait_lapic_expire);
   
   static void start_sw_tscdeadline(struct kvm_lapic *apic)
   {
@@@ -2014,7 -2046,7 +2046,7 @@@ static int apic_mmio_write(struct kvm_v
                 apic_debug("%s: offset 0x%x with length 0x%x, and value is "
                            "0x%x\n", __func__, offset, len, val);
   
-       kvm_lapic_reg_write(apic, offset & 0xff0, val);
+       kvm_lapic_reg_write(apic, offset, val);
   
         return 0;
   }
@@@ -2311,7 -2343,7 +2343,7 @@@ int kvm_create_lapic(struct kvm_vcpu *v
                      HRTIMER_MODE_ABS_PINNED);
         apic->lapic_timer.timer.function = apic_timer_fn;
         if (timer_advance_ns == -1) {
-               apic->lapic_timer.timer_advance_ns = 1000;
+               apic->lapic_timer.timer_advance_ns = LAPIC_TIMER_ADVANCE_ADJUST_INIT;
                 apic->lapic_timer.timer_advance_adjust_done = false;
         } else {
                 apic->lapic_timer.timer_advance_ns = timer_advance_ns;
@@@ -2321,7 -2353,7 +2353,7 @@@
   
         /*
          * APIC is created enabled. This will prevent kvm_lapic_set_base from
-        * thinking that APIC satet has changed.
+        * thinking that APIC state has changed.
          */
         vcpu->arch.apic_base = MSR_IA32_APICBASE_ENABLE;
         static_key_slow_inc(&apic_sw_disabled.key); /* sw disabled at reset */
@@@ -2330,6 -2362,7 +2362,7 @@@
         return 0;
   nomem_free_apic:
         kfree(apic);
+       vcpu->arch.apic = NULL;
   nomem:
         return -ENOMEM;
   }
@@@ -2339,7 -2372,7 +2372,7 @@@ int kvm_apic_has_interrupt(struct kvm_v
         struct kvm_lapic *apic = vcpu->arch.apic;
         u32 ppr;
   
- -      if (!apic_enabled(apic))
+ +      if (!kvm_apic_hw_enabled(apic))
                 return -1;
   
         __apic_update_ppr(apic, &ppr);
diff --combined arch/x86/kvm/mmu.c

index 4a9c63d1c20acc07293e8fb2c9c8d895f9b6e2f8,15d2c06d7fece21bf72bcdd64334b78712b46569..9a5814d8d1942ee008457aa4d6cdeef09a33880c
--- 1/arch/x86/kvm/mmu.c
--- 2/arch/x86/kvm/mmu.c
+++ b/arch/x86/kvm/mmu.c
@@@ -140,9 -140,6 +140,6 @@@ module_param(dbg, bool, 0644)
   
   #include <trace/events/kvm.h>
   
- #define CREATE_TRACE_POINTS
- #include "mmutrace.h"
- 
   #define SPTE_HOST_WRITEABLE   (1ULL << PT_FIRST_AVAIL_BITS_SHIFT)
   #define SPTE_MMU_WRITEABLE    (1ULL << (PT_FIRST_AVAIL_BITS_SHIFT + 1))
   
@@@ -259,11 -256,20 +256,20 @@@ static const u64 shadow_nonpresent_or_r
    */
   static u64 __read_mostly shadow_nonpresent_or_rsvd_lower_gfn_mask;
   
+ /*
+  * The number of non-reserved physical address bits irrespective of features
+  * that repurpose legal bits, e.g. MKTME.
+  */
+ static u8 __read_mostly shadow_phys_bits;
   
   static void mmu_spte_set(u64 *sptep, u64 spte);
+ static bool is_executable_pte(u64 spte);
   static union kvm_mmu_page_role
   kvm_mmu_calc_root_page_role(struct kvm_vcpu *vcpu);
   
+ #define CREATE_TRACE_POINTS
+ #include "mmutrace.h"
+ 
   
   static inline bool kvm_available_flush_tlb_with_range(void)
   {
@@@ -468,6 -474,21 +474,21 @@@ void kvm_mmu_set_mask_ptes(u64 user_mas
   }
   EXPORT_SYMBOL_GPL(kvm_mmu_set_mask_ptes);
   
+ static u8 kvm_get_shadow_phys_bits(void)
+ {
+       /*
+        * boot_cpu_data.x86_phys_bits is reduced when MKTME is detected
+        * in CPU detection code, but MKTME treats those reduced bits as
+        * 'keyID' thus they are not reserved bits. Therefore for MKTME
+        * we should still return physical address bits reported by CPUID.
+        */
+       if (!boot_cpu_has(X86_FEATURE_TME) ||
+           WARN_ON_ONCE(boot_cpu_data.extended_cpuid_level < 0x80000008))
+               return boot_cpu_data.x86_phys_bits;
+ 
+       return cpuid_eax(0x80000008) & 0xff;
+ }
+ 
   static void kvm_mmu_reset_all_pte_masks(void)
   {
         u8 low_phys_bits;
@@@ -481,6 -502,8 +502,8 @@@
         shadow_present_mask = 0;
         shadow_acc_track_mask = 0;
   
+       shadow_phys_bits = kvm_get_shadow_phys_bits();
+ 
         /*
          * If the CPU has 46 or less physical address bits, then set an
          * appropriate mask to guard against L1TF attacks. Otherwise, it is
@@@ -650,7 -673,7 +673,7 @@@ static u64 __update_clear_spte_slow(u6
   
   /*
    * The idea using the light way get the spte on x86_32 guest is from
- - * gup_get_pte(arch/x86/mm/gup.c).
+ + * gup_get_pte (mm/gup.c).
    *
    * An spte tlb flush may be pending, because kvm_set_pte_rmapp
    * coalesces them and we are running out of the MMU lock.  Therefore
@@@ -1073,10 -1096,16 +1096,16 @@@ static gfn_t kvm_mmu_page_get_gfn(struc
   
   static void kvm_mmu_page_set_gfn(struct kvm_mmu_page *sp, int index, gfn_t gfn)
   {
-       if (sp->role.direct)
-               BUG_ON(gfn != kvm_mmu_page_get_gfn(sp, index));
-       else
+       if (!sp->role.direct) {
                 sp->gfns[index] = gfn;
+               return;
+       }
+ 
+       if (WARN_ON(gfn != kvm_mmu_page_get_gfn(sp, index)))
+               pr_err_ratelimited("gfn mismatch under direct page %llx "
+                                  "(expected %llx, got %llx)\n",
+                                  sp->gfn,
+                                  kvm_mmu_page_get_gfn(sp, index), gfn);
   }
   
   /*
@@@ -3055,10 -3084,7 +3084,7 @@@ static int mmu_set_spte(struct kvm_vcp
                 ret = RET_PF_EMULATE;
   
         pgprintk("%s: setting spte %llx\n", __func__, *sptep);
-       pgprintk("instantiating %s PTE (%s) at %llx (%llx) addr %p\n",
-                is_large_pte(*sptep)? "2MB" : "4kB",
-                *sptep & PT_WRITABLE_MASK ? "RW" : "R", gfn,
-                *sptep, sptep);
+       trace_kvm_mmu_set_spte(level, gfn, sptep);
         if (!was_rmapped && is_large_pte(*sptep))
                 ++vcpu->kvm->stat.lpages;
   
@@@ -3070,8 -3096,6 +3096,6 @@@
                 }
         }
   
-       kvm_release_pfn_clean(pfn);
- 
         return ret;
   }
   
@@@ -3106,9 -3130,11 +3130,11 @@@ static int direct_pte_prefetch_many(str
         if (ret <= 0)
                 return -1;
   
-       for (i = 0; i < ret; i++, gfn++, start++)
+       for (i = 0; i < ret; i++, gfn++, start++) {
                 mmu_set_spte(vcpu, start, access, 0, sp->role.level, gfn,
                              page_to_pfn(pages[i]), true, true);
+               put_page(pages[i]);
+       }
   
         return 0;
   }
@@@ -3156,40 -3182,40 +3182,40 @@@ static void direct_pte_prefetch(struct 
         __direct_pte_prefetch(vcpu, sp, sptep);
   }
   
- static int __direct_map(struct kvm_vcpu *vcpu, int write, int map_writable,
-                       int level, gfn_t gfn, kvm_pfn_t pfn, bool prefault)
+ static int __direct_map(struct kvm_vcpu *vcpu, gpa_t gpa, int write,
+                       int map_writable, int level, kvm_pfn_t pfn,
+                       bool prefault)
   {
-       struct kvm_shadow_walk_iterator iterator;
+       struct kvm_shadow_walk_iterator it;
         struct kvm_mmu_page *sp;
-       int emulate = 0;
-       gfn_t pseudo_gfn;
+       int ret;
+       gfn_t gfn = gpa >> PAGE_SHIFT;
+       gfn_t base_gfn = gfn;
   
         if (!VALID_PAGE(vcpu->arch.mmu->root_hpa))
-               return 0;
+               return RET_PF_RETRY;
   
-       for_each_shadow_entry(vcpu, (u64)gfn << PAGE_SHIFT, iterator) {
-               if (iterator.level == level) {
-                       emulate = mmu_set_spte(vcpu, iterator.sptep, ACC_ALL,
-                                              write, level, gfn, pfn, prefault,
-                                              map_writable);
-                       direct_pte_prefetch(vcpu, iterator.sptep);
-                       ++vcpu->stat.pf_fixed;
+       trace_kvm_mmu_spte_requested(gpa, level, pfn);
+       for_each_shadow_entry(vcpu, gpa, it) {
+               base_gfn = gfn & ~(KVM_PAGES_PER_HPAGE(it.level) - 1);
+               if (it.level == level)
                         break;
-               }
   
-               drop_large_spte(vcpu, iterator.sptep);
-               if (!is_shadow_present_pte(*iterator.sptep)) {
-                       u64 base_addr = iterator.addr;
+               drop_large_spte(vcpu, it.sptep);
+               if (!is_shadow_present_pte(*it.sptep)) {
+                       sp = kvm_mmu_get_page(vcpu, base_gfn, it.addr,
+                                             it.level - 1, true, ACC_ALL);
   
-                       base_addr &= PT64_LVL_ADDR_MASK(iterator.level);
-                       pseudo_gfn = base_addr >> PAGE_SHIFT;
-                       sp = kvm_mmu_get_page(vcpu, pseudo_gfn, iterator.addr,
-                                             iterator.level - 1, 1, ACC_ALL);
- 
-                       link_shadow_page(vcpu, iterator.sptep, sp);
+                       link_shadow_page(vcpu, it.sptep, sp);
                 }
         }
-       return emulate;
+ 
+       ret = mmu_set_spte(vcpu, it.sptep, ACC_ALL,
+                          write, level, base_gfn, pfn, prefault,
+                          map_writable);
+       direct_pte_prefetch(vcpu, it.sptep);
+       ++vcpu->stat.pf_fixed;
+       return ret;
   }
   
   static void kvm_send_hwpoison_signal(unsigned long address, struct task_struct *tsk)
@@@ -3216,11 -3242,10 +3242,10 @@@ static int kvm_handle_bad_page(struct k
   }
   
   static void transparent_hugepage_adjust(struct kvm_vcpu *vcpu,
-                                       gfn_t *gfnp, kvm_pfn_t *pfnp,
+                                       gfn_t gfn, kvm_pfn_t *pfnp,
                                         int *levelp)
   {
         kvm_pfn_t pfn = *pfnp;
-       gfn_t gfn = *gfnp;
         int level = *levelp;
   
         /*
@@@ -3247,8 -3272,6 +3272,6 @@@
                 mask = KVM_PAGES_PER_HPAGE(level) - 1;
                 VM_BUG_ON((gfn & mask) != (pfn & mask));
                 if (pfn & mask) {
-                       gfn &= ~mask;
-                       *gfnp = gfn;
                         kvm_release_pfn_clean(pfn);
                         pfn &= ~mask;
                         kvm_get_pfn(pfn);
@@@ -3505,22 -3528,19 +3528,19 @@@ static int nonpaging_map(struct kvm_vcp
         if (handle_abnormal_pfn(vcpu, v, gfn, pfn, ACC_ALL, &r))
                 return r;
   
+       r = RET_PF_RETRY;
         spin_lock(&vcpu->kvm->mmu_lock);
         if (mmu_notifier_retry(vcpu->kvm, mmu_seq))
                 goto out_unlock;
         if (make_mmu_pages_available(vcpu) < 0)
                 goto out_unlock;
         if (likely(!force_pt_level))
-               transparent_hugepage_adjust(vcpu, &gfn, &pfn, &level);
-       r = __direct_map(vcpu, write, map_writable, level, gfn, pfn, prefault);
-       spin_unlock(&vcpu->kvm->mmu_lock);
- 
-       return r;
- 
+               transparent_hugepage_adjust(vcpu, gfn, &pfn, &level);
+       r = __direct_map(vcpu, v, write, map_writable, level, pfn, prefault);
   out_unlock:
         spin_unlock(&vcpu->kvm->mmu_lock);
         kvm_release_pfn_clean(pfn);
-       return RET_PF_RETRY;
+       return r;
   }
   
   static void mmu_free_root_page(struct kvm *kvm, hpa_t *root_hpa,
@@@ -4015,19 -4035,6 +4035,6 @@@ static int kvm_arch_setup_async_pf(stru
         return kvm_setup_async_pf(vcpu, gva, kvm_vcpu_gfn_to_hva(vcpu, gfn), &arch);
   }
   
- bool kvm_can_do_async_pf(struct kvm_vcpu *vcpu)
- {
-       if (unlikely(!lapic_in_kernel(vcpu) ||
-                    kvm_event_needs_reinjection(vcpu) ||
-                    vcpu->arch.exception.pending))
-               return false;
- 
-       if (!vcpu->arch.apf.delivery_as_pf_vmexit && is_guest_mode(vcpu))
-               return false;
- 
-       return kvm_x86_ops->interrupt_allowed(vcpu);
- }
- 
   static bool try_async_pf(struct kvm_vcpu *vcpu, bool prefault, gfn_t gfn,
                          gva_t gva, kvm_pfn_t *pfn, bool write, bool *writable)
   {
@@@ -4147,22 -4154,19 +4154,19 @@@ static int tdp_page_fault(struct kvm_vc
         if (handle_abnormal_pfn(vcpu, 0, gfn, pfn, ACC_ALL, &r))
                 return r;
   
+       r = RET_PF_RETRY;
         spin_lock(&vcpu->kvm->mmu_lock);
         if (mmu_notifier_retry(vcpu->kvm, mmu_seq))
                 goto out_unlock;
         if (make_mmu_pages_available(vcpu) < 0)
                 goto out_unlock;
         if (likely(!force_pt_level))
-               transparent_hugepage_adjust(vcpu, &gfn, &pfn, &level);
-       r = __direct_map(vcpu, write, map_writable, level, gfn, pfn, prefault);
-       spin_unlock(&vcpu->kvm->mmu_lock);
- 
-       return r;
- 
+               transparent_hugepage_adjust(vcpu, gfn, &pfn, &level);
+       r = __direct_map(vcpu, gpa, write, map_writable, level, pfn, prefault);
   out_unlock:
         spin_unlock(&vcpu->kvm->mmu_lock);
         kvm_release_pfn_clean(pfn);
-       return RET_PF_RETRY;
+       return r;
   }
   
   static void nonpaging_init_context(struct kvm_vcpu *vcpu,
@@@ -4494,7 -4498,7 +4498,7 @@@ reset_shadow_zero_bits_mask(struct kvm_
          */
         shadow_zero_check = &context->shadow_zero_check;
         __reset_rsvds_bits_mask(vcpu, shadow_zero_check,
-                               boot_cpu_data.x86_phys_bits,
+                               shadow_phys_bits,
                                 context->shadow_root_level, uses_nx,
                                 guest_cpuid_has(vcpu, X86_FEATURE_GBPAGES),
                                 is_pse(vcpu), true);
@@@ -4531,13 -4535,13 +4535,13 @@@ reset_tdp_shadow_zero_bits_mask(struct 
   
         if (boot_cpu_is_amd())
                 __reset_rsvds_bits_mask(vcpu, shadow_zero_check,
-                                       boot_cpu_data.x86_phys_bits,
+                                       shadow_phys_bits,
                                         context->shadow_root_level, false,
                                         boot_cpu_has(X86_FEATURE_GBPAGES),
                                         true, true);
         else
                 __reset_rsvds_bits_mask_ept(shadow_zero_check,
-                                           boot_cpu_data.x86_phys_bits,
+                                           shadow_phys_bits,
                                             false);
   
         if (!shadow_me_mask)
@@@ -4558,7 -4562,7 +4562,7 @@@ reset_ept_shadow_zero_bits_mask(struct 
                                 struct kvm_mmu *context, bool execonly)
   {
         __reset_rsvds_bits_mask_ept(&context->shadow_zero_check,
-                                   boot_cpu_data.x86_phys_bits, execonly);
+                                   shadow_phys_bits, execonly);
   }
   
   #define BYTE_MASK(access) \
@@@ -5935,7 -5939,7 +5939,7 @@@ mmu_shrink_scan(struct shrinker *shrink
         int nr_to_scan = sc->nr_to_scan;
         unsigned long freed = 0;
   
-       spin_lock(&kvm_lock);
+       mutex_lock(&kvm_lock);
   
         list_for_each_entry(kvm, &vm_list, vm_list) {
                 int idx;
@@@ -5977,7 -5981,7 +5981,7 @@@
                 break;
         }
   
-       spin_unlock(&kvm_lock);
+       mutex_unlock(&kvm_lock);
         return freed;
   }
   
@@@ -5999,6 -6003,34 +6003,34 @@@ static void mmu_destroy_caches(void
         kmem_cache_destroy(mmu_page_header_cache);
   }
   
+ static void kvm_set_mmio_spte_mask(void)
+ {
+       u64 mask;
+ 
+       /*
+        * Set the reserved bits and the present bit of an paging-structure
+        * entry to generate page fault with PFER.RSV = 1.
+        */
+ 
+       /*
+        * Mask the uppermost physical address bit, which would be reserved as
+        * long as the supported physical address width is less than 52.
+        */
+       mask = 1ull << 51;
+ 
+       /* Set the present bit. */
+       mask |= 1ull;
+ 
+       /*
+        * If reserved bit is not supported, clear the present bit to disable
+        * mmio page fault.
+        */
+       if (IS_ENABLED(CONFIG_X86_64) && shadow_phys_bits == 52)
+               mask &= ~1ull;
+ 
+       kvm_mmu_set_mmio_spte_mask(mask, mask);
+ }
+ 
   int kvm_mmu_module_init(void)
   {
         int ret = -ENOMEM;
@@@ -6015,6 -6047,8 +6047,8 @@@
   
         kvm_mmu_reset_all_pte_masks();
   
+       kvm_set_mmio_spte_mask();
+ 
         pte_list_desc_cache = kmem_cache_create("pte_list_desc",
                                             sizeof(struct pte_list_desc),
                                             0, SLAB_ACCOUNT, NULL);
diff --combined arch/x86/kvm/pmu.c

index ab73a9a639aecbba0fdab012948630608a09a49d,6f7e0ed4d848ecb5768816449e4cccd172cef0d3..aa5a2597305ae82ada262121df14c6e3aafd2348
--- 1/arch/x86/kvm/pmu.c
--- 2/arch/x86/kvm/pmu.c
+++ b/arch/x86/kvm/pmu.c
@@@ -19,6 -19,9 +19,9 @@@
   #include "lapic.h"
   #include "pmu.h"
   
+ /* This keeps the total size of the filter under 4k. */
+ #define KVM_PMU_EVENT_FILTER_MAX_EVENTS 63
+ 
   /* NOTE:
    * - Each perf counter is defined as "struct kvm_pmc";
    * - There are two types of perf counters: general purpose (gp) and fixed.
@@@ -141,6 -144,10 +144,10 @@@ void reprogram_gp_counter(struct kvm_pm
   {
         unsigned config, type = PERF_TYPE_RAW;
         u8 event_select, unit_mask;
+       struct kvm *kvm = pmc->vcpu->kvm;
+       struct kvm_pmu_event_filter *filter;
+       int i;
+       bool allow_event = true;
   
         if (eventsel & ARCH_PERFMON_EVENTSEL_PIN_CONTROL)
                 printk_once("kvm pmu: pin control bit is ignored\n");
@@@ -152,6 -159,22 +159,22 @@@
         if (!(eventsel & ARCH_PERFMON_EVENTSEL_ENABLE) || !pmc_is_enabled(pmc))
                 return;
   
+       filter = srcu_dereference(kvm->arch.pmu_event_filter, &kvm->srcu);
+       if (filter) {
+               for (i = 0; i < filter->nevents; i++)
+                       if (filter->events[i] ==
+                           (eventsel & AMD64_RAW_EVENT_MASK_NB))
+                               break;
+               if (filter->action == KVM_PMU_EVENT_ALLOW &&
+                   i == filter->nevents)
+                       allow_event = false;
+               if (filter->action == KVM_PMU_EVENT_DENY &&
+                   i < filter->nevents)
+                       allow_event = false;
+       }
+       if (!allow_event)
+               return;
+ 
         event_select = eventsel & ARCH_PERFMON_EVENTSEL_EVENT;
         unit_mask = (eventsel & ARCH_PERFMON_EVENTSEL_UMASK) >> 8;
   
@@@ -261,10 -284,10 +284,10 @@@ static int kvm_pmu_rdpmc_vmware(struct 
                 ctr_val = rdtsc();
                 break;
         case VMWARE_BACKDOOR_PMC_REAL_TIME:
- -              ctr_val = ktime_get_boot_ns();
+ +              ctr_val = ktime_get_boottime_ns();
                 break;
         case VMWARE_BACKDOOR_PMC_APPARENT_TIME:
- -              ctr_val = ktime_get_boot_ns() +
+ +              ctr_val = ktime_get_boottime_ns() +
                         vcpu->kvm->arch.kvmclock_offset;
                 break;
         default:
@@@ -348,3 -371,43 +371,43 @@@ void kvm_pmu_destroy(struct kvm_vcpu *v
   {
         kvm_pmu_reset(vcpu);
   }
+ 
+ int kvm_vm_ioctl_set_pmu_event_filter(struct kvm *kvm, void __user *argp)
+ {
+       struct kvm_pmu_event_filter tmp, *filter;
+       size_t size;
+       int r;
+ 
+       if (copy_from_user(&tmp, argp, sizeof(tmp)))
+               return -EFAULT;
+ 
+       if (tmp.action != KVM_PMU_EVENT_ALLOW &&
+           tmp.action != KVM_PMU_EVENT_DENY)
+               return -EINVAL;
+ 
+       if (tmp.nevents > KVM_PMU_EVENT_FILTER_MAX_EVENTS)
+               return -E2BIG;
+ 
+       size = struct_size(filter, events, tmp.nevents);
+       filter = kmalloc(size, GFP_KERNEL_ACCOUNT);
+       if (!filter)
+               return -ENOMEM;
+ 
+       r = -EFAULT;
+       if (copy_from_user(filter, argp, size))
+               goto cleanup;
+ 
+       /* Ensure nevents can't be changed between the user copies. */
+       *filter = tmp;
+ 
+       mutex_lock(&kvm->lock);
+       rcu_swap_protected(kvm->arch.pmu_event_filter, filter,
+                          mutex_is_locked(&kvm->lock));
+       mutex_unlock(&kvm->lock);
+ 
+       synchronize_srcu_expedited(&kvm->srcu);
+       r = 0;
+ cleanup:
+       kfree(filter);
+       return r;
+ }
diff --combined arch/x86/kvm/vmx/nested.c

index 46af3a5e92094e3b529ef821bdfe636608382b25,6e88f459b323960f990467b59de9f4541feddbfe..bb509c2549397b29f03a75a1e542074eadf64702
--- 1/arch/x86/kvm/vmx/nested.c
--- 2/arch/x86/kvm/vmx/nested.c
+++ b/arch/x86/kvm/vmx/nested.c
@@@ -41,15 -41,19 +41,19 @@@ static unsigned long *vmx_bitmap[VMX_BI
   #define vmx_vmread_bitmap                    (vmx_bitmap[VMX_VMREAD_BITMAP])
   #define vmx_vmwrite_bitmap                   (vmx_bitmap[VMX_VMWRITE_BITMAP])
   
- static u16 shadow_read_only_fields[] = {
- #define SHADOW_FIELD_RO(x) x,
+ struct shadow_vmcs_field {
+       u16     encoding;
+       u16     offset;
+ };
+ static struct shadow_vmcs_field shadow_read_only_fields[] = {
+ #define SHADOW_FIELD_RO(x, y) { x, offsetof(struct vmcs12, y) },
   #include "vmcs_shadow_fields.h"
   };
   static int max_shadow_read_only_fields =
         ARRAY_SIZE(shadow_read_only_fields);
   
- static u16 shadow_read_write_fields[] = {
- #define SHADOW_FIELD_RW(x) x,
+ static struct shadow_vmcs_field shadow_read_write_fields[] = {
+ #define SHADOW_FIELD_RW(x, y) { x, offsetof(struct vmcs12, y) },
   #include "vmcs_shadow_fields.h"
   };
   static int max_shadow_read_write_fields =
@@@ -63,34 -67,40 +67,40 @@@ static void init_vmcs_shadow_fields(voi
         memset(vmx_vmwrite_bitmap, 0xff, PAGE_SIZE);
   
         for (i = j = 0; i < max_shadow_read_only_fields; i++) {
-               u16 field = shadow_read_only_fields[i];
+               struct shadow_vmcs_field entry = shadow_read_only_fields[i];
+               u16 field = entry.encoding;
   
                 if (vmcs_field_width(field) == VMCS_FIELD_WIDTH_U64 &&
                     (i + 1 == max_shadow_read_only_fields ||
-                    shadow_read_only_fields[i + 1] != field + 1))
+                    shadow_read_only_fields[i + 1].encoding != field + 1))
                         pr_err("Missing field from shadow_read_only_field %x\n",
                                field + 1);
   
                 clear_bit(field, vmx_vmread_bitmap);
- #ifdef CONFIG_X86_64
                 if (field & 1)
+ #ifdef CONFIG_X86_64
                         continue;
+ #else
+                       entry.offset += sizeof(u32);
   #endif
-               if (j < i)
-                       shadow_read_only_fields[j] = field;
-               j++;
+               shadow_read_only_fields[j++] = entry;
         }
         max_shadow_read_only_fields = j;
   
         for (i = j = 0; i < max_shadow_read_write_fields; i++) {
-               u16 field = shadow_read_write_fields[i];
+               struct shadow_vmcs_field entry = shadow_read_write_fields[i];
+               u16 field = entry.encoding;
   
                 if (vmcs_field_width(field) == VMCS_FIELD_WIDTH_U64 &&
                     (i + 1 == max_shadow_read_write_fields ||
-                    shadow_read_write_fields[i + 1] != field + 1))
+                    shadow_read_write_fields[i + 1].encoding != field + 1))
                         pr_err("Missing field from shadow_read_write_field %x\n",
                                field + 1);
   
+               WARN_ONCE(field >= GUEST_ES_AR_BYTES &&
+                         field <= GUEST_TR_AR_BYTES,
+                         "Update vmcs12_write_any() to drop reserved bits from AR_BYTES");
+ 
                 /*
                  * PML and the preemption timer can be emulated, but the
                  * processor cannot vmwrite to fields that don't exist
@@@ -115,13 -125,13 +125,13 @@@
   
                 clear_bit(field, vmx_vmwrite_bitmap);
                 clear_bit(field, vmx_vmread_bitmap);
- #ifdef CONFIG_X86_64
                 if (field & 1)
+ #ifdef CONFIG_X86_64
                         continue;
+ #else
+                       entry.offset += sizeof(u32);
   #endif
-               if (j < i)
-                       shadow_read_write_fields[j] = field;
-               j++;
+               shadow_read_write_fields[j++] = entry;
         }
         max_shadow_read_write_fields = j;
   }
@@@ -182,7 -192,7 +192,7 @@@ static void nested_vmx_abort(struct kvm
   
   static void vmx_disable_shadow_vmcs(struct vcpu_vmx *vmx)
   {
-       vmcs_clear_bits(SECONDARY_VM_EXEC_CONTROL, SECONDARY_EXEC_SHADOW_VMCS);
+       secondary_exec_controls_clearbit(vmx, SECONDARY_EXEC_SHADOW_VMCS);
         vmcs_write64(VMCS_LINK_POINTER, -1ull);
   }
   
@@@ -238,22 -248,41 +248,41 @@@ static void free_nested(struct kvm_vcp
         free_loaded_vmcs(&vmx->nested.vmcs02);
   }
   
+ static void vmx_sync_vmcs_host_state(struct vcpu_vmx *vmx,
+                                    struct loaded_vmcs *prev)
+ {
+       struct vmcs_host_state *dest, *src;
+ 
+       if (unlikely(!vmx->guest_state_loaded))
+               return;
+ 
+       src = &prev->host_state;
+       dest = &vmx->loaded_vmcs->host_state;
+ 
+       vmx_set_host_fs_gs(dest, src->fs_sel, src->gs_sel, src->fs_base, src->gs_base);
+       dest->ldt_sel = src->ldt_sel;
+ #ifdef CONFIG_X86_64
+       dest->ds_sel = src->ds_sel;
+       dest->es_sel = src->es_sel;
+ #endif
+ }
+ 
   static void vmx_switch_vmcs(struct kvm_vcpu *vcpu, struct loaded_vmcs *vmcs)
   {
         struct vcpu_vmx *vmx = to_vmx(vcpu);
+       struct loaded_vmcs *prev;
         int cpu;
   
         if (vmx->loaded_vmcs == vmcs)
                 return;
   
         cpu = get_cpu();
-       vmx_vcpu_put(vcpu);
+       prev = vmx->loaded_vmcs;
         vmx->loaded_vmcs = vmcs;
-       vmx_vcpu_load(vcpu, cpu);
+       vmx_vcpu_load_vmcs(vcpu, cpu);
+       vmx_sync_vmcs_host_state(vmx, prev);
         put_cpu();
   
-       vm_entry_controls_reset_shadow(vmx);
-       vm_exit_controls_reset_shadow(vmx);
         vmx_segment_cache_clear(vmx);
   }
   
@@@ -930,8 -959,7 +959,7 @@@ static int nested_vmx_load_cr3(struct k
                  * If PAE paging and EPT are both on, CR3 is not used by the CPU and
                  * must not be dereferenced.
                  */
-               if (!is_long_mode(vcpu) && is_pae(vcpu) && is_paging(vcpu) &&
-                   !nested_ept) {
+               if (is_pae_paging(vcpu) && !nested_ept) {
                         if (!load_pdptrs(vcpu, vcpu->arch.walk_mmu, cr3)) {
                                 *entry_failure_code = ENTRY_FAIL_PDPTE;
                                 return -EINVAL;
@@@ -1105,14 -1133,6 +1133,6 @@@ static int vmx_restore_vmx_misc(struct 
         vmx->nested.msrs.misc_low = data;
         vmx->nested.msrs.misc_high = data >> 32;
   
-       /*
-        * If L1 has read-only VM-exit information fields, use the
-        * less permissive vmx_vmwrite_bitmap to specify write
-        * permissions for the shadow VMCS.
-        */
-       if (enable_shadow_vmcs && !nested_cpu_has_vmwrite_any_field(&vmx->vcpu))
-               vmcs_write64(VMWRITE_BITMAP, __pa(vmx_vmwrite_bitmap));
- 
         return 0;
   }
   
@@@ -1214,6 -1234,11 +1234,11 @@@ int vmx_set_vmx_msr(struct kvm_vcpu *vc
         case MSR_IA32_VMX_VMCS_ENUM:
                 vmx->nested.msrs.vmcs_enum = data;
                 return 0;
+       case MSR_IA32_VMX_VMFUNC:
+               if (data & ~vmx->nested.msrs.vmfunc_controls)
+                       return -EINVAL;
+               vmx->nested.msrs.vmfunc_controls = data;
+               return 0;
         default:
                 /*
                  * The rest of the VMX capability MSRs do not support restore.
@@@ -1301,41 -1326,29 +1326,29 @@@ int vmx_get_vmx_msr(struct nested_vmx_m
   }
   
   /*
-  * Copy the writable VMCS shadow fields back to the VMCS12, in case
-  * they have been modified by the L1 guest. Note that the "read-only"
-  * VM-exit information fields are actually writable if the vCPU is
-  * configured to support "VMWRITE to any supported field in the VMCS."
+  * Copy the writable VMCS shadow fields back to the VMCS12, in case they have
+  * been modified by the L1 guest.  Note, "writable" in this context means
+  * "writable by the guest", i.e. tagged SHADOW_FIELD_RW; the set of
+  * fields tagged SHADOW_FIELD_RO may or may not align with the "read-only"
+  * VM-exit information fields (which are actually writable if the vCPU is
+  * configured to support "VMWRITE to any supported field in the VMCS").
    */
   static void copy_shadow_to_vmcs12(struct vcpu_vmx *vmx)
   {
-       const u16 *fields[] = {
-               shadow_read_write_fields,
-               shadow_read_only_fields
-       };
-       const int max_fields[] = {
-               max_shadow_read_write_fields,
-               max_shadow_read_only_fields
-       };
-       int i, q;
-       unsigned long field;
-       u64 field_value;
         struct vmcs *shadow_vmcs = vmx->vmcs01.shadow_vmcs;
+       struct vmcs12 *vmcs12 = get_vmcs12(&vmx->vcpu);
+       struct shadow_vmcs_field field;
+       unsigned long val;
+       int i;
   
         preempt_disable();
   
         vmcs_load(shadow_vmcs);
   
-       for (q = 0; q < ARRAY_SIZE(fields); q++) {
-               for (i = 0; i < max_fields[q]; i++) {
-                       field = fields[q][i];
-                       field_value = __vmcs_readl(field);
-                       vmcs12_write_any(get_vmcs12(&vmx->vcpu), field, field_value);
-               }
-               /*
-                * Skip the VM-exit information fields if they are read-only.
-                */
-               if (!nested_cpu_has_vmwrite_any_field(&vmx->vcpu))
-                       break;
+       for (i = 0; i < max_shadow_read_write_fields; i++) {
+               field = shadow_read_write_fields[i];
+               val = __vmcs_readl(field.encoding);
+               vmcs12_write_any(vmcs12, field.encoding, field.offset, val);
         }
   
         vmcs_clear(shadow_vmcs);
@@@ -1346,7 -1359,7 +1359,7 @@@
   
   static void copy_vmcs12_to_shadow(struct vcpu_vmx *vmx)
   {
-       const u16 *fields[] = {
+       const struct shadow_vmcs_field *fields[] = {
                 shadow_read_write_fields,
                 shadow_read_only_fields
         };
@@@ -1354,18 -1367,20 +1367,20 @@@
                 max_shadow_read_write_fields,
                 max_shadow_read_only_fields
         };
-       int i, q;
-       unsigned long field;
-       u64 field_value = 0;
         struct vmcs *shadow_vmcs = vmx->vmcs01.shadow_vmcs;
+       struct vmcs12 *vmcs12 = get_vmcs12(&vmx->vcpu);
+       struct shadow_vmcs_field field;
+       unsigned long val;
+       int i, q;
   
         vmcs_load(shadow_vmcs);
   
         for (q = 0; q < ARRAY_SIZE(fields); q++) {
                 for (i = 0; i < max_fields[q]; i++) {
                         field = fields[q][i];
-                       vmcs12_read_any(get_vmcs12(&vmx->vcpu), field, &field_value);
-                       __vmcs_writel(field, field_value);
+                       val = vmcs12_read_any(vmcs12, field.encoding,
+                                             field.offset);
+                       __vmcs_writel(field.encoding, val);
                 }
         }
   
@@@ -1623,7 -1638,7 +1638,7 @@@ static int copy_vmcs12_to_enlightened(s
          * evmcs->host_gdtr_base = vmcs12->host_gdtr_base;
          * evmcs->host_idtr_base = vmcs12->host_idtr_base;
          * evmcs->host_rsp = vmcs12->host_rsp;
-        * sync_vmcs12() doesn't read these:
+        * sync_vmcs02_to_vmcs12() doesn't read these:
          * evmcs->io_bitmap_a = vmcs12->io_bitmap_a;
          * evmcs->io_bitmap_b = vmcs12->io_bitmap_b;
          * evmcs->msr_bitmap = vmcs12->msr_bitmap;
@@@ -1768,26 -1783,22 +1783,22 @@@ static int nested_vmx_handle_enlightene
                                                  bool from_launch)
   {
         struct vcpu_vmx *vmx = to_vmx(vcpu);
-       struct hv_vp_assist_page assist_page;
+       bool evmcs_gpa_changed = false;
+       u64 evmcs_gpa;
   
         if (likely(!vmx->nested.enlightened_vmcs_enabled))
                 return 1;
   
-       if (unlikely(!kvm_hv_get_assist_page(vcpu, &assist_page)))
-               return 1;
- 
-       if (unlikely(!assist_page.enlighten_vmentry))
+       if (!nested_enlightened_vmentry(vcpu, &evmcs_gpa))
                 return 1;
   
-       if (unlikely(assist_page.current_nested_vmcs !=
-                    vmx->nested.hv_evmcs_vmptr)) {
- 
+       if (unlikely(evmcs_gpa != vmx->nested.hv_evmcs_vmptr)) {
                 if (!vmx->nested.hv_evmcs)
                         vmx->nested.current_vmptr = -1ull;
   
                 nested_release_evmcs(vcpu);
   
-               if (kvm_vcpu_map(vcpu, gpa_to_gfn(assist_page.current_nested_vmcs),
+               if (kvm_vcpu_map(vcpu, gpa_to_gfn(evmcs_gpa),
                                  &vmx->nested.hv_evmcs_map))
                         return 0;
   
@@@ -1822,15 -1833,9 +1833,9 @@@
                 }
   
                 vmx->nested.dirty_vmcs12 = true;
-               /*
-                * As we keep L2 state for one guest only 'hv_clean_fields' mask
-                * can't be used when we switch between them. Reset it here for
-                * simplicity.
-                */
-               vmx->nested.hv_evmcs->hv_clean_fields &=
-                       ~HV_VMX_ENLIGHTENED_CLEAN_FIELD_ALL;
-               vmx->nested.hv_evmcs_vmptr = assist_page.current_nested_vmcs;
+               vmx->nested.hv_evmcs_vmptr = evmcs_gpa;
   
+               evmcs_gpa_changed = true;
                 /*
                  * Unlike normal vmcs12, enlightened vmcs12 is not fully
                  * reloaded from guest's memory (read only fields, fields not
@@@ -1844,10 -1849,19 +1849,19 @@@
                 }
   
         }
+ 
+       /*
+        * Clean fields data can't de used on VMLAUNCH and when we switch
+        * between different L2 guests as KVM keeps a single VMCS12 per L1.
+        */
+       if (from_launch || evmcs_gpa_changed)
+               vmx->nested.hv_evmcs->hv_clean_fields &=
+                       ~HV_VMX_ENLIGHTENED_CLEAN_FIELD_ALL;
+ 
         return 1;
   }
   
- void nested_sync_from_vmcs12(struct kvm_vcpu *vcpu)
+ void nested_sync_vmcs12_to_shadow(struct kvm_vcpu *vcpu)
   {
         struct vcpu_vmx *vmx = to_vmx(vcpu);
   
@@@ -1868,7 -1882,7 +1882,7 @@@
                 copy_vmcs12_to_shadow(vmx);
         }
   
-       vmx->nested.need_vmcs12_sync = false;
+       vmx->nested.need_vmcs12_to_shadow_sync = false;
   }
   
   static enum hrtimer_restart vmx_preemption_timer_fn(struct hrtimer *timer)
@@@ -1948,8 -1962,20 +1962,20 @@@ static void prepare_vmcs02_constant_sta
         if (cpu_has_vmx_msr_bitmap())
                 vmcs_write64(MSR_BITMAP, __pa(vmx->nested.vmcs02.msr_bitmap));
   
-       if (enable_pml)
+       /*
+        * The PML address never changes, so it is constant in vmcs02.
+        * Conceptually we want to copy the PML index from vmcs01 here,
+        * and then back to vmcs01 on nested vmexit.  But since we flush
+        * the log and reset GUEST_PML_INDEX on each vmexit, the PML
+        * index is also effectively constant in vmcs02.
+        */
+       if (enable_pml) {
                 vmcs_write64(PML_ADDRESS, page_to_phys(vmx->pml_pg));
+               vmcs_write16(GUEST_PML_INDEX, PML_ENTITY_NUM - 1);
+       }
+ 
+       if (cpu_has_vmx_encls_vmexit())
+               vmcs_write64(ENCLS_EXITING_BITMAP, -1ull);
   
         /*
          * Set the MSR load/store lists to match L0's settings.  Only the
@@@ -1963,7 -1989,7 +1989,7 @@@
         vmx_set_constant_host_state(vmx);
   }
   
- static void prepare_vmcs02_early_full(struct vcpu_vmx *vmx,
+ static void prepare_vmcs02_early_rare(struct vcpu_vmx *vmx,
                                       struct vmcs12 *vmcs12)
   {
         prepare_vmcs02_constant_state(vmx);
@@@ -1984,17 -2010,14 +2010,14 @@@ static void prepare_vmcs02_early(struc
         u64 guest_efer = nested_vmx_calc_efer(vmx, vmcs12);
   
         if (vmx->nested.dirty_vmcs12 || vmx->nested.hv_evmcs)
-               prepare_vmcs02_early_full(vmx, vmcs12);
+               prepare_vmcs02_early_rare(vmx, vmcs12);
   
         /*
          * PIN CONTROLS
          */
-       exec_control = vmcs12->pin_based_vm_exec_control;
- 
-       /* Preemption timer setting is computed directly in vmx_vcpu_run.  */
-       exec_control |= vmcs_config.pin_based_exec_ctrl;
-       exec_control &= ~PIN_BASED_VMX_PREEMPTION_TIMER;
-       vmx->loaded_vmcs->hv_timer_armed = false;
+       exec_control = vmx_pin_based_exec_ctrl(vmx);
+       exec_control |= (vmcs12->pin_based_vm_exec_control &
+                        ~PIN_BASED_VMX_PREEMPTION_TIMER);
   
         /* Posted interrupts setting is only taken from vmcs12.  */
         if (nested_cpu_has_posted_intr(vmcs12)) {
@@@ -2003,7 -2026,7 +2026,7 @@@
         } else {
                 exec_control &= ~PIN_BASED_POSTED_INTR;
         }
-       vmcs_write32(PIN_BASED_VM_EXEC_CONTROL, exec_control);
+       pin_controls_set(vmx, exec_control);
   
         /*
          * EXEC CONTROLS
@@@ -2014,28 -2037,31 +2037,31 @@@
         exec_control &= ~CPU_BASED_TPR_SHADOW;
         exec_control |= vmcs12->cpu_based_vm_exec_control;
   
-       /*
-        * Write an illegal value to VIRTUAL_APIC_PAGE_ADDR. Later, if
-        * nested_get_vmcs12_pages can't fix it up, the illegal value
-        * will result in a VM entry failure.
-        */
-       if (exec_control & CPU_BASED_TPR_SHADOW) {
-               vmcs_write64(VIRTUAL_APIC_PAGE_ADDR, -1ull);
+       if (exec_control & CPU_BASED_TPR_SHADOW)
                 vmcs_write32(TPR_THRESHOLD, vmcs12->tpr_threshold);
-       } else {
   #ifdef CONFIG_X86_64
+       else
                 exec_control |= CPU_BASED_CR8_LOAD_EXITING |
                                 CPU_BASED_CR8_STORE_EXITING;
   #endif
-       }
   
         /*
          * A vmexit (to either L1 hypervisor or L0 userspace) is always needed
          * for I/O port accesses.
          */
-       exec_control &= ~CPU_BASED_USE_IO_BITMAPS;
         exec_control |= CPU_BASED_UNCOND_IO_EXITING;
-       vmcs_write32(CPU_BASED_VM_EXEC_CONTROL, exec_control);
+       exec_control &= ~CPU_BASED_USE_IO_BITMAPS;
+ 
+       /*
+        * This bit will be computed in nested_get_vmcs12_pages, because
+        * we do not have access to L1's MSR bitmap yet.  For now, keep
+        * the same bit as before, hoping to avoid multiple VMWRITEs that
+        * only set/clear this bit.
+        */
+       exec_control &= ~CPU_BASED_USE_MSR_BITMAPS;
+       exec_control |= exec_controls_get(vmx) & CPU_BASED_USE_MSR_BITMAPS;
+ 
+       exec_controls_set(vmx, exec_control);
   
         /*
          * SECONDARY EXEC CONTROLS
@@@ -2061,22 -2087,19 +2087,19 @@@
                 /* VMCS shadowing for L2 is emulated for now */
                 exec_control &= ~SECONDARY_EXEC_SHADOW_VMCS;
   
-               if (exec_control & SECONDARY_EXEC_VIRTUAL_INTR_DELIVERY)
-                       vmcs_write16(GUEST_INTR_STATUS,
-                               vmcs12->guest_intr_status);
- 
                 /*
-                * Write an illegal value to APIC_ACCESS_ADDR. Later,
-                * nested_get_vmcs12_pages will either fix it up or
-                * remove the VM execution control.
+                * Preset *DT exiting when emulating UMIP, so that vmx_set_cr4()
+                * will not have to rewrite the controls just for this bit.
                  */
-               if (exec_control & SECONDARY_EXEC_VIRTUALIZE_APIC_ACCESSES)
-                       vmcs_write64(APIC_ACCESS_ADDR, -1ull);
+               if (!boot_cpu_has(X86_FEATURE_UMIP) && vmx_umip_emulated() &&
+                   (vmcs12->guest_cr4 & X86_CR4_UMIP))
+                       exec_control |= SECONDARY_EXEC_DESC;
   
-               if (exec_control & SECONDARY_EXEC_ENCLS_EXITING)
-                       vmcs_write64(ENCLS_EXITING_BITMAP, -1ull);
+               if (exec_control & SECONDARY_EXEC_VIRTUAL_INTR_DELIVERY)
+                       vmcs_write16(GUEST_INTR_STATUS,
+                               vmcs12->guest_intr_status);
   
-               vmcs_write32(SECONDARY_VM_EXEC_CONTROL, exec_control);
+               secondary_exec_controls_set(vmx, exec_control);
         }
   
         /*
@@@ -2095,7 -2118,7 +2118,7 @@@
                 if (guest_efer != host_efer)
                         exec_control |= VM_ENTRY_LOAD_IA32_EFER;
         }
-       vm_entry_controls_init(vmx, exec_control);
+       vm_entry_controls_set(vmx, exec_control);
   
         /*
          * EXIT CONTROLS
@@@ -2107,17 -2130,7 +2130,7 @@@
         exec_control = vmx_vmexit_ctrl();
         if (cpu_has_load_ia32_efer() && guest_efer != host_efer)
                 exec_control |= VM_EXIT_LOAD_IA32_EFER;
-       vm_exit_controls_init(vmx, exec_control);
- 
-       /*
-        * Conceptually we want to copy the PML address and index from
-        * vmcs01 here, and then back to vmcs01 on nested vmexit. But,
-        * since we always flush the log on each vmexit and never change
-        * the PML address (once set), this happens to be equivalent to
-        * simply resetting the index in vmcs02.
-        */
-       if (enable_pml)
-               vmcs_write16(GUEST_PML_INDEX, PML_ENTITY_NUM - 1);
+       vm_exit_controls_set(vmx, exec_control);
   
         /*
          * Interrupt/Exception Fields
@@@ -2138,7 -2151,7 +2151,7 @@@
         }
   }
   
- static void prepare_vmcs02_full(struct vcpu_vmx *vmx, struct vmcs12 *vmcs12)
+ static void prepare_vmcs02_rare(struct vcpu_vmx *vmx, struct vmcs12 *vmcs12)
   {
         struct hv_enlightened_vmcs *hv_evmcs = vmx->nested.hv_evmcs;
   
@@@ -2162,6 -2175,8 +2175,8 @@@
                 vmcs_write32(GUEST_TR_LIMIT, vmcs12->guest_tr_limit);
                 vmcs_write32(GUEST_GDTR_LIMIT, vmcs12->guest_gdtr_limit);
                 vmcs_write32(GUEST_IDTR_LIMIT, vmcs12->guest_idtr_limit);
+               vmcs_write32(GUEST_CS_AR_BYTES, vmcs12->guest_cs_ar_bytes);
+               vmcs_write32(GUEST_SS_AR_BYTES, vmcs12->guest_ss_ar_bytes);
                 vmcs_write32(GUEST_ES_AR_BYTES, vmcs12->guest_es_ar_bytes);
                 vmcs_write32(GUEST_DS_AR_BYTES, vmcs12->guest_ds_ar_bytes);
                 vmcs_write32(GUEST_FS_AR_BYTES, vmcs12->guest_fs_ar_bytes);
@@@ -2198,6 -2213,10 +2213,10 @@@
                         vmcs_write64(GUEST_PDPTR2, vmcs12->guest_pdptr2);
                         vmcs_write64(GUEST_PDPTR3, vmcs12->guest_pdptr3);
                 }
+ 
+               if (kvm_mpx_supported() && vmx->nested.nested_run_pending &&
+                   (vmcs12->vm_entry_controls & VM_ENTRY_LOAD_BNDCFGS))
+                       vmcs_write64(GUEST_BNDCFGS, vmcs12->guest_bndcfgs);
         }
   
         if (nested_cpu_has_xsaves(vmcs12))
@@@ -2233,14 -2252,6 +2252,6 @@@
         vmcs_write32(VM_ENTRY_MSR_LOAD_COUNT, vmx->msr_autoload.guest.nr);
   
         set_cr4_guest_host_mask(vmx);
- 
-       if (kvm_mpx_supported()) {
-               if (vmx->nested.nested_run_pending &&
-                       (vmcs12->vm_entry_controls & VM_ENTRY_LOAD_BNDCFGS))
-                       vmcs_write64(GUEST_BNDCFGS, vmcs12->guest_bndcfgs);
-               else
-                       vmcs_write64(GUEST_BNDCFGS, vmx->nested.vmcs01_guest_bndcfgs);
-       }
   }
   
   /*
@@@ -2259,20 -2270,15 +2270,15 @@@ static int prepare_vmcs02(struct kvm_vc
   {
         struct vcpu_vmx *vmx = to_vmx(vcpu);
         struct hv_enlightened_vmcs *hv_evmcs = vmx->nested.hv_evmcs;
+       bool load_guest_pdptrs_vmcs12 = false;
   
-       if (vmx->nested.dirty_vmcs12 || vmx->nested.hv_evmcs) {
-               prepare_vmcs02_full(vmx, vmcs12);
+       if (vmx->nested.dirty_vmcs12 || hv_evmcs) {
+               prepare_vmcs02_rare(vmx, vmcs12);
                 vmx->nested.dirty_vmcs12 = false;
-       }
   
-       /*
-        * First, the fields that are shadowed.  This must be kept in sync
-        * with vmcs_shadow_fields.h.
-        */
-       if (!hv_evmcs || !(hv_evmcs->hv_clean_fields &
-                          HV_VMX_ENLIGHTENED_CLEAN_FIELD_GUEST_GRP2)) {
-               vmcs_write32(GUEST_CS_AR_BYTES, vmcs12->guest_cs_ar_bytes);
-               vmcs_write32(GUEST_SS_AR_BYTES, vmcs12->guest_ss_ar_bytes);
+               load_guest_pdptrs_vmcs12 = !hv_evmcs ||
+                       !(hv_evmcs->hv_clean_fields &
+                         HV_VMX_ENLIGHTENED_CLEAN_FIELD_GUEST_GRP1);
         }
   
         if (vmx->nested.nested_run_pending &&
@@@ -2283,6 -2289,9 +2289,9 @@@
                 kvm_set_dr(vcpu, 7, vcpu->arch.dr7);
                 vmcs_write64(GUEST_IA32_DEBUGCTL, vmx->nested.vmcs01_debugctl);
         }
+       if (kvm_mpx_supported() && (!vmx->nested.nested_run_pending ||
+           !(vmcs12->vm_entry_controls & VM_ENTRY_LOAD_BNDCFGS)))
+               vmcs_write64(GUEST_BNDCFGS, vmx->nested.vmcs01_guest_bndcfgs);
         vmx_set_rflags(vcpu, vmcs12->guest_rflags);
   
         /* EXCEPTION_BITMAP and CR0_GUEST_HOST_MASK should basically be the
@@@ -2372,6 -2381,15 +2381,15 @@@
                                 entry_failure_code))
                 return -EINVAL;
   
+       /* Late preparation of GUEST_PDPTRs now that EFER and CRs are set. */
+       if (load_guest_pdptrs_vmcs12 && nested_cpu_has_ept(vmcs12) &&
+           is_pae_paging(vcpu)) {
+               vmcs_write64(GUEST_PDPTR0, vmcs12->guest_pdptr0);
+               vmcs_write64(GUEST_PDPTR1, vmcs12->guest_pdptr1);
+               vmcs_write64(GUEST_PDPTR2, vmcs12->guest_pdptr2);
+               vmcs_write64(GUEST_PDPTR3, vmcs12->guest_pdptr3);
+       }
+ 
         if (!enable_ept)
                 vcpu->arch.walk_mmu->inject_page_fault = vmx_inject_page_fault_nested;
   
@@@ -2609,6 -2627,30 +2627,30 @@@ static int nested_vmx_check_host_state(
             !kvm_pat_valid(vmcs12->host_ia32_pat))
                 return -EINVAL;
   
+       ia32e = (vmcs12->vm_exit_controls &
+                VM_EXIT_HOST_ADDR_SPACE_SIZE) != 0;
+ 
+       if (vmcs12->host_cs_selector & (SEGMENT_RPL_MASK | SEGMENT_TI_MASK) ||
+           vmcs12->host_ss_selector & (SEGMENT_RPL_MASK | SEGMENT_TI_MASK) ||
+           vmcs12->host_ds_selector & (SEGMENT_RPL_MASK | SEGMENT_TI_MASK) ||
+           vmcs12->host_es_selector & (SEGMENT_RPL_MASK | SEGMENT_TI_MASK) ||
+           vmcs12->host_fs_selector & (SEGMENT_RPL_MASK | SEGMENT_TI_MASK) ||
+           vmcs12->host_gs_selector & (SEGMENT_RPL_MASK | SEGMENT_TI_MASK) ||
+           vmcs12->host_tr_selector & (SEGMENT_RPL_MASK | SEGMENT_TI_MASK) ||
+           vmcs12->host_cs_selector == 0 ||
+           vmcs12->host_tr_selector == 0 ||
+           (vmcs12->host_ss_selector == 0 && !ia32e))
+               return -EINVAL;
+ 
+ #ifdef CONFIG_X86_64
+       if (is_noncanonical_address(vmcs12->host_fs_base, vcpu) ||
+           is_noncanonical_address(vmcs12->host_gs_base, vcpu) ||
+           is_noncanonical_address(vmcs12->host_gdtr_base, vcpu) ||
+           is_noncanonical_address(vmcs12->host_idtr_base, vcpu) ||
+           is_noncanonical_address(vmcs12->host_tr_base, vcpu))
+               return -EINVAL;
+ #endif
+ 
         /*
          * If the load IA32_EFER VM-exit control is 1, bits reserved in the
          * IA32_EFER MSR must be 0 in the field for that register. In addition,
@@@ -2616,8 -2658,6 +2658,6 @@@
          * the host address-space size VM-exit control.
          */
         if (vmcs12->vm_exit_controls & VM_EXIT_LOAD_IA32_EFER) {
-               ia32e = (vmcs12->vm_exit_controls &
-                        VM_EXIT_HOST_ADDR_SPACE_SIZE) != 0;
                 if (!kvm_valid_efer(vcpu, vmcs12->host_ia32_efer) ||
                     ia32e != !!(vmcs12->host_ia32_efer & EFER_LMA) ||
                     ia32e != !!(vmcs12->host_ia32_efer & EFER_LME))
@@@ -2781,7 -2821,7 +2821,7 @@@ static int nested_vmx_check_vmentry_hw(
                 [launched]"i"(offsetof(struct loaded_vmcs, launched)),
                 [host_state_rsp]"i"(offsetof(struct loaded_vmcs, host_state.rsp)),
                 [wordsize]"i"(sizeof(ulong))
-             : "cc", "memory"
+             : "memory"
         );
   
         if (vmx->msr_autoload.host.nr)
@@@ -2851,18 -2891,14 +2891,14 @@@ static void nested_get_vmcs12_pages(str
                         hpa = page_to_phys(vmx->nested.apic_access_page);
                         vmcs_write64(APIC_ACCESS_ADDR, hpa);
                 } else {
-                       vmcs_clear_bits(SECONDARY_VM_EXEC_CONTROL,
-                                       SECONDARY_EXEC_VIRTUALIZE_APIC_ACCESSES);
+                       secondary_exec_controls_clearbit(vmx,
+                               SECONDARY_EXEC_VIRTUALIZE_APIC_ACCESSES);
                 }
         }
   
         if (nested_cpu_has(vmcs12, CPU_BASED_TPR_SHADOW)) {
                 map = &vmx->nested.virtual_apic_map;
   
-               /*
-                * If translation failed, VM entry will fail because
-                * prepare_vmcs02 set VIRTUAL_APIC_PAGE_ADDR to -1ull.
-                */
                 if (!kvm_vcpu_map(vcpu, gpa_to_gfn(vmcs12->virtual_apic_page_addr), map)) {
                         vmcs_write64(VIRTUAL_APIC_PAGE_ADDR, pfn_to_hpa(map->pfn));
                 } else if (nested_cpu_has(vmcs12, CPU_BASED_CR8_LOAD_EXITING) &&
@@@ -2876,11 -2912,13 +2912,13 @@@
                          * _not_ what the processor does but it's basically the
                          * only possibility we have.
                          */
-                       vmcs_clear_bits(CPU_BASED_VM_EXEC_CONTROL,
-                                       CPU_BASED_TPR_SHADOW);
+                       exec_controls_clearbit(vmx, CPU_BASED_TPR_SHADOW);
                 } else {
-                       printk("bad virtual-APIC page address\n");
-                       dump_vmcs();
+                       /*
+                        * Write an illegal value to VIRTUAL_APIC_PAGE_ADDR to
+                        * force VM-Entry to fail.
+                        */
+                       vmcs_write64(VIRTUAL_APIC_PAGE_ADDR, -1ull);
                 }
         }
   
@@@ -2896,11 -2934,9 +2934,9 @@@
                 }
         }
         if (nested_vmx_prepare_msr_bitmap(vcpu, vmcs12))
-               vmcs_set_bits(CPU_BASED_VM_EXEC_CONTROL,
-                             CPU_BASED_USE_MSR_BITMAPS);
+               exec_controls_setbit(vmx, CPU_BASED_USE_MSR_BITMAPS);
         else
-               vmcs_clear_bits(CPU_BASED_VM_EXEC_CONTROL,
-                               CPU_BASED_USE_MSR_BITMAPS);
+               exec_controls_clearbit(vmx, CPU_BASED_USE_MSR_BITMAPS);
   }
   
   /*
@@@ -2953,7 -2989,7 +2989,7 @@@ int nested_vmx_enter_non_root_mode(stru
         u32 exit_reason = EXIT_REASON_INVALID_STATE;
         u32 exit_qual;
   
-       evaluate_pending_interrupts = vmcs_read32(CPU_BASED_VM_EXEC_CONTROL) &
+       evaluate_pending_interrupts = exec_controls_get(vmx) &
                 (CPU_BASED_VIRTUAL_INTR_PENDING | CPU_BASED_VIRTUAL_NMI_PENDING);
         if (likely(!evaluate_pending_interrupts) && kvm_vcpu_apicv_active(vcpu))
                 evaluate_pending_interrupts |= vmx_has_apicv_interrupt(vcpu);
@@@ -2964,6 -3000,25 +3000,25 @@@
                 !(vmcs12->vm_entry_controls & VM_ENTRY_LOAD_BNDCFGS))
                 vmx->nested.vmcs01_guest_bndcfgs = vmcs_read64(GUEST_BNDCFGS);
   
+       /*
+        * Overwrite vmcs01.GUEST_CR3 with L1's CR3 if EPT is disabled *and*
+        * nested early checks are disabled.  In the event of a "late" VM-Fail,
+        * i.e. a VM-Fail detected by hardware but not KVM, KVM must unwind its
+        * software model to the pre-VMEntry host state.  When EPT is disabled,
+        * GUEST_CR3 holds KVM's shadow CR3, not L1's "real" CR3, which causes
+        * nested_vmx_restore_host_state() to corrupt vcpu->arch.cr3.  Stuffing
+        * vmcs01.GUEST_CR3 results in the unwind naturally setting arch.cr3 to
+        * the correct value.  Smashing vmcs01.GUEST_CR3 is safe because nested
+        * VM-Exits, and the unwind, reset KVM's MMU, i.e. vmcs01.GUEST_CR3 is
+        * guaranteed to be overwritten with a shadow CR3 prior to re-entering
+        * L1.  Don't stuff vmcs01.GUEST_CR3 when using nested early checks as
+        * KVM modifies vcpu->arch.cr3 if and only if the early hardware checks
+        * pass, and early VM-Fails do not reset KVM's MMU, i.e. the VM-Fail
+        * path would need to manually save/restore vmcs01.GUEST_CR3.
+        */
+       if (!enable_ept && !nested_early_check)
+               vmcs_writel(GUEST_CR3, vcpu->arch.cr3);
+ 
         vmx_switch_vmcs(vcpu, &vmx->nested.vmcs02);
   
         prepare_vmcs02_early(vmx, vmcs12);
@@@ -3059,7 -3114,7 +3114,7 @@@ vmentry_fail_vmexit
         vmcs12->vm_exit_reason = exit_reason | VMX_EXIT_REASONS_FAILED_VMENTRY;
         vmcs12->exit_qualification = exit_qual;
         if (enable_shadow_vmcs || vmx->nested.hv_evmcs)
-               vmx->nested.need_vmcs12_sync = true;
+               vmx->nested.need_vmcs12_to_shadow_sync = true;
         return 1;
   }
   
@@@ -3077,7 -3132,7 +3132,7 @@@ static int nested_vmx_run(struct kvm_vc
         if (!nested_vmx_check_permission(vcpu))
                 return 1;
   
-       if (!nested_vmx_handle_enlightened_vmptrld(vcpu, true))
+       if (!nested_vmx_handle_enlightened_vmptrld(vcpu, launch))
                 return 1;
   
         if (!vmx->nested.hv_evmcs && vmx->nested.current_vmptr == -1ull)
@@@ -3393,20 -3448,57 +3448,57 @@@ static u32 vmx_get_preemption_timer_val
         return value >> VMX_MISC_EMULATED_PREEMPTION_TIMER_RATE;
   }
   
- /*
-  * Update the guest state fields of vmcs12 to reflect changes that
-  * occurred while L2 was running. (The "IA-32e mode guest" bit of the
-  * VM-entry controls is also updated, since this is really a guest
-  * state bit.)
-  */
- static void sync_vmcs12(struct kvm_vcpu *vcpu, struct vmcs12 *vmcs12)
- {
-       vmcs12->guest_cr0 = vmcs12_guest_cr0(vcpu, vmcs12);
-       vmcs12->guest_cr4 = vmcs12_guest_cr4(vcpu, vmcs12);
+ static bool is_vmcs12_ext_field(unsigned long field)
+ {
+       switch (field) {
+       case GUEST_ES_SELECTOR:
+       case GUEST_CS_SELECTOR:
+       case GUEST_SS_SELECTOR:
+       case GUEST_DS_SELECTOR:
+       case GUEST_FS_SELECTOR:
+       case GUEST_GS_SELECTOR:
+       case GUEST_LDTR_SELECTOR:
+       case GUEST_TR_SELECTOR:
+       case GUEST_ES_LIMIT:
+       case GUEST_CS_LIMIT:
+       case GUEST_SS_LIMIT:
+       case GUEST_DS_LIMIT:
+       case GUEST_FS_LIMIT:
+       case GUEST_GS_LIMIT:
+       case GUEST_LDTR_LIMIT:
+       case GUEST_TR_LIMIT:
+       case GUEST_GDTR_LIMIT:
+       case GUEST_IDTR_LIMIT:
+       case GUEST_ES_AR_BYTES:
+       case GUEST_DS_AR_BYTES:
+       case GUEST_FS_AR_BYTES:
+       case GUEST_GS_AR_BYTES:
+       case GUEST_LDTR_AR_BYTES:
+       case GUEST_TR_AR_BYTES:
+       case GUEST_ES_BASE:
+       case GUEST_CS_BASE:
+       case GUEST_SS_BASE:
+       case GUEST_DS_BASE:
+       case GUEST_FS_BASE:
+       case GUEST_GS_BASE:
+       case GUEST_LDTR_BASE:
+       case GUEST_TR_BASE:
+       case GUEST_GDTR_BASE:
+       case GUEST_IDTR_BASE:
+       case GUEST_PENDING_DBG_EXCEPTIONS:
+       case GUEST_BNDCFGS:
+               return true;
+       default:
+               break;
+       }
   
-       vmcs12->guest_rsp = kvm_rsp_read(vcpu);
-       vmcs12->guest_rip = kvm_rip_read(vcpu);
-       vmcs12->guest_rflags = vmcs_readl(GUEST_RFLAGS);
+       return false;
+ }
+ 
+ static void sync_vmcs02_to_vmcs12_rare(struct kvm_vcpu *vcpu,
+                                      struct vmcs12 *vmcs12)
+ {
+       struct vcpu_vmx *vmx = to_vmx(vcpu);
   
         vmcs12->guest_es_selector = vmcs_read16(GUEST_ES_SELECTOR);
         vmcs12->guest_cs_selector = vmcs_read16(GUEST_CS_SELECTOR);
@@@ -3427,8 -3519,6 +3519,6 @@@
         vmcs12->guest_gdtr_limit = vmcs_read32(GUEST_GDTR_LIMIT);
         vmcs12->guest_idtr_limit = vmcs_read32(GUEST_IDTR_LIMIT);
         vmcs12->guest_es_ar_bytes = vmcs_read32(GUEST_ES_AR_BYTES);
-       vmcs12->guest_cs_ar_bytes = vmcs_read32(GUEST_CS_AR_BYTES);
-       vmcs12->guest_ss_ar_bytes = vmcs_read32(GUEST_SS_AR_BYTES);
         vmcs12->guest_ds_ar_bytes = vmcs_read32(GUEST_DS_AR_BYTES);
         vmcs12->guest_fs_ar_bytes = vmcs_read32(GUEST_FS_AR_BYTES);
         vmcs12->guest_gs_ar_bytes = vmcs_read32(GUEST_GS_AR_BYTES);
@@@ -3444,11 -3534,69 +3534,69 @@@
         vmcs12->guest_tr_base = vmcs_readl(GUEST_TR_BASE);
         vmcs12->guest_gdtr_base = vmcs_readl(GUEST_GDTR_BASE);
         vmcs12->guest_idtr_base = vmcs_readl(GUEST_IDTR_BASE);
+       vmcs12->guest_pending_dbg_exceptions =
+               vmcs_readl(GUEST_PENDING_DBG_EXCEPTIONS);
+       if (kvm_mpx_supported())
+               vmcs12->guest_bndcfgs = vmcs_read64(GUEST_BNDCFGS);
+ 
+       vmx->nested.need_sync_vmcs02_to_vmcs12_rare = false;
+ }
+ 
+ static void copy_vmcs02_to_vmcs12_rare(struct kvm_vcpu *vcpu,
+                                      struct vmcs12 *vmcs12)
+ {
+       struct vcpu_vmx *vmx = to_vmx(vcpu);
+       int cpu;
+ 
+       if (!vmx->nested.need_sync_vmcs02_to_vmcs12_rare)
+               return;
+ 
+ 
+       WARN_ON_ONCE(vmx->loaded_vmcs != &vmx->vmcs01);
+ 
+       cpu = get_cpu();
+       vmx->loaded_vmcs = &vmx->nested.vmcs02;
+       vmx_vcpu_load(&vmx->vcpu, cpu);
+ 
+       sync_vmcs02_to_vmcs12_rare(vcpu, vmcs12);
+ 
+       vmx->loaded_vmcs = &vmx->vmcs01;
+       vmx_vcpu_load(&vmx->vcpu, cpu);
+       put_cpu();
+ }
+ 
+ /*
+  * Update the guest state fields of vmcs12 to reflect changes that
+  * occurred while L2 was running. (The "IA-32e mode guest" bit of the
+  * VM-entry controls is also updated, since this is really a guest
+  * state bit.)
+  */
+ static void sync_vmcs02_to_vmcs12(struct kvm_vcpu *vcpu, struct vmcs12 *vmcs12)
+ {
+       struct vcpu_vmx *vmx = to_vmx(vcpu);
+ 
+       if (vmx->nested.hv_evmcs)
+               sync_vmcs02_to_vmcs12_rare(vcpu, vmcs12);
+ 
+       vmx->nested.need_sync_vmcs02_to_vmcs12_rare = !vmx->nested.hv_evmcs;
+ 
+       vmcs12->guest_cr0 = vmcs12_guest_cr0(vcpu, vmcs12);
+       vmcs12->guest_cr4 = vmcs12_guest_cr4(vcpu, vmcs12);
+ 
+       vmcs12->guest_rsp = kvm_rsp_read(vcpu);
+       vmcs12->guest_rip = kvm_rip_read(vcpu);
+       vmcs12->guest_rflags = vmcs_readl(GUEST_RFLAGS);
+ 
+       vmcs12->guest_cs_ar_bytes = vmcs_read32(GUEST_CS_AR_BYTES);
+       vmcs12->guest_ss_ar_bytes = vmcs_read32(GUEST_SS_AR_BYTES);
+ 
+       vmcs12->guest_sysenter_cs = vmcs_read32(GUEST_SYSENTER_CS);
+       vmcs12->guest_sysenter_esp = vmcs_readl(GUEST_SYSENTER_ESP);
+       vmcs12->guest_sysenter_eip = vmcs_readl(GUEST_SYSENTER_EIP);
   
         vmcs12->guest_interruptibility_info =
                 vmcs_read32(GUEST_INTERRUPTIBILITY_INFO);
-       vmcs12->guest_pending_dbg_exceptions =
-               vmcs_readl(GUEST_PENDING_DBG_EXCEPTIONS);
+ 
         if (vcpu->arch.mp_state == KVM_MP_STATE_HALTED)
                 vmcs12->guest_activity_state = GUEST_ACTIVITY_HLT;
         else
@@@ -3469,10 -3617,12 +3617,12 @@@
          */
         if (enable_ept) {
                 vmcs12->guest_cr3 = vmcs_readl(GUEST_CR3);
-               vmcs12->guest_pdptr0 = vmcs_read64(GUEST_PDPTR0);
-               vmcs12->guest_pdptr1 = vmcs_read64(GUEST_PDPTR1);
-               vmcs12->guest_pdptr2 = vmcs_read64(GUEST_PDPTR2);
-               vmcs12->guest_pdptr3 = vmcs_read64(GUEST_PDPTR3);
+               if (nested_cpu_has_ept(vmcs12) && is_pae_paging(vcpu)) {
+                       vmcs12->guest_pdptr0 = vmcs_read64(GUEST_PDPTR0);
+                       vmcs12->guest_pdptr1 = vmcs_read64(GUEST_PDPTR1);
+                       vmcs12->guest_pdptr2 = vmcs_read64(GUEST_PDPTR2);
+                       vmcs12->guest_pdptr3 = vmcs_read64(GUEST_PDPTR3);
+               }
         }
   
         vmcs12->guest_linear_address = vmcs_readl(GUEST_LINEAR_ADDRESS);
@@@ -3484,22 -3634,11 +3634,11 @@@
                 (vmcs12->vm_entry_controls & ~VM_ENTRY_IA32E_MODE) |
                 (vm_entry_controls_get(to_vmx(vcpu)) & VM_ENTRY_IA32E_MODE);
   
-       if (vmcs12->vm_exit_controls & VM_EXIT_SAVE_DEBUG_CONTROLS) {
+       if (vmcs12->vm_exit_controls & VM_EXIT_SAVE_DEBUG_CONTROLS)
                 kvm_get_dr(vcpu, 7, (unsigned long *)&vmcs12->guest_dr7);
-               vmcs12->guest_ia32_debugctl = vmcs_read64(GUEST_IA32_DEBUGCTL);
-       }
   
-       /* TODO: These cannot have changed unless we have MSR bitmaps and
-        * the relevant bit asks not to trap the change */
-       if (vmcs12->vm_exit_controls & VM_EXIT_SAVE_IA32_PAT)
-               vmcs12->guest_ia32_pat = vmcs_read64(GUEST_IA32_PAT);
         if (vmcs12->vm_exit_controls & VM_EXIT_SAVE_IA32_EFER)
                 vmcs12->guest_ia32_efer = vcpu->arch.efer;
-       vmcs12->guest_sysenter_cs = vmcs_read32(GUEST_SYSENTER_CS);
-       vmcs12->guest_sysenter_esp = vmcs_readl(GUEST_SYSENTER_ESP);
-       vmcs12->guest_sysenter_eip = vmcs_readl(GUEST_SYSENTER_EIP);
-       if (kvm_mpx_supported())
-               vmcs12->guest_bndcfgs = vmcs_read64(GUEST_BNDCFGS);
   }
   
   /*
@@@ -3517,11 -3656,7 +3656,7 @@@ static void prepare_vmcs12(struct kvm_v
                            u32 exit_reason, u32 exit_intr_info,
                            unsigned long exit_qualification)
   {
-       /* update guest state fields: */
-       sync_vmcs12(vcpu, vmcs12);
- 
         /* update exit information fields: */
- 
         vmcs12->vm_exit_reason = exit_reason;
         vmcs12->exit_qualification = exit_qualification;
         vmcs12->vm_exit_intr_info = exit_intr_info;
@@@ -3775,18 -3910,8 +3910,8 @@@ static void nested_vmx_restore_host_sta
         vmx_set_cr4(vcpu, vmcs_readl(CR4_READ_SHADOW));
   
         nested_ept_uninit_mmu_context(vcpu);
- 
-       /*
-        * This is only valid if EPT is in use, otherwise the vmcs01 GUEST_CR3
-        * points to shadow pages!  Fortunately we only get here after a WARN_ON
-        * if EPT is disabled, so a VMabort is perfectly fine.
-        */
-       if (enable_ept) {
-               vcpu->arch.cr3 = vmcs_readl(GUEST_CR3);
-               __set_bit(VCPU_EXREG_CR3, (ulong *)&vcpu->arch.regs_avail);
-       } else {
-               nested_vmx_abort(vcpu, VMX_ABORT_VMCS_CORRUPTED);
-       }
+       vcpu->arch.cr3 = vmcs_readl(GUEST_CR3);
+       __set_bit(VCPU_EXREG_CR3, (ulong *)&vcpu->arch.regs_avail);
   
         /*
          * Use ept_save_pdptrs(vcpu) to load the MMU's cached PDPTRs
@@@ -3794,7 -3919,8 +3919,8 @@@
          * VMFail, like everything else we just need to ensure our
          * software model is up-to-date.
          */
-       ept_save_pdptrs(vcpu);
+       if (enable_ept)
+               ept_save_pdptrs(vcpu);
   
         kvm_mmu_reset_context(vcpu);
   
@@@ -3882,14 -4008,14 +4008,14 @@@ void nested_vmx_vmexit(struct kvm_vcpu 
                 vcpu->arch.tsc_offset -= vmcs12->tsc_offset;
   
         if (likely(!vmx->fail)) {
-               if (exit_reason == -1)
-                       sync_vmcs12(vcpu, vmcs12);
-               else
+               sync_vmcs02_to_vmcs12(vcpu, vmcs12);
+ 
+               if (exit_reason != -1)
                         prepare_vmcs12(vcpu, vmcs12, exit_reason, exit_intr_info,
                                        exit_qualification);
   
                 /*
-                * Must happen outside of sync_vmcs12() as it will
+                * Must happen outside of sync_vmcs02_to_vmcs12() as it will
                  * also be used to capture vmcs12 cache as part of
                  * capturing nVMX state for snapshot (migration).
                  *
@@@ -3945,7 -4071,7 +4071,7 @@@
         kvm_make_request(KVM_REQ_APIC_PAGE_RELOAD, vcpu);
   
         if ((exit_reason != -1) && (enable_shadow_vmcs || vmx->nested.hv_evmcs))
-               vmx->nested.need_vmcs12_sync = true;
+               vmx->nested.need_vmcs12_to_shadow_sync = true;
   
         /* in case we halted in L2 */
         vcpu->arch.mp_state = KVM_MP_STATE_RUNNABLE;
@@@ -4008,7 -4134,7 +4134,7 @@@
    * #UD or #GP.
    */
   int get_vmx_mem_address(struct kvm_vcpu *vcpu, unsigned long exit_qualification,
-                       u32 vmx_instruction_info, bool wr, gva_t *ret)
+                       u32 vmx_instruction_info, bool wr, int len, gva_t *ret)
   {
         gva_t off;
         bool exn;
@@@ -4115,7 -4241,7 +4241,7 @@@
                  */
                 if (!(s.base == 0 && s.limit == 0xffffffff &&
                      ((s.type & 8) || !(s.type & 4))))
-                       exn = exn || (off + sizeof(u64) > s.limit);
+                       exn = exn || ((u64)off + len - 1 > s.limit);
         }
         if (exn) {
                 kvm_queue_exception_e(vcpu,
@@@ -4134,7 -4260,8 +4260,8 @@@ static int nested_vmx_get_vmptr(struct 
         struct x86_exception e;
   
         if (get_vmx_mem_address(vcpu, vmcs_readl(EXIT_QUALIFICATION),
-                       vmcs_read32(VMX_INSTRUCTION_INFO), false, &gva))
+                               vmcs_read32(VMX_INSTRUCTION_INFO), false,
+                               sizeof(*vmpointer), &gva))
                 return 1;
   
         if (kvm_read_guest_virt(vcpu, gva, vmpointer, sizeof(*vmpointer), &e)) {
@@@ -4300,11 -4427,13 +4427,13 @@@ static inline void nested_release_vmcs1
         if (vmx->nested.current_vmptr == -1ull)
                 return;
   
+       copy_vmcs02_to_vmcs12_rare(vcpu, get_vmcs12(vcpu));
+ 
         if (enable_shadow_vmcs) {
                 /* copy to memory all shadowed fields in case
                    they were modified */
                 copy_shadow_to_vmcs12(vmx);
-               vmx->nested.need_vmcs12_sync = false;
+               vmx->nested.need_vmcs12_to_shadow_sync = false;
                 vmx_disable_shadow_vmcs(vmx);
         }
         vmx->nested.posted_intr_nv = -1;
@@@ -4334,6 -4463,7 +4463,7 @@@ static int handle_vmclear(struct kvm_vc
         struct vcpu_vmx *vmx = to_vmx(vcpu);
         u32 zero = 0;
         gpa_t vmptr;
+       u64 evmcs_gpa;
   
         if (!nested_vmx_check_permission(vcpu))
                 return 1;
@@@ -4349,10 -4479,18 +4479,18 @@@
                 return nested_vmx_failValid(vcpu,
                         VMXERR_VMCLEAR_VMXON_POINTER);
   
-       if (vmx->nested.hv_evmcs_map.hva) {
-               if (vmptr == vmx->nested.hv_evmcs_vmptr)
-                       nested_release_evmcs(vcpu);
-       } else {
+       /*
+        * When Enlightened VMEntry is enabled on the calling CPU we treat
+        * memory area pointer by vmptr as Enlightened VMCS (as there's no good
+        * way to distinguish it from VMCS12) and we must not corrupt it by
+        * writing to the non-existent 'launch_state' field. The area doesn't
+        * have to be the currently active EVMCS on the calling CPU and there's
+        * nothing KVM has to do to transition it from 'active' to 'non-active'
+        * state. It is possible that the area will stay mapped as
+        * vmx->nested.hv_evmcs but this shouldn't be a problem.
+        */
+       if (likely(!vmx->nested.enlightened_vmcs_enabled ||
+                  !nested_enlightened_vmentry(vcpu, &evmcs_gpa))) {
                 if (vmptr == vmx->nested.current_vmptr)
                         nested_release_vmcs12(vcpu);
   
@@@ -4386,8 -4524,10 +4524,10 @@@ static int handle_vmread(struct kvm_vcp
         u64 field_value;
         unsigned long exit_qualification = vmcs_readl(EXIT_QUALIFICATION);
         u32 vmx_instruction_info = vmcs_read32(VMX_INSTRUCTION_INFO);
+       int len;
         gva_t gva = 0;
         struct vmcs12 *vmcs12;
+       short offset;
   
         if (!nested_vmx_check_permission(vcpu))
                 return 1;
@@@ -4409,11 -4549,18 +4549,18 @@@
   
         /* Decode instruction info and find the field to read */
         field = kvm_register_readl(vcpu, (((vmx_instruction_info) >> 28) & 0xf));
-       /* Read the field, zero-extended to a u64 field_value */
-       if (vmcs12_read_any(vmcs12, field, &field_value) < 0)
+ 
+       offset = vmcs_field_to_offset(field);
+       if (offset < 0)
                 return nested_vmx_failValid(vcpu,
                         VMXERR_UNSUPPORTED_VMCS_COMPONENT);
   
+       if (!is_guest_mode(vcpu) && is_vmcs12_ext_field(field))
+               copy_vmcs02_to_vmcs12_rare(vcpu, vmcs12);
+ 
+       /* Read the field, zero-extended to a u64 field_value */
+       field_value = vmcs12_read_any(vmcs12, field, offset);
+ 
         /*
          * Now copy part of this value to register or memory, as requested.
          * Note that the number of bits actually copied is 32 or 64 depending
@@@ -4423,21 -4570,45 +4570,45 @@@
                 kvm_register_writel(vcpu, (((vmx_instruction_info) >> 3) & 0xf),
                         field_value);
         } else {
+               len = is_64_bit_mode(vcpu) ? 8 : 4;
                 if (get_vmx_mem_address(vcpu, exit_qualification,
-                               vmx_instruction_info, true, &gva))
+                               vmx_instruction_info, true, len, &gva))
                         return 1;
                 /* _system ok, nested_vmx_check_permission has verified cpl=0 */
-               kvm_write_guest_virt_system(vcpu, gva, &field_value,
-                                           (is_long_mode(vcpu) ? 8 : 4), NULL);
+               kvm_write_guest_virt_system(vcpu, gva, &field_value, len, NULL);
         }
   
         return nested_vmx_succeed(vcpu);
   }
   
+ static bool is_shadow_field_rw(unsigned long field)
+ {
+       switch (field) {
+ #define SHADOW_FIELD_RW(x, y) case x:
+ #include "vmcs_shadow_fields.h"
+               return true;
+       default:
+               break;
+       }
+       return false;
+ }
+ 
+ static bool is_shadow_field_ro(unsigned long field)
+ {
+       switch (field) {
+ #define SHADOW_FIELD_RO(x, y) case x:
+ #include "vmcs_shadow_fields.h"
+               return true;
+       default:
+               break;
+       }
+       return false;
+ }
   
   static int handle_vmwrite(struct kvm_vcpu *vcpu)
   {
         unsigned long field;
+       int len;
         gva_t gva;
         struct vcpu_vmx *vmx = to_vmx(vcpu);
         unsigned long exit_qualification = vmcs_readl(EXIT_QUALIFICATION);
@@@ -4452,6 -4623,7 +4623,7 @@@
         u64 field_value = 0;
         struct x86_exception e;
         struct vmcs12 *vmcs12;
+       short offset;
   
         if (!nested_vmx_check_permission(vcpu))
                 return 1;
@@@ -4463,11 -4635,11 +4635,11 @@@
                 field_value = kvm_register_readl(vcpu,
                         (((vmx_instruction_info) >> 3) & 0xf));
         else {
+               len = is_64_bit_mode(vcpu) ? 8 : 4;
                 if (get_vmx_mem_address(vcpu, exit_qualification,
-                               vmx_instruction_info, false, &gva))
+                               vmx_instruction_info, false, len, &gva))
                         return 1;
-               if (kvm_read_guest_virt(vcpu, gva, &field_value,
-                                       (is_64_bit_mode(vcpu) ? 8 : 4), &e)) {
+               if (kvm_read_guest_virt(vcpu, gva, &field_value, len, &e)) {
                         kvm_inject_page_fault(vcpu, &e);
                         return 1;
                 }
@@@ -4484,9 -4656,16 +4656,16 @@@
                 return nested_vmx_failValid(vcpu,
                         VMXERR_VMWRITE_READ_ONLY_VMCS_COMPONENT);
   
-       if (!is_guest_mode(vcpu))
+       if (!is_guest_mode(vcpu)) {
                 vmcs12 = get_vmcs12(vcpu);
-       else {
+ 
+               /*
+                * Ensure vmcs12 is up-to-date before any VMWRITE that dirties
+                * vmcs12, else we may crush a field or consume a stale value.
+                */
+               if (!is_shadow_field_rw(field))
+                       copy_vmcs02_to_vmcs12_rare(vcpu, vmcs12);
+       } else {
                 /*
                  * When vmcs->vmcs_link_pointer is -1ull, any VMWRITE
                  * to shadowed-field sets the ALU flags for VMfailInvalid.
@@@ -4496,28 -4675,46 +4675,46 @@@
                 vmcs12 = get_shadow_vmcs12(vcpu);
         }
   
-       if (vmcs12_write_any(vmcs12, field, field_value) < 0)
+       offset = vmcs_field_to_offset(field);
+       if (offset < 0)
                 return nested_vmx_failValid(vcpu,
                         VMXERR_UNSUPPORTED_VMCS_COMPONENT);
   
         /*
-        * Do not track vmcs12 dirty-state if in guest-mode
-        * as we actually dirty shadow vmcs12 instead of vmcs12.
+        * Some Intel CPUs intentionally drop the reserved bits of the AR byte
+        * fields on VMWRITE.  Emulate this behavior to ensure consistent KVM
+        * behavior regardless of the underlying hardware, e.g. if an AR_BYTE
+        * field is intercepted for VMWRITE but not VMREAD (in L1), then VMREAD
+        * from L1 will return a different value than VMREAD from L2 (L1 sees
+        * the stripped down value, L2 sees the full value as stored by KVM).
          */
-       if (!is_guest_mode(vcpu)) {
-               switch (field) {
- #define SHADOW_FIELD_RW(x) case x:
- #include "vmcs_shadow_fields.h"
-                       /*
-                        * The fields that can be updated by L1 without a vmexit are
-                        * always updated in the vmcs02, the others go down the slow
-                        * path of prepare_vmcs02.
-                        */
-                       break;
-               default:
-                       vmx->nested.dirty_vmcs12 = true;
-                       break;
+       if (field >= GUEST_ES_AR_BYTES && field <= GUEST_TR_AR_BYTES)
+               field_value &= 0x1f0ff;
+ 
+       vmcs12_write_any(vmcs12, field, offset, field_value);
+ 
+       /*
+        * Do not track vmcs12 dirty-state if in guest-mode as we actually
+        * dirty shadow vmcs12 instead of vmcs12.  Fields that can be updated
+        * by L1 without a vmexit are always updated in the vmcs02, i.e. don't
+        * "dirty" vmcs12, all others go down the prepare_vmcs02() slow path.
+        */
+       if (!is_guest_mode(vcpu) && !is_shadow_field_rw(field)) {
+               /*
+                * L1 can read these fields without exiting, ensure the
+                * shadow VMCS is up-to-date.
+                */
+               if (enable_shadow_vmcs && is_shadow_field_ro(field)) {
+                       preempt_disable();
+                       vmcs_load(vmx->vmcs01.shadow_vmcs);
+ 
+                       __vmcs_writel(field, field_value);
+ 
+                       vmcs_clear(vmx->vmcs01.shadow_vmcs);
+                       vmcs_load(vmx->loaded_vmcs->vmcs);
+                       preempt_enable();
                 }
+               vmx->nested.dirty_vmcs12 = true;
         }
   
         return nested_vmx_succeed(vcpu);
@@@ -4527,11 -4724,10 +4724,10 @@@ static void set_current_vmptr(struct vc
   {
         vmx->nested.current_vmptr = vmptr;
         if (enable_shadow_vmcs) {
-               vmcs_set_bits(SECONDARY_VM_EXEC_CONTROL,
-                             SECONDARY_EXEC_SHADOW_VMCS);
+               secondary_exec_controls_setbit(vmx, SECONDARY_EXEC_SHADOW_VMCS);
                 vmcs_write64(VMCS_LINK_POINTER,
                              __pa(vmx->vmcs01.shadow_vmcs));
-               vmx->nested.need_vmcs12_sync = true;
+               vmx->nested.need_vmcs12_to_shadow_sync = true;
         }
         vmx->nested.dirty_vmcs12 = true;
   }
@@@ -4615,7 -4811,8 +4811,8 @@@ static int handle_vmptrst(struct kvm_vc
         if (unlikely(to_vmx(vcpu)->nested.hv_evmcs))
                 return 1;
   
-       if (get_vmx_mem_address(vcpu, exit_qual, instr_info, true, &gva))
+       if (get_vmx_mem_address(vcpu, exit_qual, instr_info,
+                               true, sizeof(gpa_t), &gva))
                 return 1;
         /* *_system ok, nested_vmx_check_permission has verified cpl=0 */
         if (kvm_write_guest_virt_system(vcpu, gva, (void *)&current_vmptr,
@@@ -4661,7 -4858,7 +4858,7 @@@ static int handle_invept(struct kvm_vcp
          * operand is read even if it isn't needed (e.g., for type==global)
          */
         if (get_vmx_mem_address(vcpu, vmcs_readl(EXIT_QUALIFICATION),
-                       vmx_instruction_info, false, &gva))
+                       vmx_instruction_info, false, sizeof(operand), &gva))
                 return 1;
         if (kvm_read_guest_virt(vcpu, gva, &operand, sizeof(operand), &e)) {
                 kvm_inject_page_fault(vcpu, &e);
@@@ -4670,13 -4867,11 +4867,11 @@@
   
         switch (type) {
         case VMX_EPT_EXTENT_GLOBAL:
+       case VMX_EPT_EXTENT_CONTEXT:
         /*
-        * TODO: track mappings and invalidate
-        * single context requests appropriately
+        * TODO: Sync the necessary shadow EPT roots here, rather than
+        * at the next emulated VM-entry.
          */
-       case VMX_EPT_EXTENT_CONTEXT:
-               kvm_mmu_sync_roots(vcpu);
-               kvm_make_request(KVM_REQ_TLB_FLUSH, vcpu);
                 break;
         default:
                 BUG_ON(1);
@@@ -4723,7 -4918,7 +4918,7 @@@ static int handle_invvpid(struct kvm_vc
          * operand is read even if it isn't needed (e.g., for type==global)
          */
         if (get_vmx_mem_address(vcpu, vmcs_readl(EXIT_QUALIFICATION),
-                       vmx_instruction_info, false, &gva))
+                       vmx_instruction_info, false, sizeof(operand), &gva))
                 return 1;
         if (kvm_read_guest_virt(vcpu, gva, &operand, sizeof(operand), &e)) {
                 kvm_inject_page_fault(vcpu, &e);
@@@ -5240,6 -5435,9 +5435,6 @@@ static int vmx_get_nested_state(struct 
         vmx = to_vmx(vcpu);
         vmcs12 = get_vmcs12(vcpu);
   
- -      if (nested_vmx_allowed(vcpu) && vmx->nested.enlightened_vmcs_enabled)
- -              kvm_state.flags |= KVM_STATE_NESTED_EVMCS;
- -
         if (nested_vmx_allowed(vcpu) &&
             (vmx->nested.vmxon || vmx->nested.smm.vmxon)) {
                 kvm_state.hdr.vmx.vmxon_pa = vmx->nested.vmxon_ptr;
@@@ -5248,9 -5446,6 +5443,9 @@@
                 if (vmx_has_valid_vmcs12(vcpu)) {
                         kvm_state.size += sizeof(user_vmx_nested_state->vmcs12);
   
+ +                      if (vmx->nested.hv_evmcs)
+ +                              kvm_state.flags |= KVM_STATE_NESTED_EVMCS;
+ +
                         if (is_guest_mode(vcpu) &&
                             nested_cpu_has_shadow_vmcs(vmcs12) &&
                             vmcs12->vmcs_link_pointer != -1ull)
@@@ -5284,12 -5479,13 +5479,13 @@@
          * When running L2, the authoritative vmcs12 state is in the
          * vmcs02. When running L1, the authoritative vmcs12 state is
          * in the shadow or enlightened vmcs linked to vmcs01, unless
-        * need_vmcs12_sync is set, in which case, the authoritative
+        * need_vmcs12_to_shadow_sync is set, in which case, the authoritative
          * vmcs12 state is in the vmcs12 already.
          */
         if (is_guest_mode(vcpu)) {
-               sync_vmcs12(vcpu, vmcs12);
-       } else if (!vmx->nested.need_vmcs12_sync) {
+               sync_vmcs02_to_vmcs12(vcpu, vmcs12);
+               sync_vmcs02_to_vmcs12_rare(vcpu, vmcs12);
+       } else if (!vmx->nested.need_vmcs12_to_shadow_sync) {
                 if (vmx->nested.hv_evmcs)
                         copy_enlightened_to_vmcs12(vmx);
                 else if (enable_shadow_vmcs)
@@@ -5350,15 -5546,6 +5546,15 @@@ static int vmx_set_nested_state(struct 
                 if (kvm_state->hdr.vmx.vmcs12_pa != -1ull)
                         return -EINVAL;
   
+ +              /*
+ +               * KVM_STATE_NESTED_EVMCS used to signal that KVM should
+ +               * enable eVMCS capability on vCPU. However, since then
+ +               * code was changed such that flag signals vmcs12 should
+ +               * be copied into eVMCS in guest memory.
+ +               *
+ +               * To preserve backwards compatability, allow user
+ +               * to set this flag even when there is no VMXON region.
+ +               */
                 if (kvm_state->flags & ~KVM_STATE_NESTED_EVMCS)
                         return -EINVAL;
         } else {
@@@ -5367,7 -5554,7 +5563,7 @@@
   
                 if (!page_address_valid(vcpu, kvm_state->hdr.vmx.vmxon_pa))
                         return -EINVAL;
- -      }
+ +      }
   
         if ((kvm_state->hdr.vmx.smm.flags & KVM_STATE_NESTED_SMM_GUEST_MODE) &&
             (kvm_state->flags & KVM_STATE_NESTED_GUEST_MODE))
@@@ -5382,21 -5569,20 +5578,21 @@@
          * nor can VMLAUNCH/VMRESUME be pending.  Outside SMM, SMM flags
          * must be zero.
          */
- -      if (is_smm(vcpu) ? kvm_state->flags : kvm_state->hdr.vmx.smm.flags)
+ +      if (is_smm(vcpu) ?
+ +              (kvm_state->flags &
+ +               (KVM_STATE_NESTED_GUEST_MODE | KVM_STATE_NESTED_RUN_PENDING))
+ +              : kvm_state->hdr.vmx.smm.flags)
                 return -EINVAL;
   
         if ((kvm_state->hdr.vmx.smm.flags & KVM_STATE_NESTED_SMM_GUEST_MODE) &&
             !(kvm_state->hdr.vmx.smm.flags & KVM_STATE_NESTED_SMM_VMXON))
                 return -EINVAL;
   
- -      vmx_leave_nested(vcpu);
- -      if (kvm_state->flags & KVM_STATE_NESTED_EVMCS) {
- -              if (!nested_vmx_allowed(vcpu))
+ +      if ((kvm_state->flags & KVM_STATE_NESTED_EVMCS) &&
+ +              (!nested_vmx_allowed(vcpu) || !vmx->nested.enlightened_vmcs_enabled))
                         return -EINVAL;
   
- -              nested_enable_evmcs(vcpu, NULL);
- -      }
+ +      vmx_leave_nested(vcpu);
   
         if (kvm_state->hdr.vmx.vmxon_pa == -1ull)
                 return 0;
@@@ -5421,7 -5607,7 +5617,7 @@@
                  * Sync eVMCS upon entry as we may not have
                  * HV_X64_MSR_VP_ASSIST_PAGE set up yet.
                  */
-               vmx->nested.need_vmcs12_sync = true;
+               vmx->nested.need_vmcs12_to_shadow_sync = true;
         } else {
                 return -EINVAL;
         }
@@@ -5489,14 -5675,8 +5685,8 @@@ error_guest_mode
   void nested_vmx_vcpu_setup(void)
   {
         if (enable_shadow_vmcs) {
-               /*
-                * At vCPU creation, "VMWRITE to any supported field
-                * in the VMCS" is supported, so use the more
-                * permissive vmx_vmread_bitmap to specify both read
-                * and write permissions for the shadow VMCS.
-                */
                 vmcs_write64(VMREAD_BITMAP, __pa(vmx_vmread_bitmap));
-               vmcs_write64(VMWRITE_BITMAP, __pa(vmx_vmread_bitmap));
+               vmcs_write64(VMWRITE_BITMAP, __pa(vmx_vmwrite_bitmap));
         }
   }
   
@@@ -5626,10 -5806,15 +5816,15 @@@ void nested_vmx_setup_ctls_msrs(struct 
         msrs->secondary_ctls_low = 0;
         msrs->secondary_ctls_high &=
                 SECONDARY_EXEC_DESC |
+               SECONDARY_EXEC_RDTSCP |
                 SECONDARY_EXEC_VIRTUALIZE_X2APIC_MODE |
+               SECONDARY_EXEC_WBINVD_EXITING |
                 SECONDARY_EXEC_APIC_REGISTER_VIRT |
                 SECONDARY_EXEC_VIRTUAL_INTR_DELIVERY |
-               SECONDARY_EXEC_WBINVD_EXITING;
+               SECONDARY_EXEC_RDRAND_EXITING |
+               SECONDARY_EXEC_ENABLE_INVPCID |
+               SECONDARY_EXEC_RDSEED_EXITING |
+               SECONDARY_EXEC_XSAVES;
   
         /*
          * We can emulate "VMCS shadowing," even if the hardware
@@@ -5749,14 -5934,6 +5944,6 @@@ __init int nested_vmx_hardware_setup(in
   {
         int i;
   
-       /*
-        * Without EPT it is not possible to restore L1's CR3 and PDPTR on
-        * VMfail, because they are not available in vmcs01.  Just always
-        * use hardware checks.
-        */
-       if (!enable_ept)
-               nested_early_check = 1;
- 
         if (!cpu_has_vmx_shadow_vmcs())
                 enable_shadow_vmcs = 0;
         if (enable_shadow_vmcs) {
diff --combined arch/x86/kvm/x86.c

index 63bb1ee8258e6fad51ea51eb39305554fc6e841a,2c323118f0b38596bbc751497c14788de0955030..4a0b74ecd1deebd90c163f8e440c379921d11e31
--- 1/arch/x86/kvm/x86.c
--- 2/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@@ -67,7 -67,6 +67,7 @@@
   #include <asm/mshyperv.h>
   #include <asm/hypervisor.h>
   #include <asm/intel_pt.h>
+ +#include <clocksource/hyperv_timer.h>
   
   #define CREATE_TRACE_POINTS
   #include "trace.h"
@@@ -717,7 -716,7 +717,7 @@@ bool pdptrs_changed(struct kvm_vcpu *vc
         gfn_t gfn;
         int r;
   
-       if (is_long_mode(vcpu) || !is_pae(vcpu) || !is_paging(vcpu))
+       if (!is_pae_paging(vcpu))
                 return false;
   
         if (!test_bit(VCPU_EXREG_PDPTR,
@@@ -960,8 -959,8 +960,8 @@@ int kvm_set_cr3(struct kvm_vcpu *vcpu, 
         if (is_long_mode(vcpu) &&
             (cr3 & rsvd_bits(cpuid_maxphyaddr(vcpu), 63)))
                 return 1;
-       else if (is_pae(vcpu) && is_paging(vcpu) &&
-                  !load_pdptrs(vcpu, vcpu->arch.walk_mmu, cr3))
+       else if (is_pae_paging(vcpu) &&
+                !load_pdptrs(vcpu, vcpu->arch.walk_mmu, cr3))
                 return 1;
   
         kvm_mmu_new_cr3(vcpu, cr3, skip_tlb_flush);
@@@ -1174,7 -1173,28 +1174,28 @@@ static u32 emulated_msrs[] = 
         MSR_AMD64_VIRT_SPEC_CTRL,
         MSR_IA32_POWER_CTL,
   
+       /*
+        * The following list leaves out MSRs whose values are determined
+        * by arch/x86/kvm/vmx/nested.c based on CPUID or other MSRs.
+        * We always support the "true" VMX control MSRs, even if the host
+        * processor does not, so I am putting these registers here rather
+        * than in msrs_to_save.
+        */
+       MSR_IA32_VMX_BASIC,
+       MSR_IA32_VMX_TRUE_PINBASED_CTLS,
+       MSR_IA32_VMX_TRUE_PROCBASED_CTLS,
+       MSR_IA32_VMX_TRUE_EXIT_CTLS,
+       MSR_IA32_VMX_TRUE_ENTRY_CTLS,
+       MSR_IA32_VMX_MISC,
+       MSR_IA32_VMX_CR0_FIXED0,
+       MSR_IA32_VMX_CR4_FIXED0,
+       MSR_IA32_VMX_VMCS_ENUM,
+       MSR_IA32_VMX_PROCBASED_CTLS2,
+       MSR_IA32_VMX_EPT_VPID_CAP,
+       MSR_IA32_VMX_VMFUNC,
+ 
         MSR_K7_HWCR,
+       MSR_KVM_POLL_CONTROL,
   };
   
   static unsigned num_emulated_msrs;
@@@ -1210,11 -1230,12 +1231,12 @@@ static u32 msr_based_features[] = 
   
   static unsigned int num_msr_based_features;
   
- u64 kvm_get_arch_capabilities(void)
+ static u64 kvm_get_arch_capabilities(void)
   {
-       u64 data;
+       u64 data = 0;
   
-       rdmsrl_safe(MSR_IA32_ARCH_CAPABILITIES, &data);
+       if (boot_cpu_has(X86_FEATURE_ARCH_CAPABILITIES))
+               rdmsrl(MSR_IA32_ARCH_CAPABILITIES, data);
   
         /*
          * If we're doing cache flushes (either "always" or "cond")
@@@ -1230,7 -1251,6 +1252,6 @@@
   
         return data;
   }
- EXPORT_SYMBOL_GPL(kvm_get_arch_capabilities);
   
   static int kvm_get_msr_feature(struct kvm_msr_entry *msr)
   {
@@@ -1555,7 -1575,7 +1576,7 @@@ static int set_tsc_khz(struct kvm_vcpu 
                         vcpu->arch.tsc_always_catchup = 1;
                         return 0;
                 } else {
- -                      WARN(1, "user requested TSC rate below hardware speed\n");
+ +                      pr_warn_ratelimited("user requested TSC rate below hardware speed\n");
                         return -1;
                 }
         }
@@@ -1565,8 -1585,8 +1586,8 @@@
                                 user_tsc_khz, tsc_khz);
   
         if (ratio == 0 || ratio >= kvm_max_tsc_scaling_ratio) {
- -              WARN_ONCE(1, "Invalid TSC scaling ratio - virtual-tsc-khz=%u\n",
- -                        user_tsc_khz);
+ +              pr_warn_ratelimited("Invalid TSC scaling ratio - virtual-tsc-khz=%u\n",
+ +                                  user_tsc_khz);
                 return -1;
         }
   
@@@ -1729,7 -1749,7 +1750,7 @@@ void kvm_write_tsc(struct kvm_vcpu *vcp
   
         raw_spin_lock_irqsave(&kvm->arch.tsc_write_lock, flags);
         offset = kvm_compute_tsc_offset(vcpu, data);
- -      ns = ktime_get_boot_ns();
+ +      ns = ktime_get_boottime_ns();
         elapsed = ns - kvm->arch.last_tsc_nsec;
   
         if (vcpu->arch.virtual_tsc_khz) {
@@@ -2071,7 -2091,7 +2092,7 @@@ u64 get_kvmclock_ns(struct kvm *kvm
         spin_lock(&ka->pvclock_gtod_sync_lock);
         if (!ka->use_master_clock) {
                 spin_unlock(&ka->pvclock_gtod_sync_lock);
- -              return ktime_get_boot_ns() + ka->kvmclock_offset;
+ +              return ktime_get_boottime_ns() + ka->kvmclock_offset;
         }
   
         hv_clock.tsc_timestamp = ka->master_cycle_now;
@@@ -2087,7 -2107,7 +2108,7 @@@
                                    &hv_clock.tsc_to_system_mul);
                 ret = __pvclock_read_cycles(&hv_clock, rdtsc());
         } else
- -              ret = ktime_get_boot_ns() + ka->kvmclock_offset;
+ +              ret = ktime_get_boottime_ns() + ka->kvmclock_offset;
   
         put_cpu();
   
@@@ -2186,7 -2206,7 +2207,7 @@@ static int kvm_guest_time_update(struc
         }
         if (!use_master_clock) {
                 host_tsc = rdtsc();
- -              kernel_ns = ktime_get_boot_ns();
+ +              kernel_ns = ktime_get_boottime_ns();
         }
   
         tsc_timestamp = kvm_read_l1_tsc(v, host_tsc);
@@@ -2545,13 -2565,24 +2566,24 @@@ int kvm_set_msr_common(struct kvm_vcpu 
                 }
                 break;
         case MSR_IA32_MISC_ENABLE:
-               vcpu->arch.ia32_misc_enable_msr = data;
+               if (!kvm_check_has_quirk(vcpu->kvm, KVM_X86_QUIRK_MISC_ENABLE_NO_MWAIT) &&
+                   ((vcpu->arch.ia32_misc_enable_msr ^ data) & MSR_IA32_MISC_ENABLE_MWAIT)) {
+                       if (!guest_cpuid_has(vcpu, X86_FEATURE_XMM3))
+                               return 1;
+                       vcpu->arch.ia32_misc_enable_msr = data;
+                       kvm_update_cpuid(vcpu);
+               } else {
+                       vcpu->arch.ia32_misc_enable_msr = data;
+               }
                 break;
         case MSR_IA32_SMBASE:
                 if (!msr_info->host_initiated)
                         return 1;
                 vcpu->arch.smbase = data;
                 break;
+       case MSR_IA32_POWER_CTL:
+               vcpu->arch.msr_ia32_power_ctl = data;
+               break;
         case MSR_IA32_TSC:
                 kvm_write_tsc(vcpu, msr_info);
                 break;
@@@ -2626,6 -2657,14 +2658,14 @@@
                         return 1;
                 break;
   
+       case MSR_KVM_POLL_CONTROL:
+               /* only enable bit supported */
+               if (data & (-1ULL << 1))
+                       return 1;
+ 
+               vcpu->arch.msr_kvm_poll_control = data;
+               break;
+ 
         case MSR_IA32_MCG_CTL:
         case MSR_IA32_MCG_STATUS:
         case MSR_IA32_MC0_CTL ... MSR_IA32_MCx_CTL(KVM_MAX_MCE_BANKS) - 1:
@@@ -2803,6 -2842,9 +2843,9 @@@ int kvm_get_msr_common(struct kvm_vcpu 
                         return 1;
                 msr_info->data = vcpu->arch.arch_capabilities;
                 break;
+       case MSR_IA32_POWER_CTL:
+               msr_info->data = vcpu->arch.msr_ia32_power_ctl;
+               break;
         case MSR_IA32_TSC:
                 msr_info->data = kvm_scale_tsc(vcpu, rdtsc()) + vcpu->arch.tsc_offset;
                 break;
@@@ -2875,6 -2917,9 +2918,9 @@@
         case MSR_KVM_PV_EOI_EN:
                 msr_info->data = vcpu->arch.pv_eoi.msr_val;
                 break;
+       case MSR_KVM_POLL_CONTROL:
+               msr_info->data = vcpu->arch.msr_kvm_poll_control;
+               break;
         case MSR_IA32_P5_MC_ADDR:
         case MSR_IA32_P5_MC_TYPE:
         case MSR_IA32_MCG_CAP:
@@@ -3084,6 -3129,7 +3130,7 @@@ int kvm_vm_ioctl_check_extension(struc
         case KVM_CAP_SET_BOOT_CPU_ID:
         case KVM_CAP_SPLIT_IRQCHIP:
         case KVM_CAP_IMMEDIATE_EXIT:
+       case KVM_CAP_PMU_EVENT_FILTER:
         case KVM_CAP_GET_MSR_FEATURES:
         case KVM_CAP_MSR_PLATFORM_INFO:
         case KVM_CAP_EXCEPTION_PAYLOAD:
@@@ -3096,7 -3142,8 +3143,8 @@@
                 r = KVM_CLOCK_TSC_STABLE;
                 break;
         case KVM_CAP_X86_DISABLE_EXITS:
-               r |=  KVM_X86_DISABLE_EXITS_HLT | KVM_X86_DISABLE_EXITS_PAUSE;
+               r |=  KVM_X86_DISABLE_EXITS_HLT | KVM_X86_DISABLE_EXITS_PAUSE |
+                     KVM_X86_DISABLE_EXITS_CSTATE;
                 if(kvm_can_mwait_in_guest())
                         r |= KVM_X86_DISABLE_EXITS_MWAIT;
                 break;
@@@ -4613,6 -4660,8 +4661,8 @@@ split_irqchip_unlock
                         kvm->arch.hlt_in_guest = true;
                 if (cap->args[0] & KVM_X86_DISABLE_EXITS_PAUSE)
                         kvm->arch.pause_in_guest = true;
+               if (cap->args[0] & KVM_X86_DISABLE_EXITS_CSTATE)
+                       kvm->arch.cstate_in_guest = true;
                 r = 0;
                 break;
         case KVM_CAP_MSR_PLATFORM_INFO:
@@@ -4927,6 -4976,9 +4977,9 @@@ set_identity_unlock
                 r = kvm_vm_ioctl_hv_eventfd(kvm, &hvevfd);
                 break;
         }
+       case KVM_SET_PMU_EVENT_FILTER:
+               r = kvm_vm_ioctl_set_pmu_event_filter(kvm, argp);
+               break;
         default:
                 r = -ENOTTY;
         }
@@@ -6379,7 -6431,7 +6432,7 @@@ static bool kvm_vcpu_check_breakpoint(s
                                            vcpu->arch.db);
   
                 if (dr6 != 0) {
-                       vcpu->arch.dr6 &= ~15;
+                       vcpu->arch.dr6 &= ~DR_TRAP_BITS;
                         vcpu->arch.dr6 |= dr6 | DR6_RTM;
                         kvm_queue_exception(vcpu, DB_VECTOR);
                         *r = EMULATE_DONE;
@@@ -6706,7 -6758,7 +6759,7 @@@ static void kvm_hyperv_tsc_notifier(voi
         struct kvm_vcpu *vcpu;
         int cpu;
   
-       spin_lock(&kvm_lock);
+       mutex_lock(&kvm_lock);
         list_for_each_entry(kvm, &vm_list, vm_list)
                 kvm_make_mclock_inprogress_request(kvm);
   
@@@ -6732,7 -6784,7 +6785,7 @@@
   
                 spin_unlock(&ka->pvclock_gtod_sync_lock);
         }
-       spin_unlock(&kvm_lock);
+       mutex_unlock(&kvm_lock);
   }
   #endif
   
@@@ -6783,17 -6835,17 +6836,17 @@@ static void __kvmclock_cpufreq_notifier
   
         smp_call_function_single(cpu, tsc_khz_changed, freq, 1);
   
-       spin_lock(&kvm_lock);
+       mutex_lock(&kvm_lock);
         list_for_each_entry(kvm, &vm_list, vm_list) {
                 kvm_for_each_vcpu(i, vcpu, kvm) {
                         if (vcpu->cpu != cpu)
                                 continue;
                         kvm_make_request(KVM_REQ_CLOCK_UPDATE, vcpu);
-                       if (vcpu->cpu != smp_processor_id())
+                       if (vcpu->cpu != raw_smp_processor_id())
                                 send_ipi = 1;
                 }
         }
-       spin_unlock(&kvm_lock);
+       mutex_unlock(&kvm_lock);
   
         if (freq->old < freq->new && send_ipi) {
                 /*
@@@ -6908,35 -6960,6 +6961,6 @@@ static struct perf_guest_info_callback
         .handle_intel_pt_intr   = kvm_handle_intel_pt_intr,
   };
   
- static void kvm_set_mmio_spte_mask(void)
- {
-       u64 mask;
-       int maxphyaddr = boot_cpu_data.x86_phys_bits;
- 
-       /*
-        * Set the reserved bits and the present bit of an paging-structure
-        * entry to generate page fault with PFER.RSV = 1.
-        */
- 
-       /*
-        * Mask the uppermost physical address bit, which would be reserved as
-        * long as the supported physical address width is less than 52.
-        */
-       mask = 1ull << 51;
- 
-       /* Set the present bit. */
-       mask |= 1ull;
- 
-       /*
-        * If reserved bit is not supported, clear the present bit to disable
-        * mmio page fault.
-        */
-       if (IS_ENABLED(CONFIG_X86_64) && maxphyaddr == 52)
-               mask &= ~1ull;
- 
-       kvm_mmu_set_mmio_spte_mask(mask, mask);
- }
- 
   #ifdef CONFIG_X86_64
   static void pvclock_gtod_update_fn(struct work_struct *work)
   {
@@@ -6945,12 -6968,12 +6969,12 @@@
         struct kvm_vcpu *vcpu;
         int i;
   
-       spin_lock(&kvm_lock);
+       mutex_lock(&kvm_lock);
         list_for_each_entry(kvm, &vm_list, vm_list)
                 kvm_for_each_vcpu(i, vcpu, kvm)
                         kvm_make_request(KVM_REQ_MASTERCLOCK_UPDATE, vcpu);
         atomic_set(&kvm_guest_has_master_clock, 0);
-       spin_unlock(&kvm_lock);
+       mutex_unlock(&kvm_lock);
   }
   
   static DECLARE_WORK(pvclock_gtod_work, pvclock_gtod_update_fn);
@@@ -7033,8 -7056,6 +7057,6 @@@ int kvm_arch_init(void *opaque
         if (r)
                 goto out_free_percpu;
   
-       kvm_set_mmio_spte_mask();
- 
         kvm_x86_ops = ops;
   
         kvm_mmu_set_mask_ptes(PT_USER_MASK, PT_ACCESSED_MASK,
@@@ -7173,6 -7194,23 +7195,23 @@@ void kvm_vcpu_deactivate_apicv(struct k
         kvm_x86_ops->refresh_apicv_exec_ctrl(vcpu);
   }
   
+ static void kvm_sched_yield(struct kvm *kvm, unsigned long dest_id)
+ {
+       struct kvm_vcpu *target = NULL;
+       struct kvm_apic_map *map;
+ 
+       rcu_read_lock();
+       map = rcu_dereference(kvm->arch.apic_map);
+ 
+       if (likely(map) && dest_id <= map->max_apic_id && map->phys_map[dest_id])
+               target = map->phys_map[dest_id]->vcpu;
+ 
+       rcu_read_unlock();
+ 
+       if (target)
+               kvm_vcpu_yield_to(target);
+ }
+ 
   int kvm_emulate_hypercall(struct kvm_vcpu *vcpu)
   {
         unsigned long nr, a0, a1, a2, a3, ret;
@@@ -7219,6 -7257,10 +7258,10 @@@
         case KVM_HC_SEND_IPI:
                 ret = kvm_pv_send_ipi(vcpu->kvm, a0, a1, a2, a3, op_64_bit);
                 break;
+       case KVM_HC_SCHED_YIELD:
+               kvm_sched_yield(vcpu->kvm, a0);
+               ret = 0;
+               break;
         default:
                 ret = -KVM_ENOSYS;
                 break;
@@@ -7951,9 -7993,6 +7994,6 @@@ static int vcpu_enter_guest(struct kvm_
         }
   
         trace_kvm_entry(vcpu->vcpu_id);
-       if (lapic_in_kernel(vcpu) &&
-           vcpu->arch.apic->lapic_timer.timer_advance_ns)
-               wait_lapic_expire(vcpu);
         guest_enter_irqoff();
   
         fpregs_assert_state_consistent();
@@@ -8002,13 -8041,29 +8042,29 @@@
         vcpu->mode = OUTSIDE_GUEST_MODE;
         smp_wmb();
   
-       kvm_before_interrupt(vcpu);
-       kvm_x86_ops->handle_external_intr(vcpu);
-       kvm_after_interrupt(vcpu);
+       kvm_x86_ops->handle_exit_irqoff(vcpu);
   
+       /*
+        * Consume any pending interrupts, including the possible source of
+        * VM-Exit on SVM and any ticks that occur between VM-Exit and now.
+        * An instruction is required after local_irq_enable() to fully unblock
+        * interrupts on processors that implement an interrupt shadow, the
+        * stat.exits increment will do nicely.
+        */
+       kvm_before_interrupt(vcpu);
+       local_irq_enable();
         ++vcpu->stat.exits;
+       local_irq_disable();
+       kvm_after_interrupt(vcpu);
   
         guest_exit_irqoff();
+       if (lapic_in_kernel(vcpu)) {
+               s64 delta = vcpu->arch.apic->lapic_timer.advance_expire_delta;
+               if (delta != S64_MIN) {
+                       trace_kvm_wait_lapic_expire(vcpu->vcpu_id, delta);
+                       vcpu->arch.apic->lapic_timer.advance_expire_delta = S64_MIN;
+               }
+       }
   
         local_irq_enable();
         preempt_enable();
@@@ -8594,7 -8649,7 +8650,7 @@@ static int __set_sregs(struct kvm_vcpu 
                 kvm_update_cpuid(vcpu);
   
         idx = srcu_read_lock(&vcpu->kvm->srcu);
-       if (!is_long_mode(vcpu) && is_pae(vcpu) && is_paging(vcpu)) {
+       if (is_pae_paging(vcpu)) {
                 load_pdptrs(vcpu, vcpu->arch.walk_mmu, kvm_read_cr3(vcpu));
                 mmu_reset_needed = 1;
         }
@@@ -8875,6 -8930,10 +8931,10 @@@ void kvm_arch_vcpu_postcreate(struct kv
         msr.host_initiated = true;
         kvm_write_tsc(vcpu, &msr);
         vcpu_put(vcpu);
+ 
+       /* poll control enabled by default */
+       vcpu->arch.msr_kvm_poll_control = 1;
+ 
         mutex_unlock(&vcpu->mutex);
   
         if (!kvmclock_periodic_sync)
@@@ -9016,7 -9075,7 +9076,7 @@@ int kvm_arch_hardware_enable(void
          * before any KVM threads can be running.  Unfortunately, we can't
          * bring the TSCs fully up to date with real time, as we aren't yet far
          * enough into CPU bringup that we know how much real time has actually
- -       * elapsed; our helper function, ktime_get_boot_ns() will be using boot
+ +       * elapsed; our helper function, ktime_get_boottime_ns() will be using boot
          * variables that haven't been updated yet.
          *
          * So we simply find the maximum observed TSC above, then record the
@@@ -9107,9 -9166,9 +9167,9 @@@ void kvm_arch_hardware_unsetup(void
         kvm_x86_ops->hardware_unsetup();
   }
   
- void kvm_arch_check_processor_compat(void *rtn)
+ int kvm_arch_check_processor_compat(void)
   {
-       kvm_x86_ops->check_processor_compatibility(rtn);
+       return kvm_x86_ops->check_processor_compatibility();
   }
   
   bool kvm_vcpu_is_reset_bsp(struct kvm_vcpu *vcpu)
@@@ -9244,7 -9303,7 +9304,7 @@@ int kvm_arch_init_vm(struct kvm *kvm, u
         mutex_init(&kvm->arch.apic_map_lock);
         spin_lock_init(&kvm->arch.pvclock_gtod_sync_lock);
   
- -      kvm->arch.kvmclock_offset = -ktime_get_boot_ns();
+ +      kvm->arch.kvmclock_offset = -ktime_get_boottime_ns();
         pvclock_update_vm_gtod_copy(kvm);
   
         kvm->arch.guest_can_read_msr_platform_info = true;
@@@ -9381,6 -9440,7 +9441,7 @@@ void kvm_arch_destroy_vm(struct kvm *kv
         kvm_ioapic_destroy(kvm);
         kvm_free_vcpus(kvm);
         kvfree(rcu_dereference_check(kvm->arch.apic_map, 1));
+       kfree(srcu_dereference_check(kvm->arch.pmu_event_filter, &kvm->srcu, 1));
         kvm_mmu_uninit_vm(kvm);
         kvm_page_track_cleanup(kvm);
         kvm_hv_destroy_vm(kvm);
@@@ -9789,6 -9849,36 +9850,36 @@@ static int apf_get_user(struct kvm_vcp
                                       sizeof(u32));
   }
   
+ static bool kvm_can_deliver_async_pf(struct kvm_vcpu *vcpu)
+ {
+       if (!vcpu->arch.apf.delivery_as_pf_vmexit && is_guest_mode(vcpu))
+               return false;
+ 
+       if (!(vcpu->arch.apf.msr_val & KVM_ASYNC_PF_ENABLED) ||
+           (vcpu->arch.apf.send_user_only &&
+            kvm_x86_ops->get_cpl(vcpu) == 0))
+               return false;
+ 
+       return true;
+ }
+ 
+ bool kvm_can_do_async_pf(struct kvm_vcpu *vcpu)
+ {
+       if (unlikely(!lapic_in_kernel(vcpu) ||
+                    kvm_event_needs_reinjection(vcpu) ||
+                    vcpu->arch.exception.pending))
+               return false;
+ 
+       if (kvm_hlt_in_guest(vcpu->kvm) && !kvm_can_deliver_async_pf(vcpu))
+               return false;
+ 
+       /*
+        * If interrupts are off we cannot even use an artificial
+        * halt state.
+        */
+       return kvm_x86_ops->interrupt_allowed(vcpu);
+ }
+ 
   void kvm_arch_async_page_not_present(struct kvm_vcpu *vcpu,
                                      struct kvm_async_pf *work)
   {
@@@ -9797,11 -9887,8 +9888,8 @@@
         trace_kvm_async_pf_not_present(work->arch.token, work->gva);
         kvm_add_async_pf_gfn(vcpu, work->arch.gfn);
   
-       if (!(vcpu->arch.apf.msr_val & KVM_ASYNC_PF_ENABLED) ||
-           (vcpu->arch.apf.send_user_only &&
-            kvm_x86_ops->get_cpl(vcpu) == 0))
-               kvm_make_request(KVM_REQ_APF_HALT, vcpu);
-       else if (!apf_put_user(vcpu, KVM_PV_REASON_PAGE_NOT_PRESENT)) {
+       if (kvm_can_deliver_async_pf(vcpu) &&
+           !apf_put_user(vcpu, KVM_PV_REASON_PAGE_NOT_PRESENT)) {
                 fault.vector = PF_VECTOR;
                 fault.error_code_valid = true;
                 fault.error_code = 0;
@@@ -9809,6 -9896,16 +9897,16 @@@
                 fault.address = work->arch.token;
                 fault.async_page_fault = true;
                 kvm_inject_page_fault(vcpu, &fault);
+       } else {
+               /*
+                * It is not possible to deliver a paravirtualized asynchronous
+                * page fault, but putting the guest in an artificial halt state
+                * can be beneficial nevertheless: if an interrupt arrives, we
+                * can deliver it timely and perhaps the guest will schedule
+                * another process.  When the instruction that triggered a page
+                * fault is retried, hopefully the page will be ready in the host.
+                */
+               kvm_make_request(KVM_REQ_APF_HALT, vcpu);
         }
   }
   
@@@ -9949,6 -10046,13 +10047,13 @@@ bool kvm_vector_hashing_enabled(void
   }
   EXPORT_SYMBOL_GPL(kvm_vector_hashing_enabled);
   
+ bool kvm_arch_no_poll(struct kvm_vcpu *vcpu)
+ {
+       return (vcpu->arch.msr_kvm_poll_control & 1) == 0;
+ }
+ EXPORT_SYMBOL_GPL(kvm_arch_no_poll);
+ 
+ 
   EXPORT_TRACEPOINT_SYMBOL_GPL(kvm_exit);
   EXPORT_TRACEPOINT_SYMBOL_GPL(kvm_fast_mmio);
   EXPORT_TRACEPOINT_SYMBOL_GPL(kvm_inj_virq);
diff --combined tools/testing/selftests/kvm/x86_64/evmcs_test.c

index 241919ef1eaca2c0d053b96a83dcf068480d5e22,dbf82658f2ef2ffc43204501adad0d3750d14201..f95c08343b487d4a4ec2698d18b08d191a612675
--- 1/tools/testing/selftests/kvm/x86_64/evmcs_test.c
--- 2/tools/testing/selftests/kvm/x86_64/evmcs_test.c
+++ b/tools/testing/selftests/kvm/x86_64/evmcs_test.c
@@@ -144,9 -144,8 +144,9 @@@ int main(int argc, char *argv[]
   
                 /* Restore state in a new VM.  */
                 kvm_vm_restart(vm, O_RDWR);
-               vm_vcpu_add(vm, VCPU_ID, 0, 0);
+               vm_vcpu_add(vm, VCPU_ID);
                 vcpu_set_cpuid(vm, VCPU_ID, kvm_get_supported_cpuid());
+ +              vcpu_ioctl(vm, VCPU_ID, KVM_ENABLE_CAP, &enable_evmcs_cap);
                 vcpu_load_state(vm, VCPU_ID, state);
                 run = vcpu_state(vm, VCPU_ID);
                 free(state);
author	Linus Torvalds <[email protected]>
	Fri, 12 Jul 2019 22:35:14 +0000 (15:35 -0700)
committer	Linus Torvalds <[email protected]>
	Fri, 12 Jul 2019 22:35:14 +0000 (15:35 -0700)
		1	2
Documentation/arm64/silicon-errata.rst	patch \|	diff1 \|	\|	blob \| history
Documentation/virtual/kvm/api.txt	patch \|	diff1 \|	diff2 \|	blob \| history
arch/arm64/include/asm/cpufeature.h	patch \|	diff1 \|	diff2 \|	blob \| history
arch/arm64/include/asm/kvm_host.h	patch \|	diff1 \|	diff2 \|	blob \| history
arch/arm64/include/asm/sysreg.h	patch \|	diff1 \|	diff2 \|	blob \| history
arch/arm64/kernel/traps.c	patch \|	diff1 \|	diff2 \|	blob \| history
arch/arm64/kvm/hyp/switch.c	patch \|	diff1 \|	diff2 \|	blob \| history
arch/s390/include/asm/kvm_host.h	patch \|	diff1 \|	diff2 \|	blob \| history
arch/s390/kvm/kvm-s390.c	patch \|	diff1 \|	diff2 \|	blob \| history
arch/x86/kvm/lapic.c	patch \|	diff1 \|	diff2 \|	blob \| history
arch/x86/kvm/mmu.c	patch \|	diff1 \|	diff2 \|	blob \| history
arch/x86/kvm/pmu.c	patch \|	diff1 \|	diff2 \|	blob \| history
arch/x86/kvm/vmx/nested.c	patch \|	diff1 \|	diff2 \|	blob \| history
arch/x86/kvm/x86.c	patch \|	diff1 \|	diff2 \|	blob \| history
tools/testing/selftests/kvm/x86_64/evmcs_test.c	patch \|	diff1 \|	diff2 \|	blob \| history