]> Git Repo - qemu.git/blame - docs/specs/ppc-spapr-hotplug.txt
Merge remote-tracking branch 'remotes/awilliam/tags/vfio-update-20160718.0' into...
[qemu.git] / docs / specs / ppc-spapr-hotplug.txt
CommitLineData
11eec063
MR
1= sPAPR Dynamic Reconfiguration =
2
3sPAPR/"pseries" guests make use of a facility called dynamic-reconfiguration
4to handle hotplugging of dynamic "physical" resources like PCI cards, or
5"logical"/paravirtual resources like memory, CPUs, and "physical"
6host-bridges, which are generally managed by the host/hypervisor and provided
7to guests as virtualized resources. The specifics of dynamic-reconfiguration
8are documented extensively in PAPR+ v2.7, Section 13.1. This document
9provides a summary of that information as it applies to the implementation
10within QEMU.
11
12== Dynamic-reconfiguration Connectors ==
13
14To manage hotplug/unplug of these resources, a firmware abstraction known as
15a Dynamic Resource Connector (DRC) is used to assign a particular dynamic
16resource to the guest, and provide an interface for the guest to manage
17configuration/removal of the resource associated with it.
18
19== Device-tree description of DRCs ==
20
21A set of 4 Open Firmware device tree array properties are used to describe
22the name/index/power-domain/type of each DRC allocated to a guest at
23boot-time. There may be multiple sets of these arrays, rooted at different
24paths in the device tree depending on the type of resource the DRCs manage.
25
26In some cases, the DRCs themselves may be provided by a dynamic resource,
27such as the DRCs managing PCI slots on a hotplugged PHB. In this case the
28arrays would be fetched as part of the device tree retrieval interfaces
29for hotplugged resources described under "Guest->Host interface".
30
31The array properties are described below. Each entry/element in an array
32describes the DRC identified by the element in the corresponding position
33of ibm,drc-indexes:
34
35ibm,drc-names:
36 first 4-bytes: BE-encoded integer denoting the number of entries
37 each entry: a NULL-terminated <name> string encoded as a byte array
38
39 <name> values for logical/virtual resources are defined in PAPR+ v2.7,
40 Section 13.5.2.4, and basically consist of the type of the resource
41 followed by a space and a numerical value that's unique across resources
42 of that type.
43
44 <name> values for "physical" resources such as PCI or VIO devices are
45 defined as being "location codes", which are the "location labels" of
46 each encapsulating device, starting from the chassis down to the
47 individual slot for the device, concatenated by a hyphen. This provides
48 a mapping of resources to a physical location in a chassis for debugging
49 purposes. For QEMU, this mapping is less important, so we assign a
50 location code that conforms to naming specifications, but is simply a
51 location label for the slot by itself to simplify the implementation.
52 The naming convention for location labels is documented in detail in
53 PAPR+ v2.7, Section 12.3.1.5, and in our case amounts to using "C<n>"
54 for PCI/VIO device slots, where <n> is unique across all PCI/VIO
55 device slots.
56
57ibm,drc-indexes:
58 first 4-bytes: BE-encoded integer denoting the number of entries
59 each 4-byte entry: BE-encoded <index> integer that is unique across all DRCs
60 in the machine
61
62 <index> is arbitrary, but in the case of QEMU we try to maintain the
63 convention used to assign them to pSeries guests on pHyp:
64
65 bit[31:28]: integer encoding of <type>, where <type> is:
66 1 for CPU resource
67 2 for PHB resource
68 3 for VIO resource
69 4 for PCI resource
70 8 for Memory resource
71 bit[27:0]: integer encoding of <id>, where <id> is unique across
72 all resources of specified type
73
74ibm,drc-power-domains:
75 first 4-bytes: BE-encoded integer denoting the number of entries
76 each 4-byte entry: 32-bit, BE-encoded <index> integer that specifies the
77 power domain the resource will be assigned to. In the case of QEMU
78 we associated all resources with a "live insertion" domain, where the
79 power is assumed to be managed automatically. The integer value for
80 this domain is a special value of -1.
81
82
83ibm,drc-types:
84 first 4-bytes: BE-encoded integer denoting the number of entries
85 each entry: a NULL-terminated <type> string encoded as a byte array
86
87 <type> is assigned as follows:
88 "CPU" for a CPU
89 "PHB" for a physical host-bridge
90 "SLOT" for a VIO slot
91 "28" for a PCI slot
92 "MEM" for memory resource
93
94== Guest->Host interface to manage dynamic resources ==
95
96Each DRC is given a globally unique DRC Index, and resources associated with
97a particular DRC are configured/managed by the guest via a number of RTAS
98calls which reference individual DRCs based on the DRC index. This can be
99considered the guest->host interface.
100
101rtas-set-power-level:
102 arg[0]: integer identifying power domain
103 arg[1]: new power level for the domain, 0-100
104 output[0]: status, 0 on success
105 output[1]: power level after command
106
107 Set the power level for a specified power domain
108
109rtas-get-power-level:
110 arg[0]: integer identifying power domain
111 output[0]: status, 0 on success
112 output[1]: current power level
113
114 Get the power level for a specified power domain
115
116rtas-set-indicator:
117 arg[0]: integer identifying sensor/indicator type
118 arg[1]: index of sensor, for DR-related sensors this is generally the
119 DRC index
120 arg[2]: desired sensor value
121 output[0]: status, 0 on success
122
123 Set the state of an indicator or sensor. For the purpose of this document we
124 focus on the indicator/sensor types associated with a DRC. The types are:
125
126 9001: isolation-state, controls/indicates whether a device has been made
127 accessible to a guest
128
129 supported sensor values:
130 0: isolate, device is made unaccessible by guest OS
131 1: unisolate, device is made available to guest OS
132
133 9002: dr-indicator, controls "visual" indicator associated with device
134
135 supported sensor values:
136 0: inactive, resource may be safely removed
137 1: active, resource is in use and cannot be safely removed
138 2: identify, used to visually identify slot for interactive hotplug
139 3: action, in most cases, used in the same manner as identify
140
141 9003: allocation-state, generally only used for "logical" DR resources to
142 request the allocation/deallocation of a resource prior to acquiring
143 it via isolation-state->unisolate, or after releasing it via
144 isolation-state->isolate, respectively. for "physical" DR (like PCI
145 hotplug/unplug) the pre-allocation of the resource is implied and
146 this sensor is unused.
147
148 supported sensor values:
149 0: unusable, tell firmware/system the resource can be
150 unallocated/reclaimed and added back to the system resource pool
151 1: usable, request the resource be allocated/reserved for use by
152 guest OS
153 2: exchange, used to allocate a spare resource to use for fail-over
154 in certain situations. unused in QEMU
155 3: recover, used to reclaim a previously allocated resource that's
156 not currently allocated to the guest OS. unused in QEMU
157
158rtas-get-sensor-state:
159 arg[0]: integer identifying sensor/indicator type
160 arg[1]: index of sensor, for DR-related sensors this is generally the
161 DRC index
162 output[0]: status, 0 on success
163
164 Used to read an indicator or sensor value.
165
166 For DR-related operations, the only noteworthy sensor is dr-entity-sense,
167 which has a type value of 9003, as allocation-state does in the case of
168 rtas-set-indicator. The semantics/encodings of the sensor values are distinct
169 however:
170
171 supported sensor values for dr-entity-sense (9003) sensor:
172 0: empty,
173 for physical resources: DRC/slot is empty
174 for logical resources: unused
175 1: present,
176 for physical resources: DRC/slot is populated with a device/resource
177 for logical resources: resource has been allocated to the DRC
178 2: unusable,
179 for physical resources: unused
180 for logical resources: DRC has no resource allocated to it
181 3: exchange,
182 for physical resources: unused
183 for logical resources: resource available for exchange (see
184 allocation-state sensor semantics above)
185 4: recovery,
186 for physical resources: unused
187 for logical resources: resource available for recovery (see
188 allocation-state sensor semantics above)
189
190rtas-ibm-configure-connector:
191 arg[0]: guest physical address of 4096-byte work area buffer
192 arg[1]: 0, or address of additional 4096-byte work area buffer. only non-zero
193 if a prior RTAS response indicated a need for additional memory
194 output[0]: status:
195 0: completed transmittal of device-tree node
196 1: instruct guest to prepare for next DT sibling node
197 2: instruct guest to prepare for next DT child node
198 3: instruct guest to prepare for next DT property
199 4: instruct guest to ascend to parent DT node
200 5: instruct guest to provide additional work-area buffer
201 via arg[1]
202 990x: instruct guest that operation took too long and to try
203 again later
204
205 Used to fetch an OF device-tree description of the resource associated with
206 a particular DRC. The DRC index is encoded in the first 4-bytes of the first
207 work area buffer.
208
209 Work area layout, using 4-byte offsets:
210 wa[0]: DRC index of the DRC to fetch device-tree nodes from
211 wa[1]: 0 (hard-coded)
212 wa[2]: for next-sibling/next-child response:
213 wa offset of null-terminated string denoting the new node's name
214 for next-property response:
215 wa offset of null-terminated string denoting new property's name
216 wa[3]: for next-property response (unused otherwise):
217 byte-length of new property's value
218 wa[4]: for next-property response (unused otherwise):
219 new property's value, encoded as an OFDT-compatible byte array
220
221== hotplug/unplug events ==
222
223For most DR operations, the hypervisor will issue host->guest add/remove events
224using the EPOW/check-exception notification framework, where the host issues a
225check-exception interrupt, then provides an RTAS event log via an
226rtas-check-exception call issued by the guest in response. This framework is
227documented by PAPR+ v2.7, and already use in by QEMU for generating powerdown
228requests via EPOW events.
229
230For DR, this framework has been extended to include hotplug events, which were
231previously unneeded due to direct manipulation of DR-related guest userspace
232tools by host-level management such as an HMC. This level of management is not
233applicable to PowerKVM, hence the reason for extending the notification
234framework to support hotplug events.
235
236Note that these events are not yet formally part of the PAPR+ specification,
237but support for this format has already been implemented in DR-related
238guest tools such as powerpc-utils/librtas, as well as kernel patches that have
239been submitted to handle in-kernel processing of memory/cpu-related hotplug
240events[1], and is planned for formal inclusion is PAPR+ specification. The
241hotplug-specific payload is QEMU implemented as follows (with all values
242encoded in big-endian format):
243
244struct rtas_event_log_v6_hp {
245#define SECTION_ID_HOTPLUG 0x4850 /* HP */
246 struct section_header {
247 uint16_t section_id; /* set to SECTION_ID_HOTPLUG */
248 uint16_t section_length; /* sizeof(rtas_event_log_v6_hp),
249 * plus the length of the DRC name
250 * if a DRC name identifier is
251 * specified for hotplug_identifier
252 */
253 uint8_t section_version; /* version 1 */
254 uint8_t section_subtype; /* unused */
255 uint16_t creator_component_id; /* unused */
256 } hdr;
257#define RTAS_LOG_V6_HP_TYPE_CPU 1
258#define RTAS_LOG_V6_HP_TYPE_MEMORY 2
259#define RTAS_LOG_V6_HP_TYPE_SLOT 3
260#define RTAS_LOG_V6_HP_TYPE_PHB 4
261#define RTAS_LOG_V6_HP_TYPE_PCI 5
262 uint8_t hotplug_type; /* type of resource/device */
263#define RTAS_LOG_V6_HP_ACTION_ADD 1
264#define RTAS_LOG_V6_HP_ACTION_REMOVE 2
265 uint8_t hotplug_action; /* action (add/remove) */
266#define RTAS_LOG_V6_HP_ID_DRC_NAME 1
267#define RTAS_LOG_V6_HP_ID_DRC_INDEX 2
268#define RTAS_LOG_V6_HP_ID_DRC_COUNT 3
269 uint8_t hotplug_identifier; /* type of the resource identifier,
270 * which serves as the discriminator
271 * for the 'drc' union field below
272 */
273 uint8_t reserved;
274 union {
275 uint32_t index; /* DRC index of resource to take action
276 * on
277 */
278 uint32_t count; /* number of DR resources to take
279 * action on (guest chooses which)
280 */
281 char name[1]; /* string representing the name of the
282 * DRC to take action on
283 */
284 } drc;
285} QEMU_PACKED;
286
db4ef288
BR
287== ibm,lrdr-capacity ==
288
289ibm,lrdr-capacity is a property in the /rtas device tree node that identifies
290the dynamic reconfiguration capabilities of the guest. It consists of a triple
291consisting of <phys>, <size> and <maxcpus>.
292
293 <phys>, encoded in BE format represents the maximum address in bytes and
294 hence the maximum memory that can be allocated to the guest.
295
296 <size>, encoded in BE format represents the size increments in which
297 memory can be hot-plugged to the guest.
298
299 <maxcpus>, a BE-encoded integer, represents the maximum number of
300 processors that the guest can have.
301
302pseries guests use this property to note the maximum allowed CPUs for the
303guest.
304
03d196b7
BR
305== ibm,dynamic-reconfiguration-memory ==
306
307ibm,dynamic-reconfiguration-memory is a device tree node that represents
308dynamically reconfigurable logical memory blocks (LMB). This node
309is generated only when the guest advertises the support for it via
310ibm,client-architecture-support call. Memory that is not dynamically
311reconfigurable is represented by /memory nodes. The properties of this
312node that are of interest to the sPAPR memory hotplug implementation
313in QEMU are described here.
314
315ibm,lmb-size
316
317This 64bit integer defines the size of each dynamically reconfigurable LMB.
318
319ibm,associativity-lookup-arrays
320
321This property defines a lookup array in which the NUMA associativity
322information for each LMB can be found. It is a property encoded array
323that begins with an integer M, the number of associativity lists followed
324by an integer N, the number of entries per associativity list and terminated
325by M associativity lists each of length N integers.
326
327This property provides the same information as given by ibm,associativity
328property in a /memory node. Each assigned LMB has an index value between
3290 and M-1 which is used as an index into this table to select which
330associativity list to use for the LMB. This index value for each LMB
331is defined in ibm,dynamic-memory property.
332
333ibm,dynamic-memory
334
335This property describes the dynamically reconfigurable memory. It is a
336property encoded array that has an integer N, the number of LMBs followed
337by N LMB list entires.
338
339Each LMB list entry consists of the following elements:
340
341- Logical address of the start of the LMB encoded as a 64bit integer. This
342 corresponds to reg property in /memory node.
343- DRC index of the LMB that corresponds to ibm,my-drc-index property
344 in a /memory node.
345- Four bytes reserved for expansion.
346- Associativity list index for the LMB that is used as an index into
347 ibm,associativity-lookup-arrays property described earlier. This
348 is used to retrieve the right associativity list to be used for this
349 LMB.
350- A 32bit flags word. The bit at bit position 0x00000008 defines whether
351 the LMB is assigned to the the partition as of boot time.
352
11eec063 353[1] http://thread.gmane.org/gmane.linux.ports.ppc.embedded/75350/focus=106867
This page took 0.134181 seconds and 4 git commands to generate.