]>
Commit | Line | Data |
---|---|---|
11eec063 MR |
1 | = sPAPR Dynamic Reconfiguration = |
2 | ||
3 | sPAPR/"pseries" guests make use of a facility called dynamic-reconfiguration | |
4 | to handle hotplugging of dynamic "physical" resources like PCI cards, or | |
5 | "logical"/paravirtual resources like memory, CPUs, and "physical" | |
6 | host-bridges, which are generally managed by the host/hypervisor and provided | |
7 | to guests as virtualized resources. The specifics of dynamic-reconfiguration | |
8 | are documented extensively in PAPR+ v2.7, Section 13.1. This document | |
9 | provides a summary of that information as it applies to the implementation | |
10 | within QEMU. | |
11 | ||
12 | == Dynamic-reconfiguration Connectors == | |
13 | ||
14 | To manage hotplug/unplug of these resources, a firmware abstraction known as | |
15 | a Dynamic Resource Connector (DRC) is used to assign a particular dynamic | |
16 | resource to the guest, and provide an interface for the guest to manage | |
17 | configuration/removal of the resource associated with it. | |
18 | ||
19 | == Device-tree description of DRCs == | |
20 | ||
21 | A set of 4 Open Firmware device tree array properties are used to describe | |
22 | the name/index/power-domain/type of each DRC allocated to a guest at | |
23 | boot-time. There may be multiple sets of these arrays, rooted at different | |
24 | paths in the device tree depending on the type of resource the DRCs manage. | |
25 | ||
26 | In some cases, the DRCs themselves may be provided by a dynamic resource, | |
27 | such as the DRCs managing PCI slots on a hotplugged PHB. In this case the | |
28 | arrays would be fetched as part of the device tree retrieval interfaces | |
29 | for hotplugged resources described under "Guest->Host interface". | |
30 | ||
31 | The array properties are described below. Each entry/element in an array | |
32 | describes the DRC identified by the element in the corresponding position | |
33 | of ibm,drc-indexes: | |
34 | ||
35 | ibm,drc-names: | |
36 | first 4-bytes: BE-encoded integer denoting the number of entries | |
37 | each entry: a NULL-terminated <name> string encoded as a byte array | |
38 | ||
39 | <name> values for logical/virtual resources are defined in PAPR+ v2.7, | |
40 | Section 13.5.2.4, and basically consist of the type of the resource | |
41 | followed by a space and a numerical value that's unique across resources | |
42 | of that type. | |
43 | ||
44 | <name> values for "physical" resources such as PCI or VIO devices are | |
45 | defined as being "location codes", which are the "location labels" of | |
46 | each encapsulating device, starting from the chassis down to the | |
47 | individual slot for the device, concatenated by a hyphen. This provides | |
48 | a mapping of resources to a physical location in a chassis for debugging | |
49 | purposes. For QEMU, this mapping is less important, so we assign a | |
50 | location code that conforms to naming specifications, but is simply a | |
51 | location label for the slot by itself to simplify the implementation. | |
52 | The naming convention for location labels is documented in detail in | |
53 | PAPR+ v2.7, Section 12.3.1.5, and in our case amounts to using "C<n>" | |
54 | for PCI/VIO device slots, where <n> is unique across all PCI/VIO | |
55 | device slots. | |
56 | ||
57 | ibm,drc-indexes: | |
58 | first 4-bytes: BE-encoded integer denoting the number of entries | |
59 | each 4-byte entry: BE-encoded <index> integer that is unique across all DRCs | |
60 | in the machine | |
61 | ||
62 | <index> is arbitrary, but in the case of QEMU we try to maintain the | |
63 | convention used to assign them to pSeries guests on pHyp: | |
64 | ||
65 | bit[31:28]: integer encoding of <type>, where <type> is: | |
66 | 1 for CPU resource | |
67 | 2 for PHB resource | |
68 | 3 for VIO resource | |
69 | 4 for PCI resource | |
70 | 8 for Memory resource | |
71 | bit[27:0]: integer encoding of <id>, where <id> is unique across | |
72 | all resources of specified type | |
73 | ||
74 | ibm,drc-power-domains: | |
75 | first 4-bytes: BE-encoded integer denoting the number of entries | |
76 | each 4-byte entry: 32-bit, BE-encoded <index> integer that specifies the | |
77 | power domain the resource will be assigned to. In the case of QEMU | |
78 | we associated all resources with a "live insertion" domain, where the | |
79 | power is assumed to be managed automatically. The integer value for | |
80 | this domain is a special value of -1. | |
81 | ||
82 | ||
83 | ibm,drc-types: | |
84 | first 4-bytes: BE-encoded integer denoting the number of entries | |
85 | each entry: a NULL-terminated <type> string encoded as a byte array | |
86 | ||
87 | <type> is assigned as follows: | |
88 | "CPU" for a CPU | |
89 | "PHB" for a physical host-bridge | |
90 | "SLOT" for a VIO slot | |
91 | "28" for a PCI slot | |
92 | "MEM" for memory resource | |
93 | ||
94 | == Guest->Host interface to manage dynamic resources == | |
95 | ||
96 | Each DRC is given a globally unique DRC Index, and resources associated with | |
97 | a particular DRC are configured/managed by the guest via a number of RTAS | |
98 | calls which reference individual DRCs based on the DRC index. This can be | |
99 | considered the guest->host interface. | |
100 | ||
101 | rtas-set-power-level: | |
102 | arg[0]: integer identifying power domain | |
103 | arg[1]: new power level for the domain, 0-100 | |
104 | output[0]: status, 0 on success | |
105 | output[1]: power level after command | |
106 | ||
107 | Set the power level for a specified power domain | |
108 | ||
109 | rtas-get-power-level: | |
110 | arg[0]: integer identifying power domain | |
111 | output[0]: status, 0 on success | |
112 | output[1]: current power level | |
113 | ||
114 | Get the power level for a specified power domain | |
115 | ||
116 | rtas-set-indicator: | |
117 | arg[0]: integer identifying sensor/indicator type | |
118 | arg[1]: index of sensor, for DR-related sensors this is generally the | |
119 | DRC index | |
120 | arg[2]: desired sensor value | |
121 | output[0]: status, 0 on success | |
122 | ||
123 | Set the state of an indicator or sensor. For the purpose of this document we | |
124 | focus on the indicator/sensor types associated with a DRC. The types are: | |
125 | ||
126 | 9001: isolation-state, controls/indicates whether a device has been made | |
127 | accessible to a guest | |
128 | ||
129 | supported sensor values: | |
130 | 0: isolate, device is made unaccessible by guest OS | |
131 | 1: unisolate, device is made available to guest OS | |
132 | ||
133 | 9002: dr-indicator, controls "visual" indicator associated with device | |
134 | ||
135 | supported sensor values: | |
136 | 0: inactive, resource may be safely removed | |
137 | 1: active, resource is in use and cannot be safely removed | |
138 | 2: identify, used to visually identify slot for interactive hotplug | |
139 | 3: action, in most cases, used in the same manner as identify | |
140 | ||
141 | 9003: allocation-state, generally only used for "logical" DR resources to | |
142 | request the allocation/deallocation of a resource prior to acquiring | |
143 | it via isolation-state->unisolate, or after releasing it via | |
144 | isolation-state->isolate, respectively. for "physical" DR (like PCI | |
145 | hotplug/unplug) the pre-allocation of the resource is implied and | |
146 | this sensor is unused. | |
147 | ||
148 | supported sensor values: | |
149 | 0: unusable, tell firmware/system the resource can be | |
150 | unallocated/reclaimed and added back to the system resource pool | |
151 | 1: usable, request the resource be allocated/reserved for use by | |
152 | guest OS | |
153 | 2: exchange, used to allocate a spare resource to use for fail-over | |
154 | in certain situations. unused in QEMU | |
155 | 3: recover, used to reclaim a previously allocated resource that's | |
156 | not currently allocated to the guest OS. unused in QEMU | |
157 | ||
158 | rtas-get-sensor-state: | |
159 | arg[0]: integer identifying sensor/indicator type | |
160 | arg[1]: index of sensor, for DR-related sensors this is generally the | |
161 | DRC index | |
162 | output[0]: status, 0 on success | |
163 | ||
164 | Used to read an indicator or sensor value. | |
165 | ||
166 | For DR-related operations, the only noteworthy sensor is dr-entity-sense, | |
167 | which has a type value of 9003, as allocation-state does in the case of | |
168 | rtas-set-indicator. The semantics/encodings of the sensor values are distinct | |
169 | however: | |
170 | ||
171 | supported sensor values for dr-entity-sense (9003) sensor: | |
172 | 0: empty, | |
173 | for physical resources: DRC/slot is empty | |
174 | for logical resources: unused | |
175 | 1: present, | |
176 | for physical resources: DRC/slot is populated with a device/resource | |
177 | for logical resources: resource has been allocated to the DRC | |
178 | 2: unusable, | |
179 | for physical resources: unused | |
180 | for logical resources: DRC has no resource allocated to it | |
181 | 3: exchange, | |
182 | for physical resources: unused | |
183 | for logical resources: resource available for exchange (see | |
184 | allocation-state sensor semantics above) | |
185 | 4: recovery, | |
186 | for physical resources: unused | |
187 | for logical resources: resource available for recovery (see | |
188 | allocation-state sensor semantics above) | |
189 | ||
190 | rtas-ibm-configure-connector: | |
191 | arg[0]: guest physical address of 4096-byte work area buffer | |
192 | arg[1]: 0, or address of additional 4096-byte work area buffer. only non-zero | |
193 | if a prior RTAS response indicated a need for additional memory | |
194 | output[0]: status: | |
195 | 0: completed transmittal of device-tree node | |
196 | 1: instruct guest to prepare for next DT sibling node | |
197 | 2: instruct guest to prepare for next DT child node | |
198 | 3: instruct guest to prepare for next DT property | |
199 | 4: instruct guest to ascend to parent DT node | |
200 | 5: instruct guest to provide additional work-area buffer | |
201 | via arg[1] | |
202 | 990x: instruct guest that operation took too long and to try | |
203 | again later | |
204 | ||
205 | Used to fetch an OF device-tree description of the resource associated with | |
206 | a particular DRC. The DRC index is encoded in the first 4-bytes of the first | |
207 | work area buffer. | |
208 | ||
209 | Work area layout, using 4-byte offsets: | |
210 | wa[0]: DRC index of the DRC to fetch device-tree nodes from | |
211 | wa[1]: 0 (hard-coded) | |
212 | wa[2]: for next-sibling/next-child response: | |
213 | wa offset of null-terminated string denoting the new node's name | |
214 | for next-property response: | |
215 | wa offset of null-terminated string denoting new property's name | |
216 | wa[3]: for next-property response (unused otherwise): | |
217 | byte-length of new property's value | |
218 | wa[4]: for next-property response (unused otherwise): | |
219 | new property's value, encoded as an OFDT-compatible byte array | |
220 | ||
221 | == hotplug/unplug events == | |
222 | ||
223 | For most DR operations, the hypervisor will issue host->guest add/remove events | |
224 | using the EPOW/check-exception notification framework, where the host issues a | |
225 | check-exception interrupt, then provides an RTAS event log via an | |
226 | rtas-check-exception call issued by the guest in response. This framework is | |
227 | documented by PAPR+ v2.7, and already use in by QEMU for generating powerdown | |
228 | requests via EPOW events. | |
229 | ||
230 | For DR, this framework has been extended to include hotplug events, which were | |
231 | previously unneeded due to direct manipulation of DR-related guest userspace | |
232 | tools by host-level management such as an HMC. This level of management is not | |
233 | applicable to PowerKVM, hence the reason for extending the notification | |
234 | framework to support hotplug events. | |
235 | ||
9f992cca MR |
236 | The format for these EPOW-signalled events is described below under |
237 | "hotplug/unplug event structure". Note that these events are not | |
238 | formally part of the PAPR+ specification, and have been superseded by a | |
239 | newer format, also described below under "hotplug/unplug event structure", | |
240 | and so are now deemed a "legacy" format. The formats are similar, but the | |
241 | "modern" format contains additional fields/flags, which are denoted for the | |
242 | purposes of this documentation with "#ifdef GUEST_SUPPORTS_MODERN" guards. | |
243 | ||
244 | QEMU should assume support only for "legacy" fields/flags unless the guest | |
245 | advertises support for the "modern" format via ibm,client-architecture-support | |
246 | hcall by setting byte 5, bit 6 of it's ibm,architecture-vec-5 option vector | |
247 | structure (as described by LoPAPR v11, B.6.2.3). As with "legacy" format events, | |
248 | "modern" format events are surfaced to the guest via check-exception RTAS calls, | |
249 | but use a dedicated event source to signal the guest. This event source is | |
250 | advertised to the guest by the addition of a "hot-plug-events" node under | |
251 | "/event-sources" node of the guest's device tree using the standard format | |
252 | described in LoPAPR v11, B.6.12.1. | |
253 | ||
254 | == hotplug/unplug event structure == | |
255 | ||
256 | The hotplug-specific payload in QEMU is implemented as follows (with all values | |
11eec063 MR |
257 | encoded in big-endian format): |
258 | ||
259 | struct rtas_event_log_v6_hp { | |
260 | #define SECTION_ID_HOTPLUG 0x4850 /* HP */ | |
261 | struct section_header { | |
262 | uint16_t section_id; /* set to SECTION_ID_HOTPLUG */ | |
263 | uint16_t section_length; /* sizeof(rtas_event_log_v6_hp), | |
264 | * plus the length of the DRC name | |
265 | * if a DRC name identifier is | |
266 | * specified for hotplug_identifier | |
267 | */ | |
268 | uint8_t section_version; /* version 1 */ | |
269 | uint8_t section_subtype; /* unused */ | |
270 | uint16_t creator_component_id; /* unused */ | |
271 | } hdr; | |
272 | #define RTAS_LOG_V6_HP_TYPE_CPU 1 | |
273 | #define RTAS_LOG_V6_HP_TYPE_MEMORY 2 | |
274 | #define RTAS_LOG_V6_HP_TYPE_SLOT 3 | |
275 | #define RTAS_LOG_V6_HP_TYPE_PHB 4 | |
276 | #define RTAS_LOG_V6_HP_TYPE_PCI 5 | |
277 | uint8_t hotplug_type; /* type of resource/device */ | |
278 | #define RTAS_LOG_V6_HP_ACTION_ADD 1 | |
279 | #define RTAS_LOG_V6_HP_ACTION_REMOVE 2 | |
280 | uint8_t hotplug_action; /* action (add/remove) */ | |
9f992cca MR |
281 | #define RTAS_LOG_V6_HP_ID_DRC_NAME 1 |
282 | #define RTAS_LOG_V6_HP_ID_DRC_INDEX 2 | |
283 | #define RTAS_LOG_V6_HP_ID_DRC_COUNT 3 | |
284 | #ifdef GUEST_SUPPORTS_MODERN | |
285 | #define RTAS_LOG_V6_HP_ID_DRC_COUNT_INDEXED 4 | |
286 | #endif | |
11eec063 MR |
287 | uint8_t hotplug_identifier; /* type of the resource identifier, |
288 | * which serves as the discriminator | |
289 | * for the 'drc' union field below | |
290 | */ | |
9f992cca MR |
291 | #ifdef GUEST_SUPPORTS_MODERN |
292 | uint8_t capabilities; /* capability flags, currently unused | |
293 | * by QEMU | |
294 | */ | |
295 | #else | |
11eec063 | 296 | uint8_t reserved; |
9f992cca | 297 | #endif |
11eec063 MR |
298 | union { |
299 | uint32_t index; /* DRC index of resource to take action | |
300 | * on | |
301 | */ | |
302 | uint32_t count; /* number of DR resources to take | |
303 | * action on (guest chooses which) | |
304 | */ | |
9f992cca MR |
305 | #ifdef GUEST_SUPPORTS_MODERN |
306 | struct { | |
307 | uint32_t count; /* number of DR resources to take | |
308 | * action on | |
309 | */ | |
310 | uint32_t index; /* DRC index of first resource to take | |
311 | * action on. guest will take action | |
312 | * on DRC index <index> through | |
313 | * DRC index <index + count - 1> in | |
314 | * sequential order | |
315 | */ | |
316 | } count_indexed; | |
317 | #endif | |
11eec063 MR |
318 | char name[1]; /* string representing the name of the |
319 | * DRC to take action on | |
320 | */ | |
321 | } drc; | |
322 | } QEMU_PACKED; | |
323 | ||
db4ef288 BR |
324 | == ibm,lrdr-capacity == |
325 | ||
326 | ibm,lrdr-capacity is a property in the /rtas device tree node that identifies | |
327 | the dynamic reconfiguration capabilities of the guest. It consists of a triple | |
328 | consisting of <phys>, <size> and <maxcpus>. | |
329 | ||
330 | <phys>, encoded in BE format represents the maximum address in bytes and | |
331 | hence the maximum memory that can be allocated to the guest. | |
332 | ||
333 | <size>, encoded in BE format represents the size increments in which | |
334 | memory can be hot-plugged to the guest. | |
335 | ||
336 | <maxcpus>, a BE-encoded integer, represents the maximum number of | |
337 | processors that the guest can have. | |
338 | ||
339 | pseries guests use this property to note the maximum allowed CPUs for the | |
340 | guest. | |
341 | ||
03d196b7 BR |
342 | == ibm,dynamic-reconfiguration-memory == |
343 | ||
344 | ibm,dynamic-reconfiguration-memory is a device tree node that represents | |
345 | dynamically reconfigurable logical memory blocks (LMB). This node | |
346 | is generated only when the guest advertises the support for it via | |
347 | ibm,client-architecture-support call. Memory that is not dynamically | |
348 | reconfigurable is represented by /memory nodes. The properties of this | |
349 | node that are of interest to the sPAPR memory hotplug implementation | |
350 | in QEMU are described here. | |
351 | ||
352 | ibm,lmb-size | |
353 | ||
354 | This 64bit integer defines the size of each dynamically reconfigurable LMB. | |
355 | ||
356 | ibm,associativity-lookup-arrays | |
357 | ||
358 | This property defines a lookup array in which the NUMA associativity | |
359 | information for each LMB can be found. It is a property encoded array | |
360 | that begins with an integer M, the number of associativity lists followed | |
361 | by an integer N, the number of entries per associativity list and terminated | |
362 | by M associativity lists each of length N integers. | |
363 | ||
364 | This property provides the same information as given by ibm,associativity | |
365 | property in a /memory node. Each assigned LMB has an index value between | |
366 | 0 and M-1 which is used as an index into this table to select which | |
367 | associativity list to use for the LMB. This index value for each LMB | |
368 | is defined in ibm,dynamic-memory property. | |
369 | ||
370 | ibm,dynamic-memory | |
371 | ||
372 | This property describes the dynamically reconfigurable memory. It is a | |
373 | property encoded array that has an integer N, the number of LMBs followed | |
374 | by N LMB list entires. | |
375 | ||
376 | Each LMB list entry consists of the following elements: | |
377 | ||
378 | - Logical address of the start of the LMB encoded as a 64bit integer. This | |
379 | corresponds to reg property in /memory node. | |
380 | - DRC index of the LMB that corresponds to ibm,my-drc-index property | |
381 | in a /memory node. | |
382 | - Four bytes reserved for expansion. | |
383 | - Associativity list index for the LMB that is used as an index into | |
384 | ibm,associativity-lookup-arrays property described earlier. This | |
385 | is used to retrieve the right associativity list to be used for this | |
386 | LMB. | |
387 | - A 32bit flags word. The bit at bit position 0x00000008 defines whether | |
388 | the LMB is assigned to the the partition as of boot time. | |
389 | ||
a324d6f1 BR |
390 | ibm,dynamic-memory-v2 |
391 | ||
392 | This property describes the dynamically reconfigurable memory. This is | |
393 | an alternate and newer way to describe dyanamically reconfigurable memory. | |
394 | It is a property encoded array that has an integer N (the number of | |
395 | LMB set entries) followed by N LMB set entries. There is an LMB set entry | |
396 | for each sequential group of LMBs that share common attributes. | |
397 | ||
398 | Each LMB set entry consists of the following elements: | |
399 | ||
400 | - Number of sequential LMBs in the entry represented by a 32bit integer. | |
401 | - Logical address of the first LMB in the set encoded as a 64bit integer. | |
402 | - DRC index of the first LMB in the set. | |
403 | - Associativity list index that is used as an index into | |
404 | ibm,associativity-lookup-arrays property described earlier. This | |
405 | is used to retrieve the right associativity list to be used for all | |
406 | the LMBs in this set. | |
407 | - A 32bit flags word that applies to all the LMBs in the set. | |
408 | ||
11eec063 | 409 | [1] http://thread.gmane.org/gmane.linux.ports.ppc.embedded/75350/focus=106867 |