]>
Commit | Line | Data |
---|---|---|
47402400 ZY |
1 | The PCI Express Advanced Error Reporting Driver Guide HOWTO |
2 | T. Long Nguyen <[email protected]> | |
3 | Yanmin Zhang <[email protected]> | |
4 | 07/29/2006 | |
5 | ||
6 | ||
7 | 1. Overview | |
8 | ||
9 | 1.1 About this guide | |
10 | ||
11 | This guide describes the basics of the PCI Express Advanced Error | |
12 | Reporting (AER) driver and provides information on how to use it, as | |
13 | well as how to enable the drivers of endpoint devices to conform with | |
14 | PCI Express AER driver. | |
15 | ||
be2a608b | 16 | 1.2 Copyright © Intel Corporation 2006. |
47402400 ZY |
17 | |
18 | 1.3 What is the PCI Express AER Driver? | |
19 | ||
20 | PCI Express error signaling can occur on the PCI Express link itself | |
21 | or on behalf of transactions initiated on the link. PCI Express | |
22 | defines two error reporting paradigms: the baseline capability and | |
23 | the Advanced Error Reporting capability. The baseline capability is | |
24 | required of all PCI Express components providing a minimum defined | |
25 | set of error reporting requirements. Advanced Error Reporting | |
26 | capability is implemented with a PCI Express advanced error reporting | |
27 | extended capability structure providing more robust error reporting. | |
28 | ||
29 | The PCI Express AER driver provides the infrastructure to support PCI | |
30 | Express Advanced Error Reporting capability. The PCI Express AER | |
31 | driver provides three basic functions: | |
32 | ||
33 | - Gathers the comprehensive error information if errors occurred. | |
34 | - Reports error to the users. | |
35 | - Performs error recovery actions. | |
36 | ||
37 | AER driver only attaches root ports which support PCI-Express AER | |
38 | capability. | |
39 | ||
40 | ||
41 | 2. User Guide | |
42 | ||
43 | 2.1 Include the PCI Express AER Root Driver into the Linux Kernel | |
44 | ||
45 | The PCI Express AER Root driver is a Root Port service driver attached | |
46 | to the PCI Express Port Bus driver. If a user wants to use it, the driver | |
47 | has to be compiled. Option CONFIG_PCIEAER supports this capability. It | |
48 | depends on CONFIG_PCIEPORTBUS, so pls. set CONFIG_PCIEPORTBUS=y and | |
49 | CONFIG_PCIEAER = y. | |
50 | ||
51 | 2.2 Load PCI Express AER Root Driver | |
52 | There is a case where a system has AER support in BIOS. Enabling the AER | |
53 | Root driver and having AER support in BIOS may result unpredictable | |
54 | behavior. To avoid this conflict, a successful load of the AER Root driver | |
55 | requires ACPI _OSC support in the BIOS to allow the AER Root driver to | |
56 | request for native control of AER. See the PCI FW 3.0 Specification for | |
57 | details regarding OSC usage. Currently, lots of firmwares don't provide | |
58 | _OSC support while they use PCI Express. To support such firmwares, | |
59 | forceload, a parameter of type bool, could enable AER to continue to | |
60 | be initiated although firmwares have no _OSC support. To enable the | |
61 | walkaround, pls. add aerdriver.forceload=y to kernel boot parameter line | |
62 | when booting kernel. Note that forceload=n by default. | |
63 | ||
64 | 2.3 AER error output | |
65 | When a PCI-E AER error is captured, an error message will be outputed to | |
66 | console. If it's a correctable error, it is outputed as a warning. | |
67 | Otherwise, it is printed as an error. So users could choose different | |
68 | log level to filter out correctable error messages. | |
69 | ||
70 | Below shows an example. | |
71 | +------ PCI-Express Device Error -----+ | |
72 | Error Severity : Uncorrected (Fatal) | |
73 | PCIE Bus Error type : Transaction Layer | |
74 | Unsupported Request : First | |
75 | Requester ID : 0500 | |
76 | VendorID=8086h, DeviceID=0329h, Bus=05h, Device=00h, Function=00h | |
77 | TLB Header: | |
78 | 04000001 00200a03 05010000 00050100 | |
79 | ||
80 | In the example, 'Requester ID' means the ID of the device who sends | |
81 | the error message to root port. Pls. refer to pci express specs for | |
82 | other fields. | |
83 | ||
84 | ||
85 | 3. Developer Guide | |
86 | ||
87 | To enable AER aware support requires a software driver to configure | |
88 | the AER capability structure within its device and to provide callbacks. | |
89 | ||
90 | To support AER better, developers need understand how AER does work | |
91 | firstly. | |
92 | ||
93 | PCI Express errors are classified into two types: correctable errors | |
94 | and uncorrectable errors. This classification is based on the impacts | |
95 | of those errors, which may result in degraded performance or function | |
96 | failure. | |
97 | ||
98 | Correctable errors pose no impacts on the functionality of the | |
99 | interface. The PCI Express protocol can recover without any software | |
100 | intervention or any loss of data. These errors are detected and | |
101 | corrected by hardware. Unlike correctable errors, uncorrectable | |
102 | errors impact functionality of the interface. Uncorrectable errors | |
103 | can cause a particular transaction or a particular PCI Express link | |
104 | to be unreliable. Depending on those error conditions, uncorrectable | |
105 | errors are further classified into non-fatal errors and fatal errors. | |
106 | Non-fatal errors cause the particular transaction to be unreliable, | |
107 | but the PCI Express link itself is fully functional. Fatal errors, on | |
108 | the other hand, cause the link to be unreliable. | |
109 | ||
110 | When AER is enabled, a PCI Express device will automatically send an | |
111 | error message to the PCIE root port above it when the device captures | |
112 | an error. The Root Port, upon receiving an error reporting message, | |
113 | internally processes and logs the error message in its PCI Express | |
114 | capability structure. Error information being logged includes storing | |
115 | the error reporting agent's requestor ID into the Error Source | |
116 | Identification Registers and setting the error bits of the Root Error | |
117 | Status Register accordingly. If AER error reporting is enabled in Root | |
118 | Error Command Register, the Root Port generates an interrupt if an | |
119 | error is detected. | |
120 | ||
121 | Note that the errors as described above are related to the PCI Express | |
122 | hierarchy and links. These errors do not include any device specific | |
123 | errors because device specific errors will still get sent directly to | |
124 | the device driver. | |
125 | ||
126 | 3.1 Configure the AER capability structure | |
127 | ||
128 | AER aware drivers of PCI Express component need change the device | |
129 | control registers to enable AER. They also could change AER registers, | |
130 | including mask and severity registers. Helper function | |
131 | pci_enable_pcie_error_reporting could be used to enable AER. See | |
132 | section 3.3. | |
133 | ||
134 | 3.2. Provide callbacks | |
135 | ||
136 | 3.2.1 callback reset_link to reset pci express link | |
137 | ||
138 | This callback is used to reset the pci express physical link when a | |
139 | fatal error happens. The root port aer service driver provides a | |
140 | default reset_link function, but different upstream ports might | |
141 | have different specifications to reset pci express link, so all | |
142 | upstream ports should provide their own reset_link functions. | |
143 | ||
144 | In struct pcie_port_service_driver, a new pointer, reset_link, is | |
145 | added. | |
146 | ||
147 | pci_ers_result_t (*reset_link) (struct pci_dev *dev); | |
148 | ||
149 | Section 3.2.2.2 provides more detailed info on when to call | |
150 | reset_link. | |
151 | ||
152 | 3.2.2 PCI error-recovery callbacks | |
153 | ||
154 | The PCI Express AER Root driver uses error callbacks to coordinate | |
155 | with downstream device drivers associated with a hierarchy in question | |
156 | when performing error recovery actions. | |
157 | ||
158 | Data struct pci_driver has a pointer, err_handler, to point to | |
159 | pci_error_handlers who consists of a couple of callback function | |
160 | pointers. AER driver follows the rules defined in | |
161 | pci-error-recovery.txt except pci express specific parts (e.g. | |
162 | reset_link). Pls. refer to pci-error-recovery.txt for detailed | |
163 | definitions of the callbacks. | |
164 | ||
165 | Below sections specify when to call the error callback functions. | |
166 | ||
167 | 3.2.2.1 Correctable errors | |
168 | ||
169 | Correctable errors pose no impacts on the functionality of | |
170 | the interface. The PCI Express protocol can recover without any | |
171 | software intervention or any loss of data. These errors do not | |
172 | require any recovery actions. The AER driver clears the device's | |
173 | correctable error status register accordingly and logs these errors. | |
174 | ||
175 | 3.2.2.2 Non-correctable (non-fatal and fatal) errors | |
176 | ||
177 | If an error message indicates a non-fatal error, performing link reset | |
178 | at upstream is not required. The AER driver calls error_detected(dev, | |
179 | pci_channel_io_normal) to all drivers associated within a hierarchy in | |
180 | question. for example, | |
181 | EndPoint<==>DownstreamPort B<==>UpstreamPort A<==>RootPort. | |
182 | If Upstream port A captures an AER error, the hierarchy consists of | |
183 | Downstream port B and EndPoint. | |
184 | ||
185 | A driver may return PCI_ERS_RESULT_CAN_RECOVER, | |
186 | PCI_ERS_RESULT_DISCONNECT, or PCI_ERS_RESULT_NEED_RESET, depending on | |
187 | whether it can recover or the AER driver calls mmio_enabled as next. | |
188 | ||
189 | If an error message indicates a fatal error, kernel will broadcast | |
190 | error_detected(dev, pci_channel_io_frozen) to all drivers within | |
191 | a hierarchy in question. Then, performing link reset at upstream is | |
192 | necessary. As different kinds of devices might use different approaches | |
193 | to reset link, AER port service driver is required to provide the | |
194 | function to reset link. Firstly, kernel looks for if the upstream | |
195 | component has an aer driver. If it has, kernel uses the reset_link | |
196 | callback of the aer driver. If the upstream component has no aer driver | |
197 | and the port is downstream port, we will use the aer driver of the | |
198 | root port who reports the AER error. As for upstream ports, | |
199 | they should provide their own aer service drivers with reset_link | |
200 | function. If error_detected returns PCI_ERS_RESULT_CAN_RECOVER and | |
201 | reset_link returns PCI_ERS_RESULT_RECOVERED, the error handling goes | |
202 | to mmio_enabled. | |
203 | ||
204 | 3.3 helper functions | |
205 | ||
206 | 3.3.1 int pci_find_aer_capability(struct pci_dev *dev); | |
207 | pci_find_aer_capability locates the PCI Express AER capability | |
208 | in the device configuration space. If the device doesn't support | |
209 | PCI-Express AER, the function returns 0. | |
210 | ||
211 | 3.3.2 int pci_enable_pcie_error_reporting(struct pci_dev *dev); | |
212 | pci_enable_pcie_error_reporting enables the device to send error | |
213 | messages to root port when an error is detected. Note that devices | |
214 | don't enable the error reporting by default, so device drivers need | |
215 | call this function to enable it. | |
216 | ||
217 | 3.3.3 int pci_disable_pcie_error_reporting(struct pci_dev *dev); | |
218 | pci_disable_pcie_error_reporting disables the device to send error | |
219 | messages to root port when an error is detected. | |
220 | ||
221 | 3.3.4 int pci_cleanup_aer_uncorrect_error_status(struct pci_dev *dev); | |
222 | pci_cleanup_aer_uncorrect_error_status cleanups the uncorrectable | |
223 | error status register. | |
224 | ||
225 | 3.4 Frequent Asked Questions | |
226 | ||
227 | Q: What happens if a PCI Express device driver does not provide an | |
228 | error recovery handler (pci_driver->err_handler is equal to NULL)? | |
229 | ||
230 | A: The devices attached with the driver won't be recovered. If the | |
231 | error is fatal, kernel will print out warning messages. Please refer | |
232 | to section 3 for more information. | |
233 | ||
234 | Q: What happens if an upstream port service driver does not provide | |
235 | callback reset_link? | |
236 | ||
237 | A: Fatal error recovery will fail if the errors are reported by the | |
238 | upstream ports who are attached by the service driver. | |
239 | ||
240 | Q: How does this infrastructure deal with driver that is not PCI | |
241 | Express aware? | |
242 | ||
243 | A: This infrastructure calls the error callback functions of the | |
244 | driver when an error happens. But if the driver is not aware of | |
245 | PCI Express, the device might not report its own errors to root | |
246 | port. | |
247 | ||
248 | Q: What modifications will that driver need to make it compatible | |
249 | with the PCI Express AER Root driver? | |
250 | ||
251 | A: It could call the helper functions to enable AER in devices and | |
252 | cleanup uncorrectable status register. Pls. refer to section 3.3. | |
253 |