]>
Commit | Line | Data |
---|---|---|
bbc53c7e SF |
1 | Rocker Network Switch Register Programming Guide |
2 | Copyright (c) Scott Feldman <[email protected]> | |
3 | Copyright (c) Neil Horman <[email protected]> | |
4 | Version 0.11, 12/29/2014 | |
5 | ||
6 | LICENSE | |
7 | ======= | |
8 | ||
9 | This program is free software; you can redistribute it and/or modify | |
10 | it under the terms of the GNU General Public License as published by | |
11 | the Free Software Foundation; either version 2 of the License, or | |
12 | (at your option) any later version. | |
13 | ||
14 | This program is distributed in the hope that it will be useful, | |
15 | but WITHOUT ANY WARRANTY; without even the implied warranty of | |
16 | MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the | |
17 | GNU General Public License for more details. | |
18 | ||
19 | SECTION 1: Introduction | |
20 | ======================= | |
21 | ||
22 | Overview | |
23 | -------- | |
24 | ||
25 | This document describes the hardware/software interface for the Rocker switch | |
26 | device. The intended audience is authors of OS drivers and device emulation | |
27 | software. | |
28 | ||
29 | Notations and Conventions | |
30 | ------------------------- | |
31 | ||
32 | o In register descriptions, [n:m] indicates a range from bit n to bit m, | |
33 | inclusive. | |
34 | o Use of leading 0x indicates a hexadecimal number. | |
35 | o Use of leading 0b indicates a binary number. | |
36 | o The use of RSVD or Reserved indicates that a bit or field is reserved for | |
37 | future use. | |
38 | o Field width is in bytes, unless otherwise noted. | |
39 | o Register are (R) read-only, (R/W) read/write, (W) write-only, or (COR) clear | |
40 | on read | |
41 | o TLV values in network-byte-order are designated with (N). | |
42 | ||
43 | ||
44 | SECTION 2: PCI Configuration Registers | |
45 | ====================================== | |
46 | ||
47 | PCI Configuration Space | |
48 | ----------------------- | |
49 | ||
50 | Each switch instance registers as a PCI device with PCI configuration space: | |
51 | ||
52 | offset width description value | |
53 | --------------------------------------------- | |
54 | 0x0 2 Vendor ID 0x1b36 | |
55 | 0x2 2 Device ID 0x0006 | |
56 | 0x4 4 Command/Status | |
57 | 0x8 1 Revision ID 0x01 | |
58 | 0x9 3 Class code 0x2800 | |
59 | 0xC 1 Cache line size | |
60 | 0xD 1 Latency timer | |
61 | 0xE 1 Header type | |
62 | 0xF 1 Built-in self test | |
63 | 0x10 4 Base address low | |
64 | 0x14 4 Base address high | |
65 | 0x18-28 Reserved | |
66 | 0x2C 2 Subsystem vendor ID * | |
67 | 0x2E 2 Subsystem ID * | |
68 | 0x30-38 Reserved | |
69 | 0x3C 1 Interrupt line | |
70 | 0x3D 1 Interrupt pin 0x00 | |
71 | 0x3E 1 Min grant 0x00 | |
72 | 0x3D 1 Max latency 0x00 | |
73 | 0x40 1 TRDY timeout | |
74 | 0x41 1 Retry count | |
75 | 0x42 2 Reserved | |
76 | ||
77 | ||
78 | * Assigned by sub-system implementation | |
79 | ||
80 | SECTION 3: Memory-Mapped Register Space | |
81 | ======================================= | |
82 | ||
83 | There are two memory-mapped BARs. BAR0 maps device register space and is | |
84 | 0x2000 in size. BAR1 maps MSI-X vector and PBA tables and is also 0x2000 in | |
85 | size, allowing for 256 MSI-X vectors. | |
86 | ||
87 | All registers are 4 or 8 bytes long. It is assumed host software will access 4 | |
88 | byte registers with one 4-byte access, and 8 byte registers with either two | |
89 | 4-byte accesses or a single 8-byte access. In the case of two 4-byte accesses, | |
90 | access must be lower and then upper 4-bytes, in that order. | |
91 | ||
92 | BAR0 device register space is organized as follows: | |
93 | ||
94 | offset description | |
95 | ------------------------------------------------------ | |
96 | 0x0000-0x000f Bogus registers to catch misbehaving | |
97 | drivers. Writes do nothing. Reads | |
98 | back as 0xDEADBABE. | |
99 | 0x0010-0x00ff Test registers | |
100 | 0x0300-0x03ff General purpose registers | |
101 | 0x1000-0x1fff Descriptor control | |
102 | ||
103 | Holes in register space are reserved. Writes to reserved registers do nothing. | |
104 | Reads to reserved registers read back as 0. | |
105 | ||
106 | No fancy stuff like write-combining is enabled on any of the registers. | |
107 | ||
108 | BAR1 MSI-X register space is organized as follows: | |
109 | ||
110 | offset description | |
111 | ------------------------------------------------------ | |
112 | 0x0000-0x0fff MSI-X vector table (256 vectors total) | |
113 | 0x1000-0x1fff MSI-X PBA table | |
114 | ||
115 | ||
116 | SECTION 4: Interrupts, DMA, and Endianness | |
117 | ========================================== | |
118 | ||
119 | PCI Interrupts | |
120 | -------------- | |
121 | ||
122 | The device supports only MSI-X interrupts. BAR1 memory-mapped region contains | |
123 | the MSI-X vector and PBA tables, with support for up to 256 MSI-X vectors. | |
124 | ||
125 | The vector assignment is: | |
126 | ||
127 | vector description | |
128 | ----------------------------------------------------- | |
129 | 0 Command descriptor ring completion | |
130 | 1 Event descriptor ring completion | |
131 | 2 Test operation completion | |
132 | 3 RSVD | |
133 | 4-255 Tx and Rx descriptor ring completion | |
134 | Tx vector is even | |
135 | Rx vector is odd | |
136 | ||
137 | A MSI-X vector table entry is 16 bytes: | |
138 | ||
139 | field offset width description | |
140 | ------------------------------------------------------------- | |
141 | lower_addr 0x0 4 [31:2] message address[31:2] | |
142 | [1:0] Rsvd (4 byte alignment | |
143 | required) | |
144 | upper_addr 0x4 4 [31:19] Rsvd | |
145 | [14:0] message address[46:32] | |
146 | data 0x8 4 message data[31:0] | |
147 | control 0xc 4 [31:1] Rsvd | |
148 | [0] mask (0 = enable, | |
149 | 1 = masked) | |
150 | ||
151 | Software should install the Interrupt Service Routine (ISR) before any ports | |
152 | are enabled or any commands are issued on the command ring. | |
153 | ||
154 | DMA Operations | |
155 | -------------- | |
156 | ||
157 | DMA operations are used for packet DMA to/from the CPU, command and event | |
158 | processing. Command processing includes statistical counters and table dumps, | |
159 | table insertion/deletion, and more. Event processing provides an async | |
160 | notification method for device-originating events. Each DMA operation has a | |
161 | set of control registers to manage a descriptor ring. The descriptor rings are | |
162 | allocated from contiguous host DMA-able memory and registers specify the rings | |
163 | base address, size and current head and tail indices. Software always writes | |
164 | the head, and hardware always writes the tail. | |
165 | ||
166 | The higher-order bit of DMA_DESC_COMP_ERR is used to mark hardware completion | |
167 | of a descriptor. Software will clear this bit when posting a descriptor to the | |
168 | ring, and hardware will set this bit when the descriptor is complete. | |
169 | ||
170 | Descriptor ring sizes must be a power of 2 and range from 2 to 64K entries. | |
171 | Descriptor rings' base address must be 8-byte aligned. Descriptors must be | |
172 | packed within ring. Each descriptor in each ring must also be aligned on an 8 | |
173 | byte boundary. Each descriptor ring will have these registers: | |
174 | ||
175 | DMA_DESC_xxx_BASE_ADDR, offset 0x1000 + (x * 32), 64-bit, (R/W) | |
176 | DMA_DESC_xxx_SIZE, offset 0x1008 + (x * 32), 32-bit, (R/W) | |
177 | DMA_DESC_xxx_HEAD, offset 0x100c + (x * 32), 32-bit, (R/W) | |
178 | DMA_DESC_xxx_TAIL, offset 0x1010 + (x * 32), 32-bit, (R) | |
179 | DMA_DESC_xxx_CTRL, offset 0x1014 + (x * 32), 32-bit, (W) | |
180 | DMA_DESC_xxx_CREDITS, offset 0x1018 + (x * 32), 32-bit, (R/W) | |
181 | DMA_DESC_xxx_RSVD1, offset 0x101c + (x * 32), 32-bit, (R/W) | |
182 | ||
183 | Where x is descriptor ring index: | |
184 | ||
185 | index ring | |
186 | -------------------- | |
187 | 0 CMD | |
188 | 1 EVENT | |
189 | 2 TX (port 0) | |
190 | 3 RX (port 0) | |
191 | 4 TX (port 1) | |
192 | 5 RX (port 1) | |
193 | . | |
194 | . | |
195 | . | |
196 | 124 TX (port 61) | |
197 | 125 RX (port 61) | |
198 | 126 Resv | |
199 | 127 Resv | |
200 | ||
201 | Writing BASE_ADDR or SIZE will reset HEAD and TAIL to zero. HEAD cannot be | |
202 | written past TAIL. To do so would wrap the ring. An empty ring is when HEAD | |
203 | == TAIL. A full ring is when HEAD is one position behind TAIL. Both HEAD and | |
204 | TAIL increment and modulo wrap at the ring size. | |
205 | ||
206 | CTRL register bits: | |
207 | ||
208 | bit name description | |
209 | ------------------------------------------------------------------------ | |
210 | [0] CTRL_RESET Reset the descriptor ring | |
211 | [1:31] Reserved | |
212 | ||
213 | All descriptor types share some common fields: | |
214 | ||
215 | field width description | |
216 | ------------------------------------------------------------------- | |
217 | DMA_DESC_BUF_ADDR 8 Phys addr of desc payload, 8-byte | |
218 | aligned | |
219 | DMA_DESC_COOKIE 8 Desc cookie for completion matching, | |
220 | upper-most bit is reserved | |
221 | DMA_DESC_BUF_SIZE 2 Desc payload size in bytes | |
222 | DMA_DESC_TLV_SIZE 2 Desc payload total size in bytes | |
223 | used for TLVs. Must be <= | |
224 | DMA_DESC_BUF_SIZE. | |
225 | DMA_DESC_COMP_ERR 2 Completion status of associated | |
226 | desc payload. High order bit is | |
227 | clear on new descs, toggled by | |
228 | hw for completed items. | |
229 | ||
230 | To support forward- and backward-compatibility, descriptor and completion | |
231 | payloads are specified in TLV format. Fields are packed with Type=field name, | |
232 | Length=field length, and Value=field value. Software will ignore unknown fields | |
233 | filled in by the switch. Likewise, the switch will ignore unknown fields | |
234 | filled in by software. | |
235 | ||
236 | Descriptor payload buffer is 8-byte aligned and TLVs are 8-byte aligned. The | |
237 | value within a TLV is also 8-byte aligned. The (packed, 8 byte) TLV header is: | |
238 | ||
239 | field width description | |
240 | ----------------------------- | |
241 | type 4 TLV type | |
242 | len 2 TLV value length | |
243 | pad 2 Reserved | |
244 | ||
245 | The alignment requirements for descriptors and TLVs are to avoid unaligned | |
246 | access exceptions in software. Note that the payload for each TLV is also | |
247 | 8 byte aligned. | |
248 | ||
249 | Figure 1 shows an example descriptor buffer with two TLVs. | |
250 | ||
251 | <------- 8 bytes -------> | |
252 | ||
253 | 8-byte +––––+ +–––––––––––+–––––+–––––+ +–+ | |
254 | align | type | len | pad | TLV#1 hdr | | |
255 | +–––––––––––+–––––+–––––+ (len=22) | | |
256 | | | | | |
257 | | value | TVL#1 value | | |
258 | | | (padded to 8-byte | | |
259 | | +–––––+ alignment) | | |
260 | | |/////| | | |
261 | 8-byte +––––+ +–––––––––––+–––––––––––+ | | |
262 | align | type | len | pad | TLV#2 hdr DESC_BUF_SIZE | |
263 | +–––––+–––––+–––––+–––––+ (len=2) | | |
264 | |value|/////////////////| TLV#2 value | | |
265 | +–––––+/////////////////| | | |
266 | |///////////////////////| | | |
267 | |///////////////////////| | | |
268 | |///////////////////////| | | |
269 | |////////unused/////////| | | |
270 | |////////space//////////| | | |
271 | |///////////////////////| | | |
272 | |///////////////////////| | | |
273 | |///////////////////////| | | |
274 | +–––––––––––––––––––––––+ +–+ | |
275 | ||
276 | fig. 1 | |
277 | ||
278 | TLVs can be nested within the NEST TLV type. | |
279 | ||
280 | Interrupt credits | |
281 | ^^^^^^^^^^^^^^^^^ | |
282 | ||
283 | MSI-X vectors used for descriptor ring completions use a credit mechanism for | |
284 | efficient device, PCIe bus, OS and driver operations. Each descriptor ring has | |
285 | a credit count which represents the number of outstanding descriptors to be | |
286 | processed by the driver. As the device marks descriptors complete, the credit | |
287 | count is incremented. As the driver processes those outstanding descriptors, | |
288 | it returns credits back to the device. This way, the device knows the driver's | |
289 | progress and can make decisions about when to fire the next interrupt or not. | |
290 | When the credit count is zero, and the first descriptors are posted for the | |
291 | driver, a single interrupt is fired. Once the interrupt is fired, the | |
292 | interrupt is disabled (auto-masked*). In response to the interrupt, the driver | |
293 | will process descriptors and PIO write a returned credit value for that | |
294 | descriptor ring. If the driver returns all credits (the driver caught up with | |
295 | the device and there is no outstanding work), then the interrupt is unmasked, | |
296 | but not fired. If only partial credits are returned, the interrupt remains | |
297 | masked but the device generates an interrupt, signaling the driver that more | |
298 | outstanding work is available. | |
299 | ||
b6af0975 | 300 | (* this masking is unrelated to the MSI-X interrupt mask register) |
bbc53c7e SF |
301 | |
302 | Endianness | |
303 | ---------- | |
304 | ||
305 | Device registers are hard-coded to little-endian (LE). The driver should | |
cb8d4c8f | 306 | convert to/from host endianness to LE for device register accesses. |
bbc53c7e SF |
307 | |
308 | Descriptors are LE. Descriptor buffer TLVs will have LE type and length | |
309 | fields, but the value field can either be LE or network-byte-order, depending | |
310 | on context. TLV values containing network packet data will be in network-byte | |
311 | order. A TLV value containing a field or mask used to compare against network | |
312 | packet data is network-byte order. For example, flow match fields (and masks) | |
313 | are network-byte-order since they're matched directly, byte-by-byte, against | |
314 | network packet data. All non-network-packet TLV multi-byte values will be LE. | |
315 | ||
316 | TLV values in network-byte-order are designated with (N). | |
317 | ||
318 | ||
319 | SECTION 5: Test Registers | |
320 | ========================= | |
321 | ||
322 | Rocker has several test registers to support troubleshooting register access, | |
323 | interrupt generation, and DMA operations: | |
324 | ||
325 | TEST_REG, offset 0x0010, 32-bit (R/W) | |
326 | TEST_REG64, offset 0x0018, 64-bit (R/W) | |
327 | TEST_IRQ, offset 0x0020, 32-bit (R/W) | |
328 | TEST_DMA_ADDR, offset 0x0028, 64-bit (R/W) | |
329 | TEST_DMA_SIZE, offset 0x0030, 32-bit (R/W) | |
330 | TEST_DMA_CTRL, offset 0x0034, 32-bit (R/W) | |
331 | ||
332 | Reads to TEST_REG and TEST_REG64 will read a value equal to twice the last | |
333 | value written to the register. The 32-bit and 64-bit versions are for testing | |
334 | 32-bit and 64-bit host accesses. | |
335 | ||
336 | A vector can be written to TEST_IRQ and the device will generate an interrupt | |
337 | for that vector. | |
338 | ||
339 | To test basic DMA operations, allocate a DMA-able host buffer and put the | |
340 | buffer address into TEST_DMA_ADDR and size into TEST_DMA_SIZE. Then, write to | |
341 | TEST_DMA_CTRL to manipulate the buffer contents. TEST_DMA_CTRL operations are: | |
342 | ||
343 | operation value description | |
344 | ----------------------------------------------------------- | |
345 | TEST_DMA_CTRL_CLEAR 1 clear buffer | |
346 | TEST_DMA_CTRL_FILL 2 fill buffer bytes with 0x96 | |
347 | TEST_DMA_CTRL_INVERT 4 invert bytes in buffer | |
348 | ||
349 | Various buffer address and sizes should be tested to verify no address boundary | |
350 | issue exists. In particular, buffers that start on odd-8-byte boundary and/or | |
351 | span multiple PAGE sizes should be tested. | |
352 | ||
353 | ||
354 | SECTION 6: Ports | |
355 | ================ | |
356 | ||
357 | Physical and Logical Ports | |
358 | ------------------------------------ | |
359 | ||
360 | The switch supports up to 62 physical (front-panel) ports. Register | |
361 | PORT_PHYS_COUNT returns the actual number of physical ports available: | |
362 | ||
363 | PORT_PHYS_COUNT, offset 0x0304, 32-bit, (R) | |
364 | ||
365 | In addition to front-panel ports, the switch supports logical ports for | |
366 | tunnels. | |
367 | ||
368 | Front-panel ports and logical tunnel ports are mapped into a single 32-bit port | |
369 | space. A special CPU port is assigned port 0. The front-panel ports are | |
370 | mapped to ports 1-62. A special loopback port is assigned port 63. Logical | |
371 | tunnel ports are assigned ports 0x0001000-0x0001ffff. | |
372 | To summarize the port assignments: | |
373 | ||
374 | port mapping | |
375 | ------------------------------------------------------- | |
376 | 0 CPU port (for packets to/from host CPU) | |
377 | 1-62 front-panel physical ports | |
378 | 63 loopback port | |
379 | 64-0x0000ffff RSVD | |
380 | 0x00010000-0x0001ffff logical tunnel ports | |
381 | 0x00020000-0xffffffff RSVD | |
382 | ||
383 | Physical Port Mode | |
384 | ------------------ | |
385 | ||
386 | Switch front-panel ports operate in a mode. Currently, the only mode is | |
387 | OF-DPA. OF-DPA[1] mode is based on OpenFlow Data Plane Abstraction (OF-DPA) | |
388 | Abstract Switch Specification, Version 1.0, from Broadcom Corporation. To | |
389 | set/get the mode for front-panel ports, see port settings, below. | |
390 | ||
391 | Port Settings | |
392 | ------------- | |
393 | ||
394 | Link status for all front-panel ports is available via PORT_PHYS_LINK_STATUS: | |
395 | ||
396 | PORT_PHYS_LINK_STATUS, offset 0x0310, 64-bit, (R) | |
397 | ||
398 | Value is port bitmap. Bits 0 and 63 always read 0. Bits 1-62 | |
399 | read 1 for link UP and 0 for link DOWN for respective front-panel ports. | |
400 | ||
401 | Other properties for front-panel ports are available via DMA CMD descriptors: | |
402 | ||
403 | Get PORT_SETTINGS descriptor: | |
404 | ||
405 | field width description | |
406 | ---------------------------------------------- | |
407 | PORT_SETTINGS 2 CMD_GET | |
408 | PPORT 4 Physical port # | |
409 | ||
410 | Get PORT_SETTINGS completion: | |
411 | ||
412 | field width description | |
413 | ---------------------------------------------- | |
414 | PPORT 4 Physical port # | |
415 | SPEED 4 Current port interface speed, in Mbps | |
416 | DUPLEX 1 1 = Full, 0 = Half | |
417 | AUTONEG 1 1 = enabled, 0 = disabled | |
418 | MACADDR 6 Port MAC address | |
419 | MODE 1 0 = OF-DPA | |
420 | LEARNING 1 MAC address learning on port | |
421 | 1 = enabled | |
422 | 0 = disabled | |
77349536 | 423 | PHYS_NAME <var> Physical port name (string) |
bbc53c7e SF |
424 | |
425 | Set PORT_SETTINGS descriptor: | |
426 | ||
427 | field width description | |
428 | ---------------------------------------------- | |
429 | PORT_SETTINGS 2 CMD_SET | |
430 | PPORT 4 Physical port # | |
431 | SPEED 4 Port interface speed, in Mbps | |
432 | DUPLEX 1 1 = Full, 0 = Half | |
433 | AUTONEG 1 1 = enabled, 0 = disabled | |
434 | MACADDR 6 Port MAC address | |
435 | MODE 1 0 = OF-DPA | |
436 | ||
437 | Port Enable | |
438 | ----------- | |
439 | ||
440 | Front-panel ports are initially disabled, which means port ingress and egress | |
441 | packets will be dropped. To enable or disable a port, use PORT_PHYS_ENABLE: | |
442 | ||
443 | PORT_PHYS_ENABLE: offset 0x0318, 64-bit, (R/W) | |
444 | ||
445 | Value is bitmap of first 64 ports. Bits 0 and 63 are ignored | |
446 | and always read as 0. Write 1 to enable port; write 0 to disable it. | |
447 | Default is 0. | |
448 | ||
449 | ||
450 | SECTION 7: Switch Control | |
451 | ========================= | |
452 | ||
453 | This section covers switch-wide register settings. | |
454 | ||
455 | Control | |
456 | ------- | |
457 | ||
458 | This register is used for low level control of the switch. | |
459 | ||
460 | CONTROL: offset 0x0300, 32-bit, (W) | |
461 | ||
462 | bit name description | |
463 | ------------------------------------------------------------------------ | |
464 | [0] CONTROL_RESET If set, device will perform reset | |
465 | [1:31] Reserved | |
466 | ||
467 | Switch ID | |
468 | --------- | |
469 | ||
470 | The switch has a SWITCH_ID to be used by software to uniquely identify the | |
471 | switch: | |
472 | ||
473 | SWITCH_ID: offset 0x0320, 64-bit, (R) | |
474 | ||
475 | Value is opaque to switch software and no special encoding is implied. | |
476 | ||
477 | ||
478 | SECTION 8: Events | |
479 | ================= | |
480 | ||
481 | Non-I/O asynchronous events from the device are notified to the host using the | |
482 | event ring. The TLV structure for events is: | |
483 | ||
484 | field width description | |
485 | --------------------------------------------------- | |
486 | TYPE 4 Event type, one of: | |
487 | 1: LINK_CHANGED | |
488 | 2: MAC_VLAN_SEEN | |
489 | INFO <nest> Event info (details below) | |
490 | ||
491 | Link Changed Event | |
492 | ------------------ | |
493 | ||
494 | When link status changes on a physical port, this event is generated. | |
495 | ||
496 | field width description | |
497 | --------------------------------------------------- | |
498 | INFO <nest> | |
499 | PPORT 4 Physical port | |
500 | LINKUP 1 Link status: | |
501 | 0: down | |
502 | 1: up | |
503 | ||
504 | MAC VLAN Seen Event | |
505 | ------------------- | |
506 | ||
507 | When a packet ingresses on a port and the source MAC/VLAN isn't known to the | |
508 | device, the device will generate this event. In response to the event, the | |
509 | driver should install to the device the MAC/VLAN on the port into the bridge | |
510 | table. Once installed, the MAC/VLAN is known on the port and this event will | |
511 | no longer be generated. | |
512 | ||
513 | field width description | |
514 | --------------------------------------------------- | |
515 | INFO <nest> | |
516 | PPORT 4 Physical port | |
517 | MAC 6 MAC address | |
518 | VLAN 2 VLAN ID | |
519 | ||
520 | ||
521 | SECTION 9: CPU Packet Processing | |
522 | ================================ | |
523 | ||
524 | Ingress packets directed to the host CPU for further processing are delivered | |
525 | in the DMA RX ring. Likewise, host CPU originating packets destined to egress | |
526 | on switch ports are scheduled by software using the DMA TX ring. | |
527 | ||
528 | Tx Packet Processing | |
529 | -------------------- | |
530 | ||
531 | Software schedules packets for egress on switch ports using the DMA TX ring. A | |
532 | TX descriptor buffer describes the packet location and size in host DMA-able | |
533 | memory, the destination port, and any hardware-offload functions (such as L3 | |
534 | payload checksum offload). Software then bumps the descriptor head to signal | |
535 | hardware of new Tx work. In response, hardware will DMA read Tx descriptors up | |
536 | to head, DMA read descriptor buffer and packet data, perform offloading | |
537 | functions, and finally frame packet on wire (network). Once packet processing | |
538 | is complete, hardware will writeback status to descriptor(s) to signal to | |
539 | software that Tx is complete and software resources (e.g. skb) backing packet | |
540 | can be released. | |
541 | ||
542 | Figure 2 shows an example 3-fragment packet queued with one Tx descriptor. A | |
543 | TLV is used for each packet fragment. | |
544 | ||
545 | pkt frag 1 | |
546 | +–––––––+ +–+ | |
547 | +–––+ | | | |
548 | desc buf | | | | | |
549 | +––––––––+ | | | | | |
550 | Tx ring +–––+ +–––––+ | | | | |
551 | +–––––––––+ | | TLVs | +–––––––+ | | |
552 | | +–––+ +––––––––+ pkt frag 2 | | |
553 | | desc 0 | | +–––––+ +–––––––+ | | |
554 | +–––––––––+ | TLVs | +–––+ | | | |
555 | head+–+ | +––––––––+ | | | | |
556 | | desc 1 | | +–––––+ +–––––––+ |pkt | |
557 | +–––––––––+ | TLVs | | | | |
558 | | | +––––––––+ | pkt frag 3 | | |
559 | | | | +–––––––+ | | |
560 | +–––––––––+ +–––+ | | | |
561 | | | | | | | |
562 | | | | | | | |
563 | +–––––––––+ | | | | |
564 | | | | | | | |
565 | | | | | | | |
566 | +–––––––––+ | | | | |
567 | | | +–––––––+ +–+ | |
568 | | | | |
569 | +–––––––––+ | |
570 | ||
571 | fig 2. | |
572 | ||
573 | The TLVs for Tx descriptor buffer are: | |
574 | ||
575 | field width description | |
576 | --------------------------------------------------------------------- | |
577 | PPORT 4 Destination physical port # | |
578 | TX_OFFLOAD 1 Hardware offload modes: | |
579 | 0: no offload | |
580 | 1: insert IP csum (ipv4 only) | |
581 | 2: insert TCP/UDP csum | |
582 | 3: L3 csum calc and insert | |
583 | into csum offset (TX_L3_CSUM_OFF) | |
584 | 16-bit 1's complement csum value. | |
585 | IPv4 pseudo-header and IP | |
586 | already calculated by OS | |
587 | and inserted. | |
588 | 4: TSO (TCP Segmentation Offload) | |
589 | TX_L3_CSUM_OFF 2 For L3 csum offload mode, the offset, | |
590 | from the beginning of the packet, | |
591 | of the csum field in the L3 header | |
592 | TX_TSO_MSS 2 For TSO offload mode, the | |
593 | Maximum Segment Size in bytes | |
594 | TX_TSO_HDR_LEN 2 For TSO offload mode, the | |
595 | length of ethernet, IP, and | |
596 | TCP/UDP headers, including IP | |
597 | and TCP options. | |
598 | TX_FRAGS <array> Packet fragments | |
599 | TX_FRAG <nest> Packet fragment | |
600 | TX_FRAG_ADDR 8 DMA address of packet fragment | |
601 | TX_FRAG_LEN 2 Packet fragment length | |
602 | ||
603 | Possible status return codes in descriptor on completion are: | |
604 | ||
605 | DESC_COMP_ERR reason | |
606 | -------------------------------------------------------------------- | |
607 | 0 OK | |
608 | -ROCKER_ENXIO address or data read err on desc buf or packet | |
609 | fragment | |
610 | -ROCKER_EINVAL bad pport or TSO or csum offloading error | |
611 | -ROCKER_ENOMEM no memory for internal staging tx fragment | |
612 | ||
613 | Rx Packet Processing | |
614 | -------------------- | |
615 | ||
616 | For packets ingressing on switch ports that are not forwarded by the switch but | |
617 | rather directed to the host CPU for further processing are delivered in the DMA | |
618 | RX ring. Rx descriptor buffers are allocated by software and placed on the | |
619 | ring. Hardware will fill Rx descriptor buffers with packet data, write the | |
620 | completion, and signal to software that a new packet is ready. Since Rx packet | |
621 | size is not known a-priori, the Rx descriptor buffer must be allocated for | |
622 | worst-case packet size. A single Rx descriptor will contain the entire Rx | |
623 | packet data in one RX_FRAG. Other Rx TLVs describe and hardware offloads | |
624 | performed on the packet, such as checksum validation. | |
625 | ||
626 | The TLVs for Rx descriptor buffer are: | |
627 | ||
628 | field width description | |
629 | --------------------------------------------------- | |
630 | PPORT 4 Source physical port # | |
631 | RX_FLAGS 2 Packet parsing flags: | |
632 | (1 << 0): IPv4 packet | |
633 | (1 << 1): IPv6 packet | |
634 | (1 << 2): csum calculated | |
635 | (1 << 3): IPv4 csum good | |
636 | (1 << 4): IP fragment | |
637 | (1 << 5): TCP packet | |
638 | (1 << 6): UDP packet | |
639 | (1 << 7): TCP/UDP csum good | |
d0d25558 | 640 | (1 << 8): Offload forward |
bbc53c7e SF |
641 | RX_CSUM 2 IP calculated checksum: |
642 | IPv4: IP payload csum | |
643 | IPv6: header and payload csum | |
644 | (Only valid is RX_FLAGS:csum calc is set) | |
645 | RX_FRAG_ADDR 8 DMA address of packet fragment | |
646 | RX_FRAG_MAX_LEN 2 Packet maximum fragment length | |
647 | RX_FRAG_LEN 2 Actual packet fragment length after receive | |
648 | ||
d0d25558 SF |
649 | Offload forward RX_FLAG indicates the device has already forwarded the packet |
650 | so the host CPU should not also forward the packet. | |
651 | ||
bbc53c7e SF |
652 | Possible status return codes in descriptor on completion are: |
653 | ||
654 | DESC_COMP_ERR reason | |
655 | -------------------------------------------------------------------- | |
656 | 0 OK | |
657 | -ROCKER_ENXIO address or data read err on desc buf | |
658 | -ROCKER_ENOMEM no memory for internal staging desc buf | |
659 | -ROCKER_EMSGSIZE Rx descriptor buffer wasn't big enough to contain | |
660 | packet data TLV and other TLVs. | |
661 | ||
662 | ||
663 | SECTION 10: OF-DPA Mode | |
664 | ====================== | |
665 | ||
666 | OF-DPA mode allows the switch to offload flow packet processing functions to | |
667 | hardware. An OpenFlow controller would communicate with an OpenFlow agent | |
668 | installed on the switch. The OpenFlow agent would (directly or indirectly) | |
669 | communicate with the Rocker switch driver, which in turn would program switch | |
670 | hardware with flow functionality, as defined in OF-DPA. The block diagram is: | |
671 | ||
672 | +–––––––––––––––----–––+ | |
673 | | OF | | |
674 | | Remote Controller | | |
675 | +––––––––+––----–––––––+ | |
676 | | | |
677 | | | |
678 | +––––––––+–––––––––+ | |
679 | | OF | | |
680 | | Local Agent | | |
681 | +––––––––––––––––––+ | |
682 | | | | |
683 | | Rocker Driver | | |
684 | +––––––––––––––––––+ | |
685 | <this spec> | |
686 | +––––––––––––––––––+ | |
687 | | | | |
688 | | Rocker Switch | | |
689 | +––––––––––––––––––+ | |
690 | ||
691 | To participate in flow functions, ports must be configure for OF-DPA mode | |
692 | during switch initialization. | |
693 | ||
694 | OF-DPA Flow Table Interface | |
695 | --------------------------- | |
696 | ||
697 | There are commands to add, modify, delete, and get stats of flow table entries. | |
698 | The commands are issued using the DMA CMD descriptor ring. The following | |
699 | commands are defined: | |
700 | ||
701 | CMD_ADD: add an entry to flow table | |
702 | CMD_MOD: modify an entry in flow table | |
703 | CMD_DEL: delete an entry from flow table | |
704 | CMD_GET_STATS: get stats for flow entry | |
705 | ||
706 | TLVs for add and modify commands are: | |
707 | ||
708 | field width description | |
709 | ---------------------------------------------------- | |
710 | OF_DPA_CMD 2 CMD_[ADD|MOD] | |
711 | OF_DPA_TBL 2 Flow table ID | |
712 | 0: ingress port | |
713 | 10: vlan | |
714 | 20: termination mac | |
715 | 30: unicast routing | |
716 | 40: multicast routing | |
717 | 50: bridging | |
718 | 60: ACL policy | |
719 | OF_DPA_PRIORITY 4 Flow priority | |
720 | OF_DPA_HARDTIME 4 Hard timeout for flow | |
721 | OF_DPA_IDLETIME 4 Idle timeout for flow | |
722 | OF_DPA_COOKIE 8 Cookie | |
723 | ||
724 | Additional TLVs based on flow table ID: | |
725 | ||
726 | Table ID 0: ingress port | |
727 | ||
728 | field width description | |
729 | ---------------------------------------------------- | |
730 | OF_DPA_IN_PPORT 4 ingress physical port number | |
731 | OF_DPA_GOTO_TBL 2 goto table ID; zero to drop | |
732 | ||
733 | Table ID 10: vlan | |
734 | ||
735 | field width description | |
736 | ---------------------------------------------------- | |
737 | OF_DPA_IN_PPORT 4 ingress physical port number | |
738 | OF_DPA_VLAN_ID 2 (N) vlan ID | |
739 | OF_DPA_VLAN_ID_MASK 2 (N) vlan ID mask | |
740 | OF_DPA_GOTO_TBL 2 goto table ID; zero to drop | |
741 | OF_DPA_NEW_VLAN_ID 2 (N) new vlan ID | |
742 | ||
743 | Table ID 20: termination mac | |
744 | ||
745 | field width description | |
746 | ---------------------------------------------------- | |
747 | OF_DPA_IN_PPORT 4 ingress physical port number | |
748 | OF_DPA_IN_PPORT_MASK 4 ingress physical port number mask | |
749 | OF_DPA_ETHERTYPE 2 (N) must be either 0x0800 or 0x86dd | |
750 | OF_DPA_DST_MAC 6 (N) destination MAC | |
751 | OF_DPA_DST_MAC_MASK 6 (N) destination MAC mask | |
752 | OF_DPA_VLAN_ID 2 (N) vlan ID | |
753 | OF_DPA_VLAN_ID_MASK 2 (N) vlan ID mask | |
754 | OF_DPA_GOTO_TBL 2 only acceptable values are | |
755 | unicast or multicast routing | |
756 | table IDs | |
757 | OF_DPA_OUT_PPORT 2 if specified, must be | |
758 | controller, set zero otherwise | |
759 | ||
760 | Table ID 30: unicast routing | |
761 | ||
762 | field width description | |
763 | ---------------------------------------------------- | |
764 | OF_DPA_ETHERTYPE 2 (N) must be either 0x0800 or 0x86dd | |
765 | OF_DPA_DST_IP 4 (N) destination IPv4 address. | |
766 | Must be unicast address | |
767 | OF_DPA_DST_IP_MASK 4 (N) IP mask. Must be prefix mask | |
768 | OF_DPA_DST_IPV6 16 (N) destination IPv6 address. | |
769 | Must be unicast address | |
770 | OF_DPA_DST_IPV6_MASK 16 (N) IPv6 mask. Must be prefix mask | |
771 | OF_DPA_GOTO_TBL 2 goto table ID; zero to drop | |
772 | OF_DPA_GROUP_ID 4 data for GROUP action must | |
773 | be an L3 Unicast group entry | |
774 | ||
775 | Table ID 40: multicast routing | |
776 | ||
777 | field width description | |
778 | ---------------------------------------------------- | |
779 | OF_DPA_ETHERTYPE 2 (N) must be either 0x0800 or 0x86dd | |
780 | OF_DPA_VLAN_ID 2 (N) vlan ID | |
781 | OF_DPA_SRC_IP 4 (N) source IPv4. Optional, | |
782 | can contain IPv4 address, | |
783 | must be completely masked | |
784 | if not used | |
785 | OF_DPA_SRC_IP_MASK 4 (N) IP Mask | |
786 | OF_DPA_DST_IP 4 (N) destination IPv4 address. | |
787 | Must be multicast address | |
788 | OF_DPA_SRC_IPV6 16 (N) source IPv6 Address. Optional. | |
789 | Can contain IPv6 address, | |
790 | must be completely masked | |
791 | if not used | |
792 | OF_DPA_SRC_IPV6_MASK 16 (N) IPv6 mask. | |
793 | OF_DPA_DST_IPV6 16 (N) destination IPv6 Address. Must | |
794 | be multicast address | |
795 | Must be multicast address | |
796 | OF_DPA_GOTO_TBL 2 goto table ID; zero to drop | |
797 | OF_DPA_GROUP_ID 4 data for GROUP action must | |
798 | be an L3 multicast group entry | |
799 | ||
800 | Table ID 50: bridging | |
801 | ||
802 | field width description | |
803 | ---------------------------------------------------- | |
804 | OF_DPA_VLAN_ID 2 (N) vlan ID | |
805 | OF_DPA_TUNNEL_ID 4 tunnel ID | |
806 | OF_DPA_DST_MAC 6 (N) destination MAC | |
807 | OF_DPA_DST_MAC_MASK 6 (N) destination MAC mask | |
808 | OF_DPA_GOTO_TBL 2 goto table ID; zero to drop | |
809 | OF_DPA_GROUP_ID 4 data for GROUP action must | |
810 | be a L2 Interface, L2 | |
811 | Multicast, L2 Flood, | |
812 | or L2 Overlay group entry | |
813 | as appropriate | |
814 | OF_DPA_TUNNEL_LPORT 4 unicast Tenant Bridging | |
815 | flows specify a tunnel | |
816 | logical port ID | |
817 | OF_DPA_OUT_PPORT 2 data for OUTPUT action, | |
818 | restricted to CONTROLLER, | |
819 | set to 0 otherwise | |
820 | ||
821 | Table ID 60: acl policy | |
822 | ||
823 | field width description | |
824 | ---------------------------------------------------- | |
825 | OF_DPA_IN_PPORT 4 ingress physical port number | |
826 | OF_DPA_IN_PPORT_MASK 4 ingress physical port number mask | |
827 | OF_DPA_ETHERTYPE 2 (N) ethertype | |
828 | OF_DPA_VLAN_ID 2 (N) vlan ID | |
829 | OF_DPA_VLAN_ID_MASK 2 (N) vlan ID mask | |
830 | OF_DPA_VLAN_PCP 2 (N) vlan Priority Code Point | |
831 | OF_DPA_VLAN_PCP_MASK 2 (N) vlan Priority Code Point mask | |
832 | OF_DPA_SRC_MAC 6 (N) source MAC | |
833 | OF_DPA_SRC_MAC_MASK 6 (N) source MAC mask | |
834 | OF_DPA_DST_MAC 6 (N) destination MAC | |
835 | OF_DPA_DST_MAC_MASK 6 (N) destination MAC mask | |
836 | OF_DPA_TUNNEL_ID 4 tunnel ID | |
837 | OF_DPA_SRC_IP 4 (N) source IPv4. Optional, | |
838 | can contain IPv4 address, | |
839 | must be completely masked | |
840 | if not used | |
841 | OF_DPA_SRC_IP_MASK 4 (N) IP Mask | |
842 | OF_DPA_DST_IP 4 (N) destination IPv4 address. | |
843 | Must be multicast address | |
844 | OF_DPA_DST_IP_MASK 4 (N) IP Mask | |
845 | OF_DPA_SRC_IPV6 16 (N) source IPv6 Address. Optional. | |
846 | Can contain IPv6 address, | |
847 | must be completely masked | |
848 | if not used | |
849 | OF_DPA_SRC_IPV6_MASK 16 (N) IPv6 mask | |
850 | OF_DPA_DST_IPV6 16 (N) destination IPv6 Address. Must | |
851 | be multicast address. | |
852 | OF_DPA_DST_IPV6_MASK 16 (N) IPv6 mask | |
853 | OF_DPA_SRC_ARP_IP 4 (N) source IPv4 address in the ARP | |
854 | payload. Only used if ethertype | |
855 | == 0x0806. | |
856 | OF_DPA_SRC_ARP_IP_MASK 4 (N) IP Mask | |
857 | OF_DPA_IP_PROTO 1 IP protocol | |
858 | OF_DPA_IP_PROTO_MASK 1 IP protocol mask | |
859 | OF_DPA_IP_DSCP 1 DSCP | |
860 | OF_DPA_IP_DSCP_MASK 1 DSCP mask | |
861 | OF_DPA_IP_ECN 1 ECN | |
862 | OF_DPA_IP_ECN_MASK 1 ECN mask | |
863 | OF_DPA_L4_SRC_PORT 2 (N) L4 source port, only for | |
864 | TCP, UDP, or SCTP | |
865 | OF_DPA_L4_SRC_PORT_MASK 2 (N) L4 source port mask | |
866 | OF_DPA_L4_DST_PORT 2 (N) L4 source port, only for | |
867 | TCP, UDP, or SCTP | |
868 | OF_DPA_L4_DST_PORT_MASK 2 (N) L4 source port mask | |
869 | OF_DPA_ICMP_TYPE 1 ICMP type, only if IP | |
870 | protocol is 1 | |
871 | OF_DPA_ICMP_TYPE_MASK 1 ICMP type mask | |
872 | OF_DPA_ICMP_CODE 1 ICMP code | |
873 | OF_DPA_ICMP_CODE_MASK 1 ICMP code mask | |
874 | OF_DPA_IPV6_LABEL 4 (N) IPv6 flow label | |
875 | OF_DPA_IPV6_LABEL_MASK 4 (N) IPv6 flow label mask | |
876 | OF_DPA_GROUP_ID 4 data for GROUP action | |
877 | OF_DPA_QUEUE_ID_ACTION 1 write the queue ID | |
878 | OF_DPA_NEW_QUEUE_ID 1 queue ID | |
879 | OF_DPA_VLAN_PCP_ACTION 1 write the VLAN priority | |
880 | OF_DPA_NEW_VLAN_PCP 1 VLAN priority | |
881 | OF_DPA_IP_DSCP_ACTION 1 write the DSCP | |
882 | OF_DPA_NEW_IP_DSCP 1 new DSCP | |
883 | OF_DPA_TUNNEL_LPORT 4 restrct to valid tunnel | |
884 | logical port, set to 0 | |
885 | otherwise. | |
886 | OF_DPA_OUT_PPORT 2 data for OUTPUT action, | |
887 | restricted to CONTROLLER, | |
888 | set to 0 otherwise | |
889 | OF_DPA_CLEAR_ACTIONS 4 if 1 packets matching flow are | |
890 | dropped (all other instructions | |
891 | ignored) | |
892 | ||
893 | TLVs for flow delete and get stats command are: | |
894 | ||
895 | field width description | |
896 | --------------------------------------------------- | |
897 | OF_DPA_CMD 2 CMD_[DEL|GET_STATS] | |
898 | OF_DPA_COOKIE 8 Cookie | |
899 | ||
900 | On completion of get stats command, the descriptor buffer is written back with | |
901 | the following TLVs: | |
902 | ||
903 | field width description | |
904 | --------------------------------------------------- | |
905 | OF_DPA_STAT_DURATION 4 Flow duration | |
906 | OF_DPA_STAT_RX_PKTS 8 Received packets | |
907 | OF_DPA_STAT_TX_PKTS 8 Transmit packets | |
908 | ||
909 | Possible status return codes in descriptor on completion are: | |
910 | ||
911 | DESC_COMP_ERR command reason | |
912 | -------------------------------------------------------------------- | |
913 | 0 all OK | |
914 | -ROCKER_EFAULT all head or tail index outside | |
915 | of ring | |
916 | -ROCKER_ENXIO all address or data read err on | |
917 | desc buf | |
918 | -ROCKER_EMSGSIZE GET_STATS cmd descriptor buffer wasn't | |
919 | big enough to contain write-back | |
920 | TLVs | |
921 | -ROCKER_EINVAL all invalid parameters passed in | |
922 | -ROCKER_EEXIST ADD entry already exists | |
923 | -ROCKER_ENOSPC ADD no space left in flow table | |
924 | -ROCKER_ENOENT MOD|DEL|GET_STATS cookie invalid | |
925 | ||
926 | Group Table Interface | |
927 | --------------------- | |
928 | ||
929 | There are commands to add, modify, delete, and get stats of group table | |
930 | entries. The commands are issued using the DMA CMD descriptor ring. The | |
931 | following commands are defined: | |
932 | ||
933 | CMD_ADD: add an entry to group table | |
934 | CMD_MOD: modify an entry in group table | |
935 | CMD_DEL: delete an entry from group table | |
936 | CMD_GET_STATS: get stats for group entry | |
937 | ||
938 | TLVs for add and modify commands are: | |
939 | ||
940 | field width description | |
941 | ----------------------------------------------------------- | |
942 | FLOW_GROUP_CMD 2 CMD_[ADD|MOD] | |
943 | FLOW_GROUP_ID 2 Flow group ID | |
944 | FLOW_GROUP_TYPE 1 Group type: | |
945 | 0: L2 interface | |
946 | 1: L2 rewrite | |
947 | 2: L3 unicast | |
948 | 3: L2 multicast | |
949 | 4: L2 flood | |
950 | 5: L3 interface | |
951 | 6: L3 multicast | |
952 | 7: L3 ECMP | |
953 | 8: L2 overlay | |
954 | FLOW_VLAN_ID 2 Vlan ID (types 0, 3, 4, 6) | |
955 | FLOW_L2_PORT 2 Port (types 0) | |
956 | FLOW_INDEX 4 Index (all types but 0) | |
957 | FLOW_OVERLAY_TYPE 1 Overlay sub-type (type 8): | |
958 | 0: Flood unicast tunnel | |
959 | 1: Flood multicast tunnel | |
960 | 2: Multicast unicast tunnel | |
961 | 3: Multicast multicast tunnel | |
962 | FLOW_GROUP_ACTION nest | |
963 | FLOW_GROUP_ID 2 next group ID in chain (all | |
964 | types except 0) | |
965 | FLOW_OUT_PORT 4 egress port (types 0, 8) | |
966 | FLOW_POP_VLAN_TAG 1 strip outer VLAN tag (type 1 | |
967 | only) | |
968 | FLOW_VLAN_ID 2 (types 1, 5) | |
969 | FLOW_SRC_MAC 6 (types 1, 2, 5) | |
970 | FLOW_DST_MAC 6 (types 1, 2) | |
971 | ||
972 | TLVs for flow delete and get stats command are: | |
973 | ||
974 | field width description | |
975 | ----------------------------------------------------------- | |
976 | FLOW_GROUP_CMD 2 CMD_[DEL|GET_STATS] | |
977 | FLOW_GROUP_ID 2 Flow group ID | |
978 | ||
979 | On completion of get stats command, the descriptor buffer is written back with | |
980 | the following TLVs: | |
981 | ||
982 | field width description | |
983 | --------------------------------------------------- | |
984 | FLOW_GROUP_ID 2 Flow group ID | |
985 | FLOW_STAT_DURATION 4 Flow duration | |
986 | FLOW_STAT_REF_COUNT 4 Flow reference count | |
987 | FLOW_STAT_BUCKET_COUNT 4 Flow bucket count | |
988 | ||
989 | Possible status return codes in descriptor on completion are: | |
990 | ||
991 | DESC_COMP_ERR command reason | |
992 | -------------------------------------------------------------------- | |
993 | 0 all OK | |
994 | -ROCKER_EFAULT all head or tail index outside | |
995 | of ring | |
996 | -ROCKER_ENXIO all address or data read err on | |
997 | desc buf | |
998 | -ROCKER_ENOSPC GET_STATS cmd descriptor buffer wasn't | |
999 | big enough to contain write-back | |
1000 | TLVs | |
1001 | -ROCKER_EINVAL ADD|MOD invalid parameters passed in | |
1002 | -ROCKER_EEXIST ADD entry already exists | |
1003 | -ROCKER_ENOSPC ADD no space left in flow table | |
1004 | -ROCKER_ENOENT MOD|DEL|GET_STATS group ID invalid | |
1005 | -ROCKER_EBUSY DEL group reference count non-zero | |
1006 | -ROCKER_ENODEV ADD next group ID doesn't exist | |
1007 | ||
1008 | ||
1009 | ||
1010 | References | |
1011 | ========== | |
1012 | ||
1013 | [1] OpenFlow Data Plane Abstraction (OF-DPA) Abstract Switch Specification, | |
1014 | Version 1.0, from Broadcom Corporation, February 21, 2014. |