]>
Commit | Line | Data |
---|---|---|
bbc53c7e SF |
1 | Rocker Network Switch Register Programming Guide |
2 | Copyright (c) Scott Feldman <[email protected]> | |
3 | Copyright (c) Neil Horman <[email protected]> | |
4 | Version 0.11, 12/29/2014 | |
5 | ||
6 | LICENSE | |
7 | ======= | |
8 | ||
9 | This program is free software; you can redistribute it and/or modify | |
10 | it under the terms of the GNU General Public License as published by | |
11 | the Free Software Foundation; either version 2 of the License, or | |
12 | (at your option) any later version. | |
13 | ||
14 | This program is distributed in the hope that it will be useful, | |
15 | but WITHOUT ANY WARRANTY; without even the implied warranty of | |
16 | MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the | |
17 | GNU General Public License for more details. | |
18 | ||
19 | SECTION 1: Introduction | |
20 | ======================= | |
21 | ||
22 | Overview | |
23 | -------- | |
24 | ||
25 | This document describes the hardware/software interface for the Rocker switch | |
26 | device. The intended audience is authors of OS drivers and device emulation | |
27 | software. | |
28 | ||
29 | Notations and Conventions | |
30 | ------------------------- | |
31 | ||
32 | o In register descriptions, [n:m] indicates a range from bit n to bit m, | |
33 | inclusive. | |
34 | o Use of leading 0x indicates a hexadecimal number. | |
35 | o Use of leading 0b indicates a binary number. | |
36 | o The use of RSVD or Reserved indicates that a bit or field is reserved for | |
37 | future use. | |
38 | o Field width is in bytes, unless otherwise noted. | |
39 | o Register are (R) read-only, (R/W) read/write, (W) write-only, or (COR) clear | |
40 | on read | |
41 | o TLV values in network-byte-order are designated with (N). | |
42 | ||
43 | ||
44 | SECTION 2: PCI Configuration Registers | |
45 | ====================================== | |
46 | ||
47 | PCI Configuration Space | |
48 | ----------------------- | |
49 | ||
50 | Each switch instance registers as a PCI device with PCI configuration space: | |
51 | ||
52 | offset width description value | |
53 | --------------------------------------------- | |
54 | 0x0 2 Vendor ID 0x1b36 | |
55 | 0x2 2 Device ID 0x0006 | |
56 | 0x4 4 Command/Status | |
57 | 0x8 1 Revision ID 0x01 | |
58 | 0x9 3 Class code 0x2800 | |
59 | 0xC 1 Cache line size | |
60 | 0xD 1 Latency timer | |
61 | 0xE 1 Header type | |
62 | 0xF 1 Built-in self test | |
63 | 0x10 4 Base address low | |
64 | 0x14 4 Base address high | |
65 | 0x18-28 Reserved | |
66 | 0x2C 2 Subsystem vendor ID * | |
67 | 0x2E 2 Subsystem ID * | |
68 | 0x30-38 Reserved | |
69 | 0x3C 1 Interrupt line | |
70 | 0x3D 1 Interrupt pin 0x00 | |
71 | 0x3E 1 Min grant 0x00 | |
72 | 0x3D 1 Max latency 0x00 | |
73 | 0x40 1 TRDY timeout | |
74 | 0x41 1 Retry count | |
75 | 0x42 2 Reserved | |
76 | ||
77 | ||
78 | * Assigned by sub-system implementation | |
79 | ||
80 | SECTION 3: Memory-Mapped Register Space | |
81 | ======================================= | |
82 | ||
83 | There are two memory-mapped BARs. BAR0 maps device register space and is | |
84 | 0x2000 in size. BAR1 maps MSI-X vector and PBA tables and is also 0x2000 in | |
85 | size, allowing for 256 MSI-X vectors. | |
86 | ||
87 | All registers are 4 or 8 bytes long. It is assumed host software will access 4 | |
88 | byte registers with one 4-byte access, and 8 byte registers with either two | |
89 | 4-byte accesses or a single 8-byte access. In the case of two 4-byte accesses, | |
90 | access must be lower and then upper 4-bytes, in that order. | |
91 | ||
92 | BAR0 device register space is organized as follows: | |
93 | ||
94 | offset description | |
95 | ------------------------------------------------------ | |
96 | 0x0000-0x000f Bogus registers to catch misbehaving | |
97 | drivers. Writes do nothing. Reads | |
98 | back as 0xDEADBABE. | |
99 | 0x0010-0x00ff Test registers | |
100 | 0x0300-0x03ff General purpose registers | |
101 | 0x1000-0x1fff Descriptor control | |
102 | ||
103 | Holes in register space are reserved. Writes to reserved registers do nothing. | |
104 | Reads to reserved registers read back as 0. | |
105 | ||
106 | No fancy stuff like write-combining is enabled on any of the registers. | |
107 | ||
108 | BAR1 MSI-X register space is organized as follows: | |
109 | ||
110 | offset description | |
111 | ------------------------------------------------------ | |
112 | 0x0000-0x0fff MSI-X vector table (256 vectors total) | |
113 | 0x1000-0x1fff MSI-X PBA table | |
114 | ||
115 | ||
116 | SECTION 4: Interrupts, DMA, and Endianness | |
117 | ========================================== | |
118 | ||
119 | PCI Interrupts | |
120 | -------------- | |
121 | ||
122 | The device supports only MSI-X interrupts. BAR1 memory-mapped region contains | |
123 | the MSI-X vector and PBA tables, with support for up to 256 MSI-X vectors. | |
124 | ||
125 | The vector assignment is: | |
126 | ||
127 | vector description | |
128 | ----------------------------------------------------- | |
129 | 0 Command descriptor ring completion | |
130 | 1 Event descriptor ring completion | |
131 | 2 Test operation completion | |
132 | 3 RSVD | |
133 | 4-255 Tx and Rx descriptor ring completion | |
134 | Tx vector is even | |
135 | Rx vector is odd | |
136 | ||
137 | A MSI-X vector table entry is 16 bytes: | |
138 | ||
139 | field offset width description | |
140 | ------------------------------------------------------------- | |
141 | lower_addr 0x0 4 [31:2] message address[31:2] | |
142 | [1:0] Rsvd (4 byte alignment | |
143 | required) | |
144 | upper_addr 0x4 4 [31:19] Rsvd | |
145 | [14:0] message address[46:32] | |
146 | data 0x8 4 message data[31:0] | |
147 | control 0xc 4 [31:1] Rsvd | |
148 | [0] mask (0 = enable, | |
149 | 1 = masked) | |
150 | ||
151 | Software should install the Interrupt Service Routine (ISR) before any ports | |
152 | are enabled or any commands are issued on the command ring. | |
153 | ||
154 | DMA Operations | |
155 | -------------- | |
156 | ||
157 | DMA operations are used for packet DMA to/from the CPU, command and event | |
158 | processing. Command processing includes statistical counters and table dumps, | |
159 | table insertion/deletion, and more. Event processing provides an async | |
160 | notification method for device-originating events. Each DMA operation has a | |
161 | set of control registers to manage a descriptor ring. The descriptor rings are | |
162 | allocated from contiguous host DMA-able memory and registers specify the rings | |
163 | base address, size and current head and tail indices. Software always writes | |
164 | the head, and hardware always writes the tail. | |
165 | ||
166 | The higher-order bit of DMA_DESC_COMP_ERR is used to mark hardware completion | |
167 | of a descriptor. Software will clear this bit when posting a descriptor to the | |
168 | ring, and hardware will set this bit when the descriptor is complete. | |
169 | ||
170 | Descriptor ring sizes must be a power of 2 and range from 2 to 64K entries. | |
171 | Descriptor rings' base address must be 8-byte aligned. Descriptors must be | |
172 | packed within ring. Each descriptor in each ring must also be aligned on an 8 | |
173 | byte boundary. Each descriptor ring will have these registers: | |
174 | ||
175 | DMA_DESC_xxx_BASE_ADDR, offset 0x1000 + (x * 32), 64-bit, (R/W) | |
176 | DMA_DESC_xxx_SIZE, offset 0x1008 + (x * 32), 32-bit, (R/W) | |
177 | DMA_DESC_xxx_HEAD, offset 0x100c + (x * 32), 32-bit, (R/W) | |
178 | DMA_DESC_xxx_TAIL, offset 0x1010 + (x * 32), 32-bit, (R) | |
179 | DMA_DESC_xxx_CTRL, offset 0x1014 + (x * 32), 32-bit, (W) | |
180 | DMA_DESC_xxx_CREDITS, offset 0x1018 + (x * 32), 32-bit, (R/W) | |
181 | DMA_DESC_xxx_RSVD1, offset 0x101c + (x * 32), 32-bit, (R/W) | |
182 | ||
183 | Where x is descriptor ring index: | |
184 | ||
185 | index ring | |
186 | -------------------- | |
187 | 0 CMD | |
188 | 1 EVENT | |
189 | 2 TX (port 0) | |
190 | 3 RX (port 0) | |
191 | 4 TX (port 1) | |
192 | 5 RX (port 1) | |
193 | . | |
194 | . | |
195 | . | |
196 | 124 TX (port 61) | |
197 | 125 RX (port 61) | |
198 | 126 Resv | |
199 | 127 Resv | |
200 | ||
201 | Writing BASE_ADDR or SIZE will reset HEAD and TAIL to zero. HEAD cannot be | |
202 | written past TAIL. To do so would wrap the ring. An empty ring is when HEAD | |
203 | == TAIL. A full ring is when HEAD is one position behind TAIL. Both HEAD and | |
204 | TAIL increment and modulo wrap at the ring size. | |
205 | ||
206 | CTRL register bits: | |
207 | ||
208 | bit name description | |
209 | ------------------------------------------------------------------------ | |
210 | [0] CTRL_RESET Reset the descriptor ring | |
211 | [1:31] Reserved | |
212 | ||
213 | All descriptor types share some common fields: | |
214 | ||
215 | field width description | |
216 | ------------------------------------------------------------------- | |
217 | DMA_DESC_BUF_ADDR 8 Phys addr of desc payload, 8-byte | |
218 | aligned | |
219 | DMA_DESC_COOKIE 8 Desc cookie for completion matching, | |
220 | upper-most bit is reserved | |
221 | DMA_DESC_BUF_SIZE 2 Desc payload size in bytes | |
222 | DMA_DESC_TLV_SIZE 2 Desc payload total size in bytes | |
223 | used for TLVs. Must be <= | |
224 | DMA_DESC_BUF_SIZE. | |
225 | DMA_DESC_COMP_ERR 2 Completion status of associated | |
226 | desc payload. High order bit is | |
227 | clear on new descs, toggled by | |
228 | hw for completed items. | |
229 | ||
230 | To support forward- and backward-compatibility, descriptor and completion | |
231 | payloads are specified in TLV format. Fields are packed with Type=field name, | |
232 | Length=field length, and Value=field value. Software will ignore unknown fields | |
233 | filled in by the switch. Likewise, the switch will ignore unknown fields | |
234 | filled in by software. | |
235 | ||
236 | Descriptor payload buffer is 8-byte aligned and TLVs are 8-byte aligned. The | |
237 | value within a TLV is also 8-byte aligned. The (packed, 8 byte) TLV header is: | |
238 | ||
239 | field width description | |
240 | ----------------------------- | |
241 | type 4 TLV type | |
242 | len 2 TLV value length | |
243 | pad 2 Reserved | |
244 | ||
245 | The alignment requirements for descriptors and TLVs are to avoid unaligned | |
246 | access exceptions in software. Note that the payload for each TLV is also | |
247 | 8 byte aligned. | |
248 | ||
249 | Figure 1 shows an example descriptor buffer with two TLVs. | |
250 | ||
251 | <------- 8 bytes -------> | |
252 | ||
253 | 8-byte +––––+ +–––––––––––+–––––+–––––+ +–+ | |
254 | align | type | len | pad | TLV#1 hdr | | |
255 | +–––––––––––+–––––+–––––+ (len=22) | | |
256 | | | | | |
257 | | value | TVL#1 value | | |
258 | | | (padded to 8-byte | | |
259 | | +–––––+ alignment) | | |
260 | | |/////| | | |
261 | 8-byte +––––+ +–––––––––––+–––––––––––+ | | |
262 | align | type | len | pad | TLV#2 hdr DESC_BUF_SIZE | |
263 | +–––––+–––––+–––––+–––––+ (len=2) | | |
264 | |value|/////////////////| TLV#2 value | | |
265 | +–––––+/////////////////| | | |
266 | |///////////////////////| | | |
267 | |///////////////////////| | | |
268 | |///////////////////////| | | |
269 | |////////unused/////////| | | |
270 | |////////space//////////| | | |
271 | |///////////////////////| | | |
272 | |///////////////////////| | | |
273 | |///////////////////////| | | |
274 | +–––––––––––––––––––––––+ +–+ | |
275 | ||
276 | fig. 1 | |
277 | ||
278 | TLVs can be nested within the NEST TLV type. | |
279 | ||
280 | Interrupt credits | |
281 | ^^^^^^^^^^^^^^^^^ | |
282 | ||
283 | MSI-X vectors used for descriptor ring completions use a credit mechanism for | |
284 | efficient device, PCIe bus, OS and driver operations. Each descriptor ring has | |
285 | a credit count which represents the number of outstanding descriptors to be | |
286 | processed by the driver. As the device marks descriptors complete, the credit | |
287 | count is incremented. As the driver processes those outstanding descriptors, | |
288 | it returns credits back to the device. This way, the device knows the driver's | |
289 | progress and can make decisions about when to fire the next interrupt or not. | |
290 | When the credit count is zero, and the first descriptors are posted for the | |
291 | driver, a single interrupt is fired. Once the interrupt is fired, the | |
292 | interrupt is disabled (auto-masked*). In response to the interrupt, the driver | |
293 | will process descriptors and PIO write a returned credit value for that | |
294 | descriptor ring. If the driver returns all credits (the driver caught up with | |
295 | the device and there is no outstanding work), then the interrupt is unmasked, | |
296 | but not fired. If only partial credits are returned, the interrupt remains | |
297 | masked but the device generates an interrupt, signaling the driver that more | |
298 | outstanding work is available. | |
299 | ||
300 | (* this masking is unrelated to to the MSI-X interrupt mask register) | |
301 | ||
302 | Endianness | |
303 | ---------- | |
304 | ||
305 | Device registers are hard-coded to little-endian (LE). The driver should | |
306 | convert to/from host endianess to LE for device register accesses. | |
307 | ||
308 | Descriptors are LE. Descriptor buffer TLVs will have LE type and length | |
309 | fields, but the value field can either be LE or network-byte-order, depending | |
310 | on context. TLV values containing network packet data will be in network-byte | |
311 | order. A TLV value containing a field or mask used to compare against network | |
312 | packet data is network-byte order. For example, flow match fields (and masks) | |
313 | are network-byte-order since they're matched directly, byte-by-byte, against | |
314 | network packet data. All non-network-packet TLV multi-byte values will be LE. | |
315 | ||
316 | TLV values in network-byte-order are designated with (N). | |
317 | ||
318 | ||
319 | SECTION 5: Test Registers | |
320 | ========================= | |
321 | ||
322 | Rocker has several test registers to support troubleshooting register access, | |
323 | interrupt generation, and DMA operations: | |
324 | ||
325 | TEST_REG, offset 0x0010, 32-bit (R/W) | |
326 | TEST_REG64, offset 0x0018, 64-bit (R/W) | |
327 | TEST_IRQ, offset 0x0020, 32-bit (R/W) | |
328 | TEST_DMA_ADDR, offset 0x0028, 64-bit (R/W) | |
329 | TEST_DMA_SIZE, offset 0x0030, 32-bit (R/W) | |
330 | TEST_DMA_CTRL, offset 0x0034, 32-bit (R/W) | |
331 | ||
332 | Reads to TEST_REG and TEST_REG64 will read a value equal to twice the last | |
333 | value written to the register. The 32-bit and 64-bit versions are for testing | |
334 | 32-bit and 64-bit host accesses. | |
335 | ||
336 | A vector can be written to TEST_IRQ and the device will generate an interrupt | |
337 | for that vector. | |
338 | ||
339 | To test basic DMA operations, allocate a DMA-able host buffer and put the | |
340 | buffer address into TEST_DMA_ADDR and size into TEST_DMA_SIZE. Then, write to | |
341 | TEST_DMA_CTRL to manipulate the buffer contents. TEST_DMA_CTRL operations are: | |
342 | ||
343 | operation value description | |
344 | ----------------------------------------------------------- | |
345 | TEST_DMA_CTRL_CLEAR 1 clear buffer | |
346 | TEST_DMA_CTRL_FILL 2 fill buffer bytes with 0x96 | |
347 | TEST_DMA_CTRL_INVERT 4 invert bytes in buffer | |
348 | ||
349 | Various buffer address and sizes should be tested to verify no address boundary | |
350 | issue exists. In particular, buffers that start on odd-8-byte boundary and/or | |
351 | span multiple PAGE sizes should be tested. | |
352 | ||
353 | ||
354 | SECTION 6: Ports | |
355 | ================ | |
356 | ||
357 | Physical and Logical Ports | |
358 | ------------------------------------ | |
359 | ||
360 | The switch supports up to 62 physical (front-panel) ports. Register | |
361 | PORT_PHYS_COUNT returns the actual number of physical ports available: | |
362 | ||
363 | PORT_PHYS_COUNT, offset 0x0304, 32-bit, (R) | |
364 | ||
365 | In addition to front-panel ports, the switch supports logical ports for | |
366 | tunnels. | |
367 | ||
368 | Front-panel ports and logical tunnel ports are mapped into a single 32-bit port | |
369 | space. A special CPU port is assigned port 0. The front-panel ports are | |
370 | mapped to ports 1-62. A special loopback port is assigned port 63. Logical | |
371 | tunnel ports are assigned ports 0x0001000-0x0001ffff. | |
372 | To summarize the port assignments: | |
373 | ||
374 | port mapping | |
375 | ------------------------------------------------------- | |
376 | 0 CPU port (for packets to/from host CPU) | |
377 | 1-62 front-panel physical ports | |
378 | 63 loopback port | |
379 | 64-0x0000ffff RSVD | |
380 | 0x00010000-0x0001ffff logical tunnel ports | |
381 | 0x00020000-0xffffffff RSVD | |
382 | ||
383 | Physical Port Mode | |
384 | ------------------ | |
385 | ||
386 | Switch front-panel ports operate in a mode. Currently, the only mode is | |
387 | OF-DPA. OF-DPA[1] mode is based on OpenFlow Data Plane Abstraction (OF-DPA) | |
388 | Abstract Switch Specification, Version 1.0, from Broadcom Corporation. To | |
389 | set/get the mode for front-panel ports, see port settings, below. | |
390 | ||
391 | Port Settings | |
392 | ------------- | |
393 | ||
394 | Link status for all front-panel ports is available via PORT_PHYS_LINK_STATUS: | |
395 | ||
396 | PORT_PHYS_LINK_STATUS, offset 0x0310, 64-bit, (R) | |
397 | ||
398 | Value is port bitmap. Bits 0 and 63 always read 0. Bits 1-62 | |
399 | read 1 for link UP and 0 for link DOWN for respective front-panel ports. | |
400 | ||
401 | Other properties for front-panel ports are available via DMA CMD descriptors: | |
402 | ||
403 | Get PORT_SETTINGS descriptor: | |
404 | ||
405 | field width description | |
406 | ---------------------------------------------- | |
407 | PORT_SETTINGS 2 CMD_GET | |
408 | PPORT 4 Physical port # | |
409 | ||
410 | Get PORT_SETTINGS completion: | |
411 | ||
412 | field width description | |
413 | ---------------------------------------------- | |
414 | PPORT 4 Physical port # | |
415 | SPEED 4 Current port interface speed, in Mbps | |
416 | DUPLEX 1 1 = Full, 0 = Half | |
417 | AUTONEG 1 1 = enabled, 0 = disabled | |
418 | MACADDR 6 Port MAC address | |
419 | MODE 1 0 = OF-DPA | |
420 | LEARNING 1 MAC address learning on port | |
421 | 1 = enabled | |
422 | 0 = disabled | |
423 | ||
424 | Set PORT_SETTINGS descriptor: | |
425 | ||
426 | field width description | |
427 | ---------------------------------------------- | |
428 | PORT_SETTINGS 2 CMD_SET | |
429 | PPORT 4 Physical port # | |
430 | SPEED 4 Port interface speed, in Mbps | |
431 | DUPLEX 1 1 = Full, 0 = Half | |
432 | AUTONEG 1 1 = enabled, 0 = disabled | |
433 | MACADDR 6 Port MAC address | |
434 | MODE 1 0 = OF-DPA | |
435 | ||
436 | Port Enable | |
437 | ----------- | |
438 | ||
439 | Front-panel ports are initially disabled, which means port ingress and egress | |
440 | packets will be dropped. To enable or disable a port, use PORT_PHYS_ENABLE: | |
441 | ||
442 | PORT_PHYS_ENABLE: offset 0x0318, 64-bit, (R/W) | |
443 | ||
444 | Value is bitmap of first 64 ports. Bits 0 and 63 are ignored | |
445 | and always read as 0. Write 1 to enable port; write 0 to disable it. | |
446 | Default is 0. | |
447 | ||
448 | ||
449 | SECTION 7: Switch Control | |
450 | ========================= | |
451 | ||
452 | This section covers switch-wide register settings. | |
453 | ||
454 | Control | |
455 | ------- | |
456 | ||
457 | This register is used for low level control of the switch. | |
458 | ||
459 | CONTROL: offset 0x0300, 32-bit, (W) | |
460 | ||
461 | bit name description | |
462 | ------------------------------------------------------------------------ | |
463 | [0] CONTROL_RESET If set, device will perform reset | |
464 | [1:31] Reserved | |
465 | ||
466 | Switch ID | |
467 | --------- | |
468 | ||
469 | The switch has a SWITCH_ID to be used by software to uniquely identify the | |
470 | switch: | |
471 | ||
472 | SWITCH_ID: offset 0x0320, 64-bit, (R) | |
473 | ||
474 | Value is opaque to switch software and no special encoding is implied. | |
475 | ||
476 | ||
477 | SECTION 8: Events | |
478 | ================= | |
479 | ||
480 | Non-I/O asynchronous events from the device are notified to the host using the | |
481 | event ring. The TLV structure for events is: | |
482 | ||
483 | field width description | |
484 | --------------------------------------------------- | |
485 | TYPE 4 Event type, one of: | |
486 | 1: LINK_CHANGED | |
487 | 2: MAC_VLAN_SEEN | |
488 | INFO <nest> Event info (details below) | |
489 | ||
490 | Link Changed Event | |
491 | ------------------ | |
492 | ||
493 | When link status changes on a physical port, this event is generated. | |
494 | ||
495 | field width description | |
496 | --------------------------------------------------- | |
497 | INFO <nest> | |
498 | PPORT 4 Physical port | |
499 | LINKUP 1 Link status: | |
500 | 0: down | |
501 | 1: up | |
502 | ||
503 | MAC VLAN Seen Event | |
504 | ------------------- | |
505 | ||
506 | When a packet ingresses on a port and the source MAC/VLAN isn't known to the | |
507 | device, the device will generate this event. In response to the event, the | |
508 | driver should install to the device the MAC/VLAN on the port into the bridge | |
509 | table. Once installed, the MAC/VLAN is known on the port and this event will | |
510 | no longer be generated. | |
511 | ||
512 | field width description | |
513 | --------------------------------------------------- | |
514 | INFO <nest> | |
515 | PPORT 4 Physical port | |
516 | MAC 6 MAC address | |
517 | VLAN 2 VLAN ID | |
518 | ||
519 | ||
520 | SECTION 9: CPU Packet Processing | |
521 | ================================ | |
522 | ||
523 | Ingress packets directed to the host CPU for further processing are delivered | |
524 | in the DMA RX ring. Likewise, host CPU originating packets destined to egress | |
525 | on switch ports are scheduled by software using the DMA TX ring. | |
526 | ||
527 | Tx Packet Processing | |
528 | -------------------- | |
529 | ||
530 | Software schedules packets for egress on switch ports using the DMA TX ring. A | |
531 | TX descriptor buffer describes the packet location and size in host DMA-able | |
532 | memory, the destination port, and any hardware-offload functions (such as L3 | |
533 | payload checksum offload). Software then bumps the descriptor head to signal | |
534 | hardware of new Tx work. In response, hardware will DMA read Tx descriptors up | |
535 | to head, DMA read descriptor buffer and packet data, perform offloading | |
536 | functions, and finally frame packet on wire (network). Once packet processing | |
537 | is complete, hardware will writeback status to descriptor(s) to signal to | |
538 | software that Tx is complete and software resources (e.g. skb) backing packet | |
539 | can be released. | |
540 | ||
541 | Figure 2 shows an example 3-fragment packet queued with one Tx descriptor. A | |
542 | TLV is used for each packet fragment. | |
543 | ||
544 | pkt frag 1 | |
545 | +–––––––+ +–+ | |
546 | +–––+ | | | |
547 | desc buf | | | | | |
548 | +––––––––+ | | | | | |
549 | Tx ring +–––+ +–––––+ | | | | |
550 | +–––––––––+ | | TLVs | +–––––––+ | | |
551 | | +–––+ +––––––––+ pkt frag 2 | | |
552 | | desc 0 | | +–––––+ +–––––––+ | | |
553 | +–––––––––+ | TLVs | +–––+ | | | |
554 | head+–+ | +––––––––+ | | | | |
555 | | desc 1 | | +–––––+ +–––––––+ |pkt | |
556 | +–––––––––+ | TLVs | | | | |
557 | | | +––––––––+ | pkt frag 3 | | |
558 | | | | +–––––––+ | | |
559 | +–––––––––+ +–––+ | | | |
560 | | | | | | | |
561 | | | | | | | |
562 | +–––––––––+ | | | | |
563 | | | | | | | |
564 | | | | | | | |
565 | +–––––––––+ | | | | |
566 | | | +–––––––+ +–+ | |
567 | | | | |
568 | +–––––––––+ | |
569 | ||
570 | fig 2. | |
571 | ||
572 | The TLVs for Tx descriptor buffer are: | |
573 | ||
574 | field width description | |
575 | --------------------------------------------------------------------- | |
576 | PPORT 4 Destination physical port # | |
577 | TX_OFFLOAD 1 Hardware offload modes: | |
578 | 0: no offload | |
579 | 1: insert IP csum (ipv4 only) | |
580 | 2: insert TCP/UDP csum | |
581 | 3: L3 csum calc and insert | |
582 | into csum offset (TX_L3_CSUM_OFF) | |
583 | 16-bit 1's complement csum value. | |
584 | IPv4 pseudo-header and IP | |
585 | already calculated by OS | |
586 | and inserted. | |
587 | 4: TSO (TCP Segmentation Offload) | |
588 | TX_L3_CSUM_OFF 2 For L3 csum offload mode, the offset, | |
589 | from the beginning of the packet, | |
590 | of the csum field in the L3 header | |
591 | TX_TSO_MSS 2 For TSO offload mode, the | |
592 | Maximum Segment Size in bytes | |
593 | TX_TSO_HDR_LEN 2 For TSO offload mode, the | |
594 | length of ethernet, IP, and | |
595 | TCP/UDP headers, including IP | |
596 | and TCP options. | |
597 | TX_FRAGS <array> Packet fragments | |
598 | TX_FRAG <nest> Packet fragment | |
599 | TX_FRAG_ADDR 8 DMA address of packet fragment | |
600 | TX_FRAG_LEN 2 Packet fragment length | |
601 | ||
602 | Possible status return codes in descriptor on completion are: | |
603 | ||
604 | DESC_COMP_ERR reason | |
605 | -------------------------------------------------------------------- | |
606 | 0 OK | |
607 | -ROCKER_ENXIO address or data read err on desc buf or packet | |
608 | fragment | |
609 | -ROCKER_EINVAL bad pport or TSO or csum offloading error | |
610 | -ROCKER_ENOMEM no memory for internal staging tx fragment | |
611 | ||
612 | Rx Packet Processing | |
613 | -------------------- | |
614 | ||
615 | For packets ingressing on switch ports that are not forwarded by the switch but | |
616 | rather directed to the host CPU for further processing are delivered in the DMA | |
617 | RX ring. Rx descriptor buffers are allocated by software and placed on the | |
618 | ring. Hardware will fill Rx descriptor buffers with packet data, write the | |
619 | completion, and signal to software that a new packet is ready. Since Rx packet | |
620 | size is not known a-priori, the Rx descriptor buffer must be allocated for | |
621 | worst-case packet size. A single Rx descriptor will contain the entire Rx | |
622 | packet data in one RX_FRAG. Other Rx TLVs describe and hardware offloads | |
623 | performed on the packet, such as checksum validation. | |
624 | ||
625 | The TLVs for Rx descriptor buffer are: | |
626 | ||
627 | field width description | |
628 | --------------------------------------------------- | |
629 | PPORT 4 Source physical port # | |
630 | RX_FLAGS 2 Packet parsing flags: | |
631 | (1 << 0): IPv4 packet | |
632 | (1 << 1): IPv6 packet | |
633 | (1 << 2): csum calculated | |
634 | (1 << 3): IPv4 csum good | |
635 | (1 << 4): IP fragment | |
636 | (1 << 5): TCP packet | |
637 | (1 << 6): UDP packet | |
638 | (1 << 7): TCP/UDP csum good | |
639 | RX_CSUM 2 IP calculated checksum: | |
640 | IPv4: IP payload csum | |
641 | IPv6: header and payload csum | |
642 | (Only valid is RX_FLAGS:csum calc is set) | |
643 | RX_FRAG_ADDR 8 DMA address of packet fragment | |
644 | RX_FRAG_MAX_LEN 2 Packet maximum fragment length | |
645 | RX_FRAG_LEN 2 Actual packet fragment length after receive | |
646 | ||
647 | Possible status return codes in descriptor on completion are: | |
648 | ||
649 | DESC_COMP_ERR reason | |
650 | -------------------------------------------------------------------- | |
651 | 0 OK | |
652 | -ROCKER_ENXIO address or data read err on desc buf | |
653 | -ROCKER_ENOMEM no memory for internal staging desc buf | |
654 | -ROCKER_EMSGSIZE Rx descriptor buffer wasn't big enough to contain | |
655 | packet data TLV and other TLVs. | |
656 | ||
657 | ||
658 | SECTION 10: OF-DPA Mode | |
659 | ====================== | |
660 | ||
661 | OF-DPA mode allows the switch to offload flow packet processing functions to | |
662 | hardware. An OpenFlow controller would communicate with an OpenFlow agent | |
663 | installed on the switch. The OpenFlow agent would (directly or indirectly) | |
664 | communicate with the Rocker switch driver, which in turn would program switch | |
665 | hardware with flow functionality, as defined in OF-DPA. The block diagram is: | |
666 | ||
667 | +–––––––––––––––----–––+ | |
668 | | OF | | |
669 | | Remote Controller | | |
670 | +––––––––+––----–––––––+ | |
671 | | | |
672 | | | |
673 | +––––––––+–––––––––+ | |
674 | | OF | | |
675 | | Local Agent | | |
676 | +––––––––––––––––––+ | |
677 | | | | |
678 | | Rocker Driver | | |
679 | +––––––––––––––––––+ | |
680 | <this spec> | |
681 | +––––––––––––––––––+ | |
682 | | | | |
683 | | Rocker Switch | | |
684 | +––––––––––––––––––+ | |
685 | ||
686 | To participate in flow functions, ports must be configure for OF-DPA mode | |
687 | during switch initialization. | |
688 | ||
689 | OF-DPA Flow Table Interface | |
690 | --------------------------- | |
691 | ||
692 | There are commands to add, modify, delete, and get stats of flow table entries. | |
693 | The commands are issued using the DMA CMD descriptor ring. The following | |
694 | commands are defined: | |
695 | ||
696 | CMD_ADD: add an entry to flow table | |
697 | CMD_MOD: modify an entry in flow table | |
698 | CMD_DEL: delete an entry from flow table | |
699 | CMD_GET_STATS: get stats for flow entry | |
700 | ||
701 | TLVs for add and modify commands are: | |
702 | ||
703 | field width description | |
704 | ---------------------------------------------------- | |
705 | OF_DPA_CMD 2 CMD_[ADD|MOD] | |
706 | OF_DPA_TBL 2 Flow table ID | |
707 | 0: ingress port | |
708 | 10: vlan | |
709 | 20: termination mac | |
710 | 30: unicast routing | |
711 | 40: multicast routing | |
712 | 50: bridging | |
713 | 60: ACL policy | |
714 | OF_DPA_PRIORITY 4 Flow priority | |
715 | OF_DPA_HARDTIME 4 Hard timeout for flow | |
716 | OF_DPA_IDLETIME 4 Idle timeout for flow | |
717 | OF_DPA_COOKIE 8 Cookie | |
718 | ||
719 | Additional TLVs based on flow table ID: | |
720 | ||
721 | Table ID 0: ingress port | |
722 | ||
723 | field width description | |
724 | ---------------------------------------------------- | |
725 | OF_DPA_IN_PPORT 4 ingress physical port number | |
726 | OF_DPA_GOTO_TBL 2 goto table ID; zero to drop | |
727 | ||
728 | Table ID 10: vlan | |
729 | ||
730 | field width description | |
731 | ---------------------------------------------------- | |
732 | OF_DPA_IN_PPORT 4 ingress physical port number | |
733 | OF_DPA_VLAN_ID 2 (N) vlan ID | |
734 | OF_DPA_VLAN_ID_MASK 2 (N) vlan ID mask | |
735 | OF_DPA_GOTO_TBL 2 goto table ID; zero to drop | |
736 | OF_DPA_NEW_VLAN_ID 2 (N) new vlan ID | |
737 | ||
738 | Table ID 20: termination mac | |
739 | ||
740 | field width description | |
741 | ---------------------------------------------------- | |
742 | OF_DPA_IN_PPORT 4 ingress physical port number | |
743 | OF_DPA_IN_PPORT_MASK 4 ingress physical port number mask | |
744 | OF_DPA_ETHERTYPE 2 (N) must be either 0x0800 or 0x86dd | |
745 | OF_DPA_DST_MAC 6 (N) destination MAC | |
746 | OF_DPA_DST_MAC_MASK 6 (N) destination MAC mask | |
747 | OF_DPA_VLAN_ID 2 (N) vlan ID | |
748 | OF_DPA_VLAN_ID_MASK 2 (N) vlan ID mask | |
749 | OF_DPA_GOTO_TBL 2 only acceptable values are | |
750 | unicast or multicast routing | |
751 | table IDs | |
752 | OF_DPA_OUT_PPORT 2 if specified, must be | |
753 | controller, set zero otherwise | |
754 | ||
755 | Table ID 30: unicast routing | |
756 | ||
757 | field width description | |
758 | ---------------------------------------------------- | |
759 | OF_DPA_ETHERTYPE 2 (N) must be either 0x0800 or 0x86dd | |
760 | OF_DPA_DST_IP 4 (N) destination IPv4 address. | |
761 | Must be unicast address | |
762 | OF_DPA_DST_IP_MASK 4 (N) IP mask. Must be prefix mask | |
763 | OF_DPA_DST_IPV6 16 (N) destination IPv6 address. | |
764 | Must be unicast address | |
765 | OF_DPA_DST_IPV6_MASK 16 (N) IPv6 mask. Must be prefix mask | |
766 | OF_DPA_GOTO_TBL 2 goto table ID; zero to drop | |
767 | OF_DPA_GROUP_ID 4 data for GROUP action must | |
768 | be an L3 Unicast group entry | |
769 | ||
770 | Table ID 40: multicast routing | |
771 | ||
772 | field width description | |
773 | ---------------------------------------------------- | |
774 | OF_DPA_ETHERTYPE 2 (N) must be either 0x0800 or 0x86dd | |
775 | OF_DPA_VLAN_ID 2 (N) vlan ID | |
776 | OF_DPA_SRC_IP 4 (N) source IPv4. Optional, | |
777 | can contain IPv4 address, | |
778 | must be completely masked | |
779 | if not used | |
780 | OF_DPA_SRC_IP_MASK 4 (N) IP Mask | |
781 | OF_DPA_DST_IP 4 (N) destination IPv4 address. | |
782 | Must be multicast address | |
783 | OF_DPA_SRC_IPV6 16 (N) source IPv6 Address. Optional. | |
784 | Can contain IPv6 address, | |
785 | must be completely masked | |
786 | if not used | |
787 | OF_DPA_SRC_IPV6_MASK 16 (N) IPv6 mask. | |
788 | OF_DPA_DST_IPV6 16 (N) destination IPv6 Address. Must | |
789 | be multicast address | |
790 | Must be multicast address | |
791 | OF_DPA_GOTO_TBL 2 goto table ID; zero to drop | |
792 | OF_DPA_GROUP_ID 4 data for GROUP action must | |
793 | be an L3 multicast group entry | |
794 | ||
795 | Table ID 50: bridging | |
796 | ||
797 | field width description | |
798 | ---------------------------------------------------- | |
799 | OF_DPA_VLAN_ID 2 (N) vlan ID | |
800 | OF_DPA_TUNNEL_ID 4 tunnel ID | |
801 | OF_DPA_DST_MAC 6 (N) destination MAC | |
802 | OF_DPA_DST_MAC_MASK 6 (N) destination MAC mask | |
803 | OF_DPA_GOTO_TBL 2 goto table ID; zero to drop | |
804 | OF_DPA_GROUP_ID 4 data for GROUP action must | |
805 | be a L2 Interface, L2 | |
806 | Multicast, L2 Flood, | |
807 | or L2 Overlay group entry | |
808 | as appropriate | |
809 | OF_DPA_TUNNEL_LPORT 4 unicast Tenant Bridging | |
810 | flows specify a tunnel | |
811 | logical port ID | |
812 | OF_DPA_OUT_PPORT 2 data for OUTPUT action, | |
813 | restricted to CONTROLLER, | |
814 | set to 0 otherwise | |
815 | ||
816 | Table ID 60: acl policy | |
817 | ||
818 | field width description | |
819 | ---------------------------------------------------- | |
820 | OF_DPA_IN_PPORT 4 ingress physical port number | |
821 | OF_DPA_IN_PPORT_MASK 4 ingress physical port number mask | |
822 | OF_DPA_ETHERTYPE 2 (N) ethertype | |
823 | OF_DPA_VLAN_ID 2 (N) vlan ID | |
824 | OF_DPA_VLAN_ID_MASK 2 (N) vlan ID mask | |
825 | OF_DPA_VLAN_PCP 2 (N) vlan Priority Code Point | |
826 | OF_DPA_VLAN_PCP_MASK 2 (N) vlan Priority Code Point mask | |
827 | OF_DPA_SRC_MAC 6 (N) source MAC | |
828 | OF_DPA_SRC_MAC_MASK 6 (N) source MAC mask | |
829 | OF_DPA_DST_MAC 6 (N) destination MAC | |
830 | OF_DPA_DST_MAC_MASK 6 (N) destination MAC mask | |
831 | OF_DPA_TUNNEL_ID 4 tunnel ID | |
832 | OF_DPA_SRC_IP 4 (N) source IPv4. Optional, | |
833 | can contain IPv4 address, | |
834 | must be completely masked | |
835 | if not used | |
836 | OF_DPA_SRC_IP_MASK 4 (N) IP Mask | |
837 | OF_DPA_DST_IP 4 (N) destination IPv4 address. | |
838 | Must be multicast address | |
839 | OF_DPA_DST_IP_MASK 4 (N) IP Mask | |
840 | OF_DPA_SRC_IPV6 16 (N) source IPv6 Address. Optional. | |
841 | Can contain IPv6 address, | |
842 | must be completely masked | |
843 | if not used | |
844 | OF_DPA_SRC_IPV6_MASK 16 (N) IPv6 mask | |
845 | OF_DPA_DST_IPV6 16 (N) destination IPv6 Address. Must | |
846 | be multicast address. | |
847 | OF_DPA_DST_IPV6_MASK 16 (N) IPv6 mask | |
848 | OF_DPA_SRC_ARP_IP 4 (N) source IPv4 address in the ARP | |
849 | payload. Only used if ethertype | |
850 | == 0x0806. | |
851 | OF_DPA_SRC_ARP_IP_MASK 4 (N) IP Mask | |
852 | OF_DPA_IP_PROTO 1 IP protocol | |
853 | OF_DPA_IP_PROTO_MASK 1 IP protocol mask | |
854 | OF_DPA_IP_DSCP 1 DSCP | |
855 | OF_DPA_IP_DSCP_MASK 1 DSCP mask | |
856 | OF_DPA_IP_ECN 1 ECN | |
857 | OF_DPA_IP_ECN_MASK 1 ECN mask | |
858 | OF_DPA_L4_SRC_PORT 2 (N) L4 source port, only for | |
859 | TCP, UDP, or SCTP | |
860 | OF_DPA_L4_SRC_PORT_MASK 2 (N) L4 source port mask | |
861 | OF_DPA_L4_DST_PORT 2 (N) L4 source port, only for | |
862 | TCP, UDP, or SCTP | |
863 | OF_DPA_L4_DST_PORT_MASK 2 (N) L4 source port mask | |
864 | OF_DPA_ICMP_TYPE 1 ICMP type, only if IP | |
865 | protocol is 1 | |
866 | OF_DPA_ICMP_TYPE_MASK 1 ICMP type mask | |
867 | OF_DPA_ICMP_CODE 1 ICMP code | |
868 | OF_DPA_ICMP_CODE_MASK 1 ICMP code mask | |
869 | OF_DPA_IPV6_LABEL 4 (N) IPv6 flow label | |
870 | OF_DPA_IPV6_LABEL_MASK 4 (N) IPv6 flow label mask | |
871 | OF_DPA_GROUP_ID 4 data for GROUP action | |
872 | OF_DPA_QUEUE_ID_ACTION 1 write the queue ID | |
873 | OF_DPA_NEW_QUEUE_ID 1 queue ID | |
874 | OF_DPA_VLAN_PCP_ACTION 1 write the VLAN priority | |
875 | OF_DPA_NEW_VLAN_PCP 1 VLAN priority | |
876 | OF_DPA_IP_DSCP_ACTION 1 write the DSCP | |
877 | OF_DPA_NEW_IP_DSCP 1 new DSCP | |
878 | OF_DPA_TUNNEL_LPORT 4 restrct to valid tunnel | |
879 | logical port, set to 0 | |
880 | otherwise. | |
881 | OF_DPA_OUT_PPORT 2 data for OUTPUT action, | |
882 | restricted to CONTROLLER, | |
883 | set to 0 otherwise | |
884 | OF_DPA_CLEAR_ACTIONS 4 if 1 packets matching flow are | |
885 | dropped (all other instructions | |
886 | ignored) | |
887 | ||
888 | TLVs for flow delete and get stats command are: | |
889 | ||
890 | field width description | |
891 | --------------------------------------------------- | |
892 | OF_DPA_CMD 2 CMD_[DEL|GET_STATS] | |
893 | OF_DPA_COOKIE 8 Cookie | |
894 | ||
895 | On completion of get stats command, the descriptor buffer is written back with | |
896 | the following TLVs: | |
897 | ||
898 | field width description | |
899 | --------------------------------------------------- | |
900 | OF_DPA_STAT_DURATION 4 Flow duration | |
901 | OF_DPA_STAT_RX_PKTS 8 Received packets | |
902 | OF_DPA_STAT_TX_PKTS 8 Transmit packets | |
903 | ||
904 | Possible status return codes in descriptor on completion are: | |
905 | ||
906 | DESC_COMP_ERR command reason | |
907 | -------------------------------------------------------------------- | |
908 | 0 all OK | |
909 | -ROCKER_EFAULT all head or tail index outside | |
910 | of ring | |
911 | -ROCKER_ENXIO all address or data read err on | |
912 | desc buf | |
913 | -ROCKER_EMSGSIZE GET_STATS cmd descriptor buffer wasn't | |
914 | big enough to contain write-back | |
915 | TLVs | |
916 | -ROCKER_EINVAL all invalid parameters passed in | |
917 | -ROCKER_EEXIST ADD entry already exists | |
918 | -ROCKER_ENOSPC ADD no space left in flow table | |
919 | -ROCKER_ENOENT MOD|DEL|GET_STATS cookie invalid | |
920 | ||
921 | Group Table Interface | |
922 | --------------------- | |
923 | ||
924 | There are commands to add, modify, delete, and get stats of group table | |
925 | entries. The commands are issued using the DMA CMD descriptor ring. The | |
926 | following commands are defined: | |
927 | ||
928 | CMD_ADD: add an entry to group table | |
929 | CMD_MOD: modify an entry in group table | |
930 | CMD_DEL: delete an entry from group table | |
931 | CMD_GET_STATS: get stats for group entry | |
932 | ||
933 | TLVs for add and modify commands are: | |
934 | ||
935 | field width description | |
936 | ----------------------------------------------------------- | |
937 | FLOW_GROUP_CMD 2 CMD_[ADD|MOD] | |
938 | FLOW_GROUP_ID 2 Flow group ID | |
939 | FLOW_GROUP_TYPE 1 Group type: | |
940 | 0: L2 interface | |
941 | 1: L2 rewrite | |
942 | 2: L3 unicast | |
943 | 3: L2 multicast | |
944 | 4: L2 flood | |
945 | 5: L3 interface | |
946 | 6: L3 multicast | |
947 | 7: L3 ECMP | |
948 | 8: L2 overlay | |
949 | FLOW_VLAN_ID 2 Vlan ID (types 0, 3, 4, 6) | |
950 | FLOW_L2_PORT 2 Port (types 0) | |
951 | FLOW_INDEX 4 Index (all types but 0) | |
952 | FLOW_OVERLAY_TYPE 1 Overlay sub-type (type 8): | |
953 | 0: Flood unicast tunnel | |
954 | 1: Flood multicast tunnel | |
955 | 2: Multicast unicast tunnel | |
956 | 3: Multicast multicast tunnel | |
957 | FLOW_GROUP_ACTION nest | |
958 | FLOW_GROUP_ID 2 next group ID in chain (all | |
959 | types except 0) | |
960 | FLOW_OUT_PORT 4 egress port (types 0, 8) | |
961 | FLOW_POP_VLAN_TAG 1 strip outer VLAN tag (type 1 | |
962 | only) | |
963 | FLOW_VLAN_ID 2 (types 1, 5) | |
964 | FLOW_SRC_MAC 6 (types 1, 2, 5) | |
965 | FLOW_DST_MAC 6 (types 1, 2) | |
966 | ||
967 | TLVs for flow delete and get stats command are: | |
968 | ||
969 | field width description | |
970 | ----------------------------------------------------------- | |
971 | FLOW_GROUP_CMD 2 CMD_[DEL|GET_STATS] | |
972 | FLOW_GROUP_ID 2 Flow group ID | |
973 | ||
974 | On completion of get stats command, the descriptor buffer is written back with | |
975 | the following TLVs: | |
976 | ||
977 | field width description | |
978 | --------------------------------------------------- | |
979 | FLOW_GROUP_ID 2 Flow group ID | |
980 | FLOW_STAT_DURATION 4 Flow duration | |
981 | FLOW_STAT_REF_COUNT 4 Flow reference count | |
982 | FLOW_STAT_BUCKET_COUNT 4 Flow bucket count | |
983 | ||
984 | Possible status return codes in descriptor on completion are: | |
985 | ||
986 | DESC_COMP_ERR command reason | |
987 | -------------------------------------------------------------------- | |
988 | 0 all OK | |
989 | -ROCKER_EFAULT all head or tail index outside | |
990 | of ring | |
991 | -ROCKER_ENXIO all address or data read err on | |
992 | desc buf | |
993 | -ROCKER_ENOSPC GET_STATS cmd descriptor buffer wasn't | |
994 | big enough to contain write-back | |
995 | TLVs | |
996 | -ROCKER_EINVAL ADD|MOD invalid parameters passed in | |
997 | -ROCKER_EEXIST ADD entry already exists | |
998 | -ROCKER_ENOSPC ADD no space left in flow table | |
999 | -ROCKER_ENOENT MOD|DEL|GET_STATS group ID invalid | |
1000 | -ROCKER_EBUSY DEL group reference count non-zero | |
1001 | -ROCKER_ENODEV ADD next group ID doesn't exist | |
1002 | ||
1003 | ||
1004 | ||
1005 | References | |
1006 | ========== | |
1007 | ||
1008 | [1] OpenFlow Data Plane Abstraction (OF-DPA) Abstract Switch Specification, | |
1009 | Version 1.0, from Broadcom Corporation, February 21, 2014. |