]> Git Repo - qemu.git/blame - docs/specs/rocker.txt
iohandler: Change return type of qemu_set_fd_handler to "void"
[qemu.git] / docs / specs / rocker.txt
CommitLineData
bbc53c7e
SF
1Rocker Network Switch Register Programming Guide
2Copyright (c) Scott Feldman <[email protected]>
3Copyright (c) Neil Horman <[email protected]>
4Version 0.11, 12/29/2014
5
6LICENSE
7=======
8
9This program is free software; you can redistribute it and/or modify
10it under the terms of the GNU General Public License as published by
11the Free Software Foundation; either version 2 of the License, or
12(at your option) any later version.
13
14This program is distributed in the hope that it will be useful,
15but WITHOUT ANY WARRANTY; without even the implied warranty of
16MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
17GNU General Public License for more details.
18
19SECTION 1: Introduction
20=======================
21
22Overview
23--------
24
25This document describes the hardware/software interface for the Rocker switch
26device. The intended audience is authors of OS drivers and device emulation
27software.
28
29Notations and Conventions
30-------------------------
31
32o In register descriptions, [n:m] indicates a range from bit n to bit m,
33inclusive.
34o Use of leading 0x indicates a hexadecimal number.
35o Use of leading 0b indicates a binary number.
36o The use of RSVD or Reserved indicates that a bit or field is reserved for
37future use.
38o Field width is in bytes, unless otherwise noted.
39o Register are (R) read-only, (R/W) read/write, (W) write-only, or (COR) clear
40on read
41o TLV values in network-byte-order are designated with (N).
42
43
44SECTION 2: PCI Configuration Registers
45======================================
46
47PCI Configuration Space
48-----------------------
49
50Each switch instance registers as a PCI device with PCI configuration space:
51
52 offset width description value
53 ---------------------------------------------
54 0x0 2 Vendor ID 0x1b36
55 0x2 2 Device ID 0x0006
56 0x4 4 Command/Status
57 0x8 1 Revision ID 0x01
58 0x9 3 Class code 0x2800
59 0xC 1 Cache line size
60 0xD 1 Latency timer
61 0xE 1 Header type
62 0xF 1 Built-in self test
63 0x10 4 Base address low
64 0x14 4 Base address high
65 0x18-28 Reserved
66 0x2C 2 Subsystem vendor ID *
67 0x2E 2 Subsystem ID *
68 0x30-38 Reserved
69 0x3C 1 Interrupt line
70 0x3D 1 Interrupt pin 0x00
71 0x3E 1 Min grant 0x00
72 0x3D 1 Max latency 0x00
73 0x40 1 TRDY timeout
74 0x41 1 Retry count
75 0x42 2 Reserved
76
77
78* Assigned by sub-system implementation
79
80SECTION 3: Memory-Mapped Register Space
81=======================================
82
83There are two memory-mapped BARs. BAR0 maps device register space and is
840x2000 in size. BAR1 maps MSI-X vector and PBA tables and is also 0x2000 in
85size, allowing for 256 MSI-X vectors.
86
87All registers are 4 or 8 bytes long. It is assumed host software will access 4
88byte registers with one 4-byte access, and 8 byte registers with either two
894-byte accesses or a single 8-byte access. In the case of two 4-byte accesses,
90access must be lower and then upper 4-bytes, in that order.
91
92BAR0 device register space is organized as follows:
93
94 offset description
95 ------------------------------------------------------
96 0x0000-0x000f Bogus registers to catch misbehaving
97 drivers. Writes do nothing. Reads
98 back as 0xDEADBABE.
99 0x0010-0x00ff Test registers
100 0x0300-0x03ff General purpose registers
101 0x1000-0x1fff Descriptor control
102
103Holes in register space are reserved. Writes to reserved registers do nothing.
104Reads to reserved registers read back as 0.
105
106No fancy stuff like write-combining is enabled on any of the registers.
107
108BAR1 MSI-X register space is organized as follows:
109
110 offset description
111 ------------------------------------------------------
112 0x0000-0x0fff MSI-X vector table (256 vectors total)
113 0x1000-0x1fff MSI-X PBA table
114
115
116SECTION 4: Interrupts, DMA, and Endianness
117==========================================
118
119PCI Interrupts
120--------------
121
122The device supports only MSI-X interrupts. BAR1 memory-mapped region contains
123the MSI-X vector and PBA tables, with support for up to 256 MSI-X vectors.
124
125The vector assignment is:
126
127 vector description
128 -----------------------------------------------------
129 0 Command descriptor ring completion
130 1 Event descriptor ring completion
131 2 Test operation completion
132 3 RSVD
133 4-255 Tx and Rx descriptor ring completion
134 Tx vector is even
135 Rx vector is odd
136
137A MSI-X vector table entry is 16 bytes:
138
139 field offset width description
140 -------------------------------------------------------------
141 lower_addr 0x0 4 [31:2] message address[31:2]
142 [1:0] Rsvd (4 byte alignment
143 required)
144 upper_addr 0x4 4 [31:19] Rsvd
145 [14:0] message address[46:32]
146 data 0x8 4 message data[31:0]
147 control 0xc 4 [31:1] Rsvd
148 [0] mask (0 = enable,
149 1 = masked)
150
151Software should install the Interrupt Service Routine (ISR) before any ports
152are enabled or any commands are issued on the command ring.
153
154DMA Operations
155--------------
156
157DMA operations are used for packet DMA to/from the CPU, command and event
158processing. Command processing includes statistical counters and table dumps,
159table insertion/deletion, and more. Event processing provides an async
160notification method for device-originating events. Each DMA operation has a
161set of control registers to manage a descriptor ring. The descriptor rings are
162allocated from contiguous host DMA-able memory and registers specify the rings
163base address, size and current head and tail indices. Software always writes
164the head, and hardware always writes the tail.
165
166The higher-order bit of DMA_DESC_COMP_ERR is used to mark hardware completion
167of a descriptor. Software will clear this bit when posting a descriptor to the
168ring, and hardware will set this bit when the descriptor is complete.
169
170Descriptor ring sizes must be a power of 2 and range from 2 to 64K entries.
171Descriptor rings' base address must be 8-byte aligned. Descriptors must be
172packed within ring. Each descriptor in each ring must also be aligned on an 8
173byte boundary. Each descriptor ring will have these registers:
174
175 DMA_DESC_xxx_BASE_ADDR, offset 0x1000 + (x * 32), 64-bit, (R/W)
176 DMA_DESC_xxx_SIZE, offset 0x1008 + (x * 32), 32-bit, (R/W)
177 DMA_DESC_xxx_HEAD, offset 0x100c + (x * 32), 32-bit, (R/W)
178 DMA_DESC_xxx_TAIL, offset 0x1010 + (x * 32), 32-bit, (R)
179 DMA_DESC_xxx_CTRL, offset 0x1014 + (x * 32), 32-bit, (W)
180 DMA_DESC_xxx_CREDITS, offset 0x1018 + (x * 32), 32-bit, (R/W)
181 DMA_DESC_xxx_RSVD1, offset 0x101c + (x * 32), 32-bit, (R/W)
182
183Where x is descriptor ring index:
184
185 index ring
186 --------------------
187 0 CMD
188 1 EVENT
189 2 TX (port 0)
190 3 RX (port 0)
191 4 TX (port 1)
192 5 RX (port 1)
193 .
194 .
195 .
196 124 TX (port 61)
197 125 RX (port 61)
198 126 Resv
199 127 Resv
200
201Writing BASE_ADDR or SIZE will reset HEAD and TAIL to zero. HEAD cannot be
202written past TAIL. To do so would wrap the ring. An empty ring is when HEAD
203== TAIL. A full ring is when HEAD is one position behind TAIL. Both HEAD and
204TAIL increment and modulo wrap at the ring size.
205
206CTRL register bits:
207
208 bit name description
209 ------------------------------------------------------------------------
210 [0] CTRL_RESET Reset the descriptor ring
211 [1:31] Reserved
212
213All descriptor types share some common fields:
214
215 field width description
216 -------------------------------------------------------------------
217 DMA_DESC_BUF_ADDR 8 Phys addr of desc payload, 8-byte
218 aligned
219 DMA_DESC_COOKIE 8 Desc cookie for completion matching,
220 upper-most bit is reserved
221 DMA_DESC_BUF_SIZE 2 Desc payload size in bytes
222 DMA_DESC_TLV_SIZE 2 Desc payload total size in bytes
223 used for TLVs. Must be <=
224 DMA_DESC_BUF_SIZE.
225 DMA_DESC_COMP_ERR 2 Completion status of associated
226 desc payload. High order bit is
227 clear on new descs, toggled by
228 hw for completed items.
229
230To support forward- and backward-compatibility, descriptor and completion
231payloads are specified in TLV format. Fields are packed with Type=field name,
232Length=field length, and Value=field value. Software will ignore unknown fields
233filled in by the switch. Likewise, the switch will ignore unknown fields
234filled in by software.
235
236Descriptor payload buffer is 8-byte aligned and TLVs are 8-byte aligned. The
237value within a TLV is also 8-byte aligned. The (packed, 8 byte) TLV header is:
238
239 field width description
240 -----------------------------
241 type 4 TLV type
242 len 2 TLV value length
243 pad 2 Reserved
244
245The alignment requirements for descriptors and TLVs are to avoid unaligned
246access exceptions in software. Note that the payload for each TLV is also
2478 byte aligned.
248
249Figure 1 shows an example descriptor buffer with two TLVs.
250
251 <------- 8 bytes ------->
252
253 8-byte +––––+ +–––––––––––+–––––+–––––+ +–+
254 align | type | len | pad | TLV#1 hdr |
255 +–––––––––––+–––––+–––––+ (len=22) |
256 | | |
257 | value | TVL#1 value |
258 | | (padded to 8-byte |
259 | +–––––+ alignment) |
260 | |/////| |
261 8-byte +––––+ +–––––––––––+–––––––––––+ |
262 align | type | len | pad | TLV#2 hdr DESC_BUF_SIZE
263 +–––––+–––––+–––––+–––––+ (len=2) |
264 |value|/////////////////| TLV#2 value |
265 +–––––+/////////////////| |
266 |///////////////////////| |
267 |///////////////////////| |
268 |///////////////////////| |
269 |////////unused/////////| |
270 |////////space//////////| |
271 |///////////////////////| |
272 |///////////////////////| |
273 |///////////////////////| |
274 +–––––––––––––––––––––––+ +–+
275
276 fig. 1
277
278TLVs can be nested within the NEST TLV type.
279
280Interrupt credits
281^^^^^^^^^^^^^^^^^
282
283MSI-X vectors used for descriptor ring completions use a credit mechanism for
284efficient device, PCIe bus, OS and driver operations. Each descriptor ring has
285a credit count which represents the number of outstanding descriptors to be
286processed by the driver. As the device marks descriptors complete, the credit
287count is incremented. As the driver processes those outstanding descriptors,
288it returns credits back to the device. This way, the device knows the driver's
289progress and can make decisions about when to fire the next interrupt or not.
290When the credit count is zero, and the first descriptors are posted for the
291driver, a single interrupt is fired. Once the interrupt is fired, the
292interrupt is disabled (auto-masked*). In response to the interrupt, the driver
293will process descriptors and PIO write a returned credit value for that
294descriptor ring. If the driver returns all credits (the driver caught up with
295the device and there is no outstanding work), then the interrupt is unmasked,
296but not fired. If only partial credits are returned, the interrupt remains
297masked but the device generates an interrupt, signaling the driver that more
298outstanding work is available.
299
300(* this masking is unrelated to to the MSI-X interrupt mask register)
301
302Endianness
303----------
304
305Device registers are hard-coded to little-endian (LE). The driver should
306convert to/from host endianess to LE for device register accesses.
307
308Descriptors are LE. Descriptor buffer TLVs will have LE type and length
309fields, but the value field can either be LE or network-byte-order, depending
310on context. TLV values containing network packet data will be in network-byte
311order. A TLV value containing a field or mask used to compare against network
312packet data is network-byte order. For example, flow match fields (and masks)
313are network-byte-order since they're matched directly, byte-by-byte, against
314network packet data. All non-network-packet TLV multi-byte values will be LE.
315
316TLV values in network-byte-order are designated with (N).
317
318
319SECTION 5: Test Registers
320=========================
321
322Rocker has several test registers to support troubleshooting register access,
323interrupt generation, and DMA operations:
324
325 TEST_REG, offset 0x0010, 32-bit (R/W)
326 TEST_REG64, offset 0x0018, 64-bit (R/W)
327 TEST_IRQ, offset 0x0020, 32-bit (R/W)
328 TEST_DMA_ADDR, offset 0x0028, 64-bit (R/W)
329 TEST_DMA_SIZE, offset 0x0030, 32-bit (R/W)
330 TEST_DMA_CTRL, offset 0x0034, 32-bit (R/W)
331
332Reads to TEST_REG and TEST_REG64 will read a value equal to twice the last
333value written to the register. The 32-bit and 64-bit versions are for testing
33432-bit and 64-bit host accesses.
335
336A vector can be written to TEST_IRQ and the device will generate an interrupt
337for that vector.
338
339To test basic DMA operations, allocate a DMA-able host buffer and put the
340buffer address into TEST_DMA_ADDR and size into TEST_DMA_SIZE. Then, write to
341TEST_DMA_CTRL to manipulate the buffer contents. TEST_DMA_CTRL operations are:
342
343 operation value description
344 -----------------------------------------------------------
345 TEST_DMA_CTRL_CLEAR 1 clear buffer
346 TEST_DMA_CTRL_FILL 2 fill buffer bytes with 0x96
347 TEST_DMA_CTRL_INVERT 4 invert bytes in buffer
348
349Various buffer address and sizes should be tested to verify no address boundary
350issue exists. In particular, buffers that start on odd-8-byte boundary and/or
351span multiple PAGE sizes should be tested.
352
353
354SECTION 6: Ports
355================
356
357Physical and Logical Ports
358------------------------------------
359
360The switch supports up to 62 physical (front-panel) ports. Register
361PORT_PHYS_COUNT returns the actual number of physical ports available:
362
363 PORT_PHYS_COUNT, offset 0x0304, 32-bit, (R)
364
365In addition to front-panel ports, the switch supports logical ports for
366tunnels.
367
368Front-panel ports and logical tunnel ports are mapped into a single 32-bit port
369space. A special CPU port is assigned port 0. The front-panel ports are
370mapped to ports 1-62. A special loopback port is assigned port 63. Logical
371tunnel ports are assigned ports 0x0001000-0x0001ffff.
372To summarize the port assignments:
373
374 port mapping
375 -------------------------------------------------------
376 0 CPU port (for packets to/from host CPU)
377 1-62 front-panel physical ports
378 63 loopback port
379 64-0x0000ffff RSVD
380 0x00010000-0x0001ffff logical tunnel ports
381 0x00020000-0xffffffff RSVD
382
383Physical Port Mode
384------------------
385
386Switch front-panel ports operate in a mode. Currently, the only mode is
387OF-DPA. OF-DPA[1] mode is based on OpenFlow Data Plane Abstraction (OF-DPA)
388Abstract Switch Specification, Version 1.0, from Broadcom Corporation. To
389set/get the mode for front-panel ports, see port settings, below.
390
391Port Settings
392-------------
393
394Link status for all front-panel ports is available via PORT_PHYS_LINK_STATUS:
395
396 PORT_PHYS_LINK_STATUS, offset 0x0310, 64-bit, (R)
397
398 Value is port bitmap. Bits 0 and 63 always read 0. Bits 1-62
399 read 1 for link UP and 0 for link DOWN for respective front-panel ports.
400
401Other properties for front-panel ports are available via DMA CMD descriptors:
402
403 Get PORT_SETTINGS descriptor:
404
405 field width description
406 ----------------------------------------------
407 PORT_SETTINGS 2 CMD_GET
408 PPORT 4 Physical port #
409
410 Get PORT_SETTINGS completion:
411
412 field width description
413 ----------------------------------------------
414 PPORT 4 Physical port #
415 SPEED 4 Current port interface speed, in Mbps
416 DUPLEX 1 1 = Full, 0 = Half
417 AUTONEG 1 1 = enabled, 0 = disabled
418 MACADDR 6 Port MAC address
419 MODE 1 0 = OF-DPA
420 LEARNING 1 MAC address learning on port
421 1 = enabled
422 0 = disabled
423
424 Set PORT_SETTINGS descriptor:
425
426 field width description
427 ----------------------------------------------
428 PORT_SETTINGS 2 CMD_SET
429 PPORT 4 Physical port #
430 SPEED 4 Port interface speed, in Mbps
431 DUPLEX 1 1 = Full, 0 = Half
432 AUTONEG 1 1 = enabled, 0 = disabled
433 MACADDR 6 Port MAC address
434 MODE 1 0 = OF-DPA
435
436Port Enable
437-----------
438
439Front-panel ports are initially disabled, which means port ingress and egress
440packets will be dropped. To enable or disable a port, use PORT_PHYS_ENABLE:
441
442 PORT_PHYS_ENABLE: offset 0x0318, 64-bit, (R/W)
443
444 Value is bitmap of first 64 ports. Bits 0 and 63 are ignored
445 and always read as 0. Write 1 to enable port; write 0 to disable it.
446 Default is 0.
447
448
449SECTION 7: Switch Control
450=========================
451
452This section covers switch-wide register settings.
453
454Control
455-------
456
457This register is used for low level control of the switch.
458
459 CONTROL: offset 0x0300, 32-bit, (W)
460
461 bit name description
462 ------------------------------------------------------------------------
463 [0] CONTROL_RESET If set, device will perform reset
464 [1:31] Reserved
465
466Switch ID
467---------
468
469The switch has a SWITCH_ID to be used by software to uniquely identify the
470switch:
471
472 SWITCH_ID: offset 0x0320, 64-bit, (R)
473
474 Value is opaque to switch software and no special encoding is implied.
475
476
477SECTION 8: Events
478=================
479
480Non-I/O asynchronous events from the device are notified to the host using the
481event ring. The TLV structure for events is:
482
483 field width description
484 ---------------------------------------------------
485 TYPE 4 Event type, one of:
486 1: LINK_CHANGED
487 2: MAC_VLAN_SEEN
488 INFO <nest> Event info (details below)
489
490Link Changed Event
491------------------
492
493When link status changes on a physical port, this event is generated.
494
495 field width description
496 ---------------------------------------------------
497 INFO <nest>
498 PPORT 4 Physical port
499 LINKUP 1 Link status:
500 0: down
501 1: up
502
503MAC VLAN Seen Event
504-------------------
505
506When a packet ingresses on a port and the source MAC/VLAN isn't known to the
507device, the device will generate this event. In response to the event, the
508driver should install to the device the MAC/VLAN on the port into the bridge
509table. Once installed, the MAC/VLAN is known on the port and this event will
510no longer be generated.
511
512 field width description
513 ---------------------------------------------------
514 INFO <nest>
515 PPORT 4 Physical port
516 MAC 6 MAC address
517 VLAN 2 VLAN ID
518
519
520SECTION 9: CPU Packet Processing
521================================
522
523Ingress packets directed to the host CPU for further processing are delivered
524in the DMA RX ring. Likewise, host CPU originating packets destined to egress
525on switch ports are scheduled by software using the DMA TX ring.
526
527Tx Packet Processing
528--------------------
529
530Software schedules packets for egress on switch ports using the DMA TX ring. A
531TX descriptor buffer describes the packet location and size in host DMA-able
532memory, the destination port, and any hardware-offload functions (such as L3
533payload checksum offload). Software then bumps the descriptor head to signal
534hardware of new Tx work. In response, hardware will DMA read Tx descriptors up
535to head, DMA read descriptor buffer and packet data, perform offloading
536functions, and finally frame packet on wire (network). Once packet processing
537is complete, hardware will writeback status to descriptor(s) to signal to
538software that Tx is complete and software resources (e.g. skb) backing packet
539can be released.
540
541Figure 2 shows an example 3-fragment packet queued with one Tx descriptor. A
542TLV is used for each packet fragment.
543
544 pkt frag 1
545 +–––––––+ +–+
546 +–––+ | |
547 desc buf | | | |
548 +––––––––+ | | | |
549 Tx ring +–––+ +–––––+ | | |
550 +–––––––––+ | | TLVs | +–––––––+ |
551 | +–––+ +––––––––+ pkt frag 2 |
552 | desc 0 | | +–––––+ +–––––––+ |
553 +–––––––––+ | TLVs | +–––+ | |
554 head+–+ | +––––––––+ | | |
555 | desc 1 | | +–––––+ +–––––––+ |pkt
556 +–––––––––+ | TLVs | | |
557 | | +––––––––+ | pkt frag 3 |
558 | | | +–––––––+ |
559 +–––––––––+ +–––+ | |
560 | | | | |
561 | | | | |
562 +–––––––––+ | | |
563 | | | | |
564 | | | | |
565 +–––––––––+ | | |
566 | | +–––––––+ +–+
567 | |
568 +–––––––––+
569
570 fig 2.
571
572The TLVs for Tx descriptor buffer are:
573
574 field width description
575 ---------------------------------------------------------------------
576 PPORT 4 Destination physical port #
577 TX_OFFLOAD 1 Hardware offload modes:
578 0: no offload
579 1: insert IP csum (ipv4 only)
580 2: insert TCP/UDP csum
581 3: L3 csum calc and insert
582 into csum offset (TX_L3_CSUM_OFF)
583 16-bit 1's complement csum value.
584 IPv4 pseudo-header and IP
585 already calculated by OS
586 and inserted.
587 4: TSO (TCP Segmentation Offload)
588 TX_L3_CSUM_OFF 2 For L3 csum offload mode, the offset,
589 from the beginning of the packet,
590 of the csum field in the L3 header
591 TX_TSO_MSS 2 For TSO offload mode, the
592 Maximum Segment Size in bytes
593 TX_TSO_HDR_LEN 2 For TSO offload mode, the
594 length of ethernet, IP, and
595 TCP/UDP headers, including IP
596 and TCP options.
597 TX_FRAGS <array> Packet fragments
598 TX_FRAG <nest> Packet fragment
599 TX_FRAG_ADDR 8 DMA address of packet fragment
600 TX_FRAG_LEN 2 Packet fragment length
601
602Possible status return codes in descriptor on completion are:
603
604 DESC_COMP_ERR reason
605 --------------------------------------------------------------------
606 0 OK
607 -ROCKER_ENXIO address or data read err on desc buf or packet
608 fragment
609 -ROCKER_EINVAL bad pport or TSO or csum offloading error
610 -ROCKER_ENOMEM no memory for internal staging tx fragment
611
612Rx Packet Processing
613--------------------
614
615For packets ingressing on switch ports that are not forwarded by the switch but
616rather directed to the host CPU for further processing are delivered in the DMA
617RX ring. Rx descriptor buffers are allocated by software and placed on the
618ring. Hardware will fill Rx descriptor buffers with packet data, write the
619completion, and signal to software that a new packet is ready. Since Rx packet
620size is not known a-priori, the Rx descriptor buffer must be allocated for
621worst-case packet size. A single Rx descriptor will contain the entire Rx
622packet data in one RX_FRAG. Other Rx TLVs describe and hardware offloads
623performed on the packet, such as checksum validation.
624
625The TLVs for Rx descriptor buffer are:
626
627 field width description
628 ---------------------------------------------------
629 PPORT 4 Source physical port #
630 RX_FLAGS 2 Packet parsing flags:
631 (1 << 0): IPv4 packet
632 (1 << 1): IPv6 packet
633 (1 << 2): csum calculated
634 (1 << 3): IPv4 csum good
635 (1 << 4): IP fragment
636 (1 << 5): TCP packet
637 (1 << 6): UDP packet
638 (1 << 7): TCP/UDP csum good
639 RX_CSUM 2 IP calculated checksum:
640 IPv4: IP payload csum
641 IPv6: header and payload csum
642 (Only valid is RX_FLAGS:csum calc is set)
643 RX_FRAG_ADDR 8 DMA address of packet fragment
644 RX_FRAG_MAX_LEN 2 Packet maximum fragment length
645 RX_FRAG_LEN 2 Actual packet fragment length after receive
646
647Possible status return codes in descriptor on completion are:
648
649 DESC_COMP_ERR reason
650 --------------------------------------------------------------------
651 0 OK
652 -ROCKER_ENXIO address or data read err on desc buf
653 -ROCKER_ENOMEM no memory for internal staging desc buf
654 -ROCKER_EMSGSIZE Rx descriptor buffer wasn't big enough to contain
655 packet data TLV and other TLVs.
656
657
658SECTION 10: OF-DPA Mode
659======================
660
661OF-DPA mode allows the switch to offload flow packet processing functions to
662hardware. An OpenFlow controller would communicate with an OpenFlow agent
663installed on the switch. The OpenFlow agent would (directly or indirectly)
664communicate with the Rocker switch driver, which in turn would program switch
665hardware with flow functionality, as defined in OF-DPA. The block diagram is:
666
667 +–––––––––––––––----–––+
668 | OF |
669 | Remote Controller |
670 +––––––––+––----–––––––+
671 |
672 |
673 +––––––––+–––––––––+
674 | OF |
675 | Local Agent |
676 +––––––––––––––––––+
677 | |
678 | Rocker Driver |
679 +––––––––––––––––––+
680 <this spec>
681 +––––––––––––––––––+
682 | |
683 | Rocker Switch |
684 +––––––––––––––––––+
685
686To participate in flow functions, ports must be configure for OF-DPA mode
687during switch initialization.
688
689OF-DPA Flow Table Interface
690---------------------------
691
692There are commands to add, modify, delete, and get stats of flow table entries.
693The commands are issued using the DMA CMD descriptor ring. The following
694commands are defined:
695
696 CMD_ADD: add an entry to flow table
697 CMD_MOD: modify an entry in flow table
698 CMD_DEL: delete an entry from flow table
699 CMD_GET_STATS: get stats for flow entry
700
701TLVs for add and modify commands are:
702
703 field width description
704 ----------------------------------------------------
705 OF_DPA_CMD 2 CMD_[ADD|MOD]
706 OF_DPA_TBL 2 Flow table ID
707 0: ingress port
708 10: vlan
709 20: termination mac
710 30: unicast routing
711 40: multicast routing
712 50: bridging
713 60: ACL policy
714 OF_DPA_PRIORITY 4 Flow priority
715 OF_DPA_HARDTIME 4 Hard timeout for flow
716 OF_DPA_IDLETIME 4 Idle timeout for flow
717 OF_DPA_COOKIE 8 Cookie
718
719Additional TLVs based on flow table ID:
720
721Table ID 0: ingress port
722
723 field width description
724 ----------------------------------------------------
725 OF_DPA_IN_PPORT 4 ingress physical port number
726 OF_DPA_GOTO_TBL 2 goto table ID; zero to drop
727
728Table ID 10: vlan
729
730 field width description
731 ----------------------------------------------------
732 OF_DPA_IN_PPORT 4 ingress physical port number
733 OF_DPA_VLAN_ID 2 (N) vlan ID
734 OF_DPA_VLAN_ID_MASK 2 (N) vlan ID mask
735 OF_DPA_GOTO_TBL 2 goto table ID; zero to drop
736 OF_DPA_NEW_VLAN_ID 2 (N) new vlan ID
737
738Table ID 20: termination mac
739
740 field width description
741 ----------------------------------------------------
742 OF_DPA_IN_PPORT 4 ingress physical port number
743 OF_DPA_IN_PPORT_MASK 4 ingress physical port number mask
744 OF_DPA_ETHERTYPE 2 (N) must be either 0x0800 or 0x86dd
745 OF_DPA_DST_MAC 6 (N) destination MAC
746 OF_DPA_DST_MAC_MASK 6 (N) destination MAC mask
747 OF_DPA_VLAN_ID 2 (N) vlan ID
748 OF_DPA_VLAN_ID_MASK 2 (N) vlan ID mask
749 OF_DPA_GOTO_TBL 2 only acceptable values are
750 unicast or multicast routing
751 table IDs
752 OF_DPA_OUT_PPORT 2 if specified, must be
753 controller, set zero otherwise
754
755Table ID 30: unicast routing
756
757 field width description
758 ----------------------------------------------------
759 OF_DPA_ETHERTYPE 2 (N) must be either 0x0800 or 0x86dd
760 OF_DPA_DST_IP 4 (N) destination IPv4 address.
761 Must be unicast address
762 OF_DPA_DST_IP_MASK 4 (N) IP mask. Must be prefix mask
763 OF_DPA_DST_IPV6 16 (N) destination IPv6 address.
764 Must be unicast address
765 OF_DPA_DST_IPV6_MASK 16 (N) IPv6 mask. Must be prefix mask
766 OF_DPA_GOTO_TBL 2 goto table ID; zero to drop
767 OF_DPA_GROUP_ID 4 data for GROUP action must
768 be an L3 Unicast group entry
769
770Table ID 40: multicast routing
771
772 field width description
773 ----------------------------------------------------
774 OF_DPA_ETHERTYPE 2 (N) must be either 0x0800 or 0x86dd
775 OF_DPA_VLAN_ID 2 (N) vlan ID
776 OF_DPA_SRC_IP 4 (N) source IPv4. Optional,
777 can contain IPv4 address,
778 must be completely masked
779 if not used
780 OF_DPA_SRC_IP_MASK 4 (N) IP Mask
781 OF_DPA_DST_IP 4 (N) destination IPv4 address.
782 Must be multicast address
783 OF_DPA_SRC_IPV6 16 (N) source IPv6 Address. Optional.
784 Can contain IPv6 address,
785 must be completely masked
786 if not used
787 OF_DPA_SRC_IPV6_MASK 16 (N) IPv6 mask.
788 OF_DPA_DST_IPV6 16 (N) destination IPv6 Address. Must
789 be multicast address
790 Must be multicast address
791 OF_DPA_GOTO_TBL 2 goto table ID; zero to drop
792 OF_DPA_GROUP_ID 4 data for GROUP action must
793 be an L3 multicast group entry
794
795Table ID 50: bridging
796
797 field width description
798 ----------------------------------------------------
799 OF_DPA_VLAN_ID 2 (N) vlan ID
800 OF_DPA_TUNNEL_ID 4 tunnel ID
801 OF_DPA_DST_MAC 6 (N) destination MAC
802 OF_DPA_DST_MAC_MASK 6 (N) destination MAC mask
803 OF_DPA_GOTO_TBL 2 goto table ID; zero to drop
804 OF_DPA_GROUP_ID 4 data for GROUP action must
805 be a L2 Interface, L2
806 Multicast, L2 Flood,
807 or L2 Overlay group entry
808 as appropriate
809 OF_DPA_TUNNEL_LPORT 4 unicast Tenant Bridging
810 flows specify a tunnel
811 logical port ID
812 OF_DPA_OUT_PPORT 2 data for OUTPUT action,
813 restricted to CONTROLLER,
814 set to 0 otherwise
815
816Table ID 60: acl policy
817
818 field width description
819 ----------------------------------------------------
820 OF_DPA_IN_PPORT 4 ingress physical port number
821 OF_DPA_IN_PPORT_MASK 4 ingress physical port number mask
822 OF_DPA_ETHERTYPE 2 (N) ethertype
823 OF_DPA_VLAN_ID 2 (N) vlan ID
824 OF_DPA_VLAN_ID_MASK 2 (N) vlan ID mask
825 OF_DPA_VLAN_PCP 2 (N) vlan Priority Code Point
826 OF_DPA_VLAN_PCP_MASK 2 (N) vlan Priority Code Point mask
827 OF_DPA_SRC_MAC 6 (N) source MAC
828 OF_DPA_SRC_MAC_MASK 6 (N) source MAC mask
829 OF_DPA_DST_MAC 6 (N) destination MAC
830 OF_DPA_DST_MAC_MASK 6 (N) destination MAC mask
831 OF_DPA_TUNNEL_ID 4 tunnel ID
832 OF_DPA_SRC_IP 4 (N) source IPv4. Optional,
833 can contain IPv4 address,
834 must be completely masked
835 if not used
836 OF_DPA_SRC_IP_MASK 4 (N) IP Mask
837 OF_DPA_DST_IP 4 (N) destination IPv4 address.
838 Must be multicast address
839 OF_DPA_DST_IP_MASK 4 (N) IP Mask
840 OF_DPA_SRC_IPV6 16 (N) source IPv6 Address. Optional.
841 Can contain IPv6 address,
842 must be completely masked
843 if not used
844 OF_DPA_SRC_IPV6_MASK 16 (N) IPv6 mask
845 OF_DPA_DST_IPV6 16 (N) destination IPv6 Address. Must
846 be multicast address.
847 OF_DPA_DST_IPV6_MASK 16 (N) IPv6 mask
848 OF_DPA_SRC_ARP_IP 4 (N) source IPv4 address in the ARP
849 payload. Only used if ethertype
850 == 0x0806.
851 OF_DPA_SRC_ARP_IP_MASK 4 (N) IP Mask
852 OF_DPA_IP_PROTO 1 IP protocol
853 OF_DPA_IP_PROTO_MASK 1 IP protocol mask
854 OF_DPA_IP_DSCP 1 DSCP
855 OF_DPA_IP_DSCP_MASK 1 DSCP mask
856 OF_DPA_IP_ECN 1 ECN
857 OF_DPA_IP_ECN_MASK 1 ECN mask
858 OF_DPA_L4_SRC_PORT 2 (N) L4 source port, only for
859 TCP, UDP, or SCTP
860 OF_DPA_L4_SRC_PORT_MASK 2 (N) L4 source port mask
861 OF_DPA_L4_DST_PORT 2 (N) L4 source port, only for
862 TCP, UDP, or SCTP
863 OF_DPA_L4_DST_PORT_MASK 2 (N) L4 source port mask
864 OF_DPA_ICMP_TYPE 1 ICMP type, only if IP
865 protocol is 1
866 OF_DPA_ICMP_TYPE_MASK 1 ICMP type mask
867 OF_DPA_ICMP_CODE 1 ICMP code
868 OF_DPA_ICMP_CODE_MASK 1 ICMP code mask
869 OF_DPA_IPV6_LABEL 4 (N) IPv6 flow label
870 OF_DPA_IPV6_LABEL_MASK 4 (N) IPv6 flow label mask
871 OF_DPA_GROUP_ID 4 data for GROUP action
872 OF_DPA_QUEUE_ID_ACTION 1 write the queue ID
873 OF_DPA_NEW_QUEUE_ID 1 queue ID
874 OF_DPA_VLAN_PCP_ACTION 1 write the VLAN priority
875 OF_DPA_NEW_VLAN_PCP 1 VLAN priority
876 OF_DPA_IP_DSCP_ACTION 1 write the DSCP
877 OF_DPA_NEW_IP_DSCP 1 new DSCP
878 OF_DPA_TUNNEL_LPORT 4 restrct to valid tunnel
879 logical port, set to 0
880 otherwise.
881 OF_DPA_OUT_PPORT 2 data for OUTPUT action,
882 restricted to CONTROLLER,
883 set to 0 otherwise
884 OF_DPA_CLEAR_ACTIONS 4 if 1 packets matching flow are
885 dropped (all other instructions
886 ignored)
887
888TLVs for flow delete and get stats command are:
889
890 field width description
891 ---------------------------------------------------
892 OF_DPA_CMD 2 CMD_[DEL|GET_STATS]
893 OF_DPA_COOKIE 8 Cookie
894
895On completion of get stats command, the descriptor buffer is written back with
896the following TLVs:
897
898 field width description
899 ---------------------------------------------------
900 OF_DPA_STAT_DURATION 4 Flow duration
901 OF_DPA_STAT_RX_PKTS 8 Received packets
902 OF_DPA_STAT_TX_PKTS 8 Transmit packets
903
904Possible status return codes in descriptor on completion are:
905
906 DESC_COMP_ERR command reason
907 --------------------------------------------------------------------
908 0 all OK
909 -ROCKER_EFAULT all head or tail index outside
910 of ring
911 -ROCKER_ENXIO all address or data read err on
912 desc buf
913 -ROCKER_EMSGSIZE GET_STATS cmd descriptor buffer wasn't
914 big enough to contain write-back
915 TLVs
916 -ROCKER_EINVAL all invalid parameters passed in
917 -ROCKER_EEXIST ADD entry already exists
918 -ROCKER_ENOSPC ADD no space left in flow table
919 -ROCKER_ENOENT MOD|DEL|GET_STATS cookie invalid
920
921Group Table Interface
922---------------------
923
924There are commands to add, modify, delete, and get stats of group table
925entries. The commands are issued using the DMA CMD descriptor ring. The
926following commands are defined:
927
928 CMD_ADD: add an entry to group table
929 CMD_MOD: modify an entry in group table
930 CMD_DEL: delete an entry from group table
931 CMD_GET_STATS: get stats for group entry
932
933TLVs for add and modify commands are:
934
935 field width description
936 -----------------------------------------------------------
937 FLOW_GROUP_CMD 2 CMD_[ADD|MOD]
938 FLOW_GROUP_ID 2 Flow group ID
939 FLOW_GROUP_TYPE 1 Group type:
940 0: L2 interface
941 1: L2 rewrite
942 2: L3 unicast
943 3: L2 multicast
944 4: L2 flood
945 5: L3 interface
946 6: L3 multicast
947 7: L3 ECMP
948 8: L2 overlay
949 FLOW_VLAN_ID 2 Vlan ID (types 0, 3, 4, 6)
950 FLOW_L2_PORT 2 Port (types 0)
951 FLOW_INDEX 4 Index (all types but 0)
952 FLOW_OVERLAY_TYPE 1 Overlay sub-type (type 8):
953 0: Flood unicast tunnel
954 1: Flood multicast tunnel
955 2: Multicast unicast tunnel
956 3: Multicast multicast tunnel
957 FLOW_GROUP_ACTION nest
958 FLOW_GROUP_ID 2 next group ID in chain (all
959 types except 0)
960 FLOW_OUT_PORT 4 egress port (types 0, 8)
961 FLOW_POP_VLAN_TAG 1 strip outer VLAN tag (type 1
962 only)
963 FLOW_VLAN_ID 2 (types 1, 5)
964 FLOW_SRC_MAC 6 (types 1, 2, 5)
965 FLOW_DST_MAC 6 (types 1, 2)
966
967TLVs for flow delete and get stats command are:
968
969 field width description
970 -----------------------------------------------------------
971 FLOW_GROUP_CMD 2 CMD_[DEL|GET_STATS]
972 FLOW_GROUP_ID 2 Flow group ID
973
974On completion of get stats command, the descriptor buffer is written back with
975the following TLVs:
976
977 field width description
978 ---------------------------------------------------
979 FLOW_GROUP_ID 2 Flow group ID
980 FLOW_STAT_DURATION 4 Flow duration
981 FLOW_STAT_REF_COUNT 4 Flow reference count
982 FLOW_STAT_BUCKET_COUNT 4 Flow bucket count
983
984Possible status return codes in descriptor on completion are:
985
986 DESC_COMP_ERR command reason
987 --------------------------------------------------------------------
988 0 all OK
989 -ROCKER_EFAULT all head or tail index outside
990 of ring
991 -ROCKER_ENXIO all address or data read err on
992 desc buf
993 -ROCKER_ENOSPC GET_STATS cmd descriptor buffer wasn't
994 big enough to contain write-back
995 TLVs
996 -ROCKER_EINVAL ADD|MOD invalid parameters passed in
997 -ROCKER_EEXIST ADD entry already exists
998 -ROCKER_ENOSPC ADD no space left in flow table
999 -ROCKER_ENOENT MOD|DEL|GET_STATS group ID invalid
1000 -ROCKER_EBUSY DEL group reference count non-zero
1001 -ROCKER_ENODEV ADD next group ID doesn't exist
1002
1003
1004
1005References
1006==========
1007
1008[1] OpenFlow Data Plane Abstraction (OF-DPA) Abstract Switch Specification,
1009Version 1.0, from Broadcom Corporation, February 21, 2014.
This page took 0.122384 seconds and 4 git commands to generate.