TL;DR
- PFC (IEEE 802.1Qbb) provides per-priority pause frames that stop upstream senders when switch buffers fill — the lowest-level loss prevention.
- ECN (RFC 3168) marks packets in the IP header when queues build, letting receivers tell senders to slow down before drops occur.
- DCQCN (Microsoft/Mellanox, SIGCOMM 2015) is the end-host congestion control algorithm that translates ECN marks into RDMA QP rate cuts.
- Together these form the standard recipe for running RoCEv2 at AI-fabric scale — with PFC as the safety net and ECN/DCQCN as the steady-state regulator.
Overview#
RDMA over Ethernet only works if the fabric does not drop packets. Lossless Ethernet is the umbrella term for the combination of mechanisms that makes commodity Ethernet near-lossless. The standard recipe in production AI fabrics is three-layered: PFC at Layer 2, ECN at Layer 3, and DCQCN at the end host.
Each layer operates at a different timescale and depth. PFC is the brake of last resort — milliseconds, per-priority, hop-by-hop. ECN is the warning system — packet-by-packet, end-to-end. DCQCN is the controller — sub-RTT, per RDMA queue pair. When all three are tuned correctly, congestion is absorbed long before any packet is dropped.
PFC — Priority Flow Control#
PFC is defined by IEEE 802.1Qbb. It extends the original PAUSE frame (802.3x) to operate on a per-traffic-class basis using 802.1p priorities. When a switch's ingress buffer for a given priority exceeds a configured threshold, it emits a PAUSE frame upstream telling the sender to halt that priority for a quantum of time. The upstream switch in turn may emit its own PAUSE if its buffers fill.
The strength of PFC is that it absolutely prevents drops on the paused priority. The weakness is head-of-line blocking: a single congested flow can pause an entire priority queue, including unrelated flows that share it. Pause propagation can also cause 'PFC storms' in which pauses cascade upstream and freeze the fabric.
ECN — Explicit Congestion Notification#
ECN (RFC 3168) uses two bits in the IP header. When queue depth crosses a threshold, the switch sets the Congestion Experienced (CE) codepoint instead of dropping the packet. The receiver echoes this back to the sender via a Congestion Notification Packet (CNP).
ECN is end-to-end, so it does not suffer from PFC's head-of-line blocking — only the flows actually causing congestion get marked. It is also proactive: marks appear before queues are full, giving the end host time to back off.
DCQCN — Data Centre Quantised Congestion Notification#
DCQCN is the end-host algorithm that translates ECN marks into rate adjustments per RDMA queue pair. Published by Microsoft and Mellanox at SIGCOMM 2015, it combines a slow-start-like ramp with multiplicative rate cuts on each CNP.
Three parameters dominate behaviour: Kmin (queue depth at which marking begins), Kmax (queue depth at which 100% of packets are marked), and Pmax (maximum marking probability). Default values work for small fabrics; large clusters require tuning to balance aggressiveness against fairness.
| Parameter | Role | Typical Value |
|---|---|---|
| Kmin | Start ECN marking at this queue depth | 150-500 KB |
| Kmax | Mark 100% of packets at this queue depth | 1500-5000 KB |
| Pmax | Max marking probability | 0.1-0.5 |
| Rate target | DCQCN steady-state target rate | 95% of link |
| Rate first | Initial rate after CNP | 50% of last |
Tuning Recipe#
- Pick a single DSCP value for RoCE (commonly 26) and map it to a dedicated priority queue (commonly priority 3).
- Enable PFC only on the RoCE priority — never on default or storage priorities.
- Enable ECN on the same priority with Kmin/Kmax sized to absorb expected incast burst durations.
- Set DCQCN parameters per cluster scale — defaults trend conservative; large clusters need higher Kmax to avoid premature throttling.
- Always test under synthetic congestion (`ib_send_bw` with many-to-one) before production traffic.
The most common production incident is PFC pause storms: a misbehaving NIC continuously emits PAUSE frames, the upstream switch propagates them, and a swathe of the fabric freezes. Always instrument per-port PFC pause counters and alarm on sustained pause rates above ~1%.
References
- IEEE 802.1Qbb — Priority-based Flow Control · IEEE
- RFC 3168 — Addition of Explicit Congestion Notification to IP · IETF
- Congestion Control for Large-Scale RDMA Deployments · Microsoft / SIGCOMM 2015
- RoCE Deployment Best Practices · NVIDIA