SystemVerilog · Cocotb · Vivado 2025.2 · Target: Xilinx xcvu9p
Overview
This project is inspired by low-latency/HFT-style design constraints.
- Runtime-configurable AXI-stream Ethernet/IPv4/UDP parser
- Single-pass payload discrimination and variable-offset price extraction
- Implemented in SystemVerilog, verified with cocotb/Verilator
- Meets 500 MHz on Xilinx VU9P with +0.367 ns WNS
The payload format is configured at runtime using a small host-programmable table, where packet discriminators select extraction offsets dynamically. Header checks and format resolution are performed in-stream with no multi-cycle lookup stage. The parser supports AXI-Stream with non-contiguous tvalid (upstream bubbles) and price fields that straddle the beat boundary between byte_counter == 40 and byte_counter == 48.
Beat-by-Beat Pipeline
The AXI-Stream bus is 64 bits wide — 8 bytes per beat. A byte_counter tracks position in units of 8, advancing only on beat_fire = tvalid && tready. This makes the parser transparent to upstream bubbles: idle cycles are simply not counted.
dst/src
MAC
EtherType
0x0800
Protocol
0x11
src/dst
IP
Dst Port
cfg
lookup +
buf 42..47
price +
trigger
Fail-fast semantics apply: if EtherType, Protocol, or Dst Port checks fail, a drop_packet flag is set and all subsequent parsing for that packet is suppressed. The format table lookup and price extraction are entirely skipped for dropped packets.
Verification Strategy
Why two test types are sufficient
The verification strategy separates two independent questions that are often conflated: is the logic correct? and is the design fast enough?
Behavioural simulation (cocotb + Verilator) answers the first question by running the RTL with zero propagation delay. If the design produces correct outputs under all tested stimulus conditions, the logic is sound. Static Timing Analysis (Vivado implementation) answers the second: it exhaustively checks every register-to-register path against the clock constraint, under worst-case Process/Voltage/Temperature conditions.
The combination is equivalent to post-implementation timing simulation and is standard practice for synchronous single-clock-domain FPGA designs. Gate-level simulation with SDF back-annotation adds little value here and would be orders of magnitude slower.
What WNS means
Worst Negative Slack (WNS) is the margin between the clock period and the delay of the slowest combinational path in the design, measured after place-and-route:
fmax = 1 / (clock_period − WNS)
WNS = +0.367 ns @ 2.000 ns constraint → fmax ≈ 612 MHz
A positive WNS means every path meets timing with margin to spare. Because Vivado's STA already accounts for worst-case PVT corners, a positive WNS at the target frequency is a sufficient timing guarantee for production use.
Behavioural test suite (cocotb)
Tests are structured in three layers, each targeting a distinct failure mode:
| Test | What it catches |
|---|---|
| Wrong EtherType / Protocol / Dst Port | Fail-fast logic — dropped packets must not assert valid_packet or price_updated |
| No table match | Header-valid packets with unrecognised discriminator must assert valid_packet but not price_updated |
| Price crosses beat boundary (offset 45, 46, 47) | Assembly logic that spans two beats — the most error-prone path in the design |
| Bubble injection (50% tvalid gap probability) | Parser must be transparent to upstream idle cycles; byte_counter must only advance on beat_fire |
| Deterministic bubble before discriminator beat | Gap specifically before bc=40, the most timing-sensitive beat for format lookup |
| Randomised sweep (100 packets) | Random price_offset ∈ {44,45,46,47}, random disc, random price, random bubble probability — Python reference model computes expected output independently |
Design Decisions & Trade-offs
Format table lookup: parallel combinational logic, not a state machine
The format table has four entries. A priority encoder selects the lowest-index matching entry in a single always_comb block:
for (i = 0; i < TABLE_SIZE; i++) begin hit_c[i] = disc_valid_c && table_entry_valid[i] && ((payload_disc & mask_i) == (value_i & mask_i)); if (!match_found_c && hit_c[i]) begin match_found_c = 1'b1; selected_offset_c = price_offset_i; end end
After synthesis, Vivado maps this priority encoder to three MUXF7 primitives — cascaded 7-input MUXes that implement the if-not-found-yet selection across all four entries in a single LUT level. This is visible directly in the primitives report.
The alternative — a state machine that checks one entry per cycle — would take up to four cycles to resolve a match, which is unacceptable: the discriminator appears at bc=40, and the price must be assembled at bc=48. There is exactly one beat of margin. The parallel approach uses more LUTs but keeps the latency at zero additional cycles.
The trade-off is that a larger table would lengthen this combinational path and eventually fail timing. TABLE_SIZE=4 is therefore not just a functional choice — it is a timing constraint. If more entries were needed, the lookup would need to be pipelined and started earlier.
Unconditional buffering of bytes 42..47
At bc=40, the parser unconditionally captures bytes 42–47 into a six-byte buffer regardless of whether a table match has been found:
// Always capture — do not gate on match_found_c buf42_47 <= { s_axis_tdata[63:56], // byte 47 s_axis_tdata[55:48], // byte 46 s_axis_tdata[47:40], // byte 45 s_axis_tdata[39:32], // byte 44 s_axis_tdata[31:24], // byte 43 s_axis_tdata[23:16] // byte 42 };
The alternative — gating the write enable on match_found_c — would chain the table lookup combinational path directly into the register write enable, lengthening the critical path. Unconditional capture parallelises the two operations at the cost of writing six bytes that are sometimes unused. The area cost is negligible; the timing benefit is real.
Pending register for price output
Price assembly at bc=48 involves a dynamic loop that selects bytes from either buf42_47 or the current beat depending on active_price_offset. Rather than driving the output signals directly from this combinational block, the result is written to a price_pending register and published one cycle later.
The timing report confirms this was necessary: the critical path runs from active_price_offset_reg to price_pending_data_reg/CE with a fanout of 103, consuming 1.460 ns of the 2.000 ns budget. Driving the final outputs directly from this node would push the path over budget.
Fail-fast ordering
Header checks are ordered by byte position: EtherType at bc=8, Protocol at bc=16, Dst Port at bc=32. Once any check fails, drop_packet is asserted and the unique case body is suppressed for all subsequent beats. This means the format table lookup at bc=40 is never entered for non-UDP or wrong-port packets, eliminating any risk of spurious price updates from malformed traffic.
Implementation Results
| Metric | Value | Note |
|---|---|---|
| Target device | xcvu9p-flgb2104-2-i | UltraScale+, speed grade −2 |
| Clock constraint | 500 MHz (2.000 ns) | Used here as an aggressive timing target |
| WNS (setup) | +0.367 ns | All 384 endpoints pass |
| WHS (hold) | +0.057 ns | All 384 endpoints pass |
| fmax | ≈ 612 MHz | 1 / (2.000 − 0.367) ns |
| Throughput @ fmax | ≈ 39 Gbps | 612 MHz × 8 bytes/beat |
| LUT | 298 / 1,182,240 | 0.03% |
| FF | 198 / 2,364,480 | 0.01% |
| DSP / BRAM | 0 / 0 | Pure LUT/FF implementation |
| Latches inferred | 0 | Verified in synthesis report |
| Critical path | active_price_offset → price_pending_data/CE | Logic 0.37 ns, Net 1.09 ns, Fanout 103 |
| Logic levels (critical) | 3 | Net delay dominates, not logic |
The critical path is net-delay dominated (1.09 ns net vs 0.37 ns logic), which is typical for high-fanout control signals on UltraScale+. active_price_offset fans out to 103 endpoints — every bit of the price assembly loop — making it the natural bottleneck. Further frequency improvement would require replicating this register to reduce fanout, at the cost of a small area increase.
Built with SystemVerilog · Verified with cocotb + Verilator · Implemented in Vivado 2025.2 · Target: Xilinx xcvu9p