Low-Latency UDP Market-Data Parser with Runtime Format Tables

SystemVerilog · Cocotb · Vivado 2025.2 · Target: Xilinx xcvu9p

612 fmax MHz

+0.37 WNS ns @ 500 MHz

298 LUTs used

0.03% Device util.

0 Latches inferred

39 Gbps throughput

Overview

This project is inspired by low-latency/HFT-style design constraints.

Runtime-configurable AXI-stream Ethernet/IPv4/UDP parser
Single-pass payload discrimination and variable-offset price extraction
Implemented in SystemVerilog, verified with cocotb/Verilator
Meets 500 MHz on Xilinx VU9P with +0.367 ns WNS

The payload format is configured at runtime using a small host-programmable table, where packet discriminators select extraction offsets dynamically. Header checks and format resolution are performed in-stream with no multi-cycle lookup stage. The parser supports AXI-Stream with non-contiguous tvalid (upstream bubbles) and price fields that straddle the beat boundary between byte_counter == 40 and byte_counter == 48.

Beat-by-Beat Pipeline

The AXI-Stream bus is 64 bits wide — 8 bytes per beat. A byte_counter tracks position in units of 8, advancing only on beat_fire = tvalid && tready. This makes the parser transparent to upstream bubbles: idle cycles are simply not counted.

bc=0

Eth
dst/src
MAC

bc=8

Check
EtherType
0x0800

bc=16

Check
Protocol
0x11

bc=24

IPv4
src/dst
IP

bc=32

Check
Dst Port
cfg

bc=40 ★

Table
lookup +
buf 42..47

bc=48 ★

Assemble
price +
trigger

Fail-fast semantics apply: if EtherType, Protocol, or Dst Port checks fail, a drop_packet flag is set and all subsequent parsing for that packet is suppressed. The format table lookup and price extraction are entirely skipped for dropped packets.

Verification Strategy

Why two test types are sufficient

The verification strategy separates two independent questions that are often conflated: is the logic correct? and is the design fast enough?

Behavioural simulation (cocotb + Verilator) answers the first question by running the RTL with zero propagation delay. If the design produces correct outputs under all tested stimulus conditions, the logic is sound. Static Timing Analysis (Vivado implementation) answers the second: it exhaustively checks every register-to-register path against the clock constraint, under worst-case Process/Voltage/Temperature conditions.

The combination is equivalent to post-implementation timing simulation and is standard practice for synchronous single-clock-domain FPGA designs. Gate-level simulation with SDF back-annotation adds little value here and would be orders of magnitude slower.

What WNS means

Worst Negative Slack (WNS) is the margin between the clock period and the delay of the slowest combinational path in the design, measured after place-and-route:

WNS = clock_period − max_path_delay − clock_uncertainty

fmax = 1 / (clock_period − WNS)

WNS = +0.367 ns @ 2.000 ns constraint → fmax ≈ 612 MHz

A positive WNS means every path meets timing with margin to spare. Because Vivado's STA already accounts for worst-case PVT corners, a positive WNS at the target frequency is a sufficient timing guarantee for production use.

Behavioural test suite (cocotb)

Tests are structured in three layers, each targeting a distinct failure mode:

Test	What it catches
Wrong EtherType / Protocol / Dst Port	Fail-fast logic — dropped packets must not assert `valid_packet` or `price_updated`
No table match	Header-valid packets with unrecognised discriminator must assert `valid_packet` but not `price_updated`
Price crosses beat boundary (offset 45, 46, 47)	Assembly logic that spans two beats — the most error-prone path in the design
Bubble injection (50% tvalid gap probability)	Parser must be transparent to upstream idle cycles; `byte_counter` must only advance on `beat_fire`
Deterministic bubble before discriminator beat	Gap specifically before bc=40, the most timing-sensitive beat for format lookup
Randomised sweep (100 packets)	Random `price_offset ∈ {44,45,46,47}`, random disc, random price, random bubble probability — Python reference model computes expected output independently

Key point

The random sweep uses Python as a reference model rather than a hand-written golden vector. The same Python code that generates stimulus also computes the expected price and trigger values from first principles. Any RTL divergence from the specification is caught automatically, including off-by-one errors in byte/lane indexing and incorrect big-endian assembly for specific offset values.

Design Decisions & Trade-offs

Format table lookup: parallel combinational logic, not a state machine

The format table has four entries. A priority encoder selects the lowest-index matching entry in a single always_comb block:

for (i = 0; i < TABLE_SIZE; i++) begin
    hit_c[i] = disc_valid_c
             && table_entry_valid[i]
             && ((payload_disc & mask_i) == (value_i & mask_i));

    if (!match_found_c && hit_c[i]) begin
        match_found_c    = 1'b1;
        selected_offset_c = price_offset_i;
    end
end

After synthesis, Vivado maps this priority encoder to three MUXF7 primitives — cascaded 7-input MUXes that implement the if-not-found-yet selection across all four entries in a single LUT level. This is visible directly in the primitives report.

The alternative — a state machine that checks one entry per cycle — would take up to four cycles to resolve a match, which is unacceptable: the discriminator appears at bc=40, and the price must be assembled at bc=48. There is exactly one beat of margin. The parallel approach uses more LUTs but keeps the latency at zero additional cycles.

The trade-off is that a larger table would lengthen this combinational path and eventually fail timing. TABLE_SIZE=4 is therefore not just a functional choice — it is a timing constraint. If more entries were needed, the lookup would need to be pipelined and started earlier.

Unconditional buffering of bytes 42..47

At bc=40, the parser unconditionally captures bytes 42–47 into a six-byte buffer regardless of whether a table match has been found:

// Always capture — do not gate on match_found_c
buf42_47 <= {
    s_axis_tdata[63:56], // byte 47
    s_axis_tdata[55:48], // byte 46
    s_axis_tdata[47:40], // byte 45
    s_axis_tdata[39:32], // byte 44
    s_axis_tdata[31:24], // byte 43
    s_axis_tdata[23:16]  // byte 42
};

The alternative — gating the write enable on match_found_c — would chain the table lookup combinational path directly into the register write enable, lengthening the critical path. Unconditional capture parallelises the two operations at the cost of writing six bytes that are sometimes unused. The area cost is negligible; the timing benefit is real.

Pending register for price output

Price assembly at bc=48 involves a dynamic loop that selects bytes from either buf42_47 or the current beat depending on active_price_offset. Rather than driving the output signals directly from this combinational block, the result is written to a price_pending register and published one cycle later.

The timing report confirms this was necessary: the critical path runs from active_price_offset_reg to price_pending_data_reg/CE with a fanout of 103, consuming 1.460 ns of the 2.000 ns budget. Driving the final outputs directly from this node would push the path over budget.

Fail-fast ordering

Header checks are ordered by byte position: EtherType at bc=8, Protocol at bc=16, Dst Port at bc=32. Once any check fails, drop_packet is asserted and the unique case body is suppressed for all subsequent beats. This means the format table lookup at bc=40 is never entered for non-UDP or wrong-port packets, eliminating any risk of spurious price updates from malformed traffic.

Implementation Results

Metric	Value	Note
Target device	xcvu9p-flgb2104-2-i	UltraScale+, speed grade −2
Clock constraint	500 MHz (2.000 ns)	Used here as an aggressive timing target
WNS (setup)	+0.367 ns	All 384 endpoints pass
WHS (hold)	+0.057 ns	All 384 endpoints pass
fmax	≈ 612 MHz	1 / (2.000 − 0.367) ns
Throughput @ fmax	≈ 39 Gbps	612 MHz × 8 bytes/beat
LUT	298 / 1,182,240	0.03%
FF	198 / 2,364,480	0.01%
DSP / BRAM	0 / 0	Pure LUT/FF implementation
Latches inferred	0	Verified in synthesis report
Critical path	active_price_offset → price_pending_data/CE	Logic 0.37 ns, Net 1.09 ns, Fanout 103
Logic levels (critical)	3	Net delay dominates, not logic

The critical path is net-delay dominated (1.09 ns net vs 0.37 ns logic), which is typical for high-fanout control signals on UltraScale+. active_price_offset fans out to 103 endpoints — every bit of the price assembly loop — making it the natural bottleneck. Further frequency improvement would require replicating this register to reduce fanout, at the cost of a small area increase.

Built with SystemVerilog · Verified with cocotb + Verilator · Implemented in Vivado 2025.2 · Target: Xilinx xcvu9p