As I continue studying computer architecture and deep-learning acceleration, I have become increasingly interested in how hardware flexibility should be exposed for modern AI workloads. Rather than presenting a finished proposal, what follows reflects a set of ongoing observations, intuitions, and open-ended directions I find worth exploring.

One recurring impression is that many contemporary AI workloads exhibit a somewhat paradoxical structure. The underlying computational primitives — matrix multiplications, reductions, pooling, sorting, and elementwise transforms — tend to be highly regular and repeatable across models. At the same time, the way these primitives are arranged, connected, and scheduled often varies considerably between architectures, deployment scenarios, and model families.

This raises a broader question about whether current hardware platforms expose flexibility at the most appropriate level.

The Spectrum of Flexibility

Highly reconfigurable fabrics such as FPGAs provide extremely fine-grained configurability at both logic and routing levels. This flexibility is powerful and broadly applicable, yet for structured dataflow workloads it sometimes appears more general than strictly necessary. Configuration bitstreams can be large, compilation flows can be lengthy, and timing closure may become nontrivial when workloads evolve frequently. These observations do not diminish the value of reconfigurable logic, but they do suggest that the granularity of flexibility may not always align perfectly with the structure of modern machine learning pipelines.

At the opposite extreme, fixed-function accelerators achieve remarkable efficiency precisely by reducing flexibility. Dedicated dataflow engines, systolic arrays, and domain-specific accelerators often deliver excellent performance and energy efficiency, but adapting them to new models or evolving operator patterns can require substantial redesign or careful software workarounds.

Between these two ends of the spectrum — maximal flexibility and near-complete specialization — there appears to be a broad design space that is still actively being explored.

Why the Mismatch Feels Pronounced

One dimension worth dwelling on is why this mismatch feels so pronounced for deep learning in particular. FPGAs were originally designed for logic correctness — to implement arbitrary digital circuits at LUT-level granularity. For a workload like neural network inference, where the same handful of operations repeat relentlessly across billions of parameters, that level of generality starts to feel like the wrong unit of abstraction. Every time a dataflow pattern changes, a conventional FPGA flow must still rebuild and reload a large bitstream that re-specifies both the logic and the physical routing fabric, even if the underlying compute did not change at all. The overhead is not a bug in the FPGA design; it is a natural consequence of offering flexibility at a granularity finer than the workload actually requires.

CGRAs attempt to address this by moving reconfigurability up to the word or ALU level, which is a meaningful step. But most CGRAs still preserve fairly general-purpose ALU tiles and configure them using instruction-like contexts. What I find myself wondering is whether, for deep learning specifically, even this level of generality may be more than is needed — and whether it comes at a cost in the form of configuration complexity, routing overhead, and the difficulty of achieving ASIC-like efficiency in the compute fabric itself.

There is also something deeper going on here that I keep thinking about. PE programmability and statically derivable dataflow seem to be in fundamental tension with each other. CGRAs resolve this tension by preserving PE programmability and accepting that the dataflow cannot be fully determined at compile time — the generality of the ALU tiles makes static analysis of the whole system harder. What I find myself drawn to is the opposite resolution: give up PE programmability entirely, and in exchange gain the ability to reason about the entire dataflow statically. Fixed-function PEs have no instruction stream to schedule, no branch behavior to reason about, no data-dependent control flow. That rigidity is a cost — but it is also what makes it possible to prove things about the system at compile time that would otherwise require runtime mechanisms. Whether this tradeoff is actually worth making is something I am still thinking through, but the logic of it feels compelling to me.

Systolic arrays and dedicated dataflow engines sit at the other end. A well-designed systolic array for matrix multiplication is remarkably efficient precisely because its connectivity and execution rhythm are completely fixed. But that same rigidity makes it awkward to support, say, the attention patterns in a Transformer with its varying sequence lengths, or emerging operators that do not map cleanly onto a regular grid flow. The hardware is highly optimized for a specific dataflow, and anything that deviates from that dataflow requires workarounds at the software level.

There is also a subtler limitation in something like a TPU-style systolic array that I find interesting. The effective matrix size is fixed by the hardware — the systolic array has a specific dimension, and models that use smaller or differently shaped matrix multiplications will leave part of the array idle or require padding and tiling that does not perfectly match the hardware geometry. An architecture where the matmul tile is not a fixed physical structure but is instead assembled from smaller PEs through a programmable interconnect could in principle support different effective matrix sizes for different models or layers, which seems like a theoretical improvement. Whether this flexibility can be realized without the interconnect becoming a bottleneck — without the programmable NoC eating into the efficiency gains that fixed-function PEs would otherwise provide — is a real engineering question I do not have a clear answer to yet.

What strikes me about this landscape is that none of these approaches seem to be targeting flexibility at the place where deep learning workloads actually vary most.

Communication as the Locus of Variation

One perspective I find intriguing is the possibility that, for many AI workloads, the most dynamic aspect of computation may not be the primitive operations themselves but rather the way data moves between them. If the computational building blocks remain relatively stable — and I think there is a reasonable case that dot products, reductions, pooling, and elementwise transforms cover a substantial fraction of what modern models actually compute — then the variation between models is largely a variation in how those primitives are wired together, staged, and pipelined. In that framing, architectural flexibility focused on communication and interconnect behavior might prove particularly valuable, while compute structures could afford to remain fixed and deeply optimized.

This asymmetry is what I keep returning to as an organizing intuition.

A Speculative Direction: Routing Microcode

A speculative direction I have been thinking about follows fairly directly from this: what if, instead of reconfiguring compute logic, a hardware platform exposed a programmable on-chip network whose routing behavior could be specified at compile time? The compute fabric would consist of fixed-function processing elements — units that implement dot products, reductions, pooling, sorting, and similar primitives, each optimized for its specific function rather than for generality. The variation between workloads would be captured not by changing what the PEs compute, but by changing how they are connected and in what order data flows through them.

Concretely, I have been wondering whether the routing state for such a network might be made surprisingly compact. Each router in the network really only needs to know which direction to forward data for a given dataflow configuration — a small number of bits per router. Summed across a large array, the total state required to describe a complete routing configuration might be on the order of kilobytes, not megabytes. This is several orders of magnitude smaller than a typical FPGA bitstream. And if the routing state is that small, it might be feasible to switch between configurations with very low overhead — perhaps at the granularity of model layers or operators, rather than requiring a full chip reprogramming cycle. One could even imagine storing multiple routing snapshots simultaneously in small SRAM blocks distributed across the chip, so that switching between dataflow modes is just a pointer update rather than a data movement operation.

The notion I find most appealing is treating these routing configurations as a kind of microcode for dataflow. Just as machine code describes what a CPU should compute instruction by instruction, routing microcode would describe how data should move through the PE fabric for a given model or operator phase. Crucially, this microcode would be generated at compile time, from a high-level description of the computation graph, rather than written manually or discovered at runtime. A compiler taking an ONNX graph as input would be responsible for decomposing operators into PE-level primitives, planning the routes that data takes through the network, and emitting a set of routing snapshots — one per dataflow phase — that the hardware can switch between efficiently. The mental model I find helpful is that routing microcode is to dataflow what machine code is to instruction execution: a low-level, hardware-facing representation that hides complexity from the user while remaining close enough to the hardware to be efficient.

Predictability and Determinism

There is something appealing about this framing from a predictability standpoint as well. One of the less-discussed costs of runtime-arbitrated networks-on-chip is that they introduce congestion and latency variability that is difficult to reason about statically. If all routing decisions are made at compile time, then communication timing becomes deterministic by construction. The hardware does not need runtime arbitration logic on the hot paths, and the compiler can reason about buffer requirements and pipeline timing without conservative worst-case assumptions about contention. This connects to a broader body of theory around synchronous dataflow models — SDF, CSDF, Kahn process networks — that I have been studying. These models provide formal guarantees about deadlock freedom and bounded buffer usage for networks of communicating processes, precisely because the production and consumption rates of each node are fixed and known ahead of time. An architecture built around fixed-function PEs and compile-time routing would seem to sit naturally within this theoretical framework, which I find reassuring from a correctness standpoint. Whether this predictability would translate into meaningful practical benefits — or whether the static scheduling problem would turn out to be intractable for realistically complex models — is something I remain genuinely uncertain about.

One constraint that seems practically important, though I am still working out the implications, is that snapshot switching probably cannot happen at arbitrary moments during execution. If data is mid-flight through the network when the routing configuration changes, the results could be unpredictable. The natural solution is to only switch snapshots at well-defined drain points — moments when the current dataflow phase has completed and the network has been flushed. This is a meaningful restriction on when reconfiguration can occur, but for many workloads it seems reasonable: the natural boundaries between model layers or operator phases are exactly the points where a switch would be semantically meaningful anyway. I am curious whether this constraint turns out to be practically limiting, or whether it aligns well enough with real workload structure that it rarely matters.

Scaling and System Integration

The question of how this architecture could scale beyond a single chip is also something I have been thinking about, though only in a preliminary way. One possibility is that inter-chip links could be treated as additional router ports in the routing microcode — no different in principle from intra-chip connections, just with different latency and bandwidth characteristics. A compiler generating routing snapshots for a multi-chip system would plan routes that span chip boundaries in the same way it plans routes across the on-chip PE grid. Whether the latency asymmetry between on-chip and off-chip communication would cause problems for a static schedule is unclear to me, but it seems like a natural direction to explore if single-chip capacity turns out to be insufficient for the models one cares about.

Open Questions and Tradeoffs

The question of where compute and communication should separate in the architecture also connects to a broader tension I have been thinking about. Fixing the PE functions too aggressively risks missing important future operator types — if a novel attention variant or a new normalization primitive requires computation that does not decompose naturally into the existing PE family, the architecture would have no clean way to support it. This is a real cost that I want to be honest about. A chip with fixed-function PEs is also a chip that permanently allocates area to those functions regardless of whether a given model uses them. A workload that does not need, say, sorting or top-k operations still has those PE units sitting on the chip, consuming area and potentially power even when idle. On the other hand, making the PEs more general-purpose pushes back toward the CGRA regime, with all the configuration overhead and loss of static analyzability that entails. I do not have a principled answer to where the right boundary is, and I suspect it depends heavily on how stable the core computational vocabulary of deep learning turns out to be over the next several years.

There is also a question about how this kind of fabric would relate to the control logic that manages model execution — loading weights, scheduling batches, handling multi-model deployments. In current systems, there is often a hard boundary between a CPU running complex control code and an accelerator executing pre-defined kernels. The CPU is the orchestrator; the accelerator is the engine. Data exchange between them flows through fixed buses and DMA engines with limited patterns, and the CPU cannot easily participate in fine-grained changes to the accelerator's internal dataflow. One thing I find potentially interesting about a routing-microcode approach is that it might suggest a somewhat different relationship — not "CPU + accelerator" as two separate subsystems, but a single fabric where the control layer manages routing microcode alongside data movement, and can reshape internal communication patterns at a granularity that the traditional host/device model does not easily permit. Whether this tighter coupling would actually be useful in practice, or whether the existing CPU-accelerator boundary is adequate for most real workloads, is another question I am still thinking about. But there is something philosophically appealing about the idea that the boundary between "control" and "compute" might be less fixed than current system architectures tend to assume.

These questions remain open for me, and I expect my thinking about them to evolve with further study and experimentation. For now, they serve mainly as a compass guiding what I read, what problems I pay attention to, and what directions I find intellectually compelling.

Ultimately, I keep returning to a broader curiosity:

Is there a "natural" granularity of reconfigurability for AI hardware — one that preserves efficiency while still accommodating the pace at which models and workloads evolve? And if that granularity turns out to be the interconnect rather than the compute logic, what would a hardware and compiler stack built around that insight actually look like?

I do not have confident answers. But the question feels like it might be pointing somewhere interesting.

Font Settings