Retiming, Explained.

Extreme register retiming is the fundamental optimization that is enabled by Intel’s HyperFlex architecture (starting with Stratix 10). Retiming is an old concept that’s mentioned in many resources but a detailed explanation can be tricky to find. In order to get maximum performance from a HyperFlex FPGA one must develop an intimate understanding of how retiming works so I’ll provide a detailed explanation in this article.

Let’s review a few fundamentals of FPGA circuit performance. FPGA designers will roll their eyes at this, I know, but stick with me. The propagation delay of each register-to-register path in a design is the sum of the delays through the logic and interconnect on that path. The maximum frequency of the design (its FMAX) is dictated by the single path in the design with the longest delay (called the critical path).  So it doesn’t matter if most paths in the design are short and fast — the FMAX of the whole design is ultimately limited by the one path that’s long.

Let’s now consider a hypothetical, simple FPGA design, consisting of a straight sequence of register-to-register paths (called a chain in Intel FPGA terminology), shown in Figure 1.

Figure 1. Simple FPGA circuit with unequal path delays. [1]
Here is the key concept: the maximum frequency at which this design can operate is achieved when all of the path delays are equal. If that’s not immediately intuitive, let’s start by imagining all the path delays are indeed equal, followed by a decision (for illustrative purposes) to redistribute the delay slightly by shortening one path to make it faster at the expense of equivalently elongating another — the design’s FMAX would decrease since it is now dictated by this new longest path. For a given design with a given amount of logic and interconnect utilization, if all the path delays are equal, none of them can become smaller without another becoming larger, therefore the circuit performance cannot increase. This is why we say that the best possible FMAX is achieved when all of the path delays are equal.

This concept can be extended beyond the simple ‘chain’ design to a design of any configuration.

Conceptually, in order to achieve balanced path delays, you could try to shuffle around the logic and routing so that an equal amount exists between registers. Or you could approach it backwards and leave the logic and routing in place and just move the registers around to achieve balanced delays, which is what the Hyper-Retimer does.

Now we come to the definition of retiming. Retiming is the practice of relocating registers in an already placed-and-routed design with the goal of balancing path delays between registers. As path delays get closer to being fully balanced, the FMAX of the circuit increases. One-by-one, the Hyper-Retimer optimizes the performance-limiting longest paths in the design by attempting to reduce their path delay. It does this by moving registers in an effort to balance that delay across the chain of paths to which that path belongs. All register movements are made in such a way that the functionality of the circuit remains exactly the same.

For many years Quartus had already been performing some retiming in the early stages of the Fitter, but of course, its effectiveness has been limited by the fact that it’s difficult to predict path delays before the design is fully placed-and-routed! With HyperFlex, it now also performs extreme retiming after place-and-route, which is a very effective time to do so since all of the path delays are known with high accuracy.

Gains in FMAX are achieved even if you don’t achieve exact delay equality, as shown in Figure 2.

Figure 2. Same circuit but with higher FMAX because delays are more closely balanced. [1]
I’ll end with a note on interconnect delay. In today’s 14nm FPGAs the delay through a path can easily be dominated by interconnect delay rather than logic delay. Since the goal is to balance overall path delays, it follows that much of this balancing act specifically focuses on balancing interconnect delay. Thus it would be very advantageous to relocate registers into the interconnect itself, which is exactly what the HyperFlex architecture allows. The Hyper-Retimer can therefore make fine-grained register relocation moves in its quest to balance path delays (Figure 3.)

Figure 3. Register locations throughout the interconnect. [1]
References

[1] Hutton, Mike. “Understanding How the New Intel® HyperFlexTM FPGA Architecture Enables Next- Generation High-Performance Systems“. [Local Copy]