High Fanout Optimization

We’ve previously discussed why the effect of performance-limiting high fanout signals can go unreported on the HyperFlex Architecture. In this article I’ll explain the tried-and-tested strategy for optimizing such signals and some key points to keep in mind when working with HyperFlex.

The most effective and long-standing method for optimizing a performance-limiting high-fanout signal is to replicate its source register. In a nutshell, replication works simply because it allocates the fanouts among all the copies thereby reducing the fanout of any one of them. This of course makes it easier for Quartus to find a high performance place-and-route solution.

Figure 1. Reduce fanout by replicating the source register.

Intel/Altera posted a nice PDF several years ago explaining a few replication methods and their tradeoffs [1]. In this article I will add a few important points to consider.

Locality

For register duplication to be effective it is critical that the fanouts be grouped by locality. This means that a given copy should feed a set of fanouts that are placed closely together. Otherwise the copies will create a spaghetti mess of routing.

If you are replicating manually in the RTL how can you predict which fanouts will be placed together? One very good, and intuitive, heuristic is hierarchy. You can simply group together fanouts that are in the same module or section of the design hierarchy since Quartus is likely to place the related logic close together. Remember to use the dont_merge synthesis attribute to prevent your duplicates from being optimized away.

Doesn’t Quartus Replicate Automatically?

The answer is: yes it will detect high fanout signals and perform replication, but sometimes not enough. The keys to effective automatic replication are that, firstly, Quartus needs to know if a high fanout signal is going to be on the critical chain, otherwise any replication will just consume area unnecessarily. Secondly it needs to know where the fanouts have been placed so it can group them by locality.

The challenge is that replication happens early in the compilation flow, before placement and routing have occurred, so Quartus understandably has a tougher time predicting which signals will be critical, where they’ll be placed, and how they’ll be routed. In its effort to maximize performance at the lowest possible area you might find it didn’t do enough replication on a signal that eventually became critical.

Nothing is needed to enable automatic replication. Just compile your design and see if a high fanout signal is still limiting performance. If so, you may need to manually replicate or try the MAXFAN strategy (described below).

Keep in mind that while it’s true we can complain that Quartus should solve this problem automatically and perfectly, our ultimate job as designers is to get the job done. This article is explaining how to do that.

MAXFAN on HyperFlex

What about the famous MAXFAN synthesis directive described in [1]? Its name wonderfully implies that you can easily constrain the maximum fanout of a given signal, thereby forcing Quartus to automatically create duplicates. But long-time Quartus users have probably found that it sometimes doesn’t work well.

Why? Because MAXFAN does not take locality into account! It basically groups the fanouts randomly instead of by locality which can lead to the “spaghetti mess” I mentioned above. I’ll repeat why: replication happens in early in the compilation process (in Synthesis) and Quartus does not have good information about placement at that time.

In my experience it’s hard to find applications where MAXFAN is truly a winner but here’s one scenario where it can be helpful: setting MAXFAN=1 on a pipelined synchronous reset. Reset is often the worst kind of high-fanout signal since its targets are geographically dispersed. But if you can add pipelining on the reset signal (ie. your design can tolerate some latency on reset), you can specify MAXFAN=1 on the entire pipeline. This results in the high-fanout register being replicated maximally (ie. every reset destination gets its own dedicated reset source register). Yes, this “kicks the can” upstream in that the first stage of the reset pipeline now presents the same high-fanout to its preceding stage. But the replicated pipeline acts as “insulation” between the reset source and its final targets. Quartus can focus on placing the high-fanout first stage closely together since it can use the pipelining to reach the final targets. This strategy might sound crazy in terms of area usage but it is actually a very effective and reasonable approach if the number of fanouts isn’t too large (say, in the few hundreds). Furthermore, if the source register can be retimed then the duplicates can occupy the abundant Hyper Register locations, making the area penalty tolerable.

Figure 3. Automatic replication when you add pipelining to a synchronous reset and specify MAXFAN=1.

Replication by the Hyper-Retimer

One of the powerful features of the HyperFlex architecture is the ability for the Hyper-Retimer to replicate high fanout registers after place-and-route. It can retime the source register forward into the interconnect and create copies of it to balance the delays on the potentially long fanout paths. It often replicates so magically that you’re not even aware that you had a performance limiting high-fanout signal at all — the Hyper-Retimer just took care of it for you.

Tip: if you want to see if the Hyper-Retimer did any replication at all, use the post-Fit Tech-Map viewer to find your signals of interest. This netlist viewer shows registers created by the Hyper-Retimer.

BUT as we’ve said before, since the Hyper-Retimer runs after place-and-route it can’t always compensate for poor placement. What if the design was originally placed so poorly due to a high fanout signal that replication by the Hyper-Retimer isn’t good enough to solve the resulting performance bottleneck? This is a common problem, especially with resets.

Replication by the Hyper-Retimer is an excellent step towards fully automatic management of high-fanout signals but you may still find yourself using the strategies in this article. Manual replication and MAXFAN=1 ensure that the duplicates are made in Synthesis, which means that they exist in the netlist before the Placer runs. The Placer is then tasked with smaller, more manageable fanouts and has an easier time finding a high-performance placement for all the registers. This is in contrast to the Hyper-Retimer which is trying to fix up a potentially poorly-placed high fanout situation after-the-fact.

References
[1] Scoville, Ryan. “Register Duplication for Timing Closure“. [Local Copy]