High fanout signals are a classic killer of performance in FPGA acceleration. Why? For the simple reason that when a register fans out to many nodes it’s difficult for the Placer to find a single location for that register where all of the fanout paths can be short and fast; at least a few may end up long. If these paths end up bottlenecking performance the effect is usually obvious in TimeQuest so you’ll be alerted that you need to optimize them. But in the ‘era of retiming’ on the Intel HyperFlex architecture it’s surprisingly very possible for a high fanout signal to become a silent and unreported bottleneck. I’ll explain why in this article.
What follows is a sequence of events that can easily occur. I don’t mean to say that this happens all the time but rather it’s something you must be aware of when doing high performance FPGA design on the HyperFlex architecture.
- First, the high fanout signal causes poor placement since the Placer is understandably faced with the difficult task of placing all the fanouts (and their related logic) close together; inevitably some of the fanout paths end up being long. It’s important to note that is an old and fundamental FPGA design challenge that has been around forever.
- What is new, however, is that the Hyper Retimer comes in after place-and-route, sees these long fanout paths, and may successfully retime them just enough that they’re no longer critical (they may still be slow in the grand scheme of things, just no longer the slowest paths in the design).
- The Hyper Retimer then moves on to the next slowest paths in the design, does its best to retime them, but eventually gets stuck due to a retiming restriction, and then quits. The key point is that critical chain reported by the Hyper Retimer will likely not contain the original high fanout signal, but may contain logic that is a few connections away.
- So as a designer you’re led to believe the reported critical chain is the bottleneck that requires your attention. But the truth is that the reported critical chain is actually only a slow part of your design because its nodes were placed poorly due to the tough constraints imposed on the Placer by the original high fanout signal! And that high fanout signal eventually got retimed so its fanout paths are just slightly faster than whatever was eventually marked as critical so they did not get reported as being a bottleneck. The result is that you’re left scratching your head as to why the reported critical chain is as it is and you have no idea that the real problem is the high fanout signal.
Wouldn’t these high fanout signals show up in TimeQuest? Well, we’ve already talked about why you shouldn’t look at the top failing paths in TimeQuest, but if you did, they may still be absent from the list since they were retimed to be faster than the eventual critical chain.
A typical case where this happens is your synchronous reset signal. Not only does it usually have high fanout but that fanout is geographically dispersed on the chip.
I want to say it again for clarity: high fanout signals have always been a performance bottleneck in FPGAs since they can cause generally poor placement, not only of the fanouts, but of related logic. But what’s new is that the fanout paths can get retimed so they’re not the absolute slowest paths in the design and the related logic can end up being the bottleneck, not because it was necessarily poorly designed, but because it was placed poorly due to that high fanout signal that pulled the placement apart.
So how can you be alerted of this effect? Look at the list of high fanout signals in the Fit report. It’s hard to develop a good rule of thumb about maximum fanout but I would suggest having no more than 100 physical fanouts since it’s hard to push the limits of FPGA performance if you have more than that. The geography of the fanouts is important too. 100 very localized fanouts are not as bad as 100 geographically dispersed fanouts. You need to reduce the fanout as much as possible to achieve high performance and this is not a new concept. What’s new is that poor placement caused by high fanout signals can go unreported because the retimer may cover them up (by doing its job properly…and ironically). Fundamentally, the Hyper Retimer can’t overcome bad placement.
In the next article I will explain some effective strategies for optimizing high fanout signals and discuss why Quartus doesn’t automatically solve this problem for us.