Why use FPGAs for Computing?

This paper outlines the reasons why FPGA technology is better than other technology for high performance computing.

FPGAs have been around for over 25 years. From the beginning people have been using them for computing. Performance on many different algorithms has ranged from 0.1x to 1,000x over top of the line processors. At first FPGAs could only beat processors on integer based algorithms. FPGAs started to beat processors at single precision floating point algorithms in the 2001 times frame and double precision in 2003. This is all without hardwired floating point support. There are 4 major reasons to use FPGA technology.

Dark silicon:

Dark silicon is the phenomenon of starving your silicon of electrons because the power usage in an area is too high.

From the group that coined the term “dark silicon”.

Our team was among the first to demonstrate the existence of a utilization wall which says that with the progression of Moore’s Law, the percentage of a chip that we can actively use within a chip’s power budget is dropping exponentially! The remaining silicon that must be left unpowered is now referred to as Dark Silicon.”

This means that you can’t use all the silicon you want whenever you want it. Processors have this problem FPGAs don’t why?

FPGAs do not have “dark silicon” like multicore processors.  This is because multicore designers pushed the frequency so high there are not enough electrons to power the high density cores. Multicores also have caches and networks on a chip structures which take lots of power and FPGAs have direct pipelined connections and take way less power to move the data than multicores. Think of dark silicon as an extension of the frequency wall. If a device is 1 square centimeter and takes 1 watt then, when you shrink the geometry by half, you have one watt in ¼ square centimeter.  That is a 4x power per area increase. This is a fundamental barrier to shrinking traditional compute architectures like von Neumann cores whether they are multi-core, many core or SIMD.

Central Processing Unit (CPU) based compute devices:

CPUs have very high speed and power hungry compute units. They use a register file and an ALU to implement their data flow graphs. A high portion of the device is dedicated to retrieving and decoding instructions thereby implementing their control flow graph. This subsystem must move billions of instructions per second generating lots of heat. CPUs have hierarchical levels of memory which data has to travel over. Some of these caches are shared among the cores and this causes memory contention.

Graphical Processing Unit (GPU) based compute devices:

GPUs have many small, fast compute units tied to a single instruction sequencing control unit. GPUs also have an instruction decode subsystem that moves billions of instructions per second generating lots of heat. GPUs are mostly floating point intensive devices that are best at data parallel algorithms that do the same thing for each data point. Branching algorithms don’t perform well in these devices. GPUs have hundreds of cores running at high speeds and are the most power hungry devices currently used in computing today.

Field Programmable Gate Array (FPGA) based devices:

FPGAs have uncommitted logic and routing that gets “personalized” at run time. Since the logic and internal memory are spread out there are always enough electrons to power the device. FPGAs don’t suffer from the dark silicon bottleneck like CPUs and GPUs.  FPGAs put the burden of instruction generation on the compiler and don’t have a power hungry instruction decode subsystem that uses lots of power.  FPGA are the most power efficient compute devices.

Rent’s Rule:

Basically describes the relationship between the amount of logic in a partition and the amount of communication into that partition. FPGAs are architected based on Rent’s rule and CPUs and GPUs are not. FPGAs are designed to get the data to the logic and so have way more routing and bisectional bandwidth than multicore systems such as CPUs and GPUs. You can see how this works in figure 2. The logic cores of CPUs and GPUs are connected to caches through which the data must pass. FPGAs, on the other hand, have 1000’s of wires coming into a logic partition from all directions. In FPGAs data is managed through 100’s to 1000’s of multi-ported memories instead of a hierarchical memory using different levels of cache.

Because FPGAs have many more wires than CPUs or GPUs the overall bisectional bandwidth of FPGAs can’t be beat. If you make the partition the whole chip you see that FPGAs have multiple PCIe busses as well as 400 Gbit transceiver capabilities, something that CPUs and GPUs don’t have.

Result Reuse

“First defined two decades ago, the memory wall remains a fundamental limitation to system performance. Recent innovations in 3D-stacking technology enable DRAM devices with much higher bandwidths than traditional DIMMs. The first such products will soon hit the market, and some of the publicity claims that they will break through the memory wall. Here we summarize our analysis and expectations of how such 3D-stacked DRAMs will affect the memory wall for a set of representative HPC applications. We conclude that although 3D-stacked DRAM is a major technological innovation, it cannot eliminate the memory wall.” [1]

 

Circuit Specialization:

FPGAs can be specialized even more than application specific integrated circuits (ASICs). The is because when you make an ASIC you have to cover all the bases and make sure your ASIC can do everything you might ever what it to do. For example consider DNA sequencing.  A sequence of amino-acid base pairs is compared against one or more databases to identify it or see how close it comes to a known sequence.  You can get a “how close am I?” score.  The highest score is the best matched sequence.

Raw performance:

FPGAs have always been better at integer performance than other technologies.  FPGA performance grows faster than Moore’s law predicts.  Moore says performance CPU performance doubles every 18 months. FPGA’s are on a path to grow performance by an order of magnitude (10x) every 4 years on average.  These performance increases come every other year or so when FPGAs move to new process nodes. FPGAs have a great amount of architectural potential and this combined with Moore’s law gives increases in performance greater than Moore’s law alone.

The current node is 20 nm next node (due 2016) is 14 nm. While performance increases happened at different times usually with devices moving to new processes nodes the trend is clear FPGAs are poised to become the performance leader for all algorithms. This is even clearer when you realize that GPUs can no longer take advantage of geometry shrinks without addressing the dark silicon problem.

 

Performance per watt:

It would take 100,000 devices rated at 1 Teraflops to get to 100 Petaflops. With the requirement of only 20 Megawatts that means each node has a power budget of 200 watts.

The fastest CPU is about 100 Gigaflops. It would take 1,000,000 cores to get to 100 Petaflops. CPUs by themselves are no longer in the running for use in Exascale computer systems. That would take 20 Megawatts at 20 watts per core.

The fastest GPU is the newly released Titan. It runs at 1.3 Teraflops and consumes 210 watts. To get to 100 Petaflops you would need 77,000 GPUs. It would take 16 Megawatts just to power the GPUs.

The Stratix 10 will have 10 Teraflops of single precision and an estimated double precision performance of 4 Teraflops. This would take 25,000 parts. If we give the compute nodes a 250 watt budget the compute infrastructure will take 6.25 Megawatts, which includes the communication infrastructure. We still have 13.75 Megawatts left over for storage and all other required functions.

Large supercomputer systems do more than just computing. They need communication between the node and they need large storage subsystems. CPUs have those capabilities but as we see CPUs alone cannot power an Exascale computer system. GPUs will have to couple with CPUs to incorporate all the functionally needed to run an Exascale computer system. This will add to their already tight power budget. Not to mention routers and storage systems.

The Stratix 10 will have a quad-core 64-bit high performance CPU subsystem. Four 64-bit ARMs will perform all of the functions that are needed except high performance computation and communication. It currently takes many of a multicore CPUs cores just to run the OS, do power management and many other high performance systems.

High powered supercomputers have lots of communication that are needed. CPUs don’t have this kind of capability. They rely on expensive I/O cards and routers to carry information from node to node to disk subsystem.

GPU are no better off than CPUs when it comes to communicating with other compute nodes or the disk I/O subsystem.

The Stratix 10 will have 100G/400G capability built directly into the device. This is incredibility important when it comes to a well architected supercomputer. Image a 1u 19 inch rack board with 8 nodes on it. Each Stratix 10 node could have 100G coming of it to route to other board and still have an incredible amount of bandwidth for inter-node communication.

Conclusion:

Computing is at a crossroad. On one side are computer architectures that have been around for 50 or more years. The von Neumann architectures of multi-core CPUs and SIMD architecture of GPUs are facing insurmountable limits of physics that will not carry us into the future. Only FPGA SoC based devices that contain CPU, FPGA accelerators and communications all in one package have a chance at delivering systems capable of Exascale performance and beyond.

About the Author:

Steve Casselman is CEO of HotWright Inc. In 1987 when he won his first SBIR contract to build a FPGA based supercomputer. Steve has 14 issued patents including putting the FPGA in the processor socket and combining FPGAs and CPUs on a single die.