Comparing Hardware for Artificial Intelligence: FPGAs vs. GPUs vs. ASICs

Comparing Hardware for Artificial Intelligence: FPGAs vs. GPUs vs. ASICs

July 24th, 2018

By Lynnette Reese, Editor-in-Chief, Embedded Intel Solutions

Compare the general pros and cons of hardware commonly used for AI applications. FPGAs are looking good: a free command-line tool makes FPGAs that much easier to work with and dramatically lowers the size of an inference model, all in one go.

The Artificial Intelligence (AI) deep learning chipset market will reach $66.3 billion by 2025[i].  Although Graphics Processing Units (GPUs) and Central Processing Units (CPUs) lead in AI sockets today, Application Specific Integrated Chips (ASICs), Field Programmable Gate Arrays (FPGAs), and System-on-Chip (SoC) accelerators are also part of the expanding AI hardware market. Deep Neural Networks (DNNs), a special subset of machine learning, are all about completing repetitive math algorithms or functions on a massive scale at blazing speeds. Hardware accelerators like CPUs, GPUs, ASICs, FPGAs, and even SoCs have various advantages and design trade-offs.

Latency Comparison
FPGAs offer lower latency than GPUs or CPUs. All else being equal, FPGAs and ASICs are faster than GPUs and CPUs because they run on “bare metal,” as the saying goes; there is no Operating System (OS). Both logic transistors and software programs can complete instructions. ASICs and FPGAs can provide lower latencies, which is better for applications that require real-time AI.

Figure 1: Panch Chandrasekaran is the FPGA Portfolio Marketing Director at Intel’s Programmable Solutions Group where he leads outbound marketing activities for Intel PSG products. Chandrasekaran has a Master’s in Electrical Engineering from University of Central Florida and an MBA from University of California, Haas School of Business.

Panch Chandrasekaran, FPGA Marketing Director, Programmable Solutions Group at Intel, places an emphasis on latency, where FPGAs outperform GPUs. “For applications that require real-time performance, like streaming video object detection and identification, for example, latency matters. For example, if an application has to perform image classification in an automotive setting, or if there’s a large venue where stress profiling is the goal, latency can be critical. For real-time, low latency requirements, FPGAs are the more suitable strategy. GPUs, being software in nature, can accommodate changes, but when the application is also performance-, power-, and latency-critical, FPGAs really shine versus GPUs.”

Power Comparison
Another area where FPGAs outperform GPUs (and CPUs) is for those applications with a constrained power envelope. It takes less to run on bare metal.

Flexibility: FPGAs vs. ASICs
FPGAs are similar to ASICs except that FPGAs are notoriously difficult to program and ASICs have a typical production cycle time of 12 – 18 months. Changing a design on an ASIC takes much longer, whereas a design change on an FPGA requires reprogramming that can take anywhere from several hours to several weeks. FPGAs have been steadily growing more competitive with ASICs on price, as well.

FPGAs are superior in terms of flexibility and proving especially useful in rapidly growing and changing AI applications. Neural networks can improve significantly over the course of months. For instance, their architectures, also referred to as topologies, can undergo changes. As more and/or different data comes in, companies want the ability to retrain or tune neural nets as their applications develop. For cases requiring maximum flexibility, or where neural networks are still evolving, FPGAs make sense.

Figure 2: Tony Kau is the Software and AI Marketing Director in charge of Product Line Management and Marketing of PSG’s Artificial Intelligence and Workload Acceleration SW products with Intel. Kau earned a bachelor’s and master’s degree in electrical engineering from the University of Southern California and an MBA from INSEAD.

Tony Kau, Marketing Director, Artificial Intelligence, Software and IP solutions at Intel, states, “So much is changing in the wildly evolving space of AI. There are many hundreds of existing topologies for various industries and use cases and the [AI] industry is constantly adding new ones. The neural networks vary in precision from 32-bit to binary. And FPGAs can accommodate all of that.”

Kau goes on to add, “To give an idea of how fast AI is evolving, consider that the original GoogleNet and ResNet versions were created back in 2015. There were some follow-on versions of both networks, but these topologies are now considered “old” and almost “classical.” We currently have 150+ different topologies in our benchmark. FPGAs allow data centers to process workloads on a hyper-scale for real-time AI. For example, a system of FPGAs can run 8 billion calculations that are required for ResNet 50 (an industry-standard DNN) without having to batch or queue loads.”[ii]

Parallel Computing
DNNs can employ parallel computing on a massive scale. But parallel computing has introduced execution complexities as programs running through one of several pipelines must be coordinated across cores. Computational hardware imbalances can occur if irregular parallelisms evolve. FPGAs are also better than GPUs wherever custom data types exist or irregular parallelism tends to develop.

Both GPUs and FPGAs can process in parallel on a massive scale. However, FPGAs also surpass GPUs for efficiency in parallel processing. Using the analogy of a bottling factory, we can compare FPGAs and GPUs on the concept of parallelism. Imagine that a soda bottling factory has a three-step process of filling up a bottle, capping it, and then labeling it. You could do one bottle at a time in a long series, but to dramatically expand capacity you would process rows of several bottles, in parallel. CPUs, GPUs, FPGAs and ASICs are all able to process in parallel on a massive scale (some better than others).

Figure 3: Example of parallel computing, where several payrolls are processed concurrently on separate cores. Only programs and problems that are suitable for processing in parallel will complete faster than if accomplished serially. (Source:

The bottle factory as run by a GPU might fill a row of one hundred million bottles every clock cycle. Then the whole row would get capped, then labelled, before the next massive row of bottles is advanced to be filled, capped, and labelled. An FPGA would also process in parallel on a massive scale, but several steps are performed at each cycle. The FPGA does not waste bandwidth in its one hundred million-wide (108) pipeline of bottles. By the time the first row of 108 bottles is getting labeled, the second row of 108 bottles is being capped, and the third row in the FPGA bottling factory is filling bottles at the same time. The 108-wide FPGA pipeline is fully utilized. GPUs can only do one row at a time before moving on to the next row of bottles to perform the same set of repetitive operations. Yes, both FPGAs and GPUs can operate in parallel on a massive scale, but FPGAs are inherently more efficient (with a full pipeline), faster (running an algorithm on bare metal), and are more flexible than GPUs in terms of architecture and programming changes.

Flexibility: FPGAs vs. GPUs
FPGAs can be programmed to add different steps or outputs altogether, allowing growth beyond existing GPU support without physically changing the way the GPUs are architected. If the bottles need to be cleaned before they are filled, the FPGA can be programmed to add that step. FPGAs go even further in flexibility, however. Assume that the owner of the bottling factory wants to make bicycles as well. The factory running on the FPGA needs to be re-programmed to add bicycle-making steps, and it can be running within hours or weeks, depending on the complexity of the bicycle. The GPU-run factory would need additional GPUs as well as programming. The ASIC-run factory would not be able to add bicycles to the manufacturing line for 12 – 18 months but would require little programming upon release.

As Kau points out, “FPGAs are massively programmable in a way that enables different outputs. And with GPUs you can only do one thing at a time. That is why there have been talks within the industry that a GPU is great for looking at training data or unstructured data. But once the data is trained, the GPU goes through a very deterministic inference model, and this is where the FPGA brings tremendous value.” The inference model is the resulting model from all that training and what AI uses to make decisions once it’s up and running.

Intel introduced the Intel Programmable Acceleration Card (PAC) with Intel Stratix 10 SX FPGA in September 2018. The card leverages the Acceleration Stack for Intel Xeon CPU with FPGAs, providing data center developers a robust platform to deploy FPGA-based accelerated workloads. (Credit: Intel Corporation)

Analogies aside, FPGAs can be customized without re-designing, -fabricating, -packaging, -testing, and shipping a new chip. With FPGAs, engineers can add networking, pre- or post-processing, or other customizations that have nothing to do with AI. The GPU factory might be able to change one bottle cap for another of a different color, but GPUs are simply not as wide-ranging with flexibility as are FPGAs. The architecture of GPUs is not as flexible in making changes to an existing system as are FPGAs.

Historically, FPGAs have been complicated to program, with a steeper learning curve than traditional programming. However, Intel® has recently released a development tool that allows effective execution of a neural network model from several deep learning training frameworks on any Intel AI HW engine, including FPGAs. Intel’s free  Intel’s free Open Visual Inference & Neural Network Optimization (OpenVINO™) toolkit can convert and optimize a TensorFlow™, MXNet, or Caffe model for use with any Intel standard HW targets and accelerators. OpenVINO creates instant portability, allows addition of custom C++ and OCL deep learning layers via an extension mechanism, and provides a means for quantization of models. For instance, by converting a model from 32-bit floating point to 16-bit floating point one can lower required compute memory and thus optimize power efficiency. Models don’t need re-training and the Application Programming Interface (API) of OpenVINO is the same across all platforms. Developers can execute the same DNN model across several Intel targets and accelerators (e.g., CPU, CPU with integrated graphics, Movidius, and FPGA) by converting with OpenVINO, experimenting for the best fit in cost and performance on actual hardware.

One the most important features of the OpenVINO toolkit is a “model zoo,” which contains public and free optimized models. One can use these models for rapid prototyping as well as to expedite development and production of applications without having to search for or train your own models.

AI opens an untapped frontier for technology to solve problems, provide extreme productivity boosts, entertain us, and so much more. As AI rapidly grows and changes, FPGAs offer flexibility, performance, and extensibility in comparison to other accelerators. Although historically complex to program, FPGAs are carving out their own space in AI technology, with new tools that make programming AI applications that much easier.

[i]Tractica. (2018, May/June). Deep Learning Chipsets. Retrieved June 26, 2018, from

[ii] Reese, L., Embedded Systems Engineering, & Embedded Intel Solutions. (2017, October). Where FPGAs Surpass GPUs., Embedded Systems Engineering Magazine. Retrieved June 26, 2018, from

Lynnette Reese is Editor-in-Chief, Embedded Intel Solutions and Embedded Systems Engineering, and has been working in various roles as an electrical engineer for over two decades. She is interested in open source software and hardware, the maker movement, and in increasing the number of women working in STEM so she has a greater chance of talking about something other than football at the water cooler.