FPGA is becoming more and more popular these days, and everyone is talking about it and how to utilize it in real world applications in either public or private clouds. But you may ask, why? before diving into pros and cons of this new member in the family, let me talk a little bit about what is FPGA and how it became so popular among software community.
FPGA and it’s history
So what exactly is FPGA? FPGA stands for Field-Programmable Gate Arrays. These are fancy chips that could be programmed for one specific purpose, unlike CPUs and GPUs. In other words, CPUs and GPUs are once mapped into hardware and could be used by feeding instructions of a specific application. Logic blocks in FPGAs can be configured to perform complex combinational functions. The concept of programmable logic devices have existed while before introduction of FPGAs. In old days people were using PROMs and PLDs. Unfortunately these devices were not re-programmable, and could only be programmed once in a factory in large quantity. Imagine, you could only develop your design and give it to a company in china to build the physical device. Well, it’s not really that convenient. As a result, researchers and engineers developed FPGAs, which could be reprogrammed in-house. First commercial FPGAs been introduced around 1983 and 1985 by Altera and Xilinx, which were major players in FPGA industry for many years.
FPGAs have several configurable elements, where each of them being used for a specific purpose. Today’s FPGAs have more complementary elements for common operations. But let us start first with basic required elements in an FPGA (Shown in Figure1):
- Configurable Logic Block (CLB): These are core array of configurable logic blocks, that perform user-specified logic functions. available interconnections in the device can carry signal between these CLBs.
- Input/Output (IO) Blocks: These are providing a programmable interface between the internal and external components of the FPGA.
Modern FPGA has other complementary pieces, such as DSPs (Digital Signal Processing) , BRAMs, and etc. For the sake of this article, we are not going into details regarding all available complex features of FPGAs.
Modern FPGAs are known for their low power consumption and programmability. Imagine you have one specific complicated function to implement. Each operation of this function may require hundreds of instructions being executed on GPU or CPU, while it can be done in one clock cycle on an FPGA device. FPGAs do not provide the same clock frequency rate as other types of processors. For example an Intel Xeon CPU can operate up to 4 GHz, or an Nvidia Titan xp can go up to 1.5 GHz, while an efficient FPGA implementation can reach at most 500 MHz. Considering the big frequency gap, FPGA can still beat other processors in many applications.
Applications of FPGA
FPGA is being widely adopted in all industries. A lot of electrical equipment, such as medical deivces, security cameras, automobiles and etc. are geared with FPGAs. For example, Airbus A380 contains more than 1000 FPGA chipsets.
FPGA has been broadly adopted by different industries, since it reduces delivery time of the product and total cost of the project. Hardware companies do not need to spend tons of many and wait for a long time for chip manufacturers to received and produce ASICs for their hardware designs. There are still many companies distributing their design in ASICs, due to it lowest power consumption and highest possible performance. For example, Apple is using dedicated chipset for their motion processing and their new technology, neural engine.
With the rise of machine learning and AI applications, research community and industry are getting more and more interested in accelerators other than CPU. This is due to heavy processing requirements of such applications. For example, training a deep neural network may take hours or even days. Even inferencing in DNNs may take seconds. There are many other algorithms widely being used in different applications, with completely different behavior, but same execution intensity. One available options is GPU. GPUs are widely being adopted in many applications. Also, there are tons of libraries for doing mathematical operations, such as BLAS and DNNs. As a result, they are being supported by many available simulation and AI frameworks. Unfortunately, GPUs are power hungry and are generating considerable amount of heat. Considering cumulative amount of generated heat and consumed power, it’s a huge burden on data centers, from total operation cost perspective. As a result, there exists a huge thrive for utilizing power efficient processors. Interestingly, FPGAs do meet this constraint. They are operating in low power mode, which reduces it’s overall power consumption and heat generation. But how to use these fancy devices in operation?
In order to program a logic on FPGA, one might need to code it in either Verilog or VHDL, which are specifically hardware languages. There are still many other available approaches to design a hardware. For example, it is common today to write the logic in C and then let the compiler to generate the equivalent Verilog code. The Verilog code is then being synthesized for the target hardware. Recently, with introduction of PCI-e form factor FPGA cards from both Xilinx and Intel (Yes, intel has acquired Altera and it’s in the FPGA business now :D), one can buy these cards and attach them into the mother board, just like GPUs. Further, a parallel programming language could be used to write and deploy applications onto the card. Both Xilinx and Intel are offering OpenCL support on their boards. OpenCL is a standard, introduce by Apple first, but later being adopted by Khronos. Different vendors support OpenCL, such as Nvidia on their GPUs, and Intel on it’s CPUs. OpenCL is supposed to be generic enough for all set of accelerators. For now, one can write an OpenCL code and use the compiler to generate the Verilog code. Later the Verilog code can be synthesized and mapped into the card, along all necessary blocks such as PCIe driver, memory trasnfer block and etc. Further, the host code can use OpenCL API to initiate and execute the OpenCL code on the FPGA.
Working with OpenCL on FPGA, almost 95% of everything is the same. There is only an extra new feature block, called channels, which exists only in Intel FPGA OpenCL SDK. This is almost like pipes, but seems to provide better performance. As a result, one can be hopeful to take a generic OpenCL code and execute it, As Is, on FPGA. Nevertheless, due to architectural differences between GPU and FPGA, optimizations are required to prepare an OpenCL GPU code for FPGA. The same parallel code that performs good on GPU, may need to be written sequentially and then get pipelined, to achieve reasonable execution time on FPGA.
With all above efforts from vendors to support OpenCL, there is still one huge difference between FPGA and other devices such as CPU or GPU. Compiling even a simple code for FPGA will take hours to finish, although binary generation for CPU or GPU is fast. As a result, FPGA seems not to be pleasant for software engineering community. Software engineers are used to build and test their code back-to-back. They are definitely not patient enough to wait for their simple code for many hours to get prepared. Many frameworks today enable interactive data processing, analysis and visualization. For example, you can set up python console and start developing and testing tensorflow model training or inference. Or, Spark shell enables users to interactively play with their data. Doing such in an environment which is equipped with FPGA seems not to be doable.
Summarizing all above, we need a new programming interface for FPGAs on the cloud, but how should it be look like?
Transforming FPGA interface
So far we have concluded, compiling every new piece of code into FPGA bitstream is not feasible for production environment. As a result, we need a framework, which acts like LEGO. There are different primitive pieces for different set of operations. These operations can be either transformation or actions. More specifically, looking at HPC, Big-Data and AI applications, a single process of execution is operating on a stream of data, where at each single stage, data is either being transformed into another shape, or a specific kind of operation is being applied on it. This sounds like Apache Spark, Right? In Apache Spark users are utilizing a set of available operations, and they start applying them one by one to get the final desired result. Let’s mention that Spark provides kind of freedom by letting users specify their lambda functions and apply it on the data using map operation. Well, every single lambda function is being transformed into it’s java equivalent and can be executed on the CPU, right away. This cannot be the case with FPGA. But still the same thing can be applied on FPGA, by having completely predefined operations that can be used by users.
Thinking more about this approach, one can consider a set of pre-defined operations. We assume almost any logic can be written using these operations. Each operations is already written in OpenCL and also being compiled for available FPGAs on the cloud. as a result, user can access a large library of FPGA bitstreams. Further, all these blocks can be attached serially and build a deep pipeline to apply all the operations on the input data. This idea can be easily applied on something like Scala. We can define a set of primitive data types, and then define all the available operations on them. These datatypes can only be used for FPGA. Further, the series of operations on the combination of these datatypes can be converted into a series of FPGA blocks. Need to mention that loading the bitstream into the FPGA is fast enough.
Hmmm, Still doesn’t work
The whole idea above seems to be promising, but still does not work, due to limitation of current OpenCL support. When you compile a simple OpenCL code for FPGA, all the pipes between different kernels should also be defined statically. It means you cannot just map different blocks into the FPGA and wire them dynamically, at runtime. One possible hack for this issue, is to consider Global DRAM memory as the communication channel between kernels, but this means loading and unloading kernels one after the other, which for sure introduces huge overhead.
Combining all above, we can conclude FPGAs has a way to go to be feasible for cloud environment. There is a huge opportunity in thinking about new programming models for FPGA, which make them fast and feasible for interactive programming. In order to do that, fundamental things should be changed in FPGA OpenCL compilers. First of all, we need a new set of Board Support Packages (BSPs), to increase flexibility of FPGA, by making features more dynamic, then require them statically. For example, BSPs should be able to compile different OpenCL kernel functions independently, and be able to wire them in FPGA at runtime. Well, I’m not an FPGA expert and I’m not sure how much is this idea is possible, but talking to a friend of mine it seems like to be something reasonable, although it needs significant effort. After that, we can think of higher level programs which can convert the user code into a series of FPGA kernel.
We still need to wait and see various studies, comparing FPGAs against available accelerators in the cloud environment. I update you guys with more information about OpenCL on FPGA, since it’s also my own active research track.