Beginning of FPGA Era in Data Centers

FPGA is becoming more and more popular these days, and everyone is talking about it and how to utilize it in real world applications in either public or private clouds. But you may ask, why? before diving into pros and cons of this new member in the family, let me talk a little bit about what is FPGA and how it became so popular among software community.

FPGA and it’s history

So what exactly is FPGA? FPGA stands for Field-Programmable Gate Arrays. These are fancy chips that could be programmed for one specific purpose, unlike CPUs and GPUs. In other words, CPUs and GPUs are once mapped into hardware and could be used by feeding instructions of a specific application. Logic blocks in FPGAs can be configured to perform complex combinational functions. The concept of programmable logic devices have existed while before introduction of FPGAs. In old days people were using PROMs and PLDs. Unfortunately these devices were not re-programmable, and could only be programmed once in a factory in large quantity. Imagine, you could only develop your design and give it to a company in china to build the physical device. Well, it’s not really that convenient. As a result, researchers and engineers developed FPGAs, which could be reprogrammed in-house. First commercial FPGAs been introduced around 1983 and 1985 by Altera and Xilinx, which were major players in FPGA industry for many years.

FPGAs have several configurable elements, where each of them being used for a specific purpose. Today’s FPGAs have more complementary elements for common operations. But let us start first with basic required elements in an FPGA (Shown in Figure1):

  1. Configurable Logic Block (CLB): These are core array of configurable logic blocks, that perform user-specified logic functions. available interconnections in the device can carry signal between these CLBs.
  2. Input/Output (IO) Blocks: These are providing a programmable interface between the internal and external components of the FPGA.

194226-programovatelna-logika-i-prehled-pld-3221877687444215496

Modern FPGA has other complementary pieces, such as DSPs (Digital Signal Processing) , BRAMs, and etc. For the sake of this article, we are not going into details regarding all available complex features of FPGAs.

Modern FPGAs are known for their low power consumption and programmability. Imagine you have one specific complicated function to implement. Each operation of this function may require hundreds of instructions being executed on GPU or CPU, while it can be done in one clock cycle on an FPGA device. FPGAs do not provide the same clock frequency rate as other types of processors. For example an Intel Xeon CPU can operate up to 4 GHz, or an Nvidia Titan xp can go up to 1.5 GHz, while an efficient FPGA implementation can reach at most 500 MHz. Considering the big frequency gap, FPGA can still beat other processors in many applications.

Applications of FPGA

FPGA is being widely adopted in all industries. A lot of electrical equipment, such as medical deivces, security cameras, automobiles and etc. are geared with FPGAs. For example, Airbus A380 contains more than 1000 FPGA chipsets.

FPGA has been broadly adopted by different industries, since it reduces delivery time of the product and total cost of the project. Hardware companies do not need to spend tons of many and wait for a long time for chip manufacturers to received and produce ASICs for their hardware designs. There are still many companies distributing their design in ASICs, due to it lowest power consumption and highest possible performance. For example, Apple is using dedicated chipset for their motion processing and their new technology, neural engine.

With the rise of machine learning and AI applications, research community and industry are getting more and more interested in accelerators other than CPU. This is due to heavy processing requirements of such applications. For example, training a deep neural network may take hours or even days. Even inferencing in DNNs may take seconds. There are many other algorithms widely being used in different applications, with completely different behavior, but same execution intensity. One available options is GPU. GPUs are widely being adopted in many applications. Also, there are tons of libraries for doing mathematical operations, such as BLAS and DNNs. As a result, they are being supported by many available simulation and AI frameworks. Unfortunately, GPUs are power hungry and are generating considerable amount of heat. Considering cumulative amount of generated heat and consumed power, it’s a huge burden on data centers, from total operation cost perspective. As a result, there exists a huge thrive for utilizing power efficient processors. Interestingly, FPGAs do meet this constraint. They are operating in low power mode, which reduces it’s overall power consumption and heat generation. But how to use these fancy devices in operation?

Programming FPGAs

In order to program a logic on FPGA, one might need to code it in either Verilog or VHDL, which are specifically hardware languages. There are still many other available approaches to design a hardware. For example, it is common today to write the logic in C and then let the compiler to generate the equivalent Verilog code. The Verilog code is then being synthesized for the target hardware. Recently, with introduction of PCI-e form factor FPGA cards from both Xilinx and Intel (Yes, intel has acquired Altera and it’s in the FPGA business now :D), one can buy these cards and attach them into the mother board, just like GPUs. Further, a parallel programming language could be used to write and deploy applications onto the card. Both Xilinx and Intel are offering OpenCL support on their boards. OpenCL is a standard, introduce by Apple first, but later being adopted by Khronos. Different vendors support OpenCL, such as Nvidia on their GPUs, and Intel on it’s CPUs. OpenCL is supposed to be generic enough for all set of accelerators. For now, one can write an OpenCL code and use the compiler to generate the Verilog code. Later the Verilog code can be synthesized and mapped into the card, along all necessary blocks such as PCIe driver, memory trasnfer block and etc. Further, the host code can use OpenCL API to initiate and execute the OpenCL code on the FPGA.

Working with OpenCL on FPGA, almost 95% of everything is the same. There is only an extra new feature block, called channels, which exists only in Intel FPGA OpenCL SDK. This is almost like pipes, but seems to provide better performance. As a result, one can be hopeful to take a generic OpenCL code and execute it, As Is, on FPGA. Nevertheless,1024px-opencl_logo-svg due to architectural differences between GPU and FPGA, optimizations are required to prepare an OpenCL GPU code for FPGA. The same parallel code that performs good on GPU, may need to be written sequentially and then get pipelined, to achieve reasonable execution time on FPGA.

With all above efforts from vendors to support OpenCL, there is still one huge difference between FPGA and other devices such as CPU or GPU. Compiling even a simple code for FPGA will take hours to finish, although binary generation for CPU or GPU is fast. As a result, FPGA seems not to be pleasant for software engineering community. Software engineers are used to build and test their code back-to-back. They are definitely not patient enough to wait for their simple code for many hours to get prepared. Many frameworks today enable interactive data processing, analysis and visualization. For example, you can set up python console and start developing and testing tensorflow model training or inference. Or, Spark shell enables users to interactively play with their data. Doing such in an environment which is equipped with FPGA seems not to be doable.

Summarizing all above, we need a new programming interface for FPGAs on the cloud, but how should it be look like?

Transforming FPGA interface

So far we have concluded, compiling every new piece of code into FPGA bitstream is not feasible for production environment. As a result, we need a framework, which acts like LEGO. There are different primitive pieces for different set of operations. These operations can be either transformation or actions. More specifically, looking at HPC, Big-Data and AI applications, a single process of execution is operating on a stream of data, where at each single stage, data is either being transformed into another shape, or a specific kind of operation is being applied on it. This sounds like Apache Spark, Right? In Apache Spark users are utilizing a set of available operations, and they start applying them one by one to get the final desired result. Let’s mention that Spark provides kind of freedom by letting users specify their lambda functions and apply it on the data using map operation.  Well, every single lambda function is being transformed into it’s java equivalent and can be executed on the CPU, right away. This cannot be the case with FPGA. But still the same thing can be applied on FPGA, by having completely predefined operations that can be used by users.

Thinking more about this approach, one can consider a set of pre-defined operations. We assume almost any logic can be written using these operations. Each operations is already written in OpenCL and also being compiled for available FPGAs on the cloud. as a result, user can access a large library of FPGA bitstreams. Further, all these blocks can be attached serially and build a deep pipeline to apply all the operations on the input data. This idea can be easily applied on something like Scala. We can define a set of primitive data types, and then define all the available operations on them. These datatypes can only be used  for FPGA. Further, the series of operations on the combination of these datatypes can be converted into a series of FPGA blocks. Need to mention that loading the bitstream into the FPGA is fast enough.

Hmmm, Still doesn’t work

The whole idea above seems to be promising, but still does not work, due to limitation of current OpenCL support. When you compile a simple OpenCL code for FPGA, all the pipes between different kernels should also be defined statically. It means you cannot just map different blocks into the FPGA and wire them dynamically, at runtime. One possible hack for this issue, is to consider Global DRAM memory as the communication channel between kernels, but this means loading and unloading kernels one after the other, which for sure introduces huge overhead.

Combining all above, we can conclude FPGAs has a way to go to be feasible for cloud environment. There is a huge opportunity in thinking about new programming models for FPGA, which make them fast and feasible for interactive programming. In order to do that, fundamental things should be changed in FPGA OpenCL compilers. First of all, we need a new set of Board Support Packages (BSPs), to increase flexibility of FPGA, by making features more dynamic, then require them statically. For example, BSPs should be able to compile different OpenCL kernel functions independently, and be able to wire them in FPGA at runtime. Well, I’m not an FPGA expert and I’m not sure how much is this idea is possible, but talking to a friend of mine it seems like to be something reasonable, although it needs significant effort. After that, we can think of higher level programs which can convert the user code into a series of FPGA kernel.

We still need to wait and see various studies, comparing FPGAs against available accelerators in the cloud environment. I update you guys with more information about OpenCL on FPGA, since it’s also my own active research track.

 

Advertisements

(Part 1) Migrate Deep Learning Training onto Mobile Devices!

Recently I’ve started a new project which aims to port training phase of deep convolutional neural networks into the mobile phone. Training neural networks is a hard and time consuming task, and it requires horse power machines to finish a reasonable training phase in a timely manner. Current successful models such as GoogleNet, VGG and Inception are based on tens of convolutional layers. The model is heavy enough that one for sure need large amount of memory and a super power GPU, to be able to train it at least in a day. (Although it still may take up to days to reach a reasonable accuracy.)

The nature of training neural networks almost prevents them from being deployed on embedded and mobile systems. These small systems are based on an SoC architecture with a small size GPU on it. They also have a medium size DRAM, which in combination designed to answer mobile size applications. Let’s mention that mobile devices today are much more powerful, compared to 10 years old PCs. They are also could be considered as a replacement for Laptops or desktops for everyday tasks. But they still cannot afford to perform heavy AI tasks.

Despite all arguments above, having AI capabilities on your mobile phone is a necessity of future applications. We are almost at the end of simple functional applications, and moving toward more intelligent and sophisticated user applications. These applications may need to use statistical machine learning techniques to provide unique functionalities to users. Even right now you can see many AI backed apps on your phone, such as Google assistant, Apple Siri and etc. The philosophy behind these interesting applications is to offload a clean and useful interface onto the user’s device and power all heavy AI tasks in a data center, such that all users inputs would (1) go into the data center, (2) Being processed by the servers, and (3) the output will return back to the user. It seems to be enough, right? Well, that may not be true. Imagine you have purchased an IPhone and you are so excited to use Siri for all your daily tasks. You may find English not being your first language and having some accent. This may make it hard for Siri to fully understand what you are saying, almost all the time. There may be an obvious and easy solution for this problem, which is custom training an AI model for every single person in the cloud. Well, this may bring a whole lots of challenges for the provided. Here are some of the existence challenges:

  • Doing inference per user request is cheap, fast and affordable. It doesn’t need massive amount of computation on the servers. It also would not generate too much heat, which is the #1 problem for big data centers. Training, on the other hand, is expensive and time consuming. It requires the provider to allocate considerable amount of resources for each user, turns out not being cost effective. It also can generate more heat, which increases difficulties in cooling. As a result, running continuous training for each user is not an option for the providers, at least with current technology.
  • Holding every user’s customized model may require large disk storage. Providers need to add more disks, preferably SSDs, in order to hold user’s final model and also relevant snapshots. This will increase cost for every data center.
  • Security is another issue with cloud service providers. Imagine all the data and models for every user being stored in a centralized data center. This makes security issues more challenging. Beside, user specified AI models says a lot about users private information, which makes protection and encryption more sensitive.

As a result, I believe customizing user models on cloud is totally doable, but at a high cost. Having AI capabilities integrated inside the mobile application will reduce operation costs and also bring real-time responsiveness into mobile apps. Unfortunately current mobile systems are not capable of training a network such as inception, locally. So it might not be practical to port AI codes into mobile, as is.

Recently one of my colleagues have came with an idea, which is retraining a new neural network from scratch while receiving mentorship from an already trained network. One can use this technique to retrain a neural network from scratch much easier, compared to non-mentored version. Now what if the new network could maintain a smaller size than the original network, but at the same time be able to represent the same knowledge? This may be a great idea, since it makes it possible to adapt the knowledge of a heavy neural network, while make training easier and faster? This is basically the idea we are going to expand, in order to bring training into mobile phones. Here you can find his paper draft: https://arxiv.org/pdf/1604.08220.pdf

So far there has been lot’s of related work, targeting only already-trained networks, such that you’ll get the model parameters and then apply specific techniques in order to reduce the size of the model. This may be (1) weights pruning and quantization, (2) convert 32 and 64 bit floating point values into 8 bit version, and etc. Unfortunately none of these techniques can helps the training phase. Shrinking model size for training phase can introduce extensive divergence of loss value, and will prevent the model from reaching a reasonable accuracy step-by-step. Our proposed technique can solve this issue. All these techniques so far are predecessor of an idea called Dark Knowledge.

So far I have talked about the problem and why it is important. Now let’s talk more about the technique being described above.

Consider a large Mentor network with n layer. Now consider a smaller Mentee network with m layers. Now we assume the large network is well-trained and stable on a general-enough dataset. We want the smaller network to classify a new dataset which may be less general or as general as Mentor. We will map each layer (filter) of the Mentee network to a filter in Mentor, and we will calculate the error between them, using RMSE (other metrics could be used too). While training the Mentee network, the network not only learn the difference between real and predicted labels, but also tries to adapt almost the same representation of the intermediate Mentor layers. This helps the Mentee not to deviate from mentor knowledge representation and be able to emulate it’s knowledge in a smaller scale. Users can specify the contribution of the final softmax loss and also the intermediate losses, which will control the deviation factor from the Mentor.

I have so far tested the idea on MNIST and VGG16 model and the accuracy numbers are interesting. Mentee being supervised by Mentor network is able to produce much higher accuracy compared to the independent Mentee. Choosing the size of Mentee would definitely affect the performance, but this could be tuned based on the computing limitations and also user’s tolerance over model accuracy.

Here is a schematic of the connection between the Mentor and Mentee network.

 The graph clears out how Mentee is being supervised by Mentor during training session. Later on I will share my code written in TensorFlow, which has more detail about the connection of these two graphs.

Now, how could it solve the mobile issue? Well, you can have a general brain on the cloud which is responsible for learning a really big model, representing a global knowledge. Now you are using this service through your phone and wanted to inject some more information about your usage habits and customize the model for your own needs. You can have a small representation of the model on the phone, and keep training that in the background while receiving supervision from the Mentor model. As a result, cloud service provide can only focus on the global knowledge and your device takes care of your own input data.

I think so far we had enough discussion about the background of the idea. It’s time to get our hands dirty and show how all these are possible with current technology that we have. Next part of this article will discuss about implementation details of the Mentee-Mentor Network.

Leaving to attend EITR Systems

I’m going to work for “EITR systems”, for the whole summer 16′. I would be the first technical person working out there, and I’m really excited to join an startup from scratch, instead of working in well-established big companies. This is an interesting opportunity to learn all different aspects of a baby business, whether it is technical issues or even communication and expenses issues. We are going to develop a unique and cheap kind of product, which is able to easily handle the data replication problem in current data centers. This problem is going to be a bigger one, especially as the 3DxPoint devices are on the way to the market. With the utilization of these new storage devices, the data center community is concerned how to achieve both replication guarantee and high throughput and low latency data access. My goal would be coming up with a final product, which is a complete solution for above mentioned problems.

I’ve used to work for software development companies while I was doing my bachelor and I somehow know how does it feel like to do something simply straightforward for a long time, with no specific innovation and challenge. I may be interesting for some people, but not for me at least.

I have started working on this startup about a year ago, with Dr. Raju Rangaswami. It was so simple, we came up with a funny idea in the Storage Systems class and then I have followed up with him to see how could we extend the idea and get one paper out of it. After several weeks discussions, we came up with a modified version which was interestingly pure and unique. We found ourselves in a situation were we came up with a great solution for data center replication issue. Since then, we have started research about requires hardware and software components and after one year the money successfully been raised. It is sufficient for the whole summer and the mission is to come up with a working prototype at the end of the summer, in order to move to the second stage of fund-raising.

Now that I’m comparing my current work with the work I was doing previously, I can easily say there is a huge difference, in all aspects you can think of. Running an startup and developing something completely new is a big challenge. You need to learn about a lot of stuff and sometimes talking to knowledgeable people in different areas to get straight information. It’s hardly like working in a well-established company with strongly defined objectives. The process is hard enough, that makes a thinker and innovator from you in the future. You will learn not to do the things regular people are doing. You will learn to put the work in the first place, instead of money and profit. For sure, we do all these for huge turnover. But for the beginning your main assignment is to develop something that could have people attention and appreciation.

Right now, I’m geared up to fully dedicated myself for this journey and see what happens next!