Recently I’ve started a new project which aims to port training phase of deep convolutional neural networks into the mobile phone. Training neural networks is a hard and time consuming task, and it requires horse power machines to finish a reasonable training phase in a timely manner. Current successful models such as GoogleNet, VGG and Inception are based on tens of convolutional layers. The model is heavy enough that one for sure need large amount of memory and a super power GPU, to be able to train it at least in a day. (Although it still may take up to days to reach a reasonable accuracy.)
The nature of training neural networks almost prevents them from being deployed on embedded and mobile systems. These small systems are based on an SoC architecture with a small size GPU on it. They also have a medium size DRAM, which in combination designed to answer mobile size applications. Let’s mention that mobile devices today are much more powerful, compared to 10 years old PCs. They are also could be considered as a replacement for Laptops or desktops for everyday tasks. But they still cannot afford to perform heavy AI tasks.
Despite all arguments above, having AI capabilities on your mobile phone is a necessity of future applications. We are almost at the end of simple functional applications, and moving toward more intelligent and sophisticated user applications. These applications may need to use statistical machine learning techniques to provide unique functionalities to users. Even right now you can see many AI backed apps on your phone, such as Google assistant, Apple Siri and etc. The philosophy behind these interesting applications is to offload a clean and useful interface onto the user’s device and power all heavy AI tasks in a data center, such that all users inputs would (1) go into the data center, (2) Being processed by the servers, and (3) the output will return back to the user. It seems to be enough, right? Well, that may not be true. Imagine you have purchased an IPhone and you are so excited to use Siri for all your daily tasks. You may find English not being your first language and having some accent. This may make it hard for Siri to fully understand what you are saying, almost all the time. There may be an obvious and easy solution for this problem, which is custom training an AI model for every single person in the cloud. Well, this may bring a whole lots of challenges for the provided. Here are some of the existence challenges:
- Doing inference per user request is cheap, fast and affordable. It doesn’t need massive amount of computation on the servers. It also would not generate too much heat, which is the #1 problem for big data centers. Training, on the other hand, is expensive and time consuming. It requires the provider to allocate considerable amount of resources for each user, turns out not being cost effective. It also can generate more heat, which increases difficulties in cooling. As a result, running continuous training for each user is not an option for the providers, at least with current technology.
- Holding every user’s customized model may require large disk storage. Providers need to add more disks, preferably SSDs, in order to hold user’s final model and also relevant snapshots. This will increase cost for every data center.
- Security is another issue with cloud service providers. Imagine all the data and models for every user being stored in a centralized data center. This makes security issues more challenging. Beside, user specified AI models says a lot about users private information, which makes protection and encryption more sensitive.
As a result, I believe customizing user models on cloud is totally doable, but at a high cost. Having AI capabilities integrated inside the mobile application will reduce operation costs and also bring real-time responsiveness into mobile apps. Unfortunately current mobile systems are not capable of training a network such as inception, locally. So it might not be practical to port AI codes into mobile, as is.
Recently one of my colleagues have came with an idea, which is retraining a new neural network from scratch while receiving mentorship from an already trained network. One can use this technique to retrain a neural network from scratch much easier, compared to non-mentored version. Now what if the new network could maintain a smaller size than the original network, but at the same time be able to represent the same knowledge? This may be a great idea, since it makes it possible to adapt the knowledge of a heavy neural network, while make training easier and faster? This is basically the idea we are going to expand, in order to bring training into mobile phones. Here you can find his paper draft: https://arxiv.org/pdf/1604.08220.pdf
So far there has been lot’s of related work, targeting only already-trained networks, such that you’ll get the model parameters and then apply specific techniques in order to reduce the size of the model. This may be (1) weights pruning and quantization, (2) convert 32 and 64 bit floating point values into 8 bit version, and etc. Unfortunately none of these techniques can helps the training phase. Shrinking model size for training phase can introduce extensive divergence of loss value, and will prevent the model from reaching a reasonable accuracy step-by-step. Our proposed technique can solve this issue. All these techniques so far are predecessor of an idea called Dark Knowledge.
So far I have talked about the problem and why it is important. Now let’s talk more about the technique being described above.
Consider a large Mentor network with n layer. Now consider a smaller Mentee network with m layers. Now we assume the large network is well-trained and stable on a general-enough dataset. We want the smaller network to classify a new dataset which may be less general or as general as Mentor. We will map each layer (filter) of the Mentee network to a filter in Mentor, and we will calculate the error between them, using RMSE (other metrics could be used too). While training the Mentee network, the network not only learn the difference between real and predicted labels, but also tries to adapt almost the same representation of the intermediate Mentor layers. This helps the Mentee not to deviate from mentor knowledge representation and be able to emulate it’s knowledge in a smaller scale. Users can specify the contribution of the final softmax loss and also the intermediate losses, which will control the deviation factor from the Mentor.
I have so far tested the idea on MNIST and VGG16 model and the accuracy numbers are interesting. Mentee being supervised by Mentor network is able to produce much higher accuracy compared to the independent Mentee. Choosing the size of Mentee would definitely affect the performance, but this could be tuned based on the computing limitations and also user’s tolerance over model accuracy.
Here is a schematic of the connection between the Mentor and Mentee network.
The graph clears out how Mentee is being supervised by Mentor during training session. Later on I will share my code written in TensorFlow, which has more detail about the connection of these two graphs.
Now, how could it solve the mobile issue? Well, you can have a general brain on the cloud which is responsible for learning a really big model, representing a global knowledge. Now you are using this service through your phone and wanted to inject some more information about your usage habits and customize the model for your own needs. You can have a small representation of the model on the phone, and keep training that in the background while receiving supervision from the Mentor model. As a result, cloud service provide can only focus on the global knowledge and your device takes care of your own input data.
I think so far we had enough discussion about the background of the idea. It’s time to get our hands dirty and show how all these are possible with current technology that we have. Next part of this article will discuss about implementation details of the Mentee-Mentor Network.