Abstract
Keywords
Introduction
Over the last several years, with proliferation of data and devices, machine learning algorithms have become unavoidable in almost every aspect of human life, 1 especially when deep learning (DL) techniques are used. DL techniques, such as deep neural networks (DNNs), are among the best solutions for many problems in image understanding. DNNs are used in applications like face recognition, document analysis, image classification, labeling, and many others.2–4
Training and inference of DNNs could be done in the cloud, with enormous computational power, as well as hundreds of kilometers away from data sources, with a less strong computational power in the fog, but much closer to data sources, and also in the dew, at the source of data, where processing is closely coupled with Internet of Things (IoT) and Wireless Sensor Networks (WSNs) connected to end devices.
Cloud computing provides enormous computational power and storage resources available on demand, where users can scale services to fit their needs. Fog computing provides less computational power and storage resources with bigger geographical distribution compared to cloud computing, which could be tuned more effectively to adequate preprocessing. Fog computing is performed by edge devices (routers, routing switches, multiplexers, and integrated access devices), which provide an entry point into enterprise. Dew computing provides minimal computational power and storage resources in close coupling with IoT and WSN on the very edge of the system. Dew computing is performed by end devices like personal computers, mobile phones, and sensors.
DNNs could include hundreds of millions of connections between nodes, which are computationally and memory intensive, making DNNs difficult to deploy on end devices with limited resources. While embedded processors enable arithmetic logic unit (ALU)based computations with reasonable power consumptions, fetching weights from dynamic random-access memory (DRAM), static random-access memory (SRAM), or block random-access memory (BRAM) is several orders of magnitude which is more power greedy than ALU operations.
Besides deployment locations, it is hard to determine which underlying hardware is the optimal solution for a wide spectrum of DNNs. Depending on the application type and the amount of data, the following architectural approaches could be used: general purpose processors (multicore or central processing unit (CPU)), graphics processing units (manycore or GPUs), field programmable gate arrays (FPGAs), and application-specific integrated circuits (ASICs).
This article discusses advantages and disadvantages of different deployment locations and different architectural approaches used for training and inference of DNNs, comparing them in terms of aspects such as speed, power consumption, and complexity. It then proposes a novel classification of deployment locations, from the viewpoint of underlying hardware. The goal of this article is to provide insights into design choices related to hardware used for DNNs at different deployment locations.
In our search, which we conducted using Google Scholar database, we focused on keywords related to DNNs and hardware acceleration. Specifically, we searched for the following terms: “an Analysis of DNNs,”“Hardware Acceleration of DNNs,”“Deep Learning in Neural Networks,”“Tensorflow,”“Neuron Programmable Chip,”“Convolution Neural Networks,”“Binarized Neural Networks,”“Training Deep FeedForward Neural Networks,”“Image Segmentation with Deep Neural Networks,”“Memristors for Neural Networks,”“Image Classification using Deep Neural Networks,”“Cloud Computing,”“Fog Computing,”“Edge Computing.”
The “Compute intensive training and strict latency inference of DNNs” section discusses problems with training and inference of DNNs at different deployment locations, defines the big data problem, and presents why the problem will grow over time. The “Analysis of existing solutions” section gives an overview of existing articles available in the open literature, presents advantages of the state-of-the-art solutions for training and inference of DNNs, and discusses major comparative differences. The “Classification criteria” section classifies existing solutions into 12 categories with examples, based on the underlying hardware and architectural approach in connection with various deployment locations. The “Conclusion” section concludes this article, discusses the main advantages of the proposed classification, defines what would be the optimal solution for future, and presents newly opened research problems.
Compute intensive training and strict latency inference of DNNs
Training of DNNs is a much more compute-intensive process than inference, 5 which requires enormous computational power, especially for big data applications. Inference of DNNs often has strict latency demands, especially in real-time scenarios such as driverless cars and search engines, which is, in both cases, less than a second for each sample. While academia is focused on faster training, industry is often more focused on faster inference, by creating custom solutions, both in software and hardware.
For big data applications, in environments with limited resources such as embedded systems, it is hard to train a DNN, due the lack of computational power and power resources. In real-time applications, which have strict latency demands, it is a non-trivial problem for inference in the cloud environment, since processing could be done hundreds of kilometers away. Currently, big data applications are using powerful GPUs for DNNs in the cloud environment, while FPGA, ASIC, and embedded processors of any type are on the rise in fog and dew environments suited for DL.6,7
Analysis of existing solutions
Companies like Amazon, Google, Microsoft, Baidu, and many others recently created web services for image understanding. Nowadays, they are aiming to create artificial intelligence cloud-based platforms for image understanding which could be used for almost any type of data, 8 referred to as Machine-Learning-as-a-Service (MLaaS). MLaaS presents a set of machine learning tools which are a part of cloud computing, including frameworks, data visualization, application programming interfaces (APIs), and predictive analysis.
Cloud computing currently offers the best solution for training and inference of DNNs, if execution time is of primary concern. Even though cloud computing offers immense opportunities, the development of cloud computing technology is currently struggling with many issues which still have to be addressed, such as latency and security concerns,9,10 also in real-time scenarios. The amount of power required to send data to the cloud is enormous, which is why even the most powerful cloud solutions need to make a great effort to overcome latency concerns. 11
There is another trend which shifts data and processing significantly away from the cloud to end devices. Huang et al. 12 propose a dew machine learning framework, which demonstrates superiority by reducing network traffic and running time under certain conditions of interest. This could be of crucial importance in environments, such as drones, ships, and research stations, where execution time is not of primary concern and where ultra-low-power solutions benefit. Even data mining could be ported to end devices. 13
Schmidhuber
14
presents the essence of DNNs, how they have evolved over time, and which cutting-edge DNNs exist today. In the open literature, there are many papers which discuss implementations of DNNs using various architectural approaches. The history of DNNs for image understanding – convolutional neural networks (CNNs) began from AlexNet,
15
with a DNN which won an ImageNet competition, where the goal was to classify 1.2 million high-resolution images into 1000 different classes. They used an efficient GPU implementation of the convolution operation which relied on two GTX 580 GPUs with 3 GB of memory. State-of-the-art CNNs, such as VGG, Inception, GoogleNet, RestNet, and other families,
16
are suitable for training on high parallelizable GPUs. These DNNs are very powerful. However, they
While training is mostly dominated by GPUs, FPGAs provide superior efficiency over GPUs, especially for binarized NNs—NNs with binary weights—often used in environments with limited resources, especially in embedded systems. A group of researchers 18 present power-efficient training of a binarized NN, where most arithmetic operations are replaced with bitwise operations. That paper further evaluates the opportunity to improve the execution efficiency of a binarized NN through hardware acceleration.
In the open literature, there are a number of ASICs developed for DNNs such as neuromorphic chips. 19 IBM TrueNorth 20 and Intel Loihi 21 are neuromorphic chips which enable parallel computations used for learning and inference of DNNs needing high efficiency. Another trend implies analog computational units for running DNNs 22 such as memristors. Quantum computing, the emerging interdisciplinary research area of quantum physics and machine learning, could also be used for training of DNNs. 23
Classification criteria
The classification proposed in this article is illustrated in Table 1. Each category contains examples from the open literature, if such examples exist, which describes the essence of the paradigm and its advantages and disadvantages in comparison to other categories. An empty category in the classification could be an opportunity for further research and development. Creativity in this article follows the path referred to as generalization. 24
Proposed classification with 12 different categories and examples of different underlying hardware used at deployment locations.
CPU: central processing unit; FRD: Fog Reference Design; GPU: graphics processing unit; FPGA: field programmable gate array; ASIC: application-specific integrated circuit; TPU: tensor processing unit.
We immediately found that different architectural approaches do not allow for a full direct comparison of performance, power consumption, latency, and data processing bandwidth of DNNs described in the open literature, due to different testing conditions and different DNN designs. Therefore, this section partially compares different architectural approaches through aspects such as speed, power, and speed per watt, using the data available in the open literature.
DNNs in cloud computing
Cloud computing is the most powerful massive workload computing model today. The main advantage of cloud computing is its flexibility where users can scale resources and services to fit their needs and access them from anywhere. It provides the enormous computational power needed for training and inference of DNNs. The most important aspect of performance evaluation for training of DNNs is the computational power needed for such massive workload, while inference heavily depends on power consumption.
Cloud-based solutions are often implemented on powerful manycore processors, which makes the massive parallel calculations for training of DNNs possible, while multicore processors orchestrate data movements into and from the memory of GPUs. Nvidia recently introduced a mixed-precision arithmetic, where calculations are performed using FP16 (half precision) instead of FP32 (single precision) or FP64 (double precision), which is not critical for DNNs but reduces the time required for calculations and moving data into and from the shared memory.25,26 Such approach reduces memory usage and allows for training of larger DNNs. Tensor cores are introduced in the Nvidia’s Volta architecture and represent another optimization construct which accelerates the training of DNNs. 27
Besides the powerful GPUs, cloud services for DNNs could be based on multicore processors, usually from 2 to 8 cores, and also with up to 18 cores, for Intel i9. There are attempts to create multicore processors with up to 72 cores like Intel Xeon Phi. 28 A group of researchers 29 trained a DNN on a cluster of 1024 CPUs. Intel also introduced a new instruction set (AVX-512) available on the latest Xeon Phi and Skylake-X CPUs, which could accelerate performance for massive workloads of DNNs. The new instruction set enables lower precision operations using Fused Multiply Add (FMA) core instructions. 30
Table 2 presents the performance comparison between Titan X GPU and Xeon E5-2698 CPU processors for AlexNet. It is shown that for inference, the GPU solution achieves a better performance per watt.
Calculations are today usually bandwidth-limited, which means that storing and reading data could take more energy than computation itself. In order to solve this problem, many new developments are switching from the control-flow paradigm to the dataflow paradigm.32,33 Cloud services which rely on FPGA, such as Baidu XPU, still cannot offer performance of today’s GPUs or TPUs (tensor processing units), but offer better energy efficiency for training and inference of DNNs. Companies like Baidu, Intel, and Microsoft have cloud services which rely on FPGAs, which under certain conditions of interest could achieve better performance per watt. Nurvitadhi et al. 18 compare the performance per watt of DNNs, such as AlexNet and VGG, which rely on floating-point matrix multiplication. Table 3 shows that Stratix 10 with far greater number of DSPs used for multiplication offers improved FP32 performance compared to Arria 10 but is still behind the powerful Titan X GPU. Stratix 10 can be up to 40% more power greedy than Titan X for the theoretical peak.
Dense matrix multiplication (theoretical peak) with FP32 data type, on Arria 10 FPGA, Stratix 10 FPGA, and Titan X GPU.
FPGA: field programmable gate array; GPU: graphics processing unit.
It is shown that in the case of binarized NNs with 1 bit data types, Stratix 10 FPGA can deliver 10 times better performance per watt compared to Titan X GPU, which is shown in Table 4.
Dense matrix multiplication for binarized DNNs with 1 bit data types, on Arria 10 FPGA, Stratix 10 FPGA, and Titan X GPU (multiply and add operations are replaced with XOR and BitCount).
DNNs: deep neural networks; FPGA: field programmable gate array; GPU: graphics processing unit.
In terms of performance on ResNet-50 (CNN), two cloud-based services, Google TPU and Amazon EC2, which rely on TPUv2 ASIC and Nvidia V100 GPU, are almost equally fast on the ImageNet dataset, as shown in Table 5. 34 Due to their ability to be fully tuned to DNN design, these two architectural approaches are in many cases used for training DNNs.
ResNet-50 results for training the ImageNet dataset, on Google TPUv2 ASIC and Nvidia V100 GPU for various batch sizes.
TPU: tensor processing unit; ASIC: application-specific integrated circuit; GPU: graphics processing unit.
Due to their complexity, GPUs are quite often used in cloud services to enable parallelism. However, companies like Intel and Amazon have CPU cloud-based services. By scaling the performance evaluation to the cloud level, with a number of processors running simultaneously, GPU and ASIC solutions would definitely be faster compared to CPU and FPGA solutions.
Table 6 presents a comparison of different architectural approaches commonly used in cloud-based services, focusing on attributes such as the year of production, platform type, technology, clock rate, memory type, and transistor count.
Comparison of different architectural approaches commonly used in cloud-based services.
TPU: tensor processing unit; CPU: central processing unit; GPU: graphics processing unit; FPGA: field programmable gate array; ASIC: application-specific integrated circuit; DRAM: dynamic random-access memory; BRAM: block random-access memory; SRAM: static random-access memory
Cloud/multicore
Intel AI DevCloud 35 is a full-stack platform for DL based on Xeon Scalable processors, which enable fast training and inference of DNNs, by splitting calculations to more dedicated nodes. A group of researchers 29 trained RestNet-50 and AlexNet in record time (presented in their publication), by scaling computations to more dedicated CPU nodes.
Cloud/manycore
Amazon EC2 provides GPU cloud-based acceleration with up to 8 GB of GPU memory, which could be used for almost any massive workload. In the open literature, there are a number of DNN implementations which exploit the above-mentioned cloud-based platform. Strom 36 solves the well-known communication bottleneck problem, which arises for data-parallel stochastic gradient descent. This reduction in communication bandwidth enables efficient scaling to more parallel GPU nodes than any other method, thus achieving better accuracy in the resulting DNN.
Cloud/FPGA
Baidu XPU is an FPGA cloud-based accelerator with 256 cores, running on 600 MHz. It has a shared memory for data synchronization. In the open literature, there are many papers which propose DNN FPGA-based designs. A group of researchers from Intel evaluated performance of algorithms for training DNNs on Intel FPGAs (Arria 10, Stratix 10) against the high performance Titan X Pascal GPU. 18 Their results show that in environments where execution time is not a prime concern, FPGAs could become the optimal platform for accelerating the next generation of DNNs.
Cloud/ASIC
Google TPU 6 is an ASIC accelerator developed by Google, especially suited for DNNs. It runs at 700 MHz and consumes 40 W, which is extremely low for cloud services. The TPU is connected to its host via a peripheral component interconnect (PCI) express bus which provides a high bandwidth (up to 12 GB/s). Tensorflow framework, a well-known framework for machine learning, relies on the TPU. Nodes of a dataflow execution graph are mapped across many machines in a cluster, including CPUs, GPUs, and ASICs.
DNNs in fog computing
Fog computing is an infrastructure which provides compute and storage distributed in the most logical and efficient place, between end devices and cloud servers. Fog computing is performed by edge devices, at local area network level, at the edge of the network. It enables localization, therefore enabling low and admissible latency for real-time applications, which is necessary in smart grids and vehicular ad hoc networks (VANETs).
One of the first attempts to define a fog computing was made by Cisco, 37 when fog computing was defined as a mini-cloud, located at the edge of the network. It consists of a variety of interconnected edge devices. Edge devices which contain processing units and storage are used not only to process their own data but also to process external requests. In the open literature, there are several definitions of fog computing, from Cloudlets and Mobile Edge Computing, to Intelligent Transport Systems Clouds. 37
Fog to cloud computing is an extension of cloud computing in the way which brings processing closer to end devices, thus fulfilling essential latency and security requirements, while minimizing traffic load in the network. 38 This new emerging field addresses applications and services which do not fit well in the existing paradigms.
Fog-based solutions often rely on low-power processors, enabling computations in environments with limited resources. In the open literature, there are attempts to create fog solutions for DL, which implement intercommunication between deployed edge devices, often based on low-power CPUs and FPGAs. Yi et al. 39 introduce and define fog computing, and also present its benefits against cloud computing, comparing latency, bandwidth, and migration performance. An example shows that response time for face recognition could be drastically reduced using fog computing instead of cloud computing, consuming similar time for the computation task.
Fog/multicore
Intel built the Fog Reference Design (FRD), reference design for testing and demonstration of fog for many use cases, in order to accelerate market adoption of fog technologies. 40 It is still used at universities for platform testing and research. Since end devices do not have enough computing and storage resources to perform analytics, and since cloud servers, on the contrary, are too far away to process data and respond in time, data processing could be done using edge devices.
Fog/manycore
In the open literature, there is no paper which introduces fog computing using the manycore approach. Most papers are focused on dew computing, which is discussed in the next section.
Fog/FPGA
An FPGA fog-based platform 41 allocates less than 15% of resources in FPGA, while the rest of the resources are available for user-defined applications. It allows high throughput computations and enables post-deployment updates.
Fog/ASIC
In the open literature, there is no paper which introduces fog computing using ASICs. Most papers are focused on dew computing, which is discussed in the next section.
DNNs in dew computing
Dew computing brings processing close to the data source, where data are processed by end devices, without sending data to a remote cloud. Using such an approach, applications which have strict latency demands could have significantly better performance by eliminating latency. In dew computing, data are processed on end devices (IoT and WSN), addressing the concerns of the strict latency demands, data safety, and privacy. 42
Table 7 presents the performance comparison between Tegra X1 (FP16) GPU and Core i7 6700K CPU low-voltage processors for AlexNet. 15 It is shown that the GPU solution achieves better performance per watt for inference.
In contrast to cloud services, end devices have limited power resources, which means that massive workloads cannot be done, thus making the training of DNNs at end devices almost impossible. Despite the limited resources, an NN could efficiently inference in the dew. End devices have to be ultra power efficient in order to perform the inference of DNNs. In the open literature, there are many ASIC- and FPGA-based end devices, specialized for machine learning.43,44 FPGAs provide low-precision data types and sparsity, which drive the adoption of FPGAs over GPUs for inference of DNNs. TrueNorth 45 can be the optimal solution for inference in the dew, due to its energy-efficient architecture, consuming 70 mW. ASICs, such as TPU, often deliver up to 30× faster inference than CPUs or GPUs, and even more performance per watt. Han et al. 46 give a comprehensive comparison of different architectural approaches for DNNs and propose an ASIC-based efficient inference engine on compressed DNNs. The article further compares the proposed inference engine against CPU and mobile GPU on nine benchmarks selected from AlexNet, VGG16, and Neural Talk. The proposed engine outperforms CPUs and mobile GPUs on average 189 and 307 times, respectively.
Besides ASICs and FPGAs, in the open literature, there are several ultra-low-voltage chips aimed for DL, which are based on multicore and manycore architectures. Intel Atom and Nvidia Tegra X microprocessors are ultra-low-voltage chips, consuming less than 20 W. The ability to process data locally with limited power is extremely useful when connectivity bandwidth is limited, when latency is critical in real-time scenarios, or where privacy and security are a concern. Nvidia recently launched an energy-efficient T4 inference accelerator with low-precision calculations, supporting INT4 precision, which is about two times faster than other Nvidia architectures with higher precision calculations. Intel also provides inference on Xeon with its custom instruction set which enables acceleration available both for multicore processors and FPGA cards.
Hubara et al. 47 discuss quantized neural networks (QNNs)—NNs with extremely low precision. QNNs are extremely useful in environments where resources are limited such as end devices. In QNN, arithmetic operations are replaced with bitwise operations, and thus it allocates small amount of resources.
Table 8 presents a comparison of different architectural approaches commonly used in dew solutions, for attributes such as the year of production, platform type, technology, clock rate, memory type, and transistors count.
Comparison of different architectural approaches commonly used in dew-based solutions.
CPU: central processing unit; GPU: graphics processing unit; FPGA: field programmable gate array; ASIC: application-specific integrated circuit; DRAM: dynamic random-access memory; BRAM: block random-access memory; SRAM: static random-access memory.
Dew/multicore
Intel has created a library for DNNs, 48 which contains vectorized and threaded building blocks that can be used for implementing DNNs. The library is optimized for ultra-low-voltage processors such as Intel Atom and Intel Core Solo. These ultra-low-voltage processors are often embedded into IoT and WSN systems ranging from health care to VANETs.
Dew/manycore
Nvidia Tegra X is a power-efficient processor intended for end devices (IoT and WSN), suitable for a DNN inference, due to its low-power consumption, which is of crucial importance in edge computing. Han et al. 49 measure the total power consumption of a compressed DNN implemented on the Jetson development board based on Tegra X processor. The results show that Tegra X processor achieves three to seven times higher energy efficiency on the compressed NN.
Dew/FPGA
FPGAs are mainstream architectural approaches used in embedded systems, due to their low-power consumption and ability to be configured for a specific NN. With specifically designed hardware, FPGAs beat powerful GPUs in speed and energy efficiency. Guo et al. 50 propose various FPGA-based accelerator designs with software and hardware optimization techniques to achieve high speed and energy efficiency, compared to existing GPU solutions.
Dew/ASIC
IBM TrueNorth 45 is a neuromorphic chip with 4096 cores and transistor count of about 5.4 billion. The chip contains over a million neurons, where each core contains 256 simulated neurons. In turn, each neuron has 256 programmable synapses, which convey signals among them. Using IBM TrueNorth, a group of researchers 45 trained a DNN with accuracy over 95%, consuming under 200 mW, to recognize different hand gestures. In the open literature, there are other ASICs which could be used for DNNs in IoT and WSN such as Intel Loihi 21 or EIE. 46
Proposed classification summary
From everything presented above, from the general viewpoint of this research, it can be concluded that the optimal solution among existing solutions mostly depends on the application and the environment. In dew computing, where computational and power resources are limited, FPGA and ASIC solutions can be treated as optimal for inference, due to their power efficiency and flexibility. In cloud computing, where massive workloads are done, GPU and ASIC solutions are optimal for training, due to their enormous processing power and ability to parallelize a massive workload.
It could be concluded that hybrid fog computing is the optimal platform for the future development of DNNs, where inference could be done in the dew, preprocessing and updating in the fog, and training in the cloud. Teerapittayanon et al. 51 propose distributed DNNs consisting of the cloud, the fog, and the dew, where inference is done in dew and fog, while training is done in the cloud.
Conclusion
This article classifies architectural approaches (CPU, GPU, FPGA, and ASIC) used at different deployment locations (cloud, fog, and dew) for training and inference of DNNs. The classification defines 12 different categories where each category is illustrated with a most representative example. For further development of DNNs, it is of crucial importance to consider which type of underlying hardware is the most suitable one for a particular deployment location under specific conditions of interest.
Traditional cloud-based infrastructures are not strong enough for current demands of IoT and WSN applications, due to limitations in latency and network bandwidth. By moving data processing to the edge of the system, closer to data sources, computational power is often not strong enough to solve big data problems in DNNs.
Bearing in mind the enormous growth of IoT and WSN, we consider that its full potential could be exploited by combining cloud, fog, and dew computing with different architectural approaches. Which of the three options is the most effective one depends on the application and the environment. New research problems which arise nowadays are how to migrate cloud-based solutions to the fog layer, and possibly further out into the dew layer, and, additionally, how to implement hybrid solutions that could provide an optimal trade-off between data processing bandwidth, data processing latency, and processing power consumption.
