Sage Journals: Discover world-class research

Abstract

This article overviews the emerging use of deep neural networks in data analytics and explores which type of underlying hardware and architectural approach is best used in various deployment locations when implementing deep neural networks. The locations which are discussed are in the cloud, fog, and dew computing (dew computing is performed by end devices). Covered architectural approaches include multicore processors (central processing unit), manycore processors (graphics processing unit), field programmable gate arrays, and application-specific integrated circuits. The proposed classification in this article divides the existing solutions into 12 different categories, organized in two dimensions. The proposed classification allows a comparison of existing architectures, which are predominantly cloud-based, and anticipated future architectures, which are expected to be hybrid cloud-fog-dew architectures for applications in Internet of Things and Wireless Sensor Networks. Researchers interested in studying trade-offs among data processing bandwidth, data processing latency, and processing power consumption would benefit from the classification made in this article.

Keywords

Application-specific integrated circuit big data cloud computing central processing unit deep neural networks dew computing edge computing fog computing field programmable gate array graphics processing unit

Introduction

Over the last several years, with proliferation of data and devices, machine learning algorithms have become unavoidable in almost every aspect of human life,¹ especially when deep learning (DL) techniques are used. DL techniques, such as deep neural networks (DNNs), are among the best solutions for many problems in image understanding. DNNs are used in applications like face recognition, document analysis, image classification, labeling, and many others.^2–4

Training and inference of DNNs could be done in the cloud, with enormous computational power, as well as hundreds of kilometers away from data sources, with a less strong computational power in the fog, but much closer to data sources, and also in the dew, at the source of data, where processing is closely coupled with Internet of Things (IoT) and Wireless Sensor Networks (WSNs) connected to end devices.

Cloud computing provides enormous computational power and storage resources available on demand, where users can scale services to fit their needs. Fog computing provides less computational power and storage resources with bigger geographical distribution compared to cloud computing, which could be tuned more effectively to adequate preprocessing. Fog computing is performed by edge devices (routers, routing switches, multiplexers, and integrated access devices), which provide an entry point into enterprise. Dew computing provides minimal computational power and storage resources in close coupling with IoT and WSN on the very edge of the system. Dew computing is performed by end devices like personal computers, mobile phones, and sensors.

DNNs could include hundreds of millions of connections between nodes, which are computationally and memory intensive, making DNNs difficult to deploy on end devices with limited resources. While embedded processors enable arithmetic logic unit (ALU)based computations with reasonable power consumptions, fetching weights from dynamic random-access memory (DRAM), static random-access memory (SRAM), or block random-access memory (BRAM) is several orders of magnitude which is more power greedy than ALU operations.

Besides deployment locations, it is hard to determine which underlying hardware is the optimal solution for a wide spectrum of DNNs. Depending on the application type and the amount of data, the following architectural approaches could be used: general purpose processors (multicore or central processing unit (CPU)), graphics processing units (manycore or GPUs), field programmable gate arrays (FPGAs), and application-specific integrated circuits (ASICs).

This article discusses advantages and disadvantages of different deployment locations and different architectural approaches used for training and inference of DNNs, comparing them in terms of aspects such as speed, power consumption, and complexity. It then proposes a novel classification of deployment locations, from the viewpoint of underlying hardware. The goal of this article is to provide insights into design choices related to hardware used for DNNs at different deployment locations.

In our search, which we conducted using Google Scholar database, we focused on keywords related to DNNs and hardware acceleration. Specifically, we searched for the following terms: “an Analysis of DNNs,”“Hardware Acceleration of DNNs,”“Deep Learning in Neural Networks,”“Tensorflow,”“Neuron Programmable Chip,”“Convolution Neural Networks,”“Binarized Neural Networks,”“Training Deep FeedForward Neural Networks,”“Image Segmentation with Deep Neural Networks,”“Memristors for Neural Networks,”“Image Classification using Deep Neural Networks,”“Cloud Computing,”“Fog Computing,”“Edge Computing.”

The “Compute intensive training and strict latency inference of DNNs” section discusses problems with training and inference of DNNs at different deployment locations, defines the big data problem, and presents why the problem will grow over time. The “Analysis of existing solutions” section gives an overview of existing articles available in the open literature, presents advantages of the state-of-the-art solutions for training and inference of DNNs, and discusses major comparative differences. The “Classification criteria” section classifies existing solutions into 12 categories with examples, based on the underlying hardware and architectural approach in connection with various deployment locations. The “Conclusion” section concludes this article, discusses the main advantages of the proposed classification, defines what would be the optimal solution for future, and presents newly opened research problems.

Compute intensive training and strict latency inference of DNNs

Training of DNNs is a much more compute-intensive process than inference,⁵ which requires enormous computational power, especially for big data applications. Inference of DNNs often has strict latency demands, especially in real-time scenarios such as driverless cars and search engines, which is, in both cases, less than a second for each sample. While academia is focused on faster training, industry is often more focused on faster inference, by creating custom solutions, both in software and hardware.

For big data applications, in environments with limited resources such as embedded systems, it is hard to train a DNN, due the lack of computational power and power resources. In real-time applications, which have strict latency demands, it is a non-trivial problem for inference in the cloud environment, since processing could be done hundreds of kilometers away. Currently, big data applications are using powerful GPUs for DNNs in the cloud environment, while FPGA, ASIC, and embedded processors of any type are on the rise in fog and dew environments suited for DL.^6,7

Analysis of existing solutions

Companies like Amazon, Google, Microsoft, Baidu, and many others recently created web services for image understanding. Nowadays, they are aiming to create artificial intelligence cloud-based platforms for image understanding which could be used for almost any type of data,⁸ referred to as Machine-Learning-as-a-Service (MLaaS). MLaaS presents a set of machine learning tools which are a part of cloud computing, including frameworks, data visualization, application programming interfaces (APIs), and predictive analysis.

Cloud computing currently offers the best solution for training and inference of DNNs, if execution time is of primary concern. Even though cloud computing offers immense opportunities, the development of cloud computing technology is currently struggling with many issues which still have to be addressed, such as latency and security concerns,^9,10 also in real-time scenarios. The amount of power required to send data to the cloud is enormous, which is why even the most powerful cloud solutions need to make a great effort to overcome latency concerns.¹¹

There is another trend which shifts data and processing significantly away from the cloud to end devices. Huang et al.¹² propose a dew machine learning framework, which demonstrates superiority by reducing network traffic and running time under certain conditions of interest. This could be of crucial importance in environments, such as drones, ships, and research stations, where execution time is not of primary concern and where ultra-low-power solutions benefit. Even data mining could be ported to end devices.¹³

Schmidhuber¹⁴ presents the essence of DNNs, how they have evolved over time, and which cutting-edge DNNs exist today. In the open literature, there are many papers which discuss implementations of DNNs using various architectural approaches. The history of DNNs for image understanding – convolutional neural networks (CNNs) began from AlexNet,¹⁵ with a DNN which won an ImageNet competition, where the goal was to classify 1.2 million high-resolution images into 1000 different classes. They used an efficient GPU implementation of the convolution operation which relied on two GTX 580 GPUs with 3 GB of memory. State-of-the-art CNNs, such as VGG, Inception, GoogleNet, RestNet, and other families,¹⁶ are suitable for training on high parallelizable GPUs. These DNNs are very powerful. However, they require an enormous amount of energy because weights have to be stored into external DRAM memory and have to be fetched every time for each image. It is possible to fit a relatively small neural network (NN) into an on-chip SRAM such as LeNet-5.¹⁷

While training is mostly dominated by GPUs, FPGAs provide superior efficiency over GPUs, especially for binarized NNs—NNs with binary weights—often used in environments with limited resources, especially in embedded systems. A group of researchers¹⁸ present power-efficient training of a binarized NN, where most arithmetic operations are replaced with bitwise operations. That paper further evaluates the opportunity to improve the execution efficiency of a binarized NN through hardware acceleration.

In the open literature, there are a number of ASICs developed for DNNs such as neuromorphic chips.¹⁹ IBM TrueNorth²⁰ and Intel Loihi²¹ are neuromorphic chips which enable parallel computations used for learning and inference of DNNs needing high efficiency. Another trend implies analog computational units for running DNNs²² such as memristors. Quantum computing, the emerging interdisciplinary research area of quantum physics and machine learning, could also be used for training of DNNs.²³

Classification criteria

The classification proposed in this article is illustrated in Table 1. Each category contains examples from the open literature, if such examples exist, which describes the essence of the paradigm and its advantages and disadvantages in comparison to other categories. An empty category in the classification could be an opportunity for further research and development. Creativity in this article follows the path referred to as generalization.²⁴

Table 1.

Proposed classification with 12 different categories and examples of different underlying hardware used at deployment locations.

Underlying hardware	Deployment locations
Underlying hardware	Cloud	Fog	Dew
CPU	Intel—AI DevCloud, Skylake	Intel FRD	Intel—Atom, Pentium M, Core Solo
GPU	Amazon EC2	N/A	Nvidia–JetsonTX, Tegra X, Drive
FPGA	Baidu XPU	hpFog	Altera Stratix, Xilinx Virtex
ASIC	Google TPU	N/A	IBM TrueNorth, Intel Loihi

CPU: central processing unit; FRD: Fog Reference Design; GPU: graphics processing unit; FPGA: field programmable gate array; ASIC: application-specific integrated circuit; TPU: tensor processing unit.

We immediately found that different architectural approaches do not allow for a full direct comparison of performance, power consumption, latency, and data processing bandwidth of DNNs described in the open literature, due to different testing conditions and different DNN designs. Therefore, this section partially compares different architectural approaches through aspects such as speed, power, and speed per watt, using the data available in the open literature.

DNNs in cloud computing

Cloud computing is the most powerful massive workload computing model today. The main advantage of cloud computing is its flexibility where users can scale resources and services to fit their needs and access them from anywhere. It provides the enormous computational power needed for training and inference of DNNs. The most important aspect of performance evaluation for training of DNNs is the computational power needed for such massive workload, while inference heavily depends on power consumption.

Cloud-based solutions are often implemented on powerful manycore processors, which makes the massive parallel calculations for training of DNNs possible, while multicore processors orchestrate data movements into and from the memory of GPUs. Nvidia recently introduced a mixed-precision arithmetic, where calculations are performed using FP16 (half precision) instead of FP32 (single precision) or FP64 (double precision), which is not critical for DNNs but reduces the time required for calculations and moving data into and from the shared memory.^25,26 Such approach reduces memory usage and allows for training of larger DNNs. Tensor cores are introduced in the Nvidia’s Volta architecture and represent another optimization construct which accelerates the training of DNNs.²⁷

Besides the powerful GPUs, cloud services for DNNs could be based on multicore processors, usually from 2 to 8 cores, and also with up to 18 cores, for Intel i9. There are attempts to create multicore processors with up to 72 cores like Intel Xeon Phi.²⁸ A group of researchers²⁹ trained a DNN on a cluster of 1024 CPUs. Intel also introduced a new instruction set (AVX-512) available on the latest Xeon Phi and Skylake-X CPUs, which could accelerate performance for massive workloads of DNNs. The new instruction set enables lower precision operations using Fused Multiply Add (FMA) core instructions.³⁰

Table 2 presents the performance comparison between Titan X GPU and Xeon E5-2698 CPU processors for AlexNet. It is shown that for inference, the GPU solution achieves a better performance per watt.

Table 2.

Inference performance for AlexNet¹⁵ on Titan X GPU and Xeon E5-2698 CPU.

Processor	Inference	Power	Performance/watt
Titan X	405 img/s	160.0 W	2.5 img/s/W
Xeon E5-2698 v3	76 img/s	111.7 W	0.7 img/s/W

Source: Performance evaluation adopted from Nvidia Corporation.³¹

GPU: graphics processing unit; CPU: central processing unit.

Calculations are today usually bandwidth-limited, which means that storing and reading data could take more energy than computation itself. In order to solve this problem, many new developments are switching from the control-flow paradigm to the dataflow paradigm.^32,33 Cloud services which rely on FPGA, such as Baidu XPU, still cannot offer performance of today’s GPUs or TPUs (tensor processing units), but offer better energy efficiency for training and inference of DNNs. Companies like Baidu, Intel, and Microsoft have cloud services which rely on FPGAs, which under certain conditions of interest could achieve better performance per watt. Nurvitadhi et al.¹⁸ compare the performance per watt of DNNs, such as AlexNet and VGG, which rely on floating-point matrix multiplication. Table 3 shows that Stratix 10 with far greater number of DSPs used for multiplication offers improved FP32 performance compared to Arria 10 but is still behind the powerful Titan X GPU. Stratix 10 can be up to 40% more power greedy than Titan X for the theoretical peak.

Table 3.

Dense matrix multiplication (theoretical peak) with FP32 data type, on Arria 10 FPGA, Stratix 10 FPGA, and Titan X GPU.

Processor	Performance (FP32)	Performance/watt (FP32)
Arria 10	1.8 TOP/s	28 GOP/s/W
Stratix 10	9.1 TOP/s	60 GOP/s/W
Titan X	10.88 TOP/s	45 GOP/s/W

Source: Performance evaluation adopted from Nurvitadhi et al.¹⁸

FPGA: field programmable gate array; GPU: graphics processing unit.

It is shown that in the case of binarized NNs with 1 bit data types, Stratix 10 FPGA can deliver 10 times better performance per watt compared to Titan X GPU, which is shown in Table 4.

Table 4.

Dense matrix multiplication for binarized DNNs with 1 bit data types, on Arria 10 FPGA, Stratix 10 FPGA, and Titan X GPU (multiply and add operations are replaced with XOR and BitCount).

Processor	Performance (fixed-point 1 bit)	Performance/watt (fixed-point 1 bit)
Arria 10	50 TOP/s	1500 GOP/s/W
Stratix 10 (300 MHz)	170 TOP/s	1000 GOP/s/W
Stratix 10 (500 MHz)	250 TOP/s	1600 GOP/s/W
Stratix 10 (700 MHz)	600 TOP/s	3100 GOP/s/W
Titan X	87.7 TOP/s	400 GOP/s/W

Source: Performance evaluation adopted from Nurvitadhi et al.¹⁸

DNNs: deep neural networks; FPGA: field programmable gate array; GPU: graphics processing unit.

In terms of performance on ResNet-50 (CNN), two cloud-based services, Google TPU and Amazon EC2, which rely on TPUv2 ASIC and Nvidia V100 GPU, are almost equally fast on the ImageNet dataset, as shown in Table 5.³⁴ Due to their ability to be fully tuned to DNN design, these two architectural approaches are in many cases used for training DNNs.

Table 5.

ResNet-50 results for training the ImageNet dataset, on Google TPUv2 ASIC and Nvidia V100 GPU for various batch sizes.

Processor	128 (batch size)	256 (batch size)	512 (batch size)	1024 (batch size)
4× Google TPUv2	1127 img/s	2183 img/s	2664 img/s	3186 img/s
4× Nvidia V100	1746 img/s	2287 img/s	2783 img/s	3128 img/s

Source: Performance evaluation adopted from RiseML.³⁴

TPU: tensor processing unit; ASIC: application-specific integrated circuit; GPU: graphics processing unit.

Due to their complexity, GPUs are quite often used in cloud services to enable parallelism. However, companies like Intel and Amazon have CPU cloud-based services. By scaling the performance evaluation to the cloud level, with a number of processors running simultaneously, GPU and ASIC solutions would definitely be faster compared to CPU and FPGA solutions.

Table 6 presents a comparison of different architectural approaches commonly used in cloud-based services, focusing on attributes such as the year of production, platform type, technology, clock rate, memory type, and transistor count.

Table 6.

Comparison of different architectural approaches commonly used in cloud-based services.

Platform	Intel Xeon	Tesla V100	Xilinx Virtex Ultrascale	Google TPU
Year	2016	2017	2014	2017
Platform type	CPU	GPU	FPGA	ASIC
Technology	22 nm	12 nm	22 nm	28 nm
Clock (MHz)	4300	1530	200	700
Memory type	DRAM	DRAM	BRAM	SRAM
Transistors count	7.2 billion	21.1 billion	20 billion	2.5 billion

TPU: tensor processing unit; CPU: central processing unit; GPU: graphics processing unit; FPGA: field programmable gate array; ASIC: application-specific integrated circuit; DRAM: dynamic random-access memory; BRAM: block random-access memory; SRAM: static random-access memory

Cloud/multicore

Intel AI DevCloud³⁵ is a full-stack platform for DL based on Xeon Scalable processors, which enable fast training and inference of DNNs, by splitting calculations to more dedicated nodes. A group of researchers²⁹ trained RestNet-50 and AlexNet in record time (presented in their publication), by scaling computations to more dedicated CPU nodes.

Cloud/manycore

Amazon EC2 provides GPU cloud-based acceleration with up to 8 GB of GPU memory, which could be used for almost any massive workload. In the open literature, there are a number of DNN implementations which exploit the above-mentioned cloud-based platform. Strom³⁶ solves the well-known communication bottleneck problem, which arises for data-parallel stochastic gradient descent. This reduction in communication bandwidth enables efficient scaling to more parallel GPU nodes than any other method, thus achieving better accuracy in the resulting DNN.

Cloud/FPGA

Baidu XPU is an FPGA cloud-based accelerator with 256 cores, running on 600 MHz. It has a shared memory for data synchronization. In the open literature, there are many papers which propose DNN FPGA-based designs. A group of researchers from Intel evaluated performance of algorithms for training DNNs on Intel FPGAs (Arria 10, Stratix 10) against the high performance Titan X Pascal GPU.¹⁸ Their results show that in environments where execution time is not a prime concern, FPGAs could become the optimal platform for accelerating the next generation of DNNs.

Cloud/ASIC

Google TPU⁶ is an ASIC accelerator developed by Google, especially suited for DNNs. It runs at 700 MHz and consumes 40 W, which is extremely low for cloud services. The TPU is connected to its host via a peripheral component interconnect (PCI) express bus which provides a high bandwidth (up to 12 GB/s). Tensorflow framework, a well-known framework for machine learning, relies on the TPU. Nodes of a dataflow execution graph are mapped across many machines in a cluster, including CPUs, GPUs, and ASICs.

DNNs in fog computing

Fog computing is an infrastructure which provides compute and storage distributed in the most logical and efficient place, between end devices and cloud servers. Fog computing is performed by edge devices, at local area network level, at the edge of the network. It enables localization, therefore enabling low and admissible latency for real-time applications, which is necessary in smart grids and vehicular ad hoc networks (VANETs).

One of the first attempts to define a fog computing was made by Cisco,³⁷ when fog computing was defined as a mini-cloud, located at the edge of the network. It consists of a variety of interconnected edge devices. Edge devices which contain processing units and storage are used not only to process their own data but also to process external requests. In the open literature, there are several definitions of fog computing, from Cloudlets and Mobile Edge Computing, to Intelligent Transport Systems Clouds.³⁷

Fog to cloud computing is an extension of cloud computing in the way which brings processing closer to end devices, thus fulfilling essential latency and security requirements, while minimizing traffic load in the network.³⁸ This new emerging field addresses applications and services which do not fit well in the existing paradigms.

Fog-based solutions often rely on low-power processors, enabling computations in environments with limited resources. In the open literature, there are attempts to create fog solutions for DL, which implement intercommunication between deployed edge devices, often based on low-power CPUs and FPGAs. Yi et al.³⁹ introduce and define fog computing, and also present its benefits against cloud computing, comparing latency, bandwidth, and migration performance. An example shows that response time for face recognition could be drastically reduced using fog computing instead of cloud computing, consuming similar time for the computation task.

Fog/multicore

Intel built the Fog Reference Design (FRD), reference design for testing and demonstration of fog for many use cases, in order to accelerate market adoption of fog technologies.⁴⁰ It is still used at universities for platform testing and research. Since end devices do not have enough computing and storage resources to perform analytics, and since cloud servers, on the contrary, are too far away to process data and respond in time, data processing could be done using edge devices.

Fog/manycore

In the open literature, there is no paper which introduces fog computing using the manycore approach. Most papers are focused on dew computing, which is discussed in the next section.

Fog/FPGA

An FPGA fog-based platform⁴¹ allocates less than 15% of resources in FPGA, while the rest of the resources are available for user-defined applications. It allows high throughput computations and enables post-deployment updates.

Fog/ASIC

In the open literature, there is no paper which introduces fog computing using ASICs. Most papers are focused on dew computing, which is discussed in the next section.

DNNs in dew computing

Dew computing brings processing close to the data source, where data are processed by end devices, without sending data to a remote cloud. Using such an approach, applications which have strict latency demands could have significantly better performance by eliminating latency. In dew computing, data are processed on end devices (IoT and WSN), addressing the concerns of the strict latency demands, data safety, and privacy.⁴²

Table 7 presents the performance comparison between Tegra X1 (FP16) GPU and Core i7 6700K CPU low-voltage processors for AlexNet.¹⁵ It is shown that the GPU solution achieves better performance per watt for inference.

Table 7.

Inference performance for AlexNet¹⁵ on Tegra X1 (FP16) GPU and Core i7 6700K CPU.

Processor	Inference	Power	Performance/watt
Core i7 6700K	62 img/s	49.7 W	1.3 img/s/W
Tegra X1 (FP16)	47 img/s	5.5 W	8.6 img/s/W

Source: Performance evaluation adopted from Nvidia Corporation.³¹

GPU: graphics processing unit; CPU: central processing unit.

In contrast to cloud services, end devices have limited power resources, which means that massive workloads cannot be done, thus making the training of DNNs at end devices almost impossible. Despite the limited resources, an NN could efficiently inference in the dew. End devices have to be ultra power efficient in order to perform the inference of DNNs. In the open literature, there are many ASIC- and FPGA-based end devices, specialized for machine learning.^43,44 FPGAs provide low-precision data types and sparsity, which drive the adoption of FPGAs over GPUs for inference of DNNs. TrueNorth⁴⁵ can be the optimal solution for inference in the dew, due to its energy-efficient architecture, consuming 70 mW. ASICs, such as TPU, often deliver up to 30× faster inference than CPUs or GPUs, and even more performance per watt. Han et al.⁴⁶ give a comprehensive comparison of different architectural approaches for DNNs and propose an ASIC-based efficient inference engine on compressed DNNs. The article further compares the proposed inference engine against CPU and mobile GPU on nine benchmarks selected from AlexNet, VGG16, and Neural Talk. The proposed engine outperforms CPUs and mobile GPUs on average 189 and 307 times, respectively.

Besides ASICs and FPGAs, in the open literature, there are several ultra-low-voltage chips aimed for DL, which are based on multicore and manycore architectures. Intel Atom and Nvidia Tegra X microprocessors are ultra-low-voltage chips, consuming less than 20 W. The ability to process data locally with limited power is extremely useful when connectivity bandwidth is limited, when latency is critical in real-time scenarios, or where privacy and security are a concern. Nvidia recently launched an energy-efficient T4 inference accelerator with low-precision calculations, supporting INT4 precision, which is about two times faster than other Nvidia architectures with higher precision calculations. Intel also provides inference on Xeon with its custom instruction set which enables acceleration available both for multicore processors and FPGA cards.

Hubara et al.⁴⁷ discuss quantized neural networks (QNNs)—NNs with extremely low precision. QNNs are extremely useful in environments where resources are limited such as end devices. In QNN, arithmetic operations are replaced with bitwise operations, and thus it allocates small amount of resources.

Table 8 presents a comparison of different architectural approaches commonly used in dew solutions, for attributes such as the year of production, platform type, technology, clock rate, memory type, and transistors count.

Table 8.

Comparison of different architectural approaches commonly used in dew-based solutions.

Platform	Intel Atom	Nvidia Tegra X1	Xilinx Virtex 7	IBM TrueNorth
Year	2012	2015	2011	2014
Platform type	CPU	GPU	FPGA	ASIC
Technology	41 nm	40 nm	28 nm	28 nm
Clock (MHz)	2600	1000	200	500
Memory type	DRAM	DRAM	BRAM	SRAM
Transistors count	47 million	9 billion	17 billion	5.4 billion

CPU: central processing unit; GPU: graphics processing unit; FPGA: field programmable gate array; ASIC: application-specific integrated circuit; DRAM: dynamic random-access memory; BRAM: block random-access memory; SRAM: static random-access memory.

Dew/multicore

Intel has created a library for DNNs,⁴⁸ which contains vectorized and threaded building blocks that can be used for implementing DNNs. The library is optimized for ultra-low-voltage processors such as Intel Atom and Intel Core Solo. These ultra-low-voltage processors are often embedded into IoT and WSN systems ranging from health care to VANETs.

Dew/manycore

Nvidia Tegra X is a power-efficient processor intended for end devices (IoT and WSN), suitable for a DNN inference, due to its low-power consumption, which is of crucial importance in edge computing. Han et al.⁴⁹ measure the total power consumption of a compressed DNN implemented on the Jetson development board based on Tegra X processor. The results show that Tegra X processor achieves three to seven times higher energy efficiency on the compressed NN.

Dew/FPGA

FPGAs are mainstream architectural approaches used in embedded systems, due to their low-power consumption and ability to be configured for a specific NN. With specifically designed hardware, FPGAs beat powerful GPUs in speed and energy efficiency. Guo et al.⁵⁰ propose various FPGA-based accelerator designs with software and hardware optimization techniques to achieve high speed and energy efficiency, compared to existing GPU solutions.

Dew/ASIC

IBM TrueNorth⁴⁵ is a neuromorphic chip with 4096 cores and transistor count of about 5.4 billion. The chip contains over a million neurons, where each core contains 256 simulated neurons. In turn, each neuron has 256 programmable synapses, which convey signals among them. Using IBM TrueNorth, a group of researchers⁴⁵ trained a DNN with accuracy over 95%, consuming under 200 mW, to recognize different hand gestures. In the open literature, there are other ASICs which could be used for DNNs in IoT and WSN such as Intel Loihi²¹ or EIE.⁴⁶

Proposed classification summary

From everything presented above, from the general viewpoint of this research, it can be concluded that the optimal solution among existing solutions mostly depends on the application and the environment. In dew computing, where computational and power resources are limited, FPGA and ASIC solutions can be treated as optimal for inference, due to their power efficiency and flexibility. In cloud computing, where massive workloads are done, GPU and ASIC solutions are optimal for training, due to their enormous processing power and ability to parallelize a massive workload.

It could be concluded that hybrid fog computing is the optimal platform for the future development of DNNs, where inference could be done in the dew, preprocessing and updating in the fog, and training in the cloud. Teerapittayanon et al.⁵¹ propose distributed DNNs consisting of the cloud, the fog, and the dew, where inference is done in dew and fog, while training is done in the cloud.

Conclusion

This article classifies architectural approaches (CPU, GPU, FPGA, and ASIC) used at different deployment locations (cloud, fog, and dew) for training and inference of DNNs. The classification defines 12 different categories where each category is illustrated with a most representative example. For further development of DNNs, it is of crucial importance to consider which type of underlying hardware is the most suitable one for a particular deployment location under specific conditions of interest.

Traditional cloud-based infrastructures are not strong enough for current demands of IoT and WSN applications, due to limitations in latency and network bandwidth. By moving data processing to the edge of the system, closer to data sources, computational power is often not strong enough to solve big data problems in DNNs.

Bearing in mind the enormous growth of IoT and WSN, we consider that its full potential could be exploited by combining cloud, fog, and dew computing with different architectural approaches. Which of the three options is the most effective one depends on the application and the environment. New research problems which arise nowadays are how to migrate cloud-based solutions to the fog layer, and possibly further out into the dew layer, and, additionally, how to implement hybrid solutions that could provide an optimal trade-off between data processing bandwidth, data processing latency, and processing power consumption.

Footnotes

We want to thank our colleagues who provided insight and expertise that greatly assisted the research.

Handling Editor: Francesco Longo

Declaration of conflicting interests

The author(s) declared no potential conflicts of interest with respect to the research,authorship,and/or publication of this article.

Funding

The author(s) received no financial support for the research,authorship,and/or publication of this article.

ORCID iD

Miloš Kotlar

References

LeCun

Bengio

Hinton

. Deep learning. Nature 2015; 521(7553): 436–444.

Bhandare

Bhide

Gokhale

et al. Applications of convolutional neural networks. Int J Comp Sci Informat Tech 2016; 7: 2206–2215.

Esteva

Kuprel

Novoa

et al. Dermatologist-level classification of skin cancer with deep neural networks. Nature 2017; 542(7639): 115–118.

Havaei

Davy

Warday-Farley

et al. Brain tumor segmentation with deep neural networks. Med Image Anal 2017; 35: 18–31.

Glorot

Bengio

. Understanding the difficulty of training deep feed forward neural networks. In: Proceedings of the 13th international conference on artificial intelligence and statistics, 2009, pp.249–256, http://proceedings.mlr.press/v9/glorot10a/glorot10a.pdf

Abadi

Barham

Chen

et al. Tensorflow: a system for large-scale machine learning. In: OSDI, vol. 16, Savannah, GA, 2–4 November 2016, pp.265–283. New York: IEEE.

Gokhale

Jin

Dundar

et al. A 240 G-ops/s mobile coprocessor for deep neural networks. In: Proceedings of the IEEE conference on computer vision and pattern recognition workshops, Columbus, OH, 23–28 June 2014, pp.682–687. New York: IEEE.

Burrows

. How the AI cloud could produce the richest companies ever, MIT Technology Review, 2018, https://www.technologyreview.com/s/610554/how-the-ai-cloud-could-produce-the-richest-companies-ever/

Cruz

Fernandez

Toval

. Security in cloud computing: a mapping study. Comp Sci Informat Syst 2015; 12(1): 161–184.

10.

Varghese

Buyya

. Next generation cloud computing: new trends and research directions. Future Generat Comp Syst 2018; 79: 849–861.

11.

Garcia

Lopez P

Montresser

Epema

et al. Edge-centric computing: vision and challenges. ACM SIGCOMM Comp Comm Rev 2015; 45(5): 37–42.

12.

Huang

Fan

et al. When deep learning meets edge computing. In: 2017 IEEE 25th international conference on network protocols (ICNP), Toronto, ON, Canada, 10–13 October 2017, pp.1–2. New York: IEEE.

13.

Stanković

Rakočević

Kojić

et al. A classification and comparison of data mining algorithms for wireless sensor networks. In: IEEE international conference on industrial technology (ICIT), Athens, 19–21 March 2012, pp.265–270. New York: IEEE.

14.

Schmidhuber

. Deep learning in neural networks: an overview. Neural Netw 2015; 61: 85–117.

15.

Krizhevsky

Sutskever

Hinton

ImageNet classification with deep convolutional neural networks. In: Advances in neural information processing systems, Lake Tahoe, NV, 3–6 December 2012, pp.1097–1105. New York: ACM.

16.

Canziani

Paszke

Culurciello

. An analysis of deep neural network models for practical applications. Comp Vis Pattern Recogn, 2016, https://arxiv.org/abs/1605.07678

17.

LeCun

Bottou

Bengio

et al. LeNet-5: convolutional neural networks, 2015, http://yann.lecun.com/exdb/lenet/

18.

Nurvitadhi

Venkatesh

Sim

et al. Can FPGAs beat GPUs in accelerating next-generation deep neural networks?: In Proceedings of the 2017 ACM/SIGDA international symposium on field-programmable gate arrays, 2017, pp.5–14, https://dl.acm.org/citation.cfm?id=3021740

19.

Robert

. Neuromorphic chips. MIT Tech Rev, 2018, https://www.technologyreview.com/s/526506/neuromorphic-chips/

20.

Wen

Wang

et al. A new learning method for inference accuracy, core occupation, and performance co-optimization on TrueNorth chip. In: IEEE 53rd design automation conference (DAC), 2016, pp.1–6, https://dl.acm.org/citation.cfm?id=2897968

21.

Davies

Srinivasa

Lin

T-H

et al. Loihi: a neuromorphic manycore processor with on-chip learning. IEEE Micro 2018; 38(1): 82–99.

22.

Jeong

Shi

. Memristor devices for neural networks. J Phys D Appl Phys 2018; 52(2): 023003.

23.

Farhi

Neven

. Classification with quantum neural networks on near term processors. Quantum Phys, 2018, https://arxiv.org/abs/1802.06002

24.

Blagojevic

Cvetanović

Đorđević

et al. A systematic approach to generation of new ideas for PhD research in computing. Adv Comp 2017; 104: 1–31.

25.

Micikevicius

Narang

Alben

et al. Mixed precision training. Artif Intell, 2017, https://arxiv.org/abs/1710.03740

26.

Sapunov

. Hardware for deep learning, 2018, https://blog.inten.to/hardware-for-deep-learning-part-3-gpu-8906c1644664

27.

Markidis

Der Chien

Laure

et al. Nvidia tensor core programmability, performance & precision. Distrib Paral Cluster Comput, 2018, https://arxiv.org/abs/1803.04014

28.

Chrysos

. Intel Xeon Phi coprocessor: the architecture. Intel Whitepaper 2014; 176, https://ieeexplore.ieee.org/abstract/document/7476487

29.

You

Zhang

Hsieh

et al. ImageNet training in minutes. In: Proceedings of the 47th international conference on parallel processing, 2018, p.1, https://dl.acm.org/citation.cfm?id=3225069

30.

Sodani

Knights landing (KNL): 2nd generation Intel Xeon Phi processor. In: IEEE hot chips 27th symposium (HCS), 2015, pp.1–24, https://ieeexplore.ieee.org/document/7477467

31.

Nvidia Corporation. GPU-based deep learning inference: a performance and power analysis. Nvidia Whitepaper, 2015, https://www.nvidia.com/content/tegra/embedded-systems/pdf/jetson_tx1_whitepaper.pdf

32.

Trifunovic

Milutinovic

Salom

et al. Paradigm shift in big data supercomputing: dataflow vs. controlflow. J Big Data 2015; 2(1): 4.

33.

Kos

Tomažič

Salom

et al. New benchmarking methodology and programming model for big data processing. Int J Distrib Sens Netw 2015; 11(8): 271752.

34.

RiseML. Comparing Googles TPUv2 against Nvidias V100 on ResNet-502018, https://hackerfall.com/story/benchmarking-googles-tpuv2-against-nvidias-v100-on

35.

Apeland

. Intel AI DevCloud, 2017, https://www.intel.ai/devcloud/

36.

Strom

Scalable distributed DNN training using commodity GPU cloud computing. In: 16th annual conference of the international speech communication association, 2015, https://pdfs.semanticscholar.org/a38f/d727637fcfb99a8c7e0df2c819c2d3c6773a.pdf

37.

Tordera

Masip-Bruin

Garcia-Alminan

et al. What is a fog node? A tutorial on current concepts towards a common definition. Netw Internet Architect, 2016, https://arxiv.org/abs/1611.09193

38.

Masip-Bruin

Marín-Tordera

Alonso

et al. Fog-to-cloud Computing (F2C): the key technology enabler for dependable e-health services deployment. In: Mediterranean ad hoc networking workshop (Med-Hoc-Net), 2016, pp.1–5, https://ieeexplore.ieee.org/abstract/document/7528425

39.

Hao

Qin

et al. Fog computing: platform and applications. In: 3rd workshop on hot topics in web systems and technologies (HotWeb), 2015, pp.73–78, https://ieeexplore.ieee.org/document/7372286

40.

Intel. Intel’s Fog Reference Design overview, 2017, https://www.intel.com/content/dam/www/public/us/en/documents/design-guides/fog-reference-design-overview-guide.pdf

41.

Tan

Ooi

Marsono

MN.

hpFog: a FPGA-based fog computing platform. In: International conference on networking, architecture, and storage (NAS), 2017, pp.1–2, https://ieeexplore.ieee.org/abstract/document/8026862

42.

Shi

Dustdar

. The promise of edge computing. Computer 2016; 49(5): 78–81.

43.

Freund

. Microsoft: FPGA wins versus Google TPUs for AI, 2017, https://www.forbes.com/sites/moorinsights/2017/08/28/microsoft-fpga-wins-versus-google-tpus-for-ai/

44.

Fallahlalehzari

. FPGA vs GPU for machine learning applications, 2018, https://www.aldec.com/en/company/blog/167-fpgas-vs-gpus-for-machine-learning-applications-which-one-is-better

45.

Akopyan

Sawada

Cassidy

et al. TrueNorth: design and tool flow of a 65 mW 1-million neuron programmable neurosynaptic chip. IEEE Trans Comp Aided Design Integrat Circuits Syst 2015; 34(10): 1537–1557.

46.

Han

Liu

Mao

et al. EIE: efficient inference engine on compressed deep neural network. In: 43rd annual international symposium on computer architecture (ISCA), 2016, pp.243–254, https://arxiv.org/abs/1602.01528

47.

Hubara

Courbariaux

Soudry

et al. Quantized neural networks: training neural networks with low precision weights and activations. J Machine Learn Res 2017; 18(1): 6869–6898.

48.

Intel. Math kernel library for deep neural networks, 2018, https://github.com/intel/mkl-dnn

49.

Han

Mao

Dally

. Deep compression: compressing deep neural networks with pruning, trained quantization and Huffman coding. Comp Vis Pattern Recogn, 2015, https://arxiv.org/abs/1510.00149

50.

Guo

Zeng

et al. A survey of FPGA based neural network accelerator. Hardw Architect, 2017, https://arxiv.org/abs/1712.08934

51.

Teerapittayanon

McDanel

Kung

. Distributed deep neural networks over the cloud, the edge and end devices. In: 37th international conference on distributed computing systems (ICDCS), 2017, pp.328–339, https://ieeexplore.ieee.org/abstract/document/7979979