Abstract
Keywords
Introduction
With the rapid development of multimedia technologies and network communication, recent applications must be implemented so as to meet different constraints respecting the time to market. These constraints are strongly linked to speed, area, and consumption. Such critical applications can be well treated by field-programmable gate arrays (FPGAs) or graphic processing unit (GPUs). When using FPGAs, many computation blocks such as logic gates are inter-wired in different configurations.
At the same time, the general-purpose nature of the basic blocks makes them slower by comparison to specific purposes blocks such as the arithmetic and logic units (ALU) of a GPU.
In fact, GPUs supply hundreds of ALUs associated to small memory blocks (16 kB). They are designed and optimized initially for real-time rendering, but their massively parallel architecture has allowed them to be used in other areas such as analysis, video and audio processing, and engineering applications.
This is what made FPGAs and GPUs primary choices to speed up challenging applications such as voice algorithms showing high computational requirements. The defiance is to take advantage of these incredible breakthroughs, although the exploitation of parallelism on both systems still necessitates considerable effort.
The paper addresses the utilities of GPUs for voice processing. To achieve that, we implemented the linear prediction coding (LPC) algorithm on a GPU and compared the performance of a state-of-the-art FPGA. The windowing and autocorrelation computing is used as a case study throughout the paper and implemented using both types of accelerators.
The paper is composed of five sections: the upcoming section details the LPC computation, a very reputed algorithm in many voice decoders. Next, we give some related works, and we present the FPGA implementation's results. Later sections describe the GPU implementation and different optimizations applied at various levels and provide an experimental comparison of the proposed parallel implementation. Finally, the last section concludes the paper.
LPC computation
In this section, we study a speech compression technique used in many voice coders particularly in the ITU-T G729 CODEC, known as LPC.
LPC analysis is based on the assumption that a predictor model, which looks at past values of the output alone, can characterize the speech signal and, hence, it is an all-pole model in the Z-transform domain.
1
A further simplification of the analysis is introduced by the assumption that either a random noise or an impulse train can excite the vocal tract (Figure 1). The coder acts on a block of 10 ms, corresponding to 80 samples.
LPC model.
The speech signal is examined to evaluate the code-excited linear-prediction (CELP) parameters 2 every 10 ms frame.
The operations involved in the LPC analysis are summarized in five subtasks. In order to establish the most critical ones, we have estimated the characteristics of the tasks activated during the algorithm execution using the Visual C+ + profiler on an Intel® Core ™ i7-3770 CPU (3.4 GHz, 8 GB main memory).
The complexity of the different blocks in the percentage of the global time.
LPC: linear prediction coding;
One can easily notice that the LSP quantization block and the windowing, autocorrelation block are the most expensive blocks. Hence, we can envisage migrating them to the GPU, but we started with the windowing and autocorrelation sub-blocks because the used quantization technique is not very common and it is specific to G729 standard.
The windowed speech and the autocorrelation computing are illustrated in equations (1) and (2), respectively
The most time-consuming stage of the LPC algorithm is the windowing and autocorrelation stage; fortunately, this computation stage is well appropriated for a parallel implementation. That is why we suggest a parallel processing to accelerate this stage by fully exploiting the GPU. In fact, the total floating-point operations per second of the best GPUs are higher than the FPGAs' with the maximum DSP capabilities.
Related work and FPGA's implementation results
Several publications exist that deal with the optimization of speech encoding algorithms. All of these efforts, however, are targeted towards DSP or FPGA platforms. 3 Audio-related GPGPU applications, on the other hand, are rare. In Tsingos 4 for example, the audio signal is not processed directly, but the GPU is rather employed for audio rendering based on room acoustics models. In Brent and Bill, 5 the author presents a real-time GPU implementation of the convolution operation for large input vectors, programmed via a graphic API. Our research presented in Atri et al. 6 gives a significant speed-up, obtained through an FPGA implementation of the windowing and autocorrelation blocks on a Spartan-3 Starter Kit. We have drawn inspiration from this previous study to reimplement these two blocks on a FPGA Spartan 6 whose performance is estimated at about 200 GFLOPS.
Profiling results with hardware implementation vs. software implementation.
CPU: central processing unit; FPGA: field-programmable gate array.
It is obvious that the results of the pure hardware implementation give a significant speedup compared to the software implementation. This outcome is valid only if the data transfer time is less than the software implementation time. Hence, we estimate the different transfer time. The transmission and the receipt time have been calculated for different values of samples. The time taken for receipt is relatively small since the maximum coefficients number transmitted is about 10. The evaluation of the execution time on FPGA + transfer time is obtained by summing the emission + reception + FPGA computing and given in Table 2.
Table 2 shows that the computation time on the FPGA is negligible comparing to the transfer times. We note that even with the transfer time, the hardware solution is better than the software one. These results will be compared to those obtained by a parallel GPU implementation on CUDA in order to assess the capabilities of this new generation of processors.
Parallel implementation on GPU
Hundreds of applications increasingly rely on parallelism to get higher performances.7,8 Consequently, recent years have seen the arrival of GPGPU standards as sources of massive computing power. It is the parallel computing that motivates NVIDIA to promote the CUDA programming model offering a simple framework, 9 to run parallel programs.
To exploit the full potential provided by GPUs, a radical change must be made on the software infrastructure. Once we have identified the crucial portion, a parallel implementation using CUDA C/C+ + can be launched as a CUDA kernel onto the GPU. It is performed by the mechanism of “threads” that are arranged in a thread block. The threads within a block can collaborate and synchronize their execution to harmonize their memory access as shown in Figure 2.
Hardware structure and memory hierarchy of CUDA.
Hardware selection
Our investigations are achieved on a NVIDIA GeForce GTX 480; it is 480 CUDA cores clocked at 1401 MHz with 48 kB per-block shared memory per SM and a 1536 MB of GPU device memory. We also used an Intel® core™ i7-3770 CPU with a base clock of 3.4 GHz and a 8 GB main memory. We elaborated our GPU code applying NVIDIA driver version 378.66 and CUDA 7.5.
Parallel execution and optimization strategies
We write a sequential program that invokes parallel kernel. When calling a kernel, we specify the characteristics of the grid that can be divided into one or more blocks each of them runs multiple threads. 10
To reduce computing times,
The autocorrelation coefficients are next computed as shown in the following equation
An excerpt of the first parallel code version performing this processing, using CUDA, is given in Figure 3.
Kernel of autocorrelation function on CUDA.
The __global__ declaration is a function callable from the host and executed only in the Device. To call a Kernel function, we use <<<dimGrid, dimBlock>> > ( parameter list …) syntax. DimGrid and dimBlock provide the number of blocks per grid and the number of threads per block respectively.
The L_WINDOW samples are acquired by the host, copied into GPU memory and then processed by the Device. Despite that the serial and parallel versions of this code have slight similarity, the parallel program necessitates fine tuning. In fact, we can enhance performance on both CPU and GPU codes by introducing optimization strategies.11,12
These strategies include loop unrolling, vector reordering, register blocking, memory optimization, and instruction optimization.
As mentioned above, the ultimate CUDA achievement lies in the many code optimizations and the compacting into a single kernel; by eliminating the maximum overhead, that reduces speed. As a first step we start by dividing the outer loop among the available threads.
13
In fact, to eliminate the outer loop, we replace the first “for” loop with “if” as shown in Figure 4.
Instruction optimization.
This implementation entirely omits the outer loop and instead uses the thread index. We do the same for the second outer loop. As a second step we have manually unrolled the inner loop. This technique attempts to optimize a program's execution speed by re-writing the body of a suitable loop as a repeated sequence of similar independent statements.
Being one of the bottle-necks for most parallel GPU-based algorithms, memory should be carefully addressed and well chosen to optimize transfer time.14,15
Hence, the parallel windowing /autocorrelation kernel outlined is re-implemented by exploiting the existing per-block shared memory. It is considered as a very low latency on-chip memory and as a software-managed cache exposed by CUDA.
Profiling results with GPU implementation
GPU: graphic processing unit.
The speed-up shows that the more the amount of data to be analyzed is important, the more the performance gain will be higher. This is because for lower window's size the data are insufficient to fully exploit the computing units of the graphics card. The results given in Table 3 show that the gain in the execution time is all the greater as the number of samples increases. The ratio is up to 4.
Results interpretation
For comparison between all three platforms, we sought to the windowing/autocorrelation block. It was chosen since it is a voice compression technique bottleneck, it was also easily portable between all three devices. Figure 5 shows the timings plotted in logarithmic scale.
The execution time of windowing/autocorrelation on different platforms: (a) without transfer time; (b) with transfer time. This time is plotted in logarithmic scale.
From timings given in Figure 5(a) it is clear that the FPGA achieved the windowing/autocorrelation the fastest. The FPGA computes the windowing/ autocorrelation up to five times faster than the optimized GPU implementation. What is missing is the transfer time to send and receive the data. For the FPGA, this time can reach up to 100 µs against 4 µs for the GPU. The underlying reason is that the FPGA is connected to the PC through an adept USB port where the exchange data can reach up to 38 MB/s. While direct read and write accesses to the device memory of the GPU are allowed through PCI-Express, which has a theoretical peak throughput of 16 GB/s.
As shown in Figure 5(b), with this time added in, the FPGA becomes slower than the GPU. Now, the GPU is the fastest at 4 times faster than the FPGA and 48 times faster than the CPU.
Conclusion
We have compared the performance of GPU with FPGA and CPU using a well-known algorithm in audio processing. We note that GPU has a potential for achieving impressive speedups of up to 4 × compared to the FPGA and around 48 × compared to a sequential execution. To attain these objectives gain, careful consideration should, therefore, be given to optimizing computation on the GPU architecture.
In fact, instructions optimization and efficiently utilizing the on-chip shared memory lead to highly parallel cost computations.
These findings are helpful in comparing the three systems discussed as well as in comparison with other devices making future research object; especially as new studies have enabled designers to embark on the same die both CPUs and GPUs, thus promoting the reduction of data transfer times and a substantial performance increase.
