Sage Journals: Discover world-class research

Abstract

With the advancement in the device technology and parallel architecture, field-programmable gate arrays (FPGAs) can well perform the speech processing operation. FPGAs have very impressive results, despite their low operating frequency, by completely extracting the parallelism. Nevertheless, recent central processing unit and graphic processing unit (GPU) have also an inherent feature for high performance. In fact, recent GPUs enable dramatic increases in computing performance by harnessing great number of cores. In this context, we seek to analyze the performance of the linear prediction coding algorithm implementation on two different platforms: one based on the GPU NVIDIA GeForce GTX 480 and another on the FPGA Spartan-6. Subsequently, we try to apply several optimization strategies on those platforms. The experimental results highlight the relative robustness or weakness of both these platforms. The tests prove that, for several samples, GPU manages speedups of up to 4× compared to the FPGA and around 48× compared to a sequential execution.

Keywords

Linear predictive coding graphic processing unit field-programmable gate arrays optimization strategies shared memory

Introduction

With the rapid development of multimedia technologies and network communication, recent applications must be implemented so as to meet different constraints respecting the time to market. These constraints are strongly linked to speed, area, and consumption. Such critical applications can be well treated by field-programmable gate arrays (FPGAs) or graphic processing unit (GPUs). When using FPGAs, many computation blocks such as logic gates are inter-wired in different configurations.

At the same time, the general-purpose nature of the basic blocks makes them slower by comparison to specific purposes blocks such as the arithmetic and logic units (ALU) of a GPU.

In fact, GPUs supply hundreds of ALUs associated to small memory blocks (16 kB). They are designed and optimized initially for real-time rendering, but their massively parallel architecture has allowed them to be used in other areas such as analysis, video and audio processing, and engineering applications.

This is what made FPGAs and GPUs primary choices to speed up challenging applications such as voice algorithms showing high computational requirements. The defiance is to take advantage of these incredible breakthroughs, although the exploitation of parallelism on both systems still necessitates considerable effort.

The paper addresses the utilities of GPUs for voice processing. To achieve that, we implemented the linear prediction coding (LPC) algorithm on a GPU and compared the performance of a state-of-the-art FPGA. The windowing and autocorrelation computing is used as a case study throughout the paper and implemented using both types of accelerators.

The paper is composed of five sections: the upcoming section details the LPC computation, a very reputed algorithm in many voice decoders. Next, we give some related works, and we present the FPGA implementation's results. Later sections describe the GPU implementation and different optimizations applied at various levels and provide an experimental comparison of the proposed parallel implementation. Finally, the last section concludes the paper.

LPC computation

In this section, we study a speech compression technique used in many voice coders particularly in the ITU-T G729 CODEC, known as LPC.

LPC analysis is based on the assumption that a predictor model, which looks at past values of the output alone, can characterize the speech signal and, hence, it is an all-pole model in the Z-transform domain.¹ A further simplification of the analysis is introduced by the assumption that either a random noise or an impulse train can excite the vocal tract (Figure 1). The coder acts on a block of 10 ms, corresponding to 80 samples.

Figure 1.

LPC model.

The speech signal is examined to evaluate the code-excited linear-prediction (CELP) parameters² every 10 ms frame.

The operations involved in the LPC analysis are summarized in five subtasks. In order to establish the most critical ones, we have estimated the characteristics of the tasks activated during the algorithm execution using the Visual C+ + profiler on an Intel® Core ™ i7-3770 CPU (3.4 GHz, 8 GB main memory).

Table 1 gives the complexity of the different blocks in the percentage of the global time.

Table 1.

The complexity of the different blocks in the percentage of the global time.

	Windowing + Autocorrelation	Levinson-Durbin algorithm	LP to LSP Conversion	LSP Quantization	Interpolation & conversion
LPC sub-blocks complexity	46%	0.13%	11.5%	37.5%	4.87%

LPC: linear prediction coding;

One can easily notice that the LSP quantization block and the windowing, autocorrelation block are the most expensive blocks. Hence, we can envisage migrating them to the GPU, but we started with the windowing and autocorrelation sub-blocks because the used quantization technique is not very common and it is specific to G729 standard.

The windowed speech and the autocorrelation computing are illustrated in equations (1) and (2), respectively $s' (n) = W_{lp} (n) s (n) n = 0, \dots, 239$ (1) $r (k) = \sum_{n = k}^{239} s' (n) s' (n - k) \begin{matrix} k = 0, \dots, 9 \end{matrix}$ (2) where s(n) is the original speech signal, s′(n) is the windowed speech signal, and r(k) is the autocorrelation coefficients.

The most time-consuming stage of the LPC algorithm is the windowing and autocorrelation stage; fortunately, this computation stage is well appropriated for a parallel implementation. That is why we suggest a parallel processing to accelerate this stage by fully exploiting the GPU. In fact, the total floating-point operations per second of the best GPUs are higher than the FPGAs' with the maximum DSP capabilities.

Related work and FPGA's implementation results

Several publications exist that deal with the optimization of speech encoding algorithms. All of these efforts, however, are targeted towards DSP or FPGA platforms.³ Audio-related GPGPU applications, on the other hand, are rare. In Tsingos⁴ for example, the audio signal is not processed directly, but the GPU is rather employed for audio rendering based on room acoustics models. In Brent and Bill,⁵ the author presents a real-time GPU implementation of the convolution operation for large input vectors, programmed via a graphic API. Our research presented in Atri et al.⁶ gives a significant speed-up, obtained through an FPGA implementation of the windowing and autocorrelation blocks on a Spartan-3 Starter Kit. We have drawn inspiration from this previous study to reimplement these two blocks on a FPGA Spartan 6 whose performance is estimated at about 200 GFLOPS.

The introduced design is a hardware implementation of the LPC algorithm. It performs an autocorrelation core and a windowing core consisting of two different sub-blocks: a multiplier block and a ROM of 256 words of 16 bits. The windowing core aims to decrease the creation of side lobes in the frequency spectrum. The ultimate structure is made from the windowing and the autocorrelation cores. Table 2 summarizes the profiling results of such an implementation.

Table 2.

Profiling results with hardware implementation vs. software implementation.

	Execution time (µs)
Number of samples	N = 10	N = 80	N = 100	N = 150	N = 200	N = 240
CPU computing	480	490	550	620	810	1210
FPGA computing	0.21	1.26	1.7	2.03	3.12	3.76
FPGA + transfer time	6.25	30.45	42.55	59.67	86.51	105.98

CPU: central processing unit; FPGA: field-programmable gate array.

It is obvious that the results of the pure hardware implementation give a significant speedup compared to the software implementation. This outcome is valid only if the data transfer time is less than the software implementation time. Hence, we estimate the different transfer time. The transmission and the receipt time have been calculated for different values of samples. The time taken for receipt is relatively small since the maximum coefficients number transmitted is about 10. The evaluation of the execution time on FPGA + transfer time is obtained by summing the emission + reception + FPGA computing and given in Table 2.

Table 2 shows that the computation time on the FPGA is negligible comparing to the transfer times. We note that even with the transfer time, the hardware solution is better than the software one. These results will be compared to those obtained by a parallel GPU implementation on CUDA in order to assess the capabilities of this new generation of processors.

Parallel implementation on GPU

Hundreds of applications increasingly rely on parallelism to get higher performances.^7,8 Consequently, recent years have seen the arrival of GPGPU standards as sources of massive computing power. It is the parallel computing that motivates NVIDIA to promote the CUDA programming model offering a simple framework,⁹ to run parallel programs.

To exploit the full potential provided by GPUs, a radical change must be made on the software infrastructure. Once we have identified the crucial portion, a parallel implementation using CUDA C/C+ + can be launched as a CUDA kernel onto the GPU. It is performed by the mechanism of “threads” that are arranged in a thread block. The threads within a block can collaborate and synchronize their execution to harmonize their memory access as shown in Figure 2.

Figure 2.

Hardware structure and memory hierarchy of CUDA.

Hardware selection

Our investigations are achieved on a NVIDIA GeForce GTX 480; it is 480 CUDA cores clocked at 1401 MHz with 48 kB per-block shared memory per SM and a 1536 MB of GPU device memory. We also used an Intel® core™ i7-3770 CPU with a base clock of 3.4 GHz and a 8 GB main memory. We elaborated our GPU code applying NVIDIA driver version 378.66 and CUDA 7.5.

Parallel execution and optimization strategies

We write a sequential program that invokes parallel kernel. When calling a kernel, we specify the characteristics of the grid that can be divided into one or more blocks each of them runs multiple threads.¹⁰

To reduce computing times, W_lp coefficients used in equation (1), are pre-evaluated and stored in one of the GPU's available memory.

The autocorrelation coefficients are next computed as shown in the following equation $\begin{matrix} r (k) = \sum_{n = k}^{239} s' (n) . s' (n - k) \\ k = \begin{matrix} 0, . . 9 {\begin{matrix} r (0) = s' (0) . s' (0) + \dots + s' (239) . s' (239) \\ . \\ . \\ r (9) = s' (9) . s' (0) + \dots + s' (239) . s' (230) \end{matrix} \end{matrix} \end{matrix}$ (3)

An excerpt of the first parallel code version performing this processing, using CUDA, is given in Figure 3.

Figure 3.

Kernel of autocorrelation function on CUDA.

The __global__ declaration is a function callable from the host and executed only in the Device. To call a Kernel function, we use <<<dimGrid, dimBlock>> > ( parameter list …) syntax. DimGrid and dimBlock provide the number of blocks per grid and the number of threads per block respectively.

The L_WINDOW samples are acquired by the host, copied into GPU memory and then processed by the Device. Despite that the serial and parallel versions of this code have slight similarity, the parallel program necessitates fine tuning. In fact, we can enhance performance on both CPU and GPU codes by introducing optimization strategies.^11,12

These strategies include loop unrolling, vector reordering, register blocking, memory optimization, and instruction optimization.

As mentioned above, the ultimate CUDA achievement lies in the many code optimizations and the compacting into a single kernel; by eliminating the maximum overhead, that reduces speed. As a first step we start by dividing the outer loop among the available threads.¹³ In fact, to eliminate the outer loop, we replace the first “for” loop with “if” as shown in Figure 4.

Figure 4.

Instruction optimization.

This implementation entirely omits the outer loop and instead uses the thread index. We do the same for the second outer loop. As a second step we have manually unrolled the inner loop. This technique attempts to optimize a program's execution speed by re-writing the body of a suitable loop as a repeated sequence of similar independent statements.

Being one of the bottle-necks for most parallel GPU-based algorithms, memory should be carefully addressed and well chosen to optimize transfer time.^14,15

Hence, the parallel windowing /autocorrelation kernel outlined is re-implemented by exploiting the existing per-block shared memory. It is considered as a very low latency on-chip memory and as a software-managed cache exposed by CUDA.

This memory can be used for storing frequently used data. Consequently, by exploiting it, we can deliver significant performance improvements. The profiling results of the optimized GPU implementation are shown in Table 3. All the results are given regarding the execution time by using a clock entry point provided by the CUDA library. They are measured for different samples corresponding to various speech coding. The experiments confirm that the use of GPUs provides a significant gain in the computational time. As expected, the parallel implementation outperforms CPU version in all size of the window. In fact, while the GTX 480 GPU is expected to deliver peak double precision performance of about 1344.96 GFLOPS, the CPU achieves 200 GFLOPS.

Table 3.

Profiling results with GPU implementation

	Execution time(µs)
Number of samples	N = 10	N = 80	N = 100	N = 150	N = 200	N = 240
GPU implementation	2.08	4.13	7.24	11.06	18.25	21.36
GPU implementation + transfer time	5.95	8.14	11.45	15.34	22.62	25.74

GPU: graphic processing unit.

The speed-up shows that the more the amount of data to be analyzed is important, the more the performance gain will be higher. This is because for lower window's size the data are insufficient to fully exploit the computing units of the graphics card. The results given in Table 3 show that the gain in the execution time is all the greater as the number of samples increases. The ratio is up to 4.

Results interpretation

For comparison between all three platforms, we sought to the windowing/autocorrelation block. It was chosen since it is a voice compression technique bottleneck, it was also easily portable between all three devices. Figure 5 shows the timings plotted in logarithmic scale.

Figure 5.

The execution time of windowing/autocorrelation on different platforms: (a) without transfer time; (b) with transfer time. This time is plotted in logarithmic scale.

From timings given in Figure 5(a) it is clear that the FPGA achieved the windowing/autocorrelation the fastest. The FPGA computes the windowing/ autocorrelation up to five times faster than the optimized GPU implementation. What is missing is the transfer time to send and receive the data. For the FPGA, this time can reach up to 100 µs against 4 µs for the GPU. The underlying reason is that the FPGA is connected to the PC through an adept USB port where the exchange data can reach up to 38 MB/s. While direct read and write accesses to the device memory of the GPU are allowed through PCI-Express, which has a theoretical peak throughput of 16 GB/s.

As shown in Figure 5(b), with this time added in, the FPGA becomes slower than the GPU. Now, the GPU is the fastest at 4 times faster than the FPGA and 48 times faster than the CPU.

Conclusion

We have compared the performance of GPU with FPGA and CPU using a well-known algorithm in audio processing. We note that GPU has a potential for achieving impressive speedups of up to 4 × compared to the FPGA and around 48 × compared to a sequential execution. To attain these objectives gain, careful consideration should, therefore, be given to optimizing computation on the GPU architecture.

In fact, instructions optimization and efficiently utilizing the on-chip shared memory lead to highly parallel cost computations.

These findings are helpful in comparing the three systems discussed as well as in comparison with other devices making future research object; especially as new studies have enabled designers to embark on the same die both CPUs and GPUs, thus promoting the reduction of data transfer times and a substantial performance increase.

Footnotes

Declaration of conflicting interests

The author(s) declared no potential conflicts of interest with respect to the research,authorship,and/or publication of this article.

Funding

The author(s) received no financial support for the research,authorship,and/or publication of this article.

References

Sunny

Peter

Jacob

. Recognition of speech signals: An experimental comparison of linear predictive coding and discrete wavelet transforms. Int J Eng Sci 2012; 4: 1594–1601.

ITU-T Recommendation G729: 1996. Coding of Speech at 8 kbit/s using conjugate-structure algebraic-code-excited linear prediction (CS-ACELP).

Sayadi

Casseau

Atri

et al.

G729 voice decoder design. J VLSI Sig Proc 2006; 42: 173–184.

Tsingos

. Using programmable graphics hardware for acoustics and audio rendering. J Audio Eng Soc 2011; 59: 628–646.

Brent C and Bill K. Spatial sound for video games and virtual environments utilizing real-time GPU-based convolution. In: Conference on future play, Toronto, ON, Canada, 3–5 November 2008, pp. 166–172.

Atri

Sayadi

Elhamzi

et al.

Efficient hardware/software implementation of LPC algorithm in speech coding applications. J Signal Inform Process 2012; 3: 122–129.

Chouchene

Sayadi

Bahri

et al.

Optimized parallel implementation of face detection based on GPU component. Microprocess Microsyst 2015; 39: 393–404.

Masoud

Murat

Erhan İlhan

et al.

Fast and accurate semiautomatic haptic segmentation of brain tumor in 3D MRI images. Turk J Eng Environ Sci 2016; 24: 1397–1411.

Farber

. CUDA application design and development, 1st. San Francisco, CA: Morgan Kaufmann, 2011.

10.

Nvidia Corporation. CUDA C best practices guide, Santa Clara, CA: Nvidia Corporation, 2015.

11.

Nvidia Corporation. Tuning CUDA applications for Kepler, Santa Clara, CA: Nvidia Corporation, 2015.

12.

Burtscher M. Parallelizing and optimizing programs for GPU acceleration using CUDA. In: The 21st international conference on parallel architectures and compilation techniques, Minneapolis, USA, 21–25 September 2012.

13.

Coplin J and Burtscher M. Effects of source-code optimizations on GPU performance and energy consumption. In: Eighth workshop on general purpose processing using GPU, San Francisco, CA, USA, 7–11 February 2015, pp. 48–58.

14.

Taylor

A code merging optimization technique for GPU. In: Rajopadhye

Michelle

(eds). Languages and compilers for parallel computing, Berlin Heidelberg: Springer, 2013, pp. 218–236.

15.

Hyun

Jaehong

Jae

et al.

Optimization of compute unified device architecture for real-time ultrahigh-resolution optical coherence tomography. Optic Commun 2015; 334: 308–313.

Optimization and performance evaluation of graphic processing units for voice processing

Abstract

Keywords

Introduction

LPC computation

Related work and FPGA's implementation results

Parallel implementation on GPU

Hardware selection

Parallel execution and optimization strategies

Results interpretation

Conclusion

Footnotes

Declaration of conflicting interests

Funding

References