Abstract
1. Introduction
Computation of regional descriptors based on gray value, color, texture, and geometrical features for segmented image components is a basic step for many machine vision systems. The requirement for the detection of image objects and their feature calculation can be seen in a huge number of applications, for example, medical science [1], optical navigation [2], thin film inspection [3], and smart cameras [4], as well as in industrial process monitoring [5]. Hardware architecture design for machine vision systems is thus of great importance and is also an active research topic. The architecture developed in this case ensures a high frame speed, low latency, modularization, and also low power consumption. The developed architecture is suitable for smart camera applications, where only refined and processed information is sent over a low bandwidth communication channel rather than sending whole video frames. This smart camera can form part of a visual sensor network in order to send processed results to any other node or to the base station.
Machine vision algorithms are often divided into the following steps as shown in Figure 1 [6]. Video is acquired from an image sensor at image acquisition. Image objects are extracted from the preprocessed video data at

Fundamental steps of a machine vision system.
In this work, the main focus is on image component labeling and feature extraction. Component labeling assigns a unique label to each connected component in an image. The classical 2-pass labeling of an image component is described in [7] and is shown in Figure 2. The pixel neighborhood is analyzed for the 8-connectivity. The neighboring pixel either gives its label to the currently processed pixel or, alternatively, labels are recorded as equivalent, which is referred to as label merging. In Figure 2, in the third row, assigning label to 2nd pixel means that it has two neighboring pixels, labeled 2 and 3. The smaller label is assigned to the current pixel, that is, 2 and labels 2 and 3 are recorded as equivalents. If there is no labeled pixel in the neighborhood, a new label will be assigned to the currently processed pixel [8]. From a typical image, many equivalent labeled pairs merge and these pairs form linked lists in the equivalence table. Resolving equivalent labels means that all the merged labels, which are linked to the same list, must be replaced by one unique label [9]. The second scan then assigns the unique label to each pixel in the image component. In our implementation, only equivalent labels are stored and resolved but the resolved labels are not written back to the image, further referred to as single pass labeling in this paper. The equivalence table can be implemented in a memory, where all memory cells are initiated to their own address [10]. The label merging is illustrated by means of a graph in Figure 3. It shows the merging of four label pairs from the example in Figure 2. After the decision is made that labels 2 and 3 are equivalent, the contents of label 3 can then be altered by means of the content of label 2. Thus, this is the starting point for label 3 in relation to its equivalence to label 2. At the end of each frame these equivalences must be resolved to the root of the link list. Five labels in a linked list can be seen after the labeling of the pair (4, 5). The final graph shows how this linked list has been resolved so that all labels of the same image component should point to a unique value.

Classical 2-scan 8-connectivity labeling algorithm.

Label resolution graph.
The key design metrics that were considered while developing the architecture for component labeling and feature extraction for real time machine vision systems are as follows:
high frame speed and low latency,
modularization, that is, a clear separation between component labeling and the extraction of different image component feature descriptors,
parameterizeable (frame size, label code word length, etc.),
low power consumption.
The developed architecture is suitable for an application that requires a high frame speed, low power consumption, and programmability. Battery operated smart camera nodes for surveillance [11, 12] and camera based navigation for blind persons [13] require low power architecture as well as programmability in order to incorporate any changes depending on the external environment. Camera based industrial process monitoring requires a high frame speed [5]. Our developed architecture addresses the issue of modularity and there is a separation between the image component labeling and feature extraction. This work presents three basic image component feature descriptors, namely, Centre Of Gravity (COG), area, and bounding box. It is possible that future work will be able to extend this architecture with more complex descriptors, for example, geometrical shape, color, and texture. Our belief is that this work is a valuable scientific contribution in those areas for which power consumption or speed of a machine vision system is of the highest importance.
This paper is organized as follows. Section 2 describes the related work. Section 3 describes the hardware architecture for image component labeling and feature extraction. Section 4 discusses the memory accesses and storage requirement for different features. Section 5 explains memory implantation and the results with regards to area and simulated power consumption. Section 6 explains the performance and device utilization results and Section 7 is a discussion of the results.
2. Related Work
This paper presents new hardware architecture for connected component analysis and there is also a requirement to highlight its relevance by referring to applications and state of the art research within the computer vision community. With regards to the recognition of image objects, labeled image components (blobs) are described based on features such as texture, color, or geometrical shape. This paper includes hardware models for three basic feature descriptors, namely, COG, area, and bounding box. More complex descriptors are often derived as histograms accumulated over described image regions [14]. The architecture presented in this paper can be extended by means of an additional hardware module to support the accumulation of histograms. An Edge Histogram Descriptor (EHD) can be accumulated to describe the shape of the regions [15]. Local Directional Pattern (LDP) and Local Binary Pattern (LBP) are coded descriptors regarding the difference in gradients and gray values used to describe the texture of the image regions [16, 17]. LBP was used by Mohammed and Yampolskiy in relation to face recognition [17]. Bounding box, COG, and area of blobs were used by Yoon et al. in relation to the recognition of a car's license plate and its characters [18]. A Gabor filter bank was used to segment fabric defects on textile as blobs [19] and this is a good example of a real-time application that could benefit from additional analysis of geometrical shape, position, and size of defect regions. Chan and Pun present work on character recognition based on connected component analysis [20]. Successful experiments have been conducted by us using connected component analysis with regards to the reading of signs for speed limits [21].
Connected component analysis is very much dependent on an efficient method for segmenting images into regions. Too many spurious regions will increase the workload on the processors and will also increase the requirements with regards to memory storage. Luis-García et al. defined a success score for the evaluation of efficiency of segmentation [22]. Cao et al. provide an example regarding how a color gradient tuned to a specific pair of colors can be exploited for the efficient segmentation of road signs [23]. We have published similar work for the segmentation of colored labels [24].
For real-time image component labeling and feature extraction, several methods and proposed architectures were studied. The first connected component labeling method was proposed by Rosenfeld and Pfalz [25]. The labeling was performed in Raster scan order, which is the same as the output from a camera. However, this method requires a large global table for equivalent label pairs.
Parallel hardware architecture was proposed in [26], utilizing two processing units on an FPGA and in this case, 2 pass labeling was used. Image component labeling is performed for
Aggressive approaches were presented in [9, 27, 28], which only use one labeling pass and by which it is proposed that the image features can be calculated. The main emphasis of their work was to minimize the memory usage and they have, additionally, reported the worst case scenario for a number of labels. A row buffer is used to store the label assigned to a previous row and a merger table is used to retain a record of the equivalences. The resolving of the merged labels is conducted during the horizontal blanking period. The maximum clock frequency achieved was 40.63 MHz for an image of size
There are still differences that are worth mentioning. Firstly, we resolve merged labels after scanning each video frame and no additional clock cycles are required for row or frame synchronization, thus increasing both the throughput and frame speed. It should be kept in mind that the row and frame synchronization overhead is a mechanism inherited from the blanking of the old Cathode Ray Tubes (CRT). This unwanted overhead only reduces the throughput of a digital video system. Secondly, the proposal is to compute feature descriptors in well-separable hardware modules. The labeling process continues on incoming pixel, while the feature extraction continues uninterrupted in separate hardware modules. This approach allows for an easy inclusion of combinations of descriptors depending on the application. The massive memory bandwidth on the FPGA is efficiently exploited when each descriptor is accumulated in separate sets of block-ram memories. Secondly, the architecture was developed at RT level rather than using higher level abstraction model, for example, Handel C.
Real time component labeling with boundary tracking was described in [10]. Two-pass labeling is used such that a locally labeled image must be stored in the FIFO memory, thus increasing the latency, power, and hardware utilization.
FPGA implementation of component labeling is presented in [29] in which they use Handel C as the development language and off-chip memories to store the images, which is not a power efficient technique. Another image component labeling implementation in Handel-C is described in [30]. Handel-C can support the parallel execution of a program but a more optimized hardware design can be achieved by using HDL. Handel-C implementation of an encryption protocol is compared with a VHDL implementation by Mylonas et al. [31] and here, the authors reported that a VHDL based design requires fewer resources and can be operated at a higher clock speed as compared to that for a Handel-C design.
Component labeling using contour tracking was implemented in [32]. The proposed architecture is not suitable for real time processing, and the required extra memories increase the power consumption. Another labeling implementation in Handel-C is described in [30], but this is not a power-efficient implementation.
Image component labeling on a one-dimensional DSP array was described in [33]. In this case, the majority of the execution time is spent on interprocessor communication, and it can also be determined that the latency for the implementation is quite high. In relation to this a
No research discovered, which is related to connected component analysis on FPGA, focused on a comprehensive investigation of power and speed.
3. Hardware Architecture
This section describes the hardware architecture for image component labeling and feature extraction being proposed as suitable for implementation on an FPGA. Image component labeling assigns unique labels to different image components in an image as described in Section 1. Pixels are assigned labels at

(a) Neighborhood of pixels, (b) bounding Box parameters, and (c) labeling kernel and delay line.
The
The architecture has no dependency on row synchronization for either assigning labels or label merging. Data required for the extraction of features accumulates in parallel with the labeling process and more
In this paper the computation of three basic image component features, namely, COG, area, and bounding box has been implemented. The Centre of Gravity of an image component
In (1),
The area of an image component equals the sum of the pixels it contains. The area of the image component
The bounding box is the minimum rectangle enclosing an object. It is parameterized by its minimum and maximum coordinates in the row and column directions, that is, coordinates of the upper left corner (ULC), and the lower right corner (LRC). An example of a bounding box is shown in Figure 4(b), where
3.1. Image Component Feature Calculation
The architecture for image component labeling and feature calculation is described in Figure 6. The computation of the image component features is controlled by a

Sequencer flow graph.

Architecture of image feature calculation.

Image feature computation module running in parallel.
4. Memory Access for Equivalence Table
This section describes how memory accesses are managed for the equivalence table during the merging of labels. Label merging is performed in such a manner that linked lists of equivalent labels are mapped onto a linear memory array as described in Figure 3. The equivalence table is fed with two equivalent labels,
Initialize all memory cells to its own address at the start of every frame.
If labels
If
This algorithm requires three memory accesses for every pair of merged labels. In addition, the merging of the two labels is required to be performed in one single clock cycle in order to keep the system clock frequency low as this is essential based on the dominance of the high power consumption in the clock nets [35]. The conclusion to be drawn from this is that it is necessary to pipeline the merging algorithm and introduce a triple port memory. At the first clock cycle, the contents of the memory cells
4.1. Memory Storage Requirements
In this section the discussion will be in relation to the memory requirement for different modules used in our design. According to Section 3 and (1), numerators N and denominators D are accumulated in the memory storage. The worst case scenario for the COG calculation will be at the point when a large image component of size

(a) Memory requirement for image component
where
The memory requirement for the denominator in the COG unit is given in (8). For object size of
For
For an image size of
A good segmentation will result in rather few objects which is necessary in real time system in order to reduce the computational burden. The table depth of 1024 is sufficient for the above presented applications. The depth can be customized to a greater extent, based on the nature of the specific application and the nature of the objects of interest.
5. Experiments and Results
Experiments and results on implementation of triple-port memories, utilization of FPGA area, power consumption, maximum system throughput, latency, and functional verification are reported in this section. Power consumption is both measured on a hardware prototype as well as compared with simulations. The hardware prototype is shown in Figure 13. This prototype is a Digilent Atlys Spartan-6 board [38], connected with a CMOS imaging sensor from Aptina, MT9V032 [39].
5.1. Memory Implementation
From Section 4 it is evident that the label merging process requires three memory accesses per clock cycle, that is, one write and two reads. The equivalence tables in the labeling process are implemented as triple port memories. The Xilinx tool set duplicates the memory storage during synthesis; that is, it uses two block RAMs. It is possible to use multitap delay lines if these memory accesses are regular and occur within the same clock cycle. However, for the equivalence table, the memory accesses are irregular and an independently addressed multiport memory is required in order to implement three memory accesses in one single clock cycle. Another approach in relation to providing an improved memory bandwidth would be to increase the clock frequency and to serialize the memory accesses. However, the dynamic power dissipation is then expected to increase dramatically, since the clock nets dissipate the majority of the dynamic power [35].
In this case, an experiment has been performed which involved doubling the clock speed using a dual port block RAM and serialized memory accesses in comparison with the use of triple port memory being implemented as duplicated block RAMs. The memory bandwidth was selected to be 103 MB/sec for both cases. Synthetic data was used with regards to the reading and writing in both cases of memory storage. Block RAMs were enabled only during clock cycles when memory was accessed for writing or reading. The results for the dynamic power consumption are summarized in Figure 9. The results show that clock nets are power hungry and increasing the clock frequency to achieve a certain bandwidth consumes more power than using additional block RAMs.

Power consumption (dual port versus triple port).
5.2. FPGA Area Utilization and Power Consumption
The utilization of logic cells and power consumption are two important design metrics for an FPGA based hardware system. Power simulations have been conducted for different configurations based on different label code word lengths. The hardware architecture was implemented using a Digilent Atlys Spartan-6 board, which has the capacity of monitoring the power consumption on the different available voltage rails. The FPGA core is connected to the 1.2 v rail. The monitors are based on Linear Technology's LTC2481C sigma-delta analog-to-digital converters that return 16-bit samples for each channel [38]. The clock frequency for the experiments was chosen to be 27 MHz.
The experiments were firstly performed on component labeling alone in order to determine how different label code word lengths affect the area and power consumption. Assigning the number of bits for label codes must be based on the estimated maximum number of image components and their complexity. Figures 10 and 11 show the results as reported by the design tool set, Xilinx Integrated Service Environment (ISE), after post place and route simulations. The power simulations were conducted using the Power tool included in ISE. Experiments have also been performed for dynamic power consumption based on different numbers of objects for labeling along with a COG calculation. 4, 8, 16, and 30 objects of 6000 pixels were chosen in relation to this task. These three cases correspond to 8, 16, 31, and 58 percent of the total image area. The power consumption was shown to increase as the number of objects increased. The results of the simulated and measured values are shown in Figure 12. The measured values show an increase in dynamic power dissipation as the number of objects increased.

Percentage LUT utilization versus label code word length for labeling unit.

Power consumption versus label code word length for labeling unit.

(a), (b), (c), and (d) Simulated power consumption with increasing number of objects and (e) measured power with increasing number of objects.

Experimental Setup for verification of architecture.
5.3. Functional Verification, Performance, and Device Utilization
The hardware architecture was modeled in VHDL and this model was simulated at the register transfer level. The architecture was implemented using a Diligent Spartan-6 board as shown in Figure 13. The experiments were conducted using a variety of input stimulus and two sample images are shown in Figure 14. The first image shows the segmented image of magnetic particles in the fluid of a hydraulic machine. The presence of magnetic particles in the fluid demonstrates the deterioration inside the machine. The oil particles are correctly detected in the image and their features are extracted in a correct manner. The number of magnetic particles and their occupied area can provide an indication in relation to the status of the machine. The second image shows a speed sign and, in this case, the objects have been correctly counted and the extracted features can assist with regards to its correct classification.

Examples of input stimuli used for functional verification.
Both the device utilization and power consumption for the whole architecture, as shown in Figure 6, are presented in Table 1. The pixel clock frequency was set to 27 MHz at a frame speed of 86 frames per second. Ten bits are assigned for label codes, 34 bits are used for the numerator and 19 for the denominator in the COG calculation, and 17 bits are used for the area of each component and 10 bits for each of the bounding box parameters. The Spartan-6 device that was selected for implementation is based on a 45 nm CMOS process, which is optimized for low power and cost. Table 1 shows the simulated power consumption for different elements of the FPGA, the total power consumption, the maximum clock frequency on which the whole system can be operated, and the number of block RAMs and the FPGA slices’ utilization. For comparison, a measured value for the dynamic power consumption is also reported. This measured value was derived after firstly measuring the total power and then subtracting another measured value for the static power consumption.
Power and area utilization for whole design.
5.4. Quantitative Comparison with Other Architectures
In this section a quantitative comparison of the proposed architecture is presented with already published work. The comparison includes the maximum achievable frame speed and the power consumption results. The comparison is presented for the labeling unit only, in order to provide a fair correspondence of the results for the maximum frame speed and power consumption associated with related published work. This result is reported for the Xilinx Virtex-II pro-FPGA so as to compare it appropriately with the work of Bailey et al. The result from the maximum achievable clock speed is presented in Table 2. The maximum frame speed can be calculated based on the maximum clock speed values. For the proposed architecture, there is no dependence on a row blanking period and for a frame size of
Comparison of maximum clock frequency and resource utilization.
5.4.1. Latency
Latency is important in time critical machine vision systems and in a video frame it is data dependent as the pixel data originating from the sensor array is sequential in nature. The latency of computation for a labeling process can be defined as follows.
The latency of our proposed architecture is for the worst case scenario and is equal to one video frame. The latency value reported in [9, 27, 28] is equal to one image row. This claim of latency is only true if an image object occupies one row. In this scenario, the labeled object can be obtained after a row blanking period. It is very unlikely to discover that such an object only occupies one row. For example, if there is a “
5.4.2. Power Consumption
Low power consumption is important for battery operated machine vision systems [11–13]. There is no reported data on power consumption in the work of Bailey et al. [9, 27, 28]. However, the row blanking period used to resolve equivalences does have an impact on the frame speed. Since our proposed architecture does not have any dependence on a row blanking period, it is possible to achieve the same frame speed at a lower clock frequency. This lower clock frequency will result in lower power consumption as the dynamic power consumption is directly proportional to the operating frequency.
Post place and route simulation have been performed to show this effect more clearly and the results for power consumption with a zero clock cycles row blanking period and 128 clock cycles row blanking period will be presented. The frame speed chosen for study is 86 frames per second. To achieve 86 frames per second it is necessary to have a 27 MHz clock frequency with zero blanking periods, while for 32 MHz it requires a 128 clocks cycle blanking period. The result in Table 3 clearly shows that achieving the same frame speed with 128 cycles for the row blanking period requires a higher clock frequency as compared to the design without any dependence in relation to the row blanking period. The power consumption will also subsequently increase as the clock frequency is increased.
Power consumption comparison.
6. Discussion
In this work, hardware architecture is presented in which the image component features are computed in parallel along with single-pass labeling of connected image components. Since there is no requirement for intermediate storage of image data, the latency and overall memory utilization are maintained at a minimum. The characteristic of the developed architecture is that the majority of the data required to calculate the image features is collected in the data tables along with the labeling process. This single-pass labeling and simultaneous data accumulation are similar to that presented previously by Bailey et al. [9, 27, 28]. When the equivalences are resolved in the equivalence table, data, which is stored on different codes, is accumulated and postprocessed into feature descriptors which are used as the input for subsequent object recognition.
Our work has similarity with Bailey et al. [9, 27, 28] in a way that single pass labeling is used and feature extraction is in separate hardware modules. Parallel memory accesses within multiple hardware modules to banks of block ram memories will efficiently exploit the massive memory bandwidth available on the FPGA. In addition, the separation of computations is desirable from the perspective of the RT-level modeling of machine vision systems. It allows for different combinations of hardware modules for the parallel computation of an object perimeter, ellipse features, feature descriptors which are computed as histograms, and so forth, to be included into the connected component analysis. These feature descriptors are all relevant in relation to the connected component analysis used for applications such as those presented in [18, 20, 21, 36, 37]. We believe that modularity and a clear separation between the descriptor computations and the labeling make the presented architecture suitable for design automation. It is also believed that a CAD tool can be developed to read a high level system description where a user can include feature descriptors of his/her own choice. From this system description, the CAD tool can be made to generate synthesizable VHDL code. The Embedded Development Kit (EDK) of Xilinx is a similar source code generator for the modeling of general signal processing systems. The EDK allows a user to add IP components of his/her own choice and generate synthesizable VHDL code for Xilinx platforms. Visual Applets is a tool for graphical modeling and for the synthesis of machine vision systems on an FPGA [40].
Our work differs in scheduling of resolving merged label if it is compared with that of Bailey et al. [9, 27, 28]. The presented architecture resolves labels at the end of each video frame, while Bailey et al. present a method for resolving labels during row synchronization by exploiting the additional clock cycles during the blanking periods. Row- and frame synchronization is inherited from the old analog TV systems using CRT displays. Blanking periods are an unnecessary overhead for digital video systems and will thus reduce the system throughput. The architecture presented in this paper operates independently of blanking periods and will thus allow for higher frame speeds but at the cost of longer latency. Latency is a very important issue for real time systems and, in particular, for applications such as optical navigation and modern optical sensor based video games.
The system should be fast enough to give result for decision making as quick as possible. Our architecture provides this low latency. It is almost equal to one image frame period. The latency value can be reduced to 2.5 msec if operated at maximum clock frequency and if avoid synchronization is overhead. This performance achievement is due to the massive parallelism available in FPGAs.
Area and power consumptions for component labeling together with the computation of COG, area, and bounding box are summarized in Table 1. Only 12% of the available slices and 7% of block RAMs are used. One block ram is used for the delay of
In order to store 34-bit value from one of centroid, two 16-Kbit block rams (RAMB16) are required. Four RAMB16 are required to maintain two such
The maximum clock frequency computed by the tool set is 120 MHz which means that it is possible to process up to 390 frames per second for image component labeling and feature extraction. This frame speed is calculated without including the synchronization overhead present in a video stream. If controls of camera parameters are limited, the suggestion is to employ an asynchronous FIFO buffer on the input video stream to store the incoming pixel data and to reduce the output clock speed such that blanking periods with nonvalid pixels are removed. This means that the pixel clock frequency for the computations on the FPGA can be lower than for the pixel clock connected to the image sensor. Maintaining the FPGA pixel clock frequency to be as low as possible is efficient for reducing the dynamic power consumption. This becomes clear from Figure 12 showing that the clock nets are the dominant source of power dissipation. A smart camera system usually has a good control of all sensor parameters through a register programming model. This will allow for the synchronization overhead of the sensor readout to be reduced to a minimum. For the Aptina sensor MT9V032 used for experiments in this paper, the overhead can be reduced to only a few clock cycles per frame [39]. Values for the dynamic power reported in Table 1 and Figure 12 are measured for a sensor readout having a synchronization overhead of 16 percent.
Figure 12 shows the increased power consumption while increasing the number of image objects. This is due to the higher power consumption in block RAMs, signal nets, and logic while memory transactions are occurring. The block RAMs are only activated when data transactions to or from the memory occur. If block RAMs remain enabled at all times, then their power consumption becomes even greater than for the clock nets. The fact that the power dissipation stemming from the clock nets is dominant offers an explanation as to why the total dynamic power is only increasing by 36 percent while the amount of data to be accumulated for feature descriptors is increasing by 650 percent in Figure 12. The dominant power dissipation from the clock nets also offers an explanation as to why the serialization of memory accesses is less power efficient in comparison with-using triple port memories. The reason for this is because the serialization of memory accesses for increased memory bandwidth also requires an increase in the clock frequencies. This is shown in Figure 9. The dominant static power, as shown in Table 1, can be reduced by selecting a more appropriate device since for the current setup of the architecture only 12% of the available slices are being used.
7. Conclusion
The hardware architecture presented in this work is suitable for the efficient implementation of machine vision systems. This architecture supports robust, high speed, low latency, and low power smart camera applications.
Configuring the imaging sensor for a reduced synchronization overhead can either increase the maximum frame speed and, simultaneously, reduce the latency or reduce the power consumption at a maintained frame speed.
The majority of the dissipated dynamic power stems from the clock nets. Aggressive parallelization of computation and memory accesses maintaining the clock nets at a lower frequency would appear to be a good strategy with regards to achieving a low dynamic power for DSP systems on an FPGA. Static power can be controlled by selecting an FPGA device which has a size that matches the size of the application.
