A new approach for real time object detection and tracking onhigh resolution and multi-camera surveillance videos using GPU
来源期刊:中南大学学报(英文版)2016年第1期
论文作者:Mohammad Farukh Hashmi Ritu Pal RajatSaxena Avinash G. Keskar
文章页码:130 - 144
Key words:central processing unit (CPU); graphics processing unit (GPU); morphology; connected component labelling (CCL)
Abstract: High resolution cameras and multi camera systems are being used in areas of video surveillance like security of public places, traffic monitoring, and military and satellite imaging. This leads to a demand for computational algorithms for real time processing of high resolution videos. Motion detection and background separation play a vital role in capturing the object of interest in surveillance videos, but as we move towards high resolution cameras, the time-complexity of the algorithm increases and thus fails to be a part of real time systems. Parallel architecture provides a surpass platform to work efficiently with complex algorithmic solutions. In this work, a method was proposed for identifying the moving objects perfectly in the videos using adaptive background making, motion detection and object estimation. The pre-processing part includes an adaptive block background making model and a dynamically adaptive thresholding technique to estimate the moving objects. The post processing includes a competent parallel connected component labelling algorithm to estimate perfectly the objects of interest. New parallel processing strategies are developed on each stage of the algorithm to reduce the time-complexity of the system. This algorithm has achieved a average speedup of 12.26 times for lower resolution video frames (320×240, 720×480, 1024×768) and 7.30 times for higher resolution video frames (1360×768, 1920×1080, 2560×1440) on GPU, which is superior to CPU processing. Also, this algorithm was tested by changing the number of threads in a thread block and the minimum execution time has been achieved for 16×16 thread block. And this algorithm was tested on a night sequence where the amount of light in the scene is very less and still the algorithm has given a significant speedup and accuracy in determining the object.
J. Cent. South Univ. (2016) 23: 130-144
DOI: 10.1007/s11771-016-3056-6
Mohammad Farukh Hashmi, Ritu Pal, Rajat Saxena, Avinash G. Keskar
Department of Electronics and Communication Engineering, Visvesvaraya National Institute of Technology,
Nagpur, 440010, India
Central South University Press and Springer-Verlag Berlin Heidelberg 2016
Abstract: High resolution cameras and multi camera systems are being used in areas of video surveillance like security of public places, traffic monitoring, and military and satellite imaging. This leads to a demand for computational algorithms for real time processing of high resolution videos. Motion detection and background separation play a vital role in capturing the object of interest in surveillance videos, but as we move towards high resolution cameras, the time-complexity of the algorithm increases and thus fails to be a part of real time systems. Parallel architecture provides a surpass platform to work efficiently with complex algorithmic solutions. In this work, a method was proposed for identifying the moving objects perfectly in the videos using adaptive background making, motion detection and object estimation. The pre-processing part includes an adaptive block background making model and a dynamically adaptive thresholding technique to estimate the moving objects. The post processing includes a competent parallel connected component labelling algorithm to estimate perfectly the objects of interest. New parallel processing strategies are developed on each stage of the algorithm to reduce the time-complexity of the system. This algorithm has achieved a average speedup of 12.26 times for lower resolution video frames (320×240, 720×480, 1024×768) and 7.30 times for higher resolution video frames (1360×768, 1920×1080, 2560×1440) on GPU, which is superior to CPU processing. Also, this algorithm was tested by changing the number of threads in a thread block and the minimum execution time has been achieved for 16×16 thread block. And this algorithm was tested on a night sequence where the amount of light in the scene is very less and still the algorithm has given a significant speedup and accuracy in determining the object.
Key words: central processing unit (CPU); graphics processing unit (GPU); morphology; connected component labelling (CCL)
1 Introduction
Real time processing of video surveillance (VS) data is important due to wide requirement of monitoring namely for security purposes. Currently, these areas include traffic monitoring, security of public places and critical infrastructures like dams and bridges, preventing cross-border infiltration, identification of military targets, medical image processing and providing crucial evidences in the trials of unlawful activities. Detection of any suspicious activity in these areas is a prime area of concern. In military activities, detecting any suspicious movement of object and tracking its path are very relevant. Moreover, tracking the paths or trajectories of satellites and other objects to calculate its motion-related phenomenon to name its speed, velocity, acceleration, size, distance, is also a growing need today. In medical image processing [1], many topics like functional magnetic resonance imaging (fMRI), diffusion tensor imaging (DTI), image denoising algorithms, filtering, interpolation, histogram estimation and distance transforms, etc. require heavy computations. In addition, in other fields like traffic areas, managing the traffic and locating a particular object in a high-traffic zone are sometimes tough tasks. All these require high computation on data and perfect results in real time. Real-time processing is also required in emerging fields of automated video content analysis and robotics.
Surveillance deals with large amount of visual data pilling up each second. Processing of such large amount of data in real time is becoming computationally challenging and costly. VS algorithms represent a class of problems that are both computationally and time complex. Real time image and video processing in the last decade has become more challenging due to altering need in scientific, defence and consumer needs in which inexpensive high resolution digital cameras and various networking compute devices are used.
The recent development of multi core central processing unit (CPU) and graphics processing unit (GPU) is a path breaking achievement in computer vision. They provide parallel hardware and match the data processing requirement in video processing. Optimized utilization of this parallel hardware, suiting the algorithms to be run, can alleviate the time consumption.
In this work, motion estimation based object detection on parallel architecture is discussed in detail. It mainly focuses on the parallel techniques to implement motion detection algorithms involving (1) background estimation, (2) thresholding and (3) connected component labelling (CCL).
2 Related work
Computer vision and image analysis is becoming very important in many areas of life. Video surveillance is one of such areas. A practical 24 h monitoring by humans in many areas is becoming crucial and thus the need of camera and video surveillance is increasing. As far as the previous efforts in this field are concerned, BUCH et al [2] suggested that the demand for such field is increasing in intelligent transport systems due to the decreased hardware cost. Also, the increasing use of cameras has opened a wide area for video analytics and many problems related to areas like traffic violation, congestion, pre-traffic warnings, etc. can be solved very easily. Thus, a numerous works have been done in traffic related areas and many literatures related to it also exist. ATTARD and FARRUGIA [3] proposed a vision based automated surveillance system to detect humans and vehicles from a video. HASHMI and KESKAR [4-5] suggested a statistical approach for determining traffic parameters at heavily crowded intersections. Their algorithm offered accurate tracking and counting of objects with high efficiency at high traffic density T intersection. Their approach gave the count of the objects in real time video sequences. Our work is related to motion detection of objects in any type of scenario for high resolution videos. Motion detection is very important in video sequences. A lot of literature review exists in motion detection in videos. HUANG [6] proposed the detection of moving objects using background model, alarm trigger and object extraction. The background modeling (BM) model used a two-phase background matching procedure by rapid matching and accurate matching to get optimum background. The alarm trigger (AT) module eliminated the unnecessary examination of the entire background region and object extraction (OE) module finally estimated moving objects based on binary object detection mask. Also with some other methods, background subtraction and foreground estimation can be done. FRADI and DUGELAY [7] proposed foreground segmentation based on uniform motion model with GMM background segmentation. Similarly, CHENG and RUAN [8] used background matching method with absolute difference estimation and Cauchy distribution model for motion detection method. Some works have also been done in pre-processing algorithms on GPU. RIHA and HODA [9] proposed a comparison between CPU and GPU algorithms for red-green-blue (RGB) to colour conversion. In our algorithm, two of the most important parts are morphology and CCL. Morphology is an important image processing tool for analysis of spatial structure based on pre-defined structuring elements. A lot of algorithms exist in this area. RANE [10] presented various techniques used for morphological operations. THURLEY and DANELL [11] presented a similar approach for the vHGW algorithm for erosion and dilation independent of structuring element size for horizontal, vertical, and 45° line structuring elements with significant performance improvements over NVIDIA Performance Primitives Library (NPP). DOMANSKI et al [12] also explained parallel technique for morphological erosion and dilation by NVIDIA. As for CCL, SOH et al [13] proposed two different CCL techniques for matrices. They have shown that their approach is better than the mask based method, as proposed by WU et al [14]. We have used an improved version of their technique on images for object estimation and tracking in our proposed algorithm.
3 Methods
3.1 GPU architecture
A digital image is a 2D matrix of data. Video is a set of images placed sequentially on time axis called frames. All operations of video processing are just convolution functions to be performed on each pixel and each frame. Thus, video processing can be best suited to single instruction multiple data (SIMD) architectures. Graphics processing unit (GPU) is a SIMD device.
Programmable GPUs are parallel multi-cored coprocessors specialized for parallel computing large chunks of data. CUDA (compute unified device architecture) developed by NVIDIA is a platform for parallel computation of algorithms by utilizing the power of CPU and GPU together.
The CPU and the GPU are separate devices with their separate memories. Thus, simultaneous working of CPU and GPU is possible without any memory contentions. GPUs have hundreds of cores which can run the same code on thousand of threads simultaneously dealing with their dedicated registers and shared memory, as shown in Fig. 1. Threads within a block work together by sharing the data through shared memory and are synchronized to coordinate memory accesses. Threads from different blocks are not synchronized and can execute in any order. The number of threads per block is limited by the memory resources available to the processor core. The GPU creates threads, schedules, and executes groups of 32 parallel threads called warps.
Figure 1 shows the GPU architecture and the memory organisation. Every thread has its own personal registers. Every block shares memory where every thread inside a block can access and thus inter block interaction is possible. Also, every thread can access with global memory to move data to and fro from CPU. The speed of memory transfer is faster with registers for a thread than with the shared memory and is the least with the global memory. Thus, optimized usage of various memory elements, when required, is necessary for optimization of timing required by the running algorithms. CPU is the ‘host’ here and the GPU is referred to as the ‘device’. Memory transfer between CPU and GPU is done in concordance with CPUs host memory and GPUs global as well as constant memory. Constant memory is accessible to threads only for reading purpose.
3.2 Real time motion detection
Motion detection is an action of estimating objects based on physical movement in given video frames. In general, motion detection includes inter frame subtraction, removal of noise, identification of the moving pixels belonging to the object of interest among all the moving pixels and thus identifying and detecting the object in a frame. To perform the task of motion detection in real time, as stated in Gems 3 by Wen-Mei et al [15], for a PAL (phase alternating line), 25 frames must be completely processed within l s, i.e. each frame must be processed in less than 40 ms, prior to the arrival of next subsequent frames. Various film and television companies use this rate for direct compatibility with television frame rates. Similarly, for NTSC (National Television System Committee), 30 frames should beprocessed within l s, i.e. each frame should be processed in less than 33.33 ms, prior to next frames. For high resolution cameras or multiple camera systems, the time for processing each frame gets correspondingly scaled down, making the task of computation more complicated. A parallel processing approach thus becomes noteworthy for real time motion detection.
The best serial approach to a problem is often different form the best parallel approach to solve it. Thus, many algorithms need to be redesigned to fetch the parallel goal. This work illustrates the implementation of the algorithm on Intel i7-4770 CPU system running at 3.40 GHz having CUDA 5.5 enabled NVIDIA GEFORCE GT 640 GPU having 384 cores, on Windows 8 (64 bit) operating system. It mainly focuses on making the motion detection algorithm more parallel by optimization of logical memory usage, and intelligible flow of data within the wraps so as to achieve real time processing.
4 Proposed methodology
Figure 2 shows the different stages of the video object detection algorithm. The stages are illustrated as follows.
The input frames are given to background modelling unit where an initial background is made and then it is updated as per the consecutive frames. These frames are subtracted from the background and we get frames that have moving elements in them which also contains potential objects. Then, according to the frames, dynamic threshold is calculated and applied on it. Then,we remove noise by applying different morphological techniques and pass these frames to CCL unit to connect different components and estimate the objects.
Fig. 1 CUDA GPU architecture and organization of internal memory hierarchy
Fig. 2 Block diagram for proposed algorithm
4.1 Self adaptive background model making
In general, identifying a moving foreground object can be done by taking absolute difference between each frame and the background. Good quality motion detection requires a superior quality of background frame to be subtracted. Issues involved here are gradual illumination changes altering the background from time to time, dynamic background movements such as moving trees; modifying the background model and many more. Often the non steady and rotating nature of cameras makes the background scene variable. Thus, a dynamic and self adaptive background is needed.
The adaptive background model differencing involves 3 steps: initial background model, reference pixels selection, and background revision. The initial background model is generated by taking an average of first N frames, as given by
(1)
where Bi(x, y) represents the initial background model of N frames (experimentally set to be 50 here) and Ci(x, y) represents the current incoming video frame.
To adapt the background without the moving object, new reference pixels are selected. They are selected based on the following function:
(2)
where Bt-1(x, y) represents a current reference pixel, Rt-1(x, y) represents corresponding past reference pixel, C(x, y) represents each pixel value of the current input frame, and Rt(x, y) is pixel initialized by the initial background frame generated for N frames.
The initial background model gets modified by the modification of only those pixels generated as reference pixels in the previous step:
(3)
where Bt(x, y) represents a current background frame, Bt-1(x, y) represents past background frame, and Rt(x, y) represents current reference frame generated by taking reference pixels together.
Since pixel by pixel modification of background is done, noise pixels get apart from the background model (except the constant noise which plays no significant role here). Thus, a dynamic and adaptive background is generated.
4.2 Gaussian modelled dynamic thresholding
Suitable threshold is required to get a proper binary image. Here, threshold is deduced by approximating two Gaussian curves into the gray frame intensity histogram. Dynamic threshold is calculated based on these two Gaussian curves (this is equivalent to the deepest valley between two modes of a bimodal histogram). The complete process is as follows.
The histogram of the intensity values of the frame is shown in Fig.3.
Fig. 3 Histogram of intensity values of all pixels in a frame
T1 and T2 are start and end intensity values for the valley bin in the histogram respectively:
(4)
The dynamic threshold may be taken as the intensity value for the valley point or can be calculated by the equation given below. Let the two Gaussian probability density functions be G(μ1, σ1) and G(μ2, σ2) and their variances are considered as equal i.e. σ2= and then the threshold T is given by
(5)
where p1 and p2 are the probabilities of occurrences of two classes of pixels.
This threshold is applied on the current frame and the current background model. The binary background is then subtracted from the binary frame to get only the motion pixel present in binary format. Due to pixel by pixel modeled adaptive background and the dynamic threshold, the resultant binary image consists of foreground motion pixels and minor noise pixels.
4.3 Fast morphological noise removal
Even after identification of the motion foreground pixels from a frame, some kinds of noise elements cross the threshold. There is a vital need to remove these noise elements before proceeding to CCL method. This is achieved by morphological operations of closing and opening [10]. Every pixel in the output is based on its comparison with its neighbours, on the basis of the shape and size of the structuring element. A structuring element (SE) is a small binary image (made of 1 and 0) used to analyze bigger image based on properties of interest. The structuring element has both a shape and an origin. Figure 4 shows a rectangular structuring element. The origin of SEs is marked by a circle and shaded square denotes a member of the SE (i.e. 1 or 0). If the SE is rectangular, smaller sized background elements are appended to form rectangular shape.
Fig. 4 5×5 rectangular structuring elements
The output pixel of erosion operation is obtained by placing the centre element of the structuring element on the current pixel and replacing it with the minimum of the neighbouring pixels which are overlapped by 1 by the structuring element. The output pixel of dilation operation is obtained by placing the centre element of the structuring element on the current pixel and replacing it with the maximum of the neighbouring pixels which are overlapped by 1 by the structuring element.
Opening is erosion followed by dilation and closing is dilation followed by erosion. Opening is useful to remove minute false holes and isolate touching objects, and closing would unite broken responses.
The opening of a set A by structuring element B [10] is defined as
AB=(A-B)B (6)
Similarly, closing of a set A by structuring element B [10] is defined as
A·B=(AB)-B (7)
These operations being pixel by pixel operated are unworthy if done sequentially. A parallel approach is not only instinctive but also a requisite for real time processing. The parallel approach is explained briefly in CUDA implementation section.
This approach of implementation of morphological operation is as referred in Ref. [12]. The complexity of this method is independent of the size of the structuring element. This method is for 1D SE. Let the structuring element be of size p. The image matrix rows are then divided into segments of length p with (p-1)/2 columns overlap on consecutive segments to form a window of size 2p-1, centered at p-1, 2p-1, 3p-1, 4p-1, …, as shown in Fig. 5.
In a given window (w), all pixels p in that window are scanned and two arrays are created, namely, suffix(max) array and prefix (max) array. All the pixels left to the centre pixel overlapped with 1 in SE are used to calculate members for suffix max array and the right pixels overlapped with 1 in SE are used to calculate the members of prefix max array. The array members of suffix max array and prefix max array are nothing but the max element among the left and right pixels to the centre pixel, respectively, in their corresponding SE window, as shown in Fig. 6. After all the elements in given image are scanned, these two arrays are merged together.
Fig. 5 Row segmented into 2p-1 sized windows with two sides overlapping of (p-1)/2 elements
S[k]=max(w[j]), j=k, …, (p-1)
P[k]=max(w[p-1+j]), j=0, …, K
where S[k] is suffix max array; P[k] is prefix max array.
As shown in Fig. 7, For each pixel j, the output of dilation is given by
Dilate[j]=max(R [j-m], S[j+m])
where m=(p-1)/2.
Fig. 6 Performing suffix/prefix max for left/right half of array
Fig. 7 Writing final results to output image
For erosion, the above same procedure is followed by taking suffix min array and prefix min array of the element in a given window overlapped by 1 in SE using following equations:
S[k]=min (w[j]), j=k, …, (p-1)
P[k]=min(w[p-1+j]), j=0, …, K
For each pixel j, the output of dilation is given by
Erode[j]=min(R[j-m], S[j+m])
where m=(p-1)/2.
For dealing with 2D SE, we cascade the 1D procedure on rows followed by that on columns. For this, we first apply dilation on image rows and the result is then transposed and then again dilation is done by the 1D structuring elements. This gives us dilation by 2D structuring element. But, we can conclude during experiments that 1D structuring element gives better result compared to 2D structuring element and hence we have done morphology by 1D structuring element only.
Thus, after applying morphology, we get pure thresholded frame that contains only moving object pixels and no noise is present. This helps us in further steps to perfectly estimate the object of interest.
4.4 Parallel connected component labeling
Connected component labeling is the algorithm which works on binary images to identify various objects in the image by checking the pixel connectivity. CCL method is basically an image segmentation to identify and give labels to foreground pixels from the background. The algorithm mainly includes scanning the binary image for each pixel, from top to bottom, left to right, along diagonals, collectively to identify any connectivity among its neighbours. This is done by assigning labels to the pixels and then replacing the connected pixels with a single label (this single labeled connected region is called a map).
CCL algorithms can be divided depending upon pass into three classes: (1) one-way pass; (2) two-way pass; (3) multi-way pass.
In one-way pass algorithm, the whole image is scanned, once from top to bottom then left to right, and a new label is given to an unlabeled pixel. Then, the connected pixels within the labeled pixels are searched and given the same label. This is recursively done till no more modifications in labels are further possible. The searching algorithm is sequential with heavy computational overhead, since the neighbouring labels can possibly change. The two-way pass involves three steps. These are scanning, investigation and labeling. In scanning, a label is given to each pixel and relationships with its neighbours are recorded. In investigation step, it tries to find out the final label based on the equivalence chains. In labeling, final label is given to the each label. In multi-way pass, a local neighbourhood is assumed to search minimum label and memorize the labels equivalences.
Normally, CCL algorithm implementation is done by using mask technique. We apply different mask on the pixels and find out the minimum label in that mask. This method is applied recursively on the frame to successfully detect the foreground object. But only on a single pixel, many different masks should be applied recursively to get the best result. However, each of the above algorithms is sequential due to large amount of spatial dependency in the algorithms. Moreover, the computational load increases as the image size increases and the number of recursions rises with the complexity of the shape of connected objects. A parallel approach is thus of great need to improvise the algorithm and increase its reliability.
This method of CCL is carried on the threshold binary output. The algorithm consists of two passes. In the first pass, labeling, the pixels which are 1 are given label which is nothing but the sequential index of the pixel in the image matrix. These pixels are known as focused pixels. In the next pass, analysis, all the 8 directions of the focused pixels till the image limits, are scanned and the focused label is replaced by the minima. The example (Fig. 8) illustrates the CCL algorithm in pictorial format. The Matrix 1 represents the binary thresholded image. The shaded regions show the connected components in the image. After the first pass, the “1” is replaced with its sequential indices, as shown in Matrix 2. In the analysis pass, the 8 direction minima of the focused pixels are searched and the element is replaced by the corresponding minimum. Recursive iterations are carried on the output matrix with the same 8 directional minima search, until the output remains unchanged even after a pass. For the complex shaped connected component, as shown in Matrix 4, six iterations are required.
Fig. 8 Example for CCL algorithm in pictorial format:
In this method, 8 direction minima search is carried on each pixel based on the input image. Every pixel operation is self-determining. Spatial dependencies get reduced to zero. Thus, parallelism can be easily implemented with this method.
4.5 Overall process flow
The complete process of motion detection in pictorial format is shown in Fig. 9.
5 CUDA implementation
This section explains the CUDA algorithm which gears up the motion detection on GPU. In the dynamic partitioning of streaming multiprocessors (SM) [16], resources are very important. The execution resources in a SM include registers, thread block slots and thread blocks. In optimized utilization of GPU, it is important that many threads get executed simultaneously, which depends on the available hardware on GPU. NVIDIA Geforce GT 640 is available with maximum 1024 threads per block and 2048 threads per multiprocessor. That is to say, if we assign 32×32 block size, then only 2 blocks get executed at a time leaving the 6 block slots idle. If we use 8×8 block size, then 16 blocks are to be executed by the multiprocessor but executes only 8 at a time due to limitation in block slots. However, if 16×16 block size is taken, then all the 2048 threads of a streaming multiprocessor get equally distributed into the 8 blocks slots and get executed simultaneously. Hence, the 16×16 block size emerges the optimized one.
Maximum accessible register per multiprocessor is limited to 65536, which limits each thread in a multiprocessor to access maximum 32 registers. If as per the code, a thread requires 32 or less registers, calculating registers per block which is 32×16×16×8= 65536, all the blocks can execute properly, as shown in Fig. 10(a). But if each thread requires more registers, for example, if each thread requires 35 registers, calculating it for a bock, which is 35×16×16×8=71680 which exceeds the memory hardware available, then less number of blocks will get scheduled by the GPU, as shown in Fig. 10(b).
5.1 Adaptive background and threshold implementation
Our approach to background modeling involves first summing the N frames together and calculating the average of these frames. It further involves the revision of the background procedure pixel by pixel. This is done by assigning M frames and the current background together to the kernel function, among which reference pixels are chosen parallel according to the function mentioned in Section 4.1.
The reference frame gets generated over every M frames and so the background is updated on each cycle.
Thresholding follows the background modeling where we apply the same threshold generated per frame as per Section 4.2 to the frame and the background. Threshold is applied to large number of pixels in parallel.
Fig. 9 Pictorial representation of proposed algorithm
Fig. 10 Interaction of resource limitation:
The binary background is then subtracted from the binary frame.
Algorithm 1: Pseudo code for self-adaptive background model making and Gaussian modeled dynamic thresholding
//Making initial background
Init Back (frames, N)
for i=1 to N
add all the first N frames and store them in a temporary memory T
calculate the initial background frame by dividing all the pixel values in T by N
end
//making reference frames
Initial reference frame is the initial background calculated
As per current frame and the last reference frame the new reference frame is calculated
//updating //update background
for i=1 to (rows*cols)
if reference pixel available for the pixel in current background
then current background pixel gets replaced and current background is updated to new one.
else
keep the background pixel as it is
end
end
//thresholding
for i<=no of frames
calculate threshold for current frame
apply the threshold to current frame and the current background
subtract the binary background from the binary frame
5.2 Morphological implementation
In our morphology algorithm, we have two stages: pre-processing stage and merging stage.
1) In pre processing stage, the maximum number of comparisons required to compute P[k], S[k] and the maximum of them is 2(p-1).
2) In merging stage, the total number of comparison required to merge P[k] and S[k] is p-2.
Here, maximum of p windows is computed, hence total number of comparisons per window is
(8)
Also, we would like to mention that for large value of p, the number of comparisons required for pre- processing stage becomes two and for merging stage it is one more such comparison.
Algorithm 2: Pseudo code for fast morphological noise removal
Divide all rows into equal p segments and form (p-1)/2 overlapping columns on consecutive segment.
for every window
do
for all p pixels form two arrays namely suffix and prefix
// for dilation
if structuring element is 1
Suffix array <- max element overlapping with structuring element to the left to centre pixel
Prefix array <- max element overlapping with structuring element to the right to centre pixel
end if
for each pixel j
dilate [j]=max(R[j-m], S[j+m]), where m= (p-1)/2
end for
// for erosion
if structuring element is 1
Suffix array <- min element overlapping with structuring element to the left to centre pixel
Prefix array <- min element overlapping with structuring element to the right to centre pixel
end if
for each pixel j
Erode[j]=min(R[j-m], S[j+m]), where m=(p-1)/2
end for
exit
5.3 CCL pseudo code and performance analysis
There are two stages in CCL: one is map formation and the other is recursion to make the minimum maps in the frame.
1) In map formation step, a current pixel label is compared and the minimum label is stored for all the 8 directions until a background pixel is found. So, the number of comparison steps per pixel varies according to the complexities of the shape. Normally, the number of comparisons is very large here, so the number of branches that we get in parallel is very large. But, since it is done for all the pixels at a time, the number of steps for comparison reduces from serial implementation to parallel.
2) Since we do comparison in all the 8 directions till we get a background pixel and not only for a 3×3 neighbourhood as done in mask methods, the number of iterations needed for arriving at the lowest label for every pixel in the frame reduces to a large extent. Hence, the number of recursion of CCL barely increases above 5 even for a panoramic (2560×1440) frame size and hence the number of computation per frame for calculating accurate CCL is reduced.
Algorithm 3: Pseudo code for parallel CCL
for i =1 to n
do
if p is a foreground pixel
then give it a label according to its position
else no label assignment
a: for j=1 to 8
for every focused pixel, compare its label with all the labels in all 8 directions until a background pixel is found and store it in variable minj
end for
for 1<=j<=8
m=minimum(minj)
replace current pixel label with m
end for
if label change then go to point a
else exit
end for
6 Simulation results and discussion
The above algorithms of moving object detection and tracking are implemented on NVIDIA GEFORCE GT 640 with 3.40 GHz Intel (R) Core (TM) i7-4770 with 4 core and 32 GB RAM. The GPU has 384 cores with 2 GB of dedicated memory for storage of data sets in local memory. The development environment used was Visual Studio 2012 and CUDA version 5.5 for profiling the applications. Some of the important features of GT640 that can affect the implementation are specified as compute capability: 3.0 X, graphics Clock: 901 MHz, memory Clock: 1782 MHz, memory bandwidth 28.5 GB/s.
The image sizes on which we have tested our algorithm are 320×240, 720×480, XGA size of 1024× 768, standard half HD 1360×768, full HD 1920×1080 and beyond HD like panoramic images 2560×1440.
Table 1 shows the tabular information about the relative speedup of GPU over CPU for different video dataset for first 300 frames. We have tested our algorithm on different video datasets from different sources like OTCBVS benchmark database, CAVIAR database, i-LIDS image library supplied to AVSS 2007, Delhi traffic database and MIT traffic database. We have tested different frame size videos ranging from 320×240 to 854×480. Table 2 shows information about time taken by different parts of the algorithm in GPU, GPU memory transfer and CPU for different frame sizes. Table 3 shows information about time taken by different segments of algorithm for different block sizes for 1920×1080 frame size. We can see that the total time is minimum in case of 16×16 block size, whereas for other block sizes, the time required is more. Hence, the optimum block size is experimentally proven to be 16×16, the same as concluded in CUDA implementation section. Table 4 shows information taken from CUDA profiler about different parts of the algorithm.
Figure 11 shows the entire process for different video datasets. Here, we have taken certain frame from the sequence of frames to show the results. The first row shows the gray frames that we have generated from the colour frames. The second row shows the sequence of gray adaptive background model that we have developed from the adaptive background model differencing
algorithm. The third row shows the sequence of thresholded frames. The results are generated by applying the dynamic threshold and subtracting the background model from the current frame. The results contain certain noise in visual and to get better results we need to remove noise and hence, we apply morphological opening and closing. The fourth row of sequence shows the results after applying connected component labelling and extracting the object of interest based on the count of maps. We can see that only the object of interest is there in the frames. The fifth row of sequence shows the bounding box around the object of our interest according to the frames obtained after applying CCL.
Table 1 CPU and GPU time comparison for various videos
Table 2 Background modelling time for GPU, CPU and data transfer for various frame sizes
Table 3 Dynamic thresholding time for GPU, CPU and data transfer for various frame sizes
Table 4 Morphology operation time for GPU, CPU and data transfer for various frame sizes
The algorithm works well for all the database videos mentioned above. The relative time between CPU and GPU are given in Table 1.
We can see that for CPU, our algorithm takes 4.02125 s for 300 frames and for GPU it takes 0.401812 s for 300 frames. Thus, our algorithm gives overall speedup of 10. Also, as the frame data become more and more complex, the processing time will go on increasing. Since for real time processing, scenes which are much more complicated in nature need to be processed, the CPU will not give feasible and effective result. Thus, GPU will give us much faster results.
6.1 Parallel background modeling performance
Figure 12 shows the comparison between serial and parallel implementation of our background modeling for different frame sizes. The results show that speedup of 17.54 for frame size 320×240 and the peak speedup of 17.85 for a frame size of 720×480. Even for high resolution, the speedup is quite good of 12.95 for a frame size of 1920×1080. The speedup of 12.63 is gained for panoramic size frame 2560×1440. The memory transfer time taken by background modeling increases as the frame size increases because the data to be transferred in the global memory for the GPU processing increase. Also from profiler output (Table 4), we can see that background modeling is active for 2.28% on GPU and has occupancy of 82.87%. It takes 326400 total memory transfers between global memory and kernel.
Fig. 11 Results on video sequence:
Fig. 12 Speedup between GPU and CPU for background timing for various frame sizes
6.2 Parallel dynamic thresholding performance
Figure 13 shows the comparison of timing of serial and parallel implementation on CPU and GPU for different frame sizes. Here, we can see that the maximum speedup can go up to 18.18 for frame 320×240, which gradually reduces to 3.189 for the frame size of 2560×1440. For high resolution frames, the speedup is about 8.5 which is pretty good. Here also, as the size of frame increases, the time taken for memory transfer goes on increasing. From Table 4, we can see that it is active only for 0.56% on GPU and it has occupancy of 80.36%. It only takes 261120 total memory transfers between global memory and kernel.
Fig. 13 Speedup between GPU and CPU for dynamic thresholding for various frame sizes
6.3 Parallel morphological performance
Figure 14 shows the comparison between serial and parallel implementation of morphology algorithm on CPU and GPU for different frame size. Here, we have used a 5×5 structuring element for opening and closing process. We can see that here maximum speedup of 36.84 is achieved for frame size 320×240 and minimum speedup is 3.98 for frame size 2560×1440. For high resolution frame, the speedup is about 8. Here also, the time taken for data transfer increases as size of the frames goes on increasing. But the time taken here is less as compared to time taken by threshold part because the data from thresholding are only binary data which need very less memory for storing the entire frame and hence the time for memory copy operation is reduced. It is active for 0.48% only and occupies 79.97%, as shown in Table 4. It only takes 195840 global memory transfers with kernel. This part of the algorithm is made generic so that it can run for all sizes of structuring element.
Fig. 14 Speedup between GPU and CPU for morphology operations for various frame sizes
6.4 Parallel CCL performance
Figure 15 shows the comparison of serial and parallel implementation of our CCL algorithm on CPU and GPU for different frame sizes. Here, the maximum speedup we achieve is 52.63 on frame size 320×240 and the least speedup that we achieve is 1.59 on frame size 720×480. For high resolution frames, the speedup is about 3.4. Here, since this process is recursive, we have shown time taken by the CCL algorithm only for the first iteration. The time for memory transfer increases as the frame size goes on increasing (Table 5). This is the reason why the activity of CCL kernel on GPU is 2.65% and the occupancy is 88.62%, both of which are high as compared to all other segments of algorithm. Also, it takes the maximum global memory transfer with kernel which is 3050826.
6.5 Overall performance
Figure 16 shows the comparison of serial and parallel implementation of overall algorithm on CPU and GPU for different frame sizes. Here, the maximumspeedup we achieve is 11.48 on frame size 320×240 and the least speedup that we achieve is 8.10 on frame size 1360×768. For high resolution frames, the speedup is around 7.5. Table 6 gives the overall time required on CPU and GPU for the complete execution of the algorithm.
Fig. 15 Speedup between GPU and CPU for CCL for various frame sizes
Table 5 CCL and object estimation time for GPU, CPU and data transfer for various frame sizes
Fig. 16 Speedup of CPU vs GPU for different frame sizes
Table 6 GPU and CPU time for overall algorithm for different frame sizes
6.6 Block size for GPU kernels
As explained in the CUDA implementation section, the thread block size assigned to a kernel function in execution is one of the major factors in utilization of the SIMD system efficiently. Experimental time for various parts of the algorithm is recorded while giving different thread block sizes to the same kernel function running on a video having frame size of 1920×1080, as shown in Table 7.
Table 7 Comparison of time (μs) for different functions with various sizes of thread blocks on 1920×1080 size frame
The performance of GPU from Fig. 17 shows that the time required for all processes reaches minima when the block size is 16×16. This is due to full utilization of the multiprocessor as explained in CUDA implementation section. The theory results are verified by the experimental results.
6.7 Profiler output
The CUDA profiler outputs showing various parameters about the run time parameters are given in Table 8. We see that the occupancy in all parts of the algorithm is about 80%-85%. That is to say, the GPU system is occupied by about 80%-85% with data in execution of the various algorithms. The device (GPU) time is very less of about 0.5% for threshold and morphology operations, while it is about 2.5% for the background generation and CCL. This is because in adaptive background modeling, the background is first made for initial set of frames which are loaded to and fro from global memory and afterwards it is updated as per the consecutive frames. Also in CCL, the algorithm is recursive till we get the minimum label on every pixel position. This is why these kernel functions are active for more time compared to thresholding and morphology kernels. The total global memory usage is shown in terms of loading data on the global memory by read operations done by GPU threads and setting data on global memory by write operations done by the GPU threads.
Fig. 17 Performance of differnt functions in algorithm by changing block dimensions
Table 8 CUDA profiler outputs for 1920×1080 frame size
6.8 Results on night video
The above algorithm is also tested on various adverse conditions. The algorithm gives good tracking results even for night videos. The tested video results are shown in Fig. 18. Detecting the object is not as per mark but can be improved by further additions of artefacts in the algorithm.
Fig. 18 Experimental results for night video
Figure 18 shows that even without adding any specific artefacts about night motion detection, the algorithm performs well enough to track the motion objects. The complete algorithm being independent of classifiers and based on motion, easy detection and tracking is possible. The self adaptive background and threshold mechanisms make the overall system robust and authentic.
7 Conclusions and future work
1) The results show that we have achieved significant speedup to a factor of about 13.14 for self-adaptive background model making, about 5.543 for Gaussian modeled dynamic thresholding, about 6.636 for fast morphological noise removal, about 3.168 times for parallel CCL and about 7.30 for the overall algorithm for parallel implementation in comparison to sequential implementation. Hence, our algorithm provides a significant speedup compared to any algorithm currently existing. We have examined the performance on only GEFORCE GT640 in this work.
2) Also we tested our algorithm for 8 different sets of videos with different frame sizes and frame rates, and we have achieved a minimum speedup of about 3.70274 and maximum speedup of about 25.32801.
3) Also we have verified from our results that the minimum execution time for our algorithm is achieved by 16×16 block threads.
4) We also tested our algorithm on the night sequence where the amount of light in the scene is very less. Though our algorithm is still unable to provide the exact shape of the object, it is able to properly track the object.
5) As future extensions, we shall include testing the proposed implementation on other GPUs like AMD processors using open CL and compare the results with those of NVIDIA GPUs. Further, we want to expand the algorithm’s limits to give efficient detection on night videos. Other future work includes addition of various artefacts in the proposed algorithm to generalize it highly and make it as much robust as possible.
References
[1] EKLUND A, DUFORT P, FORSBERG D, LACONTE S M. Medical image processing on the GPU–Past, present and future [J]. Medical Image Analysis, 2013, 17(8): 1073-1094.
[2] BUCH N, VELASTIN S A, ORWELL J. A review of computer vision techniques for the analysis of urban traffic [J]. IEEE Transactions on Intelligent Transportation Systems, 2011, 12(3): 920-939.
[3] ATTARD L, FARRUGIA R A. Vision based surveillance system [C]// Proceedings of IEEE EUROCON-International Conference on Computer as a Tool (EUROCON 2011). 2011: 1-4.
[4] MOHAMMAD FARUKH HASHMI, KESKAR A G. Analysis and monitoring of a high density traffic flow at t-intersection using statistical computer vision based approach [C]// Proceedings of 12th IEEE International Conference on Intelligent Systems Design and Applications (ISDA-2012). Kochi India, 2012: 52-57.
[5] MOHAMMAD FARUKH HASHMI, KESKAR A G. Video surveillance for disorganized traffic flow at T–intersections [C]// Proceedings of Elsevier, Seventh International Conference on Image and Signal Processing (ICISP-2013). Bangalore India, Book Series Elsevier Science and Technology, Elsevier India, 2013: 51-61.
[6] HUANG Shih-chia. An advanced motion detection algorithm with video quality analysis for video surveillance systems [J]. IEEE Transactions on Circuits and Systems for Video Technology, 2011, 21(1): 1-14.
[7] FRADI H, DUGELAY J. Robust foreground segmentation using improved Gaussian mixture model and optical flow [C]// Proceedings of IEEE International Conference on Informatics, Electronics & Vision (ICIEV-2012). Dhaka, Bangladesh, 2012: 248-253.
[8] CHENG Fan-Chieh, RUAN Shanq-Jang. Accurate motion detection using a self-adaptive background matching framework [J]. IEEE Transactions on Intelligent Transportation Systems, 2012, 13(2): 671-679.
[9] RIHA L, HODA El-Sayed. Real-time motion object tracking using GPU [C]// Proceedings of IEEE 9th IEEE/ACS International Conference on Computer Systems and Applications (AICCSA-2011). Sharm El-Sheikh, Egypt, 2011: 301-304.
[10] RANE M A. Fast morphological image processing on GPU using CUDA [D]. Pune: College of Engineering, 2013.
[11] THURLEY M J, DANELL V. Fast morphological image processing open-source extensions for GPU processing with CUDA [J]. IEEE Journal of Selected Topics in Signal Processing, 2012, 6(7): 849-855.
[12] DOMANSKI L, VALLOTTON P, WANG Da-dong. Parallel van Herk/Gil-Werman image morphology on GPUs using CUDA [C]// Proceedings of Conference Posters GTC. Santa Clara, CA, USA, 2009.
[13] SOH Youngsung, ASHRAF H, HAE Yongsuk, KIM Intaek. Fast parallel connected component labeling algorithms using CUDA based on 8-directional label selection [J]. International Journal of Latest Research in Science and Technology, 2014, 3(2): 187-190.
[14] WU K, OTOO E, SUZUKI K. Optimizing two-pass connected component labelling algorithms [J]. Pattern Analysis & Applications, 2009, 2: 117-135.
[15] Wen-Mei W Hwu. GPU Computing Gems Emerald Edition [M]. Elsevier Press, 2011: 1-831.
[16] KIRK D B, Wen-mei W HWU. Programming massively parallel processors: A hands-on approach [M]. Elsevier Press, 2012: 1-487.
(Edited by YANG Bing)
Received date: 2015-01-27; Accepted date: 2015-05-14
Corresponding author: Mohammad Farukh Hashmi; Tel: +91–7122801355, +91–9665610484; E-mail: farooq78699@gmail.com