Performance and energy-efficient implementation of a smart city application on FPGAs

Performance and energy-efficient implementation of a smart city application on FPGAs The continuous growth of modern cities and the request for better quality of life, coupled with the increased availability of computing resources, lead to an increased attention to smart city services. Smart cities promise to deliver a better life to their inhabitants while simultaneously reducing resource requirements and pollution. They are thus perceived as a key enabler to sustainable growth. Out of many other issues, one of the major concerns for most cities in the world is trac ffi , which leads to a huge waste of time and energy, and to increased pollution. To optimize traffic in cities, one of the first steps is to get accurate information in real time about the traffic flows in the city. This can be achieved through the application of automated video analytics to the video streams provided by a set of cameras distributed throughout the city. Image sequence processing can be performed both peripherally and centrally. In this paper, we argue that, since centralized processing has several advan- tages in terms of availability, maintainability and cost, it is a very promising strategy to enable effective traffic management even in large cities. However, the computational costs are enormous, and thus require an energy-efficient High-Performance Computing approach. Field Programmable Gate Arrays (FPGAs) provide comparable computational resources to CPUs and GPUs, yet require much lower amounts of energy per operation (around 6 × and 10× for the application considered in this case study). They are thus preferred resources to reduce both energy supply and cooling costs in the huge datacenters that will be needed by Smart Cities. In this paper, we describe efficient implementations of high-performance algorithms that can process traffic camera image sequences to provide traffic flow information in real-time at a low energy and power cost. Keywords Smart city · Image processing · Background subtraction · Lucas–Kanade · High-level synthesis · Field programmable gate array · Graphical processing unit 1 Introduction accidents. Several public and private entities (ranging from public transportation providers, to city planners, to traffic Cities are seeing massive urbanization worldwide, thus light control, to taxi and car sharing providers, to individual increasing the pressure on infrastructure to sustain private drivers) can profit from the widespread availability of real- and public transportation. Adding intelligence to tradi- time information about traffic flows. tional traffic management and city planning strategies is The main aim of this paper is to present a computer vision essential to preserve and even improve quality of life for application, which operates in the Smart city context. This citizens under this enormous increase of population. Traf- application will provide cost-effective and scalable real time fic causes increased delays, thus reducing the opportunity analysis of traffic in cities that can then be harnessed by for city dwellers to earn money by performing productive other smart city services and applications (e.g., intelligent activities. It also poses health hazards due to pollution and traffic management tools) to reduce traffic-related impacts on the quality of life of citizens. Videos obtained from cameras can provide reliable information about the traffic flow on * Arslan Arif roads. The basic idea, as shown in Fig. 1 is that the cameras arslan.arif@polito.it acquire the images, which are then processed using image- processing algorithms. After that, the data are stored in a Politecnico Di Torino, Turin, Italy database and accessed on demand. ACCIONA Infrastructure S.A., Madrid, Spain Vol.:(0123456789) 1 3 Journal of Real-Time Image Processing discusses the Hardware computation performance and costs. The work is concluded in Sect. 7. 2 Related work A lot of work has been carried out on smart cities in the last 20 years [1]. For some reviewers, smart cities are still con- fusing [2]. Definitions range from information and commu- nication technology (ICT) networks in city environments [3] to various ICT attributes in a city [4]. Some relate the term Fig. 1 Application overview with indexes such as the level of education of citizens or in terms of financial security [5 ], while others thinks about it However, the use of cameras poses some disadvantages. in terms of urban living labs [6]. All of these implications The first major drawback is the breach of privacy. Citi- are alternative schools of thought and most researchers point zens usually feel uncomfortable and insecure when their towards the complexity and scale of the smart city domain movements are being monitored and they tend to oppose [7]. any such system. To overcome this disadvantage, the end The monitoring of roads for security and traffic manage- users of our application are not given the raw data. Rather ment purposes is one of the main topics in this domain. they are provided with only the result of the processing of Modern smart cities measure the traffic so that they can opti- the images recorded by the cameras. This ensures both the mize the utilization of the roads and streets by taking actions protection of personal information and the value of data. which can improve traffic flow. Video-based approaches Another difficulty in the use of such systems is the huge have been researched to monitor the flow of vehicles to effort required to compute and process data by image anal- obtain rich information about vehicles on roads (speed, type ysis algorithms. For instance, cameras should be deployed of vehicle, plate number, color, etc.) [8]. every 50 m or so to obtain a density that can provide com- Vision-based traffic monitoring applications have seen plete information for a city. A big city with an urban area many advances thanks to several research projects that were of 360 km would require the use of about 100,000 active aimed at improving them. In 1986, the European automotive cameras. This can be supported only by extreme parallel industry launched the PROMETHEUS European Research computing techniques. Program [9]. It was a pioneer project which intended to Two commonly used accelerators in the field of paral- improve traffic efficiency and reduce road fatalities [10]. lel computing are Graphical Processing Units (GPUs) and Later, the Defense Advanced Research Projects Agency Field Programmable Gate Arrays (FPGAs). They provide a introduced the VSAM project to create an automated video good solution to achieve high computational power. Both understanding technology which can be used in urban and options have their advantages and disadvantages. GPUs battlefield surveillance applications of the future [11]. are power hungry, whereas developing complex applica- Within this structural framework, a number of advanced tions for FPGAs using Hardware Description Languages surveillance techniques were demonstrated in an end-to-end (HDL) is difficult and time consuming. With the introduc- testbed system which included tracking from moving and tion of techniques such as High-Level Synthesis (HLS), stationary camera platforms and real-time moving object the effort of programming FPGAs has been significantly detection as well as multi-camera and active camera control reduced and their low energy consumption makes them a tracking techniques. The cooperative effort of these two pio- great candidate for such large scale applications. neering projects remained active for about two decades. As a One more point to keep in mind when planning for such result, new European frameworks evolved to cover a variety systems is that cities tend to grow. Therefore, our system of visual monitoring systems for road safety and intelligent architecture is designed to be scalable, i.e., to allow cam- transportation. In the early 2000s, the ADVISOR project was eras to be added as needed. Scaling the number of cameras implemented successfully to spot abnormal user behaviors is crucial to make this system practical. and develop a monitoring system for public transportation The rest of the paper is organized as follows. Section 2 [12–14]. discusses the previous work in this field. Sections  3 and 4 There are several methods which can extract and classify give an overview of the application and explain the selec- raw images of vehicles. These methods are chiefly feature- tion of specific image processing algorithms. Section  5 based and require hand-coding for detection and classifica- discusses the application constraints, whereas Sect.  6 tion of specific features of each kind of vehicle. Tian et al. [15] and Buch et al. [8] surveyed some of these methods. In 1 3 Journal of Real-Time Image Processing the fields of intelligent transportation systems and computer widening between the two. In summary, FPGAs are known vision, intelligent visual surveillance plays a key role [16]. to be more energy efficient than both CPUs and GPUs [31]. An important early task is foreground detection, which is Acknowledging these capabilities, Microsoft, Baidu and also known as background subtraction. Many applications Amazon now also use FPGAs as accelerators rather than such as object recognition, tracking, and anomaly detection GPUs in their data centers [32]. can be implemented based on foreground detection [17, 18]. FPGAs are, however, complex to program. Hardware An application was proposed in the Artemis Arrowhead description languages (HDL) such as Verilog or VHDL are Project [19] that can detect patterns of pedestrians and vehi- commonly used for this task. A technique called high-level cles. According to the authors, based on this information, synthesis (HLS) provides the capability to program FPGAs the application can also extract a set of parameters such as through the use of high-level languages, e.g., C, C++, the density of vehicles and people, the average time during OpenCL or SystemC, consequently reducing the design time which the elements remain stationary, the trajectories fol- debugging and analysis [33, 34]. lowed by the objects, etc. Subsequently, these parameters are offered as a service to external parties, such as public administrations or private companies that are interested in 3 The application using the data to optimize the efficiency of existing systems (e.g., traffic control systems or streetlight management) or The main goal of the application described in this paper is develop other potential applications that can take advantage to extract data from video surveillance cameras and make of them (e.g., tourism or security). it available to different services. The objective is to provide Many existing systems, which are concerned about pri- real-time information which can be used to optimize, for vacy of the citizens, employ some sort of censorship so that example, the street lighting and traffic light systems installed human or AI users are not able to see and inadvertently rec- in cities. The application will analyze the images recorded ognize any person in the camera footage. This can be done by the cameras installed in cities and will apply a set of either in the form of a superimposed black box, which blocks algorithms to detect the presence of people and vehicles and out the eyes or face of the person, masking each person in to compute the density of traffic at each specific location. each frame or blocking images of certain places altogether For this purpose, cameras are installed on roads (Fig. 2). [20–25]. However, this approach cannot achieve full privacy. Their parameters such as height from ground, angle of eleva- Most of the time we do not require any sort of informa- tion, and road parameters such as width, are already assumed tion related to individuals while working with applications to be available for processing, as shown in Fig. 3, together related to computer vision. Thus, the developer should be with other constants such as the minimum value for detect- aware of the information being collected either advertently ing a change of speed. or inadvertently and of what are the real requirements for In most places, cameras cannot be positioned directly the application [26]. above a road. Most of the times they will have a prospec- Extraction and categorization of vast amounts of data tive view, as shown in Fig. 2. So we need input values to require expensive and sophisticated software. Processing the map the road with respect to the camera pixels (Fig. 4). live feed for even a single camera requires a dedicated CPU We need three types of information. (1) Whether a pixel [27]. More performance requires computer accelerators. The covers a road area, (2) how much area each pixel covers most commonly used computer accelerator in this domain is and (3) how much distance each pixel covers in the direc- the Graphical Processing Unit (GPU). GPUs provide higher tion of the camera. The presence or absence of the road memory bandwidth, higher floating point throughput and a more favorable architecture for data parallelism than pro- cessors. Due to these properties, they are used in modern high-performance computing (HPC) systems as accelerators [28]. However, the main drawback of HPC systems based on GPU accelerators is that they consume large amount of power [29]. To overcome the power inefficiency of GPU-based HPC systems, modern field programmable gate arrays (FPGAs) can be used. FPGA devices require less operating power and energy per operation while providing reasonable processing speed as compared to GPUs [30]. When comparing them with multi-core CPUs, especially with regards to data center applications, it was observed that the performance gap keeps Fig. 2 Camera view 1 3 Journal of Real-Time Image Processing Fig. 5 General workflow of image analysis module Fig. 3 Road parameters w.r.t camera Fig. 6 Decentralized model 3.1 Implementation model Two types of implementation are possible for this system on the basis of the location of computational and storage Fig. 4 Video frame vs ground reality units. One is decentralized, where each camera has its own processing unit. The other is centralized, where all the pro- cessing by a set of closely situated cameras is done on one single server. allows us to apply the algorithm only on the part of the camera frame that we are interested in and hence save computational resources. The area value is used to find 3.1.1 Decentralized architecture the percentage of the road occupied by moving objects. Finally the distance is used to compute the velocity of Figure 6 represents the decentralized architecture version of the vehicles. All of them can be calculated from camera the application. Due to the high computational requirements, resolution, aperture, focal length and height over the road. a dedicated CPU would be needed for each camera installed Another important thing to note here is that, as we move in the monitored scenario. Once the image (which must be away from the camera, the distance represented by one processed in real time) is captured, the pre-processing unit pixel increases. Therefore, the distance value for each associated to that camera processes the signal for detecting pixel is different. It is calculated once for each stationary the elements present in the image. Afterwards, it sends a camera and then used repeatedly to save time and com- picture with some metadata to the central processing unit in putational resources. which all of the information are processed and stored to be Figure  5 shows the general workflow of the image offered to the customers within a cloud architecture. analysis module in detail. Two configuration files con- taining road and camera parameters are used as inputs, in addition to the image to be analyzed. This module can 3.1.2 Centralized architecture be instantiated, as many times as needed, once for each descriptor that is desired, so that it is possible to detect On the other hand, Fig. 7 depicts an architecture in which many kinds of objects at the same time. one processing unit is used by a number of cameras. The idea is to combine the processing unit with the central data- base where all the data are offered to the customer. This 1 3 Journal of Real-Time Image Processing Fig. 9 Overview of parallelism in image-processing algorithms Fig. 7 Centralized model needs of the HPC applications as well to the characteristics of the hardware platform. Energy-efficient heterogeneous means that no camera has a dedicated processing unit COmputing at exaSCALE (ECOSCALE) is a project under attached, which dramatically increases the amount of data the H2020 Eurpeon research framework. The main goal of to be processed centrally in real time. this project is to provide a hybrid MPI + OpenCL program- After analyzing both options, the second alternative is ming environment, a hierarchical architecture, a runtime sys- considered more appropriate because of the costs of imple- tem and middleware, and a shared distributed reconfigurable mentation, application software management, maintenance FPGA-based acceleration [35]. costs to resolve hardware failures, improved safety, etc. In ECOSCALE offers a hierarchical heterogeneous archi - Fig. 8, the scheme for the proposed solution is presented. A tecture with the purpose of achieving exascale performance major factor for choosing a centralized system would be the in an energy-efficient manner. It proposes to adopt two key achievable energy efficiency using latest generation FPGA architectural features to achieve this goal: UNIMEM and devices, which are very power-efficient but too expensive to UNILOGIC. UNIMEM was first proposed by the EURO- be deployed in a decentralized architecture. SERVER project [36] and provides efficient uniform access, Most of the operations carried out in image processing are including low-overhead ultra-scalable cache coherency, pixel based, with no or very few dependencies on other pixel within each partition of a shared Partitioned Global Address output values. This provides a very good basis for a parallel Space (PGAS). UNILOGIC, which is first being proposed by implementation of image-processing algorithms that work ECOSCALE, extends UNIMEM to offer shared partitioned on each pixel either simultaneously or in a pipelined fash- reconfigurable resources on FPGAs. The proposed HPC ion (Fig. 9). In this way, we can reduce the frame process- design flow, supported by implementation tools and a run- ing time and hence we can achieve a real time processing time software layer, partitions the HPC application design frequency, which is about 25 fps for the target application. into several nodes. These nodes communicate through a hierarchical communication infrastructure as shown 3.2 Proposed architecture in Fig.  10. Each Worker node (basically, an HPC board) includes processing units, programmable logic, and memory. We target to provide an energy-efficient architecture by Within a PGAS domain (several Worker nodes), this archi- sharing numerous reconfigurable accelerators. To provide a tecture offers shared partitioned reconfigurable resources scalable approach, the architecture should be tailored to the and a shared partitioned global address space which can be accessed through regular load and store instructions by both the processors and the programmable logic. A key goal of this architecture is to be transparently programmable with a high-level language such as OpenCL. 4 Implemented algorithms As discussed before, computational accelerators must be used to extract the required information from videos with sufficient performance and energy efficiency. The computing Fig. 8 From decentralized to centralized architecture power of the hardware accelerators will focus on the vision 1 3 Journal of Real-Time Image Processing Fig. 10 Hierarchical partition- to higher levels ing (tasks, memory, communi- cation) of an HPC application [35] L0 Communicaon (Shared address space) T1 T2 T3 T T T n-2 n-1 n Logical Shared Memory Logical Shared Memory Memory Memory Memory Memory Memory Memory L0 Paron (PGAS) L0 Paron (PGAS) algorithms for recognition and measurement of traffic, as 4.1 Vehicular density on the roads they are the most expensive part of the application. Two approaches which are best suited for our application have Algorithm 1 is based on a background subtraction and object been identified for processing the images streamed from tracking method. One popular implementation was made fixed cameras. available by Laurence Bender et al. as part of the SCENE The algorithms are coded in the OpenCL language. package [38], available in the SourceForge repository OpenCL is a programming language for parallel architec- (Fig. 11). The algorithm performs motion detection prin- tures which is built upon C/C++ and thus can be easily ciple by calculating the change in the corresponding pixel learned and ported [37]. The basic advantage of OpenCL is values with respect to the reference stationary background. that it can exploit the architectural features of accelerators The portion of the road where movement is detected gives more easily than C or C++. It provides the programmer with an idea about the amount of traffic. Moreover, the algorithm a clear distinction between different kinds of memory, such also constantly updates the reference background image (in as global DRAM, local on-chip SRAM and private register case a moving object is now at rest). files. This allows programmers to optimize code much better Our chosen algorithm takes four frames (images) as than with the flat memory models of Java C and C++. input, including the reference stationary background, the frame under the consideration, the preceding frame and the Fig. 11 Output of the back- ground subtraction algorithm [38] 1 3 Journal of Real-Time Image Processing succeeding frame. For each pixel, it performs a weighted on the Lucas–Kanade algorithm for optical flow [39]. An difference on the corresponding pixels of three consecutive implementation of the Lucas–Kanades optical flow algo- frames. If this difference is zero, it implies that there is no rithm developed by Altera [40] in OpenCL with a 52 × 52 movement in the corresponding pixel, hence no update is window size is shown in Fig. 12. needed for the total moving area or the reference back- A window size of N × N means that the optical flow for ground. On the other hand, non-zero values corresponds one pixels is computed with respect to the neighboring N/2 to some change in the consecutive video frames around the pixels on each side of that pixel, i.e., the pixel under consid- pixel. The value can be a positive or a negative number eration is in the center of a matrix of pixels having (N+1) according to the direction of movement with respect to rows and columns. For each pixel in the window, a partial the camera. If the absolute of this value is larger than the derivative with respect to its horizontal ( I ) and vertical ( I ) x y threshold set for movement detection and some change is neighbors is computed. The size of the window is a compro- also detected in the current frame pixel w.r.t. the reference mise between true negative and false positive change detec- background, then the global accumulator of the moving tion. Therefore, it should be chosen by an expert with respect area is updated by adding the area of the road occupied by to area covered by each pixel and other parameters. In this the current pixel. If the weighted difference is less than the paper, we use a 15 × 15 window. threshold for N − 1 frames, then the algorithm updates the A pyramidal implementation [41] is used to refine the reference background pixel with the current pixel. N is the optical flow calculation and the iterative Lucas–Kanade opti- minimum number of frames required to declare the pixel cal flow computation is used for the core calculations. For to be part of the stationary background. The value of N can each pixel, computed partial derivatives within the window be set according to the application. and the difference among the pixel values in the current and next frames are used to calculate the velocity of each moving object (it is zero if the area covered by the pixel is station- ary). The magnitude is the speed of the object, whereas the Algorithm1 Background Subtraction algorithm sign shows whether it moves towards the camera or away Require: Four grayscale images image , image , image from it. −1 0 1 and image & Count array bg In our implementation of the algorithm (Algorithm 2), Ensure: image ,Updated image andCount array& out bg the optical flow is computed for all the pixels of the image Total Area with Movement 1: for j =0 to HEIGHT − 1 do (in this case for a 1280 × 720 resolution). Two images using 2: for i =0 to WIDTH − 1 do 8 bits per pixel are compared with a window size of 15. 3: PIX =(j ∗ WIDTH )+ i Moreover, the obtained values are mapped to a single color 4: lat =0 representing both relative velocity and direction, as shown 5: if PIX is on ROAD then 6: center ← PIX in Fig. 13. 7: left ← PIX − 10 To calculate the average velocity of traffic with the opti- 8: right ← PIX +10 cal flow algorithm one needs to know the distance between 9: lat ← Abs(sum of weighted difference of left, right the camera and the recorded objects. To avoid expensive and center pixelsofall three images) 10: endif and complex solutions for a real-time depth measurement, 11: if (lat< threshold) &(Count[PIX] ≥ N) then an approximation for calculating the distance corresponding 12: image [center] ← image [center] bg 0 to each pixel of the image is used based on static camera 13: else 14: Count[PIX]++ parameters, such as road plane inclination, camera orienta- 15: endif tion and field of view. 16: if ((image [center]- image [center]> Background 0 bg In addition to the capabilities summarized above, addi- threshold)&(lat> threshold)) then 17: image [center] ← image [center] tional features for user interaction are included in the out 0 18: Increment Area with Movement application. For example, a module for defining the target 19: else areas where the recognition is performed and setting up the 20: image [center] ← 0 out parameters of the different cameras has been developed. All 21: endif 22: end for these parameters can be given as an input in the configura- 23: end for tion file. 4.2 Vehicular velocity on the roads Since the background subtraction module can only find the area occupied by moving objects on the roads, another method is needed to measure the velocity of vehicles, based 1 3 Journal of Real-Time Image Processing Fig. 12 Altera’s implementation of Lucas–Kanade algorithm [40] Algorithm 2 Lucas-Kanade algorithm Require: two frames of images image and image and 0 1 other coefficients Ensure: v opt 1: for j =0 to HEIGHT − 1 do 2: for i =0 to WIDTH − 1 do 3: G ← 0 2×2 4: b ← 0 2×1 5: for w = −w to w do j y y 6: for w = −w to w do i x x 7: center ← Pos(i + w ,j + w ) i j 8: left ← Pos(i + w − 1,j + w ) i j 9: right ← Pos(i + w +1,j + w ) i j 10: up ← Pos(i + w ,j + w − 1) i j 11: down ← Pos(i + w ,j + w +1) i j 12: im ← image [center] Fig. 13 Lucas–Kanade’s disparity map val 0 13: im ← image [center] val 1 0 1 14: δI ← d(im , im ) val val 15: im ← image [left] left 0 16: im ← image [right] right 0 0 0 17: I ← (im − im )/2 right left 18: im ← image [up] up 0 19: im ← image [down] down 0 0 0 20: I ← (im − im )/2 down up 21: G ← G + g (I ,I ) 2×2 x y 22: b ← b + f (δI,I ,I ) 2×1 x y 23: end for 24: end for 25: G ← inverse(G) 26: v [j][i] ← G × b opt 27: end for 28: end for Fig. 14 Sample frame 5 Application constraints 5.1 Background subtraction algorithm As discussed before, we are dealing with live video stream- The background subtraction algorithm needs three consecu- ing in our application. The cameras that we are using pro- tive frames and a reference stationary background image to duce 25 frames per second (fps) with an image resolution of distinguish between moving and stationary objects. After 1280 × 720 pixels. These frames are given as input to both the computation of one set of frames, the next frame is fed image-processing algorithms explained in section IV, one to the kernel and the oldest one is removed from the set. The for moving object detection and one for speed estimation. A result is shown in Fig. 15. sample frame from one of the cameras is shown in Fig. 14. Here the static areas are detected as background and con- verted to black, while pixels where movements have been detected are shown as gray-scale pixels of the original frame. We also compute the portion of the road that is occupied by 1 3 Journal of Real-Time Image Processing A interesting result is the speed of moving objects (vehi- cles) on the road. For the current frame as reference. The average velocity coming towards the camera is about 118 km/h while the velocity moving away is − 67 km/h. The direction of the vehicles is evident also from the color in Fig. 16, in accordance with the encoding shown in Fig. 13. We can also find the speed in any specific lane of the road, by dividing the pictures in separate lanes instead of two parts as we did in Fig. 14. This can be achieved, if required, by minor adjustments in the input configuration file. Fig. 15 Output of background subtraction Note that the processed images or data extracted from them contain no personal information, thus we can safely say that we have achieved the objective of personal data integrity and we are not forwarding any sort of personal or privileged information to any third party. 6 Implementation results and algorithm optimization After testing the basic functionality of the algorithms, we optimized them to get the maximum efficiency with a mini- mum use of resources in the smallest amount of computa- Fig. 16 Output of Lucas–Kanade algorithm tional time. Performance analysis was carried out using RTL simulation on a virtual board including a Virtex 7 FPGA moving objects. In this set of frames, it is equal to 11.2 m from Xilinx and then on real hardware, using the Amazon on the side where traffic is coming towards the camera, and Web Services(AWS) Elastic Compute Cloud (Amazon it is 6.55 m on the side where traffic is moving away from EC2). The available resources on these boards are shown the camera. in Table 1. Note that to complete RTL simulations (for Vir- tex 7) in a reasonable amount of time, we used an image 5.2 Lucas–Kanade algorithm resolution of 1280 × 4 and we extrapolated the simulation results to the real image size. On AWS, on the other hand, In our implementation of the Lucas–Kanade Algorithm, for the complete frame was used to verify the results. For high each set of calculations, we need two consecutive image level synthesis, we used SDAccel v2016.4 and 2017.1 from frames and a set of input parameters depending on the Xilinx. road conditions and camera angles. Similar to background Moreover, simulations were carried out for a single com- subtraction, each new frame replaces the older one. The pute unit and then a suitable number of compute units that graphical output from these images is shown in Fig. 16. The could fit on the FPGA were used for each algorithm. In stationary regions are represented by white pixels, while contrast to a CPU or GPU, an FPGA does not have a fixed moving objects are mapped to colors according to their architecture but the HLS tool generates a custom computa- speed and direction. tion and memory architecture from each application. The Table 1 Target FPGAs and Target device name ADM-PCIE-7V3:1ddr:3.0 AWS-F1:4ddr-xpr-2pr:4.0 boards FPGA part (Xilinx) Virtex-7 XC7VX690T-2 Virtex UltraScale+ xcvu9p-2-i Clock frequency 200 MHz 250 MHz Memory bandwidth 9.6 GB/s 11.25 GB/s BRAMs 2940 4320 URAMs – 960 DSPs 3600 6840 FFs 866,400 2,364,480 LUTs 433,200 1,182,240 1 3 Journal of Real-Time Image Processing Table 2 Kernel execution time and resource utilization (per compute unit) of background subtraction algorithm Implementation version Time (ms) Resource utilization Per frame BRAM DSP FF LUT Basic 7313.112 5 3 14,447 35,019 Optimized v.1 1467.108 22 5 10,979 31,700 Optimized v.2 103.8096 24 5 9165 15,978 Virtex 7 (3 CU) 34.6032 72 15 27,495 47,934 UltraScale+ (3 CU) 27.8082 65 15 18,723 17,859 Fig. 17 Line buffers for Lucas–Kanade term “compute unit” (CU) refers to a specialized hardware architecture (processing core) for a given application. The designer can use multiple parallel CUs (within the available resources) to boost the performance of each application. The application needs to process 25 frames per second to meet the requirement of real time video processing. This means that each kernel iteration (processing one frame) should be completed in a maximum time of 40 m. 6.1 Background subtraction algorithm Fig. 18 Basic vs final implementation of Lucas Kanade The initial implementation of the background subtraction algorithm was faster than the optical flow algorithm, but Table 3 Kernel execution time and resource utilization (per compute still did not match the real time requirements. The bottleneck unit) of Lucas–Kanade algorithm for this algorithm was global memory access. To solve this issue, a line buffer was introduced. The kernel fetches all the Implementation Time (ms) Resource utilization version pixel values required for each work group and stores them in Per frame BRAM DSP FF LUT a line buffer in local memory. This fetching is implemented Basic 44,209.98 31 56 21,367 37,080 using the OpenCL asynchronous work group copy operation, Optimized v. 1 14,883.876 122 51 18,410 24,613 which is implemented as a burst read operation from DRAM Optimized v. 2 3751.2 182 52 51,025 100,777 to on-chip memory (much faster than single transfers). The Optimized v. 3 207.313 178 175 35,683 36,072 same mechanism is used for burst writes. This reduces the Virtex 7 (6 CU) 34.5522 1068 1050 214,098 216432 kernel execution time by a factor of 5 but increases BRAM UltraScale+ (6 39.4512 1386 576 274,770 246,786 utilization. The results are good but still the desired pipelin- CU) ing of work items is not achieved due to the read/modify/ writes required to update global variables such as the total moving area. accumulator of the moving area which were causing the bot- In the second version, local buffers are also used for tleneck in the first place. In this way, we are able to achieve the standard background image, the array accounting for the expected performance, a speed gain of more than 70x the number of frames with a slight change and the global 1 3 Journal of Real-Time Image Processing Table 4 Total resource Algorithm Compute units Total resources utilized utilization for Virtex 7 (CU) BRAM DSP FF LUT Background subtraction 3 72 15 27,495 47,934 Lucas–Kanade 6 1068 1050 214,098 216,432 Total 9 1140 1065 241,593 264,366 Available – 2940 3600 866,400 433,200 Table 5 Total resource Algorithm Compute units Total resources utilized utilization for UltraScale+ (CU) (AWS-EC2) BRAM DSP FF LUT Background subtraction 3 65 15 18,723 17,859 Lucas–Kanade 6 812 246 176,970 168,280 Total 13 877 261 195,693 186,139 Available – 4320 6840 2,364,480 1,182,240 % Utilization – 20.30% 3.81% 8.27% 15.74% from the basic implementation and more than 14x from our Table 6 Power consumption per frame for background subtraction first optimized version. The extra resources consumed are Parameters FPGA GPU CPU only two BRAMs. Ultrascale+ Virtex 7 However, the best time that we achieved using Hardware emulation was 103 ms per frame, hence not sufficient to Device time (ms) 27.80 34.6 28.16 47.68 achieve 25 fps. For this purpose, we need to use at least three Device power (W) 4.55 2.760 26 10 parallel compute units, which multiplies all the resources by Energy (mJ) 126.49 95.496 732.16 476.8 a factor of 3 as shown in Table 2. This still uses only about 12% of the resources of a Virtex 7 FPGA, which can thus processes frames from five cameras. The results obtained sliding window as shown in Fig. 17), in the second opti- from AWS EC2 board show an increase in performance mized version we removed this repetitive computation by which was expected as Ultrascale+ is a newer generation calculating it only once and reusing it (Line Buffer 2 in FPGA than Virtex 7. These results are shown in the last row Fig. 17). In this way, we not only saved computations per of Table 2. work group, but also were able to split the loop nest (line 4 and 5 of Algorithm 2) into two single loops as shown in Fig 6.2 Lucas–Kanade algorithm 18. This reduces the iterations from 225 (15 × 15 ) to 30 (15 + 15). A work-group size of 1280 was also used, as it avoids The basic implementation of the Lucas–Kanade Algorithm is not only the repetitive fetching of neighbors among work even more costly than the background subtraction algorithm. groups (along the width of image) but also eliminates repeti- Three main opportunities for optimizing were global memory tive calculations for each WG. This gives us a performance access, avoiding repeated calculations for the same pixel and boost of 4 × but also requires a lot more resources (Table 3). optimizing trigonometric calculations for the output colors. The analysis of the second optimized version revealed The first optimized version of the kernel uses a line that the algorithm is not able to pipeline the inner loop buffer for burst reading and writing of the image data from because of the trigonometric functions for the output global to local memory (similar to what we have seen in color encoding. These calculations were required only the background subtraction algorithm). For Lucas–Kanade for debugging. Since the information provided to end this line buffer is about five times larger than what we used users is purely average velocity on each lane, therefore in background subtraction because more neighboring pixels the Lucas–Kanade algorithm debug image is calculated are required for computation. This can be easily seen by the in a simpler way. For debugging, the most interesting part increase in the number of BRAMs (about four times) in the of the image is the one that closest to the line of sight. first version as compared to the basic implementation. Hence, it is possible to use a linear pixel mapping, rather Since the partial derivative calculated for a pixel in the than using trigonometric functions, which are expensive window is also required by the next 14 windows (using a to compute just for debugging and system monitoring 1 3 Journal of Real-Time Image Processing Table 7 Power consumption per frame for Lucas–Kanade algorithm and energy consumption, are much better than on a CPU and energy consumption is much better than on a GPU. Parameters FPGA GPU CPU Ultrascale+ Virtex 7 7 Conclusion Device time (ms) 37.31 36.34 42.68 5925.78 Device power (W) 8.0 8.385 75 10 This paper presents a high performance yet energy efficient Energy (mJ) 298.48 304.7 3201 59,257.8 smart city application implementation. The application pro- vides not only the velocity of the vehicles in real time but also purposes on an FPGA. This resulted in a degraded depic- the density of traffic on roads. This information can be used tion of sideways motion, but overall improved the FPGA by different stake holders such as public transportation, taxis execution time by 15×  (Table  3). Hence, it shows that and city planners. Real-time benefits of these data can save floating point computations are FPGA’s weakest point. To time spent on roads and can help to reduce pollution where in satisfy real-time requirements, we have to use six Com- long run these data can be used for better planning of city and pute Units for the core calculations of the Lucas–Kanade road infrastructure. The computational capabilities and power algorithm. efficiency of FPGAs makes them a very suitable candidate As we witnessed from background subtraction as well, for applications that require large amounts of data process- the results obtained from AWS EC2 for the Lucas–Kanade ing, especially in real time. Furthermore, high-level synthesis algorithm are very comparable to the hardware emulation provides an excellent platform for designers to exploit the results as shown in Fig.  18. In both cases performance capabilities of FPGAs without the long design times entailed improved and the amount of available resources increase by the use of in hardware description languages. significantly on a Virtex Ultrascale+ with respect to the Acknowledgements This work is supported by the European Com- Virtex 7. Hence, we were able to feed the data from four mission through the H2020 ECOSCALE project (Project ID 671632). cameras in real-time to the EC2 board. Open Access This article is distributed under the terms of the Crea- 6.3 T otal resource utilization and power tive Commons Attribution 4.0 International License (http://creat iveco consumption mmons.or g/licenses/b y/4.0/), which permits unrestricted use, distribu- tion, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Summing up all the results discussed above, we achieved Creative Commons license, and indicate if changes were made. our goal of real time calculation of the portion of the road that is used by traffic and of average vehicular velocity. Moreover, Table  4 shows that we have not exceeded our References resource utilization limit, while performing the full process- 1. Akçura, M.T., Avci, S.B.: How to make global cities: information ing of the data from one camera on a relatively old Virtex communication technologies and macro-level variables. Technol. 7 FPGA. The results of actual Hardware implementation Forecast. Soc. Chang. 89, 68–79 (2014) on the Amazon EC2 cloud platform are shown in Table 5. 2. Anderson, J., et al.: Getting smart about smart cities: understand- The final aspect to consider is what advantage we have ing the market opportunity in the cities of tomorrow (2012). http:// www.alcat elluc ent.com achieved in terms of power and energy consumption (per 3. Allwinkle, S., Cruickshank, P.: Creating smart-er cities: an over- computation) with respect to GPUs and CPUs. We are con- view. J. Urban Technol. 18(2), 1–16 (2011) sidering an NVIDIA GeForce GTX960 GPU. It has 2GB of 4. Anthopoulos, L., Fitsilis, P.: Using classification and roadmapping global memory and bandwidth of 112 GB/s with a maxi- techniques for smart city viability’s realization. Electr. J. e-Gov. 11(2), 326–336 (2013) mum power consumption of 120 Watt. The CPU that we are 5. Anthopoulos, L.G., Tsoukalas, I.A.: The implementation model of considering is an Intel Xeon E3-1241 (v3) with a clock fre- a digital city. the case study of the digital city of Trikala, Greece: quency of 3.5 GHz and maximum power consumption of 80 e-Trikala. J e-Gov. 2(2), 91–109 (2006) Watts. The power consumption for the FPGAs was estimated 6. Komninos, N.: Intelligent cities: innovation, knowledge systems, and digital spaces. Taylor and Francis, Boca Raton (2002) using the Xilinx Power Estimator (XPE) tool while for the 7. Anthopoulos, L.G.: Understanding the smart city domain: a GPU it was measured using NVIDIA System Management literature review. In: Transforming city governments for suc- Interface (NVIDIA-SMI). cessful smart cities. Springer, Berlin, pp 9–21 (2015) As we can see from Tables 6 and 7, the FPGA is much 8. Buch, N., Velastin, S.A., Orwell, J.: A review of computer vision techniques for the analysis of urban traffic. IEEE Trans. more energy efficient as compared to both CPU and GPU. Intell. Transp. Syst. 12(3), 920–939 (2011) Moreover, the computation of Lucas Kanade is not possible 9. Williams, M.: The prometheus programme. In: Towards safer in real time using only a single CPU, as it takes around 6 road transport-engineering solutions, IEE colloquium on. IET, s to process each frame. As we can see both, performance London, pp 4-1 (1992) 1 3 Journal of Real-Time Image Processing 10. Ulmer, B.: Vita-an autonomous road vehicle (ARV) for colli- 29. De Schryver, C., Shcherbakov, I., Kienle, F., Wehn, N., Marxen, sion avoidance in traffic. In: Intelligent vehicles’ 92 symposium. H., Kostiuk, A., Korn, R.: An energy efficient fpga accelera- Proceedings of the IEEE, Detroit, MI, pp 36–41 (1992) tor for Monte Carlo option pricing with the Heston model, In: 11. Collins, R.T., Lipton, A.J., Kanade, T., Fujiyoshi, H., Duggins, Reconfigurable computing and FPGAs (ReConFig), 2011 inter - D., Tsin, Y., Tolliver, D., Enomoto, N., Hasegawa, O., Burt, P., national conference. IEEE, Cancun, pp 468–474 (2011) et al.: A system for video surveillance and monitoring. VSAM 30. Ovtcharov, K., Ruwase, O., Kim, J.-Y., Fowers, J., Strauss, K., Final Report, pp 1–68. Carnegie Mellon University (2000) Chung, E.S.: Accelerating deep convolutional neural networks 12. Morris, B.T., Trivedi, M.M.: A survey of vision-based trajec- using specialized hardware. Microsoft Research Whitepaper tory learning and analysis for surveillance. IEEE Trans. Circuits 2(11) (2015) Syst. Video Technol. 18(8), 1114–1127 (2008) 31. Sundararajan, P.: High performance computing using FPGAs. 13. Morris, B.T., Trivedi, M.M.: Understanding vehicular traffic Xilinx white paper: FPGAs, pp 1–15 (2010) behavior from video: a survey of unsupervised approaches. J. 32. Ouyang, J., Lin, S., Qi, W., Wang, Y., Yu, B., Jiang, S.: SDA: Electron. Imaging 22(4), 041 113–041 113 (2013) Software-den fi ed accelerator for large-scale DNN systems. In: Hot 14. Datondji, S.R.E., Dupuis, Y., Subirats, P., Vasseur, P.: A survey chips 26 symposium (HCS), 2014 IEEE. IEEE, Cupertino, pp. of vision-based traffic monitoring of road intersections. IEEE 1–23 (2014) Trans. Intell. Transp. Syst. 17(10), 2681–2698 (2016) 33. Muslim, F.B., Ma, L., Roozmeh, M., Lavagno, L.: Efficient 15. Tian, B., Morris, B.T., Tang, M., Liu, Y., Yao, Y., Gou, C., FPGA implementation of OpenCL high-performance com- Shen, D., Tang, S.: Hierarchical and networked vehicle surveil- puting applications via high-level synthesis. IEEE Access 5, lance in ITS: a survey. IEEE Trans. Intell. Transp. Syst. 16(2), 2747–2762 (2017) 557–580 (2015) 34. Coussy, P., Gajski, D.D., Meredith, M., Takach, A.: An introduc- 16. Zhang, J., Wang, F.-Y., Wang, K., Lin, W.-H., Xu, X., Chen, C.: tion to high-level synthesis. IEEE Des Test Comput 26(4), 8–17 Data-driven intelligent transportation systems: a survey. IEEE (2009) Trans. Intell. Transp. Syst. 12(4), 1624–1639 (2011) 35. Ecoscale project. http://www .ecosc ale.eu/pr oje ct-descr ip tio n.html . 17. Yilmaz, A., Javed, O., Shah, M.: Object tracking: a survey. Accessed 1 Nov 2018 ACM Comput. Surv. CSUR 38(4), 13 (2006) 36. Durand, Y., Carpenter, P.M., Adami, S., Bilas, A., Dutoit, D., 18. Wang, K., Liu, Y., Gou, C., Wang, F.-Y.: A multi-view learning Farcy, A., Gaydadjiev, G., Goodacre, J., Katevenis, M., Marazakis, approach to foreground detection for traffic surveillance applica- M., et al.: Euroserver: energy efficient node for european micro- tions. IEEE Trans. Veh. Technol. 65(6), 4144–4158 (2016) servers. In: Digital system design (DSD), 2014 17th Euromicro 19. Jokinen, J., Latvala, T., Lastra, J.L.M.: Integrating smart city conference. IEEE, Verona, pp 206–213 (2014) services using arrowhead framework. In: Industrial Electron- 37. L. Struyf, S. De Beugher, D. H. Van Uytsel, F. Kanters, T. ics Society, IECON 2016-42nd annual conference of the IEEE. Goedemé: The battle of the giants: a case study of GPU vs FPGA IEEE, Florence, pp 5568–5573 (2016) optimisation for real-time image processing. In: Proceedings 20. Blažević, M., Brkić, K., Hrkać, T.: Towards reversible de-identi- PECCS, vol 1. VISIGRAPP 2014, 112–119 (2014) fication in video sequences using 3d avatars and steganography. 38. Scene 1.0—background subtraction and object tracking with arXiv preprint arXiv :1510.04861 (2015) TUIO. http://scene .sourc eforg e.net/. Accessed 1 Nov 2018 21. Newton, E.M., Sweeney, L., Malin, B.: Preserving privacy by 39. Lucas, B.D., Kanade, T., et al.: An iterative image registration de-identifying face images. IEEE Trans. Knowl. Data Eng. technique with an application to stereo vision. In: Proceedings of 17(2), 232–243 (2005) DARPA Image Understanding Workshop, April 1981, pp 121–130 22. Rashwan, H.A., Solanas, A., Puig, D., Martínez-Ballesté, A.: (1981) Understanding trust in privacy-aware video surveillance sys- 40. Optical flow design example. https ://www .alter a.com/suppo r t/ tems. Int. J. Inf. Secur. 15(3), 225–234 (2016)suppor t-resour ces/design-e xamples/desig n-sof twar e/opencl/op tic 23. Raval, N., Srivastava, A., Lebeck, K., Cox, L., Machanavajjhala, al-flow.html. Accessed 1 Oct 2018 A.: Markit: Privacy markers for protecting visual secrets, In: 41. Bouguet, J.-Y.: Pyramidal implementation of the affine Lucas– Proceedings of the 2014 ACM international joint conference Kanade feature tracker description of the algorithm. Intel Corp on pervasive and ubiquitous computing: adjunct publication. 5(1–10), 4 (2001) ACM, Seattle, pp 1289–1295 (2014) 42. Yeshwanth, C., Sooraj, P.A., Sudhakaran, V., Raveendran, V.: 24. Roesner, F., Molnar, D., Moshchuk, A., Kohno, T., Wang, H.J.: Estimation of intersection traffic density on decentralized archi- World-driven access control for continuous sensing, In: Proceed- tectures with deep networks. In: Smart cities conference (ISC2), ings of the 2014 ACM SIGSAC conference on computer and com- 2017 international. IEEE, Wuxi, pp 1–6 (2017) munications security. ACM, Scottsdate, pp 1169–1181 (2014) 25. Schiff, J., Meingast, M., Mulligan, D.K., Sastry, S., Goldberg, K.: Respectful cameras: detecting visual markers in real-time Arslan Arif has done his masters to address privacy concerns. In: Protecting privacy in video from NUST Pakistan. Currently surveillance. Springer, Berlin, pp 65–89 (2009) he is pursuing his PhD degree 26. Chen, A.T.-Y., Biglari-Abhari, M., Kevin, I., Wang, K.: Trusting with Department of Electronics the computer in computer vision: a privacy-affirming frame- and Telecommunication (DET) work. In: Computer vision and pattern recognition workshops Politecnico Di Torino, Italy. His (CVPRW), 2017 IEEE conference. IEEE, pp 1360–1367 (2017) current research interests include 27. Engel, J.I., Martin, J., Barco, R.: A low-complexity vision-based high-level synthesis (HLS), com- system for real-time traffic monitoring. IEEE Trans. Intell. putation accelerators (FPGA and Transp. Syst. 18(5), 1279–1288 (2017) GPU) and internet of things 28. Weber, R., Gothandaraman, A., Hinde, R.J., Peterson, G.D.: (IoT) Comparing hardware accelerators in scientific applications: A case study. IEEE Trans. Parallel Distrib. Syst. 22(1), 58–68 (2011) 1 3 Journal of Real-Time Image Processing Felipe A. Barrigon is a qualified Luciano Lavagno received his Electrician, Mechanical and Elec- PhD in EECS from U.C. Berke- trical Engineer graduated at Car- ley in 1992. He co-authored four los III University in Madrid books and over 200 scientific (Spain). He currently works at papers. He was the architect of Agustin de Betancourt Founda- the POLIS HW/SW co-design tion as in-house researcher for tool. Between 2003 and 2014, he ACCIONA Construccin Technol- was an architect of the Cadence ogy & Innovation Division, in the Cto- Silicon high-level synthesis field of new technologies devel- tool. Since 1993 he is a professor opment for the civil engineering with Politecnico di Torino, Italy. sector, e.g. development of auto- His research interests include mated systems for tunneling guid- synthesis of asynchronous cir- ance and other underground con- cuits, HW/SW co-design, high- struction works. He is currently level synthesis, and design tools involved in ECOSCALE project. for wireless sensor networks. Francesco Gregoretti graduated Mihai Teodor Lazarescu Mihai in 1975 from Politecnico di Teodor Lazarescu received his Torino, Italy where is now a Pro- PhD from Politecnico di Torino fessor in Microelectronics. From (Italy) in 1998. He was Senior 1976 to 1977, he was an Assis- Engineer at Cadence Design tant Professor at the Swiss Fed- Systems, founded several start- eral Institute of Technology in ups and serves now as Assistant Lausanne (Switzerland) and Professor at Politecnico di from 1983 to 1985 Visiting Sci- Torino. He coauthored more entist at the Department of Com- than 40 scientific publications puter Science of Carnegie Mel- and several books. His research lon University, Pittsburgh interests include sensing and (USA). His main research inter- data processing for IoT, WSN ests have been in digital elec- platforms, and high-level hard- tronics, VLSI circuits,massively ware/software co-design and parallel multi-microprocessor highlevel synthesis. systems for VLSI CAD tools and in image processing architectures. More recently, his research has been focused to co-design methodolo- Liang Ma received the M.S. gies for complex electronic systems, to methodologies for reduction of degree (with Hons.) from electromagnetic emissions and power consumption of processing archi- Politecnico di Torino, Italy, tectures by the use of asynchronous methodologies. where he is currently pursuing the PhD degree with the Depart- Javed Iqbal received the M.S. ment of Electronics and Tele- degree in telecommunications communications under the engineering from the Politecnico supervision of Prof. L. Lavagno. di Torino, Torino, Italy, in 2014, His research interests focus on where he is currently pursuing high-level synthesis, electronic the PhD degree with the Depart- system level design and low- ment of Electronics and Tele- power high-performance communications. His current computing. research interests include instru- mentation and measurements, statistical signal processing, con- trol systems, and the design and implementation of low-power sensors for indoor human detec- tion, localization, tracking, and identification. 1 3 Journal of Real-Time Image Processing Manuel Palomino is a Telecom- Intelligence, SW, HW and embedded systems development, satellite munication Engineer and Project integration, and High energy Physics. Main projects completed: Manager (PMP) within Acciona HormiH: AI software for composite materials optimization, ENH: AI Technology and Innovation Divi- software for optimization of No. of tests needed for characterization of sion. He joined Acciona Con- composite materials through multidimensional algorithms, BIOFER: struccin in 2009 and since then, System design for control of biological reactors, LOAD and EDIANA: he has been involved in several Control and monitoring systems for fuel cells, solar panels and other National and European R&D energy management systems in buildings, ACCES: design of embedded Projects related to ICT and RF-based guidance system for blind persons, MIROR: NDT system to Robotics such as MIROR, automate testing of fiber composites, BLAST: Test system designed to CABLEBOT, TITAM, MEGA- test structures subjected to extreme TNT blasts to prevent terrorist ROB and ECOSCALE. attacks. He is currently involved in ECOSCALE project Javier Luis Lopez Segura gradu- ated in Physics at University of Granada (1988). He currently works at Acciona Construccin ICT research group. His research skills and interests include, among others, Parallel comput- ing, Computer vision, Artificial 1 3 http://www.deepdyve.com/assets/images/DeepDyve-Logo-lg.png Journal of Real-Time Image Processing Springer Journals

Performance and energy-efficient implementation of a smart city application on FPGAs

Free
15 pages
Loading next page...
 
/lp/springer_journal/performance-and-energy-efficient-implementation-of-a-smart-city-JGktnfYAai
Publisher
Springer Berlin Heidelberg
Copyright
Copyright © 2018 by The Author(s)
Subject
Computer Science; Image Processing and Computer Vision; Multimedia Information Systems; Computer Graphics; Pattern Recognition; Signal,Image and Speech Processing
ISSN
1861-8200
eISSN
1861-8219
D.O.I.
10.1007/s11554-018-0792-x
Publisher site
See Article on Publisher Site

Abstract

The continuous growth of modern cities and the request for better quality of life, coupled with the increased availability of computing resources, lead to an increased attention to smart city services. Smart cities promise to deliver a better life to their inhabitants while simultaneously reducing resource requirements and pollution. They are thus perceived as a key enabler to sustainable growth. Out of many other issues, one of the major concerns for most cities in the world is trac ffi , which leads to a huge waste of time and energy, and to increased pollution. To optimize traffic in cities, one of the first steps is to get accurate information in real time about the traffic flows in the city. This can be achieved through the application of automated video analytics to the video streams provided by a set of cameras distributed throughout the city. Image sequence processing can be performed both peripherally and centrally. In this paper, we argue that, since centralized processing has several advan- tages in terms of availability, maintainability and cost, it is a very promising strategy to enable effective traffic management even in large cities. However, the computational costs are enormous, and thus require an energy-efficient High-Performance Computing approach. Field Programmable Gate Arrays (FPGAs) provide comparable computational resources to CPUs and GPUs, yet require much lower amounts of energy per operation (around 6 × and 10× for the application considered in this case study). They are thus preferred resources to reduce both energy supply and cooling costs in the huge datacenters that will be needed by Smart Cities. In this paper, we describe efficient implementations of high-performance algorithms that can process traffic camera image sequences to provide traffic flow information in real-time at a low energy and power cost. Keywords Smart city · Image processing · Background subtraction · Lucas–Kanade · High-level synthesis · Field programmable gate array · Graphical processing unit 1 Introduction accidents. Several public and private entities (ranging from public transportation providers, to city planners, to traffic Cities are seeing massive urbanization worldwide, thus light control, to taxi and car sharing providers, to individual increasing the pressure on infrastructure to sustain private drivers) can profit from the widespread availability of real- and public transportation. Adding intelligence to tradi- time information about traffic flows. tional traffic management and city planning strategies is The main aim of this paper is to present a computer vision essential to preserve and even improve quality of life for application, which operates in the Smart city context. This citizens under this enormous increase of population. Traf- application will provide cost-effective and scalable real time fic causes increased delays, thus reducing the opportunity analysis of traffic in cities that can then be harnessed by for city dwellers to earn money by performing productive other smart city services and applications (e.g., intelligent activities. It also poses health hazards due to pollution and traffic management tools) to reduce traffic-related impacts on the quality of life of citizens. Videos obtained from cameras can provide reliable information about the traffic flow on * Arslan Arif roads. The basic idea, as shown in Fig. 1 is that the cameras arslan.arif@polito.it acquire the images, which are then processed using image- processing algorithms. After that, the data are stored in a Politecnico Di Torino, Turin, Italy database and accessed on demand. ACCIONA Infrastructure S.A., Madrid, Spain Vol.:(0123456789) 1 3 Journal of Real-Time Image Processing discusses the Hardware computation performance and costs. The work is concluded in Sect. 7. 2 Related work A lot of work has been carried out on smart cities in the last 20 years [1]. For some reviewers, smart cities are still con- fusing [2]. Definitions range from information and commu- nication technology (ICT) networks in city environments [3] to various ICT attributes in a city [4]. Some relate the term Fig. 1 Application overview with indexes such as the level of education of citizens or in terms of financial security [5 ], while others thinks about it However, the use of cameras poses some disadvantages. in terms of urban living labs [6]. All of these implications The first major drawback is the breach of privacy. Citi- are alternative schools of thought and most researchers point zens usually feel uncomfortable and insecure when their towards the complexity and scale of the smart city domain movements are being monitored and they tend to oppose [7]. any such system. To overcome this disadvantage, the end The monitoring of roads for security and traffic manage- users of our application are not given the raw data. Rather ment purposes is one of the main topics in this domain. they are provided with only the result of the processing of Modern smart cities measure the traffic so that they can opti- the images recorded by the cameras. This ensures both the mize the utilization of the roads and streets by taking actions protection of personal information and the value of data. which can improve traffic flow. Video-based approaches Another difficulty in the use of such systems is the huge have been researched to monitor the flow of vehicles to effort required to compute and process data by image anal- obtain rich information about vehicles on roads (speed, type ysis algorithms. For instance, cameras should be deployed of vehicle, plate number, color, etc.) [8]. every 50 m or so to obtain a density that can provide com- Vision-based traffic monitoring applications have seen plete information for a city. A big city with an urban area many advances thanks to several research projects that were of 360 km would require the use of about 100,000 active aimed at improving them. In 1986, the European automotive cameras. This can be supported only by extreme parallel industry launched the PROMETHEUS European Research computing techniques. Program [9]. It was a pioneer project which intended to Two commonly used accelerators in the field of paral- improve traffic efficiency and reduce road fatalities [10]. lel computing are Graphical Processing Units (GPUs) and Later, the Defense Advanced Research Projects Agency Field Programmable Gate Arrays (FPGAs). They provide a introduced the VSAM project to create an automated video good solution to achieve high computational power. Both understanding technology which can be used in urban and options have their advantages and disadvantages. GPUs battlefield surveillance applications of the future [11]. are power hungry, whereas developing complex applica- Within this structural framework, a number of advanced tions for FPGAs using Hardware Description Languages surveillance techniques were demonstrated in an end-to-end (HDL) is difficult and time consuming. With the introduc- testbed system which included tracking from moving and tion of techniques such as High-Level Synthesis (HLS), stationary camera platforms and real-time moving object the effort of programming FPGAs has been significantly detection as well as multi-camera and active camera control reduced and their low energy consumption makes them a tracking techniques. The cooperative effort of these two pio- great candidate for such large scale applications. neering projects remained active for about two decades. As a One more point to keep in mind when planning for such result, new European frameworks evolved to cover a variety systems is that cities tend to grow. Therefore, our system of visual monitoring systems for road safety and intelligent architecture is designed to be scalable, i.e., to allow cam- transportation. In the early 2000s, the ADVISOR project was eras to be added as needed. Scaling the number of cameras implemented successfully to spot abnormal user behaviors is crucial to make this system practical. and develop a monitoring system for public transportation The rest of the paper is organized as follows. Section 2 [12–14]. discusses the previous work in this field. Sections  3 and 4 There are several methods which can extract and classify give an overview of the application and explain the selec- raw images of vehicles. These methods are chiefly feature- tion of specific image processing algorithms. Section  5 based and require hand-coding for detection and classifica- discusses the application constraints, whereas Sect.  6 tion of specific features of each kind of vehicle. Tian et al. [15] and Buch et al. [8] surveyed some of these methods. In 1 3 Journal of Real-Time Image Processing the fields of intelligent transportation systems and computer widening between the two. In summary, FPGAs are known vision, intelligent visual surveillance plays a key role [16]. to be more energy efficient than both CPUs and GPUs [31]. An important early task is foreground detection, which is Acknowledging these capabilities, Microsoft, Baidu and also known as background subtraction. Many applications Amazon now also use FPGAs as accelerators rather than such as object recognition, tracking, and anomaly detection GPUs in their data centers [32]. can be implemented based on foreground detection [17, 18]. FPGAs are, however, complex to program. Hardware An application was proposed in the Artemis Arrowhead description languages (HDL) such as Verilog or VHDL are Project [19] that can detect patterns of pedestrians and vehi- commonly used for this task. A technique called high-level cles. According to the authors, based on this information, synthesis (HLS) provides the capability to program FPGAs the application can also extract a set of parameters such as through the use of high-level languages, e.g., C, C++, the density of vehicles and people, the average time during OpenCL or SystemC, consequently reducing the design time which the elements remain stationary, the trajectories fol- debugging and analysis [33, 34]. lowed by the objects, etc. Subsequently, these parameters are offered as a service to external parties, such as public administrations or private companies that are interested in 3 The application using the data to optimize the efficiency of existing systems (e.g., traffic control systems or streetlight management) or The main goal of the application described in this paper is develop other potential applications that can take advantage to extract data from video surveillance cameras and make of them (e.g., tourism or security). it available to different services. The objective is to provide Many existing systems, which are concerned about pri- real-time information which can be used to optimize, for vacy of the citizens, employ some sort of censorship so that example, the street lighting and traffic light systems installed human or AI users are not able to see and inadvertently rec- in cities. The application will analyze the images recorded ognize any person in the camera footage. This can be done by the cameras installed in cities and will apply a set of either in the form of a superimposed black box, which blocks algorithms to detect the presence of people and vehicles and out the eyes or face of the person, masking each person in to compute the density of traffic at each specific location. each frame or blocking images of certain places altogether For this purpose, cameras are installed on roads (Fig. 2). [20–25]. However, this approach cannot achieve full privacy. Their parameters such as height from ground, angle of eleva- Most of the time we do not require any sort of informa- tion, and road parameters such as width, are already assumed tion related to individuals while working with applications to be available for processing, as shown in Fig. 3, together related to computer vision. Thus, the developer should be with other constants such as the minimum value for detect- aware of the information being collected either advertently ing a change of speed. or inadvertently and of what are the real requirements for In most places, cameras cannot be positioned directly the application [26]. above a road. Most of the times they will have a prospec- Extraction and categorization of vast amounts of data tive view, as shown in Fig. 2. So we need input values to require expensive and sophisticated software. Processing the map the road with respect to the camera pixels (Fig. 4). live feed for even a single camera requires a dedicated CPU We need three types of information. (1) Whether a pixel [27]. More performance requires computer accelerators. The covers a road area, (2) how much area each pixel covers most commonly used computer accelerator in this domain is and (3) how much distance each pixel covers in the direc- the Graphical Processing Unit (GPU). GPUs provide higher tion of the camera. The presence or absence of the road memory bandwidth, higher floating point throughput and a more favorable architecture for data parallelism than pro- cessors. Due to these properties, they are used in modern high-performance computing (HPC) systems as accelerators [28]. However, the main drawback of HPC systems based on GPU accelerators is that they consume large amount of power [29]. To overcome the power inefficiency of GPU-based HPC systems, modern field programmable gate arrays (FPGAs) can be used. FPGA devices require less operating power and energy per operation while providing reasonable processing speed as compared to GPUs [30]. When comparing them with multi-core CPUs, especially with regards to data center applications, it was observed that the performance gap keeps Fig. 2 Camera view 1 3 Journal of Real-Time Image Processing Fig. 5 General workflow of image analysis module Fig. 3 Road parameters w.r.t camera Fig. 6 Decentralized model 3.1 Implementation model Two types of implementation are possible for this system on the basis of the location of computational and storage Fig. 4 Video frame vs ground reality units. One is decentralized, where each camera has its own processing unit. The other is centralized, where all the pro- cessing by a set of closely situated cameras is done on one single server. allows us to apply the algorithm only on the part of the camera frame that we are interested in and hence save computational resources. The area value is used to find 3.1.1 Decentralized architecture the percentage of the road occupied by moving objects. Finally the distance is used to compute the velocity of Figure 6 represents the decentralized architecture version of the vehicles. All of them can be calculated from camera the application. Due to the high computational requirements, resolution, aperture, focal length and height over the road. a dedicated CPU would be needed for each camera installed Another important thing to note here is that, as we move in the monitored scenario. Once the image (which must be away from the camera, the distance represented by one processed in real time) is captured, the pre-processing unit pixel increases. Therefore, the distance value for each associated to that camera processes the signal for detecting pixel is different. It is calculated once for each stationary the elements present in the image. Afterwards, it sends a camera and then used repeatedly to save time and com- picture with some metadata to the central processing unit in putational resources. which all of the information are processed and stored to be Figure  5 shows the general workflow of the image offered to the customers within a cloud architecture. analysis module in detail. Two configuration files con- taining road and camera parameters are used as inputs, in addition to the image to be analyzed. This module can 3.1.2 Centralized architecture be instantiated, as many times as needed, once for each descriptor that is desired, so that it is possible to detect On the other hand, Fig. 7 depicts an architecture in which many kinds of objects at the same time. one processing unit is used by a number of cameras. The idea is to combine the processing unit with the central data- base where all the data are offered to the customer. This 1 3 Journal of Real-Time Image Processing Fig. 9 Overview of parallelism in image-processing algorithms Fig. 7 Centralized model needs of the HPC applications as well to the characteristics of the hardware platform. Energy-efficient heterogeneous means that no camera has a dedicated processing unit COmputing at exaSCALE (ECOSCALE) is a project under attached, which dramatically increases the amount of data the H2020 Eurpeon research framework. The main goal of to be processed centrally in real time. this project is to provide a hybrid MPI + OpenCL program- After analyzing both options, the second alternative is ming environment, a hierarchical architecture, a runtime sys- considered more appropriate because of the costs of imple- tem and middleware, and a shared distributed reconfigurable mentation, application software management, maintenance FPGA-based acceleration [35]. costs to resolve hardware failures, improved safety, etc. In ECOSCALE offers a hierarchical heterogeneous archi - Fig. 8, the scheme for the proposed solution is presented. A tecture with the purpose of achieving exascale performance major factor for choosing a centralized system would be the in an energy-efficient manner. It proposes to adopt two key achievable energy efficiency using latest generation FPGA architectural features to achieve this goal: UNIMEM and devices, which are very power-efficient but too expensive to UNILOGIC. UNIMEM was first proposed by the EURO- be deployed in a decentralized architecture. SERVER project [36] and provides efficient uniform access, Most of the operations carried out in image processing are including low-overhead ultra-scalable cache coherency, pixel based, with no or very few dependencies on other pixel within each partition of a shared Partitioned Global Address output values. This provides a very good basis for a parallel Space (PGAS). UNILOGIC, which is first being proposed by implementation of image-processing algorithms that work ECOSCALE, extends UNIMEM to offer shared partitioned on each pixel either simultaneously or in a pipelined fash- reconfigurable resources on FPGAs. The proposed HPC ion (Fig. 9). In this way, we can reduce the frame process- design flow, supported by implementation tools and a run- ing time and hence we can achieve a real time processing time software layer, partitions the HPC application design frequency, which is about 25 fps for the target application. into several nodes. These nodes communicate through a hierarchical communication infrastructure as shown 3.2 Proposed architecture in Fig.  10. Each Worker node (basically, an HPC board) includes processing units, programmable logic, and memory. We target to provide an energy-efficient architecture by Within a PGAS domain (several Worker nodes), this archi- sharing numerous reconfigurable accelerators. To provide a tecture offers shared partitioned reconfigurable resources scalable approach, the architecture should be tailored to the and a shared partitioned global address space which can be accessed through regular load and store instructions by both the processors and the programmable logic. A key goal of this architecture is to be transparently programmable with a high-level language such as OpenCL. 4 Implemented algorithms As discussed before, computational accelerators must be used to extract the required information from videos with sufficient performance and energy efficiency. The computing Fig. 8 From decentralized to centralized architecture power of the hardware accelerators will focus on the vision 1 3 Journal of Real-Time Image Processing Fig. 10 Hierarchical partition- to higher levels ing (tasks, memory, communi- cation) of an HPC application [35] L0 Communicaon (Shared address space) T1 T2 T3 T T T n-2 n-1 n Logical Shared Memory Logical Shared Memory Memory Memory Memory Memory Memory Memory L0 Paron (PGAS) L0 Paron (PGAS) algorithms for recognition and measurement of traffic, as 4.1 Vehicular density on the roads they are the most expensive part of the application. Two approaches which are best suited for our application have Algorithm 1 is based on a background subtraction and object been identified for processing the images streamed from tracking method. One popular implementation was made fixed cameras. available by Laurence Bender et al. as part of the SCENE The algorithms are coded in the OpenCL language. package [38], available in the SourceForge repository OpenCL is a programming language for parallel architec- (Fig. 11). The algorithm performs motion detection prin- tures which is built upon C/C++ and thus can be easily ciple by calculating the change in the corresponding pixel learned and ported [37]. The basic advantage of OpenCL is values with respect to the reference stationary background. that it can exploit the architectural features of accelerators The portion of the road where movement is detected gives more easily than C or C++. It provides the programmer with an idea about the amount of traffic. Moreover, the algorithm a clear distinction between different kinds of memory, such also constantly updates the reference background image (in as global DRAM, local on-chip SRAM and private register case a moving object is now at rest). files. This allows programmers to optimize code much better Our chosen algorithm takes four frames (images) as than with the flat memory models of Java C and C++. input, including the reference stationary background, the frame under the consideration, the preceding frame and the Fig. 11 Output of the back- ground subtraction algorithm [38] 1 3 Journal of Real-Time Image Processing succeeding frame. For each pixel, it performs a weighted on the Lucas–Kanade algorithm for optical flow [39]. An difference on the corresponding pixels of three consecutive implementation of the Lucas–Kanades optical flow algo- frames. If this difference is zero, it implies that there is no rithm developed by Altera [40] in OpenCL with a 52 × 52 movement in the corresponding pixel, hence no update is window size is shown in Fig. 12. needed for the total moving area or the reference back- A window size of N × N means that the optical flow for ground. On the other hand, non-zero values corresponds one pixels is computed with respect to the neighboring N/2 to some change in the consecutive video frames around the pixels on each side of that pixel, i.e., the pixel under consid- pixel. The value can be a positive or a negative number eration is in the center of a matrix of pixels having (N+1) according to the direction of movement with respect to rows and columns. For each pixel in the window, a partial the camera. If the absolute of this value is larger than the derivative with respect to its horizontal ( I ) and vertical ( I ) x y threshold set for movement detection and some change is neighbors is computed. The size of the window is a compro- also detected in the current frame pixel w.r.t. the reference mise between true negative and false positive change detec- background, then the global accumulator of the moving tion. Therefore, it should be chosen by an expert with respect area is updated by adding the area of the road occupied by to area covered by each pixel and other parameters. In this the current pixel. If the weighted difference is less than the paper, we use a 15 × 15 window. threshold for N − 1 frames, then the algorithm updates the A pyramidal implementation [41] is used to refine the reference background pixel with the current pixel. N is the optical flow calculation and the iterative Lucas–Kanade opti- minimum number of frames required to declare the pixel cal flow computation is used for the core calculations. For to be part of the stationary background. The value of N can each pixel, computed partial derivatives within the window be set according to the application. and the difference among the pixel values in the current and next frames are used to calculate the velocity of each moving object (it is zero if the area covered by the pixel is station- ary). The magnitude is the speed of the object, whereas the Algorithm1 Background Subtraction algorithm sign shows whether it moves towards the camera or away Require: Four grayscale images image , image , image from it. −1 0 1 and image & Count array bg In our implementation of the algorithm (Algorithm 2), Ensure: image ,Updated image andCount array& out bg the optical flow is computed for all the pixels of the image Total Area with Movement 1: for j =0 to HEIGHT − 1 do (in this case for a 1280 × 720 resolution). Two images using 2: for i =0 to WIDTH − 1 do 8 bits per pixel are compared with a window size of 15. 3: PIX =(j ∗ WIDTH )+ i Moreover, the obtained values are mapped to a single color 4: lat =0 representing both relative velocity and direction, as shown 5: if PIX is on ROAD then 6: center ← PIX in Fig. 13. 7: left ← PIX − 10 To calculate the average velocity of traffic with the opti- 8: right ← PIX +10 cal flow algorithm one needs to know the distance between 9: lat ← Abs(sum of weighted difference of left, right the camera and the recorded objects. To avoid expensive and center pixelsofall three images) 10: endif and complex solutions for a real-time depth measurement, 11: if (lat< threshold) &(Count[PIX] ≥ N) then an approximation for calculating the distance corresponding 12: image [center] ← image [center] bg 0 to each pixel of the image is used based on static camera 13: else 14: Count[PIX]++ parameters, such as road plane inclination, camera orienta- 15: endif tion and field of view. 16: if ((image [center]- image [center]> Background 0 bg In addition to the capabilities summarized above, addi- threshold)&(lat> threshold)) then 17: image [center] ← image [center] tional features for user interaction are included in the out 0 18: Increment Area with Movement application. For example, a module for defining the target 19: else areas where the recognition is performed and setting up the 20: image [center] ← 0 out parameters of the different cameras has been developed. All 21: endif 22: end for these parameters can be given as an input in the configura- 23: end for tion file. 4.2 Vehicular velocity on the roads Since the background subtraction module can only find the area occupied by moving objects on the roads, another method is needed to measure the velocity of vehicles, based 1 3 Journal of Real-Time Image Processing Fig. 12 Altera’s implementation of Lucas–Kanade algorithm [40] Algorithm 2 Lucas-Kanade algorithm Require: two frames of images image and image and 0 1 other coefficients Ensure: v opt 1: for j =0 to HEIGHT − 1 do 2: for i =0 to WIDTH − 1 do 3: G ← 0 2×2 4: b ← 0 2×1 5: for w = −w to w do j y y 6: for w = −w to w do i x x 7: center ← Pos(i + w ,j + w ) i j 8: left ← Pos(i + w − 1,j + w ) i j 9: right ← Pos(i + w +1,j + w ) i j 10: up ← Pos(i + w ,j + w − 1) i j 11: down ← Pos(i + w ,j + w +1) i j 12: im ← image [center] Fig. 13 Lucas–Kanade’s disparity map val 0 13: im ← image [center] val 1 0 1 14: δI ← d(im , im ) val val 15: im ← image [left] left 0 16: im ← image [right] right 0 0 0 17: I ← (im − im )/2 right left 18: im ← image [up] up 0 19: im ← image [down] down 0 0 0 20: I ← (im − im )/2 down up 21: G ← G + g (I ,I ) 2×2 x y 22: b ← b + f (δI,I ,I ) 2×1 x y 23: end for 24: end for 25: G ← inverse(G) 26: v [j][i] ← G × b opt 27: end for 28: end for Fig. 14 Sample frame 5 Application constraints 5.1 Background subtraction algorithm As discussed before, we are dealing with live video stream- The background subtraction algorithm needs three consecu- ing in our application. The cameras that we are using pro- tive frames and a reference stationary background image to duce 25 frames per second (fps) with an image resolution of distinguish between moving and stationary objects. After 1280 × 720 pixels. These frames are given as input to both the computation of one set of frames, the next frame is fed image-processing algorithms explained in section IV, one to the kernel and the oldest one is removed from the set. The for moving object detection and one for speed estimation. A result is shown in Fig. 15. sample frame from one of the cameras is shown in Fig. 14. Here the static areas are detected as background and con- verted to black, while pixels where movements have been detected are shown as gray-scale pixels of the original frame. We also compute the portion of the road that is occupied by 1 3 Journal of Real-Time Image Processing A interesting result is the speed of moving objects (vehi- cles) on the road. For the current frame as reference. The average velocity coming towards the camera is about 118 km/h while the velocity moving away is − 67 km/h. The direction of the vehicles is evident also from the color in Fig. 16, in accordance with the encoding shown in Fig. 13. We can also find the speed in any specific lane of the road, by dividing the pictures in separate lanes instead of two parts as we did in Fig. 14. This can be achieved, if required, by minor adjustments in the input configuration file. Fig. 15 Output of background subtraction Note that the processed images or data extracted from them contain no personal information, thus we can safely say that we have achieved the objective of personal data integrity and we are not forwarding any sort of personal or privileged information to any third party. 6 Implementation results and algorithm optimization After testing the basic functionality of the algorithms, we optimized them to get the maximum efficiency with a mini- mum use of resources in the smallest amount of computa- Fig. 16 Output of Lucas–Kanade algorithm tional time. Performance analysis was carried out using RTL simulation on a virtual board including a Virtex 7 FPGA moving objects. In this set of frames, it is equal to 11.2 m from Xilinx and then on real hardware, using the Amazon on the side where traffic is coming towards the camera, and Web Services(AWS) Elastic Compute Cloud (Amazon it is 6.55 m on the side where traffic is moving away from EC2). The available resources on these boards are shown the camera. in Table 1. Note that to complete RTL simulations (for Vir- tex 7) in a reasonable amount of time, we used an image 5.2 Lucas–Kanade algorithm resolution of 1280 × 4 and we extrapolated the simulation results to the real image size. On AWS, on the other hand, In our implementation of the Lucas–Kanade Algorithm, for the complete frame was used to verify the results. For high each set of calculations, we need two consecutive image level synthesis, we used SDAccel v2016.4 and 2017.1 from frames and a set of input parameters depending on the Xilinx. road conditions and camera angles. Similar to background Moreover, simulations were carried out for a single com- subtraction, each new frame replaces the older one. The pute unit and then a suitable number of compute units that graphical output from these images is shown in Fig. 16. The could fit on the FPGA were used for each algorithm. In stationary regions are represented by white pixels, while contrast to a CPU or GPU, an FPGA does not have a fixed moving objects are mapped to colors according to their architecture but the HLS tool generates a custom computa- speed and direction. tion and memory architecture from each application. The Table 1 Target FPGAs and Target device name ADM-PCIE-7V3:1ddr:3.0 AWS-F1:4ddr-xpr-2pr:4.0 boards FPGA part (Xilinx) Virtex-7 XC7VX690T-2 Virtex UltraScale+ xcvu9p-2-i Clock frequency 200 MHz 250 MHz Memory bandwidth 9.6 GB/s 11.25 GB/s BRAMs 2940 4320 URAMs – 960 DSPs 3600 6840 FFs 866,400 2,364,480 LUTs 433,200 1,182,240 1 3 Journal of Real-Time Image Processing Table 2 Kernel execution time and resource utilization (per compute unit) of background subtraction algorithm Implementation version Time (ms) Resource utilization Per frame BRAM DSP FF LUT Basic 7313.112 5 3 14,447 35,019 Optimized v.1 1467.108 22 5 10,979 31,700 Optimized v.2 103.8096 24 5 9165 15,978 Virtex 7 (3 CU) 34.6032 72 15 27,495 47,934 UltraScale+ (3 CU) 27.8082 65 15 18,723 17,859 Fig. 17 Line buffers for Lucas–Kanade term “compute unit” (CU) refers to a specialized hardware architecture (processing core) for a given application. The designer can use multiple parallel CUs (within the available resources) to boost the performance of each application. The application needs to process 25 frames per second to meet the requirement of real time video processing. This means that each kernel iteration (processing one frame) should be completed in a maximum time of 40 m. 6.1 Background subtraction algorithm Fig. 18 Basic vs final implementation of Lucas Kanade The initial implementation of the background subtraction algorithm was faster than the optical flow algorithm, but Table 3 Kernel execution time and resource utilization (per compute still did not match the real time requirements. The bottleneck unit) of Lucas–Kanade algorithm for this algorithm was global memory access. To solve this issue, a line buffer was introduced. The kernel fetches all the Implementation Time (ms) Resource utilization version pixel values required for each work group and stores them in Per frame BRAM DSP FF LUT a line buffer in local memory. This fetching is implemented Basic 44,209.98 31 56 21,367 37,080 using the OpenCL asynchronous work group copy operation, Optimized v. 1 14,883.876 122 51 18,410 24,613 which is implemented as a burst read operation from DRAM Optimized v. 2 3751.2 182 52 51,025 100,777 to on-chip memory (much faster than single transfers). The Optimized v. 3 207.313 178 175 35,683 36,072 same mechanism is used for burst writes. This reduces the Virtex 7 (6 CU) 34.5522 1068 1050 214,098 216432 kernel execution time by a factor of 5 but increases BRAM UltraScale+ (6 39.4512 1386 576 274,770 246,786 utilization. The results are good but still the desired pipelin- CU) ing of work items is not achieved due to the read/modify/ writes required to update global variables such as the total moving area. accumulator of the moving area which were causing the bot- In the second version, local buffers are also used for tleneck in the first place. In this way, we are able to achieve the standard background image, the array accounting for the expected performance, a speed gain of more than 70x the number of frames with a slight change and the global 1 3 Journal of Real-Time Image Processing Table 4 Total resource Algorithm Compute units Total resources utilized utilization for Virtex 7 (CU) BRAM DSP FF LUT Background subtraction 3 72 15 27,495 47,934 Lucas–Kanade 6 1068 1050 214,098 216,432 Total 9 1140 1065 241,593 264,366 Available – 2940 3600 866,400 433,200 Table 5 Total resource Algorithm Compute units Total resources utilized utilization for UltraScale+ (CU) (AWS-EC2) BRAM DSP FF LUT Background subtraction 3 65 15 18,723 17,859 Lucas–Kanade 6 812 246 176,970 168,280 Total 13 877 261 195,693 186,139 Available – 4320 6840 2,364,480 1,182,240 % Utilization – 20.30% 3.81% 8.27% 15.74% from the basic implementation and more than 14x from our Table 6 Power consumption per frame for background subtraction first optimized version. The extra resources consumed are Parameters FPGA GPU CPU only two BRAMs. Ultrascale+ Virtex 7 However, the best time that we achieved using Hardware emulation was 103 ms per frame, hence not sufficient to Device time (ms) 27.80 34.6 28.16 47.68 achieve 25 fps. For this purpose, we need to use at least three Device power (W) 4.55 2.760 26 10 parallel compute units, which multiplies all the resources by Energy (mJ) 126.49 95.496 732.16 476.8 a factor of 3 as shown in Table 2. This still uses only about 12% of the resources of a Virtex 7 FPGA, which can thus processes frames from five cameras. The results obtained sliding window as shown in Fig. 17), in the second opti- from AWS EC2 board show an increase in performance mized version we removed this repetitive computation by which was expected as Ultrascale+ is a newer generation calculating it only once and reusing it (Line Buffer 2 in FPGA than Virtex 7. These results are shown in the last row Fig. 17). In this way, we not only saved computations per of Table 2. work group, but also were able to split the loop nest (line 4 and 5 of Algorithm 2) into two single loops as shown in Fig 6.2 Lucas–Kanade algorithm 18. This reduces the iterations from 225 (15 × 15 ) to 30 (15 + 15). A work-group size of 1280 was also used, as it avoids The basic implementation of the Lucas–Kanade Algorithm is not only the repetitive fetching of neighbors among work even more costly than the background subtraction algorithm. groups (along the width of image) but also eliminates repeti- Three main opportunities for optimizing were global memory tive calculations for each WG. This gives us a performance access, avoiding repeated calculations for the same pixel and boost of 4 × but also requires a lot more resources (Table 3). optimizing trigonometric calculations for the output colors. The analysis of the second optimized version revealed The first optimized version of the kernel uses a line that the algorithm is not able to pipeline the inner loop buffer for burst reading and writing of the image data from because of the trigonometric functions for the output global to local memory (similar to what we have seen in color encoding. These calculations were required only the background subtraction algorithm). For Lucas–Kanade for debugging. Since the information provided to end this line buffer is about five times larger than what we used users is purely average velocity on each lane, therefore in background subtraction because more neighboring pixels the Lucas–Kanade algorithm debug image is calculated are required for computation. This can be easily seen by the in a simpler way. For debugging, the most interesting part increase in the number of BRAMs (about four times) in the of the image is the one that closest to the line of sight. first version as compared to the basic implementation. Hence, it is possible to use a linear pixel mapping, rather Since the partial derivative calculated for a pixel in the than using trigonometric functions, which are expensive window is also required by the next 14 windows (using a to compute just for debugging and system monitoring 1 3 Journal of Real-Time Image Processing Table 7 Power consumption per frame for Lucas–Kanade algorithm and energy consumption, are much better than on a CPU and energy consumption is much better than on a GPU. Parameters FPGA GPU CPU Ultrascale+ Virtex 7 7 Conclusion Device time (ms) 37.31 36.34 42.68 5925.78 Device power (W) 8.0 8.385 75 10 This paper presents a high performance yet energy efficient Energy (mJ) 298.48 304.7 3201 59,257.8 smart city application implementation. The application pro- vides not only the velocity of the vehicles in real time but also purposes on an FPGA. This resulted in a degraded depic- the density of traffic on roads. This information can be used tion of sideways motion, but overall improved the FPGA by different stake holders such as public transportation, taxis execution time by 15×  (Table  3). Hence, it shows that and city planners. Real-time benefits of these data can save floating point computations are FPGA’s weakest point. To time spent on roads and can help to reduce pollution where in satisfy real-time requirements, we have to use six Com- long run these data can be used for better planning of city and pute Units for the core calculations of the Lucas–Kanade road infrastructure. The computational capabilities and power algorithm. efficiency of FPGAs makes them a very suitable candidate As we witnessed from background subtraction as well, for applications that require large amounts of data process- the results obtained from AWS EC2 for the Lucas–Kanade ing, especially in real time. Furthermore, high-level synthesis algorithm are very comparable to the hardware emulation provides an excellent platform for designers to exploit the results as shown in Fig.  18. In both cases performance capabilities of FPGAs without the long design times entailed improved and the amount of available resources increase by the use of in hardware description languages. significantly on a Virtex Ultrascale+ with respect to the Acknowledgements This work is supported by the European Com- Virtex 7. Hence, we were able to feed the data from four mission through the H2020 ECOSCALE project (Project ID 671632). cameras in real-time to the EC2 board. Open Access This article is distributed under the terms of the Crea- 6.3 T otal resource utilization and power tive Commons Attribution 4.0 International License (http://creat iveco consumption mmons.or g/licenses/b y/4.0/), which permits unrestricted use, distribu- tion, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Summing up all the results discussed above, we achieved Creative Commons license, and indicate if changes were made. our goal of real time calculation of the portion of the road that is used by traffic and of average vehicular velocity. Moreover, Table  4 shows that we have not exceeded our References resource utilization limit, while performing the full process- 1. Akçura, M.T., Avci, S.B.: How to make global cities: information ing of the data from one camera on a relatively old Virtex communication technologies and macro-level variables. Technol. 7 FPGA. The results of actual Hardware implementation Forecast. Soc. Chang. 89, 68–79 (2014) on the Amazon EC2 cloud platform are shown in Table 5. 2. Anderson, J., et al.: Getting smart about smart cities: understand- The final aspect to consider is what advantage we have ing the market opportunity in the cities of tomorrow (2012). http:// www.alcat elluc ent.com achieved in terms of power and energy consumption (per 3. Allwinkle, S., Cruickshank, P.: Creating smart-er cities: an over- computation) with respect to GPUs and CPUs. We are con- view. J. Urban Technol. 18(2), 1–16 (2011) sidering an NVIDIA GeForce GTX960 GPU. It has 2GB of 4. Anthopoulos, L., Fitsilis, P.: Using classification and roadmapping global memory and bandwidth of 112 GB/s with a maxi- techniques for smart city viability’s realization. Electr. J. e-Gov. 11(2), 326–336 (2013) mum power consumption of 120 Watt. The CPU that we are 5. Anthopoulos, L.G., Tsoukalas, I.A.: The implementation model of considering is an Intel Xeon E3-1241 (v3) with a clock fre- a digital city. the case study of the digital city of Trikala, Greece: quency of 3.5 GHz and maximum power consumption of 80 e-Trikala. J e-Gov. 2(2), 91–109 (2006) Watts. The power consumption for the FPGAs was estimated 6. Komninos, N.: Intelligent cities: innovation, knowledge systems, and digital spaces. Taylor and Francis, Boca Raton (2002) using the Xilinx Power Estimator (XPE) tool while for the 7. Anthopoulos, L.G.: Understanding the smart city domain: a GPU it was measured using NVIDIA System Management literature review. In: Transforming city governments for suc- Interface (NVIDIA-SMI). cessful smart cities. Springer, Berlin, pp 9–21 (2015) As we can see from Tables 6 and 7, the FPGA is much 8. Buch, N., Velastin, S.A., Orwell, J.: A review of computer vision techniques for the analysis of urban traffic. IEEE Trans. more energy efficient as compared to both CPU and GPU. Intell. Transp. Syst. 12(3), 920–939 (2011) Moreover, the computation of Lucas Kanade is not possible 9. Williams, M.: The prometheus programme. In: Towards safer in real time using only a single CPU, as it takes around 6 road transport-engineering solutions, IEE colloquium on. IET, s to process each frame. As we can see both, performance London, pp 4-1 (1992) 1 3 Journal of Real-Time Image Processing 10. Ulmer, B.: Vita-an autonomous road vehicle (ARV) for colli- 29. De Schryver, C., Shcherbakov, I., Kienle, F., Wehn, N., Marxen, sion avoidance in traffic. In: Intelligent vehicles’ 92 symposium. H., Kostiuk, A., Korn, R.: An energy efficient fpga accelera- Proceedings of the IEEE, Detroit, MI, pp 36–41 (1992) tor for Monte Carlo option pricing with the Heston model, In: 11. Collins, R.T., Lipton, A.J., Kanade, T., Fujiyoshi, H., Duggins, Reconfigurable computing and FPGAs (ReConFig), 2011 inter - D., Tsin, Y., Tolliver, D., Enomoto, N., Hasegawa, O., Burt, P., national conference. IEEE, Cancun, pp 468–474 (2011) et al.: A system for video surveillance and monitoring. VSAM 30. Ovtcharov, K., Ruwase, O., Kim, J.-Y., Fowers, J., Strauss, K., Final Report, pp 1–68. Carnegie Mellon University (2000) Chung, E.S.: Accelerating deep convolutional neural networks 12. Morris, B.T., Trivedi, M.M.: A survey of vision-based trajec- using specialized hardware. Microsoft Research Whitepaper tory learning and analysis for surveillance. IEEE Trans. Circuits 2(11) (2015) Syst. Video Technol. 18(8), 1114–1127 (2008) 31. Sundararajan, P.: High performance computing using FPGAs. 13. Morris, B.T., Trivedi, M.M.: Understanding vehicular traffic Xilinx white paper: FPGAs, pp 1–15 (2010) behavior from video: a survey of unsupervised approaches. J. 32. Ouyang, J., Lin, S., Qi, W., Wang, Y., Yu, B., Jiang, S.: SDA: Electron. Imaging 22(4), 041 113–041 113 (2013) Software-den fi ed accelerator for large-scale DNN systems. In: Hot 14. Datondji, S.R.E., Dupuis, Y., Subirats, P., Vasseur, P.: A survey chips 26 symposium (HCS), 2014 IEEE. IEEE, Cupertino, pp. of vision-based traffic monitoring of road intersections. IEEE 1–23 (2014) Trans. Intell. Transp. Syst. 17(10), 2681–2698 (2016) 33. Muslim, F.B., Ma, L., Roozmeh, M., Lavagno, L.: Efficient 15. Tian, B., Morris, B.T., Tang, M., Liu, Y., Yao, Y., Gou, C., FPGA implementation of OpenCL high-performance com- Shen, D., Tang, S.: Hierarchical and networked vehicle surveil- puting applications via high-level synthesis. IEEE Access 5, lance in ITS: a survey. IEEE Trans. Intell. Transp. Syst. 16(2), 2747–2762 (2017) 557–580 (2015) 34. Coussy, P., Gajski, D.D., Meredith, M., Takach, A.: An introduc- 16. Zhang, J., Wang, F.-Y., Wang, K., Lin, W.-H., Xu, X., Chen, C.: tion to high-level synthesis. IEEE Des Test Comput 26(4), 8–17 Data-driven intelligent transportation systems: a survey. IEEE (2009) Trans. Intell. Transp. Syst. 12(4), 1624–1639 (2011) 35. Ecoscale project. http://www .ecosc ale.eu/pr oje ct-descr ip tio n.html . 17. Yilmaz, A., Javed, O., Shah, M.: Object tracking: a survey. Accessed 1 Nov 2018 ACM Comput. Surv. CSUR 38(4), 13 (2006) 36. Durand, Y., Carpenter, P.M., Adami, S., Bilas, A., Dutoit, D., 18. Wang, K., Liu, Y., Gou, C., Wang, F.-Y.: A multi-view learning Farcy, A., Gaydadjiev, G., Goodacre, J., Katevenis, M., Marazakis, approach to foreground detection for traffic surveillance applica- M., et al.: Euroserver: energy efficient node for european micro- tions. IEEE Trans. Veh. Technol. 65(6), 4144–4158 (2016) servers. In: Digital system design (DSD), 2014 17th Euromicro 19. Jokinen, J., Latvala, T., Lastra, J.L.M.: Integrating smart city conference. IEEE, Verona, pp 206–213 (2014) services using arrowhead framework. In: Industrial Electron- 37. L. Struyf, S. De Beugher, D. H. Van Uytsel, F. Kanters, T. ics Society, IECON 2016-42nd annual conference of the IEEE. Goedemé: The battle of the giants: a case study of GPU vs FPGA IEEE, Florence, pp 5568–5573 (2016) optimisation for real-time image processing. In: Proceedings 20. Blažević, M., Brkić, K., Hrkać, T.: Towards reversible de-identi- PECCS, vol 1. VISIGRAPP 2014, 112–119 (2014) fication in video sequences using 3d avatars and steganography. 38. Scene 1.0—background subtraction and object tracking with arXiv preprint arXiv :1510.04861 (2015) TUIO. http://scene .sourc eforg e.net/. Accessed 1 Nov 2018 21. Newton, E.M., Sweeney, L., Malin, B.: Preserving privacy by 39. Lucas, B.D., Kanade, T., et al.: An iterative image registration de-identifying face images. IEEE Trans. Knowl. Data Eng. technique with an application to stereo vision. In: Proceedings of 17(2), 232–243 (2005) DARPA Image Understanding Workshop, April 1981, pp 121–130 22. Rashwan, H.A., Solanas, A., Puig, D., Martínez-Ballesté, A.: (1981) Understanding trust in privacy-aware video surveillance sys- 40. Optical flow design example. https ://www .alter a.com/suppo r t/ tems. Int. J. Inf. Secur. 15(3), 225–234 (2016)suppor t-resour ces/design-e xamples/desig n-sof twar e/opencl/op tic 23. Raval, N., Srivastava, A., Lebeck, K., Cox, L., Machanavajjhala, al-flow.html. Accessed 1 Oct 2018 A.: Markit: Privacy markers for protecting visual secrets, In: 41. Bouguet, J.-Y.: Pyramidal implementation of the affine Lucas– Proceedings of the 2014 ACM international joint conference Kanade feature tracker description of the algorithm. Intel Corp on pervasive and ubiquitous computing: adjunct publication. 5(1–10), 4 (2001) ACM, Seattle, pp 1289–1295 (2014) 42. Yeshwanth, C., Sooraj, P.A., Sudhakaran, V., Raveendran, V.: 24. Roesner, F., Molnar, D., Moshchuk, A., Kohno, T., Wang, H.J.: Estimation of intersection traffic density on decentralized archi- World-driven access control for continuous sensing, In: Proceed- tectures with deep networks. In: Smart cities conference (ISC2), ings of the 2014 ACM SIGSAC conference on computer and com- 2017 international. IEEE, Wuxi, pp 1–6 (2017) munications security. ACM, Scottsdate, pp 1169–1181 (2014) 25. Schiff, J., Meingast, M., Mulligan, D.K., Sastry, S., Goldberg, K.: Respectful cameras: detecting visual markers in real-time Arslan Arif has done his masters to address privacy concerns. In: Protecting privacy in video from NUST Pakistan. Currently surveillance. Springer, Berlin, pp 65–89 (2009) he is pursuing his PhD degree 26. Chen, A.T.-Y., Biglari-Abhari, M., Kevin, I., Wang, K.: Trusting with Department of Electronics the computer in computer vision: a privacy-affirming frame- and Telecommunication (DET) work. In: Computer vision and pattern recognition workshops Politecnico Di Torino, Italy. His (CVPRW), 2017 IEEE conference. IEEE, pp 1360–1367 (2017) current research interests include 27. Engel, J.I., Martin, J., Barco, R.: A low-complexity vision-based high-level synthesis (HLS), com- system for real-time traffic monitoring. IEEE Trans. Intell. putation accelerators (FPGA and Transp. Syst. 18(5), 1279–1288 (2017) GPU) and internet of things 28. Weber, R., Gothandaraman, A., Hinde, R.J., Peterson, G.D.: (IoT) Comparing hardware accelerators in scientific applications: A case study. IEEE Trans. Parallel Distrib. Syst. 22(1), 58–68 (2011) 1 3 Journal of Real-Time Image Processing Felipe A. Barrigon is a qualified Luciano Lavagno received his Electrician, Mechanical and Elec- PhD in EECS from U.C. Berke- trical Engineer graduated at Car- ley in 1992. He co-authored four los III University in Madrid books and over 200 scientific (Spain). He currently works at papers. He was the architect of Agustin de Betancourt Founda- the POLIS HW/SW co-design tion as in-house researcher for tool. Between 2003 and 2014, he ACCIONA Construccin Technol- was an architect of the Cadence ogy & Innovation Division, in the Cto- Silicon high-level synthesis field of new technologies devel- tool. Since 1993 he is a professor opment for the civil engineering with Politecnico di Torino, Italy. sector, e.g. development of auto- His research interests include mated systems for tunneling guid- synthesis of asynchronous cir- ance and other underground con- cuits, HW/SW co-design, high- struction works. He is currently level synthesis, and design tools involved in ECOSCALE project. for wireless sensor networks. Francesco Gregoretti graduated Mihai Teodor Lazarescu Mihai in 1975 from Politecnico di Teodor Lazarescu received his Torino, Italy where is now a Pro- PhD from Politecnico di Torino fessor in Microelectronics. From (Italy) in 1998. He was Senior 1976 to 1977, he was an Assis- Engineer at Cadence Design tant Professor at the Swiss Fed- Systems, founded several start- eral Institute of Technology in ups and serves now as Assistant Lausanne (Switzerland) and Professor at Politecnico di from 1983 to 1985 Visiting Sci- Torino. He coauthored more entist at the Department of Com- than 40 scientific publications puter Science of Carnegie Mel- and several books. His research lon University, Pittsburgh interests include sensing and (USA). His main research inter- data processing for IoT, WSN ests have been in digital elec- platforms, and high-level hard- tronics, VLSI circuits,massively ware/software co-design and parallel multi-microprocessor highlevel synthesis. systems for VLSI CAD tools and in image processing architectures. More recently, his research has been focused to co-design methodolo- Liang Ma received the M.S. gies for complex electronic systems, to methodologies for reduction of degree (with Hons.) from electromagnetic emissions and power consumption of processing archi- Politecnico di Torino, Italy, tectures by the use of asynchronous methodologies. where he is currently pursuing the PhD degree with the Depart- Javed Iqbal received the M.S. ment of Electronics and Tele- degree in telecommunications communications under the engineering from the Politecnico supervision of Prof. L. Lavagno. di Torino, Torino, Italy, in 2014, His research interests focus on where he is currently pursuing high-level synthesis, electronic the PhD degree with the Depart- system level design and low- ment of Electronics and Tele- power high-performance communications. His current computing. research interests include instru- mentation and measurements, statistical signal processing, con- trol systems, and the design and implementation of low-power sensors for indoor human detec- tion, localization, tracking, and identification. 1 3 Journal of Real-Time Image Processing Manuel Palomino is a Telecom- Intelligence, SW, HW and embedded systems development, satellite munication Engineer and Project integration, and High energy Physics. Main projects completed: Manager (PMP) within Acciona HormiH: AI software for composite materials optimization, ENH: AI Technology and Innovation Divi- software for optimization of No. of tests needed for characterization of sion. He joined Acciona Con- composite materials through multidimensional algorithms, BIOFER: struccin in 2009 and since then, System design for control of biological reactors, LOAD and EDIANA: he has been involved in several Control and monitoring systems for fuel cells, solar panels and other National and European R&D energy management systems in buildings, ACCES: design of embedded Projects related to ICT and RF-based guidance system for blind persons, MIROR: NDT system to Robotics such as MIROR, automate testing of fiber composites, BLAST: Test system designed to CABLEBOT, TITAM, MEGA- test structures subjected to extreme TNT blasts to prevent terrorist ROB and ECOSCALE. attacks. He is currently involved in ECOSCALE project Javier Luis Lopez Segura gradu- ated in Physics at University of Granada (1988). He currently works at Acciona Construccin ICT research group. His research skills and interests include, among others, Parallel comput- ing, Computer vision, Artificial 1 3

Journal

Journal of Real-Time Image ProcessingSpringer Journals

Published: Jun 5, 2018

References

You’re reading a free preview. Subscribe to read the entire article.


DeepDyve is your
personal research library

It’s your single place to instantly
discover and read the research
that matters to you.

Enjoy affordable access to
over 18 million articles from more than
15,000 peer-reviewed journals.

All for just $49/month

Explore the DeepDyve Library

Search

Query the DeepDyve database, plus search all of PubMed and Google Scholar seamlessly

Organize

Save any article or search result from DeepDyve, PubMed, and Google Scholar... all in one place.

Access

Get unlimited, online access to over 18 million full-text articles from more than 15,000 scientific journals.

Your journals are on DeepDyve

Read from thousands of the leading scholarly journals from SpringerNature, Elsevier, Wiley-Blackwell, Oxford University Press and more.

All the latest content is available, no embargo periods.

See the journals in your area

DeepDyve

Freelancer

DeepDyve

Pro

Price

FREE

$49/month
$360/year

Save searches from
Google Scholar,
PubMed

Create lists to
organize your research

Export lists, citations

Read DeepDyve articles

Abstract access only

Unlimited access to over
18 million full-text articles

Print

20 pages / month

PDF Discount

20% off