# Introduction of the Research Based on FPGA at NICS

### Rong Luo

Nano Integrated Circuits and Systems Lab, Department of Electronic Engineering, Tsinghua University Beijing, 100084, China <sup>1</sup>luorong@tsinghua.edu.cn

Abstract — This document introduces the research work based on FPGA at Nano Integrated Circuits and Systems Lab, Department of Electronic Engineering, Tsinghua University.

Keywords — WSN Digital Baseband SOC, Twodimensional Bar Code, Dynamic Time Warping Distance, Real Time Image Processing, FPGA

## I. INTRODUCTION

The Nano Integrated Circuits and Systems Lab at the Department of Electronic Engineering, Tsinghua University, Beijing, China focus on the following research fields, Chips for Communications and Digital Media Processing, Electronic System Design Automation, Analog and Mixed-Signal Integrated Circuits Design, and Design of Radio-Frequency and Microwave Integrated Circuits.

In the field of Chips for Communications and Digital Media Process, research activities include ASIC design for wireless communication, digital broadcasting, and digital media applications.

A field-programmable gate array (FPGA) is a chip designed to be configured by a customer or a designer after manufacturing.<sub>[1]</sub> Hence, "field-programmable" is the biggest advantages, which can let the researcher update the functionality after shipping, partial re-configuration of a portion of their design. Moreover, with the ability of the

low non-recurring engineering costs relative to an ASIC design, FPGAs offer advantages for many applications.

The structure of this paper is arranged as follows. Section II introduces the prototype system for our wireless sensor network digital baseband system-on-a-chip design based on DE2-70. Section III presents the implementation of an embedded two-dimensional bar code recognition system based on DE2. A subsequence similarity search algorithm based on Dynamic Time Warping (DTW) distance, is accelerated in Section IV. A hardware platform for a real time image processing system is built in Section V. Finally, the conclusions and acknowledgements are given.

# II. BUILDING THE PROTOTYPE SYSTEM FOR WSN DIGITAL BASEBAND SOC DESIGN

Wireless Sensor Networks (WSNs) are widely used as information acquisition and processing platforms in many applications.

In order to design all digital WSN baseband SOC, a prototype based on DE2-70 is built to verify our design, as shown in Figure  $1_{[2]}$ 



Figure 1. Prototype based on DE2-70<sub>[3]</sub> for WSN digital baseband SOC design

To meet the requirements of low complexity, low power and high flexibility, many algorithms are proposed, such as modulation, demodulation, spreading and synchronous, digital frequency converter, interpolation and decimation filter. According to the proposed algorithm, the architecture of our baseband circuit design is implemented in the prototype.

As shown in Figure 2, the MCU core 8051 consumes 3427 LUTs, while the baseband circuits consume 3282 LUTs. The resource of DE2-70 is more than enough for our design.

| Entity                     | Logic Cells | Dedicated Logic Reg | Menory Bits | M4K3 | Pins | LUT-Only LCs | Register-Only LCs |
|----------------------------|-------------|---------------------|-------------|------|------|--------------|-------------------|
| Cyclone II: EF2C70F895C5   |             |                     |             |      |      |              |                   |
| B-20 acu_con               | 13830 (6)   | T0T8 (5)            | 133632      | 36   | 142  | 6752 (1)     | 1084 (0)          |
| B 10 SEGT_LUT_2:SEGT       | 14 (0)      | 0 (0)               | 0           | 0    | 0    | 14 (0)       | 0 (0)             |
| SEGT_LUT_2:SEGT2           | 14 (0)      | 0 (0)               | 0           | 0    | 0    | 14 (0)       | 0 (0)             |
| B-BS SEGT_LUT_2:SEGT3      | 14 (0)      | 0 (0)               | 0           | 0    | 0    | 14 (0)       | 0 (0)             |
| B mc8051_top b2v_inst1     | 3955 (0)    | 520 (0)             | 132096      | 33   | 0    | 3427 (0)     | 24 (0)            |
|                            | 0           | 0                   | 0           | 0    | 0    | 0            | 0                 |
| - 🔤 spi_baseband b2v_inst5 | 9827 (0)    | 6545 (0)            | 1536        | 3    | 0    | 3282 (0)     | 1060 (0)          |
| flash_cov:b2v_inst8        | 8 (8)       | 8 (8)               | 0           | 0    | 0    | 0 (0)        | 0 (0)             |

Figure 2. Resource Consumption of Cyclone II

# III. IMPLEMENTING AN EMBEDDED TWO-DIMENSIONAL BAR CODE RECOGNITION SYSTEM BASED ON FPGA

A two-dimensional bar code is a symbol in the plane, which has particular shape both in its horizontal and vertical direction. Therefore, it has the ability of carrying more information in smaller area with higher error tolerance as well as better scalability compared to a one-dimensional bar code.[4] As an open standard, PDF417 code is adopted to design an embedded two-dimensional bar code recognizing system based on FPGA using the NIOS II processor<sub>[5]</sub>, as shown in Figure 3.

|                           | Franks Court   |                    | Name                                      |          | Source         |             |  |  |
|---------------------------|----------------|--------------------|-------------------------------------------|----------|----------------|-------------|--|--|
| Device Family: Cyclone II |                | one I              | Thank.                                    | Externa  |                |             |  |  |
|                           |                | LIN.               |                                           | L. C. C. |                | 51          |  |  |
| Use                       | Connectio      | Module Name        | Description                               | Clock    | Base           | End IRQ     |  |  |
| P                         |                | 🗄 chn              | Nos I Processor                           |          |                |             |  |  |
|                           | $\sim$         | instruction_master | Avaion Master                             | cik      |                |             |  |  |
|                           |                | data_master        | Avaion Master                             |          | 150 0          | 18Q 31      |  |  |
|                           | $\rightarrow$  | jtag_debug_module  | Avaion Slave                              |          | - 0x00888800 0 | 1200888fff  |  |  |
| Y                         | 111            | onchip_mem         | On-Chip Memory (RAM or ROM)               |          |                |             |  |  |
| _                         | $\rightarrow$  | s1                 | Avalon Slave                              | cik      | # 0x00884000 0 | 111788001   |  |  |
| ¥                         |                | E pio              | PIO (Parallel I/O)                        |          |                |             |  |  |
|                           | $\rightarrow$  | \$1                | Avalon Slave                              | cik      | = 0x00889320 0 | z0088932f   |  |  |
| 7                         | н              | 🗄 pio_sram_irq     | PIO (Parallel I/O)                        | 1        |                |             |  |  |
| _                         | $\rightarrow$  | \$1                | Avaion Slave                              | cik      | ₽ 0x00889330   | z0088933f   |  |  |
| 7                         |                | pio_sram_sel       | PIO (Parallel I/O)                        |          |                |             |  |  |
| _                         | $\rightarrow$  | 51                 | Avalon Slave                              | cik      | - 0x00889340   | 120088934f  |  |  |
| 7                         | 111            | E led              | Character LCD                             |          |                |             |  |  |
| _                         | $\rightarrow$  | control_slave      | Avalon Slave                              | cik      | - 0x00889350 0 | 120088935f  |  |  |
| 7                         |                | ⊟ cfi_flash        | Flash Memory (CFI)                        |          |                |             |  |  |
| _                         | $\square \cap$ | \$1                | Avalon Tristate Slave                     | cik      | = 0x00400000 0 | z007fffff   |  |  |
| 7                         | ILLE.          | tristate_bridge    | Avalon-MM Tristate Bridge<br>Avalon Slave |          |                |             |  |  |
|                           |                | avalon_slave       |                                           | clk      |                |             |  |  |
| _                         | $\sim$         | tristate_master    | Avalon Tristate Master                    |          |                |             |  |  |
| 2                         | 11             | pio_point1         | PIO (Parallel I/O)                        |          |                |             |  |  |
| _                         | $\rightarrow$  | \$1                | Avalon Slave                              | cik      |                | z0088936f   |  |  |
| V                         |                | E pio_point2       | PIO (Parallel I/O)                        |          |                |             |  |  |
| _                         | ,              | s1                 | Avalon Slave                              | cik      | ₽ 0x00889370   | ±0088937f   |  |  |
| 7                         |                | pio_point3<br>*1   | PIO (Parallel I/O)                        |          |                |             |  |  |
| _                         | $\rightarrow$  |                    | Avalon Slave                              | cik      | 0x00889380     | 188688001   |  |  |
| 7                         |                | E pio_point4       | PIO (Parallel I/O)                        |          |                |             |  |  |
| _                         | $\rightarrow$  | s1                 | Avalon Slave                              | cik      | - 0x00889390 0 | z0088939f   |  |  |
| F                         |                | E pio_key          | PIO (Parallel I/O)                        |          |                |             |  |  |
| Ā                         | $\rightarrow$  | s1<br>⊟ pio time   | Avaion Slave                              | cik      | = 0x008893a0   | 12008893af  |  |  |
| e.                        |                | ±1 pio_time        | PIO (Parallel I/O)<br>Avalon Slave        | cik      | # 0x008893b0   | 146688001   |  |  |
| 7                         |                | E RS232            | UART (RS-232 Serial Port)                 | CIR      | - 0X00889350   | 100003001   |  |  |
| <b>*</b>                  |                | E R5232            | Avalon Slave                              | clk      | 0x00883300     | 120088931f  |  |  |
| 7                         |                | B DPRAM inst       | Avaion Stave<br>DPRAM                     | CIK      | - 0x00889300 0 | moossaari - |  |  |
| M                         |                | avaion_slave_0     | Avelon Slave                              | cik      | # 0x00889000   | z008891ff   |  |  |
| 7                         |                |                    | Avaion Slave<br>PIO (Parallel I/O)        | CIR      | - 0x00889000   | Z00003111   |  |  |
| ~                         |                | pio_hw_done<br>s1  |                                           | clk      | # 0x008893c0   |             |  |  |
| 7                         | $\rightarrow$  |                    | Avalon Slave<br>PIO (Parallel I/O)        | CIK      | - 0x008893c0   | 12008893cf  |  |  |
| M                         |                | □ pio_hw_start     |                                           |          |                |             |  |  |
|                           |                | 81                 | Avaion Slave<br>MYSRAM                    | cik      | # 0x00889340   | z0088934f   |  |  |
| F                         |                | SRAM_inst          |                                           |          |                |             |  |  |
|                           |                | avaion_slave_0     | Avalon Slave                              |          | = 0x00800000 0 |             |  |  |

Figure 3. The whole configuration for NIOS II processor

Figure 4 show the report for the resource utility of our design based on DE2. The software-based system can finish the recognition in 28ms with 90% accuracy, given a 320 \* 240 two-dimensional bar code image.

| Flow Status                        | Successful - Thu Jun 02 19:47:02 2011    |
|------------------------------------|------------------------------------------|
| Quartus II Version                 | 7.2 Build 151 09/26/2007 SJ Full Version |
| Revision Name                      | DE2 TV                                   |
| Top-level Entity Name              | DE2 TV                                   |
|                                    | -                                        |
| Family                             | Cyclone II                               |
| Device                             | EP2C35F672C8                             |
| Timing Models                      | Final                                    |
| Met timing requirements            | No                                       |
| Total logic elements               | 28,631 / 33,216 ( 86 % )                 |
| Total combinational functions      | 25,378 / 33,216 (76 %)                   |
| Dedicated logic registers          | 14,078 / 33,216 ( 42 % )                 |
| Total registers                    | 14127                                    |
| Total pins                         | 430 / 475 ( 91 % )                       |
| Total virtual pins                 | 0                                        |
| Total memory bits                  | 242,256 / 483,840 (50 %)                 |
| Embedded Multiplier 9-bit elements | 70 / 70 ( 100 % )                        |
| Total PLLs                         | 1 / 4 (25 %)                             |
|                                    |                                          |

Figure 4. The report for the resource utility of our design based on  $DE2_{[6]}$ 

# IV. ACCELERATING SUBSEQUENCE SIMILARITY SEARCH BASED ON DYNAMIC TIME WARPING DISTANCE WITH FPGA

Subsequence search, especially subsequence similarity search, is one of the most important subroutines in time series data mining algorithms, and Dynamic Time Warping (DTW) distance is best. Although many software speedup techniques, including early abandoning strategies, lower bound, indexing, computation-reuse, DTW still cost about 80% of the total time for most applications. Moreover, DTW is hard to use parallel hardware to be accelerated because it is 2-Dimension sequential dynamic search with quite high data dependency.

A novel framework for FPGA based subsequence similarity search and a novel PE-ring structure for DTW calculation are proposed.<sup>[7]</sup> Figure 5 illustrates the framework. The framework utilizes the data reusability of continuous DTW calculations to reduce the bandwidth and exploit the coarse-grain parallelism, and guarantees the accuracy with two-phase precision reduction. The PE-ring supports on-line updating patterns of various lengths, and utilizes the hard-wired synchronization of FPGA to realize the fine-grained parallelism, which can only be exploited by FPGAs.

Our system is implemented on TERASIC Company's Altera DE4 Board with a Stratix IV GX EP4SGX530 FPGAs.[8]

The resource cost of our system is shown as Figure 6.

| Combinational ALUTs       | 362,568/424,960      | (85%) |
|---------------------------|----------------------|-------|
| Dedicated logic registers | 230,160/424,960      | (54%) |
| Memory bits               | 1,902,512/21,233,664 | (9%)  |

Figure 6. The report for the resource cost of our system based on DE4



#### **Hardware Framework**

Figure 5. Hardware framework of the DTW system

The experimental results show that this work achieves one to four orders of magnitude speedup compared to the best software implementation in different datasets, three orders of magnitude speedup compared to the current GPU implementations, and two orders of magnitude speedup compared to the current FPGA implementation.

# V. CONSTRUCTING A DETECTION AND 3D MEASUREMENT FPGA BASED SYSTEM FOR A REAL TIME IMAGE PROCESSING

Electronic Road Pricing (ERP) system is widely used in the world. The first large scale free flow road pricing system has been successfully in operation in Singapore since April 1998, in which an enforcement system (vehicle detection and license plate recognition) has achieved the performance of success rate of license plate recognition as 96% in 1.5 million pieces of vehicle units. However, the performance is based on lots of high quality of equipments with high cost, such as in-vehicle-units for every vehicle, complex gantries with many sensors, high luminance lighting, and high resolution cameras. As a result, reducing system cost is the most important issue for expanding sales for other countries. [9]

In some markets, the use of only video cameras for the enforcement system to identify the vehicles might be a cost reduction option since complex gantries would be removed. A simple enforcement system with video cameras offers high performance solutions with very low initial investment. Stereo cameras are needed to assure the tolling accuracy. Due to the fact that cameras, especially stereo ones, will provide really large real-time video for the tolling system to detect and classify the vehicles, a real-time and high throughput image processor is very necessary.

FPGA has millions of LEs (Logic Elements) processing at the same time in a parallel way and this feature can achieve high parallelism and throughput which fit for the real-time vehicle detection and classification algorithms from the video. However, in order to achieve high performance and low cost using FPGA, the key problem is how to reasonably analyze the computation cost and computation load allocation of vehicle detection and classification algorithms. Thus, we design the FPGA based detection and tracking processing system.

Based on simulation by ModelSim SE 6.5 and synthesis by Quartus II 10.0, the hardware implementation on Altera Stratix IV FPGA can reach 125MHz. It could achieve about 43 fps for video of 1392\*1040 resolution with a large disparity range of 256, and 400 fps for a video of 640\*480 resolution with a disparity range of 128. The detection and tracking part only gives the tracked area, so the image for these modules is resized to 320\*256. We keep the 1392\*1040 resolution for stereo matching in order to achieve more accurate size extraction.

The software results are tested on an i7 930 2.8GHz CPU based on OpenCV library. The results as shown in Table I indicates that our FPGA implementation of the system has 71.38 times speedup than software, and 63.91 times speedup for stereo matching.

| SW AND HW RESULTS FOR ONE 1392*1040 IMAGE |              |              |         |  |  |  |  |
|-------------------------------------------|--------------|--------------|---------|--|--|--|--|
| Module                                    | Software(ms) | Hardware(ms) | Speedup |  |  |  |  |
| Detection                                 | 161.42       | 2.42         | 66.70   |  |  |  |  |
| Stereo Matching                           | 1380.40      | 21.60        | 63.91   |  |  |  |  |
| Total                                     | 1541.82      | 21.60        | 71.38   |  |  |  |  |

TABLE I SW AND HW RESULTS FOR ONE 1392\*1040 IMAGE

## CONCLUSIONS

In summary, with the outstanding abilities of FPGA, much research work is done based on FPGA at NICS lab.

This paper introduces four projects. The prototype system for our wireless sensor network digital baseband system-on-a-chip design based on DE2-70 helps us to verify our proposed circuits and algorithms. The implementation of an embedded two-dimensional bar code recognition system based on DE2 shows that the proposed design can finish the recognition in 28ms with 90% accuracy, given a 320 \* 240 two-dimensional bar code image. A subsequence similarity search algorithm based on Dynamic Time Warping (DTW) distance, is accelerated in DE4. Compared with other software and hardware methods, it can achieve at least two orders of magnitude speedup. A hardware platform for a real time image processing system is built based on DE4, and experimental results demonstrate that our FPGA implementation of the system has 71.38 times speedup than software, and 63.91 times speedup for stereo matching.

## ACKNOWLEDGMENT

Many thanks to Altera Corp. for supplying excellent FPGA chips, and TERASIC Company for their powerful DE series boards.

Many thanks to National Science and Technology Major Project of the Ministry of Science and Technology of China, National Natural Science Foundation of China, Microsoft and Mitsubishi Heavy Industries for their financial support.

Many thanks to prof. Huazhong Yang, Yu Wang and Yongpan Liu for their contributions to the research work at NICS.

## REFERENCES

- FPGA, http://en.wikipedia.org/wiki/Fieldprogrammable\_gate\_array, 2012-10-12
- Yihao Zhu, Medium/high speed WSN digital baseband SOC design with IEEE 802.15.4 PHY layer compatible, master thesis, Tsinghua University, 2012.6.
- [3]. DE2-70 Datasheets, www.altera.com, 2009.
- [4]. Xiao Chen. The two-dimensional bar code recognition system based on FPGA, bachelor thesis, Tsinghua University, 2011.6.
- [5]. NIOS II Datasheets, www.altera.com, 2010.
- [6]. DE2 Datasheets, www.altera.com, 2010.
- [7]. Zilong Wang, Yu Wang, et al. Accelerating Subsequence Similarity Search Based on Dynamic Time Warping Distance with FPGA. paper submitted to FPGA 2013.
- [8]. DE4 Datasheets, www.altera.com, 2011.
- [9]. Yu Wang, Rong Luo, et al. Development of an effective design tool of a real-time image processing hardware, final report for cooperation with Mitsubishi Heavy Industries, Ltd, 2012-3-19.