Project

# Title Team Members TA Documents Sponsor
5 An event-based smart vision node for ultra-low-latency motion detection
Luying Wang
Shuke Wang
Yaxing Zhang
Yueyao Si
design_document1.pdf
final_paper1.pdf
final_paper2.pdf
proposal1.docx
Aili Wang
# Problem



Traditional motion detection systems usually rely on frame-based cameras, which capture full images at fixed intervals. In many situations, consecutive frames are very similar, making the system store and process a large amount of redundant information. This not only increases data load but also leads to higher power consumption. Meanwhile, in this way, motion can only be analyzed after a batch of image frames are collected and processed, which is not ideal for applications that require very low latency.



As a result, the main problem is how to build a vision system that can respond to motion more efficiently by using only meaningful visual changes instead of full frames, while showing potential advantages in latency, resources and power consumption compared with a conventional approach.

# Solution Overview



Our solution is to build an event-based vision system using a DVS camera, FPGA, and SNN-inspired processing. Instead of capturing and processing full image frames, the system works directly on event data input stream.



The system first captures event-based visual data from a DVS camera. These events are then sent to the FPGA, where they are received, parsed, and temporarily buffered in real time without reconstructing full frames. The formatted event stream is then passed to a software-based SNN-inspired module, which analyzes motion patterns over time and generates a detection result when meaningful activity is observed. When motion is detected, the result will be sent to the output subsystem for display with minimal latency.



If time allows, a frame-focused baseline may be used as a comparison so that our system can be evaluated in terms of end-to-end latency, event throughput, and power consumption.

# Solution Components & Distribution of work



### Event-Based Vision Sensor (Shuke Wang – EE)



- Dynamic Vision Sensor (DVS) Camera: Employs a neuromorphic event-based sensor that captures visual information asynchronous spikes of pixel-level brightness changes. Each event includes pixel coordinates, polarity, and a precise microsecond timestamp, enabling ultra‑low‑latency motion detection without the need for full frame readout.



- High‑Speed Data Interface: Outputs event streams using the Address‑Event Representation (AER) protocol over a high‑bandwidth link. This interface allows direct, real‑time transmission of raw events to the FPGA processing platform, minimizing additional latency, and preserving the temporal precision of the sensor.



- Optics and Mounting: The camera is equipped with a suitable lens to match the target field of view and application scenario. It is rigidly mounted on an adjustable stage to facilitate precise alignment and stable imaging conditions during experiments.

### FPGA Subsystem (Yaxing Zhang – EE)



- The FPGA subsystem serves as the real-time processing platform of the system. It receives the event stream from the DVS camera through a high-speed interface and parses each event into pixel coordinates, polarity, and timestamp.



- The parsed events are temporarily stored in on-chip buffers to maintain stable data flow and handle burst event traffic. The FPGA can also perform lightweight pre-processing such as basic filtering before passing the formatted event stream to the motion detection module.



- This hardware platform ensures low latency and efficient handling of asynchronous event data in the system pipeline.

## SNN-Based Motion Detection Subsystem (Luying Wang – ECE)



- An SNN-inspired module that analyzes incoming events, detects motion regions by updating neural activity based on event spikes, builds up motion activity in certain regions, and generates an output when the activity exceeds a threshold.

### Output Subsystem (Yueyao Si – ME)



- The output subsystem is responsible for presenting the final motion detection result generated by the SNN-inspired module. Once motion activity exceeds the predefined threshold, a detection signal is produced and forwarded to the output controller.



- In the current implementation, the FPGA receives the detection result and triggers a visual indicator such as an LED or display module. When motion is detected, the indicator is activated in real time; otherwise it remains off.



- This subsystem provides a simple and low-latency way to demonstrate the system response to motion events. The output interface can also be extended to support other devices, such as a monitor display, UART logging interface, or external control signals for robotic or embedded applications.

# Criteria of Success

### Functionality

- The complete pipeline runs successfully from event input to final output.



- The motion detection module can correctly identify motion regions from the event stream.



- The output responds correctly to motion: the display turns on when motion is detected and remains off otherwise.



### Performance

- The end-to-end latency is less than 50 ms.



- The measured FPGA board power during operation is less than 5 W.



- The FPGA resource utilization remains below 80% of available logic and memory resources.

# References

- [Event-based Vision: A Survey](https://arxiv.org/pdf/1904.08405)



- [Event-based vision on FPGAs – a survey The work presented in this paper was supported by: the program ”Excellence initiative –- research university” for the AGH University of Krakow.](https://arxiv.org/html/2407.08356v1#bib.bib60)



- [Neuro-Inspired Spike-Based Motion: From Dynamic Vision Sensor to Robot Motor Open-Loop Control through Spike-VITE](https://www.mdpi.com/1424-8220/13/11/15805)



- [A Reconfigurable Architecture for Real-time Event-based Multi-Object Tracking | ACM Transactions on Reconfigurable Technology and Systems](https://dl.acm.org/doi/10.1145/3593587)

A Wearable Device Outputting Scene Text For Blind People

Hangtao Jin, Youchuan Liu, Xiaomeng Yang, Changyu Zhu

A Wearable Device Outputting Scene Text For Blind People

Featured Project

# Revised

We discussed it with our mentor Prof. Gaoang Wang, and got a solution to solve the problem

## TEAM MEMBERS (NETID)

Xiaomeng Yang (xy20), Youchuan Liu (yl38), Changyu Zhu (changyu4), Hangtao Jin (hangtao2)

## INSTRUCTOR

Prof. Gaoang Wang

## LINK

This idea was pitched on Web Board by Xiaomeng Yang.

https://courses.grainger.illinois.edu/ece445zjui/pace/view-topic.asp?id=64684

## PROBLEM DESCRIPTION

Nowadays, there are about 12 million visually disabled people in China. However, it is hard for us to see blind people in the street. One reason is that when the blind people are going to the location they are not familiar with, it is difficult for blind people to figure out where they are. When blind people travel, they are usually equipped with navigation equipment, but the accuracy of navigation equipment is not enough, and it is difficult for blind people to find the accurate position of the destination when they arrive near the destination. Therefore, we'd like to make a device that can figure out the scene text information around the destination for blind people to reach the direct place.

## SOLUTION OVERVIEW

We'd like to make a device with a micro camera and an earphone. By clicking a button, the camera will take a picture and send it to a remote server to process through a communication subsystem. After that, text messages will be extracted and recognized from the pictures using neural network, and be transferred to voice messages by Google text-to-speech API. The speech messages will then be sent back through the earphones to the users. The device can be attached to glasses that blind people wear.

The blind use the navigation equipment, which can tell them the location and direction of their destination, but the blind still need the detail direction of the destination. And our wearable device can help solve this problem. The camera is fixed to the head, just like our eyes. So when the blind person turns his head, the camera can capture the text of the scene in different directions. Our scenario is to identify the name of the store on the side of the street. These store signs are generally not tall, about two stories high. Blind people can look up and down to let the camera capture the whole store. Therefore, no matter where the store name is, it can be recognized.

For example, if a blind person aims to go to a book store, the navigation app will tell him that he arrives the store and it is on his right when he are near the destination. However, there are several stores on his right. Then the blind person can face to the right and take a photo of that direction, and figure out whether the store is there. If not, he can turn his head a little bit and take another photo of the new direction.

![figure1](https://courses.grainger.illinois.edu/ece445zjui/pace/getfile/18612)

![figure2](https://courses.grainger.illinois.edu/ece445zjui/pace/getfile/18614)

## SOLUTION COMPONENTS

### Interactive Subsystem

The interactive subsystem interacts with the blind and the environment.

- 3-D printed frame that can be attached to the glasses through a snap-fit structure, which could holds all the accessories in place

- Micro camera that can take pictures

- Earphone that can output the speech

### Communication Subsystem

The communication subsystem is used to connect the interactive subsystem with the software processing subsystem.

- Raspberry Pi(RPI) can get the images taken by the camera and send them to the remote server through WiFi module. After processing in the remote server, RPI can receive the speech information(.mp3 file).

### Software Processing Subsystem

The software processing subsystem processes the images and output speech, which including two subparts, text recognition part and text-to-speech part.

- A OCR recognition neural network which is able to extract and recognize the Chinese text from the environmental images transported by the communication system.

- Google text-to-speech API is used to transfer the text we get to speech.

## CRITERION FOR SUCCESS

- Use neural network to recognize the Chinese scene text successfully.

- Use Google text-to-speech API to transfer the recognized text to speech.

- The device can transport the environment pictures or video to server and receive the speech information correctly.

- Blind people could use the speech information locate their position.