Project

# Title Team Members TA Documents Sponsor
28 Extended Reality Based Robotic Desktop Assistant
Cheng Zheng
Yuxuan Wu
Zhewei Zhang
Ziyang Jin
design_document1.pdf
final_paper1.pdf
final_paper2.pdf
other1.docx
proposal1.pdf
video1.mp4
Liangjing Yang
#People
Cheng Zheng: cz77
Yuxuan Wu: yuxuan59
Ziyang Jin: ziyang3
Zhewei Zhang: zheweiz3

#Problem:
Portable robotic assistants have strong potential for everyday use, yet their compact form factor severely limits on-board user interfaces. As a result, users often cannot quickly understand what the robot can do, what it is currently doing, and how to interact with it efficiently.

Most palm-size robots rely on a mobile app or a few physical buttons, which leads to a narrow and less intuitive interaction style. Users frequently need to switch between checking the phone, issuing commands, and observing the robot’s response, increasing both the learning curve and operational friction. Some existing solutions enhance interaction by adding external displays or extra devices, but this increases system bulk and setup complexity, undermining portability and “grab-and-go” usability.

By turning any flat surface (e.g., a desk or a wall) into an interactive projection-based interface and integrating gesture recognition with dynamic visual feedback, the robot can provide a more natural and direct human–robot interaction experience without requiring an additional screen. This approach improves usability by making robot status and functions easier to perceive and operate, while also enabling an “interface anywhere” form factor that better fits real-world daily-assistance scenarios and enhances user engagement.

#Solution Overview:
Core function:
##Dynamic Projection Interface:
The robot projects an interactive user interface onto any flat surface (e.g., a desk or a wall), converting the surrounding physical space into an operable interaction area without adding an external display.

##Gesture-Based Interaction Control:
Users interact with the projected interface using hand gestures. The system performs real-time gesture detection and recognition, maps gestures to commands, and triggers corresponding robot responses, enabling a natural and intuitive interaction flow.

##Interface Navigation:
The main projected interface provides basic feature entries (e.g., Weather, Clock, Exit) and supports page switching and function invocation through a “point-and-click” interaction style.

##Information Query Functions:
Selecting the Weather icon switches the interface to display current weather information. Selecting the Clock icon switches the interface to display the current time. Each sub-page includes an Exit icon that returns the user to the main interface, ensuring a consistent and easy-to-learn navigation logic.

##Affective (Emotional) Interaction:
Simple gesture-triggered feedback is included to improve engagement and user friendliness.
A thumbs-up gesture triggers a 👍 animation with a cheerful sound;
a thumbs-down gesture triggers a 👎 animation with a sad sound;
and a heart gesture triggers a ❤️ animation with a warm, gentle sound.

#Components:
1. Mechanical Module
The mechanical module is designed to meet the overall goal of a palm-size, mobile robot with a projection-based interactive interface. A compact and lightweight structure is adopted to ensure stable desktop mobility, provide attitude adjustment, and support proper mounting and viewing angles for the projector–camera system. The module consists of three main parts:
(1) Omni-wheel Mobile Base: a three-omni-wheel chassis is used, with each wheel diameter no larger than 5 cm, enabling agile planar motion and maneuverability on desktop surfaces;
(2) Mini Gimbal: a 2-DoF gimbal provides orientation adjustment with a pitch range of approximately ±30°, allowing the system to align the projection and vision direction under different usage conditions and improving projection/recognition robustness;
(3) Lightweight Enclosure: the enclosure will be fabricated via 3D printing to support rapid iteration and assembly optimization. The overall robot size is constrained within 15 cm × 15 cm × 15 cm to maintain portability and desktop friendliness.

2. Electronic Module
The electronic module is selected and integrated to support a low-power, portable, and fully self-contained system. It provides computation and control, vision sensing, projection display, audio feedback, wireless connectivity, and power management to ensure reliable standalone operation in desktop scenarios. The main components include:
(1) Main Controller: Raspberry Pi Zero 2 W (512MB RAM) serves as the core computing and control unit, capable of running basic vision and interaction logic (with OpenCV support);
(2) Micro Projector: a DLP2000-based module with 854×480 resolution and short-throw projection, used to project the interactive UI onto flat surfaces (desk/wall) to form an interaction area;
(3) Camera: an OV5640 camera (5 MP, autofocus supported) captures gestures, objects, and environmental cues, enabling gesture recognition, interface registration, and task execution;
(4) Speaker: a compact speaker module with PWM audio output provides sound cues and affective feedback;
(5) Power System: two 18650 Li-ion cells (capacity ≥2000 mAh) target a battery life of at least 1 hour for demos and mobile usage;
(6) Communication: Wi-Fi 2.4 GHz and Bluetooth 4.2 enable connection with a phone or external devices for control, debugging, and data transfer.

3. Software Modules
The software modules integrate “projection display—visual perception—interaction comprehension—motion execution” into a closed-loop operational system. Employing a modular design for parallel development and future expansion, it comprises the following submodules:
(1) Gesture Recognition Module: Implements gesture detection and recognition using MediaPipe/OpenCV, supporting inputs such as tap, thumbs-up, thumbs-down, and heart gestures for interaction commands.
(2) Interface Rendering Module: Dynamically generates and renders main and sub-interface content (e.g., weather, clock), outputting corresponding graphical interfaces to the projection display.
(3) Interaction Logic Engine: Maps gestures to commands and triggers events, manages interface state machines and interaction flows (main/sub-interface switching, exit/return), ensuring consistent and maintainable interaction logic.
(4) Image Correction Module: Performs geometric correction and alignment on projected images to enhance stability at varying angles and distances. Integrates with cameras to implement auto-focus/alignment strategies, ensuring clearer and more reliable interface display.
(5) Sound Effect Generation Module: Plays corresponding audio cues (e.g., thumbs-up, thumbs-down, heart feedback sounds) based on interaction events, providing clearer feedback.
(6) Data Acquisition Module: Retrieves real-time weather, time, and other information via network APIs, updates projected interfaces, and enables information lookup functionality.
(7) Motion Control Module: Manages chassis movement control and task execution, including fundamental speed/attitude control interfaces and higher-level behaviors like line-following navigation and moving to designated zones. This module integrates with the interaction logic engine, allowing users to trigger motion-related tasks via projected interfaces or gestures.

#Criterion for success:
## F1: Main UI Projection Clarity & Icon Size
• Success Criteria: The main interface projects two clearly visible icons (Weather and Clock). Each icon has a visible size of at least 3 cm × 3 cm.
• Verification Method: Visually check projection clarity and measure the icon size using a ruler.
## F2: Weather Page Switching Latency
• Success Criteria: After the user completes a click on the Weather icon, the UI switches to the weather page within 2 s.
• Verification Method: Time the interval from click completion to page switch completion.
## F3: Weather Information Field Completeness
• Success Criteria: The weather page displays, at minimum, the following fields: city, temperature, and weather condition.
• Verification Method: Visually verify the presence of these fields on the projected page.
## F4: Clock Page Switching Latency
• Success Criteria: After the user completes a click on the Clock icon, the UI switches to the time page within 2 s.
• Verification Method: Time the interval from click completion to page switch completion.
## F5: Time Display Format & Refresh
• Success Criteria: The time page displays the current time in “HH:MM:SS” format and updates continuously.
• Verification Method: Visually check the format and observe continuous time updates.
## F6: Return/Exit Entry Consistency on Sub-Pages
• Success Criteria: Sub-pages (Weather/Time) always show an Exit/Back icon (or an equivalent return entry) with a consistent, recognizable placement.
• Verification Method: Visually check that the return entry remains present and consistent across pages.
## F7: Return-to-Main Page Latency
• Success Criteria: After clicking the Exit/Back icon, the UI returns to the main page within 1.5 s.
• Verification Method: Time the interval from click completion to main page display completion.
## F8: Thumbs-Up Gesture Response & Feedback
• Success Criteria: Upon a thumbs-up gesture, the system displays a 👍 animation and plays a cheerful sound cue, with a total response time under 2 s.
• Verification Method: Record a video and measure the latency frame-by-frame from gesture completion to animation/audio onset.
## F9: Thumbs-Down Gesture Response & Feedback
• Success Criteria: Upon a thumbs-down gesture, the system displays a 👎 animation and plays a sad sound cue, with a total response time under 2 s.
• Verification Method: Record a video and measure the latency frame-by-frame from gesture completion to animation/audio onset.
## F10: Heart Gesture Response & Feedback
• Success Criteria: Upon a heart gesture, the system displays a ❤️ animation and plays a warm sound cue, with a total response time under 2 s.
• Verification Method: Record a video and measure the latency frame-by-frame from gesture completion to animation/audio onset.
## F11: Gesture Recognition Accuracy
• Success Criteria: Gesture recognition accuracy is at least 85% (at least 20 trials per gesture type).
• Verification Method: Log the number of correct recognitions and total trials per gesture, then compute accuracy.
## F12: False-Trigger Rate (Robustness)
• Success Criteria: False-trigger rate is no more than 10% (no response should be triggered by non-target gestures or no-interaction motions).
• Verification Method: During a fixed-duration or fixed-count negative test (non-gesture/disturbance motions), record false triggers and compute the rate.

Autonomous Behavior Supervisor

Shengjian Chen, Xiaolu Liu, Zhuping Liu, Huili Tao

Featured Project

## Team members

- Xiaolu Liu (xiaolul2)

- Zhuping Liu(zhuping2)

- Shengjian Chen(sc54)

- Huili Tao(huilit2)

## Problem:

In many real-life scenarios, we need AI systems not only to detect people, but also to monitor their behavior. However, today's AI systems are only able to detect faces but are still lacking the analysis of movements, and the results obtained are not comprehensive enough. For example, in many high-risk laboratories, we need to ensure not only that the person entering the laboratory is identified, but also that he or she is acting in accordance with the regulations to avoid danger. In addition to this, the system can also help to better supervise students in their online study exams. We can combine the student's expressions and eyes, as well as his movements to better maintain the fairness of the test.

## Solution Overview:

Our solution for the problem mentioned above is an Autonomous Behavior Supervisor. This system mainly consists of a camera and an alarm device. Using real-time photos taken by the camera, the system can perform face verification on people. When the person is successfully verified, the camera starts to monitor the person's behavior and his interaction with the surroundings. Then the system determines whether there is a dangerous action or an unreasonable behavior. As soon as the system determines that there are something uncommon, the alarm will ring. Conversely, if the person fails verification (ie, does not have permission), the words "You do not have permission" will be displayed on the computer screen.

## Solution Components:

### Identification Subsystem:

- Locate the position of people's face

- Identify whether the face of people is recorded in our system

The camera will capture people's facial information as image input to the system. There exists several libraries in Python like OpenCV, which have lots of useful tools. The identification progress has 3 steps: firstly, we establish the documents of facial information and store the encoded faceprint. Secondly, we camera to capture the current face image, and generate the face pattern coding of the current face image file. Finally, we compare the current facial coding with the information in the storage. This is done by setting of a threshold. When the familiarity exceeds the threshold, we regard this person as recorded. Otherwise, this person will be banned from the system unless he records his facial information to our system.

### Supervising Subsystem

- Capture people's behavior

- Recognize the interaction between human and object

- Identify what people are doing

This part is the capture and analysis of people's behavior, which is the interaction between people and objects. For the algorithm, we decided initially to utilize that based on VSG-Net or other developed HOI models. To make it suitable for our system or make some improvement, we need analysis and adjustment of the models. For the algorithm, it is a multi-branch network: Visual Branch: extracting visual features from people, objects, and the surrounding environment. Spatial Attention Branch: Modeling the spatial relationship between human-object pairs. Graph Convolutional Branch: The scene was treated as a graph, with people and objects as nodes, and modeling the structural interactions. This is a computational work that needs the training on dataset and applies to the real system. It is true that the accuracy may not be 100% but we will try our best to improve the performance.

### Alarming Subsystem

- Staying normal when common behaviors are detected

- Alarming when dangerous or non-compliant behaviors are detected

It is an alarm apparatus connected to the final of our system, which is used to report dangerous actions or behaviors that are not permitted. If some actions are detected in supervising system like "harm people", "illegal experimental operation", and "cheating in exams", the alarming system will sound a warning to let people notice that. To achieve this, a "dangerous action library" should be prepared in advance which contains dangerous behaviors, when the analysis of actions in supervising system match some contents in the action library, the system will alarm to report.

## Criteria of Success:

- Must have a human face recognition system and determine whether the person is in the backend database

- The system will detect the human with the surrounding objects on the screen and analyze the possible interaction between these items.

- Based on the interaction, the system could detect the potentially dangerous action and give out warnings.

## DIVISION OF LABOR AND RESPONSIBILITIES

All members should contribute to the design and process of the project, we meet regularly to discuss and push forward the process of the design. Each member is responsible for a certain part but it doesn't mean that this is the only work for him/her.

- Shengjian Chen: Responsible for the facial recognition part of the project.

- Huili Tao: HOI algorithm modification and apply that to our project

- Zhuping Liu: Hardware design and the connectivity of the project

- XIaolu Liu: Detail optimizing and test of the function.

Project Videos