Project

# Title Team Members TA Documents Sponsor
37 Visual chatting and Real-time acting Robot
Haozhe Chi
Jiatong Li
Minghua Yang
Zonghai Jing
design_document1.pdf
design_document2.pdf
final_paper1.pdf
final_paper2.pdf
final_paper3.pdf
proposal1.pdf
proposal2.pdf
proposal3.pdf
video1.mp4
Gaoang Wang
Group member:
Haozhe Chi, haozhe4
Minghua Yang, minghua3
Zonghai Jing, zonghai2
Jiatong Li, jl180
Problem:
With the rise of large language models (LLMs), Large visual language models (LVLMs) have achieved great success in recent AI development. However, it's still a big challenge to configure an LVLM system for a robot and make all hardware work well around this system. We aim to design an LVLM-based robot that can react to multimodal inputs.
Solution overview:
We aim to deliver an LVLM system (software part), a robot arm for robot actions like grabbing objects (hardware part), a robot movement equipment for moving according to the environment (hardware part), a camera for real-time visual inputs (hardware part), a laser tracker for implicating the object (hardware part), and an audio equipment for audio inputs and outputs (hardware part).
Solution components:
LVLM system:
We will deploy a BLIP-2 based AI model for visual language processing. We will incorporate the strengths of several recent visual-language models, including LlaVA, Videochat, and VideoLlaMA, and design a better real-time visual language processing system. This system should be able to realize real-time visual chatting with less object hallucination.
Robot arm and wheels:
We will use ROS environment to control robot movements. We will apply to use robot arms in ZJUI ECE470 labs and buy certain wheels for moving. We may use four-wheel design or track design.
Camera:
We will configure cameras for real-time image inputs. 3D reconstruction may be needed, depending on our LVLM system design.
If multi-viewed inputs are needed, we will design a better camera configuration.
Audio processing:
We will use two audio processing systems: voice recognition and text-to-audio generation. They are responsible for audio inputs and outputs respectively. We will use certain audio broadcast components to make the robot talk.
Criterion for success:
The robot consists of functions including voice recognition, laser tracking, real-time visual chatting, a multimodal processing system, identifying a certain object, moving and grabbing it, and multi-view camera configuration. All the hardware parts should cooperate well in the final demo. This means that not only every single hardware should function well, but also perform more advanced functions. For instance, the robot should be able to move towards certain objects while chatting with humans.

Wireless IntraNetwork

Featured Project

There is a drastic lack of networking infrastructure in unstable or remote areas, where businesses don’t think they can reliably recoup the large initial cost of construction. Our goal is to bring the internet to these areas. We will use a network of extremely affordable (<$20, made possible by IoT technology) solar-powered nodes that communicate via Wi-Fi with one another and personal devices, donated through organizations such as OLPC, creating an intranet. Each node covers an area approximately 600-800ft in every direction with 4MB/s access and 16GB of cached data, saving valuable bandwidth. Internal communication applications will be provided, minimizing expensive and slow global internet connections. Several solutions exist, but all have failed due to costs of over $200/node or the lack of networking capability.

To connect to the internet at large, a more powerful “server” may be added. This server hooks into the network like other nodes, but contains a cellular connection to connect to the global internet. Any device on the network will be able to access the web via the server’s connection, effectively spreading the cost of a single cellular data plan (which is too expensive for individuals in rural areas). The server also contains a continually-updated several-terabyte cache of educational data and programs, such as Wikipedia and Project Gutenberg. This data gives students and educators high-speed access to resources. Working in harmony, these two components foster economic growth and education, while significantly reducing the costs of adding future infrastructure.