Project
| # | Title | Team Members | TA | Documents | Sponsor |
|---|---|---|---|---|---|
| 80 | Edge-AI based audio classifier |
Ahaan Joishy Kavin Manivasagam Om Dhingra |
Weijie Liang | proposal1.pdf proposal2.pdf |
|
| Problem Overview Most audio-based embedded systems nowadays collect large amounts of raw sensor data but, they only use simple threshold-based logic for classification. And these methods are very sensitive to noise and fail to perform accurately across various conditions. Thus, they fail without the usage of external computation/ cloud services. Thus, there’s a need of a method that can covert raw captured signals to meaningful classifications locally under tight power and memory constraints. Solution Overview The proposed project is an Edge-AI embedded system that can classify audio signals (e.g. – a clap, laugh, snap, stomp, speech, etc.) in real time using a simple neural net. The system will use a single sensor (an MEMS I2S digital microphone) to collect audio data. The classification will result in an LED-based output, telling the user the result. The system thus eliminates the need for cloud usage/computation and demonstrates the true strength of machine learning, even under tight constraints. Solution Components Sensor Subsystem: A MEMS microphone with an I2S interface will be used to collect raw audio signals (e.g. – clap, speech, snap, etc.). Audio will be sampled at a target rate of 16kHz, which is sufficient and the industry standard (used in voice bots and voice recognition) for speech and common environmental sounds. We use a digital microphone because it removes the need for an analog amplifier. Processing Subsystem: We will use an STM32F411 microcontroller for this project. We choose this microcontroller for our project because it features 512 kB Flash and 128 kB RAM which is crucial for running the math of a neural net. Furthermore, it has built-in DSP instructions that are crucial to convert raw audio signals into a spectrogram (MFCC) in real time. Since we’re using a neural net, we also need a chip with a floating point unit (FPU) which this microcontroller has. The signal chain (to capture signals) would be as follows: The microphone captures audio and sends it digitally over I2S to the microcontroller, which uses DMA to quickly store the data in memory. The audio frames are converted into MFCC features These features are then fed into our neural-net model. The ML pipeline would be as follows: The obtained MFCC features are fed into our small, dense neural net for classification into predefined types. TensorFlow Lite Micro will be used to facilitate deployment on the microcontroller without an OS/ internet connection. (We may also try to use ExecuTorch if time permits). Model size will be kept under 20 kB to ensure real-time performance. Power Subsystem: We will use a 5V USB input to power the board. This will be stepped down to 3.3 V using an on-board voltage regulator. Decoupling capacitors and filtering components will be used to reduce electrical noise that could interfere with stable operation. Criterion for Success Our device can classify at least 3 different sound types correctly with more than 85% accuracy on the recorded test set. The target end-to-end latency (from sound to LED output) is less than 100 ms. Current drawn should be under 60mA. Test Protocol description: Our test set will consist of around 50 samples per class and shall be gathered from a variety of noisy and quiet environments. (We shall aim for our model to correctly classify 3 different sound types but, this will be extended to 5 types if time permits). Alternatives Many existing sound classification systems use cloud-based processing or rely on high-power computing platforms such as smartphones and computers. These methods require a continuous internet connection. Many other methods also use a threshold-based audio detection but, these can’t work accurately for different types of sounds in varying environments. Our solution differs by performing audio classification on a low-power embedded device, using a simple neural, without the usage of external computing/ complex hardware. |
|||||