Live Demonstration: Cloud-based Audio-Visual Speech Enhancement in Multimodal Hearing-aids

Abhijeet Bishnu, Ankit Gupta, Mandar Gogate, Kia Dashtipour, Tughrul Arslan, Ahsan Adeel, Amir Hussain, Mathini Sellathurai, Tharmalingam Ratnarajah

Research output: Chapter in Book/Report/Conference proceedingConference contribution


Hearing loss is among the most serious public health problems, affecting as much as 20% of the worldwide population. Even cutting-edge multi-channel audio-only speech enhancement (SE) algorithms used in modern hearing aids face significant hurdles since they typically magnify noises while failing to boost speech understanding in crowded social environments. Recently, for the first time we proposed a novel integration of 5G cloud-radio access network, internet of things (IoT), and strong privacy algorithms to develop 5G IoT enabled hearing aid (HA) [1]. In this demonstration, we show the first-ever transceiver (PHY layer) model for cloud-based audio-visual (AV) SE, which meets the requirements for high data rate and low latency of forthcoming multi-modal HAs (such as Google glasses with integrated HAs). Even in highly noisy conditions like cafés, clubs, conferences, meetings, etc., the transceiver [2] transmits raw AV information from a hearing aid system to a cloud-based platform and obtains a clear signal. In Fig. 1, we illustrate an example of our cloud-based AV SE hearing aid demonstration. Herein, the left-side computer and Universal Software Radio Peripheral (USRP) x310 function as IoT systems (hearing aids), the right-side USRP serves as an access point or base station, and the right-side computer serves as a cloud-server for operating NN-based SE models. Please take note that the channel between the HA device and the cloud is defined as an uplink channel, whereas the channel between the access point (cloud) and the HA device is defined as a downlink channel. Given the time-varying sensitivity of the data received at HA devices, the uplink channel can therefore handle a variety of data rates. As a result, a customized long-term evolution (LTE)-based frame structure is developed for uplink transmission of data. It provides error-correction codes in the 1.4 MHz and 3 MHz bandwidths with a variety of modulations and code rates. Furthermore, the cloud access point simply supports a limited transmission rate because it only transmits audio data to the HA equipment. In order to support real-time AV SE, a modified frame structure for LTE with 1.4 MHz of bandwidth is developed. The AV SE algorithm receives cropped lip images of the target speaker and a noisy speech spectrogram, and it produces an ideal binary mask that lessens the noise-dominant regions while improving the speech-dominant areas. We use the depth-wise separable convolutions, reduced STFT window size of 32 ms, smaller STFT window shift of 8 ms, and 64 convolutions in the audio feature extraction layers of our Cochlea-Net [3] multi-modal AV SE neural network architecture to reduce processing latency. Furthermore, the visual feature extraction framework is employed. Our proposed architecture can handle streaming data frame-by-frame. Thus, the users will experience for the first time the real-world development of a physical layer transceiver that can perform AV SE in real-time under strict latency and data rate requirements. For this demonstration, we will bring two computers and two USRP x310 devices.

Original languageEnglish
Title of host publication56th IEEE International Symposium on Circuits and Systems
ISBN (Electronic)9781665451093
Publication statusPublished - 21 Jul 2023
Event56th IEEE International Symposium on Circuits and Systems 2023 - Monterey, United States
Duration: 21 May 202325 May 2023


Conference56th IEEE International Symposium on Circuits and Systems 2023
Abbreviated titleISCAS 2023
Country/TerritoryUnited States

ASJC Scopus subject areas

  • Electrical and Electronic Engineering


Dive into the research topics of 'Live Demonstration: Cloud-based Audio-Visual Speech Enhancement in Multimodal Hearing-aids'. Together they form a unique fingerprint.

Cite this