TY - GEN
T1 - Live Demonstration
T2 - 56th IEEE International Symposium on Circuits and Systems 2023
AU - Bishnu, Abhijeet
AU - Gupta, Ankit
AU - Gogate, Mandar
AU - Dashtipour, Kia
AU - Arslan, Tughrul
AU - Adeel, Ahsan
AU - Hussain, Amir
AU - Sellathurai, Mathini
AU - Ratnarajah, Tharmalingam
N1 - Funding Information:
Hearing loss is among the most serious public health problems, affecting as much as 20% of the worldwide population. Even cutting-edge multi-channel audio-only speech enhancement (SE) algorithms used in modern hearing aids face significant hurdles since they typically magnify noises while failing to boost speech understanding in crowded social environments. Recently, for the first time we proposed a novel integration of 5G cloud-radio access network, internet of things (IoT), and strong privacy algorithms to develop 5G IoT enabled hearing aid (HA) [1]. In this demonstration, we show the first-ever transceiver (PHY layer) model for cloud-based audio-visual (AV) SE, which meets the requirements for high data rate and low latency of forthcoming multi-modal HAs (such as Google glasses with integrated HAs). Even in highly noisy conditions like cafés, clubs, conferences, meetings, etc., the transceiver [2] transmits raw AV information from a hearing aid system to a cloud-based platform and obtains a clear signal. In Fig. 1, we illustrate an example of our cloud-based AV SE hearing aid demonstration. Herein, the left-side computer and Universal Software Radio Peripheral (USRP) x310 function as IoT systems (hearing aids), the right-side USRP serves as an access point or base station, and the right-side computer serves as a cloud-server for operating NN-based SE models. Please take note that the channel between the HA device and the cloud is defined as an uplink channel, whereas the channel between the access point (cloud) and the HA device is defined as a downlink channel. Given the time-varying sensitivity of the data received at HA devices, the uplink channel can therefore handle a variety of data rates. As a result, a customized long-term evolution (LTE)-based frame structure is developed for uplink transmission of data. It provides error-correction codes in the 1.4 MHz and 3 MHz bandwidths with a variety of modulations and code rates. Furthermore, the cloud access point simply supports a limited transmission rate because it only transmits audio data to the HA equipment. In order to support real-time AV SE, a modified frame structure for LTE with 1.4 MHz of bandwidth is developed. The AV SE algorithm receives cropped lip images of the target speaker and a noisy speech spectrogram, and it produces an ideal binary mask that lessens the noise-dominant regions while improving the speech-dominant areas. We use the depth-wise separable convolutions, reduced STFT window size of 32 ms, smaller STFT window shift of 8 ms, and 64 convolutions in the audio feature extraction layers of our Cochlea-Net [3] multi-modal AV SE neural network architecture to reduce processing latency. Furthermore, the visual feature extraction framework is employed. Our proposed architecture can handle streaming data frame-by-frame. Thus, the users will experience for the first time the real-world development of a physical layer transceiver that can perform AV SE in real-time under strict latency and data rate requirements. For this demonstration, we will bring two computers and two USRP x310 devices. ACKNOWLEDGMENT This work was funded by the UK Engineering and Physical Sciences Research Council (EPSRC) programme grant: COG-MHEAR (Grant Ref. No. EP/T021063/1 ) REFERENCES [1] A. Adeel, J. Ahmad, H. Larijani, and A. Hussain, “A novel real-time, lightweight chaotic-encryption scheme for next-generation audio-visual hearing aids.” Cognitive Computation, vol. 12, pp. 589–601, 2020. [2] A. Bishnu, A. Gupta, M. Gogate, K. Dashtipour, A. Adeel, A. Hussain, M. Sellathurai, and T. Ratnarajah, “A novel frame structure for cloud-based audio-visual speech enhancement in multimodal hearing-aids,” in 2022 IEEE International Conference on E-health Networking, Application & Services (Healthcom), p. In Publication. [3] M. Gogate, K. Dashtipour, A. Adeel, and A. Hussain, “Cochleanet: A robust language-independent audio-visual model for real-time speech enhancement,” Information Fusion, vol. 63, pp. 273–285, 2020.
Funding Information:
This work was funded by the UK Engineering and Physical Sciences Research Council (EPSRC) programme grant: COGMHEAR (Grant Ref. No. EP/T021063/1)
Publisher Copyright:
© 2023 IEEE.
PY - 2023/7/21
Y1 - 2023/7/21
N2 - Hearing loss is among the most serious public health problems, affecting as much as 20% of the worldwide population. Even cutting-edge multi-channel audio-only speech enhancement (SE) algorithms used in modern hearing aids face significant hurdles since they typically magnify noises while failing to boost speech understanding in crowded social environments. Recently, for the first time we proposed a novel integration of 5G cloud-radio access network, internet of things (IoT), and strong privacy algorithms to develop 5G IoT enabled hearing aid (HA) [1]. In this demonstration, we show the first-ever transceiver (PHY layer) model for cloud-based audio-visual (AV) SE, which meets the requirements for high data rate and low latency of forthcoming multi-modal HAs (such as Google glasses with integrated HAs). Even in highly noisy conditions like cafés, clubs, conferences, meetings, etc., the transceiver [2] transmits raw AV information from a hearing aid system to a cloud-based platform and obtains a clear signal. In Fig. 1, we illustrate an example of our cloud-based AV SE hearing aid demonstration. Herein, the left-side computer and Universal Software Radio Peripheral (USRP) x310 function as IoT systems (hearing aids), the right-side USRP serves as an access point or base station, and the right-side computer serves as a cloud-server for operating NN-based SE models. Please take note that the channel between the HA device and the cloud is defined as an uplink channel, whereas the channel between the access point (cloud) and the HA device is defined as a downlink channel. Given the time-varying sensitivity of the data received at HA devices, the uplink channel can therefore handle a variety of data rates. As a result, a customized long-term evolution (LTE)-based frame structure is developed for uplink transmission of data. It provides error-correction codes in the 1.4 MHz and 3 MHz bandwidths with a variety of modulations and code rates. Furthermore, the cloud access point simply supports a limited transmission rate because it only transmits audio data to the HA equipment. In order to support real-time AV SE, a modified frame structure for LTE with 1.4 MHz of bandwidth is developed. The AV SE algorithm receives cropped lip images of the target speaker and a noisy speech spectrogram, and it produces an ideal binary mask that lessens the noise-dominant regions while improving the speech-dominant areas. We use the depth-wise separable convolutions, reduced STFT window size of 32 ms, smaller STFT window shift of 8 ms, and 64 convolutions in the audio feature extraction layers of our Cochlea-Net [3] multi-modal AV SE neural network architecture to reduce processing latency. Furthermore, the visual feature extraction framework is employed. Our proposed architecture can handle streaming data frame-by-frame. Thus, the users will experience for the first time the real-world development of a physical layer transceiver that can perform AV SE in real-time under strict latency and data rate requirements. For this demonstration, we will bring two computers and two USRP x310 devices.
AB - Hearing loss is among the most serious public health problems, affecting as much as 20% of the worldwide population. Even cutting-edge multi-channel audio-only speech enhancement (SE) algorithms used in modern hearing aids face significant hurdles since they typically magnify noises while failing to boost speech understanding in crowded social environments. Recently, for the first time we proposed a novel integration of 5G cloud-radio access network, internet of things (IoT), and strong privacy algorithms to develop 5G IoT enabled hearing aid (HA) [1]. In this demonstration, we show the first-ever transceiver (PHY layer) model for cloud-based audio-visual (AV) SE, which meets the requirements for high data rate and low latency of forthcoming multi-modal HAs (such as Google glasses with integrated HAs). Even in highly noisy conditions like cafés, clubs, conferences, meetings, etc., the transceiver [2] transmits raw AV information from a hearing aid system to a cloud-based platform and obtains a clear signal. In Fig. 1, we illustrate an example of our cloud-based AV SE hearing aid demonstration. Herein, the left-side computer and Universal Software Radio Peripheral (USRP) x310 function as IoT systems (hearing aids), the right-side USRP serves as an access point or base station, and the right-side computer serves as a cloud-server for operating NN-based SE models. Please take note that the channel between the HA device and the cloud is defined as an uplink channel, whereas the channel between the access point (cloud) and the HA device is defined as a downlink channel. Given the time-varying sensitivity of the data received at HA devices, the uplink channel can therefore handle a variety of data rates. As a result, a customized long-term evolution (LTE)-based frame structure is developed for uplink transmission of data. It provides error-correction codes in the 1.4 MHz and 3 MHz bandwidths with a variety of modulations and code rates. Furthermore, the cloud access point simply supports a limited transmission rate because it only transmits audio data to the HA equipment. In order to support real-time AV SE, a modified frame structure for LTE with 1.4 MHz of bandwidth is developed. The AV SE algorithm receives cropped lip images of the target speaker and a noisy speech spectrogram, and it produces an ideal binary mask that lessens the noise-dominant regions while improving the speech-dominant areas. We use the depth-wise separable convolutions, reduced STFT window size of 32 ms, smaller STFT window shift of 8 ms, and 64 convolutions in the audio feature extraction layers of our Cochlea-Net [3] multi-modal AV SE neural network architecture to reduce processing latency. Furthermore, the visual feature extraction framework is employed. Our proposed architecture can handle streaming data frame-by-frame. Thus, the users will experience for the first time the real-world development of a physical layer transceiver that can perform AV SE in real-time under strict latency and data rate requirements. For this demonstration, we will bring two computers and two USRP x310 devices.
UR - http://www.scopus.com/inward/record.url?scp=85167666679&partnerID=8YFLogxK
U2 - 10.1109/ISCAS46773.2023.10182060
DO - 10.1109/ISCAS46773.2023.10182060
M3 - Conference contribution
AN - SCOPUS:85167666679
BT - 56th IEEE International Symposium on Circuits and Systems
PB - IEEE
Y2 - 21 May 2023 through 25 May 2023
ER -