TY - JOUR
T1 - On Improved Training of CNN for Acoustic Source Localisation
AU - Vargas, Elizabeth
AU - Hopgood, James R.
AU - Brown, Keith
AU - Subr, Kartic
N1 - Funding Information:
Manuscript received June 1, 2020; revised September 21, 2020 and December 4, 2020; accepted December 22, 2020. Date of publication January 8, 2021; date of current version January 23, 2021. The work of Elizabeth Vargas was supported by the School of Engineering & Physical Sciences at Heriot-Watt University with the EPS PG Research James Watt Scholarship. The work of Kartic Subr was supported by Royal Society’s University Research Fellowship. The associate editor coordinating the review of this manuscript and approving it for publication was Prof. Timo Gerkmann. (Corresponding author: Elizabeth Vargas.) Elizabeth Vargas and Keith Brown are with the Institute of Sensors, Signals, and Systems, Heriot-Watt University, EH14 4AS Edinburgh, U.K. (e-mail: [email protected]; [email protected]).
Publisher Copyright:
© 2014 IEEE.
PY - 2021
Y1 - 2021
N2 - Convolutional Neural Networks (CNNs) are a popular choice for estimating Direction of Arrival (DoA) without explicitly estimating delays between multiple microphones. The CNN method first optimises unknown filter weights (of a CNN) by using observations and ground-truth directional information. This trained CNN is then used to predict incident directions given test observations. Most existing methods train using spectrally-flat random signals and test using speech. In this paper, which focuses on single source DoA estimation, we find that training with speech or music signals produces a relative improvement in DoA accuracy for a variety of audio classes across 16 acoustic conditions and 9 DoAs, amounting to an average improvement of around 17% and 19% respectively when compared to training with spectrally flat random signals. This improvement is also observed in scenarios in which the speech and music signals are synthesised using, for example, a Generative Adversarial Network (GAN). When the acoustic environments during test and training are similar and reverberant, training a CNN with speech outperforms Generalized Cross Correlation (GCC) methods by about 125%. When the test conditions are different, a CNN performs comparably. This paper takes a step towards answering open questions in the literature regarding the nature of the signals used during training, as well as the amount of data required for estimating DoA using CNNs.
AB - Convolutional Neural Networks (CNNs) are a popular choice for estimating Direction of Arrival (DoA) without explicitly estimating delays between multiple microphones. The CNN method first optimises unknown filter weights (of a CNN) by using observations and ground-truth directional information. This trained CNN is then used to predict incident directions given test observations. Most existing methods train using spectrally-flat random signals and test using speech. In this paper, which focuses on single source DoA estimation, we find that training with speech or music signals produces a relative improvement in DoA accuracy for a variety of audio classes across 16 acoustic conditions and 9 DoAs, amounting to an average improvement of around 17% and 19% respectively when compared to training with spectrally flat random signals. This improvement is also observed in scenarios in which the speech and music signals are synthesised using, for example, a Generative Adversarial Network (GAN). When the acoustic environments during test and training are similar and reverberant, training a CNN with speech outperforms Generalized Cross Correlation (GCC) methods by about 125%. When the test conditions are different, a CNN performs comparably. This paper takes a step towards answering open questions in the literature regarding the nature of the signals used during training, as well as the amount of data required for estimating DoA using CNNs.
KW - Direction of arrival
KW - convolutional neural network (CNN)
KW - generative adversarial network (GAN)
KW - microphone arrays
KW - neural networks
UR - http://www.scopus.com/inward/record.url?scp=85099537167&partnerID=8YFLogxK
U2 - 10.1109/TASLP.2021.3049337
DO - 10.1109/TASLP.2021.3049337
M3 - Article
AN - SCOPUS:85099537167
SN - 2329-9290
VL - 29
SP - 720
EP - 732
JO - IEEE/ACM Transactions on Audio Speech and Language Processing
JF - IEEE/ACM Transactions on Audio Speech and Language Processing
ER -