Abstract
There is growing interest in using Machine Translation Quality Estimation (MTQE) metrics, models, and Large Language Models (LLMs) for automatically assessing the information fidelity of interpreting quality. However, studies have been limited to text-based assessments only. This study compares speech-to-speech assessment with text-based assessment. The experiment begins by segmenting audios of simultaneous interpreting into one-minute intervals, isolating the source speech and target interpretations within each segment. We utilise LLMs, BLASER, and speech embeddings and last hidden states from HuBERT and Wav2Vec to assess interpreting quality at the speech level. Additionally, we explore the use of ASR for transcribing segments, coupled with human verification and LLM along with MTQE models such as COMET and TransQuest for minute-level text-based assessment. The findings indicate that LLMs cannot directly conduct speech-based assessment of interpreting quality but demonstrate certain capabilities text-based assessment when evaluating based on transcriptions, showing a moderately high correlation with human ratings (Pearson r = 0.66). In contrast, BLASER operates directly on the speech level and demonstrates a comparable correlation (r = 0.63) with human judgments, confirming its potential for speech-based quality assessment. A combined metric integrating both speech-to-speech and text-based assessments proposed in this study accounts for approximately 47% of the variance in human judgment scores, highlighting some potential of integrated metrics to enhance the development of machine learning models for interpreting quality assessment. Such metrics offer an automated, cost-effective, and labour-saving method for evaluating simultaneous interpreting, demonstrating their utility in enhancing assessment techniques.
| Original language | English |
|---|---|
| Pages (from-to) | 23-54 |
| Number of pages | 32 |
| Journal | Linguistica Antverpiensia New Series – Themes in Translation Studies |
| Volume | 24 |
| Publication status | Published - 16 Dec 2025 |
Keywords
- simultaneous interpreting
- SI
- speech-to-speech assessment
- S2S
- text-based assessment
- automatic assessment
- large language model
- LLM
- machine translation quality estimation
- MTQE