Speech-to-Speech Assessment vs. Text-based Assessment of Simultaneous Interpreting. Tapping the potential of LLM and MTQE metrics in realising automatic assessment

Xiaoman Wang, Binhua Wang

Research output: Contribution to journalArticlepeer-review

Abstract

There is growing interest in using Machine Translation Quality Estimation (MTQE) metrics, models, and Large Language Models (LLMs) for automatically assessing the information fidelity of interpreting quality. However, studies have been limited to text-based assessments only. This study compares speech-to-speech assessment with text-based assessment. The experiment begins by segmenting audios of simultaneous interpreting into one-minute intervals, isolating the source speech and target interpretations within each segment. We utilise LLMs, BLASER, and speech embeddings and last hidden states from HuBERT and Wav2Vec to assess interpreting quality at the speech level. Additionally, we explore the use of ASR for transcribing segments, coupled with human verification and LLM along with MTQE models such as COMET and TransQuest for minute-level text-based assessment. The findings indicate that LLMs cannot directly conduct speech-based assessment of interpreting quality but demonstrate certain capabilities text-based assessment when evaluating based on transcriptions, showing a moderately high correlation with human ratings (Pearson r = 0.66). In contrast, BLASER operates directly on the speech level and demonstrates a comparable correlation (r = 0.63) with human judgments, confirming its potential for speech-based quality assessment. A combined metric integrating both speech-to-speech and text-based assessments proposed in this study accounts for approximately 47% of the variance in human judgment scores, highlighting some potential of integrated metrics to enhance the development of machine learning models for interpreting quality assessment. Such metrics offer an automated, cost-effective, and labour-saving method for evaluating simultaneous interpreting, demonstrating their utility in enhancing assessment techniques.
Original languageEnglish
Pages (from-to)23-54
Number of pages32
JournalLinguistica Antverpiensia New Series – Themes in Translation Studies
Volume24
Publication statusPublished - 16 Dec 2025

Keywords

  • simultaneous interpreting
  • SI
  • speech-to-speech assessment
  • S2S
  • text-based assessment
  • automatic assessment
  • large language model
  • LLM
  • machine translation quality estimation
  • MTQE

Fingerprint

Dive into the research topics of 'Speech-to-Speech Assessment vs. Text-based Assessment of Simultaneous Interpreting. Tapping the potential of LLM and MTQE metrics in realising automatic assessment'. Together they form a unique fingerprint.

Cite this