Skip to main navigation Skip to search Skip to main content

Neural network models vs. MT evaluation metrics: a comparison between two approaches to automatic assessment of information fidelity in consecutive interpreting

Research output: Contribution to journalArticlepeer-review

Abstract

Informational fidelity is the most important aspect in assessing interpreting quality, which can be assessed in two methods: human assessment and automated machine translation evaluation metrics. Despite their prevalent use in interpreter training, human assessment is time-consuming, labour-intensive, and mentally demanding. In terms of automated methods, machine translation metrics assess fidelity through inter-lingual comparisons between the interpretation and reference translations across multiple versions; and reference-free neural network models assess fidelity through cross-lingual comparison. The study proceeds to compare the automated metric scores derived from these two approaches with human scores at the sentence level, in order to explore the correlation between machine and human assessments. The applicability of machine evaluation was further substantiated through both three-cluster and four-cluster analysis. The results demonstrated that neural network models, which are trained to generate cross-lingual embeddings and LLM, outperform traditional machine translation evaluation metrics. Moderate correlations were observed with embeddings from LLM such as GPT (Pearson’s r = 0.47) and LLaMA (r = 0.46). Meanwhile, stronger correlations were noted with GPT scores (r = 0.53) and MUSE embeddings (r = 0.55). The correlation is strong enough to be statistically significant and meaningful, though not so strong as to suggest near-perfect predictability. Cluster analysis reveals that aggregating multiple machine evaluation metrics can effectively approximate human judgments, highlighting distinct levels of fidelity and the necessity of a composite metric. This research posits that the adoption of pre-trained neural network models has good potential as a viable method for automated assessment of information fidelity in interpreting. This method holds promise for broader applications in low-stake interpreting assessments and can supplement human scoring, particularly in large-scale assessment tasks.
Original languageEnglish
JournalHumanities and Social Sciences Communications
Early online date12 Mar 2026
DOIs
Publication statusE-pub ahead of print - 12 Mar 2026

Keywords

  • Education
  • Language and linguistics
  • Science, technology and society

Fingerprint

Dive into the research topics of 'Neural network models vs. MT evaluation metrics: a comparison between two approaches to automatic assessment of information fidelity in consecutive interpreting'. Together they form a unique fingerprint.

Cite this