TY - JOUR
T1 - Multi-Modal Multi-Stage Multi-Task Learning for Occlusion-Aware Facial Landmark Localisation
AU - Ng, Yean Chun
AU - Belyaev, Alexander G.
AU - Choong, Florence
AU - Suandi, Shahrel Azmin
AU - Chuah, Joon Huang
AU - Rudrusamy, Bhuvendhraa
PY - 2026/1
Y1 - 2026/1
N2 - Thermal facial imaging enables non-contact measurements of face heat patterns that are valuable for healthcare and affective computing, but common occluders (glasses, masks, scarves) and the single-channel, texture-poor nature of thermal frames make robust landmark localisation and visibility estimation challenging. We propose M3MSTL, a multi-modal, multi-stage, multi-task framework for occlusion-aware landmarking on thermal faces. M3MSTL pairs a ResNet-50 backbone with two lightweight heads: a compact fully connected landmark regressor and a Vision Transformer occlusion classifier that explicitly fuses per-landmark temperature cues. A three-stage curriculum (mask-based backbone pretraining, head specialisation with a frozen trunk, and final joint fine-tuning) stabilises optimisation and improves generalisation from limited thermal data. On the TFD68 dataset, M3MSTL substantially improves both visibility and localisation: the occlusion accuracy reaches 91.8% (baseline 89.7%), the mean NME reaches 0.246 (baseline 0.382), the ROC–AUC reaches 0.974, and the AP is 0.966. Paired statistical tests confirm that these gains are significant. Our approach aims to improve the reliability of temperature-based biometric and clinical measurements in the presence of realistic occluders.
AB - Thermal facial imaging enables non-contact measurements of face heat patterns that are valuable for healthcare and affective computing, but common occluders (glasses, masks, scarves) and the single-channel, texture-poor nature of thermal frames make robust landmark localisation and visibility estimation challenging. We propose M3MSTL, a multi-modal, multi-stage, multi-task framework for occlusion-aware landmarking on thermal faces. M3MSTL pairs a ResNet-50 backbone with two lightweight heads: a compact fully connected landmark regressor and a Vision Transformer occlusion classifier that explicitly fuses per-landmark temperature cues. A three-stage curriculum (mask-based backbone pretraining, head specialisation with a frozen trunk, and final joint fine-tuning) stabilises optimisation and improves generalisation from limited thermal data. On the TFD68 dataset, M3MSTL substantially improves both visibility and localisation: the occlusion accuracy reaches 91.8% (baseline 89.7%), the mean NME reaches 0.246 (baseline 0.382), the ROC–AUC reaches 0.974, and the AP is 0.966. Paired statistical tests confirm that these gains are significant. Our approach aims to improve the reliability of temperature-based biometric and clinical measurements in the presence of realistic occluders.
KW - thermal imaging
KW - occlusion-aware landmark
KW - biometrics
KW - multi-stage training
KW - multi-task learning
KW - ResNet-50
KW - vision transformer
UR - https://www.scopus.com/pages/publications/105028883166
U2 - 10.3390/ai7010028
DO - 10.3390/ai7010028
M3 - Article
SN - 2673-2688
VL - 7
JO - AI
JF - AI
IS - 1
M1 - 28
ER -