TY - JOUR
T1 - ERNet: An Efficient and Reliable Human-Object Interaction Detection Network
AU - Lim, Junyi
AU - Baskaran, Vishnu Monn
AU - Lim, Joanne Mun-Yee
AU - Wong, Koksheik
AU - See, John
AU - Tistarelli, Massimo
N1 - Funding Information:
This work was supported in part by the Malaysian Ministry of Education's Fundamental Research Grant Scheme under Grant FRGS/1/2018/ICT02/MUSM/02/3 and in part by the Advanced Computing Platform at Monash University Malaysia.
Publisher Copyright:
© 1992-2012 IEEE.
PY - 2023/1/26
Y1 - 2023/1/26
N2 - Human-Object Interaction (HOI) detection recognizes how persons interact with objects, which is advantageous in autonomous systems such as self-driving vehicles and collaborative robots. However, current HOI detectors are often plagued by model inefficiency and unreliability when making a prediction, which consequently limits its potential for real-world scenarios. In this paper, we address these challenges by proposing ERNet, an end-to-end trainable convolutional-transformer network for HOI detection. The proposed model employs an efficient multi-scale deformable attention to effectively capture vital HOI features. We also put forward a novel detection attention module to adaptively generate semantically rich instance and interaction tokens. These tokens undergo pre-emptive detections to produce initial region and vector proposals that also serve as queries which enhances the feature refinement process in the transformer decoders. Several impactful enhancements are also applied to improve the HOI representation learning. Additionally, we utilize a predictive uncertainty estimation framework in the instance and interaction classification heads to quantify the uncertainty behind each prediction. By doing so, we can accurately and reliably predict HOIs even under challenging scenarios. Experiment results on the HICO-Det, V-COCO, and HOI-A datasets demonstrate that the proposed model achieves state-of-the-art performance in detection accuracy and training efficiency. Codes are publicly available at https://github.com/Monash-CyPhi-AI-Research-Lab/ernet.
AB - Human-Object Interaction (HOI) detection recognizes how persons interact with objects, which is advantageous in autonomous systems such as self-driving vehicles and collaborative robots. However, current HOI detectors are often plagued by model inefficiency and unreliability when making a prediction, which consequently limits its potential for real-world scenarios. In this paper, we address these challenges by proposing ERNet, an end-to-end trainable convolutional-transformer network for HOI detection. The proposed model employs an efficient multi-scale deformable attention to effectively capture vital HOI features. We also put forward a novel detection attention module to adaptively generate semantically rich instance and interaction tokens. These tokens undergo pre-emptive detections to produce initial region and vector proposals that also serve as queries which enhances the feature refinement process in the transformer decoders. Several impactful enhancements are also applied to improve the HOI representation learning. Additionally, we utilize a predictive uncertainty estimation framework in the instance and interaction classification heads to quantify the uncertainty behind each prediction. By doing so, we can accurately and reliably predict HOIs even under challenging scenarios. Experiment results on the HICO-Det, V-COCO, and HOI-A datasets demonstrate that the proposed model achieves state-of-the-art performance in detection accuracy and training efficiency. Codes are publicly available at https://github.com/Monash-CyPhi-AI-Research-Lab/ernet.
KW - Human-object interaction detection
KW - deformable attention
KW - transformer
KW - uncertainty estimation
UR - http://www.scopus.com/inward/record.url?scp=85148230707&partnerID=8YFLogxK
U2 - 10.1109/TIP.2022.3231528
DO - 10.1109/TIP.2022.3231528
M3 - Article
C2 - 37022006
SN - 1057-7149
VL - 32
SP - 964
EP - 979
JO - IEEE Transactions on Image Processing
JF - IEEE Transactions on Image Processing
ER -