Video advertising is a thriving industry that has recently turned its attention to the use of intelligent algorithms for automating tasks. In advertisement insertion, the integration of contextual relevance is essential in influencing the viewer’s experience. Despite the wide spectrum of audio-visual semantic modalities available, there is a lack of research that analyzes their individual and complementary strengths in a systematic manner. In this paper, we propose an ad-insertion framework that maximizes the contextual relevance between advertisement and content video by employing high-level multi-modal semantic features. Prediction vectors are derived via clip-level and image-level extractors, which are then matched accordingly to yield relevance scores. We also established a new user study methodology that produces gold standard annotations based on multiple expert selections. By comprehensive human-centered approaches and analysis, we demonstrate that automatic ad-insertion can be improved by exploiting effective combinations of semantic modalities.