TY - GEN
T1 - EchoScript: Enhancing AI Music Generation for Cinematic Scoring via Script-Aware Fine-Tuning
AU - Sartaee, Mohammad Kasra
AU - Karim, Kayvan
PY - 2025/8/1
Y1 - 2025/8/1
N2 - Recent advancements in artificial intelligence (AI) have significantly transformed the landscape of music generation, enabling context-sensitive and emotionally expressive soundtracks for diverse media applications such as film, gaming, and therapeutic environments. However, existing AI models continue to face persistent challenges in maintaining melodic coherence, thematic continuity, and emotional depth—qualities essential for professional soundtrack production. This research addresses these limitations by fine-tuning MusicGen, a transformer-based generative AI model, to create EchoScript—an optimized variant specifically tailored for cinematic soundtrack composition through script-driven conditioning. A curated dataset enriched with detailed metadata, including genre, mood, instrumentation, tempo, and narrative context, was employed to guide the fine-tuning process. Evaluation results demonstrate substantial improvements over the baseline model. EchoScript achieved a lower Fréchet Audio Distance (FAD) score (4.3738 vs. 4.5492) and outperformed the baseline in structured listening tests, with participants consistently preferring EchoScript for musical quality and narrative alignment. Beyond these empirical findings, the study critically examines technical constraints and outlines key future directions, including symbolic-audio integration, enhanced audio mixing, and the development of standardized evaluation metrics. Collectively, these contributions advance the pursuit of AI-generated music that closely approximates human-level expressiveness and narrative coherence, offering meaningful benefits for creative industries reliant on adaptive and emotionally resonant soundtracks.
AB - Recent advancements in artificial intelligence (AI) have significantly transformed the landscape of music generation, enabling context-sensitive and emotionally expressive soundtracks for diverse media applications such as film, gaming, and therapeutic environments. However, existing AI models continue to face persistent challenges in maintaining melodic coherence, thematic continuity, and emotional depth—qualities essential for professional soundtrack production. This research addresses these limitations by fine-tuning MusicGen, a transformer-based generative AI model, to create EchoScript—an optimized variant specifically tailored for cinematic soundtrack composition through script-driven conditioning. A curated dataset enriched with detailed metadata, including genre, mood, instrumentation, tempo, and narrative context, was employed to guide the fine-tuning process. Evaluation results demonstrate substantial improvements over the baseline model. EchoScript achieved a lower Fréchet Audio Distance (FAD) score (4.3738 vs. 4.5492) and outperformed the baseline in structured listening tests, with participants consistently preferring EchoScript for musical quality and narrative alignment. Beyond these empirical findings, the study critically examines technical constraints and outlines key future directions, including symbolic-audio integration, enhanced audio mixing, and the development of standardized evaluation metrics. Collectively, these contributions advance the pursuit of AI-generated music that closely approximates human-level expressiveness and narrative coherence, offering meaningful benefits for creative industries reliant on adaptive and emotionally resonant soundtracks.
U2 - 10.1609/aaaiss.v6i1.36069
DO - 10.1609/aaaiss.v6i1.36069
M3 - Conference contribution
SN - 9781577358992
T3 - Proceedings of the AAAI Symposium Series
SP - 323
EP - 332
BT - Proceedings of the 2025 AAAI Summer Symposium Series
PB - AAAI Press
T2 - Association for the Advancement of Artificial Intelligence Syposium on Human-AI Collaboration 2025
Y2 - 20 May 2025 through 22 May 2025
ER -