Investigating the Proficiency of Large Language Models in Formative Feedback Generation for Student Programmers

Research output: Chapter in Book/Report/Conference proceedingConference contribution

Abstract

Generative AI has considerably altered traditional workplace practice across numerous industries. Ever since the emergence of large language models (LLMs), their potential to generate formative feedback for introductory programming courses has been extensively researched. However, most of these studies have focused on Python. In this work, we examine the bug-fixing and feedback-generation abilities of Code Llama and ChatGPT for Java programming assignments using our new Java benchmark called CodeWBugs. The results indicate that ChatGPT performs reasonably well, and was able to fix 94.33% of programs. By comparison, we observed high variability in the results from Code Llama. We further analyzed the impact of different types of prompts and observed that prompts that included task descriptions and test inputs yielded better results. In most cases, the LLMs precisely localized the bugs and also offered guidance on how to proceed. Nevertheless, we also noticed incorrect responses generated by the LLMs, emphasizing the need to validate responses before disseminating feedback to learners.
Original languageEnglish
Title of host publication2024 International Workshop on Large Language Models for Code (LLM4Code)
PublisherAssociation for Computing Machinery
Publication statusAccepted/In press - 15 Jan 2024

Fingerprint

Dive into the research topics of 'Investigating the Proficiency of Large Language Models in Formative Feedback Generation for Student Programmers'. Together they form a unique fingerprint.

Cite this