Investigating the Proficiency of Large Language Models in Formative Feedback Generation for Student Programmers

Research output: Chapter in Book/Report/Conference proceedingConference contribution

1 Downloads (Pure)

Abstract

Generative AI has considerably altered traditional workplace practice across numerous industries. Ever since the emergence of large language models (LLMs), their potential to generate formative feedback for introductory programming courses has been extensively researched. However, most of these studies have focused on Python. In this work, we examine the bug-fixing and feedback-generation abilities of Code Llama and ChatGPT for Java programming assignments using our new Java benchmark called CodeWBugs. The results indicate that ChatGPT performs reasonably well, and was able to fix 94.33% programs. By comparison, we observed high variability in the results from Code Llama. We further analyzed the impact of different types of prompts and observed that prompts that included task descriptions and test inputs yielded better results. In most cases, the LLMs precisely localized the bugs and also offered guidance on how to proceed. Nevertheless, we also noticed incorrect responses generated by the LLMs, emphasizing the need to validate responses before disseminating feedback to learners.
Original languageEnglish
Title of host publicationLLM4Code '24: Proceedings of the 1st International Workshop on Large Language Models for Code
PublisherAssociation for Computing Machinery
Pages88-93
Number of pages6
ISBN (Print)9798400705793
DOIs
Publication statusPublished - 10 Sept 2024

Fingerprint

Dive into the research topics of 'Investigating the Proficiency of Large Language Models in Formative Feedback Generation for Student Programmers'. Together they form a unique fingerprint.

Cite this