Abstract
Generative AI has considerably altered traditional workplace practice across numerous industries. Ever since the emergence of large language models (LLMs), their potential to generate formative feedback for introductory programming courses has been extensively researched. However, most of these studies have focused on Python. In this work, we examine the bug-fixing and feedback-generation abilities of Code Llama and ChatGPT for Java programming assignments using our new Java benchmark called CodeWBugs. The results indicate that ChatGPT performs reasonably well, and was able to fix 94.33% programs. By comparison, we observed high variability in the results from Code Llama. We further analyzed the impact of different types of prompts and observed that prompts that included task descriptions and test inputs yielded better results. In most cases, the LLMs precisely localized the bugs and also offered guidance on how to proceed. Nevertheless, we also noticed incorrect responses generated by the LLMs, emphasizing the need to validate responses before disseminating feedback to learners.
Original language | English |
---|---|
Title of host publication | LLM4Code '24: Proceedings of the 1st International Workshop on Large Language Models for Code |
Publisher | Association for Computing Machinery |
Pages | 88-93 |
Number of pages | 6 |
ISBN (Print) | 9798400705793 |
DOIs | |
Publication status | Published - 10 Sept 2024 |