TY - GEN
T1 - The Larger They Are, the Harder They Fail
T2 - 61st Annual Meeting of the Association for Computational Linguistics 2023
AU - Miceli-Barone, Antonio Valerio
AU - Barez, Fazl
AU - Konstas, Ioannis
AU - Cohen, Shay B.
N1 - Funding Information:
We thank the reviewers for their helpful comments. We thank the Inverse Scaling Prize competition organisers (McKenzie et al., 2022a) for organising the challenge and donating part of the OpenAI API credits that were used in our experiments. We are grateful to Apart Research13 for their donation that supported the purchase of additional OpenAI API credits and provided personal financial support to Antonio Valerio Miceli-Barone. This work was supported by the UKRI Research Node on Trustworthy Autonomous Systems Governance and Regulation (grant EP/V026607/1) which provided funding for Antonio Valerio Miceli-Barone. The experiments in this work on open source LLMs were supported by a compute grant (UKRI HPC) from the Baskerville service at the University of Birmingham.
Publisher Copyright:
© 2023 Association for Computational Linguistics.
PY - 2023/7
Y1 - 2023/7
N2 - Large Language Models (LLMs) have successfully been applied to code generation tasks, raising the question of how well these models understand programming. Typical programming languages have invariances and equivariances in their semantics that human programmers intuitively understand and exploit, such as the (near) invariance to the renaming of identifiers. We show that LLMs not only fail to properly generate correct Python code when default function names are swapped, but some of them even become more confident in their incorrect predictions as the model size increases, an instance of the recently discovered phenomenon of Inverse Scaling, which runs contrary to the commonly observed trend of increasing prediction quality with increasing model size. Our findings indicate that, despite their astonishing typical-case performance, LLMs still lack a deep, abstract understanding of the content they manipulate, making them unsuitable for tasks that statistically deviate from their training data, and that mere scaling is not enough to achieve such capability.
AB - Large Language Models (LLMs) have successfully been applied to code generation tasks, raising the question of how well these models understand programming. Typical programming languages have invariances and equivariances in their semantics that human programmers intuitively understand and exploit, such as the (near) invariance to the renaming of identifiers. We show that LLMs not only fail to properly generate correct Python code when default function names are swapped, but some of them even become more confident in their incorrect predictions as the model size increases, an instance of the recently discovered phenomenon of Inverse Scaling, which runs contrary to the commonly observed trend of increasing prediction quality with increasing model size. Our findings indicate that, despite their astonishing typical-case performance, LLMs still lack a deep, abstract understanding of the content they manipulate, making them unsuitable for tasks that statistically deviate from their training data, and that mere scaling is not enough to achieve such capability.
UR - http://www.scopus.com/inward/record.url?scp=85173256418&partnerID=8YFLogxK
U2 - 10.18653/v1/2023.findings-acl.19
DO - 10.18653/v1/2023.findings-acl.19
M3 - Conference contribution
AN - SCOPUS:85173256418
SP - 272
EP - 292
BT - Findings of the Association for Computational Linguistics 2023
PB - Association for Computational Linguistics
Y2 - 9 July 2023 through 14 July 2023
ER -