The Effectiveness of Pre-Trained Code Embeddings

Benjamin Trevett, Donald Shewan Reay, Nick Taylor

Research output: Contribution to conferencePaperpeer-review

Abstract

Few machine learning applications applied to the domain of programming languages make use of transfer learning. It has been shown that in other domains, such as natural language processing, that transfer learning improves performance on various tasks and leads to faster convergence. This paper investigates the use of transfer learning on machine learning models for programming languages - focusing on two tasks: method name prediction and code retrieval. We find that, for these tasks, transfer learning provides improved performance, as it does to natural languages. We also find that these models
can be pre-trained on programming languages that are different from the downstream task language and that even pre-training models on English language data is sufficient to provide similar performance as pre-training on programming languages. We believe this is because these models ignore syntax and instead look for semantic similarity between the named variables in source
code.
Original languageEnglish
Publication statusPublished - 27 Jul 2020
Event16th International Conference on Data Science 2020 - Las Vegas, United States
Duration: 27 Jul 202030 Jul 2020
Conference number: 16
https://icdatascience.org/

Conference

Conference16th International Conference on Data Science 2020
Abbreviated titleICDATA'20
Country/TerritoryUnited States
CityLas Vegas
Period27/07/2030/07/20
Internet address

Fingerprint

Dive into the research topics of 'The Effectiveness of Pre-Trained Code Embeddings'. Together they form a unique fingerprint.

Cite this