Comparisons among several methods for handling missing data in principal component analysis (PCA)

Sebastien Loisel, Yoshio Takane

Research output: Contribution to journalArticle

63 Downloads (Pure)

Abstract

Missing data are prevalent in many data analytic situations. Those in which principal component analysis (PCA) is applied are no exceptions. The performance of five methods for handling missing data in PCA is investigated, the missing data passive method, the weighted low rank approximation (WLRA) method, the regularized PCA (RPCA) method, the trimmed scores regression method, and the data augmentation (DA) method. Three complete data sets of varying sizes were selected, in which missing data were created randomly and non-randomly. These data were then analyzed by the five methods, and their parameter recovery capability, as measured by the mean congruence coefficient between loadings obtained from full and missing data, is compared as functions of the number of extracted components (dimensionality) and the proportion of missing data (censor rate). For randomly censored data, all five methods worked well when the dimensionality and censor rate were small. Their performance deteriorated, as the dimensionality and censor rate increased, but the speed of deterioration was distinctly faster with the WLRA method. The RPCA method worked best and the DA method came as a close second in terms of parameter recovery. However, the latter, as implemented here, was found to be extremely time-consuming. For non-randomly censored data, the recovery was also affected by the degree of non-randomness in censoring processes. Again the RPCA method worked best, maintaining good to excellent recoveries when the censor rate was small and the dimensionality of solutions was not too excessive.
Original languageEnglish
JournalAdvances in Data Analysis and Classification
Early online date18 Jan 2018
DOIs
Publication statusE-pub ahead of print - 18 Jan 2018

Fingerprint Dive into the research topics of 'Comparisons among several methods for handling missing data in principal component analysis (PCA)'. Together they form a unique fingerprint.

  • Cite this