Abstract
Support vector domain description (SVDD) is a useful tool in data mining, used
for analysing the within-class distribution of multi-class data and to ascertain
membership of a class with known training distribution. An important property
of the method is its inner-product based formulation, resulting in its applicability to reproductive kernel Hilbert spaces using the “kernel trick”. This practice relies on full knowledge of feature values in the training set, requiring data exhibiting incompleteness to be pre-processed via imputation, sometimes adding unnecessary or incorrect data into the classifier. Based on an existing study of support vector machine (SVM) classification with structurally missing data, we present a method of domain description of incomplete data without imputation, and generalise to some times of kernel space. We review statistical techniques of dealing with missing data, and explore the properties and limitations of the SVM procedure. We present two methods to achieve this aim: the first provides an input space solution, and the second uses a given imputation of a dataset to calculate an improved solution. We apply our methods first to synthetic and commonly-used datasets, then to non-destructive assay (NDA) data provided by a third party. We compare our classification machines to the use of a standard SVDD boundary, and highlight where performance improves upon the use of imputation.
for analysing the within-class distribution of multi-class data and to ascertain
membership of a class with known training distribution. An important property
of the method is its inner-product based formulation, resulting in its applicability to reproductive kernel Hilbert spaces using the “kernel trick”. This practice relies on full knowledge of feature values in the training set, requiring data exhibiting incompleteness to be pre-processed via imputation, sometimes adding unnecessary or incorrect data into the classifier. Based on an existing study of support vector machine (SVM) classification with structurally missing data, we present a method of domain description of incomplete data without imputation, and generalise to some times of kernel space. We review statistical techniques of dealing with missing data, and explore the properties and limitations of the SVM procedure. We present two methods to achieve this aim: the first provides an input space solution, and the second uses a given imputation of a dataset to calculate an improved solution. We apply our methods first to synthetic and commonly-used datasets, then to non-destructive assay (NDA) data provided by a third party. We compare our classification machines to the use of a standard SVDD boundary, and highlight where performance improves upon the use of imputation.
Original language | English |
---|---|
Qualification | Ph.D. |
Awarding Institution |
|
Supervisors/Advisors |
|
Award date | 23 Jun 2011 |
Publisher | |
Publication status | Published - Jun 2011 |