Generalizaciones de minimos cuadrados parciales con aplicación en clasificacion supervisada

Vega-Vilca, Jose C.

Publication

Generalizaciones de minimos cuadrados parciales con aplicación en clasificacion supervisada

Vega-Vilca, Jose C.

Abstract

The development of technologies such as microarrays has generated a large amount of data. The main characteristic of this kind of data it is the large number of predictors (genes) and few observations (experiments). Thus, the data matrix X is of order n×p, where n is much smaller than p. Before using any multivariate statistical technique, such as regression and classification, to analyze the information contained in this data, we need to apply either feature selection methods and/or dimensionality reduction using orthogonal variables, in order to eliminate multicollineality among the predictor variables that can lead to severe prediction errors, as well as to a decrease of the computational burden required to build and validate the classifier. Principal component analysis (PCA) is a technique that has being used for some time to reduce the dimensionality. However, the first components that have the most variability of the data structure do not necessarily improve the prediction when it is used for regression and classification (Yeung and Ruzzo, 2001). Partial least squares (PLS), introduced by Wold (1975), was an important contribution to reduce dimensionality in a regression context using orthogonal components. The certainty that first PLS components improve the prediction has made PLS a widely technique used particularly in the area of chemistry, known as Chemometrics. Nguyen and Rocke (2002), working on supervised classification methods for microarray data, reduced the dimensionality by applying first feature selection using statistical techniques such as difference of means and analysis of variance, after which they applied PLS regression considering the vector of classes ( a categorical variable) as a response vector (continuous variable). This procedure is not adequate since the predictions are not necessarily integers and they must be rounded up, losing accuracy. In spite of these shortcomings, regression PLS yields reasonable results. In this thesis work we implement generalizations of regression PLS as a dimensionality reduction technique to be applied in supervised classification. We extend a technique introduced by Bastien et al. (2002), who combined PLS with ordinal logistic regression.