Aparicio-Carrasco, Roxana K.
Loading...
2 results
Publication Search Results
Now showing 1 - 2 of 2
Publication Semi-supervised document classification using ontologies(2011) Aparicio-Carrasco, Roxana K.; Acuña-Fernández, Edgar; College of Engineering; Urintsev, Alexander; Lozano, Elio; Manian, Vidya; Department of Electrical and Computer Engineering; Calderón, AndrésMany modern applications of automatic document classification require learning accurately with little training data. Addressing the need to reduce the manual labeling process, the semi-supervised classification technique has been proposed. This technique use labeled and unlabeled data for training and it has shown to be effective in many cases. However, the use of unlabeled data for training is not always beneficial and it is difficult to know a priori when it will be work for a particular document collection. On the other hand, the emergence of web technologies has originated the collaborative development of ontologies. Ontologies are formal, explicit, detailed structures of concepts. In this thesis, we propose the use of Ontologies in order to improve automatic document classification, when we have little training data. We propose that making use of ontologies to assist the semi-supervised document classification can substantially improve the accuracy and efficiency of the semi-supervised technique. Many learning algorithms have been studied for text. One of the most effective is Support Vector Machines, which is the basis of this work. Our algorithm enhances the performance of Transductive Support Vector Machines through the use of ontologies. We report experimental results applying our algorithm to three different real-world text classification datasets. Our experimental results show an increment of accuracy of 4% on average and up to 20% for some datasets, in comparison with the traditional semi-supervised model.Publication Unsupervised classification of text documents(2007) Aparicio-Carrasco, Roxana K.; Acuña-Fernández, Edgar; College of Arts and Sciences - Sciences; González, Ana C.; Urintsev, Alexander; Department of Mathematics; Hernández-Rivera, WilliamThe automatic extraction of knowledge from very large document collections is becoming an important issue in order to exploit the increasing available information stored in text form. A significant aspect of this extraction of knowledge consists in organize the collection into clusters of related documents; this task is known as unsupervised classification or clustering. As a result of preprocessing the collection using the vector space model, a vector representation of each document is obtained. The main characteristics of these vectors are their high dimensionality and sparsity. In this thesis we had studied and implemented algorithms for clustering large document collections, that fully exploit these characteristics. We propose a sparse representation of the document vectors stored in a relational database and developed SQL implementations of two different clustering algorithms: PAM and EM using Multinomial Naive Bayes Mixtures.