A two-stage feature selection method for text categorization by using information gain, principal component analysis and genetic algorithm

dc.contributor.authorUguz, Harun
dc.date.accessioned2020-03-26T18:13:42Z
dc.date.available2020-03-26T18:13:42Z
dc.date.issued2011
dc.departmentSelçuk Üniversitesien_US
dc.description.abstractText categorization is widely used when organizing documents in a digital form. Due to the increasing number of documents in digital form, automated text categorization has become more promising in the last ten years. A major problem of text categorization is its large number of features. Most of those are irrelevant noise that can mislead the classifier. Therefore, feature selection is often used in text categorization to reduce the dimensionality of the feature space and to improve performance. In this study, two-stage feature selection and feature extraction is used to improve the performance of text categorization. In the first stage, each term within the document is ranked depending on their importance for classification using the information gain (IG) method. In the second stage, genetic algorithm (GA) and principal component analysis (PCA) feature selection and feature extraction methods are applied separately to the terms which are ranked in decreasing order of importance, and a dimension reduction is carried out. Thereby, during text categorization, terms of less importance are ignored, and feature selection and extraction methods are applied to the terms of highest importance; thus, the computational time and complexity of categorization is reduced. To evaluate the effectiveness of dimension reduction methods on our purposed model, experiments are conducted using the k-nearest neighbour (KNN) and C4.5 decision tree algorithm on Reuters-21,578 and Classic3 datasets collection for text categorization. The experimental results show that the proposed model is able to achieve high categorization effectiveness as measured by precision, recall and F-measure. (C) 2011 Elsevier B.V. All rights reserved.en_US
dc.description.sponsorshipSelcuk UniversitySelcuk Universityen_US
dc.description.sponsorshipThis study has been supported by Scientific Research Project of Selcuk University.en_US
dc.identifier.doi10.1016/j.knosys.2011.04.014en_US
dc.identifier.endpage1032en_US
dc.identifier.issn0950-7051en_US
dc.identifier.issn1872-7409en_US
dc.identifier.issue7en_US
dc.identifier.scopusqualityQ1en_US
dc.identifier.startpage1024en_US
dc.identifier.urihttps://dx.doi.org/10.1016/j.knosys.2011.04.014
dc.identifier.urihttps://hdl.handle.net/20.500.12395/26083
dc.identifier.volume24en_US
dc.identifier.wosWOS:000293683500009en_US
dc.identifier.wosqualityQ1en_US
dc.indekslendigikaynakWeb of Scienceen_US
dc.indekslendigikaynakScopusen_US
dc.language.isoenen_US
dc.publisherELSEVIERen_US
dc.relation.ispartofKNOWLEDGE-BASED SYSTEMSen_US
dc.relation.publicationcategoryMakale - Uluslararası Hakemli Dergi - Kurum Öğretim Elemanıen_US
dc.rightsinfo:eu-repo/semantics/closedAccessen_US
dc.selcuk20240510_oaigen_US
dc.subjectText categorizationen_US
dc.subjectFeature selectionen_US
dc.subjectGenetic algorithmen_US
dc.subjectPrincipal component analysisen_US
dc.subjectInformation gainen_US
dc.titleA two-stage feature selection method for text categorization by using information gain, principal component analysis and genetic algorithmen_US
dc.typeArticleen_US

Dosyalar