N-gram Based Text Categorization Method for Improved Data Mining

Kennedy Ogada; Waweru Mwangi; Wilson Cheruiyot

N-gram Based Text Categorization Method for Improved Data Mining

Kennedy Ogada, Waweru Mwangi, Wilson Cheruiyot

Abstract

Though naïve Bayes text classifiers are widely used because of its simplicity and effectiveness, the techniques for improving performances of these classifiers have been rarely studied. Naïve Bayes classifiers which are widely used for text classification in machine learning are based on the conditional probability of features belonging to a class, which the features are selected by feature selection methods. However, its performance is often imperfect because it does not model text well, and by inappropriate feature selection and some disadvantages of the Naive Bayes itself. Sentiment Classification or Text Classification is the act of taking a set of labeled text documents, learning a correlation between a document’s contents and its corresponding labels and then predicting the labels of a set of unlabeled test documents as best as possible. Text Classification is also sometimes called Text Categorization. Text classification has many applications in natural language processing tasks such as E-mail filtering, Intrusion detection systems, news filtering, prediction of user preferences, and organization of documents. The Naive Bayes model makes strong assumptions about the data: it assumes that words in a document are independent. This assumption is clearly violated in natural language text: there are various types of dependences between words induced by the syntactic, semantic, pragmatic and conversational structure of a text. Also, the particular form of the probabilistic model makes assumptions about the distribution of words in documents that are violated in practice. We address this problem and show that it can be solved by modeling text data differently using N-Grams. N-gram Based Text Categorization is a simple method based on statistical information about the usage of sequences of words. We conducted an experiment to demonstrate that our simple modification is able to improve the performance of Naive Bayes for text classification significantly.

Keywords: Data Mining, Text Classification, Text Categorization, Naïve Bayes, N-Grams

Full Text: PDF

Download the IISTE publication guideline!

To list your conference here. Please contact the administrator of this platform.

Paper submission email: JIEA@iiste.org

ISSN (Paper)2224-5782 ISSN (Online)2225-0506

Please add our address "contact@iiste.org" into your email contact list.

This journal follows ISO 9001 management standard and licensed under a Creative Commons Attribution 3.0 License.

Journal of Information Engineering and Applications

N-gram Based Text Categorization Method for Improved Data Mining

Abstract