Appropriate Machine Learning Algorithm for Big Data Processing

Emmanuel Boachie; Chunlin Li

Appropriate Machine Learning Algorithm for Big Data Processing

Emmanuel Boachie, Chunlin Li

Abstract

MLlib is Spark’s library of machine learning functions developed to operate in parallel on clusters. MLlib comprises of different types of learning algorithms and is available from all of Spark’s programming languages. Machine Learning is important to data scientists with a machine learning background considering using Spark, as well as engineers working with a machine learning professionals. A lot of algorithms in MLlib function better in terms of forecasting precision with regularization when that choice is accessible. Again, a lot of the SGDbased algorithms demand around 100 iterations to obtain good outcome. The paper presents the types of algorithms on distributed data sets, indicating all data as RDDs and recommends one which is more appropriate and effective for huge data processing. An assessment will be made based on their strength and weakness on the number of machine learning algorithms and come out with one which is effective for big data processing. The appropriate and effective machine learning algorithm is HashingTF as it takes the hash code of each word modulo a desired vector size, S, and thus maps each word to a number between 0 and S–1. This always provides an S-dimensional vector, and in practice is quite robust even if multiple words map to the same hash code. The MLlib inventors recommend setting S between 2 HashingTF can run either on one document at a time or on a whole RDD. It demands each “document” to be represented as an iterable order of objects for example, a list in Python or a Collection in Java.

Full Text: PDF

Download the IISTE publication guideline!

To list your conference here. Please contact the administrator of this platform.

Paper submission email: CEIS@iiste.org

ISSN (Paper)2222-1727 ISSN (Online)2222-2863

Please add our address "contact@iiste.org" into your email contact list.

This journal follows ISO 9001 management standard and licensed under a Creative Commons Attribution 3.0 License.

Computer Engineering and Intelligent Systems

Appropriate Machine Learning Algorithm for Big Data Processing

Abstract