Appropriate Machine Learning Algorithm for Big Data Processing
Abstract
MLlib is Spark’s library of machine learning functions developed to operate in parallel on clusters. MLlib comprises of different types of learning algorithms and is available from all of Spark’s programming languages. Machine Learning is important to data scientists with a machine learning background considering using Spark, as well as engineers working with a machine learning professionals. A lot of algorithms in MLlib function better in terms of forecasting precision with regularization when that choice is accessible. Again, a lot of the SGDbased algorithms demand around 100 iterations to obtain good outcome. The paper presents the types of algorithms on distributed data sets, indicating all data as RDDs and recommends one which is more appropriate and effective for huge data processing. An assessment will be made based on their strength and weakness on the number of machine learning algorithms and come out with one which is effective for big data processing. The appropriate and effective machine learning algorithm is HashingTF as it takes the hash code of each word modulo a desired vector size, S, and thus maps each word to a number between 0 and S–1. This always provides an S-dimensional vector, and in practice is quite robust even if multiple words map to the same hash code. The MLlib inventors recommend setting S between 2 HashingTF can run either on one document at a time or on a whole RDD. It demands each “document” to be represented as an iterable order of objects for example, a list in Python or a Collection in Java.
To list your conference here. Please contact the administrator of this platform.
Paper submission email: CEIS@iiste.org
ISSN (Paper)2222-1727 ISSN (Online)2222-2863
Please add our address "contact@iiste.org" into your email contact list.
This journal follows ISO 9001 management standard and licensed under a Creative Commons Attribution 3.0 License.
Copyright © www.iiste.org