Czech University of Life Sciences Prague
Faculty of Economics and Management
Department of information engineering
Bachelor Thesis
Data science
Nihar Lathiya
© 2021 CULS Prague
Table of Contents
- Literature review…………………………………………………………………………………………………… 4
3.1 Machine learning………………………………………………………………………………………………. 4
3.1.1 Data…………………………………………………………………………………………………………. 7
3.1.2 Features…………………………………………………………………………………………………….. 9
3.1.3 Machine learning algorithms………………………………………………………………………. 10
3.1.3.1 Linear regression……………………………………………………………………………….. 10
3.1.3.2 Lasso Regression……………………………………………………………………………….. 12
3.1.3.3 Decision Tree……………………………………………………………………………………. 13
3.2 Cross-Validation in Model Selection…………………………………………………………………. 15
3.2.1 K-fold cross-validation……………………………………………………………………………… 16
3.3 Hyper parameter Tuning in Machine Learning…………………………………………………….. 17
3.3.1 Grid search CV………………………………………………………………………………………… 17
3.4 Python for Data Science………………………………………………………………………………….. 18
3.5 Essential Python Libraries for Data Science……………………………………………………….. 19
3.5.1 NumPy:…………………………………………………………………………………………………… 20
3.5.2 Pandas…………………………………………………………………………………………………….. 22
3.5.3 Matplotlib……………………………………………………………………………………………….. 23
3.5.4 Scikit-Learn…………………………………………………………………………………………….. 25
3.5 Project Tool……………………………………………………………………………………………………. 25
References………………………………………………………………………………………………………………. 26
Appendix………………………………………………………………………………………………………………… 30
List of tables
Table 1: Parameters in Grid Search CV.. 30
Table 2: Attributes in Grid Search CV.. 30
Table 3: ndarray Attributes in NumPy. 31
Table 4: Arithmetic and Statistical Function in NumPy. 32
Table 5: Basic Functionalities of Pandas Data Frame. 33
Table 6: Basic Functions in Matplotlib. 34
List of figures
Figure 1: Algorithms employed in Machine Learning. 5
Figure 2: Linear Regression Graph. 11
List of equations
3. Literature review
3.1 Machine learning
Machine learning is a well-known concept in the domain of data science, artificial intelligence, and computer science – also known as statistical learning and predictive analytics. Arthur Samuel of IBM firstly used the machine learning term in 1952. In 1950, he wrote the first computer program. The program was a game of checkers, which makes winning strategies and incorporating moves. Later, Samuel also designed various mechanisms allowing his program (Foote, 2019).
As a data scientist, one must become familiar with machine learning concepts because both data science and machine learning overlap. We can say:
- Data science is used to gain insights from data and understanding of data patterns.
- Machine learning is used for making predictions based on available data.
The above predictions employ a set of artificial intelligence techniques that focuses on designing the system and uses statistical experience to improve model parameters tuning, the performance of the index, and improving its predictions, where experience can be previous information or data from broad and specific fields pooled at dataset hubs like Kaggle and made available for use in research. In other data science projects, critical measurement of these algorithms’ validity and quality in their variables such as time complexity yields a robust system. Still, an additional notion of sample complexity is required for the algorithm to learn data patterns. In short, theoretical learning guarantees for an algorithm depends on the complexity and size of the training data sample (Mehryar Mohri, 2018).
Machine learning is all about getting computers to make data-driven decisions rather than being explicitly programmed to carry out specific tasks. It enables computing devices to employ embedded programs and generated algorithms to predict instantaneous states just like humans do. Since programming is automation, machine learning is the core of the process. The latter process is a way that makes programming scalable. In conventional programming, data is fed as the input, while programs lie at the core programs to manipulate it and give an output, which is also a dataset. This concept is employed further in Machine Learning but with improved datasets and robust algorithms to achieve similar goals (Brownlee, 2015).
There are three main types of machine learning algorithms.
However, there exists another technique under the name of Reinforcement learning. All these algorithms have their differences stem from the way they treat the training and testing datasets.
Figure 1: Algorithms employed in Machine Learning.
Source: Author
Supervised machine learning
A supervised learning algorithm comprises an outcome variable, which is also referred to as a dependent variable. This variable needs to be predicted by imposing the independent variables to a learning technique. The technique’s overall goal is to generate a model, which resembles a function that can map output from a given set of inputs. Model training runs until the computer finds a suitable model while not compromising the accuracy of the results. This technique’s critical points are that it has labeled data, gives direct feedback, and predicts the outcome or future.
Under this category are Random Forest, Linear regression (most common), k- nearest neighbor design tree.
Unsupervised learning
The learning algorithms, in this case, do not contain the outcome or dependent variable for prediction. In these algorithms, data is without a label, it does not give feedback, and it helps find hidden data structure in data. Unsupervised learning algorithms cannot be directly applied to regression or classification problems because we do not know what output data values might be. It makes it impossible to train data than we usually do in supervised learning. It can be used for the clustering population in different groups for specific interventions.
E.G., K- means clustering, Gaussian mixture models, etc.
Reinforcement learning
In this category, the machines are trained to make only specific decisions while ignoring others. It has a reward system and learns a series of actions, implying that the variables in question are manipulated through the trial and error mechanism until they fit the desired output. It can be observed that this form of learning uses experience as the core of its decision making. The result is that only the best decisions are employed; hence reliability is optimal.
An example is the Markov decision process.
There are three main components of the machine learning system: data, features, and ML algorithms.
In our study, we seek to employ a Supervised Learning algorithm to achieve our project’s objectives. Since we seek to create a model that can easily predict the price of houses in India, we must have data to train and test the model. Supervised learning allows us the degree of freedom to choose the appropriate model that predicts house prices accurately. Data for this project is downloaded from Kaggle (Chakraborty, 2017).
3.1.1 data
Data can be collected manually or automatically. It can be in any unprocessed structure, text, value, images, audio, etc. data is a crucial factor in data analysis, ML, and A.I. It is not possible to train our model without data. In our case, we used data below;-
Dataset Name: Bengaluru-house-price-data
Type: Comma Separated Values (CSV)
Location: https://www.kaggle.com/amitabhajoy/bengaluru-house-price-data (Chakraborty, 2017)
Therefore, big enterprises spending vast amounts of money on getting accessing those data nowadays. Generally, in machine learning, we divide data into two parts- “training data” and “testing data.” Since we downloaded data from an online repository, Kaggle.com, we know that this data should be split into the two sections given before. However, as we understand, we cannot split this data until we have cleaned it. Data cleaning ensures the data is in good form, with only the required variables being put into play and observing the dataset’s features.
Much time is bound to be spent on data cleaning. This is the most time-consuming part. The modeling data’s data should be logically viable without outliers, which are the most common error sources in the modeling paradigm. Data cleaning involves eliminating all the data values with null or na as values in their cells. This is done either by replacing the missing values with a median value of the column it resides. The other alternative is the exclusion of missing values (rows) entirely from the dataset. It becomes reasonable to perform the latter decision when the missing variables’ population compared to the entire population is small.
Training data is a portion of our data that we show to our model as input and output. Based on this data, we train our model. The training dataset needs to form the majority of the sample. It should be large to overcome the bias introduced by the use of smaller data samples. As per our project, we split the primary dataset into an 80% training set. This was used to train the models on how to predict the price of houses given several input parameters. It is expected that some considerable amount of time would be used during training, and therefore this is to be treated as usual, and the model should be left to do its training.
Testing data – after our model is thoroughly trained and ready for prediction, we feed the testing data values as an input and get predicted output by our model. This output might not be the same as the actual output of testing data, as our model has not seen it. So, we compare both outputs and check the accuracy of the model (Gupta, 2018). Model testing forms the last step in this project. It requires the minimal pseudo-random sample of the remaining population, which is 20% of the population. However, it should be noted that the sample should be randomized to have a clear picture of what the model portrays. Testing is done using any house from any location except the number of bedrooms and bathrooms with an expectation to have a reasonable house price prediction.
3.1.2 Features
Features are the measurable and observable properties in data that we are interested in analyzing. In datasets, features often appear as columns forming distinctive characteristics in each column. Features can also refer to as “variable” or “attributes.” The feature selection process is varies depending on what we need to analyze in our model. Features are building blocks of a dataset. Quality of features if the dataset significantly affects the quality of insight we will gain from datasets. It is possible to improve the quality of feature selection with feature selection and feature engineering. It is typically tricky and tedious, but if it is done well, we can get optimum results of the dataset, containing all the essential features that might have beneficial insights to solve a specific business problem (LLC, 2019). These features are extracted during the data cleaning phase. It is also to state that the dataset characteristics, variables, including individual entries, are observed. These observations form how the required features are selected for the next phase in the data cleaning pipeline.
Once the required variables have been defined, all the other columns are then dropped as our goal is to have a data frame with only the factors on which our model will depend. In case additional variables can help reduce some of the variables, the variables are created, used until they are no longer required, at which they are dropped. As the features are extracted, the output data frame shrinks to the tune of the extracted features. The data shrinks at every stage until finally, we are comfortable using the data frame. In this instance, since we are employing a supervised learning technique, we have to ensure the proper data statistics are in order; for instance, we visualize the data to confirm that it is normally distributed. Once this criterion is met, we further confirm that there exist no anomalies in the data. This is achieved through scatter plots and observing the distribution of houses’ prices in the land space.
3.1.3 Machine learning algorithms
Algorithms employed in Machine Learning exist in massive amounts, with many sprouting at the dawn of a day. Generally, data scientist applies more than one algorithms on the model to check which one scores higher and gives much accuracy. In the practical part of this thesis, we are going to focus on the following three algorithms.
These algorithms serve as the litmus test on the viability of a model. In this project, since we seek to find the best fit model, we will not test just a single model but rather impose all the available algorithms as discussed below and find the best model with its parameters. This approach is accompanied by parameter tuning to realize the algorithm viable for the project with its corresponding parameters for optimal model performance. Therefore, we employ hyper parameter tuning in the training of this model to achieve the best parameters and order them with their score index. We then select the algorithm with the best score from the Score index and note the model’s tuning parameter.
3.1.3.1 Linear regression
This is an algorithm in the supervised machine learning algorithms family where the value predicted as output is continuous and exhibits a constant slope. It is used for predictive analysis, such as a house’s cost, total sales, and calls. The primary goal of this algorithm is usually to answer the following questions;-
(1) Does a set of predictor variables (independent) predict the outcome variable satisfactorily? And
(2) Which among the predictor variables have great significance in predicting the outcome variable, and to what extent do they impact the outcome variable?
In linear regression, we establish a relationship between the dependent and independent variables using the best fit line, statistically referred to as a regression line. The following general equation represents the line;-
Y is the value under prediction, x is the independent (predictor) variable, m is the gradient, and b is the y-intercept.
Source: (Pierce, 2018)
There are two main types of linear regression simple regression and multivariable regression. Simple regression has only one independent variable (x) and one dependent variable (y), but multivariable regression has one dependent variable (y) and more than one independent variable (x). If we take the example of house prices, then price prediction based on only square feet is a simple regression. Price prediction based on square feet, bedrooms, bathrooms, area, balcony, etc. is multivariable linear regression.
Let us visualize linear regression by the following graph:
Figure 2: Linear Regression Graph
In the figure above, we have two variables, x, and y, which has multiple data points defining the relation between both variables. The blue line is called the regression line passing through all the data points. Now the question is, why it can be only one line? There might also multiple lines passing through data points. Well, that is the motto of linear regression to find the best fit line for our model. The principle behind the regression line is to minimize error, which can be done by the sum of square errors equation:
SSE =
Equation 2: a sum of square error
Source: (Weisberg, 2005)
The equation does the sum of squared individual error of all data points from 1 to N. from the minimization of this equation; we can find only a blue line that can best fit for the model.
3.1.3.2 Lasso Regression
Lasso stands for “Least absolute shrinkage and selection operator.” It is a supervised machine learning that employs the concept of shrinkage. Shrinkage is where the data values are compressed towards a central point. The procedure encourages simple and spars model which have fewer parameters. This algorithm is perfectly suitable for the model having high-level multilinearity or when it comes to automation of certain parts of model selection, variable selection, or parameter elimination (Glen, 2015).
The coefficients are established upon minimization of this equation:
Source: (The group lasso for logistic regression, 2008)
Where is a constant positive amount of shrinkage, which regulates the imposed penalty’s strength and validity.
= slope.
When =0, it is at a steady-state, and the estimates are equal to the one found in the standard linear regression.
As increases, more and more coefficients are set to zero and eliminated. So, when l=, all coefficients are eliminated.
As increases, bias and variance increase.
The technique uses a penalized least squares as a basis for modeling and parameter sub-selection approach. Lasso regression helps us make feature selection by choosing the slop’s magnitude value, which means wherever the slop value is close to zero, it will remove those features as they are not much important for our prediction. It will keep only essential features for prediction. It is useful in fitting high-dimensional data exhibiting high correlations in the predictors. It can, therefore, be thought of as a Hybrid variable selection procedure. (Bayesian and LASSO Regressions for Comparative Permeability Modeling of Sandstone Reservoirs, 2018).
3.1.3.3 Decision Tree
The decision tree is usually a supervised machine learning algorithm ideal for task regression and its classification. The decision tree’s primary goal is to predict the value or class of specific variables depending on generated decision rules by algorithm from the training dataset. As per the name, this algorithm represents structure like a tree in which we have a variable as a root node to test on. Branches from root nodes are represented as a result of the test, and those branches have leaf nodes representing class or label. There are two decision tree types – categorical variable decision tree (targeted variable is categorical) and continuous variable decision tree (targeted variable is continuous).
Splitting:
Decision tree decides by splitting their root nodes into sub-nodes. This splitting process continuously goes on until it is left with only homogenous nodes. There are multiple methods of splitting, which depend on the type of targeted variable. For categorical variables, we can use Gini impurity, information gain, and Chi-square. For continuous variables, we can use a reduction invariance. As we will focus on the continuous variable in the practical part, let us understand the reduction invariance.
Reduction invariance:
This splitting method can be used for continuous variables only. It uses the generic statistical formula of standard deviation and variance to get the best split.
Variance =
Source: (Suzuki, 2019)
Where is the mean of the values, is the actual value, and is the number of values.
Variance is used to calculate the purity of a node. As much variance is low as purer a node will be if the node will be entirely pure, then variance value will be 0.
Steps to split a decision tree using reduction invariance:
- Calculate the variance of each node for each split.
- Calculate the weighted average variance of each split of child nodes.
- Pick the split with the lowest variance.
- Repeat until uniform and homogeneous nodes are realized. (Sharma, 2020)
Pruning:
Overfitting is the most usual and significant problem in the decision tree. It is a situation when the model gives 100% accuracy for the training data set but for the testing data set. It might have a more considerable variance between the actual and predicted value. The reason is there is no limit for growth in the decision tree. Sometimes in the worst case, it gives 100% accuracy for training data set by making one leaf for each observation. This situation will affect accuracy while predicting the actual testing data set. Pruning is one of the well-known ways to avoid overfitting. Pruning methods remove the decision nodes from the leaf nodes without affecting the model’s overall accuracy. This method uses statistical measures to eliminate the least reliable branches, which leads to faster classification and improvement in the prediction of outputs from the independent test data (Evaluation of Decision Tree Pruning Algorithms for Complexity and Classification Accuracy, 2010).
3.2 Cross-Validation in Model Selection
Generally, in machine learning, we break down our data set into training and testing data for model creation. Even in the same algorithm, the model will give us different accuracy for the different test sets. Therefore, it is best practice to apply cross-validation to the machine learning model for better accuracy. Cross-validation is a popular statistical technique for algorithm selection. The main goal of the cross-validation is to assess how the model will perform with different data set. The idea behind cross-validation is to split data once or several times to estimate each algorithm’s skill or model. The training set is used to train each algorithm, and the validation set is used to estimate the risk. In the end, we select the algorithm with the smallest estimated risk (A survey of cross-validation procedures, 2010).
There are plenty of cross-validation methods such as k-fold cross-validation, holdout method, leave-p-out cross-validation, and leave-one-out cross-validation. We will focus on k-fold cross-validation as it is prevalent, suitable for extensive data set, and we are going to use in our practical part.
Cross-validation Serves to verify that the algorithm selected is robust over random tests with the test data set. The expected score is supposed not to wander far away from the observed values after the hyper-parameter tuning.
3.2.1 K-fold cross-validation
This technique is a widely adopted method for model selection. As per the name, this technique randomly divides data set into K folds approximately in equal size. This K part of the data will be used for the testing dataset, and the rest (k-1) part will be used as a training dataset. The model will be trained and tested for K number of times, and at the end of the process, we will get a K number of scores. The average number of this score will be considered as the average accuracy. We can also say the maximum accuracy of this model, the highest score we received, and the model’s minimum accuracy is the lowest score in this process.
In K-fold cross-validation, K is a chosen number by us, which represents the number of folds. The choice of a number depends on our data size and system computation power. The number can be anything ideally between 5 and 10. We must choose numbers carefully because poor choice might lead our model to high variance and high bias. K< 5 might cause issues like that (A Comparative Study of Ordinary Cross-Validation, v-Fold Cross-Validation, and the Repeated Learning-Testing Methods, 1989).
Advantages of K-fold CV
- As K value increases, the estimated variance and bias reduces
- For the K value, the repetition of the process is limited, so less computation time.
- Each data part gets to be trained and tasted precisely.
Disadvantages of K-fold CV
- It takes a considerable amount of time to evolve as the algorithm must rerun from scratch K times.
3.3 Hyper Parameter Tuning in Machine Learning
In machine learning, hyper parameter tuning is an important task to get optimal values of a model’s parameter, which gives the maximum accuracy for a particular model. Different datasets have different hyper parameter settings, so it must be tuned for each dataset. A hyper parameter is the machine learning element that automatically cannot be learned by model, but it can be done by a meta-process called – “hyper parameter tuning.” Manually, it is difficult to keep track of the hyper parameter and frequently fit into training datasets; simultaneously; it is time-consuming. A grid search CV can solve this problem.
3.3.1 Grid search CV
Grid search CV is one of the well-known methods for hyper parameter tuning. It is a function of the Scikit-learn library, which helps to loop through predefined hyper parameters and fit our model in the training dataset. The method will then list each parameter values’ score, and we can select the best parameter from it. According to (Chih-Wei Hsu, 2003), it is highly recommended to use a grid search CV and cross-validation to archive the best parameter values. The grid search CV structure is like a dictionary (keys= parameter names, values= various possibilities for our combinations), and then it passed to our estimator object.
Grid Search CV serves the purpose well in our model as from the python data dictionary. This is the easiest way to validate and classify the parameters used in modeling the final price prediction model.
3.4 Python for Data Science
There is plenty of programming language used in data science projects like Python, Java, R, SAS, SQL, etc. Python is open-source, interpreted, and dynamic object-oriented, publicly available in 1991 (Hsu, 2018). It is widely used and suitable for data science tools and applications. According to a “stake overflow survey” in 2019, python is fast-growing and second most loved programming language.
Python holds a unique attribute, and when it comes to performing analytical and quantitative tasks, it is very easy than other programming languages. According to engineers of academia and industry, python APIs are available for deep learning frameworks. Scientific packages have constructed python as incredibly productive and versatile (Bhatia, 2012).
Hence, now we know what importance python has in the data science field, let us focus on some elegant features of python.
- Python supports various platforms such as Windows, Linux, Mac, etc.
- It makes the program easy to read and write. It is also easy to perform various machine learning algorithms and complex scientific calculations, thanks to elegant and simple syntax.
- Python has the ultimate collection of libraries to perform various tasks like data manipulation, data analysis, and data visualization.
- Python is an expressive language that makes possible applications to offer a programmable interface(Eppler, 2015).
- In python, it is simple to an extension of code by appending new modules implemented in other compiled languages like C or C++.
Machine learning scientists prefer python as well in terms of application areas. When it comes to app development for NLP and machine analysis, developers switch to python due to the huge collection of libraries python provides, which helps solve complex business problems efficiently and construct a robust system data application.
3.5 Essential Python Libraries for Data Science
Python libraries are a reusable bunch of functions and methods which we can include in our program to perform several actions without writing code. Python has improved libraries’ support in recent years and became the best alternative for data manipulation techniques. It is among the favorites for full-stack developers, which is also highly recommended for general-purpose programming (McKinney, 2012)
In a data science project, we need to go through all the stages like data cleaning, data visualization, model building, etc. Python has plenty of popular libraries for these tasks. Let us focus on some of them, which we are going to use in our practical part.
3.5.1 NumPy:
NumPy (Numeric Python) is one of the most potent and open-source python libraries primarily used for numeric analysis. NumPy deals with numerical data and provides algorithms, data structures, and other utilities to perform scientific calculations and data storage. It is highly recommended to fast operation on arrays, sorting, selecting, mathematical functions, statistical operation, linear algebra, random simulation, etc. It was created by Jim Hugunin which was modified by Travis Oliphant in 2005 to incorporating features of competing Num-array into numeric (Oliphant, 2015)
NumPy basics
- Create NumPy arrays and array attributes
- Array indexing and slicing
- Reshaping and concatenation
NumPy arithmetic and statistics basics
- Computations and aggregations
- Comparison and Boolean masks
The NumPy package has a significant object called “ndarray” (n-dimensional array). It is homogeneous and statistical data types and performs many operations in a compiled language (Leo (Liang-Huan) Chin, 2016). Now the question is, why would we use NumPy array when we can just use a python list? The list is very flexible and versatile, and excellent in python, but there are few significant benefits of using NumPy arrays over a python list.
Saves coding time
- No for loops: many vector and matrix operation save coding time
. We do not need to iterate through an array to apply a mathematical operation to each element of that array. We can do it with a single line of code.
Example:
Using python list, we need to use for loop to iterate through that list before you can multiply *= 6 operation to each element:
for i in range (len(my_list)):
my list[i] *= 6
Using NumPy array, we can apply that element directly to the entire array with a single code line. NumPy takes care of the rest of the operation behind the scenes, so we do not have to worry about it:
my array *= 6
Faster execution
- Uses single data type for each element (all must be the same data type) in array to avoid type checking at runtime.
- Uses contiguous blocks of memory
Uses less memory
- No pointers, so type and item sizes are the same for each column.
- In python list, there is an array with pointers to python object (4B+ per pointer and 16+ for a numerical object)
- Compact data types like unit 8 and float 16.
- Which depends on our task and precision of data
3.5.2 Pandas
Pandas name is shorthand for “panel data”- a term for data sets with multidimensional structure. Another important machine learning library provides functions and rich data structure to make our data analysis task more manageable, fast, and expressive. Pandas have various methods for combining data, time-series functionality, grouping, and filtering. It also provides indexing functionality, which simplifies reshaping easier, data slicing, performing aggregation, and selecting a subset from a dataset (McKinney, 2012).
Pandas built on top of NumPy. That means pandas require NumPy. Pandas do not require other libraries like Matplotlib and SciPy, but it can be handy if combined with the latter. It is an excellent tool for data wrangling due to its robust design coupled with quick and easy data manipulation features. It has two handy data structures known as “pandas series” and “pandas data frame.” They are core components of pandas that allow us to reshape, merge, split, train, and aggregate data.
Pandas Series
It is a one-dimensional labeled array containing data types. These data types span from strings, doubles, integers, objects, and floats python objects. The axis labels represent the index. In short, it is just a column in memory that is either independent or belongs to a pandas data frame. A unique label is not necessary, but it must be a hashable type. The python object integrates label-based indexing as well. It provisions a host of methods for performing index-operations.
Pandas Data frame
This is a two-dimensional, tabular data structure with more details regarding axis. The concept of a data frame is borrowed from the idea of spreadsheets. It is logically corresponding to a sheet of Excel that includes both rows and columns. The Data frame object contains an ordered collection of columns like a spreadsheet or an excel sheet. Each column holds a unique data type, but different columns may have different types (Bernd Klein, 2011).
Data frame operation in pandas
- Read, view, and extract information
- Grouping and sorting
- Deals with duplicate and missing values
- Selection, filtering, and slicing
- Pivot table and functions
The pandas library has been under development way since the python was made public. It is established to narrow and possibly close the gap in the available data analysis tools between python, and the conventional domain-specific statistical computing platforms, software, and database languages (McKinney, January 2011). There are currently fewer pandas library releases, including plenty of new features, enhancements, bug fixing, and API changes. Data analysis among everything else takes the highlights when it comes to the use of pandas. Pandas ensure high functionality and superb flexibility while combined with other libraries and tools.
3.5.3 Matplotlib
It is generally conceptualized as a plotting library for data visualization. It is a python package for 2D plotting that generates a production-graph. Any organization must visualize data and descriptive analysis; matplotlib provides very effective methods for these tasks. It supports both interactive plotting and non-interactive plotting to save the graphics into several formats like .pdf, .png, .jpeg, among others. It can also employ multi-window toolkits (GTK+, wxWidgets) and provide a conglomerate of plot types like line graphs, pie-charts, and bar charts histograms, and other professional-grade figures. Besides, it boasts high customization levels, flexibility, and convenience in use (Tosi, November 2009).
Features of Matplotlib
Pylab interface
It allows users to create plots using code just like the Mathworks Package MATLABTM figure generates its code.
Matplotlib API
Acts as the abstract interface over which plots are rendered. It is responsible for tuning the parameters and ensure what is given in the code is translated properly in the intended interface.
Backends
These play the primary role of interpreting the graphics to other devices connected to the computer or intended for display services.
Most professionals employ Matplotlib in generation postscript files for printing or publishing automatically. Some find the convenience of deploying the graphics on web applications that can dynamically generate specific nature files. Matplotlib library can also be called interactively from Tkinter’s python shell on the Windows platform (John Huter, May 27, 2007).
3.5.4 Scikit-Learn
This is a python library that makes various algorithms and functions that are used in machine learning available. David Cournapeau initially developed it as a google summer of code project in 2007. It is considered one of the best libraries for working with complex data.
Scikit learns built on NumPy, SciPy, and matplotlib. It contains several algorithms for data mining and machine learning tasks like:
- Dimensionality reduction
- Data reduction methods (e.g., principle component analysis, feature selection)
- Regression analysis (e.g., linear, logistic, and ridge)
- Classification and clustering models (e.g., random forest, support vector machine, K-means)
- Model tuning and selection (e.g. grid-search, cross-validation)
- It also provides modules for pre-processing data, extracting features, optimizing hyper parameters, and evaluating models to solve real-world problems(Hackeling, 2017).
3.5 Project Tool
Python is straightforward to learn as it is a described programming language. It has extensive documentation and vibrant online support groups where support is easily found. It is also easy to use with minimal coding and realizing maximum computational power in exploring data, using graphics, and manipulating almost all variables on the go. We have employed python from the interface of Jupyter Notebooks, which is a very convenient tool for python beginners and experts. This brings the convenience of installing python and pandas at a go and accessing them through library import.
References
{Matplotlib: A 2D graphics environment. Hunter, J. D., 2007. 3, s.l. : IEEE COMPUTER SOC, 2007, Vol. 9, pp. 90-95.
A Comparative Study of Ordinary Cross-Validation, v-Fold Cross-Validation, and the Repeated Learning-Testing Methods. Burman, Prabir. 1989. 3, Davis, California: Oxford university press, 1989, Vol. 76. ISSN: 00063444.
A survey of cross-validation procedures. Celisse, Sylvain Arlot & Alain. 2010. 40-79, s.l. : Statist. Surv., 2010, Vol. 4. ISSN : 1935-7516.
Bayesian and LASSO Regressions for Comparative Permeability Modeling of Sandstone Reservoirs. AI-Mudhafar, watheq, J., 2018. 1, s.l. : Springer US, February 10, 2018, Natural Resources Research, Vol. 28, pp. 47-62. ISSN: 1520-7439.
Bernd Klein, Bodensee. 2011. Numerical & scientific computing with python. introduction into pandas. [Online], 2011. https://www.python-course.eu/pandas.php.
Bhatia, Richa. 2012. Analytics India Magazine. WHY DO DATA SCIENTISTS PREFER PYTHON OVER JAVA? [Online] Analytics India Magazine, 2012. https://analyticsindiamag.com/why-do-data-scientists-prefer-python-over-java/.
Brownlee, Jason. 2015. Start machine learning. Basic concepts in machine learning. December 25, 2015.
Chakraborty, Amitabha. 2017. Bengaluru house price data. Kaggle. [Online] October 2017. [Cited: October 31, 2020.] https://www.kaggle.com/amitabhajoy/bengaluru-house-price-data.
Chih-Wei Hsu, Chih-Chung Chang, and Chih-Jen Lin. 2003. A Practical Guide to Support Vector Classification. Taiwan: research gate, research gate, 2003.
Documentation, NumPy. NumPy v1.19 Manual. NumPy. [Online] The SciPy community. [Cited: September 16 2020.] https://numpy.org/doc/stable/.
Documentation, pandas. API reference. Pandas. [Online] the pandas development team. [Cited: September 17 2020.] https://pandas.pydata.org/docs/reference/index.html.
Eppler, Jochen Martin. 2015. A convinient interface to the NEST simulator. [book auth.] Eilif Muller. python in neuroscience. Nijmegen : Rolf Kotter, 2015.
Evaluation of Decision Tree Pruning Algorithms for Complexity and Classification Accuracy. Patil, Dipti, D., 2010. 02, s.l. : International Journal of Computer Applications, 2010, International Journal of Computer Applications, Vol. 11.
Foote, Keith D., 2019. A Brief History of Machine Learning. Data topics. [Online] Dataversity, March 26, 2019. [Cited: August 05 2020.] https://www.dataversity.net/a-brief-history-of-machine-learning/.
Glen, Stephanie. 2015. Lasso Regression: Simple Definition. StatisticsHowTo. [Online] September 24, 2015. [Cited: August 06 2020.] https://www.statisticshowto.com/lasso-regression/.
Gupta, Mohit. 2018. ML | Introduction to Data in Machine Learning. Geeksforgeeks. [Online] May 01, 2018. [Cited: August 01 2020.] https://www.geeksforgeeks.org/ml-introduction-data-machine-learning/?ref=lbp.
Hacking, Gavin. 2017. Mastering machine learning with sci-kit learn. Birmingham: Packt publishing ltd., 2017. ISBN: 978-1-78829-987-9.
Hsu, Hansen. 2018. 2018 MUSEUM FELLOW GUIDO VAN ROSSUM, PYTHON CREATOR & BENEVOLENT DICTATOR FOR LIFE. s.l. : computer history museum, 2018.
John Huter, Darren dale. May 27, 2007. The Matplotlib User’s Guide. May 27, 2007.
Leo (Liang-Huan) Chin, Tanmay Dutta. 2016. NumPy Essentials. Birmingham: Packt publishing ltd., 2016. p. 11. ISBN: 978-1-78439-367-0.
LLC, Cogito Tech. 2019. What are Features in Machine Learning and Why it is Important? Medium. [Online] Medium, July 15, 2019. [Cited: August 01 2020.] https://medium.com/@cogitotech/what-are-features-in-machine-learning-and-why-it-is-important-e72f9905b54d.
McKinney, Wes. January 2011. 1pandas: a Foundational Python Library for DataAnalysis and Statistics. [Article] s.l. : researchgate, January 2011.
—. 2012. Python for data analysis. 1st edition. Sebastopol: O’Reilly Media, 2012. ISBN: 978-1-449-31979-3.
Mehryar Mohri, Afshin Rostamizadeh, Ameet Talwalkar. 2018. the foundation of machine learning. 2nd edition. Cambridge, England: the MIT press, 2018. ISBN: 978-0-262-03940-6.
Oliphant, Travis E., 2015. Guide to NumPy. 2nd edition. s.l. : creator space, 2015. ISBN: 978-1-51730-007-4.
Pierce, Rod. 2018. Equation of a Straight Line. MathsIsFun.com. [Online] Rod Pierce DipCE BEng, October 12, 2018. [Cited: July 31 2020.] http://www.mathsisfun.com/equation_of_line.html.
scikit-learn, Machine Learning in {P}ython. Pedregosa, F. and Varoquaux, G. and Gramfort, A. and Michel, V. 2011. s.l. : Journal of Machine Learning Research, 2011, Vol. 12, pp. 2825–2830.
Sharma, Abhishek. 2020. 4 Simple Ways to Split a Decision Tree in Machine Learning. decision tree split methods. [Online] June 30, 2020. [Cited: August 30, 2020.] https://www.analyticsvidhya.com/blog/2020/06/4-ways-split-decision-tree/.
Suzuki, Kunihiro. 2019. statistic – the fundamentals. New York: Nova Science Publishers, Incorporated, 2019. p. 162. Vol. 1. ISBN: 9781536144628.
The group lasso for logistic regression. Meier, Lukas. 2008. 1, Zurich: J.R. statist, January 04, 2008, Journal of the royal statistical society, Vol. 70, pp. 53-71. 1369-7412/08/70053.
Tosi, Sandro. November 2009. Matplotlib for python developers. [ed.] Rakesh Shejwal. Birmingham, UK: Packt publishing ltd., November 2009. ISBN: 978-1-847197-90-0.
Weisberg, Stanford. 2005. Applied Linear Regression. 3rd edition. Hoboken, New Jersey: John Wiley & sons, 2005. p. 24. ISBN: 0-471-66379-4.
Appendix
Table 1: Parameters in Grid Search CV
Source: (sci-kit-learn, Machine Learning in {P}ython, 2011)
Table 2: Attributes in Grid Search CV
Source: (sci-kit-learn, Machine Learning in {P}ython, 2011)
Table 3: ndarray Attributes in NumPy
Source: (documentation)
Table 4: Arithmetic and Statistical Function in NumPy
Source: (documentation)
Table 5: Basic Functionalities of Pandas Data Frame
Source: (documentation)
Table 6: Basic Functions in Matplotlib
Source: ({Matplotlib: A 2D graphics environment, 2007)