Czech University of Life Sciences Prague
Faculty of Economics and Management
Department of information engineering
Bachelor Thesis
Data science
Nihar Lathiya
© 2021 CULS Prague
Table of Contents
3.Literature review……………………………………………………………………………………………………. 4
3.1 Machine learning………………………………………………………………………………………………. 4
3.1.1 Data…………………………………………………………………………………………………………. 7
3.1.2 Features…………………………………………………………………………………………………….. 9
3.1.3 Machine learning algorithms………………………………………………………………………. 10
3.1.3.1 Linear regression……………………………………………………………………………….. 10
3.1.3.2 Lasso Regression……………………………………………………………………………….. 13
3.1.3.3 Decision Tree……………………………………………………………………………………. 14
3.2 Cross-Validation in Model Selection…………………………………………………………………. 16
3.2.1 K-fold cross-validation……………………………………………………………………………… 17
3.3 Hyper parameter Tuning in Machine Learning…………………………………………………….. 18
3.3.1 Grid search CV………………………………………………………………………………………… 18
3.4 Python for Data Science………………………………………………………………………………….. 18
3.5 Essential Python Libraries for Data Science……………………………………………………….. 20
3.5.1 NumPy:…………………………………………………………………………………………………… 20
3.5.2 Pandas…………………………………………………………………………………………………….. 22
3.5.3 Matplotlib……………………………………………………………………………………………….. 24
3.5.4 Scikit-Learn…………………………………………………………………………………………….. 25
3.5 Project Tool……………………………………………………………………………………………………. 26
References………………………………………………………………………………………………………………. 27
Appendix………………………………………………………………………………………………………………… 30
List of tables
Table 1: parameters in grid search CV.. 22
Table 2: attributes in grid search CV.. 22
Table 3: ndarray attributes in NumPy. 23
Table 4: arithmetic and statistical function in NumPy. 24
Table 5: basic functionalities of pandas data frame. 25
Table 6: basic functions in Matplotlib. 26
List of figures
Figure 1: linear regression graph. 7
List of equations
Equation 1: linear equation. 6
Equation 2: the sum of square error 8
Equation 3: lasso regression. 8
3.Literature review
3.1 Machine learning
Machine learning is a well-known concept at the intersection of computer science, artificial intelligence, and statistics – also known as statistical learning and predictive analytics. Arthur Samuel of IBM firstly used the machine learning term in 1952. In 1950, he wrote the first computer program. The program was a game of checkers, which makes winning strategies and incorporating moves. Later, Samuel also designed various mechanisms allowing his program (Foote, 2019).
As a data scientist, one must become familiar with machine learning concepts because both data science and machine learning overlap. We can say:
- Data science is used to gain insights from data and understanding of data patterns.
- Machine learning is used for making predictions based on available data.
Machine learning is a subset of artificial intelligence that focuses on designing the system and uses experience to improve model performance or make accurate predictions, where experience can be previous information or data available to learners. Machine learning consists of designing efficient and accurate prediction algorithms. In other computer science fields, critical measurement of these algorithms’ quality is their time and space complexity. Still, in machine learning, an additional notion of sample complexity is required for the algorithm to learn data patterns. In short, theoretical learning guarantees for an algorithm depends on the complexity and size of the training data sample (Mehryar Mohri, 2018).
Machine learning is all about getting computers to make data-driven decisions rather than being explicitly programmed to carry out specific tasks. It enables computers to program themselves. If programming is automation, then machine learning is the automating process of automation. Machine learning is a way to make programming scalable. In traditional programming, we run data and programs as input to get output. Still, in machine learning, we run data and output on a computer to get a program used in traditional programming (Brownlee, 2015).
There are three main types of machine learning algorithms.
However, there exists another technique under the name of Reinforcement learning. All these algorithms have their differences stem from the way they treat the training and testing datasets.
Figure 1: Machine Learning Algorithms
Supervised machine learning
A supervised learning algorithm consists of an outcome variable or dependent variable, which we need to predict from an available set of predictors or independent variables. This set of independent variables can generate a function that map input to the desired output. The training process keeps going until the model reaches the desired level of accuracy on training data. Critical points of supervised learning are that it has labeled data, gives direct feedback, and predicts the outcome or future.
E.G., Linear regression, design tree, random forest, k- nearest neighbor, etc.
Unsupervised machine learning
Unsupervised learning algorithm does not contain the outcome or dependent variable for prediction. In these algorithms, data is without a label, it does not give feedback, and it helps find hidden data structure in data. Unsupervised learning algorithms cannot be directly applied to regression or classification problems because we do not have a clear idea of what output data values might be. It makes it impossible to train data than we usually do in supervised learning. It can be used for the clustering population in different groups for specific interventions.
E.G., K- means clustering, Gaussian mixture models, etc.
Reinforcement learning
In a reinforcement learning algorithm, the machine is trained to make some specific decisions. It has a reward system and learns a series of actions, which means the machine is exposed to an environment where it continually uses trial and error. It learns from experience and tries to collect the best knowledge to make accurate business decisions.
E.G., Markov decision process
There are three main components of the machine learning system: data, features, and ML algorithms.
In our study, we seek to employ a Supervised Learning algorithm to achieve our project’s objectives. Since we seek to create a model that can easily predict the price of houses in India, we must have data to train and test the model. Supervised learning allows us the degree of freedom to choose the appropriate model that predicts house prices accurately. Data for this project is downloaded from Kaggle (Chakraborty, 2017).
3.1.1 data
Data can be collected manually or automatically. It can be in any unprocessed structure, text, value, images, audio, etc. data is a very important factor in data analysis, ML, and A.I. It is not possible to train our model without data. In our case, we used data below;-
Dataset Name: Bengaluru-house-price-data
Type: Comma Separated Values (CSV)
Location: https://www.kaggle.com/amitabhajoy/bengaluru-house-price-data (Chakraborty, 2017)
Therefore, big enterprises spending huge amounts of money on getting accessing those data nowadays. Generally, in machine learning, we split data in two parts- “training data” and “testing data”. Since we downloaded data from an online repository, Kaggle.com, We are aware that this data should be split into the two sections given before. However, as we understand, we cannot split this data until we have cleaned it. Data cleaning ensures the data is in good form with only the required variables being put into play and observing the features of the dataset.
Much time is bound to be spent in data cleaning. This is by far the most time consuming part as the data used in the modelling ought to be logically viable and without outliers, which are the most common sources of error in the modelling paradigm. Data cleaning involves eliminating all the data values with null or na as values in their cells. This is done either by replacing the missing values with a meadian value of the column it resides in. The other alternative is the exclusion of missing values(rows) entirely from the dataset. It becomes reasonable to perform the latter decision when the population of the missing variables compared to the entire population is small.
Training data – it is some portion of our data which we show to our model as input and output as well. Based on this data we train our model. The training dataset needs to form the majority of the sample. It should be large in size to overcome bias that would be introduced by the use of smaller data samples. As per our project, we split the main dataset into 80% training set. This was used to train the models on how to predict the price of houses given a number of input parameters. It is thus expected that during training, some considerable amount of time will be used and therefore this is to be treated as normal and the model should be left to do its training.
Testing data – after our model is completely trained and ready for prediction, we feed values of testing data as an input and get predicted output by our model. This output might not be same as actual output of testing data as our model have not seen it. So, we compare both outputs and check accuracy of model (Gupta, 2018). Model testing forms the last step in this project. It requires the minimal pseudo-random sample of the remaining population which is 20% of the population. It should however be noted that the sample should be randomized to have a clear picture of what the model portrays. Testing is done using any house from any location with the specification of the number of bedrooms and bathrooms with an expectation to have a reasonable house price prediction.
3.1.2 Features
A feature is measurable property of the object which we are trying to analyze. In datasets features appears as columns. Features can also refer to as “variable” or “attributes”. Feature selection process is varies depending on what we need to analyze in our model. Basically, features are building blocks of a dataset. Quality of features if dataset has major effect on quality of insight we will gain from datasets. Well, it is possible to improve quality of feature selection with process of feature selection and feature engineering. It is typically difficult and tedious but if it done well, then we can get optimum result of dataset which will contain all the essential features that might have beneficial insights to get solutions of specific business problem (LLC, 2019). These features are extracted during data cleaning phase. It is also in order to state that the characteristics of the dataset, variables including individual entries is observed. These observations form the basis on which the required features are selecte for the next phase in the data cleaning pipeline.
Once the required variables have been defined, all the other columns are then dropped as our goal is to have a dataframe with only the factors towards which our model will depend on. In case addition variables can help in the further reduction of some of the variables, the variables are created , used until they are no longer required, at which they are dropped. As the features are extracted, the output dataframe shrinks to the tune of the extracted features. The data shrinks at every stage until finally we are comfortable using the dataframe. At this instance since we are employing supervised learning technique, we have to ensure the proper data statistics are in order; for instance, we visualize the data to confirm that it is normally distributed. Once this criterion is met, we further confirm that there exist no anomalies in the data. This is achieved through scatter plots and observing the distribution of the prices of houses in the land space.
3.1.3 Machine learning algorithms
Currently, there are thousands of machine learning algorithms available, and hundreds of new algorithms are developing every year. Generally, data scientist applies more than one algorithms on the model to check which one scores higher and gives much accuracy. In the practical part of this thesis, we are going to focus on following three algorithms.
These algorithms serve as the litmus test on the viability of a model. In this project, since we seek to find the best fit model, we are not going to test just a single model, but rather impose all the available algorithms as discussed below and find the best model with its parameters. This approach is accompanied by parameter tuning to realize the algorithm viable for the project with its corresponding parameters for optimal model performance. We, therefore, employ hyper parameter tuning in the training of this model to achieve the best parameters and order them with their score index. From the Score index, we then select the algorithm with the best score and note the tuning parameter of the model.
3.1.3.1 Linear regression
Linear regression is a supervised machine learning algorithm where predicted output is continuous and has a constant slope. It basically used for predictive analysis such as the cost of a house, total sales, number of calls, etc. the main goal of regression is to examine two things:
(1) does set of predictor variable do a good job in predicting the outcome variable? And
(2) which variables are significant predictors of the outcome variable, and in what way they impact the outcome variable?
In linear regression, we establish a relationship between the dependent and independent variable by fitting the best line, which is called the regression line and represented by a linear equation:
y = MX + b Equation 1
Y is dependent variable, x is the independent variable, m is the gradient of the line, and b is the y-intercept.
Source: (Pierce, 2018)
There are two main types of linear regression simple regression and multivariable regression. Simple regression has only one independent variable (x) and one dependent variable (y), but multivariable regression has one dependent variable (y) and more than one independent variable (x). If we take the example of house prices then, price prediction based on only square feet is simple regression, and price prediction based on square feet, bedrooms, bathrooms, area, balcony, etc. is multivariable linear regression.
Let us visualize linear regression by the following graph:
Figure 2: linear regression graph
In the figure above, we have two variables, x, and y, which has multiple data points defining the relation between both variables. The blue line is called the regression line passing through all the data points. Now the question is, why it can be only one line? There might also multiple lines passing through data points. Well, that is the motto of linear regression to find the best fit line for our model. The principle behind the regression line is to minimize error, which can be done by the sum of square errors equation:
SSE =
Equation 2: a sum of square error
Source: (Weisberg, 2005)
Where is the distance between data points and the blue line, which is defined as the red line in the figure above? The equation does sum of squared individual error of all data points from 1 to N. from the minimization of this equation, we can find only a blue line that can best fit for the model.
3.1.3.2 Lasso Regression
Lasso stands for “Least absolute shrinkage and selection operator.” It is a supervised machine learning and type of linear regression that uses shrinkage. Shrinkage is where the data values are shrunk towards a central point. The lasso regression procedure encourages simple and spars model which have fewer parameters. This algorithm is perfectly suitable for the model having high-level multilinearity or when it comes to automation of certain parts of model selection, variable selection, or parameter elimination (Glen, 2015).
The coefficients of parameters are estimated through lasso by minimizing the following equation:
+ | Equation 3
Source: (The group lasso for logistic regression, 2008)
Where is a constant positive amount of shrinkage which controls the strength of the penalty.
is the slope.
When =0, no parameters are eliminated, and the estimates are equal to the one founded in linear regression.
As increases, more and more coefficients are set to zero and eliminated. So, when l=, all coefficients are eliminated.
As increases, bias and variance increase.
The lasso uses a penalized least squares for modeling and parameter sub-selection approach. Basically, lasso regression helps us to make feature selection by choosing magnitude value of the slop which means wherever the slop value is close to zero it will remove those feature as they are not much important for our prediction and it will keep only important features for prediction. It used for fitting high-dimensional data with highly corelated predictors. The lasso is hybrid of a penalized estimation procedure and a variable selection procedure (Bayesian and LASSO Regressions for Comparative Permeability Modeling of Sandstone Reservoirs, 2018).
3.1.3.3 Decision Tree
Decision tree is supervised machine learning algorithm and ideal for both task regression and classification as well. The main goal of decision tree is to predict value or class of specific variable depending on generated decision rules by algorithm from training dataset. As per name, this algorithm represent structure like a tree in which we have a variable as a root node to test on. Branches from root nodes are represented as result of test and those branches has leaf nodes which represent class or label. There are two types of decision tree – categorical variable decision tree (targeted variable is categorical) and continuous variable decision tree (targeted variable is continuous).
Splitting:
Decision tree makes decision by splitting their root nodes into sub-nodes. This splitting process continuously go on until it left with only homogenous nodes. There are multiple methods of splitting, which depends on type of targeted variable. For categorical variable we can use Gini impurity, information gain and Chi-square. For continuous variable we can use reduction in variance. As we are going to focus on continuous variable in practical part, let us understand reduction in variance.
Reduction in variance:
This splitting method can be used for continuous variables only. It uses the standard formula of variance to get the best split.
Variance =
Source: (Suzuki, 2019)
Where is the mean of the values, is the actual value, and is the number of values.
Variance is used to calculate the purity of a node. As much variance is low as purer a node will be if the node will be entirely pure, then variance value will be 0.
Steps to split a decision tree using reduction invariance:
- Calculate the variance of each child node individually for each split.
- Calculate the variance of each split as the weighted average variance of child nodes.
- Select the split with the lowest variance.
- Perform steps 1-3 until completely homogeneous nodes are received. (Sharma, 2020)
Pruning:
Overfitting is the most usual and significant problem in the decision tree. It is a situation when the model gives 100% accuracy for the training data set but for the testing data set. It might have a more considerable variance between the predicted value and actual value. The reason is there is no limit for growth in the decision tree. Sometimes in the worse case, it gives 100% accuracy for training data set by making one leaf for each observation. This situation will affect accuracy while predicting the actual testing data set. Pruning is one of the well-known ways to avoid overfitting. Pruning methods convert the large tree into a small tree by removing the decision nodes to start from the leaf nodes without affecting the overall accuracy of the model. Such methods typically use statistical measures to remove the least reliable branches, generally resulting in faster classification and improvement in the ability of the tree to correctly predict independent test data (Evaluation of Decision Tree Pruning Algorithms for Complexity and Classification Accuracy, 2010).
3.2 Cross-Validation in Model Selection
Generally, in machine learning, we split our data set into training and testing data for model creation. Even in the same algorithm, the model will give us different accuracy for the different test sets. Therefore, it is best practice to apply cross-validation to the machine learning model for better accuracy. Cross-validation is a popular statistical technique for algorithm selection. The main goal of the cross-validation is to assess how the model will perform with different data set. The idea behind cross-validation is to split data once or several times for estimating the skill of each algorithm or model. The training sample used for training each algorithm, and the validation sample is used for estimating the risk of the algorithm. In the end, we select the algorithm with the smallest estimated risk (A survey of cross-validation procedures, 2010).
There are plenty of cross-validation methods such as k-fold cross-validation, holdout method, leave-p-out cross-validation, and leave-one-out cross-validation. Among them, we are going to focus on k-fold cross-validation as it is very popular, suitable for large data set and we are going to use in our practical part as well.
Cross-validation Serves to verify that the algorithm selected is robust over random tests with the test data set. The expected score is supposed not to wander far away from the observed values after the hyper-parameter tuning.
3.2.1 K-fold cross-validation
K-fold cross-validation is a widely adopted method for model selection. As per the name, this technique randomly divides data set into K folds approximately in equal size. This K part of the data will be used for the testing dataset, and the rest (k-1) part will be used as a training dataset. Basically, the model will be trained and tested for K number of times, and at the end of the process, we will get a K number of scores. The average number of this score will be considered as the average accuracy. We can also say the maximum accuracy of this model, the highest score we received, and the minimum accuracy of the model is the lowest score in this process.
In K-fold cross-validation, K is a chosen number by us, which represents the number of folds. The choice of a number depends on our data size and system computation power. The number can be anything ideally between 5 to 10. We must choose numbers carefully because poor choice might lead our model to high variance and high bias. K< 5 might cause issues like that (A Comparative Study of Ordinary Cross-Validation, v-Fold Cross-Validation, and the Repeated Learning-Testing Methods, 1989).
Advantages of K-fold CV
- As K value increases, the estimated variance and bias reduces
- For the K value, the repetition of the process is limited, so less computation time.
- Each data part gets to be trained and tasted precisely.
Disadvantages of K-fold CV
- It takes K times much computation to make evolution as the algorithm must rerun from scratch K times.
3.3 Hyper parameter Tuning in Machine Learning
In machine learning, hyperparameter tuning is an important task to get optimal values of a model’s parameter, which gives the maximum accuracy for a particular model. Different datasets have different hyperparameter settings, so it must be tuned for each dataset. Hyperparameter is the element of machine learning which cannot be learned by model automatically, but it can be done by a meta-process called – “hyperparameter tuning.” Manually, it is difficult to keep track of hyperparameter and fit into training datasets frequently; at the same time, it is time-consuming as well. This problem can be solved by a grid search CV.
3.3.1 Grid search CV
Grid search CV is one of the well-known methods for hyperparameter tuning. It is a function of the Scikit-learn library, which helps to loop through predefined hyper parameters and fit our model in the training dataset. Then the method will list the score of each parameter values, and we can select the best parameter from it. According to (Chih-Wei Hsu, 2003), it is highly recommended to use grid search CV along with cross-validation to archive the best parameter values. Structure of grid search CV is like a dictionary (keys= parameter names, values= various possibilities for our combinations), and then it passed to our estimator object.
Grid Search CV serves the purpose well in our model as from the python data dictionary, and this is the easiest way to validate and classify the parameters used in the modeling of the final price prediction model.
3.4 Python for Data Science
There is plenty of programming language which can be used in data science project like python, Java, R, SAS, SQL, etc. Python is an open-source, interpreted, and dynamic object-oriented programming language created by Guido van Rossum in early 1990, which became publicly available in 1991 (Hsu, 2018). It is widely used and suitable for data science tools and applications. According to “stake overflow survey” in 2019, python is fast-growing and second most loved programming language.
Python holds a unique attribute, and when it comes to performing analytical and quantitative tasks, it is very easy than other programming languages. According to engineers of academia and industry, python APIs are available for deep learning frameworks. Scientific packages have constructed python as incredibly productive and versatile (Bhatia, 2012).
Hence, now we know what importance python has in the data science field, let us focus on some elegant features of python.
- Python supports various platforms such as Windows, Linux, Mac, etc.
- It makes the program easy to read and write. In addition, it is easy to perform various machine learning algorithms and complex scientific calculations, all thanks to elegant and simple syntax.
- Python has the ultimate collection of libraries to perform various tasks like data manipulation, data analysis, and data visualization.
- Python is an expressive language which makes possible application to offer a programmable interface(Eppler, 2015).
- In python it is simple to an extension of code by appending new modules which are implemented in other compiled language like C or C++.
Machine learning scientists prefer python as well in terms of application areas. When it comes to app development for NLP (natural language processing) and sentiment analysis, developers switch to python due to the huge collection of libraries python provides, which is very helpful in solving complex business problems easily and construct a robust system and data application.
3.5 Essential Python Libraries for Data Science
Python libraries are a reusable bunch of functions and methods which we can include in our program to perform several actions without writing code. Python has improved libraries’ support in recent years and became the best alternative for data manipulation techniques. It is an excellent choice as a single language for building a data-centric application, which is also highly recommended for general-purpose programming (McKinney, 2012)
In a data science project, we need to go through all the stages like data cleaning, data visualization, model building, etc. Python has plenty of popular libraries for these tasks. Let us focus on some of them, which we are going to use in our practical part.
3.5.1 NumPy:
NumPy (Numeric Python) is one of the most powerful and open-source python libraries primarily used for numeric analysis. NumPy deals with numerical data and provides algorithms, data structures, and other utilities to perform scientific calculations and data storage. It is highly recommended when it comes to fast operation on arrays, sorting, selecting, mathematical functions, statistical operation, linear algebra, random simulation, etc. It was originally created by Jim Hugunin which was modified by Travis Oliphant in 2005 to incorporating features of competing Num-array into numeric (Oliphant, 2015)
NumPy basics
- Create NumPy arrays and array attributes
- Array indexing and slicing
- Reshaping and concatenation
NumPy arithmetic and statistics basics
- Computations and aggregations
- Comparison and Boolean masks
The NumPy package has a significant object called “ndarray” (n-dimensional array). It is homogeneous and statistical data types, along with performs many operations in a compiled language (Leo (Liang-Huan) Chin, 2016). Now the question is, why would we use NumPy array when we can just use a python list? The list is very flexible and versatile and great data structure in python, but there are few major benefits of using NumPy arrays over a python list.
Saves coding time
- No for loops: many vector and matrix operation save coding time
. We do not need to iterate through an array to apply a mathematical operation to each element of that array. We can do it with a single line of code.
Example:
Using python list, we need to use for loop to iterate through that list before you can multiply *= 6 operation to each element:
for i in range (len(my_list)):
my list[i] *= 6
Using NumPy array, we can apply that element directly to the entire array with a single code line. NumPy takes care of the rest of the operation behind the scenes, so we do not have to worry about it:
my array *= 6
Faster execution
- Uses single data type for each element (all must be the same data type) in array to avoid type checking at runtime.
- Uses contiguous blocks of memory
Uses less memory
- No pointers, so type and item sizes are the same for each column.
- In python list, there is an array with pointers to python object (4B+ per pointer and 16+ for a numerical object)
- Compact data types like unit 8 and float 16.
- Which depends on our task and precision of data
3.5.2 Pandas
Pandas’ name is derived from “panel data”- an econometrics term for the multidimensional structure of data sets. Another important machine learning library provides functions and rich data structure to make our data analysis task more easy, fast, and expressive. Pandas have various methods for combining data, time-series functionality, grouping, and filtering. It also provides indexing functionality to make it easy to reshape, slice, and dice, perform aggregation and select a subset of data (McKinney, 2012).
Pandas built on top of NumPy. That means NumPy is required by pandas. Other libraries like Matplotlib and SciPy are not required by pandas, but it can be very useful if combined with pandas. It is an excellent tool for data wrangling, designed for quick and easy data manipulation and visualization. It has two handy data structures known as “pandas series” and “pandas data frame.” They are core components of pandas that allow us to reshape, merge, split, train, and aggregate data.
Pandas Series
It is a one-dimensional labeled array that holds data types such as string, integer, float python objects, etc. The axis labels represent the index. In short, the pandas series is just a column in memory that is either independent or belongs to a pandas data frame. A unique label is not necessary, but it must be a hashable type. The object supports integer and label-based indexing as well. It provides a host of methods for performing operations involving the index.
Pandas Data frame
Pandas data frame is a two-dimensional, size-mutable, potentially heterogeneous tabular data structure with labeled axes. The concept of a data frame is based on spreadsheets. It is logically corresponding to a sheet of an excel that includes both rows and columns. The Data frame object contains an ordered collection of columns like a spreadsheet or an excel sheet. Each column holds a unique data type, but different columns may have different types (Bernd Klein, 2011).
Data frame operation in pandas
- Read, view, and extract information
- Grouping and sorting
- Deals with duplicate and missing values
- Selection, filtering, and slicing
- Pivot table and functions
The pandas library is under development since 2008. It is intended to close the gap in the richness of available data analysis tools between python, a general-purpose system and scientific computing language, and the numerous domain-specific statistical computing platforms and database languages (McKinney, January 2011). There is currently the fewer release of pandas library, which includes plenty of new features, enhancements, bug fixing, and changes in API. Data analysis among everything else takes the highlights when it comes to the use of pandas. Pandas ensure high functionality and great flexibility while combined with other libraries and tools.
3.5.3 Matplotlib
Matplotlib is a plotting library for data visualization. It is a python package for 2D plotting that generates a production-graph. Matplotlib was originally written by John D Hunter later; in 2012, Michael Droettboom was nominated as the lead developer of matplotlib. It is very important for any organization to visualize data and descriptive analysis; matplotlib provides very effective methods for these tasks. It supports interactive and non-interactive plotting and can save image in several output formats such as PNG, P.S., etc. it can use multiple window toolkits (GTK+, wxWidgets, Qt, etc.) and provide a wide variety of plot types like line graphs, pie-charts, bar charts, histograms, and other professional-grade figures. In addition, it is highly customizable, flexible, and easy to use (Tosi, November 2009).
Matplotlib has basically three parts:
The Pylab interface
It is a set of functions that allow users to create plots with code quite like MATLABTM figure generating code.
The matplotlib frontend (matplotlib API)
It is a set of classes that does the heavy lifting, creating and managing figures, texts, lines, Plot, etc. this is an abstract interface that has no idea about output.
The backends
These are device-dependent drawing devices, aka renderers, which transform the frontend representation to hardcopy or a display device.
Matplotlib is used by many people in many different contexts. Some want to automatically generate postscript files to send a printer or publishers, and Others deploy it on a web application server to generate PNG output for inclusion in dynamically generated web pages. Matplotlib also can be used interactively from the python shell in Tkinter on windows (John Huter, May 27, 2007).
3.5.4 Scikit-Learn
Scikit learn is a python library that provides various algorithms and functions that are used in machine learning. It was originally developed by David Cournapeau as a google summer of code project in 2007. It is considered one of the best libraries for working with complex data.
Scikit learns built on NumPy, SciPy, and matplotlib. It contains a number of algorithms for data mining and machine learning tasks like:
- dimensionality reduction
- data reduction methods (e.g., principle component analysis, feature selection)
- regression analysis (e.g., linear, logistic, and ridge)
- classification models (e.g., random forest, support vector machine)
- Clustering methods (e.g., K-means)
- model selection (e.g. grid-search, cross-validation)
- It also provides modules for pre-processing data, extracting features, optimizing hyperparameters, and evaluating models to solve real-world problems(Hackeling, 2017).
3.5 Project Tool
Python is straightforward to learn as it is a described programming language. It has extensive documentation and vibrant online support groups where support is easily found. It is also easy to use with minimal coding and realizing maximum computational power in exploring data, using graphics, and manipulating almost all variables on the go. We have employed python from the interface of Jupyter Notebooks, which is a very convenient tool for python beginners and experts. This brings the convenience of installing python and pandas at a go and accessing them through library import.
References
{Matplotlib: A 2D graphics environment. Hunter, J. D. 2007. 3, s.l. : IEEE COMPUTER SOC, 2007, Vol. 9, pp. 90-95.
A Comparative Study of Ordinary Cross-Validation, v-Fold Cross-Validation and the Repeated Learning-Testing Methods. Burman, Prabir. 1989. 3, Davis, California : Oxford university press, 1989, Vol. 76. ISSN: 00063444.
A survey of cross-validation procedures. Celisse, Sylvain Arlot & Alain. 2010. 40-79, s.l. : Statist. Surv., 2010, Vol. 4. ISSN : 1935-7516.
Bayesian and LASSO Regressions for Comparative Permeability Modeling of Sandstone Reservoirs. AI-Mudhafar, watheq J. 2018. 1, s.l. : Springer US, February 10 2018, Natural Resources Research, Vol. 28, pp. 47-62. ISSN: 1520-7439.
Bernd Klein, Bodenseo. 2011. Nuerical & scientific computing with python. introduction into pandas. [Online] 2011. https://www.python-course.eu/pandas.php.
Bhatia, Richa. 2012. Analytics India Magazine. WHY DO DATA SCIENTISTS PREFER PYTHON OVER JAVA? [Online] Analytics India Magazine, 2012. https://analyticsindiamag.com/why-do-data-scientists-prefer-python-over-java/.
Brownlee, Jason. 2015. Start machine learning. Basic concepts in machine learning. December 25 2015.
Chakraborty, Amitabha. 2017. Bengaluru House price data. Kaggle. [Online] October 2017. [Cited: October 31, 2020.] https://www.kaggle.com/amitabhajoy/bengaluru-house-price-data.
Chih-Wei Hsu, Chih-Chung Chang, and Chih-Jen Lin. 2003. A Practical Guide to Support Vector Classification. Taiwan : research gate, research gate, 2003.
documentation, NumPy. NumPy v1.19 Manual. NumPy. [Online] The SciPy community. [Cited: September 16 2020.] https://numpy.org/doc/stable/.
documentation, pandas. API reference. Pandas. [Online] the pandas development team. [Cited: September 17 2020.] https://pandas.pydata.org/docs/reference/index.html.
Eppler, Jochen Martin. 2015. A convinient interface to the NEST simulator. [book auth.] Eilif Muller. python in neuroscience. Nijmegen : Rolf Kotter, 2015.
Evaluation of Decision Tree Pruning Algorithms for Complexity and Classification Accuracy. Patil, Dipti D. 2010. 02, s.l. : International Journal of Computer Applications, 2010, International Journal of Computer Applications, Vol. 11.
Foote, Keith D. 2019. A Brief History of Machine Learning. Data topics. [Online] Dataversity, March 26 2019. [Cited: August 05 2020.] https://www.dataversity.net/a-brief-history-of-machine-learning/.
Glen, Stephanie. 2015. Lasso Regression: Simple Definition. StatisticsHowTo. [Online] September 24 2015. [Cited: August 06 2020.] https://www.statisticshowto.com/lasso-regression/.
Gupta, Mohit. 2018. ML | Introduction to Data in Machine Learning. Geeksforgeeks. [Online] 01 may 2018. [Cited: August 01 2020.] https://www.geeksforgeeks.org/ml-introduction-data-machine-learning/?ref=lbp.
Hackeling, Gavin. 2017. Mastering machine learning with scilkit learn. Birmingham : packt publishing ltd., 2017. ISBN: 978-1-78829-987-9.
Hsu, Hansen. 2018. 2018 MUSEUM FELLOW GUIDO VAN ROSSUM, PYTHON CREATOR & BENEVOLENT DICTATOR FOR LIFE. s.l. : computer history museum, 2018.
John Huter, Darren dale. 27 may, 2007. The Matplotlib User’s Guide. 27 may, 2007.
Leo (Liang-Huan) Chin, Tanmay Dutta. 2016. NumPy Essentials. Birmingham : Packt publishing ltd., 2016. p. 11. ISBN: 978-1-78439-367-0.
LLC, Cogito Tech. 2019. What are Features in Machine Learning and Why it is Important? Medium. [Online] Medium, July 15 2019. [Cited: August 01 2020.] https://medium.com/@cogitotech/what-are-features-in-machine-learning-and-why-it-is-important-e72f9905b54d.
McKinney, Wes. january 2011. 1pandas: a Foundational Python Library for DataAnalysis and Statistics. [Article] s.l. : researchgate, january 2011.
—. 2012. Python for data analysis. 1st eddition. Sebastopol : O’reilly media, 2012. ISBN: 978-1-449-31979-3.
Mehryar Mohri, Afshin Rostamizadeh, Ameet Talwalkar. 2018. foundation of machine learning. 2nd eddition . Cambridge, England : the MIT press, 2018. ISBN: 978-0-262-03940-6.
Oliphant, Travis E. 2015. Guide to numpy . 2nd eddition. s.l. : creater space, 2015. ISBN: 978-1-51730-007-4.
Pierce, Rod. 2018. Equation of a Straight Line. MathsIsFun.com. [Online] Rod Pierce DipCE BEng, October 12 2018. [Cited: July 31 2020.] http://www.mathsisfun.com/equation_of_line.html.
scikit-learn, Machine Learning in {P}ython. Pedregosa, F. and Varoquaux, G. and Gramfort, A. and Michel, V. 2011. s.l. : Journal of Machine Learning Research, 2011, Vol. 12, pp. 2825–2830.
Sharma, Abhishek. 2020. 4 Simple Ways to Split a Decision Tree in Machine Learning. decision tree split methods. [Online] June 30 2020. [Cited: August 30 2020.] https://www.analyticsvidhya.com/blog/2020/06/4-ways-split-decision-tree/.
Suzuki, Kunihiro. 2019. statistic – the fundamentals. New York : Nova Science Publishers, Incorporated, 2019. p. 162. Vol. 1. ISBN : 9781536144628.
The group lasso for logistic regression. Meier, Lukas. 2008. 1, Zurich : J.R. statist, January 04 2008, Journal of the royal statistical society , Vol. 70, pp. 53-71. 1369-7412/08/70053.
Tosi, Sandro. November 2009. Matplotlib for python developers. [ed.] Rakesh Shejwal. Birmingham, UK : Packt publishing ltd., November 2009. ISBN: 978-1-847197-90-0.
Weisberg, Stanford. 2005. Applied Linear Regression. 3rd edition. Hoboken, New Jersey : John Wiley & sons, 2005. p. 24. ISBN : 0-471-66379-4.
Appendix
| Parameters | Types | Default | Tasks |
| Estimator | Estimator object | — | Selected estimator object, which provides score |
| param_grid | List of dictionaries | — | Dictionaries with the list of parameter to try |
| Scoring | Str, callable, list | None | To evaluate our prediction on test data set |
| n_jobs | Int | None | Number of jobs to run in parallel |
| pre_dispatch
|
Int or str | n_jobs
|
Controls number of jobs that gets dispatched during parallel execution.
Reducing this number can helps to avoid an explosion of memory consumption when more jobs gets dispatched than CPU can process |
| CV | Int, cross validation generator
|
None | Number of cross validation we want to try for each selected hyperparameter |
| Refit
|
Bool, str or callable
|
True
|
Refit an estimator using the best-found parameters on the whole dataset |
| Verbose | Int | — | Controls the verbosity, the higher the more messages |
| error_score
|
Raise or numeric
|
np.nam
|
Value to assign the score if an error occurs in estimator fitting |
| return_train_score
|
Bool | False | To get insights on how different parameter setting impact the over-fitting/under-fitting trade-off |
Table 1: parameters in grid search CV
Source: (scikit-learn, Machine Learning in {P}ython, 2011)
| Attributes | Types | Tasks |
| cv_results_ | Dict of numpy ndarrays | It is a dict with keys as column header and values as columns, it can be imported into pandas dataframe |
| best_estimetor_ | Estimator | Estimator that has been chosen by search which gives highest scores on left out data |
| best_score_
|
Float | Mean cross validated score of the best estimator |
| best_params_ | Dict | Parameter setting that gave the best result on hold out data |
| best_index_ | Int | The index which correspond to the best candidate parameter setting |
| Scorer_ | Function of a dict | Used on hold out data to choose the best parameters for the model |
| n_splits_ | Int | The number of cross validation splits |
| refit_time_ | Float | Seconds used for refitting best model on the whole dataset |
Table 2: attributes in grid search CV
Source: (scikit-learn, Machine Learning in {P}ython, 2011)
| ndarray attributes | Description |
| ndarray.ndim | Number of axis/dimensions of the array |
| ndarray.shape | The dimensions of array.
Indicates the size of array in each dimension Ex. shape of matrix with n rows and m columns will be (n,m) |
| ndarray.size | Total number of elements of the array.
It is equal to product of element of shape |
| ndarray.dtype | An object describing types of elements. |
| nsarray.itemsize | The size in bytes of each elements in array. |
| ndarray.data | The buffer containing the actual elements of the array. |
Table 3: ndarray attributes in NumPy
Source: (documentation)
| Operation & functions | Description |
| Import numpy as np | To import numpy library in IDE |
| np.add | For addition of arrays |
| np.subtract | For subtraction of array |
| np.multiply | For multiplication of arrays |
| np.divide | For division of arrays |
| np.power | This function treats elements in the first input array as base and returns it raised to the power of corresponding element in the second input array |
| np.mod/ np.reminder | This function returns the reminder of division of the corresponding element in the input array.
|
| np.amin | This function returns the minimum from the elements in the given array along specified axis |
| np.amax | This function returns the maximum from the elements in the given array along specified axis |
| np.ptp | This function returns the range of values along in axis |
| np.percentile | This function computes the q-th percentile of the data along with specified axis |
| np.quantile | This function computes the q-th quantile of the data along with specified axis |
| np.mean | This function computes the arithmetic mean along with specified axis |
| np.average | This function computes the weighted average along with specified axis |
| np.median | This function computes the median along with specified axis |
| np.std | This function computes the standard deviation along with specified axis |
| np.var | This function computes the variation along with specified axis |
Table 4: arithmetic and statistical function in NumPy
Source: (documentation)
| Functions | Description |
| Import pandas as pd | To import pandas library in IDE |
| pd.read_csv(‘file name.csv’) | For reading csv files |
| pd.to_csv(‘name of file to save.csv’) | To write csv files |
| xlsx = pd.excelfile(‘excel file name.xslx’)
df = pd.read_excel(‘xlsx,sheet 1’) |
To open excel files |
| df = pd.dataframe(‘’) | To create data frame |
| df.drop([0,5]) | To remove rows corresponding to index |
| df[‘new column name’] = value of column | To add new column in existing data frame |
| df.drop(‘column name’,axis=1) | To remove specific columns |
| df.shape | To see number of rows and columns |
| df.index | Description of index in data frame |
| df.columns | list of columns in data frame |
| df.head() | First 5(as default) sample of data frame |
| df.tail() | Last 5(as default) sample of data frame |
| df.count() | Counts non-null data of data frame |
| df.sum() | Sum of values in data frame |
| df.min() | Lowest values in data frame |
| df.max() | Highest values in data frame |
| df.idxmin() | Index of lowest value |
| df.idxmax() | Index of highest value |
| df.describe() | Summary statistics of data frame (including quartiles, mean, median etc) |
| df.sort_values() | Sort values in ascending order |
| df.sort_values(ascending=false) | Sort values in descending order |
| df.apply(lamda x: xreplace(‘m’,’n’))
|
For applying function that replace ‘m’ by ‘n’ |
| df [df [‘column name’]% 2 == 0] | Filter data frame to show only even numbers |
| df.loc ([index of row ] ,[‘name of column’]) | Select the chosen number of rows from chosen column |
Table 5: basic functionalities of pandas data frame
Source: (documentation)
| Functions | Description |
| import matplotlib.pyplot as plb | To import matplotlib library in IDE |
| bar (x, height [, width, bottom, align, data]) | To make a bar plot |
| barh (y, width [,height, left, align]) | To make a horizontal bar plot |
| boxplot (x [,notch, sym, vert, whis, …]) | To make a box and whisker plot |
| hist (x[,bins, range, density, weight, …]) | To plot a histogram |
| hist2d (x,y [,bins, range, density, …]) | To make 2d histogram plot |
| pie (x [, explode, labels, colors, autopct, … ]) | To plot a pie chart |
| plot (*args [, scalex, scaley, data]) | To plot y versus x as line and /or markers |
| polar (*args, **kwargs) | To make a polar plot |
| scatter (x,y, [, s, c, marker, camp, norm, … ]) | A scatter plot of x vs y |
| stackplot (x, *args, [, labels, colors, …]) | To draw a staked area plot |
| stem (*args [, linefmt, markerfmt, basefmt, …]) | To create a stem plot |
| step (x, y, *args [, where, data]) | To make a step plot |
| quiver (*args [,data]) | To Plot a 2d field of arrows |
| axes ([args]) | To add an axes to the current figure and make it the current axes |
| legend (*args, **kwargs) | To place a legend on the axes |
| table ([cellText, cellcolors, cellLoc, …]) | To add a table on axes |
| text(x, y, s [, fontdict]) | To add a text to the axes |
| title (label [, fontdict, loc, pad, y]) | To set a title for the axes |
| xlabel (xlabel [, fontdict, labelpad, loc]) | To set a label for the x axes |
| xlim (*args, **kwargs) | To set x limits of the current axes |
| xscale (value, **kwargs) | To set the x-axes scale |
| xticks ([ticks,labels]) | To set current tick location and labels of x axes |
| ylabel (ylabel [, fontdict, labelpad, loc]) | To set a label for the y axes |
| ylim (*args, **kwargs) | To set y limits of the current axes |
| yscale (value, **kwargs) | To set the y-axes scale |
| yticks ([ticks,labels]) | To set current tick location and labels of y axes |
| imread (fname[,format]) | To read an image from a file into an array |
| imsave (fname, arr, **kwargs) | To save an array as an image file |
| imshow (X [, camp, norm, aspect, …]) | To display data as an image |
| figure ([num, figsize, dpi, facecolor, …]) | To create a new figure or activate an existing one |
| figtext (x, y, s [, fontdict]) | To add a text to figure |
| figlegend (*args, **kwargs) | To place legend on figure |
| show (*[, block]) | To display all open figure |
| savefig (*args, **kwargs) | To save current figure |
| close ([fig]) | To close a figure window |
Table 6: basic functions in Matplotlib
Source: ({Matplotlib: A 2D graphics environment, 2007)