Step-by-step to a Data Scientist

December 25, 2020January 1, 2021

Python for Beginners ~ Part 1 ~

Step-by-step to a Data Scientist > Blog > 2020

Python is free and one of the most popular programming languages. You can find that python is simple, flexible, readable, and with rich functions. Python can use external libraries so that we can utilize it in a wide range of fields. Especially, the most popular field is in machine learning and deep learning.

Through some posts, we will see the basics of python from scratch. Let’s take a look at the basic concepts and grammar, keeping in mind the use in data science. Python runs a script line by line. So Python will tell you where you went wrong. So, don’t be afraid to make a mistake. Rather, you can make more and more mistakes and write the code from small stacks with small modifications.

In this post, we will learn the following contents.

Variables
Comment “#”
Arithmetic operations
Boolean
Comparison operator
List

Learning programming is boring as it is basic. However, creative working awaits beyond that !!

Note) The sample code in this post is supposed to run on Jupyter Notebook.

Variables

A variable is like a box for storing information. We assign a name to a variable, such as “x” and “y”. And, we put a value or a string into the variable. Let’s take a glance.

We first put “1” into “x“. And, check the content by the “print()“ function. The “print()” function displays the content on your monitor.

x = 1
print(x)

>>  1

Similarly, strings can also be stored in variables. Strings are represented by ” “.

y = "This is a test."
print(y)

>>  This is a test.

Variables have information of data type, e.g. “int”, “float”, and “str”. “int” is for integer and “float” are for decimal. And, “str” is for a string. Let’s check it by the “type()“ function.

type(x)  # x = 1
type(y)  # y = "This is a test."

>>  int
>>  str

z = 1.5
type(z)

>>  float

Python has the functions to convert variables into “int”, “float”, or “str” types. These functions are “int()”, “float()”, and “str()”.

a = 10

type( int(a) )
>>  int

type( float(a) )
>>  float

type( str(a) )
>>  str

Here, some basic rules for variable name are introduced. It’s okay if you know the following, but see the official Python documentation for more details.

Rules of variable name

It is case sensitive.
Only the alphabet, numbers, and “_” can be used.
The initial letter starts with a number.

For example, “abc” and “ABC” are distinguished. “number_6” is OK, but “number-6” is prohibited. The frequent mistake is including “-“, where “-“ is prohibited. Besides, “5_abc” is also prohibited. The initial letter must be numbers or “_”. However, some variables starting from “_” are for Python itself so that you can’t use them. Therefore, the author highly recommends that variables start from an alphabet, especially for beginners.

Comment “#”

Note that “#” is for comments which are NOT executed by python. Python ignores the contents after “#”. For example as the following one, python regards “print(“Hello world!”)” as a code. The other contents, such as “This is for a tutorial of Python.” and “print the first message!”, are regarded as a comment.

# This is for a tutorial of Python.
print("Hello world!")  # print the first message!

There is another method for comments. Python recognizes as a comment when the statement is between “”” and “””. In the following example, the sentence of “This is a comment.” is skipped by Python.

"""
   This is a comment.
"""
print("test")

>>  test

A comment is so important because we forget what the code we wrote in the past was for. Therefore, programmers leave the explanations and their thought as the comments. Concise and clear. That is very important.

Arithmetic operations

Here, let’s see simple examples of arithmetic operations. Complex calculations are just constructed by simple arithmetic operations. So, most programmers do NOT treat complex mathematics but utilize combinations of simple operations.

Operator	Description
+	addition
–	subtraction
*	multiplication
/	division
**	power

2 + 3  # addition
>>  5

9 - 3  # subtraction
>>  6

5 * 20  # multiplication
>>  100

100 / 20  # division
>>  5

2**3  # power
>>  8

Boolean

Boolean is just True or False. It is, however, important because we construct an algorithm by controlling True or False. For example, we assume the following situation. When an examination score is over 70, it is passed. On the other hand, when the score is less than 70, it is not passed. To control the code by syntax, a boolean is needed.

The concept of a boolean may be unfamiliar to beginners, however, python tells us intuitively. The example below is that the result of judging x > 0(or x < 0) is assigned to the variable “boolean”.

x = 1   # Assign 1 to the variable "x"

boolean = x > 0
print(boolean)

>>  True

boolean = x < 0
print(boolean)

>>  False

There is no problem with the above code. However, the author recommends the following style due to readability. In the following style, we can clearly understand what is assigned to the variable “boolean“.

boolean = (x > 0)

Comparison operator

Here, let me Introduce comparison operators related to Boolean. We have already seen examples such as “<” and “>”. The typical ones we use frequently are listed below.

Operator	Description
>	[A > B] A is greater than B.
<	[A < B] A is less than B.
>=	[A >= B] A is greater than or equal to B.
<=	[A <= B] A is less than or equal to B.
==	[A == B] A equals to B.
!=	[A != B] A does NOT equal B.

Examples against numerical values.

1 > 2
>>  False

5 < 9
>>  True

5 >= 5
>>  True

10 <= 8
>>  False

10 == 10
>>  True

10 != 10
>>  False

Examples against strings. Recall that Python is case sensitive.

"apple" == "APPLE"
>>  False

"apple" != "APPLE"
>>  True

List

List is for a data structure, which has a sequence of data. For example, 1, 2, 3,.., we can treat as a group by a list. The list is represented by “[]”, and let’s see the example.

A = [1,  2,  3]
print(A)

>>  [1, 2, 3]

“A” is the list, which stores the array of 1, 2, and 3. Each element can be appointed by the index, e.g. A[0]. Note that, in Python, an index starts from 0.

A[0]

>>  1

It is easy to replace, add or delete any elements. This property is called “mutable”.

Let’s replace the 2 in A[1] with 9. You will see that A[1] will be rewritten from 2 to 9.

A[1] = 5  # A = [1,  2,  3]
print(A)

>>  [1, 5, 3]

We can easily add a new element to the end of the list by the “.append()” method. Let’s add 99 to the list “A”.

A.append(99)  # A = [1,  5,  3]
print(A)

>>  [1,  5,  3,  99]

Of course, it is easy to delete an element. We do it by the “del” keyword. Let’s delete element 3 in A[2]. We will see that the list “A” will change from “[1, 5, 3, 99]” to “[1, 5, 99]”.

del A[2]  # A = [1,  5,  3,  99]
print(A)

>>  [1, 5, 99]

One more point, the list can handle numerical variables and strings together. Let’s add the string “apple” to the list “A”.

A.append("apple")
print(A)

>>  [1, 5, 99, 'apple']

Actually, this is the thing, where python is different from other famous programming languages such as C(C++) and Fortran. If you know such languages, please imagine that you first declare a variable name and its type. Namely, an array(“list” in Python) can treat only a sequence of data whose variable type is the same.

This is one of the reasons that Python is said to be a flexible language. Of course, there are some disadvantages. One is that the processing speed is slower. Therefore, when performing only numerical operations, you should use a so-called NumPy array, which handles only arrays composed of the same type of variables. NumPy is introduced in another post.

We can easily get the length of a list by the “len()” function.

len(A)  # A = [1, 5, 99, 'apple']

>>  4

Finally, the author would like to introduce one frequent mistake. Let me generate a new list, which is the same for the list A.

B = A  # A = [1, 5, 99, 'apple']
print(B)

>>  [1, 5, 99, 'apple']

B has the same elements as A. Next, let’s replace the one element of B.

B[3] = "orange"  # B[3] = "apple"
print(B)

>>  [1, 5, 99, 'orange']

We have confirmed that the element of B[3] is replaced from “apple” to “orange”. Then, let’s check A too.

A

>>  [1, 5, 99, 'orange']

Surprisingly, A is also changed! Actually, B was generated by “B = A” so that B refers to the same domain of memory as A. We can avoid the above mistake with the code below.

A = [1, 5, 99, 'apple']
##-- Generate the list "B"
B = A.copy()
B[3] = "orange"  # B[3] = "apple"

print(A) # Check the original list "A"

>>  [1, 5, 99, 'apple']

We have confirmed that the list A was NOT changed. When we copy the list, we have to use the “.copy()” method.

Summary

We have seen the basics of Python with the sample codes. The author would be glad if the reader feels Python is so flexible and intuitive. Python must be a good choice for beginners.

In the next post, we will see the other contents for learning the basics of Python.

December 19, 2020January 1, 2021

GitHub Beginner’s Guide for Personal Use

Step-by-step to a Data Scientist > Blog > 2020

Git, a version control system, is one of the essential skills for programmers and software engineers. Especially, GitHub, a version control service based on Git, is becoming the standard skill for such engineers.

GitHub is a famous service to control the version of a software development project. At GitHub, you can host your repository as a web page and keep your codes there. Besides, GitHub has many rich functions and makes it easier to manage the version of codes, so it is so practical for a large-scale project management. However, on the other hand, personal use is a little higher for beginners because of its peculiarity concepts, such as “commit” and “push”.

In this post, we will see the basic usage of GitHub, especially the process of creating a new repository and pushing your codes. It is intended for beginners. And after reading this post, you will keep your codes at GitHub, making your work more efficient.

What is GitHub?

Git, a core system for GitHub, is an open-source tool to control the version of a system. Especially, the function of tracking changes among versions is so useful, making it easier to run a software development project as a team.

GitHub is a well-known service using Git. Roughly speaking, GitHub is a platform to manage our codes and utilize the codes someone has written. We can manage not only individual codes but also open-source projects. Therefore, many open-source projects in the world are published through GitHub. The Python library you are using may also be published through GitHub.

The basic concept of GitHub is to synchronize the repository, like a directory including your codes, between your PC and the GitHub server. The feature is that we synchronize not only the code but also the change records. This is why GitHub is a powerful tool for developing as a team.

Try Git

First of all, if you have NOT installed Git, you have to install it. Please refer to the Git official site.

When Git is successfully installed, you can see the following message after the execution of the “git” command on the terminal or the command prompt.

git

>>  usage: git [--version] [--help] [-C <path>] [-c <name>=<value>]
>>             [--exec-path[=<path>]] [--html-path] [--man-path] [--info-path]
>>             [-p | --paginate | -P | --no-pager] [--no-replace-objects] [--bare]
>>             [--git-dir=<path>] [--work-tree=<path>] [--namespace=<name>]
>>             <command> [<args>]
>>  
>>  These are common Git commands used in various situations:
>>  ...
>>  ...

Git command

Here, we will use the Git command. The basic format is “git @@@”, where “@@@” is each command such as “clone” and “add”. In this article, we will use just 5 commands as follows.

git clone
git status
git add
git commit -m “<comment> “
git push

For personal use, these five commands are all you need. Each command wii be explained below.

Create a New Repository

First, you create a new repository for your project. A repository is like a folder of your codes. It is easy to create a new repository on your GitHub account.

1. You visit the GitHub site and log in to your account. If you don’t have your account, please create.

2. Go to the ①“Repositories” tab, and click the ②“New” button.

3. Fill in the necessary information, “①Repository name”, “②Public or Private”, and “③Initialized files”.

Note that, in”Public or Private” at ②, “Private” is for paid members. If it’s okay to publish it worldwide like a web page, select “Public”.

Whether you check “Add a README file” depends on you. The “README” file is for the description of your project. Of course, you can manually add the “README” file later.

Clone the Repository

The “clone” command synchronizes the local repository at your PC with the repository at GitHub. You can clone with just only the URL of your repository at GitHub.

1. Click the green button of “Code”(①).

2. Copy the URL of the HTTPS tab. You can copy by clicking the log at ②. Note that the default setting is for the HTTPS tab.

3. Execute the following command at the working directory on your terminal.

git clone <URL>

When the clone is done successfully, the directory, whose name is the same as the repository name, has been created. The version history of the repository is stored in the “.git” directory. “.git” is the hidden directory, so you can’t see it on your terminal by the “ls” command. You have to use the “ls -a” command. “-a” is the option for hidden files and directories.

Confirm the “Status”

First of all, we have to specify the files to synchronize with the repository on GitHub. Create a new script “sample.py” on the directory you cloned. For example, we can create it with the “touch” command.

touch sample.py

Next, use the “git add” command to put the target file in the staging state. Before executing the “git add” command, let’s confirm the staging condition of the file by the “git status” command.

git status

>>  On branch master
>>  Your branch is up to date with 'origin/master'.
>>  
>>  Untracked files:
>>    (use "git add <file>..." to include in what will be committed)
>>  
>>  	sample.py
>>  
>>  nothing added to commit but untracked files present (use "git add" to track)

“Untracked files:” indicates “sample.py” is a new file. Note that the file is NOT staged yet, so the display of color is with red, “sample.py“. Next, let’s change the status of “sample.py”. We will see the color of “sample.py” will change.

Change the “Status”

We change the status of the file by the “git add” command and check the status again by the “git status” command.

git add sample.py
git status

>>  On branch master
>>  Your branch is up to date with 'origin/master'.
>>  
>>  Changes to be committed:
>>    (use "git reset HEAD <file>..." to unstage)
>>  
>>  	new file:   sample.py
>>

Git recognized “sample.py” as a new file!

And We have seen the change of color. The display “sample.py” of color has been changed from red to green. The green indicates that the file is now staged!

Note that you can cancel the “git status” command against “sample.py”. After the following command, you will see that “sample.py” was unstaged.

git reset sample.py

Why is the staging need?

The beginners may be unfamiliar with the concept of “staging”. Why is the staging need? The answer is to prepare for committing. Git reflects the change of a file into the version history when committing. To distinguish the files to commit, Git specifies the files clearly by the “git add” command.

“Commit” the staging files

Next, we will reflect the changes of the staged file to the local repository. This operation is called “commit”. The command is as follows.

git commit -m "This is comment."

“-m” is the option for a comment. The comment makes it possible to understand what is the intention for the change of the code.

The concept of “commit” might be unfamiliar to beginners. Why the commit is need? At the commit stage, Git does NOT reflect the modified files to the GitHub repository but to your local repository. Therefore, when developing as a team, you don’t have to worry about your modification conflicting with your teammate’s modification. At the stage of “push”, your modification of files and the version history are synchronized with the GitHub repository. This is why your teammates can distinguish your changes from those of other people!

“Push” the commited files

The final step is to synchronize your local repository with your GitHub repository. This operation is called “push”. After the “git push” command, the committed changes will be reflected in your GitHub repository.

The command is as follows.

git push

If successfully done, you can confirm on your GitHub web page that your new file “sample.py” exists in your GitHub repository.

Congratulations! This is the main flow of managing files on GitHub.

When you modified the file

From the above, we can see how to add a new file. Here, we have seen the modified file case.

Please add the something change to “sample.py”. Then, execute the “git status” command. You will see Git recognizes the file was modified.

The file is NOT staged yet, so the display of color is with red, “sample.py“.

git status

>>  On branch master
>>  Your branch is up to date with 'origin/master'.
>>  
>>  Changes not staged for commit:
>>    (use "git add <file>..." to update what will be committed)
>>    (use "git checkout -- <file>..." to discard changes in working directory)
>>  
>>  	modified:   sample.py
>>

The difference is only the above. From here, you do just as you’ve seen.

git add
git commit -m
git push

Summary

We have learned the basic GitHub skill. As a data scientist, GitHub skill is one of the essential skills, in addition to programming skills. GitHub not only makes it easier to manage the version of codes but also gives you opportunities to interact with other programmers.

GitHub has many code sources and knowledge. Why not use GitHub. You can get a chance to utilize the knowledge of great programmers from around the world.

November 29, 2020January 6, 2021

Tutorial of PyCaret, Regression Analysis

Step-by-step to a Data Scientist > Blog > 2020

PyCaret is a powerful tool to compare models between different machine learning methods. The biggest feature of this library is the low code library for machine learning in Python. PyCaret is a wrapper including the famous machine learning libraries, scikit-learn, LightGBM, Catboost, XGBoost, and more.

To create one model, you may write an amount of code. So, you can imagine how difficult it is to compare models between different methods. However, PyCaret makes it easy to compare models, making it possible to experiment efficiently.

In this post, we will see the tutorial of PyCaret with a regression analysis against the Boston house prices dataset. The author will explain with the step-by-step guide in mind!!

The complete notebook can be found on GitHub.

Library version

PyCaret is now highly developed, so you should check the version of the library.

pycaret == 2.2.2
pandas == 1.0.5
scikit-learn == 0.23.2
matplotlib == 3.2.2

PyCaret can be easily installed by pip command as follows:

$pip install pycaret

If you want to define the version of PyCaret, you can use the following command.

$pip install pycaret==2.2.2

Import Library

To start, we import the following libraries.

##-- PyCaret
import pycaret
from pycaret.regression import *
##-- Pandas
import pandas as pd
from pandas import Series, DataFrame
##-- Scikit-learn
import sklearn

Dataset

In this article, we use “Boston house prices dataset” from scikit-learn library, pubished by the Carnegie Mellon University. This dataset is one of the famous open-source datasets for testing new models.

From scikit-learn library, you can easily load the dataset. For convenience, we convert the dataset into the pandas-dataframe type, fundamental data structures in pandas.

from sklearn.datasets import load_boston
dataset = load_boston()

df = pd.DataFrame(dataset.data)
df.columns = dataset.feature_names
df["PRICES"] = dataset.target
df.head()

>>     CRIM     ZN    INDUS  CHAS  NOX    RM     AGE   DIS     RAD  TAX    PTRATIO  B       LSTAT  PRICES
>>  0  0.00632  18.0  2.31   0.0   0.538  6.575  65.2  4.0900  1.0  296.0  15.3     396.90  4.98   24.0
>>  1  0.02731   0.0  7.07   0.0   0.469  6.421  78.9  4.9671  2.0  242.0  17.8     396.90  9.14   21.6
>>  2  0.02729   0.0  7.07   0.0   0.469  7.185  61.1  4.9671  2.0  242.0  17.8     392.83  4.03   34.7
>>  3  0.03237   0.0  2.18   0.0   0.458  6.998  45.8  6.0622  3.0  222.0  18.7     394.63  2.94   33.4
>>  4  0.06905   0.0  2.18   0.0   0.458  7.147  54.2  6.0622  3.0  222.0  18.7     396.90  5.33   36.2

The details of the Boston house prices dataset are introduced in another post.

Brief EDA for Boston House Prices Dataset

Before train the model, devide the dataset into train- and test- datasets. This is because we have to confirm whether the trained model has an ability to predict an unknown dataset. Here, we split the dataset into train and test datasets, as 8:2.

split_rate = 0.8
data = df.iloc[ : int(split_rate*len(df)), :]
data_pre = df.iloc[ int(split_rate*len(df)) :, :]

Set up the environment for PyCaret by the “setup()” function

Here, we set up the environment of the model by the “setup()” function. Arguments of the setup function are the input data, the name of the target data, and the “session_id”. The “session_id” equals to a random seed.

model_reg = setup(data = data, target = "PRICES", session_id=99)

>>             Data Type
>>     CRIM      Numeric
>>       ZN      Numeric
>>    INDUS      Numeric
>>     CHAS  Categorical
>>      NOX      Numeric
>>       RM      Numeric
>>      AGE      Numeric
>>      DIS      Numeric
>>      RAD      Numeric
>>      TAX      Numeric
>>  PTRATIO      Numeric
>>        B      Numeric
>>    LSTAT      Numeric
>>   PRICES        Label

Conveniently, PyCaret will predict the data type for each column as above. “Numeric” indicates the data is continuous values. On the other hand, “Categorical” means the data is NOT continuous values, for example, the season(spring, summer, fall, winter).

This is a very convenient function for quick data analysis!

We have seen “CHAS” is NOT “Numeric”, where categorical data is NOT suitable for regression analysis. Then, we should drop the “CHAS” column from the dataset, and reset up the environment by the “setup()” function.

Note that, actually, “RAD” is also NOT continuous values. We can know this fact from the explanatory data analysis. You can check the details from the above link of another post, “Brief EDA for Boston House Prices Dataset“.

data = data.drop('CHAS', 1) # "1" indicate the columns.
model_reg = setup(data = data, target = "PRICES", session_id=99)

>>             Data Type
>>     CRIM      Numeric
>>       ZN      Numeric
>>    INDUS      Numeric
>>      NOX      Numeric
>>       RM      Numeric
>>      AGE      Numeric
>>      DIS      Numeric
>>      RAD      Numeric
>>      TAX      Numeric
>>  PTRATIO      Numeric
>>        B      Numeric
>>    LSTAT      Numeric
>>   PRICES        Label

Comparison Between All Models

PyCaret makes it possible to compare models easily with just one command as follows. We can compare the models by the evaluation metrics. Due to the smaller evaluation metrics(MAE, MSE, RMSE..), it turns out that superior solutions are based on decision tree-based methods.

compare_models()

In this case, “CatBoost Regressor” is the best model. So, let’s construct the CatBoost-Regressor model. We use the “create_model” function and the argument of “catboost”. Another argument of “fold” is the number of cross-validation. In this case, we adopt “fold = 4”, then there are 4 (0~3) calculated results of each metric.

catboost = create_model("catboost", fold=4)

Optimize Hyperparameter by Tune Model Module

A tune-model module optimizes the created model by tuning the hyperparameters. You pass the created model to the tune-model function, “tune_model()”. An optimization is performed by a random grid search. After tuning, we can clearly see the improvement of MAE, 2.0685 to 1.9311.

tuned_model = tune_model(catboost)

Visualization

It is also possible to visualize results by PyCaret. First, we check the contributions of the features by the “interpret_model()” module. The vertical axis indicates the explanatory features. And, the horizontal axis indicates the SHAP, contributions of the features into output. Each circle shows the value for every sample. From this figure, we can clearly see RM and LSTA are the key features to predict house prices.

interpret_model(tuned_model)

Summary

We have seen the basic usage of PyCaret. Actually, PyCaret has various other functions, but the author has the impression that the functions are frequently renewed due to the high development speed.

The author’s recommended usage is to first check if there is a significant difference in accuracy between decision tree analysis and linear regression analysis. In general, decision tree-based methods tend to be more accurate than linear regression analysis. However, if you can expect some accuracy in the linear regression model, it is a good idea to try to understand the dataset from the linear regression analysis as the next step. This is because the model interpretability is higher in the linear regression analysis.

The essence of data science requires a deep understanding of datasets. To do so, it is important to repeat small trials quickly. PyCaret makes it possible to do such a thing more efficiently.

The author hopes this blog helps readers a little.

November 28, 2020January 7, 2021

Brief Introduction of Descriptive Statistics

Step-by-step to a Data Scientist > Blog > 2020

The descriptive statistics have important information because they reflect a summary of a dataset. For example, from descriptive statistics, we can know the scale, variation, minimum and maximum values. If you know the above information, you can have a sense of grasping whether one of the data is large or small, or whether it deviates greatly from the average value.

In this post, we will see the descriptive statistics with a definition. Understanding statistical descriptions not only helps to develop a sense of a dataset but is also useful for understanding the preprocessing of a dataset.

The complete notebook can be found on GitHub.

Dataset

Here, we utilize the Boston house prices dataset for calculating the descriptive statistics, such as mean, variance, and standard deviation. The reason why we adopt this dataset is we can use it so easily with the scikit-learn library.

The code for using the dataset as Pandas DataFrame is as follows.

import numpy as np              ##-- Numpy
import pandas as pd             ##-- Pandas
import sklearn                  ##-- Scikit-learn
import matplotlib.pylab as plt  ##-- Matplotlib

from sklearn.datasets import load_boston
dataset = load_boston()

df = pd.DataFrame(dataset.data)
df.columns = dataset.feature_names
df["PRICES"] = dataset.target
df.head()

>>       CRIM   ZN  INDUS  CHAS  NOX   RM   AGE	 DIS  RAD	TAX PTRATIO	  B   LSTAT PRICES
>>  0   0.00632 18.0  2.31  0.0 0.538 6.575 65.2  4.0900  1.0 296.0 15.3  396.90  4.98  24.0
>>  1	0.02731	 0.0  7.07  0.0 0.469 6.421 78.9  4.9671  2.0 242.0 17.8  396.90  9.14  21.6
>>  2	0.02729	 0.0  7.07  0.0 0.469 7.185 61.1  4.9671  2.0 242.0 17.8  392.83  4.03  34.7
>>  3	0.03237	 0.0  2.18  0.0 0.458 6.998 45.8  6.0622  3.0 222.0 18.7  394.63  2.94  33.4
>>  4	0.06905	 0.0  2.18  0.0 0.458 7.147 54.2  6.0622  3.0 222.0 18.7  396.90  5.33  36.2

The details of this dataset are introduced in another post below. In this post, let’s calculate the mean, the variance, the standard deviation for “df[“PRICES”]”, the housing prices.

Brief EDA for Boston House Prices Dataset

Mean

The mean $\mu$ is the average of the data. It must be one of the most familiar concepts. However, the concept of mean is important because we can obtain the sense of whether one value of data is large or small. Such a feeling is important for a data scientist.

Now, assuming that $N$ data are $x_{1}$, $x_{2}$, …, $x_{n}$, the mean $\mu$ is defined by the following formula.
$$\begin{eqnarray*}
\mu
=
\frac{1}{N}
\sum^{N}_{i=1}
x_{i}
,
\end{eqnarray*}$$
where $x_{i}$ is the value of $i$-th data.

It may seem a little difficult when expressed in mathematical symbols. However, as you know, we just take a summation of all the data and divide it by the number of data. Once we defined the mean, we can define the variance.

Then, let’s calculate the mean of each column. We can easily calculate by the “mean()” method.

df.mean()

>>  CRIM         3.613524
>>  ZN          11.363636
>>  INDUS       11.136779
>>  CHAS         0.069170
>>  NOX          0.554695
>>  RM           6.284634
>>  AGE         68.574901
>>  DIS          3.795043
>>  RAD          9.549407
>>  TAX        408.237154
>>  PTRATIO     18.455534
>>  B          356.674032
>>  LSTAT       12.653063
>>  PRICES      22.532806
>>  dtype: float64

When you want the mean value of just one column, for example the “PRICES”, the code is as follows.

df["PRICES"].mean()

>>  22.532806324110677

Variance

The variance $\sigma^{2}$ reflects the dispersion of data from the mean value. The definition is as follows.

$$\begin{eqnarray*}
\sigma^{2}
=
\frac{1}{N}
\sum^{N}_{i=1}
\left(
x_{i} – \mu
\right)^{2},
\end{eqnarray*}$$

where $N$, $x_{i}$, and $\mu$ are the number of the data, the value of $i$-th data, and the mean of $x$, respectively.

Expressed in words, the variance is the mean of the squared deviations from the mean of the data. It is no exaggeration to say that the information in the data exists in a variance! In other words, there is no worth to pay an attention to the data with the ZERO variance.

For example, let’s consider predicting math skills from exam scores. The exam scores of the person(A, B, and C) are like the below table.

Person	Math	Physics	Chemistry
A	100	90	60
B	60	70	60
C	20	40	60

The exam scores for each subject

From the above table, we can clearly see that those who are good at physics are also good at math. On the other hand, it is impossible to infer whether the person is good at mathematics from chemistry scores. Because all three scores equal the average score of 60. Namely, the variance of chemistry is ZERO!! This fact indicates that the scores of chemistry have no information, no worth to pay attention to. We should drop the “Chemistry” columns from the dataset when analyzing! This is one of the data preprocessing.

Then, let’s calculate the mean of each column of the Boston house prices dataset. We can easily calculate by the “var()” method.

df.var()

>>  CRIM          73.986578
>>  ZN           543.936814
>>  INDUS         47.064442
>>  CHAS           0.064513
>>  NOX            0.013428
>>  RM             0.493671
>>  AGE          792.358399
>>  DIS            4.434015
>>  RAD           75.816366
>>  TAX        28404.759488
>>  PTRATIO        4.686989
>>  B           8334.752263
>>  LSTAT         50.994760
>>  PRICES        84.586724
>>  dtype: float64

Standard Deviation

The standard deviation $\sigma$ is defined by the root of the variance as follows.

$$\begin{eqnarray*}
\sigma
=
\sqrt{
\frac{1}{N}
\sum^{N}_{i=1}
\left(
x_{i} – \mu
\right)^{2}
},
\end{eqnarray*}$$

where $N$, $x_{i}$, and $\mu$ are the number of the data, the value of $i$-th data, and the mean of $x$, respectively.

Why we introduced the standard deviation instead of the variance? This is because the unit becomes the same when we adopt the standard deviation of $\sigma$. Then, we can recognize $\sigma$ as the variation from the mean.

Then, let’s calculate the standard deviation of each column of the Boston house prices dataset. We can easily calculate by the “std()” method.

df.std()

>> CRIM         8.601545
>> ZN          23.322453
>> INDUS        6.860353
>> CHAS         0.253994
>> NOX          0.115878
>> RM           0.702617
>> AGE         28.148861
>> DIS          2.105710
>> RAD          8.707259
>> TAX        168.537116
>> PTRATIO      2.164946
>> B           91.294864
>> LSTAT        7.141062
>> PRICES       9.197104
>> dtype: float64

In fact, at the stage of defining these three concepts, we can define the Gaussian distribution. However, I’ll introduce the Gaussian distribution in another post.

Other descriptive statistics are calculated in the same way, so carefully select the ones you use most often and list them below.

Method	Description
mean	Average value
var	Variance value
std	Standard deviation value
min	Minimum value
max	Maximum value
median	Median value, the value at the center of the data
sum	Total value

Confirm all at once

Pandas has the useful function “describe()”, which describes the basic descriptive statistics. The “describe()” method is very convenient to use as a starting point.

df.describe()

>>         CRIM        ZN          INDUS       CHAS        NOX         RM          AGE         DIS         RAD         TAX         PTRATIO     B           LSTAT       PRICES
>>  count  506.000000  506.000000  506.000000  506.000000  506.000000  506.000000  506.000000  506.000000  506.000000  506.000000  506.000000  506.000000  506.000000  506.000000
>>  mean     3.613524   11.363636   11.136779    0.069170    0.554695    6.284634   68.574901    3.795043    9.549407  408.237154   18.455534  356.674032   12.653063  22.532806
>>  std      8.601545   23.322453    6.860353    0.253994    0.115878    0.702617   28.148861    2.105710    8.707259  168.537116    2.164946   91.294864    7.141062  9.197104
>>  min      0.006320    0.000000    0.460000    0.000000    0.385000    3.561000    2.900000    1.129600    1.000000  187.000000   12.600000    0.320000    1.730000  5.000000
>>  25%      0.082045    0.000000    5.190000    0.000000    0.449000    5.885500   45.025000    2.100175    4.000000  279.000000   17.400000  375.377500    6.950000  17.025000
>>  50%      0.256510    0.000000    9.690000    0.000000    0.538000    6.208500   77.500000    3.207450    5.000000  330.000000   19.050000  391.440000   11.360000  21.200000
>>  75%      3.677083   12.500000   18.100000    0.000000    0.624000    6.623500   94.075000    5.188425   24.000000  666.000000   20.200000  396.225000   16.955000  25.000000
>>  max     88.976200  100.000000   27.740000    1.000000    0.871000    8.780000  100.000000   12.126500   24.000000  711.000000   22.000000  396.900000   37.970000  50.000000

Note that,

“count”: Number of the data for each columns
“25%”: Value at the 25% position of the data
“50%”: Value at the 50% position of the data, equaling to “Median”
“75%”: Value at the 25% position of the data

Summary

We have seen the brief explanation of the basic descriptive statistics and how to calculate them. Understanding the concept of descriptive statistics is essential to understand the dataset. You should note the fact in your memory that the information of the dataset is included in the descriptive statistics.

The author hopes this blog helps readers a little.

November 9, 2020April 22, 2021

Step-by-step guide of Linear Regression for Boston House Prices dataset

Step-by-step to a Data Scientist > Blog > 2020

Linear regression is one of the basic techniques for machine learning analyses. You may know, in general, other methods are often superior to linear regression in terms of prediction accuracy. However, linear regression has the advantage that the model is simple and high interpretable.

For a data scientist, to understand a dataset is highly important. Therefore, linear regression plays a powerful role as a first step for the purpose of understanding the dataset first.

In this post, we will see the process of a linear regression analysis against the Boston house prices dataset. The author will explain with the step-by-step guide in mind !!

What is a Linear Regression?

Linear regression is based on the assumption of the linar relationship between a target variable and independent variables. Then, if you can represent your dataset well, you can expect a proportional relationship between the independent and objective variables.

In the mathematical expression, the representation of a linear regression is as follows.

$$y =\omega_{0}+\omega_{1}x_{1}+\omega_{2}x_{2}+…+\omega_{N}x_{N},$$

where $y$, $x_{i}$, and $\omega_{i}$ are a target variable, an independent variable, and a coefficient, respectively.

The details of the theory are explained in another post below.

Brief Explanation of the Theory of Linear Regression

Also, for reference, another post provides an example of linear regression with short code.

Python Shortcode for Linear Regression

From here, let’s perform a linear regression analysis on the Boston house prices dataset!

Prepare the Dataset

In this analysis, we adopt the Boston house prices dataset, one of the famous open datasets published by the StatLib library which is maintained at Carnegie Mellon University. This is because we can use this dataset so easily. Just load from the scikit-learn library without downloading the file.

from sklearn.datasets import load_boston
dataset = load_boston()

The details of the Boston house prices dataset is introduced in another post. But, you can understand the following analysis without referring.

Open Dataset for Regression Analysis

Confirm the Dataset as Pandas DataFrame

Here, we get 3 types of data from “dataset”, described below, as the Pandas DataFrame.

dataset.data: values of the explanatory variables
dataset.target: values of the target variable (house prices)
dataset.feature_names: the column names

import pandas as pd

f = pd.DataFrame(dataset.data)
f.columns = dataset.feature_names
f["PRICES"] = dataset.target
f.head()

>>       CRIM   ZN  INDUS  CHAS  NOX   RM   AGE	 DIS  RAD	TAX PTRATIO	  B   LSTAT PRICES
>>  0   0.00632 18.0  2.31  0.0 0.538 6.575 65.2  4.0900  1.0 296.0 15.3  396.90  4.98  24.0
>>  1	0.02731	 0.0  7.07  0.0 0.469 6.421 78.9  4.9671  2.0 242.0 17.8  396.90  9.14  21.6
>>  2	0.02729	 0.0  7.07  0.0 0.469 7.185 61.1  4.9671  2.0 242.0 17.8  392.83  4.03  34.7
>>  3	0.03237	 0.0  2.18  0.0 0.458 6.998 45.8  6.0622  3.0 222.0 18.7  394.63  2.94  33.4
>>  4	0.06905	 0.0  2.18  0.0 0.458 7.147 54.2  6.0622  3.0 222.0 18.7  396.90  5.33  36.2

Let’s try to check the correlation between only “PRICES” and “TAX”.

import matplotlib.pylab as plt  #-- "Matplotlib" for Plotting

f.plot(x="TAX", y="PRICES", style="o")
plt.ylabel("PRICES")
plt.show()

At first glance, there seems to be no simple proportional relationship. Including other variables, the EDA(Exploratory data analysis) for this dataset is introduced in another post.

Brief EDA for Boston House Prices Dataset

Pick up the Variables we use

Explicitly define the variables to use for getting from the data frame.

TargetName = "PRICES"
FeaturesName = [\
              #-- "Crime occurrence rate per unit population by town"
              "CRIM",\
              #-- "Percentage of 25000-squared-feet-area house"
              'ZN',\
              #-- "Percentage of non-retail land area by town"
              'INDUS',\
              #-- "Index for Charlse river: 0 is near, 1 is far"
              'CHAS',\
              #-- "Nitrogen compound concentration"
              'NOX',\
              #-- "Average number of rooms per residence"
              'RM',\
              #-- "Percentage of buildings built before 1940"
              'AGE',\
              #-- 'Weighted distance from five employment centers'
              "DIS",\
              ##-- "Index for easy access to highway"
              'RAD',\
              ##-- "Tax rate per $100,000"
              'TAX',\
              ##-- "Percentage of students and teachers in each town"
              'PTRATIO',\
              ##-- "1000(Bk - 0.63)^2, where Bk is the percentage of Black people"
              'B',\
              ##-- "Percentage of low-class population"
              'LSTAT',\
              ]

Get from the data frame into “X” and “Y”.

X = f[FeaturesName]
Y = f[TargetName]

Standardize the Variables

For numerical variables, we should standardize because the scales of variables are different.

In mathematically, the definition of the conversion of standardization is as follows.

$$\begin{eqnarray*}
\tilde{x}=
\frac{x-\mu}{\sigma}
,
\end{eqnarray*}$$

where $\mu$ and $\sigma$ are the mean and the standard deviation, respectively.

Execution code by scikit-learn is just 4 line code as follows.

from sklearn import preprocessing
sscaler = preprocessing.StandardScaler()
sscaler.fit(X)
X_std = sscaler.transform(X)

Regarding standardization, the details are explained in another post. The standardization is an important preprocessing for numerical variables. If you don’t know standardization, the author recommends that you check the details once.

Standardization by scikit-learn in Python

Split the Dataset

Here, we split the dataset into the train data and the test data. Why we have to split? This is because we must evaluate the generalization performance of the model against unknown data.

You can see that the above idea is valid because our purpose is to predict new data.

Then, let’s split the dataset. Of course, it is easy with scikit-learn!

from sklearn.model_selection import train_test_split
X_train, X_test, Y_train, Y_test = train_test_split(X_std, Y, test_size=0.2, random_state=99)

We pass the dataset “(X_std, Y)” to the “train_test_split()” function. The rate of the train data and the test data is defined by the argument “test_size”. Here, the rate is set to be “8:2”. And, “random_state” are set for reproducibility. You can use any number. The author often uses “99” because “99” is my favorite NFL player’s uniform number!

At this point, data preparation and preprocessing are fully completed!
Finally, we can perform the linear regression analysis!

Create an Instance for Linear Regression

Here, let’s create the model for linear regression. We can perform with the just 3 line code. The role of each line is as follows.

1. Import the “LinearRegression()” function from scikit-learn
2. Create the model as an instance “regressor” by “LinearRegression()”
3. Train the model “regressor” with train data “(X_train, Y_train)”

from sklearn.linear_model import LinearRegression
regressor = LinearRegression()
regressor.fit(X_train, Y_train)

Predict the train and test data

To check the performance of the model, we get the predicted values for the train and test data.

y_pred_train = regressor.predict(X_train)
y_pred_test = regressor.predict(X_test)

Then, let’s visualize the result by matplotlib.

import seaborn as sns

plt.figure(figsize=(5, 5), dpi=100)
sns.set()
plt.xlabel("PRICES")
plt.ylabel("Predicted PRICES")
plt.xlim(0, 60)
plt.ylim(0, 60)
plt.scatter(Y_train, y_pred_train, lw=1, color="r", label="train data")
plt.scatter(Y_test, y_pred_test, lw=1, color="b", label="test data")
plt.legend()
plt.show()

About the above figure, the red and blue circles show the results of the train and test data, respectively. We can see that the prediction accuracy decreases as the price increases.

Here, we check $R^{2}$ score, the coefficient of determination. $R^{2}$ is the index for how much the model is fitted to the dataset. When $R^{2}$ is close to $1$, the model accuracy is good. Conversely, when $R^{2}$ approaches $0$, it means that the model accuracy is poor.

We can calculate $R^{2}$ by the “r2_score()” function in scikit-learn.

from sklearn.metrics import r2_score
R2 = r2_score(Y_test, y_pred_test)
R2

>>  0.6674690355194665

The score $0.67$ is not bad, but also not good.

How to Improve the Score?

Here, one easy way to improve your score is introduced. The answer is to convert the target variable “PRICES” to a logarithmic scale. Converting to a logarithmic scale reduces the effect of errors in the high “PRICES” range. Reducing the effect of errors between the train data and the predicted values leads to improved models. Logarithmic conversion techniques are often simple and effective and should be helpful to remember.

Then, let’s try!

First, converting the target variable “PRICES” to a logarithmic scale.

##-- Logarithmic scaling
Y_log = np.log(Y)

Next, we split the dataset again.

from sklearn.model_selection import train_test_split
X_train, X_test, Y_train, Y_test = train_test_split(X_std, Y_log, test_size=0.2, random_state=99)

And, retrain the model and predict again.

regressor.fit(X_train, Y_train)
y_pred_train = regressor.predict(X_train)
y_pred_test = regressor.predict(X_test)

Plot the result again as follows. Note that the predicted value is the value after logarithmic conversion, so it must be inversely converted by “np.ep()” when plotting.

import numpy as np

plt.figure(figsize=(5, 5), dpi=100)
sns.set()
plt.xlabel("PRICES")
plt.ylabel("Predicted PRICES")
plt.xlim(0, 60)
plt.ylim(0, 60)
plt.scatter(np.exp(Y_train), np.exp(y_pred_train), lw=1, color="r", label="train data")
plt.scatter(np.exp(Y_test), np.exp(y_pred_test), lw=1, color="b", label="test data")
plt.legend()
plt.show()

It may be hard to see the improvement from the figure, but when you compare $R^{2}$, you can see that it has improved clearly.

R2 = r2_score(Y_test, y_pred_test)
R2

>>  0.7531747761424288

$R^{2}$ has improved from 0.67 to 0.75!

Summary

We have seen how to perform the linear regression analysis against the Boston house prices dataset. The basic approach to regression analysis is as described here. So, we can apply this approach to other datasets.

Note that the important thing is to have a good understanding of the dataset, making it possible to perform an analysis reflecting the essence.

Certainly, there are several methods that can be expected to be more accurate, such as random forest and neural net. However, linear regression analysis can be a good first step to understanding datasets deeper.

The author hopes this blog helps readers a little.

November 8, 2020April 24, 2021

Standardization by scikit-learn in Python

Step-by-step to a Data Scientist > Blog > 2020

Scaling variables is important, especially for regression analysis. This is because, in roughly speaking, it means that the analysis is performed without considering the unit.

In this blog, we will see the process of standardization, one of the basic scaling method, that is absolutely necessary for regression analysis.

Why standardization is a need.

For example, the coefficients of regression analysis become larger when the scale of variables becomes larger. It means that the worthies of the regression coefficients are not equivalent if the scales are different.

Here, as an example, we consider creating a mixed regression model with “m” and “cm”, length units. This is the example of without considering the unit.

More specifically, suppose you have the following formula in meters.

$$y = a_{m}x_{m}.$$

When the above formula is converted from meters to centimeters, it becomes as follows.

$$\begin{eqnarray*}
y
&&= \left( \frac{a_{m}}{100}\right) \times (100x_{m}),\\
&&=:a_{cm}x_{cm},
\end{eqnarray*}$$

where $a_{cm}=a_{m}/100$ and $x_{cm}=100x_{m}$.

From the above, we can see that the scales of variables and coefficients change even between equivalent expressions.

In fact, the problem in this case is that the values of the gradients are no longer equivalent. Specifically, consider the following example.

$$\begin{eqnarray*}
\frac{\partial y}{\partial x_{m}} &&= a_{m}.
\end{eqnarray*}$$

And,

$$\begin{eqnarray*}
\frac{\partial y}{\partial x_{cm}} &&= a_{cm}.
\end{eqnarray*}$$

From the fact that $a_{m} \neq a_{cm}(a_{m}=100a_{cm})$,

$$\begin{eqnarray*}
\frac{\partial y}{\partial x_{m}}
\neq
\frac{\partial y}{\partial x_{cm}}.
\end{eqnarray*}$$

Namely, one step based on the gradient is not equivalent.
This fact may cause the optimization to fail when optimizing the model with a gradient-based approach (the steepest descent method).

Mathematically speaking, it means that the gradient space is distorted.
But here, I will just way one step should be alway equivalent.

Therefore, standardization is a need.

Definition of Standardize

Standardization is the operation of converting the mean and standard deviation of data into 0 and 1, respectively.

The mathematical definition is as follows.

$$\begin{eqnarray*}
\tilde{x}=
\frac{x-\mu}{\sigma}
,
\end{eqnarray*}$$

where $\mu$ and $\sigma$ are the mean and the standard deviation, respectively.

How to Standardize in Python?

It’s easy. By using scikit-learn, Standardization can be done in just 4 lines!

from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
scaler.fit(data)
data_std = scaler.transform(data)

The code description is as follows.

1. Import “StandardScaler()” from “sklearn.preprocessing” in scikit-learn
2. Create the instance for Standardization by the “StandardScaler()” function
3. Calculate the mean and the standard deviation
4. Convert “data” into the standardized data “data_std”

Summary

The above 4 line code is often used, so this post should be a cheat sheet. However, the important thing is to understand why standardization is a need.

Sometimes, some people compare the values of the regression coefficients without scaling to determine the magnitude of the contribution. However, people who understand the meaning of standardization can judge it is wrong!

Standardization is so popular that the author hopes this post helps readers a little.

November 7, 2020January 1, 2021

Brief EDA for Boston House Prices Dataset

Step-by-step to a Data Scientist > Blog > 2020

Exploratory data analysis(EDA) is one of the most important processes in data analysis. This process is often neglected because it is often invisible in the final code. However, without appropriate EDA, there is no success.

Understanding the nature of the dataset.

This is the purpose of EDA. And then, you will be able to effectively select models and perform feature engineering. This is why EDA is the first step in data analysis.

In this post, we see the basic skills of EDA using the well-known open data set Boston Home Price Dataset as an example.

Prepare the Dataset

For performing EDA, we adopt the Boston house prices dataset, an open dataset for regression analysis. The details of this dataset are introduced in another post.

Open Dataset for Regression Analysis

Import Library

##-- Numpy
import numpy as np
##-- Pandas
import pandas as pd
##-- Scikit-learn
import sklearn
##-- Matplotlib
import matplotlib.pylab as plt
##-- Seaborn
import seaborn as sns

Load the Boston House Prices Dataset from Scikit-learn

It is easy to load this dataset from scikit-learn. Just 2 lines!

from sklearn.datasets import load_boston
dataset = load_boston()

The values of the dataset are stored in the variable “dataset”. Note that, in “dataset”, several kinds of information are stored, i.e., the explanatory-variable values, the target-variable values, and the column names of the explanatory-variable values. Then, we have to take them separately as follows.

dataset.data: values of the explanatory variables
dataset.target: values of the target variable (house prices)
dataset.feature_names: the column names

For convenience, we obtain the above data as the Pandas DataFrame type. Pandas is so useful against matrix-type data.

Here, let’s put all the data together into one Pandas DataFrame “f”.

f = pd.DataFrame(dataset.data)
f.columns = dataset.feature_names
f["PRICES"] = dataset.target
f.head()

>>       CRIM   ZN  INDUS  CHAS  NOX   RM   AGE	 DIS  RAD	TAX PTRATIO	  B   LSTAT PRICES
>>  0   0.00632 18.0  2.31  0.0 0.538 6.575 65.2  4.0900  1.0 296.0 15.3  396.90  4.98  24.0
>>  1	0.02731	 0.0  7.07  0.0 0.469 6.421 78.9  4.9671  2.0 242.0 17.8  396.90  9.14  21.6
>>  2	0.02729	 0.0  7.07  0.0 0.469 7.185 61.1  4.9671  2.0 242.0 17.8  392.83  4.03  34.7
>>  3	0.03237	 0.0  2.18  0.0 0.458 6.998 45.8  6.0622  3.0 222.0 18.7  394.63  2.94  33.4
>>  4	0.06905	 0.0  2.18  0.0 0.458 7.147 54.2  6.0622  3.0 222.0 18.7  396.90  5.33  36.2

From here, we perform the EDA and understand the dataset!

Summary information

First, we should look at the entire dataset. Information from the whole to the details, this order is important. The “info()” method in Pandas makes it easy to confirm the information of columns, non-null-value count, and its data type.

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 506 entries, 0 to 505
Data columns (total 14 columns):
 #   Column   Non-Null Count  Dtype  
---  ------   --------------  -----  
 0   CRIM     506 non-null    float64
 1   ZN       506 non-null    float64
 2   INDUS    506 non-null    float64
 3   CHAS     506 non-null    float64
 4   NOX      506 non-null    float64
 5   RM       506 non-null    float64
 6   AGE      506 non-null    float64
 7   DIS      506 non-null    float64
 8   RAD      506 non-null    float64
 9   TAX      506 non-null    float64
 10  PTRATIO  506 non-null    float64
 11  B        506 non-null    float64
 12  LSTAT    506 non-null    float64
 13  PRICES   506 non-null    float64
dtypes: float64(14)
memory usage: 55.5 KB

Missing Values

We have already checked how many non-null-values are. Here, on the contrary, we check how many missing values are. We can check it by the combination of the “isnull()” and “sum()” methods in Pandas.

f.isnull().sum()

>>  CRIM       0
>>  ZN         0
>>  INDUS      0
>>  CHAS       0
>>  NOX        0
>>  RM         0
>>  AGE        0
>>  DIS        0
>>  RAD        0
>>  TAX        0
>>  PTRATIO    0
>>  B          0
>>  LSTAT      0
>>  PRICES     0
>>  dtype: int64

Fortunately, there is no missing value! This is because this dataset is created carefully. Note that, however, there are usually many problems we have to deal with a real dataset.

Basic Descriptive Statistics Value

We can calculate the basic descriptive statistics values with just 1 sentence!

f.describe()

>>               CRIM          ZN       INDUS        CHAS         NOX          RM         AGE         DIS         RAD         TAX     PTRATIO           B       LSTAT      PRICES
>>  count  506.000000  506.000000  506.000000  506.000000  506.000000  506.000000  506.000000  506.000000  506.000000  506.000000  506.000000  506.000000  506.000000  506.000000
>>   mean    3.613524   11.363636   11.136779    0.069170    0.554695    6.284634   68.574901    3.795043    9.549407  408.237154   18.455534  356.674032   12.653063   22.532806
>>    std    8.601545   23.322453    6.860353    0.253994    0.115878    0.702617   28.148861    2.105710    8.707259  168.537116    2.164946   91.294864    7.141062    9.197104
>>    min    0.006320    0.000000    0.460000    0.000000    0.385000    3.561000    2.900000    1.129600    1.000000  187.000000   12.600000    0.320000    1.730000    5.000000
>>    25%    0.082045    0.000000    5.190000    0.000000    0.449000    5.885500   45.025000    2.100175    4.000000  279.000000   17.400000  375.377500    6.950000   17.025000
>>    50%    0.256510    0.000000    9.690000    0.000000    0.538000    6.208500   77.500000    3.207450    5.000000  330.000000   19.050000  391.440000   11.360000   21.200000
>>    75%    3.677083   12.500000   18.100000    0.000000    0.624000    6.623500   94.075000    5.188425   24.000000  666.000000   20.200000  396.225000   16.955000   25.000000
>>    max   88.976200  100.000000   27.740000    1.000000    0.871000    8.780000  100.000000   12.126500   24.000000  711.000000   22.000000  396.900000   37.970000   50.000000

Although each value is important, I think it is worth to focus on “mean” and “std” as a first attention.

We can know the average from “mean” so that it makes it possible to judge a value is higher or lower. This feeling is important for a data scientist.

Next, “std” represents the standard deviation, which is an indicator of how much the data is scattered from “mean”. For example, “std” will be small if each value exists almost average. Note that the variance equals to the square of the standard deviation, and the word “variance” may be more common for a data scientist. It is no exaggeration to say that the information in a dataset is contained in the variance. This is because, for instance, there is no information if all values in “AGE” is 30, indicating no worth to attention!

Histogram Distribution

Data with variance is the data that is worth paying attention to. So let’s actually visualize the distribution of the data. Seeing is believing!

We can perform the histogram plotting by “plt.hist()” in “matplotlib”, a famous library for visualization. The argument “bins” can control the fineness of plot.

for name in f.columns:
    plt.title(name)
    plt.hist(f[name], bins=50)
    plt.show()

The distribution of “PRICES”, the target variable, is below. In the right side, we can see the so high-price houses. Note here that considering such a high-price data may get a prediction accuracy worse.

The distributions of the explanatory variables are below. We can see the difference in variance between the explanatory variables.

From the above figure, we can infer that the data of “CHAS” and “RAD” are NOT continuous values. Generally, such data that is not continuous is called a categorical variable.

Be careful when handling the categorical variable values, because there is no meaning in the magnitude relationship itself. For example, when a condominium and a house are represented by 0 and 1, respectively, there is no essential meaning in the magnitude relationship(0 < 1).

For the above reasons, let’s check the categorical variables individually.

We can easily confirm the number of each unique value by the “value_counts()” method in Pandas. The first column is the unique values. The second column is the counted numbers of unique values.

f['RAD'].value_counts()

>>  24.0    132
>>  5.0     115
>>  4.0     110
>>  3.0      38
>>  6.0      26
>>  8.0      24
>>  2.0      24
>>  1.0      20
>>  7.0      17
>>  Name: RAD, dtype: int64

It is important to check the data visually. It is easy to visualize the counted numbers of unique values for each column.

f['CHAS'].value_counts().plot.bar(title="CHAS")
f['RAD'].value_counts().plot.bar(title="RAD")

Correlation of Variables

Here, we confirm the correlation of variables. The correlation is so important because the higher correlation indicates the higher contribution basically. Conversely, you have the option of removing variables that contribute less. By dropping variables that contribute less, the risk of overfitting is reduced

We can easily calculate the correlation matrix between the variables(or columns) by the “corr()” function in Pandas.

f.corr()


             CRIM        ZN     INDUS      CHAS       NOX        RM       AGE       DIS       RAD       TAX   PTRATIO         B     LSTAT    PRICES
   CRIM  1.000000 -0.200469  0.406583 -0.055892  0.420972 -0.219247  0.352734 -0.379670  0.625505  0.582764  0.289946 -0.385064  0.455621 -0.388305
     ZN -0.200469  1.000000 -0.533828 -0.042697 -0.516604  0.311991 -0.569537  0.664408 -0.311948 -0.314563 -0.391679  0.175520 -0.412995  0.360445
  INDUS  0.406583 -0.533828  1.000000  0.062938  0.763651 -0.391676  0.644779 -0.708027  0.595129  0.720760  0.383248 -0.356977  0.603800 -0.483725
   CHAS -0.055892 -0.042697  0.062938  1.000000  0.091203  0.091251  0.086518 -0.099176 -0.007368 -0.035587 -0.121515  0.048788 -0.053929  0.175260
    NOX  0.420972 -0.516604  0.763651  0.091203  1.000000 -0.302188  0.731470 -0.769230  0.611441  0.668023  0.188933 -0.380051  0.590879 -0.427321
     RM -0.219247  0.311991 -0.391676  0.091251 -0.302188  1.000000 -0.240265  0.205246 -0.209847 -0.292048 -0.355501  0.128069 -0.613808  0.695360
    AGE  0.352734 -0.569537  0.644779  0.086518  0.731470 -0.240265  1.000000 -0.747881  0.456022  0.506456  0.261515 -0.273534  0.602339 -0.376955
    DIS -0.379670  0.664408 -0.708027 -0.099176 -0.769230  0.205246 -0.747881  1.000000 -0.494588 -0.534432 -0.232471  0.291512 -0.496996  0.249929
    RAD  0.625505 -0.311948  0.595129 -0.007368  0.611441 -0.209847  0.456022 -0.494588  1.000000  0.910228  0.464741 -0.444413  0.488676 -0.381626
    TAX  0.582764 -0.314563  0.720760 -0.035587  0.668023 -0.292048  0.506456 -0.534432  0.910228  1.000000  0.460853 -0.441808  0.543993 -0.468536
PTRATIO  0.289946 -0.391679  0.383248 -0.121515  0.188933 -0.355501  0.261515 -0.232471  0.464741  0.460853  1.000000 -0.177383  0.374044 -0.507787
      B -0.385064  0.175520 -0.356977  0.048788 -0.380051  0.128069 -0.273534  0.291512 -0.444413 -0.441808 -0.177383  1.000000 -0.366087  0.333461
  LSTAT  0.455621 -0.412995  0.603800 -0.053929  0.590879 -0.613808  0.602339 -0.496996  0.488676  0.543993  0.374044 -0.366087  1.000000 -0.737663
 PRICES -0.388305  0.360445 -0.483725  0.175260 -0.427321  0.695360 -0.376955  0.249929 -0.381626 -0.468536 -0.507787  0.333461 -0.737663  1.000000

Like the above, it is easy to calculate the correlation matrix, however, it is difficult to grasp the tendency with simple numerical values.

In such a case, let’s utilize the heat map function included in “seaborn”, the library for plotting. Then, we can confirm the correlation visually.

Here, we focus on the relationship between “PRICES” – OTHER(“CRIM”, “ZN”, “INDUS”…etc). We can clearly see that “RM” and “LSTAT” are high correlated to “PRICES”.

Summary

So far we have seen how to perform EDA briefly. The purpose of EDA is to properly identify the nature of the dataset. Proper EDA can make it possible to explore the next step effectively, e.g. feature engineering and modeling methods.

November 5, 2020December 24, 2020

Open Dataset for Regression Analysis

Step-by-step to a Data Scientist > Blog > 2020

You will meet a case that something new method has emerged and you want to try it quickly. For users, it is very important to try it quickly.

At such times, an open dataset is so useful.

Of course, there are so many open datasets for machine learning. In this post, the Boston house prices dataset, one of them for regression analysis, is introduced.

A preparation of a dataset is the first step for data analyses. Let’s see the process.

Boston house prices dataset

This dataset is one of the famous open datasets and published by the StatLib library which is maintained at Carnegie Mellon University.

Fortunately, we can use this dataset very easily from scikit-learn. Just import the scikit-learn library and load the dataset.

Load the dataset from scikit-learn

from sklearn.datasets import load_boston
dataset = load_boston()
data, target, feature_names = dataset.data, dataset.target, dataset.feature_names

The first line is that we import the dataset “load_boston” from the “sklearn.datasets” module. The sklearn.datasets module includes many datasets. You can refer to the detailed information on the reference of scikit-learn.

The second line is that the dataset is assigned to the variable “dataset”. We can do it by the “load_boston()” function.

And, in the variable “dataset”, several kinds of information are stored. So, in the third line, each information is taken separately.

A brief explanation of the dataset

As the above code, we can take each information by each method “***.data”, “***.target”, “***.feature_names”. The explanation of each information is as follows:

dataset.data: array of the explanatory variables
dataset.target: house prices
dataset.feature_names: names of the explanatory variables

And also, you can see the complete explanation of the dataset as follows.

dataset.DESCR

Since the output content of the above line is so large, I will not post it here…
Please check it out for yourself.

Contents of the Variables

Here, let’s confirm the contents of the variables.

data (dataset.data)
The values of the explanatory variables are stored as an array.
The shape of data is (506, 13), where the length of each data is 506 and there are 13 types of columns.

data.shape

>>  (506, 13)

If you check the contents, you will find that it is as follows.

data

>>  array([[6.3200e-03, 1.8000e+01, 2.3100e+00, ..., 1.5300e+01, 3.9690e+02,
        4.9800e+00],
       [2.7310e-02, 0.0000e+00, 7.0700e+00, ..., 1.7800e+01, 3.9690e+02,
        9.1400e+00],
       [2.7290e-02, 0.0000e+00, 7.0700e+00, ..., 1.7800e+01, 3.9283e+02,
        4.0300e+00],
       ...,

target (dataset.target)
The prices of Boston houses, the so-called target variable of supervised data.
As we see the above that the number of data is 506, the shape of data is (506, ).

target.shape

>> (506,)

feature_names (dataset.feature_names)
Name of each column for “data (dataset.data)”.

feature_names

>>  array(['CRIM', 'ZN', 'INDUS', 'CHAS', 'NOX', 'RM', 'AGE', 'DIS', 'RAD',
       'TAX', 'PTRATIO', 'B', 'LSTAT'], dtype='<U7')

We can confirm the explanation of each ticker on the user guide for scikit-learn.

“CRIM”: per capita crime rate by town
“ZN”: proportion of residential land zoned for lots over 25,000 sq.ft.
“INDUS”: proportion of non-retail business acres per town
“CHAS“: Charles River dummy variable (= 1 if tract bounds river; 0 otherwise)
“NOX”: nitric oxides concentration (parts per 10 million)
“RM”: average number of rooms per dwelling
“AGE”: proportion of owner-occupied units built prior to 1940
“DIS”: weighted distances to five Boston employment centres
“RAD”: index of accessibility to radial highways
“TAX”: full-value property-tax rate per $10,000
“PTRATIO”: pupil-teacher ratio by town
“B”: 1000(Bk – 0.63)^2 where Bk is the proportion of blacks by town
“LSTAT”: % lower status of the population
“MEDV”: Median value of owner-occupied homes in $1000’s
Cite from User guide for scikit-learn

Summary

We have seen the process of importing and loading the Boston house prices dataset from the scikit-learn library. Besides, we have seen the description of each explanatory variable.

You can prepare a dataset with just 3 lines, so please consider using it!

Even one is fine, so it would be useful to have a dataset that you can use quickly.
The author hopes this dataset will be one of your portfolios after reading this blog.

October 27, 2020December 24, 2020

Brief Explanation of the Python Code of Polynomial Regression

Step-by-step to a Data Scientist > Blog > 2020

Polynomial regression is a technique to express a dataset with a linear regression model consisting of polynomial terms. Although we assume a linear relationship, we can reflect the non-linearity by including the non-linear effect in each polynomial term itself. Therefore, a polynomial-regression model can treat nonlinearity, so it sometimes becomes a powerful tool.

Especially, it’s a powerful technique, if you have an idea for a concrete expression of an expression.

In this post, we look at a process of polynomial-regression analyses with Python. The details of the theory are introduced in another post.

Brief Explanation of the Theory of Linear Regression

Import the Libraries

The code is written by Python. So firstly, we import the necessary library.

##-- For Numerical analyses
import numpy as np
##-- For Plot
import matplotlib.pylab as plt
import seaborn as sns
##-- For Linear Regression Analyses
from sklearn.linear_model import LinearRegression

Model Function

In this post, we adopt the following function as a polynomial regression model.

$$y =\omega_{0}+\omega_{1}x+\omega_{2}x^{2}+\omega_{3}e^{x},$$

where $\omega$ is the coefficient. Here, we assume $\omega$ as follows:

$$\begin{eqnarray*}
{\bf w^{T}}&&=\left(\omega_{0}\ \omega_{1}\ \omega_{2}\ \omega_{3}\right),\\
&&=\left(-1\ 1\ 2\ 3\right).
\end{eqnarray*}$$

Namely,

$$y =-1+x+2x^{2}+3e^{x}.$$

Create the Training Dataset

Next, we prepare the training dataset by adding the noise generated by “the Gaussian distribution”, which is also called “the Normal distribution”. With the noise function $\varepsilon(x)$, we can rewrite the model function as follows:

$$\begin{eqnarray*}
y =-1+x+2x^{2}+3e^{x} + \varepsilon(x),
\end{eqnarray*}$$

where $\varepsilon(x)$ is described as

$$\begin{eqnarray*}
\varepsilon(x)=\dfrac{1}{\sqrt{2\pi\sigma^{2}}}
e^{-\dfrac{(x-\mu)^{2}}{2\sigma^{2}}}.
\end{eqnarray*}$$

##-- Model Function
def func(param, X):
    return param[0] + param[1]*X + param[2]*np.power(X, 2) + param[3]*np.exp(X)

x = np.arange(0, 1, 0.01)
param = [-1.0, 1.0, 2.0, 3.0]

np.random.seed(seed=99) # Set Random Seed
y_model = func(param, x)
y_train = func(param, x) + np.random.normal(loc=0, scale=1.0, size=len(x))

We can check the training dataset ($x$, $y_{\text{train}}$) as follows:

plt.figure(figsize=(5, 5), dpi=100)
sns.set()
plt.xlabel("x")
plt.ylabel("y")
plt.scatter(x, y_train, lw=1, color="b", label="toy dataset")
plt.plot(x, y_model, lw=5, color="r", label="model function")
plt.legend()
plt.show()

The details of the way to create the training dataset with the above model function are explained in another post.

Create a Toy Dataset by the Noise Function

Prepare the Dataset as Pandas DataFrame

Here, we prepare the dataset as Pandas DataFrame. First, we create empty DataFrame by “pd.DataFrame()“. Next, we create the values of each polynomial term in each column.

x_train = pd.DataFrame()

x_train["x"] = x
x_train["x^2"] = np.power(x, 2)
x_train["exp(x)"] = np.exp(x)

If you check only the first 5 lines of the created DataFrame, it will be as follows.

print( x_train.head() )

>>       x     x^2    exp(x)
>> 0  0.00  0.0000  1.000000
>> 1  0.01  0.0001  1.010050
>> 2  0.02  0.0004  1.020201
>> 3  0.03  0.0009  1.030455
>> 4  0.04  0.0016  1.040811

Note that since the constant term is treated as an intercept during the linear regression analysis, it is not necessary to create a column of constant terms (all values are “1”) here.

Polynomial Regression

Finally, we will perform linear regression analysis of polynomial terms from here!!

Then, we can apply a polynomial analysis to the above training dataset (x_train, y_train). Here, we use the “LinearRegression()” module from the scikit-learn library. And, create the instance “regressor” from “LinearRegression()”.

from sklearn.linear_model import LinearRegression
regressor = LinearRegression()

Then, we give the training dataset (x_train, y_train) to “regressor“, making “regressor” trained.

regressor.fit(x_train, y_train)

Model training is over, so let’s predict.

y_pred = regressor.predict(x_train)

Let’s see the prediction result.

plt.figure(figsize=(5, 5), dpi=100)
sns.set()
plt.xlabel("x")
plt.ylabel("y")
plt.scatter(x, y_train, lw=1, color="b", label="training dataset")
plt.plot(x, y_model, lw=3, color="r", label="model function")
plt.plot(x, y_pred, lw=3, color="g", label="Polynomial regression")
plt.legend()
plt.show()

Above the plot, the red solid line, the green solid line, and the blue circles are described as the model function, the polynomial regression, and the training dataset, respectively. As you can see visually, the polynomial regression is in good agreement with the model function.

Coefficients of Polynomial Regression

Here, we have confirmed that the model function is reproduced well so far, let’s check the coefficients of the polynomial regression model.

We can confirm the intercept and coefficients by the methods “intercept_” and “coef_” against the regression instance “regressor”.

print("w0", regressor.intercept_, "\n", \
      "w1", regressor.coef_[0], "\n", \
      "w2", regressor.coef_[0], "\n", \
      "w3", regressor.coef_[0], "\n" )

>> w0 -5.385823038595163 
>> w1 -4.479990435844182 
>> w2 -0.187758224924017
>> w3  7.606988274352645

The estimated ${\bf w^{T}}$,

$$\begin{eqnarray*}
{\bf w^{T}}&&=\left(-5.4\ -4.5\ -0.2\ 7.6\right),
\end{eqnarray*}$$

deviates the one of the model function,

$$\begin{eqnarray*}
{\bf w^{T}}&&=\left(-1\ 1\ 2\ 3\right).
\end{eqnarray*}$$.

This deviation comes from the fact that the range of training data (0 to 1) is narrow. Therefore, if we increase the range of training data, the estimated value will close to the ones of the model function.

For example, when the range of training data is 0 to 2, ${\bf w^{T}}$ becomes $\left(-0.6\ 1.6\ 2.2\ 2.6\right)$. How is it? The estimation is very close to the correct answer.

Also, if the range of training data is 0 to 10, you can get the almost correct result.

Summary

We have briefly looked at the process of polynomial regression. Polynomial regression is a powerful and practical technique. However, without proper verification of the results, there is a risk of making a big mistake in predicting extrapolated values.

The author hopes this blog helps readers a little.

October 25, 2020December 24, 2020

Beginner Guide to Gaussian Process by GPy

Step-by-step to a Data Scientist > Blog > 2020

Gaussian Process is a powerful method to perform regression analyses. We can use it in regression analyses as the same as linear regression. However, Gaussian Process is based on the calculation of a probability distribution and it can be applied flexibly due to the Bayesian approach.

The main differences between Gaussian Process and linear regression are below.

1. Modeling with nonlinear
2. Model has information on both the estimation and the uncertainty.

The first one is that since this method can handle non-linearity, it is a highly expressive model. For example, linear regression analyses assumed the linear relationship between input variables and the output. Therefore, in principle, linear regression analysis is not suitable for datasets with non-linear relationships. In contrast, in such a dataset, we can apply the Gaussian Process regression.

The second one is that the output of the model has both information about the regression values and the confidence of the output. In other words, the Gaussian model tells us the confidence of results. This is completely different from linear regression.

Let’s see these differences concretely with a simple example. The complete notebook can be found on GitHub.

Create Training Dataset

Here, we create the training dataset. The way is to add the noise to the base function.

$$y =-1+x+2x^{2}+3e^{x} + \varepsilon(x).$$

$\varepsilon(x)$ is the noise function and described as follows:

$$\begin{eqnarray*}
\varepsilon(x)=\frac{1}{\sqrt{2\pi\sigma^{2}}}
e^{-\dfrac{(x-\mu)^{2}}{2\sigma^{2}}},
\end{eqnarray*}$$

where $\mu$ is the mean and $\sigma$ the standard deviation.

The code is below. The derails are introduced in another post.

import numpy as np
##-- Model Function for creating the train dataset
def func(param, X):
    return param[0] + param[1]*X + param[2]*np.power(X, 2) + param[3]*np.exp(X)

##-- Set Random Seed
np.random.seed(seed=99)

x = np.arange(0, 1, 0.01)
param = [-1.0, 1.0, 2.0, 3.0]

y_model = func(param, x)
y_train = func(param, x) + np.random.normal(loc=0, scale=1.0, size=len(x))

Create a Toy Dataset by the Noise Function

Gaussian Process Regression

In this analysis, we use “GPy“, the Gaussian Process library in Python. In this post, the version of GPy is assumed for “ver. 1.9.9”.

import GPy
print(GPy.__version__)

>> 1.9.9

Define “the kernel function”

First, we define the kernel function. There are many kinds of kernel functions. Here, we adopt “Matern52” as a kernel function. The theory of kernel functions is so deep, so beginners should use the representative one for a first try.

kernel = GPy.kern.Matern52(input_dim=1)

The option “input_dim” is the number of input variables. In this case, the input variable is just one “x”, so “input_dim = 1”.

Define the Model

Second, we define the Gaussian Process model as follows:

model = GPy.models.GPRegression(x.reshape(-1, 1), y_train.reshape(-1, 1), kernel=kernel)

There are three arguments, the input variable (x, y) and the kernel function.

Note that the dimensions of the input variable(x, y) must be two-dimensional. This is where beginners get stuck in debugging. We can easily confirm the dimensions as follows:

x_train.ndim  # x_train.ndim

>> 1

Whereas,

x_train.reshape(-1, 1).ndim  # y_train.reshape(-1, 1).ndim

>> 2

Optimize the Model

Now that we have defined the model, we can optimize it.

model.optimize()

GPy includes matplotlib inside, so you can quickly see the optimized model in just one line. Note that this method can be used when the number of the input variables is one. In multi variables case, the ability to plot in multiple dimensions is not supported.

model.plot(figsize=(5, 5), dpi=100, xlabel="x", ylabel="y")

Prediction

Now that the model has been optimized, let’s check the predicted values and the confidence(uncertainty) with data outside the training data range. If all goes well, you’ll see a wider confidence interval in the data range outside the training data range.

Since the training data was in the range of 0 to 1, we will prepare the test data in the range of 0 to 2.

x_test = np.arange(0, 2, 0.01)

Then, predict the test data by the “.predict()” method.

y_mean, y_std = model.predict(x_test.reshape(-1, 1))

Note that there are two returned values from “model.predict()”. The first one is the regression value and the second one is the confidence interval. Here, it’s okay to think of the confidence interval as the standard deviation in the Gaussian function.

Finally, let’s plot the result.

import matplotlib.pylab as plt
import seaborn as sns

plt.figure(figsize=(5, 5), dpi=100)
sns.set()
plt.xlabel("x")
plt.ylabel("y")
plt.scatter(x, y_train, lw=1, color="b", label="training dataset")
plt.plot(x_test, func(param, x_test), lw=3, color="r", label="model function")
plt.plot(x_test, y_mean, lw=3, color="g", label="GP mean")
plt.fill_between(x_test, (y_mean + y_std).reshape(y_mean.shape[0]), (y_mean - y_std).reshape(y_mean.shape[0]), facecolor="b", alpha=0.3, label="confidence")
plt.legend(loc="upper left")
plt.show()

Congratulations!! It can be confirmed that as the range of the test data deviates from the training data, the predicted value also deviates and the confidence interval becomes wider.

Summary

From here, we have seen the process of Gaussian process regression with a simple example. Certainly, Gaussian process regression has the impression that it is difficult to attract attention as an analysis method because of its mathematical complexity. However, Gaussian process regression has information on confidence, making it possible to judge the validity of prediction.

The author would be happy if this post helps the reader to try Gaussian process analyses.