Step-by-step to a Data Scientist

October 24, 2020December 24, 2020

Python Shortcode for Linear Regression

Step-by-step to a Data Scientist > Blog > 2020

Linear regression analysis is one of the most basic data analyses. This method is based on the relationship between independent variables and dataset. Therefore, if the fit between the linear model and the dataset is well, an analysis with high model interpretability can be performed.

Purpose of this post

The purpose is to introduce the shortcode example for linear regression.

Explanation of Linear Regression

In this post, we just check the brief concept of linear regression. The details are introduced in another post. Please refer to it.

Brief Explanation of the Theory of Linear Regression

A representation of a linear regression is as follows:

$$y =\omega_{0}+\omega_{1}x_{1}+\omega_{2}x_{2}+…+\omega_{N}x_{N},$$

where $x_{i}$ is an independent variable and $\omega_{i}$ is a coefficient.

Here, for convenience, we adopt just one independent variable model. This is the most simple form everyone knows.

$$y =\omega_{0}+\omega_{1}x_{1}.$$

Training Dataset

We create the training dataset from the following code. How to create is introduced in another post.

import numpy as np
##-- Model Function for creating the toy dataset
def func(param, X):
    return param[0] + param[1]*X + param[2]*np.power(X, 2) + param[3]*np.exp(X)

x = np.arange(0, 1, 0.01)
param = [-1.0, 1.0, 2.0, 3.0]

np.random.seed(seed=99) # Set Random Seed
y_model = func(param, x)
y_train = func(param, x) + np.random.normal(loc=0, scale=1.0, size=len(x))

Create a Toy Dataset by the Noise Function

Linear Regression Analyses

You can easily perform a linear regression analysis by the Scikit-learn module “LinearRegression()“. The procedure is as follows:

1. Create the instance of “LinearRegression()” as the name of “lr“
2. Train the model instance “lr” with the dataset by the “.fit()” module
3. Predict the training dataset by the “.predict()” module

from sklearn.linear_model import LinearRegression
lr = LinearRegression()
lr.fit(x.reshape(-1, 1), y_train)
y_pred = lr.predict(x.reshape(-1, 1))

Possibly, you have the question that what does “x.reshape(-1, 1)” mean. The answer is that we have to prepare input variables “x” as a two-dimensional array. You will see the following error message if you prepare “x” as a one-dimensional array.

ValueError: Expected 2D array, got 1D array instead:
” *************** Here depends on your code *************** “
Reshape your data either using array.reshape(-1, 1) if your data has a single feature or array.reshape(1, -1) if it contains a single sample.

In such a case, you have to convert “x” into a two-dimensional array by “x.reshape(-1, 1)“.

Congratulations!! Until here, the linear regression analysis is finished. You can confirm the result of your model as follows:

import matplotlib.pylab as plt
import seaborn as sns

plt.figure(figsize=(5, 5), dpi=100)
sns.set()
plt.xlabel("x")
plt.ylabel("y")
plt.ylim(-1.5, 12.5)
plt.scatter(x, y_train, lw=1, color="b", label="dataset")
plt.plot(x, y_pred, lw=5, color="r", label="linear regression")
plt.legend()
plt.show()

Summary

Contrary to the author’s intention, this blog has become a bit long. However, the important contents are as follows.

from sklearn.linear_model import LinearRegression
lr = LinearRegression()
lr.fit(x.reshape(-1, 1), y_train)
y_pred = lr.predict(x.reshape(-1, 1))

The point is that you can perform a linear regression analysis with JUST the 4 line code! The roles of each line are below.

1. Import the module
2. Create the model
3. Train the model with the dataset
4. Predict

I would be glad if you think a linear regression is easy!

October 23, 2020December 24, 2020

Brief Explanation of the Theory of Linear Regression

Step-by-step to a Data Scientist > Blog > 2020

A linear-regression technique is one of the basic analysis methods, so it may become the first attempt to investigate the relationship between independent variables and dataset.

In this post, we look through briefly the theory of linear regression. This article focuses on the concept of linear regression, so it will help the readers to understand with imagination.

Code implemented by Python

The implementation of the linear-regression model by Python is introduced by another post. This post is for just the brief theory with the simplified concept.

Python Shortcode for Linear Regression

What is the linear regression?

Linear regression is a technique to investigate the relationship between independent variables and a dataset. However, we assume that this relationship is linear. The most typical function would be:

$$y=ax+b,$$

Brief theory of linear regression

A representation of a linear regression is as follows:

$$y =\omega_{0}+\omega_{1}x_{1}+\omega_{2}x_{2}+…+\omega_{N}x_{N},$$

where $x_{i}$ is an independent variable and $\omega_{i}$ is a coefficient.

Besides, setting $x_{i}$ as $x_{1}=x$, $x_{2}=x^{2}$, and $x_{3}=e^{x}$, we can treat polynomial regression as the linear regression form:

$$y =\omega_{0}+\omega_{1}x+\omega_{2}x^{2}+\omega_{3}e^{x}.$$

The next step is to vectorize the above equation.

$$y=\left(\omega_{0}\ \omega_{1}\ \ldots\ \omega_{D}\right)\left(\begin{array}{ccc}1\\x_{1}\\\vdots\\x_{D}\\\end{array}\right).$$

$$\therefore y={\bf w^{T}}{\bf x},$$

where ${\bf w^{T}}$ is a coefficient vector and ${\bf x}$ is an independent variable vector. For the N dataset,

$$\left(\begin{array}{ccc}y_{1}\\y_{2}\\\vdots\\y_{N}\\\end{array}\right)=\left(\begin{array}{ccc}{\bf w^{T}}{\bf x_{1}}\\{\bf w^{T}}{\bf x_{2}}\\\vdots\\{\bf w^{T}}{\bf x_{N}}\\\end{array}\right).$$

As the matrix form, we can rewrite the above equation as follows:

$$\left(\begin{array}{ccc}y_{1}\\y_{2}\\\vdots\\y_{N}\\\end{array}\right)=\left(\begin{array}{ccccc}1&x_{11} &x_{12}&\ldots&x_{1D} \\1&x_{21} &x_{22}&\ldots&x_{2D}\\\vdots&&&&\vdots\\1&x_{N1} &x_{N2}&\ldots&x_{ND}\\\end{array}\right)\left(\begin{array}{ccc}\omega_{0}\\\omega_{1}\\\vdots\\\omega_{D}\\\end{array}\right).$$

$$\therefore{\bf y}\ ={\bf X}{\bf w}.$$

Our purpose is to estimate ${\bf w}$. Then, we will consider finding ${\bf w}$, which fits the dataset, namely minimizes the loss between the train data and the model-predicted data. We introduce the loss function of $E$ as follows:

$$\begin{eqnarray*}
E\left[ {\bf y}, {\bf \hat{y}}, {\bf X}; {\bf w} \right]\ &&=\sum^{n}_{n=1}\left(\hat{y}_{n} – y_{n} \right)^{2},\\ &&=\left({\bf \hat{y}} – {\bf X}{\bf w}\right)^{T}\left({\bf \hat{y}} – {\bf X}{\bf w}\right),
\end{eqnarray*}$$

where ${\bf \hat{y}}$ is the training data, and $E\left[ {\bf y}, {\bf \hat{y}}, {\bf X}; {\bf w} \right]$ is the loss between the train data(${\bf \hat{y}}$) and the model-predicted data(${\bf y}$). Note that this type of the loss function is called “Mean squared error”.

Here, we rewrite ${\bf w}$ as ${\bf w^{*}}$, which minimizes the loss $E\left[ {\bf y}, {\bf \hat{y}}, {\bf X}; {\bf w} \right]$, derived from the following relationship.

\begin{eqnarray*}
\dfrac{\partial E\left[ {\bf y}, {\bf \hat{y}}, {\bf X}; {\bf w} \right]}
{\partial {\bf w}} = 0
.
\end{eqnarray*}

From the above equation, we can obtain the solution ${\bf w^{*}}$ as follows:

\begin{eqnarray*}
\therefore
{\bf w^{*}}
=
\left(
{\bf X}^{T}{\bf X}
\right)^{-1}
{\bf X}^{T}
{\bf \hat{y}}
.
\end{eqnarray*}

Note that although the details of the derivation were skipped here, we can check the omitted process in the famous textbooks of machine learning or linear algebra.

The key is that all required information to estimate ${\bf w^{*}}$ is a training dataset(${\bf X}$ and ${\bf y}$). In addition, when we assume the form of the polynomial function, we can also judge the appropriateness of the assumed model.

Summary

We have briefly looked at the theory of linear regression. From the form of ${\bf w^{*}}=\left({\bf X}^{T}{\bf X}\right)^{-1}{\bf X}^{T}{\bf \hat{y}}$, you can understand that a linear regression analysis is on matric calculations.

In practice, there are few opportunities to be conscious of the calculation process. However, once you understand that linear regression is a matrix calculation, you can understand that the amount of calculation will become very large as the number of data increases.

October 23, 2020December 24, 2020

Create a Toy Dataset by the Noise Function

Step-by-step to a Data Scientist > Blog > 2020

The toy dataset is useful when we attempt something new analysis method quickly, especially in regression analyses. In this post, we briefly see how to create a toy dataset with NumPy.

Then, let’s get started.

Import the Libraries

The code is written by Python. So firstly, we import the necessary library.

##-- For Numerical analyses
import numpy as np
##-- For Plot
import matplotlib.pylab as plt
import seaborn as sns

Define the Model Function

Here, we adopt the following function as an example.

$$y =-1+x+2x^{2}+3e^{x}.$$

The Python code for the above function is below.

def func(param, X):
    return param[0] + param[1]*X + param[2]*np.power(X, 2) + param[3]*np.exp(X)

For convenience, let’s create continuous data of x in the range 0 to 1 and check the behavior of the function.

x = np.arange(0, 1, 0.01)
param = [-1.0, 1.0, 2.0, 3.0]

y = func(param, x)

The behavior of (x, y) is as follows. The model function “y” is denoted by the red solid line. Note that, for visual clarity, the four polynomial terms, which construct the function, are added by the dashed line.

plt.figure(figsize=(5, 5), dpi=100)
sns.set()
plt.xlabel("x")
plt.ylabel("y")
plt.ylim(-1.5, 12.5)
plt.plot(x, y, lw=5, color="r", label="model function")
plt.legend()
plt.show()

Generate the Noise

Next, we prepare the toy dataset by adding the noise into the above function. The noise is generated by “the Gaussian distribution”, which is also called “the Normal distribution”. In this case, the noise function $\varepsilon(x)$ can be written as follows:

$$\begin{eqnarray*}
\varepsilon(x)=\frac{1}{\sqrt{2\pi\sigma^{2}}}
e^{-\dfrac{(x-\mu)^{2}}{2\sigma^{2}}},
\end{eqnarray*}$$

where $\mu$ is the mean and $\sigma$ the standard deviation.

We can easily use this noise function $\varepsilon(x)$ by the NumPy module “np.random.normal()“.

noise = np.random.normal(
                            loc   = 0,
                            scale = 1,
                            size  = 10000000,
                        )
plt.hist(noise, bins=100, color="r")

“loc“, “scale“, and “size“, the arguments of “np.random.normal()“, are the mean($\mu$), the standard deviation($\sigma$), and the size of the output. As seen in the graph below, we can confirm the Gaussian distribution.

The Model Function with Noise

The code for the function with the noise is as follows:

$$y =-1+x+2x^{2}+3e^{x} + \varepsilon(x).$$

##-- Set Random Seed
np.random.seed(seed=99)

y_toy = func(param, x) + np.random.normal(loc=0, scale=1.0, size=len(x))

Finally, you can get the toy dataset $(x, y_toy)$!!
You can confirm the behavior of the toy dataset as follows:

plt.figure(figsize=(5, 5), dpi=100)
sns.set()
plt.xlabel("x")
plt.ylabel("y")
plt.ylim(-1.5, 12.5)
plt.scatter(x, y_toy, lw=1, color="b", label="toy dataset")
plt.plot(x, y, lw=5, color="r", label="model function")
plt.legend()
plt.show()

October 8, 2020December 24, 2020

Lambda Function with Pandas

Step-by-step to a Data Scientist > Blog > 2020

In the previous post, the basic of “lambda function“ is introduced. In this post, the author introduces the practical situation, i.e., lambda functions × Pandas.

Lambda Function for Python Beginners

What is Pandas?

Pandas, a library for data structures, is known as one of the essential libraries for data analyses, such as NumPy, SciPy, and Scikit-learn. Pandas is designed to treat Excel easily so that we can use a table data flexibly.

Pandas can treat files in various formats. Besides, Pandas has rich methods. The above features make it possible to perform data analysis against table data efficiently. If you look in a data science competition(e.g. Kaggle), you can understand that Pandas is an essential tool for data scientists.

Lambda Function × Pandas

Pandas is used for table data analyses. So, there might be a situation that you would like to apply the same manipulate to each element of sequence data(e.g. one column of table data).

That’s exactly where the combination of Pandas and lambda functions comes into play.

Ex. Categorize the Age Group

We first prepare the age-group list, 18, 50, 28, 78, and 33. Second, we convert the list “age_list” into Pandas DataFrame with the column name “Age”.

import pandas as pd
age_list = [18, 50, 28, 78, 33]
age_list = pd.DataFrame(age_list, columns=["Age"])
print(age_list)

>>    Age
>> 0   18
>> 1   50
>> 2   28
>> 3   78
>> 4   33

Next, we categorize each element of the column “age_list[“Age”]”. Note here, you must predefine the function for classification.

Here, we prepare the function to categorize ages into the group of “unknown”, “Under 20”, “20-40”, “41-60”, and “Over 60”. Note that “unknown” is for mistake inputs such as minus ages.

def categorize_age(x):
  x = int(x)
  if x < 0:
    x = "unknown"
  elif x < 20:
    x = "Under 20"
  elif x <= 40:
    x = "20-40"
  elif x <= 60:
    x = "41-60"
  else:
    x = "Over 60"
  return x

Then, let’s apply the above function “categorize_age()” to each element of the column “age_list[“Age”]”. As a result, we can see that the result is assigned to the newly generated column “Generation”.

Note that, to apply, we use a “apply()” method and a “lambda function“.

Syntax: DataFrame[column].apply( lambda x: function(x) )

age_list["Generation"] = age_list["Age"].apply( lambda x: categorize_age(x) )
print(age_list)

>>    Age Generation
>> 0   18   Under 20
>> 1   50      41-60
>> 2   28      20-40
>> 3   78    Over 60
>> 4   33      20-40

Summary

In this article, we have seen that a lambda function becomes a powerful tool when it is used with Pandas. When analyzing table data, it will be needed to apply arbitrary processing to each element of a column or a row of Pandas DataFrame.

It is such a time to use a lambda function!

October 7, 2020December 24, 2020

Lambda Function for Python Beginners

Step-by-step to a Data Scientist > Blog > 2020

A lambda function may be unfamiliar with Python beginners. Certainly, the lambda function is not always necessary, but if it is used adequately, it will be possible to execute arbitrary processing with a compact description.

What is a lambda function? Main two features are as follows.

1. Anonymous function with a return value
2. Described with one sentence

Let’s start with a simple example to imagine!
We will see two example codes as follows. And, note that both examples have the same function, returning the square of the input variable.

Standard Style

This example is the one you might be familiar with.

def square(x):
    return x*x

ret = square(2)
print(ret)

>> 4

Lambda function Style

With a lambda-function style, you can express the same function above with just one sentence!

ret = (lambda x: x*x)(2)
print(ret)

>> 4

The syntax of lambda functions is below.

Syntax:    lambda x: f(x)

“lambda” just claims that “This is a function”. And, the function is “f(x)” with the argument “x”, equaling to the returned value.

Since the whole sentence “lambda x: f(x)” is the function itself, we use parentheses for giving an argument to x as in the above example.

Example

Let’s square each element of the list.

A standard expression by “for loop” is as follows.

num_list = [1, 2, 3, 4]
for i in range(len(num_list)):
    num_list[i] = num_list[i]*num_list[i]
print(num_list)

>> [1, 4, 9, 16]

On the other hand, we can rewrite “for loop” into the one sentence with lambda functions.

num_list = [1, 2, 3, 4]
list( map(lambda x: x*x, num_list) )

print(num_list)

>> [1, 4, 9, 16]

map() function is to perform the same processing for each element of list.

Syntax:    map(function, iterator)

You can interpret “function” and “iterator” as just like “f(x)” and “x”, respectively.

Actually, there is a deep world of map() function, so the details of an explanation will NOT be here. However, in relation to map() function, lambda functions will be a powerful tool when used together with Pandas.

Summary

In this article, we saw the brief introduction of the lambda function. The usage of this function makes it possible to adopt a compact expression. Consequently, a low-code habit may improve the interpretation and the maintainability of your codes.

October 3, 2020December 24, 2020

Convert Jupyter Notebook into Python Script

Step-by-step to a Data Scientist > Blog > 2020

The Jupyter Notebook is a useful editor because a notebook-style makes it possible to code interactively. Especially, to create a prototype, a notebook-style is powerful.

However, a script-style(***.py) is often better than a notebook-style(“***.ipynb”) when the creation is project-sized. For example, imagine a case such as a data science competition, e.g. Kaggle.

But there is nothing to worry about. We can convert with just one command.

Python tip command

To see an example, we prepare the “work” directory, including “sample.ipynb”.
The following jupyter-notebook file is stored.

:~/work$ls
>>sample.ipynb

The content of “sample.ipynb” is below.

Convert Command: “jupyter nbconvert”

Just run one line of command!

jupyter nbconvert --to script sample.ipynb

Then, you can get the python script, “sample.py”, with sucessful message, “Converting notebook sample.ipynb to script”.

:~/work$ls
>>sample.ipynb  sample.py

The contents of “sample.py” are as follows. In[1] and In[2] denote the first and second cells in “sample.ipynb”.

#!/usr/bin/env python
# coding: utf-8

# In[1]:

import numpy as np
a = np.array([0, 1, 2, 3])

# In[2]:

print(a)

In summary, we saw the one command can make it possible to convert jupyter notebook into python script. I hope you will use it!

September 26, 2020January 10, 2021

List and Tuple, an explanation for Python beginners

Step-by-step to a Data Scientist > Blog > 2020

There are many opportunities to handle data structures that group numbers and strings such as [1, 2, 3, ..] and [a, b, c, ..]. In general, A continuous data structure is called “array”. In Python, we call an array-like structure “list” or “tuple”. Why are there two names of array structures? That is because they are used differently. Python beginners may be confused. No problem! After reading this short article, you won’t be confused about which one to use. Just keep the one point of difference.

One point you should know

My understanding is as follows:

The difference between list and tuple is, in a word, “Whether you can change contents or not.” For example, you add a new element or change an element. When the case of “list”, you can. In contrast, you cannot modify “tuple” once you defined it.

That is all you should know.

List

List is a data structure, which has a sequence of elements. Besides, you can change, add, and delete any elements. In Python, we call this property “mutable”. Let’s confirm that list is mutable with a simple example.

>>> A = [1,  2,  3]
>>> A[0]
1

“A” is the list, which stores the array of 1, 2, and 3. The list is represented by “[]”. A[0] represents the first element 1. Note that, in Python, an index of list starts from 0.
By the way, since the list is mutable, let’s rewrite and add the elements inside.

>>> A[1] = 5  # A = [1,  2,  3]
>>> A
[1,  5,  3]

You can see that the second element A[1] was rewritten from “2” to “5”.

Next, let’s add a new element. Use append () to add a new element to the end of the list.

>>> A.append(10)  # A = [1,  5,  3]
>>> A
[1,  5,  3,  10]

You can see that “10” was added to the end of A by “A.append (10)”.

Now, let’s delete the element “10” that was just added.

>>> del A[3]  # A = [1,  5,  3,  10]
>>> A
[1,  5,  3]

“10” you just added was deleted.

As in the example above, we have confirmed that elements of list can be changed. Next, let’s see that tuple cannot be changed.

Tuple

Tuple is similar to list, sequence-type data structures. In contrast, we cannot change elements of tuple once after tuple is defined. In Python, we call this property “immutable”. Then, let’s take a simple example of how immutable it really is.

>>> A = (1, 2, 3)
>>> A[0]
1

A is the tuple, which stores the array of 1, 2, and 3. tuple is represented by “()”, and this is different from list. A[0] represents the first element of A, that is, A[0]=1. Note that tuple also uses “[]” when retrieving an element.

Then, let’s confirm that the elements of A cannot be changed, then you can experience the tuple is immutable.

>>> A[1] = 50
TypeError: ‘tuple’ object does not support item assignment

If you try to assign 50 to the second element A[1], it will stop and print the error message below.

TypeError: ‘tuple’ object does not support item assignment

Next, let’s try deleting element A[2].

>>> del A[1]
TypeError: ‘tuple’ object does not support item deletion

It will print the error message and stop.

TypeError: ‘tuple’ object does not support item deletion

Benefits of tuple

If possible, you may think list is more practical than tuple. It is true that list is more flexible. So, you should use list basically. Then, tuple is NOT needed? The answer is NO. There’s a situation when you should use tuple.

The advantages of tuple, Python beginners should know, are below:

Small memory usage
No risk of unintentional rewriting
Usage as a dictionary key

The first one is basic knowledge for programmers. The more flexible it is, the less speed it is. Especially, Python is not a relatively fast language, so it is worth knowing how to make it faster.

The second one is useful for reducing bugs. It is said that programmers spend more time debugging than that writing code. Humans always make mistakes, and we should premise on this fact. If you know no need to change a list in advance, it should be a good choice to use tuple.

The third one is the usage of tuple as dictionary keys. Dictionary is also sequence data structures. This article hasn’t touched on dictionary, however, one thing to keep in mind is that you can’t use list as dictionary keys. Therefore, you need to understand tuple before learning a dictionary.