Step-by-step to a Data Scientist

October 27, 2020December 24, 2020

Brief Explanation of the Python Code of Polynomial Regression

Step-by-step to a Data Scientist > Blog > 2020 > October

Polynomial regression is a technique to express a dataset with a linear regression model consisting of polynomial terms. Although we assume a linear relationship, we can reflect the non-linearity by including the non-linear effect in each polynomial term itself. Therefore, a polynomial-regression model can treat nonlinearity, so it sometimes becomes a powerful tool.

Especially, it’s a powerful technique, if you have an idea for a concrete expression of an expression.

In this post, we look at a process of polynomial-regression analyses with Python. The details of the theory are introduced in another post.

Brief Explanation of the Theory of Linear Regression

Import the Libraries

The code is written by Python. So firstly, we import the necessary library.

##-- For Numerical analyses
import numpy as np
##-- For Plot
import matplotlib.pylab as plt
import seaborn as sns
##-- For Linear Regression Analyses
from sklearn.linear_model import LinearRegression

Model Function

In this post, we adopt the following function as a polynomial regression model.

$$y =\omega_{0}+\omega_{1}x+\omega_{2}x^{2}+\omega_{3}e^{x},$$

where $\omega$ is the coefficient. Here, we assume $\omega$ as follows:

$$\begin{eqnarray*}
{\bf w^{T}}&&=\left(\omega_{0}\ \omega_{1}\ \omega_{2}\ \omega_{3}\right),\\
&&=\left(-1\ 1\ 2\ 3\right).
\end{eqnarray*}$$

Namely,

$$y =-1+x+2x^{2}+3e^{x}.$$

Create the Training Dataset

Next, we prepare the training dataset by adding the noise generated by “the Gaussian distribution”, which is also called “the Normal distribution”. With the noise function $\varepsilon(x)$, we can rewrite the model function as follows:

$$\begin{eqnarray*}
y =-1+x+2x^{2}+3e^{x} + \varepsilon(x),
\end{eqnarray*}$$

where $\varepsilon(x)$ is described as

$$\begin{eqnarray*}
\varepsilon(x)=\dfrac{1}{\sqrt{2\pi\sigma^{2}}}
e^{-\dfrac{(x-\mu)^{2}}{2\sigma^{2}}}.
\end{eqnarray*}$$

##-- Model Function
def func(param, X):
    return param[0] + param[1]*X + param[2]*np.power(X, 2) + param[3]*np.exp(X)

x = np.arange(0, 1, 0.01)
param = [-1.0, 1.0, 2.0, 3.0]

np.random.seed(seed=99) # Set Random Seed
y_model = func(param, x)
y_train = func(param, x) + np.random.normal(loc=0, scale=1.0, size=len(x))

We can check the training dataset ($x$, $y_{\text{train}}$) as follows:

plt.figure(figsize=(5, 5), dpi=100)
sns.set()
plt.xlabel("x")
plt.ylabel("y")
plt.scatter(x, y_train, lw=1, color="b", label="toy dataset")
plt.plot(x, y_model, lw=5, color="r", label="model function")
plt.legend()
plt.show()

The details of the way to create the training dataset with the above model function are explained in another post.

Create a Toy Dataset by the Noise Function

Prepare the Dataset as Pandas DataFrame

Here, we prepare the dataset as Pandas DataFrame. First, we create empty DataFrame by “pd.DataFrame()“. Next, we create the values of each polynomial term in each column.

x_train = pd.DataFrame()

x_train["x"] = x
x_train["x^2"] = np.power(x, 2)
x_train["exp(x)"] = np.exp(x)

If you check only the first 5 lines of the created DataFrame, it will be as follows.

print( x_train.head() )

>>       x     x^2    exp(x)
>> 0  0.00  0.0000  1.000000
>> 1  0.01  0.0001  1.010050
>> 2  0.02  0.0004  1.020201
>> 3  0.03  0.0009  1.030455
>> 4  0.04  0.0016  1.040811

Note that since the constant term is treated as an intercept during the linear regression analysis, it is not necessary to create a column of constant terms (all values are “1”) here.

Polynomial Regression

Finally, we will perform linear regression analysis of polynomial terms from here!!

Then, we can apply a polynomial analysis to the above training dataset (x_train, y_train). Here, we use the “LinearRegression()” module from the scikit-learn library. And, create the instance “regressor” from “LinearRegression()”.

from sklearn.linear_model import LinearRegression
regressor = LinearRegression()

Then, we give the training dataset (x_train, y_train) to “regressor“, making “regressor” trained.

regressor.fit(x_train, y_train)

Model training is over, so let’s predict.

y_pred = regressor.predict(x_train)

Let’s see the prediction result.

plt.figure(figsize=(5, 5), dpi=100)
sns.set()
plt.xlabel("x")
plt.ylabel("y")
plt.scatter(x, y_train, lw=1, color="b", label="training dataset")
plt.plot(x, y_model, lw=3, color="r", label="model function")
plt.plot(x, y_pred, lw=3, color="g", label="Polynomial regression")
plt.legend()
plt.show()

Above the plot, the red solid line, the green solid line, and the blue circles are described as the model function, the polynomial regression, and the training dataset, respectively. As you can see visually, the polynomial regression is in good agreement with the model function.

Coefficients of Polynomial Regression

Here, we have confirmed that the model function is reproduced well so far, let’s check the coefficients of the polynomial regression model.

We can confirm the intercept and coefficients by the methods “intercept_” and “coef_” against the regression instance “regressor”.

print("w0", regressor.intercept_, "\n", \
      "w1", regressor.coef_[0], "\n", \
      "w2", regressor.coef_[0], "\n", \
      "w3", regressor.coef_[0], "\n" )

>> w0 -5.385823038595163 
>> w1 -4.479990435844182 
>> w2 -0.187758224924017
>> w3  7.606988274352645

The estimated ${\bf w^{T}}$,

$$\begin{eqnarray*}
{\bf w^{T}}&&=\left(-5.4\ -4.5\ -0.2\ 7.6\right),
\end{eqnarray*}$$

deviates the one of the model function,

$$\begin{eqnarray*}
{\bf w^{T}}&&=\left(-1\ 1\ 2\ 3\right).
\end{eqnarray*}$$.

This deviation comes from the fact that the range of training data (0 to 1) is narrow. Therefore, if we increase the range of training data, the estimated value will close to the ones of the model function.

For example, when the range of training data is 0 to 2, ${\bf w^{T}}$ becomes $\left(-0.6\ 1.6\ 2.2\ 2.6\right)$. How is it? The estimation is very close to the correct answer.

Also, if the range of training data is 0 to 10, you can get the almost correct result.

Summary

We have briefly looked at the process of polynomial regression. Polynomial regression is a powerful and practical technique. However, without proper verification of the results, there is a risk of making a big mistake in predicting extrapolated values.

The author hopes this blog helps readers a little.

October 25, 2020December 24, 2020

Beginner Guide to Gaussian Process by GPy

Step-by-step to a Data Scientist > Blog > 2020 > October

Gaussian Process is a powerful method to perform regression analyses. We can use it in regression analyses as the same as linear regression. However, Gaussian Process is based on the calculation of a probability distribution and it can be applied flexibly due to the Bayesian approach.

The main differences between Gaussian Process and linear regression are below.

1. Modeling with nonlinear
2. Model has information on both the estimation and the uncertainty.

The first one is that since this method can handle non-linearity, it is a highly expressive model. For example, linear regression analyses assumed the linear relationship between input variables and the output. Therefore, in principle, linear regression analysis is not suitable for datasets with non-linear relationships. In contrast, in such a dataset, we can apply the Gaussian Process regression.

The second one is that the output of the model has both information about the regression values and the confidence of the output. In other words, the Gaussian model tells us the confidence of results. This is completely different from linear regression.

Let’s see these differences concretely with a simple example. The complete notebook can be found on GitHub.

Create Training Dataset

Here, we create the training dataset. The way is to add the noise to the base function.

$$y =-1+x+2x^{2}+3e^{x} + \varepsilon(x).$$

$\varepsilon(x)$ is the noise function and described as follows:

$$\begin{eqnarray*}
\varepsilon(x)=\frac{1}{\sqrt{2\pi\sigma^{2}}}
e^{-\dfrac{(x-\mu)^{2}}{2\sigma^{2}}},
\end{eqnarray*}$$

where $\mu$ is the mean and $\sigma$ the standard deviation.

The code is below. The derails are introduced in another post.

import numpy as np
##-- Model Function for creating the train dataset
def func(param, X):
    return param[0] + param[1]*X + param[2]*np.power(X, 2) + param[3]*np.exp(X)

##-- Set Random Seed
np.random.seed(seed=99)

x = np.arange(0, 1, 0.01)
param = [-1.0, 1.0, 2.0, 3.0]

y_model = func(param, x)
y_train = func(param, x) + np.random.normal(loc=0, scale=1.0, size=len(x))

Create a Toy Dataset by the Noise Function

Gaussian Process Regression

In this analysis, we use “GPy“, the Gaussian Process library in Python. In this post, the version of GPy is assumed for “ver. 1.9.9”.

import GPy
print(GPy.__version__)

>> 1.9.9

Define “the kernel function”

First, we define the kernel function. There are many kinds of kernel functions. Here, we adopt “Matern52” as a kernel function. The theory of kernel functions is so deep, so beginners should use the representative one for a first try.

kernel = GPy.kern.Matern52(input_dim=1)

The option “input_dim” is the number of input variables. In this case, the input variable is just one “x”, so “input_dim = 1”.

Define the Model

Second, we define the Gaussian Process model as follows:

model = GPy.models.GPRegression(x.reshape(-1, 1), y_train.reshape(-1, 1), kernel=kernel)

There are three arguments, the input variable (x, y) and the kernel function.

Note that the dimensions of the input variable(x, y) must be two-dimensional. This is where beginners get stuck in debugging. We can easily confirm the dimensions as follows:

x_train.ndim  # x_train.ndim

>> 1

Whereas,

x_train.reshape(-1, 1).ndim  # y_train.reshape(-1, 1).ndim

>> 2

Optimize the Model

Now that we have defined the model, we can optimize it.

model.optimize()

GPy includes matplotlib inside, so you can quickly see the optimized model in just one line. Note that this method can be used when the number of the input variables is one. In multi variables case, the ability to plot in multiple dimensions is not supported.

model.plot(figsize=(5, 5), dpi=100, xlabel="x", ylabel="y")

Prediction

Now that the model has been optimized, let’s check the predicted values and the confidence(uncertainty) with data outside the training data range. If all goes well, you’ll see a wider confidence interval in the data range outside the training data range.

Since the training data was in the range of 0 to 1, we will prepare the test data in the range of 0 to 2.

x_test = np.arange(0, 2, 0.01)

Then, predict the test data by the “.predict()” method.

y_mean, y_std = model.predict(x_test.reshape(-1, 1))

Note that there are two returned values from “model.predict()”. The first one is the regression value and the second one is the confidence interval. Here, it’s okay to think of the confidence interval as the standard deviation in the Gaussian function.

Finally, let’s plot the result.

import matplotlib.pylab as plt
import seaborn as sns

plt.figure(figsize=(5, 5), dpi=100)
sns.set()
plt.xlabel("x")
plt.ylabel("y")
plt.scatter(x, y_train, lw=1, color="b", label="training dataset")
plt.plot(x_test, func(param, x_test), lw=3, color="r", label="model function")
plt.plot(x_test, y_mean, lw=3, color="g", label="GP mean")
plt.fill_between(x_test, (y_mean + y_std).reshape(y_mean.shape[0]), (y_mean - y_std).reshape(y_mean.shape[0]), facecolor="b", alpha=0.3, label="confidence")
plt.legend(loc="upper left")
plt.show()

Congratulations!! It can be confirmed that as the range of the test data deviates from the training data, the predicted value also deviates and the confidence interval becomes wider.

Summary

From here, we have seen the process of Gaussian process regression with a simple example. Certainly, Gaussian process regression has the impression that it is difficult to attract attention as an analysis method because of its mathematical complexity. However, Gaussian process regression has information on confidence, making it possible to judge the validity of prediction.

The author would be happy if this post helps the reader to try Gaussian process analyses.

October 24, 2020December 24, 2020

Python Shortcode for Linear Regression

Step-by-step to a Data Scientist > Blog > 2020 > October

Linear regression analysis is one of the most basic data analyses. This method is based on the relationship between independent variables and dataset. Therefore, if the fit between the linear model and the dataset is well, an analysis with high model interpretability can be performed.

Purpose of this post

The purpose is to introduce the shortcode example for linear regression.

Explanation of Linear Regression

In this post, we just check the brief concept of linear regression. The details are introduced in another post. Please refer to it.

Brief Explanation of the Theory of Linear Regression

A representation of a linear regression is as follows:

$$y =\omega_{0}+\omega_{1}x_{1}+\omega_{2}x_{2}+…+\omega_{N}x_{N},$$

where $x_{i}$ is an independent variable and $\omega_{i}$ is a coefficient.

Here, for convenience, we adopt just one independent variable model. This is the most simple form everyone knows.

$$y =\omega_{0}+\omega_{1}x_{1}.$$

Training Dataset

We create the training dataset from the following code. How to create is introduced in another post.

import numpy as np
##-- Model Function for creating the toy dataset
def func(param, X):
    return param[0] + param[1]*X + param[2]*np.power(X, 2) + param[3]*np.exp(X)

x = np.arange(0, 1, 0.01)
param = [-1.0, 1.0, 2.0, 3.0]

np.random.seed(seed=99) # Set Random Seed
y_model = func(param, x)
y_train = func(param, x) + np.random.normal(loc=0, scale=1.0, size=len(x))

Create a Toy Dataset by the Noise Function

Linear Regression Analyses

You can easily perform a linear regression analysis by the Scikit-learn module “LinearRegression()“. The procedure is as follows:

1. Create the instance of “LinearRegression()” as the name of “lr“
2. Train the model instance “lr” with the dataset by the “.fit()” module
3. Predict the training dataset by the “.predict()” module

from sklearn.linear_model import LinearRegression
lr = LinearRegression()
lr.fit(x.reshape(-1, 1), y_train)
y_pred = lr.predict(x.reshape(-1, 1))

Possibly, you have the question that what does “x.reshape(-1, 1)” mean. The answer is that we have to prepare input variables “x” as a two-dimensional array. You will see the following error message if you prepare “x” as a one-dimensional array.

ValueError: Expected 2D array, got 1D array instead:
” *************** Here depends on your code *************** “
Reshape your data either using array.reshape(-1, 1) if your data has a single feature or array.reshape(1, -1) if it contains a single sample.

In such a case, you have to convert “x” into a two-dimensional array by “x.reshape(-1, 1)“.

Congratulations!! Until here, the linear regression analysis is finished. You can confirm the result of your model as follows:

import matplotlib.pylab as plt
import seaborn as sns

plt.figure(figsize=(5, 5), dpi=100)
sns.set()
plt.xlabel("x")
plt.ylabel("y")
plt.ylim(-1.5, 12.5)
plt.scatter(x, y_train, lw=1, color="b", label="dataset")
plt.plot(x, y_pred, lw=5, color="r", label="linear regression")
plt.legend()
plt.show()

Summary

Contrary to the author’s intention, this blog has become a bit long. However, the important contents are as follows.

from sklearn.linear_model import LinearRegression
lr = LinearRegression()
lr.fit(x.reshape(-1, 1), y_train)
y_pred = lr.predict(x.reshape(-1, 1))

The point is that you can perform a linear regression analysis with JUST the 4 line code! The roles of each line are below.

1. Import the module
2. Create the model
3. Train the model with the dataset
4. Predict

I would be glad if you think a linear regression is easy!

October 23, 2020December 24, 2020

Brief Explanation of the Theory of Linear Regression

Step-by-step to a Data Scientist > Blog > 2020 > October

A linear-regression technique is one of the basic analysis methods, so it may become the first attempt to investigate the relationship between independent variables and dataset.

In this post, we look through briefly the theory of linear regression. This article focuses on the concept of linear regression, so it will help the readers to understand with imagination.

Code implemented by Python

The implementation of the linear-regression model by Python is introduced by another post. This post is for just the brief theory with the simplified concept.

Python Shortcode for Linear Regression

What is the linear regression?

Linear regression is a technique to investigate the relationship between independent variables and a dataset. However, we assume that this relationship is linear. The most typical function would be:

$$y=ax+b,$$

Brief theory of linear regression

A representation of a linear regression is as follows:

$$y =\omega_{0}+\omega_{1}x_{1}+\omega_{2}x_{2}+…+\omega_{N}x_{N},$$

where $x_{i}$ is an independent variable and $\omega_{i}$ is a coefficient.

Besides, setting $x_{i}$ as $x_{1}=x$, $x_{2}=x^{2}$, and $x_{3}=e^{x}$, we can treat polynomial regression as the linear regression form:

$$y =\omega_{0}+\omega_{1}x+\omega_{2}x^{2}+\omega_{3}e^{x}.$$

The next step is to vectorize the above equation.

$$y=\left(\omega_{0}\ \omega_{1}\ \ldots\ \omega_{D}\right)\left(\begin{array}{ccc}1\\x_{1}\\\vdots\\x_{D}\\\end{array}\right).$$

$$\therefore y={\bf w^{T}}{\bf x},$$

where ${\bf w^{T}}$ is a coefficient vector and ${\bf x}$ is an independent variable vector. For the N dataset,

$$\left(\begin{array}{ccc}y_{1}\\y_{2}\\\vdots\\y_{N}\\\end{array}\right)=\left(\begin{array}{ccc}{\bf w^{T}}{\bf x_{1}}\\{\bf w^{T}}{\bf x_{2}}\\\vdots\\{\bf w^{T}}{\bf x_{N}}\\\end{array}\right).$$

As the matrix form, we can rewrite the above equation as follows:

$$\left(\begin{array}{ccc}y_{1}\\y_{2}\\\vdots\\y_{N}\\\end{array}\right)=\left(\begin{array}{ccccc}1&x_{11} &x_{12}&\ldots&x_{1D} \\1&x_{21} &x_{22}&\ldots&x_{2D}\\\vdots&&&&\vdots\\1&x_{N1} &x_{N2}&\ldots&x_{ND}\\\end{array}\right)\left(\begin{array}{ccc}\omega_{0}\\\omega_{1}\\\vdots\\\omega_{D}\\\end{array}\right).$$

$$\therefore{\bf y}\ ={\bf X}{\bf w}.$$

Our purpose is to estimate ${\bf w}$. Then, we will consider finding ${\bf w}$, which fits the dataset, namely minimizes the loss between the train data and the model-predicted data. We introduce the loss function of $E$ as follows:

$$\begin{eqnarray*}
E\left[ {\bf y}, {\bf \hat{y}}, {\bf X}; {\bf w} \right]\ &&=\sum^{n}_{n=1}\left(\hat{y}_{n} – y_{n} \right)^{2},\\ &&=\left({\bf \hat{y}} – {\bf X}{\bf w}\right)^{T}\left({\bf \hat{y}} – {\bf X}{\bf w}\right),
\end{eqnarray*}$$

where ${\bf \hat{y}}$ is the training data, and $E\left[ {\bf y}, {\bf \hat{y}}, {\bf X}; {\bf w} \right]$ is the loss between the train data(${\bf \hat{y}}$) and the model-predicted data(${\bf y}$). Note that this type of the loss function is called “Mean squared error”.

Here, we rewrite ${\bf w}$ as ${\bf w^{*}}$, which minimizes the loss $E\left[ {\bf y}, {\bf \hat{y}}, {\bf X}; {\bf w} \right]$, derived from the following relationship.

\begin{eqnarray*}
\dfrac{\partial E\left[ {\bf y}, {\bf \hat{y}}, {\bf X}; {\bf w} \right]}
{\partial {\bf w}} = 0
.
\end{eqnarray*}

From the above equation, we can obtain the solution ${\bf w^{*}}$ as follows:

\begin{eqnarray*}
\therefore
{\bf w^{*}}
=
\left(
{\bf X}^{T}{\bf X}
\right)^{-1}
{\bf X}^{T}
{\bf \hat{y}}
.
\end{eqnarray*}

Note that although the details of the derivation were skipped here, we can check the omitted process in the famous textbooks of machine learning or linear algebra.

The key is that all required information to estimate ${\bf w^{*}}$ is a training dataset(${\bf X}$ and ${\bf y}$). In addition, when we assume the form of the polynomial function, we can also judge the appropriateness of the assumed model.

Summary

We have briefly looked at the theory of linear regression. From the form of ${\bf w^{*}}=\left({\bf X}^{T}{\bf X}\right)^{-1}{\bf X}^{T}{\bf \hat{y}}$, you can understand that a linear regression analysis is on matric calculations.

In practice, there are few opportunities to be conscious of the calculation process. However, once you understand that linear regression is a matrix calculation, you can understand that the amount of calculation will become very large as the number of data increases.

October 23, 2020December 24, 2020

Create a Toy Dataset by the Noise Function

Step-by-step to a Data Scientist > Blog > 2020 > October

The toy dataset is useful when we attempt something new analysis method quickly, especially in regression analyses. In this post, we briefly see how to create a toy dataset with NumPy.

Then, let’s get started.

Import the Libraries

The code is written by Python. So firstly, we import the necessary library.

##-- For Numerical analyses
import numpy as np
##-- For Plot
import matplotlib.pylab as plt
import seaborn as sns

Define the Model Function

Here, we adopt the following function as an example.

$$y =-1+x+2x^{2}+3e^{x}.$$

The Python code for the above function is below.

def func(param, X):
    return param[0] + param[1]*X + param[2]*np.power(X, 2) + param[3]*np.exp(X)

For convenience, let’s create continuous data of x in the range 0 to 1 and check the behavior of the function.

x = np.arange(0, 1, 0.01)
param = [-1.0, 1.0, 2.0, 3.0]

y = func(param, x)

The behavior of (x, y) is as follows. The model function “y” is denoted by the red solid line. Note that, for visual clarity, the four polynomial terms, which construct the function, are added by the dashed line.

plt.figure(figsize=(5, 5), dpi=100)
sns.set()
plt.xlabel("x")
plt.ylabel("y")
plt.ylim(-1.5, 12.5)
plt.plot(x, y, lw=5, color="r", label="model function")
plt.legend()
plt.show()

Generate the Noise

Next, we prepare the toy dataset by adding the noise into the above function. The noise is generated by “the Gaussian distribution”, which is also called “the Normal distribution”. In this case, the noise function $\varepsilon(x)$ can be written as follows:

$$\begin{eqnarray*}
\varepsilon(x)=\frac{1}{\sqrt{2\pi\sigma^{2}}}
e^{-\dfrac{(x-\mu)^{2}}{2\sigma^{2}}},
\end{eqnarray*}$$

where $\mu$ is the mean and $\sigma$ the standard deviation.

We can easily use this noise function $\varepsilon(x)$ by the NumPy module “np.random.normal()“.

noise = np.random.normal(
                            loc   = 0,
                            scale = 1,
                            size  = 10000000,
                        )
plt.hist(noise, bins=100, color="r")

“loc“, “scale“, and “size“, the arguments of “np.random.normal()“, are the mean($\mu$), the standard deviation($\sigma$), and the size of the output. As seen in the graph below, we can confirm the Gaussian distribution.

The Model Function with Noise

The code for the function with the noise is as follows:

$$y =-1+x+2x^{2}+3e^{x} + \varepsilon(x).$$

##-- Set Random Seed
np.random.seed(seed=99)

y_toy = func(param, x) + np.random.normal(loc=0, scale=1.0, size=len(x))

Finally, you can get the toy dataset $(x, y_toy)$!!
You can confirm the behavior of the toy dataset as follows:

plt.figure(figsize=(5, 5), dpi=100)
sns.set()
plt.xlabel("x")
plt.ylabel("y")
plt.ylim(-1.5, 12.5)
plt.scatter(x, y_toy, lw=1, color="b", label="toy dataset")
plt.plot(x, y, lw=5, color="r", label="model function")
plt.legend()
plt.show()

October 8, 2020December 24, 2020

Lambda Function with Pandas

Step-by-step to a Data Scientist > Blog > 2020 > October

In the previous post, the basic of “lambda function“ is introduced. In this post, the author introduces the practical situation, i.e., lambda functions × Pandas.

Lambda Function for Python Beginners

What is Pandas?

Pandas, a library for data structures, is known as one of the essential libraries for data analyses, such as NumPy, SciPy, and Scikit-learn. Pandas is designed to treat Excel easily so that we can use a table data flexibly.

Pandas can treat files in various formats. Besides, Pandas has rich methods. The above features make it possible to perform data analysis against table data efficiently. If you look in a data science competition(e.g. Kaggle), you can understand that Pandas is an essential tool for data scientists.

Lambda Function × Pandas

Pandas is used for table data analyses. So, there might be a situation that you would like to apply the same manipulate to each element of sequence data(e.g. one column of table data).

That’s exactly where the combination of Pandas and lambda functions comes into play.

Ex. Categorize the Age Group

We first prepare the age-group list, 18, 50, 28, 78, and 33. Second, we convert the list “age_list” into Pandas DataFrame with the column name “Age”.

import pandas as pd
age_list = [18, 50, 28, 78, 33]
age_list = pd.DataFrame(age_list, columns=["Age"])
print(age_list)

>>    Age
>> 0   18
>> 1   50
>> 2   28
>> 3   78
>> 4   33

Next, we categorize each element of the column “age_list[“Age”]”. Note here, you must predefine the function for classification.

Here, we prepare the function to categorize ages into the group of “unknown”, “Under 20”, “20-40”, “41-60”, and “Over 60”. Note that “unknown” is for mistake inputs such as minus ages.

def categorize_age(x):
  x = int(x)
  if x < 0:
    x = "unknown"
  elif x < 20:
    x = "Under 20"
  elif x <= 40:
    x = "20-40"
  elif x <= 60:
    x = "41-60"
  else:
    x = "Over 60"
  return x

Then, let’s apply the above function “categorize_age()” to each element of the column “age_list[“Age”]”. As a result, we can see that the result is assigned to the newly generated column “Generation”.

Note that, to apply, we use a “apply()” method and a “lambda function“.

Syntax: DataFrame[column].apply( lambda x: function(x) )

age_list["Generation"] = age_list["Age"].apply( lambda x: categorize_age(x) )
print(age_list)

>>    Age Generation
>> 0   18   Under 20
>> 1   50      41-60
>> 2   28      20-40
>> 3   78    Over 60
>> 4   33      20-40

Summary

In this article, we have seen that a lambda function becomes a powerful tool when it is used with Pandas. When analyzing table data, it will be needed to apply arbitrary processing to each element of a column or a row of Pandas DataFrame.

It is such a time to use a lambda function!

October 7, 2020December 24, 2020

Lambda Function for Python Beginners

Step-by-step to a Data Scientist > Blog > 2020 > October

A lambda function may be unfamiliar with Python beginners. Certainly, the lambda function is not always necessary, but if it is used adequately, it will be possible to execute arbitrary processing with a compact description.

What is a lambda function? Main two features are as follows.

1. Anonymous function with a return value
2. Described with one sentence

Let’s start with a simple example to imagine!
We will see two example codes as follows. And, note that both examples have the same function, returning the square of the input variable.

Standard Style

This example is the one you might be familiar with.

def square(x):
    return x*x

ret = square(2)
print(ret)

>> 4

Lambda function Style

With a lambda-function style, you can express the same function above with just one sentence!

ret = (lambda x: x*x)(2)
print(ret)

>> 4

The syntax of lambda functions is below.

Syntax:    lambda x: f(x)

“lambda” just claims that “This is a function”. And, the function is “f(x)” with the argument “x”, equaling to the returned value.

Since the whole sentence “lambda x: f(x)” is the function itself, we use parentheses for giving an argument to x as in the above example.

Example

Let’s square each element of the list.

A standard expression by “for loop” is as follows.

num_list = [1, 2, 3, 4]
for i in range(len(num_list)):
    num_list[i] = num_list[i]*num_list[i]
print(num_list)

>> [1, 4, 9, 16]

On the other hand, we can rewrite “for loop” into the one sentence with lambda functions.

num_list = [1, 2, 3, 4]
list( map(lambda x: x*x, num_list) )

print(num_list)

>> [1, 4, 9, 16]

map() function is to perform the same processing for each element of list.

Syntax:    map(function, iterator)

You can interpret “function” and “iterator” as just like “f(x)” and “x”, respectively.

Actually, there is a deep world of map() function, so the details of an explanation will NOT be here. However, in relation to map() function, lambda functions will be a powerful tool when used together with Pandas.

Summary

In this article, we saw the brief introduction of the lambda function. The usage of this function makes it possible to adopt a compact expression. Consequently, a low-code habit may improve the interpretation and the maintainability of your codes.

October 3, 2020December 24, 2020

Convert Jupyter Notebook into Python Script

Step-by-step to a Data Scientist > Blog > 2020 > October

The Jupyter Notebook is a useful editor because a notebook-style makes it possible to code interactively. Especially, to create a prototype, a notebook-style is powerful.

However, a script-style(***.py) is often better than a notebook-style(“***.ipynb”) when the creation is project-sized. For example, imagine a case such as a data science competition, e.g. Kaggle.

But there is nothing to worry about. We can convert with just one command.

Python tip command

To see an example, we prepare the “work” directory, including “sample.ipynb”.
The following jupyter-notebook file is stored.

:~/work$ls
>>sample.ipynb

The content of “sample.ipynb” is below.

Convert Command: “jupyter nbconvert”

Just run one line of command!

jupyter nbconvert --to script sample.ipynb

Then, you can get the python script, “sample.py”, with sucessful message, “Converting notebook sample.ipynb to script”.

:~/work$ls
>>sample.ipynb  sample.py

The contents of “sample.py” are as follows. In[1] and In[2] denote the first and second cells in “sample.ipynb”.

#!/usr/bin/env python
# coding: utf-8

# In[1]:

import numpy as np
a = np.array([0, 1, 2, 3])

# In[2]:

print(a)

In summary, we saw the one command can make it possible to convert jupyter notebook into python script. I hope you will use it!