Step-by-step to a Data Scientist

Step-by-step to a Data Scientist > Blog > Regression

Polynomial regression is a technique to express a dataset with a linear regression model consisting of polynomial terms. Although we assume a linear relationship, we can reflect the non-linearity by including the non-linear effect in each polynomial term itself. Therefore, a polynomial-regression model can treat nonlinearity, so it sometimes becomes a powerful tool.

Especially, it’s a powerful technique, if you have an idea for a concrete expression of an expression.

In this post, we look at a process of polynomial-regression analyses with Python. The details of the theory are introduced in another post.

Brief Explanation of the Theory of Linear Regression

Import the Libraries

The code is written by Python. So firstly, we import the necessary library.

##-- For Numerical analyses
import numpy as np
##-- For Plot
import matplotlib.pylab as plt
import seaborn as sns
##-- For Linear Regression Analyses
from sklearn.linear_model import LinearRegression

Model Function

In this post, we adopt the following function as a polynomial regression model.

$$y =\omega_{0}+\omega_{1}x+\omega_{2}x^{2}+\omega_{3}e^{x},$$

where $\omega$ is the coefficient. Here, we assume $\omega$ as follows:

$$\begin{eqnarray*}
{\bf w^{T}}&&=\left(\omega_{0}\ \omega_{1}\ \omega_{2}\ \omega_{3}\right),\\
&&=\left(-1\ 1\ 2\ 3\right).
\end{eqnarray*}$$

Namely,

$$y =-1+x+2x^{2}+3e^{x}.$$

Create the Training Dataset

Next, we prepare the training dataset by adding the noise generated by “the Gaussian distribution”, which is also called “the Normal distribution”. With the noise function $\varepsilon(x)$, we can rewrite the model function as follows:

$$\begin{eqnarray*}
y =-1+x+2x^{2}+3e^{x} + \varepsilon(x),
\end{eqnarray*}$$

where $\varepsilon(x)$ is described as

$$\begin{eqnarray*}
\varepsilon(x)=\dfrac{1}{\sqrt{2\pi\sigma^{2}}}
e^{-\dfrac{(x-\mu)^{2}}{2\sigma^{2}}}.
\end{eqnarray*}$$

##-- Model Function
def func(param, X):
    return param[0] + param[1]*X + param[2]*np.power(X, 2) + param[3]*np.exp(X)

x = np.arange(0, 1, 0.01)
param = [-1.0, 1.0, 2.0, 3.0]

np.random.seed(seed=99) # Set Random Seed
y_model = func(param, x)
y_train = func(param, x) + np.random.normal(loc=0, scale=1.0, size=len(x))

We can check the training dataset ($x$, $y_{\text{train}}$) as follows:

plt.figure(figsize=(5, 5), dpi=100)
sns.set()
plt.xlabel("x")
plt.ylabel("y")
plt.scatter(x, y_train, lw=1, color="b", label="toy dataset")
plt.plot(x, y_model, lw=5, color="r", label="model function")
plt.legend()
plt.show()

The details of the way to create the training dataset with the above model function are explained in another post.

Create a Toy Dataset by the Noise Function

Prepare the Dataset as Pandas DataFrame

Here, we prepare the dataset as Pandas DataFrame. First, we create empty DataFrame by “pd.DataFrame()“. Next, we create the values of each polynomial term in each column.

x_train = pd.DataFrame()

x_train["x"] = x
x_train["x^2"] = np.power(x, 2)
x_train["exp(x)"] = np.exp(x)

If you check only the first 5 lines of the created DataFrame, it will be as follows.

print( x_train.head() )

>>       x     x^2    exp(x)
>> 0  0.00  0.0000  1.000000
>> 1  0.01  0.0001  1.010050
>> 2  0.02  0.0004  1.020201
>> 3  0.03  0.0009  1.030455
>> 4  0.04  0.0016  1.040811

Note that since the constant term is treated as an intercept during the linear regression analysis, it is not necessary to create a column of constant terms (all values are “1”) here.

Polynomial Regression

Finally, we will perform linear regression analysis of polynomial terms from here!!

Then, we can apply a polynomial analysis to the above training dataset (x_train, y_train). Here, we use the “LinearRegression()” module from the scikit-learn library. And, create the instance “regressor” from “LinearRegression()”.

from sklearn.linear_model import LinearRegression
regressor = LinearRegression()

Then, we give the training dataset (x_train, y_train) to “regressor“, making “regressor” trained.

regressor.fit(x_train, y_train)

Model training is over, so let’s predict.

y_pred = regressor.predict(x_train)

Let’s see the prediction result.

plt.figure(figsize=(5, 5), dpi=100)
sns.set()
plt.xlabel("x")
plt.ylabel("y")
plt.scatter(x, y_train, lw=1, color="b", label="training dataset")
plt.plot(x, y_model, lw=3, color="r", label="model function")
plt.plot(x, y_pred, lw=3, color="g", label="Polynomial regression")
plt.legend()
plt.show()

Above the plot, the red solid line, the green solid line, and the blue circles are described as the model function, the polynomial regression, and the training dataset, respectively. As you can see visually, the polynomial regression is in good agreement with the model function.

Coefficients of Polynomial Regression

Here, we have confirmed that the model function is reproduced well so far, let’s check the coefficients of the polynomial regression model.

We can confirm the intercept and coefficients by the methods “intercept_” and “coef_” against the regression instance “regressor”.

print("w0", regressor.intercept_, "\n", \
      "w1", regressor.coef_[0], "\n", \
      "w2", regressor.coef_[0], "\n", \
      "w3", regressor.coef_[0], "\n" )

>> w0 -5.385823038595163 
>> w1 -4.479990435844182 
>> w2 -0.187758224924017
>> w3  7.606988274352645

The estimated ${\bf w^{T}}$,

$$\begin{eqnarray*}
{\bf w^{T}}&&=\left(-5.4\ -4.5\ -0.2\ 7.6\right),
\end{eqnarray*}$$

deviates the one of the model function,

$$\begin{eqnarray*}
{\bf w^{T}}&&=\left(-1\ 1\ 2\ 3\right).
\end{eqnarray*}$$.

This deviation comes from the fact that the range of training data (0 to 1) is narrow. Therefore, if we increase the range of training data, the estimated value will close to the ones of the model function.

For example, when the range of training data is 0 to 2, ${\bf w^{T}}$ becomes $\left(-0.6\ 1.6\ 2.2\ 2.6\right)$. How is it? The estimation is very close to the correct answer.

Also, if the range of training data is 0 to 10, you can get the almost correct result.

Summary

We have briefly looked at the process of polynomial regression. Polynomial regression is a powerful and practical technique. However, without proper verification of the results, there is a risk of making a big mistake in predicting extrapolated values.

The author hopes this blog helps readers a little.

Step-by-step to a Data Scientist > Blog > Regression

A linear-regression technique is one of the basic analysis methods, so it may become the first attempt to investigate the relationship between independent variables and dataset.

In this post, we look through briefly the theory of linear regression. This article focuses on the concept of linear regression, so it will help the readers to understand with imagination.

Code implemented by Python

The implementation of the linear-regression model by Python is introduced by another post. This post is for just the brief theory with the simplified concept.

Python Shortcode for Linear Regression

What is the linear regression?

Linear regression is a technique to investigate the relationship between independent variables and a dataset. However, we assume that this relationship is linear. The most typical function would be:

$$y=ax+b,$$

Brief theory of linear regression

A representation of a linear regression is as follows:

$$y =\omega_{0}+\omega_{1}x_{1}+\omega_{2}x_{2}+…+\omega_{N}x_{N},$$

where $x_{i}$ is an independent variable and $\omega_{i}$ is a coefficient.

Besides, setting $x_{i}$ as $x_{1}=x$, $x_{2}=x^{2}$, and $x_{3}=e^{x}$, we can treat polynomial regression as the linear regression form:

$$y =\omega_{0}+\omega_{1}x+\omega_{2}x^{2}+\omega_{3}e^{x}.$$

The next step is to vectorize the above equation.

$$y=\left(\omega_{0}\ \omega_{1}\ \ldots\ \omega_{D}\right)\left(\begin{array}{ccc}1\\x_{1}\\\vdots\\x_{D}\\\end{array}\right).$$

$$\therefore y={\bf w^{T}}{\bf x},$$

where ${\bf w^{T}}$ is a coefficient vector and ${\bf x}$ is an independent variable vector. For the N dataset,

$$\left(\begin{array}{ccc}y_{1}\\y_{2}\\\vdots\\y_{N}\\\end{array}\right)=\left(\begin{array}{ccc}{\bf w^{T}}{\bf x_{1}}\\{\bf w^{T}}{\bf x_{2}}\\\vdots\\{\bf w^{T}}{\bf x_{N}}\\\end{array}\right).$$

As the matrix form, we can rewrite the above equation as follows:

$$\left(\begin{array}{ccc}y_{1}\\y_{2}\\\vdots\\y_{N}\\\end{array}\right)=\left(\begin{array}{ccccc}1&x_{11} &x_{12}&\ldots&x_{1D} \\1&x_{21} &x_{22}&\ldots&x_{2D}\\\vdots&&&&\vdots\\1&x_{N1} &x_{N2}&\ldots&x_{ND}\\\end{array}\right)\left(\begin{array}{ccc}\omega_{0}\\\omega_{1}\\\vdots\\\omega_{D}\\\end{array}\right).$$

$$\therefore{\bf y}\ ={\bf X}{\bf w}.$$

Our purpose is to estimate ${\bf w}$. Then, we will consider finding ${\bf w}$, which fits the dataset, namely minimizes the loss between the train data and the model-predicted data. We introduce the loss function of $E$ as follows:

$$\begin{eqnarray*}
E\left[ {\bf y}, {\bf \hat{y}}, {\bf X}; {\bf w} \right]\ &&=\sum^{n}_{n=1}\left(\hat{y}_{n} – y_{n} \right)^{2},\\ &&=\left({\bf \hat{y}} – {\bf X}{\bf w}\right)^{T}\left({\bf \hat{y}} – {\bf X}{\bf w}\right),
\end{eqnarray*}$$

where ${\bf \hat{y}}$ is the training data, and $E\left[ {\bf y}, {\bf \hat{y}}, {\bf X}; {\bf w} \right]$ is the loss between the train data(${\bf \hat{y}}$) and the model-predicted data(${\bf y}$). Note that this type of the loss function is called “Mean squared error”.

Here, we rewrite ${\bf w}$ as ${\bf w^{*}}$, which minimizes the loss $E\left[ {\bf y}, {\bf \hat{y}}, {\bf X}; {\bf w} \right]$, derived from the following relationship.

\begin{eqnarray*}
\dfrac{\partial E\left[ {\bf y}, {\bf \hat{y}}, {\bf X}; {\bf w} \right]}
{\partial {\bf w}} = 0
.
\end{eqnarray*}

From the above equation, we can obtain the solution ${\bf w^{*}}$ as follows:

\begin{eqnarray*}
\therefore
{\bf w^{*}}
=
\left(
{\bf X}^{T}{\bf X}
\right)^{-1}
{\bf X}^{T}
{\bf \hat{y}}
.
\end{eqnarray*}

Note that although the details of the derivation were skipped here, we can check the omitted process in the famous textbooks of machine learning or linear algebra.

The key is that all required information to estimate ${\bf w^{*}}$ is a training dataset(${\bf X}$ and ${\bf y}$). In addition, when we assume the form of the polynomial function, we can also judge the appropriateness of the assumed model.

Summary

We have briefly looked at the theory of linear regression. From the form of ${\bf w^{*}}=\left({\bf X}^{T}{\bf X}\right)^{-1}{\bf X}^{T}{\bf \hat{y}}$, you can understand that a linear regression analysis is on matric calculations.

In practice, there are few opportunities to be conscious of the calculation process. However, once you understand that linear regression is a matrix calculation, you can understand that the amount of calculation will become very large as the number of data increases.

Tag: Regression

Brief Explanation of the Python Code of Polynomial Regression

Import the Libraries

Model Function

Create the Training Dataset

Prepare the Dataset as Pandas DataFrame

Polynomial Regression

Coefficients of Polynomial Regression

Summary

Brief Explanation of the Theory of Linear Regression

Code implemented by Python

What is the linear regression?

Brief theory of linear regression

Summary