Step-by-step to a Data Scientist

November 23, 2022November 23, 2022

SunPy – Planet Positions in the Solar System

Step-by-step to a Data Scientist > Blog > Python

SunPy is a useful python library to access solar physics data. Therefore, we can easily visualize planet positions in the solar system by utilizing SunPy.

In this post, we will quickly see how to visualize the orbits of specified planets in a specified time series.

This post assumes the use of a Jupyter notebook.

The full code link of Google Colab

Install SunPy and Astropy

Before analysis, we have to install SunPy if you did not install it yet. We can easily install it with a pip command. If you use an anaconda environment, please use conda commands.

$ pip install sunpy

In this post, the version of SunPy is 3.1.8.

In addition to SunPy, we use data structures of Astoropy, a common-core python package for Astronomy.

$ pip install astropy

In this post, the version of Astropy is 4.3.1.

Import libraries

First, we import the necessary libraries.

import matplotlib.pyplot as plt
import numpy as np
import pandas as pd

from astropy.coordinates import SkyCoord
from astropy.time import Time
from sunpy.coordinates import get_body_heliographic_stonyhurst

Quick Try

Let’s try SunPy with a quick view. We have to specify the planets and the time series as the “Time” type of Astropy. Then, we can get each coordinate by using the function “get_body_heliographic_stonyhurst()” of SunPy.

Note that the coordinate system is NOT the cartesian coordinate. We can get them as latitude, longitude, and radius. And, the units are degree and AU. Au is the distance unit, where 1 AU is the distance between Earth and Sun().

obstime = Time('2022-11-22T07:54:00.005')
planet_list = ['earth', 'venus', 'mars']
planet_coord = [get_body_heliographic_stonyhurst(this_planet, time=obstime) for this_planet in planet_list]

We can visualize them as follows.

fig = plt.figure(figsize=(6, 6))
ax1 = plt.subplot(1, 1, 1, projection='polar')
for this_planet, this_coord in zip(planet_list, planet_coord):
    plt.plot(np.deg2rad(this_coord.lon), this_coord.radius, 'o', label=this_planet)
plt.legend(loc='lower left')
plt.show()

Obtain the Orbit Information of the Planets in the Solar System

We will prepare the practical functions to visualize the orbits of specified planets in a specified time series.

First, we define a function to get a list of coordinate information instances for a specified planet at a specified time.

def get_planet_coord_list(timestamp, planet_list):
    """
    Get a list of coordinate information instances 
    for a specified planet at a specified time
    """
    # convert into the Time type of astropy
    timestamp = Time(timestamp)
    
    # get a coordinate of a specified planet
    planet_coord_list = [get_body_heliographic_stonyhurst(_planet, time=timestamp) for _planet in planet_list]

    return planet_coord_list

Second, we define a function to get coordinates of specified time and planets. In this function, we use the function get_planet_coord_list() defined above.

def get_planet_coord(timestamp, planet_list):
    """
    Get coordinates of specified time and planet

    Return: dict
        key(str): planet name
        value(dict): latitude(deg), longitude(deg), radius(AU)
            key: 'lon', 'lat', 'radius'
    """
    # a list of coordinate information instances
    # for a specified planet at a specified time
    _planet_coord_list = get_planet_coord_list(timestamp, planet_list)

    dict_planet_coord = {}
    for _planet, _coord in zip(planet_list, _planet_coord_list):
        # latitude(deg), longitude(deg), radius(AU)
        lon, lat, radius = _coord.lon, _coord.lat, _coord.radius
        # dict_planet_coord[_planet] = [lon, lat, radius]
        dict_planet_coord[_planet] = {'lon':lon, 'lat':lat, 'radius':radius}
    
    return dict_planet_coord

Third, we define a function to get the coordinates of a specified planet in a specified time series. By obtaining the coordinates for the time series, we can plot the orbit at the specified period. And, in this function, we use the function get_planet_coord() defined above.

def get_planet_coord_timeseries(timeseries, planet_list):
    """
    Get coordinates of a specified planet in a specified time series
    """
    # initialization
    dict_planet_coord_timeseries = {}
    for _planet in planet_list:
        dict_planet_coord_timeseries[_planet] = {'lon':[], 'lat':[], 'radius':[]}
    
    # Obtain coordinates of each planet in time series
    for _timestamp in timeseries:
        """
        Coordinates of the specified planet at the specified time
        
        key(str): planet name
        value(dict): latitude(deg), longitude(deg), radius(AU)
            key: 'lon', 'lat', 'radius'
        """
        dict_planet_coord = get_planet_coord(_timestamp, planet_list)
        for _planet in planet_list:
            for _key in ['lon', 'lat', 'radius']:
                dict_planet_coord_timeseries[_planet][_key].append(np.array(dict_planet_coord[_planet][_key]))

    # Convert list into ndarray
    for _planet in planet_list:
        for _key in ['lon', 'lat', 'radius']:
            dict_planet_coord_timeseries[_planet][_key] = np.array(dict_planet_coord_timeseries[_planet][_key])
    
    return dict_planet_coord_timeseries

Now all information on planetary orbits can be obtained. Now let’s actually plot the orbits of the planets.

Visualization of the Orbits of the Planets in the Solar System

To visualize the planetary orbits more easily, we will define the plot function. The argument of this function is the return of the function get_planet_coord_timeseries() define above.

def plot_planet_position(dict_planet_coord_timeseries):
    fig = plt.figure(figsize=(8, 8))
    ax = plt.subplot(1, 1, 1, projection='polar')
    for _planet in dict_planet_coord_timeseries.keys():
        # longitude(deg), radius(AU)
        lon = np.deg2rad(dict_planet_coord_timeseries[_planet]['lon'])
        radius = dict_planet_coord_timeseries[_planet]['radius']
        # plot
        plt.plot(lon, radius, label=_planet, linewidth=2)
        plt.scatter(lon[0], radius[0], color='black', s=40)  # initial point
        plt.scatter(lon[-1], radius[-1], color='red', s=40)  # final point
    plt.legend()
    plt.show()
    plt.close(fig)

Then, let’s plot the orbits!

Using the functions you have defined so far, you can easily draw orbits for a given planet and period of time. The information to be pre-specified is the period of the target data (start and end) and the planets of the solar system.

First, let’s specify a near-Earth planet. Set the period appropriately. The author encourages readers to try changing it in various ways.

start, end = '2022-01-01', '2022-08-01'
planet_list = ['venus', 'earth', 'mars']

timeseries = pd.date_range(start, end, freq='D')
dict_planet_coord_timeseries = get_planet_coord_timeseries(timeseries, planet_list)
plot_planet_position(dict_planet_coord_timeseries)

Note that the graph is in a coordinate system with the earth as the stationary system in the angular direction. Therefore, the position of the earth appears to be motionless, except in the radial direction. The change of the earth in the radial direction is due to the fact that the Earth’s orbit is not strictly circular, but elliptical.

Next, we will include planets far from Earth.

planet_list = ['mercury', 'venus', 'earth', 'mars', 'neptune', 'jupiter', 'uranus']
dict_planet_coord_timeseries = get_planet_coord_timeseries(timeseries, planet_list)

plot_planet_position(dict_planet_coord_timeseries)

Planetary orbits around the Earth can be seen to be quite dense. In this way, the structure of the solar system can be visually confirmed.

Summary

We have seen how to visualize the orbits of specified planets in a specified time series by SunPy. With SunPy, any Python user can check the orbits of the planets of the solar system.

The author hopes this blog helps readers a little.

January 22, 2022July 11, 2022

PyCaret 2.3.6, incredible update; Convert model and Web App

Step-by-step to a Data Scientist > Blog > Python

PyCaret is a useful auto ML python library because we can deploy machine learning models with low codes. We can also perform preprocessing, compare models, and tune hyperparameters, of course with low codes.

Recently, PyCaret version 2.3.6 was released. This is big news because several new wonderful functions were implemented in this release version! The details are described in this article written by PyCaret creator.

In this article, we will check the summary of this release. And, the three new features will be introduced.

New Features

Dashboard: interactive dashborad for a trained model.
EDA: Explonatory Data Analysis
Convert Model: converting a trained model from python into other programing language, such as C, Java, Go, JavaScript, Visual Basic, C#, PowerShell, R, PHP, Dart, Haskell, Ruby, F#.
Fairness
Web API based on FastAPI
Create Dockerfile and requirements file for a API
Web Application based on Gradio
Monitor Data Drift
Optimize Threshold for classification
New Documentation

As seeing the above new feature list, PyCaret is evolving dramatically!

In the following, we will introduce some of the new functions using normal regression analysis as an example.

Installation

If you have NOT installed PyCaret yet, you can easily install it by the following command. Note that specify the PyCaret version!

$pip install pycaret==2.3.6

From here, the sample code in this post is supposed to run on Jupyter Notebook.

Import Libraries

In advance, we load all the modules for regression analysis of PyCaret.

from pycaret.regression import *

Dataset

We use the diamond dataset for regression analysis.

# load dataset
from pycaret.datasets import get_data

df = get_data('diamond')

Set up the environment by the “setup()” function

PyCaret needs to initialize an environment by the “setup()” function. Conveniently, PyCaret infers the data type of the variables in the dataset.

Arguments of setup() are the dataset as Pandas DataFrame, the target-column name, and the “session_id”. The “session_id” equals a random seed.

s = setup(df, target='Price', session_id = 20220121)

Create a Model

Due to the simplicity of the technique and the interpretability of the model, we will adopt lr(Linear Regression) for the models that will be used below.

We can create the selected model by create_model() with the argument of “lr”. Another argument of “fold” is the number of cross-validation. “fold = 4” indicates we split the dataset into four and train the model in each dataset separately.

lr = create_model("lr", fold=4)

Introduction of New Features of PyCaret 2.3.6

From here, we will introduce two of the new features.

Convert model

With this new function, we can convert a trained model into another language, e. g. from python to C. This function is very useful when operating the created model.

Note that we need to install the dependency libraries.

$pip install m2cgen

Then, we can convert a model.

model_c = convert_model(model, language='c')
print(model_c)

Create Web App

This new function requires the “gradio” library.

$pip install gradio

Just 1 line code. We can create a web app.

create_app(model)

Summary

We have seen the new features of PyCaret 2.3.6.

In this article, we saw covering the model into another language and creating a web app.

Just 1 line.

Wouldn’t it be great? If you sympathize with it, please give it a try.

The author hopes this blog helps readers a little.

January 22, 2022January 22, 2022

PyCaret 2.3.6, incredible update; Dashboard and EDA functions

Step-by-step to a Data Scientist > Blog > Python

In this article, we will check the summary of this release. And, the three new features will be introduced.

New Features

Dashboard: interactive dashborad for a trained model.
EDA: Explonatory Data Analysis
Convert Model: converting a trained model from python into other programing language, such as C, Java, Go, JavaScript, Visual Basic, C#, PowerShell, R, PHP, Dart, Haskell, Ruby, F#.
Fairness
Web API based on FastAPI
Create Dockerfile and requirements file for a API
Web Application based on Gradio
Monitor Data Drift
Optimize Threshold for classification
New Documentation

As seeing the above new feature list, PyCaret is evolving dramatically!

In the following, we will introduce some of the new functions using normal regression analysis as an example.

Installation

If you have NOT installed PyCaret yet, you can easily install it by the following command. Note that specify the PyCaret version!

$pip install pycaret==2.3.6

From here, the sample code in this post is supposed to run on Jupyter Notebook.

Import Libraries

In advance, we load all the modules for regression analysis of PyCaret.

from pycaret.regression import *

Dataset

We use the diamond dataset for regression analysis.

# load dataset
from pycaret.datasets import get_data

df = get_data('diamond')

Set up the environment by the “setup()” function

PyCaret needs to initialize an environment by the “setup()” function. Conveniently, PyCaret infers the data type of the variables in the dataset.

Arguments of setup() are the dataset as Pandas DataFrame, the target-column name, and the “session_id”. The “session_id” equals a random seed.

s = setup(df, target='Price', session_id = 20220121)

Create a Model

Due to the simplicity of the technique and the interpretability of the model, we will adopt lr(Linear Regression) for the models that will be used below.

lr = create_model("lr", fold=4)

Introduction of New Features of PyCaret 2.3.6

From here, we will introduce two of the new features.

Dashboard

With this new function, we can create a dashboard for a trained model.

The dashboard function is implemented by ExplainerDashboard, we need the “explainerdashboard” library. We can install it with the pip command.

$pip install explainerdashboard

Then, we can create a dashboard.

dashboard(model)

Parts of the dashboard screen are introduced in the figure below.

EDA(Exploratory Data Analysis)

This new function requires the “autoviz” library.

$pip install autoviz

Just 1 line code. We can perform the EDA.

eda()

Summary

We have seen the new features of PyCaret 2.3.6.

In this article, we saw the Dashboard and EDA function.

Just 1 line.

We can create a dashboard and perform the EDA of a trained model. Wouldn’t it be great? If you sympathize with it, please give it a try.

The author hopes this blog helps readers a little.

November 25, 2021November 25, 2021

3D Scatter by Plotly

Step-by-step to a Data Scientist > Blog > Python

3D visualization is practical because we can understand relationships between variables in a dataset. For example, when EDA (Exploratory Data Analysis), it is a powerful tool to examine a dataset from various approaches.

For visualizing the 3D scatter, we use Plotly, the famous python open-source library. Although Plotly has so rich functions, it is a little bit difficult for a beginner. Compared to the famous python libraries matplotlib and seaborn, it is often not intuitive. For example, not only do object-oriented types appear in interfaces, but we often face them when we pass arguments as dictionary types.

Therefore, In this post, the basic skills for 3D visualization for Plotly are introduced. We can learn the basic functions required for drawing a graph through a simple example.

The full code is in the GitHub repository.

Import Libraries

import numpy as np
import pandas as pd
from sklearn.datasets import fetch_california_housing
import plotly.graph_objects as go

In this post, it is confirmed that the code works with the following library versions:

numpy==1.19.5
pandas==1.1.5
scikit-learn==1.0.1
plotly==4.4.1

Prepare a Dataset

In this post, we use the “California Housing dataset” included in scikit-learn.

# Get a dataset instance
dataset = fetch_california_housing()

The “dataset” variable is an instance of the dataset. It stores several kinds of information, i.e., the explanatory-variable values, the target-variable values, the names of the explanatory variable, and the name of the target variable.

We can take and assign them separately as follows.

dataset.data: values of the explanatory variables
dataset.target: values of the target variable (house prices)
dataset.feature_names: the column names

Note that we store the dataset as pandas DataFrame because it is convenient to manipulate data. And, the target variable “MedHouseVal” indicates the median house value for California districts, expressed in hundreds of thousands of dollars ($100,000).

# Store the dataset as pandas DataFrame
df = pd.DataFrame(dataset.data)
# Asign the explanatory-variable names
df.columns = dataset.feature_names
# Asign the target-variable name
df[dataset.target_names[0]] = dataset.target

df.head()

Variables to be used in each axis

Here, we prepare the variables to be used in each axis. Here, we use “Latitude” and “Longitude” for the x- and y-axis. And for the z-axis, we use “MedHouseVal”, the target variable.

xlbl = 'Latitude'
ylbl = 'Longitude'
zlbl = 'MedHouseVal' 

x = df[xlbl]
y = df[ylbl]
z = df[zlbl]

Basic Format for 3D Scatter

To get started, let’s create a 3D scatter figure. We use the “graph_objects()” module in Plotly. First, we creaete a graph instance as “fig”. Next, we add a 3D Scatter created by “go.Scatter3d()” to “fig” by the “go.add_traces()” module. Finally, we can visualize by “fig.show()”.

# import plotly.graph_objects as go

# Create a graph instance
fig = go.Figure()
# Add 3D Scatter to the graph instance
fig.add_traces(go.Scatter3d(
    x=x, y=y, z=z,
))
# Show the figure
fig.show()

However, you can see that the default settings are often inadequate.

Therefore, we will make changes to the following items to create a good-looking graph.

Marker size
Marker color
Plot Style
Axis label
Figure size
Save a figure as a HTML file

Marker size and color

We change the marker size and color. Note that we have to pass the argument as dictionary type.

# Create a graph instance
fig = go.Figure()
# Add 3D Scatter to the graph instance
fig.add_traces(go.Scatter3d(
    x=x, y=y, z=z,
    # marker size and color
    marker=dict(color='red', size=1),
))
# Show the figure
fig.show()

The marker size has been changed from 3 to 1. And, the color has also been changed from blue to red.

Here, by reducing the marker size, we can see that not only points but also lines are mixed. To make a point-only graph, you need to explicitly specify “Marker” in the mode argument.

Marker Style

We can easily specify as marker style.

# Create a graph instance
fig = go.Figure()
# Add 3D Scatter to the graph instance
fig.add_traces(go.Scatter3d(
    x=x, y=y, z=z,
    # marker size and color
    marker=dict(color='red', size=1),
    # marker style
    mode='markers',
))
# Show the figure
fig.show()

Of course, you can easily change the line style. Just change the “mode” argument from “markers” to “lines”.

Axis Label

Next, we add the label to each axis.

While we frequently create axis labels, Plotly is less intuitive than matplotlib. Therefore, it will be convenient to check it once here.

We use the “go.update_layout()” module for changing the figure layout. And, as an argument, we pass the “scene” as a dictionary, where we pass each axis label as dictionary value to its key.

# Create a graph instance
fig = go.Figure()
# Add 3D Scatter to the graph instance
fig.add_traces(go.Scatter3d(
    x=x, y=y, z=z,
    # marker size and color
    marker=dict(color='red', size=1),
    # marker style
    mode='markers',
))
# Axis Labels
fig.update_layout(
    scene=dict(
        xaxis_title=xlbl,
        yaxis_title=ylbl,
        zaxis_title=zlbl,
        )
    )
# Show the figure
fig.show()

Figure Size

We sometimes face the situation to change the figure size. It can be easily performed just one line by the “go.update_layout()” module.

fig.update_layout(height=600, width=600)

Save a figure as HTML format

We can save the created figure by the “go.write_html()” module.

fig.write_html('3d_scatter.html')

Since the figure is created and saved as an HTML file, we can confirm it interactively by a web browser, e.g. Chrome and Firefox.

Cheat Sheet for 3D Scatter

# Create a graph instance
fig = go.Figure()
# Add 3D Scatter to the graph instance
fig.add_traces(go.Scatter3d(
    x=x, y=y, z=z,
    # marker size and color
    marker=dict(color='red', size=1),
    # marker style
    mode='markers',
))
# Axis Labels
fig.update_layout(
    scene=dict(
        xaxis_title=xlbl,
        yaxis_title=ylbl,
        zaxis_title=zlbl,
        )
    )
# Figure size
fig.update_layout(height=600, width=600)
# Save the figure
fig.write_html('3d_scatter.html')
# Show the figure
fig.show()

Summary

We have seen how to create the 3D scatter graph. By plotly, we can create it easily.

The author believes that the code example in this post makes it easy to understand and implement a 3D scatter graph for readers.

This time we tried with only one variable, but the case for multi variables can be implemented in the same way. We have defined hyperparameters as a dictionary, but we just need to add additional variables there.

The author hopes this blog helps readers a little.

October 8, 2021October 8, 2021

Tuning Hyperparameters by Grid Search in Decision Tree Regression

Step-by-step to a Data Scientist > Blog > Python

A decision tree method is an important method in machine learning because the famous algorithms, such as Random Forest and Gradient Boosting Decision Trees(GBDT), are based on the decision tree method.

In the previous post, we have seen a regression analysis of the decision tree method to the Boston house prices dataset.

Step-by-step guide of Decision Tree Regression for Boston House Prices dataset

In this post, to improve the accuracy of the model, we will tune the hyperparameters of the model by Grid search.

The full code is in the GitHub repository.

Grid Search

Grid search is a method to explore all possible combinations.

For example, we think about two variables, $x_1$ and $x_2$, where $x_1 = [1, 2, 3]$ and $x_2 = [4, 5, 6]$. In this case, all possible combinations are as follows.

$$[x_1, x_2] = [1, 4], [1, 5], [1, 6], [2, 4], [2, 5], [2, 6], [3, 4], [3, 5], [3, 6]$$

Therefore, computational costs increase proportionally to the number of variables and those levels.

As you can imagine, the disadvantage of grid search is that computational costs increase when the number of variables and those levels become larger.

Baseline of Analysis without tuning the Hyperparameters

First, we import the necessary libraries. And set the random seed.

import numpy as np
import pandas as pd
from sklearn.datasets import load_boston
from sklearn.tree import DecisionTreeRegressor
from sklearn.model_selection import train_test_split
from sklearn.metrics import r2_score
import matplotlib.pylab as plt
import seaborn as sns
sns.set()

random_state = 20211006

In this post, we use the Boston house prices dataset in the scikit-learn library. We can easily load the dataset by just two lines below.

dataset = load_boston()

The details of the Boston house prices dataset, an exploratory data analysis, are introduced in another post.

Brief EDA for Boston House Prices Dataset

Read the Dataset as Pandas DataFrame

It is convenient to get the data as pandas DataFrame, making it possible to manipulate table data.

# explanatory data
df = pd.DataFrame(dataset.data)
df.columns = dataset.feature_names

# target data
df["PRICES"] = dataset.target

print(df.head())

Variables to be used

Here, we prepare the variable-name list. The description of each variable is also described in the comments.

TargetName = "PRICES"
FeaturesName = [
              #-- "Crime occurrence rate per unit population by town"
              "CRIM",
              #-- "Percentage of 25000-squared-feet-area house"
              'ZN',
              #-- "Percentage of non-retail land area by town"
              'INDUS',
              #-- "Index for Charlse river: 0 is near, 1 is far"
              'CHAS',
              #-- "Nitrogen compound concentration"
              'NOX',
              #-- "Average number of rooms per residence"
              'RM',
              #-- "Percentage of buildings built before 1940"
              'AGE',
              #-- 'Weighted distance from five employment centers'
              "DIS",
              ##-- "Index for easy access to highway"
              'RAD',
              ##-- "Tax rate per $100,000"
              'TAX',
              ##-- "Percentage of students and teachers in each town"
              'PTRATIO',
              ##-- "1000(Bk - 0.63)^2, where Bk is the percentage of Black people"
              'B',
              ##-- "Percentage of low-class population"
              'LSTAT',
              ]

We prepare the explanatory and target variables as “X” and “y”.

X = df[FeaturesName]
y = df[TargetName]

No need to perform standardization

We don’t need to standardize or normalize the numerical variable in a decision tree analysis. This is because the decision tree classifies the cases by focusing only on the magnitude relationship of the values. Therefore, the difference in the scale of the variables does NOT affect the final result.

Split the Dataset

To validate the performance of the trained model against unseen data, we have to split the dataset into the train data and the test data.

We pass the dataset “(X, y)” to the “train_test_split()” function. The rate of the train data and the test data is defined by the argument “test_size”. Here, the rate is set to be “8:2”. And, “random_state” is set for reproducibility. You can use any number.

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=random_state)

Create a model instance and Train the model

We create a decision tree instance as “regressor”, and pass the training dataset to it.

regressor = DecisionTreeRegressor(random_state=random_state)
regressor.fit(X_train, y_train)

Evaluate the accuracy of the model

To validate the performance of the model, first, we predict the training and validation data.

y_pred_train = regressor.predict(X_train)
y_pred_test = regressor.predict(X_test)

As an indicator of the accuracy of the model, we use the $R^{2}$ score, which is the index for how much the model is fitted to the dataset. The value range is from 0 to 1. When the value is close to $1$, indicating the model accuracy is good. Conversely, when $R^{2}$ approaches $0$, it means that the model accuracy is poor.

We can calculate $R^{2}$ by the “r2_score()” function in scikit-learn.

R2 = r2_score(y_test, y_pred_test)
print("R2 value: {:.2}".format(R2))

>>  R2 value: 0.84

The score $0.84$ is not bad.

Besides, visualizing the result is a good choice. The red and blue circles show the results of the training and validation data, respectively.

plt.figure(figsize=(4, 4), dpi=100)
plt.xlabel("PRICES")
plt.ylabel("Predicted PRICES")
plt.xlim(0, 60)
plt.ylim(0, 60)
plt.scatter(y_train, y_pred_train, lw=1, color="r", label="train data")
plt.scatter(y_test, y_pred_test, lw=1, color="b", label="test data")
plt.legend()
plt.show()

About the above result, what we should pay attention to is the difference of accuracy between train and test data.

Tune the hyperparameters by Grid Search

Then, to improve the accuracy of the model, we tune the hyperparameters of the model by a grid search method.

Before tuning, let’s confirm the hyperparameters of the model.

# Confirm the hyper parameters
print(regressor.get_params)

>>  <bound method BaseEstimator.get_params of DecisionTreeRegressor(ccp_alpha=0.0, criterion='mse', max_depth=None,
>>                        max_features=None, max_leaf_nodes=None,
>>                        min_impurity_decrease=0.0, min_impurity_split=None,
>>                        min_samples_leaf=1, min_samples_split=2,
>>                        min_weight_fraction_leaf=0.0, presort='deprecated',
>>                        random_state=20211006, splitter='best')>

In this post, we tune one of the most parameters, “max_depth”. In the default setting “None”, decision trees can be branched without restrictions. This setting makes it overly easy to adapt to training, i.e., overfitting.

Therefore, we will try to find the optimum solution by setting the number of branches in the range of 1 to 9.

Define the argument name and search range as a dictionary.

# Prepare a hyperparameter candidates
max_depth = np.arange(1, 10)
params = {'max_depth':max_depth}

Next, we define an instance of the grid search, where we pass the decision-tree-model instance and the above dictionary. Note that “cv” and “scoring” indicate the number of folds and metrics for validation, respectively.

from sklearn.model_selection import GridSearchCV

# Define a Grid Search as gs
gs = GridSearchCV(regressor, params, cv=5, scoring='neg_mean_squared_error', return_train_score=True)

Now, we are ready to go. Execute the grid search!

# Execute a grid search
gs.fit(X, y)

After finishing, we confirm the evaluations of the grid search. We check the metrics for each fold.

# Evaluate the score
score_train = gs.cv_results_['mean_train_score']
score_test = gs.cv_results_['mean_test_score']
print(f"train_score: {score_train}")
print(f"test_score: {score_test}")


plt.plot(max_depth, score_train, label="train")
plt.plot(max_depth, score_test, label="test")

plt.title('train_score vs test_score')
plt.xlabel('max_depth')
plt.ylabel('Mean squared error')

plt.legend()
plt.show()

The error for the training data decreases as max_depth increases, while the error for the verification data does not show any significant improvement when the number of branches is 4 or more in the middle.

We can easily check the best parameter, i.e., optimized “max_depth”.

# Best parameters
print(gs.best_params_ )

>>  {'max_depth': 7}

The result suggests that the “max_depth” of 7 is the best.

In addition, we can also get the best model, i.e., when the “max_depth” of 7

# Best-parameter model
regressor_best = gs.best_estimator_

Here, let’s evaluate the $R2$ value again. We will see the model has been improved.

y_pred_train = regressor_best.predict(X_train)
y_pred_test = regressor_best.predict(X_test)

R2 = r2_score(y_test, y_pred_test)
print("R2 value: {:.2}".format(R2))

>>  R2 value: 0.95

It is obvious if you visualize the result.

By adjusting the hyperparameters, the accuracy for the training data is reduced, but the accuracy for the validation data is improved.

In other words, it was a situation of overfitting against the training data, but by setting appropriate hyperparameters, the generalization performance of the model was improved.

Summary

We have seen how to tune the hyperparameters of the decision tree model. In this post, we adopt a Grid Search method. A grid search method is easy to understand and implement.

The author hopes this blog helps readers a little.

July 23, 2021March 26, 2022

Web app for PCA by Streamlit

Step-by-step to a Data Scientist > Blog > Python

By streamlit, we can deploy our python script on a web app easier.

This post will see how to deploy our principal-component-analysis(PCA) code on a web app. In a web app format, we can try it in interactive. The origin of the PCA analysis in this post is introduced in another post.

Web App. Demo

Full codes are available at my GitHub repo.

From the repo, we can easily prepare an environment by docker and try streamlit.

Visualising the Wine Classification Dataset by PCA

It is easy to install streamlit by pip, just like any other Python module.

pip install streamlit

Run the web app

The web app will be opened by the following command in the web browser.

$ streamlit run main.py

Appendix: Dockerfile

If you use docker, you can use the Dockerfile described below. You can try the code in this post immediately.

FROM python:3.9

WORKDIR /opt
RUN pip install --upgrade pip
RUN pip install numpy==1.21.0 \
				pandas==1.3.0 \
				scikit-learn==0.24.2 \
				plotly==5.1.0 \
				matplotlib==3.4.2 \
				seaborn==0.11.1 \
				streamlit==0.84.1

WORKDIR /work

You can build a docker container from the docker image created from the Dockerfile.

Execute the following commands.

$ docker run -it -p 8888:8888 -v ~/(your work directory):/work <Image ID> bash
$ streamlit run main.py --server.port 8888

Note that “-p 8888: 8888” is an instruction to connect the host(your local PC) with the docker container. The first and second 8888 indicate the host’s and the container’s port numbers, respectively.

Besides, by the command “streamlit run ” with the option “–server.port 8888”, we can access a web app from a web browser with the URL “localhost: 8888”.

Please refer to the details on how to execute your python and streamlit script in a docker container in the following post.

Dockerfile for Streamlit

Import libraries

import streamlit as st
import numpy as np
import pandas as pd
from sklearn.preprocessing import StandardScaler 
from sklearn.metrics import accuracy_score
from sklearn.datasets import load_wine
from sklearn.decomposition import PCA
import plotly.graph_objects as go

Title

You can create the title quickly by ‘st.title()’.

‘st.title()’: creates a title box

# Title
st.title('PCA applied to Wine dataset')

Load and Show the dataset

First, we load the dataset by ‘load_wine()’, and set it as pandas DataFrame by ‘pd.DataFame’.

Second, we assign the columns of the dataset and the target variable. The columns of the dataset are stored in ‘dataset.feature_names’. Similarly, the target variable is also stored in ‘dataset.target’.

Third, we show the dataset as a table-data format if we check the checkbox. The checkbox is create by ‘st.checkbox()’, and a table data is shown by ‘st.dataframe()’.

‘st.checkbox()’: creates a check box, which returns True when checked.
‘st.dataframe()’: display the data frame of the argument.

# load wine dataset
dataset = load_wine()
# Prepare explanatory variable as DataFrame in pandas
df = pd.DataFrame(dataset.data)
# Assign the names of explanatory variables
df.columns = dataset.feature_names
# Add the target variable(house prices),
# where its column name is set "target".
df["target"] = dataset.target

# Show the table data when checkbox is ON.
if st.checkbox('Show the dataset as table data'):
    st.dataframe(df)

NOTE: Markdown

Here, it should be noted about ‘Markdown’, since, in the following descriptions, we will use the markdown format.

The markdown style is useful! For example, the following comment outed statement is shown in the web app as follows. With the markdown style, we can easily display the statement.

"""
# Markdown 1
## Markdown 2
### Markdown 3
"""

Standardization

Before performing the PCA, we have to perform standardization against explanatory variables.

# Prepare the explanatory and target variables
x = df.drop(columns=['target'])
y = df['target']

# Standardization
scaler = StandardScaler()
x_std = scaler.fit_transform(x)

You can refer to the details of standardization in the following post.

Standardization by scikit-learn in Python

PCA Section

In this section, we perform the PCA and deploy it on streamlit.

We will create a design that interactively selects the number of principal components to be considered in the analysis. And, we select the principal components to be a plot.

Note that, we create these designs in the sidebar. It is easy by ‘st.sidebar’.

Number of principal components

First, we will create a design that interactively selects the number of principal components to be considered in the analysis.

By “st.sidebar”, the panels are created in sidebars.

And,

‘st.number_input()’: creates a panel, where we select the number.

# Number of principal components
st.sidebar.markdown(
    r"""
    ### Select the number of principal components to include in the result
    Note: The number is nonnegative integer.
    """
    )
num_pca = st.sidebar.number_input(
                            'The minimum value is an integer of 3 or more.',
                            value=3, # default value
                            step=1,
                            min_value=3)

Note that the part created by the above code is the red frame part in the figure below.

Perform the PCA

It is easy to perform the PCA by the “sklearn.decomposition.PCA()” module in scikit-learn. We pass the number of the principal components to the argument of “n_components”.

# Perform PCA
# from sklearn.decomposition import PCA
pca = PCA(n_components=num_pca)
x_pca = pca.fit_transform(x_std)

Select the principal-component index to plot

We create the panels in the sidebar to select the principal-component index to plot. This panel in the sidebar can be created by “st.sidebar.selectbox()”.

Note that we create the description as markdown in the sidebar using “st.sidebar”.

st.sidebar.markdown(
    r"""
    ### Select the principal components to plot
    ex. Choose '1' for PCA 1
    """
    )
# Index of PCA, e.g. 1 for PCA 1, 2 for PCA 2, etc..
idx_x_pca = st.sidebar.selectbox("x axis is the principal component of ", np.arange(1, num_pca+1), 0)
idx_y_pca = st.sidebar.selectbox("y axis is the principal component of ", np.arange(1, num_pca+1), 1)
idx_z_pca = st.sidebar.selectbox("z axis is the principal component of ", np.arange(1, num_pca+1), 2)

3D Plot by Plotly

Finally, let’s visualize the PCA result as a 3D plot. We can confirm the result interactively, e.g., zoom and scroll!

First, we prepare the label of each axis and the data to plot. We have already selected the principal-component index to plot.

# Axis label
x_lbl, y_lbl, z_lbl = f"PCA {idx_x_pca}", f"PCA {idx_y_pca}", f"PCA {idx_z_pca}"
# data to plot
x_plot, y_plot, z_plot = x_pca[:,idx_x_pca-1], x_pca[:,idx_y_pca-1], x_pca[:,idx_z_pca-1]

Second, we visualize the result as a 3D plot by plotly. To visualize the result on the dashboard on Streamlit, we use the ‘st.plotly_chart()’ module, where the first argument is the graph instance created by plotly.

# Create an object for 3d scatter
trace1 = go.Scatter3d(
    x=x_plot, y=y_plot, z=z_plot,
    mode='markers',
    marker=dict(
        size=5,
        color=y,
        # colorscale='Viridis'
        )
)
# Create an object for graph layout
fig = go.Figure(data=[trace1])
fig.update_layout(scene = dict(
                    xaxis_title = x_lbl,
                    yaxis_title = y_lbl,
                    zaxis_title = z_lbl),
                    width=700,
                    margin=dict(r=20, b=10, l=10, t=10),
                    )
"""### 3d plot of the PCA result by plotly"""
# Plot on the dashboard on streamlit
st.plotly_chart(fig, use_container_width=True)

Since the graph is created as an HTML file, we can confirm it interactively.

Summary

In this post, we have seen how to deploy the PCA to a web app. I hope you experienced how easy it is to implement.

The author hopes you take advantage of Streamlit.

News: The Book has been published

The new book for a tutorial of Streamlit has been published on Amazon Kindle, which is registered in Kindle Unlimited. Any member can read it.

Streamlit Tutorial Book has been published on Amazon Kindle

July 17, 2021July 17, 2021

Visualising the Wine Classification Dataset by PCA

Step-by-step to a Data Scientist > Blog > Python

Visualizing the dataset is very important to understand a dataset.

However, the larger the number of explanatory variables, the more difficult it is to visualize that reflects the characteristics of the dataset. In the case of classification problems, it would be ideal to be able to classify a dataset with a small number of variables.

Principal Components Analysis(PCA) is one of the practical methods to visualize a high-dimensional dataset. This is because PCA is a technique to reduce the dimension of a dataset, i.e. aggregation of information of a dataset.

In this post, we will see how PCA can help you aggregate information and visualize the dataset. We use the wine classification dataset, one of the famous open datasets. We can easily use this dataset because it is already included in scikit-learn.

In the previous blog, exploratory data analysis(EDA) against the wine classification dataset is introduced. Therefore, you can check the detail of this dataset.

https://machine-learning.tokyo/brief-eda-for-wine-classification-dataset/

Import Library

import numpy as np
import pandas as pd
from sklearn.preprocessing import StandardScaler 
from sklearn.metrics import accuracy_score
from sklearn.datasets import load_wine
from sklearn.decomposition import PCA
import plotly.graph_objects as go

Load the dataset

dataset = load_wine()

Confirm the content of the dataset

The contents of the dataset are stored in the variable “dataset”. In this variable, several kinds of information are stored, i.e., the target-variable name and values, the explanatory-variable names and values, and the description of the dataset. Then, we have to take each of them separately.

dataset.target_name: the class labels of the target variable
dataset.target: values of the target variable (class label)
dataset.feature_names: the explanatory-variable names
dataset.data: the explanatory-variable values

We can take the class labels and those unique data in the target variable.

"""target-variable name"""
print(dataset.target_names)
"""target-variable values"""
print(np.unique(dataset.target))


>>  ['class_0' 'class_1' 'class_2']
>>  [0 1 2]

There are three classes in the target variables(‘class_0’ ‘class_1’ ‘class_2’). These classes correspond to [0 1 2].

In other words, the problem is classifying wine into three categories from the explanatory variables.

Prepare the dataset as DataFrame in pandas

For convenience, we convert the dataset into the Pandas DataFrame type. With the DataFrame type, we can easily manipulate the table-type dataset and perform the preprocessing.

Here, let’s put all the data together into one Pandas DataFrame “df”.

(NOTE) df is an abbreviation for data frame.

"""Prepare explanatory variable as DataFrame in pandas"""
df = pd.DataFrame(dataset.data)
df.columns = dataset.feature_names
"""Add the target variable to df"""
df["target"] = dataset.target
print(df.head())


>>     alcohol  malic_acid   ash  ...  od280/od315_of_diluted_wines  proline  target
>>  0    14.23        1.71  2.43  ...                          3.92   1065.0       0
>>  1    13.20        1.78  2.14  ...                          3.40   1050.0       0
>>  2    13.16        2.36  2.67  ...                          3.17   1185.0       0
>>  3    14.37        1.95  2.50  ...                          3.45   1480.0       0
>>  4    13.24        2.59  2.87  ...                          2.93    735.0       0
>>  
>>  [5 rows x 14 columns]

In this dataset, there are 13 kinds of explanatory variables. Therefore, to visualize the dataset, we have to reduce the dimension of the dataset by PCA.

Preapare the Explanatory variables and the Target variable

First, we prepare the explanatory variables and the target variable, separately.

"""Prepare the explanatory and target variables"""
x = df.drop(columns=['target'])
y = df['target']

Standardize the Variables

Before performing PCA, we should standardize the numerical variables because the scales of variables are different. We can perform it easily by scikit-learn as follows.

"""Standardization"""
sc = StandardScaler()
x_std = sc.fit_transform(x)

The details of standardization are described in another post. If you are unfamiliar with standardization, refer to the following post.

Standardization by scikit-learn in Python

PCA

Here, let’s perform the PCA analysis. It is easy to perform it using scikit-learn.

"""PCA: principal component analysis"""
# from sklearn.decomposition import PCA

pca = PCA(n_components=3)
x_pca = pca.fit_transform(x_std)

PCA can be done in just two lines.

The first line creates an instance to execute PCA. The argument “n_components” represents the number of principal components held by the instance. If “n_components = 3”, the instance holds the first to third principal components.

The second line executes PCA as an explanatory variable with the instance set in the first line. The return value is the result of being converted to the main component, and in this case, it contains three components.

Just in case, let’s check the shape of the obtained “x_pca”. You can see that there are 3 components and 178 data numbers.

print(x_pca.shape)

>>  (178, 3)

Visualize the dataset

Finally, we visualize the dataset. We already obtained the 3 principal components, so it is a good choice to create the 3D scatter plot. To create the 3D scatter plot, we use plotly, one of the famous python libraries.

# import plotly.graph_objects as go

"""axis-label name"""
x_lbl, y_lbl, z_lbl = 'PCA 1', 'PCA 2', 'PCA 3'
"""data at eact axis to plot"""
x_plot, y_plot, z_plot = x_pca[:,0], x_pca[:,1], x_pca[:,2]

"""Create an object for 3d scatter"""
trace1 = go.Scatter3d(
    x=x_plot, y=y_plot, z=z_plot,
    mode='markers',
    marker=dict(
        size=5,
        color=y, # distinguish the class by color
        )
)
"""Create an object for graph layout"""
fig = go.Figure(data=[trace1])
fig.update_layout(scene = dict(
                    xaxis_title = x_lbl,
                    yaxis_title = y_lbl,
                    zaxis_title = z_lbl),
                    width=700,
                    margin=dict(r=20, b=10, l=10, t=10),
                    )
fig.show()

The colors correspond to the classification class. It can be seen from the graph that it is possible to roughly classify information-aggregated principal components.

If it is still difficult to classify after applying principal component analysis, the dataset may lack important features. Therefore, even if it is applied to the classification model, there is a high possibility that the accuracy will be insufficient. In this way, PCA helps us to consider the dataset by visualization.

Summary

We have seen how to perform PCA and visualize its results. One of the reasons to perform PCA is to consider the complexity of the dataset. When the PCA results are insufficient to classify, it is recommended to perform feature engineering.

The author hopes this blog helps readers a little.

You may also be interested in:

Brief EDA for Wine Classification Dataset
Standardization by scikit-learn in Python

July 4, 2021July 17, 2021

Brief EDA for Wine Classification Dataset

Step-by-step to a Data Scientist > Blog > Python

Exploratory data analysis(EDA) is one of the most important processes in data analysis. To construct a machine-learning model adequately, understanding a dataset is important. Without appropriate EDA, there is no success. After the EDA, you will be able to effectively select models and perform feature engineering.

In this post, we use the wine classification dataset, one of the famous open datasets. We can easily use this dataset because it is already included in scikit-learn.

It’s very important to get used to working with open datasets. This is because, through an open dataset, we can quickly try out something new method we learned.

In the previous blog, the open dataset for regression analyses was introduced. This time, the author will introduce the open dataset that can be used for classification problems.

Brief EDA for Boston House Prices Dataset

Import Library

import numpy as np
import pandas as pd
from matplotlib import pyplot as plt
import seaborn as sns
from sklearn import preprocessing
from sklearn.model_selection import train_test_split
from sklearn.datasets import load_wine
dataset = load_wine()

Load the dataset

dataset = load_wine()

We can confirm the details of the dataset by the .DESCR method.

print(dataset.DESCR)


>> .. _wine_dataset:
>> 
>> Wine recognition dataset
>> ------------------------
>> 
>> **Data Set Characteristics:**
>> 
>>     :Number of Instances: 178 (50 in each of three classes)
>>     :Number of Attributes: 13 numeric, predictive attributes and the class
>>     :Attribute Information:
>>  		- Alcohol
>>  		- Malic acid
>>  		- Ash
>> 		- Alcalinity of ash  
>>  		- Magnesium>
>> 		- Total phenols
>>  		- Flavanoids
>>  		- Nonflavanoid phenols
>>  		- Proanthocyanins
>> 		- Color intensity
>>  		- Hue
>>  		- OD280/OD315 of diluted wines
>>  		- Proline
>> 
>>      - class:
>>             - class_0
>>             - class_1
>>             - class_2
>> 		
>>     :Summary Statistics:
>> 
>>     .
>>     .
>>     .

Confirm the content of the dataset

We can take the class labels and those data in the target variable.

"""target-variable name"""
print(dataset.target_names)
"""target-variable values"""
print(dataset.target)


>>  ['class_0' 'class_1' 'class_2']
>>  [0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
>>   0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
>>   1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
>>   1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
>>   2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2]

There are three classes in the target variables(‘class_0’ ‘class_1’ ‘class_2’). In other words, it can be understood that it is a problem of classifying wine into three categories from the explanatory variables.

Next, let’s take the explanatory variables.

"""explanatory-variable name"""
print(dataset.feature_names)
"""explanatory-variable values"""
print(dataset.data)


>>  ['alcohol', 'malic_acid', 'ash', 'alcalinity_of_ash', 'magnesium', 'total_phenols', 'flavanoids', 'nonflavanoid_phenols', 'proanthocyanins', 'color_intensity', 'hue', 'od280/od315_of_diluted_wines', 'proline']
>>  [[1.423e+01 1.710e+00 2.430e+00 ... 1.040e+00 3.920e+00 1.065e+03]
>>   [1.320e+01 1.780e+00 2.140e+00 ... 1.050e+00 3.400e+00 1.050e+03]
>>   [1.316e+01 2.360e+00 2.670e+00 ... 1.030e+00 3.170e+00 1.185e+03]
>>   ...
>>   [1.327e+01 4.280e+00 2.260e+00 ... 5.900e-01 1.560e+00 8.350e+02]
>>   [1.317e+01 2.590e+00 2.370e+00 ... 6.000e-01 1.620e+00 8.400e+02]
>>   [1.413e+01 4.100e+00 2.740e+00 ... 6.100e-01 1.600e+00 5.600e+02]]

You can see that the explanatory variables have multiple kinds. A description of each variable can be found in “dataset.DESCR”.

Convert the dataset into DataFrame in pandas

For convenience, we convert the dataset into the Pandas DataFrame type. With the DataFrame type, we can easily manipulate the table-type dataset and perform the preprocessing.

Here, let’s put all the data together into one Pandas DataFrame “df”.

(NOTE) df is an abbreviation for data frame.

"""Prepare explanatory variable as DataFrame in pandas"""
df = pd.DataFrame(dataset.data)
df.columns = dataset.feature_names
"""Add the target variable to df"""
df["target"] = dataset.target
print(df.head())


>>     alcohol  malic_acid   ash  ...  od280/od315_of_diluted_wines  proline  target
>>  0    14.23        1.71  2.43  ...                          3.92   1065.0       0
>>  1    13.20        1.78  2.14  ...                          3.40   1050.0       0
>>  2    13.16        2.36  2.67  ...                          3.17   1185.0       0
>>  3    14.37        1.95  2.50  ...                          3.45   1480.0       0
>>  4    13.24        2.59  2.87  ...                          2.93    735.0       0
>>  
>>  [5 rows x 14 columns]

From here, we perform the EDA and understand the dataset!

Summary information

First, we should look at the entire dataset. Information from the whole to the details, this order is important.

First, let’s confirm the data type of each explanatory variable.

We can easily confirm it by the .info() method in pandas.

print(df.info())


>>  <class 'pandas.core.frame.DataFrame'>
>>  RangeIndex: 178 entries, 0 to 177
>>  Data columns (total 14 columns):
>>   #   Column                        Non-Null Count  Dtype  
>>  ---  ------                        --------------  -----  
>>   0   alcohol                       178 non-null    float64
>>   1   malic_acid                    178 non-null    float64
>>   2   ash                           178 non-null    float64
>>   3   alcalinity_of_ash             178 non-null    float64
>>   4   magnesium                     178 non-null    float64
>>   5   total_phenols                 178 non-null    float64
>>   6   flavanoids                    178 non-null    float64
>>   7   nonflavanoid_phenols          178 non-null    float64
>>   8   proanthocyanins               178 non-null    float64
>>   9   color_intensity               178 non-null    float64
>>   10  hue                           178 non-null    float64
>>   11  od280/od315_of_diluted_wines  178 non-null    float64
>>   12  proline                       178 non-null    float64
>>   13  target                        178 non-null    int64  
>>  dtypes: float64(13), int64(1)
>>  memory usage: 19.6 KB
>>  None

While the data type of the target variable is int type, the explanatory variables are all float-type numeric variables.

The fact that the target variable is of type int is valid because it is a classification problem.

Here, we know that there is no need to perform preprocessing for categorical variables because all explanatory variables are numeric.

Note that continuous numeric variables require preprocessing of scale. Please refer to the following post for details.

Standardization by scikit-learn in Python

Missing Values

Here, we check how many missing values are. We can check it by the combination of the “isnull()” and “sum()” methods in pandas.

print(df.isnull().sum())


>>  alcohol                         0
>>  malic_acid                      0
>>  ash                             0
>>  alcalinity_of_ash               0
>>  magnesium                       0
>>  total_phenols                   0
>>  flavanoids                      0
>>  nonflavanoid_phenols            0
>>  proanthocyanins                 0
>>  color_intensity                 0
>>  hue                             0
>>  od280/od315_of_diluted_wines    0
>>  proline                         0
>>  target                          0
>>  dtype: int64

Fortunately, there is no missing value! This fact is because this dataset is created carefully. Note that, however, there are usually many problems we have to deal with a real dataset.

Confirm the basic Descriptive Statistics values

We can calculate the basic descriptive statistics values with just 1 sentence!

print(df.describe())


>>            alcohol  malic_acid  ...      proline      target
>>  count  178.000000  178.000000  ...   178.000000  178.000000
>>  mean    13.000618    2.336348  ...   746.893258    0.938202
>>  std      0.811827    1.117146  ...   314.907474    0.775035
>>  min     11.030000    0.740000  ...   278.000000    0.000000
>>  25%     12.362500    1.602500  ...   500.500000    0.000000
>>  50%     13.050000    1.865000  ...   673.500000    1.000000
>>  75%     13.677500    3.082500  ...   985.000000    2.000000
>>  max     14.830000    5.800000  ...  1680.000000    2.000000
>>  
>>  [8 rows x 14 columns]

Especially, it is worth to focus on “mean” and “std” as a first attention.

We can know the average from “mean” so that it makes it possible to judge a value is higher or lower. This feeling is important for a data scientist.

Next, “std” represents the standard deviation, which is an indicator of how much the data is scattered from “mean”. For example, “std” will be small if each value exists almost average.

It should be noted that the variance equals the square of the standard deviation, and the word “variance” may be more common for a data scientist. It is no exaggeration to say that the information in a dataset is contained in the variance. In other words, we cannot get any information if all values are the same. Therefore, it’s okay to delete the variable with zero variance.

Histogram Distribution

Data with variance is the data that is worth paying attention to. So let’s actually visualize the distribution of the data.

Seeing is believing!

We can perform the histogram plotting by “plt.hist()” in “matplotlib”, a famous library for visualization. The argument “bins” can control the fineness of plot.

for name in f.columns:
    plt.title(name)
    plt.hist(f[name], bins=50)
    plt.show()

The distribution of the target variable is as follows. You can see that each category has almost the same amount of data. If the number of data is biased by category, you should pay attention to the decrease in accuracy due to imbalance.

The distributions of the explanatory variables are below. We can see the difference in variance between the explanatory variables.

Summary

We have seen how to perform EDA briefly. The purpose of EDA is to properly identify the nature of the dataset. Proper EDA can make it possible to explore the next step effectively, e.g. feature engineering and modeling methods.

In the case of classification problems, principal component analysis can be considered as a deeper analysis method. I will introduce it in another post.

June 21, 2021June 21, 2021

Recursive function for a Python beginner

Step-by-step to a Data Scientist > Blog > Python

A recursive function may be unfamiliar, especially to a beginner. This is because the concept is abstract, so that we don’t realize the opportunity to use it.

In this post, a brief description and an example of a recursive function are introduced. Through this example, the readers may realize it is helpful to keep your code simple.

What is a Recursive function?

A recursive function is a function that calls itself.

For example, the following code is the simple example.

It should be noted that, as a sensible reader may notice, running this function leads to an infinite loop.

def func():
  print("This is a recursive function!")
  """Call itself!!"""
  func()

print(func())

When to use it?

A function is a converter that performs one process. Therefore, a recursive function would be practical when creating a function that does the same thing multiple times.

Replacing similar processing with a recursive function leads to code simplification. Let’s experience it with the following simple example.

Example: factorial calculation

Here, one example is introduced. We will create the function for factorial calculations. Factorial calculations are like below.

$$N!=N\times(N-1)\times…\times2\times1$$

Without Recursive function

First, let’s implement without a recursive function. The code is below.

"""factorial calculation without recursive function"""
def factorial_nonrecursive(x):
  result = 1
  """N!=1*2*...*(N-1)*N"""
  for i in range(1, x+1):
    result = i*result
  return result


"""ex. 4! = 1*2*3*4 = 24"""
result = factorial_nonrecursive(4)
print(f"factorial: {result}")


>> factorial: 24

It may seem complicated at first glance, but the above code is just multiplying in order from 1 using for loop.

With Recursive function

Next, we perform refactoring of the above code with a recursive function. The code after refactoring is as follows.

"""factorial calculation with recursive function"""
def factorial_recursive(x):
  if x == 1:
    return 1
  else:
    return x*factorial_recursive(x-1)


"""ex. 4! = 1*2*3*4 = 24"""
result = factorial_recursive(4)
print(f"factorial: {result}")


>> factorial: 24

Did you realize that the code has become so simple?

Since the code is simple, it is easy to understand that the “factorial_recursive(x)” performs multiplication of the argument “x” with the previous returned result of this function.

Appendix: about Computational Cost

Here, we check the computational cost. We compare the calculating times between the above functions we created.

We import the necessary modules.

from time import time  # calculate time
import sys
sys.setrecursionlimit(50000) # Set the upper limit of the number of recursion
import numpy as np
from matplotlib import pyplot as plt

By the following codes, we get the time from begin to finish of calculating $N!$, where $N=10000$.

The first is the case with a recursive function.

N = 10000
cal_time_recursive = []
for i in range(1, N+1):
  begin_time = time()
  result = factorial_recursive(i)
  end_time = time()
  cal_time_recursive.append(end_time - begin_time)

The second is the case without a recursive function.

N = 10000
cal_time_nonrecursive = [] # calculation time
for i in range(1, N+1):
  begin_time = time()
  result = factorial_nonrecursive(i)
  end_time = time()
  cal_time_nonrecursive.append(end_time - begin_time)

Finally, let’s check the graph of the calculation time of N! For N. The red and blue lines indicate the with-recursive and without-recursive cases, respectively.

You can see that the calculation time tends to be longer overall with recursion. This result would be due to the increase in the number of processes associated with the function calling itself sequentially. As you can see, recursive functions can simplify your code, but in some cases, they can take longer than usual.

Summary

A recursive function is a function that calls itself. Although it is unfamiliar with a python beginner, it is helpful to keep your code simple.

If you come across a situation where you can use it, please use it positively.

May 9, 2021May 9, 2021

Neural network approach for Boston House Prices dataset by PyTorch

Step-by-step to a Data Scientist > Blog > Python

The machine learning approaches, such as decision-tree-based methods and linear regression, have been already introduced in other posts. These approaches are practical in real data science tasks.

However, deep learning approaches are also essential skills for a data scientist. In real, deep learning would be a more powerful approach when a dataset is larger than that of Boston house prices.

In this post, we will see a brief description of how to apply a neural network to the Boston house prices dataset.

Related posts are below. Please refer to those.

Step-by-step guide of Linear Regression for Boston House Prices dataset

Brief EDA for Boston House Prices Dataset

Import liraries

from sklearn.datasets import load_boston
from sklearn import preprocessing
from sklearn.metrics import r2_score
import pandas as pd
import numpy as np
import matplotlib.pylab as plt
import seaborn as sns
sns.set()
import torch
import torch.nn as nn
import torch.nn.functional as F
import torch.optim as optim

Load the Dataset

In this post, we use the Boston house prices dataset in the scikit-learn library. We can easily load the dataset by just two lines below.

# from sklearn.datasets import load_boston
dataset = load_boston()

Read the Dataset as Pandas DataFrame

# import pandas as pd
df = pd.DataFrame(dataset.data)
df.columns = dataset.feature_names
df["PRICES"] = dataset.target
df.head()

Example: RM vs PRICES

Let’s try to check the correlation between only “PRICES” and “RM”.

# import matplotlib.pylab as plt
f.plot(x="RM", y="PRICES", style="o")
plt.ylabel("PRICES")
plt.show()

Variables to be used

TargetName = "PRICES"
FeaturesName = [
              #-- "Crime occurrence rate per unit population by town"
              "CRIM",
              #-- "Percentage of 25000-squared-feet-area house"
              'ZN',
              #-- "Percentage of non-retail land area by town"
              'INDUS',
              #-- "Index for Charlse river: 0 is near, 1 is far"
              'CHAS',
              #-- "Nitrogen compound concentration"
              'NOX',
              #-- "Average number of rooms per residence"
              'RM',
              #-- "Percentage of buildings built before 1940"
              'AGE',
              #-- 'Weighted distance from five employment centers'
              "DIS",
              ##-- "Index for easy access to highway"
              'RAD',
              ##-- "Tax rate per $100,000"
              'TAX',
              ##-- "Percentage of students and teachers in each town"
              'PTRATIO',
              ##-- "1000(Bk - 0.63)^2, where Bk is the percentage of Black people"
              'B',
              ##-- "Percentage of low-class population"
              'LSTAT',
              ]

We prepare the input and target variables as “X” and “Y”.

X = f[FeaturesName]
Y = f[TargetName]

Standardize the Variables

We need to standardize or normalize the numerical variable in neural network analysis. This is because the magnitude of each variable affects its scale of parameters in a neural network. Therefore, the difference in the scale of the variables would make training of the model difficult.

From the above reason, we perform the standardization into the variables. In mathematically, the definition of the conversion of standardization is as follows.

$$\begin{eqnarray*}
\tilde{x}=
\frac{x-\mu}{\sigma}
,
\end{eqnarray*}$$

where $\mu$ and $\sigma$ are the mean and the standard deviation, respectively.

# from sklearn import preprocessing
sscaler = preprocessing.StandardScaler()
sscaler.fit(X)
X_std = sscaler.transform(X)

Standardization by scikit-learn in Python

Split the Dataset

To validate the performance of the trained model against unseen data, we have to split the dataset into the train data and the test data.

We pass the dataset “(X, Y)” to the “train_test_split()” function. The rate of the train data and the test data is defined by the argument “test_size”. Here, the rate is set to be “8:2”. And, “random_state” is set for reproducibility. You can use any number.

# from sklearn.model_selection import train_test_split
X_train, X_test, Y_train, Y_test = train_test_split(X_std, Y, test_size=0.2, random_state=99)

Define a Neural network model by PyTorch

We define a neural network model. The model has three fully connected layers.

In “__init__()”, we define the layers we will use. For example, the first layer “self.layer1” has defined by “nn.Linear()” in PyTorch, where the input and output sizes are 13(X.shape[1]) and 32, respectively. Note that the input size 13 is automatically determined from the number of input variables in the dataset, whereas the output size is arbitrary and you have to decide. Similarly, in the second(third) layer, the input size has been automatically determined by the output size of the previous layer. In contrast, the output size is arbitrary and you have to decide. Namely, you can design the neural network structure by output sizes.

In “forward()”, we define the neural network structure. The input data is “x”. And, the “x” is passed into “self.layer1(x)”, where its output is given into the activation function “F.relu()”. Note that the output of the final(third) layer doesn’t have to be applied in the activation function. It is because the final-layer output is the predicted housing prices!

# import torch
# import torch.nn as nn
# import torch.nn.functional as F

class NeuralNetwork(nn.Module):
        def __init__(self):
            super(NeuralNetwork, self).__init__()
            self.layer1 = nn.Linear(X.shape[1], 32) # input: X.shape[1]=13, output: 32
            self.layer2 = nn.Linear(32, 16)
            self.layer3 = nn.Linear(16, 1)

        def forward(self, x):
            x = F.relu( self.layer1(x) )
            x = F.relu( self.layer2(x) )
            x = self.layer3(x)

            return x

model = NeuralNetwork()
print(model)

Convert data into a Tensor

In PyTorch, we have to explicitly convert the NumPy array(or pandas DataFrame) into a tensor. The conversion can be performed by “torch.tensor()”, where the param “type” is for specifying a data type.

It must be noted that the data shape of the prediction data will be (***, 1), whereas the data shape of “Y_train” is (***, ). These differences will cause the problem in calculating the loss in training. Therefore, we should reshape the data shape of “Y_train” before converting it into a tensor.

# Convert into tensor
x = torch.tensor(np.array(X_train), dtype=torch.float)
y = torch.tensor(np.array(Y_train).reshape(-1, 1), dtype=torch.float)

Define an Optimizer

We define an optimizer. PyTorch covers many optimization algorithms. The popular and basic ones are SGD and Adam.

Here, we choose the SGD algorithm as an optimizer.

# import torch.optim as optim
optimizer = optim.SGD(model.parameters(), lr=0.005)

We passed the two arguments “model.parameters()” and “lr=0.005”.

The first one is the parameters of the neural network model. The optimizer updates these parameters in each training cycle.

The second parameter is the learning rate. The learning rate is a parameter that indicates how much the model parameters are updated at once. Basically, gradually updating the parameters will surely lead to the optimum solution. On the other hand, it takes time to learn. Therefore, we need to think about the learning rate and find an appropriate value.

If you would like to use Adam as an optimizer, instead of the above codes, specify as follows.

optimizer = optim.Adam(model.parameters(), lr=0.005)

Define Loss function

We define a loss function. PyTorch covers many types of loss functions. Here, we use the mean squared error as a loss function.

# define loss function
loss_function = nn.MSELoss()

Train the Model

Finally, we can train the model !

At each epoch, we performs:

Initialize the gradient of the model parameters
Calculate the loss
Calculate the gradient of the model parameters by backpropagation
Update the model parameters

We set epochs as 200. Then, we repeat the above 200 times.

The loss would be gradually decreasing. It indicates that the training model is being well done !

# Epoch
epochs = 200

for i in range(epochs):
    # initialize the gradient of model parameters
    optimizer.zero_grad()

    # calculate the loss
    y_val = model(x)
    loss = loss_function(y_val, y)
    # Backpropagation
    loss.backward()
    # Update parameters
    optimizer.step()
    if (i % 5) == 0:
      print('epoch: {},'.format(i) + 'loss: {:.5f}'.format(loss))

Validation

To validate the performance of the model, we predict the training and validation data. It should be noted here that we have to convert the tensor into the NumPy array after prediction.

# Prediction
Y_train_pred = model(torch.tensor(X_train, dtype=torch.float))
Y_test_pred = model(torch.tensor(X_test, dtype=torch.float))

# Convert into numpy array
Y_train_pred = Y_train_pred.detach().numpy()
Y_test_pred = Y_test_pred.detach().numpy()

Accuracy: R2

We calculate $R^{2}$ score to confirm the prediction accuracy.

$R^{2}$ is the index for how much the model is fitted to the dataset. When $R^{2}$ is close to $1$, the model accuracy is good. Conversely, when $R^{2}$ approaches $0$, it means that the model accuracy is poor.

We can calculate $R^{2}$ by the “r2_score()” function in scikit-learn.

# from sklearn.metrics import r2_score
R2 = r2_score(Y_test, Y_test_pred)
print(R2)

>> 0.8048130761552106

The score of $0.80$ is better than $0.74$ from linear regression in another post. The accuracy has been improved !

Visualize the Results

Finally, let’s visualize the results by matplotlib.

The red and blue circles show the results of the training and validation data, respectively.

plt.figure(figsize=(5, 5), dpi=100)
sns.set()
plt.xlabel("PRICES")
plt.ylabel("Predicted PRICES")
plt.xlim(0, 60)
plt.ylim(0, 60)
plt.scatter(Y_train, Y_train_pred, lw=1, color="r", label="train data")
plt.scatter(Y_test, Y_test_pred, lw=1, color="b", label="test data")
plt.legend()
plt.show()

Summary

We have seen the Neural Network analysis constructed by PyTorch against the Boston house prices dataset. Although we use a very simple network structure, the accuracy of the validation data improved more than that of linear regression.

The author hopes this blog helps readers a little.

You may also be interested in:

Step-by-step guide of Linear Regression for Boston House Prices dataset
Step-by-step guide of Decision Tree Regression for Boston House Prices dataset
Brief EDA for Boston House Prices Dataset
Standardization by scikit-learn in Python