Step-by-step to a Data Scientist

November 25, 2021November 25, 2021

3D Scatter by Plotly

Step-by-step to a Data Scientist > Blog > Data Science

3D visualization is practical because we can understand relationships between variables in a dataset. For example, when EDA (Exploratory Data Analysis), it is a powerful tool to examine a dataset from various approaches.

For visualizing the 3D scatter, we use Plotly, the famous python open-source library. Although Plotly has so rich functions, it is a little bit difficult for a beginner. Compared to the famous python libraries matplotlib and seaborn, it is often not intuitive. For example, not only do object-oriented types appear in interfaces, but we often face them when we pass arguments as dictionary types.

Therefore, In this post, the basic skills for 3D visualization for Plotly are introduced. We can learn the basic functions required for drawing a graph through a simple example.

The full code is in the GitHub repository.

Import Libraries

import numpy as np
import pandas as pd
from sklearn.datasets import fetch_california_housing
import plotly.graph_objects as go

In this post, it is confirmed that the code works with the following library versions:

numpy==1.19.5
pandas==1.1.5
scikit-learn==1.0.1
plotly==4.4.1

Prepare a Dataset

In this post, we use the “California Housing dataset” included in scikit-learn.

# Get a dataset instance
dataset = fetch_california_housing()

The “dataset” variable is an instance of the dataset. It stores several kinds of information, i.e., the explanatory-variable values, the target-variable values, the names of the explanatory variable, and the name of the target variable.

We can take and assign them separately as follows.

dataset.data: values of the explanatory variables
dataset.target: values of the target variable (house prices)
dataset.feature_names: the column names

Note that we store the dataset as pandas DataFrame because it is convenient to manipulate data. And, the target variable “MedHouseVal” indicates the median house value for California districts, expressed in hundreds of thousands of dollars ($100,000).

# Store the dataset as pandas DataFrame
df = pd.DataFrame(dataset.data)
# Asign the explanatory-variable names
df.columns = dataset.feature_names
# Asign the target-variable name
df[dataset.target_names[0]] = dataset.target

df.head()

Variables to be used in each axis

Here, we prepare the variables to be used in each axis. Here, we use “Latitude” and “Longitude” for the x- and y-axis. And for the z-axis, we use “MedHouseVal”, the target variable.

xlbl = 'Latitude'
ylbl = 'Longitude'
zlbl = 'MedHouseVal' 

x = df[xlbl]
y = df[ylbl]
z = df[zlbl]

Basic Format for 3D Scatter

To get started, let’s create a 3D scatter figure. We use the “graph_objects()” module in Plotly. First, we creaete a graph instance as “fig”. Next, we add a 3D Scatter created by “go.Scatter3d()” to “fig” by the “go.add_traces()” module. Finally, we can visualize by “fig.show()”.

# import plotly.graph_objects as go

# Create a graph instance
fig = go.Figure()
# Add 3D Scatter to the graph instance
fig.add_traces(go.Scatter3d(
    x=x, y=y, z=z,
))
# Show the figure
fig.show()

However, you can see that the default settings are often inadequate.

Therefore, we will make changes to the following items to create a good-looking graph.

Marker size
Marker color
Plot Style
Axis label
Figure size
Save a figure as a HTML file

Marker size and color

We change the marker size and color. Note that we have to pass the argument as dictionary type.

# Create a graph instance
fig = go.Figure()
# Add 3D Scatter to the graph instance
fig.add_traces(go.Scatter3d(
    x=x, y=y, z=z,
    # marker size and color
    marker=dict(color='red', size=1),
))
# Show the figure
fig.show()

The marker size has been changed from 3 to 1. And, the color has also been changed from blue to red.

Here, by reducing the marker size, we can see that not only points but also lines are mixed. To make a point-only graph, you need to explicitly specify “Marker” in the mode argument.

Marker Style

We can easily specify as marker style.

# Create a graph instance
fig = go.Figure()
# Add 3D Scatter to the graph instance
fig.add_traces(go.Scatter3d(
    x=x, y=y, z=z,
    # marker size and color
    marker=dict(color='red', size=1),
    # marker style
    mode='markers',
))
# Show the figure
fig.show()

Of course, you can easily change the line style. Just change the “mode” argument from “markers” to “lines”.

Axis Label

Next, we add the label to each axis.

While we frequently create axis labels, Plotly is less intuitive than matplotlib. Therefore, it will be convenient to check it once here.

We use the “go.update_layout()” module for changing the figure layout. And, as an argument, we pass the “scene” as a dictionary, where we pass each axis label as dictionary value to its key.

# Create a graph instance
fig = go.Figure()
# Add 3D Scatter to the graph instance
fig.add_traces(go.Scatter3d(
    x=x, y=y, z=z,
    # marker size and color
    marker=dict(color='red', size=1),
    # marker style
    mode='markers',
))
# Axis Labels
fig.update_layout(
    scene=dict(
        xaxis_title=xlbl,
        yaxis_title=ylbl,
        zaxis_title=zlbl,
        )
    )
# Show the figure
fig.show()

Figure Size

We sometimes face the situation to change the figure size. It can be easily performed just one line by the “go.update_layout()” module.

fig.update_layout(height=600, width=600)

Save a figure as HTML format

We can save the created figure by the “go.write_html()” module.

fig.write_html('3d_scatter.html')

Since the figure is created and saved as an HTML file, we can confirm it interactively by a web browser, e.g. Chrome and Firefox.

Cheat Sheet for 3D Scatter

# Create a graph instance
fig = go.Figure()
# Add 3D Scatter to the graph instance
fig.add_traces(go.Scatter3d(
    x=x, y=y, z=z,
    # marker size and color
    marker=dict(color='red', size=1),
    # marker style
    mode='markers',
))
# Axis Labels
fig.update_layout(
    scene=dict(
        xaxis_title=xlbl,
        yaxis_title=ylbl,
        zaxis_title=zlbl,
        )
    )
# Figure size
fig.update_layout(height=600, width=600)
# Save the figure
fig.write_html('3d_scatter.html')
# Show the figure
fig.show()

Summary

We have seen how to create the 3D scatter graph. By plotly, we can create it easily.

The author believes that the code example in this post makes it easy to understand and implement a 3D scatter graph for readers.

This time we tried with only one variable, but the case for multi variables can be implemented in the same way. We have defined hyperparameters as a dictionary, but we just need to add additional variables there.

The author hopes this blog helps readers a little.

September 4, 2021September 4, 2021

Open Toy datasets for Data science

Step-by-step to a Data Scientist > Blog > Data Science

We need a dataset to try out quickly something new method we learned. Therefore, it’s very important to get used to working with open datasets.

In this post, several open datasets, which are included in scikit-learn, will be introduced. From scikit-learn, we can easily and quickly use these datasets for regression and classification analyses.

The link to the code in the GitHub repository is here.

What is scikit-learn?

scikit-learn is one of the famous python libraries for machine learning. scikit-learn is easy to use, powerful, and including a variety of techniques. It can be widely used from statistical analysis to machine learning to deep learning.

Open Toy datasets in scikit-learn

scikit-learn is not only used to implement machine learning but also contains various datasets. it should be noted here that the toy dataset introduced below can be used offline once scikit-learn is installed.

The list of the datasets in scikit-learn is as follows. The reference is in the scikit-learn document.

Boston house prices dataset (regression)
Iris dataset (classification)
Diabetes dataset (regression)
Digits dataset (classification)
Physical excercise linnerud dataset
Wine dataset (classification)
Breast cancer wisconsin dataset (classification)

By using scikit-learn, we can use the above datasets in the same way.

Let’s first take the Boston home price dataset as an example. Once you understand this example, you can treat the rest of the dataset as well.

Import common libraries

Here, import the commonly used python library.

import pandas as pd

Boston house prices dataset

This dataset is for regression analysis. Therefore, we can utilize this dataset to try a method for regression analysis you learned.

First, we import the dataset module from scikit-learn. The dataset is included in the sklearn.datasets module.

from sklearn.datasets import load_boston

Second, create the instance of the dataset. In this instance, various information is stored, i.e., the explanatory data, the names of features, the regression target data, and the description of the dataset. Then, we can extract and use information from this instance as needed.

# instance of the boston house-prices dataset
dataset = load_boston()

We can confirm the details of the dataset by the DESCR method.

print(dataset.DESCR)


>> .. _boston_dataset:
>> 
>> Boston house prices dataset
>> ---------------------------
>> 
>> **Data Set Characteristics:**  
>> 
>>     :Number of Instances: 506 
>> 
>>     :Number of Attributes: 13 numeric/categorical predictive. Median Value (attribute 14) is usually the target.
>> 
>>     :Attribute Information (in order):
>>         - CRIM     per capita crime rate by town
>>         - ZN       proportion of residential land zoned for lots over 25,000 sq.ft.
>>         - INDUS    proportion of non-retail business acres per town
>>         - CHAS     Charles River dummy variable (= 1 if tract bounds river; 0 otherwise)
>>         - NOX      nitric oxides concentration (parts per 10 million)
>>         - RM       average number of rooms per dwelling
>>         - AGE      proportion of owner-occupied units built prior to 1940
>>         - DIS      weighted distances to five Boston employment centres
>>         - RAD      index of accessibility to radial highways
>>         - TAX      full-value property-tax rate per $10,000
>>         - PTRATIO  pupil-teacher ratio by town
>>         - B        1000(Bk - 0.63)^2 where Bk is the proportion of blacks by town
>>         - LSTAT    % lower status of the population
>>         - MEDV     Median value of owner-occupied homes in $1000's
>> 
>>     :Missing Attribute Values: None
>> 
>>     :Creator: Harrison, D. and Rubinfeld, D.L.
>> 
>>     .
>>     .
>>     .

Next, we confirm the contents of the dataset.

The contents of the dataset variable “dataset” can be accessed by specific methods. In this variable, several kinds of information are stored. The list of them is as follows.

dataset.data: the explanatory-variable values
dataset.feature_names: the explanatory-variable names
dataset.target: values of the target variable

First, we take the data and the feature names of the explanatory variables.

# explanatory variables
X = dataset.data
# feature names
feature_names = dataset.feature_names

The data type of the above variable X is numpy array. For convenience, we convert the dataset into the Pandas DataFrame type. With the DataFrame type, we can easily manipulate the table-type dataset and perform the preprocessing.

# convert the data type from numpy array into pandas DataFrame
X = pd.DataFrame(X, columns=feature_names)
# display the first five rows.
print(X.head())


>>         CRIM    ZN  INDUS  CHAS    NOX  ...  RAD    TAX  PTRATIO       B  LSTAT
>>   0  0.00632  18.0   2.31   0.0  0.538  ...  1.0  296.0     15.3  396.90   4.98
>>   1  0.02731   0.0   7.07   0.0  0.469  ...  2.0  242.0     17.8  396.90   9.14
>>   2  0.02729   0.0   7.07   0.0  0.469  ...  2.0  242.0     17.8  392.83   4.03
>>   3  0.03237   0.0   2.18   0.0  0.458  ...  3.0  222.0     18.7  394.63   2.94
>>   4  0.06905   0.0   2.18   0.0  0.458  ...  3.0  222.0     18.7  396.90   5.33

Finally, we take the target-variable data.

# target variable
y = dataset.target
# display the first five elements.
print(y[0:5])


>>   [24.  21.6 34.7 33.4 36.2]

Now you have the explanatory variable X and the target variable y ready. In the real analysis, the process from here is performing preprocessing the data, creating the model, and validating the accuracy of the model.

You can do the same procedures for other datasets. The code for each dataset is described below.

Iris dataset

This dataset is for classification analysis.

from sklearn.datasets import load_iris


# instance of the iris dataset
dataset = load_iris()
# explanatory variables
X = dataset.data
# feature names
feature_names = dataset.feature_names
# convert the data type from numpy array into pandas DataFrame
X = pd.DataFrame(X, columns=feature_names)
# target variable
y = dataset.target

Diabetes dataset

This dataset is for regression analysis.

from sklearn.datasets import load_diabetes


# instance of the Diabetes dataset
dataset = load_diabetes()
# explanatory variables
X = dataset.data
# feature names
feature_names = dataset.feature_names
# convert the data type from numpy array into pandas DataFrame
X = pd.DataFrame(X, columns=feature_names)
# target variable
y = dataset.target

Digits dataset

This dataset is for classification analysis. It must be noted here that this dataset is image data. Therefore, the methods for taking each data are different from other datasets.

dataset.images: the raw image data
dataset.feature_names: the explanatory-variable names
dataset.target: values of the target variable

from sklearn.datasets import load_digits


# instance of the digits dataset
dataset = load_digits()
# explanatory variables
X = dataset.images  # X.shape is (1797, 8, 8)

# target variable(0, 1, 2, .., 8, 9)
y = dataset.target

# Display the image
import matplotlib.pyplot as plt 
plt.gray() 
plt.matshow(X[0]) 
plt.show()

Physical excercise linnerud dataset

This dataset is for multi-target regression analysis. In this dataset, the target variable has three outputs.

from sklearn.datasets import load_linnerud


# instance of the physical excercise linnerud dataset
dataset = load_linnerud()
# explanatory variables
X = dataset.data
# feature names
feature_names = dataset.feature_names
# convert the data type from numpy array into pandas DataFrame
X = pd.DataFrame(X, columns=feature_names)
# target variable
y = dataset.target

X.head()
print(y)

Wine dataset

This dataset is for classification analysis.

from sklearn.datasets import load_wine


# instance of the wine dataset
dataset = load_wine()
# explanatory variables
X = dataset.data
# feature names
feature_names = dataset.feature_names
# convert the data type from numpy array into pandas DataFrame
X = pd.DataFrame(X, columns=feature_names)
# target variable
y = dataset.target

Breast cancer wisconsin dataset

This dataset is for regression analysis.

from sklearn.datasets import load_breast_cancer


# instance of the Breast cancer wisconsin dataset
dataset = load_breast_cancer()
# explanatory variables
X = dataset.data
# feature names
feature_names = dataset.feature_names
# convert the data type from numpy array into pandas DataFrame
X = pd.DataFrame(X, columns=feature_names)
# target variable
y = dataset.target

Summary

In this post, we have seen several famous datasets in scikit-learn. Open datasets are important to try out quickly something new method we learned.

Therefore, let’s get used to working with open datasets.

August 17, 2021August 17, 2021

UPDATE the Book publish on Amazon Kindle, Tutorial of a Deployment of a Web app by Streamlit and Python

The book about streamlit, published on Amazon Kindle, was major updated. In this major update, the content about PCA(principal component analysis) has been added.

The book is entitled “Tutorial of a Deployment of a Web app by Python and Streamlit for a Data Scientist”.

This book is registered on Kindle Unlimited, so any member can read it !!

Features of this book

For beginners of Streamlit
Be aware of simple explanations
All with sample code
Introducing data analysis as a web application as an example

What is Streamlit?

Streamlit is a wonderful library, making it easier and faster to build a web app for your data science project. By Streamlit, we can easily convert python script into a web app. Namely, we can publish our data analyses as a web app.

Articles about Streamlit have been posted in the past. The book was created with detailed explanations added. Especially, if you want to study all at once, please check it!

Streamlit for a beginner #1

Web app for Linear Regression by Streamlit

Web app for PCA by Streamlit

July 17, 2021July 17, 2021

Visualising the Wine Classification Dataset by PCA

Step-by-step to a Data Scientist > Blog > Data Science

Visualizing the dataset is very important to understand a dataset.

However, the larger the number of explanatory variables, the more difficult it is to visualize that reflects the characteristics of the dataset. In the case of classification problems, it would be ideal to be able to classify a dataset with a small number of variables.

Principal Components Analysis(PCA) is one of the practical methods to visualize a high-dimensional dataset. This is because PCA is a technique to reduce the dimension of a dataset, i.e. aggregation of information of a dataset.

In this post, we will see how PCA can help you aggregate information and visualize the dataset. We use the wine classification dataset, one of the famous open datasets. We can easily use this dataset because it is already included in scikit-learn.

In the previous blog, exploratory data analysis(EDA) against the wine classification dataset is introduced. Therefore, you can check the detail of this dataset.

https://machine-learning.tokyo/brief-eda-for-wine-classification-dataset/

Import Library

import numpy as np
import pandas as pd
from sklearn.preprocessing import StandardScaler 
from sklearn.metrics import accuracy_score
from sklearn.datasets import load_wine
from sklearn.decomposition import PCA
import plotly.graph_objects as go

Load the dataset

dataset = load_wine()

Confirm the content of the dataset

The contents of the dataset are stored in the variable “dataset”. In this variable, several kinds of information are stored, i.e., the target-variable name and values, the explanatory-variable names and values, and the description of the dataset. Then, we have to take each of them separately.

dataset.target_name: the class labels of the target variable
dataset.target: values of the target variable (class label)
dataset.feature_names: the explanatory-variable names
dataset.data: the explanatory-variable values

We can take the class labels and those unique data in the target variable.

"""target-variable name"""
print(dataset.target_names)
"""target-variable values"""
print(np.unique(dataset.target))


>>  ['class_0' 'class_1' 'class_2']
>>  [0 1 2]

There are three classes in the target variables(‘class_0’ ‘class_1’ ‘class_2’). These classes correspond to [0 1 2].

In other words, the problem is classifying wine into three categories from the explanatory variables.

Prepare the dataset as DataFrame in pandas

For convenience, we convert the dataset into the Pandas DataFrame type. With the DataFrame type, we can easily manipulate the table-type dataset and perform the preprocessing.

Here, let’s put all the data together into one Pandas DataFrame “df”.

(NOTE) df is an abbreviation for data frame.

"""Prepare explanatory variable as DataFrame in pandas"""
df = pd.DataFrame(dataset.data)
df.columns = dataset.feature_names
"""Add the target variable to df"""
df["target"] = dataset.target
print(df.head())


>>     alcohol  malic_acid   ash  ...  od280/od315_of_diluted_wines  proline  target
>>  0    14.23        1.71  2.43  ...                          3.92   1065.0       0
>>  1    13.20        1.78  2.14  ...                          3.40   1050.0       0
>>  2    13.16        2.36  2.67  ...                          3.17   1185.0       0
>>  3    14.37        1.95  2.50  ...                          3.45   1480.0       0
>>  4    13.24        2.59  2.87  ...                          2.93    735.0       0
>>  
>>  [5 rows x 14 columns]

In this dataset, there are 13 kinds of explanatory variables. Therefore, to visualize the dataset, we have to reduce the dimension of the dataset by PCA.

Preapare the Explanatory variables and the Target variable

First, we prepare the explanatory variables and the target variable, separately.

"""Prepare the explanatory and target variables"""
x = df.drop(columns=['target'])
y = df['target']

Standardize the Variables

Before performing PCA, we should standardize the numerical variables because the scales of variables are different. We can perform it easily by scikit-learn as follows.

"""Standardization"""
sc = StandardScaler()
x_std = sc.fit_transform(x)

The details of standardization are described in another post. If you are unfamiliar with standardization, refer to the following post.

Standardization by scikit-learn in Python

PCA

Here, let’s perform the PCA analysis. It is easy to perform it using scikit-learn.

"""PCA: principal component analysis"""
# from sklearn.decomposition import PCA

pca = PCA(n_components=3)
x_pca = pca.fit_transform(x_std)

PCA can be done in just two lines.

The first line creates an instance to execute PCA. The argument “n_components” represents the number of principal components held by the instance. If “n_components = 3”, the instance holds the first to third principal components.

The second line executes PCA as an explanatory variable with the instance set in the first line. The return value is the result of being converted to the main component, and in this case, it contains three components.

Just in case, let’s check the shape of the obtained “x_pca”. You can see that there are 3 components and 178 data numbers.

print(x_pca.shape)

>>  (178, 3)

Visualize the dataset

Finally, we visualize the dataset. We already obtained the 3 principal components, so it is a good choice to create the 3D scatter plot. To create the 3D scatter plot, we use plotly, one of the famous python libraries.

# import plotly.graph_objects as go

"""axis-label name"""
x_lbl, y_lbl, z_lbl = 'PCA 1', 'PCA 2', 'PCA 3'
"""data at eact axis to plot"""
x_plot, y_plot, z_plot = x_pca[:,0], x_pca[:,1], x_pca[:,2]

"""Create an object for 3d scatter"""
trace1 = go.Scatter3d(
    x=x_plot, y=y_plot, z=z_plot,
    mode='markers',
    marker=dict(
        size=5,
        color=y, # distinguish the class by color
        )
)
"""Create an object for graph layout"""
fig = go.Figure(data=[trace1])
fig.update_layout(scene = dict(
                    xaxis_title = x_lbl,
                    yaxis_title = y_lbl,
                    zaxis_title = z_lbl),
                    width=700,
                    margin=dict(r=20, b=10, l=10, t=10),
                    )
fig.show()

The colors correspond to the classification class. It can be seen from the graph that it is possible to roughly classify information-aggregated principal components.

If it is still difficult to classify after applying principal component analysis, the dataset may lack important features. Therefore, even if it is applied to the classification model, there is a high possibility that the accuracy will be insufficient. In this way, PCA helps us to consider the dataset by visualization.

Summary

We have seen how to perform PCA and visualize its results. One of the reasons to perform PCA is to consider the complexity of the dataset. When the PCA results are insufficient to classify, it is recommended to perform feature engineering.

The author hopes this blog helps readers a little.

You may also be interested in:

Brief EDA for Wine Classification Dataset
Standardization by scikit-learn in Python

July 4, 2021July 17, 2021

Brief EDA for Wine Classification Dataset

Step-by-step to a Data Scientist > Blog > Data Science

Exploratory data analysis(EDA) is one of the most important processes in data analysis. To construct a machine-learning model adequately, understanding a dataset is important. Without appropriate EDA, there is no success. After the EDA, you will be able to effectively select models and perform feature engineering.

In this post, we use the wine classification dataset, one of the famous open datasets. We can easily use this dataset because it is already included in scikit-learn.

It’s very important to get used to working with open datasets. This is because, through an open dataset, we can quickly try out something new method we learned.

In the previous blog, the open dataset for regression analyses was introduced. This time, the author will introduce the open dataset that can be used for classification problems.

Brief EDA for Boston House Prices Dataset

Import Library

import numpy as np
import pandas as pd
from matplotlib import pyplot as plt
import seaborn as sns
from sklearn import preprocessing
from sklearn.model_selection import train_test_split
from sklearn.datasets import load_wine
dataset = load_wine()

Load the dataset

dataset = load_wine()

We can confirm the details of the dataset by the .DESCR method.

print(dataset.DESCR)


>> .. _wine_dataset:
>> 
>> Wine recognition dataset
>> ------------------------
>> 
>> **Data Set Characteristics:**
>> 
>>     :Number of Instances: 178 (50 in each of three classes)
>>     :Number of Attributes: 13 numeric, predictive attributes and the class
>>     :Attribute Information:
>>  		- Alcohol
>>  		- Malic acid
>>  		- Ash
>> 		- Alcalinity of ash  
>>  		- Magnesium>
>> 		- Total phenols
>>  		- Flavanoids
>>  		- Nonflavanoid phenols
>>  		- Proanthocyanins
>> 		- Color intensity
>>  		- Hue
>>  		- OD280/OD315 of diluted wines
>>  		- Proline
>> 
>>      - class:
>>             - class_0
>>             - class_1
>>             - class_2
>> 		
>>     :Summary Statistics:
>> 
>>     .
>>     .
>>     .

Confirm the content of the dataset

We can take the class labels and those data in the target variable.

"""target-variable name"""
print(dataset.target_names)
"""target-variable values"""
print(dataset.target)


>>  ['class_0' 'class_1' 'class_2']
>>  [0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
>>   0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
>>   1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
>>   1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
>>   2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2]

There are three classes in the target variables(‘class_0’ ‘class_1’ ‘class_2’). In other words, it can be understood that it is a problem of classifying wine into three categories from the explanatory variables.

Next, let’s take the explanatory variables.

"""explanatory-variable name"""
print(dataset.feature_names)
"""explanatory-variable values"""
print(dataset.data)


>>  ['alcohol', 'malic_acid', 'ash', 'alcalinity_of_ash', 'magnesium', 'total_phenols', 'flavanoids', 'nonflavanoid_phenols', 'proanthocyanins', 'color_intensity', 'hue', 'od280/od315_of_diluted_wines', 'proline']
>>  [[1.423e+01 1.710e+00 2.430e+00 ... 1.040e+00 3.920e+00 1.065e+03]
>>   [1.320e+01 1.780e+00 2.140e+00 ... 1.050e+00 3.400e+00 1.050e+03]
>>   [1.316e+01 2.360e+00 2.670e+00 ... 1.030e+00 3.170e+00 1.185e+03]
>>   ...
>>   [1.327e+01 4.280e+00 2.260e+00 ... 5.900e-01 1.560e+00 8.350e+02]
>>   [1.317e+01 2.590e+00 2.370e+00 ... 6.000e-01 1.620e+00 8.400e+02]
>>   [1.413e+01 4.100e+00 2.740e+00 ... 6.100e-01 1.600e+00 5.600e+02]]

You can see that the explanatory variables have multiple kinds. A description of each variable can be found in “dataset.DESCR”.

Convert the dataset into DataFrame in pandas

For convenience, we convert the dataset into the Pandas DataFrame type. With the DataFrame type, we can easily manipulate the table-type dataset and perform the preprocessing.

Here, let’s put all the data together into one Pandas DataFrame “df”.

(NOTE) df is an abbreviation for data frame.

"""Prepare explanatory variable as DataFrame in pandas"""
df = pd.DataFrame(dataset.data)
df.columns = dataset.feature_names
"""Add the target variable to df"""
df["target"] = dataset.target
print(df.head())


>>     alcohol  malic_acid   ash  ...  od280/od315_of_diluted_wines  proline  target
>>  0    14.23        1.71  2.43  ...                          3.92   1065.0       0
>>  1    13.20        1.78  2.14  ...                          3.40   1050.0       0
>>  2    13.16        2.36  2.67  ...                          3.17   1185.0       0
>>  3    14.37        1.95  2.50  ...                          3.45   1480.0       0
>>  4    13.24        2.59  2.87  ...                          2.93    735.0       0
>>  
>>  [5 rows x 14 columns]

From here, we perform the EDA and understand the dataset!

Summary information

First, we should look at the entire dataset. Information from the whole to the details, this order is important.

First, let’s confirm the data type of each explanatory variable.

We can easily confirm it by the .info() method in pandas.

print(df.info())


>>  <class 'pandas.core.frame.DataFrame'>
>>  RangeIndex: 178 entries, 0 to 177
>>  Data columns (total 14 columns):
>>   #   Column                        Non-Null Count  Dtype  
>>  ---  ------                        --------------  -----  
>>   0   alcohol                       178 non-null    float64
>>   1   malic_acid                    178 non-null    float64
>>   2   ash                           178 non-null    float64
>>   3   alcalinity_of_ash             178 non-null    float64
>>   4   magnesium                     178 non-null    float64
>>   5   total_phenols                 178 non-null    float64
>>   6   flavanoids                    178 non-null    float64
>>   7   nonflavanoid_phenols          178 non-null    float64
>>   8   proanthocyanins               178 non-null    float64
>>   9   color_intensity               178 non-null    float64
>>   10  hue                           178 non-null    float64
>>   11  od280/od315_of_diluted_wines  178 non-null    float64
>>   12  proline                       178 non-null    float64
>>   13  target                        178 non-null    int64  
>>  dtypes: float64(13), int64(1)
>>  memory usage: 19.6 KB
>>  None

While the data type of the target variable is int type, the explanatory variables are all float-type numeric variables.

The fact that the target variable is of type int is valid because it is a classification problem.

Here, we know that there is no need to perform preprocessing for categorical variables because all explanatory variables are numeric.

Note that continuous numeric variables require preprocessing of scale. Please refer to the following post for details.

Standardization by scikit-learn in Python

Missing Values

Here, we check how many missing values are. We can check it by the combination of the “isnull()” and “sum()” methods in pandas.

print(df.isnull().sum())


>>  alcohol                         0
>>  malic_acid                      0
>>  ash                             0
>>  alcalinity_of_ash               0
>>  magnesium                       0
>>  total_phenols                   0
>>  flavanoids                      0
>>  nonflavanoid_phenols            0
>>  proanthocyanins                 0
>>  color_intensity                 0
>>  hue                             0
>>  od280/od315_of_diluted_wines    0
>>  proline                         0
>>  target                          0
>>  dtype: int64

Fortunately, there is no missing value! This fact is because this dataset is created carefully. Note that, however, there are usually many problems we have to deal with a real dataset.

Confirm the basic Descriptive Statistics values

We can calculate the basic descriptive statistics values with just 1 sentence!

print(df.describe())


>>            alcohol  malic_acid  ...      proline      target
>>  count  178.000000  178.000000  ...   178.000000  178.000000
>>  mean    13.000618    2.336348  ...   746.893258    0.938202
>>  std      0.811827    1.117146  ...   314.907474    0.775035
>>  min     11.030000    0.740000  ...   278.000000    0.000000
>>  25%     12.362500    1.602500  ...   500.500000    0.000000
>>  50%     13.050000    1.865000  ...   673.500000    1.000000
>>  75%     13.677500    3.082500  ...   985.000000    2.000000
>>  max     14.830000    5.800000  ...  1680.000000    2.000000
>>  
>>  [8 rows x 14 columns]

Especially, it is worth to focus on “mean” and “std” as a first attention.

We can know the average from “mean” so that it makes it possible to judge a value is higher or lower. This feeling is important for a data scientist.

Next, “std” represents the standard deviation, which is an indicator of how much the data is scattered from “mean”. For example, “std” will be small if each value exists almost average.

It should be noted that the variance equals the square of the standard deviation, and the word “variance” may be more common for a data scientist. It is no exaggeration to say that the information in a dataset is contained in the variance. In other words, we cannot get any information if all values are the same. Therefore, it’s okay to delete the variable with zero variance.

Histogram Distribution

Data with variance is the data that is worth paying attention to. So let’s actually visualize the distribution of the data.

Seeing is believing!

We can perform the histogram plotting by “plt.hist()” in “matplotlib”, a famous library for visualization. The argument “bins” can control the fineness of plot.

for name in f.columns:
    plt.title(name)
    plt.hist(f[name], bins=50)
    plt.show()

The distribution of the target variable is as follows. You can see that each category has almost the same amount of data. If the number of data is biased by category, you should pay attention to the decrease in accuracy due to imbalance.

The distributions of the explanatory variables are below. We can see the difference in variance between the explanatory variables.

Summary

We have seen how to perform EDA briefly. The purpose of EDA is to properly identify the nature of the dataset. Proper EDA can make it possible to explore the next step effectively, e.g. feature engineering and modeling methods.

In the case of classification problems, principal component analysis can be considered as a deeper analysis method. I will introduce it in another post.

May 24, 2021May 24, 2021

Streamlit Tutorial Book has been published on Amazon Kindle

I have published the book for a tutorial of Streamlit; “Tutorial of a Deployment of a Web app by Python and Streamlit for a Data Scientist”.

This new book is registered on Kindle Unlimited, so any member can read it !!

Features of this book

For beginners of Streamlit
Be aware of simple explanations
All with sample code
Introducing data analysis as a web application as an example

What is Streamlit?

Articles about Streamlit have been posted in the past. The book was created with detailed explanations added. Especially, if you want to study all at once, please check it!

Streamlit for a beginner #1

Web app for Linear Regression by Streamlit

November 8, 2020April 24, 2021

Standardization by scikit-learn in Python

Step-by-step to a Data Scientist > Blog > Data Science

Scaling variables is important, especially for regression analysis. This is because, in roughly speaking, it means that the analysis is performed without considering the unit.

In this blog, we will see the process of standardization, one of the basic scaling method, that is absolutely necessary for regression analysis.

Why standardization is a need.

For example, the coefficients of regression analysis become larger when the scale of variables becomes larger. It means that the worthies of the regression coefficients are not equivalent if the scales are different.

Here, as an example, we consider creating a mixed regression model with “m” and “cm”, length units. This is the example of without considering the unit.

More specifically, suppose you have the following formula in meters.

$$y = a_{m}x_{m}.$$

When the above formula is converted from meters to centimeters, it becomes as follows.

$$\begin{eqnarray*}
y
&&= \left( \frac{a_{m}}{100}\right) \times (100x_{m}),\\
&&=:a_{cm}x_{cm},
\end{eqnarray*}$$

where $a_{cm}=a_{m}/100$ and $x_{cm}=100x_{m}$.

From the above, we can see that the scales of variables and coefficients change even between equivalent expressions.

In fact, the problem in this case is that the values of the gradients are no longer equivalent. Specifically, consider the following example.

$$\begin{eqnarray*}
\frac{\partial y}{\partial x_{m}} &&= a_{m}.
\end{eqnarray*}$$

And,

$$\begin{eqnarray*}
\frac{\partial y}{\partial x_{cm}} &&= a_{cm}.
\end{eqnarray*}$$

From the fact that $a_{m} \neq a_{cm}(a_{m}=100a_{cm})$,

$$\begin{eqnarray*}
\frac{\partial y}{\partial x_{m}}
\neq
\frac{\partial y}{\partial x_{cm}}.
\end{eqnarray*}$$

Namely, one step based on the gradient is not equivalent.
This fact may cause the optimization to fail when optimizing the model with a gradient-based approach (the steepest descent method).

Mathematically speaking, it means that the gradient space is distorted.
But here, I will just way one step should be alway equivalent.

Therefore, standardization is a need.

Definition of Standardize

Standardization is the operation of converting the mean and standard deviation of data into 0 and 1, respectively.

The mathematical definition is as follows.

$$\begin{eqnarray*}
\tilde{x}=
\frac{x-\mu}{\sigma}
,
\end{eqnarray*}$$

where $\mu$ and $\sigma$ are the mean and the standard deviation, respectively.

How to Standardize in Python?

It’s easy. By using scikit-learn, Standardization can be done in just 4 lines!

from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
scaler.fit(data)
data_std = scaler.transform(data)

The code description is as follows.

1. Import “StandardScaler()” from “sklearn.preprocessing” in scikit-learn
2. Create the instance for Standardization by the “StandardScaler()” function
3. Calculate the mean and the standard deviation
4. Convert “data” into the standardized data “data_std”

Summary

The above 4 line code is often used, so this post should be a cheat sheet. However, the important thing is to understand why standardization is a need.

Sometimes, some people compare the values of the regression coefficients without scaling to determine the magnitude of the contribution. However, people who understand the meaning of standardization can judge it is wrong!

Standardization is so popular that the author hopes this post helps readers a little.