Brief Introduction of Descriptive Statistics

The descriptive statistics have important information because they reflect a summary of a dataset. For example, from descriptive statistics, we can know the scale, variation, minimum and maximum values. If you know the above information, you can have a sense of grasping whether one of the data is large or small, or whether it deviates greatly from the average value.

In this post, we will see the descriptive statistics with a definition. Understanding statistical descriptions not only helps to develop a sense of a dataset but is also useful for understanding the preprocessing of a dataset.

The complete notebook can be found on GitHub.

Dataset

Here, we utilize the Boston house prices dataset for calculating the descriptive statistics, such as mean, variance, and standard deviation. The reason why we adopt this dataset is we can use it so easily with the scikit-learn library.

The code for using the dataset as Pandas DataFrame is as follows.

import numpy as np              ##-- Numpy
import pandas as pd             ##-- Pandas
import sklearn                  ##-- Scikit-learn
import matplotlib.pylab as plt  ##-- Matplotlib

from sklearn.datasets import load_boston
dataset = load_boston()

df = pd.DataFrame(dataset.data)
df.columns = dataset.feature_names
df["PRICES"] = dataset.target
df.head()

>>       CRIM   ZN  INDUS  CHAS  NOX   RM   AGE	 DIS  RAD	TAX PTRATIO	  B   LSTAT PRICES
>>  0   0.00632 18.0  2.31  0.0 0.538 6.575 65.2  4.0900  1.0 296.0 15.3  396.90  4.98  24.0
>>  1	0.02731	 0.0  7.07  0.0 0.469 6.421 78.9  4.9671  2.0 242.0 17.8  396.90  9.14  21.6
>>  2	0.02729	 0.0  7.07  0.0 0.469 7.185 61.1  4.9671  2.0 242.0 17.8  392.83  4.03  34.7
>>  3	0.03237	 0.0  2.18  0.0 0.458 6.998 45.8  6.0622  3.0 222.0 18.7  394.63  2.94  33.4
>>  4	0.06905	 0.0  2.18  0.0 0.458 7.147 54.2  6.0622  3.0 222.0 18.7  396.90  5.33  36.2

The details of this dataset are introduced in another post below. In this post, let’s calculate the mean, the variance, the standard deviation for “df[“PRICES”]”, the housing prices.

Mean

The mean $\mu$ is the average of the data. It must be one of the most familiar concepts. However, the concept of mean is important because we can obtain the sense of whether one value of data is large or small. Such a feeling is important for a data scientist.

Now, assuming that $N$ data are $x_{1}$, $x_{2}$, …, $x_{n}$, the mean $\mu$ is defined by the following formula.
$$\begin{eqnarray*}
\mu
=
\frac{1}{N}
\sum^{N}_{i=1}
x_{i}
,
\end{eqnarray*}$$
where $x_{i}$ is the value of $i$-th data.

It may seem a little difficult when expressed in mathematical symbols. However, as you know, we just take a summation of all the data and divide it by the number of data. Once we defined the mean, we can define the variance.

Then, let’s calculate the mean of each column. We can easily calculate by the “mean()” method.

df.mean()

>>  CRIM         3.613524
>>  ZN          11.363636
>>  INDUS       11.136779
>>  CHAS         0.069170
>>  NOX          0.554695
>>  RM           6.284634
>>  AGE         68.574901
>>  DIS          3.795043
>>  RAD          9.549407
>>  TAX        408.237154
>>  PTRATIO     18.455534
>>  B          356.674032
>>  LSTAT       12.653063
>>  PRICES      22.532806
>>  dtype: float64

When you want the mean value of just one column, for example the “PRICES”, the code is as follows.

df["PRICES"].mean()

>>  22.532806324110677

Variance

The variance $\sigma^{2}$ reflects the dispersion of data from the mean value. The definition is as follows.

$$\begin{eqnarray*}
\sigma^{2}
=
\frac{1}{N}
\sum^{N}_{i=1}
\left(
x_{i} – \mu
\right)^{2},
\end{eqnarray*}$$

where $N$, $x_{i}$, and $\mu$ are the number of the data, the value of $i$-th data, and the mean of $x$, respectively.

Expressed in words, the variance is the mean of the squared deviations from the mean of the data. It is no exaggeration to say that the information in the data exists in a variance! In other words, there is no worth to pay an attention to the data with the ZERO variance.

For example, let’s consider predicting math skills from exam scores. The exam scores of the person(A, B, and C) are like the below table.

PersonMathPhysicsChemistry
A1009060
B607060
C204060
The exam scores for each subject

From the above table, we can clearly see that those who are good at physics are also good at math. On the other hand, it is impossible to infer whether the person is good at mathematics from chemistry scores. Because all three scores equal the average score of 60. Namely, the variance of chemistry is ZERO!! This fact indicates that the scores of chemistry have no information, no worth to pay attention to. We should drop the “Chemistry” columns from the dataset when analyzing! This is one of the data preprocessing.

Then, let’s calculate the mean of each column of the Boston house prices dataset. We can easily calculate by the “var()” method.

df.var()

>>  CRIM          73.986578
>>  ZN           543.936814
>>  INDUS         47.064442
>>  CHAS           0.064513
>>  NOX            0.013428
>>  RM             0.493671
>>  AGE          792.358399
>>  DIS            4.434015
>>  RAD           75.816366
>>  TAX        28404.759488
>>  PTRATIO        4.686989
>>  B           8334.752263
>>  LSTAT         50.994760
>>  PRICES        84.586724
>>  dtype: float64

Standard Deviation

The standard deviation $\sigma$ is defined by the root of the variance as follows.

$$\begin{eqnarray*}
\sigma
=
\sqrt{
\frac{1}{N}
\sum^{N}_{i=1}
\left(
x_{i} – \mu
\right)^{2}
},
\end{eqnarray*}$$

where $N$, $x_{i}$, and $\mu$ are the number of the data, the value of $i$-th data, and the mean of $x$, respectively.

Why we introduced the standard deviation instead of the variance? This is because the unit becomes the same when we adopt the standard deviation of $\sigma$. Then, we can recognize $\sigma$ as the variation from the mean.

Then, let’s calculate the standard deviation of each column of the Boston house prices dataset. We can easily calculate by the “std()” method.

df.std()

>> CRIM         8.601545
>> ZN          23.322453
>> INDUS        6.860353
>> CHAS         0.253994
>> NOX          0.115878
>> RM           0.702617
>> AGE         28.148861
>> DIS          2.105710
>> RAD          8.707259
>> TAX        168.537116
>> PTRATIO      2.164946
>> B           91.294864
>> LSTAT        7.141062
>> PRICES       9.197104
>> dtype: float64

In fact, at the stage of defining these three concepts, we can define the Gaussian distribution. However, I’ll introduce the Gaussian distribution in another post.

Other descriptive statistics are calculated in the same way, so carefully select the ones you use most often and list them below.

MethodDescription
meanAverage value
varVariance value
stdStandard deviation value
minMinimum value
maxMaximum value
medianMedian value, the value at the center of the data
sumTotal value

Confirm all at once

Pandas has the useful function “describe()”, which describes the basic descriptive statistics. The “describe()” method is very convenient to use as a starting point.

df.describe()

>>         CRIM        ZN          INDUS       CHAS        NOX         RM          AGE         DIS         RAD         TAX         PTRATIO     B           LSTAT       PRICES
>>  count  506.000000  506.000000  506.000000  506.000000  506.000000  506.000000  506.000000  506.000000  506.000000  506.000000  506.000000  506.000000  506.000000  506.000000
>>  mean     3.613524   11.363636   11.136779    0.069170    0.554695    6.284634   68.574901    3.795043    9.549407  408.237154   18.455534  356.674032   12.653063  22.532806
>>  std      8.601545   23.322453    6.860353    0.253994    0.115878    0.702617   28.148861    2.105710    8.707259  168.537116    2.164946   91.294864    7.141062  9.197104
>>  min      0.006320    0.000000    0.460000    0.000000    0.385000    3.561000    2.900000    1.129600    1.000000  187.000000   12.600000    0.320000    1.730000  5.000000
>>  25%      0.082045    0.000000    5.190000    0.000000    0.449000    5.885500   45.025000    2.100175    4.000000  279.000000   17.400000  375.377500    6.950000  17.025000
>>  50%      0.256510    0.000000    9.690000    0.000000    0.538000    6.208500   77.500000    3.207450    5.000000  330.000000   19.050000  391.440000   11.360000  21.200000
>>  75%      3.677083   12.500000   18.100000    0.000000    0.624000    6.623500   94.075000    5.188425   24.000000  666.000000   20.200000  396.225000   16.955000  25.000000
>>  max     88.976200  100.000000   27.740000    1.000000    0.871000    8.780000  100.000000   12.126500   24.000000  711.000000   22.000000  396.900000   37.970000  50.000000

Note that,

“count”: Number of the data for each columns
“25%”: Value at the 25% position of the data
“50%”: Value at the 50% position of the data, equaling to “Median”
“75%”: Value at the 25% position of the data

Summary

We have seen the brief explanation of the basic descriptive statistics and how to calculate them. Understanding the concept of descriptive statistics is essential to understand the dataset. You should note the fact in your memory that the information of the dataset is included in the descriptive statistics.

The author hopes this blog helps readers a little.

Step-by-step guide of Linear Regression for Boston House Prices dataset

Linear regression is one of the basic techniques for machine learning analyses. You may know, in general, other methods are often superior to linear regression in terms of prediction accuracy. However, linear regression has the advantage that the model is simple and high interpretable.

For a data scientist, to understand a dataset is highly important. Therefore, linear regression plays a powerful role as a first step for the purpose of understanding the dataset first.

In this post, we will see the process of a linear regression analysis against the Boston house prices dataset. The author will explain with the step-by-step guide in mind !!

What is a Linear Regression?

Linear regression is based on the assumption of the linar relationship between a target variable and independent variables. Then, if you can represent your dataset well, you can expect a proportional relationship between the independent and objective variables.

In the mathematical expression, the representation of a linear regression is as follows.

$$y =\omega_{0}+\omega_{1}x_{1}+\omega_{2}x_{2}+…+\omega_{N}x_{N},$$

where $y$, $x_{i}$, and $\omega_{i}$ are a target variable, an independent variable, and a coefficient, respectively.

The details of the theory are explained in another post below.

Also, for reference, another post provides an example of linear regression with short code.

From here, let’s perform a linear regression analysis on the Boston house prices dataset!

Prepare the Dataset

In this analysis, we adopt the Boston house prices dataset, one of the famous open datasets published by the StatLib library which is maintained at Carnegie Mellon University. This is because we can use this dataset so easily. Just load from the scikit-learn library without downloading the file.

from sklearn.datasets import load_boston
dataset = load_boston()

The details of the Boston house prices dataset is introduced in another post. But, you can understand the following analysis without referring.

Confirm the Dataset as Pandas DataFrame

Here, we get 3 types of data from “dataset”, described below, as the Pandas DataFrame.

dataset.data: values of the explanatory variables
dataset.target: values of the target variable (house prices)
dataset.feature_names: the column names

import pandas as pd

f = pd.DataFrame(dataset.data)
f.columns = dataset.feature_names
f["PRICES"] = dataset.target
f.head()

>>       CRIM   ZN  INDUS  CHAS  NOX   RM   AGE	 DIS  RAD	TAX PTRATIO	  B   LSTAT PRICES
>>  0   0.00632 18.0  2.31  0.0 0.538 6.575 65.2  4.0900  1.0 296.0 15.3  396.90  4.98  24.0
>>  1	0.02731	 0.0  7.07  0.0 0.469 6.421 78.9  4.9671  2.0 242.0 17.8  396.90  9.14  21.6
>>  2	0.02729	 0.0  7.07  0.0 0.469 7.185 61.1  4.9671  2.0 242.0 17.8  392.83  4.03  34.7
>>  3	0.03237	 0.0  2.18  0.0 0.458 6.998 45.8  6.0622  3.0 222.0 18.7  394.63  2.94  33.4
>>  4	0.06905	 0.0  2.18  0.0 0.458 7.147 54.2  6.0622  3.0 222.0 18.7  396.90  5.33  36.2

Let’s try to check the correlation between only “PRICES” and “TAX”.

import matplotlib.pylab as plt  #-- "Matplotlib" for Plotting

f.plot(x="TAX", y="PRICES", style="o")
plt.ylabel("PRICES")
plt.show()

At first glance, there seems to be no simple proportional relationship. Including other variables, the EDA(Exploratory data analysis) for this dataset is introduced in another post.

Pick up the Variables we use

Explicitly define the variables to use for getting from the data frame.

TargetName = "PRICES"
FeaturesName = [\
              #-- "Crime occurrence rate per unit population by town"
              "CRIM",\
              #-- "Percentage of 25000-squared-feet-area house"
              'ZN',\
              #-- "Percentage of non-retail land area by town"
              'INDUS',\
              #-- "Index for Charlse river: 0 is near, 1 is far"
              'CHAS',\
              #-- "Nitrogen compound concentration"
              'NOX',\
              #-- "Average number of rooms per residence"
              'RM',\
              #-- "Percentage of buildings built before 1940"
              'AGE',\
              #-- 'Weighted distance from five employment centers'
              "DIS",\
              ##-- "Index for easy access to highway"
              'RAD',\
              ##-- "Tax rate per $100,000"
              'TAX',\
              ##-- "Percentage of students and teachers in each town"
              'PTRATIO',\
              ##-- "1000(Bk - 0.63)^2, where Bk is the percentage of Black people"
              'B',\
              ##-- "Percentage of low-class population"
              'LSTAT',\
              ]

Get from the data frame into “X” and “Y”.

X = f[FeaturesName]
Y = f[TargetName]

Standardize the Variables

For numerical variables, we should standardize because the scales of variables are different.

In mathematically, the definition of the conversion of standardization is as follows.

$$\begin{eqnarray*}
\tilde{x}=
\frac{x-\mu}{\sigma}
,
\end{eqnarray*}$$

where $\mu$ and $\sigma$ are the mean and the standard deviation, respectively.

Execution code by scikit-learn is just 4 line code as follows.

from sklearn import preprocessing
sscaler = preprocessing.StandardScaler()
sscaler.fit(X)
X_std = sscaler.transform(X)

Regarding standardization, the details are explained in another post. The standardization is an important preprocessing for numerical variables. If you don’t know standardization, the author recommends that you check the details once.

Split the Dataset

Here, we split the dataset into the train data and the test data. Why we have to split? This is because we must evaluate the generalization performance of the model against unknown data.

You can see that the above idea is valid because our purpose is to predict new data.

Then, let’s split the dataset. Of course, it is easy with scikit-learn!

from sklearn.model_selection import train_test_split
X_train, X_test, Y_train, Y_test = train_test_split(X_std, Y, test_size=0.2, random_state=99)

We pass the dataset “(X_std, Y)” to the “train_test_split()” function. The rate of the train data and the test data is defined by the argument “test_size”. Here, the rate is set to be “8:2”. And, “random_state” are set for reproducibility. You can use any number. The author often uses “99” because “99” is my favorite NFL player’s uniform number!

At this point, data preparation and preprocessing are fully completed!
Finally, we can perform the linear regression analysis!

Create an Instance for Linear Regression

Here, let’s create the model for linear regression. We can perform with the just 3 line code. The role of each line is as follows.

1. Import the “LinearRegression()” function from scikit-learn
2. Create the model as an instance “regressor” by “LinearRegression()”
3. Train the model “regressor” with train data “(X_train, Y_train)”

from sklearn.linear_model import LinearRegression
regressor = LinearRegression()
regressor.fit(X_train, Y_train)

Predict the train and test data

To check the performance of the model, we get the predicted values for the train and test data.

y_pred_train = regressor.predict(X_train)
y_pred_test = regressor.predict(X_test)

Then, let’s visualize the result by matplotlib.

import seaborn as sns

plt.figure(figsize=(5, 5), dpi=100)
sns.set()
plt.xlabel("PRICES")
plt.ylabel("Predicted PRICES")
plt.xlim(0, 60)
plt.ylim(0, 60)
plt.scatter(Y_train, y_pred_train, lw=1, color="r", label="train data")
plt.scatter(Y_test, y_pred_test, lw=1, color="b", label="test data")
plt.legend()
plt.show()

About the above figure, the red and blue circles show the results of the train and test data, respectively. We can see that the prediction accuracy decreases as the price increases.

Here, we check $R^{2}$ score, the coefficient of determination. $R^{2}$ is the index for how much the model is fitted to the dataset. When $R^{2}$ is close to $1$, the model accuracy is good. Conversely, when $R^{2}$ approaches $0$, it means that the model accuracy is poor.

We can calculate $R^{2}$ by the “r2_score()” function in scikit-learn.

from sklearn.metrics import r2_score
R2 = r2_score(Y_test, y_pred_test)
R2

>>  0.6674690355194665

The score $0.67$ is not bad, but also not good.

How to Improve the Score?

Here, one easy way to improve your score is introduced. The answer is to convert the target variable “PRICES” to a logarithmic scale. Converting to a logarithmic scale reduces the effect of errors in the high “PRICES” range. Reducing the effect of errors between the train data and the predicted values leads to improved models. Logarithmic conversion techniques are often simple and effective and should be helpful to remember.

Then, let’s try!

First, converting the target variable “PRICES” to a logarithmic scale.

##-- Logarithmic scaling
Y_log = np.log(Y)

Next, we split the dataset again.

from sklearn.model_selection import train_test_split
X_train, X_test, Y_train, Y_test = train_test_split(X_std, Y_log, test_size=0.2, random_state=99)

And, retrain the model and predict again.

regressor.fit(X_train, Y_train)
y_pred_train = regressor.predict(X_train)
y_pred_test = regressor.predict(X_test)

Plot the result again as follows. Note that the predicted value is the value after logarithmic conversion, so it must be inversely converted by “np.ep()” when plotting.

import numpy as np

plt.figure(figsize=(5, 5), dpi=100)
sns.set()
plt.xlabel("PRICES")
plt.ylabel("Predicted PRICES")
plt.xlim(0, 60)
plt.ylim(0, 60)
plt.scatter(np.exp(Y_train), np.exp(y_pred_train), lw=1, color="r", label="train data")
plt.scatter(np.exp(Y_test), np.exp(y_pred_test), lw=1, color="b", label="test data")
plt.legend()
plt.show()

It may be hard to see the improvement from the figure, but when you compare $R^{2}$, you can see that it has improved clearly.

R2 = r2_score(Y_test, y_pred_test)
R2

>>  0.7531747761424288

$R^{2}$ has improved from 0.67 to 0.75!

Summary

We have seen how to perform the linear regression analysis against the Boston house prices dataset. The basic approach to regression analysis is as described here. So, we can apply this approach to other datasets.

Note that the important thing is to have a good understanding of the dataset, making it possible to perform an analysis reflecting the essence.

Certainly, there are several methods that can be expected to be more accurate, such as random forest and neural net. However, linear regression analysis can be a good first step to understanding datasets deeper.

The author hopes this blog helps readers a little.

Brief EDA for Boston House Prices Dataset

Exploratory data analysis(EDA) is one of the most important processes in data analysis. This process is often neglected because it is often invisible in the final code. However, without appropriate EDA, there is no success.

Understanding the nature of the dataset.

This is the purpose of EDA. And then, you will be able to effectively select models and perform feature engineering. This is why EDA is the first step in data analysis.

In this post, we see the basic skills of EDA using the well-known open data set Boston Home Price Dataset as an example.

Prepare the Dataset

For performing EDA, we adopt the Boston house prices dataset, an open dataset for regression analysis. The details of this dataset are introduced in another post.

Import Library

##-- Numpy
import numpy as np
##-- Pandas
import pandas as pd
##-- Scikit-learn
import sklearn
##-- Matplotlib
import matplotlib.pylab as plt
##-- Seaborn
import seaborn as sns

Load the Boston House Prices Dataset from Scikit-learn

It is easy to load this dataset from scikit-learn. Just 2 lines!

from sklearn.datasets import load_boston
dataset = load_boston()

The values of the dataset are stored in the variable “dataset”. Note that, in “dataset”, several kinds of information are stored, i.e., the explanatory-variable values, the target-variable values, and the column names of the explanatory-variable values. Then, we have to take them separately as follows.

dataset.data: values of the explanatory variables
dataset.target: values of the target variable (house prices)
dataset.feature_names: the column names

For convenience, we obtain the above data as the Pandas DataFrame type. Pandas is so useful against matrix-type data.

Here, let’s put all the data together into one Pandas DataFrame “f”.

f = pd.DataFrame(dataset.data)
f.columns = dataset.feature_names
f["PRICES"] = dataset.target
f.head()

>>       CRIM   ZN  INDUS  CHAS  NOX   RM   AGE	 DIS  RAD	TAX PTRATIO	  B   LSTAT PRICES
>>  0   0.00632 18.0  2.31  0.0 0.538 6.575 65.2  4.0900  1.0 296.0 15.3  396.90  4.98  24.0
>>  1	0.02731	 0.0  7.07  0.0 0.469 6.421 78.9  4.9671  2.0 242.0 17.8  396.90  9.14  21.6
>>  2	0.02729	 0.0  7.07  0.0 0.469 7.185 61.1  4.9671  2.0 242.0 17.8  392.83  4.03  34.7
>>  3	0.03237	 0.0  2.18  0.0 0.458 6.998 45.8  6.0622  3.0 222.0 18.7  394.63  2.94  33.4
>>  4	0.06905	 0.0  2.18  0.0 0.458 7.147 54.2  6.0622  3.0 222.0 18.7  396.90  5.33  36.2

From here, we perform the EDA and understand the dataset!

Summary information

First, we should look at the entire dataset. Information from the whole to the details, this order is important. The “info()” method in Pandas makes it easy to confirm the information of columns, non-null-value count, and its data type.

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 506 entries, 0 to 505
Data columns (total 14 columns):
 #   Column   Non-Null Count  Dtype  
---  ------   --------------  -----  
 0   CRIM     506 non-null    float64
 1   ZN       506 non-null    float64
 2   INDUS    506 non-null    float64
 3   CHAS     506 non-null    float64
 4   NOX      506 non-null    float64
 5   RM       506 non-null    float64
 6   AGE      506 non-null    float64
 7   DIS      506 non-null    float64
 8   RAD      506 non-null    float64
 9   TAX      506 non-null    float64
 10  PTRATIO  506 non-null    float64
 11  B        506 non-null    float64
 12  LSTAT    506 non-null    float64
 13  PRICES   506 non-null    float64
dtypes: float64(14)
memory usage: 55.5 KB

Missing Values

We have already checked how many non-null-values are. Here, on the contrary, we check how many missing values are. We can check it by the combination of the “isnull()” and “sum()” methods in Pandas.

f.isnull().sum()

>>  CRIM       0
>>  ZN         0
>>  INDUS      0
>>  CHAS       0
>>  NOX        0
>>  RM         0
>>  AGE        0
>>  DIS        0
>>  RAD        0
>>  TAX        0
>>  PTRATIO    0
>>  B          0
>>  LSTAT      0
>>  PRICES     0
>>  dtype: int64

Fortunately, there is no missing value! This is because this dataset is created carefully. Note that, however, there are usually many problems we have to deal with a real dataset.

Basic Descriptive Statistics Value

We can calculate the basic descriptive statistics values with just 1 sentence!

f.describe()

>>               CRIM          ZN       INDUS        CHAS         NOX          RM         AGE         DIS         RAD         TAX     PTRATIO           B       LSTAT      PRICES
>>  count  506.000000  506.000000  506.000000  506.000000  506.000000  506.000000  506.000000  506.000000  506.000000  506.000000  506.000000  506.000000  506.000000  506.000000
>>   mean    3.613524   11.363636   11.136779    0.069170    0.554695    6.284634   68.574901    3.795043    9.549407  408.237154   18.455534  356.674032   12.653063   22.532806
>>    std    8.601545   23.322453    6.860353    0.253994    0.115878    0.702617   28.148861    2.105710    8.707259  168.537116    2.164946   91.294864    7.141062    9.197104
>>    min    0.006320    0.000000    0.460000    0.000000    0.385000    3.561000    2.900000    1.129600    1.000000  187.000000   12.600000    0.320000    1.730000    5.000000
>>    25%    0.082045    0.000000    5.190000    0.000000    0.449000    5.885500   45.025000    2.100175    4.000000  279.000000   17.400000  375.377500    6.950000   17.025000
>>    50%    0.256510    0.000000    9.690000    0.000000    0.538000    6.208500   77.500000    3.207450    5.000000  330.000000   19.050000  391.440000   11.360000   21.200000
>>    75%    3.677083   12.500000   18.100000    0.000000    0.624000    6.623500   94.075000    5.188425   24.000000  666.000000   20.200000  396.225000   16.955000   25.000000
>>    max   88.976200  100.000000   27.740000    1.000000    0.871000    8.780000  100.000000   12.126500   24.000000  711.000000   22.000000  396.900000   37.970000   50.000000

Although each value is important, I think it is worth to focus on “mean” and “std” as a first attention.

We can know the average from “mean” so that it makes it possible to judge a value is higher or lower. This feeling is important for a data scientist.

Next, “std” represents the standard deviation, which is an indicator of how much the data is scattered from “mean”. For example, “std” will be small if each value exists almost average. Note that the variance equals to the square of the standard deviation, and the word “variance” may be more common for a data scientist. It is no exaggeration to say that the information in a dataset is contained in the variance. This is because, for instance, there is no information if all values in “AGE” is 30, indicating no worth to attention!

Histogram Distribution

Data with variance is the data that is worth paying attention to. So let’s actually visualize the distribution of the data. Seeing is believing!

We can perform the histogram plotting by “plt.hist()” in “matplotlib”, a famous library for visualization. The argument “bins” can control the fineness of plot.

for name in f.columns:
    plt.title(name)
    plt.hist(f[name], bins=50)
    plt.show()

The distribution of “PRICES”, the target variable, is below. In the right side, we can see the so high-price houses. Note here that considering such a high-price data may get a prediction accuracy worse.

The distributions of the explanatory variables are below. We can see the difference in variance between the explanatory variables.

From the above figure, we can infer that the data of “CHAS” and “RAD” are NOT continuous values. Generally, such data that is not continuous is called a categorical variable.

Be careful when handling the categorical variable values, because there is no meaning in the magnitude relationship itself. For example, when a condominium and a house are represented by 0 and 1, respectively, there is no essential meaning in the magnitude relationship(0 < 1).

For the above reasons, let’s check the categorical variables individually.

We can easily confirm the number of each unique value by the “value_counts()” method in Pandas. The first column is the unique values. The second column is the counted numbers of unique values.

f['RAD'].value_counts()

>>  24.0    132
>>  5.0     115
>>  4.0     110
>>  3.0      38
>>  6.0      26
>>  8.0      24
>>  2.0      24
>>  1.0      20
>>  7.0      17
>>  Name: RAD, dtype: int64

It is important to check the data visually. It is easy to visualize the counted numbers of unique values for each column.

f['CHAS'].value_counts().plot.bar(title="CHAS")
f['RAD'].value_counts().plot.bar(title="RAD")

Correlation of Variables

Here, we confirm the correlation of variables. The correlation is so important because the higher correlation indicates the higher contribution basically. Conversely, you have the option of removing variables that contribute less. By dropping variables that contribute less, the risk of overfitting is reduced

We can easily calculate the correlation matrix between the variables(or columns) by the “corr()” function in Pandas.

f.corr()


             CRIM        ZN     INDUS      CHAS       NOX        RM       AGE       DIS       RAD       TAX   PTRATIO         B     LSTAT    PRICES
   CRIM  1.000000 -0.200469  0.406583 -0.055892  0.420972 -0.219247  0.352734 -0.379670  0.625505  0.582764  0.289946 -0.385064  0.455621 -0.388305
     ZN -0.200469  1.000000 -0.533828 -0.042697 -0.516604  0.311991 -0.569537  0.664408 -0.311948 -0.314563 -0.391679  0.175520 -0.412995  0.360445
  INDUS  0.406583 -0.533828  1.000000  0.062938  0.763651 -0.391676  0.644779 -0.708027  0.595129  0.720760  0.383248 -0.356977  0.603800 -0.483725
   CHAS -0.055892 -0.042697  0.062938  1.000000  0.091203  0.091251  0.086518 -0.099176 -0.007368 -0.035587 -0.121515  0.048788 -0.053929  0.175260
    NOX  0.420972 -0.516604  0.763651  0.091203  1.000000 -0.302188  0.731470 -0.769230  0.611441  0.668023  0.188933 -0.380051  0.590879 -0.427321
     RM -0.219247  0.311991 -0.391676  0.091251 -0.302188  1.000000 -0.240265  0.205246 -0.209847 -0.292048 -0.355501  0.128069 -0.613808  0.695360
    AGE  0.352734 -0.569537  0.644779  0.086518  0.731470 -0.240265  1.000000 -0.747881  0.456022  0.506456  0.261515 -0.273534  0.602339 -0.376955
    DIS -0.379670  0.664408 -0.708027 -0.099176 -0.769230  0.205246 -0.747881  1.000000 -0.494588 -0.534432 -0.232471  0.291512 -0.496996  0.249929
    RAD  0.625505 -0.311948  0.595129 -0.007368  0.611441 -0.209847  0.456022 -0.494588  1.000000  0.910228  0.464741 -0.444413  0.488676 -0.381626
    TAX  0.582764 -0.314563  0.720760 -0.035587  0.668023 -0.292048  0.506456 -0.534432  0.910228  1.000000  0.460853 -0.441808  0.543993 -0.468536
PTRATIO  0.289946 -0.391679  0.383248 -0.121515  0.188933 -0.355501  0.261515 -0.232471  0.464741  0.460853  1.000000 -0.177383  0.374044 -0.507787
      B -0.385064  0.175520 -0.356977  0.048788 -0.380051  0.128069 -0.273534  0.291512 -0.444413 -0.441808 -0.177383  1.000000 -0.366087  0.333461
  LSTAT  0.455621 -0.412995  0.603800 -0.053929  0.590879 -0.613808  0.602339 -0.496996  0.488676  0.543993  0.374044 -0.366087  1.000000 -0.737663
 PRICES -0.388305  0.360445 -0.483725  0.175260 -0.427321  0.695360 -0.376955  0.249929 -0.381626 -0.468536 -0.507787  0.333461 -0.737663  1.000000

Like the above, it is easy to calculate the correlation matrix, however, it is difficult to grasp the tendency with simple numerical values.

In such a case, let’s utilize the heat map function included in “seaborn”, the library for plotting. Then, we can confirm the correlation visually.

Here, we focus on the relationship between “PRICES” – OTHER(“CRIM”, “ZN”, “INDUS”…etc). We can clearly see that “RM” and “LSTAT” are high correlated to “PRICES”.

Summary

So far we have seen how to perform EDA briefly. The purpose of EDA is to properly identify the nature of the dataset. Proper EDA can make it possible to explore the next step effectively, e.g. feature engineering and modeling methods.

Lambda Function with Pandas

In the previous post, the basic of lambda function is introduced. In this post, the author introduces the practical situation, i.e., lambda functions × Pandas.

What is Pandas?

Pandas, a library for data structures, is known as one of the essential libraries for data analyses, such as NumPy, SciPy, and Scikit-learn. Pandas is designed to treat Excel easily so that we can use a table data flexibly.

Pandas can treat files in various formats. Besides, Pandas has rich methods. The above features make it possible to perform data analysis against table data efficiently. If you look in a data science competition(e.g. Kaggle), you can understand that Pandas is an essential tool for data scientists.

Lambda Function × Pandas

Pandas is used for table data analyses. So, there might be a situation that you would like to apply the same manipulate to each element of sequence data(e.g. one column of table data).

That’s exactly where the combination of Pandas and lambda functions comes into play.

Ex. Categorize the Age Group

We first prepare the age-group list, 18, 50, 28, 78, and 33. Second, we convert the list “age_list” into Pandas DataFrame with the column name “Age”.

import pandas as pd
age_list = [18, 50, 28, 78, 33]
age_list = pd.DataFrame(age_list, columns=["Age"])
print(age_list)

>>    Age
>> 0   18
>> 1   50
>> 2   28
>> 3   78
>> 4   33

Next, we categorize each element of the column “age_list[“Age”]”. Note here, you must predefine the function for classification.

Here, we prepare the function to categorize ages into the group of “unknown”, “Under 20”, “20-40”, “41-60”, and “Over 60”. Note that “unknown” is for mistake inputs such as minus ages.

def categorize_age(x):
  x = int(x)
  if x < 0:
    x = "unknown"
  elif x < 20:
    x = "Under 20"
  elif x <= 40:
    x = "20-40"
  elif x <= 60:
    x = "41-60"
  else:
    x = "Over 60"
  return x

Then, let’s apply the above function “categorize_age()” to each element of the column “age_list[“Age”]”. As a result, we can see that the result is assigned to the newly generated column “Generation”.

Note that, to apply, we use a “apply()” method and a “lambda function“.

Syntax: DataFrame[column].apply( lambda x: function(x) )

age_list["Generation"] = age_list["Age"].apply( lambda x: categorize_age(x) )
print(age_list)

>>    Age Generation
>> 0   18   Under 20
>> 1   50      41-60
>> 2   28      20-40
>> 3   78    Over 60
>> 4   33      20-40

Summary

In this article, we have seen that a lambda function becomes a powerful tool when it is used with Pandas. When analyzing table data, it will be needed to apply arbitrary processing to each element of a column or a row of Pandas DataFrame.

It is such a time to use a lambda function!