Open Dataset for Regression Analysis

You will meet a case that something new method has emerged and you want to try it quickly. For users, it is very important to try it quickly.

At such times, an open dataset is so useful.

Of course, there are so many open datasets for machine learning. In this post, the Boston house prices dataset, one of them for regression analysis, is introduced.

A preparation of a dataset is the first step for data analyses. Let’s see the process.

Boston house prices dataset

This dataset is one of the famous open datasets and published by the StatLib library which is maintained at Carnegie Mellon University.

Fortunately, we can use this dataset very easily from scikit-learn. Just import the scikit-learn library and load the dataset.

Load the dataset from scikit-learn

from sklearn.datasets import load_boston
dataset = load_boston()
data, target, feature_names = dataset.data, dataset.target, dataset.feature_names

The first line is that we import the dataset “load_boston” from the “sklearn.datasets” module. The sklearn.datasets module includes many datasets. You can refer to the detailed information on the reference of scikit-learn.

The second line is that the dataset is assigned to the variable “dataset”. We can do it by the “load_boston()” function.

And, in the variable “dataset”, several kinds of information are stored. So, in the third line, each information is taken separately.

A brief explanation of the dataset

As the above code, we can take each information by each method “***.data”, “***.target”, “***.feature_names”. The explanation of each information is as follows:

dataset.data: array of the explanatory variables
dataset.target: house prices
dataset.feature_names: names of the explanatory variables

And also, you can see the complete explanation of the dataset as follows.

dataset.DESCR

Since the output content of the above line is so large, I will not post it here…
Please check it out for yourself.

Contents of the Variables

Here, let’s confirm the contents of the variables.

data (dataset.data)
The values of the explanatory variables are stored as an array.
The shape of data is (506, 13), where the length of each data is 506 and there are 13 types of columns.

data.shape

>>  (506, 13)

If you check the contents, you will find that it is as follows.

data

>>  array([[6.3200e-03, 1.8000e+01, 2.3100e+00, ..., 1.5300e+01, 3.9690e+02,
        4.9800e+00],
       [2.7310e-02, 0.0000e+00, 7.0700e+00, ..., 1.7800e+01, 3.9690e+02,
        9.1400e+00],
       [2.7290e-02, 0.0000e+00, 7.0700e+00, ..., 1.7800e+01, 3.9283e+02,
        4.0300e+00],
       ...,


target (dataset.target)
The prices of Boston houses, the so-called target variable of supervised data.
As we see the above that the number of data is 506, the shape of data is (506, ).

target.shape

>> (506,)


feature_names (dataset.feature_names)
Name of each column for “data (dataset.data)”.

feature_names

>>  array(['CRIM', 'ZN', 'INDUS', 'CHAS', 'NOX', 'RM', 'AGE', 'DIS', 'RAD',
       'TAX', 'PTRATIO', 'B', 'LSTAT'], dtype='<U7')

We can confirm the explanation of each ticker on the user guide for scikit-learn.

“CRIM”: per capita crime rate by town
“ZN”: proportion of residential land zoned for lots over 25,000 sq.ft.
“INDUS”: proportion of non-retail business acres per town
“CHAS“: Charles River dummy variable (= 1 if tract bounds river; 0 otherwise)
“NOX”: nitric oxides concentration (parts per 10 million)
“RM”: average number of rooms per dwelling
“AGE”: proportion of owner-occupied units built prior to 1940
“DIS”: weighted distances to five Boston employment centres
“RAD”: index of accessibility to radial highways
“TAX”: full-value property-tax rate per $10,000
“PTRATIO”: pupil-teacher ratio by town
“B”: 1000(Bk – 0.63)^2 where Bk is the proportion of blacks by town
“LSTAT”: % lower status of the population
“MEDV”: Median value of owner-occupied homes in $1000’s

Cite from User guide for scikit-learn

Summary

We have seen the process of importing and loading the Boston house prices dataset from the scikit-learn library. Besides, we have seen the description of each explanatory variable.

You can prepare a dataset with just 3 lines, so please consider using it!

Even one is fine, so it would be useful to have a dataset that you can use quickly.
The author hopes this dataset will be one of your portfolios after reading this blog.