Step-by-step to a Data Scientist

February 25, 2021May 4, 2021

Weekly Article News #1

Step-by-step to a Data Scientist > Blog > 2021 > February

The recommended articles the author has read this week.

Principal Component Analysis (PCA) from scratch in Python

About principal component analysis(PCA). Easy description and code for a beginner.

Feature selection in Python using the Filter method

This article helps us to select input features by filtering methods. By the step-by-step style, we investigate the correlation between variables and select the features.

3 Essential Ways to Calculate Feature Importance in Python

In this article, the ways to select features are introduced in different methods, based on logistic regression, a tree-based method, and principal component analysis(PCA).

Weighted Logistic Regression for Imbalanced Dataset

Most of the data collected in the real world always have a problem such as an imbalance of each category. That’s why dealing with imbalanced data is important. This article introduces the basics of how to treat imbalanced data.

February 21, 2021February 25, 2021

Installation — Homebrew on M1 Mac

Step-by-step to a Data Scientist > Blog > 2021 > February

Homebrew is a package manager for a Mac, making it possible to install or uninstall applications. Examples of the applications are Git, Python,.. etc.

In this post, we will see how to install homebrew on M1 Mac.

Install Homebrew

First, let’s see how to install Homebrew. It is easy.
It is just to do the following commands on your terminal.

sudo mkdir /opt/homebrew
sudo chown -R $(whoami) /opt/homebrew
curl -L https://github.com/Homebrew/brew/tarball/master | tar xz --strip 1 -C /opt/homebrew

The above commands instals Homebrew at “/opt/homebrew”. And, the execution file is set at “/opt/homebrew/bin/”.

When your installation was successfuly completed, you can see the following monitor after executing.

brew

>>  Example usage:
>>    brew search [TEXT|/REGEX/]
>>    brew info [FORMULA...]
>>    brew install FORMULA...
>>    brew update
>>    brew upgrade [FORMULA...]
>>    brew uninstall FORMULA...
>>    brew list [FORMULA...]
>>  ..
>>  ..
>>  ..

The PATH exists?

However, it is possible that the PATH of the brew executable file is not in the PATH. In that case, add the following sentence to the environment setting file of shell. The shell config file is “.bashrc” if you are using bash. If you are using zsh, the config file is “.zshrc” or “.zshenv”. Note that the default shell on M1 Mac is zsh.

# Homebrew
export PATH=/opt/homebrew/bin:$PATH

Here, your PC can recognize where the brew executable file exists in the directories.

Summary

We have seen how to install Homebrew on M1 Mac. With Homebrew, we can manage applications easily. For example, we can install Git, Python and so many utility applications.

February 7, 2021February 25, 2021

Python Tips # set data structure

Step-by-step to a Data Scientist > Blog > 2021 > February

A set data structure may be unfamiliar to a Python beginner. A set is used for sequence data structures. Therefore, you can have an image against a set like a list, tuple, and dictionary.

First, let’s look at a list as an example. The sample list “sample_list” has six elements, however, whose unique elements are three kinds. The elements are four “apple”, one “orange”, and one “grape”.

sample_list = [  
               "apple",
               "orange",
               "grape",
               "apple",
               "apple",
               "apple" 
              ]
print(sample_list)

>>  ['apple', 'orange', 'grape', 'apple', 'apple', 'apple']

You might have encountered the situation that you would like to know the unique elements of a list. Such a situation is the time to use the set() function in Python. Note that the set() is included in Python as a standard module, so you don’t need to import any external module.

It is easy. You just pass the list to the set() function as follows.

sample_list_unique = set(sample_list)
print(sample_list_unique)

>>  {'apple', 'orange', 'grape'}

You have found out the unique elements of the sample list.

Like the output of the above cell, a set is created by the curly braces {}.

set_sample_without_overlaps = {"apple", "orange", "grape"}
print(set_sample_without_overlaps)

>>  {'apple', 'orange', 'grape'}

Of course, if there are overlaps when we define the elements, these are ignored. Let’s put several “apple” elements in the set when defining. You will see that additional “apple” elements are ignored.

set_sample_with_overlaps = {"apple", "orange", "grape", "apple", "apple", "apple"}
print(set_sample_with_overlaps)

>>  {'apple', 'orange', 'grape'}

Application example

A set is so useful in data science. This is because we have many situations to confirm the overlaps between datasets.

Here is an example. We will check the overlaps of the id column between training and validation datasets. We can do this easily by the set() function and the “intersection” method. The intersection method returns the overlaps elements between two set-type data.

id_train = ["01", "02", "03", "04", "05"]
id_validation = ["04", "05", "06", "07", "08"]

# Into a set data structure
id_train_set = set(id_train)
id_validation_set = set(id_validation)

# Check an overlaps
id_overlap = id_train_set.intersection(id_validation_set)
print(id_overlap)

>>  {'05', '04'}

We have known that the elements of “05” and “04” coexist in the training and validation datasets. From this fact, we should perform a preprocessing, for example dropping overlap data.

Mixing the same information as the training data with the validation data is called data leakage, which leads to an overestimation of accuracy.

Summary

As a sequence data structure, a list, tuple, dictionary are famous. However, a set is practical when we treat unique elements.

You will surely come across a situation where you want to know the unique element of sequence-type data. Recall that there is a set data structure in the Python standard module s at that time.