Introduction to Data Science for Screen Reader Users

Introduction to Practical Data Science

Introduction

In this chapter, we will learn the basics of data science, following the content created by Japan’s Ministry of Education, Culture, Sports, Science and Technology (hereinafter referred to as “MEXT material”) for teacher training materials for the “Information II” course in high schools. Adhering to the “Terms of Use for the MEXT Website”, we have created this content with some modifications to what is covered in “High School Information Course Information II Teacher Training Materials (Main Text) Chapter 3 Information and Data Science” (MEXT).

In particular, we have arranged it so that you can practically learn the basics of data science by making the most of the Python programming and audio graph techniques introduced up to the previous chapter. Once you have mastered the basics and their practical methods, you will be able to continue learning on your own by making use of various learning resources available online.

The Significance of Learning Data Science

In data science, we clarify problems using a scientific approach to information, establish analytical strategies, and perform organization, formatting, and analysis of various data. Through activities of considering the results, you will come to understand that a variety of products and services utilizing machine learning, such as artificial intelligence-based image recognition and automatic translation, have been developed and new insights have been produced. Additionally, in order to discover and solve problems such as predicting uncertain events, you will acquire various knowledge and skills in each process such as data collection, organization, formatting, modeling, visualization, analysis, evaluation, execution, and effect verification, thereby cultivating the ability to scientifically grasp things based on data and tackle problem-solving.

Learning Objectives

Understand the existence of diverse and large amounts of data, the usefulness of utilizing data, and the role data science plays in society. Learn how to collect and organize appropriate data according to purpose, acquire skills to do so, and be able to collect, organize, and format appropriate data according to purpose.
Understand how to model phenomena based on data and how to process, interpret, and express data, acquire the skills to do so, and be able to model and process phenomena appropriately based on data, interpret and express them, predict future phenomena, and clarify the relationships between multiple phenomena.
Understand the significance and methods of evaluating models based on the results of data processing, acquire the skills to do so, and be able to evaluate models and the results of data processing, and improve modeling, processing, interpretation, and representation.

Overview

We will prioritize getting our hands on and practicing the entire process of data science, with a moderate amount of detailed explanation. Once you get a sense of how to use the audio graph library, you should be able to deepen your understanding by utilizing various contents available on the Internet.

By learning the following items in order, let’s grasp the overall picture of data science.

Importing Libraries
Preparing Data
Feature Engineering
Supervised Learning
Neural Networks
Unsupervised Learning

Importing Libraries

Firstly, let’s import the external libraries that will be called later. If you’re using Google Colab, most of the libraries are pre-installed, and you should be able to use them simply by importing them as shown below.

import numpy as np
import pandas as pd
from sklearn.preprocessing import LabelEncoder, StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report
from tensorflow import keras

As for the audio plot library, if you have not installed it yet, install it using the pip command before importing it.

!pip install -q audio-plot-lib
import audio_plot_lib as apl

In data science, random numbers are often used for things like initial values when machine learning. If left as is, the results will change every time an experiment is performed, which can often hinder problem solving. Therefore, it’s common to fix the seed of random numbers as follows.

np.random.seed(0)  # You can change 0 to any number you like

Data Preparation

As an example for this time, we will take up the Penguin dataset, which is becoming increasingly popular for data science learning. The amount and quality of data is appropriate, and it is a very easy-to-handle and popular dataset.

The dataset contains characteristic data such as the weight and beak length of each of the following three types of penguins, compiled into a CSV.

Chinstrap penguins
Gentoo penguins
Adelie penguins

These are popular penguins that can be seen in aquariums in Japan. The name of the dataset has “Palmer” in it, apparently because it is based on survey data from the Palmer Peninsula in Antarctica.

Let’s immediately read in the dataset as a pandas dataframe.

penguin_dataset_url = "https://raw.githubusercontent.com/allisonhorst/palmerpenguins/master/inst/extdata/penguins_raw.csv"
df_penguin = pd.read_csv(penguin_dataset_url)

To make the results of the analysis easier to understand, each item of data should be clearly labeled.

df_penguin = df_penguin.rename(columns={
        "studyName": "Study Name",
        "Sample Number": "Sample Number",
        "Species": "Species",
        "Region": "Region",
        "Island": "Island",
        "Stage": "Developmental Stage",
        "Individual ID": "Individual ID",
        "Clutch Completion": "Matured",
        "Date Egg": "Date of Birth",
        "Culmen Length (mm)": "Upper Beak Length (mm)",
        "Culmen Depth (mm)": "Beak Vertical Width (mm)",
        "Flipper Length (mm)": "Flipper Length (mm)",
        "Body Mass (g)": "Body Weight (g)",
        "Sex": "Sex",
        "Delta 15 N (o/oo)": "Nitrogen Isotope Ratio",
        "Delta 13 C (o/oo)": "Stable Isotope Ratio",
        "Comments": "Comments",
    })

df_penguin = df_penguin.replace("Adelie Penguin (Pygoscelis adeliae)", "Adelie Penguin")
df_penguin = df_penguin.replace("Chinstrap penguin (Pygoscelis antarctica)", "Chinstrap Penguin")
df_penguin = df_penguin.replace("Gentoo penguin (Pygoscelis papua)", "Gentoo Penguin")

Checking the Data

Let’s check the contents of the data, starting with the first three entries.

print(df_penguin.head(3))

The result should look something like this:

	Study Name	Sample Number	Species	Region	Island	Developmental Stage	Individual ID	Matured	Date of Birth	Upper Beak Length (mm)	Beak Vertical Width (mm)	Flipper Length (mm)	Body Weight (g)	Sex	Nitrogen Isotope Ratio	Stable Isotope Ratio	Comments
0	PAL0708	1	Adélie Penguin	Anvers	Torgersen	Adult, 1 Egg Stage	N1A1	Yes	2007-11-11	39.1	18.7	181	3750	MALE	nan	nan	Not enough blood for isotopes.
1	PAL0708	2	Adélie Penguin	Anvers	Torgersen	Adult, 1 Egg Stage	N1A2	Yes	2007-11-11	39.5	17.4	186	3800	FEMALE	8.94956	-24.6945	nan
2	PAL0708	3	Adélie Penguin	Anvers	Torgersen	Adult, 1 Egg Stage	N2A1	Yes	2007-11-16	40.3	18	195	3250	FEMALE	8.36821	-25.333	nan

Let’s check the amount of data. The output of the shape function represents (number of data, number of features).

print(df_penguin.shape)

(344, 17)

Let’s also check how many data points are included for each type of penguin.

print(df_penguin["Species"].value_counts())

	Species
Adélie Penguin	152
Gentoo Penguin	124
Chinstrap Penguin	68

Dealing with Missing Data

Although the penguin dataset is a relatively organized and easy-to-use dataset, it still includes some unnecessary data. In the world of data science, the phrase “garbage in, garbage out” is commonly used, emphasizing the importance of ensuring the quality of the data to produce good analytical results. As you handle more and more data, expect the time and effort required for these types of preprocessing to increase, so please keep this in mind.

Let’s start by looking at missing values.

print(df_penguin.isnull().sum())

	0
Study Name	0
Sample Number	0
Species	0
Region	0
Island	0
Developmental Stage	0
Individual ID	0
Matured	0
Date of Birth	0
Upper Beak Length (mm)	2
Beak Vertical Width (mm)	2
Flipper Length (mm)	2
Body Weight (g)	2
Sex	11
Nitrogen Isotope Ratio	14
Stable Isotope Ratio	13
Comments	290

Features such as beak length and body weight seem crucial, so we should avoid using data that lacks these values. Let’s prepare a new dataframe that excludes these missing values.

df_penguin = df_penguin.dropna(subset=["BeakLength (mm)", "BeakDepth (mm)", "FlipperLength (mm)", "Sex"])

print(df_penguin.isnull().sum())
print(df_penguin["Species"].value_counts())

	0
Study Name	0
Sample Number	0
Species	0
Region	0
Island	0
Developmental Stage	0
Individual ID	0
Matured	0
Date of Birth	0
Upper Beak Length (mm)	0
Beak Vertical Width (mm)	0
Flipper Length (mm)	0
Body Weight (g)	0
Sex	0
Nitrogen Isotope Ratio	9
Stable Isotope Ratio	8
Comments	290

Preprocessing Features

Some features are represented by keywords, and it would be easier to process these values if they were converted into numbers. Let’s transform the names of genders and species into new features represented by numbers.

df_penguin.loc[:, "Sex_Num"] = LabelEncoder().fit_transform(df_penguin["Sex"])
print(df_penguin[["Sex", "Sex_Num"]].head(3))

	Sex	Sex_Num
0	MALE	1
1	FEMALE	0
2	FEMALE	0

df_penguin.loc[:, "Species_Num"] = LabelEncoder().fit_transform(df_penguin["Species"])
print(df_penguin[["Species", "Species_Num"]].head(3))

	Species	Species_Num
0	Adelie Penguin	0
1	Adelie Penguin	0
2	Adelie Penguin	0

Investigating Data Distribution

Let’s use an audio graph to check if there’s any correlation between each feature. First, let’s confirm the correlation between flipper length and body weight. One might guess that if the flipper is longer, the body weight might be heavier, but let’s see.

x = df_penguin["FlipperLength (mm)"].values
y = df_penguin["BodyWeight (g)"].values
apl.interactive.plot(y, x)

Example of execution result

Next, let’s check the data on beak length and depth to see if there are any characteristics in the beak shape. It doesn’t necessarily mean that if the beak is long, the depth is also wide. From the middle, the trend of the sound changes from around 40mm in beak length, and you can hear low sounds, that is, data with a narrow beak depth. It seems that there are data with long, narrow beaks and others, and these seem to be related to the species.

x = df_penguin["BeakLength (mm)"].values
y = df_penguin["BeakDepth (mm)"].values
apl.interactive.plot(y, x)

Example of execution result

Let’s label each data by the type of penguin and check the audio graph. If you can tell to some extent which penguin it is from the characteristics of the beak at this point, you can set the policy for analysis.

First, let’s check the label number for each penguin.

for label in range(3):
    shumei = df_penguin[df_penguin["Species_Num"] == label]["Species"].values[0]
    print(f"{label} is {shumei}")

0 is Adelie Penguin
1 is Gentoo Penguin
2 is Chinstrap Penguin

Remember the label numbers and check the audio graph.

x = df_penguin["BeakLength (mm)"].values
y = df_penguin["BeakDepth (mm)"].values
label  = df_penguin["Species_Num"].values
apl.interactive.plot(y, x, label)

Example of execution result

Checking Statistical Quantities

print(df_penguin[["FlipperLength (mm)", "BodyWeight (g)", "BeakLength (mm)", "BeakDepth (mm)", "FlipperLength (mm)"]].describe())

	FlipperLength (mm)	BodyWeight (g)	BeakLength (mm)	BeakDepth (mm)	FlipperLength (mm)
count	333	333	333	333	333
mean	200.967	4207.06	43.9928	17.1649	200.967
std	14.0158	805.216	5.46867	1.96924	14.0158
min	172	2700	32.1	13.1	172
25%	190	3550	39.5	15.6	190
50%	197	4050	44.5	17.3	197
75%	213	4775	48.6	18.7	213
max	231	6300	59.6	21.5	231

Feature Engineering

It’s impossible to closely examine the relationships for all combinations of features (also known as dimensions). Therefore, in this case, let’s use a dimensionality reduction technique called PCA. PCA (Principal Component Analysis) is a technique for integrating highly correlated features to create new synthetic features. The advantages of dimensionality reduction (reducing the number of features) include making data analysis easier and reducing the amount of data required for machine learning, as mentioned later.

Let’s import PCA from scikit learn, a commonly used machine learning library.

from sklearn.decomposition import PCA

Let’s combine some potentially useful features to create three synthetic features.

data = df_penguin[["FlipperLength (mm)", "BodyWeight (g)", "BeakLength (mm)",  "BeakDepth (mm)",  "FlipperLength (mm)", "Sex_Num"]].values
penguins_pca = PCA(n_components=3).fit_transform(data)
penguins_pca.shape

(333, 3)

Let’s check the synthetic feature data with an audio graph.

x = penguins_pca[:, 0]
y = penguins_pca[:, 1]
label  = df_penguin["Species_Num"].values
apl.interactive.plot(y, x, label)

Example of execution result

x = penguins_pca[:, 2]
y = penguins_pca[:, 1]
label  = df_penguin["Species_Num"].values
apl.interactive.plot(y, x, label)

Example of execution result

Supervised Learning

Supervised learning is a task that learns from pairs of input data and corresponding output data (labels) to predict the output for new input data. Here, we will learn a logistic regression model that uses the features of penguin data as inputs and identifies the species of penguins as outputs.

Logistic regression is a method of supervised learning commonly used in machine learning and statistics. As the name may suggest, logistic regression is a model specialized for classification problems, particularly binary classification problems.

A characteristic of this method is that it uses the logistic function (or sigmoid function) to express the relationship between the features (input data) and the target variable (output data). This function confines the output within the range of 0 to 1, which can be interpreted as a probability.

For example, in a problem predicting whether an email is spam or not, logistic regression extracts features from the content of the email and outputs the probability of them being spam.

However, logistic regression can only express linear relationships. Therefore, it is important to consider other methods when the relationships in the data are complex and non-linear.

First, store the dimensionally reduced data by PCA as the features in X, and the species number as the target variable in y. Then, split the dataset into training data and test data.

X = penguins_pca[:, :3]
y = df_penguin["Species Number"].values
X_train, X_test, y_train, y_test = train_test_split(X, y, train_size=0.5)

Next, define the logistic regression model and learn it with the training data. Use the learned model to predict the test data.

model = LogisticRegression(max_iter=1000)
model.fit(X_train, y_train)
y_pred = model.predict(X_test)

Finally, evaluate the performance of the model. Here, calculate accuracy, recall, and F1 score and display them for each type of penguin. These metrics indicate how accurately the model can predict.

evaluation = classification_report(y_test, y_pred, target_names=["Adelie Penguin", "Gentoo Penguin", "Chinstrap Penguin"], output_dict=True)
print(pd.DataFrame(evaluation))

	Adelie Penguin	Gentoo Penguin	Chinstrap Penguin	accuracy	macro avg	weighted avg
precision	0.925	1	1	0.964072	0.975	0.966766
recall	1	0.983333	0.848485	0.964072	0.943939	0.964072
f1-score	0.961039	0.991597	0.918033	0.964072	0.956889	0.96352
support	74	60	33	0.964072	167	167

Neural Networks

Let’s try using a deep neural network instead of logistic regression. A deep neural network is a type of machine learning that mimics the way the human brain processes information. It aims to reproduce the mechanism by which neurons in the brain transmit information on a computer. They are a collection of networks called “layers” that transmit and process data in succession to solve complex problems.

Unlike logistic regression, deep neural networks generally have a multilayer structure. Each layer plays its own role, collectively enabling more advanced recognition and understanding. This allows us to handle not only simple linear problems but also complex nonlinear problems.

However, training deep neural networks requires a lot of computational resources and time. Therefore, it’s important to choose the appropriate model considering the complexity of the problem and the amount of data.

Data Preparation

First, select the columns to use as features from the penguin dataset and store them in X. Then, one-hot encode the target variable (type of penguin) and store it in y. Then, split the dataset into training data and test data.

X = df_penguin[["Flipper Length (mm)", "Body Mass (g)", "Culmen Length (mm)",  "Culmen Depth (mm)",  "Flipper Length (mm)", "Sex Number"]].values
y = pd.get_dummies(df_penguin["Species Number"])  # one hot vector
X_train, X_test, y_train, y_test = train_test_split(X, y, train_size=0.5)

Data Preprocessing

Neural networks may not learn well if the scales of the features are different. Therefore, standardize the data (convert to mean 0 and standard deviation 1). Here, we are using StandardScaler.

scaler = StandardScaler()
scaler.fit(X_train)

X_train_scaled = scaler.transform(X_train)
X_test_scaled = scaler.transform(X_test)

print(f"Original body mass data mean: {X_train[:, 1].mean().round()}, standard deviation: {X_train[:, 1].std().round()}")
print(f"Standardized body mass data mean: {X_train_scaled[:, 1].mean().round()}, standard deviation: {X_train_scaled[:, 1].std().round()}")

Original body mass data mean: 4242.0, standard deviation: 810.0
Standardized body mass data mean: -0.0, standard deviation: 1.0

Neural Network Model Definition

Next, define the neural network model. Here, we are creating a model that consists of an input layer, two hidden layers, and an output layer. We use ReLU for the activation function of the hidden layers and softmax for the activation function of the output layer.

inputs = keras.Input(shape=X_train_scaled.shape[1])
hidden_layer = keras.layers.Dense(20, activation="relu")(inputs)
hidden_layer2 = keras.layers.Dense(10, activation="relu")(hidden_layer)
output_layer = keras.layers.Dense(3, activation="softmax")(hidden_layer2)
model = keras.Model(inputs=inputs, outputs=output_layer)

Training

The defined model is trained using the training data. Adam is used as the optimization algorithm and categorical cross entropy is used as the loss function.

optimizer = keras.optimizers.Adam()
loss = keras.losses.CategoricalCrossentropy()
model.compile(optimizer, loss)
history = model.fit(X_train_scaled, y_train, epochs=200, verbose=0, validation_data=(X_test_scaled, y_test))

To visualize the progress of the learning, the losses of the training data and validation data are plotted.

plot_x = np.arange(len(history.history["loss"])).tolist() * 2
plot_y = history.history["loss"] + history.history["val_loss"]
plot_label = [0] * len(history.history["loss"]) + [1] * len(history.history["val_loss"])
apl.interactive.plot(plot_y, plot_x, plot_label)

Example of Execution Result

Prediction

Predictions are made on the test data using the trained model.

model_output = model.predict(X_test_scaled)
y_pred = pd.DataFrame(model_output, columns=y_test.columns, index=y_test.index)

The prediction results are displayed along with the specific penguin data.

index = 10

print(f"Data number {index}")
print(f"Flipper length: {X_test[index, 0]:.1f}mm")
print(f"Body weight: {X_test[index, 1]:.1f}g")
print(f"Bill length: {X_test[index, 2]:.1f}mm")
print(f"Bill depth: {X_test[index, 3]:.1f}mm")
print(f"Flipper length: {X_test[index, 4]:.1f}mm")
print(f"Gender code: {X_test[index, 5]}")
print(f"Probability of being an Adelie penguin: {(y_pred.values[index, 0] * 100):.1f}%")
print(f"Probability of being a Gentoo penguin: {(y_pred.values[index, 1] * 100):.1f}%")
print(f"Probability of being a Chinstrap penguin: {(y_pred.values[index, 2] * 100):.1f}%")
print(f"Correct answer: {['Adelie penguin', 'Gentoo penguin', 'Chinstrap penguin'][y_test.idxmax(axis='columns').values[index]]}")

Data number 10
Flipper length: 230.0mm
Body weight: 5700.0g
Bill length: 50.0mm
Bill depth: 16.3mm
Flipper length: 230.0mm
Gender code: 1.0
Probability of being an Adelie penguin: 0.0%
Probability of being a Gentoo penguin: 100.0%
Probability of being a Chinstrap penguin: 0.0%
Correct answer: Gentoo penguin

Evaluation

Finally, the performance of the model is evaluated. Here, the accuracy, recall, and F1 scores are calculated and displayed for each type of penguin. In this example, all indicators are 1.0, indicating that the model can predict perfectly accurately. However, in real problems, perfect predictions like this are rare, and these indicators are generally used to evaluate the performance of the model and adjust the model as needed.

This completes the basic flow of the neural network explained in this article. Through the steps of

data preparation, preprocessing, model definition, training, prediction, and evaluation, you can understand the basic procedures of machine learning using neural networks.

y_label_pred = y_pred.idxmax(axis="columns").values   # one hot vector -> label
y_label_test = y_test.idxmax(axis="columns").values

evaluation = classification_report(y_label_test, y_label_pred, target_names=["Adelie penguin", "Gentoo penguin", "Chinstrap penguin"], output_dict=True)
print(pd.DataFrame(evaluation))

	Adelie penguin	Gentoo penguin	Chinstrap penguin	accuracy	macro avg	weighted avg
precision	1	1	1	1	1	1
recall	1	1	1	1	1	1
f1-score	1	1	1	1	1	1
support	74	59	34	1	167	167

Unsupervised Learning

Unsupervised learning involves learning from input data alone, discovering structures or patterns in the data. Because output data (labels) are not provided, it is used to find the possibility of new species or when it is impossible to label in advance.

Here, we are dividing penguin data into three clusters using k-means clustering. K-means clustering is a method of dividing data into k clusters, clustering to minimize the total distance between the centroid of each cluster and each data point.

from sklearn.cluster import KMeans
predicted_labels = KMeans(n_clusters=3).fit_predict(penguins_pca)

Finally, the data reduced by PCA is plotted and the clusters assigned by k-means clustering are displayed in color. This allows you to check how the data was clustered.

x = penguins_pca[:, 0]
y = penguins_pca[:, 1]
label = predicted_labels
apl.interactive.plot(y, x, label)

Example of Execution Result

Summary of This Chapter

We have been able to gain at least a glimpse of the entire process of data science. Although we have only covered a few examples, the ability to scientifically understand information and apply it to problem-solving is essential in the modern era.

Our goal is to grasp the big picture of data science and acquire the skills to apply each step of the process, from data preparation to feature engineering, supervised learning, neural networks, and unsupervised learning. By utilizing the knowledge we have learned here, let’s explore the vast sea of data and cultivate the ability to make new discoveries.

These learnings will become powerful tools for you to solve future challenges you may encounter. I wish you success in your journey through data science.

Home