In this chapter, we will learn the basics of data science, following the content created by Japan’s Ministry of Education, Culture, Sports, Science and Technology (hereinafter referred to as “MEXT material”) for teacher training materials for the “Information II” course in high schools. Adhering to the “Terms of Use for the MEXT Website”, we have created this content with some modifications to what is covered in “High School Information Course Information II Teacher Training Materials (Main Text) Chapter 3 Information and Data Science” (MEXT).
In particular, we have arranged it so that you can practically learn the basics of data science by making the most of the Python programming and audio graph techniques introduced up to the previous chapter. Once you have mastered the basics and their practical methods, you will be able to continue learning on your own by making use of various learning resources available online.
In data science, we clarify problems using a scientific approach to information, establish analytical strategies, and perform organization, formatting, and analysis of various data. Through activities of considering the results, you will come to understand that a variety of products and services utilizing machine learning, such as artificial intelligence-based image recognition and automatic translation, have been developed and new insights have been produced. Additionally, in order to discover and solve problems such as predicting uncertain events, you will acquire various knowledge and skills in each process such as data collection, organization, formatting, modeling, visualization, analysis, evaluation, execution, and effect verification, thereby cultivating the ability to scientifically grasp things based on data and tackle problem-solving.
We will prioritize getting our hands on and practicing the entire process of data science, with a moderate amount of detailed explanation. Once you get a sense of how to use the audio graph library, you should be able to deepen your understanding by utilizing various contents available on the Internet.
By learning the following items in order, let’s grasp the overall picture of data science.
Firstly, let’s import the external libraries that will be called later. If you’re using Google Colab, most of the libraries are pre-installed, and you should be able to use them simply by importing them as shown below.
import numpy as np
import pandas as pd
from sklearn.preprocessing import LabelEncoder, StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report
from tensorflow import keras
As for the audio plot library, if you have not installed it yet, install it using the pip command before importing it.
!pip install -q audio-plot-lib
import audio_plot_lib as apl
In data science, random numbers are often used for things like initial values when machine learning. If left as is, the results will change every time an experiment is performed, which can often hinder problem solving. Therefore, it’s common to fix the seed of random numbers as follows.
np.random.seed(0) # You can change 0 to any number you like
As an example for this time, we will take up the Penguin dataset, which is becoming increasingly popular for data science learning. The amount and quality of data is appropriate, and it is a very easy-to-handle and popular dataset.
The dataset contains characteristic data such as the weight and beak length of each of the following three types of penguins, compiled into a CSV.
These are popular penguins that can be seen in aquariums in Japan. The name of the dataset has “Palmer” in it, apparently because it is based on survey data from the Palmer Peninsula in Antarctica.
Let’s immediately read in the dataset as a pandas dataframe.
penguin_dataset_url = "https://raw.githubusercontent.com/allisonhorst/palmerpenguins/master/inst/extdata/penguins_raw.csv"
df_penguin = pd.read_csv(penguin_dataset_url)
To make the results of the analysis easier to understand, each item of data should be clearly labeled.
df_penguin = df_penguin.rename(columns={
"studyName": "Study Name",
"Sample Number": "Sample Number",
"Species": "Species",
"Region": "Region",
"Island": "Island",
"Stage": "Developmental Stage",
"Individual ID": "Individual ID",
"Clutch Completion": "Matured",
"Date Egg": "Date of Birth",
"Culmen Length (mm)": "Upper Beak Length (mm)",
"Culmen Depth (mm)": "Beak Vertical Width (mm)",
"Flipper Length (mm)": "Flipper Length (mm)",
"Body Mass (g)": "Body Weight (g)",
"Sex": "Sex",
"Delta 15 N (o/oo)": "Nitrogen Isotope Ratio",
"Delta 13 C (o/oo)": "Stable Isotope Ratio",
"Comments": "Comments",
})
df_penguin = df_penguin.replace("Adelie Penguin (Pygoscelis adeliae)", "Adelie Penguin")
df_penguin = df_penguin.replace("Chinstrap penguin (Pygoscelis antarctica)", "Chinstrap Penguin")
df_penguin = df_penguin.replace("Gentoo penguin (Pygoscelis papua)", "Gentoo Penguin")
Let’s check the contents of the data, starting with the first three entries.
print(df_penguin.head(3))
The result should look something like this:
Study Name | Sample Number | Species | Region | Island | Developmental Stage | Individual ID | Matured | Date of Birth | Upper Beak Length (mm) | Beak Vertical Width (mm) | Flipper Length (mm) | Body Weight (g) | Sex | Nitrogen Isotope Ratio | Stable Isotope Ratio | Comments | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | PAL0708 | 1 | Adélie Penguin | Anvers | Torgersen | Adult, 1 Egg Stage | N1A1 | Yes | 2007-11-11 | 39.1 | 18.7 | 181 | 3750 | MALE | nan | nan | Not enough blood for isotopes. |
1 | PAL0708 | 2 | Adélie Penguin | Anvers | Torgersen | Adult, 1 Egg Stage | N1A2 | Yes | 2007-11-11 | 39.5 | 17.4 | 186 | 3800 | FEMALE | 8.94956 | -24.6945 | nan |
2 | PAL0708 | 3 | Adélie Penguin | Anvers | Torgersen | Adult, 1 Egg Stage | N2A1 | Yes | 2007-11-16 | 40.3 | 18 | 195 | 3250 | FEMALE | 8.36821 | -25.333 | nan |
Let’s check the amount of data. The output of the shape function represents (number of data, number of features).
print(df_penguin.shape)
(344, 17)
Let’s also check how many data points are included for each type of penguin.
print(df_penguin["Species"].value_counts())
Species | |
---|---|
Adélie Penguin | 152 |
Gentoo Penguin | 124 |
Chinstrap Penguin | 68 |
Although the penguin dataset is a relatively organized and easy-to-use dataset, it still includes some unnecessary data. In the world of data science, the phrase “garbage in, garbage out” is commonly used, emphasizing the importance of ensuring the quality of the data to produce good analytical results. As you handle more and more data, expect the time and effort required for these types of preprocessing to increase, so please keep this in mind.
Let’s start by looking at missing values.
print(df_penguin.isnull().sum())
0 | |
---|---|
Study Name | 0 |
Sample Number | 0 |
Species | 0 |
Region | 0 |
Island | 0 |
Developmental Stage | 0 |
Individual ID | 0 |
Matured | 0 |
Date of Birth | 0 |
Upper Beak Length (mm) | 2 |
Beak Vertical Width (mm) | 2 |
Flipper Length (mm) | 2 |
Body Weight (g) | 2 |
Sex | 11 |
Nitrogen Isotope Ratio | 14 |
Stable Isotope Ratio | 13 |
Comments | 290 |
Features such as beak length and body weight seem crucial, so we should avoid using data that lacks these values. Let’s prepare a new dataframe that excludes these missing values.
df_penguin = df_penguin.dropna(subset=["BeakLength (mm)", "BeakDepth (mm)", "FlipperLength (mm)", "Sex"])
print(df_penguin.isnull().sum())
print(df_penguin["Species"].value_counts())
0 | |
---|---|
Study Name | 0 |
Sample Number | 0 |
Species | 0 |
Region | 0 |
Island | 0 |
Developmental Stage | 0 |
Individual ID | 0 |
Matured | 0 |
Date of Birth | 0 |
Upper Beak Length (mm) | 0 |
Beak Vertical Width (mm) | 0 |
Flipper Length (mm) | 0 |
Body Weight (g) | 0 |
Sex | 0 |
Nitrogen Isotope Ratio | 9 |
Stable Isotope Ratio | 8 |
Comments | 290 |
Some features are represented by keywords, and it would be easier to process these values if they were converted into numbers. Let’s transform the names of genders and species into new features represented by numbers.
df_penguin.loc[:, "Sex_Num"] = LabelEncoder().fit_transform(df_penguin["Sex"])
print(df_penguin[["Sex", "Sex_Num"]].head(3))
Sex | Sex_Num | |
---|---|---|
0 | MALE | 1 |
1 | FEMALE | 0 |
2 | FEMALE | 0 |
df_penguin.loc[:, "Species_Num"] = LabelEncoder().fit_transform(df_penguin["Species"])
print(df_penguin[["Species", "Species_Num"]].head(3))
Species | Species_Num | |
---|---|---|
0 | Adelie Penguin | 0 |
1 | Adelie Penguin | 0 |
2 | Adelie Penguin | 0 |
Let’s use an audio graph to check if there’s any correlation between each feature. First, let’s confirm the correlation between flipper length and body weight. One might guess that if the flipper is longer, the body weight might be heavier, but let’s see.
x = df_penguin["FlipperLength (mm)"].values
y = df_penguin["BodyWeight (g)"].values
apl.interactive.plot(y, x)
Next, let’s check the data on beak length and depth to see if there are any characteristics in the beak shape. It doesn’t necessarily mean that if the beak is long, the depth is also wide. From the middle, the trend of the sound changes from around 40mm in beak length, and you can hear low sounds, that is, data with a narrow beak depth. It seems that there are data with long, narrow beaks and others, and these seem to be related to the species.
x = df_penguin["BeakLength (mm)"].values
y = df_penguin["BeakDepth (mm)"].values
apl.interactive.plot(y, x)
Let’s label each data by the type of penguin and check the audio graph. If you can tell to some extent which penguin it is from the characteristics of the beak at this point, you can set the policy for analysis.
First, let’s check the label number for each penguin.
for label in range(3):
shumei = df_penguin[df_penguin["Species_Num"] == label]["Species"].values[0]
print(f"{label} is {shumei}")
0 is Adelie Penguin
1 is Gentoo Penguin
2 is Chinstrap Penguin
Remember the label numbers and check the audio graph.
x = df_penguin["BeakLength (mm)"].values
y = df_penguin["BeakDepth (mm)"].values
label = df_penguin["Species_Num"].values
apl.interactive.plot(y, x, label)
print(df_penguin[["FlipperLength (mm)", "BodyWeight (g)", "BeakLength (mm)", "BeakDepth (mm)", "FlipperLength (mm)"]].describe())
FlipperLength (mm) | BodyWeight (g) | BeakLength (mm) | BeakDepth (mm) | FlipperLength (mm) | |
---|---|---|---|---|---|
count | 333 | 333 | 333 | 333 | 333 |
mean | 200.967 | 4207.06 | 43.9928 | 17.1649 | 200.967 |
std | 14.0158 | 805.216 | 5.46867 | 1.96924 | 14.0158 |
min | 172 | 2700 | 32.1 | 13.1 | 172 |
25% | 190 | 3550 | 39.5 | 15.6 | 190 |
50% | 197 | 4050 | 44.5 | 17.3 | 197 |
75% | 213 | 4775 | 48.6 | 18.7 | 213 |
max | 231 | 6300 | 59.6 | 21.5 | 231 |
It’s impossible to closely examine the relationships for all combinations of features (also known as dimensions). Therefore, in this case, let’s use a dimensionality reduction technique called PCA. PCA (Principal Component Analysis) is a technique for integrating highly correlated features to create new synthetic features. The advantages of dimensionality reduction (reducing the number of features) include making data analysis easier and reducing the amount of data required for machine learning, as mentioned later.
Let’s import PCA from scikit learn, a commonly used machine learning library.
from sklearn.decomposition import PCA
Let’s combine some potentially useful features to create three synthetic features.
data = df_penguin[["FlipperLength (mm)", "BodyWeight (g)", "BeakLength (mm)", "BeakDepth (mm)", "FlipperLength (mm)", "Sex_Num"]].values
penguins_pca = PCA(n_components=3).fit_transform(data)
penguins_pca.shape
(333, 3)
Let’s check the synthetic feature data with an audio graph.
x = penguins_pca[:, 0]
y = penguins_pca[:, 1]
label = df_penguin["Species_Num"].values
apl.interactive.plot(y, x, label)
x = penguins_pca[:, 2]
y = penguins_pca[:, 1]
label = df_penguin["Species_Num"].values
apl.interactive.plot(y, x, label)
Supervised learning is a task that learns from pairs of input data and corresponding output data (labels) to predict the output for new input data. Here, we will learn a logistic regression model that uses the features of penguin data as inputs and identifies the species of penguins as outputs.
Logistic regression is a method of supervised learning commonly used in machine learning and statistics. As the name may suggest, logistic regression is a model specialized for classification problems, particularly binary classification problems.
A characteristic of this method is that it uses the logistic function (or sigmoid function) to express the relationship between the features (input data) and the target variable (output data). This function confines the output within the range of 0 to 1, which can be interpreted as a probability.
For example, in a problem predicting whether an email is spam or not, logistic regression extracts features from the content of the email and outputs the probability of them being spam.
However, logistic regression can only express linear relationships. Therefore, it is important to consider other methods when the relationships in the data are complex and non-linear.
First, store the dimensionally reduced data by PCA as the features in X
, and the species number as the target variable in y
. Then, split the dataset into training data and test data.
X = penguins_pca[:, :3]
y = df_penguin["Species Number"].values
X_train, X_test, y_train, y_test = train_test_split(X, y, train_size=0.5)
Next, define the logistic regression model and learn it with the training data. Use the learned model to predict the test data.
model = LogisticRegression(max_iter=1000)
model.fit(X_train, y_train)
y_pred = model.predict(X_test)
Finally, evaluate the performance of the model. Here, calculate accuracy, recall, and F1 score and display them for each type of penguin. These metrics indicate how accurately the model can predict.
evaluation = classification_report(y_test, y_pred, target_names=["Adelie Penguin", "Gentoo Penguin", "Chinstrap Penguin"], output_dict=True)
print(pd.DataFrame(evaluation))
Adelie Penguin | Gentoo Penguin | Chinstrap Penguin | accuracy | macro avg | weighted avg | |
---|---|---|---|---|---|---|
precision | 0.925 | 1 | 1 | 0.964072 | 0.975 | 0.966766 |
recall | 1 | 0.983333 | 0.848485 | 0.964072 | 0.943939 | 0.964072 |
f1-score | 0.961039 | 0.991597 | 0.918033 | 0.964072 | 0.956889 | 0.96352 |
support | 74 | 60 | 33 | 0.964072 | 167 | 167 |
Let’s try using a deep neural network instead of logistic regression. A deep neural network is a type of machine learning that mimics the way the human brain processes information. It aims to reproduce the mechanism by which neurons in the brain transmit information on a computer. They are a collection of networks called “layers” that transmit and process data in succession to solve complex problems.
Unlike logistic regression, deep neural networks generally have a multilayer structure. Each layer plays its own role, collectively enabling more advanced recognition and understanding. This allows us to handle not only simple linear problems but also complex nonlinear problems.
However, training deep neural networks requires a lot of computational resources and time. Therefore, it’s important to choose the appropriate model considering the complexity of the problem and the amount of data.
First, select the columns to use as features from the penguin dataset and store them in X. Then, one-hot encode the target variable (type of penguin) and store it in y. Then, split the dataset into training data and test data.
X = df_penguin[["Flipper Length (mm)", "Body Mass (g)", "Culmen Length (mm)", "Culmen Depth (mm)", "Flipper Length (mm)", "Sex Number"]].values
y = pd.get_dummies(df_penguin["Species Number"]) # one hot vector
X_train, X_test, y_train, y_test = train_test_split(X, y, train_size=0.5)
Neural networks may not learn well if the scales of the features are different. Therefore, standardize the data (convert to mean 0 and standard deviation 1). Here, we are using StandardScaler.
scaler = StandardScaler()
scaler.fit(X_train)
X_train_scaled = scaler.transform(X_train)
X_test_scaled = scaler.transform(X_test)
print(f"Original body mass data mean: {X_train[:, 1].mean().round()}, standard deviation: {X_train[:, 1].std().round()}")
print(f"Standardized body mass data mean: {X_train_scaled[:, 1].mean().round()}, standard deviation: {X_train_scaled[:, 1].std().round()}")
Original body mass data mean: 4242.0, standard deviation: 810.0
Standardized body mass data mean: -0.0, standard deviation: 1.0
Next, define the neural network model. Here, we are creating a model that consists of an input layer, two hidden layers, and an output layer. We use ReLU for the activation function of the hidden layers and softmax for the activation function of the output layer.
inputs = keras.Input(shape=X_train_scaled.shape[1])
hidden_layer = keras.layers.Dense(20, activation="relu")(inputs)
hidden_layer2 = keras.layers.Dense(10, activation="relu")(hidden_layer)
output_layer = keras.layers.Dense(3, activation="softmax")(hidden_layer2)
model = keras.Model(inputs=inputs, outputs=output_layer)
The defined model is trained using the training data. Adam is used as the optimization algorithm and categorical cross entropy is used as the loss function.
optimizer = keras.optimizers.Adam()
loss = keras.losses.CategoricalCrossentropy()
model.compile(optimizer, loss)
history = model.fit(X_train_scaled, y_train, epochs=200, verbose=0, validation_data=(X_test_scaled, y_test))
To visualize the progress of the learning, the losses of the training data and validation data are plotted.
plot_x = np.arange(len(history.history["loss"])).tolist() * 2
plot_y = history.history["loss"] + history.history["val_loss"]
plot_label = [0] * len(history.history["loss"]) + [1] * len(history.history["val_loss"])
apl.interactive.plot(plot_y, plot_x, plot_label)
Predictions are made on the test data using the trained model.
model_output = model.predict(X_test_scaled)
y_pred = pd.DataFrame(model_output, columns=y_test.columns, index=y_test.index)
The prediction results are displayed along with the specific penguin data.
index = 10
print(f"Data number {index}")
print(f"Flipper length: {X_test[index, 0]:.1f}mm")
print(f"Body weight: {X_test[index, 1]:.1f}g")
print(f"Bill length: {X_test[index, 2]:.1f}mm")
print(f"Bill depth: {X_test[index, 3]:.1f}mm")
print(f"Flipper length: {X_test[index, 4]:.1f}mm")
print(f"Gender code: {X_test[index, 5]}")
print(f"Probability of being an Adelie penguin: {(y_pred.values[index, 0] * 100):.1f}%")
print(f"Probability of being a Gentoo penguin: {(y_pred.values[index, 1] * 100):.1f}%")
print(f"Probability of being a Chinstrap penguin: {(y_pred.values[index, 2] * 100):.1f}%")
print(f"Correct answer: {['Adelie penguin', 'Gentoo penguin', 'Chinstrap penguin'][y_test.idxmax(axis='columns').values[index]]}")
Data number 10
Flipper length: 230.0mm
Body weight: 5700.0g
Bill length: 50.0mm
Bill depth: 16.3mm
Flipper length: 230.0mm
Gender code: 1.0
Probability of being an Adelie penguin: 0.0%
Probability of being a Gentoo penguin: 100.0%
Probability of being a Chinstrap penguin: 0.0%
Correct answer: Gentoo penguin
Finally, the performance of the model is evaluated. Here, the accuracy, recall, and F1 scores are calculated and displayed for each type of penguin. In this example, all indicators are 1.0, indicating that the model can predict perfectly accurately. However, in real problems, perfect predictions like this are rare, and these indicators are generally used to evaluate the performance of the model and adjust the model as needed.
This completes the basic flow of the neural network explained in this article. Through the steps of
data preparation, preprocessing, model definition, training, prediction, and evaluation, you can understand the basic procedures of machine learning using neural networks.
y_label_pred = y_pred.idxmax(axis="columns").values # one hot vector -> label
y_label_test = y_test.idxmax(axis="columns").values
evaluation = classification_report(y_label_test, y_label_pred, target_names=["Adelie penguin", "Gentoo penguin", "Chinstrap penguin"], output_dict=True)
print(pd.DataFrame(evaluation))
Adelie penguin | Gentoo penguin | Chinstrap penguin | accuracy | macro avg | weighted avg | |
---|---|---|---|---|---|---|
precision | 1 | 1 | 1 | 1 | 1 | 1 |
recall | 1 | 1 | 1 | 1 | 1 | 1 |
f1-score | 1 | 1 | 1 | 1 | 1 | 1 |
support | 74 | 59 | 34 | 1 | 167 | 167 |
Unsupervised learning involves learning from input data alone, discovering structures or patterns in the data. Because output data (labels) are not provided, it is used to find the possibility of new species or when it is impossible to label in advance.
Here, we are dividing penguin data into three clusters using k-means clustering. K-means clustering is a method of dividing data into k clusters, clustering to minimize the total distance between the centroid of each cluster and each data point.
from sklearn.cluster import KMeans
predicted_labels = KMeans(n_clusters=3).fit_predict(penguins_pca)
Finally, the data reduced by PCA is plotted and the clusters assigned by k-means clustering are displayed in color. This allows you to check how the data was clustered.
x = penguins_pca[:, 0]
y = penguins_pca[:, 1]
label = predicted_labels
apl.interactive.plot(y, x, label)
We have been able to gain at least a glimpse of the entire process of data science. Although we have only covered a few examples, the ability to scientifically understand information and apply it to problem-solving is essential in the modern era.
Our goal is to grasp the big picture of data science and acquire the skills to apply each step of the process, from data preparation to feature engineering, supervised learning, neural networks, and unsupervised learning. By utilizing the knowledge we have learned here, let’s explore the vast sea of data and cultivate the ability to make new discoveries.
These learnings will become powerful tools for you to solve future challenges you may encounter. I wish you success in your journey through data science.