Data Modelling classification problem - Part 1

Data Analytics

Introduction

In the field of machine learning, classification is a widely used technique that involves categorizing data into predefined classes or categories. This technique is commonly used in various applications such as image recognition, spam filtering, fraud detection, and sentiment analysis.

In a classification problem, the machine learning algorithm is trained to predict the class or category of a given input based on its features. For instance, a spam filtering algorithm might classify an email as spam or not spam based on the presence or absence of certain keywords.

However, before we can start building a classification model, we need to first identify whether the problem we are trying to solve is a classification problem or not. In this blog, we will explore the characteristics of classification problems and learn how to detect them. We will also discuss some common techniques that are used to solve classification problems, and how they can be applied in various real-world scenarios.

Whether you are a beginner or an experienced data scientist, this blog will provide you with a solid understanding of classification problems and equip you with the skills to detect and solve them effectively. So, let's dive in!

Problem statement

For this tutorial, I will be using the dataset from The Johns Hopkins University which the target is to predict the likelihood of sepsis in ICU patients.

This dataset will be as following table

Column Name	Attribute/Target	Description
ID	N/A	Unique number to represent patient ID
PRG	Attribute1	Plasma glucose

PL	Attribute 2	Blood Work Result-1 (mu U/ml)
PR	Attribute 3	Blood Pressure (mm Hg)

SK	Attribute 4	Blood Work Result-2 (mm)

TS	Attribute 5	Blood Work Result-3 (mu U/ml)

M11	Attribute 6	Body mass index (weight in kg/(height in m)^2

BD2	Attribute 7	Blood Work Result-4 (mu U/ml)
Age	Attribute 8	patients age (years)
Insurance	N/A	If a patient holds a valid insurance card

Sepssis	Target	Positive: if a patient in ICU will develop a sepsis , and Negative: otherwise

The target column that I will predict is the Sepssis which has two possible value: Positive if the patient will develop a sepsis, and Negative if they will not have sepssis

First thing first, we need to load the data then do some Exploratory Data Analysis (EDA), if you are not familiar with EDA, here is some definition of its:

Exploratory Data Analysis (EDA) is the process of examining and analyzing data to extract insights, identify patterns, and understand the underlying structure of the data. EDA is typically the first step in any data analysis process and is performed before building predictive models or making any inferences from the data.

The reason why we need to do EDA is to gain a deeper understanding of the data and its properties, which can help identify any issues or anomalies that may be present in the data. EDA can also help identify the relationship between the different variables and the target variable and can help in feature selection for predictive modeling.

Data retrieving

At the first step, we will need to load the data, in this case, the data will be in .csv format

# Load the data train file into a dataframe called "train_df"
train_df = pd.read_csv("Paitients_Files_Train.csv")
 
# Load the data test file into a dataframe called "test_df"
test_df = pd.read_csv("Paitients_Files_Test.csv")

However, we will need to install some necessary packages first, then let install and import them in the notebook

# pip3 install matplotlib pandas seaborn scikit_learn numpy

import os
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
import seaborn as sns
import pickle as pkl
from sklearn.preprocessing import LabelEncoder
from sklearn.linear_model import LogisticRegression, LogisticRegressionCV
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.linear_model import ElasticNet
from sklearn.metrics import *

Let through of them and see how can they be used in this notebook

os: Used to interact with the operating system, such as setting working directories or listing files in a directory.
pandas: Used to work with data frames, which are a table-like data structure in Python. Pandas provides tools for data manipulation, cleaning, and analysis.
matplotlib.pyplot: A plotting library used for creating data visualizations in Python. It provides various functions to create different types of plots, such as line plots, scatter plots, and histograms.
numpy: A fundamental package for scientific computing in Python. It provides support for large, multi-dimensional arrays and matrices, as well as a large collection of mathematical functions to operate on these arrays.
seaborn: A data visualization library built on top of matplotlib. It provides a high-level interface for creating informative and attractive statistical graphics.
pickle: A module used for object serialization and deserialization in Python. It allows for the efficient storage and retrieval of Python objects.
LabelEncoder: A class from the scikit-learn library used to convert categorical variables into numerical labels.
LogisticRegression: A class from scikit-learn used to fit logistic regression models for binary classification.
LogisticRegressionCV: A class from scikit-learn used to perform logistic regression with cross-validation.
DecisionTreeClassifier: A class from scikit-learn used to fit decision tree models for classification.
RandomForestClassifier: A class from scikit-learn used to fit random forest models for classification.
train_test_split: A function from scikit-learn used to split data into training and testing sets.
GridSearchCV: A class from scikit-learn used for hyperparameter tuning by searching over specified parameter values for an estimator.
ElasticNet: A class from scikit-learn used to fit linear regression models with both L1 and L2 regularization.
metrics: A module from scikit-learn containing various metrics used to evaluate model performance, such as accuracy, precision, recall, and F1 score.

Then, let see the dataframe of each dataset

# view first 5 rows of train dataframe
train_df.head()

Index	ID	PRG	PL	PR	SK	TS	M11	BD2	Age	Insurance	Sepssis
0	ICU200010	6	148	72	35	0	33.6	0.627	50	0	Positive
1	ICU200011	1	85	66	29	0	26.6	0.351	31	0	Negative
2	ICU200012	8	183	64	0	0	23.3	0.672	32	1	Positive
3	ICU200013	1	89	66	23	94	28.1	0.167	21	1	Negative
4	ICU200014	0	137	40	35	168	43.1	2.288	33	1	Positive

# view first 5 rows of test dataframe
test_df.head()

Index	ID	PRG	PL	PR	SK	TS	M11	BD2	Age	Insurance
0	ICU200609	1	109	38	18	120	23.1	0.407	26	1
1	ICU200610	1	108	88	19	0	27.1	0.400	24	1
2	ICU200611	6	96	0	0	0	23.7	0.190	28	1
3	ICU200612	1	124	74	36	0	27.8	0.100	30	1
4	ICU200613	7	150	78	29	126	35.2	0.692	54	0

The test_df which will be used to evaluate the performance of the model we trained. Since we want to predict the "Sepssis" column using the other input variables, we do not see this column in the test_df. Instead, we use the input variables in test_df to make predictions for the "Sepssis" column, and then compare these predictions to the actual values in the testing set to evaluate the accuracy of our model.

# check the training dataset columns and their datatypes
train_df.info()

#	Column	Non-Null Count	Dtype
0	ID	169 non-null	object
1	PRG	169 non-null	int64
2	PL	169 non-null	int64
3	PR	169 non-null	int64
4	SK	169 non-null	int64
5	TS	169 non-null	int64
6	M11	169 non-null	float64
7	BD2	169 non-null	float64
8	Age	169 non-null	int64
9	Insurance	169 non-null	int64

# check the testing dataset columns and their datatypes
test_df.info()

#	Column	Non-Null Count	Dtype
0	ID	169 non-null	object
1	PRG	169 non-null	int64
2	PL	169 non-null	int64
3	PR	169 non-null	int64
4	SK	169 non-null	int64
5	TS	169 non-null	int64
6	M11	169 non-null	float64
7	BD2	169 non-null	float64
8	Age	169 non-null	int64

As we can see, there are no columns in either the training or testing dataset that have missing values, which means we won't need to deal with any null values during data cleaning.

Additionally, the data types for the columns in both datasets appear to be correct, with integer columns being represented as int64 and floating-point columns as float64.

Two columns, ID and Sepssis, are represented as object data type since they contain string values. However, ID is unnecessary and will be removed, while Sepssis will be encoded later on.

train_df.describe().T

	count	mean	std	min	25%	50%	75%	max
PRG	599.0	3.824708	3.362839	0.000	1.000	3.000	6.000	17.00
PL	599.0	120.153589	32.682364	0.000	99.000	116.000	140.000	198.00
PR	599.0	68.732888	19.335675	0.000	64.000	70.000	80.000	122.00
SK	599.0	20.562604	16.017622	0.000	0.000	23.000	32.000	99.00
TS	599.0	79.460768	116.576176	0.000	0.000	36.000	123.500	846.00
M11	599.0	31.920033	8.008227	0.000	27.100	32.000	36.550	67.10
BD2	599.0	0.481187	0.337552	0.078	0.248	0.383	0.647	2.42
Age	599.0	33.290484	11.828446	21.000	24.000	29.000	40.000	81.00
Insurance	599.0	0.686144	0.464447	0.000	0.000	1.000	1.000	1.00

Based on the summary of statistics provided, we can draw some conclusions about the columns and their values

There are 599 observations in each column, which indicates that there are no missing values in the dataset.
Most columns, the average and median values are similar, suggesting that their distribution is similar to that of a normal distribution.
The Insurance columns contains nominal data and other columns contain either discrete or continuous data.
Blood Work Result-3 (TS) has a mean of 79.46 and a large standard deviation of 116.58. This attribute has a positively skewed distribution toward higher number, with most patients having low results and a few patients having very high results.
The patient ages range from 21 to 81 years, with a mean age of 33.29 and a standard deviation of 11.83. This indicates that the patient population is relatively young, with most patients being in their 20s or 30s.
Most patients (68.6%) hold a valid insurance card, as indicated by the Insurance column.

# visualize the value distribution of the dataframe
train_df.hist(figsize=(20, 10))
plt.show()

The graph above confirms some observations made earlier about the distribution of the variables in the dataset. For instance, the histograms for the PL, PR, and M11 columns are roughly symmetric and resemble a normal distribution curve. This suggests that these variables are normally distributed.

On the other hand, the histogram for the TS column shows a long right tail, indicating that the distribution is highly skewed to the right. This corresponds to the high skewness value observed in the statistical summary. Similarly, the PRG, SK, BD2, and Age columns also exhibit some degree of skewness.

It is important to note that skewed distributions can affect the performance of certain statistical analyses and machine learning models. Therefore, we may need to consider normalizing or transforming these variables before applying certain techniques or models.

That's all for now

In part 1 of this series, we focused on retrieving the dataset and importing the necessary packages for data cleaning, exploratory data analysis (EDA), and machine learning. The dataset was described in terms of its data fields, and attributes/targets, and then loaded into the notebook.

It is essential to note that while data retrieval is a crucial part of the machine learning workflow, it is only the beginning. The next steps are data cleaning and exploratory data analysis, which are crucial in preparing the dataset for machine learning.

Data cleaning is a process of identifying and addressing issues in the data that could affect the accuracy of the model. These issues include missing values, outliers, duplicates, and inconsistent data. In part 2 of this series, we will focus on data cleaning by identifying and addressing these issues.

Introduction

Problem statement

Data retrieving

That's all for now

Share your thoughts