Titanic — Machine Learning from Disaster — Data Cleaning

4 min readJan 16, 2021

Done by: Lamalharbi, Monirah abdulaziz, Aisha Y Hakami

Introduction

In this article we will go through the whole process of cleaning the data and develping a machine learning model on the Titanic dataset which is one of the famous data sets which every Data Scientist explores at the start of their career and here we are. The dataset provides all the information on the fate of passengers on the Titanic, summarized according to sex, age, economic status (class), and survival.

In addition, we join the Titanic: Machine Learning from Disaster challenge on Kaggle according to our work follow up to see the score.

Data Cleaning

# upload the data
train = pd.read_csv('../datasets/train.csv')
test = pd.read_csv('../datasets/test.csv')

Check the null values:

Observation:

There is a missing values in train: Age, Cabin and Embarked.
In the test: Age, Fare and Cabin.

fig, ax = plt.subplots(nrows = 1, ncols = 2, figsize = (18, 6))# train data 
sns.heatmap(train.isnull(), yticklabels=False, ax = ax[0], cbar=False, cmap='viridis')
ax[0].set_title('Train data')# test data
sns.heatmap(test.isnull(), yticklabels=False, ax = ax[1], cbar=False, cmap='viridis')
ax[1].set_title('Test data');

Heatmap to visualize the null values before cleaning

Embarked feature in Train

First we need to check How many ports are in Embarked column,

train.Embarked.value_counts()

the result was:

S 644
C 168
Q 77

Well, now we can drop the two rows with missing Embarked values in train, however, let's fill the nulls with the port of highest embarkation.

from collections import Countertrain.Embarked= train.Embarked.replace(np.nan,Counter(train.Embarked).most_common(1)[0][0])

2. Fare feature in Test

To fill the missing values in Fare we can use the Pclass column, by computing the average of Fare of the missing Class.

class3_mean = train[train['Pclass']==3]['Fare'].mean()
test['Fare'] = test['Fare'].replace({np.nan:class3_mean})

3. Age feature in Train and Test

#defining a function 'impute_age'
def impute_age(age_pclass): # passing age_pclass as ['Age', 'Pclass']
    
    # Passing age_pclass[0] which is 'Age' to variable 'Age'
    Age = age_pclass[0]
    
    # Passing age_pclass[2] which is 'Pclass' to variable 'Pclass'
    Pclass = age_pclass[1]
    
    #applying condition based on the Age and filling the missing data respectively 
    if pd.isnull(Age):if Pclass == 1:
            return 38elif Pclass == 2:
            return 30else:
            return 25else:
        return Age

By using the above function we can fill the missing data in Age feature according to the mean age with respect to each Pclass.

# (for train) grab age and apply the impute_age, our custom function
train['Age'] = train[['Age','Pclass']].apply(impute_age,axis=1)# (for test) grab age and apply the impute_age, our custom function 
test['Age'] = test[['Age','Pclass']].apply(impute_age,axis=1)

4. Cabin feature in Train and Test

Now we reach half of the way, the last step is the Cabin column we need to apply some feature engineering.

If there was a value for Cabin — replace with 1
If the value is missing/null — replace with 0

#Train:
train.loc[train['Cabin'].notnull(), 'Cabin'] =1
train['Cabin'] = train['Cabin'].replace({np.nan:0})
train['Cabin'] = train['Cabin'].astype(int)#Test:
test.loc[test['Cabin'].notnull(), 'Cabin'] =1
test['Cabin'] = test['Cabin'].replace({np.nan:0})
test['Cabin'] = test['Cabin'].astype(int)

Now let’s take a look to the heatmap again:

Heatmap to visualize the null values after cleaning

Look’s Great!! No more missing values :)

Feature Engineering

To include the categorical variables as predictors in statistical and machine learning models we need to Dummy the variables.

Dummy the Sex and Embarked columns

# Train
sex = pd.get_dummies(train['Sex'],drop_first=True)
embark = pd.get_dummies(train['Embarked'],drop_first=True)
train = pd.concat([train, sex,embark],axis=1)
train=train.drop(['Sex','Embarked'], axis=1)
train.rename(columns={"male": "sex_male", "Q": "Embarked_Q","S": "Embarked_S"}, inplace=True)
# Same for the Test

Model Preparation

First we need to select our features and it will be the following:

[Pclass, Age, SibSp, Parch, Fare, Cabin, Sex_male, Embarked_Q, Embarked_S]

And our target will be the feature: Survived

2. Then we need to write a list comprehension to grab the selected features.

3. Separate the selected_column in X_train and the Survived in y_train.

features_drop = ['PassengerId','Name', 'Ticket', 'Survived']
selected_features = [x for x in train.columns  if x not in features_drop ]
X_train = train[selected_features]
y_train = train['Survived']

Now the data is ready to train on the models.