Introduction: The Tragedy That Teaches Data Science

On April 15, 1912, the RMS Titanic sank after hitting an iceberg, claiming the lives of 1,502 passengers and crew. It remains one of the deadliest peacetime maritime disasters. But what if we could look back at the passenger manifests and ask: Who was likely to survive?

Beyond historical curiosity, this question is the perfect starting point for a data science journey. The Titanic survival prediction is a classic introductory problem-small enough to fit in memory, yet rich with real-world complexity. It forces us to handle missing data, encode categorical variables, engineer features, and choose among classification algorithms.

In this blog, we’ll walk through a complete machine learning workflow to predict whether a passenger survived. By the end, you’ll have a working model and, more importantly, an understanding of how data science turns raw records into actionable insights.

What You’ll Learn

How to frame a binary classification problem.
Exploratory Data Analysis (EDA) with Python.
Handling missing values and outliers.
Feature engineering from raw data.
Building and tuning models (Logistic Regression, Random Forest, etc.).
Evaluating model performance with accuracy, precision, recall, and ROC curves.
Interpreting what the model learned about survival.

Prerequisites

You’ll need Python with the usual data science libraries: check out our earlier post, Build Your First Neural Network in Python, Build CNN Model in Python

pip install pandas numpy matplotlib seaborn scikit-learn

We’ll also use Jupyter Notebook or any Python IDE. No prior machine learning experience is required, but familiarity with Python basics is helpful.

Step 1: Getting the Data

The Titanic dataset is widely available. We’ll use the version from Kaggle (or you can download it from various sources). For this tutorial, we’ll assume you have two CSV files: train.csv and test.csv. The training set contains both features and the survival outcome; the test set lacks survival labels.

Let’s load the data and take a first look.

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

train = pd.read_csv('train.csv')
test = pd.read_csv('test.csv')

print("Training set shape:", train.shape)
print("Test set shape:", test.shape)

Output:

Training set shape: (891, 12)
Test set shape: (418, 11)

The training set has 891 passengers with 12 columns. The test set has 418 passengers and one fewer column-Survived is missing.

Column Descriptions

Column	Description
PassengerId	Unique ID for each passenger
Survived	0 = No, 1 = Yes (target variable)
Pclass	Ticket class: 1st, 2nd, 3rd
Name	Name of passenger
Sex	Male or Female
Age	Age in years
SibSp	Number of siblings/spouses aboard
Parch	Number of parents/children aboard
Ticket	Ticket number
Fare	Passenger fare
Cabin	Cabin number
Embarked	Port of embarkation (C = Cherbourg, Q = Queenstown, S = Southampton)

Step 2: Exploratory Data Analysis (EDA)

EDA helps us understand the data and spot patterns that might influence survival.

2.1 Check for Missing Values

train.isnull().sum()

We see:

Age has 177 missing values.
Cabin has 687 missing values (a huge proportion).
Embarked has 2 missing values.

We’ll need to handle these.

2.2 Survival Rate

Let’s compute the overall survival rate.

survival_rate = train['Survived'].mean()
print(f"Overall survival rate: {survival_rate:.2%}")

Output: Overall survival rate: 38.38%

2.3 Categorical Analysis

Sex and Survival
Women had a much higher chance of survival-a well‑known fact from history.

sns.barplot(x='Sex', y='Survived', data=train)
plt.title('Survival Rate by Sex')
plt.show()

Pclass and Survival
Wealthier passengers (1st class) survived at a higher rate.

sns.barplot(x='Pclass', y='Survived', data=train)
plt.title('Survival Rate by Passenger Class')
plt.show()

Embarked and Survival
The port of embarkation also shows variation (though less pronounced).

sns.barplot(x='Embarked', y='Survived', data=train)
plt.title('Survival Rate by Embarkation Port')
plt.show()

2.4 Age Distribution

Let’s plot the age distribution of survivors vs. non‑survivors.

plt.figure(figsize=(12,5))
plt.subplot(1,2,1)
train[train['Survived']==0]['Age'].hist(bins=30, alpha=0.7, label='Died')
train[train['Survived']==1]['Age'].hist(bins=30, alpha=0.7, label='Survived')
plt.legend()
plt.title('Age Distribution by Survival')

plt.subplot(1,2,2)
sns.boxplot(x='Survived', y='Age', data=train)
plt.title('Age Boxplot by Survival')
plt.show()

Young children had higher survival rates; older adults had lower rates.

2.5 Correlation Matrix

We can compute correlations among numeric features.

numeric_cols = ['Survived', 'Pclass', 'Age', 'SibSp', 'Parch', 'Fare']
sns.heatmap(train[numeric_cols].corr(), annot=True, cmap='coolwarm')
plt.title('Correlation Matrix')
plt.show()

Key observations:

Pclass is negatively correlated with Survived (higher class = better survival).
Fare is positively correlated with Survived (richer passengers survived more).
Age has a weak negative correlation.

Step 3: Data Cleaning and Preprocessing

Now we prepare the data for modeling.

3.1 Handle Missing Values

Age: We’ll fill missing ages with the median age grouped by Pclass and Sex.
Cabin: Too many missing values; we might extract a feature like has_cabin and drop the original.
Embarked: Only 2 missing; fill with the most frequent port (S).

# Fill Age using median grouped by Pclass and Sex
train['Age'] = train.groupby(['Pclass', 'Sex'])['Age'].transform(lambda x: x.fillna(x.median()))

# Fill Embarked with mode
train['Embarked'].fillna('S', inplace=True)

# Create a feature for "has_cabin"
train['Has_Cabin'] = train['Cabin'].notna().astype(int)

# Drop Cabin (and maybe Ticket, Name for now)
train.drop(['Cabin', 'Ticket', 'Name'], axis=1, inplace=True)

We’ll apply the same transformations to the test set later.

3.2 Outlier Detection

Check for extreme values in Fare and Age using boxplots. For now, we’ll keep them-tree‑based models are robust to outliers, but if we use logistic regression, we might cap them.

3.3 Feature Engineering

We can create new features that might improve predictive power.

FamilySize: SibSp + Parch + 1
IsAlone: 1 if FamilySize == 1 else 0
Title: Extract title from Name (e.g., Mr, Mrs, Miss) – this often captures social status.

Let’s add these:

# Family size
train['FamilySize'] = train['SibSp'] + train['Parch'] + 1
train['IsAlone'] = (train['FamilySize'] == 1).astype(int)

# Extract Title
train['Title'] = train['Name'].str.extract(' ([A-Za-z]+)\.', expand=False)
# Group rare titles
train['Title'] = train['Title'].replace(['Lady', 'Countess','Capt', 'Col', 'Don', 'Dr', 'Major', 'Rev', 'Sir', 'Jonkheer', 'Dona'], 'Rare')
train['Title'] = train['Title'].replace('Mlle', 'Miss')
train['Title'] = train['Title'].replace('Ms', 'Miss')
train['Title'] = train['Title'].replace('Mme', 'Mrs')

We’ll also encode categorical variables into numbers.

3.4 Encode Categorical Features

We have Sex, Embarked, and Title as categories. Use one‑hot encoding or label encoding. We’ll use pandas.get_dummies.

train = pd.get_dummies(train, columns=['Sex', 'Embarked', 'Title'], drop_first=True)

Now we have numeric columns ready.

Step 4: Build the Model

We split the training data into training and validation sets (e.g., 80/20) to evaluate performance.

from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, confusion_matrix, classification_report, roc_auc_score

# Define features (X) and target (y)
X = train.drop(['Survived', 'PassengerId'], axis=1)
y = train['Survived']

X_train, X_val, y_train, y_val = train_test_split(X, y, test_size=0.2, random_state=42)

We’ll try two models: Logistic Regression (interpretable) and Random Forest (often higher accuracy).

4.1 Logistic Regression

lr = LogisticRegression(max_iter=1000)
lr.fit(X_train, y_train)
y_pred_lr = lr.predict(X_val)

print("Logistic Regression Accuracy:", accuracy_score(y_val, y_pred_lr))
print(classification_report(y_val, y_pred_lr))

4.2 Random Forest

rf = RandomForestClassifier(n_estimators=100, random_state=42)
rf.fit(X_train, y_train)
y_pred_rf = rf.predict(X_val)

print("Random Forest Accuracy:", accuracy_score(y_val, y_pred_rf))
print(classification_report(y_val, y_pred_rf))

Typically, Random Forest performs better, often achieving 80–83% accuracy on the validation set.

4.3 Feature Importance (Random Forest)

importances = pd.Series(rf.feature_importances_, index=X.columns)
importances.sort_values(ascending=False).plot(kind='bar', figsize=(10,6))
plt.title('Feature Importance')
plt.show()

You’ll likely see Sex_male, Title_Mr, Fare, and Age among the top features.

Step 5: Tune the Model

We can improve Random Forest by tuning hyperparameters like n_estimators, max_depth, and min_samples_split. Use GridSearchCV.

from sklearn.model_selection import GridSearchCV

param_grid = {
    'n_estimators': [100, 200],
    'max_depth': [10, 20, None],
    'min_samples_split': [2, 5, 10]
}

grid = GridSearchCV(RandomForestClassifier(random_state=42), param_grid, cv=5, scoring='accuracy')
grid.fit(X_train, y_train)

print("Best parameters:", grid.best_params_)
print("Best cross-validation score:", grid.best_score_)

# Evaluate on validation set
best_rf = grid.best_estimator_
y_pred_best = best_rf.predict(X_val)
print("Validation Accuracy:", accuracy_score(y_val, y_pred_best))

Step 6: Prepare Test Set and Make Predictions

We must apply the same preprocessing to the test set.

# Load test data
test = pd.read_csv('test.csv')
passenger_ids = test['PassengerId']

# Feature engineering (same as on train)
test['Age'] = test.groupby(['Pclass', 'Sex'])['Age'].transform(lambda x: x.fillna(x.median()))
test['Embarked'].fillna('S', inplace=True)
test['Has_Cabin'] = test['Cabin'].notna().astype(int)
test['FamilySize'] = test['SibSp'] + test['Parch'] + 1
test['IsAlone'] = (test['FamilySize'] == 1).astype(int)
test['Title'] = test['Name'].str.extract(' ([A-Za-z]+)\.', expand=False)
test['Title'] = test['Title'].replace(['Lady', 'Countess','Capt', 'Col', 'Don', 'Dr', 'Major', 'Rev', 'Sir', 'Jonkheer', 'Dona'], 'Rare')
test['Title'] = test['Title'].replace('Mlle', 'Miss')
test['Title'] = test['Title'].replace('Ms', 'Miss')
test['Title'] = test['Title'].replace('Mme', 'Mrs')

# Drop columns not used in features
test.drop(['Cabin', 'Ticket', 'Name'], axis=1, inplace=True)

# Encode categorical variables (must have same columns as training)
test = pd.get_dummies(test, columns=['Sex', 'Embarked', 'Title'], drop_first=True)

# Align columns with training features (ensure missing columns are added)
X_test = test.drop(['PassengerId'], axis=1)
# Ensure all training columns are present; add missing ones with zeros
for col in X.columns:
    if col not in X_test.columns:
        X_test[col] = 0
X_test = X_test[X.columns]  # Reorder to match training

# Predict
test_predictions = best_rf.predict(X_test)

# Create submission file
submission = pd.DataFrame({'PassengerId': passenger_ids, 'Survived': test_predictions})
submission.to_csv('titanic_submission.csv', index=False)
print("Submission saved.")

Step 7: Model Evaluation and Interpretation

Beyond accuracy, we should examine:

Confusion Matrix: Shows true/false positives/negatives.
Precision & Recall: Important if the cost of false positives/negatives differs.
ROC AUC: Measures the model’s ability to separate classes.

from sklearn.metrics import roc_curve, auc

y_proba = best_rf.predict_proba(X_val)[:,1]
fpr, tpr, _ = roc_curve(y_val, y_proba)
roc_auc = auc(fpr, tpr)

plt.plot(fpr, tpr, label=f'Random Forest (AUC = {roc_auc:.2f})')
plt.plot([0,1],[0,1], 'k--')
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('ROC Curve')
plt.legend()
plt.show()

An AUC close to 0.9 indicates excellent discriminative power.

What Did the Model Learn?

From the feature importance plot, we see that:

Gender is the strongest predictor-consistent with women and children first.
Passenger class and fare reflect socio-economic advantage.
Age (children) and family size also matter.
Title (Mr, Mrs, Miss) captures social roles.

Conclusion: Can We Predict Titanic Survivors?

Yes, we can build a model that predicts survival with about 80–85% accuracy, far better than random guessing (50%). While no model can be perfect-tragedy always has an element of randomness-machine learning reveals the systematic factors that influenced survival.

But more importantly, this project teaches the end‑to‑end process of data science:

Question: What factors affect survival?
Data: Acquire, explore, and clean.
Model: Choose algorithms and tune.
Interpret: Communicate findings.

At TuxAcademy, we believe every data scientist should start with Titanic. It’s a small dataset with a big story. And it’s just the beginning-the same skills apply to predicting customer churn, loan defaults, or medical outcomes.

What’s Next?

Experiment with other algorithms like XGBoost or Neural Networks.
Perform more advanced feature engineering (e.g., group families, cabin deck).
Try ensemble methods like stacking to boost accuracy.
Deploy your model using Flask or Streamlit to make a simple web app.

Frequently Asked Questions

Q: Why is the survival rate so low?
A: Only about 38% survived. The Titanic did not have enough lifeboats, and many in third class had limited access to the deck.

Q: Why do we drop the Cabin column?
A: With 77% missing, it’s unreliable. However, we created a has_cabin feature which does add value.

Q: Could we achieve higher accuracy?
A: Yes, with careful feature engineering and tuning, some Kaggle models reach 82–85% accuracy on the test set. But beyond that, the data may contain irreducible noise.

Q: How do I submit to Kaggle?
A: After creating titanic_submission.csv, you can upload it to the Titanic competition page to see your score.

Can We Predict Titanic Survivors?