Introduction: The Tragedy That Teaches Data Science
On April 15, 1912, the RMS Titanic sank after hitting an iceberg, claiming the lives of 1,502 passengers and crew. It remains one of the deadliest peacetime maritime disasters. But what if we could look back at the passenger manifests and ask: Who was likely to survive?
Beyond historical curiosity, this question is the perfect starting point for a data science journey. The Titanic survival prediction is a classic introductory problem-small enough to fit in memory, yet rich with real-world complexity. It forces us to handle missing data, encode categorical variables, engineer features, and choose among classification algorithms.
In this blog, we’ll walk through a complete machine learning workflow to predict whether a passenger survived. By the end, you’ll have a working model and, more importantly, an understanding of how data science turns raw records into actionable insights.
What You’ll Learn
-
How to frame a binary classification problem.
-
Exploratory Data Analysis (EDA) with Python.
-
Handling missing values and outliers.
-
Feature engineering from raw data.
-
Building and tuning models (Logistic Regression, Random Forest, etc.).
-
Evaluating model performance with accuracy, precision, recall, and ROC curves.
-
Interpreting what the model learned about survival.
Prerequisites
You’ll need Python with the usual data science libraries: check out our earlier post, Build Your First Neural Network in Python, Build CNN Model in Python
pip install pandas numpy matplotlib seaborn scikit-learn
We’ll also use Jupyter Notebook or any Python IDE. No prior machine learning experience is required, but familiarity with Python basics is helpful.
Step 1: Getting the Data
The Titanic dataset is widely available. We’ll use the version from Kaggle (or you can download it from various sources). For this tutorial, we’ll assume you have two CSV files: train.csv and test.csv. The training set contains both features and the survival outcome; the test set lacks survival labels.
Let’s load the data and take a first look.
import pandas as pd import numpy as np import matplotlib.pyplot as plt import seaborn as sns train = pd.read_csv('train.csv') test = pd.read_csv('test.csv') print("Training set shape:", train.shape) print("Test set shape:", test.shape)
Output:
Training set shape: (891, 12) Test set shape: (418, 11)
The training set has 891 passengers with 12 columns. The test set has 418 passengers and one fewer column-Survived is missing.
Column Descriptions
| Column | Description |
|---|---|
| PassengerId | Unique ID for each passenger |
| Survived | 0 = No, 1 = Yes (target variable) |
| Pclass | Ticket class: 1st, 2nd, 3rd |
| Name | Name of passenger |
| Sex | Male or Female |
| Age | Age in years |
| SibSp | Number of siblings/spouses aboard |
| Parch | Number of parents/children aboard |
| Ticket | Ticket number |
| Fare | Passenger fare |
| Cabin | Cabin number |
| Embarked | Port of embarkation (C = Cherbourg, Q = Queenstown, S = Southampton) |
Step 2: Exploratory Data Analysis (EDA)
EDA helps us understand the data and spot patterns that might influence survival.
2.1 Check for Missing Values
train.isnull().sum()
We see:
-
Age has 177 missing values.
-
Cabin has 687 missing values (a huge proportion).
-
Embarked has 2 missing values.
We’ll need to handle these.
2.2 Survival Rate
Let’s compute the overall survival rate.
survival_rate = train['Survived'].mean() print(f"Overall survival rate: {survival_rate:.2%}")
Output: Overall survival rate: 38.38%
2.3 Categorical Analysis
Sex and Survival
Women had a much higher chance of survival-a well‑known fact from history.
sns.barplot(x='Sex', y='Survived', data=train) plt.title('Survival Rate by Sex') plt.show()
Pclass and Survival
Wealthier passengers (1st class) survived at a higher rate.
sns.barplot(x='Pclass', y='Survived', data=train) plt.title('Survival Rate by Passenger Class') plt.show()
Embarked and Survival
The port of embarkation also shows variation (though less pronounced).
sns.barplot(x='Embarked', y='Survived', data=train) plt.title('Survival Rate by Embarkation Port') plt.show()
2.4 Age Distribution
Let’s plot the age distribution of survivors vs. non‑survivors.
plt.figure(figsize=(12,5)) plt.subplot(1,2,1) train[train['Survived']==0]['Age'].hist(bins=30, alpha=0.7, label='Died') train[train['Survived']==1]['Age'].hist(bins=30, alpha=0.7, label='Survived') plt.legend() plt.title('Age Distribution by Survival') plt.subplot(1,2,2) sns.boxplot(x='Survived', y='Age', data=train) plt.title('Age Boxplot by Survival') plt.show()
Young children had higher survival rates; older adults had lower rates.
2.5 Correlation Matrix
We can compute correlations among numeric features.
numeric_cols = ['Survived', 'Pclass', 'Age', 'SibSp', 'Parch', 'Fare'] sns.heatmap(train[numeric_cols].corr(), annot=True, cmap='coolwarm') plt.title('Correlation Matrix') plt.show()
Key observations:
-
Pclassis negatively correlated withSurvived(higher class = better survival). -
Fareis positively correlated withSurvived(richer passengers survived more). -
Agehas a weak negative correlation.
Step 3: Data Cleaning and Preprocessing
Now we prepare the data for modeling.
3.1 Handle Missing Values
-
Age: We’ll fill missing ages with the median age grouped by
PclassandSex. -
Cabin: Too many missing values; we might extract a feature like has_cabin and drop the original.
-
Embarked: Only 2 missing; fill with the most frequent port (S).
# Fill Age using median grouped by Pclass and Sex train['Age'] = train.groupby(['Pclass', 'Sex'])['Age'].transform(lambda x: x.fillna(x.median())) # Fill Embarked with mode train['Embarked'].fillna('S', inplace=True) # Create a feature for "has_cabin" train['Has_Cabin'] = train['Cabin'].notna().astype(int) # Drop Cabin (and maybe Ticket, Name for now) train.drop(['Cabin', 'Ticket', 'Name'], axis=1, inplace=True)
We’ll apply the same transformations to the test set later.
3.2 Outlier Detection
Check for extreme values in Fare and Age using boxplots. For now, we’ll keep them-tree‑based models are robust to outliers, but if we use logistic regression, we might cap them.
3.3 Feature Engineering
We can create new features that might improve predictive power.
-
FamilySize:
SibSp + Parch + 1 -
IsAlone: 1 if FamilySize == 1 else 0
-
Title: Extract title from Name (e.g., Mr, Mrs, Miss) – this often captures social status.
Let’s add these:
# Family size train['FamilySize'] = train['SibSp'] + train['Parch'] + 1 train['IsAlone'] = (train['FamilySize'] == 1).astype(int) # Extract Title train['Title'] = train['Name'].str.extract(' ([A-Za-z]+)\.', expand=False) # Group rare titles train['Title'] = train['Title'].replace(['Lady', 'Countess','Capt', 'Col', 'Don', 'Dr', 'Major', 'Rev', 'Sir', 'Jonkheer', 'Dona'], 'Rare') train['Title'] = train['Title'].replace('Mlle', 'Miss') train['Title'] = train['Title'].replace('Ms', 'Miss') train['Title'] = train['Title'].replace('Mme', 'Mrs')
We’ll also encode categorical variables into numbers.
3.4 Encode Categorical Features
We have Sex, Embarked, and Title as categories. Use one‑hot encoding or label encoding. We’ll use pandas.get_dummies.
train = pd.get_dummies(train, columns=['Sex', 'Embarked', 'Title'], drop_first=True)
Now we have numeric columns ready.
Step 4: Build the Model
We split the training data into training and validation sets (e.g., 80/20) to evaluate performance.
from sklearn.model_selection import train_test_split from sklearn.ensemble import RandomForestClassifier from sklearn.linear_model import LogisticRegression from sklearn.metrics import accuracy_score, confusion_matrix, classification_report, roc_auc_score # Define features (X) and target (y) X = train.drop(['Survived', 'PassengerId'], axis=1) y = train['Survived'] X_train, X_val, y_train, y_val = train_test_split(X, y, test_size=0.2, random_state=42)
We’ll try two models: Logistic Regression (interpretable) and Random Forest (often higher accuracy).
4.1 Logistic Regression
lr = LogisticRegression(max_iter=1000) lr.fit(X_train, y_train) y_pred_lr = lr.predict(X_val) print("Logistic Regression Accuracy:", accuracy_score(y_val, y_pred_lr)) print(classification_report(y_val, y_pred_lr))
4.2 Random Forest
rf = RandomForestClassifier(n_estimators=100, random_state=42) rf.fit(X_train, y_train) y_pred_rf = rf.predict(X_val) print("Random Forest Accuracy:", accuracy_score(y_val, y_pred_rf)) print(classification_report(y_val, y_pred_rf))
Typically, Random Forest performs better, often achieving 80–83% accuracy on the validation set.
4.3 Feature Importance (Random Forest)
importances = pd.Series(rf.feature_importances_, index=X.columns) importances.sort_values(ascending=False).plot(kind='bar', figsize=(10,6)) plt.title('Feature Importance') plt.show()
You’ll likely see Sex_male, Title_Mr, Fare, and Age among the top features.
Step 5: Tune the Model
We can improve Random Forest by tuning hyperparameters like n_estimators, max_depth, and min_samples_split. Use GridSearchCV.
from sklearn.model_selection import GridSearchCV param_grid = { 'n_estimators': [100, 200], 'max_depth': [10, 20, None], 'min_samples_split': [2, 5, 10] } grid = GridSearchCV(RandomForestClassifier(random_state=42), param_grid, cv=5, scoring='accuracy') grid.fit(X_train, y_train) print("Best parameters:", grid.best_params_) print("Best cross-validation score:", grid.best_score_) # Evaluate on validation set best_rf = grid.best_estimator_ y_pred_best = best_rf.predict(X_val) print("Validation Accuracy:", accuracy_score(y_val, y_pred_best))
Step 6: Prepare Test Set and Make Predictions
We must apply the same preprocessing to the test set.
# Load test data test = pd.read_csv('test.csv') passenger_ids = test['PassengerId'] # Feature engineering (same as on train) test['Age'] = test.groupby(['Pclass', 'Sex'])['Age'].transform(lambda x: x.fillna(x.median())) test['Embarked'].fillna('S', inplace=True) test['Has_Cabin'] = test['Cabin'].notna().astype(int) test['FamilySize'] = test['SibSp'] + test['Parch'] + 1 test['IsAlone'] = (test['FamilySize'] == 1).astype(int) test['Title'] = test['Name'].str.extract(' ([A-Za-z]+)\.', expand=False) test['Title'] = test['Title'].replace(['Lady', 'Countess','Capt', 'Col', 'Don', 'Dr', 'Major', 'Rev', 'Sir', 'Jonkheer', 'Dona'], 'Rare') test['Title'] = test['Title'].replace('Mlle', 'Miss') test['Title'] = test['Title'].replace('Ms', 'Miss') test['Title'] = test['Title'].replace('Mme', 'Mrs') # Drop columns not used in features test.drop(['Cabin', 'Ticket', 'Name'], axis=1, inplace=True) # Encode categorical variables (must have same columns as training) test = pd.get_dummies(test, columns=['Sex', 'Embarked', 'Title'], drop_first=True) # Align columns with training features (ensure missing columns are added) X_test = test.drop(['PassengerId'], axis=1) # Ensure all training columns are present; add missing ones with zeros for col in X.columns: if col not in X_test.columns: X_test[col] = 0 X_test = X_test[X.columns] # Reorder to match training # Predict test_predictions = best_rf.predict(X_test) # Create submission file submission = pd.DataFrame({'PassengerId': passenger_ids, 'Survived': test_predictions}) submission.to_csv('titanic_submission.csv', index=False) print("Submission saved.")
Step 7: Model Evaluation and Interpretation
Beyond accuracy, we should examine:
-
Confusion Matrix: Shows true/false positives/negatives.
-
Precision & Recall: Important if the cost of false positives/negatives differs.
-
ROC AUC: Measures the model’s ability to separate classes.
from sklearn.metrics import roc_curve, auc y_proba = best_rf.predict_proba(X_val)[:,1] fpr, tpr, _ = roc_curve(y_val, y_proba) roc_auc = auc(fpr, tpr) plt.plot(fpr, tpr, label=f'Random Forest (AUC = {roc_auc:.2f})') plt.plot([0,1],[0,1], 'k--') plt.xlabel('False Positive Rate') plt.ylabel('True Positive Rate') plt.title('ROC Curve') plt.legend() plt.show()
An AUC close to 0.9 indicates excellent discriminative power.
What Did the Model Learn?
From the feature importance plot, we see that:
-
Gender is the strongest predictor-consistent with women and children first.
-
Passenger class and fare reflect socio-economic advantage.
-
Age (children) and family size also matter.
-
Title (Mr, Mrs, Miss) captures social roles.
Conclusion: Can We Predict Titanic Survivors?
Yes, we can build a model that predicts survival with about 80–85% accuracy, far better than random guessing (50%). While no model can be perfect-tragedy always has an element of randomness-machine learning reveals the systematic factors that influenced survival.
But more importantly, this project teaches the end‑to‑end process of data science:
-
Question: What factors affect survival?
-
Data: Acquire, explore, and clean.
-
Model: Choose algorithms and tune.
-
Interpret: Communicate findings.
At TuxAcademy, we believe every data scientist should start with Titanic. It’s a small dataset with a big story. And it’s just the beginning-the same skills apply to predicting customer churn, loan defaults, or medical outcomes.
What’s Next?
-
Experiment with other algorithms like XGBoost or Neural Networks.
-
Perform more advanced feature engineering (e.g., group families, cabin deck).
-
Try ensemble methods like stacking to boost accuracy.
-
Deploy your model using Flask or Streamlit to make a simple web app.
Frequently Asked Questions
Q: Why is the survival rate so low?
A: Only about 38% survived. The Titanic did not have enough lifeboats, and many in third class had limited access to the deck.
Q: Why do we drop the Cabin column?
A: With 77% missing, it’s unreliable. However, we created a has_cabin feature which does add value.
Q: Could we achieve higher accuracy?
A: Yes, with careful feature engineering and tuning, some Kaggle models reach 82–85% accuracy on the test set. But beyond that, the data may contain irreducible noise.
Q: How do I submit to Kaggle?
A: After creating titanic_submission.csv, you can upload it to the Titanic competition page to see your score.

