Introduction: Why Convolutional Neural Networks?
In our previous tutorial, we built a simple neural network using dense (fully connected) layers to classify fashion items. That model worked decently, but it had a fundamental flaw: it treated every pixel independently, ignoring the spatial structure of the image. A shirt’s sleeve is not just a random set of pixels-it is a pattern that appears in a specific region relative to the collar.
Convolutional Neural Networks (CNNs) were designed precisely to capture this spatial hierarchy. They are the backbone of modern computer vision, powering everything from facial recognition to self-driving cars. By using filters (kernels) that slide across the image, CNNs learn features like edges, textures, and shapes, gradually building up to high-level concepts.
In this step-by-step guide, we will build a CNN using the classic MNIST dataset-a collection of 70,000 handwritten digits (0–9). It’s the perfect starting point because it’s small, well-understood, and lets you focus on the architecture rather than data cleaning.
By the end, you will have a model that achieves over 99% accuracy on handwritten digit recognition, and you will understand exactly how each component works.
What You’ll Learn
-
How CNNs differ from dense networks.
-
Preprocessing image data for convolution.
-
Building a CNN with
Conv2D,MaxPooling2D, andDropoutlayers. -
Compiling and training the model.
-
Evaluating and visualizing predictions.
-
Improving performance with data augmentation.
Prerequisites
You should have Python installed (3.8 or later) and a basic understanding of neural networks. If you need a refresher, check out our earlier post, “Build Your First Neural Network in Python.”
Install the required libraries:
pip install numpy matplotlib tensorflow
We will use TensorFlow’s Keras API for simplicity and power.
Step 1: Load and Explore the MNIST Dataset
MNIST is built into Keras, so loading it is trivial.
import numpy as np import matplotlib.pyplot as plt import tensorflow as tf from tensorflow import keras # Load MNIST (x_train, y_train), (x_test, y_test) = keras.datasets.mnist.load_data() print(f"Training data shape: {x_train.shape}") # (60000, 28, 28) print(f"Training labels shape: {y_train.shape}") # (60000,) print(f"Test data shape: {x_test.shape}") # (10000, 28, 28)
The dataset contains:
-
60,000 training images and 10,000 test images.
-
Each image is 28×28 grayscale pixels (values 0–255).
-
Labels are integers 0–9.
Let’s visualize a few samples to get a feel:
plt.figure(figsize=(10, 10)) for i in range(25): plt.subplot(5, 5, i+1) plt.xticks([]) plt.yticks([]) plt.imshow(x_train[i], cmap='gray') plt.xlabel(y_train[i]) plt.show()
https://via.placeholder.com/800×400?text=MNIST+Sample+Digits
Step 2: Preprocess the Data
CNNs expect a specific shape and scale of inputs.
2.1 Reshape for Convolution
Conv2D layers require input in the form (height, width, channels). Our images are 28×28 with a single channel (grayscale), so we need to add a dimension:
x_train = x_train.reshape(x_train.shape[0], 28, 28, 1) x_test = x_test.reshape(x_test.shape[0], 28, 28, 1) print(x_train.shape) # (60000, 28, 28, 1)
2.2 Normalize Pixel Values
Neural networks train faster and more stably when inputs are in a small range. Divide by 255.0 to scale to [0,1]:
x_train = x_train.astype('float32') / 255.0 x_test = x_test.astype('float32') / 255.0
2.3 One-Hot Encode Labels
Our labels are integers (0–9). For classification, we want a probability distribution over the 10 classes. One-hot encoding converts each label into a vector of 10 elements, with a 1 in the position of the class.
y_train = keras.utils.to_categorical(y_train, 10) y_test = keras.utils.to_categorical(y_test, 10) print(y_train[0]) # e.g., [0., 0., 0., 0., 0., 1., 0., 0., 0., 0.] if the digit is 5
Step 3: Build the CNN Architecture
Now we design the model. A typical CNN for MNIST consists of:
-
Convolutional layers with ReLU activation to extract features.
-
MaxPooling layers to reduce spatial dimensions and add translation invariance.
-
Dropout to prevent overfitting.
-
Flatten to convert 2D feature maps to a 1D vector.
-
Dense (fully connected) layers to perform classification.
We’ll use the Sequential API.
from tensorflow.keras import layers, models model = models.Sequential([ # First convolutional block layers.Conv2D(32, (3, 3), activation='relu', input_shape=(28, 28, 1)), layers.MaxPooling2D((2, 2)), # Second convolutional block layers.Conv2D(64, (3, 3), activation='relu'), layers.MaxPooling2D((2, 2)), # Third convolutional block (optional, adds depth) layers.Conv2D(64, (3, 3), activation='relu'), # Flatten and dense layers layers.Flatten(), layers.Dense(64, activation='relu'), layers.Dropout(0.5), # Regularization layers.Dense(10, activation='softmax') ]) model.summary()
Let’s break down the layers:
-
Conv2D(32, (3,3), activation=’relu’): 32 filters of size 3×3. The filter slides over the input, computing dot products. ReLU introduces non-linearity.
-
MaxPooling2D((2,2)): Downsamples by taking the maximum value in each 2×2 window. This reduces the spatial size (from 28×28 to 14×14) and makes the model robust to small shifts.
-
Second Conv2D(64, (3,3)): Now 64 filters, learning more complex features.
-
Third Conv2D(64, (3,3)): Further abstraction.
-
Flatten: Converts the 3D output from the last conv layer into a 1D vector.
-
Dense(64): Fully connected layer with 64 neurons.
-
Dropout(0.5): Randomly turns off 50% of neurons during training, reducing overfitting.
-
Dense(10, activation=’softmax’): Output layer with 10 neurons, each giving a probability for the digit class.
The summary shows the number of trainable parameters-around 1.2 million, which is manageable even on a CPU.
Step 4: Compile the Model
We need to specify the optimizer, loss function, and metrics.
model.compile(optimizer='adam', loss='categorical_crossentropy', metrics=['accuracy'])
-
adam: Adaptive moment estimation, a popular optimizer.
-
categorical_crossentropy: Suitable for multi-class classification with one-hot labels.
-
accuracy: We track accuracy during training.
Step 5: Train the Model
Now we feed the data. We’ll use 20% of the training set as validation to monitor overfitting.
history = model.fit(x_train, y_train, epochs=10, batch_size=128, validation_split=0.2, verbose=1)
Parameters explained:
-
epochs: Number of passes over the entire dataset. 10 is usually enough for MNIST.
-
batch_size: Number of samples per gradient update. Larger batches mean faster training but more memory.
-
validation_split: Reserve 20% of training data to evaluate after each epoch.
-
verbose: Shows progress bars.
During training, you’ll see something like:
Epoch 1/10 375/375 [==============================] - 10s 26ms/step - loss: 0.2260 - accuracy: 0.9323 - val_loss: 0.0563 - val_accuracy: 0.9835 ... Epoch 10/10 375/375 [==============================] - 10s 27ms/step - loss: 0.0349 - accuracy: 0.9895 - val_loss: 0.0271 - val_accuracy: 0.9913
Notice the validation accuracy quickly surpasses 99%-much better than our previous dense network!
Step 6: Evaluate on Test Data
After training, evaluate the model on the untouched test set.
test_loss, test_acc = model.evaluate(x_test, y_test, verbose=0) print(f"Test accuracy: {test_acc:.4f}")
You should see a test accuracy around 0.9900–0.9920 (99%+).
Step 7: Make Predictions and Visualize Results
Let’s pick some test images and see what the model predicts.
# Get predictions for the entire test set predictions = model.predict(x_test) # Convert one-hot predictions back to digits predicted_classes = np.argmax(predictions, axis=1) true_classes = np.argmax(y_test, axis=1) # Plot the first 25 test images with predictions plt.figure(figsize=(10, 10)) for i in range(25): plt.subplot(5, 5, i+1) plt.xticks([]) plt.yticks([]) plt.imshow(x_test[i].reshape(28,28), cmap='gray') color = 'green' if predicted_classes[i] == true_classes[i] else 'red' plt.xlabel(f"{predicted_classes[i]}", color=color) plt.show()
This will show you where the model made mistakes (red labels). Common errors often occur with digits that look similar, like 4 and 9, or 7 and 2.
Step 8: Plot Training History
Visualizing the loss and accuracy over epochs helps diagnose underfitting or overfitting.
plt.figure(figsize=(12, 4)) plt.subplot(1, 2, 1) plt.plot(history.history['accuracy'], label='Training Accuracy') plt.plot(history.history['val_accuracy'], label='Validation Accuracy') plt.title('Accuracy over Epochs') plt.xlabel('Epoch') plt.ylabel('Accuracy') plt.legend() plt.subplot(1, 2, 2) plt.plot(history.history['loss'], label='Training Loss') plt.plot(history.history['val_loss'], label='Validation Loss') plt.title('Loss over Epochs') plt.xlabel('Epoch') plt.ylabel('Loss') plt.legend() plt.show()
If the validation loss starts increasing while training loss continues to decrease, that’s a sign of overfitting. Our model shows stable validation performance, which is good.
Step 9: Save and Load the Model (Optional)
In real projects, you’ll want to save your trained model for later use.
model.save('mnist_cnn.h5')
To load it again:
loaded_model = keras.models.load_model('mnist_cnn.h5')
Step 10: Improving Further (Data Augmentation)
Even though our model already achieves 99% accuracy, we can push it even higher and make it more robust. Data augmentation artificially increases the diversity of the training set by applying random transformations like rotations, shifts, and zooms.
Here’s how to integrate data augmentation using ImageDataGenerator:
from tensorflow.keras.preprocessing.image import ImageDataGenerator # Define augmentation pipeline datagen = ImageDataGenerator( rotation_range=10, width_shift_range=0.1, height_shift_range=0.1, zoom_range=0.1, validation_split=0.2 # Keep validation separate ) # Prepare generators train_generator = datagen.flow(x_train, y_train, batch_size=128, subset='training') val_generator = datagen.flow(x_train, y_train, batch_size=128, subset='validation') # Train with augmentation history_aug = model.fit(train_generator, epochs=15, validation_data=val_generator, verbose=1)
Augmentation often improves generalization, especially when you have limited data (though MNIST is large enough). You might see validation accuracy rise to ~99.3%.
Understanding What the CNN Learns (Optional Deep Dive)
One of the beautiful things about CNNs is that we can visualize the filters and feature maps. For instance, the first convolutional layer learns simple edge detectors (vertical, horizontal, diagonal lines). Later layers learn more complex patterns like loops and curves. While we won’t implement visualization here, it’s a fascinating exploration for curious minds.
Common Pitfalls and Solutions
-
Low accuracy (underfitting): Increase model complexity (more layers/filters) or train longer.
-
Overfitting (training accuracy high, validation low): Add dropout, reduce model size, or use data augmentation.
-
Slow training: Use a smaller batch size or reduce image dimensions (not needed for MNIST).
-
Memory errors: Reduce batch size or use
tf.datapipelines.
Conclusion: You’ve Built a State-of-the-Art Digit Recognizer
Congratulations! You’ve just built a convolutional neural network that rivals human performance on handwritten digit recognition. More importantly, you’ve learned the foundational concepts that power modern computer vision:
-
Convolution to capture spatial hierarchies.
-
Pooling to downsample and add invariance.
-
Dropout to regularize and prevent overfitting.
-
Data augmentation to improve generalization.
These techniques scale to far more complex problems-from medical image analysis to autonomous driving. With this solid foundation, you’re ready to explore more advanced architectures like ResNet, Inception, and even transformers for vision.
What’s Next?
-
Try CIFAR-10: A dataset of 60,000 color images across 10 classes. You’ll need to handle RGB channels and more complex patterns.
-
Implement a CNN from scratch using NumPy: For deep understanding, try implementing convolution and backpropagation manually.
-
Explore Transfer Learning: Use pre-trained models like VGG16 or MobileNet on your own custom datasets.
-
Deploy Your Model: Learn to serve your model via Flask, TensorFlow Serving, or convert it to TensorFlow Lite for mobile apps.
At TuxAcademy, we believe in learning by doing. This tutorial is part of our AI/ML track, designed to take you from novice to practitioner.
Frequently Asked Questions
Q: Why use categorical_crossentropy instead of sparse_categorical_crossentropy?
A: Because we one-hot encoded the labels. If you keep labels as integers, you’d use sparse_categorical_crossentropy. Both are valid; it’s a matter of preference.
Q: How do I choose the number of filters and kernel size?
A: It’s empirical. Start with common values (32, 64, 128) and kernel size 3×3. Deeper networks often use 3×3 filters stacked.
Q: Can I run this on a CPU?
A: Yes! MNIST is small enough that training on CPU takes only a few minutes. For larger datasets, a GPU is recommended.
Q: Why is validation accuracy sometimes higher than test accuracy?
A: Validation data comes from the same distribution as training, while test data might have slight differences. Also, randomness in splitting can cause variance.
Q: How do I stop training early if the model stops improving?
A: Use EarlyStopping callback:
callbacks = [tf.keras.callbacks.EarlyStopping(monitor='val_loss', patience=3)]

