Lecture 4 - Deeplearning Tutorial Notebook: Convolutional Neural Networks with PyTorch using EuroSAT#

Attention

Students are encouraged to use the CSC Mahti platform.

In this notebook, we’ll implement two models for classifying EuroSAT, a dataset of 27,000 Sentinel-2 satellite tiles at 64×64 RGB across 10 classes (forest, river, highway, residential, …).

We’ll implement two models:

Model 0 is a flat linear baseline. No convolutions, no spatial reasoning. There to set a floor.
Model 1 is a small TinyVGG-style CNN with two conv blocks, channels going 16 → 32.

We’ll compare the two, look at a confusion matrix and eyeball some random predictions.

What we’re going to cover#

In this tutorial, we will classify satellite imagery using PyTorch.

Topic	Contents
0. Setup and imports	Getting PyTorch and our plotting libraries ready.
1. Getting the data	Downloading the EuroSAT dataset and creating a reproducible 80/20 train/test split.
2. Prepare DataLoader	Wrapping our data in iterables for batching.
3. Model 0: Linear baseline	Building a simple flat linear model to set a performance floor, along with our training loops.
4. Model 1: TinyVGG-style CNN	Introducing a convolutional neural network with spatial reasoning.
5. Comparing our models	Evaluating the test loss and accuracy of both models side by side.
6. Making a confusion matrix	Building a confusion matrix by hand with PyTorch and matplotlib to see where our model gets confused.
7. Visualising predictions	Eyeballing random predictions to understand the visual ambiguity of satellite data.

0. Imports and setup#

First, let’s import the libraries we’ll need and set up some random seeds for reproducibility.

import random
import matplotlib.pyplot as plt
import torch
from torch import nn
from torch.utils.data import DataLoader, random_split
from torchvision import datasets, transforms
from tqdm.auto import tqdm

# Set random seeds for reproducibility
torch.manual_seed(42)
random.seed(42)

# Check for GPU availability
device = "cuda" if torch.cuda.is_available() else "cpu"
print(f"Using device: {device}")

A few hyperparameters up front so they’re easy to find and tweak later.

BATCH_SIZE — how many samples per batch of data?
EPOCHS — how many epochs to train for?
LR(Learning Rate) — how quickly should our model update its parameters to find optimal values?

BATCH_SIZE = 64
EPOCHS = 5
LR = 1e-3

1. Getting the data#

EuroSAT ships with torchvision.datasets.EuroSAT.

The first time we run it, this will trigger an Download of ~90 MB and extracts automatically.

EuroSAT is a single folder of images with no train/test split provided, so we’ll do an 80/20 split ourselves. We will use a a fixed seed to keep it reproducible across runs.

transform = transforms.Compose([
    transforms.ToTensor(),
    # Normalize to the same mean and std as ImageNet (the dataset our pretrained models were trained on)
    # https://docs.pytorch.org/vision/stable/models.html
    transforms.Normalize(mean=[0.485, 0.456, 0.406], std=[0.229, 0.224, 0.225]),
])


# Load the EuroSAT dataset
full_data = datasets.EuroSAT(
    root="data",
    download=True,
    transform=transform,
)

# 80/20 split, reproducible
train_size = int(0.8 * len(full_data))
test_size = len(full_data) - train_size
train_data, test_data = random_split(
    full_data,
    [train_size, test_size],
    generator=torch.Generator().manual_seed(42),
)

# Get class names and number of classes
class_names = full_data.classes
num_classes = len(class_names)

print(f"Total samples: {len(full_data)}")
print(f"Train samples: {len(train_data)} | Test samples: {len(test_data)}")
print(f"Classes ({num_classes}): {class_names}")

1.1 Input and output shapes#

Let’s look at one sample to see what we’re working with. The input shape is (3, 64, 64) (channels first) and the output is a single integer label between 0 and 9 corresponding to the land cover class.

image, label = full_data[0]
print(f"Image shape: {image.shape} -> [color_channels, height, width]")
print(f"Label: {label} ({class_names[label]})")

So each image is [3, 64, 64] — 3 colour channels (RGB), 64 pixels high, 64 pixels wide. And label is a single integer from 0–9 telling us which of the 10 land-cover classes it belongs to.

1.2 Visualising the data#

Before modelling anything, look at some samples. Satellite imagery is not visually obvious, a “Pasture” tile and a “Permanent Crop” tile can look almost identical to the human eye.

Before we can visualize the data, we need to unnormalize it back to the original pixel value range.
The normalization parameters for EuroSAT are the same as ImageNet, so we can use the same unnormalization function as before.

# The normalization parameters for EuroSAT are the same as ImageNet, so we can use the same unnormalization function as before.
mean = torch.tensor([0.485, 0.456, 0.406]).view(3, 1, 1)
std = torch.tensor([0.229, 0.224, 0.225]).view(3, 1, 1)


def unnormalize(img):
    return (img * std + mean).clamp(0, 1)

torch.manual_seed(42)
fig = plt.figure(figsize=(9, 9))
rows, cols = 4, 4
for i in range(1, rows * cols + 1):
    random_idx = torch.randint(0, len(full_data), size=[1]).item()
    img, label = full_data[random_idx]
    fig.add_subplot(rows, cols, i)
    plt.imshow(unnormalize(img).permute(1, 2, 0))  # [C,H,W] -> [H,W,C] for matplotlib
    plt.title(class_names[label], fontsize=9)
    plt.axis(False)

2. Prepare DataLoader#

Now we wrap the train and test datasets in DataLoaders so we can iterate them in batches.

train_loader = DataLoader(train_data, batch_size=BATCH_SIZE, shuffle=True)
test_loader = DataLoader(test_data, batch_size=BATCH_SIZE, shuffle=False)

# Check what comes out of one batch
images, labels = next(iter(train_loader))
print(f"Batch images shape: {images.shape}")  # [64, 3, 64, 64]
print(f"Batch labels shape: {labels.shape}")  # [64]

3. Model 0: a linear baseline#

Time to build the simplest possible model that can produce 10 numbers from a 64×64×3 image: flatten the image into one long vector, run it through one linear layer, done.

For now, no convolutions, no spatial awareness, no non-linearity.

This is the floor we want to beat.

class EuroSATModelV0(nn.Module):
    """Linear classifier. No hidden layer, no nonlinearity, no spatial reasoning."""

    def __init__(self, input_shape: int, output_shape: int):
        super().__init__()
        self.layer_stack = nn.Sequential(
            nn.Flatten(),  # [B, 3, 64, 64] -> [B, 12288]
            nn.Linear(in_features=input_shape, out_features=output_shape),
        )

    def forward(self, x):
        return self.layer_stack(x)


torch.manual_seed(42)
model_0 = EuroSATModelV0(
    input_shape=3 * 64 * 64,  # 12288
    output_shape=num_classes,  # 10
).to(device)
model_0

Let’s count parameters so we can compare fairly with Model 1 later.

def count_parameters(model: nn.Module) -> int:
    return sum(p.numel() for p in model.parameters() if p.requires_grad)


print(f"Model 0 trainable parameters: {count_parameters(model_0):,}")

3.1 Loss, optimizer, and a tiny accuracy helper#

Standard classification setup.

We set up the Cross-entropy loss, Adam optimizer, and a helper function to calculate accuracy.

def accuracy_fn(y_true: torch.Tensor, y_pred: torch.Tensor) -> float:
    return (torch.eq(y_true, y_pred).sum().item() / len(y_true)) * 100


loss_fn = (
    nn.CrossEntropyLoss()
)  # For multi-class classification, the standard loss function is CrossEntropyLoss
optimizer = torch.optim.Adam(
    params=model_0.parameters(), lr=LR
)  # Adam optimizer, a popular choice for training neural networks

3.2 Training and testing functions#

We’ll write these once and reuse them for both models. This is one of the highest-leverage refactors in any deep-learning notebook.

Once a clean train_step / test_step functions is in place, swapping models becomes trivial.

def train_step(model, data_loader, loss_fn, optimizer, accuracy_fn, device):
    model.train()
    train_loss, train_acc = 0.0, 0.0  # Initialize accumulators for loss and accuracy
    for X, y in data_loader:
        X, y = X.to(device), y.to(device)
        y_logits = model(X)
        loss = loss_fn(y_logits, y)

        # Check if the loss is finite (not NaN or Inf)
        if not torch.isfinite(loss):
            raise RuntimeError(
                "Non-finite loss detected (nan/inf). Reduce LR or normalize inputs."
            )

        train_loss += loss.item()
        train_acc += accuracy_fn(y, y_logits.argmax(dim=1))
        optimizer.zero_grad()
        loss.backward()
        optimizer.step()
    return train_loss / len(data_loader), train_acc / len(data_loader)


def test_step(model, data_loader, loss_fn, accuracy_fn, device):
    model.eval()
    test_loss, test_acc = 0.0, 0.0
    with torch.inference_mode():
        for X, y in data_loader:
            X, y = X.to(device), y.to(device)
            y_logits = model(X)
            test_loss += loss_fn(y_logits, y).item()
            test_acc += accuracy_fn(y, y_logits.argmax(dim=1))
    return test_loss / len(data_loader), test_acc / len(data_loader)

3.3 Training Model 0#

Let’s see how our flat linear model does on satellite imagery.

torch.manual_seed(42)

for epoch in tqdm(range(EPOCHS), desc="Training Model 0 (linear)"):
    train_loss, train_acc = train_step(
        model_0, train_loader, loss_fn, optimizer, accuracy_fn, device
    )
    test_loss, test_acc = test_step(model_0, test_loader, loss_fn, accuracy_fn, device)
    print(
        f"Epoch {epoch + 1}/{EPOCHS} | "
        f"Train loss: {train_loss:.4f}, Train acc: {train_acc:.2f}% | "
        f"Test loss: {test_loss:.4f}, Test acc: {test_acc:.2f}%"
    )

model_0_results = {
    "model_name": "EuroSATModelV0",
    "test_loss": test_loss,
    "test_acc": test_acc,
    "params": count_parameters(model_0),
}

# Get results for model 0 (the linear baseline)
model_0_results

Not too great, right? A flat linear model on RGB satellite imagery is throwing away every bit of spatial structure in the input. It has no idea that pixel (10, 10) and pixel (10, 11) are neighbours. That’s our floor.

Now let’s actually use the structure of an image.

4. Model 1: a TinyVGG-style CNN#

Time for a real model.

This is inspired by TinyVGG from CNN Explainer, with two design choices worth calling out:

Two convolutions per block instead of one. Stacking two 3×3 convs gives an effective 5×5 receptive field with fewer parameters than a single 5×5, plus an extra ReLU in the middle. This is how nearly every real CNN looks.
Channels grow from 16 in block 1 to 32 in block 2. Early layers learn low-level features (edges, colour patches); later layers learn higher-level combinations and need more channels to represent them.

class EuroSATModelV1(nn.Module):
    """TinyVGG-style CNN: 2 convs per block, channels 16 -> 32."""

    def __init__(self, input_channels: int, hidden_units: int, output_shape: int):
        super().__init__()
        self.block_1 = nn.Sequential(
            nn.Conv2d(
                in_channels=input_channels,
                out_channels=hidden_units,
                kernel_size=3,
                stride=1,
                padding=1,
            ),
            nn.ReLU(),
            nn.Conv2d(
                in_channels=hidden_units,
                out_channels=hidden_units,
                kernel_size=3,
                stride=1,
                padding=1,
            ),
            nn.ReLU(),
            nn.MaxPool2d(kernel_size=2),  # 64 -> 32
        )
        self.block_2 = nn.Sequential(
            nn.Conv2d(
                in_channels=hidden_units,
                out_channels=hidden_units * 2,
                kernel_size=3,
                stride=1,
                padding=1,
            ),
            nn.ReLU(),
            nn.Conv2d(
                in_channels=hidden_units * 2,
                out_channels=hidden_units * 2,
                kernel_size=3,
                stride=1,
                padding=1,
            ),
            nn.ReLU(),
            nn.MaxPool2d(kernel_size=2),  # 32 -> 16
        )
        self.classifier = nn.Sequential(
            nn.Flatten(),
            # Where does 16*16 come from? Trace shapes for a 64x64 input:
            #   start         : [B, 3, 64, 64]
            #   after block_1 : [B, hidden,    32, 32]   (two 'same' convs + maxpool/2)
            #   after block_2 : [B, hidden*2,  16, 16]   (two 'same' convs + maxpool/2)
            # so flatten dim = (hidden*2) * 16 * 16
            nn.Linear(
                in_features=(hidden_units * 2) * 16 * 16, out_features=output_shape
            ),
        )

    def forward(self, x: torch.Tensor) -> torch.Tensor:
        x = self.block_1(x)
        x = self.block_2(x)
        x = self.classifier(x)
        return x


torch.manual_seed(42)
model_1 = EuroSATModelV1(
    input_channels=3,  # RGB
    hidden_units=16,  # block_1 has 16 channels, block_2 has 32
    output_shape=num_classes,  # 10
).to(device)
model_1

How much bigger is this model than the linear one?

print(f"Model 0 trainable parameters: {count_parameters(model_0):,}")
print(f"Model 1 trainable parameters: {count_parameters(model_1):,}")

You should notice how much bigger this model is than Model 0.
Similar hidden unit counts, but the structured spatial processing pushes the parameter count up significantly.

4.1 Loss, optimizer, and training#

Same loss, same optimizer setup, fresh optimizer instance pointing at the new model’s parameters.

loss_fn = nn.CrossEntropyLoss()
optimizer = torch.optim.Adam(params=model_1.parameters(), lr=LR)

torch.manual_seed(42)
for epoch in tqdm(range(EPOCHS), desc="Training Model 1 (CNN)"):
    train_loss, train_acc = train_step(
        model_1, train_loader, loss_fn, optimizer, accuracy_fn, device
    )
    test_loss, test_acc = test_step(model_1, test_loader, loss_fn, accuracy_fn, device)
    print(
        f"Epoch {epoch + 1}/{EPOCHS} | "
        f"Train loss: {train_loss:.4f}, Train acc: {train_acc:.2f}% | "
        f"Test loss: {test_loss:.4f}, Test acc: {test_acc:.2f}%"
    )

model_1_results = {
    "model_name": "EuroSATModelV1",
    "test_loss": test_loss,
    "test_acc": test_acc,
}
model_1_results

5. Compare the two models#

How much did convolutions buy us?

import pandas as pd

compare = pd.DataFrame([model_0_results, model_1_results])
compare

There should be a clear gap.
The CNN beats the linear baseline because it can actually use the 2D structure of the image, alsoneighbouring pixels stay neighbours, edge detectors slide across the whole image, and so on. The linear model never had a chance.

If the gap is small or negative, try training for more epochs. Both models improve with longer runs, but the CNN benefits more.

6. Confusion matrix#

Test accuracy is one number. A confusion matrix shows which classes the model gets confused about far more informative, especially on EuroSAT where we already saw that some classes look similar.

def collect_predictions(model, data_loader, device):
    # We want to collect all predictions and targets across the whole dataset, so we can build a confusion matrix later.
    preds, targets = [], []
    model.eval()
    with torch.inference_mode():
        for X, y in data_loader:
            X = X.to(device)
            preds.append(model(X).argmax(dim=1).cpu())
            targets.append(y)
    return torch.cat(preds), torch.cat(targets)


def confusion_matrix_tensor(y_true, y_pred, num_classes):
    cm = torch.zeros((num_classes, num_classes), dtype=torch.int64)
    for t, p in zip(y_true, y_pred):
        cm[t, p] += 1
    return cm


y_pred, y_true = collect_predictions(model_1, test_loader, device)
cm = confusion_matrix_tensor(y_true=y_true, y_pred=y_pred, num_classes=num_classes)

Now plot the confusion matrix.

plt.figure(figsize=(9, 7))
plt.imshow(cm, cmap="Blues")
plt.title("Confusion Matrix — EuroSATModelV1")
plt.xlabel("Predicted")
plt.ylabel("True")
plt.xticks(range(num_classes), class_names, rotation=45, ha="right")
plt.yticks(range(num_classes), class_names)
for i in range(num_classes):
    for j in range(num_classes):
        v = cm[i, j].item()
        color = "white" if v > cm.max().item() * 0.5 else "black"
        plt.text(j, i, str(v), ha="center", va="center", color=color, fontsize=8)
plt.tight_layout()
plt.show()

Look at the off-diagonal cells.
Where are the confusions clustered?

7. Random predictions#

Let’s pull 9 random test images and see what the model thinks of them.

def predict_random_samples(model, dataset, n, device):
    model.eval()
    indices = random.sample(range(len(dataset)), k=n)
    images = torch.stack([dataset[i][0] for i in indices])
    true_labels = [dataset[i][1] for i in indices]
    with torch.inference_mode():
        logits = model(images.to(device))
        probs = torch.softmax(logits, dim=1)
        confs, preds = probs.max(dim=1)
    return images, true_labels, preds.cpu().tolist(), confs.cpu().tolist()


sample_images, sample_true, sample_pred, sample_conf = predict_random_samples(
    model=model_1, dataset=test_data, n=9, device=device
)

plt.figure(figsize=(9, 9))
for i, image in enumerate(sample_images):
    plt.subplot(3, 3, i + 1)
    plt.imshow(unnormalize(image).permute(1, 2, 0))
    pred_text = class_names[sample_pred[i]]
    true_text = class_names[sample_true[i]]
    color = "green" if sample_pred[i] == sample_true[i] else "red"
    plt.title(
        f"Pred: {pred_text} ({sample_conf[i] * 100:.0f}%)\nTrue: {true_text}",
        color=color,
        fontsize=9,
    )
    plt.axis("off")
plt.suptitle("EuroSAT random test predictions", y=1.02)
plt.tight_layout()
plt.show()

8. Discussion#

For the wrong predictions: do you agree they were hard?
On EuroSAT the answer is often …

Lecture 4 - Deeplearning Tutorial Notebook: Convolutional Neural Networks with PyTorch using EuroSAT

Contents

Lecture 4 - Deeplearning Tutorial Notebook: Convolutional Neural Networks with PyTorch using EuroSAT#

What we’re going to cover#

0. Imports and setup#

1. Getting the data#

1.1 Input and output shapes#

1.2 Visualising the data#

2. Prepare DataLoader#

3. Model 0: a linear baseline#

3.1 Loss, optimizer, and a tiny accuracy helper#

3.2 Training and testing functions#

3.3 Training Model 0#

4. Model 1: a TinyVGG-style CNN#

4.1 Loss, optimizer, and training#

5. Compare the two models#

6. Confusion matrix#

7. Random predictions#

8. Discussion#