The “black box” reputation of neural networks comes from complexity, not mystery. Behind every prediction lies pure mathematics: matrix multiplications, derivatives and optimization algorithms working together to minimize error. Neural networks aren’t magical thinking machines but mathematical functions that learn patterns through systematic parameter adjustment. Understanding the underlying mechanics transforms these seemingly opaque systems into much more predictable tools. In this chapter, we will demystify the process by implementing each component from scratch, revealing how gradient descent guides networks to optimal solutions for our classification task.

⚠️ Disclaimer
This series is not a substitute for a full and rigorous deep learning course. Its goal is to introduce key concepts in an accessible and practical way, particularly for readers with a security background and to provide a solid foundation for the more advanced blog articles that will follow.

If you’re looking for a deeper, textbook-level treatment of the subject, I highly recommend Dive into Deep Learning: a free, open-source book with hands-on examples and theoretical depth.


How Machines Find the Best Solution

At its core, machine learning is about finding optimal parameters for mathematical functions. Whether you’re building a simple linear model or a complex neural network, the challenge remains the same: how do you systematically adjust thousands or millions of parameters to minimize prediction errors? This is where optimization algorithms come into play, with gradient descent being the most fundamental approach.

Gradient Descent

Imagine you have a function f(x) = ax + b and you want to find the optimal values for a and b that minimize (or maximize) the function: this is an optimization problem. Gradient descent solves this by following a simple principle: if you’re standing on a hill and want to reach the bottom, walk in the direction of the steepest slope downward. Mathematically, the gradient tells us the direction of steepest increase, so we move in the opposite direction.

The algorithm works in iterations:

  1. Calculate the gradient (slope) at current position.
  2. Take a step in the opposite direction.
  3. Repeat until you reach the minimum.

The size of each step is controlled by the learning rate, a crucial parameter that determines how fast we move toward the minimum. Steps that are too large might overshoot the target, while too small steps make the process painfully slow.

Gradient Descent

The image shows this process perfectly: starting from an initial random position, we calculate the gradient and take steps proportional to the learning rate until we reach the minimum. Each step moves us closer to the optimal solution. This visualization uses a simple 2D curve, but the same principle applies to 3D surfaces and even higher-dimensional spaces with thousands of parameters.

But what exactly is this gradient we keep mentioning? The gradient is a vector of all partial derivatives of our function. Remember derivatives from calculus? The ones you thought you’d never use in real life? Well, surprise! They’re the backbone of modern AI. The derivative tells us the slope (rate of change) at any point on our function. For multivariable functions, the gradient is this vector where each component tells us how much a specific variable impacts the function’s output. Together, these components point in the direction where the function increases most rapidly.

Measuring Machine Mistakes

Great, but how does this apply to our classification problem? We need a way to measure how far we are from the optimal solution.

Here is the loss function. It is the mathematical representation of how wrong our predictions are. It transforms the abstract concept of error into a concrete number we can minimize. For every prediction our model makes, the loss function calculates the difference between what we predicted and what actually happened. This reminds me of how humans learn: we often need to make mistakes to improve. Without error, there’s no signal for adjustment. The loss function gives our algorithm the feedback it needs to correct course.

Different problems require different loss functions. Mean Squared Error (MSE) works well for regression problems. It calculates the average of all squared differences between predicted and actual values. Why squared? Because it penalizes large errors more heavily than small ones. If you predict 10 when the actual value is 5, that’s much worse than predicting 5.1 when the actual is 5. The squaring operation makes sure big mistakes hurt more than small ones, forcing the model to focus on reducing major errors first.

For classification problems, Cross-Entropy Loss is preferred because it handles probabilities better and provides stronger gradients when predictions are very wrong. Cross-entropy measures the “distance” between two probability distributions: your model’s predicted probabilities and the true distribution. The mathematical foundation relies on the logarithm of predicted probabilities. Why logarithm? Because it creates that crucial asymmetric penalty we need: when your model predicts a very small probability (close to 0) for the correct class, the logarithm becomes a large negative number, resulting in high loss. When the model predicts high probability (close to 1) for the correct class, the logarithm approaches zero, minimizing the loss. There are two main types of cross-entropy loss:

  • Binary Cross-Entropy (BCE) is used for binary classification problems. When your model predicts a URL is malicious with 90% confidence but it’s actually benign, BCE heavily penalizes this overconfident wrong prediction. Conversely, when the model correctly predicts a malicious URL with high confidence, the loss approaches zero. This asymmetric behavior pushes the model to be both accurate and confident in its predictions.
  • Categorical Cross-Entropy loss extends this concept to multi-class problems. Instead of just malicious or benign, imagine classifying URLs into categories like phishing, malware, spam or legitimate. The model outputs a probability distribution across all categories and categorical cross-entropy measures how far this distribution is from the true one-hot encoded label.

The key insight is that cross-entropy loss grows exponentially when the model is confident and wrong. This creates strong learning signals that force rapid corrections. When your model predicts 99% confidence for the wrong class, the loss becomes very large, generating big gradients that quickly adjust the parameters.

The critical requirement for any loss function is differentiability. Without smooth gradients, we can’t determine which direction to move our parameters. This is why we can’t simply count wrong predictions (a step function) but need continuous, smooth functions that provide meaningful gradients at every point.

Balancing Speed and Stability

Now that we understand how gradient descent works, we face a practical challenge: how much data should we use for each parameter update? This might seem like a minor detail, but it’s actually one of the most critical decisions in training neural networks. Use too much data and training becomes painfully slow. Use too little and the learning process becomes chaotic and unstable.

Batch Gradient Descent uses the entire dataset to calculate gradients before updating parameters. This approach gives you the most accurate gradient direction because it considers every single data point. The mathematical precision is beautiful, but the computational cost is brutal. With large datasets, a single update can take a lot of time. Plus, you need enough RAM to load the entire dataset, which becomes difficult with millions of samples. However, batch gradient descent has a subtle problem: it’s too deterministic. In high-dimensional spaces, the algorithm is more likely to get stuck at saddle points: points where the gradient is zero but it’s not a true minimum.

3D Plane

Looking at the complex loss landscape in the image, you can see multiple valleys and peaks. A local minimum is a point where the gradient is zero and the loss is lower than all immediately surrounding points, but it’s not the deepest valley in the entire landscape (the global minimum). Once gradient descent reaches such a point, it stops moving because the gradient indicates no improvement in any direction, even though better solutions exist elsewhere in the parameter space.

Stochastic Gradient Descent (SGD) swings to the opposite extreme. It updates parameters after every single prediction. This makes training incredibly fast since you’re constantly learning from each example. However, the learning path becomes erratic and noisy. One data point might push the weights in one direction, while the next pulls them back. But here’s the interesting twist: this noise is actually a feature, not a bug. The stochastic nature introduces random fluctuations that can push the algorithm out of local minima. When SGD encounters a local minimum, the noise from individual samples can provide enough momentum to escape the valley and potentially discover the global minimum that batch gradient descent would never find.

Mini-Batch Gradient Descent strikes the perfect balance. Instead of using one sample or all samples, it processes small groups (typically 32, 64, 128 or 256 samples) at a time. For our URL classifier, this means analyzing 64 URLs, calculating their average error, updating the weights and then moving to the next batch of 64. This approach combines the stability of batch learning with the speed of stochastic learning. Why does this work so well? Mini-batches provide enough samples to smooth out individual noise while remaining computationally manageable. The gradient estimates are more reliable than single samples but much faster to compute than full batches. Modern GPUs are optimized for these parallel operations, making mini-batch processing incredibly efficient. More importantly, mini-batches retain enough randomness to escape local minima while providing enough stability to converge reliably.

The batch size becomes a crucial hyper-parameter to tune. Hyper-parameters are configuration settings of the model or learning process that are not learned from the data. Smaller batches (16-64) provide more frequent updates and can escape local minima more easily, but the learning path is noisier. Larger batches (128-512) provide more stable gradients but require more memory and might get stuck in sub-optimal solutions. Another critical hyperp-arameter is the learning rate, which controls the step size in gradient descent. A high learning rate makes the algorithm take large steps, potentially speeding up convergence but risking overshooting the minimum entirely. The algorithm might bounce around the optimal solution without ever settling into it or worse diverge completely. A low learning rate ensures stable, precise steps but can make training extremely slow. The algorithm might take thousands of epochs to reach convergence or get permanently stuck in the first local minimum it encounters because the steps are too small to escape.


Neural Network Architecture Fundamentals

Now that we understand how to optimize parameters efficiently, let’s examine in detail the category of functions that we want to optimize: the neural network architecture itself.

Artificial Neurons

At the heart of every neural network lies the artificial neuron, a mathematical abstraction inspired by biological neurons but designed for computational efficiency. The architecture of a digital neuron is simple but powerful. Each neuron receives multiple inputs, applies weights to each input, sums everything together, adds a bias term and passes the result through an activation function.

Think of weights as importance factors: they determine how much each input contributes to the final output. The bias acts as a base value that shifts the entire function, while the activation function introduces non-linearity, allowing the network to model complex relationships between inputs and outputs. To give an idea of activation function, one of the most used is the ReLU (Rectified Linear Unit), defined as the maximum between 0 and the input value. This function introduces sparsity, many neurons output zero and non-linearity as mentioned before.

So, to summarize again, the calculation performed by a neuron can be divided into three main steps:

  • Multiply each input by its corresponding weight.
  • Sum all the results and include the bias term.
  • Pass the result through a nonlinear function (the activation function).

We’ll start with the forward computation, the function used to generate predictions from input data. In essence, this is the part of the neuron responsible for predicting an output based on what it has learned so far. At this stage, we are not yet training the neuron. We are simply running a forward pass, taking input data and processing it through the neuron’s internal computation to produce an output.

The forwar() function must be differentiable at every point, because during training we compute gradients using calculus. If the function isn’t differentiable, we can’t calculate these gradients properly and the learning algorithm breaks down.

Here’s how it works, step by step:

class Neuron:
    def __init__(self, num_inputs):
        # Initialize weights and bias with small random values
        self.weights = np.random.randn(num_inputs) * 0.1
        self.bias = np.random.randn() * 0.1
        # Define the activation function (ReLU)
        self.activation = lambda x: np.maximum(0, x)
    
    def forward(self, inputs):
        # Ensure inputs are 2D (batch_size, num_inputs)
        if inputs.ndim == 1:
            # Single sample (num_inputs,) → (1, num_inputs)
            inputs = inputs.reshape(1, -1)
        # Store inputs for backward pass
        self.last_input = inputs.copy()  
        # Weighted sum + bias
        self.last_z = np.dot(inputs, self.weights) + self.bias
        # Apply activation function
        self.last_output = self.activation(self.last_z)
        # Return the activated output
        return self.last_output

The reshape calls ensure that the input tensors are always treated as 2D arrays, even when processing a single sample.

Let’s break down what happens in the __init__ method:

  • Bias and Weight Initialization: We initialize bias and weights with small random values drawn from a normal distribution and scaled by 0.1. Why random? Because if all weights started at zero (or any identical value), ensuring that each hidden neuron learns to detect different features in the data, rather than all learning the same function. The small scaling factor (0.1) prevents the initial outputs from being too large, which could lead to problems during training.
  • Activation Function Choice: We use ReLU (Rectified Linear Unit) because it’s computationally efficient and introduces sparsity (many neurons output zero), which can improve generalization.

The forward method implements the core neuron computation:

  • Linear Transformation: The dot product np.dot(inputs, self.weights) + self.bias represents the weighted sum of inputs plus bias. This is the fundamental linear operation that every neuron performs.
  • Non-linear Activation: The ReLU function introduces non-linearity.

We store intermediate values (last_input, last_z and last_output) because they will be needed later during the backward pass when we update the parameters.

Feedforward Neural Network (FFN)

A single neuron is powerful on its own, but the true strength of neural networks comes from combining many neurons into layers and stacking these layers to form deep networks. Each additional layer enables the network to learn progressively more abstract and complex representations of the data. This capability largely depends on the non-linearity of activation functions; without them, stacking layers would simply be equivalent to a single linear transformation, which would severely limit the network’s expressiveness and learning capacity.

A Feedforward Neural Network (FFN) is one of the simplest and most fundamental neural architectures. Its defining characteristic is that information flows strictly in one direction, from input to output, without any loops or cycles. Within feedforward neural networks, one common type of layer is the fully connected layer, also known as a dense layer. In a fully connected layer, every neuron in one layer is connected to every neuron in the following layer. This creates a dense network of connections, allowing the model to capture complex and rich relationships between features by combining inputs in many different ways. However, not all feedforward networks use fully connected layers exclusively. Some architectures use sparse connectivity, where neurons only connect to a subset of neurons in the next layer.

Feedforward Neural Network (FFN)

The architecture consists of three types of layers:

  • Input Layer: Receives the raw features, it does not perform computations but simply passes the input data to the next layer.
  • Hidden Layer(s): Each hidden layer contains multiple neurons that learn to detect specific patterns or feature combinations, the first hidden layer might learn simple patterns like “URLs with many dots”, while deeper layers combine these simple patterns into complex rules like “URLs with many dots AND unusual characters AND suspicious TLDs are likely phishing.”
  • Output Layer: Produces the final prediction.

The power of feedforward networks lies in their ability to learn hierarchical representations: early layers detect simple features and deeper layers compose those into sophisticated, abstract concepts

The universal approximation theorem proves that a feedforward network with enough hidden neurons can approximate any continuous function. However, this theorem does not guarantee that it is actually possible to find such weights through training or tell us how many neurons are needed.

Backward Pass

While the forward pass produces an output, we need a way for the neuron to learn from its mistakes. For a single neuron, this is straightforward gradient descent. The real challenge comes when we have multiple layers: how do we update weights in earlier layers when the error comes from the final output?

Single Neuron Gradient Descent

Let’s start with how a single neuron learns. The process follows these steps:

  • Calculate Error Signal: Determine how wrong the neuron’s output was.
  • Compute Activation Gradient: Find the derivative of the activation function.
  • Calculate Weight Gradients: Determine how much each weight contributed to the error.
  • Update Parameters: Adjust weights and bias based on gradients.

Since our neuron uses the ReLU function, defined as max(x, 0), the derivative is simple: it equals 1 when x is greater than 0 because in this region the function increases linearly. When x is less than 0, the derivative is 0 since the function becomes flat (zero slope). The ReLU derivative at exactly x = 0 is technically undefined, the convention is to set it to 0. This means the neuron only learns when it’s “active” (output > 0).

Here’s the implementation of the backward pass for the Neuron class that we have previous defined:

    def backward(self, error, learning_rate=0.01):
        # Ensure error is 2D (batch_size, 1)
        if error.ndim == 1:
            error = error.reshape(-1, 1)
        # Compute the derivative of ReLU activation
        activation_grad = np.where(self.last_z > 0, 1, 0).reshape(-1, 1)
        # Compute the local error signal (delta)
        delta = error * activation_grad
        # Compute gradient for weights
        weight_gradients = (delta.T @ self.last_input / delta.shape[0]).squeeze()
        # Average bias gradient over batch
        bias_gradients = delta.mean(axis=0)
        # Update parameters using gradient descent
        self.weights -= learning_rate * weight_gradients
        self.bias -= learning_rate * bias_gradients

The @ operator performs matrix multiplication. The reshape and squeeze functions ensure that the tensor dimensions align correctly with the expected shapes for calculations.

The Backpropagation Algorithm

When we stack neurons in layers, we are faced with the following problem: If the final output is wrong, how much blame should each weight have in each layer? Backpropagation solves this using the chain rule from calculus. The algorithm works backward from the output layer to the input layer, calculating how much each parameter contributed to the final error.

Here’s how it works:

  • Forward Pass: Compute outputs for all layers, storing intermediate values
  • Output Layer Error: Calculate error between prediction and true label using a loss function
  • Backward Propagation: For each layer (from output to input):
    • Calculate how much this layer contributed to the error
    • Update the layer’s weights and biases
    • Pass the error signal to the previous layer

The key insight is that each layer’s error depends on two things:

  • How wrong the next layer was (error signal from above)
  • How much this layer’s output affected the next layer (weights connecting them)

The error signal propagates as: error_current_layer = error_next_layer × weights_to_next_layer × activation_derivative This is the chain rule in action. The error flows backward through the network, getting scaled by the weights and activation derivatives at each step.

Backpropagation is efficient because it reuses computations. Instead of calculating gradients for each parameter independently (which would be extremely slow), it calculates them all in one backward sweep through the network. This reduces the computational complexity from exponential to linear in the number of parameters. The algorithm also provides the exact gradient for each parameter, not an approximation.

In practice, modern deep learning frameworks like PyTorch and TensorFlow handle backpropagation automatically. You define the forward pass and the framework builds a computational graph: a record of all mathematical operations performed. When you call loss.backward(), it automatically computes all gradients using backpropagation. However, understanding the underlying mechanism helps you debug training issues, choose appropriate architectures and optimize performance.


Implementation Details

Now that we understand the theory and have our features extracted, let’s implement a complete training pipeline for an URL classifier.

Data Loading and Preparation

We already have our features engineered from the previous article: the input tensor X contains 84 numerical features per URL and the target tensor y contains one-hot encoded labels for four classes (benign, defacement, malware, phishing).

Recall our discussion of Mini-Batch Gradient Descent (MBGD): instead of computing gradients over the entire dataset (slow) or single samples (noisy), we process small batches. This balances computational efficiency with stable gradient estimates. PyTorch provides abstractions to handle this batching automatically: Dataset and DataLoader.

A Dataset represents a collection of samples. It implements __len__() to return the number of samples and __getitem__() to retrieve a sample by index. For tensor data, PyTorch provides TensorDataset, which wraps your features and labels into a Dataset object.

A DataLoader wraps a Dataset and handles the mini-batch processing. It provides features like:

  • Batching: Splits data into chunks of batch_size.
  • Shuffling: Randomizes sample order to prevent learning order-dependent patterns.
  • Multiprocessing: Loads data in parallel (useful for large datasets or heavy preprocessing).
from torch.utils.data import TensorDataset, DataLoader
import torch

print(f"Dataset size: {X.shape[0]} samples")
# Dataset size: 640181 samples
print(f"Number of features: {X.shape[1]}")
# Number of features: 84
print(f"Number of classes: {y.shape[1]}")
# Number of classes: 4

# Convert to tensors 
X_tensor = torch.tensor(X, dtype=torch.float32)
y_tensor = torch.tensor(y, dtype=torch.float32)

# Create TensorDataset
dataset = TensorDataset(X_tensor, y_tensor)

# Create DataLoader for batch training
batch_size = 256
train_loader = DataLoader(dataset, batch_size=batch_size, shuffle=True)

print(f"Number of batches per epoch: {len(train_loader)}")
# Number of batches per epoch: 2501

The DataLoader will automatically shuffle our 640,192 samples and yield batches of 256 samples. But shuffle when? This brings us to the concept of epochs. An epoch represents one complete pass through the entire training dataset. Training a neural network typically requires multiple epochs because a single pass through the data isn’t enough for the model to learn all the patterns. With each epoch, the model sees the same data again but in a different order (thanks to shuffling), which helps it generalize better and avoid memorizing the sequence.

The DataLoader handles this for us: at the start of each epoch, it automatically reshuffles the dataset and manages the batching process. You simply iterate over the DataLoader multiple times (once per epoch) and it takes care of randomizing the order, splitting data into mini-batches and feeding them to your model.

Using raw tensors instead? You could skip Dataset and DataLoader for small in-memory datasets.

indices = torch.randperm(len(X))
for i in range(0, len(X), batch_size):
    batch_idx = indices[i:i+batch_size]
    x_batch, y_batch = X[batch_idx], y[batch_idx]
    # ... training ...

This works but requires reimplementing some of the DataLoader features.

Neural Network Architecture

Now let’s build our URL classifier using PyTorch. Every PyTorch model follows a consistent pattern: create a class that inherits from torch.nn.Module, define the network layers in the constructor (__init__) and implement the forward pass in the forward() method.

class URLClassifier(torch.nn.Module):
    """
    Simple feedforward neural network for URL classification.
    """
    def __init__(self, input_size, hidden_size=64, num_classes=4):
        super().__init__()
        
        # Define network layers
        self.fc1 = torch.nn.Linear(input_size, hidden_size)
        self.fc2 = torch.nn.Linear(hidden_size, hidden_size)
        self.fc3 = torch.nn.Linear(hidden_size, num_classes)
        self.relu = torch.nn.ReLU()

    def forward(self, x):
        """
        Define the forward pass.
        """
        x = self.relu(self.fc1(x))
        x = self.relu(self.fc2(x))
        x = self.fc3(x)
        return x

# Initialize the model
model = URLClassifier(input_size=84)
print(model)
# URLClassifier(
#   (fc1): Linear(in_features=84, out_features=64, bias=True)
#   (fc2): Linear(in_features=64, out_features=64, bias=True)
#   (fc3): Linear(in_features=64, out_features=4, bias=True)
#   (relu): ReLU()
# )

The __init__ method defines the components of our network architecture:

  • fc1, fc2, fc3: Linear (fully connected) layers that transform data through the network. Each Linear layer is composed of a combination of neurons, the specific number of neurons in each layer is defined by the second parameter (out_features). The “fully connected” name comes from the fact that every input connects to every output neuron.
  • relu: ReLU activation function for non-linearity, enabling the network to learn complex patterns beyond linear relationships.

The forward() method defines how data flows through the network: 84 input features → 64 neurons → 64 neurons → 4 output neurons (one per class). When you call model(x), PyTorch automatically invokes this method and builds a computational graph tracking all operations, enabling automatic gradient computation during backpropagation. Notice we don’t apply activation to the output layer. The network returns raw scores called logits (unnormalized prediction scores) which our loss function will use directly. For predictions, we can add a predict() method that converts these logits into probabilities using the softmax function, the standard choice for multi-class classification that transforms scores into a probability distribution summing to 1.

Automatic Backpropagation: You only need to define the forward pass. PyTorch’s autograd system automatically computes gradients for all parameters as long as your operations are differentiable (all built-in PyTorch operations are).

Training Procedure

The training loop ties everything together. For each epoch, we iterate through mini-batches, perform forward passes to get predictions, calculate loss using Cross-Entropy Loss and update parameters via backpropagation. PyTorch handles the gradient computation automatically through its autograd system.

# Define loss function and optimizer
criterion = torch.nn.CrossEntropyLoss()
optimizer = torch.optim.Adam(model.parameters(), lr=0.001)

# Training loop
num_epochs = 20
train_losses = []
for epoch in range(num_epochs):
    model.train()
    epoch_loss = 0.0
    num_batches = 0
    for batch_X, batch_y in train_loader:
        # Forward pass
        outputs = model(batch_X)
        # Convert one-hot labels to class indices for CrossEntropyLoss
        batch_y_indices = torch.argmax(batch_y, dim=1)
        # Calculate loss
        loss = criterion(outputs, batch_y_indices)
        # Backward pass and optimization
        optimizer.zero_grad()  # Clear previous gradients
        loss.backward()        # Compute gradients
        optimizer.step()       # Update parameters
        epoch_loss += loss.item()
        num_batches += 1
    
    # Calculate average loss for the epoch
    avg_loss = epoch_loss / num_batches
    train_losses.append(avg_loss)
    
    print(f"Epoch [{epoch+1:2}/{num_epochs}], Loss: {avg_loss:.4f}")
    # Epoch [ 1/20], Loss: 0.2931
    # Epoch [ 2/20], Loss: 0.2446
    # Epoch [ 3/20], Loss: 0.1987
    # ...
    # Epoch [20/20], Loss: 0.1292

Let’s break down the key components:

Loss Function: CrossEntropyLoss expects raw logits (unnormalized scores) from the model and class indices (not one-hot vectors). It combines softmax activation with negative log-likelihood loss. We pass batch_y as integers (0 for benign, 1 for defacement, 2 for malware, 3 for phishing) rather than one-hot vectors because PyTorch’s implementation is more numerically stable this way.

Optimizer: Adam (Adaptive Moment Estimation) is an optimization algorithm that adjusts the learning rate for each parameter based on estimates of the gradients. It’s more sophisticated than basic SGD and often converges faster. The optimizer needs access to all model parameters via model.parameters() so it knows what to update.

Training Steps:

  1. optimizer.zero_grad(): Clears gradients from the previous iteration. PyTorch accumulates gradients by default, so we must reset them before each backward pass.
  2. loss.backward(): Computes gradients for all parameters using backpropagation.
  3. optimizer.step(): Updates parameters using the computed gradients according to the Adam optimization rule.

Let’s visualize how the loss decreases during training:

import matplotlib.pyplot as plt
plt.figure(figsize=(10, 6))
plt.plot(range(1, num_epochs + 1), train_losses, marker='o', linewidth=2, markersize=6)
plt.xlabel('Epoch', fontsize=12)
plt.ylabel('Training Loss', fontsize=12)
plt.title('Training Loss Over Time', fontsize=14, fontweight='bold')
plt.grid(True, alpha=0.3)
plt.tight_layout()
plt.show()

Training Loss

The plot shows the loss steadily decreasing as the model learns to distinguish between URLs classes. This downward trend indicates that gradient descent is successfully optimizing the network parameters. The steepest improvement happens in the first few epochs, where the model learns the most obvious patterns. Later epochs show slower but steady progress as the model refines its understanding of more subtle features.

Notice we only monitored training loss. In production systems, this is insufficient. The model might be memorizing specific training URLs rather than learning general threat patterns.

Making Predictions

After training, we can use our model to predict the class of new URLs. The model outputs raw scores (logits) for each class. To get the predicted class, we simply take the index of the highest score using argmax.

# Mapping from index to label (our mapping is based on sorted labels)
index_to_label = {idx: label for idx, label in enumerate(sorted(df["type"].unique()))}
# Show some predictions
for i in range(3):
    example = df.iloc[np.random.randint(0, len(df))]
    print(f"URL {i}: {example['url']}")
    prediction = model(torch.tensor(example[feature_columns].values.astype(np.float32))).detach().numpy()
    print(f"Prediction {i}: {prediction} => {index_to_label[np.argmax(prediction)]}")
    print(f"Real label {i}: {example['type']}\n")
    # URL 0: newbob.com.my/rssfeed.php
    # Prediction 0: [  1.9302212  -15.365302    -3.2674837    0.16405655] => benign
    # Real label 0: benign

    # URL 1: amazon.com/Love-Laughter-John-Ritter/dp/1416598413
    # Prediction 1: [  3.8739717 -38.427227  -11.131769   -3.1127045] => benign
    # Real label 1: benign

    # URL 2: http://center-translate.ru/index.php/osobennosti-armyanskogo-yazyka
    # Prediction 2: [-4.188846    4.7605495  -1.0622969  -0.37987226] => defacement
    # Real label 2: defacement

The model outputs four numbers (one per class). The highest value indicates the predicted class. For example, in the first prediction, the benign class has score 1.93 while all others are negative or near zero, so the model confidently predicts benign.

We can also store the model’s weights and metadata for later use:

# Save the model weights and metadata
torch.save({
    "model_state_dict": model.state_dict(),
    "metadata": {
        "input_size": 84,
        "hidden_size": 64,
        "num_classes": 4,
        "index_to_label": index_to_label
    }
}, "url_classifier.pth")
print("Model saved to url_classifier.pth")
# Model saved to url_classifier.pth

# Load the model
checkpoint = torch.load("url_classifier.pth")
loaded_model = URLClassifier(
    input_size=checkpoint["metadata"]["input_size"],
    hidden_size=checkpoint["metadata"]["hidden_size"],
    num_classes=checkpoint["metadata"]["num_classes"]
)
loaded_model.load_state_dict(checkpoint["model_state_dict"])
loaded_model.eval()  # Set to evaluation mode
print("Model loaded from url_classifier.pth\n")
# Model loaded from url_classifier.pth

This completes our first training run! We’ve successfully built and trained a neural network that can classify URLs into four categories. However, this is just the beginning. In the next article, we’ll explore how to properly evaluate model performance, optimize hyperparameters like learning rate and network architecture, handle class imbalance in our dataset and implement rigorous evaluation metrics. With these improvements, we’ll build a production-ready URL classifier that generalizes well to unseen data.

References