In the previous articles, we designed a neural network and watched it learn how to classify URLs. But here’s the uncomfortable truth: a model that “works” in a notebook isn’t necessarily ready for the real world. The gap between a functioning prototype and a production-ready system is filled with questions we haven’t answered yet. How do we know if our model will perform well on new, unseen data? Can we reproduce our results six months from now? Are we measuring the right things? Is our model as good as it could be?

⚠️ Disclaimer
This series is not a substitute for a full and rigorous deep learning course. Its goal is to introduce key concepts in an accessible and practical way, particularly for readers with a security background and to provide a solid foundation for the more advanced blog articles that will follow.

If you’re looking for a deeper, textbook-level treatment of the subject, I highly recommend Dive into Deep Learning: a free, open-source book with hands-on examples and theoretical depth.


Evaluation Metrics That Matter

The previous article tracked only loss during training, but loss is just a proxy for overall performance. The real question is: How well does the model perform on the task we care about?

For classification tasks, the most common metric is accuracy: the percentage of correct predictions. However, accuracy alone can be misleading, especially with imbalanced datasets, which are very common in the real world.

The Confusion Matrix

A confusion matrix reveals exactly where your model succeeds and fails by showing how predictions align with true labels across all classes.

from sklearn.metrics import confusion_matrix, ConfusionMatrixDisplay
import matplotlib.pyplot as plt

# Get predictions on test set
y_pred = model(X_test).detach().numpy()
y_pred_classes = np.argmax(y_pred, axis=1)
y_true_classes = np.argmax(y_test, axis=1)

# Create confusion matrix
cm = confusion_matrix(y_true_classes, y_pred_classes)
labels = ['benign', 'defacement', 'malware', 'phishing']

# Visualize
disp = ConfusionMatrixDisplay(confusion_matrix=cm, display_labels=labels)
disp.plot(cmap='Blues', values_format='d')
plt.title('Confusion Matrix')
plt.show()

Structure of the confusion matrix:

  • Rows (vertical axis): True labels (what the URL actually is).
  • Columns (horizontal axis): Predicted labels (what the model said).
  • Diagonal cells: Correct predictions.
  • Off-diagonal cells: Errors.

Confusion Matrix Example

From this visualization, you can spot critical patterns like “the model frequently confuses phishing with benign URLs” or “defacement detection is nearly perfect with few false positives”. These insights drive concrete improvements: if phishing URLs are misclassified as benign, you might collect more training examples, engineer features that capture brand impersonation patterns (like typosquatting) or adjust class weights to penalize phishing misses more heavily.

Beyond Accuracy

While accuracy gives a high-level overview, it doesn’t capture some important aspects of model performance, especially when classes are imbalanced. For example, if 95% of URLs are benign, a model that always predicts “benign” achieves 95% accuracy but is useless for detecting threats.

For a more rigorous evaluation, we can use the following per-class metrics for a classification task:

  • Precision: Of all URLs labeled as malicious, how many truly were malicious?
    High precision means few false alarms, critical when automated blocking is involved.
  • Recall: Of all actual malicious URLs, how many did we catch?
    High recall means few missed threats, critical for security monitoring.
  • F1-Score: The F1-score balances both concerns into a single metric. Unlike accuracy, it remains robust when classes are imbalanced: it won’t be artificially inflated by a model that simply predicts the majority class.

Precision and Recall

Beyond statistical metrics, real-world models can require considering the operational impact of each error type:

  • False Positive (blocking a benign URL): User frustration, lost productivity, potential revenue loss if legitimate commerce is blocked.
  • False Negative (missing a phishing URL): Potential credential theft, data breach, financial fraud, regulatory penalties.

These costs vary by context. A banking application may prioritize recall (catch every threat) even at the expense of precision (more false alarms), while a public web filter may balance both to avoid user disruption.

Cost-sensitive evaluation incorporates these real-world impacts by assigning different weights to each error type. By defining a cost matrix, a table that quantifies the actual business impact of each misclassification (e.g. phishing URLs missed cost more than benign URLs blocked), you can create a custom cost function and evaluate (or even train) models that minimize total operational cost rather than just maximizing accuracy. This transforms model optimization from a purely statistical exercise into a business-aligned decision problem.

# Example cost matrix
cost_matrix = {
    'phishing_as_benign': 1000,  # High cost: missed threat
    'benign_as_phishing': 10,    # Lower cost: false alarm
    'malware_as_benign': 5000,   # Very high cost: critical threat missed
}

# Calculate weighted cost from confusion matrix (cm)
total_cost = sum(
    cm[i][j] * cost_matrix.get(f'{labels[i]}_as_{labels[j]}', 0)
    for i in range(len(labels))
    for j in range(len(labels)) if i != j
)

Dataset and Experiment Management

A critical step in moving from prototype to production is ensuring that our model’s performance is robust and generalizable. This starts with how we handle our dataset and design our experiments.

Dataset Splitting Techniques

Evaluating a model on the same data is like studying with the exam questions already in hand. The model on paper will perform incredibly well, but in the real world it could fail miserably with new data it has never seen before. In short, this evaluation strategy creates the following problems:

  • Overestimation of the model’s performance, since we evaluate it on data it has already seen during the training phase.
  • Inability to detect overfitting, which occurs when the model learns noise and specific patterns from training data instead of generalizable rules, causing poor performance on new unseen data.

Overfitting and Underfitting

  • Overfitting occurs when the model memorizes training data rather than learning generalizable patterns, leading to poor performance on new data. For example, a model might learn that URLs containing bank are always phishing if not one legitimate banking site is in the dataset.
  • Underfitting happens when the model is too simple to capture meaningful patterns. A model with few neurons or limited input features may not be able to detect the complex relationships present in the data. .
    Overfitting

One solution to avoid overestimating performance is to exclude the data used for testing from the training process, typically by splitting the data into separate sets, each with a specific purpose:

  1. Training Set (70-80%): Used to train the model, updating weights through backpropagation.
  2. Validation Set (10-15%): Used during training to choose the best hyperparameters and to monitor overfitting. The model is never trained on this data, but we use it to make decisions about the model.
  3. Test Set (10-15%): Held out completely until final evaluation. This simulates real-world performance on truly unseen data.

Here’s how to implement this split while maintaining class distribution:

from torch.utils.data import TensorDataset, DataLoader

def split(X, y, train_ratio=0.8):
    """
    Split tensor X and y into training and validation sets based on the given ratio.
    """
    # Shuffle and split indices
    indices = list(range(len(X)))
    split = int(train_ratio * len(X))
    np.random.shuffle(indices)

    # Split the data
    train_indices, val_indices = indices[:split], indices[split:]
    train_X, val_X = X[train_indices], X[val_indices]
    train_y, val_y = y[train_indices], y[val_indices]

    # Return the split datasets
    return train_X, train_y, val_X, val_y

X_train, y_train, X_tmp, y_tmp = split(X, y, train_ratio=0.8)
X_val, y_val, X_test, y_test = split(X_tmp, y_tmp, train_ratio=0.5)

print(f"Training samples: {len(X_train)} ({len(X_train)/len(X)*100:.1f}%)")
# Training samples: 512144 (80.0%)
print(f"Validation samples: {len(X_val)} ({len(X_val)/len(X)*100:.1f}%)")
# Validation samples: 64018 (10.0%)
print(f"Test samples: {len(X_test)} ({len(X_test)/len(X)*100:.1f}%)")
# Test samples: 64019 (10.0%)

This split ensures that our model is trained, validated and tested on separate data, providing a more accurate estimate of its real-world performance. The class distribution should also be similar across all sets to avoid bias, random shuffling usually takes care of this for large datasets.

The validation set can also be used to monitor the training progress and identify overfitting. If the validation loss does not decrease while the training loss continues to decrease, it’s a sign that the model is memorizing the training data rather than learning general patterns. Overfitting Loss Curve

There are more advanced splitting techniques, like K-fold cross-validation, that make our evaluation even more robust by training multiple models on different data splits. This involves dividing the dataset into K subsets (folds), training the model K times, each time using a different fold as the validation set and the remaining K-1 folds as the training set. The final performance is averaged across all K runs. This method is especially useful to be sure our model’s performance is consistent and not dependent on a particular train/validation split. However, this method is computationally expensive for large datasets. For our URL classifier with 650,000+ samples, a single train/val/test split is sufficient.

Avoiding Data Leakage

An important caveat when splitting datasets is to avoid data leakage. Data leakage occurs when information from the validation or test set inadvertently influences model training. Common mistakes:

  • Normalizing before splitting: If you calculate mean/std on the entire dataset before splitting, your validation and test sets leak information into the training process.
  • Feature selection on all data: Selecting features based on their correlation with labels across the entire dataset.

The golden rule: Treat validation and test sets as future data. Anything you compute (normalization statistics, feature importance, etc.) should be calculated on the training set only, then applied to validation and test sets.

from sklearn.preprocessing import StandardScaler

# WRONG: Leaking information
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)  # Learns from ALL data
X_train, X_test = train_test_split(X_scaled, ...)

# CORRECT: No leakage
X_train, X_test = train_test_split(X, ...)
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)  # Learns from training only
X_test_scaled = scaler.transform(X_test)  # Applies training statistics

Handling Class Imbalance

Real-world datasets are rarely balanced. In our URL dataset, benign URLs likely vastly outnumber malicious ones:

print("Train labels distribution:")
train_labels, train_counts = np.unique(torch.argmax(y_train, dim=1).numpy(), return_counts=True)
for label, count in zip(train_labels, train_counts):
    print(f"- {index_to_label[label]}: {count} ({count/len(y_train)*100:.1f}%)")
# Train labels distribution:
# - benign: 342263 (66.8%)
# - defacement: 75682 (14.8%)
# - malware: 18888 (3.7%)
# - phishing: 75311 (14.7%)

This imbalance creates two critical problems:

  • Accuracy paradox: The model achieves deceptively high accuracy by predicting the majority class (benign) most of the time. A naive classifier that always predicts “benign” would be correct 66.8% of the time without learning anything useful.
  • Minority class blindness: Rare but critical classes such as malware (3.7%) have few training examples. The model may struggle to learn their patterns, leading to inadequate detection of real threats, where false negatives are costly.

The next sections explore techniques to address class imbalance during training.

Weighted Loss Function

Instead of treating all errors equally, penalize mistakes on rare classes more heavily. For instance PyTorch’s CrossEntropyLoss accepts class weights:

from torch.nn import CrossEntropyLoss

# Calculate inverse class frequencies as weights
class_counts = pd.Series(torch.argmax(y_train, dim=1).numpy()).value_counts().sort_index()
class_weights = 1.0 / torch.tensor(class_counts.values, dtype=torch.float32)
class_weights = class_weights / class_weights.sum() * len(class_weights)  # Normalize

print("Class weights:", class_weights)
# Class weights: tensor([0.1419, 0.6418, 2.5714, 0.6449])

# Use weighted loss
criterion = CrossEntropyLoss(weight=class_weights)

This forces the model to pay more attention to minority classes during training.

Oversampling and Undersampling

Two approaches to handle imbalance:

  • Oversampling duplicates examples from minority classes to increase their representation. Reduces the risk of ignoring rare but critical classes like malware.
  • Undersampling removes examples from majority classes to reduce their dominance. Faster training but discards potentially useful data.

A good technique is partial balancing: instead of forcing perfect 1:1:1:1 balance across all classes, use moderate ratios like bringing minorities to 50% of the majority class size. This reduces imbalance without extreme measures. For instance, oversampling malware from 3.7% to 25% (perfect balance) would require duplicating each sample 7x, causing the model to memorize those few examples (because they appear so often) instead of learning generalizable malware patterns.

import numpy as np

# Get class distribution
labels = np.argmax(y_train, axis=1)
unique_labels, counts = np.unique(labels, return_counts=True)

# Oversampling: bring minority classes to 50% of majority
target_count = int(counts.max() * 0.5)

indices = []
for label in unique_labels:
    label_idx = np.where(labels == label)[0]
    if len(label_idx) < target_count:
        # Duplicate minority samples
        label_idx = np.random.choice(label_idx, target_count, replace=True)
    indices.append(label_idx)

# Create balanced dataset
balanced_idx = np.concatenate(indices)
np.random.shuffle(balanced_idx)
x_train_balanced = x_train[balanced_idx]
y_train_balanced = y_train[balanced_idx]

⚠️ Important: Prevent Data Leakage
Like normalization, oversampling must be applied only to the training set after splitting. If you oversample before splitting, synthetic samples from the training data might end up in your validation or test sets, leading to overly optimistic performance estimates.

Experiments Reproducibility

Scientific experiments must be reproducible. If you run the same experiment twice and get different results, how can you trust either one? Yet deep learning models are full of randomness: random weight initialization, random batch sampling and random dropout (a regularization technique we’ll explore later). Without controlling these sources of randomness, your model might achieve 95% accuracy one day and 89% the next, making comparison between experiments difficult.

The solution is deterministic randomness: using pseudo-random number generators with fixed seeds. When you set a seed, the “random” sequence becomes predictable and repeatable. Here’s how to control all major sources of randomness in a PyTorch project:

import torch
import numpy as np
import random

def set_seed(seed=42):
    """
    Set random seeds for reproducibility across all libraries.
    """
    random.seed(seed)  # Python's built-in random module
    np.random.seed(seed)  # NumPy operations
    torch.manual_seed(seed)  # PyTorch CPU operations
    if torch.cuda.is_available():
        torch.cuda.manual_seed(seed)  # PyTorch GPU operations (current device)
        torch.cuda.manual_seed_all(seed)  # PyTorch GPU operations (all devices)

set_seed()

Reproducibility isn’t just about random seeds, it’s about documenting everything: hyperparameters, dataset versions, model architecture, training time and results. For serious projects, consider experiment tracking tools like MLflow. They provide dashboards to visualize metrics over time, compare runs and store artifacts.

For our purposes, we can expand the model saving approach from the previous article to include hyperparameters and training metadata:

import datetime
import hashlib

# Save complete experiment information
torch.save({
    "architecture": {
        "input_size": 84,
        "hidden_size": 64,
        "num_classes": 4
    },
    "dataset": {
        "training_date": datetime.datetime.now().isoformat(),
        "dataset_size": len(X),
        "dataset_hash": hashlib.sha256(X.tobytes()).hexdigest(),  
        "index_to_label": index_to_label
    },
    "evaluation": {
        "final_train_loss": train_losses[-1],
        "final_val_loss": val_losses[-1]
    },
    "training": {
        "learning_rate": 0.001,
        "batch_size": 128,
        "num_epochs": 20,
        "optimizer": "Adam"
    },
    "weights": model.state_dict(),
}, "url_classifier_optimized.pth")

In this way, anyone (including your future self) can load the model and know exactly how it was trained, on what data and the results achieved.


Model Optimizations

In the previous article, we built a working URL classifier with a simple architecture: 84 input features → 64 hidden neurons → 64 hidden neurons → 4 output classes. We trained it for 20 epochs with a fixed learning rate of 0.001 and batch size of 256.

While the input size (84 features) and output size (4 classes) came directly from our data, we arbitrarily chose the hidden layer configuration, batch size, learning rate and training duration. But how did we know these were good choices? What if 128 hidden neurons performed better than 64? What if a learning rate of 0.01 converged faster? Could we have stopped training at 15 epochs without losing performance? The truth is we didn’t optimize these values, we just picked reasonable defaults and hoped for the best.

These configuration choices are called hyperparameters: unlike weights and biases, which the network learns through backpropagation, they are settings that we choose before training begins. Hyperparameters are not learned from the data, but must be specified in advance and fundamentally determine how the model learns and what it can learn.

The difference between a mediocre model and an excellent one often lies not in the architecture itself, but in finding the right hyperparameter configuration. A model with 64 neurons might underfit the data, failing to capture complex patterns. A model with 512 neurons might overfit, memorizing training examples instead of learning generalizable rules. A learning rate of 0.1 might cause training to diverge while 0.0001 might make it painfully slow.

Hyperparameter Tuning

The hyperparameter tuning is the process of searching for the best configuration of hyperparameters to improve model performance. Can be implemented in several ways, from simple manual adjustments to sophisticated automated searches.

The general workflow is:

  1. Choose a new set of hyperparameters (e.g. learning rate, hidden layer size, batch size).
  2. Train the model on the training set for a fixed number of epochs
  3. Evaluate performance on the validation set.
  4. Repeat the process until satisfied with the results.
# Set of hyperparameter configurations to try
configs = [
    {'lr': 0.001, 'hidden_size': 64, 'batch_size': 128},  # Default
    {'lr': 0.01, 'hidden_size': 64, 'batch_size': 128},   # Higher LR
    {'lr': 0.0001, 'hidden_size': 64, 'batch_size': 128}, # Lower LR
    {'lr': 0.001, 'hidden_size': 128, 'batch_size': 128}, # Larger model
]

# Train and evaluate all the models
results = []
for config in configs:
    val_loss, val_acc = train_and_evaluate(
        model=build_model(config['hidden_size']),
        train_loader=train_loader,  
        val_loader=val_loader,
        optimizer=torch.optim.Adam(model.parameters(), lr=config['lr']),
        batch_size=config['batch_size'],
        epochs=10
    )
    results.append({**config, 'val_loss': val_loss, 'val_acc': val_acc})

# Find best configuration
best_config = min(results, key=lambda x: x['val_loss'])
print("Best configuration:", best_config)

You can also do a more exhaustive search using Grid Search, which tests all possible combinations of parameters from specified ranges. It guarantees finding the best combination within the grid, but it’s computationally expensive because the number of trained models grows exponentially with each added parameter. A more efficient alternative is Random Search, which samples random combinations from the parameter space instead of testing all possibilities. Research shows it often finds good hyperparameters faster than grid search, especially when some parameters matter more than others. Random search explores the parameter space more broadly with fewer trials, making it practical for large hyperparameter spaces.

import numpy as np

def random_search(n_trials=50):
    results = []
    
    for trial in range(n_trials):
        # Sample parameters randomly
        config = {
            'lr': 10 ** np.random.uniform(-5, -2),  # Log scale
            'hidden_size': np.random.choice([32, 64, 128, 256]),
            'batch_size': np.random.choice([64, 128, 256]),
        }
        # Train the model and evaluate
        val_loss, val_acc = train_and_evaluate(
            model=build_model(config['hidden_size']),
            train_loader=train_loader,  
            val_loader=val_loader,
            optimizer=torch.optim.Adam(model.parameters(), lr=config['lr']),
            batch_size=config['batch_size'],
            epochs=10
        )
        results.append({**config, 'val_loss': val_loss, 'val_acc': val_acc})

    return min(results, key=lambda x: x['val_loss'])

best_params = random_search(n_trials=50)

The number of epochs for hyperparameter tuning can be lower than your final training, but not too low. Training for too few epochs can introduce bias by favoring configurations that learn quickly initially (like high learning rates) over those that perform better long-term. Techniques to speed up or stop training early are discussed in the next chapter.

Bayesian Optimization is smarter than random search. Instead of sampling randomly, it builds a probabilistic model of how hyperparameters affect performance. After each trial, it updates this model and intelligently chooses the next configuration to test, focusing on promising regions while still exploring uncertain areas. This approach typically finds better hyperparameters with fewer trials.

Optuna is a popular library that implements Bayesian optimization:

import optuna

def objective(trial):
    # Suggest hyperparameters
    config = {
        'lr': trial.suggest_float('lr', 1e-5, 1e-2, log=True),
        'hidden_size': trial.suggest_categorical('hidden_size', [32, 64, 128, 256]),
        'batch_size': trial.suggest_categorical('batch_size', [64, 128, 256]),
    }
    # Train the model and evaluate
    val_loss, val_acc = train_and_evaluate(
        model=build_model(config['hidden_size']),
        train_loader=train_loader,  
        val_loader=val_loader,
        optimizer=torch.optim.Adam(model.parameters(), lr=config['lr']),
        batch_size=config['batch_size'],
        epochs=10
    )
    return val_loss

# Create study and optimize
study = optuna.create_study(direction='minimize')
study.optimize(objective, n_trials=50)
print("Best trial:")
print(f"- Validation loss: {study.best_trial.value:.4f}")
print(f"- Parameters: {study.best_trial.params}")

Optuna intelligently focuses on promising regions of the hyperparameter space, often finding better configurations with fewer trials than random search.

For large-scale tuning, Optuna supports pruning: automatically stopping unpromising trials early (e.g. if validation loss is significantly worse than other trials at epoch 5, stop training). This dramatically reduces compute time.

Training Enhancements

Beyond choosing the right hyperparameters, optimizing the training process itself can significantly improve results and efficiency.

Early Stopping solves a fundamental question: how many epochs should you train? Early stopping automatically halts training when validation performance stops improving, saving computation and preventing overfitting. The technique monitors validation loss and stops if it doesn’t improve for a specified number of epochs (patience).

# Early stopping variables
best_val_loss = float('inf')
patience = 10
patience_counter = 0

for epoch in range(max_epochs):
    train_loss = train_epoch(model, train_loader, optimizer, criterion)
    val_loss = validate(model, val_loader, criterion)
    
    # Check if validation loss improved
    if val_loss < best_val_loss:
        best_val_loss = val_loss
        patience_counter = 0
        print(f"Epoch {epoch}: Validation loss improved to {val_loss:.4f}")
    else:
        patience_counter += 1
        print(f"Epoch {epoch}: No improvement ({patience_counter}/{patience})")
    
    # Stop if patience exceeded
    if patience_counter >= patience:
        print(f"Early stopping at epoch {epoch}")
        break

Notice a problem: at the end of training we have the model from the last epoch, not the best one. If validation loss increased in later epochs, we’re left with a worse model. Model Checkpointing solves this by saving the best model during training and restoring it afterward:

...
for epoch in range(max_epochs):
    ...
    # Save if validation loss improved
    if val_loss < best_val_loss:
        best_val_loss = val_loss
        patience_counter = 0
        torch.save({
            'epoch': epoch,
            'model_state_dict': model.state_dict(),
            'optimizer_state_dict': optimizer.state_dict(),
            'val_loss': val_loss,
        }, 'best_model.pth')
        print(f"Epoch {epoch}: Saved best model (val_loss: {val_loss:.4f})")
...
# Load best model for final evaluation
checkpoint = torch.load('best_model.pth')
model.load_state_dict(checkpoint['model_state_dict'])
print(f"Restored best model from epoch {checkpoint['epoch']} with val_loss: {checkpoint['val_loss']:.4f}")

Learning Rate Scheduling addresses another issue: a fixed learning rate is rarely optimal. Early in training you want large steps to quickly reach good regions of the loss landscape. Later you need small steps to fine-tune without overshooting the minimum. Learning rate schedules automatically adjust the rate during training.

PyTorch provides several scheduling strategies. Two common approaches are:

  • ReduceLROnPlateau: Adaptively reduces learning rate when validation loss stops improving, responding to your specific training dynamics.
  • StepLR: Reduces learning rate at fixed intervals, useful when you know roughly how long training should take.
from torch.optim.lr_scheduler import ReduceLROnPlateau, StepLR

# Choose one scheduler based on your needs
scheduler = ReduceLROnPlateau(
    optimizer, 
    mode='min',      # Minimize validation loss
    factor=0.5,      # Multiply LR by 0.5 when triggered
    patience=3,      # Wait 3 epochs before reducing
    verbose=True
)
# OR
scheduler = StepLR(
    optimizer, 
    step_size=10,    # Reduce every 10 epochs
    gamma=0.1        # Multiply LR by 0.1
)

for epoch in range(max_epochs):
    train_loss = train_epoch(model, train_loader, optimizer, criterion)
    val_loss = validate(model, val_loader, criterion)
    
    # Update learning rate
    if isinstance(scheduler, ReduceLROnPlateau):
        scheduler.step(val_loss)  # Needs validation loss
    else:
        scheduler.step()  # StepLR doesn't need metrics

The plot below shows training with ReduceLROnPlateau. Initially, both training and validation loss decrease rapidly with the default learning rate. Around epoch 18 (first green line), validation loss plateaus and the scheduler reduces the learning rate. This triggers another phase of improvement. At epoch 30 (second green line), the learning rate is reduced again. After this point, both losses stabilize with minimal improvement, indicating the model has converged.

LR Scheduling

Notice how the learning rate reductions correspond to periods where the validation loss had stopped decreasing. This adaptive behavior is why ReduceLROnPlateau often outperforms fixed schedules: it responds to actual training dynamics rather than following a predetermined plan.

In conclusion, these techniques work best when combined: early stopping prevents wasted computation, learning rate scheduling improves convergence and checkpointing ensures you evaluate the best model version rather than the last one trained.


Production Considerations

Our model is now rigorously evaluated and optimized, but deploying it to production introduces an entirely new set of challenges that extend far beyond the model itself. While we’ve focused on building, training and optimizing the neural network, production deployment is a complex engineering discipline that involves:

  • Infrastructure Management: Containerization, load balancing and auto-scaling
  • Hardware Considerations: GPU management and computational efficiency
  • Monitoring and Observability: Real-time performance tracking, error logging and alerting systems
  • Security: API authentication, rate limiting and input sanitization
  • Reliability: Fault tolerance, backup systems and disaster recovery

These topics deserve dedicated coverage and are beyond the scope of this introductory series. Instead, we’ll focus on three critical aspects that directly relate to our model’s functionality: preprocessing consistency, batch processing and monitoring.

Preprocessing Consistency

One of the most common yet overlooked causes of production failures is the train-serve skew: when the preprocessing applied during training differs from what happens during inference. This discrepancy leads to subtle but catastrophic failures where the model receives data in a different format than it expects, resulting in degraded performance or complete malfunction.

Consider our URL classifier. During training, we extracted features like TLD (Top-Level Domain) one-hot encoding. We identified frequent TLDs that appeared in at least 0.1% of our training data and created binary features for them: tld_com, tld_org, tld_net and so on. Our model learned to make predictions based on an 84-dimensional feature vector with specific meanings for each position. Now imagine deploying this model. A new URL arrives for classification. The preprocessing pipeline must:

  1. Extract the same 84 features in the same order that the model expects.
  2. Use the same TLD list discovered during training (not recompute it from new data).
  3. Apply the same normalization or scaling (if any was used during training).
  4. Handle edge cases identically to how training data was processed.

If the production preprocessing uses a different TLD list or computes features in a different order, the model receives garbage input. The solution is to serialize not just the model weights but the entire preprocessing state learned during training. This includes:

  • Feature names and order: The exact sequence of features the model expects.
  • Learned statistics: TLD lists, frequent keywords, normalization parameters (mean, stddev).
  • Encoding mappings: Label-to-index mappings for one-hot encoded features.
  • Configuration parameters: URL length limits, character set definitions, regex patterns.

When saving your model, you must save a complete “recipe” that allows the system to reconstruct the exact same feature extraction pipeline.

Batch Processing

Production machine learning systems typically expose models through HTTP REST APIs, allowing clients to send requests and receive predictions. The simplest API design accepts one URL and returns one prediction:

POST /predict
{
  "url": "http://suspicious-site.com/download.exe"
}
→ Response:
{
  "prediction": "malware",
  "confidence": 0.9234,
  "probabilities": {
    "benign": 0.0123,
    "defacement": 0.0156,
    "malware": 0.9234,
    "phishing": 0.0487
  }
}

However, this design is inefficient when clients need to classify many URLs. Each request incurs HTTP overhead and the model processes URLs one at a time instead of leveraging batch parallelism.

Batch processing provides several advantages:

  1. Reduced HTTP Overhead: One network round-trip instead of N.
  2. GPU Efficiency: Neural networks process batches much faster than individual samples due to parallel matrix operations.
  3. Better Resource Utilization: The server can optimize memory allocation and GPU usage for batches.
  4. Throughput Optimization: Processing 100 URLs in one batch is significantly faster than 100 sequential requests.
POST /predict/batch
{
  "urls": [
    "http://site.com",
    "http://attacker.com",
    "http://newsite.com",
    ...
  ]
}

→ Response:
{
  "predictions": [
    {"url": "http://site.com", "prediction": "benign", ...},
    {"url": "http://attacker.com", "prediction": "malware", ...},
    {"url": "http://newsite.com", "prediction": "benign", ...}
  ]
}

While larger batches improve throughput, they increase latency. A batch of 1000 URLs takes longer to process than a batch of 10. Production systems must balance these trade-offs, based also on the required latency.

Monitoring and Observability

Once deployed, models require continuous monitoring. Unlike traditional software where bugs cause immediate failures, ML models degrade silently. A model can return predictions with high confidence even when performing poorly due to data drift or pipeline issues.

Production systems rarely run a single static model. As you collect more data, discover better architectures or tune hyperparameters, you’ll deploy new model versions. Tracking performance across versions is critical to ensure improvements are real and detect regressions.

Consider a realistic scenario for our URL classifier:

Model v1.0 (deployed: 2025-01-15)
├── Training: 640K URLs from 2024 dataset
├── Architecture: 84 → 64 → 64 → 4
├── Test F1 Score: 0.92
├── 30d Production F1: 0.85 ⚠️ Drift detected
└── Avg Latency: 15ms

Model v2.0 (deployed: 2025-06-20)
├── Training: 1.2M URLs (2024 + fresh 2025 data)
├── Architecture: 88 → 128 → 128 → 4
├── Test F1 Score: 0.96
├── 30d Production F1: 0.95 ✓ Stable performance
└── Avg Latency: 22ms

This comparison reveals trade-offs: v2.0 has better accuracy but higher latency. Notice the gap between test and production F1 scores. For v1.0, test performance (0.92) is significantly higher than production performance (0.85), suggesting data drift: the statistical properties of input data have changed over time, causing the model to degrade.

For URL classification, drift manifests when the types of URLs your system encounters change. Perhaps your classifier was trained with 2024 data when most malicious URLs used .com domains. By 2025, attackers shifted to obscure TLDs like .xyz or .top to evade detection. Your model has limited experience with these domains from training, so when they become common in production, predictions become less reliable because the model operates outside its training distribution.

There are several monitoring strategies to detect drift:

  • Prediction Distribution Changes: If your model suddenly predicts 40% malware instead of the usual 5%, it’s more likely that the model is drifting than that there’s been a real increase in attacks.
  • Confidence Score Degradation: Declining average confidence indicates the model encounters patterns it hasn’t seen during training.
  • Ground Truth Performance: If possible collect real-world feedback to measure actual performance.

When drift is detected:

  1. Retrain the model with recent data reflecting current patterns.
  2. Update feature extraction to handle new attack techniques.
  3. Add new features capturing emerging patterns (e.g. Unicode homograph detection).

Finally, there are also several operational metrics to monitor to ensure the system’s health:

  • Error Rates: Failed preprocessing, timeouts, invalid inputs, etc.
  • Latency and Throughput: Time taken per request and requests processed per second.
  • Resource Usage: CPU/GPU utilization and memory consumption.

This article concludes this introductory series on deep learning. We’ve journeyed from understanding AI basics to feature engineering, from scratch implementations to production-ready models with proper evaluation and tuning. The next articles will explore advanced architectures and real-world security applications, building on these foundations.

💡 Code Implementation
This GitHub repository contains a complete implementation of a URL classification system using PyTorch. The codebase features a modular architecture with separate components for data loading, preprocessing, model serialization and an engine for training and prediction. As mentioned in the previous article, the dataset have data quality issues and should be used for educational purposes only. The repository includes detailed instructions to create a custom data loader to use your own datasets.

References