When we see a URL like http://45[.]145[.]7[.]134/ups/Snup.bat, our brains instantly spot the red flags: IP-based hosting, company spoofing and a suspicious file extension. We can easily hypothesize that this url could be used to deliver malware.
But a deep learning model sees none of that, it only understands one language: numbers. This creates the fundamental challenge of abstraction, the art of capturing what matters while discarding what doesn’t. This process is universal in AI. Every system, from simple classifiers to advanced neural networks, must convert real-world complexity into a numerical representation. The question isn’t whether to abstract, but how to do it effectively.
⚠️ Disclaimer
This series is not a substitute for a full and rigorous deep learning course. Its goal is to introduce key concepts in an accessible and practical way, particularly for readers with a security background and to provide a solid foundation for the more advanced blog articles that will follow.If you’re looking for a deeper, textbook-level treatment of the subject, I highly recommend Dive into Deep Learning: a free, open-source book with hands-on examples and theoretical depth.
Introduction
One of the key strengths of deep learning is its ability to automatically extract relevant features from raw data. Given enough labeled examples, a neural network can learn what matters most without explicit human guidance. This is especially powerful for text-based inputs like URLs, where advanced architectures such as Transformers can identify complex patterns from character sequences. But let’s clarify a fundamental principle: every AI model, regardless of architecture, ultimately processes numbers. Large Language Models (LLMs) do not directly process “raw” text: before sending it as input, the text is divided into tokens (subword units) which are then mapped onto numerical vectors. The difference is that they learn the optimal numerical representation of these tokens during training rather than using predefined features.
However, in this article, we won’t use automatic feature extraction. Instead, we’ll manually engineer features from the URLs. Why? Because:
- Fixed-size requirement: Most deep learning frameworks expect input with fixed input shapes. But URLs are strings of varying lengths and this mismatch requires more complex models that can work on sequence of data.
- Simplicity: Manual feature extraction is easier to implement and more accessible for readers new to deep learning.
- Interpretability: We gain full transparency into what each feature means and how it contributes to the model’s decisions.
- Efficiency: Training with simpler models and fixed-size inputs is faster and less resource-intensive.
Manually engineered features are not a drawback. A deep learning model trained on them can still learn to prioritize the most relevant inputs through training: a process often referred to as feature selection (the process of identifying which features contribute most to predictions). In future articles, we’ll explore how advanced architectures like Transformers can learn directly from raw text and extract meaningful patterns without manual intervention. But before we get there, we need a solid foundation. Our immediate goal is to transform raw URLs into structured, fixed-size numerical representations that traditional machine learning and deep learning models can understand and learn from effectively.
From Text to Tensors
Before we extract features, we need to understand the mathematical structures that hold our numerical representations.
Understanding Tensors
A tensor is a multi-dimensional array of numbers. It’s the core data structure in deep learning and generalizes common mathematical concepts:
- A scalar (single number) is a 0-dimensional tensor
- A vector (1D array) is a 1-dimensional tensor
- A matrix (2D table) is a 2-dimensional tensor
- And so on…
Let’s look at a few simple examples using NumPy, a popular Python library for numerical computing:
import numpy as np
scalar = 5 # 0D tensor: single value
vector = np.array([1, 2, 3, 4]) # 1D tensor: list of values
matrix = np.array([[1, 2], [3, 4]]) # 2D tensor: table of values
In deep learning, tensors are the universal data format. Every layer in a neural network receives tensors as input, performs numerical operations and outputs new tensors. As mentioned before, neural networks typically expect fixed-shaped tensors, which means the size of each input (and usually the output too) must be known in advance. This requirement influences how we prepare data:
- Every training example must have the same number of features.
- If the input is a sequence, it often needs to be truncated or padded to a fixed length.
- The output shape must also match the type of task.
Real-World Example: A Photo as a Tensor
Consider a digital photo. At first glance, it seems like an unstructured blob of pixels but under the hood, it’s just a tensor. Here’s how:
- A color image is made up of pixels and each pixel contains three numbers: red, green and blue intensity values.
- If you have a
256 × 256pixel image, that’s a 3D tensor with shape(256, 256, 3): height, width and color channels. - If you’re feeding a batch of 100 images into a model, the shape becomes
(100, 256, 256, 3): a 4D tensor.
But real-world images aren’t always the same size. They could be 500×300 or 1024×768, which a model cannot process directly. So we must:
- Resize or crop all images to a consistent size like
256×256. - Optionally convert color to grayscale (if our model requires it).
- Normalize pixel values (e.g. rescale from
0–255to0–1).
Once preprocessed, all images have the same shape and can be passed through the network in batches efficiently and in parallel.
Encoding Labels into Tensors
Before we can extract features from URLs, we need to address a fundamental question: how do we represent the labels themselves numerically? This seemingly simple problem reveals a critical principle in deep learning: how you encode information directly shapes what your model can learn.
🛠️ Project Setup Tip
To avoid dependency issues, it’s a good idea to set up a Python virtual environment:python -m venv venv source venv/bin/activate # On Windows: venv\Scripts\activateA
requirements.txtfile listing all dependencies is provided in the following GitHub repository, so you can install everything at once using:pip install -r requirements.txtSince deep learning projects often involve visualizations and interactive experimentation, I also recommend working inside a Jupyter Notebook:
pip install notebook ipykernel python -m ipykernel install --user --name=.venv
Our mission is to build a model that can automatically identify malicious URLs and classify them into specific threat categories. We’ll use a Kaggle dataset containing over 650,000 labeled URLs, grouped into four categories:
- Phishing: URLs impersonating legitimate websites (like banks, social media or email services) to trick users into revealing sensitive information such as login credentials, credit card numbers or other personal information.
- Malware: URLs associated with the lifecycle of a malware, including those used to host and deliver malware payloads or those that serve as Command and Control (C2) infrastructure for communication with compromised devices.
- Defacement: URLs of a legitimate website that has been altered by an unauthorized actor, often to display political messages.
- Benign: Legitimate safe URLs.
To download the dataset you can use the kaggle Python module:
# Import modules
import kagglehub
import os
import pandas as pd
import shutil
# Download the dataset
dataset_path = os.path.join("data", "malicious_urls.csv")
if not os.path.exists(dataset_path):
download_directory = kagglehub.dataset_download("sid321axn/malicious-urls-dataset", force_download=True)
shutil.move(os.path.join(download_directory, "malicious_phish.csv"), dataset_path)
# Load the and shuffle the dataset
df = pd.read_csv(dataset_path)
df = df.sample(frac=1)
print(f"Dataset shape: {df.shape}\n")
# Dataset shape: (651191, 2)
The label of the dataset is the type column, which contains string like benign, malware, phishing and defacement. These are categorical values, meaning they represent different classes without any inherent numerical meaning or order. Unlike numeric features (like length or number of digits in a URL), we can’t feed strings directly into a neural network. And while binary values (e.g. is the url based on an ip?) can easily be represented as 1 or 0, categorical variables require a more careful approach.
Why Not Use Integer Encoding?
A possible solution is to assign an integer to each category: benign → 0, defacement → 1, malware → 2 and phishing → 3.
But this introduces a problem: the model might interpret the numerical order as meaningful, assuming that defacement (1) is somehow “closer” to malware (2) than to phishing (3). That would be incorrect because our categories are nominal, not ordinal. Unless the categories have a real order (like low, medium and high), assigning numbers directly can introduce bias into the model.
Bias in machine learning refers to systematic errors or incorrect assumptions that a model learns during training. It causes the model to consistently make predictable mistakes because it has learned flawed patterns from the data or how we represent it.
One-Hot Encoding
Instead, we can use one-hot encoding. This technique converts each category into a binary vector with one 1 and the rest 0’s. For our four labels, we get:
| Label | One-Hot Vector |
|---|---|
| benign | [1 0 0 0] |
| defacement | [0 1 0 0] |
| malware | [0 0 1 0] |
| phishing | [0 0 0 1] |
Here’s how we can transform our labels into one-hot encoded tensors using just pandas and numpy:
# Get the list of unique labels
labels = df["type"].unique()
print(f"Labels: {labels}")
# Labels: ['defacement' 'benign' 'phishing' 'malware']
# Create a mapping from label to index
label_to_index = {label: idx for idx, label in enumerate(sorted(labels))}
print(f"Label to index mapping: {label_to_index}\n")
# Label to index mapping: {'benign': 0, 'defacement': 1, 'malware': 2, 'phishing': 3}
# Create one-hot encoded matrix
label_indices = df["type"].map(label_to_index).values
type_one_hot = np.eye(len(labels))[label_indices]
# Show a few examples
print("Sample one-hot encodings:")
for i in range(4):
print(f"- {df['type'].iloc[i]} → {type_one_hot[i]}")
# Sample one-hot encodings:
# - defacement → [0. 1. 0. 0.]
# - defacement → [0. 1. 0. 0.]
# - defacement → [0. 1. 0. 0.]
# - benign → [1. 0. 0. 0.]
What can we extract from an URL?
Now we dive into the art of feature extraction. Instead of blindly converting text to numbers, we’ll use domain knowledge from information security to extract meaningful patterns that distinguish malicious URLs from benign ones.
To get a better feel for the data, let’s examine a few sample URLs from each category.
# Sample URLs from each category
for category in ["benign", "phishing", "malware", "defacement"]:
print(f"{category.upper()} examples:")
for url in df[df["type"] == category]["url"].head(3):
print(f"- {url.replace('.', '[.]')}")
# BENIGN examples:
# - vishvalankaeducation[.]com/photo-album
# - http://themeforest[.]net/item/builder-construction-architect-renovation-theme/full_screen_preview/11032581
# - askbiography[.]com/bio/WTCT[.]html
# PHISHING examples:
# - page[.]mi[.]fu-berlin[.]de/~prechelt/Biblio/jccpprt_computer2000[.]pdf
# - www[.]carlprothman[.]net/Default[.]aspx?tabid=97
# - flyglobalcard[.]com
# MALWARE examples:
# - http://chinacxyy[.]com/newscodejs[.]asp?lm2=185&list=9&icon=0&tj=0&font=10[.]5&hot=0&new=0&line=24&lmname=0&open=1&n=34&more=0&t=0&week=0&zzly=0&hit=0&pls=0&dot=0&tcolor=CCCCCC
# - http://9779[.]info/%E5%B9%BC%E5%84%BF%E7%AE%80%E5%8D%95%E6%89%8B%E5%B7%A5%E7%B2%98%E8%B4%B4%E7%94%BB/
# - http://37[.]49[.]226[.]237/deusbins/deus[.]x86
# DEFACEMENT examples:
# - http://www[.]twalogisticsltd[.]co[.]uk/consulting[.]pdf
# - http://www[.]diarco[.]com[.]pe/portal/index[.]php?option=com_content&view=section&layout=blog&id=1&Itemid=12
# - http://www[.]latarnik[.]eu/index[.]php?option=com_content&view=article&id=33&Itemid=47
Looking at these examples, we can observe some interesting patterns. For instance, the malware example http://37[.]49[.]226[.]237/deusbins/deus[.]x86 uses IP-based hosting and has an unusual file extension, while some malicious URLs contain long query strings or multiple subdomains. However, we do not have to jump to conclusions yet, these are just observations: we’ll let our models determine which features are actually discriminative and how they should be weighted.
Data Cleaning and Parsing
Before we dive into feature extraction, we need to address a critical but often overlooked aspect: data quality. Poor data quality is one of the main causes for the failure of machine learning projects in production. Ensuring high-quality training data is essential, especially in applications where false positives and false negatives can have serious consequences.
Real-world URL datasets suffer from several quality issues:
- Malformed URLs: Broken encoding, missing protocols or invalid characters.
- Label noise: Incorrect classifications due to human error or outdated threat intelligence.
- Temporal drift: URLs classified months ago may no longer be valid.
- Duplicate entries: Same URL appearing multiple times with different labels.
Let’s implement a simple data quality assessment to identify potential issues in our dataset:
def assess_data_quality(df: pd.DataFrame) -> dict:
"""
Data quality assessment for URL dataset
"""
return {
"duplicates": df["url"].duplicated().sum(),
"encoding_issues": df["url"].str.contains(r'[^\x00-\x7F]', na=False).sum(),
"label_conflicts": df.groupby("url")["type"].nunique().gt(1).sum(),
"no_dots_urls": df["url"].str.contains(r'^[^.]*$', na=False).sum(),
"null_or_empty_urls": df["url"].isnull().sum() + df["url"].str.strip().eq('').sum(),
"very_long_urls": (df["url"].str.len() > 2000).sum(),
"very_short_urls": (df["url"].str.len() < 6).sum(),
}
# Assess our dataset quality
quality_report = assess_data_quality(df)
print("Data Quality Assessment:")
for metric, value in quality_report.items():
print(f"- {metric}: {value:,}")
# Data Quality Assessment:
# - duplicates: 10,072
# - encoding_issues: 921
# - label_conflicts: 6
# - no_dots_urls: 61
# - null_or_empty_urls: 0
# - very_long_urls: 2
# - very_short_urls: 8
Looking at the results, we can see that our dataset has several potential issues that need to be addressed before training our model. For example, the presence of duplicate URLs can lead to biased learning, while label conflicts may confuse the model about the true nature of certain URLs. Also the presence of possible encoding issues may hinder the model’s ability to process certain URLs.
🗒️ Semantic Quality Validation
In a production environment, it is good practice to perform semantic quality checks to verify that URL labels accurately reflect their current status. This could be done, for example, by using external threat intelligence platforms to validate classifications.However, these semantic validation steps are time-consuming and often require API usage limits to be managed, so we will skip them in this series.
It is worth noting that, in threat intelligence, what we have just identified as data quality issues could also represent potential attack vectors:
- Encoding issues: Attackers use UTF-8 homoglyphs (e.g. Cyrillic vs Latin
a) to evade detection. - Label conflicts: The same URL may be benign when registered but later compromised and used for C2.
- Very long URLs: Often indicate exploitation or exfiltration attempts.
Malformed URLs
Our dataset, like many real-world datasets, contains URLs that can’t be parsed by standard libraries. These malformed entries can arise from various sources, including data collection errors or improper encoding during storage. To prevent these problematic entries from negatively impacting our feature engineering and model performance, we must identify and remove them. The custom parse_url function below attempts to extract various components of a URL (scheme, netloc, path, query, fragment, etc.) and gracefully handles errors by returning None for any unparseable entries.
def parse_url(url: str) -> dict | None:
"""
Parses a URL into components
"""
if not isinstance(url, str): return None
try:
# Add "http://" as a default scheme
if "://" not in url: url = f"http://{url}"
# Initialize a dictionary to hold the URL components
components = {"scheme": "", "netloc": "", "path": "", "params": "", "port": None, "query": "", "fragment": ""}
# Parsing
if "#" in url:
url, components["fragment"] = url.split("#", 1)
if "://" in url:
components["scheme"], url = url.split("://", 1)
if "?" in url:
url, components["query"] = url.split("?", 1)
if ";" in url:
url, components["params"] = url.split(";", 1)
if "/" in url:
netloc, path = url.split("/", 1)
components["netloc"] = netloc
components["path"] = f"/{path}"
else:
components["netloc"] = url
components["path"] = ""
if ":" in components["netloc"]:
port_str = components["netloc"].split(":")[-1]
if port_str.isdigit():
components["port"] = int(port_str)
components["netloc"] = components["netloc"].removesuffix(f":{port_str}")
# Validate
if len(components["netloc"].split('.')) < 2: raise ValueError("Missing TLD")
# Return components
return components
except (TypeError, UnicodeError, ValueError):
# Return None for any malformed URLs
return None
df["parsed_url"] = df["url"].apply(parse_url)
# Print the corrupted URLs before removing them
print("Corrupted URLs in the dataset:", df[df["parsed_url"].isnull()].shape[0])
# Corrupted URLs in the dataset: 74
# Filter the DataFrame to keep only the clean URLs
df = df[~df["parsed_url"].isnull()].reset_index(drop=True)
Looking at some broken examples from the dataset, we can see they contain random binary data that was likely corrupted during collection or encoding:
2å~Îä³7G®Xúä4²kXâ`õ#hi@B÷Ã:%À{MOÁã8¯Ks
¸|nãÃÆ)UäY-îF¸Ú¸X4Ù/]ÑÙݪظø2i©ßËeÄ.üí7Ý[U3±Îq86PYåEá¤ÓÑYÉ ",phishing
"ø""!.)pþYsëËãß'}ר±SâC³EpÂd´Q*",phishing
wc6ËWdþ¨í',phishing
"Ü×éÌÉuTEq<¸«ÏnÓuÏBJIªWÙn""Çh4 +Qj""ä*´Jñ. T|Jö䯰!nqÄÞÿ2ẻsròNúZbܤÒySÕQµ",phishing
ÎVz39óåëíÁ(öuK=·ÑªÍîS¢T6^ØØòq8úyÞñ,phishing
Handling Label Conflicts
One critical issue we discovered is URLs that appear with different labels. This happens when:
- The same URL is classified differently over time.
- Different threat intelligence sources disagree.
- URLs transition between categories (e.g. legitimate site gets compromised).
def resolve_label_conflicts(df: pd.DataFrame, strategy: str = "majority_vote") -> pd.DataFrame:
"""
Resolve URLs that appear with multiple labels
"""
if strategy == "majority_vote":
# Use the most common label (mode) for each URL.
# In case of a tie, the function returns the first ones, so we take the first one with .iloc[0].
return df.groupby("url").agg({
"type": lambda x: x.mode().iloc[0] if not x.mode().empty else x.iloc[0]
}).reset_index()
elif strategy == "most_recent":
# Keep the last occurrence (assuming more recent data is more accurate)
return df.drop_duplicates(subset=["url"], keep="last")
else:
raise ValueError("Unknown strategy")
# Apply conflict resolution
df_clean = resolve_label_conflicts(df)
print(f"Dataset size after conflict resolution: {len(df_clean)}")
In this case, we opted for the majority vote strategy, considering it the most robust approach to handle label conflicts in our dataset. A valid alternative for cyber-security data could have been the most recent strategy, considering the rapidly evolving nature of threats, but in this dataset we did not have the temporal information to support that.
⚠️ Dataset Quality Warning
This dataset has significant labeling issues beyond simple conflicts. As documented in this community discussion, an entire subset of phishing URLs was mislabeled in the original dataset. I corrected these labeling errors using the original Phishstorm dataset.This highlights why the semantic quality validation is so important. In production, you should verify labels against current threat intelligence rather than blindly trusting dataset annotations. For the educational purpose of this series, we will use the corrected dataset as-is, but keep in mind that real-world data is often messy and requires careful validation.
Identifying Structural Biases
Even after cleaning malformed URLs and resolving label conflicts, datasets can contain structural biases that cause models to learn wrong correlations instead of real patterns. These biases are particularly dangerous because they don’t show up as obvious errors and can lead to models with high evaluation scores that fail in production.
I learned this the hard way: after implementing all the cleaning steps above, training a model and achieving a very high evaluation score, i deployed it only to discover it flagged http://google.com as phishing while correctly classifying google.com as benign. The model wasn’t broken, the dataset was teaching it the wrong patterns.
Through post-mortem analysis, i discovered two critical structural biases in the dataset:
Bias #1: Scheme Presence
Different classes had drastically different rates of scheme inclusion:
# Analyze scheme distribution by class
for label in df['type'].unique():
subset = df[df['type'] == label]
with_scheme = subset['url'].str.contains(r'^https?://', na=False).sum()
print(f"{label}: {with_scheme/len(subset)*100:.1f}% have http(s)://")
# Output:
# benign: 9.1% have http(s)://
# defacement: 100.0% have http(s)://
# malware: 88.7% have http(s)://
# phishing: 29.7% have http(s)://
The model learned that URLs with http:// or https:// are likely malicious, which is obviously wrong.
Bias #2: Path Structure
The second bias was even more subtle:
# Analyze path presence by class
for label in df['type'].unique():
subset = df[df['type'] == label]
# Count URLs that are just domains (no path after domain)
bare_domains = subset['url'].str.match(r'^[^/]+$').sum()
print(f"{label}: {bare_domains/len(subset)*100:.1f}% are bare domains")
# Output:
# benign: 0.0% are bare domains
# defacement: 0.0% are bare domains
# malware: 8.8% are bare domains
# phishing: 68.8% are bare domains
Benign URLs predominantly included paths (e.g. google.com/search), while phishing URLs were often just domains (e.g. fake-paypal.com). The model learned that bare domains are suspicious and having a path indicates legitimacy, another false pattern arising from how the dataset was collected rather than real threat characteristics.
To fix these biases, we implemented the following steps:
- For scheme bias, we removed the scheme from all URLs during preprocessing. This forces the model to focus on other features rather than relying on the presence of
http://orhttps://. - For path bias, we made a data augmentation step to remove the paths from some benign URLs, making them bare domains. Data augmentation is a common technique to increase dataset diversity. This balances the distribution and prevents the model from associating path presence with legitimacy. This is not ideal, but it helps mitigate the bias in this educational context.
⚠️ Lesson: You should explore your dataset for structural patterns before training, using simple statistical analysis and visualization. These methods can reveal problems before you waste time training a model that works on paper but fails in practice.
Final Cleaning Steps
After handling malformed URLs and label conflicts, we need to address the remaining data quality issues to ensure our dataset is ready for feature extraction. This includes removing URLs with encoding problems and any null or empty entries.
# Remove empty/short/long URLs
df = df[df["url"].notna() & df["url"].str.len().between(4, 2048)]
# Keep only ASCII URLs
df = df[df["url"].str.contains(r'^[\x00-\x7F]+$', na=False)]
Feature Extraction
With the dataset cleaned and preprocessed, we can start extracting features, the first category being structural ones that capture URL composition patterns then we will move on to semantic features that analyze the content and context of URLs.
Structural Features
URL length often correlates with obfuscation attempts, while excessive subdomains may indicate domain generation algorithms used by malware. These structural patterns provide our first layer of abstraction from raw text to numerical features. These structural patterns provide our first layer of abstraction from raw text to numerical features.
Let’s extract basic structural characteristics that capture URL composition patterns:
df["url_length"] = df["url"].apply(len)
df["domain_length"] = df["parsed_url"].apply(lambda url: len(url["netloc"]))
df["dot_count"] = df["url"].apply(lambda url: url.count("."))
The url_length feature captures the total character count, the domain_length focuses on the hostname portion and the dot_count measures subdomain complexity, which could indicate various hosting or structuring approaches.
Other relevant structural features include:
- Length metrics: path length, query length and fragment length provide additional granularity about URL structure.
- Character counts: slashes, hyphens, underscores, digits and special characters offer different views on URL composition.
- Component analysis: subdomain count, path depth and parameter count capture hierarchical complexity.
- Encoding indicators: URL encoding presence (%20, %3A) and unicode characters measure encoding patterns.
These structural features form the foundation, but we can extract much richer information by analyzing the semantic content and patterns within URLs.
Domain-Based Features
Domain characteristics offer another dimension of features that capture infrastructure and naming patterns.
Let’s extract key domain-related features:
import re
# Check if domain is IP-based
df["is_ip_domain"] = df["parsed_url"].apply(lambda url: bool(re.match(r"^\d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3}$", url["netloc"])))
# Extract port (use default if not specified)
df["port"] = df["parsed_url"].apply(lambda url: url["port"] if url["port"] else 80)
The is_ip_domain feature creates a binary indicator for whether the URL uses an IP address directly rather than a domain name. This represents a clear structural difference in how the destination is specified. The port feature is the port number represented as an integer, since in this case the order is relevant: some C2 infrastructure consistently uses similar port numbers across different instances. The model will learn whether certain port ranges or specific ports correlate with particular URL categories.
We can also create one-hot encoded features for the most common Top Level Domains (TLDs). Rather than encoding all possible domains (which would create thousands of sparse features), we focus on those appearing in at least 0.1% of URLs:
# Extract top-level domain (TLD)
df["tld"] = df["parsed_url"].apply(lambda url: url["netloc"].split(".")[-1] if "." in url["netloc"] else "")
# Get TLD distribution and keep only frequent ones
frequent_tlds = [
tld for tld, count in df["tld"].value_counts().items()
if count >= len(df) * 0.001 and tld != ""
]
# Create one-hot encoding for frequent TLDs
for tld in frequent_tlds:
df[f"tld_{tld}"] = (df["tld"] == tld).astype(int)
This TLD analysis captures the distribution of top-level domains in our dataset without making assumptions about which TLDs are “good” or “bad”. By focusing only on TLDs that appear frequently, we avoid creating thousands of sparse features while still providing the model with information about domain patterns. The one-hot encoding ensures each TLD is treated as a distinct category rather than implying any ordinal relationship.
URLs also contain valuable information in their file extensions, which we can process using the same one-hot encoding approach as TLDs. These features capture technical characteristics like content type indicators. We can also check if the URL or a sub-part, contains specific keywords that might indicate purpose or functionality. The challenge with keyword-based features is scalability: manually defining terms doesn’t generalize well to new patterns or languages. However, they provide interpretable features for known patterns and can be supplemented with more advanced techniques like word embeddings that capture semantic relationships automatically.
Best Practices
Effective feature engineering requires balancing domain expertise with statistical rigor. While deep learning models have some built-in capabilities to handle certain data quality issues, following established best practices ensures robust and reliable models, especially in production environments.
In a production-level environment, several statistical and methodological practices are commonly employed during feature engineering and data preprocessing to enhance model performance and reliability:
- Feature Distribution Analysis: Understanding how features are distributed helps identify potential issues that could affect model performance.
- Data Normalization and Scaling: Features operating on different scales (e.g. URL length vs port numbers) can bias models toward features with larger numerical ranges. Standardization or scaling ensures all features contribute equally.
- Class Imbalance Handling: Datasets often suffer from severe class imbalance (e.g. malicious samples are much rarer than benign ones). Techniques like oversampling (generating synthetic minority samples) or undersampling (reducing majority class size) help address this issue.
- Statistical Feature Selection: There are various methods to identify which features are most discriminative for the target classes, allowing removal of irrelevant or noisy features.
For this introductory series, we have intentionally omitted most of these advanced preprocessing techniques for the following reasons:
- Learning first, optimization later: The goal is to understand the fundamental mechanisms of deep learning, model optimization can be performed at a later stage.
- Modern models are surprisingly robust: Today’s deep learning architectures have developed sophisticated mechanisms to directly handle imperfect data.
Putting It All Together
At this point, our dataset has been transformed from raw URLs into a rich feature matrix. Through data cleaning, parsing, structural analysis and domain-based feature extraction, we’ve created a comprehensive set of numerical features that capture the essential characteristics of URLs. The complete list of features and their extraction methods can be found in the following GitHub repository.
Our feature engineering process has yielded several features covering multiple dimensions of URL analysis:
- Structural metrics:
url_length,domain_length,dot_count,slash_count,hyphen_count,underscore_count,digit_count - Content analysis:
special_char_count,path_length,query_length,fragment_length,path_depth,param_count,subdomain_count,suspicious_keyword_count - Indicators:
has_url_encoding,has_unicode,is_ip_domainand the numericalportfeature - One-hot encoded features: Protocol schemes, TLD categories and file extensions in binary format
Now we need to prepare our data for machine learning by creating the input tensor X and the target tensor y. This final preprocessing step converts our DataFrame into the numerical arrays required by most machine learning frameworks. For y tensor we can use the one-hot encoding vector we prepared earlier on the type field, simply updating the data type.
# Create input tensor X (all columns except metadata and target)
feature_columns = df.columns.drop(["url", "parsed_url", "type"])
X = df[feature_columns].values.astype(np.float32)
print(f"Input tensor X shape: {X.shape}")
# Input tensor X shape: (640192, 84)
print(f"Feature names: {', '.join(feature_columns)}")
# Feature names: url_length, domain_length, dot_count, slash_count, hyphen_count, underscore_count, digit_count, special_char_count, path_length, query_length, fragment_length, path_depth, param_count, subdomain_count, has_url_encoding, has_unicode, is_ip_domain, port, suspicious_keyword_count, tld__com, tld__org, tld__net, tld__de, tld__uk, tld__ca, tld__edu, tld__br, tld__nl, tld__au, tld__it, tld__ru, tld__jp, tld__pl, tld__info, tld__gov, tld__fr, tld__eu, tld__vn, tld__cn, tld__gr, tld__ua, tld__es, tld__ro, tld__us, tld__mx, tld__in, tld__cz, tld__za, tld__ch, tld__at, tld__cl, tld__cc, tld__se, tld__dk, tld__hu, tld__asia, tld__tk, tld__ar, tld__be, tld__biz, tld__co, tld__id, tld__fm, tld__ir, tld__tr, tld__sk, file_extension__html, file_extension__php, file_extension__htm, file_extension__aspx, file_extension__asp, file_extension__exe, file_extension__m, file_extension__cfm, file_extension__shtml, file_extension__jpg, file_extension__torrent, file_extension__txt, file_extension__pdf, file_extension__jsp, file_extension__cgi, file_extension__do, file_extension__1, file_extension__js
# Create target tensor y in one-hot
y = type_one_hot.astype(np.float32)
print(f"Target tensor y shape: {y.shape}")
# Target tensor y shape: (640192, 4)
print(f"Class labels: {', '.join(list(label_to_index.keys()))}")
# Class labels: benign, defacement, malware, phishing
The X tensor contains our feature matrix as floating-point values, while the y tensor contains one-hot encoded class labels. This representation is compatible with popular machine learning frameworks like scikit-learn, PyTorch and TensorFlow.
We’ve successfully transformed the messy, variable-length world of URLs into clean, fixed-size numerical tensors that machine learning models can process. Our approach combines the power of domain expertise with systematic feature engineering to create meaningful representations of information security data. In our next article, we’ll use these carefully crafted features to build and train our first deep learning models.