Categories
Research Paper

Evaluating spam filters and Stylometric Detection of AI-generated phishing emails

The Rise of AI-Generated Phishing

Large Language Models (LLMs) like GPT-4 are transforming how we communicate, but not always for the better. While these tools can streamline everything from writing emails to coding, they’re also being misused by cybercriminals to craft phishing emails that are alarmingly convincing. These AI-generated messages mirror natural human language so closely that many traditional spam filters, typically tuned to catch suspicious links or known domains, are no longer enough.

Our recent study, published in the Expert Systems with Applications titled Evaluating spam filters and Stylometric Detection of AI-generated phishing emails, highlights this growing threat and sheds light on a potential game-changer: stylometric detection. By analysing the unique “writing fingerprint” of AI versus human-generated content, this technique offers a promising way forward in the fight against smarter phishing attacks.

Key Findings: How Email Providers Stack Up

In the paper, we tested 63 AI-generated phishing emails (created using GPT-4) across three major platforms:

  • Yahoo blocked 90% of phishing attempts, showcasing robust filtering.
  • Gmail allowed 86% of malicious emails to bypass its spam filters.
  • Outlook performed the weakest, letting 96% of phishing content through.

Even more alarming? When researchers sent legitimate AI-generated emails:

  • Yahoo falsely flagged 58–66% as spam.
  • Outlook allowed all legitimate emails through, but its permissiveness with phishing attempts raises red flags.

The takeaway: We observed that current filters prioritise minimizing false positives (legit emails marked as spam) at the cost of security, a risky trade-off as AI phishing evolves.

The Stylometric Solution: Catching Phishing by Language Patterns

To combat AI-generated threats, the study introduced 47 new stylometric features, linguistic markers that analyse writing style, structure, and tone. These include:

  • Imperative verbs (e.g., “click,” “verify”) driving urgency.
  • Clause density measures sentence complexity.
  • Pronoun usage (e.g., overuse of “we” or “you”) to mimic authority.

When tested on machine learning models:

  • XGBoost outperformed others with 96% accuracy and a near-perfect 99% AUC score.
  • Urgency markers and sentence complexity were critical in flagging phishing content.

Why Stylometrics Matter

Traditional phishing detection relies on external signals like suspicious links. But AI-generated emails often omit these, relying instead on psychological manipulation. Stylometrics offers a text-based defense:

  • Detects zero-day attacks lacking known malicious links.
  • Provides transparent insights (unlike “black-box” AI models).
  • Complements existing tools for multi-layered security.

Limitations and Future Directions

  • Small dataset: Only 63 phishing/legitimate emails were tested.
  • Provider bias: Results may not generalize to all email services.
  • Model focus: GPT-4 was the sole LLM used; future work could explore Claude, Gemini, or Llama.

Explore the Research:

Authors: Chidimma (Chi) Opara, Paolo Modesti and Lewis Golightly