Data Guardian

Data Privacy Guardian is a security solution that filters suspicious system log access using Isolation Forest and analyzes them through a LangGraph-based autonomous agent to detect insider threats.


#cybersecurity #agentic-ai #langgraph #machine-learning #fintech

Building a Data Privacy Guardian: From Anomaly Detection to Agentic Triage

Published on March 15, 2026 • 10 mins read •

In the financial sector, the most dangerous threat isn't always the hacker at the gate—it’s the authorized user already inside. During the KKB Agentic AI Hackathon, we designed a "Data Privacy Guardian": an autonomous agent built to scan system logs and identify suspicious behavioral patterns that traditional rules-based systems often miss.

Stages To move from raw logs to actionable security intelligence, we architected a system that follows three critical stages:

Strategic Threat Modeling Behavioral Anomaly Isolation Autonomous Agentic Triage These stages ensure the system doesn't just flag "weird" data, but understands the context of a potential security breach.

Strategic Threat Modeling The first step is defining what you are looking for. Instead of using synthetic data, we utilized the Los Alamos National Laboratory (LANL) dataset to model realistic corporate environments.

Focus: We specifically targeted Insider Threats, where valid credentials are used for lateral movement or data hırsızlığı.

Data Sources: We focused on Windows Event IDs like 4624 (Successful Logon), 4625 (Failed Logon), and 4688 (Process Creation).

Logic: By tracking these IDs, we can identify patterns like a single user attempting to access 15 different servers in 3 minutes—a classic sign of lateral movement.

Behavioral Anomaly Isolation You can't send billions of log lines to a Large Language Model (LLM) due to cost and latency. We implemented a high-speed filtering layer using Isolation Forest from Scikit-learn.

The Problem: Real-world security logs are "unlabeled"—you don't always have a list of past attacks to learn from.

The Solution: Isolation Forest is an unsupervised model that "isolates" anomalies rather than profiling normal behavior, making it highly efficient for high-dimensional log data.

Feature Engineering: We transformed raw JSON logs into behavioral vectors, calculating metrics like failed_logon_count and powershell_launch_count over 10-minute windows.

Autonomous Agentic Triage Once the statistical model flags an anomaly, the "agentic" heart of the system takes over to interpret the "why".

The Setup: We chose LangGraph over CrewAI because security pipelines require a stateful, deterministic flow rather than unpredictable agent "negotiations".

The Process: If the detect_anomalies node triggers a high anomaly score, a conditional edge routes the data to a Triage Agent.

The Result: The LLM analyzes the specific log batch and generates a natural language report for the security team, explaining the risk (e.g., "High-confidence Lateral Movement detected for user C1455").

Best Practices

Prioritize Determinisim: Use LangGraph for data-heavy pipelines where A-to-B logic is clear; save high-level agent "cooperation" for research-based tasks.

Feature Engineering is Key: Anomaly detection is only as good as the features you create. Aggregate logs by UserName or Source IP to capture true behavioral profiles.

Explain the "Why": A statistical outlier is just a number until an LLM maps it to a known threat vector like Brute-Force or Phishing.

Live Visualizations: In a security ops setting, being able to see the graph flow pivot in real-time when a threat is detected is a massive advantage for human oversight.