Skip links
Circuit board engine bay with scanning beam revealing red vulnerabilities

AI-Powered Vulnerability Scanning: How VibeGuard Works Under the Hood

When we started building VibeGuard eighteen months ago, the vulnerability scanning market was dominated by tools that relied almost exclusively on signature-based detection. Snyk, Semgrep, and SonarQube all do excellent work matching known patterns, but they share a fundamental limitation: they can only find vulnerability classes someone has already written a rule for. Our hypothesis was that large language models, combined with static analysis context, could detect vulnerability patterns that fall between the cracks of traditional rule sets. After shipping VibeGuard to production and scanning over 40,000 repositories, here is how the system actually works and what we learned building it.

Article Overview

AI-Powered Vulnerability Scanning: How VibeGuard Works Un…

6 sections · Reading flow

01
The Architecture: Three-Stage Pipeline
02
The Model: Fine-Tuning for Security Analysis
03
Reducing False Positives: The Hardest Problem
04
Performance and Cost Engineering
05
What the LLM Catches That Rules Do Not
06
Limitations and Honest Assessment

HARBOR SOFTWARE · Engineering Insights

The Architecture: Three-Stage Pipeline

VibeGuard is not a monolithic LLM that reads your entire codebase and outputs a list of vulnerabilities. That approach fails for three reasons: context window limits, hallucination rates on large inputs, and cost. Instead, we use a three-stage pipeline that combines traditional static analysis with targeted LLM inference.

Stage 1: Static Pre-Scan. We run a fast, rule-based static analyzer that identifies candidate regions of code. These are not confirmed vulnerabilities; they are code patterns that statistically correlate with security issues. Think of it as a triage step. The pre-scanner flags things like: user input flowing into SQL queries (even through intermediary functions), cryptographic API calls with hardcoded parameters, HTTP handlers that deserialize untrusted data, and authentication checks that appear structurally incomplete.

The pre-scanner is built on tree-sitter for parsing and a custom taint analysis engine written in Rust. It processes approximately 50,000 lines of code per second on a single core. The taint analysis is interprocedural but not whole-program; it tracks data flow within a file and one level of function calls into imported modules. This is a deliberate tradeoff: whole-program taint analysis is slow and brittle (it breaks on dynamic dispatch, metaprogramming, and monkey-patching), while single-function analysis misses too many real vulnerabilities. One level of interprocedural tracking catches about 78% of the taint-flow vulnerabilities in our test corpus while keeping analysis time under 30 seconds for a typical 100K-line repository.

The interprocedural taint analysis works by first building a call graph from import statements and function definitions, then propagating taint annotations forward through the call chain. We annotate known source functions (HTTP request handlers, database query results that come from user-controlled tables, file reads from user-supplied paths) and known sink functions (SQL query executors, OS command executors, file writers, HTML template renderers). The taint propagation follows assignment chains, function return values, and callback arguments. When a tainted value reaches a sink without passing through a sanitizer, the candidate region is flagged.

// Simplified taint analysis pseudocode
fn analyze_function(func: &Function, module: &Module) -> Vec<TaintFlow> {
    let mut taint_sources: HashSet<VarId> = HashSet::new();
    let mut flows: Vec<TaintFlow> = Vec::new();

    // Mark parameters from HTTP handlers as tainted
    for param in &func.params {
        if is_http_handler_param(param, module) {
            taint_sources.insert(param.id);
        }
    }

    // Propagate taint through assignments and calls
    for stmt in &func.body {
        match stmt {
            Statement::Assign { target, value } => {
                if expr_uses_tainted(value, &taint_sources) {
                    taint_sources.insert(target.id);
                }
            }
            Statement::Call { func_name, args } => {
                if is_sensitive_sink(func_name) {
                    for arg in args {
                        if expr_uses_tainted(arg, &taint_sources) {
                            flows.push(TaintFlow {
                                source: find_source(arg, &taint_sources),
                                sink: func_name.clone(),
                                path: reconstruct_path(arg, &taint_sources),
                            });
                        }
                    }
                }
            }
            _ => {}
        }
    }
    flows
}

We maintain a library of approximately 340 source-sink pairs across JavaScript, TypeScript, Python, Java, and Go. Each pair is annotated with the CWE category it corresponds to and a priority score based on historical exploitation frequency. The library is versioned and updated monthly as new framework-specific patterns are discovered.

Stage 2: Context Extraction. For each candidate region flagged by the pre-scanner, we extract a context window. This is not just the flagged lines; it includes the function containing the flagged code, the function’s callers (up to two levels), relevant type definitions, and any configuration files that affect the code path (environment variable references, config file reads). The context extractor also resolves imports to include relevant utility functions and middleware definitions.

The context window is typically 200-800 lines of code, carefully selected to give the LLM enough information to reason about the vulnerability without overwhelming it with irrelevant code. We experimented extensively with context size and found a sharp quality cliff: below 150 lines, the model lacks enough context to distinguish intentional patterns from bugs; above 1,000 lines, the model starts ignoring relevant details and its false positive rate climbs from 12% to 34%. The sweet spot varies by vulnerability type: SQL injection analysis needs less context (the sink is usually close to the source), while authentication bypass analysis needs more context (the bypass often involves multiple middleware layers).

The context extraction algorithm uses a priority queue. It starts with the flagged code region (highest priority), adds the containing function, then adds callers sorted by their distance in the call graph. For each function added, it also adds any type definitions referenced in function signatures. The algorithm stops when the context window reaches the target size or there are no more relevant functions to add. We implemented a token budget allocator that reserves 40% of the context window for the immediate function, 30% for callers and callees, 20% for type definitions and configuration, and 10% for framework-specific boilerplate (middleware registration, route definitions).

Stage 3: LLM Analysis. Each context window is sent to a fine-tuned model with a structured prompt that asks specific questions: Is there a security vulnerability in this code? What is the attack vector? What is the severity? What is the specific fix? The model returns structured JSON that we parse, validate, and deduplicate against findings from other context windows.

The Model: Fine-Tuning for Security Analysis

We fine-tuned a code-specialized model (based on CodeLlama 34B) on a dataset of 28,000 annotated vulnerability examples. The training data came from three sources:

  • CVE database entries with associated patches. We extracted 11,000 vulnerability-fix pairs from public CVE databases, focusing on web application vulnerabilities (CWE-79, CWE-89, CWE-287, CWE-502, CWE-918). For each CVE, we reconstructed the vulnerable code from the git history before the fix commit. This required building a pipeline that parses CVE references, locates the corresponding GitHub repositories, finds the fix commit, and extracts the parent commit’s version of the affected files.
  • Synthetic vulnerabilities injected into clean code. We took 8,000 functions from well-maintained open-source projects and programmatically injected specific vulnerability patterns: removing input validation, weakening authentication checks, introducing race conditions, adding unsafe deserialization. Each synthetic example was manually reviewed by a security engineer to confirm it was realistic. The injection rules are parameterized: for SQL injection, we have 12 different injection patterns ranging from simple string concatenation to template literal interpolation to ORM bypass via raw query methods.
  • Internal security audit findings. Over 9,000 examples from our own security consulting engagements (anonymized and with client permission). These are the most valuable because they represent real-world vulnerabilities that actually shipped to production, not textbook examples. The real-world examples include context that synthetic examples lack: surrounding business logic, framework configurations, deployment environment details.

The fine-tuning objective is not just classification (vulnerable / not vulnerable). We train the model to output a structured analysis:

{
  "vulnerability_detected": true,
  "cwe_id": "CWE-89",
  "severity": "high",
  "confidence": 0.92,
  "vulnerable_lines": [45, 46, 47],
  "attack_vector": "User-controlled input from request.query.search flows into
    SQL query on line 47 without parameterization. An attacker can inject
    arbitrary SQL by submitting a search parameter containing single quotes.",
  "suggested_fix": "Replace string concatenation with parameterized query.
    Use db.query('SELECT * FROM products WHERE name LIKE $1', ['%' + search + '%'])
    instead of string interpolation.",
  "false_positive_indicators": []
}

Training on structured output rather than binary classification improved our precision from 71% to 88%. The model learns not just to detect vulnerabilities but to explain them and propose fixes, which dramatically reduces the triage burden on developers reviewing scan results. We validated this improvement by measuring the time developers spent triaging VibeGuard findings: with binary output, average triage time per finding was 8.4 minutes; with structured output including attack vector descriptions and suggested fixes, it dropped to 2.1 minutes.

Reducing False Positives: The Hardest Problem

Every security scanning tool lives or dies by its false positive rate. A tool that generates 200 findings per scan, 180 of which are false positives, gets disabled within a week. Developers stop reading the results, security teams stop trusting the tool, and the entire investment is wasted. Our target was a false positive rate below 15%, which we hit after three major iterations.

The first iteration had a 38% false positive rate, which was unacceptable. The primary source of false positives was the model flagging code that looked vulnerable in isolation but was protected by validation logic elsewhere. For example, a SQL query built with string concatenation might be safe because the input was validated by a middleware function that the model could not see in its context window. Another common false positive source was the model flagging code that used framework features incorrectly according to general best practices but correctly according to the specific framework version in use (Django 4.x changed several security-related defaults compared to 3.x, and the model was trained on a mix of both).

We addressed this with three mechanisms:

  1. Expanded context extraction. We modified the context extractor to aggressively include middleware, decorators, and validation functions. If a function is wrapped in @require_auth or @validate_input, those decorator implementations are included in the context window. We also added framework version detection: the context extractor reads package.json, requirements.txt, or pom.xml to determine framework versions, and includes this information in the prompt so the model can reason about version-specific security behaviors.
  2. Framework-aware analysis. We built framework-specific knowledge into the pre-scanner. For example, Django’s ORM parameterizes queries by default, so a Django model query is not flagged as a SQL injection candidate unless it uses raw() or extra(). Express.js with helmet middleware is not flagged for missing security headers. Rails with strong parameters is not flagged for mass assignment unless permit! is used. We maintain a database of 87 framework-specific security rules covering the 15 most popular web frameworks.
  3. Confidence calibration. We added a calibration layer that adjusts the model’s confidence score based on the presence or absence of protective patterns. If the model detects a potential SQL injection but also sees an ORM layer in the context, the confidence is reduced. We trained a separate lightweight classifier (gradient-boosted trees with XGBoost, 47 features) on 5,000 examples of model outputs labeled as true positive or false positive. The features include: framework detected, middleware stack depth, presence of validation functions in the context, code complexity metrics for the flagged region, and the model’s raw confidence distribution.

The calibration layer alone reduced false positives from 24% to 14%. It works because the LLM’s raw confidence scores are poorly calibrated (a common problem with fine-tuned models), but the patterns that distinguish true positives from false positives are highly learnable: presence of validation middleware, use of ORM vs. raw queries, framework conventions, and test coverage signals.

Performance and Cost Engineering

Running LLM inference on every candidate region in a repository is expensive. A typical repository generates 20-80 candidate regions, and each LLM call costs approximately $0.003 with our current model. At 40,000 repositories scanned per month, that adds up to $2,400-$9,600/month in inference costs alone, before accounting for compute for the pre-scanner, context extraction, and result storage.

We optimized costs in four ways:

  • Aggressive pre-scanner filtering. The pre-scanner’s false negative rate is 6%, meaning it misses 6% of vulnerabilities. But it filters out 85% of code that does not contain vulnerabilities, reducing the number of LLM calls by 5x. This is an intentional tradeoff: we accept a small miss rate at the pre-scan stage to keep costs manageable.
  • Batched inference. We batch multiple context windows into a single LLM call when they come from the same repository. The model processes up to 4 context windows in a single prompt, reducing per-window overhead by approximately 40% due to shared repository context (imports, configuration, framework version). Batching required careful prompt engineering to prevent the model from confusing findings across different context windows; we use XML-style delimiters to separate each window within the prompt.
  • Result caching. If a file has not changed since the last scan (determined by content hash), we reuse the previous analysis. In practice, 60-70% of files in a repository are unchanged between scans, so caching reduces inference volume significantly. We use a content-addressable cache keyed on the SHA-256 of the context window, which means identical code patterns across different repositories also benefit from caching.
  • Tiered model selection. Simple, high-confidence findings from the pre-scanner (e.g., hardcoded passwords matching passwords*=s*["'][^"']+["']) bypass the LLM entirely. They are reported directly from the rule-based scanner. Only ambiguous candidates go to the LLM. This routes about 30% of findings away from the LLM, and these rule-based findings have a 97% true positive rate because the patterns are highly specific.

What the LLM Catches That Rules Do Not

The whole point of adding an LLM to the pipeline is catching things that rule-based scanners miss. Here are three real examples (anonymized) from production scans:

Example 1: Subtle Authentication Bypass. A Node.js API had a middleware chain where authentication was applied to all routes except those matching a whitelist regex. The regex was /^/(health|status|docs)/. The LLM identified that a request to /health/../admin/users would match the whitelist regex (starts with /health) but after URL normalization by Express, resolve to /admin/users, bypassing authentication. No signature-based rule covers this because it requires understanding the interaction between regex matching and URL normalization. This was a critical finding: the customer’s admin API was exposed to unauthenticated access.

Example 2: Race Condition in Balance Check. A Python payment processing function checked a user’s balance, then debited the amount in a separate database call. The LLM identified that without a database lock or transaction isolation, two concurrent requests could both pass the balance check and double-spend. This is a TOCTOU (Time-of-Check-Time-of-Use) vulnerability that requires understanding the semantics of database operations, not just their syntax. The model specifically noted that the code used READ COMMITTED isolation level, which does not prevent this race, and suggested either SELECT ... FOR UPDATE or moving to SERIALIZABLE isolation.

Example 3: Insecure Deserialization via Indirect Path. A Java Spring application used Jackson for JSON deserialization with @JsonTypeInfo(use = JsonTypeInfo.Id.CLASS) on an internal DTO. The DTO was not directly exposed to user input, but the LLM traced the data flow through three intermediate services and identified that user-controlled data from a public API eventually reached this deserialization point, enabling a remote code execution attack via polymorphic type handling. The model recommended switching to JsonTypeInfo.Id.NAME with an explicit whitelist of allowed subtypes.

None of these three vulnerabilities would be caught by Semgrep, SonarQube, or Snyk’s default rule sets. They require semantic understanding of code behavior, not just syntactic pattern matching.

Limitations and Honest Assessment

VibeGuard is not a replacement for human security review. It has specific, measurable limitations:

  • Language coverage. The fine-tuned model performs well on JavaScript/TypeScript, Python, Java, and Go. Its performance degrades significantly on Rust (limited training data for Rust-specific vulnerability patterns), C/C++ (pointer arithmetic and memory safety require a different analysis approach), and PHP (framework diversity makes generalization difficult). We are actively expanding training data for these languages.
  • Business logic vulnerabilities. The model cannot detect vulnerabilities that depend on business requirements it does not know. If the business rule is “users should not be able to transfer more than $10,000 per day” and the code does not enforce this limit, the model has no way to flag it because it does not know the requirement exists.
  • Adversarial robustness. A motivated attacker who knows they are being scanned by an LLM-based tool could potentially write code that exploits the model’s blind spots. We have not yet done formal adversarial testing, and this is a concern we take seriously. Obfuscation techniques like variable renaming, control flow flattening, and indirect function calls can reduce the model’s detection rate by 15-25% in our preliminary tests.
  • Determinism. LLM outputs are stochastic. The same code scanned twice might produce slightly different findings. We mitigate this by running inference with temperature 0.1 and deduplicating across multiple runs, but we cannot guarantee perfectly deterministic results. In practice, our run-to-run agreement rate is 96.4% for high-confidence findings and 82.1% for medium-confidence findings.

Despite these limitations, VibeGuard catches an average of 2.3 vulnerabilities per repository that other tools miss, with a confirmed true positive rate of 86%. For teams that already use Snyk or Semgrep, it functions as a complementary layer that catches the long tail of vulnerabilities that rule-based systems are structurally unable to detect. The combination of rule-based and LLM-based scanning is strictly better than either approach alone, and we expect this hybrid architecture to become the standard for application security tooling within the next two to three years.

Leave a comment

Explore
Drag