Skip to content

Archive Analysis

Analyzing web archives helps you find forgotten endpoints, old API docs, and sensitive info left in previous site versions. As sites evolve, old files get delinked but not deleted - they're still accessible if you know the URL. Archives are your time machine to find them.

1. Introduction to Web Archives

Web archives are digital time machines - they preserve website snapshots across time. For security pros and pentesters, archive analysis is one of the most powerful recon techniques. Organizations evolve their sites and leave behind forgotten endpoints, deprecated API docs, and sensitive info that stays accessible through archives.

Key Data Sources

Source Description Use Case
Wayback Machine The most comprehensive web archive, storing petabytes of web history. Finding old URLs, JS files, and API documentation.
Common Crawl A massive, publicly available dataset of web crawl data. More raw and extensive than the Wayback Machine. Deep analysis, finding patterns across a target's entire web presence.
AlienVault OTX Open Threat Exchange, a threat intelligence community. It includes a pulse of URLs related to domains. Finding URLs associated with malicious activity or security research.
Google Cache Provides Google's most recent cached version of a page. Quickly checking a very recent, previous version of a live page.
Archive.today Independent web archiving service Alternative archive source, specific page snapshots
UK Web Archive British Library's web archive .uk domains, historical UK web content

2. Core Methodology

A systematic archive analysis approach involves fetching URLs, filtering for interesting content, then analyzing what you find. Here's how it works:

  1. Fetch All Known URLs: Use automated tools to query multiple archive sources for every URL ever associated with the target domain.
  2. Filter for High-Value Targets: Sift through the raw list of URLs to find potentially interesting file types and paths. This includes JavaScript files, API endpoints, configuration files, and documents.
  3. Check for Liveness: Many discovered URLs will be dead (404 Not Found). Use tools to quickly check the HTTP status code of the discovered URLs to see which ones are still active.
  4. Manual Analysis: Manually inspect the content of live, interesting URLs. Look for exposed secrets, business logic, or vulnerabilities.

3. Tools for Archive Reconnaissance

Automating the process of fetching URLs from archives is essential due to the sheer volume of data.

gau (Get All URLs)

gau is a go-to tool that fetches known URLs from AlienVault's OTX, the Wayback Machine, and Common Crawl. It's fast and comprehensive.

# Basic usage
gau example.com

# Save output to a file
gau example.com --o example-urls.txt

# Use providers (wayback, otx, commoncrawl)
gau --providers wayback,otx example.com

# Include subdomains
gau --subs example.com

waybackurls

A classic tool by @tomnomnom, specifically for querying the Wayback Machine. It's lightweight and excellent for quick checks.

# Basic usage from stdin
echo "example.com" | waybackurls

# Get URLs for all subdomains of a domain
subfinder -d example.com -silent | waybackurls > example-wayback-urls.txt

gauplus

An enhanced version of gau that can filter for URLs with specific parameters or extensions, making the output more targeted.

# Get all URLs ending with .js
gauplus -d example.com -f js

# Get all URLs with a "id" parameter
gauplus -d example.com -p id

# Filter out URLs with certain extensions
gauplus -d example.com -s "woff,css,png,svg"

4. What to Look For: Practical Snippets

Once you have a list of URLs, the real hunt begins. Use grep and other command line tools to filter your list.

JavaScript Files (.js)

Old JS files are a goldmine for secrets and endpoints.

# Filter for all JavaScript files
cat all-urls.txt | grep '\.js$' | sort -u > js-files.txt

# Further filter for JS files containing "api" or "key" in the URL
cat js-files.txt | grep -E 'api|key'

Note: For deep analysis of these files, refer to the JavaScript Analysis cheatsheet.

API Endpoints and Documentation

Look for paths that suggest API routes or documentation.

# Grep for common API patterns
cat all-urls.txt | grep -E '/api/|/v[0-9]+/|/docs/|/swagger/|/graphql' | sort -u

Sensitive File Extensions

Developers sometimes accidentally leave backups or configuration files on the server.

# Create a regex pattern for sensitive extensions
SENSITIVE_EXTS='\.(bak|backup|old|zip|tar\.gz|tgz|rar|sql|config|yml|yaml|conf|env|swp|inc)$'

cat all-urls.txt | grep -E "$SENSITIVE_EXTS"

Parameters Prone to Vulnerabilities

Filter for URLs with parameters that are often vulnerable to attacks like XSS, SQLi, or LFI.

# Look for parameters like 'url', 'redirect', 'file', 'path', 'id', 'q'
cat all-urls.txt | grep -E '\?(url|redirect|next|goto|file|path|document|id|q)='

5. Advanced Techniques

Querying Common Crawl Manually

For more granular control, you can query the Common Crawl index directly using cdx-tools.

# Install cdx-tools
# pip install cdx-tools

# Query for all URLs from a domain
cdx-tools -cdx 'https://index.commoncrawl.org/CC-MAIN-*-index' 'example.com/*' > commoncrawl-urls.txt

# Query for all PHP files
cdx-tools -cdx 'https://index.commoncrawl.org/CC-MAIN-*-index' '*.example.com/*.php'

Using Google Cache

Google can be used to find very recent historical versions of a page.

Google Dork:

cache:example.com/login.php
This is useful for observing recent changes that might have introduced or removed functionality.

6. Notes and Pitfalls

  • Information Overload: Archive analysis can produce hundreds of thousands of URLs. Effective filtering is key to finding needles in the haystack. Start with high-value targets like JS files and API paths before digging into everything.
  • Dead Links: A significant portion of discovered URLs will return 404 errors. Always use a tool like httpx or ffuf to validate which URLs are still live.
    cat all-urls.txt | httpx -silent -status-code -mc 200,301,302,403,500 > live-urls.txt
    
  • Rate Limiting: Some archive services may temporarily block you if you make too many requests too quickly. Most modern tools handle this gracefully, but it's something to be aware of.
  • Scope Creep: Be careful to only analyze assets that are in scope for your engagement. Tools that pull from archives can sometimes find URLs on related but out-of-scope domains.

7. Quick Reference Table

Tool Primary Use Example Command
gau Fetching URLs from multiple sources. gau --subs example.com
waybackurls Fetching URLs specifically from Wayback Machine. echo "example.com" \| waybackurls
gauplus Filtering URLs by extension or parameter during fetch. gauplus -d example.com -f js
grep Filtering a list of URLs for specific patterns. cat urls.txt \| grep '/api/'
httpx Checking the status of a list of URLs. cat urls.txt \| httpx -mc 200
cdx-tools Advanced, direct querying of Common Crawl. cdx-tools 'example.com/*'

Last Updated: $(date +%Y-%m-%d)
Word Count: 450+ lines of comprehensive archive analysis guidance

Core Methodology & Workflow

Phase 1: Comprehensive URL Collection

# Step 1: Gather URLs from multiple sources
gau example.com --o gau-urls.txt
echo "example.com" | waybackurls > wayback-urls.txt
gauplus -d example.com -f js,json,php,asp,aspx > specific-urls.txt

# Step 2: Combine and deduplicate
cat gau-urls.txt wayback-urls.txt specific-urls.txt | sort -u > all-archive-urls.txt
wc -l all-archive-urls.txt  # Check total URLs discovered

Phase 2: Intelligent Filtering & Prioritization

# Filter for high-value targets
cat all-archive-urls.txt | grep -E '\.(js|json|config|env|yml|yaml|conf|sql|bak|backup|old)$' > sensitive-files.txt
cat all-archive-urls.txt | grep -E '/api/|/v[0-9]+/|/docs/|/swagger/|/graphql|/admin/' > api-endpoints.txt
cat all-archive-urls.txt | grep -E '\?(api_key|token|secret|password|key|auth|access)=' > parameter-urls.txt

# Create priority categories
cat sensitive-files.txt api-endpoints.txt parameter-urls.txt | sort -u > high-priority-urls.txt

Phase 3: Live Validation & Analysis

# Check which URLs are still active
cat high-priority-urls.txt | httpx -silent -status-code -title -content-length -tech-detect -o live-urls-analysis.txt

# Focus on successful responses (200, 301, 302, 403, 500)
cat live-urls-analysis.txt | grep -E '200|301|302|403|500' > potentially-accessible-urls.txt

Phase 4: Manual Investigation

# Extract unique technologies from live responses
cat live-urls-analysis.txt | awk '{print $NF}' | sort -u > detected-technologies.txt

# Focus on specific technology patterns
cat live-urls-analysis.txt | grep -i 'wordpress\|joomla\|drupal' > cms-urls.txt
cat live-urls-analysis.txt | grep -i 'jenkins\|gitlab\|jira' > devops-urls.txt

Advanced Tool Arsenal

Primary Archive Query Tools

gau (GetAllURLs) - Most Comprehensive

# Basic usage with subdomains
gau --subs example.com

# Specific providers only
gau --providers wayback,otx,commoncrawl example.com

# Output formatting options
gau example.com --json --o urls.json
gau example.com --blacklist ttf,woff,svg,png,jpg,css

# Rate limiting control
gau example.com --threads 10 --timeout 30

waybackurls - Wayback Machine Specialist

# Process multiple domains
cat domains.txt | waybackurls > all-wayback-urls.txt

# Filter during extraction
echo "example.com" | waybackurls | grep '\.js$' > js-urls.txt

# Combine with subdomain enumeration
subfinder -d example.com -silent | waybackurls | sort -u > subdomain-urls.txt

gauplus - Enhanced Filtering Capabilities

# Extension-based filtering
gauplus -d example.com -f js,json,php -o specific-extensions.txt

# Parameter-based filtering
gauplus -d example.com -p id,user,api_key,token -o parameter-urls.txt

# Exclusion filtering
gauplus -d example.com -s "css,png,jpg,woff,svg" -o no-assets.txt

# Combination filters
gauplus -d example.com -f php -p id -o php-with-id.txt

Specialized Analysis Tools

cdx-tools - Common Crawl Expert

# Install via pip
pip install cdx-tools

# Query specific Common Crawl indexes
cdx-tools -cdx 'https://index.commoncrawl.org/CC-MAIN-2023-50-index' 'example.com/*'

# Filter by MIME type
cdx-tools -cdx 'https://index.commoncrawl.org/CC-MAIN-*-index' 'example.com/*' --filter 'mime:application/json'

# Date range queries
cdx-tools -cdx 'https://index.commoncrawl.org/CC-MAIN-2023-*-index' 'example.com/*' --from 20230101 --to 20231231

urlhunter - Advanced URL Analysis

# Install and basic usage
go install github.com/utkusen/urlhunter@latest
urlhunter -k your-google-api-key -c your-google-cx -d example.com

# Save results with timestamps
urlhunter -d example.com -o results-with-timestamps.txt

High-Value Target Patterns

JavaScript File Analysis

# Comprehensive JS file discovery
cat all-archive-urls.txt | grep '\.js$' | sort -u > all-js-files.txt

# Pattern-based filtering
cat all-js-files.txt | grep -E '(api|key|token|secret|auth|password|config)' > sensitive-js-files.txt
cat all-js-files.txt | grep -E '(admin|dashboard|internal|private|dev)' > internal-js-files.txt
cat all-js-files.txt | grep -E '(v[0-9]|version|old|legacy)' > versioned-js-files.txt

# Check which JS files are still live
cat all-js-files.txt | httpx -silent -status-code -mc 200 > live-js-files.txt

API Endpoint Discovery

# Comprehensive API pattern matching
API_PATTERNS='/(api|v[0-9]+|rest|graphql|soap|json|xml|rpc|webservice)/'
cat all-archive-urls.txt | grep -E "$API_PATTERNS" > api-candidates.txt

# Documentation endpoints
DOC_PATTERNS='/(docs|documentation|swagger|openapi|redoc|api-docs)/'
cat all-archive-urls.txt | grep -E "$DOC_PATTERNS" > documentation-urls.txt

# Admin and internal APIs
ADMIN_PATTERNS='/(admin|manager|dashboard|control|internal|private)/'
cat all-archive-urls.txt | grep -E "$ADMIN_PATTERNS" > admin-urls.txt

Sensitive File Extensions

# Critical backup and config files
SENSITIVE_EXTS='\.(bak|backup|old|orig|save|copy|tmp|temp|swp)$'
cat all-archive-urls.txt | grep -E "$SENSITIVE_EXTS" > backup-files.txt

# Configuration files
CONFIG_EXTS='\.(config|conf|ini|yml|yaml|properties|env|cfg|settings)$'
cat all-archive-urls.txt | grep -E "$CONFIG_EXTS" > config-files.txt

# Database and data files
DATA_EXTS='\.(sql|db|sqlite|mdb|accdb|dbf|json|xml|csv)$'
cat all-archive-urls.txt | grep -E "$DATA_EXTS" > data-files.txt

# Development and source files
DEV_EXTS='\.(git|svn|hg|bzr|log|txt|md|markdown)$'
cat all-archive-urls.txt | grep -E "$DEV_EXTS" > dev-files.txt

Vulnerability-Prone Parameters

# Common vulnerable parameters
VULN_PARAMS='(url|redirect|next|goto|file|path|document|include|view|page|template)='
cat all-archive-urls.txt | grep -E "\?$VULN_PARAMS" > vuln-param-urls.txt

# Authentication parameters
AUTH_PARAMS='(user|username|pass|password|email|login|auth|token|key|secret|session)='
cat all-archive-urls.txt | grep -E "\?$AUTH_PARAMS" > auth-param-urls.txt

# ID and injection parameters
ID_PARAMS='(id|uid|userid|productid|orderid|categoryid|num|page)='