Code Repositories OSINT¶
Public code repos on GitHub, GitLab, and Bitbucket are treasure troves for security researchers. Developers accidentally commit sensitive data all the time - API keys, passwords, internal docs, infrastructure details. This guide shows you how to extract valuable intel from code repositories.
Real story: On a recent pentest, code repo analysis led me to AWS credentials that gave access to 47 S3 buckets with customer data, internal databases, and proprietary source code. Estimated value? Over $2 million in potential damages.
1. Introduction to Code Repository OSINT¶
Code repository OSINT means systematically searching through public code, commit history, and developer activity to find security-relevant info. It's a highly effective passive recon technique that often yields the most valuable findings.
Why It's a Goldmine¶
- Secrets Exposure: The most common finding. Developers hardcode credentials for testing and forget to remove them. Recent studies show that 10% of all public repositories contain some form of sensitive information.
- Infrastructure Mapping: Configuration files (
docker-compose.yml,.tffiles,k8smanifests) can reveal internal network architecture, hostnames, and technologies used. - Vulnerable Code: Finding outdated dependencies or custom code with obvious vulnerabilities that can be exploited.
- Developer Profiling: Identifying developers and pivoting to their other public activities can reveal more information about internal processes and technologies.
- Business Logic Understanding: Source code often reveals application workflows, authentication mechanisms, and data processing logic.
Statistical Insights¶
- 73% of developers have committed secrets to version control at least once
- AWS keys are the most commonly leaked secret (42% of findings)
- 80% of leaked secrets remain active for more than 30 days
- The average company has 12,000+ secrets exposed in public repositories
2. Core Methodology: The Hunt for Secrets¶
You need a structured approach or you'll get lost in all that code. Follow this systematic methodology to cover everything.
Phase 1: Target Identification¶
- Identify Target Repositories: Find the official GitHub/GitLab organization for your target company.
- Employee Repositories: Search for personal accounts of employees that may contain work-related code.
- Fork Analysis: Identify forks of the target's repositories that may contain older, more sensitive versions.
- Dependency Analysis: Find repositories that depend on the target's packages or libraries.
Phase 2: Broad Scanning¶
- Automated Scanning: Use specialized tools to scan repositories for high-entropy strings and patterns matching known secret formats.
- Keyword Searching: Perform broad searches for common secret patterns across all identified repositories.
- File Type Analysis: Focus on configuration files, environment files, and documentation.
Phase 3: Deep Analysis¶
- Commit History Analysis: Secrets are often removed in a later commit, but they remain in the repository's history. This is the most critical step.
- Branch Analysis: Check other branches (especially
dev,test,staging) that may contain different code. - Pull Request Review: Analyze pull requests for comments, discussions, and code changes that may reveal sensitive information.
Phase 4: Correlation and Verification¶
- Cross-Repository Analysis: Correlate findings across multiple repositories.
- Live Verification: Test discovered credentials and endpoints against live systems.
- Impact Assessment: Evaluate the potential impact of each finding.
Phase 5: Reporting and Documentation¶
- Evidence Collection: Capture screenshots and code snippets for documentation.
- Risk Rating: Assign severity levels based on the sensitivity of the information.
- Recommendations: Provide actionable remediation advice.
3. GitHub Dorking: Advanced Manual Searching¶
GitHub's search is powerful but most people barely use it. Master these advanced techniques and you'll find way more.
Basic Dork Structure¶
Combine the organization/user filter with keywords and file filters.
# Basic structure
[scope]:[target] [keyword] [file-filter]
# Examples
org:exampleinc password filename:.env
user:johndoe "api_key" extension:json
Organization and User Targeting¶
# Search within specific organizations
org:netflix
org:google
org:apple
# Search within user accounts
user:torvalds
user:defunkt
user:github
# Combine multiple targets
org:netflix OR org:netflix-cloud
# Exclude specific users or orgs
org:exampleinc -user:bot-account
File and Path Based Searching¶
# Specific filenames
filename:.env
filename:config.yml
filename:docker-compose.yml
filename:settings.py
filename:wp-config.php
# File extensions
extension:key
extension:pem
extension:ppk
extension:pfx
extension:sql
# Path based searching
path:/config/
path:/src/config/
path:/includes/
path:/secrets/
Content Based Searching¶
# Credentials and secrets
"password"
"secret"
"token"
"key"
"credential"
"auth"
"login"
# API related
"api_key"
"client_secret"
"access_token"
"bearer"
"oauth"
# Cloud providers
"aws_access_key"
"AZURE_CLIENT_SECRET"
"GCP_SERVICE_ACCOUNT"
"digitalocean_token"
# Database connections
"database_url"
"connection_string"
"DB_PASSWORD"
"mongodb://"
Advanced Search Filters Reference¶
| Filter | Description | Example |
|---|---|---|
org:<name> | Search within organization | org:netflix password |
user:<name> | Search within user | user:torvalds kernel |
repo:<user/repo> | Search specific repo | repo:facebook/react |
filename:<name> | Search filename | filename:.env |
extension:<ext> | Search file extension | extension:json |
path:<path> | Search in path | path:/config/ |
language:<lang> | Search by language | language:python |
size:<size> | Search by file size | size:>1000 |
created:<date> | Search by creation date | created:>2023-01-01 |
pushed:<date> | Search by last push | pushed:<2022-01-01 |
Combining Dorks for High-Impact Results¶
# Find AWS keys in Python files
org:exampleinc language:python "aws_access_key"
# Database credentials in config files
org:exampleinc path:/config/ "DB_PASSWORD"
# Private keys in any file
org:exampleinc "BEGIN RSA PRIVATE KEY"
# Recent commits with password mentions
org:exampleinc password pushed:>2023-06-01
# Large configuration files
org:exampleinc filename:config size:>5000
# Environment files with secrets
org:exampleinc filename:.env "API_KEY"
# Docker files with hardcoded secrets
org:exampleinc filename:docker-compose "password"
# Kubernetes configurations
org:exampleinc filename:deployment.yaml "secret"
# Terraform files with credentials
org:exampleinc extension:tf "password"
Language-Specific Dorking¶
# JavaScript/Node.js
org:exampleinc language:javascript "process.env"
org:exampleinc language:javascript "require('dotenv')"
# Python
org:exampleinc language:python "os.environ"
org:exampleinc language:python "from dotenv import"
# Java
org:exampleinc language:java "System.getenv"
org:exampleinc language:java "Properties.load"
# PHP
org:exampleinc language:php "getenv"
org:exampleinc language:php "$_ENV"
# Ruby
org:exampleinc language:ruby "ENV["
org:exampleinc language:ruby "Figaro.load"
Advanced Search Techniques¶
Boolean Operators:
# AND operator (default)
org:exampleinc password token
# OR operator
org:exampleinc password OR token
# NOT operator
org:exampleinc password -filename:test
# Grouping with parentheses
org:exampleinc (password OR token) filename:.env
# Complex combinations
org:exampleinc (aws_key OR azure_key) -extension:md
Range Searching:
# Date ranges
org:exampleinc password pushed:2023-01-01..2023-06-30
org:exampleinc created:>2022-01-01
# Size ranges
org:exampleinc filename:config size:1000..5000
org:exampleinc size:>10000
Regular Expression Support: While GitHub doesn't support full regex in search, you can use pattern matching:
# Pattern matching
org:exampleinc key[0-9]
org:exampleinc secret_[a-z]
Specialized Search Scenarios¶
CI/CD Configuration Files:
# GitHub Actions
org:exampleinc path:.github/workflows "secret"
org:exampleinc filename:github-actions.yml "AWS_"
# GitLab CI
org:exampleinc filename:.gitlab-ci.yml "variables"
org:exampleinc path:.gitlab "SECRET_"
# Jenkins
org:exampleinc filename:Jenkinsfile "withCredentials"
org:exampleinc filename:Jenkinsfile "usernamePassword"
# CircleCI
org:exampleinc filename:config.yml "environment"
org:exampleinc path:.circleci "AWS_"
# Travis CI
org:exampleinc filename:.travis.yml "secure"
org:exampleinc filename:.travis.yml "env"
# Azure DevOps
org:exampleinc filename:azure-pipelines.yml "secret"
org:exampleinc path:.azure "variables"
GitHub Search Limitations and Workarounds¶
Rate Limiting: - Unauthenticated: 10 requests per minute - Authenticated: 30 requests per minute - Use multiple accounts or proxies for large-scale scanning
Search Result Limits: - Only first 1,000 results are available - Use date ranges and other filters to narrow results - Break large searches into smaller, targeted queries
API Access:
# Use GitHub API for programmatic access
curl -H "Authorization: token YOUR_TOKEN" \
"https://api.github.com/search/code?q=org:exampleinc+password"
# Use official GitHub CLI
gh api search/code --jq '.items[] | .html_url' -q "org:exampleinc password"
4. Automated Secret Scanning Tools¶
Manual searching is good, but automated tools are essential for comprehensive coverage and deep commit history analysis.
truffleHog¶
Scans git repositories for secrets using both regex patterns and entropy analysis. Now maintained by Truffle Security.
# Basic scan of a public repository
trufflehog git https://github.com/example/repo.git
# Scan with specific detectors
trufflehog git https://github.com/example/repo.git --only-verified
# Scan with custom rules
trufflehog git https://github.com/example/repo.git --rules /path/to/rules.json
# Output in JSON format
trufflehog git https://github.com/example/repo.git --json
# Scan with entropy checks disabled
trufflehog git https://github.com/example/repo.git --no-entropy
# Scan specific branches
trufflehog git https://github.com/example/repo.git --branch develop
# Use multiple threads for faster scanning
trufflehog git https://github.com/example/repo.git --threads 4
gitleaks¶
A fast, reliable secret scanner that uses regex-based detection with comprehensive rule sets.
# Scan a local repository
gitleaks detect -s /path/to/repo -v
# Scan with specific config
gitleaks detect -s /path/to/repo -c gitleaks.toml
# Scan and output to file
gitleaks detect -s /path/to/repo --report-path findings.json
# Scan with GitHub token for API access
gitleaks detect -s https://github.com/example/repo --github-token YOUR_TOKEN
# Protect mode (pre-commit hook)
gitleaks protect -v
# Scan only recent commits
gitleaks detect -s /path/to/repo --log-opts="--since=2023-01-01"
git-secrets¶
AWS-developed tool that prevents committing secrets and can scan existing repositories.
# Install hooks in a repository
git secrets --install
# Add patterns to scan for
git secrets --add 'AWS_ACCESS_KEY_ID'
# Scan the entire history
git secrets --scan-history
# Scan specific files
git secrets --scan /path/to/file
# Register with AWS patterns
git secrets --register-aws
Other Notable Tools¶
shhgit:
# Real-time scanning of GitHub
shhgit -t YOUR_TOKEN -q "org:exampleinc"
# Specific file types
shhgit -t YOUR_TOKEN -q "org:exampleinc filename:.env"
repo-supervisor:
# Scan multiple repositories
repo-supervisor --org exampleinc --token YOUR_TOKEN
# Custom rules
repo-supervisor --rules /path/to/rules.json
ggshield (GitGuardian):
# Scan local repository
ggshield scan repo /path/to/repo
# Scan pre-commit
ggshield scan pre-commit
# Scan CI environment
ggshield scan ci
Custom Rule Development¶
Create custom detection rules for organization-specific patterns:
// truffleHog rules example
{
"detectors": [
{
"name": "Custom API Key",
"keywords": ["custom_api_key", "internal_key"],
"regex": {"pattern": "[A-Z0-9]{32}"},
"verify": false
}
]
}
# gitleaks config example
[[rules]]
description = "Custom API Key Pattern"
id = "custom-api-key"
regex = '''[A-Z0-9]{32}'''
keywords = ["custom_api_key", "internal_key"]
5. Deep Commit History Analysis¶
This is where the most valuable secrets are found. Developers often commit secrets and then remove them, but the historical record remains.
Manual Git History Analysis¶
# Clone the target repository
git clone https://github.com/example/repo.git
cd repo
# View full commit history with diffs
git log -p
# Search for specific strings in commit history
git log -p -S"password"
# Search with regex patterns
git log -p -G"[A-Z0-9]{20}"
# Search in specific file types
git log -p -- "*.env"
# Limit to specific time range
git log -p --since="2023-01-01" --until="2023-06-30"
# Search in specific branches
git log -p origin/develop -S"secret"
# Show only commit messages containing keywords
git log --grep="password"
# Extract all unique strings from commit history
git log -p | grep -Eo '[A-Za-z0-9+/]{20,}=' | sort -u
Automated History Scanning¶
# Scan entire history with gitleaks
gitleaks detect -s /path/to/repo --log-opts="--all"
# Use git-all-secrets wrapper
git-all-secrets --repo https://github.com/example/repo
# Manual script for comprehensive scanning
#!/bin/bash
REPO_PATH="/path/to/repo"
OUTPUT_FILE="secrets_scan.txt"
echo "Scanning commit history for secrets..."
git -C "$REPO_PATH" log -p --all > all_commits.diff
# Extract high-entropy strings
cat all_commits.diff | grep -E '[A-Za-z0-9+/]{20,}=' >> "$OUTPUT_FILE"
# Look for common patterns
cat all_commits.diff | grep -E '(password|secret|key|token|credential)' >> "$OUTPUT_FILE"
echo "Scan complete. Results in $OUTPUT_FILE"
Branch Analysis¶
# List all branches
git branch -a
# Checkout and scan specific branches
git checkout develop
gitleaks detect -v
# Scan all branches
for branch in $(git branch -r | grep -v HEAD); do
echo "Scanning branch: $branch"
git checkout "$branch"
gitleaks detect -v --report-path "scan_${branch//\//_}.json"
done
Blame Analysis¶
# See who last modified each line containing a secret
git blame config/file.yml | grep -i password
# Annotate files with commit information
git annotate sensitive/file.txt
6. Advanced Techniques and Pivoting¶
Developer OSINT and Pivoting¶
# Get contributor information
git shortlog -sn --all
# Analyze specific developer's commits
git log --author="developer@example.com" -p
# Find developer email patterns
git log --pretty=format:"%ae" | sort -u
# Cross-reference with other platforms
# Search for developer usernames on:
# - LinkedIn
# - Twitter
# - Stack Overflow
# - Other code platforms
Gist Analysis¶
# Search user's gists
curl -H "Authorization: token YOUR_TOKEN" \
"https://api.github.com/users/username/gists"
# Search gists by content
# Use GitHub search: https://gist.github.com/search?q=org:exampleinc
# Automated gist scanning
python3 gist_scanner.py --username target_developer
Exposed .git Directories¶
# Check for exposed .git directories
curl -s http://example.com/.git/HEAD
# Use git-dumper to download entire repository
git-dumper http://example.com/.git/ ./output_directory
# Alternative tools
# - GitHacker
# - DVCS-Pillage
# - git-dumper (Python)
# Reconstruct repository from exposed .git
./gitdumper.sh http://example.com/.git/ ./repo_output
Dependency Analysis¶
# Analyze package.json for dependencies
curl -s https://raw.githubusercontent.com/example/repo/main/package.json | jq '.dependencies'
# Check for vulnerable dependencies
npm audit --json
snyk test
# Analyze CI/CD dependencies
# Check Dockerfiles, requirements.txt, Gemfile, etc.
Infrastructure as Code Analysis¶
# Terraform files
find . -name "*.tf" -exec grep -l "password\|secret\|key" {} \;
# Kubernetes configurations
find . -name "*.yaml" -o -name "*.yml" | xargs grep -l "secret\|password"
# CloudFormation templates
find . -name "*.json" -o -name "*.yml" | xargs grep -l "AWS::"
# Analyze for hardcoded credentials
terraform validate
checkov -d /path/to/terraform
7. Enterprise-Scale Scanning¶
Organization-Wide Scanning¶
# Script to scan all organization repositories
#!/bin/bash
ORG="exampleinc"
TOKEN="YOUR_TOKEN"
OUTPUT_DIR="scan_results"
mkdir -p "$OUTPUT_DIR"
# Get all repositories
curl -H "Authorization: token $TOKEN" \
"https://api.github.com/orgs/$ORG/repos?per_page=100" | \
jq -r '.[].clone_url' > repos.txt
# Scan each repository
while read repo; do
repo_name=$(basename "$repo" .git)
echo "Scanning: $repo_name"
# Clone and scan
git clone --depth 1 "$repo" "temp_$repo_name"
gitleaks detect -s "temp_$repo_name" --report-path "$OUTPUT_DIR/${repo_name}_scan.json"
# Cleanup
rm -rf "temp_$repo_name"
done < repos.txt
Continuous Monitoring¶
# Set up GitHub webhooks for real-time monitoring
# Monitor:
# - Push events
# - Repository creation
# - Fork events
# - Public status changes
# Use GitHub Actions for automated scanning
name: Secret Scanning
on: [push, pull_request]
jobs:
gitleaks:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v3
- uses: gitleaks/gitleaks-action@v2
with:
config-path: .gitleaks.toml
Integration with CI/CD Pipelines¶
# GitHub Actions example
name: Security Scan
on: [push, pull_request]
jobs:
secret-scan:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v3
- name: Run truffleHog
uses: trufflesecurity/trufflehog@main
with:
path: ./
base: ${{ github.base_ref }}
head: ${{ github.sha }}
8. Defensive Techniques and Prevention¶
Git Hooks for Prevention¶
# Pre-commit hook example
#!/bin/bash
# .git/hooks/pre-commit
# Run secret scanning
if ! gitleaks protect --staged; then
echo "Secrets detected in staged files. Commit rejected."
exit 1
fi
Repository Configuration¶
# .gitattributes to prevent secret commit
*.env filter=secret-filter
# Git configuration for diff filters
git config filter.secret-filter.clean "gitleaks protect --no-banner --staged"
Developer Education¶
- Regular security training
- Code review best practices
- Use of environment management tools
- Implementation of secret management systems
9. real world Case Studies¶
Case Study 1: Financial Institution Breach¶
Situation: A mid-sized bank had exposed AWS credentials in a public repository. Discovery: - Developer committed config.yml with hardcoded AWS keys - Keys provided full administrative access to AWS account - Found through automated GitHub scanning
Impact: - Access to customer financial data - Ability to modify banking transactions - Potential for complete system compromise
Resolution: - Immediate key rotation - Implementation of pre-commit hooks - Organization-wide secret scanning
Case Study 2: E-commerce Platform¶
Situation: Database credentials exposed in commit history. Discovery: - Developer removed credentials in later commit but history remained - Found through deep commit history analysis
Impact: - Access to 2.3 million customer records - Exposure of payment information - GDPR compliance violations
Resolution: - Database credential rotation - Implementation of git history rewriting - Enhanced monitoring and alerting
10. Quick Reference: High-Value Keywords¶
Credentials and Secrets¶
"password", "passwd", "secret", "token", "key", "credential",
"auth", "authentication", "login", "cred", "certificate",
"private_key", "public_key", "ssh_key", "rsa_key", "pem",
"jwt", "bearer", "oauth", "api_key", "client_secret",
"access_token", "refresh_token", "session_token"
Cloud Providers¶
# AWS
"aws_access_key", "aws_secret_key", "AWS_ACCESS_KEY_ID",
"AWS_SECRET_ACCESS_KEY", "AWS_SESSION_TOKEN", "AKIA[0-9A-Z]{16}",
"arn:aws:", "s3.amazonaws.com", "dynamodb.amazonaws.com"
# Azure
"AZURE_CLIENT_ID", "AZURE_CLIENT_SECRET", "AZURE_TENANT_ID",
"azure.identity", "azure.keyvault", "azure.storage"
# Google Cloud
"GOOGLE_APPLICATION_CREDENTIALS", "GCP_SERVICE_ACCOUNT",
"google.cloud", "gcloud", "gs://", "storage.googleapis.com"
# DigitalOcean
"digitalocean_token", "DO_SPACES_KEY", "DO_SPACES_SECRET"
# Heroku
"HEROKU_API_KEY", "herokuapp.com"
# Other Cloud
"cloudflare", "fastly", "akamai", "linode", "vultr"
Database Connections¶
# Connection strings
"database_url", "connection_string", "DB_PASSWORD", "DB_USER",
"mongodb://", "mysql://", "postgresql://", "redis://",
"sqlserver://", "oracle://", "sqlite://"
# Specific databases
"MYSQL_ROOT_PASSWORD", "POSTGRES_PASSWORD", "REDIS_PASSWORD",
"MONGO_INITDB_ROOT_PASSWORD", "RDS_PASSWORD"
API Keys and Services¶
# Payment processors
"stripe", "paypal", "braintree", "square", "adyen"
# Email services
"sendgrid", "mailgun", "mailchimp", "postmark", "mandrill"
# SMS services
"twilio", "nexmo", "plivo", "messagebird"
# Social media
"facebook", "twitter", "instagram", "linkedin", "google_oauth"
# Mapping services
"google_maps", "mapbox", "here", "bing_maps"
# Analytics
"google_analytics", "mixpanel", "amplitude", "segment"
Internal Infrastructure¶
# Internal URLs and endpoints
"internal", "staging", "dev", "test", "qa", "preprod",
"corp.example.com", "vpn.example.com", "intranet"
# Network information
"192.168.", "10.", "172.16.", "localhost", "127.0.0.1"
# Service discovery
"consul", "etcd", "zookeeper", "eureka"
File Patterns and Extensions¶
# Configuration files
".env", ".config", "config.yml", "settings.py", "configuration.json"
# Certificate files
".pem", ".key", ".crt", ".pfx", ".der", ".csr", ".jks"
# Database files
".sql", ".db", ".mdb", ".accdb", ".dbf"
# Archive files
".zip", ".rar", ".7z", ".tar", ".gz", ".backup", ".dump"
11. Legal and Ethical Considerations¶
Responsible Disclosure¶
- Always follow responsible disclosure procedures
- Report findings to the organization through proper channels
- Do not access systems without explicit permission
- Document findings thoroughly for remediation
Scope and Boundaries¶
- Only search within publicly accessible repositories
- Respect rate limits and terms of service
- Avoid causing service disruptions
- Do not exploit discovered vulnerabilities without permission
Data Handling¶
- Handle sensitive information with care
- Do not share or publish discovered secrets
- Securely store any collected data
- Follow data protection regulations (GDPR, CCPA, etc.)
12. Tools and Resources¶
Essential Tools¶
- truffleHog: Comprehensive secret scanning
- gitleaks: Fast regex-based scanning
- git-secrets: AWS-focused prevention and detection
- shhgit: Real-time GitHub monitoring
- git-dumper: .git directory recovery
- ggshield: GitGuardian enterprise solution
Browser Extensions¶
- GitHub Secret Scanner: Real-time browser scanning
- GitGuardian: Enterprise browser integration
- TruffleHog Browser Extension: Chrome/Firefox scanning
Online Resources¶
- GitHub Advanced Search: https://github.com/search/advanced
- Gist Search: https://gist.github.com/search
- GitHub API Documentation: https://docs.github.com/en/rest
- Secret Scanning Patterns: https://github.com/trufflesecurity/trufflehog
Training and Education¶
- OWASP Secure Coding Practices
- GitHub Security Lab training
- SANS Secure Coding courses
- Platform-specific security documentation
13. Best Practices Summary¶
For Security Researchers¶
- Start with manual reconnaissance to understand the target
- Use automated tools for comprehensive coverage
- Focus on commit history for the most valuable findings
- Verify findings before reporting
- Follow responsible disclosure procedures
- Document everything for clear reporting
- Respect scope and boundaries at all times
For Developers and Organizations¶
- Implement pre-commit hooks to prevent secret commits
- Use secret management systems (Vault, AWS Secrets Manager, etc.)
- Regularly scan repositories for accidental exposures
- Educate developers on secure coding practices
- Monitor public repositories for organizational leaks
- Have incident response plans for when leaks occur
- Rotate credentials immediately upon discovery
Continuous Improvement¶
- Regularly update scanning rules and patterns
- Stay current with new tools and techniques
- Participate in security communities
- Share knowledge and lessons learned
- Contribute to open source security tools
14. Conclusion¶
Code repository OSINT is one of the most powerful techniques in modern security assessment. The accidental exposure of secrets in public code repositories represents a significant risk to organizations of all sizes. By following the systematic methodology outlined in this guide, security professionals can effectively identify and mitigate these risks.
Remember that the goal is not just to find vulnerabilities, but to help organizations improve their security posture. Always approach this work with professionalism, ethics, and a commitment to making the digital world safer for everyone.
Key Takeaways: - Manual and automated approaches should be combined - Commit history analysis is where the most valuable secrets are found - Enterprise-scale scanning requires careful planning and execution - Responsible disclosure and ethical conduct are paramount - Continuous learning and tool improvement are essential
By mastering code repository OSINT, you join a community of security professionals dedicated to protecting digital infrastructure and preventing the accidental exposure of sensitive information.