Skip to content

Code Repositories OSINT

Public code repos on GitHub, GitLab, and Bitbucket are treasure troves for security researchers. Developers accidentally commit sensitive data all the time - API keys, passwords, internal docs, infrastructure details. This guide shows you how to extract valuable intel from code repositories.

Real story: On a recent pentest, code repo analysis led me to AWS credentials that gave access to 47 S3 buckets with customer data, internal databases, and proprietary source code. Estimated value? Over $2 million in potential damages.

1. Introduction to Code Repository OSINT

Code repository OSINT means systematically searching through public code, commit history, and developer activity to find security-relevant info. It's a highly effective passive recon technique that often yields the most valuable findings.

Why It's a Goldmine

  • Secrets Exposure: The most common finding. Developers hardcode credentials for testing and forget to remove them. Recent studies show that 10% of all public repositories contain some form of sensitive information.
  • Infrastructure Mapping: Configuration files (docker-compose.yml, .tf files, k8s manifests) can reveal internal network architecture, hostnames, and technologies used.
  • Vulnerable Code: Finding outdated dependencies or custom code with obvious vulnerabilities that can be exploited.
  • Developer Profiling: Identifying developers and pivoting to their other public activities can reveal more information about internal processes and technologies.
  • Business Logic Understanding: Source code often reveals application workflows, authentication mechanisms, and data processing logic.

Statistical Insights

  • 73% of developers have committed secrets to version control at least once
  • AWS keys are the most commonly leaked secret (42% of findings)
  • 80% of leaked secrets remain active for more than 30 days
  • The average company has 12,000+ secrets exposed in public repositories

2. Core Methodology: The Hunt for Secrets

You need a structured approach or you'll get lost in all that code. Follow this systematic methodology to cover everything.

Phase 1: Target Identification

  1. Identify Target Repositories: Find the official GitHub/GitLab organization for your target company.
  2. Employee Repositories: Search for personal accounts of employees that may contain work-related code.
  3. Fork Analysis: Identify forks of the target's repositories that may contain older, more sensitive versions.
  4. Dependency Analysis: Find repositories that depend on the target's packages or libraries.

Phase 2: Broad Scanning

  1. Automated Scanning: Use specialized tools to scan repositories for high-entropy strings and patterns matching known secret formats.
  2. Keyword Searching: Perform broad searches for common secret patterns across all identified repositories.
  3. File Type Analysis: Focus on configuration files, environment files, and documentation.

Phase 3: Deep Analysis

  1. Commit History Analysis: Secrets are often removed in a later commit, but they remain in the repository's history. This is the most critical step.
  2. Branch Analysis: Check other branches (especially dev, test, staging) that may contain different code.
  3. Pull Request Review: Analyze pull requests for comments, discussions, and code changes that may reveal sensitive information.

Phase 4: Correlation and Verification

  1. Cross-Repository Analysis: Correlate findings across multiple repositories.
  2. Live Verification: Test discovered credentials and endpoints against live systems.
  3. Impact Assessment: Evaluate the potential impact of each finding.

Phase 5: Reporting and Documentation

  1. Evidence Collection: Capture screenshots and code snippets for documentation.
  2. Risk Rating: Assign severity levels based on the sensitivity of the information.
  3. Recommendations: Provide actionable remediation advice.

3. GitHub Dorking: Advanced Manual Searching

GitHub's search is powerful but most people barely use it. Master these advanced techniques and you'll find way more.

Basic Dork Structure

Combine the organization/user filter with keywords and file filters.

# Basic structure
[scope]:[target] [keyword] [file-filter]

# Examples
org:exampleinc password filename:.env
user:johndoe "api_key" extension:json

Organization and User Targeting

# Search within specific organizations
org:netflix 
org:google
org:apple

# Search within user accounts
user:torvalds
user:defunkt
user:github

# Combine multiple targets
org:netflix OR org:netflix-cloud

# Exclude specific users or orgs
org:exampleinc -user:bot-account

File and Path Based Searching

# Specific filenames
filename:.env
filename:config.yml
filename:docker-compose.yml
filename:settings.py
filename:wp-config.php

# File extensions
extension:key
extension:pem
extension:ppk
extension:pfx
extension:sql

# Path based searching
path:/config/
path:/src/config/
path:/includes/
path:/secrets/

Content Based Searching

# Credentials and secrets
"password"
"secret"
"token"
"key"
"credential"
"auth"
"login"

# API related
"api_key"
"client_secret"
"access_token"
"bearer"
"oauth"

# Cloud providers
"aws_access_key"
"AZURE_CLIENT_SECRET"
"GCP_SERVICE_ACCOUNT"
"digitalocean_token"

# Database connections
"database_url"
"connection_string"
"DB_PASSWORD"
"mongodb://"

Advanced Search Filters Reference

Filter Description Example
org:<name> Search within organization org:netflix password
user:<name> Search within user user:torvalds kernel
repo:<user/repo> Search specific repo repo:facebook/react
filename:<name> Search filename filename:.env
extension:<ext> Search file extension extension:json
path:<path> Search in path path:/config/
language:<lang> Search by language language:python
size:<size> Search by file size size:>1000
created:<date> Search by creation date created:>2023-01-01
pushed:<date> Search by last push pushed:<2022-01-01

Combining Dorks for High-Impact Results

# Find AWS keys in Python files
org:exampleinc language:python "aws_access_key"

# Database credentials in config files
org:exampleinc path:/config/ "DB_PASSWORD"

# Private keys in any file
org:exampleinc "BEGIN RSA PRIVATE KEY"

# Recent commits with password mentions
org:exampleinc password pushed:>2023-06-01

# Large configuration files
org:exampleinc filename:config size:>5000

# Environment files with secrets
org:exampleinc filename:.env "API_KEY"

# Docker files with hardcoded secrets
org:exampleinc filename:docker-compose "password"

# Kubernetes configurations
org:exampleinc filename:deployment.yaml "secret"

# Terraform files with credentials
org:exampleinc extension:tf "password"

Language-Specific Dorking

# JavaScript/Node.js
org:exampleinc language:javascript "process.env"
org:exampleinc language:javascript "require('dotenv')"

# Python
org:exampleinc language:python "os.environ"
org:exampleinc language:python "from dotenv import"

# Java
org:exampleinc language:java "System.getenv"
org:exampleinc language:java "Properties.load"

# PHP
org:exampleinc language:php "getenv"
org:exampleinc language:php "$_ENV"

# Ruby
org:exampleinc language:ruby "ENV["
org:exampleinc language:ruby "Figaro.load"

Advanced Search Techniques

Boolean Operators:

# AND operator (default)
org:exampleinc password token

# OR operator
org:exampleinc password OR token

# NOT operator
org:exampleinc password -filename:test

# Grouping with parentheses
org:exampleinc (password OR token) filename:.env

# Complex combinations
org:exampleinc (aws_key OR azure_key) -extension:md

Range Searching:

# Date ranges
org:exampleinc password pushed:2023-01-01..2023-06-30
org:exampleinc created:>2022-01-01

# Size ranges
org:exampleinc filename:config size:1000..5000
org:exampleinc size:>10000

Regular Expression Support: While GitHub doesn't support full regex in search, you can use pattern matching:

# Pattern matching
org:exampleinc key[0-9]
org:exampleinc secret_[a-z]

Specialized Search Scenarios

CI/CD Configuration Files:

# GitHub Actions
org:exampleinc path:.github/workflows "secret"
org:exampleinc filename:github-actions.yml "AWS_"

# GitLab CI
org:exampleinc filename:.gitlab-ci.yml "variables"
org:exampleinc path:.gitlab "SECRET_"

# Jenkins
org:exampleinc filename:Jenkinsfile "withCredentials"
org:exampleinc filename:Jenkinsfile "usernamePassword"

# CircleCI
org:exampleinc filename:config.yml "environment"
org:exampleinc path:.circleci "AWS_"

# Travis CI
org:exampleinc filename:.travis.yml "secure"
org:exampleinc filename:.travis.yml "env"

# Azure DevOps
org:exampleinc filename:azure-pipelines.yml "secret"
org:exampleinc path:.azure "variables"

GitHub Search Limitations and Workarounds

Rate Limiting: - Unauthenticated: 10 requests per minute - Authenticated: 30 requests per minute - Use multiple accounts or proxies for large-scale scanning

Search Result Limits: - Only first 1,000 results are available - Use date ranges and other filters to narrow results - Break large searches into smaller, targeted queries

API Access:

# Use GitHub API for programmatic access
curl -H "Authorization: token YOUR_TOKEN" \
  "https://api.github.com/search/code?q=org:exampleinc+password"

# Use official GitHub CLI
gh api search/code --jq '.items[] | .html_url' -q "org:exampleinc password"

4. Automated Secret Scanning Tools

Manual searching is good, but automated tools are essential for comprehensive coverage and deep commit history analysis.

truffleHog

Scans git repositories for secrets using both regex patterns and entropy analysis. Now maintained by Truffle Security.

# Basic scan of a public repository
trufflehog git https://github.com/example/repo.git

# Scan with specific detectors
trufflehog git https://github.com/example/repo.git --only-verified

# Scan with custom rules
trufflehog git https://github.com/example/repo.git --rules /path/to/rules.json

# Output in JSON format
trufflehog git https://github.com/example/repo.git --json

# Scan with entropy checks disabled
trufflehog git https://github.com/example/repo.git --no-entropy

# Scan specific branches
trufflehog git https://github.com/example/repo.git --branch develop

# Use multiple threads for faster scanning
trufflehog git https://github.com/example/repo.git --threads 4

gitleaks

A fast, reliable secret scanner that uses regex-based detection with comprehensive rule sets.

# Scan a local repository
gitleaks detect -s /path/to/repo -v

# Scan with specific config
gitleaks detect -s /path/to/repo -c gitleaks.toml

# Scan and output to file
gitleaks detect -s /path/to/repo --report-path findings.json

# Scan with GitHub token for API access
gitleaks detect -s https://github.com/example/repo --github-token YOUR_TOKEN

# Protect mode (pre-commit hook)
gitleaks protect -v

# Scan only recent commits
gitleaks detect -s /path/to/repo --log-opts="--since=2023-01-01"

git-secrets

AWS-developed tool that prevents committing secrets and can scan existing repositories.

# Install hooks in a repository
git secrets --install

# Add patterns to scan for
git secrets --add 'AWS_ACCESS_KEY_ID'

# Scan the entire history
git secrets --scan-history

# Scan specific files
git secrets --scan /path/to/file

# Register with AWS patterns
git secrets --register-aws

Other Notable Tools

shhgit:

# Real-time scanning of GitHub
shhgit -t YOUR_TOKEN -q "org:exampleinc"

# Specific file types
shhgit -t YOUR_TOKEN -q "org:exampleinc filename:.env"

repo-supervisor:

# Scan multiple repositories
repo-supervisor --org exampleinc --token YOUR_TOKEN

# Custom rules
repo-supervisor --rules /path/to/rules.json

ggshield (GitGuardian):

# Scan local repository
ggshield scan repo /path/to/repo

# Scan pre-commit
ggshield scan pre-commit

# Scan CI environment
ggshield scan ci

Custom Rule Development

Create custom detection rules for organization-specific patterns:

// truffleHog rules example
{
  "detectors": [
    {
      "name": "Custom API Key",
      "keywords": ["custom_api_key", "internal_key"],
      "regex": {"pattern": "[A-Z0-9]{32}"},
      "verify": false
    }
  ]
}
# gitleaks config example
[[rules]]
description = "Custom API Key Pattern"
id = "custom-api-key"
regex = '''[A-Z0-9]{32}'''
keywords = ["custom_api_key", "internal_key"]

5. Deep Commit History Analysis

This is where the most valuable secrets are found. Developers often commit secrets and then remove them, but the historical record remains.

Manual Git History Analysis

# Clone the target repository
git clone https://github.com/example/repo.git
cd repo

# View full commit history with diffs
git log -p

# Search for specific strings in commit history
git log -p -S"password"

# Search with regex patterns
git log -p -G"[A-Z0-9]{20}"

# Search in specific file types
git log -p -- "*.env"

# Limit to specific time range
git log -p --since="2023-01-01" --until="2023-06-30"

# Search in specific branches
git log -p origin/develop -S"secret"

# Show only commit messages containing keywords
git log --grep="password"

# Extract all unique strings from commit history
git log -p | grep -Eo '[A-Za-z0-9+/]{20,}=' | sort -u

Automated History Scanning

# Scan entire history with gitleaks
gitleaks detect -s /path/to/repo --log-opts="--all"

# Use git-all-secrets wrapper
git-all-secrets --repo https://github.com/example/repo

# Manual script for comprehensive scanning
#!/bin/bash
REPO_PATH="/path/to/repo"
OUTPUT_FILE="secrets_scan.txt"

echo "Scanning commit history for secrets..."
git -C "$REPO_PATH" log -p --all > all_commits.diff

# Extract high-entropy strings
cat all_commits.diff | grep -E '[A-Za-z0-9+/]{20,}=' >> "$OUTPUT_FILE"

# Look for common patterns
cat all_commits.diff | grep -E '(password|secret|key|token|credential)' >> "$OUTPUT_FILE"

echo "Scan complete. Results in $OUTPUT_FILE"

Branch Analysis

# List all branches
git branch -a

# Checkout and scan specific branches
git checkout develop
gitleaks detect -v

# Scan all branches
for branch in $(git branch -r | grep -v HEAD); do
    echo "Scanning branch: $branch"
    git checkout "$branch"
    gitleaks detect -v --report-path "scan_${branch//\//_}.json"
done

Blame Analysis

# See who last modified each line containing a secret
git blame config/file.yml | grep -i password

# Annotate files with commit information
git annotate sensitive/file.txt

6. Advanced Techniques and Pivoting

Developer OSINT and Pivoting

# Get contributor information
git shortlog -sn --all

# Analyze specific developer's commits
git log --author="developer@example.com" -p

# Find developer email patterns
git log --pretty=format:"%ae" | sort -u

# Cross-reference with other platforms
# Search for developer usernames on:
# - LinkedIn
# - Twitter
# - Stack Overflow
# - Other code platforms

Gist Analysis

# Search user's gists
curl -H "Authorization: token YOUR_TOKEN" \
  "https://api.github.com/users/username/gists"

# Search gists by content
# Use GitHub search: https://gist.github.com/search?q=org:exampleinc

# Automated gist scanning
python3 gist_scanner.py --username target_developer

Exposed .git Directories

# Check for exposed .git directories
curl -s http://example.com/.git/HEAD

# Use git-dumper to download entire repository
git-dumper http://example.com/.git/ ./output_directory

# Alternative tools
# - GitHacker
# - DVCS-Pillage
# - git-dumper (Python)

# Reconstruct repository from exposed .git
./gitdumper.sh http://example.com/.git/ ./repo_output

Dependency Analysis

# Analyze package.json for dependencies
curl -s https://raw.githubusercontent.com/example/repo/main/package.json | jq '.dependencies'

# Check for vulnerable dependencies
npm audit --json
snyk test

# Analyze CI/CD dependencies
# Check Dockerfiles, requirements.txt, Gemfile, etc.

Infrastructure as Code Analysis

# Terraform files
find . -name "*.tf" -exec grep -l "password\|secret\|key" {} \;

# Kubernetes configurations
find . -name "*.yaml" -o -name "*.yml" | xargs grep -l "secret\|password"

# CloudFormation templates
find . -name "*.json" -o -name "*.yml" | xargs grep -l "AWS::"

# Analyze for hardcoded credentials
terraform validate
checkov -d /path/to/terraform

7. Enterprise-Scale Scanning

Organization-Wide Scanning

# Script to scan all organization repositories
#!/bin/bash
ORG="exampleinc"
TOKEN="YOUR_TOKEN"
OUTPUT_DIR="scan_results"

mkdir -p "$OUTPUT_DIR"

# Get all repositories
curl -H "Authorization: token $TOKEN" \
  "https://api.github.com/orgs/$ORG/repos?per_page=100" | \
  jq -r '.[].clone_url' > repos.txt

# Scan each repository
while read repo; do
    repo_name=$(basename "$repo" .git)
    echo "Scanning: $repo_name"

    # Clone and scan
    git clone --depth 1 "$repo" "temp_$repo_name"
    gitleaks detect -s "temp_$repo_name" --report-path "$OUTPUT_DIR/${repo_name}_scan.json"

    # Cleanup
    rm -rf "temp_$repo_name"
done < repos.txt

Continuous Monitoring

# Set up GitHub webhooks for real-time monitoring
# Monitor:
# - Push events
# - Repository creation
# - Fork events
# - Public status changes

# Use GitHub Actions for automated scanning
name: Secret Scanning
on: [push, pull_request]
jobs:
  gitleaks:
    runs-on: ubuntu-latest
    steps:
    - uses: actions/checkout@v3
    - uses: gitleaks/gitleaks-action@v2
      with:
        config-path: .gitleaks.toml

Integration with CI/CD Pipelines

# GitHub Actions example
name: Security Scan
on: [push, pull_request]
jobs:
  secret-scan:
    runs-on: ubuntu-latest
    steps:
    - uses: actions/checkout@v3
    - name: Run truffleHog
      uses: trufflesecurity/trufflehog@main
      with:
        path: ./
        base: ${{ github.base_ref }}
        head: ${{ github.sha }}

8. Defensive Techniques and Prevention

Git Hooks for Prevention

# Pre-commit hook example
#!/bin/bash
# .git/hooks/pre-commit

# Run secret scanning
if ! gitleaks protect --staged; then
    echo "Secrets detected in staged files. Commit rejected."
    exit 1
fi

Repository Configuration

# .gitattributes to prevent secret commit
*.env filter=secret-filter

# Git configuration for diff filters
git config filter.secret-filter.clean "gitleaks protect --no-banner --staged"

Developer Education

  • Regular security training
  • Code review best practices
  • Use of environment management tools
  • Implementation of secret management systems

9. real world Case Studies

Case Study 1: Financial Institution Breach

Situation: A mid-sized bank had exposed AWS credentials in a public repository. Discovery: - Developer committed config.yml with hardcoded AWS keys - Keys provided full administrative access to AWS account - Found through automated GitHub scanning

Impact: - Access to customer financial data - Ability to modify banking transactions - Potential for complete system compromise

Resolution: - Immediate key rotation - Implementation of pre-commit hooks - Organization-wide secret scanning

Case Study 2: E-commerce Platform

Situation: Database credentials exposed in commit history. Discovery: - Developer removed credentials in later commit but history remained - Found through deep commit history analysis

Impact: - Access to 2.3 million customer records - Exposure of payment information - GDPR compliance violations

Resolution: - Database credential rotation - Implementation of git history rewriting - Enhanced monitoring and alerting

10. Quick Reference: High-Value Keywords

Credentials and Secrets

"password", "passwd", "secret", "token", "key", "credential", 
"auth", "authentication", "login", "cred", "certificate", 
"private_key", "public_key", "ssh_key", "rsa_key", "pem", 
"jwt", "bearer", "oauth", "api_key", "client_secret", 
"access_token", "refresh_token", "session_token"

Cloud Providers

# AWS
"aws_access_key", "aws_secret_key", "AWS_ACCESS_KEY_ID", 
"AWS_SECRET_ACCESS_KEY", "AWS_SESSION_TOKEN", "AKIA[0-9A-Z]{16}",
"arn:aws:", "s3.amazonaws.com", "dynamodb.amazonaws.com"

# Azure
"AZURE_CLIENT_ID", "AZURE_CLIENT_SECRET", "AZURE_TENANT_ID",
"azure.identity", "azure.keyvault", "azure.storage"

# Google Cloud
"GOOGLE_APPLICATION_CREDENTIALS", "GCP_SERVICE_ACCOUNT",
"google.cloud", "gcloud", "gs://", "storage.googleapis.com"

# DigitalOcean
"digitalocean_token", "DO_SPACES_KEY", "DO_SPACES_SECRET"

# Heroku
"HEROKU_API_KEY", "herokuapp.com"

# Other Cloud
"cloudflare", "fastly", "akamai", "linode", "vultr"

Database Connections

# Connection strings
"database_url", "connection_string", "DB_PASSWORD", "DB_USER",
"mongodb://", "mysql://", "postgresql://", "redis://",
"sqlserver://", "oracle://", "sqlite://"

# Specific databases
"MYSQL_ROOT_PASSWORD", "POSTGRES_PASSWORD", "REDIS_PASSWORD",
"MONGO_INITDB_ROOT_PASSWORD", "RDS_PASSWORD"

API Keys and Services

# Payment processors
"stripe", "paypal", "braintree", "square", "adyen"

# Email services
"sendgrid", "mailgun", "mailchimp", "postmark", "mandrill"

# SMS services
"twilio", "nexmo", "plivo", "messagebird"

# Social media
"facebook", "twitter", "instagram", "linkedin", "google_oauth"

# Mapping services
"google_maps", "mapbox", "here", "bing_maps"

# Analytics
"google_analytics", "mixpanel", "amplitude", "segment"

Internal Infrastructure

# Internal URLs and endpoints
"internal", "staging", "dev", "test", "qa", "preprod",
"corp.example.com", "vpn.example.com", "intranet"

# Network information
"192.168.", "10.", "172.16.", "localhost", "127.0.0.1"

# Service discovery
"consul", "etcd", "zookeeper", "eureka"

File Patterns and Extensions

# Configuration files
".env", ".config", "config.yml", "settings.py", "configuration.json"

# Certificate files
".pem", ".key", ".crt", ".pfx", ".der", ".csr", ".jks"

# Database files
".sql", ".db", ".mdb", ".accdb", ".dbf"

# Archive files
".zip", ".rar", ".7z", ".tar", ".gz", ".backup", ".dump"

Responsible Disclosure

  • Always follow responsible disclosure procedures
  • Report findings to the organization through proper channels
  • Do not access systems without explicit permission
  • Document findings thoroughly for remediation

Scope and Boundaries

  • Only search within publicly accessible repositories
  • Respect rate limits and terms of service
  • Avoid causing service disruptions
  • Do not exploit discovered vulnerabilities without permission

Data Handling

  • Handle sensitive information with care
  • Do not share or publish discovered secrets
  • Securely store any collected data
  • Follow data protection regulations (GDPR, CCPA, etc.)

12. Tools and Resources

Essential Tools

  • truffleHog: Comprehensive secret scanning
  • gitleaks: Fast regex-based scanning
  • git-secrets: AWS-focused prevention and detection
  • shhgit: Real-time GitHub monitoring
  • git-dumper: .git directory recovery
  • ggshield: GitGuardian enterprise solution

Browser Extensions

  • GitHub Secret Scanner: Real-time browser scanning
  • GitGuardian: Enterprise browser integration
  • TruffleHog Browser Extension: Chrome/Firefox scanning

Online Resources

  • GitHub Advanced Search: https://github.com/search/advanced
  • Gist Search: https://gist.github.com/search
  • GitHub API Documentation: https://docs.github.com/en/rest
  • Secret Scanning Patterns: https://github.com/trufflesecurity/trufflehog

Training and Education

  • OWASP Secure Coding Practices
  • GitHub Security Lab training
  • SANS Secure Coding courses
  • Platform-specific security documentation

13. Best Practices Summary

For Security Researchers

  1. Start with manual reconnaissance to understand the target
  2. Use automated tools for comprehensive coverage
  3. Focus on commit history for the most valuable findings
  4. Verify findings before reporting
  5. Follow responsible disclosure procedures
  6. Document everything for clear reporting
  7. Respect scope and boundaries at all times

For Developers and Organizations

  1. Implement pre-commit hooks to prevent secret commits
  2. Use secret management systems (Vault, AWS Secrets Manager, etc.)
  3. Regularly scan repositories for accidental exposures
  4. Educate developers on secure coding practices
  5. Monitor public repositories for organizational leaks
  6. Have incident response plans for when leaks occur
  7. Rotate credentials immediately upon discovery

Continuous Improvement

  • Regularly update scanning rules and patterns
  • Stay current with new tools and techniques
  • Participate in security communities
  • Share knowledge and lessons learned
  • Contribute to open source security tools

14. Conclusion

Code repository OSINT is one of the most powerful techniques in modern security assessment. The accidental exposure of secrets in public code repositories represents a significant risk to organizations of all sizes. By following the systematic methodology outlined in this guide, security professionals can effectively identify and mitigate these risks.

Remember that the goal is not just to find vulnerabilities, but to help organizations improve their security posture. Always approach this work with professionalism, ethics, and a commitment to making the digital world safer for everyone.

Key Takeaways: - Manual and automated approaches should be combined - Commit history analysis is where the most valuable secrets are found - Enterprise-scale scanning requires careful planning and execution - Responsible disclosure and ethical conduct are paramount - Continuous learning and tool improvement are essential

By mastering code repository OSINT, you join a community of security professionals dedicated to protecting digital infrastructure and preventing the accidental exposure of sensitive information.