Skip to content

Regex

Regex Diagram

Table of Contents

Part 1: Regex Fundamentals

1. Why Regex? What It Is and Isn't

Regex finds patterns in text. Think of it as a scalpel, not a chainsaw.

Use it for: slicing logs, validating input, extracting data, quick filters. Don't use it for: parsing complex context free or nested formats use proper parsers for that.

Quick joke: I told my code a regex joke. It was too greedy and took the punchline.

2. Engines at a Glance (DFA vs NFA,-Backtracking)

  • DFA (deterministic finite automaton): Typically linear time, no backreferences, no backtracking (e.g., RE2). Safe for performance critical contexts.
  • NFA backtracking engines (PCRE, Perl, Oniguruma, PCRE2): Powerful features (lookbehind, backreferences, recursion) but risk of catastrophic backtracking.

3. Character Classes and Escapes

  • Dot: . matches any character except newline (engine dependent)
  • Sets: [abc], ranges: [a z], negation: [^...]
  • Shortcuts: \d (digit), \D (non digit), \w (word), \W (non word), \s (space), \S (non space)
  • Escape metacharacters when used literally: . ^ $ * + ? ( ) [ ] { } | \

Example:

\w+@\w+\.\w+

4. Literals, Metacharacters, and Escaping

  • Escape to match literally: e.g., match a dot in a domain: example.com
  • Raw strings in some languages reduce double escaping (e.g., Python r"...", Rust r"..."), but check language docs.

5. Flags/Modifiers Across Engines

  • Case insensitive: i
  • Multiline ^ and $: m
  • Dotall (dot matches newline): s
  • Unicode: u
  • Global (find all): g (JS)
  • Extended/comments: x (ignore whitespace, allow comments)

Example inline modifiers (PCRE):

(?i)case insensitive
(?m)multiline anchors
(?s)dotall

Part 2: Matching Anatomy

6. Anchors and Boundaries

  • ^ start of string/line (with m)
  • $ end of string/line (with m)
  • \b word boundary, \B non word boundary
  • \A beginning of string, \Z end of string (some engines)

7. Groups: Capturing, Non Capturing, Named

  • Capturing: ( ... )
  • Non capturing: (?: ... ) for performance/clarity when capture isn’t needed
  • Named capture (engine dependent): (? ... ) or (?'name' ...)

8. Alternation

  • a|b|c, order matters in backtracking engines. Put more specific/likely alternatives first to reduce backtracking.

9. Quantifiers (Greedy, Lazy,-Possessive)

  • Greedy: *, +, ?, {n}, {n,}, {n,m}
  • Lazy: *?, +?, ??, {n,m}?
  • Possessive (PCRE, Java, etc.): *+, ++, ?+, {n,m}+ (no backtracking once matched)

10. Lookarounds-(Lookahead/Lookbehind)

  • Positive lookahead: (?=...)
  • Negative lookahead: (?!...)
  • Positive lookbehind: (?<=...)
  • Negative lookbehind: (?<!...)

Use cases: assert context without consuming it (e.g., capture a token only when followed by a known suffix).


Part 3: Unicode, Locales, and Character Properties

11. Unicode Basics and Case Folding

  • Use Unicode aware engines where needed (u flag). Beware of case insensitive matching across locales.
  • Case folding can make matches larger than expected (e.g., ß vs SS).

12. Property Classes (General Category,-Scripts)

  • Example (PCRE/Oniguruma): \p{L} letters, \p{N} numbers, \p{Han} Han ideographs
  • Negation: \P{...}

13. Normalization and Security Implications

  • NFC/NFD: canonical equivalence; diacritics may be decomposed
  • Security: Mixed script identifiers, homoglyph attacks. Normalize and restrict scripts where appropriate.

Part 4: Practical Security Patterns

14. Networking: IP, MAC, CIDR, URLs

  • IPv4 (lenient):
    \b(?:\d{1,3}\.){3}\d{1,3}\b
    
  • IPv4 (0-255 strict):
    \b(?:(?:25[0-5]|2[0-4]\d|1?\d?\d)\.){3}(?:25[0-5]|2[0-4]\d|1?\d?\d)\b
    
  • IPv6 (compressed allowed) is complex; prefer libraries. Minimal presence check:
    \b[0-9A-Fa f:]{2,}\b
    
  • MAC:
    \b(?:[0-9A-Fa f]{2}:){5}[0-9A-Fa f]{2}\b
    
  • URL (simplified):
    https?:\/\/(?:[\w.-]+)(?::\d+)?(?:\/[\w\-./?%&=+#]*)?
    

Linux joke: There’s no place like 127.0.0.1. And nothing like ::1 when you’re feeling extra modern.

15. Files and Paths (Windows,-POSIX)

  • POSIX-ish path:
    \/(?:[\w._-]+\/?)*
    
  • Windows drive path:
    [A-Za z]:\\(?:[^\\/:*?"<>|\r\n]+\\?)*
    
  • Prevent path traversal (detecting ../):
    (?:^|\/|\\)\.\.(?:\/|\\|$)
    

16. Emails, Domains, TLDs

  • Email (practical; not RFC-complete):
    [\w.!#$%&'*+\/=?^`{|}~-]+@[A-Za z0-9](?:[A-Za z0-9-]{0,61}[A-Za-z0-9])?(?:\.[A-Za z0-9](?:[A-Za z0-9-]{0,61}[A-Za z0-9])?)+
    
  • Domain (basic):
    (?:[A-Za z0-9](?:[A-Za z0-9-]{0,61}[A-Za z0-9])?\.)+[A-Za z]{2,}
    

17. Credentials and Secrets: High Entropy, Key Prefixes

  • AWS Access Key ID:
    \bA(KIA|SIA)[A-Z0-9]{16}\b
    
  • GitHub token:
    \bghp_[A-Za z0-9]{36}\b
    
  • Google API key:
    \bAIza[0-9A-Za z\-_]{35}\b
    
  • Generic high entropy token:
    \b[A-Za z0-9_\-]{32,}\b
    

Warning: High false positives possible. Always verify and consider context.

18. Logs and SIEM: Web, Auth, Cloud, Databases

  • Apache/Nginx common line parse (named groups; PCRE/Python style):
    (?P<ip>\S+) \S+ \S+ \[(?P<ts>[^\]]+)\] "(?P<method>\S+) (?P<path>\S+) \S+" (?P<status>\d{3}) (?P<size>\d+)
    
  • SSH auth failure:
    Failed password for (?:invalid user )?(?P<user>\S+) from (?P<ip>\S+) port (?P<port>\d+)
    
  • CloudTrail eventName:
    "eventName"\s*:\s*"(?P<event>[A-Za z0-9_]+)"
    

19. Data Formats: JSON, XML, CSV, JWTs

  • JSON key value (simple):
    "(?P<key>[^"]+)"\s*:\s*"(?P<val>[^"]*)"
    
  • XML tag content:
    <(?P<tag>[A-Za z_:][\w:.-]*)[^>]*>(?P<body>.*?)<\/\1>
    
  • JWT (header.payload.signature):
    \b[A-Za z0-9_-]+\.[A-Za z0-9_-]+\.[A-Za z0-9_-]+\b
    

Part 5: Validation Do’s and Don’ts

20. Input Validation Strategy

  • Prefer allowlists. Define what is valid, not what is forbidden.
  • Anchor patterns: ^...$ to ensure full string validation.

21. Allowlists vs Denylists

  • Denylists are bypass friendly; allowlists constrain inputs to expected forms.

22. Canonicalization and Double Decode Pitfalls

  • Normalize encodings first: URL decode, HTML decode, Unicode normalize as needed, then validate.

Dev aside: If your regex looks like a firewall rule and your firewall rule looks like a regex, step away from the keyboard and write a parser.


Part 6: ReDoS and Performance

23. Catastrophic Backtracking

  • Vulnerable: (a+)+ against aaaa...ab can explode runtime.
  • Mitigate by: Using atomic groups: (?>...) Possessive quantifiers: a++ Refactoring patterns to avoid nested, overlapping quantifiers

24. Possessive Quantifiers and Atomic Groups

  • Possessive: *+, ++, ?+, {n,m}+
  • Atomic: (?>...)
  • Both reduce backtracking in engines that support them.

25. Linear Time Patterns and RE2

  • Consider RE2/Go’s regexp for user controlled inputs; they reject constructs that cause exponential time.

26. Performance Budgeting and Profiling

  • Time box regex runs if your language allows it.
  • Benchmark with realistic inputs and pathological cases.

Linux joke: My regex was so slow, the cron job finished before the first match.


Part 7: Tooling and Language Ecosystem

27. grep/egrep/fgrep/ripgrep/sed/awk

  • grep: basic regex; use -E for extended. ripgrep (rg) is fast and modern.
  • Examples:
    # Find IPs
    rg -o "\\b(?:\\d{1,3}\\.){3}\\d{1,3}\\b" logs/
    
    # Extract emails
    grep -Eo '[A-Za z0-9._%+-]+@[A-Za z0-9.-]+\.[A-Za z]+' -r src/
    
    # Obfuscate emails-(sed)
    sed -E 's/([[:alnum:]._%+-]+)@([[:alnum:].-]+)/[REDACTED]@\2/g' input.txt > output.txt
    

28. Python, JavaScript, Go, Java, C#, PHP, Rust

  • Python (re):
    import re
    p = re.compile(r"(?P<ip>\S+) \[(?P<ts>.*?)\] \"(?P<m>\S+) (?P<path>\S+) \S+\" (?P<st>\d{3}) (?P<sz>\d+)")
    
  • JS:
    const re = /https?:\/\/[\w.-]+/gi;
    
  • Go (RE2):
    re := regexp.MustCompile(`\b(?:\d{1,3}\.){3}\d{1,3}\b`)
    
  • Java (Pattern):
    Pattern p = Pattern.compile("(?i)user: (\\w+)");
    
  • C#:
    var re = new Regex(@"(?<=^|\s)user=(\w+)", RegexOptions.Compiled);
    
  • PHP:
    if (preg_match('/^USER:(\w+)$/i', $s, $m)) { $user = $m[1]; }
    
  • Rust (regex crate):
    let re = Regex::new(r"(?m)^ERROR: (.+)$").unwrap();
    

29. Editors and IDEs (VS Code, Vim/Neovim,-JetBrains)

  • VS Code search: supports PCRE-like with toggle for regex. Beware escaping.
  • Vim: different flavor; \v for very magic to reduce escaping.

30. Testing and Visualization

  • regex101: PCRE/Python/JS flavors, explains tokens
  • regexper: visualize the automata
  • debuggex: diagram generator

Tip: Save test strings that triggered corner cases; they are your fuzz corpus.


Part 8: Cookbook: Copy-Paste Ready Patterns

31. Web and APIs

  • Query parameters (key=value pairs):
    ([?&])(?P<k>[A-Za z0-9_]+)=(?P<v>[^&#]*)
    
  • Bearer token header:
    ^Authorization:\s*Bearer\s+([A-Za z0-9._-]+)$
    

32. Authentication and Access Logs

  • Brute force detection (naive windowing done outside regex):
    Failed password for (?:invalid user )?\S+ from (?P<ip>\S+)
    

33. Cloud and DevOps

  • AWS ARN:
    arn:aws:[A-Za z0-9-]+:[a z0-9-]*:\d{12}:.+
    
  • Kubernetes pod names:
    [a z0-9]([-a z0-9]*[a z0-9])?(\.[a z0-9]([-a z0-9]*[a z0-9])?)*
    

34. Database and Data Leakage

  • SQL credentials strings (heuristic):
    \b(user(name)?|uid)\s*=\s*\w+;\s*pass(word)?\s*=\s*[^;\s]+\b
    

35. Misc Utilities

  • Version (semver ish):
    \b\d+\.\d+\.\d+(?:-[0-9A-Za z.-]+)?(?:\+[0-9A-Za z.-]+)?\b
    
  • UUID v4:
    \b[0-9a fA-F]{8}-[0-9a fA-F]{4}-4[0-9a fA-F]{3}-[89abAB][0-9a fA-F]{3}-[0-9a fA-F]{12}\b
    

Dev joke: I wrote a regex to detect my productivity. It matched zero or more tasks.


Part 9: WAF, DLP, and SIEM Considerations

36. WAF Rule Patterns

  • Detect ../../ traversal (encoded variants handled upstream):
    \.{2}\/(?:\.{2}\/)+
    
  • Script tag injection (simplified; avoid false positives):
    <\s*script\b[^>]*>.*?<\s*\/\s*script\s*>
    

37. DLP Policies

  • Look for private key blocks:
    -----BEGIN (?:RSA|OPENSSH|EC) PRIVATE KEY-----
    
  • Credit card (Luhn verification outside regex):
    \b(?:4\d{12}(?:\d{3})?|5[1-5]\d{14}|3[47]\d{13}|3(?:0[0-5]|[68]\d)\d{11}|6(?:011|5\d{2})\d{12}|(?:2131|1800|35\d{3})\d{11})\b
    

38. SIEM Parsers

  • Key value logs:
    \b(?P<k>[A-Za z_][A-Za z0-9_]*)=(?P<v>"[^"]*"|\S+)
    

Part 10: Troubleshooting and FAQ

39. Common Pitfalls

  • Forgetting to anchor, causing partial matches
  • Over escaping or under escaping (language string vs regex engine)
  • Relying on regex for complex grammars

40. Debugging Checklist

  • Reproduce with smallest test string
  • Use online testers to see backtracking steps
  • Add anchors and explicit character classes
  • Replace nested quantifiers with atomic or possessive variants

41. Glossary

  • Anchor: Position matcher like ^ or $
  • Character class: Set or range of chars [A-Z]
  • Backreference: \1 refers to first capture
  • Lookaround: Zero width context assertions

Appendix

A. Reference Tables

  • Shorthand classes: \d, \D, \w, \W, \s, \S
  • Word boundary: \b, \B
  • Anchors: ^, $, \A, \Z

B. Engine Differences

  • PCRE/PCRE2: rich features, backtracking
  • RE2/Go: linear time, no backreferences/lookbehind
  • Java: supports possessive quantifiers and atomic groups
  • .NET: similar to PCRE with some differences

C. Humor and Easter Eggs

  • Linux joke: chmod -x life; still executable.
  • Networking: I would tell you a UDP joke, but you might not get it.
  • Dev: There are two hard problems in computer science: cache invalidation, naming things, and off by one errors.

D. Further Reading

  • Regular-Expressions.info: https://www.regular expressions.info/
  • RE2: https://github.com/google/re2
  • PCRE2: https://www.pcre.org/
  • OWASP Input Validation Cheat Sheet: https://cheatsheetseries.owasp.org/
  • RFC 3986 (URI): https://www.rfc editor.org/rfc/rfc3986

Images: NFA diagram from Wikimedia Commons. Ensure your site permits external images or mirror locally if needed.