Regex¶

Table of Contents¶
- 1. Regex Fundamentals
- 1. Why Regex? What It Is and Isn’t
- 2. Engines at a Glance (DFA vs NFA, Backtracking)
- 3. Character Classes and Escapes
- 4. Literals, Metacharacters, and Escaping
- 5. Flags/Modifiers Across Engines
- 2. Matching Anatomy
- 6. Anchors and Boundaries
- 7. Groups: Capturing, Non Capturing, Named
- 8. Alternation
- 9. Quantifiers (Greedy, Lazy, Possessive)
- 10. Lookarounds (Lookahead/Lookbehind)
- 3. Unicode, Locales, and Character Properties
- 11. Unicode Basics and Case Folding
- 12. Property Classes (General Category, Scripts)
- 13. Normalization and Security Implications
- 4. Practical Security Patterns
- 14. Networking: IP, MAC, CIDR, URLs
- 15. Files and Paths (Windows, POSIX)
- 16. Emails, Domains, TLDs
- 17. Credentials and Secrets: High Entropy, Key Prefixes
- 18. Logs and SIEM: Web, Auth, Cloud, Databases
- 19. Data Formats: JSON, XML, CSV, JWTs
- 5. Validation Do’s and Don’ts
- 20. Input Validation Strategy
- 21. Allowlists vs Denylists
- 22. Canonicalization and Double Decode Pitfalls
- 6. ReDoS and Performance
- 23. Catastrophic Backtracking
- 24. Possessive Quantifiers and Atomic Groups
- 25. Linear Time Patterns and RE2
- 26. Performance Budgeting and Profiling
- 7. Tooling and Language Ecosystem
- 27. grep/egrep/fgrep/ripgrep/sed/awk
- 28. Python, JavaScript, Go, Java, C#, PHP, Rust
- 29. Editors and IDEs (VS Code, Vim/Neovim, JetBrains)
- 30. Testing and Visualization
- 8. Cookbook: Copy-Paste Ready Patterns
- 31. Web and APIs
- 32. Authentication and Access Logs
- 33. Cloud and DevOps
- 34. Database and Data Leakage
- 35. Misc Utilities
- 9. WAF, DLP, and SIEM Considerations
- 36. WAF Rule Patterns
- 37. DLP Policies
- 38. SIEM Parsers
- 10. Troubleshooting and FAQ
- 39. Common Pitfalls
- 40. Debugging Checklist
- 41. Glossary
- Appendix
- A. Reference Tables
- B. Engine Differences
- C. Humor and Easter Eggs
- D. Further Reading
Part 1: Regex Fundamentals¶
1. Why Regex? What It Is and Isn't¶
Regex finds patterns in text. Think of it as a scalpel, not a chainsaw.
Use it for: slicing logs, validating input, extracting data, quick filters. Don't use it for: parsing complex context free or nested formats use proper parsers for that.
Quick joke: I told my code a regex joke. It was too greedy and took the punchline.
2. Engines at a Glance (DFA vs NFA,-Backtracking)¶
- DFA (deterministic finite automaton): Typically linear time, no backreferences, no backtracking (e.g., RE2). Safe for performance critical contexts.
- NFA backtracking engines (PCRE, Perl, Oniguruma, PCRE2): Powerful features (lookbehind, backreferences, recursion) but risk of catastrophic backtracking.
3. Character Classes and Escapes¶
- Dot: . matches any character except newline (engine dependent)
- Sets: [abc], ranges: [a z], negation: [^...]
- Shortcuts: \d (digit), \D (non digit), \w (word), \W (non word), \s (space), \S (non space)
- Escape metacharacters when used literally: . ^ $ * + ? ( ) [ ] { } | \
Example:
\w+@\w+\.\w+
4. Literals, Metacharacters, and Escaping¶
- Escape to match literally: e.g., match a dot in a domain: example.com
- Raw strings in some languages reduce double escaping (e.g., Python r"...", Rust r"..."), but check language docs.
5. Flags/Modifiers Across Engines¶
- Case insensitive: i
- Multiline ^ and $: m
- Dotall (dot matches newline): s
- Unicode: u
- Global (find all): g (JS)
- Extended/comments: x (ignore whitespace, allow comments)
Example inline modifiers (PCRE):
(?i)case insensitive
(?m)multiline anchors
(?s)dotall
Part 2: Matching Anatomy¶
6. Anchors and Boundaries¶
- ^ start of string/line (with m)
- $ end of string/line (with m)
- \b word boundary, \B non word boundary
- \A beginning of string, \Z end of string (some engines)
7. Groups: Capturing, Non Capturing, Named¶
- Capturing: ( ... )
- Non capturing: (?: ... ) for performance/clarity when capture isn’t needed
- Named capture (engine dependent): (?
... ) or (?'name' ...)
8. Alternation¶
- a|b|c, order matters in backtracking engines. Put more specific/likely alternatives first to reduce backtracking.
9. Quantifiers (Greedy, Lazy,-Possessive)¶
- Greedy: *, +, ?, {n}, {n,}, {n,m}
- Lazy: *?, +?, ??, {n,m}?
- Possessive (PCRE, Java, etc.): *+, ++, ?+, {n,m}+ (no backtracking once matched)
10. Lookarounds-(Lookahead/Lookbehind)¶
- Positive lookahead: (?=...)
- Negative lookahead: (?!...)
- Positive lookbehind: (?<=...)
- Negative lookbehind: (?<!...)
Use cases: assert context without consuming it (e.g., capture a token only when followed by a known suffix).
Part 3: Unicode, Locales, and Character Properties¶
11. Unicode Basics and Case Folding¶
- Use Unicode aware engines where needed (u flag). Beware of case insensitive matching across locales.
- Case folding can make matches larger than expected (e.g., ß vs SS).
12. Property Classes (General Category,-Scripts)¶
- Example (PCRE/Oniguruma): \p{L} letters, \p{N} numbers, \p{Han} Han ideographs
- Negation: \P{...}
13. Normalization and Security Implications¶
- NFC/NFD: canonical equivalence; diacritics may be decomposed
- Security: Mixed script identifiers, homoglyph attacks. Normalize and restrict scripts where appropriate.
Part 4: Practical Security Patterns¶
14. Networking: IP, MAC, CIDR, URLs¶
- IPv4 (lenient):
\b(?:\d{1,3}\.){3}\d{1,3}\b - IPv4 (0-255 strict):
\b(?:(?:25[0-5]|2[0-4]\d|1?\d?\d)\.){3}(?:25[0-5]|2[0-4]\d|1?\d?\d)\b - IPv6 (compressed allowed) is complex; prefer libraries. Minimal presence check:
\b[0-9A-Fa f:]{2,}\b - MAC:
\b(?:[0-9A-Fa f]{2}:){5}[0-9A-Fa f]{2}\b - URL (simplified):
https?:\/\/(?:[\w.-]+)(?::\d+)?(?:\/[\w\-./?%&=+#]*)?
Linux joke: There’s no place like 127.0.0.1. And nothing like ::1 when you’re feeling extra modern.
15. Files and Paths (Windows,-POSIX)¶
- POSIX-ish path:
\/(?:[\w._-]+\/?)* - Windows drive path:
[A-Za z]:\\(?:[^\\/:*?"<>|\r\n]+\\?)* - Prevent path traversal (detecting ../):
(?:^|\/|\\)\.\.(?:\/|\\|$)
16. Emails, Domains, TLDs¶
- Email (practical; not RFC-complete):
[\w.!#$%&'*+\/=?^`{|}~-]+@[A-Za z0-9](?:[A-Za z0-9-]{0,61}[A-Za-z0-9])?(?:\.[A-Za z0-9](?:[A-Za z0-9-]{0,61}[A-Za z0-9])?)+ - Domain (basic):
(?:[A-Za z0-9](?:[A-Za z0-9-]{0,61}[A-Za z0-9])?\.)+[A-Za z]{2,}
17. Credentials and Secrets: High Entropy, Key Prefixes¶
- AWS Access Key ID:
\bA(KIA|SIA)[A-Z0-9]{16}\b - GitHub token:
\bghp_[A-Za z0-9]{36}\b - Google API key:
\bAIza[0-9A-Za z\-_]{35}\b - Generic high entropy token:
\b[A-Za z0-9_\-]{32,}\b
Warning: High false positives possible. Always verify and consider context.
18. Logs and SIEM: Web, Auth, Cloud, Databases¶
- Apache/Nginx common line parse (named groups; PCRE/Python style):
(?P<ip>\S+) \S+ \S+ \[(?P<ts>[^\]]+)\] "(?P<method>\S+) (?P<path>\S+) \S+" (?P<status>\d{3}) (?P<size>\d+) - SSH auth failure:
Failed password for (?:invalid user )?(?P<user>\S+) from (?P<ip>\S+) port (?P<port>\d+) - CloudTrail eventName:
"eventName"\s*:\s*"(?P<event>[A-Za z0-9_]+)"
19. Data Formats: JSON, XML, CSV, JWTs¶
- JSON key value (simple):
"(?P<key>[^"]+)"\s*:\s*"(?P<val>[^"]*)" - XML tag content:
<(?P<tag>[A-Za z_:][\w:.-]*)[^>]*>(?P<body>.*?)<\/\1> - JWT (header.payload.signature):
\b[A-Za z0-9_-]+\.[A-Za z0-9_-]+\.[A-Za z0-9_-]+\b
Part 5: Validation Do’s and Don’ts¶
20. Input Validation Strategy¶
- Prefer allowlists. Define what is valid, not what is forbidden.
- Anchor patterns: ^...$ to ensure full string validation.
21. Allowlists vs Denylists¶
- Denylists are bypass friendly; allowlists constrain inputs to expected forms.
22. Canonicalization and Double Decode Pitfalls¶
- Normalize encodings first: URL decode, HTML decode, Unicode normalize as needed, then validate.
Dev aside: If your regex looks like a firewall rule and your firewall rule looks like a regex, step away from the keyboard and write a parser.
Part 6: ReDoS and Performance¶
23. Catastrophic Backtracking¶
- Vulnerable:
(a+)+againstaaaa...abcan explode runtime. - Mitigate by: Using atomic groups:
(?>...)Possessive quantifiers:a++Refactoring patterns to avoid nested, overlapping quantifiers
24. Possessive Quantifiers and Atomic Groups¶
- Possessive:
*+,++,?+,{n,m}+ - Atomic:
(?>...) - Both reduce backtracking in engines that support them.
25. Linear Time Patterns and RE2¶
- Consider RE2/Go’s regexp for user controlled inputs; they reject constructs that cause exponential time.
26. Performance Budgeting and Profiling¶
- Time box regex runs if your language allows it.
- Benchmark with realistic inputs and pathological cases.
Linux joke: My regex was so slow, the cron job finished before the first match.
Part 7: Tooling and Language Ecosystem¶
27. grep/egrep/fgrep/ripgrep/sed/awk¶
- grep: basic regex; use -E for extended. ripgrep (rg) is fast and modern.
- Examples:
# Find IPs rg -o "\\b(?:\\d{1,3}\\.){3}\\d{1,3}\\b" logs/ # Extract emails grep -Eo '[A-Za z0-9._%+-]+@[A-Za z0-9.-]+\.[A-Za z]+' -r src/ # Obfuscate emails-(sed) sed -E 's/([[:alnum:]._%+-]+)@([[:alnum:].-]+)/[REDACTED]@\2/g' input.txt > output.txt
28. Python, JavaScript, Go, Java, C#, PHP, Rust¶
- Python (re):
import re p = re.compile(r"(?P<ip>\S+) \[(?P<ts>.*?)\] \"(?P<m>\S+) (?P<path>\S+) \S+\" (?P<st>\d{3}) (?P<sz>\d+)") - JS:
const re = /https?:\/\/[\w.-]+/gi; - Go (RE2):
re := regexp.MustCompile(`\b(?:\d{1,3}\.){3}\d{1,3}\b`) - Java (Pattern):
Pattern p = Pattern.compile("(?i)user: (\\w+)"); - C#:
var re = new Regex(@"(?<=^|\s)user=(\w+)", RegexOptions.Compiled); - PHP:
if (preg_match('/^USER:(\w+)$/i', $s, $m)) { $user = $m[1]; } - Rust (regex crate):
let re = Regex::new(r"(?m)^ERROR: (.+)$").unwrap();
29. Editors and IDEs (VS Code, Vim/Neovim,-JetBrains)¶
- VS Code search: supports PCRE-like with toggle for regex. Beware escaping.
- Vim: different flavor; \v for very magic to reduce escaping.
30. Testing and Visualization¶
- regex101: PCRE/Python/JS flavors, explains tokens
- regexper: visualize the automata
- debuggex: diagram generator
Tip: Save test strings that triggered corner cases; they are your fuzz corpus.
Part 8: Cookbook: Copy-Paste Ready Patterns¶
31. Web and APIs¶
- Query parameters (key=value pairs):
([?&])(?P<k>[A-Za z0-9_]+)=(?P<v>[^&#]*) - Bearer token header:
^Authorization:\s*Bearer\s+([A-Za z0-9._-]+)$
32. Authentication and Access Logs¶
- Brute force detection (naive windowing done outside regex):
Failed password for (?:invalid user )?\S+ from (?P<ip>\S+)
33. Cloud and DevOps¶
- AWS ARN:
arn:aws:[A-Za z0-9-]+:[a z0-9-]*:\d{12}:.+ - Kubernetes pod names:
[a z0-9]([-a z0-9]*[a z0-9])?(\.[a z0-9]([-a z0-9]*[a z0-9])?)*
34. Database and Data Leakage¶
- SQL credentials strings (heuristic):
\b(user(name)?|uid)\s*=\s*\w+;\s*pass(word)?\s*=\s*[^;\s]+\b
35. Misc Utilities¶
- Version (semver ish):
\b\d+\.\d+\.\d+(?:-[0-9A-Za z.-]+)?(?:\+[0-9A-Za z.-]+)?\b - UUID v4:
\b[0-9a fA-F]{8}-[0-9a fA-F]{4}-4[0-9a fA-F]{3}-[89abAB][0-9a fA-F]{3}-[0-9a fA-F]{12}\b
Dev joke: I wrote a regex to detect my productivity. It matched zero or more tasks.
Part 9: WAF, DLP, and SIEM Considerations¶
36. WAF Rule Patterns¶
- Detect ../../ traversal (encoded variants handled upstream):
\.{2}\/(?:\.{2}\/)+ - Script tag injection (simplified; avoid false positives):
<\s*script\b[^>]*>.*?<\s*\/\s*script\s*>
37. DLP Policies¶
- Look for private key blocks:
-----BEGIN (?:RSA|OPENSSH|EC) PRIVATE KEY----- - Credit card (Luhn verification outside regex):
\b(?:4\d{12}(?:\d{3})?|5[1-5]\d{14}|3[47]\d{13}|3(?:0[0-5]|[68]\d)\d{11}|6(?:011|5\d{2})\d{12}|(?:2131|1800|35\d{3})\d{11})\b
38. SIEM Parsers¶
- Key value logs:
\b(?P<k>[A-Za z_][A-Za z0-9_]*)=(?P<v>"[^"]*"|\S+)
Part 10: Troubleshooting and FAQ¶
39. Common Pitfalls¶
- Forgetting to anchor, causing partial matches
- Over escaping or under escaping (language string vs regex engine)
- Relying on regex for complex grammars
40. Debugging Checklist¶
- Reproduce with smallest test string
- Use online testers to see backtracking steps
- Add anchors and explicit character classes
- Replace nested quantifiers with atomic or possessive variants
41. Glossary¶
- Anchor: Position matcher like ^ or $
- Character class: Set or range of chars [A-Z]
- Backreference: \1 refers to first capture
- Lookaround: Zero width context assertions
Appendix¶
A. Reference Tables¶
- Shorthand classes: \d, \D, \w, \W, \s, \S
- Word boundary: \b, \B
- Anchors: ^, $, \A, \Z
B. Engine Differences¶
- PCRE/PCRE2: rich features, backtracking
- RE2/Go: linear time, no backreferences/lookbehind
- Java: supports possessive quantifiers and atomic groups
- .NET: similar to PCRE with some differences
C. Humor and Easter Eggs¶
- Linux joke: chmod -x life; still executable.
- Networking: I would tell you a UDP joke, but you might not get it.
- Dev: There are two hard problems in computer science: cache invalidation, naming things, and off by one errors.
D. Further Reading¶
- Regular-Expressions.info: https://www.regular expressions.info/
- RE2: https://github.com/google/re2
- PCRE2: https://www.pcre.org/
- OWASP Input Validation Cheat Sheet: https://cheatsheetseries.owasp.org/
- RFC 3986 (URI): https://www.rfc editor.org/rfc/rfc3986
Images: NFA diagram from Wikimedia Commons. Ensure your site permits external images or mirror locally if needed.