Regex¶

Regex Diagram

Table of Contents¶

1. Regex Fundamentals
1. Why Regex? What It Is and Isn’t
2. Engines at a Glance (DFA vs NFA, Backtracking)
3. Character Classes and Escapes
4. Literals, Metacharacters, and Escaping
5. Flags/Modifiers Across Engines
2. Matching Anatomy
6. Anchors and Boundaries
7. Groups: Capturing, Non Capturing, Named
8. Alternation
9. Quantifiers (Greedy, Lazy, Possessive)
10. Lookarounds (Lookahead/Lookbehind)
3. Unicode, Locales, and Character Properties
11. Unicode Basics and Case Folding
12. Property Classes (General Category, Scripts)
13. Normalization and Security Implications
4. Practical Security Patterns
14. Networking: IP, MAC, CIDR, URLs
15. Files and Paths (Windows, POSIX)
16. Emails, Domains, TLDs
17. Credentials and Secrets: High Entropy, Key Prefixes
18. Logs and SIEM: Web, Auth, Cloud, Databases
19. Data Formats: JSON, XML, CSV, JWTs
5. Validation Do’s and Don’ts
20. Input Validation Strategy
21. Allowlists vs Denylists
22. Canonicalization and Double Decode Pitfalls
6. ReDoS and Performance
23. Catastrophic Backtracking
24. Possessive Quantifiers and Atomic Groups
25. Linear Time Patterns and RE2
26. Performance Budgeting and Profiling
7. Tooling and Language Ecosystem
27. grep/egrep/fgrep/ripgrep/sed/awk
28. Python, JavaScript, Go, Java, C#, PHP, Rust
29. Editors and IDEs (VS Code, Vim/Neovim, JetBrains)
30. Testing and Visualization
8. Cookbook: Copy-Paste Ready Patterns
31. Web and APIs
32. Authentication and Access Logs
33. Cloud and DevOps
34. Database and Data Leakage
35. Misc Utilities
9. WAF, DLP, and SIEM Considerations
36. WAF Rule Patterns
37. DLP Policies
38. SIEM Parsers
10. Troubleshooting and FAQ
39. Common Pitfalls
40. Debugging Checklist
41. Glossary
Appendix
A. Reference Tables
B. Engine Differences
C. Humor and Easter Eggs
D. Further Reading

Part 1: Regex Fundamentals¶

1. Why Regex? What It Is and Isn't¶

Regex finds patterns in text. Think of it as a scalpel, not a chainsaw.

Use it for: slicing logs, validating input, extracting data, quick filters. Don't use it for: parsing complex context free or nested formats use proper parsers for that.

Quick joke: I told my code a regex joke. It was too greedy and took the punchline.

2. Engines at a Glance (DFA vs NFA,-Backtracking)¶

DFA (deterministic finite automaton): Typically linear time, no backreferences, no backtracking (e.g., RE2). Safe for performance critical contexts.
NFA backtracking engines (PCRE, Perl, Oniguruma, PCRE2): Powerful features (lookbehind, backreferences, recursion) but risk of catastrophic backtracking.

3. Character Classes and Escapes¶

Dot: . matches any character except newline (engine dependent)
Sets: [abc], ranges: [a z], negation: [^...]
Shortcuts: \d (digit), \D (non digit), \w (word), \W (non word), \s (space), \S (non space)
Escape metacharacters when used literally: . ^ $ * + ? ( ) [ ] { } | \

Example:

\w+@\w+\.\w+

4. Literals, Metacharacters, and Escaping¶

Escape to match literally: e.g., match a dot in a domain: example.com
Raw strings in some languages reduce double escaping (e.g., Python r"...", Rust r"..."), but check language docs.

5. Flags/Modifiers Across Engines¶

Case insensitive: i
Multiline ^ and $: m
Dotall (dot matches newline): s
Unicode: u
Global (find all): g (JS)
Extended/comments: x (ignore whitespace, allow comments)

Example inline modifiers (PCRE):

(?i)case insensitive
(?m)multiline anchors
(?s)dotall

Part 2: Matching Anatomy¶

6. Anchors and Boundaries¶

^ start of string/line (with m)
$ end of string/line (with m)
\b word boundary, \B non word boundary
\A beginning of string, \Z end of string (some engines)

7. Groups: Capturing, Non Capturing, Named¶

Capturing: ( ... )
Non capturing: (?: ... ) for performance/clarity when capture isn’t needed
Named capture (engine dependent): (? ... ) or (?'name' ...)

8. Alternation¶

a|b|c, order matters in backtracking engines. Put more specific/likely alternatives first to reduce backtracking.

9. Quantifiers (Greedy, Lazy,-Possessive)¶

Greedy: *, +, ?, {n}, {n,}, {n,m}
Lazy: *?, +?, ??, {n,m}?
Possessive (PCRE, Java, etc.): *+, ++, ?+, {n,m}+ (no backtracking once matched)

10. Lookarounds-(Lookahead/Lookbehind)¶

Positive lookahead: (?=...)
Negative lookahead: (?!...)
Positive lookbehind: (?<=...)
Negative lookbehind: (?<!...)

Use cases: assert context without consuming it (e.g., capture a token only when followed by a known suffix).

Part 3: Unicode, Locales, and Character Properties¶

11. Unicode Basics and Case Folding¶

Use Unicode aware engines where needed (u flag). Beware of case insensitive matching across locales.
Case folding can make matches larger than expected (e.g., ß vs SS).

12. Property Classes (General Category,-Scripts)¶

Example (PCRE/Oniguruma): \p{L} letters, \p{N} numbers, \p{Han} Han ideographs
Negation: \P{...}

13. Normalization and Security Implications¶

NFC/NFD: canonical equivalence; diacritics may be decomposed
Security: Mixed script identifiers, homoglyph attacks. Normalize and restrict scripts where appropriate.

Part 4: Practical Security Patterns¶

14. Networking: IP, MAC, CIDR, URLs¶

IPv4 (lenient):
```
\b(?:\d{1,3}\.){3}\d{1,3}\b
```

IPv4 (0-255 strict):

\b(?:(?:25[0-5]|2[0-4]\d|1?\d?\d)\.){3}(?:25[0-5]|2[0-4]\d|1?\d?\d)\b

IPv6 (compressed allowed) is complex; prefer libraries. Minimal presence check:
```
\b[0-9A-Fa f:]{2,}\b
```

MAC:

\b(?:[0-9A-Fa f]{2}:){5}[0-9A-Fa f]{2}\b

URL (simplified):

https?:\/\/(?:[\w.-]+)(?::\d+)?(?:\/[\w\-./?%&=+#]*)?

Linux joke: There’s no place like 127.0.0.1. And nothing like ::1 when you’re feeling extra modern.

15. Files and Paths (Windows,-POSIX)¶

POSIX-ish path:
```
\/(?:[\w._-]+\/?)*
```
Windows drive path:
```
[A-Za z]:\\(?:[^\\/:*?"<>|\r\n]+\\?)*
```
Prevent path traversal (detecting ../):
```
(?:^|\/|\\)\.\.(?:\/|\\|$)
```

16. Emails, Domains, TLDs¶

Email (practical; not RFC-complete):

[\w.!#$%&'*+\/=?^`{|}~-]+@[A-Za z0-9](?:[A-Za z0-9-]{0,61}[A-Za-z0-9])?(?:\.[A-Za z0-9](?:[A-Za z0-9-]{0,61}[A-Za z0-9])?)+

Domain (basic):

(?:[A-Za z0-9](?:[A-Za z0-9-]{0,61}[A-Za z0-9])?\.)+[A-Za z]{2,}

17. Credentials and Secrets: High Entropy, Key Prefixes¶

AWS Access Key ID:
```
\bA(KIA|SIA)[A-Z0-9]{16}\b
```
GitHub token:
```
\bghp_[A-Za z0-9]{36}\b
```
Google API key:
```
\bAIza[0-9A-Za z\-_]{35}\b
```
Generic high entropy token:
```
\b[A-Za z0-9_\-]{32,}\b
```

Warning: High false positives possible. Always verify and consider context.

18. Logs and SIEM: Web, Auth, Cloud, Databases¶

Apache/Nginx common line parse (named groups; PCRE/Python style):

(?P<ip>\S+) \S+ \S+ \[(?P<ts>[^\]]+)\] "(?P<method>\S+) (?P<path>\S+) \S+" (?P<status>\d{3}) (?P<size>\d+)

SSH auth failure:

Failed password for (?:invalid user )?(?P<user>\S+) from (?P<ip>\S+) port (?P<port>\d+)

CloudTrail eventName:

"eventName"\s*:\s*"(?P<event>[A-Za z0-9_]+)"

19. Data Formats: JSON, XML, CSV, JWTs¶

JSON key value (simple):

"(?P<key>[^"]+)"\s*:\s*"(?P<val>[^"]*)"

XML tag content:

<(?P<tag>[A-Za z_:][\w:.-]*)[^>]*>(?P<body>.*?)<\/\1>

JWT (header.payload.signature):

\b[A-Za z0-9_-]+\.[A-Za z0-9_-]+\.[A-Za z0-9_-]+\b

Part 5: Validation Do’s and Don’ts¶

20. Input Validation Strategy¶

Prefer allowlists. Define what is valid, not what is forbidden.
Anchor patterns: ^...$ to ensure full string validation.

21. Allowlists vs Denylists¶

Denylists are bypass friendly; allowlists constrain inputs to expected forms.

22. Canonicalization and Double Decode Pitfalls¶

Normalize encodings first: URL decode, HTML decode, Unicode normalize as needed, then validate.

Dev aside: If your regex looks like a firewall rule and your firewall rule looks like a regex, step away from the keyboard and write a parser.

Part 6: ReDoS and Performance¶

23. Catastrophic Backtracking¶

Vulnerable: (a+)+ against aaaa...ab can explode runtime.
Mitigate by: Using atomic groups: (?>...) Possessive quantifiers: a++ Refactoring patterns to avoid nested, overlapping quantifiers

24. Possessive Quantifiers and Atomic Groups¶

Possessive: *+, ++, ?+, {n,m}+
Atomic: (?>...)
Both reduce backtracking in engines that support them.

25. Linear Time Patterns and RE2¶

Consider RE2/Go’s regexp for user controlled inputs; they reject constructs that cause exponential time.

26. Performance Budgeting and Profiling¶

Time box regex runs if your language allows it.
Benchmark with realistic inputs and pathological cases.

Linux joke: My regex was so slow, the cron job finished before the first match.

Part 7: Tooling and Language Ecosystem¶

27. grep/egrep/fgrep/ripgrep/sed/awk¶

grep: basic regex; use -E for extended. ripgrep (rg) is fast and modern.

Examples:

# Find IPs
rg -o "\\b(?:\\d{1,3}\\.){3}\\d{1,3}\\b" logs/

# Extract emails
grep -Eo '[A-Za z0-9._%+-]+@[A-Za z0-9.-]+\.[A-Za z]+' -r src/

# Obfuscate emails-(sed)
sed -E 's/([[:alnum:]._%+-]+)@([[:alnum:].-]+)/[REDACTED]@\2/g' input.txt > output.txt

28. Python, JavaScript, Go, Java, C#, PHP, Rust¶

Python (re):

import re
p = re.compile(r"(?P<ip>\S+) \[(?P<ts>.*?)\] \"(?P<m>\S+) (?P<path>\S+) \S+\" (?P<st>\d{3}) (?P<sz>\d+)")

JS:
```
const re = /https?:\/\/[\w.-]+/gi;
```

Go (RE2):

re := regexp.MustCompile(`\b(?:\d{1,3}\.){3}\d{1,3}\b`)

Java (Pattern):

Pattern p = Pattern.compile("(?i)user: (\\w+)");

C#:

var re = new Regex(@"(?<=^|\s)user=(\w+)", RegexOptions.Compiled);

PHP:

if (preg_match('/^USER:(\w+)$/i', $s, $m)) { $user = $m[1]; }

Rust (regex crate):

let re = Regex::new(r"(?m)^ERROR: (.+)$").unwrap();

29. Editors and IDEs (VS Code, Vim/Neovim,-JetBrains)¶

VS Code search: supports PCRE-like with toggle for regex. Beware escaping.
Vim: different flavor; \v for very magic to reduce escaping.

30. Testing and Visualization¶

regex101: PCRE/Python/JS flavors, explains tokens
regexper: visualize the automata
debuggex: diagram generator

Tip: Save test strings that triggered corner cases; they are your fuzz corpus.

Part 8: Cookbook: Copy-Paste Ready Patterns¶

31. Web and APIs¶

Query parameters (key=value pairs):

([?&])(?P<k>[A-Za z0-9_]+)=(?P<v>[^&#]*)

Bearer token header:

^Authorization:\s*Bearer\s+([A-Za z0-9._-]+)$

32. Authentication and Access Logs¶

Brute force detection (naive windowing done outside regex):

Failed password for (?:invalid user )?\S+ from (?P<ip>\S+)

33. Cloud and DevOps¶

AWS ARN:

arn:aws:[A-Za z0-9-]+:[a z0-9-]*:\d{12}:.+

Kubernetes pod names:

[a z0-9]([-a z0-9]*[a z0-9])?(\.[a z0-9]([-a z0-9]*[a z0-9])?)*

34. Database and Data Leakage¶

SQL credentials strings (heuristic):

\b(user(name)?|uid)\s*=\s*\w+;\s*pass(word)?\s*=\s*[^;\s]+\b

35. Misc Utilities¶

Version (semver ish):

\b\d+\.\d+\.\d+(?:-[0-9A-Za z.-]+)?(?:\+[0-9A-Za z.-]+)?\b

UUID v4:

\b[0-9a fA-F]{8}-[0-9a fA-F]{4}-4[0-9a fA-F]{3}-[89abAB][0-9a fA-F]{3}-[0-9a fA-F]{12}\b

Dev joke: I wrote a regex to detect my productivity. It matched zero or more tasks.

Part 9: WAF, DLP, and SIEM Considerations¶

36. WAF Rule Patterns¶

Detect ../../ traversal (encoded variants handled upstream):
```
\.{2}\/(?:\.{2}\/)+
```
Script tag injection (simplified; avoid false positives):
```
<\s*script\b[^>]*>.*?<\s*\/\s*script\s*>
```

37. DLP Policies¶

Look for private key blocks:

-----BEGIN (?:RSA|OPENSSH|EC) PRIVATE KEY-----

Credit card (Luhn verification outside regex):

\b(?:4\d{12}(?:\d{3})?|5[1-5]\d{14}|3[47]\d{13}|3(?:0[0-5]|[68]\d)\d{11}|6(?:011|5\d{2})\d{12}|(?:2131|1800|35\d{3})\d{11})\b

38. SIEM Parsers¶

Key value logs:

\b(?P<k>[A-Za z_][A-Za z0-9_]*)=(?P<v>"[^"]*"|\S+)

Part 10: Troubleshooting and FAQ¶

39. Common Pitfalls¶

Forgetting to anchor, causing partial matches
Over escaping or under escaping (language string vs regex engine)
Relying on regex for complex grammars

40. Debugging Checklist¶

Reproduce with smallest test string
Use online testers to see backtracking steps
Add anchors and explicit character classes
Replace nested quantifiers with atomic or possessive variants

41. Glossary¶

Anchor: Position matcher like ^ or $
Character class: Set or range of chars [A-Z]
Backreference: \1 refers to first capture
Lookaround: Zero width context assertions

Appendix¶

A. Reference Tables¶

Shorthand classes: \d, \D, \w, \W, \s, \S
Word boundary: \b, \B
Anchors: ^, $, \A, \Z

B. Engine Differences¶

PCRE/PCRE2: rich features, backtracking
RE2/Go: linear time, no backreferences/lookbehind
Java: supports possessive quantifiers and atomic groups
.NET: similar to PCRE with some differences

C. Humor and Easter Eggs¶

Linux joke: chmod -x life; still executable.
Networking: I would tell you a UDP joke, but you might not get it.
Dev: There are two hard problems in computer science: cache invalidation, naming things, and off by one errors.

D. Further Reading¶

Regular-Expressions.info: https://www.regular expressions.info/
RE2: https://github.com/google/re2
PCRE2: https://www.pcre.org/
OWASP Input Validation Cheat Sheet: https://cheatsheetseries.owasp.org/
RFC 3986 (URI): https://www.rfc editor.org/rfc/rfc3986

Images: NFA diagram from Wikimedia Commons. Ensure your site permits external images or mirror locally if needed.