Post

19. Regular Expressions

๐Ÿ” Master the art of text manipulation with Regular Expressions! This comprehensive guide delves into Python's `re` module, fundamental patterns, quantifiers, grouping, and practical applications, empowering you to efficiently search and process data. โœจ

19. Regular Expressions

What we will learn in this post?

  • ๐Ÿ‘‰ Introduction to Regular Expressions
  • ๐Ÿ‘‰ The re Module
  • ๐Ÿ‘‰ Basic Regex Patterns
  • ๐Ÿ‘‰ Quantifiers and Repetition
  • ๐Ÿ‘‰ Groups and Capturing
  • ๐Ÿ‘‰ Regex Flags and Options
  • ๐Ÿ‘‰ Practical Regex Applications
  • ๐Ÿ‘‰ Conclusion!

Regex: Your Text Superpower! โœจ

Imagine needing to find specific information or check text rules in a big pile of words. Thatโ€™s where Regular Expressions, or regex, come in! Theyโ€™re a special, incredibly powerful language for describing and matching text patterns. Think of them as super-smart search and replace tools that understand complex sequences, not just exact words.

Why Use Regex? Common Magic! ๐Ÿช„

Regex helps computers understand text in a structured way, unlocking many possibilities:

1. Validation โœ…

Quickly check if an email (user@domain.com), phone number, or password meets specific format rules before accepting it.

2. Smart Searching ๐Ÿ•ต๏ธโ€โ™€๏ธ

Find all URLs on a webpage, specific keywords, or even patterns like dates (DD-MM-YYYY) within large documents with incredible precision.

3. Data Extraction โœ‚๏ธ

Pull out just the names, prices, or product codes from raw, unstructured text, making data cleanup and analysis much easier.

Regex uses a pattern of characters (like \d+ for โ€˜one or more digitsโ€™) to tell the computer precisely what to look for. Itโ€™s a concise way to communicate complex text needs.

graph TD
    A["๐Ÿ“ง Text Input:<br/>My email is user@example.com"]:::pink --> B{"๐Ÿ” Regex Pattern:<br/>\\w+@\\w+\\.\\w+"}:::gold
    B --> C{"โ“ Match Found?"}:::purple
    C -- "โœ… Yes" --> D["๐Ÿ“‹ Extracted Data:<br/>user@example.com"]:::green
    C -- "โŒ No" --> E["๐Ÿšซ No Match"]:::orange

    classDef pink fill:#ff4f81,stroke:#c43e3e,color:#fff,font-size:14px,stroke-width:3px,rx:12,shadow:4px;
    classDef gold fill:#ffd700,stroke:#d99120,color:#222,font-size:14px,stroke-width:3px,rx:12,shadow:4px;
    classDef purple fill:#6b5bff,stroke:#4a3f6b,color:#fff,font-size:14px,stroke-width:3px,rx:12,shadow:4px;
    classDef green fill:#43e97b,stroke:#38f9d7,color:#fff,font-size:14px,stroke-width:3px,rx:12,shadow:4px;
    classDef orange fill:#ff9800,stroke:#f57c00,color:#fff,font-size:14px,stroke-width:3px,rx:12,shadow:4px;

    linkStyle 0,1,2,3 stroke:#e67e22,stroke-width:3px;

This simple flowchart shows how regex can process text to find specific patterns.

Regex Fun with Pythonโ€™s re Module! ๐Ÿ”Ž

Pythonโ€™s built-in re module is your ultimate companion for working with Regular Expressions (regex), a robust tool for finding, matching, and manipulating text based on powerful patterns. Think of it as a super-smart search engine for your strings!


Spotting Patterns: re.match() vs re.search() ๐ŸŽฏ

These functions help determine if a pattern exists within a string. They return a match object if successful, otherwise None.

  • re.match(pattern, string): Searches for the pattern only at the very beginning of the string.
    1
    2
    3
    4
    
    import re
    text = "Hello world"
    print(re.match(r"Hello", text)) # <re.Match object; span=(0, 5), match='Hello'>
    print(re.match(r"world", text))  # None (because 'world' isn't at the start)
    
  • re.search(pattern, string): Scans the entire string to find the first place the pattern matches.
    1
    2
    
    text = "Hello world"
    print(re.search(r"world", text)) # <re.Match object; span=(6, 11), match='world'>
    

Hereโ€™s a quick visual to understand the difference:

graph TD
    A["๐Ÿš€ Start String Scan"]:::pink --> B{"๐ŸŽฏ Pattern at BEGINNING?"}:::gold
    B -- "โœ… Yes" --> C["๐Ÿ“ฆ re.match()<br/>returns Match Object"]:::green
    B -- "โŒ No" --> D["๐Ÿšซ re.match()<br/>returns None"]:::orange
    A --> E{"๐Ÿ” Pattern ANYWHERE?"}:::purple
    E -- "โœ… Yes, first match" --> F["๐Ÿ“ฆ re.search()<br/>returns Match Object"]:::teal
    E -- "โŒ No" --> G["๐Ÿšซ re.search()<br/>returns None"]:::orange

    classDef pink fill:#ff4f81,stroke:#c43e3e,color:#fff,font-size:14px,stroke-width:3px,rx:12,shadow:4px;
    classDef gold fill:#ffd700,stroke:#d99120,color:#222,font-size:14px,stroke-width:3px,rx:12,shadow:4px;
    classDef purple fill:#6b5bff,stroke:#4a3f6b,color:#fff,font-size:14px,stroke-width:3px,rx:12,shadow:4px;
    classDef teal fill:#00bfae,stroke:#005f99,color:#fff,font-size:14px,stroke-width:3px,rx:12,shadow:4px;
    classDef green fill:#43e97b,stroke:#38f9d7,color:#fff,font-size:14px,stroke-width:3px,rx:12,shadow:4px;
    classDef orange fill:#ff9800,stroke:#f57c00,color:#fff,font-size:14px,stroke-width:3px,rx:12,shadow:4px;

    linkStyle 0,1,2,3,4,5 stroke:#e67e22,stroke-width:3px;

Finding All Occurrences: re.findall() & re.finditer() ๐Ÿ•ต๏ธโ€โ™€๏ธ

Need to grab all instances of a specific pattern? These are your friends!

  • re.findall(pattern, string): Returns a list of all non-overlapping matches as strings.
    1
    2
    
    text = "cat and dog and cat"
    print(re.findall(r"cat", text)) # ['cat', 'cat']
    
  • re.finditer(pattern, string): Returns an iterator yielding match objects for all matches. This is handy for getting more detailed information (like starting position) for each match.
    1
    2
    3
    4
    5
    
    text = "cat and dog and cat"
    for m in re.finditer(r"cat", text):
        print(f"Found '{m.group()}' at index {m.start()}")
    # Found 'cat' at index 0
    # Found 'cat' at index 14
    

Changing & Splitting Text: re.sub() & re.split() โœ๏ธ

Regex isnโ€™t just for finding; it can transform and break apart text too!

  • re.sub(pattern, replacement, string): Substitutes (replaces) all occurrences of the pattern with the specified replacement string.
    1
    2
    
    text = "Call me at 123-456-7890 anytime."
    print(re.sub(r"\d{3}-\d{3}-\d{4}", "HIDDEN", text)) # Call me at HIDDEN anytime.
    
  • re.split(pattern, string): Splits the string by occurrences of the pattern, returning a list of substrings.
    1
    2
    
    text = "apple,banana;orange"
    print(re.split(r"[,;]", text)) # ['apple', 'banana', 'orange']
    
๐Ÿš€ Try this Live โ†’ Click to open interactive PYTHON playground

Unleash the Power of Regex! ๐Ÿš€

Ever needed to find specific text patterns or validate inputs? Regular Expressions, or Regex, are incredibly powerful tools for searching, matching, and manipulating strings. Letโ€™s explore the fundamental building blocks in a friendly, easy-to-understand way!


1. Literal Characters: The Exact Match ๐ŸŽฏ

Most characters in a regex pattern simply match themselves exactly. Theyโ€™re like plain text!

  • hello will literally match the word โ€œhelloโ€.
hello
1
2
# Input: "hello world"
# Output: Match found: "hello"

2. Metacharacters: The Special Symbols โœจ

These characters have special meanings, allowing you to create more flexible and dynamic patterns.

. (Dot): Any Single Character ๐Ÿ“

  • Matches any single character (except a newline).
  • Example: a.b matches axb, a b, acb.
a.b
1
2
# Input: "axb", "a b", "acb", "ab"
# Output: Match found: "axb", "a b", "acb" (No match for "ab")

^ and $ (Anchors): Start & End โš“

  • ^: Matches the beginning of a string.
  • $: Matches the end of a string.
  • Example: ^start matches โ€œstart hereโ€ but not โ€œletโ€™s startโ€.
^start
1
2
# Input: "start here", "let's start"
# Output: Match found: "start" (from "start here")

*, +, ? (Quantifiers): How Many? ๐Ÿ”ข

These specify how many times the preceding element can repeat.

  • *: Zero or more times. ab*c matches ac, abc, abbc.
  • +: One or more times. ab+c matches abc, abbc but not ac.
  • ?: Zero or one time. ab?c matches ac, abc.
ab+c
1
2
# Input: "abc", "abbc", "ac"
# Output: Match found: "abc", "abbc" (No match for "ac")

{} (Quantifier): Specific Counts ๐Ÿ“

  • Matches a specific number of times. a{3}b matches aaab.
  • a{2,4}b matches aab, aaab, aaaab.
a{2,4}b
1
2
# Input: "aab", "aaab", "aaaab", "ab"
# Output: Match found: "aab", "aaab", "aaaab" (No match for "ab")

[] (Character Sets): Any of These ๐ŸŽ

  • Matches any single character found inside the brackets.
  • [aeiou] matches any vowel. [0-9] matches any digit. [a-z] matches any lowercase letter.
[aeiou]
1
2
# Input: "apple", "banana"
# Output: Match found: "a", "e" (from apple), "a", "a", "a" (from banana)

\ (Escape Character): Take it Literally ๐Ÿ›ก๏ธ

  • Removes the special meaning of a metacharacter. To match a literal . or *, use \. or \*.
  • Example: \.com matches โ€œ.comโ€.
\.com
1
2
# Input: "example.com"
# Output: Match found: ".com"

3. Special Sequences: Handy Shortcuts! โšก

These are pre-defined character classes, making common patterns easier to write.

\d, \w, \s: Common Patterns ๐Ÿงฉ

  • \d: Matches any digit (0-9). (Same as [0-9])
  • \w: Matches any word character (alphanumeric + underscore: a-zA-Z0-9_).
  • \s: Matches any whitespace character (space, tab, newline, etc.).
  • Example: \d{3}-\d{3}-\d{4} matches phone numbers like โ€œ123-456-7890โ€.
\d{3}-\d{3}-\d{4}
1
2
# Input: "My number is 123-456-7890."
# Output: Match found: "123-456-7890"

Regex Concepts Flow ๐ŸŒŠ

graph TD
    A["๐Ÿš€ Start Regex Pattern"]:::pink --> B{"๐Ÿ“ Literal Characters"}:::gold
    A --> C{"โœจ Metacharacters"}:::purple
    A --> D{"๐Ÿ”ข Special Sequences"}:::teal

    C --> C1["๐Ÿ”ข Quantifiers:<br/>* + ? {}"]:::orange
    C --> C2["โš“ Anchors:<br/>^ $"]:::orange
    C --> C3["๐ŸŽฏ Character Sets:<br/>[]"]:::orange
    C --> C4["๐Ÿ›ก๏ธ Escape Character:<br/>\\"]:::orange

    D --> D1["๐Ÿ”ข \\d: Digit"]:::green
    D --> D2["๐Ÿ”ค \\w: Word Char"]:::green
    D --> D3["โฃ \\s: Whitespace"]:::green

    B -- "or" --> E["โœ… Match Text"]:::green
    C -- "or" --> E
    D -- "or" --> E

    classDef pink fill:#ff4f81,stroke:#c43e3e,color:#fff,font-size:14px,stroke-width:3px,rx:12,shadow:4px;
    classDef gold fill:#ffd700,stroke:#d99120,color:#222,font-size:14px,stroke-width:3px,rx:12,shadow:4px;
    classDef purple fill:#6b5bff,stroke:#4a3f6b,color:#fff,font-size:14px,stroke-width:3px,rx:12,shadow:4px;
    classDef teal fill:#00bfae,stroke:#005f99,color:#fff,font-size:14px,stroke-width:3px,rx:12,shadow:4px;
    classDef orange fill:#ff9800,stroke:#f57c00,color:#fff,font-size:14px,stroke-width:3px,rx:12,shadow:4px;
    classDef green fill:#43e97b,stroke:#38f9d7,color:#fff,font-size:14px,stroke-width:3px,rx:12,shadow:4px;

    linkStyle default stroke:#e67e22,stroke-width:3px;

Regular expressions might look a bit like magic at first, but mastering these basics opens up a world of text manipulation possibilities! Keep practicing! โœจ


Regex Quantifiers: The Power of Repetition! ๐Ÿš€

Regular expressions (regex) use quantifiers to specify how many times a character, group, or character class can appear. They make your patterns flexible and incredibly powerful!

Meet the Common Quantifiers โœจ

  • * (Asterisk): Matches the preceding element zero or more times. Itโ€™s like saying โ€œoptional, and can repeatโ€.
    • Example: a*b matches โ€œbโ€, โ€œabโ€, โ€œaaabโ€.
  • + (Plus): Matches the preceding element one or more times. It must appear at least once.
    • Example: a+b matches โ€œabโ€, โ€œaaabโ€, but not โ€œbโ€.
  • ? (Question Mark): Matches the preceding element zero or one time. It makes an element completely optional.
    • Example: colou?r matches โ€œcolorโ€ or โ€œcolourโ€.
  • {n} (Exactly n): Matches the preceding element exactly n times.
    • Example: a{3} matches โ€œaaaโ€.
  • {n,m} (Between n and m): Matches the preceding element at least n and at most m times.
    • Example: a{2,4} matches โ€œaaโ€, โ€œaaaโ€, โ€œaaaaโ€.

Greedy vs. Non-Greedy Matching โš–๏ธ

By default, all quantifiers (*, +, ?, {n}, {n,m}) are greedy. This means they try to match the longest possible string that still allows the overall regex to succeed.

  • Greedy Example: "<.*>" on <h1>Hello</h1> matches the entire <h1>Hello</h1>.

To make a quantifier non-greedy (or lazy), simply add a ? right after it (e.g., *?, +?, ??, {n,m}?). A non-greedy quantifier matches the shortest possible string.

  • Non-Greedy Example: "<.*?>" on <h1>Hello</h1> matches <h1> and </h1> as two separate matches.
flowchart TD
    A["๐Ÿš€ Start Matching"]:::pink --> B{"๐Ÿ”ข Quantifier<br/>Encountered?"}:::gold
    B -- "๐Ÿช Default: Greedy" --> C["๐Ÿ“Š Match Longest<br/>Possible String"]:::teal
    B -- "โ“ With '?': Non-Greedy" --> D["๐ŸŽฏ Match Shortest<br/>Possible String"]:::purple
    C --> E["โžก๏ธ Proceed with<br/>rest of Regex"]:::green
    D --> E

    classDef pink fill:#ff4f81,stroke:#c43e3e,color:#fff,font-size:14px,stroke-width:3px,rx:12,shadow:4px;
    classDef gold fill:#ffd700,stroke:#d99120,color:#222,font-size:14px,stroke-width:3px,rx:12,shadow:4px;
    classDef purple fill:#6b5bff,stroke:#4a3f6b,color:#fff,font-size:14px,stroke-width:3px,rx:12,shadow:4px;
    classDef teal fill:#00bfae,stroke:#005f99,color:#fff,font-size:14px,stroke-width:3px,rx:12,shadow:4px;
    classDef green fill:#43e97b,stroke:#38f9d7,color:#fff,font-size:14px,stroke-width:3px,rx:12,shadow:4px;

    linkStyle default stroke:#e67e22,stroke-width:3px;

Regex Grouping Magic! โœจ

Regular expressions use parentheses () to group parts of a pattern, treating them as a single unit. This is super handy for applying quantifiers (+, *) to multiple characters or for capturing specific pieces of your match.

Capturing Groups () ๐Ÿ“ฆ

When you use (), youโ€™re not just grouping; youโ€™re also capturing the text that matches inside. These groups are automatically numbered from left to right, starting from 1.

1
2
3
4
5
6
7
8
9
import re
text = "My phone is 123-456-7890."
pattern = r"(\d{3})-(\d{3})-(\d{4})" # Three capturing groups for phone parts
match = re.search(pattern, text)
if match:
    print(match.group(0)) # The entire matched string: "123-456-7890"
    print(match.group(1)) # First captured group: "123"
    print(match.group(2)) # Second captured group: "456"
    print(match.group(3)) # Third captured group: "7890"

Non-Capturing Groups (?:) ๐Ÿ‘ป

Need to group but donโ€™t want to capture the text? Thatโ€™s what (?:) is for! It groups patterns together for things like applying quantifiers or alternation, but it doesnโ€™t create a backreference or consume a group number. Great for efficiency!

1
2
3
# Example: Match "colour" or "color"
# Pattern with capturing: (colou?r)  -> "colou" or "colo" is captured
# Pattern with non-capturing: (?:colou?r) -> No part is captured, just the whole match

Named Groups (?P<name>) ๐Ÿท๏ธ

Forget remembering group numbers! With (?P<your_name>pattern), you can give your capturing groups a name. This makes your regular expressions much clearer and easier to manage when accessing specific parts.

1
2
3
4
5
6
7
# Example for a date:
pattern_named = r"(?P<year>\d{4})-(?P<month>\d{2})-(?P<day>\d{2})"
match_named = re.search(pattern_named, "Date: 2023-10-26")
if match_named:
    print(match_named.group("year"))  # Access by name: "2023"
    print(match_named.group("month")) # Access by name: "10"
    print(match_named.group("day"))   # Access by name: "26"

Accessing Captured Groups ๐Ÿค

After a successful match, you can retrieve the captured content using methods like match.group(). Access numbered groups by their index (e.g., match.group(1)) and named groups by their assigned name (e.g., match.group("year")).


graph TD
    A["๐Ÿš€ Start Grouping"]:::pink --> B{"๐Ÿค” Need to save<br/>this part?"}:::gold
    B -- "โœ… Yes" --> C["๐Ÿ“ฆ Use Capturing Group:<br/>(pattern)"]:::teal
    C --> D{"๐Ÿท๏ธ Give memorable<br/>name?"}:::purple
    D -- "โœ… Yes" --> E["๐Ÿ“› Use Named Group:<br/>(?P<name>pattern)"]:::green
    D -- "โŒ No" --> F["๐Ÿ”ข Access by number:<br/>group(1), group(2)..."]:::orange
    B -- "โŒ No" --> G["๐Ÿ‘ป Use Non-Capturing:<br/>(?:pattern)"]:::orange
    E --> H["๐ŸŽฏ Access by name:<br/>group('name')"]:::green
    F --> I["โœ… End"]:::green
    G --> I
    H --> I

    classDef pink fill:#ff4f81,stroke:#c43e3e,color:#fff,font-size:13px,stroke-width:3px,rx:12,shadow:4px;
    classDef gold fill:#ffd700,stroke:#d99120,color:#222,font-size:13px,stroke-width:3px,rx:12,shadow:4px;
    classDef purple fill:#6b5bff,stroke:#4a3f6b,color:#fff,font-size:13px,stroke-width:3px,rx:12,shadow:4px;
    classDef teal fill:#00bfae,stroke:#005f99,color:#fff,font-size:13px,stroke-width:3px,rx:12,shadow:4px;
    classDef orange fill:#ff9800,stroke:#f57c00,color:#fff,font-size:13px,stroke-width:3px,rx:12,shadow:4px;
    classDef green fill:#43e97b,stroke:#38f9d7,color:#fff,font-size:13px,stroke-width:3px,rx:12,shadow:4px;

    linkStyle default stroke:#e67e22,stroke-width:3px;

# Regex Flags: Powering Your Patterns! ๐Ÿš€

Regex flags are like special switches that change how your regular expressions work. They offer extra control, making your pattern matching more flexible and powerful!

How Flags Modify Matching โœจ

This simple chart shows how flags fit into the pattern matching process:

graph TD
    A["๐Ÿš€ Start Regex Match"]:::pink --> B{"๐Ÿด Are Flags<br/>Provided?"}:::gold
    B -- "โœ… Yes" --> C["โš™๏ธ Apply Flag Rules"]:::purple
    C --> D["๐Ÿ” Execute Pattern<br/>Matching"]:::teal
    B -- "โŒ No" --> D
    D --> E["๐ŸŽ Return Result"]:::green

    classDef pink fill:#ff4f81,stroke:#c43e3e,color:#fff,font-size:14px,stroke-width:3px,rx:12,shadow:4px;
    classDef gold fill:#ffd700,stroke:#d99120,color:#222,font-size:14px,stroke-width:3px,rx:12,shadow:4px;
    classDef purple fill:#6b5bff,stroke:#4a3f6b,color:#fff,font-size:14px,stroke-width:3px,rx:12,shadow:4px;
    classDef teal fill:#00bfae,stroke:#005f99,color:#fff,font-size:14px,stroke-width:3px,rx:12,shadow:4px;
    classDef green fill:#43e97b,stroke:#38f9d7,color:#fff,font-size:14px,stroke-width:3px,rx:12,shadow:4px;

    linkStyle default stroke:#e67e22,stroke-width:3px;

1. re.IGNORECASE (or re.I) ๐Ÿ”ก

This flag makes your pattern match both uppercase and lowercase letters. Itโ€™s fantastic for case-insensitive searches.

1
2
3
4
5
import re
pattern = r"apple"
text = "Apple pie, apple crisp."
match = re.search(pattern, text, re.IGNORECASE)
print(match.group()) # Output: Apple

2. re.MULTILINE (or re.M) ๐Ÿ“œ

Normally, ^ matches the stringโ€™s start and $ its end. re.MULTILINE makes ^ match the start of each line, and $ the end of each line within the string.

1
2
3
4
pattern = r"^Line"
text = "First Line\nSecond Line"
match = re.search(pattern, text, re.MULTILINE)
print(match.group()) # Output: Line

3. re.DOTALL (or re.S) ๐ŸŽฏ

By default, the dot (.) matches any character *except a newline (\n).* re.DOTALL makes . match all characters, including those pesky newlines.

1
2
3
4
pattern = r"hello.world"
text = "hello\nworld"
match = re.search(pattern, text, re.DOTALL)
print(match.group()) # Output: hello\nworld

4. re.VERBOSE (or re.X) ๐Ÿ’ก

This flag allows you to write more readable regex by ignoring whitespace and letting you add comments. It helps break down complex patterns.

1
2
3
4
5
6
7
8
pattern = r"""
    hello   # Matches the word "hello"
    \s+     # Matches one or more whitespace characters
    world   # Matches the word "world"
"""
text = "hello world"
match = re.search(pattern, text, re.VERBOSE)
print(match.group()) # Output: hello world

Practical Regex Adventures! ๐Ÿš€

Regular Expressions (Regex) are powerful tools for pattern matching in text. They help us find, validate, extract, or replace specific text strings efficiently. Letโ€™s dive into some practical examples!

Understanding the Regex Flow ๐Ÿ’ก

Hereโ€™s a simple way to visualize how Regex works:

graph TD
    A["๐Ÿ“ Start with<br/>Text/Data"]:::pink --> B{"๐ŸŽฏ Define Regex<br/>Pattern"}:::gold
    B -- "โžก๏ธ Apply Pattern" --> C{"๐Ÿ” Search for<br/>Match"}:::purple
    C -- "โ“ Match Found?" --> D["โœ… Yes: Extract/<br/>Validate/Transform"]:::teal
    C -- "โŒ No Match" --> E["๐Ÿšซ No Action/False"]:::orange
    D --> F["๐ŸŽ Result/Output"]:::green
    E --> F

    classDef pink fill:#ff4f81,stroke:#c43e3e,color:#fff,font-size:14px,stroke-width:3px,rx:12,shadow:4px;
    classDef gold fill:#ffd700,stroke:#d99120,color:#222,font-size:14px,stroke-width:3px,rx:12,shadow:4px;
    classDef purple fill:#6b5bff,stroke:#4a3f6b,color:#fff,font-size:14px,stroke-width:3px,rx:12,shadow:4px;
    classDef teal fill:#00bfae,stroke:#005f99,color:#fff,font-size:14px,stroke-width:3px,rx:12,shadow:4px;
    classDef orange fill:#ff9800,stroke:#f57c00,color:#fff,font-size:14px,stroke-width:3px,rx:12,shadow:4px;
    classDef green fill:#43e97b,stroke:#38f9d7,color:#fff,font-size:14px,stroke-width:3px,rx:12,shadow:4px;

    linkStyle default stroke:#e67e22,stroke-width:3px;

Everyday Regex Use Cases โœจ

Letโ€™s see Regex in action with Pythonโ€™s re module.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
import re

# --- ๐Ÿ“ง Email Validation ---
# Checks if a string looks like a valid email address.
email_pattern = r"^[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,}$"
test_email = "user@example.com"
is_valid_email = bool(re.match(email_pattern, test_email))
print(f"'{test_email}' is valid: {is_valid_email}") # Output: 'user@example.com' is valid: True

# --- ๐Ÿ“ž Phone Number Formatting ---
# Cleans and formats phone numbers into a standard (XXX) XXX-XXXX format.
phone_number = "123.456.7890"
cleaned_phone = re.sub(r"[^\d]", "", phone_number) # Removes non-digits
formatted_phone = re.sub(r"(\d{3})(\d{3})(\d{4})", r"(\1) \2-\3", cleaned_phone)
print(f"Formatted phone: {formatted_phone}") # Output: Formatted phone: (123) 456-7890

# --- ๐Ÿ”— URL Extraction ---
# Finds all URLs (HTTP/HTTPS) within a given text.
text_with_urls = "Visit us at https://www.example.com or our blog http://blog.test.org for more info."
extracted_urls = re.findall(r"https?://[^\s]+", text_with_urls)
print(f"Extracted URLs: {extracted_urls}") # Output: Extracted URLs: ['https://www.example.com', 'http://blog.test.org']

# --- ๐Ÿ”’ Password Strength Checking ---
# A simple check: at least 8 chars, 1 uppercase, 1 lowercase, 1 digit.
password_pattern = r"^(?=.*[a-z])(?=.*[A-Z])(?=.*\d)[a-zA-Z\d]{8,}$"
strong_password = "MyP@ssw0rd!"
is_strong = bool(re.match(password_pattern, strong_password))
print(f"Is '{strong_password}' strong: {is_strong}") # Output: Is 'MyP@ssw0rd!' strong: True

# --- ๐Ÿงน Data Cleaning ---
# Removes special characters, keeping only letters, numbers, and spaces.
dirty_data = "Hello, world! This is some data with @symbols & numbers 123."
cleaned_data = re.sub(r"[^a-zA-Z0-9\s]", "", dirty_data)
print(f"Cleaned data: {cleaned_data}") # Output: Cleaned data: Hello world This is some data with symbols  numbers 123

Regex is an indispensable skill for developers and data professionals, making text manipulation tasks much easier!

๐Ÿš€ Try this Live โ†’ Click to open interactive PYTHON playground

๐ŸŽฏ Hands-On Assignment

๐Ÿ’ก Project: Log File Analyzer - Build a Production Log Parser (Click to expand)

๐Ÿš€ Your Challenge:

Create a comprehensive Log File Analyzer using Regular Expressions to parse, extract, and analyze data from server log files. Your system should handle common log formats, extract meaningful information, and generate reports. ๐Ÿ“Šโœจ

๐Ÿ“‹ Requirements:

Part 1: Log Entry Parser

  • Parse standard Apache/Nginx log format: IP - - [DateTime] \"REQUEST\" STATUS SIZE
  • Extract components using capturing groups:
    • IP address (validate IPv4 format)
    • Timestamp (parse date and time)
    • HTTP method (GET, POST, PUT, DELETE)
    • URL path
    • Status code (200, 404, 500, etc.)
    • Response size in bytes
  • Example log entry:
    192.168.1.100 - - [10/Dec/2025:13:55:36 +0000] \"GET /api/users HTTP/1.1\" 200 1234

Part 2: IP Address Analysis

  • Extract all unique IP addresses from logs
  • Count requests per IP address
  • Identify suspicious IPs (more than 100 requests in the sample)
  • Validate IPv4 format: ^((25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)\\.){3}(25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)$

Part 3: Error Detection

  • Find all 4xx client errors (404, 403, etc.)
  • Find all 5xx server errors (500, 502, 503, etc.)
  • Extract error URLs and count occurrences
  • Identify the most common error patterns

Part 4: URL Pattern Analysis

  • Extract API endpoints (e.g., /api/users, /api/products)
  • Identify resource IDs in URLs (e.g., /users/123, /products/456)
  • Count requests per endpoint
  • Find most accessed resources

Part 5: Time-Based Analysis

  • Extract timestamps and convert to Python datetime objects
  • Group requests by hour of day
  • Identify peak traffic hours
  • Calculate average response size per hour

Part 6: User Agent & Referrer Extraction (Advanced)

  • Parse extended log format including User-Agent and Referrer
  • Detect bot traffic (identify common bot user agents)
  • Extract browser types (Chrome, Firefox, Safari, etc.)
  • Identify mobile vs desktop traffic

๐Ÿ’ก Implementation Hints:

  • Step 1: Start with basic pattern: r'^(\\S+) .* \\[([^\\]]+)\\] \"(\\w+) ([^\"]+)\" (\\d{3}) (\\d+)'
  • Step 2: Use re.findall() to extract all log entries, then process each match
  • Step 3: Use named groups for clarity: (?P<ip>\\S+), (?P<method>\\w+)
  • Step 4: Create a dictionary to store statistics (IP counts, error counts, etc.)
  • Step 5: Use re.compile() for patterns you'll reuse multiple times (better performance)
  • Step 6: For datetime parsing, combine regex with Python's datetime.strptime()
  • Bonus: Create visualization of results using simple text-based charts or export to CSV

๐Ÿ“Š Sample Log Data to Test:

192.168.1.100 - - [10/Dec/2025:13:55:36 +0000] "GET /api/users HTTP/1.1" 200 1234
10.0.0.50 - - [10/Dec/2025:13:56:12 +0000] "POST /api/login HTTP/1.1" 401 89
192.168.1.101 - - [10/Dec/2025:13:57:22 +0000] "GET /products/123 HTTP/1.1" 200 5678
10.0.0.75 - - [10/Dec/2025:14:00:45 +0000] "GET /api/products HTTP/1.1" 500 234
192.168.1.100 - - [10/Dec/2025:14:05:11 +0000] "DELETE /api/users/456 HTTP/1.1" 204 0

Expected Output Example:

LOG FILE ANALYSIS REPORT
========================
Total Requests: 5
Unique IPs: 3
Error Rate: 40.0%

TOP IPs:
  192.168.1.100: 2 requests
  10.0.0.50: 1 request
  10.0.0.75: 1 request

STATUS CODES:
  2xx (Success): 3
  4xx (Client Error): 1
  5xx (Server Error): 1

TOP ENDPOINTS:
  /api/users: 2 requests
  /products/123: 1 request
  /api/login: 1 request

Share Your Solution! ๐Ÿ’ฌ

Built your log analyzer? Awesome! Share your approach in the comments below. Did you find interesting patterns in your log data? What regex tricks did you discover? Let's learn from each other! ๐Ÿš€


Conclusion

Well, weโ€™ve covered quite a bit today! ๐Ÿ˜Š I hope you found something inspiring or thought-provoking. Now itโ€™s your turn! Iโ€™m genuinely curious to hear what you think. Did this post spark any ideas for you? Do you have a different perspective, or perhaps some extra tips to share? Donโ€™t hold back! Pop your comments, feedback, or even just a quick hello down in the section below. ๐Ÿ‘‡ Letโ€™s build on this conversation together! Thanks for reading! โœจ

This post is licensed under CC BY 4.0 by the author.