21. Regular Expressions
π Unleash the power of Regular Expressions! This post guides you through mastering pattern matching, capturing groups, and efficient text manipulation using the `regexp` package, all while considering performance. β‘
What we will learn in this post?
- π Regexp Package
- π Matching Patterns
- π Capturing Groups
- π Replacing with Regex
- π Performance Considerations
π Goβs regexp Package: Your Pattern-Matching Friend!
Goβs regexp package is a fantastic tool for finding and manipulating text using regular expressions (regex). Think of regex as a super-powered search pattern language that helps you describe complex text sequences!
β¨ Getting Your Patterns Ready: Compile vs. MustCompile
Before you can use a regular expression, Go needs to βcompileβ it into an efficient internal representation.
regexp.Compile(pattern string): Use this when your pattern might come from an external source (like user input or a config file). It returns a*regexp.Regexpobject and anerror. Always check the error!1 2 3 4
r, err := regexp.Compile("[0-9]+") // Matches one or more digits if err != nil { // Handle the error gracefully }
regexp.MustCompile(pattern string): This is perfect for patterns you know are fixed and correct at compile-time. If the pattern is invalid, it willpanic. Itβs often used for global variables.1
var validID = regexp.MustCompile(`^[a-z]+[0-9]*$`) // ID starts with letters, ends with optional digits
π― Basic Regex Magic: Simple Syntax
Hereβs a glimpse into some common regex characters:
abc: Matches the literal string βabcβ..: Matches any single character (except newline).*: Matches zero or more of the preceding item. E.g.,a*matches ββ, βaβ, βaaβ.+: Matches one or more of the preceding item. E.g.,a+matches βaβ, βaaβ.?: Matches zero or one of the preceding item (makes it optional).[abc]: Matches any one character listed inside the brackets.[0-9]matches any digit.^: Matches the start of a string.$: Matches the end of a string.
π How Compilation Works (Simplified)
graph TD
A["π Start"]:::style1 --> B{"β Pattern fixed & correct?"}:::style2
B -- "β
Yes" --> C["β‘ regexp.MustCompile()"]:::style3
B -- "π§ No / User input" --> D["π οΈ regexp.Compile()"]:::style4
C --> E["π¦ Get *Regexp object"]:::style5
D --> F{"π¨ Check error?"}:::style6
F -- "β Yes, error!" --> G["π΄ Handle error"]:::style7
F -- "β
No error" --> E
E --> H["π― Use for matching/searching"]:::style8
classDef style1 fill:#00ADD8,stroke:#00758f,color:#fff,font-size:16px,stroke-width:3px,rx:14,shadow:6px;
classDef style2 fill:#5dc9e2,stroke:#00ADD8,color:#fff,font-size:16px,stroke-width:3px,rx:14,shadow:6px;
classDef style3 fill:#43e97b,stroke:#38f9d7,color:#fff,font-size:16px,stroke-width:3px,rx:14,shadow:6px;
classDef style4 fill:#ffd700,stroke:#d99120,color:#222,font-size:16px,stroke-width:3px,rx:14,shadow:6px;
classDef style5 fill:#6b5bff,stroke:#4a3f6b,color:#fff,font-size:16px,stroke-width:3px,rx:14,shadow:6px;
classDef style6 fill:#ff9800,stroke:#f57c00,color:#fff,font-size:16px,stroke-width:3px,rx:14,shadow:6px;
classDef style7 fill:#ff4f81,stroke:#c43e3e,color:#fff,font-size:16px,stroke-width:3px,rx:14,shadow:6px;
classDef style8 fill:#00ADD8,stroke:#00758f,color:#fff,font-size:16px,stroke-width:3px,rx:14,shadow:6px;
linkStyle default stroke:#e67e22,stroke-width:3px;
Regex Magic: Finding Patterns! β¨
Pattern matching helps us find and extract specific pieces of text using Regular Expressions (regex). Goβs regexp package offers powerful functions for this!
π― Checking & Finding Matches
<span style=βcolor:#8e44adβ Is It There? MatchString() π§</span>
This method simply checks if any part of your text contains the pattern. It returns true or false. Example: regexp.MustCompile("world").MatchString("hello world") returns true.
Finding the First Match FindString() π
Want the actual first piece of text that matches? FindString() returns just that. Its cousin, FindStringIndex(), gives you where it starts and ends. Example: regexp.MustCompile("o.l").FindString("hello world") returns "o wo".
Catching All Matches FindAllString() π£
To collect every non-overlapping match, FindAllString() is your go-to. Specify -1 to find all possible matches. Example: regexp.MustCompile("a").FindAllString("banana", -1) returns ["a" "a" "a"].
graph TD
A["β Check if pattern exists?"]:::style1 -->|"β
Yes"| B["π MatchString()"]:::style2
A -->|"π Need actual match?"| C["π― What to find?"]:::style3
C -->|"1οΈβ£ First one"| D["π FindString()"]:::style4
C -->|"π£ All of them"| E["π FindAllString()"]:::style5
classDef style1 fill:#00ADD8,stroke:#00758f,color:#fff,font-size:16px,stroke-width:3px,rx:14,shadow:6px;
classDef style2 fill:#43e97b,stroke:#38f9d7,color:#fff,font-size:16px,stroke-width:3px,rx:14,shadow:6px;
classDef style3 fill:#5dc9e2,stroke:#00ADD8,color:#fff,font-size:16px,stroke-width:3px,rx:14,shadow:6px;
classDef style4 fill:#ffd700,stroke:#d99120,color:#222,font-size:16px,stroke-width:3px,rx:14,shadow:6px;
classDef style5 fill:#6b5bff,stroke:#4a3f6b,color:#fff,font-size:16px,stroke-width:3px,rx:14,shadow:6px;
linkStyle default stroke:#e67e22,stroke-width:3px;
π» Behind the Scenes: Bytes & Indices
Similar methods like Match(), Find(), FindIndex() work with raw byte slices ([]byte). Index variants (e.g., FindStringIndex()) return the matchβs start and end positions. For finding matches with capturing groups, explore the Submatch variants like FindStringSubmatch()!
Unleashing Data with Regex Capturing Groups! π€©
Ever needed to pick out specific bits of text from a larger string? Thatβs where capturing groups in regular expressions come in handy! They let you βcaptureβ specific parts of a matched pattern using (). Think of them as special nets for the exact data you want to retrieve.
Extracting Data with FindStringSubmatch()! β¨
Wrap any part of your regex in parentheses (...) to define a capturing group. Goβs regexp.FindStringSubmatch() function is your go-to for extracting these captures. It returns a string slice: the first element ([0]) is always the full match, followed by your captured groups ([1], [2], etc.).
- Example (Basic Capture):
1 2 3 4 5 6 7 8 9 10
package main import ("fmt"; "regexp") func main() { re := regexp.MustCompile(`Hello (\w+)!`) // (\w+) captures a word match := re.FindStringSubmatch("Hello World!") // Output: // match[0]: "Hello World!" (full match) // match[1]: "World" (1st capture) fmt.Println("Full Match:", match[0], "\nCaptured:", match[1]) }
Naming Your Treasures with (?P<name>...)! π·οΈ
For clearer code, you can give your capturing groups a name using (?P<name>...). This makes accessing them by name, rather than just an index, much easier to read! After FindStringSubmatch(), use regexp.SubexpNames() to find the slice index corresponding to your named group.
- Example (Named Capture):
1 2 3 4 5 6 7 8 9 10
package main import ("fmt"; "regexp") func main() { re := regexp.MustCompile(`User: (?P<username>\w+)`) // (?P<username>\w+) names the capture match := re.FindStringSubmatch("User: Alice") usernameIndex := re.SubexpIndex("username") // Get index for "username" // Output: // Captured Username: Alice fmt.Println("Captured Username:", match[usernameIndex]) }
Capture Flow Visualized! π
graph TD
A["π Input String"]:::style1 --> B["π― Regex Pattern<br/>with Captures ()"]:::style2
B --> C["βοΈ FindStringSubmatch()"]:::style3
C -- "π€ Returns" --> D["π¦ String Slice<br/>Full Match + Captures"]:::style4
D -- "π·οΈ Named Group" --> E["π SubexpNames()<br/>for Index Mapping"]:::style5
E --> F["β
Access by<br/>Name/Index"]:::style6
classDef style1 fill:#00ADD8,stroke:#00758f,color:#fff,font-size:16px,stroke-width:3px,rx:14,shadow:6px;
classDef style2 fill:#5dc9e2,stroke:#00ADD8,color:#fff,font-size:16px,stroke-width:3px,rx:14,shadow:6px;
classDef style3 fill:#ffd700,stroke:#d99120,color:#222,font-size:16px,stroke-width:3px,rx:14,shadow:6px;
classDef style4 fill:#6b5bff,stroke:#4a3f6b,color:#fff,font-size:16px,stroke-width:3px,rx:14,shadow:6px;
classDef style5 fill:#ff9800,stroke:#f57c00,color:#fff,font-size:16px,stroke-width:3px,rx:14,shadow:6px;
classDef style6 fill:#43e97b,stroke:#38f9d7,color:#fff,font-size:16px,stroke-width:3px,rx:14,shadow:6px;
linkStyle default stroke:#e67e22,stroke-width:3px;
βοΈ Master Text Replacement in Go!
Text replacement is a powerful tool for manipulating strings. Goβs regexp package offers fantastic ways to do this, both for simple and complex scenarios. Letβs dive in!
π Simple Swaps with ReplaceAllString()
This function is your go-to for fixed replacements. It finds all matches of a regular expression and substitutes them with a specified string. You can even use backreferences (like $1, $2) to re-use parts of your matched text!
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
package main
import (
"fmt"
"regexp"
)
func main() {
text := "Hello Mr. Smith and Ms. Jane!"
// Regex: find (Mr|Ms|Dr). followed by word
re := regexp.MustCompile(`(Mr|Ms)\. (\w+)`)
// Replace with "Prefix. Name (Esq.)" using $1 (prefix) and $2 (name)
newText := re.ReplaceAllString(text, "$1. $2 (Esq.)")
fmt.Println("Simple Replacement:", newText)
// Output: Hello Mr. Smith (Esq.) and Ms. Jane (Esq.)!
}
Think of ReplaceAllString() as a straightforward βfind and replaceβ operation.
graph TD
A["π Original Text"]:::style1 --> B{"π Regex Match?"}:::style2
B -- "β
Yes" --> C["π Replace with<br/>Fixed String/$1 $2"]:::style3
B -- "β No" --> A
C --> D["π Resulting Text"]:::style4
A -- "β
All Processed" --> D
classDef style1 fill:#00ADD8,stroke:#00758f,color:#fff,font-size:16px,stroke-width:3px,rx:14,shadow:6px;
classDef style2 fill:#5dc9e2,stroke:#00ADD8,color:#fff,font-size:16px,stroke-width:3px,rx:14,shadow:6px;
classDef style3 fill:#ffd700,stroke:#d99120,color:#222,font-size:16px,stroke-width:3px,rx:14,shadow:6px;
classDef style4 fill:#43e97b,stroke:#38f9d7,color:#fff,font-size:16px,stroke-width:3px,rx:14,shadow:6px;
linkStyle default stroke:#e67e22,stroke-width:3px;
β¨ Dynamic Changes with ReplaceAllStringFunc()
Need more control? ReplaceAllStringFunc() lets you provide a function that determines the replacement string for each match. This is super handy for dynamic transformations! The function receives a string slice containing the full match and all capture groups.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
package main
import (
"fmt"
"regexp"
)
func main() {
dates := "Meeting on 2023-10-26 and 2024-01-15."
// Regex: YYYY-MM-DD pattern
re := regexp.MustCompile(`(\d{4})-(\d{2})-(\d{2})`)
// Custom function to reformat date
reformattedDates := re.ReplaceAllStringFunc(dates, func(match []string) string {
// match[0] = full match (e.g., "2023-10-26")
// match[1] = first group (e.g., "2023")
// match[2] = second group (e.g., "10")
// match[3] = third group (e.g., "26")
return fmt.Sprintf("%s/%s/%s", match[1], match[2], match[3]) // YYYY/MM/DD
})
fmt.Println("Dynamic Replacement:", reformattedDates)
// Output: Meeting on 2023/10/26 and 2024/01/15.
}
Here, the function allows custom logic for each match.
graph TD
A["π Original Text"]:::style1 --> B{"π Regex Match?"}:::style2
B -- "β
Yes" --> C["βοΈ Call Custom Func<br/>with match & groups"]:::style3
C --> D["π¨ Function Returns<br/>Replacement String"]:::style4
D --> E["π Resulting Text"]:::style5
B -- "β No" --> A
A -- "β
All Processed" --> E
classDef style1 fill:#00ADD8,stroke:#00758f,color:#fff,font-size:16px,stroke-width:3px,rx:14,shadow:6px;
classDef style2 fill:#5dc9e2,stroke:#00ADD8,color:#fff,font-size:16px,stroke-width:3px,rx:14,shadow:6px;
classDef style3 fill:#ffd700,stroke:#d99120,color:#222,font-size:16px,stroke-width:3px,rx:14,shadow:6px;
classDef style4 fill:#ff9800,stroke:#f57c00,color:#fff,font-size:16px,stroke-width:3px,rx:14,shadow:6px;
classDef style5 fill:#43e97b,stroke:#38f9d7,color:#fff,font-size:16px,stroke-width:3px,rx:14,shadow:6px;
linkStyle default stroke:#e67e22,stroke-width:3px;
β In a Nutshell
ReplaceAllString(): Use for straightforward, fixed text substitutions, often with backreferences.ReplaceAllStringFunc(): Opt for this when you need complex, conditional, or dynamic replacements by executing custom logic for each match.
# Regex Speed Secrets Unveiled! π
Hey there! Letβs chat about making your regex run super fast without breaking a sweat. Understanding how regex performs can save you a lot of processing time!
Compile Once, Run Many! ποΈ
Think of regex like a special instruction manual. If you use the same manual often, itβs much faster to compile it once (e.g., re.compile() in Python). This pre-processes the pattern so your computer understands it perfectly. Re-compiling repeatedly for the same pattern (like using re.search() directly in a loop) wastes time, as the computer βreads the manualβ from scratch each time.
Is Regex Always Best? π€
Not always! For simple tasks like checking if text starts with specific characters (.startswith()) or contains something (.find(), in), standard string methods are often much faster and easier to read than a complex regex. Regex shines for intricate pattern matching, not basic string checks.
Measure Your Speed! β±οΈ
Unsure which method is faster? Benchmark it! Use Pythonβs timeit module to compare regex vs. string operation speeds. Get concrete data for informed decisions.
1
2
3
4
import timeit, re
# Example of benchmarking
print(timeit.timeit("'hello' in 'hello world'"))
print(timeit.timeit("re.search('hello', 'hello world')"))
Quick Optimization Tips β¨
- Pre-compile: Always compile your regex if reusing it.
- Be Specific: Make your patterns as precise as possible.
- Anchors: Use
^(start) and$(end) to limit the search scope. - Non-Greedy: Use
*?or+?to prevent potential backtracking performance issues.
π― Real-World Examples: Regex in Production Go Systems
Example 1: Email Validation Service
Production email validators use regex for RFC-compliant validation!
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
package main
import (
"fmt"
"regexp"
"strings"
)
type EmailValidator struct {
emailRegex *regexp.Regexp
}
func NewEmailValidator() *EmailValidator {
// Simplified RFC 5322 pattern
pattern := `^[a-zA-Z0-9._%+\-]+@[a-zA-Z0-9.\-]+\.[a-zA-Z]{2,}$`
return &EmailValidator{
emailRegex: regexp.MustCompile(pattern),
}
}
func (ev *EmailValidator) IsValid(email string) bool {
return ev.emailRegex.MatchString(strings.TrimSpace(email))
}
func (ev *EmailValidator) ExtractDomain(email string) string {
if !ev.IsValid(email) {
return ""
}
re := regexp.MustCompile(`@([a-zA-Z0-9.\-]+)$`)
matches := re.FindStringSubmatch(email)
if len(matches) > 1 {
return matches[1]
}
return ""
}
func (ev *EmailValidator) MaskEmail(email string) string {
if !ev.IsValid(email) {
return email
}
re := regexp.MustCompile(`^([a-zA-Z0-9])([a-zA-Z0-9._%-]*)@`)
return re.ReplaceAllString(email, "$1***@")
}
func main() {
validator := NewEmailValidator()
emails := []string{
"john.doe@example.com",
"invalid-email",
"alice.smith@company.co.uk",
"test@domain",
}
fmt.Println("π§ Email Validation Service")
fmt.Println("=" + strings.Repeat("=", 50))
for _, email := range emails {
isValid := validator.IsValid(email)
status := "β Invalid"
if isValid {
status = "β
Valid"
}
fmt.Printf("\n%s: %s\n", email, status)
if isValid {
domain := validator.ExtractDomain(email)
masked := validator.MaskEmail(email)
fmt.Printf" Domain: %s\n", domain)
fmt.Printf(" Masked: %s\n", masked)
}
}
}
// Used in production by:
// - Mailgun email validation API
// - SendGrid recipient validation
// - AWS SES bounce handling
Example 2: Log Parser for Monitoring Systems
Production log parsers extract structured data from unstructured logs!
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
package main
import (
"fmt"
"regexp"
"time"
)
type LogEntry struct {
Timestamp time.Time
Level string
Service string
Message string
RequestID string
}
type LogParser struct {
logPattern *regexp.Regexp
}
func NewLogParser() *LogParser {
// Pattern: [2024-01-15 10:30:45] [INFO] [auth-service] [req-123abc] User login successful
pattern := `^\[(?P<timestamp>[^\]]+)\] \[(?P<level>\w+)\] \[(?P<service>[^\]]+)\] \[(?P<requestid>[^\]]+)\] (?P<message>.+)$`
return &LogParser{
logPattern: regexp.MustCompile(pattern),
}
}
func (lp *LogParser) Parse(logLine string) (*LogEntry, error) {
matches := lp.logPattern.FindStringSubmatch(logLine)
if matches == nil {
return nil, fmt.Errorf("invalid log format")
}
names := lp.logPattern.SubexpNames()
result := make(map[string]string)
for i, name := range names {
if i > 0 && name != "" {
result[name] = matches[i]
}
}
timestamp, _ := time.Parse("2006-01-02 15:04:05", result["timestamp"])
return &LogEntry{
Timestamp: timestamp,
Level: result["level"],
Service: result["service"],
Message: result["message"],
RequestID: result["requestid"],
}, nil
}
func (lp *LogParser) FilterByLevel(logs []string, level string) []*LogEntry {
var filtered []*LogEntry
for _, log := range logs {
entry, err := lp.Parse(log)
if err == nil && entry.Level == level {
filtered = append(filtered, entry)
}
}
return filtered
}
func (lp *LogParser) ExtractRequestIDs(logs []string) []string {
re := regexp.MustCompile(`\[req-([a-z0-9]+)\]`)
var ids []string
for _, log := range logs {
matches := re.FindStringSubmatch(log)
if len(matches) > 1 {
ids = append(ids, matches[1])
}
}
return ids
}
func main() {
parser := NewLogParser()
logs := []string{
"[2024-01-15 10:30:45] [INFO] [auth-service] [req-123abc] User login successful",
"[2024-01-15 10:31:22] [ERROR] [payment-service] [req-456def] Payment processing failed",
"[2024-01-15 10:32:10] [INFO] [auth-service] [req-789ghi] Session refreshed",
"[2024-01-15 10:33:05] [WARN] [api-gateway] [req-abc123] Rate limit approaching",
}
fmt.Println("π Log Parser Analysis")
fmt.Println("=" + strings.Repeat("=", 60))
for _, log := range logs {
entry, err := parser.Parse(log)
if err != nil {
fmt.Printf("β Failed to parse: %s\n", log)
continue
}
var emoji string
switch entry.Level {
case "INFO":
emoji = "βΉοΈ"
case "ERROR":
emoji = "π΄"
case "WARN":
emoji = "β οΈ"
}
fmt.Printf("\n%s [%s] %s\n", emoji, entry.Level, entry.Service)
fmt.Printf(" Time: %s\n", entry.Timestamp.Format("15:04:05"))
fmt.Printf(" Request: %s\n", entry.RequestID)
fmt.Printf(" Message: %s\n", entry.Message)
}
// Filter errors only
fmt.Println("\nπ Error Logs Only:")
errorLogs := parser.FilterByLevel(logs, "ERROR")
fmt.Printf("Found %d error(s)\n", len(errorLogs))
// Extract all request IDs
requestIDs := parser.ExtractRequestIDs(logs)
fmt.Printf("\nπ Request IDs: %v\n", requestIDs)
}
// Used in production by:
// - Datadog log aggregation
// - Splunk log indexing
// - ELK Stack (Elasticsearch, Logstash, Kibana)
// - Prometheus log parsing
Example 3: URL Router with Dynamic Path Parameters
HTTP routers use regex for path matching and parameter extraction!
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
package main
import (
"fmt"
"regexp"
"strings"
)
type Route struct {
Pattern *regexp.Regexp
Handler string
Params []string
}
type Router struct {
routes []*Route
}
func NewRouter() *Router {
return &Router{
routes: make([]*Route, 0),
}
}
func (r *Router) AddRoute(path string, handler string) {
// Convert /users/:id/posts/:postId to regex with named groups
paramRegex := regexp.MustCompile(`:(\w+)`)
params := []string{}
// Find all parameters
for _, match := range paramRegex.FindAllStringSubmatch(path, -1) {
params = append(params, match[1])
}
// Replace :param with named capture groups
pattern := paramRegex.ReplaceAllString(path, `(?P<$1>[^/]+)`)
pattern = "^" + pattern + "$"
r.routes = append(r.routes, &Route{
Pattern: regexp.MustCompile(pattern),
Handler: handler,
Params: params,
})
}
func (r *Router) Match(path string) (string, map[string]string, bool) {
for _, route := range r.routes {
matches := route.Pattern.FindStringSubmatch(path)
if matches == nil {
continue
}
// Extract parameters
params := make(map[string]string)
names := route.Pattern.SubexpNames()
for i, name := range names {
if i > 0 && name != "" {
params[name] = matches[i]
}
}
return route.Handler, params, true
}
return "", nil, false
}
func main() {
router := NewRouter()
// Register routes with dynamic parameters
router.AddRoute("/users/:id", "GetUserHandler")
router.AddRoute("/users/:id/posts/:postId", "GetPostHandler")
router.AddRoute("/api/v:version/products/:sku", "GetProductHandler")
router.AddRoute("/files/:path+", "GetFileHandler")
testPaths := []string{
"/users/123",
"/users/456/posts/789",
"/api/v2/products/ABC-123",
"/files/images/logo.png",
"/unknown/path",
}
fmt.Println("π¦ URL Router Test")
fmt.Println("=" + strings.Repeat("=", 70))
for _, path := range testPaths {
handler, params, matched := router.Match(path)
if matched {
fmt.Printf("\nβ
Path: %s\n", path)
fmt.Printf(" Handler: %s\n", handler)
fmt.Printf(" Params: %v\n", params)
} else {
fmt.Printf("\nβ Path: %s (No match)\n", path)
}
}
}
// Used in production by:
// - Gorilla Mux router
// - Chi router
// - Gin framework
// - Echo framework
π― Hands-On Assignment: Build a Text Processing CLI Tool π
π Your Mission
Build a production-ready CLI tool that processes text files using regular expressions for pattern matching, data extraction, and text transformation!π― Requirements
- Phone Number Extractor:
- Find and extract phone numbers in formats: (123) 456-7890, 123-456-7890, +1-123-456-7890
- Validate format using regex
- Group by country code
- Format output consistently
- URL Parser:
- Extract URLs from text using regex
- Parse protocol, domain, path, query parameters
- Use named capturing groups
- Validate URL structure
- Credit Card Masker:
- Find credit card numbers (Visa, MasterCard, Amex formats)
- Mask all but last 4 digits: **** **** **** 1234
- Use
ReplaceAllStringFunc()for dynamic masking - Preserve original spacing
- Date Normalizer:
- Find dates in multiple formats (MM/DD/YYYY, DD-MM-YYYY, YYYY.MM.DD)
- Convert all to ISO 8601 format (YYYY-MM-DD)
- Use capturing groups to extract day, month, year
- Handle invalid dates gracefully
- Markdown Link Converter:
- Find markdown links:
[text](url) - Convert to HTML:
<a href="url">text</a> - Use backreferences in replacement
- Handle nested brackets
- Find markdown links:
- Performance Metrics:
- Compile regex patterns once (pre-compilation)
- Benchmark processing time for large files
- Report matches found, replacements made
- Memory-efficient streaming for large files
π‘ Starter Code
package main
import (
"bufio"
"flag"
"fmt"
"os"
"regexp"
"time"
)
type TextProcessor struct {
phoneRegex *regexp.Regexp
urlRegex *regexp.Regexp
creditCardRegex *regexp.Regexp
dateRegex *regexp.Regexp
markdownRegex *regexp.Regexp
}
func NewTextProcessor() *TextProcessor {
return &TextProcessor{
// TODO: Compile all regex patterns
phoneRegex: regexp.MustCompile(`(\+?1[-.])?\(?([0-9]{3})\)?[-.]?([0-9]{3})[-.]?([0-9]{4})`),
// Add more regex patterns...
}
}
func (tp *TextProcessor) ExtractPhoneNumbers(text string) []string {
// TODO: Find and format all phone numbers
matches := tp.phoneRegex.FindAllStringSubmatch(text, -1)
var phones []string
for _, match := range matches {
// Format: (XXX) XXX-XXXX
formatted := fmt.Sprintf("(%s) %s-%s", match[2], match[3], match[4])
phones = append(phones, formatted)
}
return phones
}
func (tp *TextProcessor) MaskCreditCards(text string) string {
// TODO: Mask credit card numbers
return tp.creditCardRegex.ReplaceAllStringFunc(text, func(match string) string {
// Keep last 4 digits, mask rest
if len(match) < 4 {
return match
}
last4 := match[len(match)-4:]
return "**** **** **** " + last4
})
}
func (tp *TextProcessor) ProcessFile(filename string) error {
file, err := os.Open(filename)
if err != nil {
return err
}
defer file.Close()
scanner := bufio.NewScanner(file)
start := time.Now()
for scanner.Scan() {
line := scanner.Text()
// TODO: Process each line
fmt.Println(line)
}
duration := time.Since(start)
fmt.Printf("\nβ±οΈ Processing time: %v\n", duration)
return scanner.Err()
}
func main() {
filename := flag.String("file", "input.txt", "Input file to process")
mode := flag.String("mode", "all", "Processing mode: phone|url|card|date|markdown|all")
flag.Parse()
processor := NewTextProcessor()
fmt.Println("π Text Processing CLI Tool")
fmt.Println("Mode:", *mode)
fmt.Println("File:", *filename)
fmt.Println("=" + strings.Repeat("=", 50))
if err := processor.ProcessFile(*filename); err != nil {
fmt.Printf("β Error: %v\n", err)
os.Exit(1)
}
}
π Bonus Challenges
- Level 2: Add email extraction and validation (RFC 5322 compliant)
- Level 3: Implement IP address finder (IPv4 and IPv6)
- Level 4: Add hashtag and @mention extractor for social media text
- Level 5: Create sensitive data redactor (SSN, passport numbers, API keys)
- Level 6: Build HTML tag stripper that preserves text content
- Level 7: Add SQL injection pattern detector for security scanning
π Learning Goals
- Master regex pattern compilation and reuse π―
- Use capturing groups and named groups effectively π¦
- Apply
ReplaceAllStringFunc()for dynamic transformations π - Implement performance optimization techniques β‘
- Build production-ready text processing tools π
π‘ Pro Tip: This pattern is used in real systems like GitHub code search, Slack message parsing, and AWS CloudWatch log analysis!
Share Your Solution! π¬
Completed the project? Post your code in the comments below! Show us your regex mastery! β¨π
Conclusion: Master Regular Expressions in Go π
Goβs regexp package provides a powerful, efficient toolkit for pattern matching, text extraction, and data transformation in production systems. By mastering compilation strategies, capturing groups, replacement functions, and performance optimization techniques, you can build robust text processing applications β from log parsers and URL routers to data validators and content filters powering modern Go services and CLI tools. β‘π