Regex 1

Getting Started

Resources

What Is It?

Regular expressions (called REs, or regexes, or regex patterns) are essentially a tiny, highly specialized programming language embedded inside Python and made available through the re module.

How Does It Work?

Suppose you wanted to know if an email address matched a particular pattern.

We first compile a regular expression. # 1
We can then inoke various methods and attributes against a compiled regular expression object. # 2

The pattern is simply a string. The ‘r’ before teh quotation tells Python that it’s a “raw” string and shouldn’t attempt to interpret escaped text.

import re

p = re.compile(r"[a-zA-z0-9]*@zipcodewilmington.com")  # 1

m1 = p.findall("roberto@zipcodewilmington.com")  # 2
m1
# ['roberto@zipcodewilmington.com']

m2 = p.findall("billgates@microsoft.com")
m2
# []

-

Special Sequences

Sequence	Description	Equivalent class
\d	Any decimal digit	[0-9]
\D	Any non-digit character	[^0-9]
\s	Any whitespace character	[\t\n\r\f\v]
\S	Any non-whitespace character	[^\t\n\r\f\v]
\w	Matches any alphanumeric character	[a-zA-z0-9_]
\W	Matches any non-alphameric character	[^a-zA-Z0-9_]

\d

Matches any decimal digit

p = re.compile(r"\d")
m = p.findall("1 over the 8.")
m
# ['1', '8']

\D

Any non-digit character

p = re.compile(r"\D")
m = p.findall("catch 22")
m
# ['c', 'a', 't', 'c', 'h', ' ']

\w

Matches any alphanumeric character

p = re.compile(r"\w")
m = p.findall("1a!2b@3c#4d$5e%6f^7g&8h*9i(0j)")
m
# ['1', 'a', '2', 'b', '3', 'c', '4', 'd', '5', 'e', '6', 'f', '7', 'g', '8', 'h', '9', 'i', '0', 'j']

\W

Matches any non-alphameric character.

p = re.compile(r"\W")
m = p.findall("1a!2b@3c#4d$5e%6f^7g&8h*9i(0j)")
m
# ['!', '@', '#', '$', '%', '^', '&', '*', '(', ')']

-

Metacharacters

Part 1

.
[]
\

. - Dot

Matches any character except a newline.

p = re.compile(r".uck")
m1 = p.match("duck")
m1 
# <re.Match object; span=(0, 4), match='duck'>

m2 = p.match("muck")
m2
# <re.Match object; span=(0, 4), match='muck'>

m3 = p.match("buck")
m3
# <re.Match object; span=(0, 4), match=buck'>

[] - Character Class

A character class is a set of characters that you wish to match.

Characters can be listed individually

[abcdefghijklmnopqrstuvwxyz] 

Characters can be listed as a range of characters

[a-z]  # a range is indicated by using a dash between two characters

[] - Character Class

p = re.compile(r"[cd][ao][gt]")
m1 = p.findall("dog")
m1
# ['dog']

m2 = p.findall("cat")
m2
# ['cat']

m3 = p.findall("bat")
m3
# []

[] - Character Class

Metacharacters are not active inside character classes.

’.’ (dot) is usually a metacharacter, but inside of a character class it is stripped of its special nature.

p = re.compile(r"[zc.]")  # will match any of the characters 'z', 'c', '.'

m1 = p.findall(".")
m1
# ['.']

m2 = p.findall("2")
m2
# []

\ - Backslash

Signal various special sequences.

Can be followed by various characters to signal various special sequences.

p1 = re.compile(r"\d\d\d-\d\d\d-\d\d\d\d")
m1 = p1.findall("Zip code's phone number is: 302-256-5203")
m1
# ['302-256-5203']

\ - Backslash

Escape metacharacters

Can be used to escape all the metacharacters so you can still match them in patterns.

p1 = re.compile(r"\$\d\d,\d\d\d")
m1 = p.findall("Average developer base pay is $76,526 dollars per year.")
m1
# ['$76,526']

p2 = re.compile(r"\[Brazil\]")
m2 = p.findall("The current leading team [Brazil] is likely to win the world cup again.")
m2 
# ['[Brazil]']

-

Metacharacters

Part 2 - Repeating Things

*
+
?
{m}
{m, n}

*

The * metacharacter is a greedy qualifier which specifies that the previous character can be matched zero or more times.

p = re.compile(r"\w*@zipcodewilmington.com")

m1 = p.findall("kris@zipcodewilmington.com")
m1
# 

m2 = p.findall("chris@zcw.com")
m2 
# 

+

The + metacharacter is a greedy qualifier which specifies that the previous character matches one or more times.

p = re.compile(r"\d+\s.*")

m1 = p.findall("123 Sesame Street")
m1
# ['123 Sesame Street']

m2 = p.findall("Fraggle Rock")
m2 
# 

?

The ? metacharacter matches either once or zero times.

p = re.compile(r"P\.?O\.? Box \d+")

m1 = p.findall("P.O. Box 55 PO Box 679")
m1
# ['P.O. Box 55', 'PO Box 679']

m2 =  m = p.findall("P..O. Box 55")
m2 
# []

{m}

Specifies that exactly m copies of the previous RE should be matched.

# Simplistic MM/DD/YYYY or MM-DD-YYYY 
# Does not account for days greater than 31, months greater than 13, min value for year, etc..
p = re.compile(r"\d{2}[/-]\d{2}[/-]\d{4}")
m = p.findall("02-29-2020 01/01/1970 12312021")
m
# ['02-29-2020', '01/01/1970']

{m, n}

This qualifier means that there must be at least m repetitions and at most n repetitions.

users = "jack77,ryu,ken*$,doesntknowhentostop"

p = re.compile(r"\w{5,13},")
m = p.findall(users)
m
# ['jack77,']

-

Metacharacters

Part 3 - Anchors, Logical or, & Groups

$
()

^

Matches the start of the string.

valid_entry = "John, Smith, 29, New York City, NY"
invalid_entry = "1bob, Taylor, 43, Atlantic City, NJ"

p = re.compile(r"^[A-Z][a-z]{1,15},\s?[A-Z][a-z]{1,15},\s?\d{1,3},[\w\s]*,\s?[A-Z]{2}")
m1 = p.findall(valid_entry)
m1
# ['John, Smith, 29, New York City, NY']

m2 = p.findall(invaid_entry)
m2
# []

$

Matches the end of the string or just before the newline at the end of the string.

valid_entry = "John, Smith, 29, New York City, NY"
invalid_entry = "John, Smith, 29, New York City, NY,"

p = re.compile(r"^[A-Z][a-z]{1,15},\s?[A-Z][a-z]{1,15},\s?\d{1,3},[\w\s]*,\s?[A-Z]{2}$")
m1 = p.findall(valid_entry)
m1
# ['John, Smith, 29, New York City, NY']

m2 = p.findall(invaid_entry)
m2
# []

()

Matches whatever regular expression is inside the parentheses, and indicates the start and end of a group.

transaction = "03/24/2020 time: 135500 userid: drseuss amount: US$1500"
p = re.compile(r"^(\d{2}/\d{2}/\d{4}\s)(time:\s\d{6}\s)(userid:\s\w{1,10}\s)(amount:\sUS\$\d{1,9})$")

for match in p.finditer(transaction):
    print(f"first group  \t ----> {match.group(1)}")
    print(f"second group \t ----> {match.group(2)}")
    print(f"thrird group \t ----> {match.group(3)}")
    print(f"fourth group \t ----> {match.group(4)}")

# first group  	 ----> 03/24/2020 
# second group 	 ----> time: 135500 
# thrird group 	 ----> userid: drseuss 
# fourth group 	 ----> amount: US$1500

| - Alternation / “or” operator

Creates a regular expression that will match either A or B.

p = re.compile(r"DATA-A\d{4}|DATA-B\w{2}\d{2}")
data1 = "DATA-A1234"
data2 = "DATA-BCD56"

p.findall(data1)
# ['DATA-A1234']

p.findall(data2)
# ['DATA-BCD56']

p.findall("DATA-C4321")
# []

-

The End

Parrot