Regex 1
Getting Started
-
Resources
Regex How To Regular Expression Syntax
-
What Is It?
Regular expressions (called REs, or regexes, or regex patterns) are essentially a tiny, highly specialized programming language embedded inside Python and made available through the re module.
-
How Does It Work?
Suppose you wanted to know if an email address matched a particular pattern.
We first compile a regular expression. # 1
We can then inoke various methods and attributes against a compiled regular expression object. # 2
- The pattern is simply a string. The ‘r’ before teh quotation tells Python that it’s a “raw” string and shouldn’t attempt to interpret escaped text.
import re
p = re.compile(r"[a-zA-z0-9]*@zipcodewilmington.com") # 1
m1 = p.findall("roberto@zipcodewilmington.com") # 2
m1
# ['roberto@zipcodewilmington.com']
m2 = p.findall("billgates@microsoft.com")
m2
# []
-
Special Sequences
Sequence | Description | Equivalent class |
---|---|---|
\d | Any decimal digit | [0-9] |
\D | Any non-digit character | [^0-9] |
\s | Any whitespace character | [\t\n\r\f\v] |
\S | Any non-whitespace character | [^\t\n\r\f\v] |
\w | Matches any alphanumeric character | [a-zA-z0-9_] |
\W | Matches any non-alphameric character | [^a-zA-Z0-9_] |
-
\d
Matches any decimal digit
p = re.compile(r"\d")
m = p.findall("1 over the 8.")
m
# ['1', '8']
-
\D
Any non-digit character
p = re.compile(r"\D")
m = p.findall("catch 22")
m
# ['c', 'a', 't', 'c', 'h', ' ']
-
\w
Matches any alphanumeric character
p = re.compile(r"\w")
m = p.findall("1a!2b@3c#4d$5e%6f^7g&8h*9i(0j)")
m
# ['1', 'a', '2', 'b', '3', 'c', '4', 'd', '5', 'e', '6', 'f', '7', 'g', '8', 'h', '9', 'i', '0', 'j']
-
\W
Matches any non-alphameric character.
p = re.compile(r"\W")
m = p.findall("1a!2b@3c#4d$5e%6f^7g&8h*9i(0j)")
m
# ['!', '@', '#', '$', '%', '^', '&', '*', '(', ')']
-
Metacharacters
Part 1
- .
- []
- \
-
. - Dot
Matches any character except a newline.
p = re.compile(r".uck")
m1 = p.match("duck")
m1
# <re.Match object; span=(0, 4), match='duck'>
m2 = p.match("muck")
m2
# <re.Match object; span=(0, 4), match='muck'>
m3 = p.match("buck")
m3
# <re.Match object; span=(0, 4), match=buck'>
-
[] - Character Class
A character class is a set of characters that you wish to match.
Characters can be listed individually
[abcdefghijklmnopqrstuvwxyz]
Characters can be listed as a range of characters
[a-z] # a range is indicated by using a dash between two characters
-
[] - Character Class
p = re.compile(r"[cd][ao][gt]")
m1 = p.findall("dog")
m1
# ['dog']
m2 = p.findall("cat")
m2
# ['cat']
m3 = p.findall("bat")
m3
# []
-
[] - Character Class
Metacharacters are not active inside character classes.
’.’ (dot) is usually a metacharacter, but inside of a character class it is stripped of its special nature.
p = re.compile(r"[zc.]") # will match any of the characters 'z', 'c', '.'
m1 = p.findall(".")
m1
# ['.']
m2 = p.findall("2")
m2
# []
-
\ - Backslash
Signal various special sequences.
Can be followed by various characters to signal various special sequences.
p1 = re.compile(r"\d\d\d-\d\d\d-\d\d\d\d")
m1 = p1.findall("Zip code's phone number is: 302-256-5203")
m1
# ['302-256-5203']
-
\ - Backslash
Escape metacharacters
Can be used to escape all the metacharacters so you can still match them in patterns.
p1 = re.compile(r"\$\d\d,\d\d\d")
m1 = p.findall("Average developer base pay is $76,526 dollars per year.")
m1
# ['$76,526']
p2 = re.compile(r"\[Brazil\]")
m2 = p.findall("The current leading team [Brazil] is likely to win the world cup again.")
m2
# ['[Brazil]']
-
Metacharacters
Part 2 - Repeating Things
- *
- +
- ?
- {m}
- {m, n}
-
*
The * metacharacter is a greedy qualifier which specifies that the previous character can be matched zero or more times.
p = re.compile(r"\w*@zipcodewilmington.com")
m1 = p.findall("kris@zipcodewilmington.com")
m1
#
m2 = p.findall("chris@zcw.com")
m2
#
-
+
The + metacharacter is a greedy qualifier which specifies that the previous character matches one or more times.
p = re.compile(r"\d+\s.*")
m1 = p.findall("123 Sesame Street")
m1
# ['123 Sesame Street']
m2 = p.findall("Fraggle Rock")
m2
#
-
?
The ? metacharacter matches either once or zero times.
p = re.compile(r"P\.?O\.? Box \d+")
m1 = p.findall("P.O. Box 55 PO Box 679")
m1
# ['P.O. Box 55', 'PO Box 679']
m2 = m = p.findall("P..O. Box 55")
m2
# []
-
{m}
Specifies that exactly m copies of the previous RE should be matched.
# Simplistic MM/DD/YYYY or MM-DD-YYYY
# Does not account for days greater than 31, months greater than 13, min value for year, etc..
p = re.compile(r"\d{2}[/-]\d{2}[/-]\d{4}")
m = p.findall("02-29-2020 01/01/1970 12312021")
m
# ['02-29-2020', '01/01/1970']
-
{m, n}
This qualifier means that there must be at least m repetitions and at most n repetitions.
users = "jack77,ryu,ken*$,doesntknowhentostop"
p = re.compile(r"\w{5,13},")
m = p.findall(users)
m
# ['jack77,']
-
Metacharacters
Part 3 - Anchors, Logical or, & Groups
- $
- ()
-
-
^
Matches the start of the string.
valid_entry = "John, Smith, 29, New York City, NY"
invalid_entry = "1bob, Taylor, 43, Atlantic City, NJ"
p = re.compile(r"^[A-Z][a-z]{1,15},\s?[A-Z][a-z]{1,15},\s?\d{1,3},[\w\s]*,\s?[A-Z]{2}")
m1 = p.findall(valid_entry)
m1
# ['John, Smith, 29, New York City, NY']
m2 = p.findall(invaid_entry)
m2
# []
-
$
Matches the end of the string or just before the newline at the end of the string.
valid_entry = "John, Smith, 29, New York City, NY"
invalid_entry = "John, Smith, 29, New York City, NY,"
p = re.compile(r"^[A-Z][a-z]{1,15},\s?[A-Z][a-z]{1,15},\s?\d{1,3},[\w\s]*,\s?[A-Z]{2}$")
m1 = p.findall(valid_entry)
m1
# ['John, Smith, 29, New York City, NY']
m2 = p.findall(invaid_entry)
m2
# []
-
()
Matches whatever regular expression is inside the parentheses, and indicates the start and end of a group.
transaction = "03/24/2020 time: 135500 userid: drseuss amount: US$1500"
p = re.compile(r"^(\d{2}/\d{2}/\d{4}\s)(time:\s\d{6}\s)(userid:\s\w{1,10}\s)(amount:\sUS\$\d{1,9})$")
for match in p.finditer(transaction):
print(f"first group \t ----> {match.group(1)}")
print(f"second group \t ----> {match.group(2)}")
print(f"thrird group \t ----> {match.group(3)}")
print(f"fourth group \t ----> {match.group(4)}")
# first group ----> 03/24/2020
# second group ----> time: 135500
# thrird group ----> userid: drseuss
# fourth group ----> amount: US$1500
-
| - Alternation / “or” operator
Creates a regular expression that will match either A or B.
p = re.compile(r"DATA-A\d{4}|DATA-B\w{2}\d{2}")
data1 = "DATA-A1234"
data2 = "DATA-BCD56"
p.findall(data1)
# ['DATA-A1234']
p.findall(data2)
# ['DATA-BCD56']
p.findall("DATA-C4321")
# []