#Regular Expressions
-
What we’ll cover
- Regex basics
- Pattern Matching
- Using Regex in Java
-
Resources
- Regexr
- Regular-expressions.info
- Regex Crossword - Regex-based challenges
- Duck Duck Go Regex Cheat Sheet
- Pattern class documentation - includes extensive regex explanations.
-
Regex basics
-
Basic symbols
a,b,c- match “a”, “b”, “c” respectively- Works with all numbers and letters
- Case sensitive e.g.
adoes not match “A” andBdoes not match “b”
abmatches “ab”- Works with any string
- invisible characters like spaces are also match
a bdoes not match “ab”
-
Dot matching
. - matches any one character
.matches “a” and “A” and “b” and “c” etc.a.cmatches “abc” and “acc” and “a c” etc.
-
Repeated matches
+- match 1 or more occurrences of the last symbola+matches “a” and “aa” and “aaa” etc.ab+matches “ab” and “abb” but not “a” or “b”
*- match 0 or more occurrences of the last symbolab*matches “ab” and “abb” and “abbb” etc.ab*will also match “a”
-
Repeated matches (continued)
{n},{n,m}- Repeat previous match n times or n to m timesa{1}matches “a”a{2}does not match “a” but does match “aa”a{1,3}matches all of “a” and “aa” and “aaa”
-
Advanced Matching
()- group characters together(abc){3}matches “abcabcabc” and does not match “abccc”
|- alternation, matches either the pattern on the left or the righta|bwill match both “a” and “b” but not “ab”- note
a|bwill still find both “a” and “b” in “ab” but it will be as two matches and not one
?- match 0 or 1 occurrences of the last symbolab?will match “a” and “ab”
-
Boundaries
^- Beginning of line^awill find the “a” in “ab” but not in “ba”
$- End of linea$will find the “a” in “ba” but not in “ab”
\b- Word boundary\bmanwill match the “man” in “The man” but not in “The human”
-
Character classes
- Match any single character in the character class
- groups of characters within [] brackets
[abc]matches “a”, “b” or “c”
- hyphen denotes a range (eg:
[a-c]==[abc] - Negate a character class with
^[^13579]matches any character other than odd numbers
- Note: Many special characters behave differently inside of character classes than they do outside of them.
-
Predefined character classes
Shortcuts for commonly used character classes
\s- any whitespace character (\\sin Java Strings)\S- non-whitespace characters aka[^\s](\\Sin Java Strings)\d- any numeric digit aka[0-9](\\din Java)\D- any non-digit aka[^0-9](\\Din Java)\w- a word character[a-zA-Z_0-9]\W- a non-word character aka[^\w]
-
Using Regex in Java
-
Quirks
- Java Strings are converted to regex patterns – this means backslashes and escape sequences are parsed and interpreted as what they represent in a String
\nand\t(and some others) become the character they represent;\\becomes\and\\\\becomes\\- As a result, Java regex escape sequences (such as
\d,\.,\\) must be double-escaped (eg:\\d,\\.,\\\\)
-
Methods and Classes
- Used in String methods like
replaceAll,replaceFirst,split, andmatches Patternobjects represent compiled regular expressions (usingPattern thePattern = Pattern.compile(regexStr)Matcherobjects provide information about matching a regex on to an input, created withMatcher m = thePattern.matcher(inputStr);
-
Examples
public boolean batman(String str){
return str.matches("(na){16}");
}
public boolean manOfSteel(String str){
return str.matches("(Clark Kent)|(Superman)|(Kal-El)");
}
public boolean time(String str){
return str.matches("w[io]bbly");
}
-
Backtracking
- Aforementioned Regex quantifiers are “greedy” - They will match the most characters possible to still allow the pattern to match
- Greedy quantifiers start at their longest and backtrack one match at a time until the whole pattern matches.
- Beware of catastrophic backtracking
.*is very, very suspicious.
-
Lazy quantifiers
AKA Reluctant quantifiers
- consume smallest possible sequence to achieve a match.
- Same syntax as greedy quantifiers, plus a
?at the end - eg:
abc*?,[0-9]+?3
-
