Regular Expressions & Grep
-
What we’ll cover
- Regex basics
- Pattern Matching
- Grep
- Using Regex in Grep
-
Grep
grep is used to search for a specified text or pattern within files. ggrep is a version of the command that is compliant to GNU standards that can be used in our Mac evironment.
We’ll be using ggrep with the flag -P (for Perl regex) to demonstrate our pattern matching with Regular Expressions.
to install it on your Mac:
brew install grep
-
Regex basics
-
Basic symbols
a,b,c- match “a”, “b”, “c” respectively- Works with all numbers and letters
- Case sensitive e.g.
adoes not match “A” andBdoes not match “b”
abmatches “ab”- Works with any string
- invisible characters like spaces are also match
a bdoes not match “ab”
-
Dot matching
. - matches any one character
.matches “a” and “A” and “b” and “c” etc.a.cmatches “abc” and “acc” and “a c” etc.
-
Repeated matches
+- match 1 or more occurrences of the last symbola+matches “a” and “aa” and “aaa” etc.ab+matches “ab” and “abb” but not “a” or “b”
*- match 0 or more occurrences of the last symbolab*matches “ab” and “abb” and “abbb” etc.ab*will also match “a”
-
Repeated matches (continued)
{n},{n,m}- Repeat previous match n times or n to m timesa{1}matches “a”a{2}does not match “a” but does match “aa”a{1,3}matches all of “a” and “aa” and “aaa”
-
Advanced Matching
()- group characters together(abc){3}matches “abcabcabc” and does not match “abccc”
|- alternation, matches either the pattern on the left or the righta|bwill match both “a” and “b” but not “ab”- note
a|bwill still find both “a” and “b” in “ab” but it will be as two matches and not one
?- match 0 or 1 occurrences of the last symbolab?will match “a” and “ab”
-
Boundaries
^- Beginning of line^awill find the “a” in “ab” but not in “ba”
$- End of linea$will find the “a” in “ba” but not in “ab”
\b- Word boundary\bmanwill match the “man” in “The man” but not in “The human”
-
Character classes
- Match any single character in the character class
- groups of characters within [] brackets
[abc]matches “a”, “b” or “c”
- hyphen denotes a range (eg:
[a-c]==[abc] - Negate a character class with
^[^13579]matches any character other than odd numbers
- Note: Many special characters behave differently inside of character classes than they do outside of them.
-
Predefined character classes
Shortcuts for commonly used character classes
\s- any whitespace character\S- non-whitespace characters aka[^\s]\d- any numeric digit aka[0-9]\D- any non-digit aka[^0-9]\w- a word character[a-zA-Z_0-9]\W- a non-word character aka[^\w]
-
Backtracking
- Aforementioned Regex quantifiers are “greedy” - They will match the most characters possible to still allow the pattern to match
- Greedy quantifiers start at their longest and backtrack one match at a time until the whole pattern matches.
- Beware of catastrophic backtracking
.*is very, very suspicious.
-
Lazy quantifiers
AKA Reluctant quantifiers
- consume smallest possible sequence to achieve a match.
- Same syntax as greedy quantifiers, plus a
?at the end - eg:
abc*?,[0-9]+?3
-
Flags
-rRecursively search through a directory.-cReturn the number of matches found, and the file they are located in.-n(attached to our-Pflag to become-nP) display the line numbers where the matches occurred.-i(attached to our-Pflag to become-iP) ignore case.-v(attached to our-Pflag to become-vP) inverse match, as in return the lines that DO NOT match the pattern.-ooutput each match in a unique line-lprint the names of files that have matches
Remember, flags can be combined.
-
Commands worth knowing: wc
The wc (word count) command is used to find out number of newline count, word count, byte and characters count in a files specified by the file arguments.
syntax:
# wc [options] filenames
The following options can be used to specify its actions:
wc -lnumber of lines in a file.wc -wnumber of words in a file.wc -ccount of bytes in a file.wc -mcount of characters from a file.
-
Other commands worth knowing
The results of searches can be sent another command by using the pipe | symbol. Here, we are sending the results of one ggrep search to another ggrep execution:
ggrep "^foo.#bar$" file.txt | ggrep -v "baz"
(same search as grep, but filter out the lines containing “baz”)
The results of searches can also be sent to a file
ggrep -iP -r "[A]\Sl" poetry/ > search_results.txt
-
Resources
- Regexr
- Regular-expressions.info
- Regex Crossword - Regex-based challenges
- Grep Manual page (for full flag documentation)
-
