Regular Expressions & Grep
-
What we’ll cover
- Regex basics
- Pattern Matching
- Grep
- Using Regex in Grep
-
Grep
grep
is used to search for a specified text or pattern within files. ggrep
is a version of the command that is compliant to GNU standards that can be used in our Mac evironment.
We’ll be using ggrep
with the flag -P
(for Perl regex) to demonstrate our pattern matching with Regular Expressions.
to install it on your Mac:
brew install grep
-
Regex basics
-
Basic symbols
a
,b
,c
- match “a”, “b”, “c” respectively- Works with all numbers and letters
- Case sensitive e.g.
a
does not match “A” andB
does not match “b”
ab
matches “ab”- Works with any string
- invisible characters like spaces are also match
a b
does not match “ab”
-
Dot matching
.
- matches any one character
.
matches “a” and “A” and “b” and “c” etc.a.c
matches “abc” and “acc” and “a c” etc.
-
Repeated matches
+
- match 1 or more occurrences of the last symbola+
matches “a” and “aa” and “aaa” etc.ab+
matches “ab” and “abb” but not “a” or “b”
*
- match 0 or more occurrences of the last symbolab*
matches “ab” and “abb” and “abbb” etc.ab*
will also match “a”
-
Repeated matches (continued)
{n}
,{n,m}
- Repeat previous match n times or n to m timesa{1}
matches “a”a{2}
does not match “a” but does match “aa”a{1,3}
matches all of “a” and “aa” and “aaa”
-
Advanced Matching
()
- group characters together(abc){3}
matches “abcabcabc” and does not match “abccc”
|
- alternation, matches either the pattern on the left or the righta|b
will match both “a” and “b” but not “ab”- note
a|b
will still find both “a” and “b” in “ab” but it will be as two matches and not one
?
- match 0 or 1 occurrences of the last symbolab?
will match “a” and “ab”
-
Boundaries
^
- Beginning of line^a
will find the “a” in “ab” but not in “ba”
$
- End of linea$
will find the “a” in “ba” but not in “ab”
\b
- Word boundary\bman
will match the “man” in “The man” but not in “The human”
-
Character classes
- Match any single character in the character class
- groups of characters within [] brackets
[abc]
matches “a”, “b” or “c”
- hyphen denotes a range (eg:
[a-c]
==[abc]
- Negate a character class with
^
[^13579]
matches any character other than odd numbers
- Note: Many special characters behave differently inside of character classes than they do outside of them.
-
Predefined character classes
Shortcuts for commonly used character classes
\s
- any whitespace character\S
- non-whitespace characters aka[^\s]
\d
- any numeric digit aka[0-9]
\D
- any non-digit aka[^0-9]
\w
- a word character[a-zA-Z_0-9]
\W
- a non-word character aka[^\w]
-
Backtracking
- Aforementioned Regex quantifiers are “greedy” - They will match the most characters possible to still allow the pattern to match
- Greedy quantifiers start at their longest and backtrack one match at a time until the whole pattern matches.
- Beware of catastrophic backtracking
.*
is very, very suspicious.
-
Lazy quantifiers
AKA Reluctant quantifiers
- consume smallest possible sequence to achieve a match.
- Same syntax as greedy quantifiers, plus a
?
at the end - eg:
abc*?
,[0-9]+?3
-
Flags
-r
Recursively search through a directory.-c
Return the number of matches found, and the file they are located in.-n
(attached to our-P
flag to become-nP
) display the line numbers where the matches occurred.-i
(attached to our-P
flag to become-iP
) ignore case.-v
(attached to our-P
flag to become-vP
) inverse match, as in return the lines that DO NOT match the pattern.-o
output each match in a unique line-l
print the names of files that have matches
Remember, flags can be combined.
-
Commands worth knowing: wc
The wc
(word count) command is used to find out number of newline count, word count, byte and characters count in a files specified by the file arguments.
syntax:
# wc [options] filenames
The following options can be used to specify its actions:
wc -l
number of lines in a file.wc -w
number of words in a file.wc -c
count of bytes in a file.wc -m
count of characters from a file.
-
Other commands worth knowing
The results of searches can be sent another command by using the pipe |
symbol. Here, we are sending the results of one ggrep
search to another ggrep
execution:
ggrep "^foo.#bar$" file.txt | ggrep -v "baz"
(same search as grep, but filter out the lines containing “baz”)
The results of searches can also be sent to a file
ggrep -iP -r "[A]\Sl" poetry/ > search_results.txt
-
Resources
- Regexr
- Regular-expressions.info
- Regex Crossword - Regex-based challenges
- Grep Manual page (for full flag documentation)