Regular Expressions & Grep

-

What we’ll cover

Regex basics
Pattern Matching
Grep
Using Regex in Grep

-

Grep

grep is used to search for a specified text or pattern within files. ggrep is a version of the command that is compliant to GNU standards that can be used in our Mac evironment.

We’ll be using ggrep with the flag -P (for Perl regex) to demonstrate our pattern matching with Regular Expressions.

to install it on your Mac:

brew install grep

-

Regex basics

Basic symbols

a, b, c - match “a”, “b”, “c” respectively
- Works with all numbers and letters
- Case sensitive e.g. a does not match “A” and B does not match “b”
ab matches “ab”
- Works with any string
- invisible characters like spaces are also match
- a b does not match “ab”

Dot matching

. - matches any one character

. matches “a” and “A” and “b” and “c” etc.
a.c matches “abc” and “acc” and “a c” etc.

Repeated matches

+ - match 1 or more occurrences of the last symbol
- a+ matches “a” and “aa” and “aaa” etc.
- ab+ matches “ab” and “abb” but not “a” or “b”
* - match 0 or more occurrences of the last symbol
- ab* matches “ab” and “abb” and “abbb” etc.
- ab* will also match “a”

Repeated matches (continued)

{n}, {n,m} - Repeat previous match n times or n to m times
- a{1} matches “a”
- a{2} does not match “a” but does match “aa”
- a{1,3} matches all of “a” and “aa” and “aaa”

Advanced Matching

()- group characters together
- (abc){3} matches “abcabcabc” and does not match “abccc”
| - alternation, matches either the pattern on the left or the right
- a|b will match both “a” and “b” but not “ab”
- note a|b will still find both “a” and “b” in “ab” but it will be as two matches and not one
? - match 0 or 1 occurrences of the last symbol
- ab? will match “a” and “ab”

Boundaries

^ - Beginning of line
- ^a will find the “a” in “ab” but not in “ba”
$ - End of line
- a$ will find the “a” in “ba” but not in “ab”
\b - Word boundary
- \bman will match the “man” in “The man” but not in “The human”

-

Character classes

Match any single character in the character class
groups of characters within [] brackets
- [abc] matches “a”, “b” or “c”
hyphen denotes a range (eg: [a-c] == [abc]
Negate a character class with ^
- [^13579] matches any character other than odd numbers
Note: Many special characters behave differently inside of character classes than they do outside of them.

Predefined character classes

Shortcuts for commonly used character classes

\s - any whitespace character
\S - non-whitespace characters aka [^\s]
\d - any numeric digit aka [0-9]
\D - any non-digit aka [^0-9]
\w - a word character [a-zA-Z_0-9]
\W - a non-word character aka [^\w]

Backtracking

Aforementioned Regex quantifiers are “greedy” - They will match the most characters possible to still allow the pattern to match
Greedy quantifiers start at their longest and backtrack one match at a time until the whole pattern matches.
Beware of catastrophic backtracking
.* is very, very suspicious.

Lazy quantifiers

AKA Reluctant quantifiers

consume smallest possible sequence to achieve a match.
Same syntax as greedy quantifiers, plus a ? at the end
eg: abc*?, [0-9]+?3

Flags

-r Recursively search through a directory.
-c Return the number of matches found, and the file they are located in.
-n (attached to our -P flag to become -nP) display the line numbers where the matches occurred.
-i (attached to our -P flag to become -iP) ignore case.
-v (attached to our -P flag to become -vP) inverse match, as in return the lines that DO NOT match the pattern.
-o output each match in a unique line
-l print the names of files that have matches

Remember, flags can be combined.

-

Commands worth knowing: wc

The wc (word count) command is used to find out number of newline count, word count, byte and characters count in a files specified by the file arguments.

syntax:

# wc [options] filenames

The following options can be used to specify its actions:

wc -l number of lines in a file.
wc -w number of words in a file.
wc -c count of bytes in a file.
wc -m count of characters from a file.

-

Other commands worth knowing

The results of searches can be sent another command by using the pipe | symbol. Here, we are sending the results of one ggrep search to another ggrep execution:

ggrep "^foo.#bar$" file.txt | ggrep -v "baz"

(same search as grep, but filter out the lines containing “baz”)

The results of searches can also be sent to a file

ggrep -iP -r "[A]\Sl" poetry/ > search_results.txt

-

Resources

Regexr
Regular-expressions.info
Regex Crossword - Regex-based challenges
Duck Duck Go Regex Cheat Sheet
Grep Manual page (for full flag documentation)

-

cute bunny