Skip to content

Regular Expressions

A regular expression is a powerful tool to match patterns. With this tool, you can validate text input, search/replace text within a file, batch rename files, test for patterns within strings etc.

There are two types of regular expressions: the basic regular expressions (BRE), and the extended regular expressions (ERE). Most utilities (including vi, sed, and grep) use the basic regular expression. awk and egrep use the extended expression.

There are three parts to a regular expression: anchors, character sets, and modifiers. Anchors are used to specify the position of the pattern in relation to a line of text. Character sets match one or more characters in a single position. Modifiers specify how many times the previous character set is repeated.

Anchor characters ^ and $

The character ^ is the starting anchor, and the character $ is the ending anchor. The regular expression ^A will match all lines that start with a capital A. The expression A$ will match all lines that end with the capital A.

The anchor characters works only if they are located in a proper location. Otherwise,  they no longer act as anchors. For example, ^ is only an anchor if it is the first character in a regular expression and $ is only an anchor if it is the last character. The expression $1 and 1^ do not have an anchor. If you want to match a ^ at the beginning of the line, or a $ at the end of a line, you must escape the special characters with a backslash.

pattern Matches
^A A at the beginning of a line
A$ A at the end of a line
A^ A^ anywhere on a line
$A $A anywhere on a line
^^ ^ at the begining of a line
$$ $ at the end of a line

Matching a character with character sets

The regular expression the has three characters: t, h, and e. It will match any line with the string "the" inside it. However, it will also match the word "there" or "them". To prevent this, put spaces before and after the pattern as the. You can combine the string with an anchor such as ^HPCC.

Specifying a range of characters with [ ]

If you want to match specific characters, you can use the square brackets to identify the exact characters you are searching for. The pattern that will match any line of text that contains exactly one number is ^[0123456789]$

You can use the hyphen between two characters to specify a range. For example ^[0-9]$ is identical to ^[0123456789]$.

You can intermix explicit characters with character ranges. This pattern will match a single character that is a letter, number, or underscore: [A-Za-z0-9_].

Specifying exceptions in character sets with [^ ]

[^] matches a single character that is not contained within the brackets.

For example, [^abc] matches any character other than "a", "b", or "c". [^a-z] matches any single character that is not a lowercase letter from "a" to "z". Likewise, literal characters and ranges can be mixed. To match all characters except vowels use [^aeiou].

Regular Expression Matches
[] The characters "[]"
[0] The character "0"
[0-9] Any number
[^0-9] Any character other than a number
[-0-9] Any number or a "-"
[0-9-] Any number or a "-"
[^-0-9] Any character except a number or a "-"
[]0-9] Any number or a "]"
[0-9]] Any number followed by a "]"
[0-9-z] Any number, or any character between "9" and "z".
[0-9\-a]] Any number, a "-", a "a", or a "]"

Matching anything with the wildcard character .

A dot . is a special meta-character. It will match any character, except the end-of-line character.

For example, the pattern that will match a line with a single characters is ^.$, and a line with two characters is ^..$. You can use ...\. to match three (wildcard) characters, and escape the final wildcard meta-character to match the period instead. 

Repeating character sets with *

The * character matches the preceding element zero or more times.

For example, ab*c matches "ac", "abc", "abbbc", etc. [xyz]* matches "", "x", "y", "z", "zx", "zyx", "xyzzy", and so on.

Using parentheses will create an element that can be repeated with *. For example, (ab)* matches "", "ab", "abab", "ababab", and so on.

Matching a specific number of sets with \{ and \}

You cannot specify a maximum number of sets with the * modifier. There is a special pattern you can use to specify the minimum and maximum number of repeats. This is done by putting those two numbers between \{ and \}.

\{m, n\} matches the preceding element at least m and not more than n times.

For example, a\{3,5\} matches only "aaa", "aaaa", and "aaaaa". Another example is [a-z]\{4,8\} which matches 4, 5, 6, 7 or 8 lower case letters.

More examples

Regular expression Matches
.og any three-character string ending with "og", including "dog", "fog", and "hog".
[df]og "dog" and "fog".
[^d]og all strings matched by .og except "dog".
[^df]og all strings matched by .og other than "dog" and "fog".
^[df]og "dog" and "fog", but only at the beginning of the string or line.
[df]og$  "dog" and "fog", but only at the end of the string or line.
\[.\] any single character surrounded by "[" and "]" since the brackets are escaped, for example: "[a]" and "[b]".
b.*  b followed by zero or more characters, for example: "b" and "boy" and "bowl".

Extended regular expressions

The command line utilities egrep and awk use the extended regular expressions. In extended extensions, the backslash before some special characters is no longer required.

For example, \{...\} becomes {...} and \(...\) becomes (...).

Examples:

  • [hc]+at matches with "hat", "cat", "hhat", "chat", "hcat", "ccchat" etc
  • [hc]?at matches "hat", "cat" and "at"
  • ([cC]at)|([dD]og) matches "cat", "Cat", "dog" and "Dog"
  • The characters (, ), [, ], ., *, ?, +, |, ^, and $ are special symbols and have to be escaped with a backslash symbol in order to be treated as literal characters. For example:
  • a\.(\(|\)) matches the string "a.)" or "a.("

Modern regular expression tools allow a quantifier to be specified as non-greedy (i.e., match the fewest number of times), by putting a question mark after the quantifier. For example, in the string "[a] [bb]", \[.*?\] will match "[a]" since it matches the wildcard the fewest number of times.

Comparison

BRE ERE Matches
\( \) ( ) a marked subexpression. The string matched within the parentheses can be recalled later.
\+ + the preceding element one or more times.
\? ? the preceding element one or zero times.
\| | the preceding element or the following element.
\{m, n\} {m, n} the preceding element at least m and not more than n times.
\{m\} {m} the preceding element exactly m times.
\{m,\} {m,} the preceding element at least m times.
\{,n\} {,n} the preceding element not more than n times.

Examples

BRE ERE Matched results
\(ab\)* (ab)* "", "ab", "abab", "ababab" etc.
ab\+c ab+c "abc", "abbbc", etc, but not "ac".
[xyz]|+ xyz+ "x", "y", "z", "zx", "zyx", "xyzzy", etc.
\(ab\) (ab)+ "ab", "abab", "ababab" etc.
ab\?c ab?c "ac" or "abc".
\(ab\)\? (ab)? "" or "ab".
abc\|def abc|def "abc" or "def".
a\{3,5\} a{3,5} "aaa", "aaaa", and "aaaaa".
ba\{,2\}b ba{,2}b "bb", "bab", "baab".

POSIX character sets

POSIX has added newer and more convenient ways to search for character sets. For example, you can use [:upper:] instead of [A-Z]. In fact, [A-Z] can be different on different systems based on the LC_COLLATE value. For further discussion, check here. On the HPCC at MSU, the default of [A-Z] is  a, A, b, B, c, C, ....y, Y, z, Z, which is standard collations (en_US). 

You can use [[:upper:]] instead of [:upper:], and you can mix the old style and POSIX styles, such as [1-9[:upper:]].

Listing

Expression matches
[:alnum:] Alphanumeric
[:alpha:] Alphabetic
[:blank:] Whitespace, tabs, etc
[:cntrl:] Control character
[:digit:] digit
[:graph:] Printable and visible characters
[:lower:] Lower case character
[:print:] Printable character
[:punct:] Punctuation
[:space:] Whitespace
[:upper:] Upper case character
[:xdigit:] Extended digit