Regular Expressions
A regular expression is a powerful tool to match patterns. With this tool, you can validate text input, search/replace text within a file, batch rename files, test for patterns within strings etc.
There are two types of regular expressions: the basic regular expressions
(BRE), and the extended regular expressions (ERE). Most utilities (including
vi
, sed
, and grep
) use the basic regular expression. awk
and egrep
use the extended expression.
There are three parts to a regular expression: anchors, character sets, and modifiers. Anchors are used to specify the position of the pattern in relation to a line of text. Character sets match one or more characters in a single position. Modifiers specify how many times the previous character set is repeated.
Anchor characters ^
and $
The character ^
is the starting anchor, and the character $
is the ending
anchor. The regular expression ^A
will match all lines that start with a
capital A. The expression A$
will match all lines that end with the capital
A.
The anchor characters works only if they are located in a proper location.
Otherwise, they no longer act as anchors. For example, ^
is only an anchor
if it is the first character in a regular expression and $
is only an anchor
if it is the last character. The expression $1
and 1^
do not have an
anchor. If you want to match a ^
at the beginning of the line, or a $
at
the end of a line, you must escape the special characters with a backslash.
pattern | Matches |
---|---|
^A |
A at the beginning of a line |
A$ |
A at the end of a line |
A^ |
A^ anywhere on a line |
$A |
$A anywhere on a line |
^^ |
^ at the begining of a line |
$$ |
$ at the end of a line |
Matching a character with character sets
The regular expression the
has three characters: t
, h
, and e
. It will
match any line with the string "the" inside it. However, it will also match the
word "there" or "them". To prevent this, put spaces before and after the
pattern as the
. You can combine the string with an anchor such as ^HPCC
.
Specifying a range of characters with [ ]
If you want to match specific characters, you can use the square brackets to
identify the exact characters you are searching for. The pattern that will
match any line of text that contains exactly one number is ^[0123456789]$
You can use the hyphen between two characters to specify a range. For example
^[0-9]$
is identical to ^[0123456789]$
.
You can intermix explicit characters with character ranges. This pattern
will match a single character that is a letter, number, or underscore:
[A-Za-z0-9_]
.
Specifying exceptions in character sets with [^ ]
[^]
matches a single character that is not contained within the brackets.
For example, [^abc]
matches any character other than "a", "b", or "c".
[^a-z]
matches any single character that is not a lowercase letter from "a"
to "z". Likewise, literal characters and ranges can be mixed. To match all
characters except vowels use [^aeiou]
.
Regular Expression | Matches |
---|---|
[] |
The characters "[]" |
[0] |
The character "0" |
[0-9] |
Any number |
[^0-9] |
Any character other than a number |
[-0-9] |
Any number or a "-" |
[0-9-] |
Any number or a "-" |
[^-0-9] |
Any character except a number or a "-" |
[]0-9] |
Any number or a "]" |
[0-9]] |
Any number followed by a "]" |
[0-9-z] |
Any number, or any character between "9" and "z". |
[0-9\-a]] |
Any number, a "-", a "a", or a "]" |
Matching anything with the wildcard character .
A dot .
is a special meta-character. It will match any character, except the
end-of-line character.
For example, the pattern that will match a line with a single characters is
^.$
, and a line with two characters is ^..$
. You can use
...\.
to match three (wildcard)
characters, and escape the final wildcard meta-character to match the period
instead.
Repeating character sets with *
The *
character matches the preceding element zero or more times.
For example, ab*c
matches "ac", "abc", "abbbc", etc. [xyz]*
matches "",
"x", "y", "z", "zx", "zyx", "xyzzy", and so on.
Using parentheses will create an element that can be repeated with *
. For
example, (ab)*
matches "", "ab", "abab", "ababab", and so on.
Matching a specific number of sets with \{
and \}
You cannot specify a maximum number of sets with the *
modifier. There is a
special pattern you can use to specify the minimum and maximum number of
repeats. This is done by putting those two numbers between \{
and \}
.
\{m, n\}
matches the preceding element at least m and not more than n times.
For example, a\{3,5\}
matches only "aaa", "aaaa", and "aaaaa". Another
example is [a-z]\{4,8\}
which matches 4, 5, 6, 7 or 8 lower case letters.
More examples
Regular expression | Matches |
---|---|
.og |
any three-character string ending with "og", including "dog", "fog", and "hog". |
[df]og |
"dog" and "fog". |
[^d]og |
all strings matched by .og except "dog". |
[^df]og |
all strings matched by .og other than "dog" and "fog". |
^[df]og |
"dog" and "fog", but only at the beginning of the string or line. |
[df]og$ |
"dog" and "fog", but only at the end of the string or line. |
\[.\] |
any single character surrounded by "[" and "]" since the brackets are escaped, for example: "[a]" and "[b]". |
b.* |
b followed by zero or more characters, for example: "b" and "boy" and "bowl". |
Extended regular expressions
The command line utilities egrep
and awk
use the extended regular
expressions. In extended extensions, the backslash before some special
characters is no longer required.
For example, \{...\}
becomes {...}
and \(...\)
becomes (...)
.
Examples:
[hc]+at
matches with "hat", "cat", "hhat", "chat", "hcat", "ccchat" etc[hc]?at
matches "hat", "cat" and "at"([cC]at)|([dD]og)
matches "cat", "Cat", "dog" and "Dog"- The characters
(
,)
,[
,]
,.
,*
,?
,+
,|
,^
, and$
are special symbols and have to be escaped with a backslash symbol in order to be treated as literal characters. For example: a\.(\(|\))
matches the string "a.)" or "a.("
Modern regular expression tools allow a quantifier to be specified as
non-greedy (i.e., match the fewest number of times), by putting a question mark
after the quantifier. For example, in the string "[a] [bb]", \[.*?\]
will match
"[a]" since it matches the wildcard the fewest number of times.
Comparison
BRE | ERE | Matches |
---|---|---|
\( \) |
( ) |
a marked subexpression. The string matched within the parentheses can be recalled later. |
\+ |
+ |
the preceding element one or more times. |
\? |
? |
the preceding element one or zero times. |
\| |
| |
the preceding element or the following element. |
\{m, n\} |
{m, n} |
the preceding element at least m and not more than n times. |
\{m\} |
{m} |
the preceding element exactly m times. |
\{m,\} |
{m,} |
the preceding element at least m times. |
\{,n\} |
{,n} |
the preceding element not more than n times. |
Examples
BRE | ERE | Matched results |
---|---|---|
\(ab\)* |
(ab)* |
"", "ab", "abab", "ababab" etc. |
ab\+c |
ab+c |
"abc", "abbbc", etc, but not "ac". |
[xyz]|+ |
xyz+ |
"x", "y", "z", "zx", "zyx", "xyzzy", etc. |
\(ab\) |
(ab)+ |
"ab", "abab", "ababab" etc. |
ab\?c |
ab?c |
"ac" or "abc". |
\(ab\)\? |
(ab)? |
"" or "ab". |
abc\|def |
abc|def |
"abc" or "def". |
a\{3,5\} |
a{3,5} |
"aaa", "aaaa", and "aaaaa". |
ba\{,2\}b |
ba{,2}b |
"bb", "bab", "baab". |
POSIX character sets
POSIX has added newer and more convenient ways to search for character sets.
For example, you can use [:upper:]
instead of [A-Z]
. In fact, [A-Z]
can
be different on different systems based on the LC_COLLATE
value. For further
discussion, check
here.
On the HPCC at MSU, the default of [A-Z]
is a, A, b, B, c, C, ....y, Y, z,
Z, which is standard collations (en_US).
You can use [[:upper:]]
instead of [:upper:]
, and you can mix the old
style and POSIX styles, such as [1-9[:upper:]]
.
Listing
Expression | matches |
---|---|
[:alnum:] |
Alphanumeric |
[:alpha:] |
Alphabetic |
[:blank:] |
Whitespace, tabs, etc |
[:cntrl:] |
Control character |
[:digit:] |
digit |
[:graph:] |
Printable and visible characters |
[:lower:] |
Lower case character |
[:print:] |
Printable character |
[:punct:] |
Punctuation |
[:space:] |
Whitespace |
[:upper:] |
Upper case character |
[:xdigit:] |
Extended digit |