regular expressions

City College of San Francisco - CS160B
Unix/Linux Shell Scripting
Module: Linux Review

regular expressions

Regular expressions are a pattern-description language, where special characters (regular expression operators) in combination with literal characters are used to represent patterns.

You should already be acquainted with regular expressions, at least with the basic set. In this section we will review the basic set and call attention to issues with using regular expressions to achieve exact results. Then we will briefly discuss extended regular expressions.

Before we start

There is always confusion between regular expressions and wildcards or shell metacharacters, which are used to match filenames with the shell. Unfortunately, if you use a wildcard as a regular expression it will be accepted - it just means something different! The traditional way to avoid confusion is to tell students to just put wildcards out of your mind while we study regular expressions. If this is effective for you, great. However, most students find themselves referring to familiar wildcards while they learn Regular Expressions. For this reason, we will compare them as we go along. Also, unfortunately, many Linux commands use both wildcards and regular expressions at the same time. Here is an example:

grep '^[0-9]' [a-z]*.txt

This command contains a wildcard [a-z]*.txt, which matches each item in the current directory whose name starts with a lower-case letter (although this "lowercase" restriction will produce unexpected results on Linux as discussed later under character sets). The regular expression searches the contents of those files for lines that start with a digit.

As we get into regular expressions we will contrast the operators so that you can fit regular expressions into your pattern-matching psyche along with wildcards and learn when to use each. For now, the example above shows a very important difference:

A wildcard must specify what the entire pattern looks like! In the example above, the wildcard [a-z]*.txt says starts with a lower-case letter! This is implied since [a-z] is the first character in the wildcard. Regular expressions, however, do not imply the entire pattern. Rather, the regular expression matches a line if any part of the line matches. In our example above we explicitly said starts with by using the beginning-of-line anchor ^ To contrast, the command

grep '[0-9]' [a-z]*.txt

says [in the same set of files], output each line that contains a digit! Very different.

You also should have noticed that our regular expression is quoted. Without quotes, regular expressions can be "seen" by the shell and may be interpreted as wildcards! Suppose we executed the command

grep [0-9] [a-z]*.txt

in the directory

$ ls
5 abc file1 resume.txt

which, unfortunately, has a file named 5. Obviously, the wildcard [a-z]*.txt only matches resume.txt. However, since we did not quote our regular expression, it will be interpreted as a wildcard, too, and it matches 5! The command that is executed, then, looks like this:

grep 5 resume.txt

This is not exactly what we wanted. Worse, since we don't see the command after the wildcards get expanded, we won't know the problem occurred - we will just get the misleading results!!

Basic Regular Expressions

The base set of regular expressions, called basic regular expressions or BREs are understood by every program that understands regular expressions: grep, sed, more, and vi, to name the most common ones. BREs only have four operators:

operator	meaning
^$	anchors
.	matches any one character
[stuff]	matches one character. stuff describes the values that character may have.
[^stuff]	matches one character. stuff describes the values that the character may not have.
*	0 or more of the preceding character.

We will go through these operators one at a time.

The anchors ^ and $

Since a RE matches a line if the line contains the RE, anchors are needed to indicate when a match must occur at the beginning or end of a line.

grep 'abc' file matches any line in file that contains abc

grep '^abc' file matches any line in file that begins with abc

grep 'abc$' file matches any line in file that ends with abc.

Note that these anchors are not needed in wildcards - since the wildcard must describe the entire filename for a match to occur. Example: abc* matches a filename that begins with abc. Not so with regular expressions.

If you need to match an entire line, use both anchors:

grep '^abc$' file matches lines that contain only abc: i.e., when examining the line, you come to the beginning of line, then see 'abc', then, immediately see the end of line.

Anchors are simple and very useful. A common requirement, for example, is to examine the contents of a variable to see if it maches a pattern exactly (as opposed to "contains a pattern"). To do this use

echo "$var" | grep '^pattern$'

The RE "any one character" operator ( . )

If you need to "skip over" a specific number of characters and you don't care what they are, use . to match exactly one character. Example: You are matching permissions in the output of ls -l and you want to know which files (only) are writable by group. The output of ls -l begins like this

-rwxrwx---
drwx--x--x

We only care about two characters in this output - the first character must be a dash, and the character for write permission for group - the sixth character - must be a w. To match this pattern, we use the RE

grep '^-....w'

Note: Using a '-' as the first character of a RE causes problems, since grep then confuses the RE with a grep option string. We avoid that here with the anchor. In the general case, enclose the dash in a character set if it is the first character of the RE: [-] or precede it with a backslash \-

Note: the RE operator . is the same as the wildcard operator ?

The character set [ ]

If you need to match a character and that character may only have certain values, use a character set. Character sets in REs are the same as character sets in wildcards with one exception: in a wildcard you use ! as the first character to indicate a character that is not one of these. In a RE this is indicated by ^ as the first character.

Example: To indicate "one character that is not a digit":

Wildcard: [!0-9] RE: [^0-9]

The possibilities for character sets include:

[abc]	one character that is a, b, or c
[a-c]	one character whose value is between 'a' and 'c' inclusive.

The standard groups of characters (uppercase, lowercase, and digits) are in order, meaning the value of 'b' is the value of 'a' + 1 so that you can use

[a-z]	to match a lowercase letter
[A-Z]	to match an uppercase letter
[^A-Z]	to match a character that is not an uppercase letter

But resist the temptation to try to match any alphabetic character using [a-Z], as the upper- and lower-case characters are not "next to each other". Instead, you must use two separate ranges in the same set: [a-zA-Z]

Although you will see these character ranges used in existing code, their use in the modern environment is discouraged for two reasons

they only work on English
they only work in the POSIX locale. (A discussion of locales is beyond our scope - suffice it to say these character ranges would not work correctly on Linux, which uses a locale and character set that accomodates international characters.)

Today, character classes should be used to indicate types of characters. These classes are independent of language or character set (or locale). The syntax is a bit clumsy and difficult to type, unfortunately, since it was made to be backwards-compatible:

[:classname:] denotes a member of character class classname. classnames include

[:alpha:]	alphabetic character
[:upper:]	uppercase alphabetic character
[:lower:]	lowercase alphabetic character
[:digit:]	a number
[:punct:]	punctuation character
[:space:]	whitespace character
[:alnum:]	alphabetic or numeric character

Then, to indicate a character of that class, you enclose the class member specification in the bracket of a character set. Thus

[[:alpha:]] means an alphabetic character ("a single character that is one of the class [:alpha:]")

Examples:

[[:upper:]] an uppercase character

[[:alpha:][:space:]] a character that is alphabetic or whitespace

[^[:punct:]] a character that is any except a punctuation character

You can use character classes in wildcards as well as REs. In fact, bash will even accept ^ instead of ! as "not" in a wildcard (but only the bash shell).

The * repetition operator

* means 0 or more of the preceding character. This is the most common source of errors for students new to REs. * does not stand for itself. It operates on the character to the left of it.

"Jose[[:space:]]*Valdez"

means there may be any amount of whitespace characters between Jose and Valdez (including no whitespace characters).

You may want to indicate that there must be at least one whitespace character, and there may be more. To do this you must add one more: (0 or more + 1 = 1 or more) like this:

"Jose[[:space:]][[:space:]]*Valdez"

This is such a common thing to want that extended REs introduce a special operator to indicate one or more.

If the only whitespace character permitted is a space, just specify it instead of [[:space:]] :

"Jose *Valdez"

Matching any amount of anything

We are used to the * wildcard operator meaning "anything or nothing". The equivalent regular expression operator is 0 or more of any character or .*

Examples:

1. Output any line in file1 that contains the string City and College separated by anything.

grep 'City.*College' file1

If the words can be in either order you would need two regular expressions. The -e option to grep can be used to indicate each regular expression:

grep -e 'City.*College' -e 'College.*City' file1

but this is much more difficult than using two grep commands

grep 'City' file1 | grep 'College'

2. Given a user name, you want to output the line of the /etc/passwd file for that user.

The passwd file has this format:

uname:pwd:uid:gid:gecos:home:shell

Thus, you want to match the line whose uname field matches your username. Let's assume your username is gboyd. The command

grep 'gboyd' /etc/passwd

is not so good because this will look for gboyd anywhere on any line. You only want gboyd to appear in the first field. To add this restriction you must provide some context by indicating what may appear to the left and to the right of what you are looking for.

Immediately to the left of gboyd must be the beginning of line. We know how to specify this - our anchor ^
Immediately to the right of gboyd must be a colon.

Since this context can only appear around the first field, it is sufficient. Thus, our command is

grep '^gboyd:' /etc/passwd

3. Given a user id (a number), you want to output the single line of the /etc/passwd file for that user.

The passwd file has this format:

uname:pwd:uid:gid:gecos:home:shell

where the uid (user id) and gid (group id) fields are numeric. The rest of the fields are alphabetic, but may have digits, such as part of the uname and part of the gecos field. Thus, your job is to match the line whose uid field matches the user id you are given.

Here, if we try to look for context, we have problems. Several fields have colons on either side, so we must extend our search:

the pwd field (to the left) on hills always has one character in it - a lower-case X (x). (This indicates the password is stored elsewhere.). This is fortunate, as it is the only numeric field that can have x to the left of it. This is good and it will help us with our regular expression.
the gid field (to the right) always is numeric.

The combination of these two pieces of context gives us an iron-clad RE that will work (on hills only!) to match the line with a specific uid.

Suppose our uid was 27. Then, we need a RE to describe the following

a colon followed by an x followed by a colon followed by 27 followed by a colon followed by a number, or

grep ':x:27:[[:digit:]]' /etc/passwd

In fact, you could probably get away with less than this, but this is certainly 'safe'.

4. Given a colon-delimited file with five fields, output the line(s) whose second field has the number 100.

To clarify this, let's look at what a sample record might be:

46:542:873:12:204

This is a bit more messy. In this case three fields look the same: the field contains a number with a colon on either side. But you might say, "Well, to specify the contents of the second field, you just have to skip over the first field". This is a good idea, and may result in the following RE:

grep '^.*:100:'

However, this will match 100 in either the second, third, or fourth fields! Why? because the . in .* can match the field delimiter : In other words, . is "too general". When we are skipping the first field we need to say "any number of characters that aren't colons". But wait, you can write this: [^:]* This results in the RE

grep '^[^:]*:100:'

which works fine. In fact, all you have to do is to repeat the pattern [^:]*: as many times as you need to skip fields, so

grep '^[^:]*:[^:]*:100:' matches 100 in the third field

grep '^[^:]*:[^:]*:[^:]*:100:' matches 100 in the fourth field

When you get past the middle field, it is shorter to work from the right:

grep ':100:[^:]*$' matches 100 in the fourth field as well

If you have a limited number of fields, it may be easier to just include all the fields. By specifying the delimiter and including every delimiter on the line, you can force the delimiters to match and just use .* to skip fields:

grep ':.*:.*:100:' matches 100 in the fourth field. (In order for the colons to match, the 100 must be in the fourth field, since there are only four colons on the line!)

5. Match a dollar and cents figure, where the dollar portion must have at least one digit, the dollar sign is required, and the cents part is required:

grep '\$[[:digit:]][[:digit:]]*\.[[:digit:]][[:digit:]]'

Question: which of the following would your RE match. If it doesn't work the way you want, can you fix it?

$14.45
$1.21
$ 4.00
$56
$.55
$14.456

Basic regular expressions also support the capture operator, which can be used to capture the text that matches part of a regular expression and replay the captured text in the same expression. We will see this use in sed. Here is an example of this especially ugly syntax:

grep '^$.$\1' outputs lines whose first two characters are the same. (The first character . is 'captured' in a pair of backslashed parenthesis. This is 'captured subexpression #1'. It can then be referred to as \1 (replay subexpression #1). Hence $.$ captures a character and \1 plays it back. ) Try the command on /etc/passwd on hills.

Extended Regular Expressions

We do not have time to cover extended regular expressions other than to list the operators. For work in shell scripting, basic regular expressions are usually sufficient. The one operator that you may see in an ERE is the {} operator.

operator	meaning
( )	group a new regular expression (NOT TO BE CONFUSED with the capture operator of BREs)
\|	OR
+	one or more of the preceding regular expression.
?	0 or 1 of the preceding regular expression (used to indicate part of a RE can be optional)
{m,n}	at least m, not more than n of the preceding regular expression. also {m} for exactly m or {m,} for at least m

First, notice that these operators can be applied to a regular expression. The RE they are applied to can be either

a single character OR
a subexpression enclosed in parenthesis

The most common place to use extended REs (EREs) are with grep -E (also called egrep). Note: forgetting to 'activate' the ERE operators by adding -E when using grep is the most common source of errors when using EREs.

Here are some quick examples:

grep -E '^([^:]*:){3}140:' matches 140 in the fourth field of a colon-delimited record

grep -E '\$ *[[:digit:]]*(\.[[:digit:]]{2}([^[:digit:]]|$))?' matches all but the last example of the dollar and cents figures in our previous problem.

grep -E '^[[:digit:]]{3}[. -]?[[:digit:]]{4}$' matches lines that look like a local phone number, with either dash, period, or space (or nothing) between the third and fourth digit.

Using EREs in situations where only BREs are allowed

Several Linux tools allow the use of extended regular expression operators, but there is no option to 'turn them on', such as there is in grep. In this situations, you activate an ERE operator by, believe it or not, preceding it with a backslash. Most ERE operators are available in these tools using this capability. For example, in vi you can search forward to the next instance of 4 digits using

/[[:digit:]]\{4\}

grep allows the use of \ to activate ERE operators as well. In the example above, the grep command

grep -E '^[[:digit:]]{3}[. -]?[[:digit:]]{4}$'

can also be written

grep '^[[:digit:]]\{3\}[. -]\?[[:digit:]]\{4\}$'

This page was made entirely with free software on Linux:
Kompozer and LibreOffice