Contents

Shell Metacharacters

Before we study Regular Expressions we should get familiar with the Shell meta characters which have similar syntax to regular expressions. Regular expressions are used by tools such as "grep" , "sed", "awk" while shell meta characters are used when we use tools such as "ls" and "echo" . Even though shell meta characters have similar syntax the meaning is different from regular expressions and we have to make sure that we do not confuse between the two. Regular Expressions are lot more sophisticated and more powerful than what the shell metacharacters have to offer.

Star(*)

The star matches anything. It could be a single or multiple characters
or no characters at all . Say we have the following files in a folder:

a ab b  c  fb  file1 file2 file3 file4

and we want the "ls" command to list the files1 to 4. Using the star
command we can write:

ls file*

[amittal@hills w1]$ ls file*

file1  file2  file3  file4

The above states that list all the files that begin with
the word "file" and anything after that.

ls * gives us all the files

[amittal@hills w1]$ ls file*

file1  file2  file3  file4

The above states that list all the files that begin with the word
"file" and anything after that.

ls * gives us all the files

Example

$ ls
a  ab  b  c  fb  file1  file2  file3  file4

$ ls *
a  ab  b  c  fb  file1  file2  file3  file4

Create a folder with the following files:

1.txt  2.txt  a1  a2  b

We can create the files using the "touch" command.

$ ls *.txt
1.txt  2.txt

$ ls a*
a1  a2

$ ls a*1
a1

The "*" is not just limited to listing the files.
It will work with the "echo" command also.

[amittal@hills w1]$ echo *
a ab b c fb file1 file2 file3 file4

The shell will substitute "*" with the list of file
names and then execute the command. Again the command
can be anything.

Ex:
"file1.txt"
Contents of file 1 .

"file2.txt"
Contents of file 2 .

$ cat *
Contents of file 1 .
Contents of file 2 .

In the above the star is substituted and
the shell creates the command:

cat file1 file2

The above command prints the contents of the 2 files.

Question Mark

The question mark represents a single character.

[amittal@hills w1]$ ls

a  ab  b  c  fb  file1  file2  file3  file4

[amittal@hills w1]$ ls ?
a  b  c

[amittal@hills w1]$ ls ??
ab  fb

The above command will list all the files with 2 characters.

$ ls
1.txt  2.txt  a1  a2  b  file1.txt  file2.txt

$ ls ?*?
1.txt  2.txt  a1  a2  file1.txt  file2.txt

In the above we do not get "b" file printed
out as the 2 question marks mean that the file
name must be at least 2 characters long.

Square Brackets

Specifies a range. Ex:
[amittal@hills w1]$ ls

a  ab  b  c  fb  file1  file2  file3  file4  notes

[amittal@hills w1]$ ls file[1-4]
file1  file2  file3  file4

This states that list all the files starting
with the word "file" and ending in any number from 1 to 4 .

Instead of range we can also specify the characters in the set. Ex:

[amittal@hills w1]$ ls file[1,3]
file1  file3

The "[1,3]" states that we can use any character "1" or "3" .

The square brackets with an "!" represents a Not
condition. Let's say we wanted to list files beginning with
the word "file" but do not want the "file3" listed.

[amittal@hills w1]$ ls
a  ab  b  c  fb  file1  file2  file3  file4  notes  temp1

[amittal@hills w1]$ ls file[!3]
file1  file2  file4

Example:
$ ls
1.txt  2.txt  a1  a2  b  file1.txt  file2.txt

$ ls [1-2].txt
1.txt  2.txt

$ ls [1-5].txt
1.txt  2.txt

Since there are no files such as "3.txt", "4.txt" and
"5.txt" we still get "1.txt" and "2.txt" printed out.

$ ls [a-z].txt

ls: cannot access '[a-z].txt': No such file or directory

$ touch a.txt

$ ls [a-z].txt
a.txt

$ ls
a.txt 1.txt  2.txt  a1  a2  b  file1.txt  file2.txt

$ ls [a-z]*.txt
a.txt  file1.txt  file2.txt

$ ls
1.txt  2.txt  a  a.txt  a1  a2  b  F3.txt  file1.txt  file2.txt

$ ls [a-z1-9]*.txt
1.txt  2.txt  a.txt  file1.txt  file2.txt

$ ls [a-zA-Z]*.txt
a.txt  F3.txt  file1.txt  file2.txt

We can have multiple ranges.

$ ls [1,a]*.txt
1.txt  a.txt

The comma means or.

$ ls  a[1,2,3]
a1  a2

$ touch a
$ ls

1.txt  2.txt  a  a.txt  a1  a2  b  F3.txt  file1.txt  file2.txt

$ ls  a[1,2,3]*
a1  a2

The above means a name that starts with "a" followed by
the numbers 1 or 2 or 3 followed by anything.
Curly Brackets
This represents an or condition. Ex:

[amittal@hills w1]$ ls
1.txt  2.txt  a  a.txt  a1  a2  b  F3.txt  file1.txt  file2.txt

$ ls {?,??}
a  a1  a2  b

The above states that list any files are either 1 or 2 characters long.


$ ls {?,?*}
1.txt  2.txt  a  a  a.txt  a1  a2  b  b  F3.txt  file1.txt  file2.txt

Backslash

Is used to quote special characters .
Let's say we wanted to create a file named "*" .

[amittal@hills temp1]$ touch \*
[amittal@hills temp1]$ ls
*
[amittal@hills temp1]$ rm \*
[amittal@hills temp1]$ ls

Regular Expressions

Recall that grep searches for a word or a pattern and prints the line if it finds it.

Star(*)

The star is used after a character or a string of characters and means 0 or more occurrences.

[amittal@hills w1]$ echo "The fox ate an orange" | grep fo*x
The fox ate an orange

[amittal@hills w1]$ echo "The foox ate an orange" | grep fo*x

The foox ate an orange

The pattern we are searching for is "fo*x" and that means a
word that starts with "f" and ends with "x" and contains any
number of "o" s including zero occurrence of "o" .

[amittal@hills w1]$ echo "The fx ate an orange" | grep fo*x
The fx ate an orange

The following do not produce a match because the pattern must
have "f" at the begining and "x" at the end and any number of
"o's" in the middle.

$ echo "The fix ate an orange" | grep fo*x

$ echo "The fonx ate an orange" | grep fo*x

Dot(.)

The "." can be used to match any single character.

[amittal@hills w1]$ echo "The fx ate an orange" | grep f.x

[amittal@hills w1]$

[amittal@hills w1]$ echo "The fox ate an orange" | grep f.x

The fox ate an orange

Square Brackets re

The square brackets specifies a set of characters that we can choose from.


$ echo  "cat"  | grep "c[a,b]t"
cat

$ echo  "cbt"  | grep "c[a,b]t"
cbt

$ echo  "cct"  | grep "c[a,b]t"

We can use the range syntax also..


$ echo  "ct1"  | grep "ct[0-9]"
ct1
The above means any single digit
from 0 to  9.


$ echo  "ct1"  | grep "ct[0-4,6-9]"
ct1

$ echo  "ct5"  | grep "ct[0-4,6-9]"

No match in the above.


$ echo  "c13t"  | grep "c[0-4,6-9]t"
No match because the expression
"c13t" needs 2 digits and "c[0-4,6-9]t"
only specifies 1 digit.

+ re

We can use the "+" sign to indicate 1 or more occurrences of the preceding character.

$ echo  "c13t"  | grep "1+"

The above did not produce any match.Why not ?
The "+" is a special symbol that makes the expression
"1+" into an extended regular expression.We need
to tell "grep" that it is an extended  regular
expression. We can do that using the "-E" option or
using the "egrep" .

$ echo  "c13t"  | grep -E "1+"
c13t

$ echo  "c13t"  | egrep "1+"
c13t

$ echo  "ca+t"  | grep "c[a,b]+t"
ca+t
The "+" becomes a normal character and
the "grep" finds a match.

$ echo  "ca+t"  | grep -E  "c[a,b]+t"
We do not find a match because the "+" is
not interpreted as a regular character.
If we do want to match it in an extended regular
expression then we need to "escape" it.

$ echo  "ca+t"  | grep -E  "c[a,b]\+t"
ca+t

$ echo  "cababat"  | grep -E  "c[a,b]+t"
cababat

We can have more than 1 occurrences of either "a"
or "b" as the above shows.

$ echo  "ct"  | grep -E  "c[a,b]+t"

We need at least 1 occurrence for a match.

Question Mark re

The question mark applies to a character before it and can mean zero or 1 occurrence of the character.

[amittal@hills w1]$ echo "aab" | egrep aab?
aab

[amittal@hills w1]$ echo "aabb" | egrep aab?
aabb

[amittal@hills w1]$ echo "aa" | egrep aab?
aa

[amittal@hills w1]$ echo "ac" | egrep aab?

$ echo  "ct"  | grep -E  "c[a,b]?t"
ct

$ echo  "caat"  | grep -E  "c[a,b]?t"

$ echo  "cabt"  | grep -E  "c[a,b]?t"

$ echo  "cabt"  | grep -E  "c[a,b]?[a,b]?t"
cabt

$ echo  "caat"  | grep -E  "c[a,b]?t"

$ echo  "cabt"  | grep -E  "c[a,b]?t"

$ echo  "cabt"  | grep -E  "c[a,b]?[a,b]?t"
cabt

Curly Bracket re

Repeat a pattern certain number of time

Check if a lower case character is
repeated 2 times.

$  echo "aaa" | grep -E "[a-z]{2}"
aaa

$  echo "BaBc"	 | grep -E "[a-z]{2}"

No match fouind in above because the lower
case character needs to repeat.

$  echo "BaBc"	 | grep -E "a{1}"
BaBc

Match found.

Round Bracket re

The round bracket can contain strings separated by a "|" and any one of those strings can produce a match.

$  echo "Open AI is the new thing."	 | grep -E "(the)"
Open AI is the new thing.

$ echo "Open AI is the new thing."	 | grep -E "(new|the)"
Open AI is the new thing.

$ echo "Open AI is the new thing."	 | grep -E "(food)"

No match


$ echo "Open AI is the new thing." | grep -E "([e]+|less)"
Open AI is the new thing.

We can place the square brackets inside the round brackets.

Caret and Dollar re

The caret symbol "^" means start of string and the dollar symbol "$" means end of string.

$ echo "Open AI" | grep -E "^Open"
Open AI

$ echo "aOpen AI" | grep -E "^Open"

$ echo "Open AI" | grep -E "^Open$"

No match in the above line because "Open"
must exist both at the beginning and at
the end.

$ echo "Open" | grep -E "^Open$"
Open

Regular Expression Cheat Sheet

Regular Expressopmn Cheat Sheet

Character Classes re


In addition to the square brackets we can use
character classes represent certain ranges.

[[:alnum:]]
Any of `[:digit:]' or `[:alpha:]'

[[:alpha:]]
Any letter:a b c d e f g h i j k l m n o p q r s t u v w
x y z,A B C D E F G H I J K L M N O P Q R S T U V W X Y Z.

[[:blank:]]
Space or tab.

[[:digit:]]
Any one of 0 1 2 3 4 5 6 7 8 9.

[[:lower:]]
Any one of a b c d e f g h i j k l m n
o p q r s t u v w x y z.

[[:punct:]]Any one of ! " # $ % &
' ( ) * + , - . / : ; < = > ? @ [ \ ] ^ _ `
{ | } ~.

[[:space:]]
Any one of CR FF HT NL VT SPACE.

[[:upper:]]
Any one of A B C D E
F G H I J K L M N O P Q R S T U V W X Y Z.

$ echo "a134" | grep [[:alpha:]]
a134

$ echo "134" | grep [[:alpha:]]


$ echo "134" | grep [[:digit:]]
134



Regular Expression Cheat Sheet
We explain a little bit more about extended
regular expressions.
In basic regular expressions the meta-characters
‘?’, ‘+’, ‘{’, ‘|’,‘(’, and ‘)’ lose their
special meaning.

If we want the characters to take on special meanings
we need to escape
them ‘\?’, ‘\+’, ‘\{’, ‘\|’, ‘\(’, and ‘\)’.

In extended regular expressions it's the opposite.
We do not have to
escape the meta-characters and in order to
take off the special meaning we need to escape them.

For the "grep" utility we can use "egrep" or "grep -E" command for
the extended usage.

For "sed" we can "-r" or "-E' if we are on the Mac.

$  echo  "letter" | grep "et+e"


$  echo  "let+er" | grep "et+e"
let+er

In the first example even though "t" is repeated 2 times "grep"
only looks for the literal "+" . This is illustrated in the second
search. This time grep finds the word because it is looking for the
"+" character .

$ echo  "letter" | grep "et\+e"
letter

To tell grep to consider the "+" character as special we escape
it and this time it finds the match.

Now let's repeat the scenario with the extended grep.

$  echo  "letter" | egrep "et+e"
letter

$  echo  "let+er" | egrep "et+e"

$  echo  "let+er" | egrep "et\+e"
let+er

In the first line :

echo  "letter" | egrep "et+e"

the "+" is recognized as a special
character and it finds the word.

echo  "let+er" | egrep "et+e"

In the second line we are looking for the literal "+"" but since the
"+" is a meta character we are not going to find it. In order to take
off the special meaning we need to escape the "+" and this is what happens
on the third line.

echo  "let+er" | egrep "et\+e"

The characters  ‘?’, ‘+’, ‘{’, ‘|’, ‘(’, and ‘)’  were introduced after the
regular expressions had already been defined. Now there was a problem because
if say a utility such as "grep" takes these characters into consideration then
that breaks the old scripts. The utility instead introduced "egrep" or "-E'
option to deal with these new characters.



Exercises

Ex1:

Create some files in a folder. You can
use the "touch" command if you like.

file1.txt  file2.txt  file3  file4

List files "file1.txt" and "file2.txt"  without
using the word "txt" in your
search command.  Do not use the file names explicitly.

Ex2:

Create some files in a folder.
You can use the "touch" command if you like.

a ab b  c  fb  file1 file2 file3 file4

list files "fb" and "file1" only

Do not use the file names explicitly.

Ex3:

 Write a grep that will match a line if
 it contains a ca license plate with
 the following format :


Digit UpperCase UpperCase UpperCase Digit Digit Digit

The line should only contain the license no and nothing else.

Ex4:

Write a grep that matches the following name.
It has 3 words with the last
word being "Blvd" Ex:

Amador Valley Blvd

All the words start with an uppercase and there are exactly 2 spaces in the
phrase.

Ex5:

Suppose a folder has 10 files named
"file1", "file2" ... "file10" .

Complete the following command:

ls  "expression"

to list the files 5 through 10 but
you cannot use the numbers 5-10 .

Ex6:

Create 6 files with the following names:

file1, file2, file3 ... file6

Use the range square brackets [1-2] , [4-6]
together with the curly brackets
to output the files:

1 to 2 and 4 to 6 .

Ex7:

What pattern is satisfied by the following phrase:

egrep "^T{2}.*t$"

Use "echo" to test your answer out.

Ex8:

What pattern is satisfied by the following phrase:

egrep "[a,c]{1,3}"

If a string satisfies the above expression will it also satisfy  the
below expression ?

egrep "[a,c]{1,2}"

Ex9:

Sort the attached file "num.txt" so that the single digits are sorted
first and then the double digits. The sorted file should look like "data.txt" .

Solutions
Solution 1

ls *.*

Solution 2

ls    f*[b,1]

Solution 3

[amittal@hills ~]$ echo "3BPZ780" | egrep [0-9][A-Z]{3}[0-9]{3}

3BPZ780

This will find the license no . Now we need to make there are no other
words on that line.

 echo "3BPZ780" | egrep ^[0-9][A-Z]{3}[0-9]{3}$

Using the caret and the dollar sign we specify that the pattern must match
at the beginning and at the end also.

Solution 4

 echo "Amador Valley Blvd" | egrep "^[A-Z][a-zA-Z]* [A-Z][a-zA-Z]* Blvd$"

Solution 5

$ ls file[!1-2].txt

Solution 6

$ ls file{[1-2],[4-6]}

file1  file2  file4  file5  file6

Solution 7

There should be 2 occurences of capitol "T" at the begining and a small
t at the end with any number of characters in the middle.

$ echo "TThis is a test" | egrep "^T{2}.*t$"

TThis is a test

Solution 9

File: "mysort.sh"

cat num.txt | egrep "^[0-9]$" | sort > data1.txt

cat num.txt | egrep -v "^[0-9]$" | sort > data2.txt

cat data1.txt data2.txt > data.txt