Contents
Shell Metacharacters
Before we study Regular Expressions we should get familiar with the Shell meta characters which have similar syntax to regular expressions. Regular expressions are used by tools such as "grep" , "sed", "awk" while shell meta characters are used when we use tools such as "ls" and "echo" . Even though shell meta characters have similar syntax the meaning is different from regular expressions and we have to make sure that we do not confuse between the two. Regular Expressions are lot more sophisticated and more powerful than what the shell metacharacters have to offer.Star(*)
The star matches anything. It could be a single or multiple characters or no characters at all . Say we have the following files in a folder: a ab b c fb file1 file2 file3 file4 and we want the "ls" command to list the files1 to 4. Using the star command we can write: ls file* [amittal@hills w1]$ ls file* file1 file2 file3 file4 The above states that list all the files that begin with the word "file" and anything after that. ls * gives us all the files [amittal@hills w1]$ ls file* file1 file2 file3 file4 The above states that list all the files that begin with the word "file" and anything after that. ls * gives us all the files Example $ ls a ab b c fb file1 file2 file3 file4 $ ls * a ab b c fb file1 file2 file3 file4 Create a folder with the following files: 1.txt 2.txt a1 a2 b We can create the files using the "touch" command. $ ls *.txt 1.txt 2.txt $ ls a* a1 a2 $ ls a*1 a1 The "*" is not just limited to listing the files. It will work with the "echo" command also. [amittal@hills w1]$ echo * a ab b c fb file1 file2 file3 file4 The shell will substitute "*" with the list of file names and then execute the command. Again the command can be anything. Ex: "file1.txt" Contents of file 1 . "file2.txt" Contents of file 2 . $ cat * Contents of file 1 . Contents of file 2 . In the above the star is substituted and the shell creates the command: cat file1 file2 The above command prints the contents of the 2 files.
Question Mark
The question mark represents a single character. [amittal@hills w1]$ ls a ab b c fb file1 file2 file3 file4 [amittal@hills w1]$ ls ? a b c [amittal@hills w1]$ ls ?? ab fb The above command will list all the files with 2 characters. $ ls 1.txt 2.txt a1 a2 b file1.txt file2.txt $ ls ?*? 1.txt 2.txt a1 a2 file1.txt file2.txt In the above we do not get "b" file printed out as the 2 question marks mean that the file name must be at least 2 characters long.
Square Brackets
Specifies a range. Ex: [amittal@hills w1]$ ls a ab b c fb file1 file2 file3 file4 notes [amittal@hills w1]$ ls file[1-4] file1 file2 file3 file4 This states that list all the files starting with the word "file" and ending in any number from 1 to 4 . Instead of range we can also specify the characters in the set. Ex: [amittal@hills w1]$ ls file[1,3] file1 file3 The "[1,3]" states that we can use any character "1" or "3" . The square brackets with an "!" represents a Not condition. Let's say we wanted to list files beginning with the word "file" but do not want the "file3" listed. [amittal@hills w1]$ ls a ab b c fb file1 file2 file3 file4 notes temp1 [amittal@hills w1]$ ls file[!3] file1 file2 file4 Example: $ ls 1.txt 2.txt a1 a2 b file1.txt file2.txt $ ls [1-2].txt 1.txt 2.txt $ ls [1-5].txt 1.txt 2.txt Since there are no files such as "3.txt", "4.txt" and "5.txt" we still get "1.txt" and "2.txt" printed out. $ ls [a-z].txt ls: cannot access '[a-z].txt': No such file or directory $ touch a.txt $ ls [a-z].txt a.txt $ ls a.txt 1.txt 2.txt a1 a2 b file1.txt file2.txt $ ls [a-z]*.txt a.txt file1.txt file2.txt $ ls 1.txt 2.txt a a.txt a1 a2 b F3.txt file1.txt file2.txt $ ls [a-z1-9]*.txt 1.txt 2.txt a.txt file1.txt file2.txt $ ls [a-zA-Z]*.txt a.txt F3.txt file1.txt file2.txt We can have multiple ranges. $ ls [1,a]*.txt 1.txt a.txt The comma means or. $ ls a[1,2,3] a1 a2 $ touch a $ ls 1.txt 2.txt a a.txt a1 a2 b F3.txt file1.txt file2.txt $ ls a[1,2,3]* a1 a2 The above means a name that starts with "a" followed by the numbers 1 or 2 or 3 followed by anything.Curly Brackets
This represents an or condition. Ex: [amittal@hills w1]$ ls 1.txt 2.txt a a.txt a1 a2 b F3.txt file1.txt file2.txt $ ls {?,??} a a1 a2 b The above states that list any files are either 1 or 2 characters long. $ ls {?,?*} 1.txt 2.txt a a a.txt a1 a2 b b F3.txt file1.txt file2.txt
Backslash
Is used to quote special characters . Let's say we wanted to create a file named "*" . [amittal@hills temp1]$ touch \* [amittal@hills temp1]$ ls * [amittal@hills temp1]$ rm \* [amittal@hills temp1]$ ls
Regular Expressions
Recall that grep searches for a word or a pattern and prints the line if it finds it.Star(*)
The star is used after a character or a string of characters and means 0 or more occurrences.
[amittal@hills w1]$ echo "The fox ate an orange" | grep fo*x The fox ate an orange [amittal@hills w1]$ echo "The foox ate an orange" | grep fo*x The foox ate an orange The pattern we are searching for is "fo*x" and that means a word that starts with "f" and ends with "x" and contains any number of "o" s including zero occurrence of "o" . [amittal@hills w1]$ echo "The fx ate an orange" | grep fo*x The fx ate an orange The following do not produce a match because the pattern must have "f" at the begining and "x" at the end and any number of "o's" in the middle. $ echo "The fix ate an orange" | grep fo*x $ echo "The fonx ate an orange" | grep fo*x
Dot(.)
The "." can be used to match any single character. [amittal@hills w1]$ echo "The fx ate an orange" | grep f.x [amittal@hills w1]$ [amittal@hills w1]$ echo "The fox ate an orange" | grep f.x The fox ate an orange
Square Brackets re
The square brackets specifies a set of characters that we can choose from.$ echo "cat" | grep "c[a,b]t" cat $ echo "cbt" | grep "c[a,b]t" cbt $ echo "cct" | grep "c[a,b]t" We can use the range syntax also.. $ echo "ct1" | grep "ct[0-9]" ct1 The above means any single digit from 0 to 9. $ echo "ct1" | grep "ct[0-4,6-9]" ct1 $ echo "ct5" | grep "ct[0-4,6-9]" No match in the above. $ echo "c13t" | grep "c[0-4,6-9]t" No match because the expression "c13t" needs 2 digits and "c[0-4,6-9]t" only specifies 1 digit.
+ re
We can use the "+" sign to indicate 1 or more occurrences of the preceding character.$ echo "c13t" | grep "1+" The above did not produce any match.Why not ? The "+" is a special symbol that makes the expression "1+" into an extended regular expression.We need to tell "grep" that it is an extended regular expression. We can do that using the "-E" option or using the "egrep" . $ echo "c13t" | grep -E "1+" c13t $ echo "c13t" | egrep "1+" c13t $ echo "ca+t" | grep "c[a,b]+t" ca+t The "+" becomes a normal character and the "grep" finds a match. $ echo "ca+t" | grep -E "c[a,b]+t" We do not find a match because the "+" is not interpreted as a regular character. If we do want to match it in an extended regular expression then we need to "escape" it. $ echo "ca+t" | grep -E "c[a,b]\+t" ca+t $ echo "cababat" | grep -E "c[a,b]+t" cababat We can have more than 1 occurrences of either "a" or "b" as the above shows. $ echo "ct" | grep -E "c[a,b]+t" We need at least 1 occurrence for a match.
Question Mark re
The question mark applies to a character before it and can mean zero or 1 occurrence of the character.[amittal@hills w1]$ echo "aab" | egrep aab? aab [amittal@hills w1]$ echo "aabb" | egrep aab? aabb [amittal@hills w1]$ echo "aa" | egrep aab? aa [amittal@hills w1]$ echo "ac" | egrep aab? $ echo "ct" | grep -E "c[a,b]?t" ct $ echo "caat" | grep -E "c[a,b]?t" $ echo "cabt" | grep -E "c[a,b]?t" $ echo "cabt" | grep -E "c[a,b]?[a,b]?t" cabt $ echo "caat" | grep -E "c[a,b]?t" $ echo "cabt" | grep -E "c[a,b]?t" $ echo "cabt" | grep -E "c[a,b]?[a,b]?t" cabt
Curly Bracket re
Repeat a pattern certain number of time
Check if a lower case character is
repeated 2 times.
$ echo "aaa" | grep -E "[a-z]{2}"
aaa
$ echo "BaBc" | grep -E "[a-z]{2}"
No match fouind in above because the lower
case character needs to repeat.
$ echo "BaBc" | grep -E "a{1}"
BaBc
Match found.
Round Bracket re
The round bracket can contain strings separated by a "|" and any one of those strings can produce a match.$ echo "Open AI is the new thing." | grep -E "(the)" Open AI is the new thing. $ echo "Open AI is the new thing." | grep -E "(new|the)" Open AI is the new thing. $ echo "Open AI is the new thing." | grep -E "(food)" No match $ echo "Open AI is the new thing." | grep -E "([e]+|less)" Open AI is the new thing. We can place the square brackets inside the round brackets.
Caret and Dollar re
The caret symbol "^" means start of string and the dollar symbol "$" means end of string.$ echo "Open AI" | grep -E "^Open" Open AI $ echo "aOpen AI" | grep -E "^Open" $ echo "Open AI" | grep -E "^Open$" No match in the above line because "Open" must exist both at the beginning and at the end. $ echo "Open" | grep -E "^Open$" Open
Regular Expression Cheat Sheet
Regular Expressopmn Cheat SheetCharacter Classes re
In addition to the square brackets we can use
character classes represent certain ranges.
[[:alnum:]]
Any of `[:digit:]' or `[:alpha:]'
[[:alpha:]]
Any letter:a b c d e f g h i j k l m n o p q r s t u v w
x y z,A B C D E F G H I J K L M N O P Q R S T U V W X Y Z.
[[:blank:]]
Space or tab.
[[:digit:]]
Any one of 0 1 2 3 4 5 6 7 8 9.
[[:lower:]]
Any one of a b c d e f g h i j k l m n
o p q r s t u v w x y z.
[[:punct:]]Any one of ! " # $ % &
' ( ) * + , - . / : ; < = > ? @ [ \ ] ^ _ `
{ | } ~.
[[:space:]]
Any one of CR FF HT NL VT SPACE.
[[:upper:]]
Any one of A B C D E
F G H I J K L M N O P Q R S T U V W X Y Z.
$ echo "a134" | grep [[:alpha:]]
a134
$ echo "134" | grep [[:alpha:]]
$ echo "134" | grep [[:digit:]]
134
Regular Expression Cheat Sheet
We explain a little bit more about extended
regular expressions.
In basic regular expressions the meta-characters
‘?’, ‘+’, ‘{’, ‘|’,‘(’, and ‘)’ lose their
special meaning.
If we want the characters to take on special meanings
we need to escape
them ‘\?’, ‘\+’, ‘\{’, ‘\|’, ‘\(’, and ‘\)’.
In extended regular expressions it's the opposite.
We do not have to
escape the meta-characters and in order to
take off the special meaning we need to escape them.
For the "grep" utility we can use "egrep" or "grep -E" command for
the extended usage.
For "sed" we can "-r" or "-E' if we are on the Mac.
$ echo "letter" | grep "et+e"
$ echo "let+er" | grep "et+e"
let+er
In the first example even though "t" is repeated 2 times "grep"
only looks for the literal "+" . This is illustrated in the second
search. This time grep finds the word because it is looking for the
"+" character .
$ echo "letter" | grep "et\+e"
letter
To tell grep to consider the "+" character as special we escape
it and this time it finds the match.
Now let's repeat the scenario with the extended grep.
$ echo "letter" | egrep "et+e"
letter
$ echo "let+er" | egrep "et+e"
$ echo "let+er" | egrep "et\+e"
let+er
In the first line :
echo "letter" | egrep "et+e"
the "+" is recognized as a special
character and it finds the word.
echo "let+er" | egrep "et+e"
In the second line we are looking for the literal "+"" but since the
"+" is a meta character we are not going to find it. In order to take
off the special meaning we need to escape the "+" and this is what happens
on the third line.
echo "let+er" | egrep "et\+e"
The characters ‘?’, ‘+’, ‘{’, ‘|’, ‘(’, and ‘)’ were introduced after the
regular expressions had already been defined. Now there was a problem because
if say a utility such as "grep" takes these characters into consideration then
that breaks the old scripts. The utility instead introduced "egrep" or "-E'
option to deal with these new characters.
Exercises
Ex1:
Create some files in a folder. You can
use the "touch" command if you like.
file1.txt file2.txt file3 file4
List files "file1.txt" and "file2.txt" without
using the word "txt" in your
search command. Do not use the file names explicitly.
Ex2:
Create some files in a folder.
You can use the "touch" command if you like.
a ab b c fb file1 file2 file3 file4
list files "fb" and "file1" only
Do not use the file names explicitly.
Ex3:
Write a grep that will match a line if
it contains a ca license plate with
the following format :
Digit UpperCase UpperCase UpperCase Digit Digit Digit
The line should only contain the license no and nothing else.
Ex4:
Write a grep that matches the following name.
It has 3 words with the last
word being "Blvd" Ex:
Amador Valley Blvd
All the words start with an uppercase and there are exactly 2 spaces in the
phrase.
Ex5:
Suppose a folder has 10 files named
"file1", "file2" ... "file10" .
Complete the following command:
ls "expression"
to list the files 5 through 10 but
you cannot use the numbers 5-10 .
Ex6:
Create 6 files with the following names:
file1, file2, file3 ... file6
Use the range square brackets [1-2] , [4-6]
together with the curly brackets
to output the files:
1 to 2 and 4 to 6 .
Ex7:
What pattern is satisfied by the following phrase:
egrep "^T{2}.*t$"
Use "echo" to test your answer out.
Ex8:
What pattern is satisfied by the following phrase:
egrep "[a,c]{1,3}"
If a string satisfies the above expression will it also satisfy the
below expression ?
egrep "[a,c]{1,2}"
Ex9:
Sort the attached file "num.txt" so that the single digits are sorted
first and then the double digits. The sorted file should look like "data.txt" .
Solutions
Solution 1
ls *.*
Solution 2
ls f*[b,1]
Solution 3
[amittal@hills ~]$ echo "3BPZ780" | egrep [0-9][A-Z]{3}[0-9]{3}
3BPZ780
This will find the license no . Now we need to make there are no other
words on that line.
echo "3BPZ780" | egrep ^[0-9][A-Z]{3}[0-9]{3}$
Using the caret and the dollar sign we specify that the pattern must match
at the beginning and at the end also.
Solution 4
echo "Amador Valley Blvd" | egrep "^[A-Z][a-zA-Z]* [A-Z][a-zA-Z]* Blvd$"
Solution 5
$ ls file[!1-2].txt
Solution 6
$ ls file{[1-2],[4-6]}
file1 file2 file4 file5 file6
Solution 7
There should be 2 occurences of capitol "T" at the begining and a small
t at the end with any number of characters in the middle.
$ echo "TThis is a test" | egrep "^T{2}.*t$"
TThis is a test
Solution 9
File: "mysort.sh"
cat num.txt | egrep "^[0-9]$" | sort > data1.txt
cat num.txt | egrep -v "^[0-9]$" | sort > data2.txt
cat data1.txt data2.txt > data.txt