Basic regular expression

This post is basically copied from Ch4 ‘Texting and Driving’ from Sarath’s book ‘Linux Shell Scripting Cookbook’. — I did want to summarize my daily regular recipes by myself but Sarath has done a good job in his book. I am not suffering NIH symptom – Not invented here. So I just copied it from Sir Sarath’s book for my reference.

Regular expressions are the heart of the pattern-matching based text-processing techniques. For fluency in writing text-processing tools, one must have basic understanding of regular expressions. Regular expressions are a form of tiny, highly-specialized programming language used to match text.

Getting ready

Regular expressions are the language used in most text processing utilities. Hence you will use the techniques learnt in this recipe in many other recipes. [a-z0-9_]+@[a-z0-9]+\.[a-z]+ is an example of regular expression for matching an e-mail address.

How to do it…

In this section, we will go through regex, the POSIX character class, and meta characters.

Let’s first go through the basic components of regular expressions (regex).

^ The start of the line marker. ^tux matches a string that starts the line with tux.
$ The end of the line marker. tux$ matches strings of a line that ends with tux.
. Matches any one character. Hack. matches Hack1, Hacki but not Hack12, Hackil, only one additional character matches.
[] Matches any one of the characters enclosed in [chars]. coo[kl] matches cook or cool.
[^] Matches any one of the characters EXCEPT those that are enclosed in [^chars]. 9[^01] matches 92, 93 but not 91 or 90.
[-] Matches any character within the range specified in []. [1-5] matches any digits from 1 to 5.
? The preceding item must match one or zero times. colou?r matches color or colour but not colouur.
+ The preceding item must match one or more times. Rollno-9+ matches Rollno-99, Rollno-9 but not Rollno-.
* The preceding item must match zero or more times. co*l matches cl, col, coool.
() Creates a substring from the regex match. ma(tri)?x matches max or matrix.
{n} The preceding item must match n times. [0-9]{3} matches any three-digit number. [0-9]{3} can be expanded as: [0-9][0-9][0-9].
{n,} Minimum number of times that the preceding item should match. [0-9]{2,} matches any number, that is, two digits or more.
{n, m} Specifies the minimum and maximum number of times the preceding item should match. [0-9]{2,5} matches any number that is having two digits to five digits.
| Alternation—one of the items on either of sides of | should match. Oct (1st | 2nd) matches Oct 1st or Oct 2nd.
\ The escape character for escaping any of the special characters mentioned above. a\.b matches a.b but not ajb. It ignores special meaning of .by prefexing \.

A POSIX character class is a special meta sequence of the form [:…:] that can be used to match a range of specified characters. The POSIX classes are as follows:

[:alnum:] Alphanumeric character [[:alnum:]]+
[:alpha:] Alphabet character (lowercase and uppercase) [[:alpha:]]{4}
[:blank:] Space and tab [[:blank:]]*
[:digit:] Digit [[:digit:]]?
[:lower:] Lowercase alphabet [[:lower:]]{5,}
[:upper:] Uppercase alphabet ([[:upper:]]+)?
[:punct:] Punctuation [[:punct:]]
[:space:] All whitespace characters including newline, carriage return, and so on.

Meta characters are a type of Perl-style regular expression that is supported by a subset of text processing utilities. Not all of the utilities will support the following notations. But the above character classes and regular expression are universally accepted.

\b Word boundary \bcool\b matches only cool not coolant.
\B Non-word boundary cool\B matches coolant and not cool.
\d Single digit character b\db matches b2b not bcb.
\D Single non-digit b\Db matches bcb not b2b.
\w Single word character(alnum and _) \w matches 1 or a not &.
\W Single non-word character \w matches & not 1 or a.
\n Newline \n Matches a new line.
\s Single whitespace x\sx matches xx not xx.
\S Single non-space x\Sx matches xkx not xx.
\r Carriage return \r matches carriage return.