Regular Expressions

Quantifiers

Quantifiers affect the character before them in the regular expression, and determine how many times this character must or may occur.

If you want the quantifier to affect a sequence of characters, enclose those characters in parentheses.

The quantifiers are:

{n}Must occur exactly n times
{n,m}Must occur at least n times but no more than m times
{n,}Must occur at least n times
*0 or more times (same as {0,})
+1 or more times (same as {1,})
?0 or 1 time (same as {0,1})

Example 1

We would like to find out whether the concensus sequence
ACCCC[AG][AG][AG]GTGT
is contained (somewhere) in a given sequence $a.

Without quantifiers:
if ($a =~ /ACCCC[AG][AG][AG]GTGT/) {...};
With quantifiers:
if ($a =~ /AC{4}[AG]{3}(GT){2}/) {...};

Example 2

The date and time example from the previous slide will look much nicer if we use quantifiers:
#!/usr/local/bin/perl

print "Please enter date and time, as in \"08-OCT-1997  16:30\"\n";
my $entry = <STDIN>;
chop ($entry);

if ($entry =~ /\d{2}-\w{3}-\d{4}  \d{2}:\d{2}/) {
   print "good!\n";
} else {
   print "wrong format!\n";
}

Example 3

To check whether a given sequence contains 2 or more repeats of the GATA tetranucleotide write:
if ($seq =~ /(GATA){2,}/) {  }

# note that we enclosed  the sequence to be repeated in parentheses

Example 4

The Genome Database accession IDs are composed of the characters GDB: followed by several digits (see example).
To check whether a Genome Database accession ID is entered correctly, use the following conditional:
if ($entry =~ /GDB:\d+/) {  }

# i.e. "GDB:" followed by one or more digits

Example 5

To check whether a sentence contains either the word "color" or "colour", write:
if ($sentence =~ /colou?r/) {  }

# the question mark here denotes an optional "u"

Example 6

The HTML specifications allow extra whitespaces inside tags.
For example, < TITLE    > and <\tTITLE> mean the same as <TITLE>.
To check whether an HTML text contains the TITLE tag, write:

if ($text =~ /<\s*TITLE\s*>/) {  }

# the word "TITLE" may optionally be surrounded by any number
# of spaces, tabs etc.


Table of Contents.
Next.