- •Introduction
 - •Who This Book Is For
 - •What This Book Covers
 - •How This Book Is Structured
 - •What You Need to Use This Book
 - •Conventions
 - •Source Code
 - •Errata
 - •p2p.wrox.com
 - •What Are Regular Expressions?
 - •What Can Regular Expressions Be Used For?
 - •Finding Doubled Words
 - •Checking Input from Web Forms
 - •Changing Date Formats
 - •Finding Incorrect Case
 - •Adding Links to URLs
 - •Regular Expressions You Already Use
 - •Search and Replace in Word Processors
 - •Directory Listings
 - •Online Searching
 - •Why Regular Expressions Seem Intimidating
 - •Compact, Cryptic Syntax
 - •Whitespace Can Significantly Alter the Meaning
 - •No Standards Body
 - •Differences between Implementations
 - •Characters Change Meaning in Different Contexts
 - •Regular Expressions Can Be Case Sensitive
 - •Case-Sensitive and Case-Insensitive Matching
 - •Case and Metacharacters
 - •Continual Evolution in Techniques Supported
 - •Multiple Solutions for a Single Problem
 - •What You Want to Do with a Regular Expression
 - •Replacing Text in Quantity
 - •Regular Expression Tools
 - •findstr
 - •Microsoft Word
 - •StarOffice Writer/OpenOffice.org Writer
 - •Komodo Rx Package
 - •PowerGrep
 - •Microsoft Excel
 - •JavaScript and JScript
 - •VBScript
 - •Visual Basic.NET
 - •Java
 - •Perl
 - •MySQL
 - •SQL Server 2000
 - •W3C XML Schema
 - •An Analytical Approach to Using Regular Expressions
 - •Express and Document What You Want to Do in English
 - •Consider the Regular Expression Options Available
 - •Consider Sensitivity and Specificity
 - •Create Appropriate Regular Expressions
 - •Document All but Simple Regular Expressions
 - •Document What You Expect the Regular Expression to Do
 - •Document What You Want to Match
 - •Test the Results of a Regular Expression
 - •Matching Single Characters
 - •Matching Sequences of Characters That Each Occur Once
 - •Introducing Metacharacters
 - •Matching Sequences of Different Characters
 - •Matching Optional Characters
 - •Matching Multiple Optional Characters
 - •Other Cardinality Operators
 - •The * Quantifier
 - •The + Quantifier
 - •The Curly-Brace Syntax
 - •The {n} Syntax
 - •The {n,m} Syntax
 - •Exercises
 - •Regular Expression Metacharacters
 - •Thinking about Characters and Positions
 - •The Period (.) Metacharacter
 - •Matching Variably Structured Part Numbers
 - •Matching a Literal Period
 - •The \w Metacharacter
 - •The \W Metacharacter
 - •Digits and Nondigits
 - •The \d Metacharacter
 - •Canadian Postal Code Example
 - •The \D Metacharacter
 - •Alternatives to \d and \D
 - •The \s Metacharacter
 - •Handling Optional Whitespace
 - •The \S Metacharacter
 - •The \t Metacharacter
 - •The \n Metacharacter
 - •Escaped Characters
 - •Finding the Backslash
 - •Modifiers
 - •Global Search
 - •Case-Insensitive Search
 - •Exercises
 - •Introduction to Character Classes
 - •Choice between Two Characters
 - •Using Quantifiers with Character Classes
 - •Using the \b Metacharacter in Character Classes
 - •Selecting Literal Square Brackets
 - •Using Ranges in Character Classes
 - •Alphabetic Ranges
 - •Use [A-z] With Care
 - •Digit Ranges in Character Classes
 - •Hexadecimal Numbers
 - •IP Addresses
 - •Reverse Ranges in Character Classes
 - •A Potential Range Trap
 - •Finding HTML Heading Elements
 - •Metacharacter Meaning within Character Classes
 - •The ^ metacharacter
 - •How to Use the - Metacharacter
 - •Negated Character Classes
 - •Combining Positive and Negative Character Classes
 - •POSIX Character Classes
 - •The [:alnum:] Character Class
 - •Exercises
 - •String, Line, and Word Boundaries
 - •The ^ Metacharacter
 - •The ^ Metacharacter and Multiline Mode
 - •The $ Metacharacter
 - •The $ Metacharacter in Multiline Mode
 - •Using the ^ and $ Metacharacters Together
 - •Matching Blank Lines
 - •Working with Dollar Amounts
 - •Revisiting the IP Address Example
 - •What Is a Word?
 - •Identifying Word Boundaries
 - •The \< Syntax
 - •The \>Syntax
 - •The \b Syntax
 - •The \B Metacharacter
 - •Less-Common Word-Boundary Metacharacters
 - •Exercises
 - •Grouping Using Parentheses
 - •Parentheses and Quantifiers
 - •Matching Literal Parentheses
 - •U.S. Telephone Number Example
 - •Alternation
 - •Choosing among Multiple Options
 - •Unexpected Alternation Behavior
 - •Capturing Parentheses
 - •Numbering of Captured Groups
 - •Numbering When Using Nested Parentheses
 - •Named Groups
 - •Non-Capturing Parentheses
 - •Back References
 - •Exercises
 - •Why You Need Lookahead and Lookbehind
 - •The (? metacharacters
 - •Lookahead
 - •Positive Lookahead
 - •Negative Lookahead
 - •Positive Lookahead Examples
 - •Positive Lookahead in the Same Document
 - •Inserting an Apostrophe
 - •Lookbehind
 - •Positive Lookbehind
 - •Negative Lookbehind
 - •How to Match Positions
 - •Adding Commas to Large Numbers
 - •Exercises
 - •What Are Sensitivity and Specificity?
 - •Extreme Sensitivity, Awful Specificity
 - •Email Addresses Example
 - •Replacing Hyphens Example
 - •The Sensitivity/Specificity Trade-Off
 - •Sensitivity, Specificity, and Positional Characters
 - •Sensitivity, Specificity, and Modes
 - •Sensitivity, Specificity, and Lookahead and Lookbehind
 - •How Much Should the Regular Expressions Do?
 - •Abbreviations
 - •Characters from Other Languages
 - •Names
 - •Sensitivity and How to Achieve It
 - •Specificity and How to Maximize It
 - •Exercises
 - •Documenting Regular Expressions
 - •Document the Problem Definition
 - •Add Comments to Your Code
 - •Making Use of Extended Mode
 - •Know Your Data
 - •Abbreviations
 - •Proper Names
 - •Incorrect Spelling
 - •Creating Test Cases
 - •Debugging Regular Expressions
 - •Treacherous Whitespace
 - •Backslashes Causing Problems
 - •Considering Other Causes
 - •The User Interface
 - •Metacharacters Available
 - •Quantifiers
 - •The @ Quantifier
 - •The {n,m} Syntax
 - •Modes
 - •Character Classes
 - •Back References
 - •Lookahead and Lookbehind
 - •Lazy Matching versus Greedy Matching
 - •Examples
 - •Character Class Examples, Including Ranges
 - •Whole Word Searches
 - •Search-and-Replace Examples
 - •Changing Name Structure Using Back References
 - •Manipulating Dates
 - •The Star Training Company Example
 - •Regular Expressions in Visual Basic for Applications
 - •Exercises
 - •The User Interface
 - •Metacharacters Available
 - •Quantifiers
 - •Modes
 - •Character Classes
 - •Alternation
 - •Back References
 - •Lookahead and Lookbehind
 - •Search Example
 - •Search-and-Replace Example
 - •Online Chats
 - •POSIX Character Classes
 - •Matching Numeric Digits
 - •Exercises
 - •Introducing findstr
 - •Finding Literal Text
 - •Quantifiers
 - •Character Classes
 - •Command-Line Switch Examples
 - •The /v Switch
 - •The /a Switch
 - •Single File Examples
 - •Simple Character Class Example
 - •Find Protocols Example
 - •Multiple File Example
 - •A Filelist Example
 - •Exercises
 - •The PowerGREP Interface
 - •A Simple Find Example
 - •The Replace Tab
 - •The File Finder Tab
 - •Syntax Coloring
 - •Other Tabs
 - •Numeric Digits and Alphabetic Characters
 - •Quantifiers
 - •Back References
 - •Alternation
 - •Line Position Metacharacters
 - •Word-Boundary Metacharacters
 - •Lookahead and Lookbehind
 - •Longer Examples
 - •Finding HTML Horizontal Rule Elements
 - •Matching Time Example
 - •Exercises
 - •The Excel Find Interface
 - •Escaping Wildcard Characters
 - •Using Wildcards in Data Forms
 - •Using Wildcards in Filters
 - •Exercises
 - •Using LIKE with Regular Expressions
 - •The % Metacharacter
 - •The _ Metacharacter
 - •Character Classes
 - •Negated Character Classes
 - •Using Full-Text Search
 - •Using The CONTAINS Predicate
 - •Document Filters on Image Columns
 - •Exercises
 - •Using the _ and % Metacharacters
 - •Testing Matching of Literals: _ and % Metacharacters
 - •Using Positional Metacharacters
 - •Using Character Classes
 - •Quantifiers
 - •Social Security Number Example
 - •Exercises
 - •The Interface to Metacharacters in Microsoft Access
 - •Creating a Hard-Wired Query
 - •Creating a Parameter Query
 - •Using the ? Metacharacter
 - •Using the * Metacharacter
 - •Using the # Metacharacter
 - •Using the # Character with Date/Time Data
 - •Using Character Classes in Access
 - •Exercises
 - •The RegExp Object
 - •Attributes of the RegExp Object
 - •The Other Properties of the RegExp Object
 - •The test() Method of the RegExp Object
 - •The exec() Method of the RegExp Object
 - •The String Object
 - •Metacharacters in JavaScript and JScript
 - •SSN Validation Example
 - •Exercises
 - •The RegExp Object and How to Use It
 - •Quantifiers
 - •Positional Metacharacters
 - •Character Classes
 - •Word Boundaries
 - •Lookahead
 - •Grouping and Nongrouping Parentheses
 - •Exercises
 - •The System.Text.RegularExpressions namespace
 - •A Simple Visual Basic .NET Example
 - •The Classes of System.Text.RegularExpressions
 - •The Regex Object
 - •Using the Match Object and Matches Collection
 - •Using the Match.Success Property and Match.NextMatch Method
 - •The GroupCollection and Group Classes
 - •The CaptureCollection and Capture Class
 - •The RegexOptions Enumeration
 - •Case-Insensitive Matching: The IgnoreCase Option
 - •Multiline Matching: The Effect on the ^ and $ Metacharacters
 - •Right to Left Matching: The RightToLeft Option
 - •Lookahead and Lookbehind
 - •Exercises
 - •An Introductory Example
 - •The Classes of System.Text.RegularExpressions
 - •The Regex Class
 - •The Options Property of the Regex Class
 - •Regex Class Methods
 - •The CompileToAssembly() Method
 - •The GetGroupNames() Method
 - •The GetGroupNumbers() Method
 - •GroupNumberFromName() and GroupNameFromNumber() Methods
 - •The IsMatch() Method
 - •The Match() Method
 - •The Matches() Method
 - •The Replace() Method
 - •The Split() Method
 - •Using the Static Methods of the Regex Class
 - •The IsMatch() Method as a Static
 - •The Match() Method as a Static
 - •The Matches() Method as a Static
 - •The Replace() Method as a Static
 - •The Split() Method as a Static
 - •The Match and Matches Classes
 - •The Match Class
 - •The GroupCollection and Group Classes
 - •The RegexOptions Class
 - •The IgnorePatternWhitespace Option
 - •Metacharacters Supported in Visual C# .NET
 - •Using Named Groups
 - •Using Back References
 - •Exercise
 - •The ereg() Set of Functions
 - •The ereg() Function
 - •The ereg() Function with Three Arguments
 - •The eregi() Function
 - •The ereg_replace() Function
 - •The eregi_replace() Function
 - •The split() Function
 - •The spliti() Function
 - •The sql_regcase() Function
 - •Perl Compatible Regular Expressions
 - •Pattern Delimiters in PCRE
 - •Escaping Pattern Delimiters
 - •Matching Modifiers in PCRE
 - •Using the preg_match() Function
 - •Using the preg_match_all() Function
 - •Using the preg_grep() Function
 - •Using the preg_quote() Function
 - •Using the preg_replace() Function
 - •Using the preg_replace_callback() Function
 - •Using the preg_split() Function
 - •Supported Metacharacters with ereg()
 - •Using POSIX Character Classes with PHP
 - •Supported Metacharacters with PCRE
 - •Positional Metacharacters
 - •Character Classes in PHP
 - •Documenting PHP Regular Expressions
 - •Exercises
 - •W3C XML Schema Basics
 - •Tools for Using W3C XML Schema
 - •Comparing XML Schema and DTDs
 - •How Constraints Are Expressed in W3C XML Schema
 - •W3C XML Schema Datatypes
 - •Derivation by Restriction
 - •Unicode and W3C XML Schema
 - •Unicode Overview
 - •Using Unicode Character Classes
 - •Matching Decimal Numbers
 - •Mixing Unicode Character Classes with Other Metacharacters
 - •Unicode Character Blocks
 - •Using Unicode Character Blocks
 - •Metacharacters Supported in W3C XML Schema
 - •Positional Metacharacters
 - •Matching Numeric Digits
 - •Alternation
 - •Using the \w and \s Metacharacters
 - •Escaping Metacharacters
 - •Exercises
 - •Introduction to the java.util.regex Package
 - •Obtaining and Installing Java
 - •The Pattern Class
 - •Using the matches() Method Statically
 - •Two Simple Java Examples
 - •The Properties (Fields) of the Pattern Class
 - •The CASE_INSENSITIVE Flag
 - •Using the COMMENTS Flag
 - •The DOTALL Flag
 - •The MULTILINE Flag
 - •The UNICODE_CASE Flag
 - •The UNIX_LINES Flag
 - •The Methods of the Pattern Class
 - •The compile() Method
 - •The flags() Method
 - •The matcher() Method
 - •The matches() Method
 - •The pattern() Method
 - •The split() Method
 - •The Matcher Class
 - •The appendReplacement() Method
 - •The appendTail() Method
 - •The end() Method
 - •The find() Method
 - •The group() Method
 - •The groupCount() Method
 - •The lookingAt() Method
 - •The matches() Method
 - •The pattern() Method
 - •The replaceAll() Method
 - •The replaceFirst() Method
 - •The reset() Method
 - •The start() Method
 - •The PatternSyntaxException Class
 - •Using the \d Metacharacter
 - •Character Classes
 - •The POSIX Character Classes in the java.util.regex Package
 - •Unicode Character Classes and Character Blocks
 - •Using Escaped Characters
 - •Using Methods of the String Class
 - •Using the matches() Method
 - •Using the replaceFirst() Method
 - •Using the replaceAll() Method
 - •Using the split() Method
 - •Exercises
 - •Obtaining and Installing Perl
 - •Creating a Simple Perl Program
 - •Basics of Perl Regular Expression Usage
 - •Using the m// Operator
 - •Using Other Regular Expression Delimiters
 - •Matching Using Variable Substitution
 - •Using the s/// Operator
 - •Using s/// with the Global Modifier
 - •Using s/// with the Default Variable
 - •Using the split Operator
 - •Using Quantifiers in Perl
 - •Using Positional Metacharacters
 - •Captured Groups in Perl
 - •Using Back References in Perl
 - •Using Alternation
 - •Using Character Classes in Perl
 - •Using Lookahead
 - •Using Lookbehind
 - •Escaping Metacharacters
 - •A Simple Perl Regex Tester
 - •Exercises
 - •Index
 
Regular Expressions in Perl
Figure 26-17
How It Works
The test string is assigned to the variable $myString:
my $myString = “I attended a Star Training Company training course.”;
The variable $oldString is used to hold the original value for later display:
my $oldString = $myString;
The first occurrence of the character sequence Star in the test string is replaced by the character sequence Moon:
$myString =~ s/Star/Moon/;
The user is informed of the original and replaced strings:
print “The original string was: \n’$oldString’\n\n”;
print “After replacement the string is: \n’$myString’\n\n”; if ($oldString =~ m/Star/)
{
print “The string ‘Star’ was matched and replaced in the old string”;
}
Using s/// with the Global Modifier
Often, you will want to replace all occurrences of a character sequence in the test string. The example of the Star Training Company earlier in this book is a case in point. To specify that all occurrences of a pattern are replaced, the global modifier, g, is used.
To achieve global replacement, you write the following:
$myTestString =~ s/pattern/replacementString/g
The g modifier after the third forward slash indicates that global replacement is to take place.
Try It Out  | 
	Using s/// with the Global Modifier  | 
1.Type the following code in a text editor:
#!/usr/bin/perl -w use strict;
679
Chapter 26
print “This example uses the global modifier, ‘g’\n\n”;
my $myTestString = “Star Training Company courses are great. Choose Star for your training needs.”;
my $myOnceString = $myTestString; my $myGlobalString = $myTestString; my $myPattern = “Star”;
my $myReplacementString = “Moon”;
$myOnceString =~ s/$myPattern/$myReplacementString/; $myGlobalString =~ s/$myPattern/$myReplacementString/g; print “The original string was ‘$myTestString’.\n\n”;
print “After a single replacement it became ‘$myOnceString’.\n\n”; print “After global replacement it became ‘$myGlobalString’.\n\n”;
2.Save the code as GlobalReplace.pl.
3.Run the code and inspect the results, as shown in Figure 26-18. Notice that without the g modifier, only one occurrence of the character sequence Star has been replaced. With the g modifier present, all occurrences (in this case, there are two) are replaced.
Figure 26-18
How It Works
The test string is assigned to the variable $myTestString:
my $myTestString = “Star Training Company courses are great. Choose Star for your
training needs.”;
The value of the original test string is copied to the variables $myOnceString and $myGlobalString:
my $myOnceString = $myTestString;
my $myGlobalString = $myTestString;
The pattern Star is assigned to the variable $myPattern:
my $myPattern = “Star”;
The replacement string, Moon, is assigned to the variable $myReplacementString:
my $myReplacementString = “Moon”;
680
Regular Expressions in Perl
One match is replaced in $myOnceString:
$myOnceString =~ s/$myPattern/$myReplacementString/;
All matches (two, in this example) are replaced in $myGlobalString, because the g modifier is specified:
$myGlobalString =~ s/$myPattern/$myReplacementString/g;
Then the original string, the string after a single replacement, and the string after global replacement are displayed:
print “The original string was ‘$myTestString’.\n\n”;
print “After a single replacement it became ‘$myOnceString’.\n\n”; print “After global replacement it became ‘$myGlobalString’.\n\n”;
Using s/// with the Default Variable
The default variable, $_, can be used with s/// to search and replace the value held in the default variable.
Two forms of syntax can be used. You can use the normal s/// syntax, with the variable name, the =~ operator and the pattern and replacement text:
$_ =~ s/pattern/replacementText/modifiers;
The alternative, more succinct, syntax allows the name of the default variable and =~ operator to be omitted. So you can simply write the following:
s/pattern/replacementText/modifiers
Try It Out  | 
	Using s/// with the Default Variable  | 
1.Type the following code in a text editor:
#!/usr/bin/perl -w use strict;
$_ = “I went to a training course from Star Training Company.”; print “The default string, \$_, contains ‘$_’.\n\n”;
if (s/Star/Moon/)
{
print “A replacement has taken place using the default variable.\n”;
print “The replaced string in \$_ is now ‘$_’.”;
}
2.
3.
Save the code as ReplaceDefaultVariable.pl.
Run the code, and inspect the displayed result, as shown in Figure 26-19.
681
Chapter 26
Figure 26-19
How It Works
The test string is assigned to the default variable, $_:
$_ = “I went to a training course from Star Training Company.”;
The value contained in the default variable is displayed:
print “The default string, \$_, contains ‘$_’.\n\n”;
The test of the if statement uses the abbreviated syntax for carrying out a replacement on the default variable:
if (s/Star/Moon/)
You might prefer to use the full syntax:
if ($_ =~ s/Star/Moon/)
Whichever syntax you use, the user is then informed that a replacement operation has taken place and is informed of the value of the string after the replacement operation:
print “A replacement has taken place using the default variable.\n”;
print “The replaced string in \$_ is now ‘$_’.”;
Using the split Operator
The split operator is used to split a test string according to the match for a regular expression.
The following example shows how you can separate a comma-separated sequence of values into its component parts.
Try It Out  | 
	Using the split Operator  | 
1.Type the following code into a text editor:
#!/usr/bin/perl -w use strict;
my $myTestString = “A, B, C, D”;
print “The original string was ‘$myTestString’.\n”; my @myArray = split/,\s?/, $myTestString;
682
Regular Expressions in Perl
print “The string has been split into four array elements:\n”; print “$myArray[0]\n”;
print “$myArray[1]\n”; print “$myArray[2]\n”; print “$myArray[3]\n”;
print “Displaying array elements using the ‘foreach’ statement:\n”; foreach my $mySplit (split/,\s?/, $myTestString)
{
print “$mySplit\n”;
}
2.
3.
Save the code as SplitDemo.pl.
Run the code, and inspect the displayed results, as shown in Figure 26-20.
Figure 26-20
How It Works
A sequence of values separated by commas and a space character is assigned to the variable $myTestString:
my $myTestString = “A, B, C, D”;
The value of the original string is displayed:
print “The original string was ‘$myTestString’.\n”;
The @myArray array is assigned the result of using the split operator. The pattern that is matched against is a comma optionally followed by a whitespace character. The target of the split operator is the variable $myTestString:
my @myArray = split/,\s?/, $myTestString;
Then you can use array indices to display the components into which the string has been split:
print “The string has been split into four array elements:\n”; print “$myArray[0]\n”;
print “$myArray[1]\n”; print “$myArray[2]\n”; print “$myArray[3]\n”;
683
Chapter 26
Or, more elegantly, you can use a foreach statement to display each result of splitting the $myTestString variable:
print “Displaying array elements using the ‘foreach’ statement:\n”;
foreach my $mySplit (split/,\s?/, $myTestString)
{
print “$mySplit\n”;
}
The Metacharacters Suppor ted in Perl
Perl supports a useful range of metacharacters, as summarized in the following table.
Metacharacter  | 
	Description  | 
  | 
	
  | 
. (period character)  | 
	Matches any character (with the exception, according to mode, of the new-  | 
  | 
	line character).  | 
\w  | 
	Matches a character that is alphabetic, numeric, or an underscore character.  | 
  | 
	Sometimes called a “word character.” Equivalent to the character class  | 
  | 
	[A-Za-z0-9_].  | 
\W  | 
	Matches a character that is not alphabetic, numeric, or an underscore char-  | 
  | 
	acter. Equivalent to the character class [^A-Za-z0-9_] or [^\w].  | 
\s  | 
	Matches a whitespace character.  | 
\S  | 
	Matches a character that is not a whitespace character.  | 
\d  | 
	Matches a character that is a numeric digit. Equivalent to the character  | 
  | 
	class [0-9].  | 
\D  | 
	Matches a character that is not a numeric digit. Equivalent to the character  | 
  | 
	class [^0-9].  | 
?  | 
	Quantifier. Matches if the preceding character or group occurs zero or one  | 
  | 
	time.  | 
*  | 
	Quantifier. Matches if the preceding character or group occurs zero or more  | 
  | 
	times.  | 
+  | 
	Quantifier. Matches if the preceding character or group occurs one or more  | 
  | 
	times.  | 
{n,m}  | 
	Quantifier. Matches if the preceding character or group occurs a minimum  | 
  | 
	of n times and a maximum of m times.  | 
(...)  | 
	Capturing parentheses.  | 
$1 etc  | 
	Variables that allow access to captured groups  | 
|  | 
	Alternation character.  | 
684
