
C# ПІДРУЧНИКИ / c# / Apress - Accelerated C# 2005
.pdf
194C H A P T E R 8 ■ W O R K I N G W I T H S T R I N G S
period by itself carries a special meaning: An unescaped period matches any character in that position of the match. Lastly, you’ll see that it is much easier to use the verbatim string syntax when declaring regular expressions in order to avoid the gratuitous proliferation of backslashes. If you were to invoke the previous example passing the following quoted string on the command line
"This is an IP address:123.123.1.123" then the output would look like the following:
IP Address found at 22 with value of 123.123.1.123
The previous example creates a new Regex instance named regex and then, using the Match() method, applies the pattern to the given string. The results of the match are stored in the match variable. That match variable represents the first match within the searched string. You can use the Match.Success property to determine if the regex found anything at all. Next, you see the code using the Index and Value properties to find out more about the match. Lastly, you can go to the next match in the searched string by calling the Match.NextMatch method, and you can iterate through this chain until you find no more matches in the searched string.
Alternatively, instead of calling Match.NextMatch() in a loop, you can call the Regex.Matches method to retrieve a MatchCollection that gives you all of the matches at once rather than one at a time. Also, each of the examples using Regex in this chapter are calling instance methods on
a Regex instance. Many of the methods on Regex, such as Match() and Replace(), also offer static versions where you don’t have to create a Regex instance first and you can just pass the regular expression patter in the method call.
Searching and Grouping
From looking at the previous match, really all that is happening is that the pattern is looking for a series of four groups of digits separated by periods, where each group can be from one to three
digits in length. The reason I say this is a crude search is because it will match an invalid IP address such as 999.888.777.666. A better search for the IP address would look like the following:
using System;
using System.Text.RegularExpressions;
public class EntryPoint
{
static void Main( string[] args ) { if( args.Length < 1 ) {
Console.WriteLine( "You must provide a string." ); return;
}
// Create regex to search for IP address pattern. string pattern = @"([01]?\d\d?|2[0-4]\d|25[0-5])\." +
@"([01]?\d\d?|2[0-4]\d|25[0-5])\." + @"([01]?\d\d?|2[0-4]\d|25[0-5])\." + @"([01]?\d\d?|2[0-4]\d|25[0-5])";
Regex regex = new Regex( pattern ); Match match = regex.Match( args[0] ); while( match.Success ) {
Console.WriteLine( "IP Address found at {0} with " + "value of {1}",
match.Index, match.Value );

C H A P T E R 8 ■ W O R K I N G W I T H S T R I N G S |
195 |
match = match.NextMatch();
}
}
}
Essentially, four groupings of the same search pattern [01]?\d\d?|2[0-4]\d|25[0-5] are separated by periods, which of course, are escaped in the preceding regular expression. Each one of these subexpressions matches a number between 0 and 255. This entire expression for searching for regular expressions is better, but still not perfect. However, you can see that it’s getting closer, and with a little more fine tuning, you can use it to validate the IP address given in a string. Thus, you can use regular expressions to effectively validate input from users to make sure that it matches a certain form. For example, you may have a web server that expects U.S. telephone numbers to be entered in a pattern such as (xxx) xxx-xxxx. Regular expressions allow you to easily validate that the user has input the number correctly.
You may have noticed the addition of parentheses in the IP address search expression in the previous example. Parentheses are used to define groups that group subexpressions within regular expressions into discrete chunks. Groups can contain other groups as well. Therefore, the IP address regular-expression pattern in the previous example forms a group around each part of the IP address. In addition, you can access each individual group within the match. Consider the following modified version of the previous example:
using System;
using System.Text.RegularExpressions;
public class EntryPoint
{
static void Main( string[] args ) { if( args.Length < 1 ) {
Console.WriteLine( "You must provide a string." ); return;
}
// Create regex to search for IP address pattern. string pattern = @"([01]?\d\d?|2[0-4]\d|25[0-5])\." +
@"([01]?\d\d?|2[0-4]\d|25[0-5])\." + @"([01]?\d\d?|2[0-4]\d|25[0-5])\." + @"([01]?\d\d?|2[0-4]\d|25[0-5])";
Regex regex = new Regex( pattern ); Match match = regex.Match( args[0] ); while( match.Success ) {
Console.WriteLine( "IP Address found at {0} with " + "value of {1}",
match.Index, match.Value );
Console.WriteLine( "Groups are:" ); foreach( Group g in match.Groups ) {
Console.WriteLine( "\t{0} at {1}", g.Value, g.Index );
}
match = match.NextMatch();
}

196 C H A P T E R 8 ■ W O R K I N G W I T H S T R I N G S
}
}
Within each match, I’ve added a loop that iterates through the individual groups within the match. As you’d expect, there will be at least four groups in the collection, one for each portion of the IP address. In fact, there is also a fifth item in the group that is the entire match. So, one of the groups within the groups collection returned from Match.Groups will always contain the entire match itself. Given the following input to the previous example
"This is an IP address:123.123.1.123"
the result would look like the following:
IP Address found at 22 with value of 123.123.1.123 Groups are:
123.123.1.123 at 22 123 at 22
123 at 26
1 at 30
123 at 32
Groups provide an excellent means of picking portions out of a given input string. For example, at the same time that you validate that a user has input a phone number of the required format, you could also capture the area code into a group for use later. Collecting substrings of a match into groups is handy. But what’s even handier is being able to give those groups a name. Check out the following modified example:
using System;
using System.Text.RegularExpressions;
public class EntryPoint
{
static void Main( string[] args ) { if( args.Length < 1 ) {
Console.WriteLine( "You must provide a string." ); return;
}
// Create regex to search for IP address pattern.
string pattern = @"(?<part1>[01]?\d\d?|2[0-4]\d|25[0-5])\." + @"(?<part2>[01]?\d\d?|2[0-4]\d|25[0-5])\." + @"(?<part3>[01]?\d\d?|2[0-4]\d|25[0-5])\." + @"(?<part4>[01]?\d\d?|2[0-4]\d|25[0-5])";
Regex regex = new Regex( pattern ); Match match = regex.Match( args[0] ); while( match.Success ) {
Console.WriteLine( "IP Address found at {0} with " + "value of {1}",
match.Index, match.Value );
Console.WriteLine( "Groups are:" );
Console.WriteLine( "\tPart 1: {0}", match.Groups["part1"] );
Console.WriteLine( "\tPart 2: {0}", match.Groups["part2"] );
Console.WriteLine( "\tPart 3: {0}",

C H A P T E R 8 ■ W O R K I N G W I T H S T R I N G S |
197 |
match.Groups["part3"] ); Console.WriteLine( "\tPart 4: {0}",
match.Groups["part4"] );
match = match.NextMatch();
}
}
}
In this variation, I’ve captured each part into a group with a name, and when I send the result to the console, I access the group by name through an indexer on the GroupCollection returned by Match.Groups that accepts a string argument.
With the ability to name groups comes the ability to back-reference groups within searches. For example, if you’re looking for an exact repeat of a previous match, you can reference a previous group in what’s called a back-reference by including a \k<name>, where name is the name of the group to back-reference. For example, consider the following example that looks for IP addresses where all four parts are the same:
using System;
using System.Text.RegularExpressions;
public class EntryPoint
{
static void Main( string[] args ) { if( args.Length < 1 ) {
Console.WriteLine( "You must provide a string." ); return;
}
// Create regex to search for IP address pattern.
string pattern = @"(?<part1>[01]?\d\d?|2[0-4]\d|25[0-5])\." +
@"\k<part1>\." + @"\k<part1>\." + @"\k<part1>";
Regex regex = new Regex( pattern ); Match match = regex.Match( args[0] ); while( match.Success ) {
Console.WriteLine( "IP Address found at {0} with " + "value of {1}",
match.Index, match.Value );
match = match.NextMatch();
}
}
}
The following output shows the results of running this code on the string "My IP address is 123.123.123.123":
IP Address found at 17 with value of 123.123.123.123

198 C H A P T E R 8 ■ W O R K I N G W I T H S T R I N G S
Replacing Text with Regex
If you’ve ever used Perl to do any text processing, you know that the regular-expression engine within it is indispensable. But one of the greatest powers within Perl is the regular-expression text-substitution capabilities. You can do the same thing using .NET regular expressions via the Regex.Replace method overloads. Suppose that you want to process a string looking for an IP address that a user input, and you want to display the string. However, for security reasons, you want to replace the IP address with xxx.xxx.xxx.xxx. You could achieve this goal, as in the following example:
using System;
using System.Text.RegularExpressions;
public class EntryPoint
{
static void Main( string[] args ) { if( args.Length < 1 ) {
Console.WriteLine( "You must provide a string." ); return;
}
// Create regex to search for IP address pattern. string pattern = @"([01]?\d\d?|2[0-4]\d|25[0-5])\." +
@"([01]?\d\d?|2[0-4]\d|25[0-5])\." + @"([01]?\d\d?|2[0-4]\d|25[0-5])\." + @"([01]?\d\d?|2[0-4]\d|25[0-5])";
Regex regex = new Regex( pattern ); Console.WriteLine( "Input given --> {0}",
regex.Replace(args[0], "xxx.xxx.xxx.xxx") );
}
}
Thus, given the following input
"This is an IP address:123.123.123.123"
the output would look like the following:
Input given --> This is an IP address:xxx.xxx.xxx.xxx
Of course, when you find a match within a string, you may want to replace it with something that depends on what the match is. The previous example simply replaces each match with a static string. In order to replace based on the match instance, you can create an instance of the MatchEvaluator delegate and pass it to the Regex.Replace method. Then, whenever it finds
a match, it calls through to the MatchEvaluator delegate instance given while passing it the match. Thus, the delegate can create the replacement string based upon the actual match. The MatchEvaluator delegate has the following signature:
public delegate string MatchEvaluator( Match match );
Suppose you want to reverse the individual parts of an IP address. Then you could use a MatchEvaluator coupled with Regex.Replace() to get the job done, as in the following example:
using System; using System.Text;
using System.Text.RegularExpressions;

C H A P T E R 8 ■ W O R K I N G W I T H S T R I N G S |
199 |
public class EntryPoint
{
static void Main( string[] args ) { if( args.Length < 1 ) {
Console.WriteLine( "You must provide a string." ); return;
}
// Create regex to search for IP address pattern.
string pattern = @"(?<part1>[01]?\d\d?|2[0-4]\d|25[0-5])\." + @"(?<part2>[01]?\d\d?|2[0-4]\d|25[0-5])\." + @"(?<part3>[01]?\d\d?|2[0-4]\d|25[0-5])\." + @"(?<part4>[01]?\d\d?|2[0-4]\d|25[0-5])";
Regex regex = new Regex( pattern );
Match match = regex.Match( args[0] );
MatchEvaluator eval = new MatchEvaluator( EntryPoint.IPReverse );
Console.WriteLine( regex.Replace(args[0], eval) );
}
static string IPReverse( Match match ) { StringBuilder sb = new StringBuilder(); sb.Append( match.Groups["part4"] + "." ); sb.Append( match.Groups["part3"] + "." ); sb.Append( match.Groups["part2"] + "." ); sb.Append( match.Groups["part1"] ); return sb.ToString();
}
}
In the previous case, whenever a match is found, the delegate is called to determine what the replacement string should be. However, since all you’re doing is changing the order, the job is not too complex for what are called regular-expression substitutions. If, in the example prior to this one, you had chosen to use the overload of Replace() that doesn’t use a MatchEvaluator delegate, you can achieve the same result, since the regex lets you reference the group variables in the replacement string. To reference one of the named groups, you can use the syntax shown in the following example:
using System; using System.Text;
using System.Text.RegularExpressions;
public class EntryPoint
{
static void Main( string[] args ) { if( args.Length < 1 ) {
Console.WriteLine( "You must provide a string." ); return;
}
// Create regex to search for IP address pattern.
string pattern = @"(?<part1>[01]?\d\d?|2[0-4]\d|25[0-5])\." + @"(?<part2>[01]?\d\d?|2[0-4]\d|25[0-5])\." + @"(?<part3>[01]?\d\d?|2[0-4]\d|25[0-5])\." + @"(?<part4>[01]?\d\d?|2[0-4]\d|25[0-5])";

200 C H A P T E R 8 ■ W O R K I N G W I T H S T R I N G S
Regex regex = new Regex( pattern );
Match match = regex.Match( args[0] );
string replace = @"${part4}.${part3}.${part2}.${part1}" + @" (the reverse of $&)";
Console.WriteLine( regex.Replace(args[0], replace) );
}
}
To include one of the named groups, simply use the ${name} syntax, where name is the name of the group. You can also see that I reference the full text of the match using $&. Other substitutions strings are available, such as $`, which substitutes the part of the input string prior to and up to the match, and $', which substitutes all text after the match. Others are documented in the MSDN documentation.
As you can imagine, you can craft complex string-replacement capabilities using the regularexpression implementation within .NET Framework just as you can using Perl.
Regex Creation Options
One of the constructor overloads of a Regex allows you to pass various options of type RegexOptions during creation of a Regex instance. Likewise, the methods on Regex, such as Match() and Replace(), have a static overload allowing you to pass RegexOptions flags. I’ll discuss some of the more commonly used options in this section, but for a description of all of the options and their behavior, consult the RegexOptions documentation within the MSDN.
By default, regular expressions are interpreted at run time. Complex regular expressions can chew up quite a bit of processor time while the regex engine is processing them. For times like these, consider using the Compiled option. This option causes the regular expression to be represented internally by IL code that is JIT-compiled. This increases the latency for the first use of the regular expression, but if it’s used often, it will pay off in the end. Also, don’t forget that JIT-compiled code increases the working set of the application.
Many times, you’ll find it useful to do case-insensitive searches. You could accommodate that in the regular-expression pattern, but it makes your pattern much more difficult to read. It’s much easier to pass the IgnoreCase flag when creating the Regex instance. When you use this flag, the Regex engine will also take into account any culture-specific case-sensitivity issues by referencing the CultureInfo attached to the current thread. If you want to do case-insensitive searches in a cultureinvariant way, combine the IgnoreCase flag with the CultureInvariant flag.
The IgnorePatternWhitespace flag is also useful for complex regular expressions. This flag tells the regex engine to ignore any white space within the match expression and to ignore any comments on lines following the # character. This provides a nifty way to comment regular expressions that are really complex. For example, check out the IP address search from the previous example rewritten using IgnorePatternWhitespace:
using System;
using System.Text.RegularExpressions;
public class EntryPoint
{
static void Main( string[] args ) { if( args.Length < 1 ) {
Console.WriteLine( "You must provide a string." ); return;
}

C H A P T E R 8 ■ W O R K I N G W I T H S T R I N G S |
201 |
// Create regex to search for IP address pattern. string pattern = @"
# First part match |
|
([01]?\d\d? |
# At least one digit, |
|
# possibly prepended by 0 or 1 |
|
# and possibly followed by another digit |
# OR |
|
|2[0-4]\d |
# Starts with a 2, after a number from 0-4 |
|
# and then any digit |
# OR |
|
|25[0-5]) |
# 25 followed by a number from 0-5 |
\. |
# The whole group is followed by a period. |
# REPEAT ([01]?\d\d?|2[0-4]\d|25[0-5])\.
# REPEAT ([01]?\d\d?|2[0-4]\d|25[0-5])\.
# REPEAT ([01]?\d\d?|2[0-4]\d|25[0-5]) ";
Regex regex = new Regex( pattern, RegexOptions.IgnorePatternWhitespace );
Match match = regex.Match( args[0] ); while( match.Success ) {
Console.WriteLine( "IP Address found at {0} with " + "value of {1}",
match.Index, match.Value );
match = match.NextMatch();
}
}
}
Notice how expressive you can be in the comments of your regular expression. And given how complex regular expressions can become, this is never a bad thing indeed.
Summary
In this chapter, I’ve touched the tip of the iceberg on the string-handling capabilities of the
.NET Framework and C#. Since the string type is such a widely used type, rather than merely include it in the base class library, the CLR designers chose to annex it into the set of built-in types. This is a good thing considering how common string usage is. Furthermore, the library provides a thorough implementation of cultural-specific patterns, via CultureInfo, that you typically need when creating global software that deals with strings heavily.

202 C H A P T E R 8 ■ W O R K I N G W I T H S T R I N G S
I showed how you can create your own cultures easily using the CultureAndRegionInfoBuilder class. Essentially, any software that interacts directly with the user and is meant to be used on a global basis needs to be prepared to service locale-specific needs. Finally, I gave a brief tour of the regular-expression capabilities of the .NET Framework, even though a full treatment of the regular-expression language is outside the scope of this book. I think you’ll agree that the string and text-handling facilities built into the CLR, the .NET Framework, and the C# language are well-designed and easy to use.
In Chapter 9, I cover arrays and other, more versatile, collection types in the .NET Framework. Also, I spend a fair amount of time covering the new support for iterators available in C# 2.0.

C H A P T E R 9
■ ■ ■
Arrays, Collection Types, and Iterators
Collection types have been around in various forms since the dawn of programming. I’m sure you remember the linked list exercises when you were learning to write programs. In this chapter, I’ll give a brief overview of arrays, but won’t go into much detail, as arrays have not changed much between .NET 1.1 and .NET 2.0.
However, I’ll spend more time explaining the major generic collection interfaces and iterators, which are new to C# 2.0, along with what sorts of cool things you can do with them. Traditionally, creating enumerators for collection types has been mundane and annoying. Iterators make this task a breeze, while making your code a lot more readable in the process.
Introduction to Arrays
C# arrays, as well as arrays in the CLR, are highly evolved from C/C++ arrays. In C/C++, you typically access an array by offsetting a pointer that points to the beginning of a contiguous range of items in a memory block somewhere. C/C++ arrays have no built-in range checking, which is the root of more bugs than you can shake a stick at. C# and the CLR solve this problem elegantly by making the array type a built-in, implicit type to the runtime.
When you declare a type—whether it’s a class or struct—the runtime reserves the right to silently generate an array type based upon that new type. The array type that it generates is a reference type—thus, all array instances are of class type. The reference type that it generates is derived from System.Array, and ultimately from System.Object. Therefore, you can treat all C# arrays polymorphically through a reference to System.Array. Of course, that means that each array, no matter what concrete type of array it is, implements all of the methods and properties of System.Array.
The way that you declare an array within C# is similar to C/C++, except the designers of the language took the liberty to make the syntax a tad more intuitive in their minds, in that the square brackets in the declaration follow the type and not the array variable name. The following example shows three ways to create an array of integers and print them to the console:
using System;
public class EntryPoint
{
static void Main() {
int[] array1 = new int[ 10 ];
for( int i = 0; i < array1.Length; ++i ) { array1[i] = i*2;
}
int[] array2 = new int[] { 2, 4, 6, 8 };
203