Добавил:
Опубликованный материал нарушает ваши авторские права? Сообщите нам.
Вуз: Предмет: Файл:
Beginning Regular Expressions 2005.pdf
Скачиваний:
101
Добавлен:
17.08.2013
Размер:
25.42 Mб
Скачать

Regular Expressions in Java

Then the value of the foundOrNot variable is tested as the condition controlling an if statement. If it is not true, the message No match found. is displayed:

if(!foundOrNot){ System.out.println(“No match found.”);

}

}

Finally, the tidyUp() method tidies up.

The pattern used is defined in the file Pattern.txt:

\d\w

The pattern matches a numeric digit followed by a word character (meaning an alphabetic character of either case, an underline character, or a numeric digit).

The test string is located in the file TestText.txt:

3D 2A 5R

There are three matches for the pattern \d\w: 3D, 2A, and 5R.

The Properties (Fields) of the Pattern Class

The following table summarizes information about the properties (fields) of the Pattern class.

Property (Field)

Description

 

 

CANON_EQ

Enables canonical equivalence when matching.

CASE_INSENSITIVE

Enables case-insensitive matching.

COMMENTS

Enables whitespace and comments to be included in the pattern.

DOTALL

With this flag set, the . (period) metacharacter matches all characters.

MULTILINE

Alters the behavior of the ^ (caret) and $ (dollar) positional

 

metacharacters.

UNICODE_CASE

In this mode, case-insensitive matching is applied to all Unicode

 

alphabetic characters (as appropriate).

UNIX_LINES

In this mode, only the \n line terminator affects the behavior of the

 

. (period), ^ (caret), and $ (dollar) metacharacters.

 

 

The CASE_INSENSITIVE Flag

The CASE_INSENSITIVE flag applies only to U.S. ASCII characters. If you need case-insensitive matching to apply to other characters, you will likely need the UNICODE_CASE flag.

The CASE_INSENSITIVE flag can also be specified using the embedded flag expression (?i).

629

Chapter 25

Using the COMMENTS Flag

When the COMMENTS flag is set, it is possible to include whitespace in a regular expression pattern that is not matched against the test character sequence. In other words, whitespace included in a pattern is ignored, enabling the pattern (and the comments describing the meaning of the pattern’s components) to be displayed in a way that assists a human reader in reading and understanding it.

The # character is used at the beginning of a comment. All characters following the # character are ignored (as far as matching is concerned) by the regular expression engine.

Comments mode can also be enabled using the embedded flag expression (?x).

The following example shows how comments can be used when attempting to match a U.S. Zip code when the Pattern.COMMENTS flag is set.

Try It Out

Using the COMMENTS Flag

1.Type the following code into a text editor:

import java.util.regex.*;

public class MatchZipComments{

public static void main(String args[]) throws Exception{

String myTestString = “12345-1234 23456 45678 01234-1234”;

//Attempt to match US Zip codes.

//The pattern matches five numeric digits followed by a hyphen followed by four numeric digits.

String myRegex = “\\d{5} “ +

“# Matches five numeric digits” + “\n(-\\d{4})* “ +

“# Matches four numeric digits and a hyphen, all of which are optional”;

Pattern myPattern = Pattern.compile(myRegex, Pattern.COMMENTS);

Matcher myMatcher = myPattern.matcher(myTestString);

String myMatch = “”;

System.out.println(“The test string was ‘“ + myTestString + “‘.”); System.out.println(“The pattern was ‘\\d{5}-\\d{4}’.”);

while (myMatcher.find())

{

myMatch = myMatcher.group();

System.out.println(“A match ‘“ + myMatch + “‘was found.”); } // end while

if (myMatch == “”){

System.out.println(“There were no matches.”);

}// end if

}// end main()

}

630

Regular Expressions in Java

2.Save the code as MatchZipComments.java. To compile it at the command line, type javac MatchZipComments.java.

3.Run the code. At the command line, type java MatchZipComments, and inspect the results, as shown in Figure 25-4.

Figure 25-4

How It Works

The variable myTestString is assigned a string that contains four character sequences that could be U.S. Zip codes:

String myTestString = “12345-1234 23456 45678 01234-1234”;

Conventional Java comments can be used to indicate the purpose of the regular expression:

// Attempt to match US Zip codes.

Similarly, conventional Java comments can be used to specify how the pattern is constructed:

// The pattern matches five numeric digits followed by a hyphen followed by four

numeric digits.

The Pattern.COMMENTS flag is set in the following statement; therefore, the value of the myRegex variable can be written across several lines, with comments interwoven between the components of the regular expression pattern. Notice that the comments follow the # character:

String myRegex =

 

“\\d{5} “ +

 

“# Matches

five

numeric digits” +

“\n(-\\d{4})* “

+

“# Matches

four

numeric digits and a hyphen, all of which are optional”;

 

 

 

When the value of the variable myPattern is assigned the result of the Pattern class’s compile() method, the second argument of the compile() method, Pattern.COMMENTS, sets the COMMENTS flag. When the COMMENTS flag is set, whitespace inside the pattern is ignored, and characters from the # character to the next-line terminator character are treated as comments:

Pattern myPattern = Pattern.compile(myRegex, Pattern.COMMENTS);

Matching takes place against the myTestString variable using the myPattern object’s matcher() method:

Matcher myMatcher = myPattern.matcher(myTestString);

631

Chapter 25

There are four matches in the myTestString variable. Character sequences 12345-1234 and 01234-1234 match when the optional part of the pattern, (-\d{4})*, matches once; and 23456 and 45678 match when (-\d{4})* matches zero occurrences of the pattern.

The DOTALL Flag

By default, the . (period) metacharacter matches any character except a line terminator. In Java regular expressions, the term line terminator refers to those characters (or combinations of characters) specified in the following list. When the DOTALL flag is set, the . (period) metacharacter matches all characters, including line terminators:

\n A newline (linefeed) character

\r\n A carriage-return character followed immediately by a newline character

\r A carriage return not followed by a newline character

\u0085 A next-line character

\u2028 A line-separator character

\u2029 A paragraph-separator character

The DOTALL mode can also be specified using the embedded flag expression (?s).

The MULTILINE Flag

By default, the positional metacharacters ^ and $, respectively, match the position just before the first character in the test character sequence and the position just after the last character in the character sequence. When MULTILINE mode is specified, the ^ metacharacter matches the position just before the first character on each line, and the $ metacharacter matches the position just after the final character (ignoring line terminators) on each line.

The MULTILINE flag can also be specified using the embedded flag expression (?m).

The UNICODE_CASE Flag

The CASE_INSENSITIVE flag causes matching of U.S. ASCII characters to be carried out in a caseinsensitive way. To use case-insensitive matching with other characters, the UNICODE_CASE flag is set. It is likely that using the UNICODE_CASE flag will impose a performance penalty, so you should use it only when it is essential to the purpose of the regular expression.

The UNICODE_CASE flag can also be specified using the embedded flag expression (?u).

The UNIX_LINES Flag

The UNIX_LINES flag is set when you are dealing with multiline text originating from a Unix or related operating system where only the \n line terminator is used. Only \n is recognized as affecting the behavior of the . (period), ^ (caret), and $ (dollar) metacharacters.

The UNIX_LINES flag can also be specified using the embedded flag expression (?d).

632