Добавил:
Опубликованный материал нарушает ваши авторские права? Сообщите нам.
Вуз: Предмет: Файл:
Beginning Regular Expressions 2005.pdf
Скачиваний:
101
Добавлен:
17.08.2013
Размер:
25.42 Mб
Скачать

Chapter 24

Unicode Character Class

Description

 

 

Pd

Dashes

Pe

Closing punctuation

Pf

Final quotes

Pi

Initial quotes

Po

Other forms of punctuation

Ps

Opening punctuation

S

Symbols

Sc

Currency symbols

Sk

Modifier symbols

Sm

Mathematical symbols

So

Other symbols

Z

Separators

Zl

Line breaks

Zp

Paragraph breaks

Zs

Spaces

 

 

The following sections briefly illustrate the use of several Unicode character classes.

Matching Decimal Numbers

The Nd character class matches decimal numbers. So if you have a simple document such as the following DocumentUnicode.xml, you can use that Unicode character class to specify allowed values of the Section element’s number attribute:

<?xml version=”1.0” encoding=”UTF-8”?>

<Document xmlns:xsi=”http://www.w3.org/2001/XMLSchema-instance”

xsi:noNamespaceSchemaLocation=”C:\BRegExp\Ch24\DocumentUnicode.xsd”> <Section number=”1”>Content</Section>

<Section number=”2”>Content</Section> <Section number=”3”>Content</Section>

</Document>

The corresponding schema document, DocumentUnicode.xsd, uses the Nd Unicode character class :

<?xml version=”1.0” encoding=”UTF-8”?>

<xs:schema xmlns:xs=”http://www.w3.org/2001/XMLSchema” elementFormDefault=”qualified”>

<xs:element name=”Document”> <xs:complexType>

<xs:sequence>

<xs:element ref=”Section” maxOccurs=”unbounded”/> </xs:sequence>

606

Regular Expressions in W3C XML Schema

</xs:complexType>

</xs:element>

<xs:element name=”Section”> <xs:complexType>

<xs:simpleContent>

<xs:extension base=”xs:string”> <xs:attribute name=”number” use=”required”>

<xs:simpleType>

<xs:restriction base=”xs:NMTOKEN”> <xs:pattern value=”\p{Nd}” /> </xs:restriction>

</xs:simpleType>

</xs:attribute>

</xs:extension>

</xs:simpleContent>

</xs:complexType>

</xs:element>

</xs:schema>

Notice that the value of the xs:pattern element’s value attribute is \p{Nd}, which specifies that the value of the Section element’s number attribute is a single decimal number.

Mixing Unicode Character Classes with Other Metacharacters

It is possible to mix Unicode character classes with other metacharacters in the same regular expression. The following example illustrates how this can be done (in a rather contrived way) to match a U.S. Social Security number. The XML instance file, PersonsSSNUnicode.xml, is shown here:

<?xml version=”1.0” encoding=”UTF-8”?>

<PersonsSSN xmlns:xsi=”http://www.w3.org/2001/XMLSchema-instance” xsi:noNamespaceSchemaLocation=”C:\BRegExp\Ch24\PersonsSSNUnicode.xsd”>

<Person>

<Name>Peter Schmidt</Name> <SSN>123-45-6789</SSN>

</Person>

<Person>

<Name>Yasmin Brown</Name> <SSN>987-65-4321</SSN>

</Person>

</PersonsSSN>

The corresponding W3C XML Schema document, PersonsSSNUnicode.xsd, is shown here:

<?xml version=”1.0” encoding=”UTF-8”?>

<xs:schema xmlns:xs=”http://www.w3.org/2001/XMLSchema” elementFormDefault=”qualified”>

<xs:element name=”Name” type=”xs:string”/> <xs:element name=”Person”>

<xs:complexType>

<xs:sequence>

<xs:element ref=”Name”/> <xs:element ref=”SSN”/>

</xs:sequence>

</xs:complexType>

</xs:element>

607

Chapter 24

<xs:element name=”PersonsSSN”> <xs:complexType>

<xs:sequence>

<xs:element ref=”Person” maxOccurs=”unbounded”/> </xs:sequence>

</xs:complexType>

</xs:element> <xs:element name=”SSN”>

<xs:simpleType>

<xs:restriction base=”xs:string”>

<xs:pattern value=”\p{Nd}{3}-[0-9]{2}-\d{4}” /> </xs:restriction>

</xs:simpleType>

</xs:element>

</xs:schema>

Notice the pattern specified as the value of the xs:pattern element’s value attribute. It uses three different ways of expressing numeric digits: a Unicode character class, a regular expression character class, and the metacharacter \d. The \p{Nd}{3} matches three numeric digits, using a Unicode character class. It is followed by a literal hyphen. Then [0-9]{2} uses a range in a conventional character class to match two numeric digits. Again, it is followed by a literal hyphen. Finally, the \d{4} matches four numeric digits.

Unicode Character Blocks

Unicode character blocks refer to blocks of Unicode characters that are relevant to a particular use. A Unicode character block may refer to a language or group of languages, or may refer to a specialized use, such as box drawing or geometric elements.

The following table illustrates some of the many Unicode character blocks available for use.

Block Name

Start Code

End Code

 

 

 

BasicLatin

#x0000

#x007F

Latin-1 Supplement

#x0080

#x00FF

LatinExtended-A

#x0100

#x017F

Cyrillic

#x0400

#x04FF

Hebrew

#x0590

#x05FF

Arabic

#x0600

#x06FF

Greek

#x0370

#x03FF

Cherokee

#x13A0

#x13FF

SuperscriptsAndSubscripts

#x2070

#x209F

Mathematical Operators

#x2200

#x22FF

 

 

 

608

Regular Expressions in W3C XML Schema

Using Unicode Character Blocks

This example illustrates the effect of combining a Unicode character block with a Unicode character class.

Try It Out

Using a Unicode Character Block

1.Type the following XML markup or open the file WordUnicode.xml in the code download:

<?xml version=”1.0” encoding=”UTF-8”?>

<Word xmlns:xsi=”http://www.w3.org/2001/XMLSchema-instance” xsi:noNamespaceSchemaLocation=”C:\BRegExp\Ch24\WordUnicode.xsd”>Führer</Word>

2.Type the following W3C XML Schema document or open the file WordUnicode.xsd in the code download:

<?xml version=”1.0” encoding=”UTF-8”?>

<xs:schema xmlns:xs=”http://www.w3.org/2001/XMLSchema” elementFormDefault=”qualified”>

<xs:element name=”Word” type=”UnicodeType”/> <xs:simpleType name=”UnicodeType”>

<xs:restriction base=”xs:string”> <xs:pattern value=”\w+”/>

</xs:restriction>

</xs:simpleType>

</xs:schema>

3.Attempt to validate WordUnicode.xml against WordUnicode.xsd. Figure 24-7 shows the appearance when validating in XMLSpy. As you can see in the lower part of the figure, the XML instance document is valid according to its associated schema document.

4.Type the following XML markup or open the file WordUnicode2.xml in the code download:

<?xml version=”1.0” encoding=”UTF-8”?>

<Word xmlns:xsi=”http://www.w3.org/2001/XMLSchema-instance”

xsi:noNamespaceSchemaLocation=”C:\BRegExp\Ch24\WordUnicode2.xsd”>Führer</Word>

5.Type the following W3C XML Schema document or open the file WordUnicode2.xsd in the code download:

<?xml version=”1.0” encoding=”UTF-8”?>

<xs:schema xmlns:xs=”http://www.w3.org/2001/XMLSchema” elementFormDefault=”qualified”>

<xs:element name=”Word” type=”UnicodeLetterType”/> <xs:simpleType name=”UnicodeLetterType”>

<xs:restriction base=”xs:string”> <xs:pattern value=”\p{L}+”/>

</xs:restriction>

</xs:simpleType>

</xs:schema>

609

Chapter 24

Figure 24-7

6.Attempt to validate WordUnicode2.xml against WordUnicode2.xsd. Figure 24-8 shows the screen’s appearance. This attempts to match Führer against the pattern \p{L}, which is all Unicode letters. There is a match.

Next, attempt to match the word Führer against Basic Latin letters. It won’t match, because the character ü is Unicode U+00FC, which is outside the range U+0000 to U+007F for the

BasicLatin code group.

7.Type the following XML markup or open the file WordUnicode3.xml in the code download:

<?xml version=”1.0” encoding=”UTF-8”?>

<Word xmlns:xsi=”http://www.w3.org/2001/XMLSchema-instance” xsi:noNamespaceSchemaLocation=”C:\BRegExp\Ch24\WordUnicode3.xsd”>Führer</Word>

610

Regular Expressions in W3C XML Schema

Figure 24-8

8.Type the following W3C XML Schema document or open the file WordUnicode3.xsd in the code download:

<?xml version=”1.0” encoding=”UTF-8”?>

<xs:schema xmlns:xs=”http://www.w3.org/2001/XMLSchema” elementFormDefault=”qualified”>

<xs:element name=”Word” type=”UnicodeBasicLatinType” /> <xs:simpleType name=”UnicodeBasicLatinType” >

<xs:restriction base=”xs:string”> <xs:simpleType>

<xs:restriction base=”xs:string”> <xs:pattern value=”\p{IsBasicLatin}” /> </xs:restriction>

</xs:simpleType> <xs:pattern value=”\p{L}”/> </xs:restriction>

</xs:simpleType>

</xs:schema>

Notice how you specify the intersection of the Unicode character class specified by the pattern \p{L} and the Unicode character block specified by \p{IsBasicLatin}.

611