Добавил:

fench Опубликованный материал нарушает ваши авторские права? Сообщите нам.

Вуз:

Сумский государственный университет

Предмет:

Программирование

Файл:

Beginning Regular Expressions 2005.pdf

Скачиваний:

101

Добавлен:

17.08.2013

Размер:

25.42 Mб

Скачать

☆

<<< < Предыдущая 135 136 137 138 139 140 141 142 143 144 145 146147 / 169147 148 149 150 151 152 153 154 155 156 157 158 159 > Следующая >>>

Chapter 24

Unicode Character Class	Description

Pd	Dashes
Pe	Closing punctuation
Pf	Final quotes
Pi	Initial quotes
Po	Other forms of punctuation
Ps	Opening punctuation
S	Symbols
Sc	Currency symbols
Sk	Modifier symbols
Sm	Mathematical symbols
So	Other symbols
Z	Separators
Zl	Line breaks
Zp	Paragraph breaks
Zs	Spaces

The following sections briefly illustrate the use of several Unicode character classes.

Matching Decimal Numbers

The Nd character class matches decimal numbers. So if you have a simple document such as the following DocumentUnicode.xml, you can use that Unicode character class to specify allowed values of the Section element’s number attribute:

<?xml version=”1.0” encoding=”UTF-8”?>

<Document xmlns:xsi=”http://www.w3.org/2001/XMLSchema-instance”

xsi:noNamespaceSchemaLocation=”C:\BRegExp\Ch24\DocumentUnicode.xsd”> <Section number=”1”>Content</Section>

<Section number=”2”>Content</Section> <Section number=”3”>Content</Section>

</Document>

The corresponding schema document, DocumentUnicode.xsd, uses the Nd Unicode character class :

<?xml version=”1.0” encoding=”UTF-8”?>

<xs:schema xmlns:xs=”http://www.w3.org/2001/XMLSchema” elementFormDefault=”qualified”>

<xs:element name=”Document”> <xs:complexType>

<xs:sequence>

<xs:element ref=”Section” maxOccurs=”unbounded”/> </xs:sequence>

606

Regular Expressions in W3C XML Schema

</xs:complexType>

</xs:element>

<xs:element name=”Section”> <xs:complexType>

<xs:simpleContent>

<xs:extension base=”xs:string”> <xs:attribute name=”number” use=”required”>

<xs:simpleType>

<xs:restriction base=”xs:NMTOKEN”> <xs:pattern value=”\p{Nd}” /> </xs:restriction>

</xs:simpleType>

</xs:attribute>

</xs:extension>

</xs:simpleContent>

</xs:complexType>

</xs:element>

</xs:schema>

Notice that the value of the xs:pattern element’s value attribute is \p{Nd}, which specifies that the value of the Section element’s number attribute is a single decimal number.

Mixing Unicode Character Classes with Other Metacharacters

It is possible to mix Unicode character classes with other metacharacters in the same regular expression. The following example illustrates how this can be done (in a rather contrived way) to match a U.S. Social Security number. The XML instance file, PersonsSSNUnicode.xml, is shown here:

<?xml version=”1.0” encoding=”UTF-8”?>

<Name>Peter Schmidt</Name> <SSN>123-45-6789</SSN>

</Person>

<Name>Yasmin Brown</Name> <SSN>987-65-4321</SSN>

</Person>

</PersonsSSN>

The corresponding W3C XML Schema document, PersonsSSNUnicode.xsd, is shown here:

<?xml version=”1.0” encoding=”UTF-8”?>

<xs:schema xmlns:xs=”http://www.w3.org/2001/XMLSchema” elementFormDefault=”qualified”>

<xs:element name=”Name” type=”xs:string”/> <xs:element name=”Person”>

<xs:complexType>

<xs:sequence>

<xs:element ref=”Name”/> <xs:element ref=”SSN”/>

</xs:sequence>

</xs:complexType>

</xs:element>

607

Chapter 24

<xs:element name=”PersonsSSN”> <xs:complexType>

<xs:sequence>

<xs:element ref=”Person” maxOccurs=”unbounded”/> </xs:sequence>

</xs:complexType>

</xs:element> <xs:element name=”SSN”>

<xs:simpleType>

<xs:restriction base=”xs:string”>

<xs:pattern value=”\p{Nd}{3}-[0-9]{2}-\d{4}” /> </xs:restriction>

</xs:simpleType>

</xs:element>

</xs:schema>

Notice the pattern specified as the value of the xs:pattern element’s value attribute. It uses three different ways of expressing numeric digits: a Unicode character class, a regular expression character class, and the metacharacter \d. The \p{Nd}{3} matches three numeric digits, using a Unicode character class. It is followed by a literal hyphen. Then [0-9]{2} uses a range in a conventional character class to match two numeric digits. Again, it is followed by a literal hyphen. Finally, the \d{4} matches four numeric digits.

Unicode Character Blocks

Unicode character blocks refer to blocks of Unicode characters that are relevant to a particular use. A Unicode character block may refer to a language or group of languages, or may refer to a specialized use, such as box drawing or geometric elements.

The following table illustrates some of the many Unicode character blocks available for use.

Block Name	Start Code	End Code

BasicLatin	#x0000	#x007F
Latin-1 Supplement	#x0080	#x00FF
LatinExtended-A	#x0100	#x017F
Cyrillic	#x0400	#x04FF
Hebrew	#x0590	#x05FF
Arabic	#x0600	#x06FF
Greek	#x0370	#x03FF
Cherokee	#x13A0	#x13FF
SuperscriptsAndSubscripts	#x2070	#x209F
Mathematical Operators	#x2200	#x22FF

608

Regular Expressions in W3C XML Schema

Using Unicode Character Blocks

This example illustrates the effect of combining a Unicode character block with a Unicode character class.

Try It Out

Using a Unicode Character Block

1.Type the following XML markup or open the file WordUnicode.xml in the code download:

<?xml version=”1.0” encoding=”UTF-8”?>

<Word xmlns:xsi=”http://www.w3.org/2001/XMLSchema-instance” xsi:noNamespaceSchemaLocation=”C:\BRegExp\Ch24\WordUnicode.xsd”>Führer</Word>

2.Type the following W3C XML Schema document or open the file WordUnicode.xsd in the code download:

<?xml version=”1.0” encoding=”UTF-8”?>

<xs:schema xmlns:xs=”http://www.w3.org/2001/XMLSchema” elementFormDefault=”qualified”>

<xs:element name=”Word” type=”UnicodeType”/> <xs:simpleType name=”UnicodeType”>

<xs:restriction base=”xs:string”> <xs:pattern value=”\w+”/>

</xs:restriction>

</xs:simpleType>

</xs:schema>

3.Attempt to validate WordUnicode.xml against WordUnicode.xsd. Figure 24-7 shows the appearance when validating in XMLSpy. As you can see in the lower part of the figure, the XML instance document is valid according to its associated schema document.

4.Type the following XML markup or open the file WordUnicode2.xml in the code download:

<?xml version=”1.0” encoding=”UTF-8”?>

<Word xmlns:xsi=”http://www.w3.org/2001/XMLSchema-instance”

xsi:noNamespaceSchemaLocation=”C:\BRegExp\Ch24\WordUnicode2.xsd”>Führer</Word>

5.Type the following W3C XML Schema document or open the file WordUnicode2.xsd in the code download:

<?xml version=”1.0” encoding=”UTF-8”?>

<xs:schema xmlns:xs=”http://www.w3.org/2001/XMLSchema” elementFormDefault=”qualified”>

<xs:element name=”Word” type=”UnicodeLetterType”/> <xs:simpleType name=”UnicodeLetterType”>

<xs:restriction base=”xs:string”> <xs:pattern value=”\p{L}+”/>

</xs:restriction>

</xs:simpleType>

</xs:schema>

609

Chapter 24

Figure 24-7

6.Attempt to validate WordUnicode2.xml against WordUnicode2.xsd. Figure 24-8 shows the screen’s appearance. This attempts to match Führer against the pattern \p{L}, which is all Unicode letters. There is a match.

Next, attempt to match the word Führer against Basic Latin letters. It won’t match, because the character ü is Unicode U+00FC, which is outside the range U+0000 to U+007F for the

BasicLatin code group.

7.Type the following XML markup or open the file WordUnicode3.xml in the code download:

<?xml version=”1.0” encoding=”UTF-8”?>

<Word xmlns:xsi=”http://www.w3.org/2001/XMLSchema-instance” xsi:noNamespaceSchemaLocation=”C:\BRegExp\Ch24\WordUnicode3.xsd”>Führer</Word>

610

Regular Expressions in W3C XML Schema

Figure 24-8

8.Type the following W3C XML Schema document or open the file WordUnicode3.xsd in the code download:

<?xml version=”1.0” encoding=”UTF-8”?>

<xs:schema xmlns:xs=”http://www.w3.org/2001/XMLSchema” elementFormDefault=”qualified”>

<xs:element name=”Word” type=”UnicodeBasicLatinType” /> <xs:simpleType name=”UnicodeBasicLatinType” >

<xs:restriction base=”xs:string”> <xs:simpleType>

<xs:restriction base=”xs:string”> <xs:pattern value=”\p{IsBasicLatin}” /> </xs:restriction>

</xs:simpleType> <xs:pattern value=”\p{L}”/> </xs:restriction>

</xs:simpleType>

</xs:schema>

Notice how you specify the intersection of the Unicode character class specified by the pattern \p{L} and the Unicode character block specified by \p{IsBasicLatin}.

611

<<< < Предыдущая 135 136 137 138 139 140 141 142 143 144 145 146147 / 169147 148 149 150 151 152 153 154 155 156 157 158 159 > Следующая >>>

Соседние файлы в предмете Программирование

#
17.08.20132.9 Mб60Beginning Perl Web Development - From Novice To Professional (2006).pdf
#
17.08.20138.05 Mб121Beginning Programming for Dummies 2004.pdf
#
17.08.201315.78 Mб177Beginning Python (2005).pdf
#
17.08.201313.91 Mб139Beginning Python - From Novice To Professional (2005).pdf
#
17.08.201318.51 Mб239Beginning REALbasic - From Novice To Professional (2006).pdf
#
17.08.201325.42 Mб101Beginning Regular Expressions 2005.pdf
#
17.08.20137.52 Mб29Beginning SharePoint With Excel - From Novice To Professional (2006).pdf
#
17.08.201325.54 Mб71Beginning Ubuntu Linux - From Novice To Professional (2006).pdf
#
17.08.201314.97 Mб226Beginning Visual Basic 2005 (2006).pdf
#
17.08.201321.25 Mб392Beginning Visual Basic 2005 Express Edition - From Novice To Professional (2006).pdf
#
17.08.201338.67 Mб39Blog Design Solutions (2006).pdf