Добавил:
Upload Опубликованный материал нарушает ваши авторские права? Сообщите нам.
Вуз: Предмет: Файл:
Oracle Database 11g.pdf
Скачиваний:
78
Добавлен:
10.06.2015
Размер:
12.69 Mб
Скачать

Using Linguistic Sorts and Searches 

657

Using Linguistic Sorts and Searches

Different languages follow different rules when it comes to sorting text. Unfortunately, that means that there is no “one-size-fits-all” algorithm that can be used. Instead, Oracle’s global support functionality allows not only binary sorting methods but also linguistic sorting and searching methodologies to provide the flexibility to support the needs of many languages.

In the following sections, you will learn the methods that Oracle uses when it performs text-sorting operations. You will also learn about the different NLS parameters that impact linguistic sorting and searching. Next, you will learn about the different types of linguistic sorts supported by Oracle. Last, you will learn about linguistic text searches.

An Overview of Text Sorting

There are many ways in which text sorting can be accomplished. Sort order can be case sensitive or case can be ignored. Diacritics (accent marks) can be considered or ignored. Sorting can be done phonetically or based on the appearance of the character.

Some languages even consider groupings of characters to have a specific sort order. For example, traditional Spanish treats ch as a character that sorts after the letter C. Therefore, when sorting the words cat, dog, cow, and chinchilla, the correct sort sequence would be cat, cow, chinchilla, and dog.

To support these different sorting methods, Oracle offers two basic categories of sorting: binary and linguistic.

Binary Sorts

Binary sorts are the fastest and most efficient sorting method offered by Oracle. However, they are also the most limited. Binary sorts perform a numeric sort based on the encoded value (in the character set) for each character. As long as the character set encodes all of the characters in the proper sort order, this method works very well. The performance is also exceptional.

For example, in the US7ASCII character set, alphabetical characters are encoded as shown in Table 13.4. Note that this is just a subset of the character set.

Ta b l e 13 . 4   US7ASCII​ Alphabetical Characters

Char

Value

Char

Value

Char

Value

Char

Value

 

 

 

 

 

 

 

 

A

65

N

78

a

97

n

110

B

66

O

79

b

98

o

111

C

67

P

80

c

99

p

112

 

 

 

 

 

 

 

 

658

Chapter 13  n  Implementing Globalization Support

 

 

Ta b l e 13 . 4   US7ASCII​

Alphabetical Characters  (continued)

 

 

 

 

 

 

 

 

 

 

Char

Value

Char

Value

Char

Value

Char

Value

 

 

 

 

 

 

 

 

D

68

Q

81

d

100

q

113

E

69

R

82

e

101

r

114

F

70

S

83

f

102

s

115

G

71

T

84

g

103

t

116

H

72

U

85

h

104

u

117

I

73

V

86

i

105

v

118

J

74

W

87

j

106

w

119

K

75

X

88

k

107

x

120

L

76

Y

89

l

108

y

121

M

77

Z

90

m

109

z

122

 

 

 

 

 

 

 

 

 

Because the encoded values ascend in correlation with the characters, a binary sort will always perform a proper alphabetical sort. In addition, uppercase characters will always sort higher than lowercase characters.

However, different languages may share the same alphabet (and therefore, the same character set) yet utilize a sort order that deviates from the encoding order of the character set. In this situation, binary sorting will fail to produce an acceptable result.

Binary sorting is Oracle’s default sorting method.

Linguistic Sorts

Linguistic sorts, unlike binary sorts, operate independently of the underlying encoded values. Instead, they allow character data to be sorted based on the rules of specific languages. Linguistic sorts offer the flexibility to deal with the caveats imposed by different languages. Oracle provides a rich set of linguistic sort definitions that cover most of the languages

of the world. However, it also provides the ability to define new definitions or to modify existing definitions. The Oracle Locale Builder, a graphical tool that ships with Oracle 11g, can be used to view, create, and modify sort definitions and other locale definitions. However, as mentioned earlier, Locale Builder is not covered in this book.

Using Linguistic Sorts and Searches 

659

Linguistic sorts are defined using a variety of rules that govern the sorting process. The following elements are available and can be used to create a comprehensive linguistic sort:

Base letters    Base letters are the individual letters upon which other characters are based. The derived characters would map back to the base letter to determine the sorting value. For example, the character A is a base letter, while À, Á, Ã, Ä, a, à, á, and ä would all map to A as a base letter.

Ignorable characters    Ignorable characters, just as the name implies, can be defined as having no effect on sort order. Diacritics, such as the umlaut, can be classified as ignorable, as can certain punctuation characters, such as the hyphen. Therefore, a word such as e mail would be sorted the same as email.

Contracting characters    Contracting characters represent two or more characters that are treated linguistically as a single character. An example—the traditional Spanish ch string— was explained earlier in this section. Contracting characters require flexibility in the sorting algorithm to read ahead to the next character to determine if a contracting character has been found.

Expanding characters    With some locales, repeating or commonly occurring strings of characters are compressed into a single character for brevity’s sake. However, the character needs to sort as if all characters are present. These are referred to as expanding characters.

For example, the ö character should be treated as the string oe for sorting purposes.

Context-sensitive characters    With certain languages, characters are sorted differently based upon their relationship to other characters. These are generally characters that modify the preceding character. For example, a Japanese length mark is sorted according to the vowel that precedes it.

Canonical equivalence    When a Unicode character set is used, the character ö and the string can be considered equal from a sorting perspective. This is because the code points of the two-character string match the code point of the individual character. The two are said to have canonical equivalence.

In situations of canonical equivalence, the value of the CANONICAL_EQUIVALENCE linguistic flag (with a value of TRUE or FALSE) determines whether the rules of canonical equivalence should be followed.

Reverse secondary sorting    In some languages, strings containing diacritics will be sorted from left to right on the base characters and then from right to left on the diacritics. This is referred to as reverse secondary sorting. For example, resumé would sort before résume because of the position of the diacritic from left to right.

The REVERSE_SECONDARY=TRUE linguistic flag enables this functionality.

Character rearrangement    In certain languages (notably Thai and Laotian dialects), sorting rules declare that certain characters should switch places with the following character before sorting. This generally happens when a consonant is preceded by a vowel sound. In this case, the consonant is given priority, forcing the characters to be switched before the sort begins.

660  Chapter 13  n  Implementing Globalization Support

The SWAP_WITH_NEXT linguistic flag can determine whether character rearrangement will occur within a sort definition.

Don’t confuse linguistic flags with Oracle parameter settings. Linguistic flags are defined for specific sort-order definitions when they are created.

These different sorting methods represent the toolset available to sort and search text. Different languages may use one or more of the linguistic methods listed here. But as long as the rules of a language can be described using these methods, Oracle is able to perform linguistic sorts, either through an existing sort definition or through a custom sort definition.

Using Linguistic Sort Parameters

Linguistic sorts are generally applicable to a specific language or to a specific character set. And, as mentioned previously, there are many predefined linguistic sort definitions that may be utilized. Therefore, it is unlikely that you would ever need to define your own.

You can instruct Oracle to utilize specific linguistic sorts by setting the appropriate NLS sort parameters. In this section, you will learn about the NLS_SORT and NLS_COMP parameters and how they affect linguistic sorting operations.

NLS_SORT

The NLS_SORT parameter defines which type of sorting—binary or linguistic—should be performed for SQL sort operations. By default, the value for NLS_SORT is the default sort method defined for the language identified in the NLS_LANGUAGE parameter. For example, if the NLS_ LANGUAGE parameter is set to AMERICAN, the default value for the NLS_SORT parameter will be

BINARY.

To instruct Oracle to use linguistic sorting, this parameter can be set to the name of any valid linguistic sort definition, as shown here:

SQL> alter session set NLS_SORT = German;

Session altered.

A list of valid sort definition names is shown here. You could also query the V$NLS_ VALID_VALUES view (as shown earlier in this chapter).

ARABIC

GERMAN

SWISS

ARABIC_ABJ_MATCH

GERMAN_DIN

TCHINESE_RADICAL_M

ARABIC_ABJ_SORT

GREEK

TCHINESE_STROKE_M

ARABIC_MATCH

HEBREW

THAI_DICTIONARY

ASCII7

HKSCS

THAI_M

 

Using Linguistic Sorts and Searches 

661

AZERBAIJANI

HUNGARIAN

THAI_TELEPHONE

 

BENGALI

ICELANDIC

TURKISH

 

BIG5

INDONESIAN

UKRAINIAN

 

BINARY

ITALIAN

UNICODE_BINARY

 

BULGARIAN

JAPANESE

VIETNAMESE

 

CANADIAN FRENCH

JAPANESE_M

WEST_EUROPEAN

 

CANADIAN_M

KOREAN_M

XAZERBAIJANI

 

CATALAN

LATIN

XCATALAN

 

CROATIAN

LATVIAN

XCROATIAN

 

CZECH

LITHUANIAN

XCZECH

 

CZECH_PUNCTUATION

MALAY

XCZECH_PUNCTUATION

 

DANISH

NORWEGIAN

XDANISH

 

DANISH_M

POLISH

XDUTCH

 

DUTCH

PUNCTUATION

XFRENCH

 

EBCDIC

ROMANIAN

XGERMAN

 

EEC_EURO

RUSSIAN

XGERMAN_DIN

 

EEC_EUROPA3

SCHINESE_PINYIN_M

XHUNGARIAN

 

ESTONIAN

SCHINESE_RADICAL_M

XPUNCTUATION

 

FINNISH

SCHINESE_STROKE_M

XSLOVAK

 

FRENCH

SLOVAK

XSLOVENIAN

 

FRENCH_M

SLOVENIAN

XSPANISH

 

GBK

SPANISH

XSWISS

 

GENERIC_BASELETTER

SPANISH_M

XTURKISH

 

GENERIC_M

SWEDISH

XWEST_EUROPEAN

 

By using the NLS_SORT parameter, you can make the following changes to Oracle’s default functionality:

NN

Set the default sort method for all ORDER BY operations.

NN

Set the default sort value for the NLSSORT function.

662  Chapter 13  n  Implementing Globalization Support

It is important to note that not all SQL functionality supports linguistic sorting. Certain functions support only binary sorts. However, most of the commonly used methods are supported. Also, all NLS-specific SQL functions (for example, NLSSORT) will support linguistic sorts.

The methods listed here support linguistic sorting:

NN ORDER BY

NN BETWEEN

NN CASE WHEN

NN HAVING

NN IN/OUT

NN START WITH

NN WHERE

By default, all of these operations will perform binary sorts. When you set the NLS_SORT parameter, SQL statements using the WHERE operation will perform linguistic sorts by default. The following example demonstrates the use of the NLS_SORT parameter. Initially, you can see that your session has no value set for NLS_SORT. Therefore, it will inherit the default

setting from the NLS_LANGUAGE parameter (BINARY).

SQL> show parameters NLS_LANGUAGE

NAME

TYPE

VALUE

------------------------------------

-----------

----------

nls_language

string

AMERICAN

SQL> show parameters NLS_SORT

 

 

NAME

TYPE

VALUE

------------------------------------

-----------

----------

nls_sort

string

 

As you learned earlier, the default sort for the language AMERICAN is BINARY. The default setting for NLS_COMP is BINARY as well. Therefore, you can expect that, by default, Oracle will perform a binary sort unless otherwise specified, as shown here:

SQL> select * from sort_test order by name;

NAME

--------------------------------

Finsteraarhornhutte Grünhornlücke einschließlich

Using Linguistic Sorts and Searches 

663

finsteraarhornhütte

grünhornlücke

5 rows selected.

As expected, Oracle sorted the rows based on the encoded value of the characters in the US7ASCII character set. Therefore, all uppercase characters sort before lowercase characters.

Because the words in this table are of German origin, it is logical that you might want to sort them using a German sorting method. To change the sorting method for a GROUP BY clause, the NLS_SORT parameter can be set as shown here:

SQL> alter session set NLS_SORT=German_din;

Session altered.

SQL> select * from sort_test order by name;

NAME

--------------------------------------------

einschließlich Finsteraarhornhutte finsteraarhornhütte Grünhornlücke grünhornlücke

5 rows selected.

As you can see, setting the NLS_SORT parameter changed the default sorting method to a linguistic sort instead of a binary sort.

In the next step, another query is executed, this time using a WHERE condition rather than an ORDER BY clause:

SQL> select * from sort_test

2where name > ‘einschließlich’;

NAME

---------------------------------------

finsteraarhornhütte

grünhornlücke

2 rows selected.

664  Chapter 13  n  Implementing Globalization Support

The result of this query might not be what you expect. Instead of the expected four rows (which a linguistic sort would have returned), only two rows are returned, indicating that a binary sort took place instead.

Remember, the NLS_SORT parameter overrides the default sorting method for ORDER BY operations and for the NLSSORT function, but it has no effect on other sort operations, such as WHERE conditions. Therefore, this query ignored the parameter entirely.

To perform a linguistic sort, you call the NLSSORT function. Normally, the function would be called like this:

SQL> select * from sort_test

where nlssort(name, ‘NLS_SORT=German_din’) > nlssort(‘einschließlich’,’NLS_SORT=German_din’);

NAME

--------------------------------------------------

Finsteraarhornhutte finsteraarhornhütte Grünhornlücke grünhornlücke

4 rows selected.

However, because the NLS_SORT parameter defines the default sort for the NLSSORT function, specifying the sort inside the function is unnecessary. The following method works as well:

SQL> select * from sort_test

where nlssort(name) > nlssort(‘einschließlich’);

NAME

--------------------------------------------------

Finsteraarhornhutte finsteraarhornhütte Grünhornlücke grünhornlücke

4 rows selected.

As you can see, in both queries the sort was performed linguistically and returned the expected rows.

The NLS_SORT parameter is very limited in relation to linguistic sorting. The NLS_COMP parameter, on the other hand, makes linguistic sorting much easier.

Using Linguistic Sorts and Searches 

665

NLS_COMP

The NLS_COMP parameter works in conjunction with the NLS_SORT parameter to make linguistic sorts easier to use. When the NLS_COMP parameter is set to a value of ANSI, all of the following SQL operations will default to linguistic sorting (using the language specified in NLS_SORT parameter):

NN ORDER BY

NN BETWEEN

NN CASE WHEN

NN HAVING

NN IN/OUT

NN START WITH

NN WHERE

Setting the NLS_COMP parameter makes it unnecessary to call the NLSSORT function when using these sort operations. As you can guess, this makes linguistic sorting much easier to perform.

The NLS_COMP parameter can be set to either BINARY (the default) or ANSI. The following example shows the usage of the NLS_COMP parameter:

SQL> alter session set NLS_SORT=German_din;

Session altered.

SQL> alter session set NLS_COMP=ANSI;

Session altered.

SQL> select * from sort_test

2where name > ‘einschließlich’;

NAME

------------------------------------------

Finsteraarhornhutte finsteraarhornhütte Grünhornlücke grünhornlücke

4 rows selected.

As you can see, the query performed the linguistic sort and returned the expected results, even without using the NLSSORT function.

666  Chapter 13  n  Implementing Globalization Support

If the NLS_COMP parameter is set back to BINARY, binary sorting occurs once again:

SQL> alter session set NLS_COMP=BINARY;

Session altered.

SQL> select * from sort_test

2where name > ‘einschließlich’;

NAME

--------------------------------------------

finsteraarhornhütte

grünhornlücke

2 rows selected.

Linguistic Sort Types

When performing linguistic sorts, Oracle uses different methodologies, depending upon the number of languages involved in the sort. If character data in only one language is being sorted, this is classified as a monolingual linguistic sort. If more than one language is involved in the sort, it is classified as a multilingual linguistic sort.

In the following sections, you will learn the methodology that Oracle implements in performing both monolingual and multilingual linguistic sorts. You will also learn about accent-insensitive and case-insensitive linguistic sorts.

Monolingual Linguistic Sorts

When dealing with only a single language inside a sort, Oracle performs a two-step process to compare character strings. First, the major value of the strings is compared. Next, if necessary, the minor value of the strings is compared.

Major and minor values are sort values assigned to letters in the character set. A base letter and those derived from it will generally share a common major value, but they will have different minor values based on the desired sort order.

Here’s an example:

Letter

Major Value

Minor Value

A

30

10

A

30

20

Ä

30

30

Ä

30

40

B

40

10

Using Linguistic Sorts and Searches 

667

The example shows that all four variations of the letter A have identical major values, but differing minor values. When two letters share a major value, the minor value will determine the sort order.

The major-value numbers are assigned to a Unicode code point (a 16-bit binary value that defines a character in a Unicode character set).

Multilingual Linguistic Sorts

Multilingual sorts offer the ability to sort mixed languges within the same sort. For example, if you have a table that stores names in both English and Spanish, a multilingual sort should be used.

Multilingual sorts perform three levels of evaluation: primary, secondary, and tertiary. Primary sorts assign a primary sort value based on the base letter of each character (dia-

critics and case are ignored). If a character is defined as ignorable, it is assigned a primary value of zero.

Secondary-level sorts consider diacritics to differentiate accented letters from base letters in assigning a secondary sort level. For example, A and ä share the same base letter, so they have the same primary sort level but they will have different secondary levels.

Tertiary-level sorts consider character case to differentiate uppercase and lowercase letters. Tertiary sorts also handle special characters such as *, +, and -.

Consider the following words: Fahrvergnügen Fahrvergnugen farhrvergnugen fahrvergnügen

Because all of these words share the same base letters in the same order, all of them would be considered equivalent at the primary sort level. After a secondary-level sort, they would be ordered similarly to the following list:

Fahrvergnugen farhrvergnugen Fahrvergnügen fahrvergnügen

The secondary-level sort is concerned only with diacritics, so it will sort characters without diacritics above those with diacritics. After that, the words are displayed in their primary sort order. Because the words in this example have identical primary sort orders, there is no guarantee which word will appear before the other. The only guarantee is that those with diacritics will sort after those without.

After the tertiary-level sort is performed, the list should look exactly like this: farhrvergnugen

Fahrvergnugen fahrvergnügen Fahrvergnügen

668  Chapter 13  n  Implementing Globalization Support

The tertiary-level sort applies the case rule to the data, forcing lowercase letters to sort before uppercase letters. This is the final result of the sort after applying all three multilingual sorting levels.

In keeping with the ISO 14651 standard for multilingual sorting, Oracle appends an _M to the sort name to identify it as multilingual. For example, FRENCH_M identifies a French multilingual sort, whereas FRENCH identifies a French monolingual sort. Table 13.5 shows the multilingual sorts predefined in Oracle 11g.

Ta b l e 13 . 5   Multilingual​ Sorts Available in Oracle 11g

Multilingual Sort Name

Description

 

 

CANADIAN_M

Canadian French

DANISH_M

Danish

FRENCH_M

French

GENERIC_M

Generic based on ISO 14651

JAPANESE_M

Japanese

KOREAN_M

Korean

SPANISH_M

Traditional Spanish

THAI_M

Thai

SCHINESE_RADICAL_M

Simplified Chinese

SCHINESE_STROKE_M

Simplified Chinese

SCHINESE_PINYIN_M

Simplified Chinese

TCHINESE_RADICAL_M

Traditional Chinese

TCHINESE_STROKE_M

Traditional Chinese

 

 

Case-Insensitive and Accent-Insensitive Linguistic Sorts

Oracle, by default, will always consider both the case of the characters and any diacritics when performing sort operations. As you’ve seen in previous examples, linguistic sorts have rules to govern precedence between uppercase and lowercase words, as well as those words containing diacritics.

Using Linguistic Sorts and Searches 

669

However, you may wish to override this functionality from time to time and choose to ignore case and/or diacritics. Oracle 11g offers case-insensitive and accent-insensitive sorting options to allow for these cases.

To specify case-insensitive or accent-insensitive sorts, the NLS_SORT parameter is used, but with the following changes:

NN

For a case-insensitive linguistic sort, append the string _CI to the sort name.

 

NN

For an accent-insensitive linguistic sort, append the string _AI to the sort name.

Accent-insensitive sorts are also case insensitive by default.

For example, to specify a French, multilingual, accent-insensitive sort, use the following:

NLS_SORT = French_M_AI

Here is how to specify a German, monolingual, case-insensitive sort:

NLS_SORT = German_CI

Case-Insensitive and Accent-Insensitive Binary Sorts

Binary sorts can also be designated as case insensitive or accent insensitive. The NLS_SORT parameter can be set to BINARY_CI or BINARY_AI. Table 13.6 shows how the sort will be affected by these settings.

Ta b l e 13 .6   Binary​ Case and Accent-Insensitive Sort Options

Sort Name

Sort Type

Case Insensitive?

Accent Insensitive?

 

 

 

 

BINARY_CI

Binary

Yes

No

BINARY_AI

Binary

Yes

Yes

 

 

 

 

As you can see, using the BINARY_AI sort will result in both an accent-insensitive and case-insensitive sort.

Searching Linguistic Strings

Linguistic searches are closely related to linguistic sorts and are directly affected by the NLS_ SORT setting. To accomplish linguistically meaningful searches, Oracle must apply the same rules it applies for linguistic sorts.

670  Chapter 13  n  Implementing Globalization Support

Earlier in this section, you saw several examples of linguistic string searching, including the following:

SQL> select * from sort_test

2where name > ‘einschließlich’;

NAME

------------------------------------------

Finsteraarhornhutte finsteraarhornhütte Grünhornlücke grünhornlücke

4 rows selected.

When you set the NLS_COMP parameter to ANSI and the NLS_SORT parameter to the desired sort language, the WHERE operator (as well as several others) will perform linguistic searching by default.

And, just as you did with sorts, if the NLS_SORT is set to ignore case or accents, linguistic searches will follow suit, as in this example:

SQL> alter session set NLS_COMP=ANSI;

Session altered.

SQL> alter session set NLS_SORT=German_din_ci;

Session altered.

SQL> select * from sort_test

2where name = ‘Grünhornlücke’;

NAME

-----------------------------------------

Grünhornlücke

grünhornlücke

As you can see, the search ignored the case and returned both rows that matched. This is the expected functionality of the case-insensitive search.

In the next example, the NLS_SORT parameter will be set to define an accent-insensitive search:

SQL> alter session set NLS_SORT=German_din_ai;

Session altered.

Соседние файлы в предмете [НЕСОРТИРОВАННОЕ]