Добавил:
Upload Опубликованный материал нарушает ваши авторские права? Сообщите нам.
Вуз: Предмет: Файл:
Apress.Pro.Drupal.7.Development.3rd.Edition.Dec.2010.pdf
Скачиваний:
73
Добавлен:
14.03.2016
Размер:
12.64 Mб
Скачать

CHAPTER 13 SEARCHING AND INDEXING CONTENT

foreach ($result as $alias) {

$find[] = array('title' => $alias->alias, 'link' => url($alias->source, array('absolute' => TRUE)));

}

return $find;

}

When the search API invokes hook_search_info(), it’s looking for the name the menu tab should display on the generic search page (see Figure 13-3). In our case, we’re returning “URL aliases.” By returning the name of the menu tab, the search API wires up the link of the menu tab to a new search form.

Figure 13-3. By returning the name of the menu tab from hook_search_info(), the search form becomes accessible.

hook_search_execute() is the workhorse part of Drupal's search hooks. It is invoked when the search form is submitted, and its job is to collect and return the search results. In the preceding code, we query the url_alias table, using the search terms submitted from the form. We then collect the results of the query and send them back in an array. The results are formatted by the search module and displayed to the user, as shown in Figure 13-4.

Figure 13-4. Search results are formatted by the search module.

Using the Search HTML Indexer

So far, we’ve examined how to interact with the default search form by providing a simple implementation of hook_search_execute(). However, when we move from searching a simple VARCHAR database column with LIKE to seriously indexing web site content, it’s time to outsource the task to Drupal’s built-in HTML indexer.

312

CHAPTER 13 SEARCHING AND INDEXING CONTENT

The goal of the indexer is to efficiently search large chunks of HTML. It does this by processing content when cron is called (via http://example.com/cron.php). As such, there is a lag time between when new content is searchable and how often cron is scheduled to run. The indexer parses data and splits text into words (a process called tokenization), assigning scores to each token based on a rule set, which can be extended with the search API. It then stores this data in the database, and when a search is requested, it uses these indexed tables instead of the node tables directly.

Note If you have a busy Drupal site where hundreds of new nodes are added between cron runs, it might be time to move to a search solution that works alongside Drupal, such as Solr (see http://drupal.org/project/ apachesolr).

When to Use the Indexer

Indexers are generally used when implementing search engines that evaluate more than the standard “most words matched” approach. Search relevancy refers to content passing through a (usually complex) rule set to determine ranking within an index.

You’ll want to harness the power of the indexer if you need to search a large bulk of HTML content. One of the greatest benefits in Drupal is that blogs, forums, pages, and so forth are all nodes. Their base data structures are identical, and this common bond means they also share basic functionality. One such common feature is that all nodes are automatically indexed if a search module is enabled; no extra programming is needed. Even if you create a custom node type, searching of that content is already built in, provided that the modifications you make show up in the node when it is rendered.

How the Indexer Works

The indexer has a preprocessing mode where text is filtered through a set of rules to assign scores. Such rules include dealing with acronyms, URLs, and numerical data. During the preprocessing phase, other modules have a chance to add logic to this process in order to perform their own data manipulations.

This comes in handy during language-specific tweaking, as shown here using the contributed PorterStemmer module:

resumé -> resume (accent removal)

skipping -> skip (stemming)

skips -> skip (stemming)

Another such language preprocessing example is word splitting for the Chinese, Japanese, and Korean languages to ensure the character text is correctly indexed.

313

CHAPTER 13 SEARCHING AND INDEXING CONTENT

Tip The Porter-Stemmer module (http://drupal.org/project/porterstemmer) is an example of a module that provides word stemming to improve English language searching. Likewise, the Chinese Word Splitter module (http://drupal.org/project/csplitter) is an enhanced preprocessor for improving Chinese, Japanese, and Korean searching. A simplified Chinese word splitter is included with the search module and can be enabled on the search settings page.

After the preprocessing phase, the indexer uses HTML tags to find more important words (called tokens) and assigns them adjusted scores based on the default score of the HTML tags and the number of occurrences of each token. These scores will be used to determine the ultimate relevancy of the token. Here’s the full list of the default HTML tag scores (they are defined in search_index()):

'h1' => 25, 'h2' => 18, 'h3' => 15, 'h4' => 12, 'h5' => 9, 'h6' => 6, 'u' => 3, 'b' => 3, 'i' => 3,

'strong' => 3, 'em' => 3, 'a' => 10

Let’s grab a chunk of HTML and run it through the indexer to better understand how it works. Figure 13-5 shows an overview of the HTML indexer parsing content, assigning scores to tokens, and storing that information in the database.

314

CHAPTER 13 SEARCHING AND INDEXING CONTENT

Figure 13-5. Indexing a chunk of HTML and assigning token scores

When the indexer encounters numerical data separated by punctuation, the punctuation is removed and numbers alone are indexed. This makes elements such as dates, version numbers, and IP addresses easier to search for. The middle process in Figure 13-5 shows how a word token is processed when it’s not surrounded by HTML. These tokens have a weight of 1. The last row shows content that is wrapped in an emphasis (<em>) tag. The formula for determining the overall score of a token is as follows:

Number of matches x Weight of the HTML tag

It should also be noted that Drupal indexes the filtered output of nodes, so, for example, if you have an input filter set to automatically convert URLs to hyperlinks, or another filter to convert line breaks to HTML breaks and paragraph tags, the indexer sees this content with all the markup in place and can take the markup into consideration and assign scores accordingly. A greater impact of indexing filtered output is seen with a node that uses the PHP evaluator filter to generate dynamic content. Indexing dynamic content could be a real hassle, but because Drupal’s indexer sees only the output of content generated by the PHP code, dynamic content is automatically fully searchable.

315

CHAPTER 13 SEARCHING AND INDEXING CONTENT

Note If content is subject to change, it will not continuously update the index. Instead, the index will contain the dynamic content that was displayed when this node was indexed on cron. It is then frozen in time and will not get indexed again unless specific steps are taken.

When the indexer encounters internal links, they too are handled in a special way. If a link points to another node, then the link’s words are added to the target node’s content, making answers to common questions and relevant information easier to find. There are two ways to hook into the indexer:

hook_node_update_index($node): You can add data to a node that is otherwise invisible in order to tweak search relevancy. You can see this in action within the Drupal core comments, which technically aren’t part of the node object but should influence the search results. The Comment module also implements this hook. This is, however, sneaky. It uses the comment_update_index function to set a limit on how many comments should be indexed. Thus it’s just a bit of a hack of the API.

hook_update_index(): You can use the indexer to index HTML content that is not part of a node using hook_update_index(). For a Drupal core implementation of hook_update_index(), see node_update_index() in modules/node/node.module.

Both of these hooks are called during cron runs in order to index new data. Figure 13-6 shows the order in which these hooks run.

Figure 13-6. Overview of HTML indexing hooks

316

Соседние файлы в предмете [НЕСОРТИРОВАННОЕ]