Zeta Components - high quality PHP components

eZ Components - Search

Introduction

The search component allows you to index and search documents. A document consists of an object. A definition maps the object's properties to fields of a document, just like the PersistentObject component. The indexing process takes the document and indexes the fields depending on the data type. The searching part allows you to search for documents in the index with a rich query language.

Class overview

This section gives you an overview of the main classes in the Search component.

ezcSearchSession
This class provide access to the search index for both indexing and searching. All operations towards the index go through this class which is configured with a search handler and a definition manager.
ezcSearchSolrHandler
The handler that uses Solr for indexing and searching.
ezcSearchQueryBuilder
A class that allows you to build a search query from a search string.
ezcSearchQuery
An interface to building complex search queries in case ezcSearchQueryBuilder is not up to the task.

Search Handlers

Search handlers provide the link between the abstract query and document interfaces to the mechanism that actually stores the index, and allows querying for documents. Not all handlers handle all the different query words or datatypes in the index, but effort is to put in to make as much use of the handler's functionality as possible.

Solr

This handler uses Apache's Solr as backend. It accesses Solr over TCP/IP as a web service. Solr is a very capable search provider, with many features.

Using the handler is relatively easy. You basically only have to instantiate the handler class which then can be passed to the ezcSearchSession constructor:

  1. <?php
  2. require_once 'tutorial_autoload.php';
  3. // on localhost with the default port
  4. $handler = new ezcSearchSolrHandler;
  5. // on another host with a different port
  6. $handler = new ezcSearchSolrHandler'10.0.2.184'9123 );
  7. ?>

Solr requires a schema to work. This scheme defines how data types work and allows for many more customizations. The default schema that comes with Solr requires a few minor changes to make it work with the Search component. This schema should be used as a basis for the Search component.

Zend_Search_Lucene

The component also provides a backend based on Zend's Lucene implementation. This backend has many limitations compared to Solr, such as missing multi-valued field support, no data-type support and a much lower performance. In order to use this backend, you need to have a specific autoload function as well as the Zend Framework installed and included in the PHP included path. An example on how to use it follows:

  1. <?php
  2. // load the normal ezc autoload mechanism with some tricks
  3. require_once 'tutorial_autoload.php';
  4. // add the location of the zend framework to the include path, this can of
  5. // course also be done in php.ini
  6. ini_set'include_path'ini_get'include_path' ) . ':/home/derick/dev/ZendFramework-1.7.4-minimal/library' );
  7. // define the autoload function that can load Zend framework classes
  8. function zend_autoload$className )
  9. {
  10.     if ( strpos$className'_' ) !== false )
  11.     {
  12.         $file str_replace'_''/'$className ) . '.php';
  13.         $val = require_once( $file );
  14.         return ( $val == );
  15.     }
  16. }
  17. // reset the autoload stack, register the zend_autoload mechanism and
  18. // re-register the original eZ Component's autoload() mechanism.
  19. spl_autoload_register'zend_autoload' );
  20. spl_autoload_register'__autoload' );
  21. // open the handler
  22. $handler = new ezcSearchZendLuceneHandler'/tmp/lucene' );
  23. ?>

Definition Managers

A definition manager maps properties of an object to fields in a search document. As most search handlers support fields with arbitrary names, you don't actually provide the name of the fields in the search index. Instead, the mapping configures several things for an object's property.

First of all, every document type needs an ID field. This ID will uniquely define a document in the search index. There can only be one ID field, and there has to be one. For each field, you have to define the data type, and optionally you can configure:

Definitions can be supplied in two ways. The embedded manager retrieves the definitions from the document classes directly, whereas the xml manager uses external XML file to read definitions from.

Embedded Manager

The ezcSearchEmbeddedManager retrieves the definition from a class that implements the ezcSearchDefinitionProvider interface. This interfaces specifies the getDefinition() method that should be implemented by the classes to return the document's definition mappings. An example of the implementation of the getDefinition() method is:

static public function getDefinition() { $n = new ezcSearchDocumentDefinition( __CLASS__ ); $n->idProperty = 'id'; $n->fields['id'] = new ezcSearchDefinitionDocumentField( 'id', ezcSearchDocumentDefinition::TEXT ); $n->fields['title'] = new ezcSearchDefinitionDocumentField( 'title', ezcSearchDocumentDefinition::TEXT, 2, true, false, true ); $n->fields['body'] = new ezcSearchDefinitionDocumentField( 'body', ezcSearchDocumentDefinition::TEXT, 1, false, false, true ); $n->fields['published'] = new ezcSearchDefinitionDocumentField( 'published', ezcSearchDocumentDefinition::DATE ); $n->fields['url'] = new ezcSearchDefinitionDocumentField( 'url', ezcSearchDocumentDefinition::STRING ); $n->fields['type'] = new ezcSearchDefinitionDocumentField( 'type', ezcSearchDocumentDefinition::STRING, 0, true, false, false ); return $n; }

Basically what this method does is construct an ezcSearchDocumentDefinition object containing all the field definitions. It's required to have an ID property. It's recommended to use a TEXT data type for this, although it is not required. See the section Data Types on the differences between data types.

Each field is then added to the fields property as a ezcSearchDefinitionDocumentField object. The field index should be the same as the first argument to the constructor of this class. By default the type will be ezcSearchDefinitionDocumentField::TEXT. Subsequent arguments control the importance (boost) of a field, whether it should be part of the result, whether multiple values for this field are accepted and whether it should be selected for highlighting.

XML Manager

The ezcSearchXmlManager uses XML files to obtain a document definition from. The manager is configured with the directory where the XML definition files can be found in the constructor:

  1. <?php
  2. $xm = new ezcSearchXmlManager'search-defs/' );
  3. ?>

The names of the definition files are required to be name-of-class-in-lower-case.xml. This means that for the class Article the file article.xml is being read. The file itself is a simple XML file. The file below demonstrates the same definition as the one in the example in Embedded Manager:

<?xml version="1.0"?> <document> <field type="id">id</field> <field type="text" highLight="true" boost="2">title</field> <field inResult="false" type="html">body</field> <field type="date">published</field> <field type="string">url</field> <field highLight="true" type="string">type</field> </document>

The RelaxNG-Compressed schema is:

default namespace = "http://components.ez.no/Search" start = element document { field+ } field = element field { attribute type { xsd:string }, attribute highLight { 'true' | 'false' }?, attribute inResult { 'true' | 'false' }?, attribute multi { 'true' | 'false' }?, attribute boost { xsd:float }?, string }

Search Session

The search session is responsible for indexing documents, and searching for documents. The session object requires both a search handler and a definition manager. The handler is used for storing the index, while the definition manager is used to find the definition that maps object's properties to search index fields. Creating a session is simple, as is demonstrated in the following example:

  1. <?php
  2. require_once 'tutorial_autoload.php';
  3. $handler = new ezcSearchSolrHandler;
  4. $manager = new ezcSearchEmbeddedManager;
  5. $session = new ezcSearchSession$handler$manager );
  6. ?>

Indexing

With the session created, it is time to index documents. Before we can index anything we need to create an object and create the definition. For this tutorial we'll reuse the definition from the Embedded Manager section and create a class out of this. Each class that you want to index through the Search component needs to implement the ezcBasePersistable interface. This interfaces defines two methods: getState() and setState() as well as the requirement that the constructor should be able to be called without any arguments. Those methods are used for fetching and re-creating the state of this object, similarly to what PersistentObject requires.

To see everything in perspective, the full class follows here, including the definition method:

  1. <?php
  2. class Article implements ezcBasePersistableezcSearchDefinitionProvider
  3. {
  4.     public  $id;
  5.     public  $title;
  6.     private $body;
  7.     private $published;
  8.     private $url;
  9.     private $type;
  10.     function __construct$id null$title null$body null$published null$url null$type null )
  11.     {
  12.         $this->id $id;
  13.         $this->title $title;
  14.         $this->body $body;
  15.         $this->published $published;
  16.         $this->url $url;
  17.         $this->type $type;
  18.     }
  19.     function getState()
  20.     {
  21.         $state = array(
  22.             'id' => $this->id,
  23.             'title' => $this->title,
  24.             'body' => $this->body,
  25.             'published' => $this->published,
  26.             'url' => $this->url,
  27.             'type' => $this->type,
  28.         );
  29.         return $state;
  30.     }
  31.     function setState$state )
  32.     {
  33.         foreach ( $state as $key => $value )
  34.         {
  35.             $this->$key $value;
  36.         }
  37.     }
  38.     static public function getDefinition()
  39.     {
  40.         $n = new ezcSearchDocumentDefinition__CLASS__ );
  41.         $n->idProperty 'id';
  42.         $n->fields['id']        = new ezcSearchDefinitionDocumentField'id'ezcSearchDocumentDefinition::TEXT );
  43.         $n->fields['title']     = new ezcSearchDefinitionDocumentField'title'ezcSearchDocumentDefinition::TEXT2truefalsetrue );
  44.         $n->fields['body']      = new ezcSearchDefinitionDocumentField'body'ezcSearchDocumentDefinition::TEXT1falsefalsetrue );
  45.         $n->fields['published'] = new ezcSearchDefinitionDocumentField'published'ezcSearchDocumentDefinition::DATE );
  46.         $n->fields['url']       = new ezcSearchDefinitionDocumentField'url'ezcSearchDocumentDefinition::STRING );
  47.         $n->fields['type']      = new ezcSearchDefinitionDocumentField'type'ezcSearchDocumentDefinition::STRING0truefalsefalse );
  48.         return $n;
  49.     }
  50. }
  51. ?>

The ezcBasePersistable interface is also compatible with PersistentObject, although there the interface is not enforced.

After we've created the class and definition, indexing an object is relatively simple. After instantiation, indexing the document is done by calling the index() method of the session as you can see in the next example:

  1. <?php
  2. require_once 'tutorial_autoload.php';
  3. // setup
  4. $handler = new ezcSearchSolrHandler;
  5. $manager = new ezcSearchEmbeddedManager;
  6. $session = new ezcSearchSession$handler$manager );
  7. // instantiate article
  8. $article = new Article();
  9. $article->title "A test article to show indexing.";
  10. $article->body  = <<<ENDBODY
  11. This is the body of the text, nothing interesting now
  12. as this is just an example.
  13. ENDBODY;
  14. $article->published time();
  15. $article->url       "/article/1";
  16. $article->type      "article";
  17. // index
  18. $session->index$article );
  19. ?>

If you are indexing a large amount of documents, it's wise to wrap this into an indexing transaction. For the handlers that support this, this will optimize the indexing process. See the ezcSearchSession->beginTransaction() documentation.

Data Types

The Search component understands many data types, but they might not always be representable by every handler. The table below explains the different data types that are available:

Constant

Description

BOOLEAN

Stores a true or false boolean value

STRING

Untokenized text, useful for keywords or facets.

TEXT

Tokenized text, useful for summaries and large pieces of text.

HTML

Tokenized HTML documents, strips out all tags and attributes.

DATE

Stores Unix timestamps and DateTime objects.

INT

Stores integer numbers, which can be used in range searches.

FLOAT

Stores floating point numbers, which can be used in range searches.

Searching

After documents are indexed, they are searchable. Building a search query can be done in two ways. The Query Language approach is the most powerful one, but is more complex. Alternatively you can use the Query Builder approach which lets you feed it a string and it will build the query from that string.

Query Language

The ezcSearchQuery interface defines all the methods that handlers should implement to realize the query language for every handler. This interface defines methods such as where(), lOr() and between() - very similar to what the ezcQuerySelect and ezcQueryExpression classes provide. The following example shows how to use the query language:

  1. <?php
  2. require_once 'tutorial_autoload.php';
  3. // setup
  4. $handler = new ezcSearchSolrHandler;
  5. $manager = new ezcSearchEmbeddedManager;
  6. $session = new ezcSearchSession$handler$manager );
  7. // initialize a pre-configured query
  8. $q $session->createFindQuery'Article' );
  9. $searchWord 'test';
  10. // where either body or title contains thr $searchWord
  11. $q->where(
  12.     $q->lOr(
  13.         $q->eq'body'$searchWord ), 
  14.         $q->eq'title'$searchWord 
  15.     )
  16. );
  17. // limit the query and order
  18. $q->limit10 );
  19. $q->orderBy'title' );
  20. // add a facet on url (not very useful)
  21. $q->facet'url' );
  22. // run the query and show titles for found documents
  23. $r $session->find$q );
  24. foreach( $r->documents as $res )
  25. {
  26.     echo $res->document->title"\n";
  27. }
  28. ?>

The result of the query is returned in the form of an ezcSearchResult object. This contains the documents, but also information about facets and pagination. See the documentation of the ezcSearchResult class for more information.

Query Builder

The query builder approach allows you to use more powerful query strings instead of having to use the API to create queries. With this you can allow query strings such as foo -bar, while still searching in multiple fields. Be aware however, that it depends on the handlers whether it will actually return the expected results. The query builder interface will most likely work best if you're only searching in one field only. At the moment the query builder understands +, -, grouping ( with '(' and ')' ), AND and OR modifiers, as well as phrases (enclosed in ").

An example that searches in two fields (body and title) follows:

  1. <?php
  2. require_once 'tutorial_autoload.php';
  3. // setup
  4. $handler = new ezcSearchSolrHandler;
  5. $manager = new ezcSearchEmbeddedManager;
  6. $session = new ezcSearchSession$handler$manager );
  7. // initialize a pre-configured query
  8. $q $session->createFindQuery'Article' );
  9. // where either body or title contains test but not article
  10. $searchWord 'test -article';
  11. // run the query builder to search for the $searchWord in body and title
  12. $qb = new ezcSearchQueryBuilder();
  13. $qb->parseSearchQuery$q$searchWord, array( 'body''title' ) );
  14. // run the query and show titles for found documents, and its score
  15. $r $session->find$q );
  16. foreach( $r->documents as $res )
  17. {
  18.     echo $res->document->score", "$res->document->title"\n";
  19. }
  20. ?>