eZ Components - Search
Table of Contents
Introduction
The search component allows you to index and search documents. A document consists of an object. A definition maps the object's properties to fields of a document, just like the PersistentObject component. The indexing process takes the document and indexes the fields depending on the data type. The searching part allows you to search for documents in the index with a rich query language.
Class overview
This section gives you an overview of the main classes in the Search component.
- ezcSearchSession
- This class provide access to the search index for both indexing and searching. All operations towards the index go through this class which is configured with a search handler and a definition manager.
- ezcSearchSolrHandler
- The handler that uses Solr for indexing and searching.
- ezcSearchQueryBuilder
- A class that allows you to build a search query from a search string.
- ezcSearchQuery
- An interface to building complex search queries in case ezcSearchQueryBuilder is not up to the task.
Search Handlers
Search handlers provide the link between the abstract query and document interfaces to the mechanism that actually stores the index, and allows querying for documents. Not all handlers handle all the different query words or datatypes in the index, but effort is to put in to make as much use of the handler's functionality as possible.
Solr
This handler uses Apache's Solr as backend. It accesses Solr over TCP/IP as a web service. Solr is a very capable search provider, with many features.
Using the handler is relatively easy. You basically only have to instantiate the handler class which then can be passed to the ezcSearchSession constructor:
- <?php
- require_once 'tutorial_autoload.php';
- // on localhost with the default port
- $handler = new ezcSearchSolrHandler;
- // on another host with a different port
- $handler = new ezcSearchSolrHandler( '10.0.2.184', 9123 );
- ?>
Solr requires a schema to work. This scheme defines how data types work and allows for many more customizations. The default schema that comes with Solr requires a few minor changes to make it work with the Search component. This schema should be used as a basis for the Search component.
Zend_Search_Lucene
The component also provides a backend based on Zend's Lucene implementation. This backend has many limitations compared to Solr, such as missing multi-valued field support, no data-type support and a much lower performance. In order to use this backend, you need to have a specific autoload function as well as the Zend Framework installed and included in the PHP included path. An example on how to use it follows:
- <?php
- // load the normal ezc autoload mechanism with some tricks
- require_once 'tutorial_autoload.php';
- // add the location of the zend framework to the include path, this can of
- // course also be done in php.ini
- ini_set( 'include_path', ini_get( 'include_path' ) . ':/home/derick/dev/ZendFramework-1.7.4-minimal/library' );
- // define the autoload function that can load Zend framework classes
- function zend_autoload( $className )
- {
- if ( strpos( $className, '_' ) !== false )
- {
- $file = str_replace( '_', '/', $className ) . '.php';
- $val = require_once( $file );
- return ( $val == 0 );
- }
- }
- // reset the autoload stack, register the zend_autoload mechanism and
- // re-register the original eZ Component's autoload() mechanism.
- spl_autoload_register( 'zend_autoload' );
- spl_autoload_register( '__autoload' );
- // open the handler
- $handler = new ezcSearchZendLuceneHandler( '/tmp/lucene' );
- ?>
Definition Managers
A definition manager maps properties of an object to fields in a search document. As most search handlers support fields with arbitrary names, you don't actually provide the name of the fields in the search index. Instead, the mapping configures several things for an object's property.
First of all, every document type needs an ID field. This ID will uniquely define a document in the search index. There can only be one ID field, and there has to be one. For each field, you have to define the data type, and optionally you can configure:
The importance of that field (boost factor)
Whether the field should be a part of the resulting document
Whether the field supports multiple values
Whether highlighting should be performed for this field on result documents
Definitions can be supplied in two ways. The embedded manager retrieves the definitions from the document classes directly, whereas the xml manager uses external XML file to read definitions from.
Embedded Manager
The ezcSearchEmbeddedManager retrieves the definition from a class that implements the ezcSearchDefinitionProvider interface. This interfaces specifies the getDefinition() method that should be implemented by the classes to return the document's definition mappings. An example of the implementation of the getDefinition() method is:
static public function getDefinition()
{
$n = new ezcSearchDocumentDefinition( __CLASS__ );
$n->idProperty = 'id';
$n->fields['id'] = new ezcSearchDefinitionDocumentField( 'id', ezcSearchDocumentDefinition::TEXT );
$n->fields['title'] = new ezcSearchDefinitionDocumentField( 'title', ezcSearchDocumentDefinition::TEXT, 2, true, false, true );
$n->fields['body'] = new ezcSearchDefinitionDocumentField( 'body', ezcSearchDocumentDefinition::TEXT, 1, false, false, true );
$n->fields['published'] = new ezcSearchDefinitionDocumentField( 'published', ezcSearchDocumentDefinition::DATE );
$n->fields['url'] = new ezcSearchDefinitionDocumentField( 'url', ezcSearchDocumentDefinition::STRING );
$n->fields['type'] = new ezcSearchDefinitionDocumentField( 'type', ezcSearchDocumentDefinition::STRING, 0, true, false, false );
return $n;
}
Basically what this method does is construct an ezcSearchDocumentDefinition object containing all the field definitions. It's required to have an ID property. It's recommended to use a TEXT data type for this, although it is not required. See the section Data Types on the differences between data types.
Each field is then added to the fields property as a ezcSearchDefinitionDocumentField object. The field index should be the same as the first argument to the constructor of this class. By default the type will be ezcSearchDefinitionDocumentField::TEXT. Subsequent arguments control the importance (boost) of a field, whether it should be part of the result, whether multiple values for this field are accepted and whether it should be selected for highlighting.
XML Manager
The ezcSearchXmlManager uses XML files to obtain a document definition from. The manager is configured with the directory where the XML definition files can be found in the constructor:
- <?php
- $xm = new ezcSearchXmlManager( 'search-defs/' ); ?>
The names of the definition files are required to be name-of-class-in-lower-case.xml
. This means that for the class Article the file article.xml
is being read. The file itself is a simple XML file. The file below demonstrates the same definition as the one in the example in Embedded Manager:
<?xml version="1.0"?>
<document>
<field type="id">id</field>
<field type="text" highLight="true" boost="2">title</field>
<field inResult="false" type="html">body</field>
<field type="date">published</field>
<field type="string">url</field>
<field highLight="true" type="string">type</field>
</document>
The RelaxNG-Compressed schema is:
default namespace = "http://components.ez.no/Search"
start =
element document {
field+
}
field =
element field {
attribute type { xsd:string },
attribute highLight { 'true' | 'false' }?,
attribute inResult { 'true' | 'false' }?,
attribute multi { 'true' | 'false' }?,
attribute boost { xsd:float }?,
string
}
Search Session
The search session is responsible for indexing documents, and searching for documents. The session object requires both a search handler and a definition manager. The handler is used for storing the index, while the definition manager is used to find the definition that maps object's properties to search index fields. Creating a session is simple, as is demonstrated in the following example:
- <?php
- require_once 'tutorial_autoload.php';
- $handler = new ezcSearchSolrHandler;
- $manager = new ezcSearchEmbeddedManager;
- $session = new ezcSearchSession( $handler, $manager );
- ?>
Indexing
With the session created, it is time to index documents. Before we can index anything we need to create an object and create the definition. For this tutorial we'll reuse the definition from the Embedded Manager section and create a class out of this. Each class that you want to index through the Search component needs to implement the ezcBasePersistable interface. This interfaces defines two methods: getState() and setState() as well as the requirement that the constructor should be able to be called without any arguments. Those methods are used for fetching and re-creating the state of this object, similarly to what PersistentObject requires.
To see everything in perspective, the full class follows here, including the definition method:
- <?php
- class Article implements ezcBasePersistable, ezcSearchDefinitionProvider
- {
- public $id;
- public $title;
- private $body;
- private $published;
- private $url;
- private $type;
- function __construct( $id = null, $title = null, $body = null, $published = null, $url = null, $type = null )
- {
- $this->id = $id;
- $this->title = $title;
- $this->body = $body;
- $this->published = $published;
- $this->url = $url;
- $this->type = $type;
- }
- function getState()
- {
- $state = array(
- 'id' => $this->id,
- 'title' => $this->title,
- 'body' => $this->body,
- 'published' => $this->published,
- 'url' => $this->url,
- 'type' => $this->type,
- );
- return $state;
- }
- function setState( $state )
- {
- foreach ( $state as $key => $value )
- {
- $this->$key = $value;
- }
- }
- static public function getDefinition()
- {
- $n = new ezcSearchDocumentDefinition( __CLASS__ );
- $n->idProperty = 'id';
- $n->fields['id'] = new ezcSearchDefinitionDocumentField( 'id', ezcSearchDocumentDefinition::TEXT );
- $n->fields['title'] = new ezcSearchDefinitionDocumentField( 'title', ezcSearchDocumentDefinition::TEXT, 2, true, false, true );
- $n->fields['body'] = new ezcSearchDefinitionDocumentField( 'body', ezcSearchDocumentDefinition::TEXT, 1, false, false, true );
- $n->fields['published'] = new ezcSearchDefinitionDocumentField( 'published', ezcSearchDocumentDefinition::DATE );
- $n->fields['url'] = new ezcSearchDefinitionDocumentField( 'url', ezcSearchDocumentDefinition::STRING );
- $n->fields['type'] = new ezcSearchDefinitionDocumentField( 'type', ezcSearchDocumentDefinition::STRING, 0, true, false, false );
- return $n;
- }
- }
- ?>
The ezcBasePersistable interface is also compatible with PersistentObject, although there the interface is not enforced.
After we've created the class and definition, indexing an object is relatively simple. After instantiation, indexing the document is done by calling the index() method of the session as you can see in the next example:
- <?php
- require_once 'tutorial_autoload.php';
- // setup
- $handler = new ezcSearchSolrHandler;
- $manager = new ezcSearchEmbeddedManager;
- $session = new ezcSearchSession( $handler, $manager );
- // instantiate article
- $article = new Article();
- $article->title = "A test article to show indexing.";
- $article->body = <<<ENDBODY
- This is the body of the text, nothing interesting now
- as this is just an example.
- ENDBODY;
- $article->published = time();
- $article->url = "/article/1";
- $article->type = "article";
- // index
- $session->index( $article );
- ?>
If you are indexing a large amount of documents, it's wise to wrap this into an indexing transaction. For the handlers that support this, this will optimize the indexing process. See the ezcSearchSession->beginTransaction() documentation.
Data Types
The Search component understands many data types, but they might not always be representable by every handler. The table below explains the different data types that are available:
Constant | Description |
---|---|
BOOLEAN | Stores a true or false boolean value |
STRING | Untokenized text, useful for keywords or facets. |
TEXT | Tokenized text, useful for summaries and large pieces of text. |
HTML | Tokenized HTML documents, strips out all tags and attributes. |
DATE | Stores Unix timestamps and DateTime objects. |
INT | Stores integer numbers, which can be used in range searches. |
FLOAT | Stores floating point numbers, which can be used in range searches. |
Searching
After documents are indexed, they are searchable. Building a search query can be done in two ways. The Query Language approach is the most powerful one, but is more complex. Alternatively you can use the Query Builder approach which lets you feed it a string and it will build the query from that string.
Query Language
The ezcSearchQuery interface defines all the methods that handlers should implement to realize the query language for every handler. This interface defines methods such as where(), lOr() and between() - very similar to what the ezcQuerySelect and ezcQueryExpression classes provide. The following example shows how to use the query language:
- <?php
- require_once 'tutorial_autoload.php';
- // setup
- $handler = new ezcSearchSolrHandler;
- $manager = new ezcSearchEmbeddedManager;
- $session = new ezcSearchSession( $handler, $manager );
- // initialize a pre-configured query
- $q = $session->createFindQuery( 'Article' );
- $searchWord = 'test';
- // where either body or title contains thr $searchWord
- $q->where(
- $q->lOr(
- $q->eq( 'body', $searchWord ),
- $q->eq( 'title', $searchWord )
- )
- );
- // limit the query and order
- $q->limit( 10 );
- $q->orderBy( 'title' );
- // add a facet on url (not very useful)
- $q->facet( 'url' );
- // run the query and show titles for found documents
- $r = $session->find( $q );
- foreach( $r->documents as $res )
- {
- echo $res->document->title, "\n";
- }
- ?>
The result of the query is returned in the form of an ezcSearchResult object. This contains the documents, but also information about facets and pagination. See the documentation of the ezcSearchResult class for more information.
Query Builder
The query builder approach allows you to use more powerful query strings instead of having to use the API to create queries. With this you can allow query strings such as foo -bar
, while still searching in multiple fields. Be aware however, that it depends on the handlers whether it will actually return the expected results. The query builder interface will most likely work best if you're only searching in one field only. At the moment the query builder understands +
, -
, grouping ( with '(' and ')' ), AND and OR modifiers, as well as phrases (enclosed in ").
An example that searches in two fields (body and title) follows:
- <?php
- require_once 'tutorial_autoload.php';
- // setup
- $handler = new ezcSearchSolrHandler;
- $manager = new ezcSearchEmbeddedManager;
- $session = new ezcSearchSession( $handler, $manager );
- // initialize a pre-configured query
- $q = $session->createFindQuery( 'Article' );
- // where either body or title contains test but not article
- $searchWord = 'test -article';
- // run the query builder to search for the $searchWord in body and title
- $qb = new ezcSearchQueryBuilder();
- $qb->parseSearchQuery( $q, $searchWord, array( 'body', 'title' ) );
- // run the query and show titles for found documents, and its score
- $r = $session->find( $q );
- foreach( $r->documents as $res )
- {
- echo $res->document->score, ", ", $res->document->title, "\n";
- }
- ?>