Description
The ConfigurableXMLContentExtractor is a content extractor which is a little bit more flexible than the standard XMLContentExtractor, so you can control which parts of your documents are indexed and which parts are not. It can be configured to ignore specific elements, or to extract only content in a specific language.
- Class: nl.hippo.slide.extractor.ConfigurableXMLContentExtractor
- Hippo Repository version: 1.2.12+
Configuration
Excluding content from the index
| This feature is supported from Hippo Repository version 1.2.12 |
Sometimes you don't want certain parts of your documents to be indexed. For example, for documents with the following structure we do not want the element comment to be indexed.
<document> <title>My Document</title> <date>2007-06-05</date> <author>John Doe</author> <comment>blah</comment><!-- element to be excluded from index --> <content> lorem ipsum ... </content> </document>
In the repository configuration we specify a ConfigurableXMLContentExtractor, and tell it to exclude elements named "comment":
<extractor classname="nl.hippo.slide.extractor.ConfigurableXMLContentExtractor" uri="/files" content-type="text/xml"> <configuration> <exclude-element name="comment"/> </configuration> </extractor>
That's all! Any number of "exlude-element" elements can be specified, and the elements to be excluded from indexing can be anywhere in your document structure.
Handling multilingual content
| This feature is supported from Hippo Repository version 1.2.13 |
Sometimes content in different languages is stored in the same document, e.g.
<document> <content_en> a bad example </content_en> <content_nl> een slecht voorbeeld </content_nl> </document>
By default, all content in one document is stored in one field in the index. This means that in our example, a search query for "bad", e.g.
... <d:contains>bad</d:contains> ..
will return not only documents about bad things, but possibly also documents about bathtubs because in Dutch, "bad" means "bath".
To prevent this, define a ConfigurableXMLContentExtractor for each language and specify which XML element surrounds the language, e.g. for our example document we define in extractors.xml:
<extractor classname="nl.hippo.slide.extractor.ConfigurableXMLContentExtractor" uri="/files" content-type="text/xml"> <configuration> <language-element locale="en" element="content_en"/> </configuration> </extractor> <extractor classname="nl.hippo.slide.extractor.ConfigurableXMLContentExtractor" uri="/files" content-type="text/xml"> <configuration> <language-element locale="nl" element="content_nl"/> </configuration> </extractor>
Now that we have separate extractors for each language, we can also analyze each one with an analyzer appropriate for that language. In the indexer configuration (in dasl-indexer.xml), specify the analyzers to be used with a property element withing the properties part of the configuration. The name attribute needs to match the locale specified in the extractor configuration (e.g. "en" or "nl"). In the analyzer attribute you can specify any Lucene analyzer.
<properties> ... <property type="text" name="en" analyzer="org.apache.lucene.analysis.standard.StandardAnalyzer"/> <property type="text" name="nl" analyzer="org.apache.lucene.analysis.nl.DutchAnalyzer"/> ... </properties>
Hippo Repository is now configured for language specific indexing and searching, so restart the repository!
Now in your DASL queries you can use an attribute locale in d:contains to limit a search query to a specific language, e.g.
... <d:contains locale="en">bad</d:contains> ...