Sitecore 7 and Lucene Highlighter Examples

I have seen lots of discussion about getting Lucene highlighter working with Sitecore 7 but not much in the way of code. It seems like a natural fit to incorporate highlighted terms as part of your Sitecore 7 search functionality but there is one important thing to note before getting started. This functionality is not currently built into Sitecore 7. That doesn’t mean that we can’t do it, just that we will need to fall back onto using traditional Lucene functionality instead of a fancy Sitecore wrapper.

The single most important thing to note about using Lucene Highlighter with Sitecore 7 was posted by Alin Parjolea: laubplusco.net/sitecore-7-lucen-3-0-highlighted-results. Thank you Alin! Basically, he says that the Lucene.Net.dll that comes with Sitecore is not compatible with the Contrib libraries (which includes Lucene.Net.Contrib.Highlighter.dll). To get a compatible Lucene.Net.dll file I downloaded the latest Lucene.Net NuGet package (v 3.0.3) and stole it from there. Without the replacement Lucene.Net.dll file you will get this ugly error:

Method not found: ‘System.Collections.Generic.ISet`1<!!0> Lucene.Net.Support.Compatibility.SetFactory.CreateHashSet()’.

Since it’s such a common client request to have contextually important words highlighted so they stand out in search results, let’s figure out how to do this in Sitecore 7. I have to admit that I had never used Lucene Highlighter before so I had a bit of learning to do. Let’s look at a simple example.

// create analyzer and query
var analyzer = new StandardAnalyzer(Lucene.Net.Util.Version.LUCENE_30);
var parser = new QueryParser(Lucene.Net.Util.Version.LUCENE_30, "text", analyzer);
Query query = parser.Parse(searchQuery);

// create highlighter - using strong tag to highlight in this case (change as needed)
IFormatter formatter = new SimpleHTMLFormatter("<strong>", "</strong>");

// excerpt set to 200 characters in length
var fragmenter = new SimpleFragmenter(200);
var scorer = new QueryScorer(query);
var highlighter = new Highlighter(formatter, scorer) { TextFragmenter = fragmenter };

// optional step to remove html tags from content
string rawPageContent = StringUtil.RemoveTags(pageContent);

// get highlighted fragment
TokenStream stream = analyzer.TokenStream("", new StringReader(rawPageContent));
string highlightedFragment = highlighter.GetBestFragment(stream, rawPageContent);

That’s a good start. The example above will attempt to return the best (most relevant) text fragment based on the length specified on line 10 above. Here is an example of what it will look like when the user performs a search.

Search query: Ipsum dolor
highlighter1

This is just a simple example but let’s take a look at some of the options you have when writing the code.

Single or Multiple Fragments

The code above uses the GetBestFragment method to return a single text excerpt. You also have the option of using the GetBestFragments method to return more than 1 relevant excerpt. Multiple fragments is the type of excerpt you are probably used to seeing in Google’s search results. It grabs different excerpts and concatenates them together with some sort of delimiter like an ellipsis (…).

Search query: Ipsum dolor
highlighte2

The example above shows that the highlighted fragment is now actually 3 separate fragments put together. Personally I like using the multiple fragments method when building a site-wide search results page because users are pretty used to this now (and I think it’s awesome!).

Here is an example of using the GetBestFragments method to grab the 3 most relevant excerpts and seperate them with an ellipsis (…):

string highlightedFragment = highlighter.GetBestFragments(stream, rawPageContent, 3, "...");

Fragment Length

The length of the text excerpt is something you will probably want to adjust depending on your situation. The only thing to note is if you are using the GetBestFragments method then the fragment length you set actually sets the length for each individual fragment. So if you are grabbing the 3 most relevant excerpts then you will end up with a total fragment length that is 3 times the length of what you set.

var fragmenter = new SimpleFragmenter(200);

Query, WildcardQuery, and FuzzyQuery

You have the ability to use different types of queries in order to adjust which words get highlighted in your fragment. My simple examples above use a basic Query which will highlight words if a direct match is found. For example, if a user searches for Lorem then only that exact word will be highlighted if you use a basic Query. You would not get highlights for partial words (ex: Lore) or plurals (ex: Lorems). For this functionality you will need to experiment with other query types like WildcardQuery and FuzzyQuery. The only tricky part is that in order to get WildcardQuery or FuzzyQuery to work for multiple words in a search query, you must also use BooleanQuery as shown below.

Here is an example using FuzzyQuery:

// create analyzer
var analyzer = new StandardAnalyzer(Lucene.Net.Util.Version.LUCENE_30);

// create FuzzyQuery using the BooleanQuery for multiple words
var booleanQuery = new BooleanQuery();
var segments = searchQuery.ToLower().Split(new[] { " " }, StringSplitOptions.RemoveEmptyEntries);
foreach (var segment in segments)
{
    var fuzzyQuery = new FuzzyQuery(new Term("", segment), 0.7f, 3);
    booleanQuery.Add(new BooleanClause(fuzzyQuery, Occur.SHOULD));
}
// create highlighter - using strong tag to highlight in this case (change as needed)
IFormatter formatter = new SimpleHTMLFormatter("<strong>", "</strong>");

// excerpt set to 200 characters in length
var fragmenter = new SimpleFragmenter(200);
var scorer = new QueryScorer(booleanQuery);
var highlighter = new Highlighter(formatter, scorer) { TextFragmenter = fragmenter };

// optional step to remove html tags from content
string rawPageContent = StringUtil.RemoveTags(pageContent);

// get highlighted fragment
TokenStream stream = analyzer.TokenStream("", new StringReader(rawPageContent));
string highlightedFragment = highlighter.GetBestFragment(stream, rawPageContent);

Search query: Lorems dolo
highlighter3

Here is an example using WildcardQuery:

// create WildcardQuery using the BooleanQuery for multiple words
var booleanQuery = new BooleanQuery();
var segments = searchQuery.ToLower().Split(new[] { " " }, StringSplitOptions.RemoveEmptyEntries);
foreach (var segment in segments)
{
    var wildcardQuery = new WildcardQuery(new Term("", "*" + segment + "*"));
    booleanQuery.Add(new BooleanClause(wildcardQuery, Occur.SHOULD));
}

That’s all for now! I hope this will get you started in the right direction.

Advertisements
Posted in Sitecore
8 comments on “Sitecore 7 and Lucene Highlighter Examples
  1. Wow that’s awesome! Thanks for sharing.

    Drazen

  2. The first approach returns only a result if the token is equal to or longer than 4 characters. Have you encountered the same? Where can this be adjusted?

  3. Good question. Can you confirm if it is only stripping out stop words like “the” or all words shorter than 4 characters. I am able to see highlighting for words like “can”, “all”, and “we”.

    The StandardAnalyzer used in the first example should strip out stop words. Instead you can replace the StandardAnalyzer with SimpleAnalyzer which retains stop words.

    Just replace line 2 with:
    var analyzer = new SimpleAnalyzer();

  4. Josh Jenkins says:

    In the first example, line 15

    string rawPageContent = StringUtil.RemoveTags(pageContent);

    What is pageContent?

  5. SK says:

    Hello there,

    I have a question for you.

    I did everything the same way as you mentioned, however my highlightedFragment comes out as an empty string.

    Regards,
    Srikanth

  6. Barry Clark says:

    Thanks so much for theHighlighter code examples. Was the “last inch” of the project and I was worried about how best to display the results to the user, as I couldn’t make much sense of the highlighter documentation. 🙂

  7. Vincent says:

    Thanks.. this was still useful! The “pagecontent” bit was a bit confusing at first but decided to construct my own index field with all HTML content that needs highlighting (luckily not too much)

  8. Ryan Bailey says:

    It’s also worth noting the that the GetBestFragment method will return null if no term matches are found.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s