To stop or not to stop

The Canadiana Discovery Portal has been in operation for about a year now. Over the course of that year, we have changed the Solr index schema numerous times, experimenting to find what works best.

Solr supports a stopword filter, which initially seemed like a good idea, so we indexed the data using the stopword filter. However, the Discovery Portal contains both French and English content, as well as a significant amount of material in other languages. Many words that are stop words in one language are significant in another. Add “the” as a stopword and you remove a definite article in English, but you also make it harder to search for thé.

A second problem is that removing stop words from the index makes it harder to search for phrases. This problem can be overcome by indexing fields twice: one with stop words removed and one without any removal, and then searching on the appropriate index. Doing this increases the index size, so if the purpose of removing stop words is to make the index smaller, it is counterproductive.

In the latest version of the schema, we don’t use stopword filters. There doesn’t seem to be any significant effect on query times, and it avoids many of the phrase and multilingual-related problems. We will no doubt continue to experiment with our database schema over time. It still takes only a few hours to completely rebuild the index, making experimentation fairly cheap and easy.