Managing Thesaurus Files - SQL Server 2008

As you’ve seen in the previous sections, the FREETEXT predicate and FREETEXTTABLE function automatically perform word stemming for inflectional forms and thesaurus expansions and replacements. The CONTAINS predicate and CONTAINSTABLE function require you to explicitly state that you want inflectional forms and thesaurus expansions and replacements by using FORMSOF. While inflectional forms include verb conjugations and plural forms of words, thesaurus functionality is based on usermanaged XML files that define word replacement and expansion patterns.

Each language-specific thesaurus is located in an XML file in the MSSQLFTDATA directory of your SQL Server installation. The thesaurus files are named using the format tsnnn.xml, where nnn is a three-letter code representing the language. The tsenu.xml file, for instance, is the US English thesaurus.

To demonstrate the iFTS thesaurus capabilities, we’ll begin by creating a new full-text index on the Production.Product table using the following code:

CREATE FULLTEXT INDEX ON Production.Product
(
Name LANGUAGE 1033,
Color LANGUAGE 1033
)
KEY INDEX PK_Product_ProductID
ON (AdvFTCatalog)
WITH
(
CHANGE_TRACKING AUTO,
STOPLIST = SYSTEM
);
GO
ALTER FULLTEXT INDEX ON Production.Product
ENABLE;
GO

Editing Thesaurus Files

You can edit the thesaurus XML files with a simple text editor or a more advanced XML editor. For this example, we opened the tsenu.xml thesaurus file in Notepad, made the appropriate changes, and saved the file back to the MSSQLFTDATA directory. The contents of the tsenu.xml file, after our edits, are as follows:

<XML ID = "Microsoft Search Thesaurus">
<thesaurus xmlns = "x-schema:tsSchema.xml">
<diacritics_sensitive>0</diacritics_sensitive>
<expansion>
<sub>thin</sub>
<sub>flat</sub>
</expansion>
<replacement>
<pat>brick</pat>
<pat>cherry</pat>
<pat>magenta</pat>
<pat>maroon</pat>
<pat>rose</pat>
<pat>salmon</pat>
<pat>vermilion</pat>
<sub>red</sub>
</replacement>
</thesaurus>
</XML>

The diacritics_sensitive section of the thesaurus file indicates whether accent marks are replaced during expansion and replacement. For instance, if diacritics_sensitive is set to 1, the words resume and resumé are considered equivalent for purposes of the thesaurus. When diacritics_sensitive is set to 0, these two words are considered different.

The expansion section indicates substitutions that should be applied during the full-text query. The word being searched is expanded to match the other words in the expansion set. In the example, if the user queries for the word thin, the search is automatically expanded to include matches for the word flat and vice versa. An expansion set can include as many substitutions as you care to define, and the thesaurus can contain as many expansion sets as you need. The following FREETEXT query demonstrates the expansion sets in action, and Figure shows the results.

SELECT ProductID, Name
FROM Production.Product
WHERE FREETEXT(*, N'thin');

Partial results of full-text query with expansion sets

Partial results of full-text query with expansion sets

The replacement section of the thesaurus file indicates replacements for words that are used in a full-text query. In the example, we’ve defined several patterns, such as brick and cherry, which will be replaced with the word red. The result is that a full-text query for these replacement patterns will be converted internally to a search for red. The following is a FREETEXT query that uses the replacement patterns defined in the thesaurus.

SELECT ProductID, Name, Color
FROM Production.Product
WHERE FREETEXT(*, N'brick');

Figure shows the results. You can use any of the replacement patterns defined in the thesaurus file in the full-text query to get the same results.

Results of full-text query with replacement sets

Results of full-text query with replacement sets

Reloading a Thesaurus

In previous versions of SQL Server, the FTS thesaurus files were loaded once when the full-text engine service was started. There was no documented way to dynamically reload the FTS thesaurus file without stopping and restarting the full-text engine service. SQL Server 2008 provides a new system stored procedure, sys.sp_fulltext_load_thesaurus_file, which allows you to reload thesaurus files without the hassle of stopping and restarting services. Once you have the modified tsenu.xml file in the MSSQLFTDATA directory, you can load it using the following statement:

EXEC sys.sp_fulltext_load_thesaurus_file 1033;

The sys.sp_fulltext_load_thesaurus_file takes an LCID as a parameter. You can get a full list of supported LCIDs by querying sys.fulltext_languages.


All rights reserved © 2018 Wisdom IT Services India Pvt. Ltd DMCA.com Protection Status

SQL Server 2008 Topics