繁体   English   中英

Sql Server 2005全文搜索中的噪音字

[英]Noise Words in Sql Server 2005 Full Text Search

我试图在我的数据库中使用一系列名称进行全文搜索。 这是我第一次尝试使用全文搜索。 目前我输入搜索字符串并在每个术语之间放置一个NEAR条件(即输入的“Kings of Leon”短语变为“NEAR Leon NEAR”)。

不幸的是,我发现这种策略会导致错误的否定搜索结果,因为SQL Server在创建索引时会删除“of”这个词,因为它是一个干扰词。 因此,“国王莱昂”将正确匹配,但“莱昂国王”将不会。

我的同事建议采用MSSQL \\ FTData \\ noiseENG.txt中定义的所有干扰词,并将它们放在.Net代码中,以便在执行全文搜索之前删除干扰词。

这是最好的解决方案吗? 是否有一些自动魔术设置我可以在SQL服务器中更改为我这样做? 或者也许只是一个更好的解决方案,不会感到hacky?

全文将根据您提供的搜索条件进行操作。 您可以从文件中删除干扰词,但这样做确实有可能使索引大小膨胀。 Robert Cain在他的博客上有很多关于此的信息:

http://arcanecode.com/2008/05/29/creating-and-customizing-noise-words-in-sql-server-2005-full-text-search/

为了节省一些时间,您可以查看此方法如何删除它们并复制代码和单词:

        public string PrepSearchString(string sOriginalQuery)
    {
        string strNoiseWords = @" 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 0 | $ | ! | @ | # | $ | % | ^ | & | * | ( | ) | - | _ | + | = | [ | ] | { | } | about | after | all | also | an | and | another | any | are | as | at | be | because | been | before | being | between | both | but | by | came | can | come | could | did | do | does | each | else | for | from | get | got | has | had | he | have | her | here | him | himself | his | how | if | in | into | is | it | its | just | like | make | many | me | might | more | most | much | must | my | never | now | of | on | only | or | other | our | out | over | re | said | same | see | should | since | so | some | still | such | take | than | that | the | their | them | then | there | these | they | this | those | through | to | too | under | up | use | very | want | was | way | we | well | were | what | when | where | which | while | who | will | with | would | you | your | a | b | c | d | e | f | g | h | i | j | k | l | m | n | o | p | q | r | s | t | u | v | w | x | y | z ";

        string[] arrNoiseWord = strNoiseWords.Split("|".ToCharArray());

        foreach (string noiseword in arrNoiseWord)
        {
            sOriginalQuery = sOriginalQuery.Replace(noiseword, " ");
        }
        sOriginalQuery = sOriginalQuery.Replace("  ", " ");
        return sOriginalQuery.Trim();
    }

但是,我可能会使用Regex.Replace来实现这一点,它应该比循环更快。 我只是没有一个快速的例子来发布。

这是一个有效的功能。 文件noiseENU.txt按原样从\\Program Files\\Microsoft SQL Server\\MSSQL.1\\MSSQL\\FTData

    Public Function StripNoiseWords(ByVal s As String) As String
        Dim NoiseWords As String = ReadFile("/Standard/Core/Config/noiseENU.txt").Trim
        Dim NoiseWordsRegex As String = Regex.Replace(NoiseWords, "\s+", "|") ' about|after|all|also etc.
        NoiseWordsRegex = String.Format("\s?\b(?:{0})\b\s?", NoiseWordsRegex)
        Dim Result As String = Regex.Replace(s, NoiseWordsRegex, " ", RegexOptions.IgnoreCase) ' replace each noise word with a space
        Result = Regex.Replace(Result, "\s+", " ") ' eliminate any multiple spaces
        Return Result
    End Function

暂无
暂无

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM