简体   繁体   English

将字符串列表与可用字典/同义词库进行比较

[英]Comparing list of strings with an available dictionary/thesaurus

I have a program (C#) that generates a list of strings (permutations of an original string). 我有一个程序(C#)生成一个字符串列表(原始字符串的排列)。 Most of the strings are random grouping of the original letters as expected (ie etam,aemt, team). 大多数字符串是按预期的原始字母随机分组(即etam,aemt,team)。 I wanna find the one string in the list that is an actual English word, programatically. 我想在程序中找到列表中的一个字符串,它是一个实际的英语单词。 I need a thesaurus/dictionary to look up and compare each string to. 我需要一个同义词库/字典来查找并比较每个字符串。 Any one know of a resource available. 任何人都知道可用的资源。 Im using VS2008 in C#. 我在C#中使用VS2008。

You could download a list of words from the web (say one of the files mentioned here: http://www.outpost9.com/files/WordLists.html ), then then do a quick: 您可以从网上下载一个单词列表(比如这里提到的文件之一: http//www.outpost9.com/files/WordLists.html ),然后快速执行:

// Read words from file.
string [] words = ReadFromFile();

Dictionary<String, List<String>> permuteDict = new Dictionary<String, List<String>>(StringComparer.OrdinalIgnoreCase);

foreach (String word in words) {
    String sortedWord = new String(word.ToArray().Sort());
    if (!permuteDict.ContainsKey(sortedWord)) {
        permuteDict[sortedWord] = new List<String>();
    }
    permuteDict[sortedWord].Add(word);
}

// To do a lookup you can just use

String sortedWordToLook = new String(wordToLook.ToArray().Sort());

List<String> outWords;
if (permuteDict.TryGetValue(sortedWordToLook, out outWords)) {
    foreach (String outWord in outWords) {
        Console.WriteLine(outWord);
    }
}

You can also use Wiktionary. 您也可以使用维基词典。 The MediaWiki API (Wikionary uses MediaWiki) allows you to query for a list of article titles. MediaWiki API(Wikionary使用MediaWiki)允许您查询文章标题列表。 In wiktionary, article titles are (among other things) word entries in the dictionary. 在wiktionary中,文章标题是(除其他外)字典中的单词条目。 The only catch is that foreign words are also in the dictionary, so you might get "incorrect" matches sometimes. 唯一的问题是外来词也在字典中,所以有时你可能会得到“不正确”的匹配。 Your user will also need internet access, of course. 当然,您的用户还需要访问互联网。 You can get help and info on the api at: http://en.wiktionary.org/w/api.php 您可以在http://en.wiktionary.org/w/api.php获取关于api的帮助和信息

Here's an example of your query URL: 以下是您的查询网址示例:

http://en.wiktionary.org/w/api.php?action=query&format=xml&titles=dog|god|ogd|odg|gdo

This returns the following xml: 这将返回以下xml:

<?xml version="1.0"?>
<api>
  <query>
    <pages>
      <page ns="0" title="ogd" missing=""/>
      <page ns="0" title="odg" missing=""/>
      <page ns="0" title="gdo" missing=""/>
      <page pageid="24" ns="0" title="dog"/>
      <page pageid="5015" ns="0" title="god"/>
    </pages>
  </query>
</api>

In C#, you can then use System.Xml.XPath to get the parts you need (page items with pageid). 在C#中,您可以使用System.Xml.XPath获取所需的部分(带有pageid的页面项)。 Those are the "real words". 那些是“真实的话语”。

I wrote an implementation and tested it (using the simple "dog" example from above). 我编写了一个实现并对其进行了测试(使用上面简单的“dog”示例)。 It returned just "dog" and "god". 它只返回“狗”和“上帝”。 You should test it more extensively. 你应该更广泛地测试它。

public static IEnumerable<string> FilterRealWords(IEnumerable<string> testWords)
{
    string baseUrl = "http://en.wiktionary.org/w/api.php?action=query&format=xml&titles=";
    string queryUrl = baseUrl + string.Join("|", testWords.ToArray());

    WebClient client = new WebClient();
    client.Encoding = UnicodeEncoding.UTF8; // this is very important or the text will be junk

    string rawXml = client.DownloadString(queryUrl);

    TextReader reader = new StringReader(rawXml);
    XPathDocument doc = new XPathDocument(reader);
    XPathNavigator nav = doc.CreateNavigator();
    XPathNodeIterator iter = nav.Select(@"//page");

    List<string> realWords = new List<string>();
    while (iter.MoveNext())
    {
        // if the pageid attribute has a value
        // add the article title to the list.
        if (!string.IsNullOrEmpty(iter.Current.GetAttribute("pageid", "")))
        {
            realWords.Add(iter.Current.GetAttribute("title", ""));
        }
    }

    return realWords;
}

Call it like this: 像这样称呼它:

IEnumerable<string> input = new string[] { "dog", "god", "ogd", "odg", "gdo" };
IEnumerable<string> output = FilterRealWords(input);

I tried using LINQ to XML, but I'm not that familiar with it, so it was a pain and I gave up on it. 我尝试使用LINQ to XML,但我并不熟悉它,所以这很痛苦,我放弃了它。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM