简体   繁体   English

关键字排序算法

[英]Keyword sorting algorithm

I have over 1000 surveys, many of which contains open-ended replies. 我有超过1000个调查,其中许多包含开放式回复。

I would like to be able to 'parse' in all the words and get a ranking of the most used words (disregarding common words) to spot a trend. 我希望能够“解析”所有单词并获得最常用单词的排名(忽略常用单词)以发现趋势。

How can I do this? 我怎样才能做到这一点? Is there a program I can use? 我有可以使用的程序吗?

EDIT If a 3rd party solution is not available, it would be great if we can keep the discussion to microsoft technologies only. 编辑如果没有第三方解决方案,如果我们只能继续讨论微软技术,那就太棒了。 Cheers. 干杯。

Divide and conquer. 分而治之。 Split up your problem into many smaller problems and solve each of them. 将您的问题分解为许多小问题并解决每个问题。

First problem: turn a paragraph into a list of words. 第一个问题:将一个段落变成一个单词列表。

You are fortunate because you don't have to worry about being perfect. 你很幸运,因为你不必担心完美。 Actually parsing natural languages to determine exactly what "a word" is can be very difficult, but frankly you probably don't really care whether "lightbulb" has the the same semantics as "light bulb". 实际上解析自然语言以确定“一个单词”究竟是什么可能非常困难,但坦率地说,你可能并不关心“灯泡”是否具有与“灯泡”相同的语义。 Since you are in particular looking for common words (for now, more on that later) the interesting ones are precisely those that are easy to identify because they come up a lot. 既然你特别寻找常用词(现在,稍后更多)的有趣的,正是那些容易识别,因为他们拿出了很多。

So, break this problem down further. 所以,进一步打破这个问题。 You want a list of words. 你想要一个单词列表。 Start by getting a string with the text in it: 首先获取包含文本的字符串:

StreamReader streamReader = new StreamReader(@"c:\survey.txt");
string source = streamReader.ReadToEnd();

Great, you've got a string somehow. 太棒了,你有一个字符串。 Now turn that into an array of words. 现在把它变成一个单词数组。 Because you probably want to count "Frog" and "frog" as the same word, make everything lowercase. 因为你可能想把“Frog”和“frog”算作同一个单词,所以要把所有东西都小写。 How to do all that that? 怎么做那一切? Split the lowercase string up based on spaces, newlines, tabs and punctuation: 根据空格,换行符,制表符和标点符号拆分小写字符串:

char[] punctuation = new char[] {' ', '\n', '\r', '\t', '(', ')', '"'};
string[] tokens = source.ToLower().Split(punctuation, true); 

Now examine the output. 现在检查输出。 That was terrible. 那太可怕了。 There's all kinds of stuff we forget. 我们忘记了各种各样的东西。 Periods and commas and colons and semicolons and so on. 句号和逗号和冒号和分号等。 Figure out which punctuation you care about and add it to the list. 找出你关心的标点符号并将其添加到列表中。

Is ToLower the right thing to do? ToLower是正确的事吗? What about ToLowerInvariant? ToLowerInvariant怎么样? There are times you want to stress about it; 有时你想要强调它; this isn't one of them. 这不是其中之一。 The fact that ToLower doesn't necessarily canoncialize the Turkish lowercase I in a manner that consistently round-trips is unlikely to throw off your summary statistics. 事实上,ToLower并不一定能够以持续往返的方式对土耳其小写字母I进行规范化,这一事实不太可能使您的摘要统计数据失效。 We're not going for pinpoint accuracy here. 我们不打算精确定位。 If someone says "luxury-yacht", and someone says "luxury yacht", the former might be one word if you forget to break on hyphens. 如果有人说“豪华游艇”,而有人说“豪华游艇”,如果你忘记打破连字符,前者可能就是一个字。 Who cares? 谁在乎? Hyphenated words are unlikely to be in your top ten anyway. 连字符不太可能在你的前十名中。

Next problem: count all the occurrences of each word: 下一个问题:计算每个单词的所有出现次数:

var firstPass = new Dictionary<string, int>();
foreach(string token in tokens)
{
    if (!firstPass.ContainsKey(token))
        firstPass[token] = 1;
    else
        ++firstPass[token];
} 

Great. 大。 We now have a dictionary that maps words to integers. 我们现在有一个将单词映射到整数的字典。 Trouble is, that's backwards. 麻烦的是,这是倒退的。 What you want to know is what are all the words that have the same number of occurrences. 您想知道的是具有相同出现次数的所有单词是什么。 A dictionary is a sequence of key/value pairs, so group it: 字典是键/值对的序列,因此将其分组:

var groups = from pair in firstPass
             group pair.Key by pair.Value;

OK, now we have a sequence of groups of words, each one associated with its count of occurrences. 好的,现在我们有一系列单词组,每组都与其出现次数相关联。 Order it. 订购它。 Remember, the key of the group is the value of the dictionary, the count: 请记住,组的关键是字典的值,计数:

var sorted = from group in groups
             orderby group.Key
             select group;

And you want the top hundred, let's say: 而你想要前百名,让我们说:

foreach(var g in sorted.Take(100))
{
  Console.WriteLine("Words with count {0}:", g.Key);
  foreach(var w in g)
    Console.WriteLine(w);
}

And you're done. 而且你已经完成了。

Now, is this really what you're interested in? 现在,这真的是你感兴趣的吗? I think it might be more interesting to look for unusual words, or pairs of words. 我认为寻找不寻常的单词或单词对可能更有趣。 If the words "yacht" and "racing" show up together a lot, not a surprise. 如果“游艇”和“赛车”这两个词汇出现在一起很多,那并不奇怪。 If "tomato" and "ketchup" show up a lot together, not surprising. 如果“番茄”和“番茄酱”在一起出现很多,那就不足为奇了。 If "tomato" and "racing" start showing up together, then maybe something noteworthy is going on. 如果“番茄”和“赛车”开始一起出现,那么可能会有一些值得注意的事情发生。

That requires much deeper analysis; 这需要更深入的分析; read up on Bayes' Theorem if that's the sort of thing you're interested in. 阅读贝叶斯定理,如果这是你感兴趣的那种东西。

Also note that this tracks the raw count of words, not their frequency -- the number of times they appear per thousand words . 另请注意,这会跟踪单词的原始计数 ,而不是它们的频率 - 它们出现在每千个单词中的次数 That might also be an interesting metric to measure: not just how many times did this word appear, period, but how many times did it appear as a percentage of the text. 这可能也是一个有趣的衡量指标:不仅仅是这个词出现了多少次,一段时间,而是它显示为文本百分比的次数。

The NLTK contains tons of useful for dealing with natural language. NLTK包含大量有助于处理自然语言的内容。

Check out this article (linked from the NLTK site) for an example of building a persistent, network-accessible frequency distribution. 查看本文 (从NLTK站点链接),了解构建持久的,网络可访问的频率分布的示例。 Even if it's not exactly what you're looking for, it might help you get a feel for how to approach your problem. 即使它不是您正在寻找的,它也可能帮助您了解如何解决您的问题。

UPDATE : 更新

Re: MS technologies, you can run NLTK on .NET using IronPython. Re:MS技术,您可以使用IronPython在.NET上运行NLTK。 See this related SO question . 看到这个相关的SO问题

SharpNLP is a native .NET library for doing NLP. SharpNLP是用于执行NLP的本机.NET库。 I don't know how it compares to NLTK, since I'd never heard of it until I went Googling. 我不知道它与NLTK相比如何,因为直到我去Googling之前我才听说过它。

You can create a lucene index of the text with custom stop word list which will skip common words. 您可以使用自定义停用词列表创建文本的lucene索引,该列表将跳过常用词。 Open the lucene index with Luke and it will show you the top terms in the index. 用Luke打开lucene索引,它会显示索引中的顶级术语。

You can enable stemming while indexing so that words are grouped to their root form. 您可以在编制索引时启用词干分析,以便将单词分组为其根表单。 That will help you club together different forms of the same word (plurals, different tenses, etc.). 这将有助于你将不同形式的同一个词(复数形式,不同时态等)联合起来。 That is "quetions, question, questioned" etc will show up as "question." 那就是“排队,问题,质疑”等会出现“问题”。 This you can't do with any other method. 这是你不能用任何其他方法。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM