简体   繁体   English

使用ruby创建常用单词或短语的列表

[英]Use ruby to create a list of commonly used words or phrases

Looking for some advice on generating a list of commonly used words and phrases from a bunch of entries in a nosql database. 寻找有关从nosql数据库中的一堆条目生成常用单词和短语列表的建议。 Basically we have a bunch of posts made by someone and we want to tell them "Hey there. You use these words / phrases a lot". 基本上,我们有一堆帖子是由某人发表的,我们想告诉他们“嘿,您在这里经常使用这些单词/短语”。

I'm a bit stumped on this one. 我对此有些困惑。

My application is ruby on rails, backbone-js and redis. 我的应用程序是在rails,border-js和redis上使用ruby。

Since it's not clear how the posts are stored, I'll just assume you can get an array of all the posts. 由于尚不清楚帖子的存储方式,因此我假设您可以获取所有帖子的数组。

A simple algorithm to find the most common uncommon words would be the following: Iterate over the array of all the posts, and then strip the post from anything but the words and split it into words. 查找最常见的不常见单词的简单算法如下:遍历所有帖子的数组,然后从单词以外的任何内容中删除帖子并将其拆分为单词。 Go over all the words in the entry and add 1 to the amount of times you've seen that word. 遍历条目中的所有单词,并将您看到该单词的次数加1。 Once that's done for all the words in all your entries, you'll have a hash with the number of occurrences of all the words. 完成所有条目中所有单词的操作后,您将获得一个包含所有单词出现次数的哈希值。 Remove the most common words, here's an example of 100 common words . 删除最常用的词,这是100个常用词的示例。 You should probably use more in your application. 您可能应该在应用程序中使用更多。 Sort them by the number of occurrences and you'll have the most commonly occurring words. 按出现次数对它们进行排序,您将获得最常见的单词。

It's implemented here . 在这里实现。 It doesn't handle cases such as posts being post , which you might want. 它不处理您可能想要的情况,例如posts正在post You could look into how Rails implements String#singular to get this behavior. 您可以研究Rails如何实现String#singular来获得此行为。

If you wanna find commonly used phrases it gets more interesting, you'd probably have to use some kind of natural language processing as @sawa pointed out in a comment. 如果您想找到常用的短语会变得更有趣,那么您可能不得不使用某种自然语言处理,如@sawa在评论中指出的那样。 I can't come up with a solution that is fast enough off the top of my head. 我想不出一个足够快的解决方案。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM