简体   繁体   English

文本中最常见的 n 个单词

[英]Most common n words in a text

I am currently learning to work with NLP.我目前正在学习使用 NLP。 One of the problems I am facing is finding most common n words in text.我面临的问题之一是在文本中找到最常见的 n 个单词。 Consider the following:考虑以下:

text=['Lion Monkey Elephant Weed','Tiger Elephant Lion Water Grass','Lion Weed Markov Elephant Monkey Fine','Guard Elephant Weed Fortune Wolf'] text=['狮子猴象草','虎象狮子水草','狮子草马尔科夫象猴精','守卫象草招财狼']

Suppose n = 2. I am not looking for most common bigrams.假设 n = 2。我不是在寻找最常见的二元组。 I am searching for 2-words that occur together the most in the text.我正在搜索文本中一起出现最多的 2 个单词。 Like, the output for the above should give:就像,上面的 output 应该给出:

'Lion' & 'Elephant': 3 'Elephant' & 'Weed': 3 'Lion' & 'Monkey': 2 'Elephant' & 'Monkey': 2 “狮子”和“大象”:3 “大象”和“杂草”:3 “狮子”和“猴子”:2 “大象”和“猴子”:2

and such..等等..

Could anyone suggest a suitable way to tackle this?谁能提出一个合适的方法来解决这个问题?

it was tricky but I solved for you, I used empty space to detect if elem contains more than 3 words:-) cause if elem has 3 words then it must be 2 empty spaces:-) in that case, only elem with 2 words will be returned这很棘手,但我为你解决了,我使用空格来检测 elem 是否包含超过 3 个单词:-) 因为如果 elem 有 3 个单词,那么它必须是 2 个空格:-) 在这种情况下,只有 elem 有 2 个单词将被退回

l = ["hello world", "good night world", "good morning sunshine", "wassap babe"]
for elem in l:

   if elem.count(" ") == 1:
      print(elem) 

output output

hello world
wassap babe
  

I would suggest using Counter and combinations as follows.我建议如下使用Countercombinations

from collections import Counter
from itertools import combinations, chain

text = ['Lion Monkey Elephant Weed', 'Tiger Elephant Lion Water Grass', 'Lion Weed Markov Elephant Monkey Fine', 'Guard Elephant Weed Fortune Wolf']


def count_combinations(text, n_words, n_most_common=None):
    count = []
    for t in text:
        words = t.split()
        combos = combinations(words, n_words)
        count.append([" & ".join(sorted(c)) for c in combos])
    return dict(Counter(sorted(list(chain(*count)))).most_common(n_most_common))

count_combinations(text, 2)

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 Pyspark 对文本的操作、计数词、唯一词、最常用词 - Pyspark operations on text, counting words, unique words, most common words 文本文件python中5个最常见的单词 - 5 most common words in text file python 计算一段文本中最常见的标题词 - Count most common titular words in a paragraph of text 查看文本中找到的7个最常见的单词,但对作为常见单词的单词进行排序 - View the 7 most common words found in the text, but sorting out the words that are common words python,如何计算文本文件中最常见的单词 - python, how to count most common words in text file 编写一个返回文本文件中最常见单词列表的 Python 函数? - Writing a Python Function that returns a list of the most common words in a text file? 尝试在文本文件中输出x个最常用的单词 - Trying to output the x most common words in a text file FreqDist用于最常见的单词或短语 - FreqDist for most common words OR phrases 查找网站中最常用的词 - Find the most common words in a website Python家庭作业帮助:计数整数,拆分和返回文本文件中最常见/最不常见的单词的问题 - Python Homework help: issues with counting integers, splitting, and returning most/least common words in text file
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM