简体   繁体   English

如何从 python 的列表中找到最常用的短语?

[英]How to find most common phrases from a list in python?

I struggle with the following: I have an input list:我遇到以下问题:我有一个输入列表:

input_list = [
    "Beneficiile pozitive ale productName:",
    "Care sunt ingredientele product name?",
    "Ce este product name ddd?",
    "Ce face product name decât orice altă îngrijire a pielii?",
    "Cum funcționează?",
    "product name pret – Ia Piele Tineresc Natural! Pareri, Cumpăra",
    "Offer Nutra",
    "Offering Top Nutritional",
    "În cazul în care pentru a cumpara Crema product name?",
]

I need to count each list item and get a most frequent phrase or word from the whole list of items.我需要计算每个列表项并从整个项目列表中获取最常用的短语或单词。

There are some answers here where I can count words but in that case I need a two-words phrase to be returned这里有一些答案,我可以计算单词,但在这种情况下,我需要返回一个两个单词的短语

Expected output:预期 output:

In this case the returned output should be 'product name' because it occurs in 5 list items in this case.在这种情况下,返回的 output 应该是“产品名称”,因为在这种情况下它出现在 5 个列表项中。

Again - I don't want to count words but phrases which occurs multiple times in list items.再次 - 我不想计算单词,而是列表项中多次出现的短语。

this is my implementation, it's a bit tricky but it works anyway:这是我的实现,它有点棘手,但它仍然有效:

from string import punctuation

input_list = [
    "Beneficiile pozitive ale productName:",
    "Care sunt ingredientele product name?",
    "Ce este product name ddd?",
    "Ce face product name decât orice altă îngrijire a pielii?",
    "Cum funcționează?",
    "product name pret – Ia Piele Tineresc Natural! Pareri, Cumpăra",
    "Offer Nutra",
    "Offering Top Nutritional",
    "În cazul în care pentru a cumpara Crema product name?"]

most_common_phrase = ''
duplicates_num = 0

f = lambda x: x.translate(str.maketrans('','',punctuation)).lower() # removes punctuation
phrases = f(' 000 '.join(input_list)) # adds dividers

for i in input_list:
    phrase = f(i).split()
    for j in range(len(phrase)-1):
        for y in range(j+2,len(phrase)+1):
            phrase_comb = ' '.join(phrase[j:y])
            if (n:=phrases.count(phrase_comb)) > duplicates_num:
                duplicates_num = n
                most_common_phrase = phrase_comb
                
print(f'{most_common_phrase = }\n{duplicates_num = }')

>>> out
'''
most_common_phrase = 'product name'
duplicates_num = 5   

This is an ugly task: it boils down to start with two-word phrases and count them, then three-word phrases and so on until finally the whole input list element considered as one phrase is counted.这是一个丑陋的任务:它归结为从两个单词的短语开始并计算它们,然后是三个单词的短语,依此类推,直到最后将整个输入列表元素视为一个短语进行计数。 (May be, there are additional criteria, what counts as phrase, so some may be skipped.) Per phrase you have a runtime proportial to the square to the number of input words. (可能还有其他标准,即短语,因此可能会跳过一些标准。)每个短语的运行时间与输入单词数的平方成正比。 It becomes even worse, if you are free to ignore word boundaries, ie in the first element "productName" should be counted as "product name" as well, since then you may require a dictionary to identify valid substrings (which may easily produce huge numbers of false hits.)更糟糕的是,如果您可以随意忽略单词边界,即在第一个元素中,“productName”也应该算作“产品名称”,因为那时您可能需要字典来识别有效的子字符串(这可能很容易产生巨大的错误命中数。)

Ok, I somehow figured out how to solve that.好的,我以某种方式想出了如何解决这个问题。 Its best to use some nlp lib like nltk最好使用一些 nlp lib 像 nltk

import nltk
from nltk.collocations import *
bigram_measures = nltk.collocations.BigramAssocMeasures()


input_list = ['Beneficiile pozitive ale productName:', 'Care sunt ingredientele product name?', 'Ce este product name ddd?', 'Ce face product name decât orice altă îngrijire a pielii?', 'Cum funcționează?', 'product name pret – Ia Piele Tineresc Natural! Pareri, Cumpăra', 'Offer Nutra', 'Offering Top Nutritional', 'În cazul în care pentru a cumpara Crema product name?']

def func(some_list):
    outer_list = []
    for i in some_list:
        tokens = nltk.wordpunct_tokenize(i)
        finder = BigramCollocationFinder.from_words(tokens)
        finder.apply_freq_filter(1)
        outer_list.append(finder.nbest(bigram_measures.pmi, 10))

    flattened_list = [item for sublist in outer_list for item in sublist]
    frequency_distribution = nltk.FreqDist(flattened_list)
    most_common_element = frequency_distribution.max()

    return ' '.join(most_common_element)

print(func(input_list))


声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM