简体   繁体   English

从 python 中的单词列表中查找最长的常用单词序列

[英]Find longest sequence of common words from list of words in python

I searched a lot for a solution and I indeed have found similar questions.我搜索了很多解决方案,我确实发现了类似的问题。 This answer gives back the longest sequence of CHARACTERS that might NOT belong in all of the strings in the input list. 此答案返回可能不属于输入列表中所有字符串的最长字符序列。 This answer gives back the longest common sequences of WORDS that MUST belong to all of the strings in the input list. 此答案返回必须属于输入列表中所有字符串的最长公共单词序列。

I am looking for a combination of the above solutions.我正在寻找上述解决方案的组合。 That is, I want the longest sequence of common WORDS that might NOT appear in all of the words/phrases of the input list.也就是说,我想要最长的常见单词序列,它可能不会出现在输入列表的所有单词/短语中。

Here are some examples of what is expected:以下是一些预期的示例:

['exterior lighting', 'interior lighting'] --> 'lighting' ['exterior lighting', 'interior lighting'] --> 'lighting'

['ambient lighting', 'ambient light'] --> 'ambient' ['ambient lighting', 'ambient light'] --> 'ambient'

['led turn signal lamp', 'turn signal lamp', 'signal and ambient lamp', 'turn signal light'] --> 'turn signal lamp' ['led turn signal lamp', 'turn signal lamp', 'signal and ambient lamp', 'turn signal light'] --> 'turn signal lamp'

['ambient lighting', 'infrared light'] --> '' ['ambient lighting', 'infrared light'] --> ''

Thank you谢谢

this code will also sort your desired list by the most common word in your list.此代码还将按列表中最常见的单词对您想要的列表进行排序。 it will count the amount of every word in your list, and than will cut the words that appeared only once and sort it.它将计算列表中每个单词的数量,然后将仅出现一次的单词剪切并对其进行排序。

lst=['led turn signal lamp', 'turn signal lamp', 'signal and ambient lamp', 'turn signal light'] 
d = {}
d_words={}
for i in lst:
    for j in i.split():
      if j in d:
          d[j] = d[j]+1
      else:
          d[j]= 1
for k,v in d.items():
    if v!=1:
        d_words[k] = v
sorted_words = sorted(d_words,key= d_words.get,reverse = True)
print(sorted_words)

A rather crude solution but I think it works:一个相当粗略的解决方案,但我认为它有效:

from nltk.util import everygrams
import pandas as pd

def get_word_sequence(phrases):

    ngrams = []

    for phrase in phrases:        
        phrase_split = [ token for token in phrase.split()]
        ngrams.append(list(everygrams(phrase_split)))

    ngrams = [i for j in ngrams for i in j]  # unpack it    

    counts_per_ngram_series = pd.Series(ngrams).value_counts()

    counts_per_ngram_df = pd.DataFrame({'ngram':counts_per_ngram_series.index, 'count':counts_per_ngram_series.values})

    # discard the pandas Series
    del(counts_per_ngram_series)

    # filter out the ngrams that appear only once
    counts_per_ngram_df = counts_per_ngram_df[counts_per_ngram_df['count'] > 1]

    if not counts_per_ngram_df.empty:    
        # populate the ngramsize column
        counts_per_ngram_df['ngramsize'] = counts_per_ngram_df['ngram'].str.len()

        # sort by ngramsize, ngram_char_length and then by count
        counts_per_ngram_df.sort_values(['ngramsize', 'count'], inplace = True, ascending = [False, False])

        # get the top ngram
        top_ngram = " ".join(*counts_per_ngram_df.head(1).ngram.values)

        return top_ngram

    return ''

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM