繁体   English   中英

从 python 中的单词列表中查找最长的常用单词序列

[英]Find longest sequence of common words from list of words in python

我搜索了很多解决方案,我确实发现了类似的问题。 此答案返回可能不属于输入列表中所有字符串的最长字符序列。 此答案返回必须属于输入列表中所有字符串的最长公共单词序列。

我正在寻找上述解决方案的组合。 也就是说,我想要最长的常见单词序列,它可能不会出现在输入列表的所有单词/短语中。

以下是一些预期的示例:

['exterior lighting', 'interior lighting'] --> 'lighting'

['ambient lighting', 'ambient light'] --> 'ambient'

['led turn signal lamp', 'turn signal lamp', 'signal and ambient lamp', 'turn signal light'] --> 'turn signal lamp'

['ambient lighting', 'infrared light'] --> ''

谢谢

此代码还将按列表中最常见的单词对您想要的列表进行排序。 它将计算列表中每个单词的数量,然后将仅出现一次的单词剪切并对其进行排序。

lst=['led turn signal lamp', 'turn signal lamp', 'signal and ambient lamp', 'turn signal light'] 
d = {}
d_words={}
for i in lst:
    for j in i.split():
      if j in d:
          d[j] = d[j]+1
      else:
          d[j]= 1
for k,v in d.items():
    if v!=1:
        d_words[k] = v
sorted_words = sorted(d_words,key= d_words.get,reverse = True)
print(sorted_words)

一个相当粗略的解决方案,但我认为它有效:

from nltk.util import everygrams
import pandas as pd

def get_word_sequence(phrases):

    ngrams = []

    for phrase in phrases:        
        phrase_split = [ token for token in phrase.split()]
        ngrams.append(list(everygrams(phrase_split)))

    ngrams = [i for j in ngrams for i in j]  # unpack it    

    counts_per_ngram_series = pd.Series(ngrams).value_counts()

    counts_per_ngram_df = pd.DataFrame({'ngram':counts_per_ngram_series.index, 'count':counts_per_ngram_series.values})

    # discard the pandas Series
    del(counts_per_ngram_series)

    # filter out the ngrams that appear only once
    counts_per_ngram_df = counts_per_ngram_df[counts_per_ngram_df['count'] > 1]

    if not counts_per_ngram_df.empty:    
        # populate the ngramsize column
        counts_per_ngram_df['ngramsize'] = counts_per_ngram_df['ngram'].str.len()

        # sort by ngramsize, ngram_char_length and then by count
        counts_per_ngram_df.sort_values(['ngramsize', 'count'], inplace = True, ascending = [False, False])

        # get the top ngram
        top_ngram = " ".join(*counts_per_ngram_df.head(1).ngram.values)

        return top_ngram

    return ''

暂无
暂无

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM