如何从python中的连接字符串中提取有意义和频繁的单词？

Question

I have a list of concatenated string as given below which I wish to split into meaningful and frequent words.我有一个连接字符串列表，如下所示，我希望将其拆分为有意义的和常用的单词。 Code which I have created is giving me all sorts of un-frequent words as well.我创建的代码也给了我各种不常用的词。

con_words = ["stainlesssteel", "screwhammerwing", "goldplated", "bearingball", "inchcountry"]

Expected output:预期输出：

{"stainlesssteel": ["stainless", "steel"], 
"screwhammerwing": ["screw", "hammer", "wing"], 
"goldplated": ["gold", "plated"], 
"bearingball": ["bearing", "ball"], 
"inchcountry": ["inch", "country"]}

My code我的代码

from nltk.corpus import words
from nltk.corpus import stopwords

#list of all words from english dictionary
words = words.words()

#list of all english stopwords
stops = list(set(stopwords.words('english')))

alphabets = [chr(x) for x in range(ord('a'), ord('z') + 1)]
cleaned_word_list = list(set(words)|set(stops))
cleaned_word_list = list(set(cleaned_word_list)|set(alphabets))
cleaned_word_dict = dict((i, 0) for i in cleaned_word_list)

def extract_words(x):
    res = []
    subs = [x[i:j+1] for i in range(len(x)) for j in range(i,len(x))if (i - (j+1)) < -1]
    for sub in subs:
        try:
            l = cleaned_word_dict[sub]
            res.append(str(sub))
        except:
            pass
    
    return sorted(res, key = len, reverse=True)

common_words_dict = dict((i, extract_words(str(i))[:5]) for i in con_words)

Output:输出：

{'stainlesssteel': ['stainless', 'stain', 'steel', 'tain', 'less'],
 'screwhammerwing': ['hammer', 'screw', 'ammer', 'crew', 'wham'],
 'goldplated': ['plated', 'plate', 'lated', 'gold', 'plat'],
 'bearingball': ['bearing', 'earing', 'bear', 'ring', 'ball'],
 'inchcountry': ['country', 'count', 'inch', 'try', 'in']}

Is there any other way of doing this?有没有其他方法可以做到这一点？

Please help me in understand how to get the expected output.请帮助我了解如何获得预期的输出。

Answer 1

I think the simplest way to get your expected output (although it might use a lot of memory) would be to run the extract_words() function on the first output, but remove words that are contained in other answers.我认为获得预期输出的最简单方法（尽管它可能会使用大量内存）是在第一个输出上运行 extract_words() 函数，但删除其他答案中包含的单词。 This would prevent the fragmented words ('stainless'-'stain' ; 'hammer'-'ammer') and would still allow for multiple full words.这将防止碎片化的单词（'stainless'-'stain' ; 'hammer'-'ammer'）并且仍然允许多个完整的单词。 I will update my answer with code once I make something that works.一旦我做出了一些有用的东西，我就会用代码更新我的答案。

如何从python中的连接字符串中提取有意义和频繁的单词？

问题描述

1 个解决方案

解决方案1
0 2020-01-05 20:59:35

如何从python中的连接字符串中提取有意义和频繁的单词？

问题描述

1 个解决方案

解决方案1 0 2020-01-05 20:59:35

解决方案1
0 2020-01-05 20:59:35