繁体   English   中英

如何在python列表中干掉单词?

[英]How to stem words in python list?

我有像下面这样的python列表

documents = ["Human machine interface for lab abc computer applications",
             "A survey of user opinion of computer system response time",
             "The EPS user interface management system",
             "System and human system engineering testing of EPS",
             "Relation of user perceived response time to error measurement",
             "The generation of random binary unordered trees",
             "The intersection graph of paths in trees",
             "Graph minors IV Widths of trees and well quasi ordering",
             "Graph minors A survey"]

现在我需要阻止它(每个单词)并得到另一个列表。 我怎么做 ?

from stemming.porter2 import stem

documents = ["Human machine interface for lab abc computer applications",
             "A survey of user opinion of computer system response time",
             "The EPS user interface management system",
             "System and human system engineering testing of EPS",
             "Relation of user perceived response time to error measurement",
             "The generation of random binary unordered trees",
             "The intersection graph of paths in trees",
             "Graph minors IV Widths of trees and well quasi ordering",
             "Graph minors A survey"]

documents = [[stem(word) for word in sentence.split(" ")] for sentence in documents]

我们在这里做的是使用列表推导来遍历主列表中的每个字符串,将其拆分为单词列表。 然后我们循环遍历该列表,在我们去的时候阻止每个单词,返回新的词干列表。

请注意我没有尝试使用已安装的词干 - 我已经从评论中删除了它,并且从未使用过它。 然而,这是将列表拆分为单词的基本概念。 请注意,这将生成一个单词列表列表,保持原始分隔。

如果不想要这种分离,你可以这样做:

documents = [stem(word) for sentence in documents for word in sentence.split(" ")]

相反,这将留下一个连续的列表。

如果您希望最后将这些词汇合在一起,您可以:

documents = [" ".join(sentence) for sentence in documents]

或者在一行中完成:

documents = [" ".join([stem(word) for word in sentence.split(" ")]) for sentence in documents]

保持句子结构的地方,或

documents = " ".join(documents)

在哪里无视它。

你可能想看看NLTK(自然语言工具包)。 它有一个模块nltk.stem ,它包含各种不同的词干分析器。

另见这个问题

好的。 所以,使用词干包,你会有这样的事情:

from stemming.porter2 import stem
from itertools import chain

def flatten(listOfLists):
    "Flatten one level of nesting"
    return list(chain.from_iterable(listOfLists))

def stemall(documents):
    return flatten([ [ stem(word) for word in line.split(" ")] for line in documents ])

你可以使用NLTK

from nltk.stem import PorterStemmer


ps = PorterStemmer()
final = [[ps.stem(token) for token in sentence.split(" ")] for sentence in documents]

NLTK为IR Systems提供了许多功能,请查看它

你可以使用嗖嗖:( http://whoosh.readthedocs.io/

from whoosh.analysis import CharsetFilter, StemmingAnalyzer
from whoosh import fields
from whoosh.support.charset import accent_map

my_analyzer = StemmingAnalyzer() | CharsetFilter(accent_map)

tokens = my_analyzer("hello you, comment ça va ?")
words = [token.text for token in tokens]

print(' '.join(words))
from nltk.stem import PorterStemmer
ps = PorterStemmer()
list_stem = [ps.stem(word) for word in list]

您可以使用PorterStemmer或LancasterStemmer进行词干分析。

暂无
暂无

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM