删除停用词而无需在Python中标记化

Question

我正在尝试从练习的字符串列表中删除停用词： ipython file ，我的解决方案如下：

sentences = []
labels = []
with open("./bbc-text.csv", 'r') as csvfile:
    reader = csv.reader(csvfile, delimiter =',')
    next(reader)
    for row in reader: 
        labels.append(row[0])
        # clean up the sentence
        sentence = row[1]
        for word in stopwords: 
            if word in sentence: 
                sentence = sentence.replace(" "+ word + " ", "")
        sentences.append(sentence)

print(len(sentences))

但是当我对单词进行标记时，单词index是131530，比预期的要大得多：

tokenizer = Tokenizer(oov_token="<OOV>")
tokenizer.fit_on_texts(sentences)
word_index = tokenizer.word_index
print(len(word_index))
# Expected output
# 29714

这是讲师提供的解决方案：

sentences = []
labels = []
with open("./bbc-text.csv", 'r') as csvfile:
    reader = csv.reader(csvfile, delimiter=',')
    next(reader)
    for row in reader:
        labels.append(row[0])
        sentence = row[1]
        for word in stopwords:
            token = " " + word + " "
            sentence = sentence.replace(token, " ")
            sentence = sentence.replace("  ", " ")
        sentences.append(sentence)

请问我做错了什么？

谢谢CS

Answer 1

将您的解决方案与讲师的解决方案进行比较，您可以：

sentence = sentence.replace(" "+ word + " ", "")

这将用空字符串替换" to " ，将短语“ go to the store”转换为“ gothe store”。 我怀疑您是以这种方式“创建”了很多不存在的单词，这导致了差异。 讲师的解决方案用空格替换了停用词，从而避免了此问题。

Answer 2

因为您在找到停用词后将句子中的两个词连接在一起，这会导致您在下一次迭代中丢失其他意义词。 考虑以下示例：
“你和我一样”
句子中的所有单词都是停用词。 假设停用词列表为[“ are”，“ as”，“ i”，“ am”，“ you”]
迭代1：删除“ are”
你是我
迭代2：删除“ as”： 找不到要删除的！
你是我
迭代3：删除“ i”
Youasam
如您所见，其他词已被修改。

Answer 3

感谢大伙们。 我想解决方案只是为替换字符串添加空间，如下所示。

if word in sentence: 
                sentence = sentence.replace(" "+ word + " ", " ")

现在，我看到如果停用词是每个句子的开头或结尾，那么我需要再添加两行，如下所示：

 for word in stopwords:
        if word in sentence: 
            sentence = sentence.replace(" "+ word + " ", " ")
            sentence = sentence.replace(" "+ word , " ")
            sentence = sentence.replace(word + " ", " ")
    sentences.append(sentence)

现在，我得到的字词索引更短了。

CS

删除停用词而无需在Python中标记化

问题描述

3 个解决方案

解决方案1
0 2019-09-14 04:32:38

解决方案2
0 2019-09-14 04:43:58

解决方案3
0 2019-09-14 14:14:25

删除停用词而无需在Python中标记化

问题描述

3 个解决方案

解决方案1 0 2019-09-14 04:32:38

解决方案2 0 2019-09-14 04:43:58

解决方案3 0 2019-09-14 14:14:25

解决方案1
0 2019-09-14 04:32:38

解决方案2
0 2019-09-14 04:43:58

解决方案3
0 2019-09-14 14:14:25