繁体   English   中英

删除停用词而无需在Python中标记化

[英]Removing Stop words without tokenizing in Python

我正在尝试从练习的字符串列表中删除停用词: ipython file ,我的解决方案如下:

sentences = []
labels = []
with open("./bbc-text.csv", 'r') as csvfile:
    reader = csv.reader(csvfile, delimiter =',')
    next(reader)
    for row in reader: 
        labels.append(row[0])
        # clean up the sentence
        sentence = row[1]
        for word in stopwords: 
            if word in sentence: 
                sentence = sentence.replace(" "+ word + " ", "")
        sentences.append(sentence)

print(len(sentences))

但是当我对单词进行标记时,单词index是131530,比预期的要大得多:

tokenizer = Tokenizer(oov_token="<OOV>")
tokenizer.fit_on_texts(sentences)
word_index = tokenizer.word_index
print(len(word_index))
# Expected output
# 29714 

这是讲师提供的解决方案:

sentences = []
labels = []
with open("./bbc-text.csv", 'r') as csvfile:
    reader = csv.reader(csvfile, delimiter=',')
    next(reader)
    for row in reader:
        labels.append(row[0])
        sentence = row[1]
        for word in stopwords:
            token = " " + word + " "
            sentence = sentence.replace(token, " ")
            sentence = sentence.replace("  ", " ")
        sentences.append(sentence)

请问我做错了什么?

谢谢CS

将您的解决方案与讲师的解决方案进行比较,您可以:

sentence = sentence.replace(" "+ word + " ", "")

这将用空字符串替换" to " ,将短语“ go to the store”转换为“ gothe store”。 我怀疑您是以这种方式“创建”了很多不存在的单词,这导致了差异。 讲师的解决方案用空格替换了停用词,从而避免了此问题。

因为您在找到停用词后将句子中的两个词连接在一起,这会导致您在下一次迭代中丢失其他意义词。 考虑以下示例:
“你和我一样”
句子中的所有单词都是停用词。 假设停用词列表为[“ are”,“ as”,“ i”,“ am”,“ you”]
迭代1:删除“ are”
你是我
迭代2:删除“ as”: 找不到要删除的!
你是我
迭代3:删除“ i”
Youasam
如您所见,其他词已被修改。

感谢大伙们。 我想解决方案只是为替换字符串添加空间,如下所示。

if word in sentence: 
                sentence = sentence.replace(" "+ word + " ", " ")

现在,我看到如果停用词是每个句子的开头或结尾,那么我需要再添加两行,如下所示:

 for word in stopwords:
        if word in sentence: 
            sentence = sentence.replace(" "+ word + " ", " ")
            sentence = sentence.replace(" "+ word , " ")
            sentence = sentence.replace(word + " ", " ")
    sentences.append(sentence)

现在,我得到的字词索引更短了。

CS

暂无
暂无

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM