[英]Removing Stop words without tokenizing in Python
我正在尝试从练习的字符串列表中删除停用词: ipython file ,我的解决方案如下:
sentences = []
labels = []
with open("./bbc-text.csv", 'r') as csvfile:
reader = csv.reader(csvfile, delimiter =',')
next(reader)
for row in reader:
labels.append(row[0])
# clean up the sentence
sentence = row[1]
for word in stopwords:
if word in sentence:
sentence = sentence.replace(" "+ word + " ", "")
sentences.append(sentence)
print(len(sentences))
但是当我对单词进行标记时,单词index是131530,比预期的要大得多:
tokenizer = Tokenizer(oov_token="<OOV>")
tokenizer.fit_on_texts(sentences)
word_index = tokenizer.word_index
print(len(word_index))
# Expected output
# 29714
这是讲师提供的解决方案:
sentences = []
labels = []
with open("./bbc-text.csv", 'r') as csvfile:
reader = csv.reader(csvfile, delimiter=',')
next(reader)
for row in reader:
labels.append(row[0])
sentence = row[1]
for word in stopwords:
token = " " + word + " "
sentence = sentence.replace(token, " ")
sentence = sentence.replace(" ", " ")
sentences.append(sentence)
请问我做错了什么?
谢谢CS
将您的解决方案与讲师的解决方案进行比较,您可以:
sentence = sentence.replace(" "+ word + " ", "")
这将用空字符串替换" to "
,将短语“ go to the store”转换为“ gothe store”。 我怀疑您是以这种方式“创建”了很多不存在的单词,这导致了差异。 讲师的解决方案用空格替换了停用词,从而避免了此问题。
因为您在找到停用词后将句子中的两个词连接在一起,这会导致您在下一次迭代中丢失其他意义词。 考虑以下示例:
“你和我一样”
句子中的所有单词都是停用词。 假设停用词列表为[“ are”,“ as”,“ i”,“ am”,“ you”]
迭代1:删除“ are”
你是我
迭代2:删除“ as”: 找不到要删除的!
你是我
迭代3:删除“ i”
Youasam
如您所见,其他词已被修改。
感谢大伙们。 我想解决方案只是为替换字符串添加空间,如下所示。
if word in sentence:
sentence = sentence.replace(" "+ word + " ", " ")
现在,我看到如果停用词是每个句子的开头或结尾,那么我需要再添加两行,如下所示:
for word in stopwords:
if word in sentence:
sentence = sentence.replace(" "+ word + " ", " ")
sentence = sentence.replace(" "+ word , " ")
sentence = sentence.replace(word + " ", " ")
sentences.append(sentence)
现在,我得到的字词索引更短了。
CS
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.