删除停用词和标点符号

Question

I parse information from a news website. 我从新闻网站解析信息。 Each news is a dictionary that is stored inside translated_news variable. 每个新闻都是一个字典，存储在translation_news变量中。 Each news has its title, url and country. 每个新闻都有其标题，URL和国家。 Then I try to iterate over each news title and delete stop-words and punctuation signs. 然后，我尝试遍历每个新闻标题并删除停用词和标点符号。 I've written this code: 我写了这段代码：

for new in translated_news:
    tk = tokenize(new['title'])
    # delete punctuation signs & stop-words
    for t in tk:
        if (t in punkts) or (t+'\n' in stops):
            tk.remove(t)
tokens.append(tk)

Tokenize is a function that returns a list of tokens. 令牌化是一个返回令牌列表的函数。 Here's an example of the output: 这是输出示例：

['medium', ':', 'russian', 'athlete', 'will', 'be', 'admit', 'to', 'the', '2018', 'olympics', 'in', 'neutral', 'status']

Here's the same output, but with deleted stop-words and punctuation: 这是相同的输出，但是删除了停用词和标点符号：

['medium', 'russian', 'athlete', 'be', 'admit', 'the', 'olympics', 'neutral', 'status']

The problem is: even though the words 'the' and 'be' are included in my stop-words list, they were not deleted from the news title. 问题是：即使我的停用词列表中包含单词“ the”和“ be”，它们也没有从新闻标题中删除。 However, on other titles it sometimes works correctly: 但是，在其他标题上，它有时也可以正常工作：

['wada', 'acknowledge', 'the', 'reliable', 'information', 'provide', 'to', 'rodchenkov'] ['wada', 'acknowledge', 'reliable', 'information', 'provide', 'rodchenkov']

Here 'the' was deleted from the title. 此处的“ the”已从标题中删除。 I don't understand what is wrong with the code and why sometimes the output is perfect and sometimes not. 我不明白代码有什么问题，为什么有时输出是完美的而有时却不是完美的。

Answer 1

You have to iterate on tokenize(new['title']) and use De Morgan's laws to simplify the if statement: 您必须迭代tokenize(new['title'])并使用De Morgan的定律来简化if语句：

import string

stops = ['will', 'be', 'to', 'the', 'in']

tk = ['medium', ':', 'russian', 'athlete', 'will', 'be', 'admit', 'to', 'the',
      '2018', 'olympics', 'in', 'neutral', 'status']

# delete punctuation signs & stop-words
tk = []
for t in tokenize(new['title']):
    # if not ((t in string.punctuation) or (t in stops)):
    if (t not in string.punctuation) and (t not in stops): # De Morgan's laws
        tk.append(t)
print(tk)

will print: 将打印：

['medium', 'russian', 'athlete', 'admit', '2018', 'olympics', 'neutral', 'status']

You can get rid of new lines in stops words: 您可以在停用词中删除新行：

stops = ['will\n', 'be\n', 'to\n', 'the\n', 'in\n']
stops = [item.strip() for item in stops]
print(stops)

will print: 将打印：

['will', 'be', 'to', 'the', 'in']

The solution suggested from incanus86 does work: incanus86建议的解决方案确实有效：

tk = [x for x in tokenize(new['title']) if x not in stops and x not in string.punctuation]

but you won't be asking in SO if you knew about list comprehensions . 但是您不会问自己是否知道列表理解。

I don't understand what is wrong with the code and why sometimes the output is perfect and sometimes not. 我不明白代码有什么问题，为什么有时输出是完美的而有时却不是完美的。

While iterating on tk items you do miss 'be' and 'the' because you are removing tk items as seen in code: 在迭代tk项时，您确实会错过'be'和'the'因为您正在删除tk项，如代码所示：

import string

stops = ['will', 'be', 'to', 'the', 'in']

tk = [
    'medium',  # 0
    ':',  # 1
    'russian',  # 2
    'athlete',  # 3
    'will',  # 4
    'be',  # 5
    'admit',  # 6
    'to',  # 7
    'the',  # 8
    '2018',  # 9
    'olympics',  # 10
    'in',  # 11
    'neutral',  # 12
    'status'  # 13
]

# delete punctuation signs & stop-words
for t in tk:
    print(len(tk), t, tk.index(t))
    if (t in string.punctuation) or (t in stops):
        tk.remove(t)

print(tk)

will print: 将打印：

(14, 'medium', 0)
(14, ':', 1)
(13, 'athlete', 2)
(13, 'will', 3)
(12, 'admit', 4)
(12, 'to', 5)
(11, '2018', 6)
(11, 'olympics', 7)
(11, 'in', 8)
(10, 'status', 9)
['medium', 'russian', 'athlete', 'be', 'admit', 'the', '2018', 'olympics', 'neutral', 'status']

You do miss "russian" , "be" , "the" and "neutral" . 您确实会错过“ russian” ， “ be” ， “ the”和“ neutral” 。
Index of "athlete" is 2 and index of "will" is 3 because you removed ":" from tk. 因为您从tk中删除了“：”，所以“运动员”的索引是2，而“意志”的索引是3。
Index of "admit" is 4 and index if "to" is 5 because you removed "will" from tk. 因为从tk中删除了“ will”，所以“ admit”的索引为4，如果“ to”的索引为5。
Index of "2018" is 6, index of "olympics" is 7, index of "in" is 8 and index of "status" is 9. “ 2018”的索引为6，“奥运会”的索引为7，“ in”的索引为8，“ status”的索引为9。

You MUST not change list while iterating on it! 您不得在迭代过程中更改列表！

Answer 2

Try getting rid of the newline character. 尝试摆脱换行符。

something like this 像这样的东西

tk = [x for x in tokenize(new['title']) if x not in stops and x not in string.punctuation]

删除停用词和标点符号

问题描述

2 个解决方案

解决方案1
1 已采纳 2017-12-09 16:24:14

解决方案2
0 2017-12-03 21:56:20

删除停用词和标点符号

问题描述

2 个解决方案

解决方案1 1 已采纳 2017-12-09 16:24:14

解决方案2 0 2017-12-03 21:56:20

解决方案1
1 已采纳 2017-12-09 16:24:14

解决方案2
0 2017-12-03 21:56:20