如何从标点符号中去除字符串，除了 NLP 的撇号

Question

I am using the below "fastest" way of removing punctuation from a string:我正在使用以下“最快”的方式从字符串中删除标点符号：

text = file_open.translate(str.maketrans("", "", string.punctuation))

However, it removes all punctuation including apostrophes from tokens such as shouldn't turning it into shouldnt .但是，它从标记中删除了所有标点符号，包括撇号，例如shouldn't将其变成shouldnt 。

The problem is I am using NLTK library for stopwords and the standard stopwords don't include such examples without apostrophes but instead have tokens that NLTK would generate if I used the NLTK tokenizer to split my text.问题是我将 NLTK 库用于停用词，而标准停用词不包含此类没有撇号的示例，而是包含如果我使用 NLTK 标记器拆分文本时 NLTK 会生成的标记。 For example for shouldnt the stopwords included are shouldn, shouldn't, t .例如， shouldnt包含的停用词是shouldn, shouldn't, t 。

I can either add the additional stopwords or remove the apostrophes from the NLTK stopwords.我可以添加额外的停用词或从 NLTK 停用词中删除撇号。 But both solutions don't seem "correct" in a way as I think the apostrophes should be left when doing punctuation cleaning.但是这两种解决方案在某种程度上似乎都不“正确”，因为我认为在进行标点符号清理时应该留下撇号。

Is there a way I can leave the apostrophes when doing fast punctuation cleaning?有没有办法在快速清理标点符号时留下撇号？

Answer 1

>>> from string import punctuation
>>> type(punctuation)
<class 'str'>
>>> my_punctuation = punctuation.replace("'", "")
>>> my_punctuation
'!"#$%&()*+,-./:;<=>?@[\\]^_`{|}~'
>>> "It's right, isn't it?".translate(str.maketrans("", "", my_punctuation))
"It's right isn't it"

Answer 2

Edited from this answer .从这个答案编辑。

import re

s = "This is a test string, with punctuation. This shouldn't fail...!"

text = re.sub(r'[^\w\d\s\']+', '', s)
print(text)

This returns:这将返回：

This is a test string with punctuation This shouldn't fail这是一个带标点符号的测试字符串这不应该失败

Regex explanation:正则表达式解释：

[^] matches everything but everything inside the blockquotes [^]匹配除块引号内的所有内容之外的所有内容
\\w matches any word character (equal to [a-zA-Z0-9_] ) \\w匹配任何单词字符（等于[a-zA-Z0-9_] ）
\\d matches a digit (equal to [0-9] ) \\d匹配一个数字（等于[0-9] ）
\\s matches any whitespace character (equal to [\\r\\n\\t\\f\\v ] ) \\s匹配任何空白字符（等于[\\r\\n\\t\\f\\v ] ）
\\' matches the character ' literally (case sensitive) \\'字面上匹配字符' （区分大小写）
+ matches between one and unlimited times, as many times as possible, giving back as needed +匹配一次和无限次，尽可能多次，根据需要回馈

And you can try it here .你可以在这里尝试。

Answer 3

What about using怎么用

text = file_open.translate(str.maketrans(",.", "  "))

and adding other characters you want to ignore into the first string.并将要忽略的其他字符添加到第一个字符串中。

如何从标点符号中去除字符串，除了 NLP 的撇号

问题描述

3 个解决方案

解决方案1
4 已采纳 2020-01-23 11:55:49

解决方案2
3 2020-01-23 12:03:01

解决方案3
1 2020-01-23 11:54:29

如何从标点符号中去除字符串，除了 NLP 的撇号

问题描述

3 个解决方案

解决方案1 4 已采纳 2020-01-23 11:55:49

解决方案2 3 2020-01-23 12:03:01

解决方案3 1 2020-01-23 11:54:29

解决方案1
4 已采纳 2020-01-23 11:55:49

解决方案2
3 2020-01-23 12:03:01

解决方案3
1 2020-01-23 11:54:29