[英]How to strip string from punctuation except apostrophes for NLP
I am using the below "fastest" way of removing punctuation from a string:我正在使用以下“最快”的方式从字符串中删除标点符号:
text = file_open.translate(str.maketrans("", "", string.punctuation))
However, it removes all punctuation including apostrophes from tokens such as shouldn't
turning it into shouldnt
.但是,它从标记中删除了所有标点符号,包括撇号,例如
shouldn't
将其变成shouldnt
。
The problem is I am using NLTK library for stopwords and the standard stopwords don't include such examples without apostrophes but instead have tokens that NLTK would generate if I used the NLTK tokenizer to split my text.问题是我将 NLTK 库用于停用词,而标准停用词不包含此类没有撇号的示例,而是包含如果我使用 NLTK 标记器拆分文本时 NLTK 会生成的标记。 For example for
shouldnt
the stopwords included are shouldn, shouldn't, t
.例如,
shouldnt
包含的停用词是shouldn, shouldn't, t
。
I can either add the additional stopwords or remove the apostrophes from the NLTK stopwords.我可以添加额外的停用词或从 NLTK 停用词中删除撇号。 But both solutions don't seem "correct" in a way as I think the apostrophes should be left when doing punctuation cleaning.
但是这两种解决方案在某种程度上似乎都不“正确”,因为我认为在进行标点符号清理时应该留下撇号。
Is there a way I can leave the apostrophes when doing fast punctuation cleaning?有没有办法在快速清理标点符号时留下撇号?
>>> from string import punctuation
>>> type(punctuation)
<class 'str'>
>>> my_punctuation = punctuation.replace("'", "")
>>> my_punctuation
'!"#$%&()*+,-./:;<=>?@[\\]^_`{|}~'
>>> "It's right, isn't it?".translate(str.maketrans("", "", my_punctuation))
"It's right isn't it"
Edited from this answer .从这个答案编辑。
import re
s = "This is a test string, with punctuation. This shouldn't fail...!"
text = re.sub(r'[^\w\d\s\']+', '', s)
print(text)
This returns:这将返回:
This is a test string with punctuation This shouldn't fail
这是一个带标点符号的测试字符串 这不应该失败
Regex explanation:正则表达式解释:
[^]
matches everything but everything inside the blockquotes [^]
匹配除块引号内的所有内容之外的所有内容\\w
matches any word character (equal to [a-zA-Z0-9_]
) \\w
匹配任何单词字符(等于[a-zA-Z0-9_]
)
\\d
matches a digit (equal to [0-9]
) \\d
匹配一个数字(等于[0-9]
)
\\s
matches any whitespace character (equal to [\\r\\n\\t\\f\\v ]
) \\s
匹配任何空白字符(等于[\\r\\n\\t\\f\\v ]
)
\\'
matches the character '
literally (case sensitive) \\'
字面上匹配字符'
(区分大小写)
+
matches between one and unlimited times, as many times as possible, giving back as needed +
匹配一次和无限次,尽可能多次,根据需要回馈
What about using怎么用
text = file_open.translate(str.maketrans(",.", " "))
and adding other characters you want to ignore into the first string.并将要忽略的其他字符添加到第一个字符串中。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.