简体   繁体   English

如何从标点符号中去除字符串,除了 NLP 的撇号

[英]How to strip string from punctuation except apostrophes for NLP

I am using the below "fastest" way of removing punctuation from a string:我正在使用以下“最快”的方式从字符串中删除标点符号:

text = file_open.translate(str.maketrans("", "", string.punctuation))

However, it removes all punctuation including apostrophes from tokens such as shouldn't turning it into shouldnt .但是,它从标记中删除了所有标点符号,包括撇号,例如shouldn't将其变成shouldnt

The problem is I am using NLTK library for stopwords and the standard stopwords don't include such examples without apostrophes but instead have tokens that NLTK would generate if I used the NLTK tokenizer to split my text.问题是我将 NLTK 库用于停用词,而标准停用词不包含此类没有撇号的示例,而是包含如果我使用 NLTK 标记器拆分文本时 NLTK 会生成的标记。 For example for shouldnt the stopwords included are shouldn, shouldn't, t .例如, shouldnt包含的停用词是shouldn, shouldn't, t

I can either add the additional stopwords or remove the apostrophes from the NLTK stopwords.我可以添加额外的停用词或从 NLTK 停用词中删除撇号。 But both solutions don't seem "correct" in a way as I think the apostrophes should be left when doing punctuation cleaning.但是这两种解决方案在某种程度上似乎都不“正确”,因为我认为在进行标点符号清理时应该留下撇号。

Is there a way I can leave the apostrophes when doing fast punctuation cleaning?有没有办法在快速清理标点符号时留下撇号?

>>> from string import punctuation
>>> type(punctuation)
<class 'str'>
>>> my_punctuation = punctuation.replace("'", "")
>>> my_punctuation
'!"#$%&()*+,-./:;<=>?@[\\]^_`{|}~'
>>> "It's right, isn't it?".translate(str.maketrans("", "", my_punctuation))
"It's right isn't it"

Edited from this answer .这个答案编辑。

import re

s = "This is a test string, with punctuation. This shouldn't fail...!"

text = re.sub(r'[^\w\d\s\']+', '', s)
print(text)

This returns:这将返回:

This is a test string with punctuation This shouldn't fail这是一个带标点符号的测试字符串 这不应该失败

Regex explanation:正则表达式解释:

[^] matches everything but everything inside the blockquotes [^]匹配除块引号内的所有内容之外的所有内容
\\w matches any word character (equal to [a-zA-Z0-9_] ) \\w匹配任何单词字符(等于[a-zA-Z0-9_]
\\d matches a digit (equal to [0-9] ) \\d匹配一个数字(等于[0-9]
\\s matches any whitespace character (equal to [\\r\\n\\t\\f\\v ] ) \\s匹配任何空白字符(等于[\\r\\n\\t\\f\\v ]
\\' matches the character ' literally (case sensitive) \\'字面上匹配字符' (区分大小写)
+ matches between one and unlimited times, as many times as possible, giving back as needed +匹配一次和无限次,尽可能多次,根据需要回馈

And you can try it here .你可以在这里尝试。

What about using怎么用

text = file_open.translate(str.maketrans(",.", "  "))

and adding other characters you want to ignore into the first string.并将要忽略的其他字符添加到第一个字符串中。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 如何从Python字符串中删除unicode“标点符号” - How to strip unicode “punctuation” from Python string 从Unicode字符串中去除特殊字符和标点符号 - Strip special characters and punctuation from a unicode string 从字符串中去除标点符号的最佳方法 - Best way to strip punctuation from a string 如何从字符串中删除标点符号,然后稍后再将其添加回相同的索引? - How can i strip punctuation from a string and then add it back at the same index later? 从Python中的unicode字符串中删除标点符号的最快方法 - Fastest way to strip punctuation from a unicode string in Python 我们如何使用Python在字符串的开头删除标点符号? - How can we strip punctuation at the start of a string using Python? 如何在熊猫的字符串列的开头和结尾处删除标点符号 - How to strip punctuation of at the beginning and the end of string column in Pandas Python - 使用 re.sub 和 string.punctuation 从单词列表中去除标点符号 - Python - Strip Punctuation from list of words using re.sub and string.punctuation 如何从之前删除空格,而不是在python中标点符号之后 - How to strip whitespace from before but not after punctuation in python Python从除撇号之外的unicode字符串中删除标点符号 - Python removing punctuation from unicode string except apostrophe
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM