[英]keeping smileys/emoticons while removing special characters using regex python
I am using the following code for cleaning my text我正在使用以下代码来清理我的文本
def clean_str(s):
"""Clean sentence"""
s = re.sub(r"[^A-Za-z0-9(),!?\'\`]", " ", s)
s = re.sub(r"\'s", " \'s", s)
s = re.sub(r"\'ve", " \'ve", s)
s = re.sub(r"n\'t", " n\'t", s)
s = re.sub(r"\'re", " \'re", s)
s = re.sub(r"\'d", " \'d", s)
s = re.sub(r"\'ll", " \'ll", s)
s = re.sub(r",", " , ", s)
s = re.sub(r"!", " ! ", s)
s = re.sub(r"\(", " ", s)
s = re.sub(r"\)", " ", s)
s = re.sub(r"\?", " ? ", s)
s = re.sub(r"\s{2,}", " ", s)
s = re.sub(r'\S*(x{2,}|X{2,})\S*',"xxx", s)
s = re.sub(r'[^\x00-\x7F]+', "", s)
return s.strip()
As you can see that I am removing parentheses and other special characters.如您所见,我正在删除括号和其他特殊字符。 Now, I want to keep the following patterns intact in my text and not remove them
现在,我想在我的文本中保持以下模式完整,而不是删除它们
:), :-), :( and :-( :), :-), :( 和 :-(
Could anyone help me with this please?有人可以帮我吗?
thanks,谢谢,
You should ask yourself what patterns match any chars from the smilies you want to "protect".您应该问自己哪些模式与您要“保护”的表情符号中的任何字符相匹配。 You can easily see that
r"[^A-Za-z0-9(),!?'`]"
, r"\\("
and r"\\)"
match these chars.您可以轻松看到
r"[^A-Za-z0-9(),!?'`]"
、 r"\\("
和r"\\)"
匹配这些字符。
So, you may fix those patterns:因此,您可以修复这些模式:
s = re.sub(r":-?[()]|([^A-Za-z0-9(),!?'`])", lambda x: " " if x.group(1) else x.group(), s) # Match smilies and match and capture what you need to replace
s = re.sub(r"(?<!:)(?<!:-)\(", " ", s) # Prepend (?<!:)(?<!:-) lookbehinds
s = re.sub(r"(?<!:)(?<!:-)\)", " ", s) # Prepend (?<!:)(?<!:-) lookbehinds
The :-?[()]|([^A-Za-z0-9(),!?'`])
pattern matches a smiley to protect ( :-?[()]
matches a :
, then an optional -
and then a (
or )
) or matches and captures into Group 1 any char other than the one defined in the negated character class. :-?[()]|([^A-Za-z0-9(),!?'`])
模式匹配一个笑脸来保护( :-?[()]
匹配一个:
,然后是一个可选的-
然后 a (
or )
) or 匹配并捕获除否定字符类中定义的字符以外的任何字符并将其捕获到组 1 中。 The lambda x: " " if x.group(1) else x.group()
lambda expression implements a custom replacement logic depending on a group match: if Group 1 matched, the replacement occurs, else, the smiley is put back where it was. lambda x: " " if x.group(1) else x.group()
lambda 表达式根据组匹配实现自定义替换逻辑:如果组 1 匹配,则进行替换,否则,笑脸被放回原处曾是。
The (?<!:)(?<!:-)
negative lookbehinds make sure (
and )
are not matched if they are prepended with :
or :-
. (?<!:)(?<!:-)
负向后视确保(
和)
不匹配,如果它们以:
或:-
开头。
Note r'\\S*(x{2,}|X{2,})\\S*'
can also match the smilies if they are glued to the xx
or XX
.注意
r'\\S*(x{2,}|X{2,})\\S*'
如果它们粘在xx
或XX
也可以匹配表情符号。 However, fixing this one is tricky since :(
like smilies might be matched with \\S*
if they are not at the start of the non-whitespace chunk, so, you may use但是,修复这个很棘手,因为
:(
如果笑脸不在非空白块的开头,则它们可能与\\S*
匹配,因此,您可以使用
s = re.sub(r'(:-[()])|(?:(?!:-?[()])\S)*(?:x{2,}|X{2,})(?:(?!:-?[()])\S)*',"xxx" if x.group(1) else x.group(), s)
The tactics is similar to r":-?[()]|([^A-Za-z0-9(),!?'`])"
pattern, we match and capture the smiley, but then we only allow matching such non-whitespace chars ( \\S
) that do not start the smiley substring ( (?:(?!:-?[()])\\S)*
).策略类似于
r":-?[()]|([^A-Za-z0-9(),!?'`])"
模式,我们匹配并捕获笑脸,但我们只允许匹配这样的非空白字符( \\S
)不开始笑脸子串( (?:(?!:-?[()])\\S)*
)。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.