[英]How to remove digits in a string except those in hashtags using regex
I'm processing some twitter texts, and I want to remove all numbers in a tweet except those that appear in hashtags. 我正在处理一些Twitter文本,我想删除推文中的所有数字,除了那些出现在主题标签中的数字。 For example,
例如,
'I wrote 16 scripts in #code100day challenge2019 in 10day'
should become 应该成为
'I wrote scripts in #code100day challenge in day'
Note that numbers not separated from alphabetic characters should also be removed (ie 'challenge2019'
--> 'challenge'
, '10day'
--> 'day'
). 请注意,也应删除未与字母字符分隔的数字(即
'challenge2019'
- > 'challenge'
, '10day'
- > 'day'
)。
I tried: 我试过了:
text = 'I wrote 16 scripts in #code100day challenge2019 in 10day'
text = re.sub(r"^(?!#)\d+", "", text)
But it does not do anything to the input string. 但它对输入字符串没有任何作用。
I also did negative lookbehind, trying to remove all digits except those following the '#'
symbol: 我也做了负面的lookbehind,尝试删除除
'#'
符号后面的所有数字:
text = re.sub(r"(?<!#)\d+", "", text)
But now it removes all the numeric characters no matter in hashtag or not: 但现在它删除所有数字字符,无论是否标签:
'I wrote scripts in #codeday challenge in day'
Any suggestions? 有什么建议?
One option is to match #
followed by non-space characters (and, if matched, replace with the whole match, effectively leaving the hashtag alone), or match digit characters and remove them: 一个选项是匹配
#
后跟非空格字符(如果匹配,则替换整个匹配,实际上只留下标签), 或匹配数字字符并删除它们:
output = re.sub(
r'#\S+|\d+',
lambda match: match.group(0) if match.group(0).startswith('#') else '',
txt
)
If you can use the regex module, you can use (*SKIP)(*FAIL)
after matching hashtags instead, to effectively skip them if matched: 如果您可以使用正则表达式模块,则可以在匹配主题标签后使用
(*SKIP)(*FAIL)
,以便在匹配时有效跳过它们:
output = regex.sub(r'#\S+(*SKIP)(*FAIL)|\d+', '', txt)
My guess is that using an alternation would likely be faster and simpler than lookarounds: 我的猜测是,使用替换可能比看起来更快更简单:
import re
test_str = "10 I wrote 16 scripts in #code100day challenge2019 in 10day 100 "
print(re.sub(r"^\s+|\s+$","",re.sub(r"\s{2,}"," ",re.sub(r"(#\S+)|(\d+)", "\\1", test_str))))
I wrote scripts in #code100day challenge in day
The expression is explained on the top right panel of regex101.com , if you wish to explore/simplify/modify it, and in this link , you can watch how it would match against some sample inputs, if you like. 如果您希望探索/简化/修改表达,请在regex101.com的右上方面板中进行说明 ,如果您愿意,可以在此链接中查看它与某些示例输入的匹配情况。
Please try this: 请试试这个:
Just checking for the digit with space(Before/after) and replacing with space. 只检查带空格的数字(之前/之后)并用空格替换。
text = 'I wrote 16 scripts in #code100day challenge2019 in 10day'
re.sub("\d+ | \d+", " ", text)
O/P: 'I wrote scripts in #code100day challenge in day' O / P:'我在#code100day challenge中编写了脚本'
You can use like this also, which will give the same result 您也可以这样使用,这将产生相同的结果
re.sub("\d+\s|\s\d+", " ", text)
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.