简体   繁体   English

除了使用正则表达式的主题标签中的数字之外,如何删除字符串中的数字

[英]How to remove digits in a string except those in hashtags using regex

I'm processing some twitter texts, and I want to remove all numbers in a tweet except those that appear in hashtags. 我正在处理一些Twitter文本,我想删除推文中的所有数字,除了那些出现在主题标签中的数字。 For example, 例如,

'I wrote 16 scripts in #code100day challenge2019 in 10day' 

should become 应该成为

'I wrote scripts in #code100day challenge in day'

Note that numbers not separated from alphabetic characters should also be removed (ie 'challenge2019' --> 'challenge' , '10day' --> 'day' ). 请注意,也应删除未与字母字符分隔的数字(即'challenge2019' - > 'challenge''10day' - > 'day' )。

I tried: 我试过了:

text = 'I wrote 16 scripts in #code100day challenge2019 in 10day'
text = re.sub(r"^(?!#)\d+", "", text)

But it does not do anything to the input string. 但它对输入字符串没有任何作用。

I also did negative lookbehind, trying to remove all digits except those following the '#' symbol: 我也做了负面的lookbehind,尝试删除除'#'符号后面的所有数字:

text = re.sub(r"(?<!#)\d+", "", text)

But now it removes all the numeric characters no matter in hashtag or not: 但现在它删除所有数字字符,无论是否标签:

'I wrote  scripts in #codeday challenge in day'

Any suggestions? 有什么建议?

One option is to match # followed by non-space characters (and, if matched, replace with the whole match, effectively leaving the hashtag alone), or match digit characters and remove them: 一个选项是匹配#后跟非空格字符(如果匹配,则替换整个匹配,实际上只留下标签), 匹配数字字符并删除它们:

output = re.sub(
    r'#\S+|\d+',
    lambda match: match.group(0) if match.group(0).startswith('#') else '',
    txt
)

If you can use the regex module, you can use (*SKIP)(*FAIL) after matching hashtags instead, to effectively skip them if matched: 如果您可以使用正则表达式模块,则可以在匹配主题标签后使用(*SKIP)(*FAIL) ,以便在匹配时有效跳过它们:

output = regex.sub(r'#\S+(*SKIP)(*FAIL)|\d+', '', txt)

My guess is that using an alternation would likely be faster and simpler than lookarounds: 我的猜测是,使用替换可能比看起来更快更简单:

import re

test_str = "10 I wrote 16 scripts in #code100day challenge2019 in 10day 100 "

print(re.sub(r"^\s+|\s+$","",re.sub(r"\s{2,}"," ",re.sub(r"(#\S+)|(\d+)", "\\1", test_str))))

Output 产量

I wrote scripts in #code100day challenge in day

The expression is explained on the top right panel of regex101.com , if you wish to explore/simplify/modify it, and in this link , you can watch how it would match against some sample inputs, if you like. 如果您希望探索/简化/修改表达,请在regex101.com的右上方面板中进行说明 ,如果您愿意,可以在此链接中查看它与某些示例输入的匹配情况。

Please try this: 请试试这个:

Just checking for the digit with space(Before/after) and replacing with space. 只检查带空格的数字(之前/之后)并用空格替换。

text = 'I wrote 16 scripts in #code100day challenge2019 in 10day'
re.sub("\d+ | \d+", " ", text)

O/P: 'I wrote scripts in #code100day challenge in day' O / P:'我在#code100day challenge中编写了脚本'

You can use like this also, which will give the same result 您也可以这样使用,这将产生相同的结果

re.sub("\d+\s|\s\d+", " ", text)

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM