除了使用正则表达式的主题标签中的数字之外，如何删除字符串中的数字

Question

I'm processing some twitter texts, and I want to remove all numbers in a tweet except those that appear in hashtags. 我正在处理一些Twitter文本，我想删除推文中的所有数字，除了那些出现在主题标签中的数字。 For example, 例如，

'I wrote 16 scripts in #code100day challenge2019 in 10day'

should become 应该成为

'I wrote scripts in #code100day challenge in day'

Note that numbers not separated from alphabetic characters should also be removed (ie 'challenge2019' --> 'challenge' , '10day' --> 'day' ). 请注意，也应删除未与字母字符分隔的数字（即'challenge2019' - > 'challenge' ， '10day' - > 'day' ）。

I tried: 我试过了：

text = 'I wrote 16 scripts in #code100day challenge2019 in 10day'
text = re.sub(r"^(?!#)\d+", "", text)

But it does not do anything to the input string. 但它对输入字符串没有任何作用。

I also did negative lookbehind, trying to remove all digits except those following the '#' symbol: 我也做了负面的lookbehind，尝试删除除'#'符号后面的所有数字：

text = re.sub(r"(?<!#)\d+", "", text)

But now it removes all the numeric characters no matter in hashtag or not: 但现在它删除所有数字字符，无论是否标签：

'I wrote  scripts in #codeday challenge in day'

Any suggestions? 有什么建议？

Answer 1

One option is to match # followed by non-space characters (and, if matched, replace with the whole match, effectively leaving the hashtag alone), or match digit characters and remove them: 一个选项是匹配#后跟非空格字符（如果匹配，则替换整个匹配，实际上只留下标签），或匹配数字字符并删除它们：

output = re.sub(
    r'#\S+|\d+',
    lambda match: match.group(0) if match.group(0).startswith('#') else '',
    txt
)

If you can use the regex module, you can use (*SKIP)(*FAIL) after matching hashtags instead, to effectively skip them if matched: 如果您可以使用正则表达式模块，则可以在匹配主题标签后使用(*SKIP)(*FAIL) ，以便在匹配时有效跳过它们：

output = regex.sub(r'#\S+(*SKIP)(*FAIL)|\d+', '', txt)

Answer 2

My guess is that using an alternation would likely be faster and simpler than lookarounds: 我的猜测是，使用替换可能比看起来更快更简单：

import re

test_str = "10 I wrote 16 scripts in #code100day challenge2019 in 10day 100 "

print(re.sub(r"^\s+|\s+$","",re.sub(r"\s{2,}"," ",re.sub(r"(#\S+)|(\d+)", "\\1", test_str))))

Output 产量

I wrote scripts in #code100day challenge in day

The expression is explained on the top right panel of regex101.com , if you wish to explore/simplify/modify it, and in this link , you can watch how it would match against some sample inputs, if you like. 如果您希望探索/简化/修改表达，请在regex101.com的右上方面板中进行说明，如果您愿意，可以在此链接中查看它与某些示例输入的匹配情况。

Answer 3

Please try this: 请试试这个：

Just checking for the digit with space(Before/after) and replacing with space. 只检查带空格的数字（之前/之后）并用空格替换。

text = 'I wrote 16 scripts in #code100day challenge2019 in 10day'
re.sub("\d+ | \d+", " ", text)

O/P: 'I wrote scripts in #code100day challenge in day' O / P：'我在＃code100day challenge中编写了脚本'

You can use like this also, which will give the same result 您也可以这样使用，这将产生相同的结果

re.sub("\d+\s|\s\d+", " ", text)

除了使用正则表达式的主题标签中的数字之外，如何删除字符串中的数字

问题描述

3 个解决方案

解决方案1
1 已采纳 2019-07-24 03:47:08

解决方案2
1 2019-07-24 03:48:07

Output 产量

解决方案3
0 2019-07-24 03:44:21

除了使用正则表达式的主题标签中的数字之外，如何删除字符串中的数字

问题描述

3 个解决方案

解决方案1 1 已采纳 2019-07-24 03:47:08

解决方案2 1 2019-07-24 03:48:07

Output 产量

解决方案3 0 2019-07-24 03:44:21

解决方案1
1 已采纳 2019-07-24 03:47:08

解决方案2
1 2019-07-24 03:48:07

解决方案3
0 2019-07-24 03:44:21