简体   繁体   English

如果使用正则表达式连接字符串,则从字符串中删除数字

[英]Remove digits from the string if they are concatenated using Regex

I am trying to remove the digits from the text only if they are concatenated with the alphabets or coming between characters in a word.我试图从文本中删除数字,只有当它们与字母连接或出现在单词中的字符之间时。 But not with the dates.但不是日期。

Like if "21st" then should remain "21st" But if "alphab24et" should be "alphabet" but if the digits come separately like "26 alphabets"就像如果"21st"那么应该保持"21st"但是如果"alphab24et"应该是"alphabet"但是如果数字像"26 alphabets"一样单独出现
then it should remain "26 alphabets" .那么它应该保持"26 alphabets"

I am using the below regex newString = re.sub(r'[0-9]+', '', newString)我正在使用下面的正则表达式newString = re.sub(r'[0-9]+', '', newString)

, which removes digits in ay position they occur, like in the above example it removes 26 as well. ,这会删除它们出现的 y position 中的数字,就像在上面的示例中一样,它也删除了 26。

You can match digits that are not enclosed with word boundaries with custom digit boundaries:您可以将未包含在单词边界中的数字与自定义数字边界匹配:

import re
newString = 'Like if "21st" then should remain "21st" But if  "alphab24et" should be  "alphabet" but if the digits come separately like  "26 alphabets" then it should remain  "26 alphabets" .'
print( re.sub(r'\B(?<!\d)[0-9]+\B(?!\d)', '', newString) )
# => Like if "21st" then should remain "21st" But if  "alphabet" should be  "alphabet" but if the digits come separately like  "26 alphabets" then it should remain  "26 alphabets" .

See the Python demo and the regex demo .请参阅Python 演示正则表达式演示

Details :详情

  • \B(?< \d) - a non-word boundary position with no digit immediately on the left \B(?< \d) - 非字边界 position 左边没有数字
  • [0-9]+ - one or more digits [0-9]+ - 一位或多位数字
  • \B(? \d) - a non-word boundary position with no digit immediately on the right. \B(? \d) - 非字边界 position 右侧没有数字。

I find a way to make my re.sub 's cleaner is to capture the things around my pattern in groups ( (...) below), and put them back in the subsitute pattern ( \1 and \2 below).我找到了一种让我的re.sub更清洁的方法是在组中捕获我的模式周围的东西(下面的(...) ),然后将它们放回替代模式(下面的\1\2 )。

In your case you want to catch digit sequences ( [0-9]+ ) that are not surrounded by white spaces ( \s , since you want to keep those) or other other digits ( [0-9] , otherwise the greediness of the algorithm won't remove these): [^\s0-9] .在您的情况下,您想要捕获未被空格( \s包围的数字序列( [0-9]+ ),因为您想保留这些数字序列)或其他数字( [0-9] ,否则该算法不会删除这些): [^\s0-9] This gives:这给出了:

In [1]: re.sub(r"([^\s0-9])[0-9]+([^\s0-9])", r"\1\2", "11 a000b 11 11st x11 11")
Out[1]: '11 ab 11 11st x11 11'

What you should do is add parenthesis so as to define a group and specify that the digits need to be sourounded by strings.您应该做的是添加括号以定义一个组并指定数字需要被字符串包围。

re.sub(r"([^\s\d])\d+([^\s\d])", r'\1\2', newString)

This does match only digits which are between a character other than a space: [^\s] part.这确实只匹配空格以外的字符之间的数字: [^\s] 部分。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM