简体   繁体   English


[英]How to delete all numbers from a string?

Below is an example of a test case:下面是一个测试用例的例子:

inpoot = "A.p.p.l.e (45) Orange (5.11) Kiwi"  # WE HAVE
outpoot = "A.p.p.l.e () Orange () Kiwi" # WE WANT

The only reason I spelled inpoot incorrectly is because input is a reserved language keyword.我拼写错误的唯一原因inpoot是因为input是保留的语言关键字。

One might think that the following would work:有人可能认为以下方法会起作用:

import string
def kill_numbers(text: str) -> str:
    text = str(text)
    return "".join(filter(lambda ch: ch not in string.digits, text))

However, the decimal point ( . ) in a decimal numbers will be preserved.但是,十进制数字中的小数点 ( . ) 将被保留。

inpoot = "A.p.p.l.e (45) Orange T5.11T Kiwi 99 Apricot"

outpoot = kill_numbers(inpoot)

# prints 'A.p.p.l.e  () Orange T.T Kiwi'
# We want `TT` not `T.T`
# the output contains a stray decimal point. 

outpoot = kill_numbers("Strawberry 3.145 Plum")

# fails to delete the `.` in `3.145`
"3.14" "." "" (empty string) "" (空字符串)

So, how can we delete all numbers, including decimal numbers?那么,我们怎样才能删除所有的数字,包括十进制数字呢?

A substitution using regular expressions is theoretically possible.理论上可以使用正则表达式进行替换。

import re
test_case =  "(.4) A.p.p.l.e (44) Orange .... (4.44) Kiwi . . . . ."
result = re.sub("[0-9]+\.?[0-9]*|\.[0-9]+", "", test_case)
print(result) # () A.p.p.l.e () Orange .... () Kiwi . . . . .

The regular expression shown above works for that one test case, but not all test cases.上面显示的正则表达式适用于那个测试用例,但不是所有的测试用例。

The table below shows how various regular expressions perform on various test inputs.下表显示了各种正则表达式如何在各种测试输入上执行。


  • - means that the regex does NOT match the string -表示正则表达式与字符串匹配
  • + means that the regex matches the entire string +表示正则表达式匹配整个字符串
  • meh means that the regex matches a small part of string, but not the whole thing. meh表示正则表达式匹配字符串的一小部分,但不是全部。
REGEX正则表达式 ' 1 ' '2' '3' '365' '9.43' '-5000' '+10' '3.10.4' '0001' '.5' '.' '591.' '' '0x77F' '3.456e11'
[0-9]+\\.?[0-9]*|\\.[0-9]+ - - - - - - - - - - meh meh meh - - - - + + - - + + meh meh
[+-]?[0-9]+\\.?[0-9]*|\\.[0-9]+ - - - - - - - - - - - - - - meh - - - - + + - - + + meh meh
[+-]?([0-9]+\\.?[0-9]*|\\.[0-9]+) - - - - - - - - - - - - - - meh - - - - + + - - + + meh meh
[0-9]*\\.?[0-9]* meh - - - - - - - - meh meh meh - - - - - - - - - - meh meh
[0-9]+\\.?[0-9]+ + + + + + + - - - - meh meh meh - - + + + + meh + + meh meh
[0-9]+\\.?[0-9]* - - - - - - - - - - meh meh meh - - meh + + - - + + meh meh
[0-9]*\\.?[0-9]+ - - - - - - - - - - meh meh meh - - - - + + meh + + meh meh
\\d+ - - - - - - - - meh meh meh meh - - meh + + meh + + meh meh
[0-9] - - - - - - meh meh meh meh meh meh meh + + meh + + meh meh
\\d - - - - - - meh meh meh meh meh meh meh + + meh + + meh meh
\\d* meh - - - - - - meh meh meh meh - - meh meh meh - - meh meh

The same table in ASCII form might be easier to read and understand: ASCII 形式的同一张表可能更容易阅读和理解:

                                ' 1  ' '2' '3' '365' '9.43' '-5000' '+10' '3.10.4' '0001' '.5'  '.' '591.' '' '0x77F' '3.456e11'
[0-9]+\.?[0-9]*|\.[0-9]+             -   -   -     -      -     meh   meh      meh      -    -    +      -  +     meh        meh
[+-]?[0-9]+\.?[0-9]*|\.[0-9]+        -   -   -     -      -       -     -      meh      -    -    +      -  +     meh        meh
[+-]?([0-9]+\.?[0-9]*|\.[0-9]+)      -   -   -     -      -       -     -      meh      -    -    +      -  +     meh        meh
[0-9]*\.?[0-9]*                    meh   -   -     -      -     meh   meh      meh      -    -    -      -  -     meh        meh
[0-9]+\.?[0-9]+                      +   +   +     -      -     meh   meh      meh      -    +    +    meh  +     meh        meh
[0-9]+\.?[0-9]*                      -   -   -     -      -     meh   meh      meh      -  meh    +      -  +     meh        meh
[0-9]*\.?[0-9]+                      -   -   -     -      -     meh   meh      meh      -    -    +    meh  +     meh        meh
\d+                                  -   -   -     -    meh     meh   meh      meh      -  meh    +    meh  +     meh        meh
[0-9]                                -   -   -   meh    meh     meh   meh      meh    meh  meh    +    meh  +     meh        meh
\d                                   -   -   -   meh    meh     meh   meh      meh    meh  meh    +    meh  +     meh        meh
\d*                                meh   -   -     -    meh     meh   meh      meh      -  meh  meh    meh  -     meh        meh

In my humble opinion, regular expressions are a nightmare.在我看来,正则表达式是一场噩梦。

To digress, it took me a long time to realize that:离题,我花了很长时间才意识到:

IMHO = In my humble opinion`. I don't speak acronym very well. 

Back to business...回到业务...

I cannot find a regex which satisfies the following requirements:我找不到满足以下要求的正则表达式:

  • the regex must not match the empty string ( "" )正则表达式不能匹配空字符串 ( "" )
  • the regex must not match any sub-string of a version number, such as "3.10.4" At most one decimal point is allowed to appear in what we call a "number"正则表达式不能匹配版本号的任何子字符串,例如"3.10.4"在我们所说的“数字”中最多允许出现一个小数点
  • the regex must not match free-floating decimal points ( "." ).正则表达式不能匹配自由浮动小数点 ( "." )。

Desired behavior is as follows:期望的行为如下:

"1" Yes是的 int整数
"2" Yes是的 int整数
"365" Yes是的 int整数
"365." No 365. is a float equivalent to 365.0 However, I do not want to delete the ( . ) at the end of the string "The number of houses was 44." 365.是相当于365.0的浮点数 但是,我不想删除字符串"The number of houses was 44."末尾的 ( . )。
"9.43" Yes是的 one decimal points一位小数
"-5000" Yes是的
"+10" Yes是的
"0001" Yes是的
".5" Yes是的 .5 is equivalent to 0.5 .5相当于0.5
"1" Yes是的
"0x77F" Yes是的
"3.456e11" Yes是的 pseudo-scientific-notation伪科学记数法
"3.10.4" Not a number不是数字 two decimals points两位小数点
"." Not a number不是数字
"" Not a number不是数字 do not match the empty string不匹配空字符串


The following are defined to be seed numbers ...以下被定义为种子编号...

( 1 , 365 , 9.43 , -5000 , +10 , 0001 , .5 , .5 , 0x77F , 3.456e11 ) ( 1 , 365 , 9.43 , -5000 , +10 , 0001 , .5 , .5 , 0x77F , 3.456e11 )

A valid number is defined to be any seed number or a string formed by a seed number by doing one of the following:通过执行以下操作之一,将有效数字定义为任何种子编号或由种子编号形成的字符串:

  1. Iteratively replacing any digit in a seed number with 9999迭代替换种子数中的任何数字
  2. Replacing any digit in a valid number with a different digit.用不同的数字替换有效数字中的任何数字。
  3. Replacing F in 0xF with 2F or F2 or A , B , C , D , or E .0xF中的F替换为2FF2ABCDE

For example, you could replace the 5 in -5000 with 9 to get -9000例如,您可以将-5000中的5替换为9以获得-9000

Also, you could replace the 5 in .5 with 99 to get .99此外,您可以将.5中的5替换为99以获得.99

The above defines language L .上面定义了语言L

My question could be re-worded as follows:我的问题可以改写如下:

What algorithm A will return s′ from input string s such that:什么算法A将从输入字符串s中返回s′ ,使得:

  • s is any finite-length string of ASCII characters. s是任何有限长度的 ASCII 字符字符串。
  • string s′ is like string s except that all maximal substrings of s which are in language L , have been replaced by empty strings.字符串s'与字符串s类似,不同之处在于语言Ls 的所有最大子字符串都已替换为空字符串。

A substring t of string s is maximal and t is in language L if it is not possible to tack on one more character to the left or to the right of t to form t′ , such that t′ is a string in language L and t′ is a substring of s .字符串s的子串t是最大的并且t在语言L中,如果不可能在t的左侧或右侧再添加一个字符以形成t' ,使得t'是语言L中的字符串并且t's的子串。

In layman's terms, if you see "apple 12.345" you should go after "12.345" not "2.34".用外行的话来说,如果你看到“apple 12.345”,你应该去找“12.345”而不是“2.34”。

Indices matter.指数很重要。 Sometimes, it makes no sense to say that the letter "a" is a sub-string of "abracadabra" .有时,说字母"a""abracadabra"的子字符串是没有意义的。 Which letter "a" is it? “a”是哪个字母? It it the letter "a" third-from-the-left, or second-from-the left?是左数第三个还是左数第二个字母“a”?

We define a string to a mathematical mapping M from a finite subset of the natural numbers to the ASCii character set such that the absolute difference between the maximum of the domain of mapping M and the minimum of the domain of mapping M is the sum of one and the cardinality of the domain of mapping M .我们将字符串定义为从自然数的有限子集到 ASCii 字符集的数学映射M ,使得映射M的域的最大值与映射M的域的最小值之间的绝对差是 1 的和和映射域的基数M

For any string SML and any string LRG , we say that SML is a sub-string of LRG if and only if SML[k] = LRG[k] for all k in the domain of string SML对于任何字符串SML和任何字符串LRG ,我们说SMLLRG的子字符串当且仅当SML[k] = LRG[k]对于字符串SML的域中的所有k


You can use negative lookarounds to avoid undesired corner cases.您可以使用负面环视来避免不希望的极端情况。 Use alternation patterns to include incompatible patterns such as hexadecimal numbers:使用交替模式来包含不兼容的模式,例如十六进制数:


Demo: https://regex101.com/r/HXxct5/2演示: https ://regex101.com/r/HXxct5/2

Quite many requirements, so I could be missing something here, but still worth a try:很多要求,所以我可能会在这里遗漏一些东西,但仍然值得一试:

import re
import itertools

def filter_nums(text) -> str:

        def is_a_number(x):
                        if re.search('^0(x|X)', x):
                                return(x, 16)
                        return float(x)
                except ValueError:
                        return False

        tokens = text.split(' ')
        suspect_tokens = [re.findall(r"[A-Fa-f0-9\-\.\+xX]+", elem) for elem in tokens]
        suspect_tokens = list(itertools.chain(*suspect_tokens))
        num_tokens = [elem for elem in suspect_tokens if is_a_number(elem)]

        # Reversed sort, so to avoid "45" fire a call to replace the 45 in 3.456e11 
        # i.e. the longer the sooner to be replaced:
        for num_token in sorted(num_tokens, key=len, reverse=True):
                text = text.replace(num_token, '')
        return text

text = "A.p.p.l.e (45) Orange (5.11) Kiwi [0x77F] {0X77F +10 .,.,-5000!343£ ///3.456e11sd 3.10.4 000001"
# "A.p.p.l.e () Orange () Kiwi [] {  .,.,!£ ///sd 3.10.4"
>>> import re
>>> inpoot = "A.p.p.l.e (45) Orange (5.11) Kiwi"  # WE HAVE
>>> pattern = re.compile(r"\d+\.?\d*")
>>> re.sub(pattern, "", inpoot)
'A.p.p.l.e () Orange () Kiwi'

Try this:尝试这个:

>>> inpoot = "A.p.p.l.e (45) Orange (5.11) Kiwi"
>>> re.sub('(\d+.\d+)|(\d+)', '', inpoot)
'A.p.p.l.e () Orange () Kiwi'
  • The first part tries to find a decimal number with the pattern: digits decimalpoint digits第一部分尝试使用以下模式查找十进制数:digits decimalpoint digits

  • The second part looks for a just a number without a decimal point.第二部分寻找一个没有小数点的数字。

The first part goes first because alternation picks the first match and we want the longer of the two.第一部分首先出现,因为交替选择第一场比赛,我们想要两者中较长的一场。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

粤ICP备18138465号  © 2020-2024 STACKOOM.COM