简体   繁体   English

如何从字符串中删除所有数字?

[英]How to delete all numbers from a string?

Below is an example of a test case:下面是一个测试用例的例子:

inpoot = "A.p.p.l.e (45) Orange (5.11) Kiwi"  # WE HAVE
outpoot = "A.p.p.l.e () Orange () Kiwi" # WE WANT

The only reason I spelled inpoot incorrectly is because input is a reserved language keyword.我拼写错误的唯一原因inpoot是因为input是保留的语言关键字。

One might think that the following would work:有人可能认为以下方法会起作用:

import string
def kill_numbers(text: str) -> str:
    text = str(text)
    return "".join(filter(lambda ch: ch not in string.digits, text))

However, the decimal point ( . ) in a decimal numbers will be preserved.但是,十进制数字中的小数点 ( . ) 将被保留。

inpoot = "A.p.p.l.e (45) Orange T5.11T Kiwi 99 Apricot"

outpoot = kill_numbers(inpoot)
print(repr(outpoot))

# prints 'A.p.p.l.e  () Orange T.T Kiwi'
# We want `TT` not `T.T`
# the output contains a stray decimal point. 

outpoot = kill_numbers("Strawberry 3.145 Plum")
print(repr(outpoot))

# fails to delete the `.` in `3.145`
INPUT输入 BAD OUTPUT输出不良 DESIRED OUTPUT期望的输出
"3.14" "." "" (empty string) "" (空字符串)

So, how can we delete all numbers, including decimal numbers?那么,我们怎样才能删除所有的数字,包括十进制数字呢?

A substitution using regular expressions is theoretically possible.理论上可以使用正则表达式进行替换。

import re
test_case =  "(.4) A.p.p.l.e (44) Orange .... (4.44) Kiwi . . . . ."
result = re.sub("[0-9]+\.?[0-9]*|\.[0-9]+", "", test_case)
print(result) # () A.p.p.l.e () Orange .... () Kiwi . . . . .

The regular expression shown above works for that one test case, but not all test cases.上面显示的正则表达式适用于那个测试用例,但不是所有的测试用例。

The table below shows how various regular expressions perform on various test inputs.下表显示了各种正则表达式如何在各种测试输入上执行。

KEY FOR TABLE表键

  • - means that the regex does NOT match the string -表示正则表达式与字符串匹配
  • + means that the regex matches the entire string +表示正则表达式匹配整个字符串
  • meh means that the regex matches a small part of string, but not the whole thing. meh表示正则表达式匹配字符串的一小部分,但不是全部。
REGEX正则表达式 ' 1 ' '2' '3' '365' '9.43' '-5000' '+10' '3.10.4' '0001' '.5' '.' '591.' '' '0x77F' '3.456e11'
[0-9]+\\.?[0-9]*|\\.[0-9]+ - - - - - - - - - - meh meh meh - - - - + + - - + + meh meh
[+-]?[0-9]+\\.?[0-9]*|\\.[0-9]+ - - - - - - - - - - - - - - meh - - - - + + - - + + meh meh
[+-]?([0-9]+\\.?[0-9]*|\\.[0-9]+) - - - - - - - - - - - - - - meh - - - - + + - - + + meh meh
[0-9]*\\.?[0-9]* meh - - - - - - - - meh meh meh - - - - - - - - - - meh meh
[0-9]+\\.?[0-9]+ + + + + + + - - - - meh meh meh - - + + + + meh + + meh meh
[0-9]+\\.?[0-9]* - - - - - - - - - - meh meh meh - - meh + + - - + + meh meh
[0-9]*\\.?[0-9]+ - - - - - - - - - - meh meh meh - - - - + + meh + + meh meh
\\d+ - - - - - - - - meh meh meh meh - - meh + + meh + + meh meh
[0-9] - - - - - - meh meh meh meh meh meh meh + + meh + + meh meh
\\d - - - - - - meh meh meh meh meh meh meh + + meh + + meh meh
\\d* meh - - - - - - meh meh meh meh - - meh meh meh - - meh meh

The same table in ASCII form might be easier to read and understand: ASCII 形式的同一张表可能更容易阅读和理解:

                                ' 1  ' '2' '3' '365' '9.43' '-5000' '+10' '3.10.4' '0001' '.5'  '.' '591.' '' '0x77F' '3.456e11'
[0-9]+\.?[0-9]*|\.[0-9]+             -   -   -     -      -     meh   meh      meh      -    -    +      -  +     meh        meh
[+-]?[0-9]+\.?[0-9]*|\.[0-9]+        -   -   -     -      -       -     -      meh      -    -    +      -  +     meh        meh
[+-]?([0-9]+\.?[0-9]*|\.[0-9]+)      -   -   -     -      -       -     -      meh      -    -    +      -  +     meh        meh
[0-9]*\.?[0-9]*                    meh   -   -     -      -     meh   meh      meh      -    -    -      -  -     meh        meh
[0-9]+\.?[0-9]+                      +   +   +     -      -     meh   meh      meh      -    +    +    meh  +     meh        meh
[0-9]+\.?[0-9]*                      -   -   -     -      -     meh   meh      meh      -  meh    +      -  +     meh        meh
[0-9]*\.?[0-9]+                      -   -   -     -      -     meh   meh      meh      -    -    +    meh  +     meh        meh
\d+                                  -   -   -     -    meh     meh   meh      meh      -  meh    +    meh  +     meh        meh
[0-9]                                -   -   -   meh    meh     meh   meh      meh    meh  meh    +    meh  +     meh        meh
\d                                   -   -   -   meh    meh     meh   meh      meh    meh  meh    +    meh  +     meh        meh
\d*                                meh   -   -     -    meh     meh   meh      meh      -  meh  meh    meh  -     meh        meh

In my humble opinion, regular expressions are a nightmare.在我看来,正则表达式是一场噩梦。

To digress, it took me a long time to realize that:离题,我花了很长时间才意识到:

IMHO = In my humble opinion`. I don't speak acronym very well. 

Back to business...回到业务...

I cannot find a regex which satisfies the following requirements:我找不到满足以下要求的正则表达式:

  • the regex must not match the empty string ( "" )正则表达式不能匹配空字符串 ( "" )
  • the regex must not match any sub-string of a version number, such as "3.10.4" At most one decimal point is allowed to appear in what we call a "number"正则表达式不能匹配版本号的任何子字符串,例如"3.10.4"在我们所说的“数字”中最多允许出现一个小数点
  • the regex must not match free-floating decimal points ( "." ).正则表达式不能匹配自由浮动小数点 ( "." )。

Desired behavior is as follows:期望的行为如下:

PSEUDO-NUMBER伪数 IS_A_NUMBER() NOTES笔记
"1" Yes是的 int整数
"2" Yes是的 int整数
"365" Yes是的 int整数
"365." No 365. is a float equivalent to 365.0 However, I do not want to delete the ( . ) at the end of the string "The number of houses was 44." 365.是相当于365.0的浮点数 但是,我不想删除字符串"The number of houses was 44."末尾的 ( . )。
"9.43" Yes是的 one decimal points一位小数
"-5000" Yes是的
"+10" Yes是的
"0001" Yes是的
".5" Yes是的 .5 is equivalent to 0.5 .5相当于0.5
"1" Yes是的
"0x77F" Yes是的
"3.456e11" Yes是的 pseudo-scientific-notation伪科学记数法
"3.10.4" Not a number不是数字 two decimals points两位小数点
"." Not a number不是数字
"" Not a number不是数字 do not match the empty string不匹配空字符串

EDIT:编辑:

The following are defined to be seed numbers ...以下被定义为种子编号...

( 1 , 365 , 9.43 , -5000 , +10 , 0001 , .5 , .5 , 0x77F , 3.456e11 ) ( 1 , 365 , 9.43 , -5000 , +10 , 0001 , .5 , .5 , 0x77F , 3.456e11 )

A valid number is defined to be any seed number or a string formed by a seed number by doing one of the following:通过执行以下操作之一,将有效数字定义为任何种子编号或由种子编号形成的字符串:

  1. Iteratively replacing any digit in a seed number with 9999迭代替换种子数中的任何数字
  2. Replacing any digit in a valid number with a different digit.用不同的数字替换有效数字中的任何数字。
  3. Replacing F in 0xF with 2F or F2 or A , B , C , D , or E .0xF中的F替换为2FF2ABCDE

For example, you could replace the 5 in -5000 with 9 to get -9000例如,您可以将-5000中的5替换为9以获得-9000

Also, you could replace the 5 in .5 with 99 to get .99此外,您可以将.5中的5替换为99以获得.99

The above defines language L .上面定义了语言L

My question could be re-worded as follows:我的问题可以改写如下:

What algorithm A will return s′ from input string s such that:什么算法A将从输入字符串s中返回s′ ,使得:

  • s is any finite-length string of ASCII characters. s是任何有限长度的 ASCII 字符字符串。
  • string s′ is like string s except that all maximal substrings of s which are in language L , have been replaced by empty strings.字符串s'与字符串s类似,不同之处在于语言Ls 的所有最大子字符串都已替换为空字符串。

A substring t of string s is maximal and t is in language L if it is not possible to tack on one more character to the left or to the right of t to form t′ , such that t′ is a string in language L and t′ is a substring of s .字符串s的子串t是最大的并且t在语言L中,如果不可能在t的左侧或右侧再添加一个字符以形成t' ,使得t'是语言L中的字符串并且t's的子串。

In layman's terms, if you see "apple 12.345" you should go after "12.345" not "2.34".用外行的话来说,如果你看到“apple 12.345”,你应该去找“12.345”而不是“2.34”。

Indices matter.指数很重要。 Sometimes, it makes no sense to say that the letter "a" is a sub-string of "abracadabra" .有时,说字母"a""abracadabra"的子字符串是没有意义的。 Which letter "a" is it? “a”是哪个字母? It it the letter "a" third-from-the-left, or second-from-the left?是左数第三个还是左数第二个字母“a”?

We define a string to a mathematical mapping M from a finite subset of the natural numbers to the ASCii character set such that the absolute difference between the maximum of the domain of mapping M and the minimum of the domain of mapping M is the sum of one and the cardinality of the domain of mapping M .我们将字符串定义为从自然数的有限子集到 ASCii 字符集的数学映射M ,使得映射M的域的最大值与映射M的域的最小值之间的绝对差是 1 的和和映射域的基数M

For any string SML and any string LRG , we say that SML is a sub-string of LRG if and only if SML[k] = LRG[k] for all k in the domain of string SML对于任何字符串SML和任何字符串LRG ,我们说SMLLRG的子字符串当且仅当SML[k] = LRG[k]对于字符串SML的域中的所有k

END OF EDIT编辑结束

You can use negative lookarounds to avoid undesired corner cases.您可以使用负面环视来避免不希望的极端情况。 Use alternation patterns to include incompatible patterns such as hexadecimal numbers:使用交替模式来包含不兼容的模式,例如十六进制数:

[+-]?(?:(?:\b(?<!\d\.)\d+(?:\.\d+)?|(?<!\d)\.\d+)(?!\.)(?:e\d+)?|\b0x[0-9A-F]+)\b

Demo: https://regex101.com/r/HXxct5/2演示: https ://regex101.com/r/HXxct5/2

Quite many requirements, so I could be missing something here, but still worth a try:很多要求,所以我可能会在这里遗漏一些东西,但仍然值得一试:

import re
import itertools
    

def filter_nums(text) -> str:

        def is_a_number(x):
                try:
                        if re.search('^0(x|X)', x):
                                return(x, 16)
                        return float(x)
                except ValueError:
                        return False

        tokens = text.split(' ')
        suspect_tokens = [re.findall(r"[A-Fa-f0-9\-\.\+xX]+", elem) for elem in tokens]
        suspect_tokens = list(itertools.chain(*suspect_tokens))
        num_tokens = [elem for elem in suspect_tokens if is_a_number(elem)]

        # Reversed sort, so to avoid "45" fire a call to replace the 45 in 3.456e11 
        # i.e. the longer the sooner to be replaced:
        for num_token in sorted(num_tokens, key=len, reverse=True):
                text = text.replace(num_token, '')
        return text

text = "A.p.p.l.e (45) Orange (5.11) Kiwi [0x77F] {0X77F +10 .,.,-5000!343£ ///3.456e11sd 3.10.4 000001"
print(filter_nums(text))
# "A.p.p.l.e () Orange () Kiwi [] {  .,.,!£ ///sd 3.10.4"
>>> import re
>>> inpoot = "A.p.p.l.e (45) Orange (5.11) Kiwi"  # WE HAVE
>>> pattern = re.compile(r"\d+\.?\d*")
>>> re.sub(pattern, "", inpoot)
'A.p.p.l.e () Orange () Kiwi'
>>>

Try this:尝试这个:

>>> inpoot = "A.p.p.l.e (45) Orange (5.11) Kiwi"
>>> re.sub('(\d+.\d+)|(\d+)', '', inpoot)
'A.p.p.l.e () Orange () Kiwi'
  • The first part tries to find a decimal number with the pattern: digits decimalpoint digits第一部分尝试使用以下模式查找十进制数:digits decimalpoint digits

  • The second part looks for a just a number without a decimal point.第二部分寻找一个没有小数点的数字。

The first part goes first because alternation picks the first match and we want the longer of the two.第一部分首先出现,因为交替选择第一场比赛,我们想要两者中较长的一场。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM