简体   繁体   English

验证Python字符串中的整数

[英]Validating integers in Python strings

We have a large number of strings containing substrings that are possibly integers eg. 我们有很多字符串,其中包含可能是整数的子字符串。

mystring = "123 345 456 567 678 789"

and need to verify that: 并需要验证:

a. 一种。 each substring is in fact an integer eg. 每个子串实际上是一个整数,例如。 in mystring = "123 345 456 567 abc 789" fails when it reaches 'abc' in mystring = "123 345 456 567 abc 789"到达'abc'时失败

b. b。 each integer is within the range 0 <= i <= 10000 eg. 每个整数都在0 <= i <= 10000的范围内。 mystring = "123 -345 456 567 678 789" fails when it reaches '-345' mystring = "123 -345 456 567 678 789"到达'-345'时失败

One solution is: 一种解决方案是:

mylist= [int(i) for i in mystring.split() if isinstance(int(i), int) and (0 <= int(i) <= 10000)]

Questions are: 问题是:

i. 一世。 In the list comprehension, for each i, does the int(i) get evaluated once or multiple times? 在列表理解中,对于每个i,int(i)是一次还是多次求值?

ii. ii。 Is there an alternative method that could be faster (as the volume of strings is large and each string could contain hundreds to thousands of integers)? 是否有另一种方法可以更快(因为字符串的数量很大,每个字符串可以包含数百到数千个整数)?

I think that I would probably use something like: 我认为我可能会使用类似:

try:
    if not all( (0 <= int(i) <= 10000) for i in mystring.split() ):
       raise ValueError("arg!")
except ValueError:
    print "Oops, didn't pass"

This has the advantage that it short circuits if something fails to convert to an int or if it doesn't fall in the correct range. 这样做的好处是,如果某些内容无法转换为int或不在正确的范围内,则会短路。

Here's a silly test: 这是一个愚蠢的测试:

def test_str(mystring):
    try:
        return all( (0 <= int(i) <= 10000) for i in mystring.split() )
    except ValueError:
        return False

print test_str("123 345 456 567 abc 789")
print test_str("123 345 456 567 -300 789")
print test_str("123 345 456 567 300 789")

int(i) gets evaluated multiple times. int(i)得到了多次评估。 Also, isinstance(int(i), int) is useless because int() will raise an exception on non-integer input, not silently return a non-int. 同样, isinstance(int(i), int)没用,因为int()会在非整数输入上引发异常,而不是静默地返回非整数。

There's nothing wrong with writing the code as an old-fashioned loop. 将代码编写为老式循环没有错。 It gives you the biggest amount of flexibility regarding error handling. 它为您提供有关错误处理的最大灵活性。 If you're worried about efficiency, remember that a list comprehension is nothing but syntactic sugar for such a loop. 如果您担心效率,请记住列表理解不过是这种循环的语法糖。

intlist = []
for part in mystring.split():
    try:
        val = int(part)
    except ValueError:
        continue  # or report the error
    if val < 0 or val > 10000:
        continue  # or report the error
    intlist.append(val)

Your solution doesn't work if there are non-numeric strings: 如果存在非数字字符串,则您的解决方案将不起作用:

ValueError: invalid literal for int() with base 10: 'abc' ValueError:int()的无效文字,基数为10:“ abc”

I'd do something like this: 我会做这样的事情:

mystring = "123 345 456 -123 567 abc 678 789"

mylist = []
for i in mystring.split():
    try:
        ii = int(i)
    except ValueError:
        print "{} is bad".format(i)
    if 0 <= ii <= 10000:
        mylist.append(ii)
    else:
        print  "{} is out of range".format(i)
print mylist

To answer your questions: 要回答您的问题:

i. 一世。 Yes, more than once. 是的,不止一次。

ii. ii。 Yes, several examples have been provided. 是的,已经提供了几个示例。

My output looks like this: 我的输出如下所示:

-123 is out of range -123超出范围

abc is bad abc不好

[123, 345, 456, 567, 567, 678, 789] [123、345、456、567、567、678、789]

You may also use regex: 您也可以使用正则表达式:

import re
mystring = "123 345 456 567 abc 789 -300 ndas"

re_integer = r'(-??\d+)'
re_space_or_eof = r'(\ |$)' #using space or eof so we don't match floats

#match all integers
matches = re.finditer(re_integer + re_space_or_eof, mystring)

#extract the str, convert to int for all matches
int_matches = [int(num.groups()[0]) for num in matches]

#filter based on criteria
in_range = [rnum for rnum in int_matches if 0 <= rnum <=10000]

>>> in_range
[123, 345, 456, 567, 789]

It seems I missed the heat of the debate, but here's another - potentially faster - approach: 似乎我错过了辩论的激烈程度,但这是另一种可能更快的方法:

>>> f = lambda(s): set(s) <= set('0123456789 ') and not filter(lambda n: int(n) > 10000, s.split())

Tests: 测试:

>>> s1 = '123 345 456 567 678 789'
>>> s2 = '123 345 456 567 678 789 100001'
>>> s3 = '123 345 456 567 678 789 -3'
>>> s4 = '123 345 456 567 678 789 ba'
>>> f(s1)
True
>>> f(s2)
False
>>> f(s3)
False
>>> f(s4)
False

I did not time it, but I suspect it might be faster than other proposed solutions as the set comparison takes already care of both the x < 0 test and non-parsable strings like abc . 我没有安排时间,但是我怀疑它可能比其他建议的解决方案更快,因为集合比较已经考虑了x < 0测试和诸如abc这样的不可解析字符串。 Since the two tests (the set comparison and the numerical range) are join logically, failure of the first will prevent running the second one. 由于两个测试(集合比较和数值范围)在逻辑上结合在一起,因此第一个测试的失败将阻止第二个测试的运行。

HTH! HTH!

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM