简体   繁体   中英

Validating integers in Python strings

We have a large number of strings containing substrings that are possibly integers eg.

mystring = "123 345 456 567 678 789"

and need to verify that:

a. each substring is in fact an integer eg. in mystring = "123 345 456 567 abc 789" fails when it reaches 'abc'

b. each integer is within the range 0 <= i <= 10000 eg. mystring = "123 -345 456 567 678 789" fails when it reaches '-345'

One solution is:

mylist= [int(i) for i in mystring.split() if isinstance(int(i), int) and (0 <= int(i) <= 10000)]

Questions are:

i. In the list comprehension, for each i, does the int(i) get evaluated once or multiple times?

ii. Is there an alternative method that could be faster (as the volume of strings is large and each string could contain hundreds to thousands of integers)?

I think that I would probably use something like:

try:
    if not all( (0 <= int(i) <= 10000) for i in mystring.split() ):
       raise ValueError("arg!")
except ValueError:
    print "Oops, didn't pass"

This has the advantage that it short circuits if something fails to convert to an int or if it doesn't fall in the correct range.

Here's a silly test:

def test_str(mystring):
    try:
        return all( (0 <= int(i) <= 10000) for i in mystring.split() )
    except ValueError:
        return False

print test_str("123 345 456 567 abc 789")
print test_str("123 345 456 567 -300 789")
print test_str("123 345 456 567 300 789")

int(i) gets evaluated multiple times. Also, isinstance(int(i), int) is useless because int() will raise an exception on non-integer input, not silently return a non-int.

There's nothing wrong with writing the code as an old-fashioned loop. It gives you the biggest amount of flexibility regarding error handling. If you're worried about efficiency, remember that a list comprehension is nothing but syntactic sugar for such a loop.

intlist = []
for part in mystring.split():
    try:
        val = int(part)
    except ValueError:
        continue  # or report the error
    if val < 0 or val > 10000:
        continue  # or report the error
    intlist.append(val)

Your solution doesn't work if there are non-numeric strings:

ValueError: invalid literal for int() with base 10: 'abc'

I'd do something like this:

mystring = "123 345 456 -123 567 abc 678 789"

mylist = []
for i in mystring.split():
    try:
        ii = int(i)
    except ValueError:
        print "{} is bad".format(i)
    if 0 <= ii <= 10000:
        mylist.append(ii)
    else:
        print  "{} is out of range".format(i)
print mylist

To answer your questions:

i. Yes, more than once.

ii. Yes, several examples have been provided.

My output looks like this:

-123 is out of range

abc is bad

[123, 345, 456, 567, 567, 678, 789]

You may also use regex:

import re
mystring = "123 345 456 567 abc 789 -300 ndas"

re_integer = r'(-??\d+)'
re_space_or_eof = r'(\ |$)' #using space or eof so we don't match floats

#match all integers
matches = re.finditer(re_integer + re_space_or_eof, mystring)

#extract the str, convert to int for all matches
int_matches = [int(num.groups()[0]) for num in matches]

#filter based on criteria
in_range = [rnum for rnum in int_matches if 0 <= rnum <=10000]

>>> in_range
[123, 345, 456, 567, 789]

It seems I missed the heat of the debate, but here's another - potentially faster - approach:

>>> f = lambda(s): set(s) <= set('0123456789 ') and not filter(lambda n: int(n) > 10000, s.split())

Tests:

>>> s1 = '123 345 456 567 678 789'
>>> s2 = '123 345 456 567 678 789 100001'
>>> s3 = '123 345 456 567 678 789 -3'
>>> s4 = '123 345 456 567 678 789 ba'
>>> f(s1)
True
>>> f(s2)
False
>>> f(s3)
False
>>> f(s4)
False

I did not time it, but I suspect it might be faster than other proposed solutions as the set comparison takes already care of both the x < 0 test and non-parsable strings like abc . Since the two tests (the set comparison and the numerical range) are join logically, failure of the first will prevent running the second one.

HTH!

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM