简体   繁体   English

如何从字符串列表中提取数字?

[英]How to extract numbers from a list of strings?

How should I extract numbers only from我应该如何仅从

a = ['1 2 3', '4 5 6', 'invalid']

I have tried:我努力了:

mynewlist = [s for s in a if s.isdigit()]
print mynewlist

and

for strn in a:
    values = map(float, strn.split())
print values

Both failed because there is a space between the numbers.两者都失败了,因为数字之间有空格。

Note: I am trying to achieve output as:注意:我试图实现输出为:

[1, 2, 3, 4, 5, 6]

I think you need to process each item in the list as a split string on whitespace.我认为您需要将list中的每个项目作为空格上的拆分字符串处理。

a = ['1 2 3', '4 5 6', 'invalid']
numbers = []
for item in a:
    for subitem in item.split():
        if(subitem.isdigit()):
            numbers.append(subitem)
print(numbers)

['1', '2', '3', '4', '5', '6']

Or in a neat and tidy comprehension:或者在一个整洁的理解中:

[item for subitem in a for item in subitem.split() if item.isdigit()]

That should do for your particular case since you include a string within list.这应该适用于您的特定情况,因为您在列表中包含一个字符串。 Therefore you need to flatten it:因此,您需要将其展平:

new_list = [int(item) for sublist in a for item in sublist if item.isdigit()]

假设列表只是字符串:

[int(word) for sublist in map(str.split, a) for word in sublist if word.isdigit()]

With the help of sets you can do:借助套装,您可以:

>>> a = ['1 2 3', '4 5 6', 'invalid']
>>> valid = set(" 0123456789")
>>> [int(y) for x in a if set(x) <= valid for y in x.split()]
[1, 2, 3, 4, 5, 6]

This will include the numbers from a string only if the string consists of characters from the valid set.当字符串由valid集合中的字符组成时,这将包括字符串中的数字。

mynewlist = [s for s in a if s.isdigit()]
print mynewlist

doesnt work because you are iterating on the content of the array, which is made of three string:不起作用,因为您正在迭代数组的内容,该数组由三个字符串组成:

  1. '1 2 3' '1 2 3'
  2. '4 5 6' '4 5 6'
  3. 'invalid' '无效的'

that means that you have to iterate again on each of those strings.这意味着您必须在每个字符串上再次迭代。

you can try something like你可以尝试类似的东西

mynewlist = []
for s in a:
    mynewlist += [digit for digit in s if digit.isdigit()] 

一种衬垫解决方案:

new_list = [int(m) for n in a for m in n if m in '0123456789']

There are lots of option to extract numbers from a list of strings.有很多选项可以从字符串列表中提取数字。

A general list of strings is assumed as follows:假定字符串的一般列表如下:

input_list = ['abc.123def45, ghi67 890 12, jk345', '123, 456 78, 90', 'abc def, ghi'] * 10000

If the conversion into an integer is not considered,如果不考虑转换成整数,

def test_as_str(input_list):
    output_list = []
    
    for string in input_list:
        output_list += re.findall(r'\d+', string)
    
    return output_list

%timeit -n 10 -r 7 test_as_str(input_list)
> 37.6 ms ± 168 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
def test_as_str(input_list):
    output_list = []
    
    [output_list.extend(re.findall(r'\d+', string)) for string in input_list]
    
    return output_list

%timeit -n 10 -r 7 test_as_str(input_list)
> 39.5 ms ± 118 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
def test_as_str(input_list):
    return list(itertools.chain(*[re.findall(r'\d+', string) for string in input_list]))

%timeit -n 10 -r 7 test_as_str(input_list)
> 40.4 ms ± 202 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
def test_as_str(input_list):
    return list(filter(None, [item for string in input_list for item in re.split('[^\d]+' , string)]))

%timeit -n 10 -r 7 test_as_str(input_list)
> 42.8 ms ± 372 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

The conversion into an integer can be also considered.也可以考虑转换成整数。

def test_as_int(input_list):
    output_list = []
    
    for string in input_list:
        output_list += re.findall(r'\d+', string)
    
    return list(map(int, output_list))

%timeit -n 10 -r 7 test_as_int(input_list)
> 44.7 ms ± 232 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
def test_as_int(input_list):
    output_list = []
    
    for string in input_list:
        output_list += re.findall(r'\d+', string)
    
    return [int(item) for item in output_list]

%timeit -n 10 -r 7 test_as_int(input_list)
> 47.8 ms ± 198 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
def test_as_int(input_list):
    return [int(item) for string in input_list for item in re.findall(r'\d+', string)]

%timeit -n 10 -r 7 test_as_int(input_list)
> 48.3 ms ± 101 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
def test_as_int(input_list):
    return [int(item) for string in input_list for item in re.split('[^\d]+' , string) if item]

%timeit -n 10 -r 7 test_as_int(input_list)
> 51.4 ms ± 150 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
def test_as_int(input_list):
    return [int(item) for string in input_list for item in re.split('[^\d]+' , string) if item.isdigit()]

%timeit -n 10 -r 7 test_as_int(input_list)
> 54.9 ms ± 210 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
def test_as_int(input_list):
    return [int(item) for string in input_list for item in re.split('[^\d]+' , string) if len(item)]

%timeit -n 10 -r 7 test_as_int(input_list)
> 55.5 ms ± 175 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

The performance test, which does not show much difference, is done on Windows OS, Python 3.8.8 virtual environment.性能测试是在Windows OS,Python 3.8.8虚拟环境下进行的,差别不大。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM