简体   繁体   English

Numpy将布尔数组的字符串表示形式转换为布尔数组

[英]Numpy Convert String Representation of Boolean Array To Boolean Array

Is there a native numpy way to convert an array of string representations of booleans eg: 是否有一种原生的numpy方式来转换布尔字符串表示的数组,例如:

['True','False','True','False']

To an actual boolean array I can use for masking/indexing? 对于我可以用于屏蔽/索引的实际布尔数组? I could do a for loop going through and rebuilding the array but for large arrays this is slow. 我可以做一个for循环并重建数组,但对于大型数组,这很慢。

You should be able to do a boolean comparison, IIUC, whether the dtype is a string or object : 你应该能够做一个布尔比较,IIUC的是否dtype是一个字符串或object

>>> a = np.array(['True', 'False', 'True', 'False'])
>>> a
array(['True', 'False', 'True', 'False'], 
      dtype='|S5')
>>> a == "True"
array([ True, False,  True, False], dtype=bool)

or 要么

>>> a = np.array(['True', 'False', 'True', 'False'], dtype=object)
>>> a
array(['True', 'False', 'True', 'False'], dtype=object)
>>> a == "True"
array([ True, False,  True, False], dtype=bool)

I've found a method that's even faster than DSM's, taking inspiration from Eric, though the improvement is best seen with smaller lists of values; 我发现了一种比DSM更快的方法,从Eric那里获得灵感,尽管使用较小的值列表可以获得最佳效果。 at very large values, the cost of the iterating itself starts to outweigh the advantage of performing the truth testing during creation of the numpy array rather than after. 在非常大的值下,迭代本身的成本开始超过在创建numpy数组期间而不是之后执行真值测试的优势。 Testing with both is and == (for situations where the strings are interned versus when they might not be, as is would not work with non-interned strings. As 'True' is probably going to be a literal in the script it should be interned, though) showed that while my version with == was slower than with is , it was still much faster than DSM's version. 既测试is== (对于其中的字符串与实习时,他们可能没有的情况下,因为is不会与非实习字符串的工作。至于'True'的可能将是在脚本中它应该是一个字面实习,虽然)显示,尽管我的版本==比用慢is ,它仍然比DSM的版本快得多。

Test setup: 测试设置:

import timeit
def timer(statement, count):
    return timeit.repeat(statement, "from random import choice;import numpy as np;x = [choice(['True', 'False']) for i in range(%i)]" % count)

>>> stateIs = "y = np.fromiter((e is 'True' for e in x), bool)"
>>> stateEq = "y = np.fromiter((e == 'True' for e in x), bool)"
>>> stateDSM = "y = np.array(x) == 'True'"

With 1000 items, the faster statements take about 66% the time of DSM's: 有1000个项目,更快的语句占DSM时间的66%:

>>> timer(stateIs, 1000)
[101.77722641656146, 100.74985342340369, 101.47228618107965]
>>> timer(stateEq, 1000)
[112.26464996250706, 112.50754567379681, 112.76057346127709]
>>> timer(stateDSM, 1000)
[155.67689949529995, 155.96820504501557, 158.32394669279802]

For smaller string arrays (in the hundreds rather than thousands), the elapsed time is less than 50% of DSM's: 对于较小的字符串数组(数百而不是数千),经过的时间不到DSM的50%:

>>> timer(stateIs, 100)
[11.947757485669172, 11.927990253608186, 12.057855628259858]
>>> timer(stateEq, 100)
[13.064947253943501, 13.161545451986967, 13.30599035623618]
>>> timer(stateDSM, 100)
[31.270060799078237, 30.941749748808434, 31.253922641324607]

A bit over 25% of DSM's when done with 50 items per list: 每个列表50个项目完成时,DSM的比例超过25%:

>>> timer(stateIs, 50)
[6.856538342483873, 6.741083326021908, 6.708402786859551]
>>> timer(stateEq, 50)
[7.346079345032194, 7.312723444475523, 7.309259899921017]
>>> timer(stateDSM, 50)
[24.154247576229864, 24.173593700599667, 23.946403452288905]

For 5 items, about 11% of DSM's: 对于5个项目,约占DSM的11%:

>>> timer(stateIs, 5)
[1.8826215278058953, 1.850232652068371, 1.8559381315990322]
>>> timer(stateEq, 5)
[1.9252821868467436, 1.894011299061276, 1.894306935199893]
>>> timer(stateDSM, 5)
[18.060974208809057, 17.916322392367874, 17.8379771602049]

Is this good enough? 这够好吗?

my_list = ['True', 'False', 'True', 'False']
np.array(x == 'True' for x in my_list)

It's not native, but if you're starting with a non-native list anyway, it really shouldn't matter. 它不是原生的,但如果你从一个非本地列表开始,它真的应该没关系。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM