[英]Python/Numpy: Vectorizing the combining of row elements with conditions
Is there a way to vectorize the combining of row elements with certain conditions?有没有办法对具有特定条件的行元素的组合进行矢量化?
Conditions:条件:
Note that a) my array has a variable number of rows and columns and will grow quite large hence my interest in vectorization here b) each non-empty string element starts with a '$' character请注意,a)我的数组具有可变数量的行和列,并且会变得非常大,因此我对这里的向量化感兴趣 b)每个非空字符串元素都以 '$' 字符开头
arr = np.array([
['', '', '$c'],
['', '$b', '' ],
['', '$b', '$c'],
['$a', '', '' ],
['$a', '', '$c'],
['$a', '$b', '' ],
['$a', '$b', '$c']
], dtype='U1')
Desired result:期望的结果:
res = [
['$c'], # <-- reduce to single char element
['$b'], # <-- reduce to single char element
['$b\n$c'], # <-- combine char elements with '\n' delimiter
['$a'], # <-- reduce to single char element
['$a\n$c'], # <-- combine char elements with '\n' delimiter
['$a\n$b'], # <-- combine char elements with '\n' delimiter
['$a\n$b\n$c'] # <-- combine char elements with '\n' delimiter
]
Any insight into a vectorized approach to achieve the desired end result would be much appreciated.任何对矢量化方法以实现所需最终结果的见解将不胜感激。 Thank you in advance.
先感谢您。
Update:更新:
Due to the differences in requirements, the suggested answer from Reduce multi-dimensional array of strings along axis in Numpy is not the best fit for my use case.由于要求不同, Reduce multi-dimensional array of strings along the axis in Numpy的建议答案不是最适合我的用例。 See accepted answer below.
请参阅下面接受的答案。
Even under your updated circumstances, I would not recommend a numpy-based solution for this, and instead use.即使在您更新的情况下,我也不建议为此使用基于 numpy 的解决方案,而是使用。
arr = arr.tolist()
empty_removed = [[el for el in row if el != ''] for row in arr]
result = ["\n".join(row) for row in empty_removed]
Even for your small example, you can already see a significant speed difference compared to your solution in the comment:即使对于您的小示例,与评论中的解决方案相比,您已经可以看到显着的速度差异:
# array solution
timeit.timeit("['\\n'.join(sub[sub != '']) for sub in arr]", "from __main__ import arr")
# time: 13.177253899999982
# list solution (with initial cast to list)
timeit.timeit("['\\n'.join(row) for row in [[el for el in row if el != ''] for row in arr.tolist()]]", "from __main__ import arr")
# time: 1.9387359000000117
# list solution (if you can avoid the array in the beginning)
timeit.timeit("['\\n'.join(row) for row in [[el for el in row if el != ''] for row in arr_list]]", "from __main__ import arr_list")
# time 1.4084819999999922
If you want to convert it into a numpy array afterwards to use np.tile
and np.repeat
, this can certainly be done.如果您想在之后将其转换为 numpy 数组以使用
np.tile
和np.repeat
,这当然可以做到。 However, I would test if that doesn't cause a similar slowdown in your pipeline.但是,我会测试这是否不会导致您的管道出现类似的放缓。
Old answer, for reference reasons旧答案,仅供参考
I suggest you do not use NumPy arrays and instead switch to plain and simple list comprehension:我建议您不要使用 NumPy arrays 而是切换到简单明了的列表理解:
arr = arr.tolist() # if you can avoid array creation, even better
result = ['\n'.join(sub) for sub in [''.join(sub) for sub in arr]]
# or if you need the list wrapping the individual elements
result2 = [['\n'.join(sub)] for sub in [''.join(sub) for sub in arr]]
The reason for this is a little more complicated.原因有点复杂。 The gist of it is that numpy can't accelerate array operations on
dtype=object
in the same way as it can on dtype=np.number
.它的要点是 numpy 不能像在
dtype=np.number
上一样加速 dtype dtype=object
上的数组操作。 You get the same convenience of fancy indexing (advanced indexing is the name now I think) and tuple-based indexing, but actual performance will not compare.您可以获得花式索引(我认为现在是高级索引)和基于元组的索引的相同便利,但实际性能无法比较。 You can get some intuition here: https://stackoverflow.com/a/20942032/6753182
你可以在这里得到一些直觉: https://stackoverflow.com/a/20942032/6753182
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.