简体   繁体   English

Python/Numpy:向量化行元素与条件的组合

[英]Python/Numpy: Vectorizing the combining of row elements with conditions

Is there a way to vectorize the combining of row elements with certain conditions?有没有办法对具有特定条件的行元素的组合进行矢量化?

Conditions:条件:

  1. Empty elements get dropped空元素被丢弃
  2. Rows with more than 1 non-empty element get delimited by '\n'具有超过 1 个非空元素的行由 '\n' 分隔

Note that a) my array has a variable number of rows and columns and will grow quite large hence my interest in vectorization here b) each non-empty string element starts with a '$' character请注意,a)我的数组具有可变数量的行和列,并且会变得非常大,因此我对这里的向量化感兴趣 b)每个非空字符串元素都以 '$' 字符开头

arr = np.array([       
        ['',  '',  '$c'],
        ['',  '$b', '' ],
        ['',  '$b', '$c'],
        ['$a', '',  '' ],
        ['$a', '',  '$c'],
        ['$a', '$b', '' ],
        ['$a', '$b', '$c']
    ], dtype='U1')

Desired result:期望的结果:

res = [       
        ['$c'],              # <-- reduce to single char element
        ['$b'],              # <-- reduce to single char element
        ['$b\n$c'],           # <-- combine char elements with '\n' delimiter
        ['$a'],              # <-- reduce to single char element
        ['$a\n$c'],           # <-- combine char elements with '\n' delimiter
        ['$a\n$b'],           # <-- combine char elements with '\n' delimiter
        ['$a\n$b\n$c']         # <-- combine char elements with '\n' delimiter
    ]

Any insight into a vectorized approach to achieve the desired end result would be much appreciated.任何对矢量化方法以实现所需最终结果的见解将不胜感激。 Thank you in advance.先感谢您。

Update:更新:

Due to the differences in requirements, the suggested answer from Reduce multi-dimensional array of strings along axis in Numpy is not the best fit for my use case.由于要求不同, Reduce multi-dimensional array of strings along the axis in Numpy的建议答案不是最适合我的用例。 See accepted answer below.请参阅下面接受的答案。

Even under your updated circumstances, I would not recommend a numpy-based solution for this, and instead use.即使在您更新的情况下,我也不建议为此使用基于 numpy 的解决方案,而是使用。

arr = arr.tolist()
empty_removed = [[el for el in row if el != ''] for row in arr]
result = ["\n".join(row) for row in empty_removed]

Even for your small example, you can already see a significant speed difference compared to your solution in the comment:即使对于您的小示例,与评论中的解决方案相比,您已经可以看到显着的速度差异:

# array solution
timeit.timeit("['\\n'.join(sub[sub != '']) for sub in arr]", "from __main__ import arr")
# time: 13.177253899999982

# list solution (with initial cast to list)
timeit.timeit("['\\n'.join(row) for row in [[el for el in row if el != ''] for row in arr.tolist()]]", "from __main__ import arr") 
# time: 1.9387359000000117

# list solution (if you can avoid the array in the beginning)
timeit.timeit("['\\n'.join(row) for row in [[el for el in row if el != ''] for row in arr_list]]", "from __main__ import arr_list")
# time 1.4084819999999922

If you want to convert it into a numpy array afterwards to use np.tile and np.repeat , this can certainly be done.如果您想在之后将其转换为 numpy 数组以使用np.tilenp.repeat ,这当然可以做到。 However, I would test if that doesn't cause a similar slowdown in your pipeline.但是,我会测试这是否不会导致您的管道出现类似的放缓。


Old answer, for reference reasons旧答案,仅供参考

I suggest you do not use NumPy arrays and instead switch to plain and simple list comprehension:我建议您不要使用 NumPy arrays 而是切换到简单明了的列表理解:

arr = arr.tolist() # if you can avoid array creation, even better
result = ['\n'.join(sub) for sub in [''.join(sub) for sub in arr]]
# or if you need the list wrapping the individual elements
result2 = [['\n'.join(sub)] for sub in [''.join(sub) for sub in arr]]

The reason for this is a little more complicated.原因有点复杂。 The gist of it is that numpy can't accelerate array operations on dtype=object in the same way as it can on dtype=np.number .它的要点是 numpy 不能像在dtype=np.number上一样加速 dtype dtype=object上的数组操作。 You get the same convenience of fancy indexing (advanced indexing is the name now I think) and tuple-based indexing, but actual performance will not compare.您可以获得花式索引(我认为现在是高级索引)和基于元组的索引的相同便利,但实际性能无法比较。 You can get some intuition here: https://stackoverflow.com/a/20942032/6753182你可以在这里得到一些直觉: https://stackoverflow.com/a/20942032/6753182

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM