numpy字符串分区：执行多个拆分

Question

I have an array of strings, each containing one or more words. 我有一个字符串数组，每个字符串包含一个或多个单词。 I want to split / partition the array on a separator (blank in my case) with as many splits as there are separators in the element containing the most separators. 我想在分隔符（在我的情况下为空白）上对数组进行拆分/分区，该拆分与包含最多分隔符的元素中的分隔符一样多。 numpy.char.partition however only performs a single split, regardless of how often the separator appears: 但是，无论分隔符出现的频率如何， numpy.char.partition仅执行一次拆分：

I've got: 我有：

>>> a = np.array(['word', 'two words', 'and three words'])
>>> np.char.partition(a, ' ')

>>> array([['word', '', ''],
       ['two', ' ', 'words'],
       ['and', ' ', 'three words']], dtype='<U8')

I'd like to have: 我想拥有：

>>> array([['word', '', '', '', ''],
       ['two', ' ', 'words', '', ''],
       ['and', ' ', 'three', ' ', 'words']], dtype='<U8')

Answer 1

Approach #1 方法1

Those partition functions doesn't seem to partition for all the occurrences. 这些partition功能似乎并没有对所有出现的事件进行分区。 To solve for our case, we can use np.char.split to get the split strings and then masking , array-assignment , like so - 为了解决我们的问题，我们可以使用np.char.split来获取拆分字符串，然后使用masking ， array-assignment ，如下所示：

def partitions(a, sep):
    # Split based on sep
    s = np.char.split(a,sep)

    # Get concatenated split strings
    cs = np.concatenate(s)

    # Get params
    N = len(a)
    l = np.array(list(map(len,s)))
    el = 2*l-1
    ncols = el.max()

    out = np.zeros((N,ncols),dtype=cs.dtype)

    # Setup valid mask that starts at fist col until the end for each row
    mask = el[:,None] > np.arange(el.max())

    # Assign sepeter into valid ones
    out[mask] = sep

    # Setup valid mask that has True at postions where words are to be assigned
    mask[:,1::2] = 0

    # Assign words
    out[mask] = cs
    return out

Sample runs - 样品运行-

In [32]: a = np.array(['word', 'two words', 'and three words'])

In [33]: partitions(a, sep=' ')
Out[33]: 
array([['word', '', '', '', ''],
       ['two', ' ', 'words', '', ''],
       ['and', ' ', 'three', ' ', 'words']], dtype='<U5')

In [44]: partitions(a, sep='ord')
Out[44]: 
array([['w', 'ord', ''],
       ['two w', 'ord', 's'],
       ['and three w', 'ord', 's']], dtype='<U11')

Approach #2 方法＃2

Here's another with a loop, to save on memory - 这是另一个循环，以节省内存-

def partitions_loopy(a, sep):
    # Get params
    N = len(a)
    l = np.char.count(a, sep)+1
    ncols = 2*l.max()-1
    out = np.zeros((N,ncols),dtype=a.dtype)
    for i,(a_i,L) in enumerate(zip(a,l)):
        ss = a_i.split(sep)
        out[i,1:2*L-1:2] = sep
        out[i,:2*L:2] = ss
    return out

Answer 2

I came up with my own recursive solution that uses np.char.partition . 我想出了自己的使用np.char.partition的递归解决方案。 However, when timing it, it turns out to be less performant. 但是，在对它进行计时时，结果表现不佳。 The time is similar to @Divakar's solution for a single split, but then multiplies with the number of splits necessary. 该时间类似于@Divakar针对单个拆分的解决方案，但随后乘以所需的拆分次数。

def partitions(a, sep):
    if np.any(np.char.count(a, sep) >= 1):
        a2 = np.char.partition(a, sep)
        return np.concatenate([a2[:, 0:2], partitions(a2[:, 2], sep)], axis=1)
    return a.reshape(-1, 1)

Answer 3

The function based approaches are great but seem too complex. 基于函数的方法很棒，但看起来太复杂了。 You can solve this just using data structure transforms and the re.split in a single line of code . 您只需使用数据结构转换和re.split 在单行代码中即可解决此问题。

a = np.array(['word', 'two words', 'and three words'])

#Use the re.split to get partitions then transform to dataframe, fillna, transform back!

np.array(pd.DataFrame([re.split('( )', i) for i in a]).fillna(''))

#You can change the '( )' to '(\W)' if you want it to separate on all non-word characters!

array([['word', '', '', '', ''],
       ['two', ' ', 'words', '', ''],
       ['and', ' ', 'three', ' ', 'words']], dtype=object)

numpy字符串分区：执行多个拆分

问题描述

3 个解决方案

解决方案1
2 已采纳 2019-07-23 08:36:36

解决方案2
1 2019-07-23 09:32:44

解决方案3
1 2019-07-23 11:36:37

numpy字符串分区：执行多个拆分

问题描述

3 个解决方案

解决方案1 2 已采纳 2019-07-23 08:36:36

解决方案2 1 2019-07-23 09:32:44

解决方案3 1 2019-07-23 11:36:37

解决方案1
2 已采纳 2019-07-23 08:36:36

解决方案2
1 2019-07-23 09:32:44

解决方案3
1 2019-07-23 11:36:37