[英]Numpy String Partitioning: Perform Multiple Splits
I have an array of strings, each containing one or more words. 我有一个字符串数组,每个字符串包含一个或多个单词。 I want to split / partition the array on a separator (blank in my case) with as many splits as there are separators in the element containing the most separators.
我想在分隔符(在我的情况下为空白)上对数组进行拆分/分区,该拆分与包含最多分隔符的元素中的分隔符一样多。
numpy.char.partition
however only performs a single split, regardless of how often the separator appears: 但是,无论分隔符出现的频率如何,
numpy.char.partition
仅执行一次拆分:
I've got: 我有:
>>> a = np.array(['word', 'two words', 'and three words'])
>>> np.char.partition(a, ' ')
>>> array([['word', '', ''],
['two', ' ', 'words'],
['and', ' ', 'three words']], dtype='<U8')
I'd like to have: 我想拥有:
>>> array([['word', '', '', '', ''],
['two', ' ', 'words', '', ''],
['and', ' ', 'three', ' ', 'words']], dtype='<U8')
Approach #1 方法1
Those partition
functions doesn't seem to partition for all the occurrences. 这些
partition
功能似乎并没有对所有出现的事件进行分区。 To solve for our case, we can use np.char.split
to get the split strings and then masking
, array-assignment
, like so - 为了解决我们的问题,我们可以使用
np.char.split
来获取拆分字符串,然后使用masking
, array-assignment
,如下所示:
def partitions(a, sep):
# Split based on sep
s = np.char.split(a,sep)
# Get concatenated split strings
cs = np.concatenate(s)
# Get params
N = len(a)
l = np.array(list(map(len,s)))
el = 2*l-1
ncols = el.max()
out = np.zeros((N,ncols),dtype=cs.dtype)
# Setup valid mask that starts at fist col until the end for each row
mask = el[:,None] > np.arange(el.max())
# Assign sepeter into valid ones
out[mask] = sep
# Setup valid mask that has True at postions where words are to be assigned
mask[:,1::2] = 0
# Assign words
out[mask] = cs
return out
Sample runs - 样品运行-
In [32]: a = np.array(['word', 'two words', 'and three words'])
In [33]: partitions(a, sep=' ')
Out[33]:
array([['word', '', '', '', ''],
['two', ' ', 'words', '', ''],
['and', ' ', 'three', ' ', 'words']], dtype='<U5')
In [44]: partitions(a, sep='ord')
Out[44]:
array([['w', 'ord', ''],
['two w', 'ord', 's'],
['and three w', 'ord', 's']], dtype='<U11')
Approach #2 方法#2
Here's another with a loop, to save on memory - 这是另一个循环,以节省内存-
def partitions_loopy(a, sep):
# Get params
N = len(a)
l = np.char.count(a, sep)+1
ncols = 2*l.max()-1
out = np.zeros((N,ncols),dtype=a.dtype)
for i,(a_i,L) in enumerate(zip(a,l)):
ss = a_i.split(sep)
out[i,1:2*L-1:2] = sep
out[i,:2*L:2] = ss
return out
I came up with my own recursive solution that uses np.char.partition
. 我想出了自己的使用
np.char.partition
的递归解决方案。 However, when timing it, it turns out to be less performant. 但是,在对它进行计时时,结果表现不佳。 The time is similar to @Divakar's solution for a single split, but then multiplies with the number of splits necessary.
该时间类似于@Divakar针对单个拆分的解决方案,但随后乘以所需的拆分次数。
def partitions(a, sep):
if np.any(np.char.count(a, sep) >= 1):
a2 = np.char.partition(a, sep)
return np.concatenate([a2[:, 0:2], partitions(a2[:, 2], sep)], axis=1)
return a.reshape(-1, 1)
The function based approaches are great but seem too complex. 基于函数的方法很棒,但看起来太复杂了。 You can solve this just using data structure transforms and the re.split in a single line of code .
您只需使用数据结构转换和re.split 在单行代码中即可解决此问题。
a = np.array(['word', 'two words', 'and three words'])
#Use the re.split to get partitions then transform to dataframe, fillna, transform back!
np.array(pd.DataFrame([re.split('( )', i) for i in a]).fillna(''))
#You can change the '( )' to '(\W)' if you want it to separate on all non-word characters!
array([['word', '', '', '', ''],
['two', ' ', 'words', '', ''],
['and', ' ', 'three', ' ', 'words']], dtype=object)
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.