I got a huge numpy array where elements are strings. I like to replace the strings with the first alphabet of the string. For example if
C[0] = 'A90CD'
I want to replace it with
C[0] = 'A'
IN nutshell, I was thinking of applying regex in a loop where I have a dictionary of regex expression like
'^A.+$' => 'A'
'^B.+$' => 'B' etc
How can I apply this regex over the numpy arrays ? Or is there any better method to achieve the same ?
There's no need for regex here. Just convert your array to a 1 byte string, using astype
-
v = np.array(['abc', 'def', 'ghi'])
>>> v.astype('<U1')
array(['a', 'd', 'g'],
dtype='<U1')
Alternatively, you change its view
and stride. Here's a slightly optimised version for equal sized strings. -
>>> v.view('<U1')[::len(v[0])]
array(['a', 'd', 'g'],
dtype='<U1')
And here's the more generalised version of .view
method, but this works for arrays of strings with differing length. Thanks to Paul Panzer for the suggestion -
>>> v.view('<U1').reshape(v.shape + (-1,))[:, 0]
array(['a', 'd', 'g'],
dtype='<U1')
Performance
y = np.array([x * 20 for x in v]).repeat(100000)
y.shape
(300000,)
len(y[0]) # they're all the same length - `abcabcabc...`
60
Now, the timings -
# `astype` conversion
%timeit y.astype('<U1')
100 loops, best of 3: 5.03 ms per loop
# `view` for equal sized string arrays
%timeit y.view('<U1')[::len(y[0])]
100000 loops, best of 3: 2.43 µs per loop
# Paul Panzer's version for differing length strings
%timeit y.view('<U1').reshape(y.shape + (-1,))[:, 0]
100000 loops, best of 3: 3.1 µs per loop
view
method is faster by a huge margin . However, use with caution, as the memory is shared.
If you're interested in a more general solution that finds you the first letter (regardless of where it may be), I'd say the fastest/easiest way would be using the re
module, compiling a pattern and searching inside a list comprehension.
>>> p = re.compile('[a-zA-Z]')
>>> [p.search(x).group() for x in v]
['a', 'd', 'g']
And, its performance on the same setup above -
%timeit [p.search(x).group() for x in y]
1 loop, best of 3: 320 ms per loop
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.