简体   繁体   中英

Extract the first letter from each string in a numpy array

I got a huge numpy array where elements are strings. I like to replace the strings with the first alphabet of the string. For example if

C[0] = 'A90CD'

I want to replace it with

C[0] = 'A'

IN nutshell, I was thinking of applying regex in a loop where I have a dictionary of regex expression like

'^A.+$' => 'A'

'^B.+$' => 'B' etc

How can I apply this regex over the numpy arrays ? Or is there any better method to achieve the same ?

There's no need for regex here. Just convert your array to a 1 byte string, using astype -

v = np.array(['abc', 'def', 'ghi'])

>>> v.astype('<U1')
array(['a', 'd', 'g'],
      dtype='<U1')

Alternatively, you change its view and stride. Here's a slightly optimised version for equal sized strings. -

>>> v.view('<U1')[::len(v[0])]
array(['a', 'd', 'g'],
      dtype='<U1')

And here's the more generalised version of .view method, but this works for arrays of strings with differing length. Thanks to Paul Panzer for the suggestion -

>>> v.view('<U1').reshape(v.shape + (-1,))[:, 0]
array(['a', 'd', 'g'],
      dtype='<U1')

Performance

y = np.array([x * 20 for x in v]).repeat(100000)

y.shape
(300000,)

len(y[0])   # they're all the same length - `abcabcabc...`
60

Now, the timings -

# `astype` conversion

%timeit y.astype('<U1')
100 loops, best of 3: 5.03 ms per loop

# `view` for equal sized string arrays 

%timeit y.view('<U1')[::len(y[0])]
100000 loops, best of 3: 2.43 µs per loop

# Paul Panzer's version for differing length strings

%timeit y.view('<U1').reshape(y.shape + (-1,))[:, 0]
100000 loops, best of 3: 3.1 µs per loop

The view method is faster by a huge margin .

However, use with caution, as the memory is shared.


If you're interested in a more general solution that finds you the first letter (regardless of where it may be), I'd say the fastest/easiest way would be using the re module, compiling a pattern and searching inside a list comprehension.

>>> p = re.compile('[a-zA-Z]')
>>> [p.search(x).group() for x in v]
['a', 'd', 'g']

And, its performance on the same setup above -

%timeit [p.search(x).group() for x in y]
1 loop, best of 3: 320 ms per loop

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM