slicing a numpy array with characters

Question

I have a text file made as:

0.01 1 0.1 1 10 100 a
0.02 3 0.2 2 20 200 b
0.03 2 0.3 3 30 300 c
0.04 1 0.4 4 40 400 d

I read it as a list A and then converted to a numpy array, that is:

>>> A
array([['0.01', '1', '0.1', '1', '10', '100', 'a'],
       ['0.02', '3', '0.2', '2', '20', '200', 'b'],
       ['0.03', '2', '0.3', '3', '30', '300', 'c'],
       ['0.04', '1', '0.4', '4', '40', '400', 'd']], 
      dtype='|S4')

I just want to extract a sub-array B , made of A wherever its 4th entry is lower than 30, that should look something like:

B = array([['0.01', '1', '0.1', '1', '10', '100', 'a'],
           ['0.02', '3', '0.2', '2', '20', '200', 'b']])

When dealing with arrays, I usually do simply B = A[A[:,4]<30] , but in this case (maybe due to the presence of characters/strings I've never worked with) it doesn't work, giving me this:

>>> A[A[:,4]<30]
array(['0.01', '1', '0.1', '1', '10', '100', 'a'], 
      dtype='|S4')

and I can't figure out the reason. I'm not dealing with a code of mine and I don't think I can switch all this to structures or dictionaries: any suggestion for doing this with numpy arrays? Thank you very much in advance!

Answer 1

You have to compare int to int

A[A[:,4].astype(int)<30]

or str to str

A[A[:,4]<'30']

However, notice that the latter would work in your specific example , but won't work generally because you are comparing str ordering (for example, '110' < '30' returns True , but 110 < 30 returns False )

numpy will infer your elements' types from your data. In this case, it attributed the type = '|S4' to your elements, meaning they strings of length 4. This is probably a consequence of the underlying C code (which enhances numpy 's performance) that requires elements to have fixed types.

To illustrate this difference, check the following code:

>>> np.array([['0.01', '1', '0.1', '1', '10', '100', 'a']])
array(['0.01', '1', '0.1', '1', '10', '100', 'a'], dtype='|S4')

The inferred type of strings of length 4, which is the max length of your elements (in elem 0.01 ). Now, if you expclitily define it to hold general type objects, it will do what you want

>>> np.array([[0.01, 1, 0.1, 1, 10, 100, 'a']], dtype=object)
array([0.01, 1, 0.1, 1, 10, 100, 'a'], dtype=object)

and your code A[A[:,4]<30] would work properly.

For more information, this is a very complete guide

Answer 2

In [86]: txt='''0.01 1 0.1 1 10 100 a
    ...: 0.02 3 0.2 2 20 200 b
    ...: 0.03 2 0.3 3 30 300 c
    ...: 0.04 1 0.4 4 40 400 d'''
In [87]: A = np.genfromtxt(txt.splitlines(), dtype=str)
In [88]: A
Out[88]: 
array([['0.01', '1', '0.1', '1', '10', '100', 'a'],
       ['0.02', '3', '0.2', '2', '20', '200', 'b'],
       ['0.03', '2', '0.3', '3', '30', '300', 'c'],
       ['0.04', '1', '0.4', '4', '40', '400', 'd']], dtype='<U4')
In [89]: A[:,4]
Out[89]: array(['10', '20', '30', '40'], dtype='<U4')

genfromtxt , as a default tries to make floats. But in that case the character column would be nan . Instead I specified str dtype.

So a numeric test would require converting the column to numbers:

In [90]: A[:,4].astype(int)
Out[90]: array([10, 20, 30, 40])
In [91]: A[:,4].astype(int)<30
Out[91]: array([ True,  True, False, False])

In this case a string comparison also works:

In [99]: A[:,4]<'30'
Out[99]: array([ True,  True, False, False])

Or if we use dtype=None, it infers dtype by column and makes a structured array:

In [93]: A1 = np.genfromtxt(txt.splitlines(), dtype=None,encoding=None)
In [94]: A1
Out[94]: 
array([(0.01, 1, 0.1, 1, 10, 100, 'a'), (0.02, 3, 0.2, 2, 20, 200, 'b'),
       (0.03, 2, 0.3, 3, 30, 300, 'c'), (0.04, 1, 0.4, 4, 40, 400, 'd')],
      dtype=[('f0', '<f8'), ('f1', '<i8'), ('f2', '<f8'), ('f3', '<i8'), ('f4', '<i8'), ('f5', '<i8'), ('f6', '<U1')])

Now we can select a field by name, and test it:

In [95]: A1['f4']
Out[95]: array([10, 20, 30, 40])

Either way we can select rows based on the True/False mask or the corresponding row indices:

In [96]: A[[0,1],:]
Out[96]: 
array([['0.01', '1', '0.1', '1', '10', '100', 'a'],
       ['0.02', '3', '0.2', '2', '20', '200', 'b']], dtype='<U4')

In [98]: A1[[0,1]]     # A1 is 1d
Out[98]: 
array([(0.01, 1, 0.1, 1, 10, 100, 'a'), (0.02, 3, 0.2, 2, 20, 200, 'b')],
      dtype=[('f0', '<f8'), ('f1', '<i8'), ('f2', '<f8'), ('f3', '<i8'), ('f4', '<i8'), ('f5', '<i8'), ('f6', '<U1')])

slicing a numpy array with characters

Question

2 answers

solution1
2 2018-04-29 18:59:43

solution2
1 2018-04-29 21:00:06

slicing a numpy array with characters

Question

2 answers

solution1 2 2018-04-29 18:59:43

solution2 1 2018-04-29 21:00:06

solution1
2 2018-04-29 18:59:43

solution2
1 2018-04-29 21:00:06