Numpy dtype invalid index

Question

I'm trying to load csv data file:
ACCEPT,organizer@t.net,t,p1@t.net,0,UK,3600000,3,1475917200000,1475920800000,MON,9,0,0,0

in following way:

dataset = genfromtxt('./training_set.csv', delimiter=',', dtype='a20, a20, a20, a8, i8, a20, i8, i8, i8, i8, a3, i8, i8, i8, i8')
print(dataset)
target = [x[0] for x in dataset]
train = [x[1:] for x in dataset]

in last line above I've got an error:

---------------------------------------------------------------------------
IndexError                                Traceback (most recent call last)
<ipython-input-66-5d58edf06039> in <module>()
      4 print(dataset)
      5 target = [x[0] for x in dataset]
----> 6 train = [x[1:] for x in dataset]
      7 
      8 #rf = RandomForestClassifier(n_estimators=100)

<ipython-input-66-5d58edf06039> in <listcomp>(.0)
      4 print(dataset)
      5 target = [x[0] for x in dataset]
----> 6 train = [x[1:] for x in dataset]
      7 
      8 #rf = RandomForestClassifier(n_estimators=100)

IndexError: invalid index

How to handle this?

Answer 1

n [42]: dataset = np.genfromtxt('./np_inf.txt', delimiter=',', dtype='a20, a20, a20, a8, i8, a20, i8, i8, i8, i8, a3, i8, i8, i8, i8')

In [43]: [x[0] for x in dataset]
Out[43]: ['ACCEPT', 'ACCEPT', 'ACCEPT']

The issue is that the entries of the dataset are of not very useful type np.void . It does not allow slicing, apparently, but you can iterate over it:

In [56]: type(dataset[0])
Out[56]: numpy.void

In [57]: len(dataset[0])
Out[57]: 15

In [58]: z = [[y for j, y in enumerate(x) if j > 0] for x in dataset]

In [59]: z[0]
Out[59]: 
['organizer@t.net',
 't',
 'p1@t.net',
 0,
 'UK',
 3600000,
 3,
 1475917200000,
 1475920800000,
 'MON',
 9,
 0,
 0,
 0]

However you're probably better off converting the array to a structured dtype instead of using lists.

Better still, consider using pandas and do pd.read_csv .

Answer 2

With that dtype you have created a structured array - it is 1d with a compound dtype.

I have a sample structured array from another problem:

In [26]: data
Out[26]: 
array([(b'1Q11', 252.0, 0.0166), (b'2Q11', 212.4, 0.0122),
       (b'3Q11', 425.9, 0.0286), (b'4Q11', 522.3, 0.0322),
       (b'1Q12', 263.2, 0.0185), (b'2Q12', 238.6, 0.0131),
       ...
       (b'1Q14', 264.5, 0.0179), (b'2Q14', 211.2, 0.0116)], 
      dtype=[('Qtrs', 'S4'), ('Y', '<f8'), ('X', '<f8')])

One record is:

In [27]: data[0]
Out[27]: (b'1Q11', 252.0, 0.0166)

While I can access elements within that a number, it does not accept a slice:

In [36]: data[0][1]
Out[36]: 252.0
In [37]: data[0][1:]
....
IndexError: invalid index

The preferred way of accessing elements with a structured record is with the field name:

In [38]: data[0]['X']
Out[38]: 0.0166

Such a name allows me to access that field across all records:

In [39]: data['X']
Out[39]: 
array([ 0.0166,  0.0122,  0.0286, ...  0.0116])

Fetching multiple fields requires a list of field names (and is more wordy than 2d slicing):

In [42]: data.dtype.names[1:]
Out[42]: ('Y', 'X')

In [44]: data[list(data.dtype.names[1:])]
Out[44]: 
array([(252.0, 0.0166), (212.4, 0.0122),... (211.2, 0.0116)], 
      dtype=[('Y', '<f8'), ('X', '<f8')])

===============

With your sample line (replicated 3 times) I can load:

In [53]: dataset=np.genfromtxt(txt,dtype=None,delimiter=',')
In [54]: dataset
Out[54]: 
array([ (b'ACCEPT', b'organizer@t.net', b't', b'p1@t.net', 0, b'UK', 3600000, 3, 1475917200000, 1475920800000, b'MON', 9, 0, 0, 0),
       (b'ACCEPT', b'organizer@t.net', b't', b'p1@t.net', 0, b'UK', 3600000, 3, 1475917200000, 1475920800000, b'MON', 9, 0, 0, 0),
       (b'ACCEPT', b'organizer@t.net', b't', b'p1@t.net', 0, b'UK', 3600000, 3, 1475917200000, 1475920800000, b'MON', 9, 0, 0, 0)], 
      dtype=[('f0', 'S6'), ('f1', 'S15'), ('f2', 'S1'), ('f3', 'S8'), ('f4', '<i4'), ('f5', 'S2'), ('f6', '<i4'), ('f7', '<i4'), ('f8', '<i8'), ('f9', '<i8'), ('f10', 'S3'), ('f11', '<i4'), ('f12', '<i4'), ('f13', '<i4'), ('f14', '<i4')])
In [55]:

dtype=None produces something similar to your explicit dtype ;

To get your desired output (as arrays, not lists):

target = dataset['f0']
names=dataset.dtype.names[1:]
train = dataset[list(names)]

=====================

You could also refine the dtype to make the task simpler. Define 2 fields, with the 2nd containing most of csv columns. genfromtxt handles this sort of dtype nesting - just so long as the total field count is correct.

In [106]: dt=[('target','a20'), 
       ('train','a20, a20, a8, i8, a20, i8, i8, i8, i8, a3, i8, i8, i8, i8')]
In [107]: dataset=np.genfromtxt(txt,dtype=dt,delimiter=',')
In [108]: dataset
Out[108]: 
array([ (b'ACCEPT', (b'organizer@t.net', b't', b'p1@t.net', 0, b'UK', 3600000, 3, 1475917200000, 1475920800000, b'MON', 9, 0, 0, 0)),
...], 
      dtype=[('target', 'S20'), ('train', [('f0', 'S20'), ('f1', 'S20'), ('f2', 'S8'), ('f3', '<i8'), ('f4', 'S20'), ('f5', '<i8'), ('f6', '<i8'), ('f7', '<i8'), ('f8', '<i8'), ('f9', 'S3'), ('f10', '<i8'), ('f11', '<i8'), ('f12', '<i8'), ('f13', '<i8')])])

Now just select the 2 top level fields:

In [109]: dataset['target']
Out[109]: 
array([b'ACCEPT', b'ACCEPT', b'ACCEPT'], 
      dtype='|S20')

In [110]: dataset['train']
Out[110]: 
array([ (b'organizer@t.net', b't', b'p1@t.net', 0, b'UK', 3600000, 3, 1475917200000, 1475920800000, b'MON', 9, 0, 0, 0),
...], 
      dtype=[('f0', 'S20'), ('f1', 'S20'), ...])

I could nest further, grouping the i8 columns into groups of 4:

dt=[('target','a20'), ('train','a20, a20, a8, i8, a20, (4,)i8, a3, (4,)i8')]

Numpy dtype invalid index

Question

2 answers

solution1
1 2016-10-10 13:19:44

solution2
1 ACCPTED 2016-10-10 16:55:19

Numpy dtype invalid index

Question

2 answers

solution1 1 2016-10-10 13:19:44

solution2 1 ACCPTED 2016-10-10 16:55:19

solution1
1 2016-10-10 13:19:44

solution2
1 ACCPTED 2016-10-10 16:55:19