2D Numpy Array of Strings WITH column access

Question

New to Python and Numpy and MatPlotLib .

I am trying to create a 2D Numpy array from a CSV of various data types, but I will treat them all as strings. The killer is that I need to be able to access them with tuple indices, like: [:,5] to get the 5th column, or [5] to get the 5th row.

Is there any way to do this?

It seems that this is a limitation of Numpy due to the memory-access calculations:

dataSet = np.loadtxt(open("adult.data.csv", "rb"), delimiter=" ,")
print dataSet[:, 4] <---results in IndexError: Invalid Index

I have also tried loadfromgen , dtype = str and dtype = "a16" , as well as dtype = object . Nothing works. I can either load the data and it does not have column access, or I can't load the data at all.

Answer 1

Simulate you file from the comment line - replicated several time (ie one string per row of the file):

In [8]: txt = b" 39, State-gov, 77516, Bachelors, 13, Never-married, Adm-clerical, Not-in-family, White, Male, 2174, 0, 40, United-States, <=50K"
In [9]: txt = [txt for _ in range(5)]

In [10]: txt
Out[10]: 
[b' 39, State-gov, 77516, Bachelors, 13, Never-married, Adm-clerical, Not-in-family, White, Male, 2174, 0, 40, United-States, <=50K',
 b' 39, State-gov, 77516, Bachelors, 13, Never-married, Adm-clerical, Not-in-family, White, Male, 2174, 0, 40, United-States, <=50K',
 b' 39, State-gov, 77516, Bachelors, 13, Never-married, Adm-clerical, Not-in-family, White, Male, 2174, 0, 40, United-States, <=50K',
 b' 39, State-gov, 77516, Bachelors, 13, Never-married, Adm-clerical, Not-in-family, White, Male, 2174, 0, 40, United-States, <=50K',
 b' 39, State-gov, 77516, Bachelors, 13, Never-married, Adm-clerical, Not-in-family, White, Male, 2174, 0, 40, United-States, <=50K']

Load with genfromtxt , with delimiter. Let it choose the best dtype per column:

In [12]: A=np.genfromtxt(txt, delimiter=',',dtype=None)
In [13]: A
Out[13]: 
array([ (39, b' State-gov', 77516, b' Bachelors', 13, b' Never-married', b' Adm-clerical', b' Not-in-family', b' White', b' Male', 2174, 0, 40, b' United-States', b' <=50K'),
       (39, b' State-gov', 77516, b' Bachelors', 13, b' Never-married', b' Adm-clerical', b' Not-in-family', b' White', b' Male', 2174, 0, 40, b' United-States', b' <=50K'),...], 
      dtype=[('f0', '<i4'), ('f1', 'S10'), ('f2', '<i4'), ('f3', 'S10'), ('f4', '<i4'), ('f5', 'S14'), ('f6', 'S13'), ('f7', 'S14'), ('f8', 'S6'), ('f9', 'S5'), ('f10', '<i4'), ('f11', '<i4'), ('f12', '<i4'), ('f13', 'S14'), ('f14', 'S6')])

5 element array with a compound dtype

In [14]: A.shape
Out[14]: (5,)
In [15]: A.dtype
Out[15]: dtype([('f0', '<i4'), ('f1', 'S10'), ('f2', '<i4'),
    ('f3', 'S10'), ('f4', '<i4'), ....])

Access a 'column' with a field name (not column number)

In [16]: A['f4']
Out[16]: array([13, 13, 13, 13, 13])

Or load as dtype=str:

In [17]: A=np.genfromtxt(txt, delimiter=',',dtype=str)
In [18]: A
Out[18]: 
array([['39', ' State-gov', ' 77516', ' Bachelors', ' 13',
        ' Never-married', ' Adm-clerical', ' Not-in-family', ' White',
        ' Male', ' 2174', ' 0', ' 40', ' United-States', ' <=50K'],
        ...
        ' Male', ' 2174', ' 0', ' 40', ' United-States', ' <=50K']], 
      dtype='<U14')
In [19]: A.dtype
Out[19]: dtype('<U14')
In [20]: A.shape
Out[20]: (5, 15)
In [21]: A[:,4]
Out[21]: 
array([' 13', ' 13', ' 13', ' 13', ' 13'], 
      dtype='<U14')

Now it is 15 column 2d array that can be indexed with column number.

With the wrong delimiter, and it loads one column per row

In [24]: A=np.genfromtxt(txt, delimiter=' ,',dtype=str)
In [25]: A
Out[25]: 
array([ '39, State-gov, 77516, Bachelors, 13, Never-married, Adm-clerical, Not-in-family, White, Male, 2174, 0, 40, United-States, <=50K',
      ...], 
      dtype='<U127')
In [26]: A.shape
Out[26]: (5,)

A 1d array with a long string dtype.

A CSV file might loaded in various ways, some intentional, some not. You have to look at the results, and try to understand them before blindly trying to index columns.

2D Numpy Array of Strings WITH column access

Question

1 answers

solution1
1 ACCPTED 2016-01-23 02:34:56

2D Numpy Array of Strings WITH column access

Question

1 answers

solution1 1 ACCPTED 2016-01-23 02:34:56

solution1
1 ACCPTED 2016-01-23 02:34:56