New to Python and Numpy
and MatPlotLib
.
I am trying to create a 2D
Numpy
array from a CSV
of various data types, but I will treat them all as strings. The killer is that I need to be able to access them with tuple
indices, like: [:,5]
to get the 5th column, or [5]
to get the 5th row.
Is there any way to do this?
It seems that this is a limitation of Numpy
due to the memory-access calculations:
dataSet = np.loadtxt(open("adult.data.csv", "rb"), delimiter=" ,")
print dataSet[:, 4] <---results in IndexError: Invalid Index
I have also tried loadfromgen
, dtype = str
and dtype = "a16"
, as well as dtype = object
. Nothing works. I can either load the data and it does not have column access, or I can't load the data at all.
Simulate you file from the comment line - replicated several time (ie one string per row of the file):
In [8]: txt = b" 39, State-gov, 77516, Bachelors, 13, Never-married, Adm-clerical, Not-in-family, White, Male, 2174, 0, 40, United-States, <=50K"
In [9]: txt = [txt for _ in range(5)]
In [10]: txt
Out[10]:
[b' 39, State-gov, 77516, Bachelors, 13, Never-married, Adm-clerical, Not-in-family, White, Male, 2174, 0, 40, United-States, <=50K',
b' 39, State-gov, 77516, Bachelors, 13, Never-married, Adm-clerical, Not-in-family, White, Male, 2174, 0, 40, United-States, <=50K',
b' 39, State-gov, 77516, Bachelors, 13, Never-married, Adm-clerical, Not-in-family, White, Male, 2174, 0, 40, United-States, <=50K',
b' 39, State-gov, 77516, Bachelors, 13, Never-married, Adm-clerical, Not-in-family, White, Male, 2174, 0, 40, United-States, <=50K',
b' 39, State-gov, 77516, Bachelors, 13, Never-married, Adm-clerical, Not-in-family, White, Male, 2174, 0, 40, United-States, <=50K']
Load with genfromtxt
, with delimiter. Let it choose the best dtype per column:
In [12]: A=np.genfromtxt(txt, delimiter=',',dtype=None)
In [13]: A
Out[13]:
array([ (39, b' State-gov', 77516, b' Bachelors', 13, b' Never-married', b' Adm-clerical', b' Not-in-family', b' White', b' Male', 2174, 0, 40, b' United-States', b' <=50K'),
(39, b' State-gov', 77516, b' Bachelors', 13, b' Never-married', b' Adm-clerical', b' Not-in-family', b' White', b' Male', 2174, 0, 40, b' United-States', b' <=50K'),...],
dtype=[('f0', '<i4'), ('f1', 'S10'), ('f2', '<i4'), ('f3', 'S10'), ('f4', '<i4'), ('f5', 'S14'), ('f6', 'S13'), ('f7', 'S14'), ('f8', 'S6'), ('f9', 'S5'), ('f10', '<i4'), ('f11', '<i4'), ('f12', '<i4'), ('f13', 'S14'), ('f14', 'S6')])
5 element array with a compound dtype
In [14]: A.shape
Out[14]: (5,)
In [15]: A.dtype
Out[15]: dtype([('f0', '<i4'), ('f1', 'S10'), ('f2', '<i4'),
('f3', 'S10'), ('f4', '<i4'), ....])
Access a 'column' with a field name (not column number)
In [16]: A['f4']
Out[16]: array([13, 13, 13, 13, 13])
Or load as dtype=str:
In [17]: A=np.genfromtxt(txt, delimiter=',',dtype=str)
In [18]: A
Out[18]:
array([['39', ' State-gov', ' 77516', ' Bachelors', ' 13',
' Never-married', ' Adm-clerical', ' Not-in-family', ' White',
' Male', ' 2174', ' 0', ' 40', ' United-States', ' <=50K'],
...
' Male', ' 2174', ' 0', ' 40', ' United-States', ' <=50K']],
dtype='<U14')
In [19]: A.dtype
Out[19]: dtype('<U14')
In [20]: A.shape
Out[20]: (5, 15)
In [21]: A[:,4]
Out[21]:
array([' 13', ' 13', ' 13', ' 13', ' 13'],
dtype='<U14')
Now it is 15 column 2d array that can be indexed with column number.
With the wrong delimiter, and it loads one column per row
In [24]: A=np.genfromtxt(txt, delimiter=' ,',dtype=str)
In [25]: A
Out[25]:
array([ '39, State-gov, 77516, Bachelors, 13, Never-married, Adm-clerical, Not-in-family, White, Male, 2174, 0, 40, United-States, <=50K',
...],
dtype='<U127')
In [26]: A.shape
Out[26]: (5,)
A 1d array with a long string dtype.
A CSV file might loaded in various ways, some intentional, some not. You have to look at the results, and try to understand them before blindly trying to index columns.
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.