具有列访问权限的2D Numpy字符串数组

Question

New to Python and Numpy and MatPlotLib . Python和Numpy和MatPlotLib 。

I am trying to create a 2D Numpy array from a CSV of various data types, but I will treat them all as strings. 我正在尝试从各种数据类型的CSV创建2D Numpy数组，但我会将它们全部视为字符串。 The killer is that I need to be able to access them with tuple indices, like: [:,5] to get the 5th column, or [5] to get the 5th row. 杀手is的是，我需要能够使用tuple索引访问它们，例如： [:,5] tuple [:,5]获得第5列，或[5]获得第5行。

Is there any way to do this? 有什么办法吗？

It seems that this is a limitation of Numpy due to the memory-access calculations: 由于内存访问计算，这似乎是Numpy的局限性：

dataSet = np.loadtxt(open("adult.data.csv", "rb"), delimiter=" ,")
print dataSet[:, 4] <---results in IndexError: Invalid Index

I have also tried loadfromgen , dtype = str and dtype = "a16" , as well as dtype = object . 我也尝试过loadfromgen ， loadfromgen dtype = str和dtype = "a16"以及dtype = object 。 Nothing works. 什么都没有。 I can either load the data and it does not have column access, or I can't load the data at all. 我可以加载数据并且它没有列访问权限，或者根本无法加载数据。

Answer 1

Simulate you file from the comment line - replicated several time (ie one string per row of the file): 从注释行模拟您的文件-复制多次（即文件的每一行一个字符串）：

In [8]: txt = b" 39, State-gov, 77516, Bachelors, 13, Never-married, Adm-clerical, Not-in-family, White, Male, 2174, 0, 40, United-States, <=50K"
In [9]: txt = [txt for _ in range(5)]

In [10]: txt
Out[10]: 
[b' 39, State-gov, 77516, Bachelors, 13, Never-married, Adm-clerical, Not-in-family, White, Male, 2174, 0, 40, United-States, <=50K',
 b' 39, State-gov, 77516, Bachelors, 13, Never-married, Adm-clerical, Not-in-family, White, Male, 2174, 0, 40, United-States, <=50K',
 b' 39, State-gov, 77516, Bachelors, 13, Never-married, Adm-clerical, Not-in-family, White, Male, 2174, 0, 40, United-States, <=50K',
 b' 39, State-gov, 77516, Bachelors, 13, Never-married, Adm-clerical, Not-in-family, White, Male, 2174, 0, 40, United-States, <=50K',
 b' 39, State-gov, 77516, Bachelors, 13, Never-married, Adm-clerical, Not-in-family, White, Male, 2174, 0, 40, United-States, <=50K']

Load with genfromtxt , with delimiter. 使用genfromtxt和定界符加载。 Let it choose the best dtype per column: 让它为每列选择最佳的dtype：

In [12]: A=np.genfromtxt(txt, delimiter=',',dtype=None)
In [13]: A
Out[13]: 
array([ (39, b' State-gov', 77516, b' Bachelors', 13, b' Never-married', b' Adm-clerical', b' Not-in-family', b' White', b' Male', 2174, 0, 40, b' United-States', b' <=50K'),
       (39, b' State-gov', 77516, b' Bachelors', 13, b' Never-married', b' Adm-clerical', b' Not-in-family', b' White', b' Male', 2174, 0, 40, b' United-States', b' <=50K'),...], 
      dtype=[('f0', '<i4'), ('f1', 'S10'), ('f2', '<i4'), ('f3', 'S10'), ('f4', '<i4'), ('f5', 'S14'), ('f6', 'S13'), ('f7', 'S14'), ('f8', 'S6'), ('f9', 'S5'), ('f10', '<i4'), ('f11', '<i4'), ('f12', '<i4'), ('f13', 'S14'), ('f14', 'S6')])

5 element array with a compound dtype 具有复合dtype的5元素数组

In [14]: A.shape
Out[14]: (5,)
In [15]: A.dtype
Out[15]: dtype([('f0', '<i4'), ('f1', 'S10'), ('f2', '<i4'),
    ('f3', 'S10'), ('f4', '<i4'), ....])

Access a 'column' with a field name (not column number) 使用字段名称（而不是列号）访问“列”

In [16]: A['f4']
Out[16]: array([13, 13, 13, 13, 13])

Or load as dtype=str: 或加载为dtype = str：

In [17]: A=np.genfromtxt(txt, delimiter=',',dtype=str)
In [18]: A
Out[18]: 
array([['39', ' State-gov', ' 77516', ' Bachelors', ' 13',
        ' Never-married', ' Adm-clerical', ' Not-in-family', ' White',
        ' Male', ' 2174', ' 0', ' 40', ' United-States', ' <=50K'],
        ...
        ' Male', ' 2174', ' 0', ' 40', ' United-States', ' <=50K']], 
      dtype='<U14')
In [19]: A.dtype
Out[19]: dtype('<U14')
In [20]: A.shape
Out[20]: (5, 15)
In [21]: A[:,4]
Out[21]: 
array([' 13', ' 13', ' 13', ' 13', ' 13'], 
      dtype='<U14')

Now it is 15 column 2d array that can be indexed with column number. 现在是可以用列号索引的15列2d数组。

With the wrong delimiter, and it loads one column per row 使用错误的分隔符，并且每行加载一列

In [24]: A=np.genfromtxt(txt, delimiter=' ,',dtype=str)
In [25]: A
Out[25]: 
array([ '39, State-gov, 77516, Bachelors, 13, Never-married, Adm-clerical, Not-in-family, White, Male, 2174, 0, 40, United-States, <=50K',
      ...], 
      dtype='<U127')
In [26]: A.shape
Out[26]: (5,)

A 1d array with a long string dtype. 具有长字符串dtype的一维数组。

A CSV file might loaded in various ways, some intentional, some not. CSV文件可能以各种方式加载，有些是有意加载的，有些则不是。 You have to look at the results, and try to understand them before blindly trying to index columns. 您必须先查看结果，然后尝试理解它们，然后再盲目地尝试对列进行索引。

具有列访问权限的2D Numpy字符串数组

问题描述

1 个解决方案

解决方案1
1 已采纳 2016-01-23 02:34:56

具有列访问权限的2D Numpy字符串数组

问题描述

1 个解决方案

解决方案1 1 已采纳 2016-01-23 02:34:56

解决方案1
1 已采纳 2016-01-23 02:34:56