具有列訪問權限的2D Numpy字符串數組

Question

Python和Numpy和MatPlotLib 。

我正在嘗試從各種數據類型的CSV創建2D Numpy數組，但我會將它們全部視為字符串。 殺手is的是，我需要能夠使用tuple索引訪問它們，例如： [:,5] tuple [:,5]獲得第5列，或[5]獲得第5行。

有什么辦法嗎？

由於內存訪問計算，這似乎是Numpy的局限性：

dataSet = np.loadtxt(open("adult.data.csv", "rb"), delimiter=" ,")
print dataSet[:, 4] <---results in IndexError: Invalid Index

我也嘗試過loadfromgen ， loadfromgen dtype = str和dtype = "a16"以及dtype = object 。 什么都沒有。 我可以加載數據並且它沒有列訪問權限，或者根本無法加載數據。

Answer 1

從注釋行模擬您的文件-復制多次（即文件的每一行一個字符串）：

In [8]: txt = b" 39, State-gov, 77516, Bachelors, 13, Never-married, Adm-clerical, Not-in-family, White, Male, 2174, 0, 40, United-States, <=50K"
In [9]: txt = [txt for _ in range(5)]

In [10]: txt
Out[10]: 
[b' 39, State-gov, 77516, Bachelors, 13, Never-married, Adm-clerical, Not-in-family, White, Male, 2174, 0, 40, United-States, <=50K',
 b' 39, State-gov, 77516, Bachelors, 13, Never-married, Adm-clerical, Not-in-family, White, Male, 2174, 0, 40, United-States, <=50K',
 b' 39, State-gov, 77516, Bachelors, 13, Never-married, Adm-clerical, Not-in-family, White, Male, 2174, 0, 40, United-States, <=50K',
 b' 39, State-gov, 77516, Bachelors, 13, Never-married, Adm-clerical, Not-in-family, White, Male, 2174, 0, 40, United-States, <=50K',
 b' 39, State-gov, 77516, Bachelors, 13, Never-married, Adm-clerical, Not-in-family, White, Male, 2174, 0, 40, United-States, <=50K']

使用genfromtxt和定界符加載。 讓它為每列選擇最佳的dtype：

In [12]: A=np.genfromtxt(txt, delimiter=',',dtype=None)
In [13]: A
Out[13]: 
array([ (39, b' State-gov', 77516, b' Bachelors', 13, b' Never-married', b' Adm-clerical', b' Not-in-family', b' White', b' Male', 2174, 0, 40, b' United-States', b' <=50K'),
       (39, b' State-gov', 77516, b' Bachelors', 13, b' Never-married', b' Adm-clerical', b' Not-in-family', b' White', b' Male', 2174, 0, 40, b' United-States', b' <=50K'),...], 
      dtype=[('f0', '<i4'), ('f1', 'S10'), ('f2', '<i4'), ('f3', 'S10'), ('f4', '<i4'), ('f5', 'S14'), ('f6', 'S13'), ('f7', 'S14'), ('f8', 'S6'), ('f9', 'S5'), ('f10', '<i4'), ('f11', '<i4'), ('f12', '<i4'), ('f13', 'S14'), ('f14', 'S6')])

具有復合dtype的5元素數組

In [14]: A.shape
Out[14]: (5,)
In [15]: A.dtype
Out[15]: dtype([('f0', '<i4'), ('f1', 'S10'), ('f2', '<i4'),
    ('f3', 'S10'), ('f4', '<i4'), ....])

使用字段名稱（而不是列號）訪問“列”

In [16]: A['f4']
Out[16]: array([13, 13, 13, 13, 13])

或加載為dtype = str：

In [17]: A=np.genfromtxt(txt, delimiter=',',dtype=str)
In [18]: A
Out[18]: 
array([['39', ' State-gov', ' 77516', ' Bachelors', ' 13',
        ' Never-married', ' Adm-clerical', ' Not-in-family', ' White',
        ' Male', ' 2174', ' 0', ' 40', ' United-States', ' <=50K'],
        ...
        ' Male', ' 2174', ' 0', ' 40', ' United-States', ' <=50K']], 
      dtype='<U14')
In [19]: A.dtype
Out[19]: dtype('<U14')
In [20]: A.shape
Out[20]: (5, 15)
In [21]: A[:,4]
Out[21]: 
array([' 13', ' 13', ' 13', ' 13', ' 13'], 
      dtype='<U14')

現在是可以用列號索引的15列2d數組。

使用錯誤的分隔符，並且每行加載一列

In [24]: A=np.genfromtxt(txt, delimiter=' ,',dtype=str)
In [25]: A
Out[25]: 
array([ '39, State-gov, 77516, Bachelors, 13, Never-married, Adm-clerical, Not-in-family, White, Male, 2174, 0, 40, United-States, <=50K',
      ...], 
      dtype='<U127')
In [26]: A.shape
Out[26]: (5,)

具有長字符串dtype的一維數組。

CSV文件可能以各種方式加載，有些是有意加載的，有些則不是。 您必須先查看結果，然后嘗試理解它們，然后再盲目地嘗試對列進行索引。

具有列訪問權限的2D Numpy字符串數組

問題描述

1 個解決方案

解決方案1
1 已采納 2016-01-23 02:34:56

具有列訪問權限的2D Numpy字符串數組

問題描述

1 個解決方案

解決方案1 1 已采納 2016-01-23 02:34:56

解決方案1
1 已采納 2016-01-23 02:34:56