PYTHON-使用numpy genfromtxt導入具有多種數據類型的csv數據時出錯

Question

我正在參加一項kaggle競賽，以根據多個預測變量預測餐廳收入。 我是Python的初學者，通常我會使用Rapidminer進行數據分析。 我在Spyder 2.3開發人員環境上使用Python 3.4。

我正在使用下面的代碼導入培訓csv文件。

from sklearn import linear_model
from numpy import genfromtxt, savetxt

  def main():
      #create the training & test sets, skipping the header row with [1:]
      dataset = genfromtxt(open('data/train.csv','rb'), delimiter=",", dtype= None)[1:]   
      train = [x[1:41] for x in dataset]
      test = genfromtxt(open('data/test.csv','rb'), delimiter=",")[1:]

這是我得到的錯誤：

dataset = genfromtxt(open('data/train.csv','rb'), delimiter=",", dtype= None)[1:]

IndexError: too many indices for array

然后我使用print (dataset.dtype)檢查了各種導入的數據類型

我注意到，已經為csv文件中的每個值分別分配了數據類型。 此外，該代碼最后將無法與[1：]一起使用。 它給了我too many indices相同的錯誤。 如果我刪除了[1：]並使用skip_header=1選項定義了輸入， skip_header=1出現以下錯誤：

output = np.array(data, dtype=ddtype)

TypeError: Empty data-type

在我看來，整個數據集被作為具有5000多個列的單行讀取。

數據集包括43列和138行。

我目前仍處於停滯狀態，我將不勝感激如何進行。

我在下面（示例）發布原始的csv數據：

Id,Open Date,City,City Group,Type,P1,P2,P3,P4,P5,P6,P7,P8,P9,P10,P11,P12,P13,P14,P15,P16,P17,P18,P19,P20,P21,P22,P23,P24,P25,P26,P27,P28,P29,P30,P31,P32,P33,P34,P35,P36,P37,revenue
0,7/17/99,Ä°stanbul,Big Cities,IL,4,5,4,4,2,2,5,4,5,5,3,5,5,1,2,2,2,4,5,4,1,3,3,1,1,1,4,2,3,5,3,4,5,5,4,3,4,5653753
1,2/14/08,Ankara,Big Cities,FC,4,5,4,4,1,2,5,5,5,5,1,5,5,0,0,0,0,0,3,2,1,3,2,0,0,0,0,3,3,0,0,0,0,0,0,0,0,6923131
2,3/9/13,DiyarbakÄr,Other,IL,2,4,2,5,2,3,5,5,5,5,2,5,5,0,0,0,0,0,1,1,1,1,1,0,0,0,0,1,3,0,0,0,0,0,0,0,0,2055379
3,2/2/12,Tokat,Other,IL,6,4.5,6,6,4,4,10,8,10,10,8,10,7.5,6,4,9,3,12,20,12,6,1,10,2,2,2.5,2.5,2.5,7.5,25,12,10,6,18,12,12,6,2675511
4,5/9/09,Gaziantep,Other,IL,3,4,3,4,2,2,5,5,5,5,2,5,5,2,1,2,1,4,2,2,1,2,1,2,3,3,5,1,3,5,1,3,2,3,4,3,3,4316715
5,2/12/10,Ankara,Big Cities,FC,6,6,4.5,7.5,8,10,10,8,8,8,10,8,6,0,0,0,0,0,5,6,3,1,5,0,0,0,0,7.5,5,0,0,0,0,0,0,0,0,5017319
6,10/11/10,Ä°stanbul,Big Cities,IL,2,3,4,4,1,5,5,5,5,5,2,5,5,3,4,4,3,4,2,4,1,2,1,5,4,4,5,1,3,4,5,2,2,3,5,4,4,5166635
7,6/21/11,Ä°stanbul,Big Cities,IL,4,5,4,5,2,3,5,4,4,4,4,3,4,0,0,0,0,0,3,5,2,4,2,0,0,0,0,3,2,0,0,0,0,0,0,0,0,4491607
8,8/28/10,Afyonkarahisar,Other,IL,1,1,4,4,1,2,1,5,5,5,1,5,5,1,1,2,1,4,1,1,1,1,1,4,4,4,2,2,3,4,5,5,3,4,5,4,5,4952497
9,11/16/11,Edirne,Other,IL,6,4.5,6,7.5,6,4,10,10,10,10,2,10,7.5,0,0,0,0,0,25,3,3,1,10,0,0,0,0,5,2.5,0,0,0,0,0,0,0,0,5444227

Answer 1

我認為genfromtxt中的字符（例如Ä°）引起了問題。 我發現您在此處的數據中包含以下內容：

dtypes = "i8,S12,S12,S12,S12" + ",i8"*38
test = genfromtxt(open('data/test.csv','rb'),  delimiter="," , names = True, dtype=dtypes)

然后，您可以按名稱訪問元素，

In [16]: test['P8']
Out[16]: array([ 4,  5,  5,  8,  5,  8,  5,  4,  5, 10])

城市列的值，

test['City']

回報，

array(['\xc3\x84\xc2\xb0stanbul', 'Ankara', 'Diyarbak\xc3\x84r', 'Tokat',
   'Gaziantep', 'Ankara', '\xc3\x84\xc2\xb0stanbul',
   '\xc3\x84\xc2\xb0stanbul', 'Afyonkarahis', 'Edirne'], 
  dtype='|S12')

原則上，您可以嘗試將這些代碼轉換為python腳本中的unicode，例如，

In [17]: unicode(test['City'][0], 'utf8')
Out[17]: u'\xc4\xb0stanbul

其中\\ xc4 \\ xb0是İ的UTF-8十六進制編碼。 為避免這種情況，您還可以嘗試清理csv輸入文件。

Answer 2

[解決了]。

我只是放棄了numpy的genfromtext並選擇使用pandas的read_csv，因為它提供了以'utf-8'編碼導入文本的選項。

PYTHON-使用numpy genfromtxt導入具有多種數據類型的csv數據時出錯

問題描述

2 個解決方案

解決方案1
0 2015-03-30 18:12:57

解決方案2
0 2015-04-04 19:10:46

PYTHON-使用numpy genfromtxt導入具有多種數據類型的csv數據時出錯

問題描述

2 個解決方案

解決方案1 0 2015-03-30 18:12:57

解決方案2 0 2015-04-04 19:10:46

解決方案1
0 2015-03-30 18:12:57

解決方案2
0 2015-04-04 19:10:46