數據集上的sklearn.preprocessing.LabelEncoder TypeError

Question

有14列數據和大約1,011,052行。 讀取CSV時，將跳過大約十行（錯誤是：錯誤標記數據。C錯誤：<...>行中應有14個字段，看到15個）。 使用data.apply(LabelEncoder().fit_transform)將字符串轉換為浮點數，以供scikit-learn.fit(...) 。 建議在此處使用data.apply(LabelEncoder().fit_transform) （ https://stackoverflow.com/a/31939145/2178774 ）。 （ 編輯：請注意，670是第一個值。）

data = pd.read_csv('./dm.csv',error_bad_lines=False)

print(X.shape,y.shape)

(1011052, 13) (1011052, 1)

data.apply(LabelEncoder().fit_transform)

---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-18-9734848fb589> in <module>()
     19 # y is now: array([2, 0, 1, 3, 2, 0, 1, 3])
     20 
---> 21 data.apply(LabelEncoder().fit_transform)
     22 # TypeError: ("'>' not supported between instances of 'int' and 'str'", 'occurred at index 670')
     23 

/usr/lib64/python3.6/site-packages/pandas/core/frame.py in apply(self, func, axis, broadcast, raw, reduce, args, **kwds)
   4358                         f, axis,
   4359                         reduce=reduce,
-> 4360                         ignore_failures=ignore_failures)
   4361             else:
   4362                 return self._apply_broadcast(f, axis)

/usr/lib64/python3.6/site-packages/pandas/core/frame.py in _apply_standard(self, func, axis, ignore_failures, reduce)
   4454             try:
   4455                 for i, v in enumerate(series_gen):
-> 4456                     results[i] = func(v)
   4457                     keys.append(v.name)
   4458             except Exception as e:

/usr/lib64/python3.6/site-packages/sklearn/preprocessing/label.py in fit_transform(self, y)
    110         """
    111         y = column_or_1d(y, warn=True)
--> 112         self.classes_, y = np.unique(y, return_inverse=True)
    113         return y
    114 

/usr/lib64/python3.6/site-packages/numpy/lib/arraysetops.py in unique(ar, return_index, return_inverse, return_counts)
    209 
    210     if optional_indices:
--> 211         perm = ar.argsort(kind='mergesort' if return_index else 'quicksort')
    212         aux = ar[perm]
    213     else:

TypeError: ("'>' not supported between instances of 'int' and 'str'", 'occurred at index 670')

編輯：在read_csv有以下輸出： /usr/lib/python3.6/site-packages/IPython/core/interactiveshell.py:2717: DtypeWarning: Columns (0) have mixed types. Specify dtype option on import or set low_memory=False. /usr/lib/python3.6/site-packages/IPython/core/interactiveshell.py:2717: DtypeWarning: Columns (0) have mixed types. Specify dtype option on import or set low_memory=False. interactivity=interactivity, compiler=compiler, result=result)

編輯：向read_csv添加了dtype = {...}，現在導致類型錯誤： TypeError: ("'>' not supported between instances of 'str' and 'int'", 'occurred at index 0') 。

data = pd.read_csv('./dm.csv',error_bad_lines=False,header=None,dtype={
  0: np.dtype('u8'), # 64-bit unsigned integer
  1: np.dtype('u4'), # 32-bit unsigned integer
  2: np.dtype('U'),  # unicode
  3: np.dtype('U'),  # unicode
  4: np.dtype('U'),  # unicode
  5: np.dtype('U'),  # unicode
  6: np.dtype('u2'), # 16-bit unsigned integer
  7: np.dtype('U'),  # unicode
  8: np.dtype('U'),  # unicode
  9: np.dtype('f2'), # 16-bit floating point
  10:np.dtype('U'),  # unicode
  11:np.dtype('U'),  # unicode
  12:np.dtype('f4'), # 32-bit floating point
  13:np.dtype('U')   # unicode
})

編輯：使用兩行數據時發生類型錯誤。 它出現在第八列。 第1列第8列為“ GHI789”。 第2行第8列為“ NaN”。

X = data.iloc[0:2,0:14]
print(X)
print('--------')
for col in X.columns:
    print(col)
    print(X.dtypes[col])
    if X.dtypes[col] == "object":
        le = LabelEncoder()
        le.fit_transform(X[col])
        X[col] = le.transform(X[col])

輸出：

     0      1           2   \
0  100  138.0  2017-12-31   
1  101   13.0  2017-12-31   

        3         4   \
0  Title1    ABC123   
1  Title2    ABC123

       5    6        7   \
0  User1  0.0   DEF456
1  User2  0.0   DEF456

        8    9      10  \
0  GHI789  0.0  XYZ123   
1     NaN  0.0  XYZ123

        11   12   13  
0  Title11  0.0  NaN  
1  Title22  0.0  NaN  

--------

0
object
1
float64
2
object
3
object
4
object
5
object
6
float64
7
object
8
object

---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-70-c94173863fd7> in <module>()
     29     if X.dtypes[col] == "object":
     30         le = LabelEncoder()
---> 31         le.fit_transform(X[col])
     32         X[col] = le.transform(X[col])

/usr/lib64/python3.6/site-packages/sklearn/preprocessing/label.py in fit_transform(self, y)
    110         """
    111         y = column_or_1d(y, warn=True)
--> 112         self.classes_, y = np.unique(y, return_inverse=True)
    113         return y
    114 

/usr/lib64/python3.6/site-packages/numpy/lib/arraysetops.py in unique(ar, return_index, return_inverse, return_counts)
    209 
    210     if optional_indices:
--> 211         perm = ar.argsort(kind='mergesort' if return_index else 'quicksort')
    212         aux = ar[perm]
    213     else:

TypeError: '>' not supported between instances of 'float' and 'str'

編輯： 解決方案？： “ NaN”與字符串混合是一個問題。 解決方案是用空字符串替換“ NaN”。 例如： data = data.replace(np.nan, '', regex=True) 。

編輯：剛注意到第9列有兩個問題。一：大約兩百行是空字符串，導致str浮動問題。 二：另一個大集合是str“ 0”，它被解析為int或str，再次導致str浮動問題。 在第二種情況下，解決方法是執行以下操作： data[9] = data[9].replace('^0$', 0.0, regex=True) 。

Answer 1

    if train[col].dtype == 'object':
      train[col] = train[col].fillna(train[col].mode().iloc[0])

您可以通過用此列中的平均值替換來填充此類NaN值。 我認為這將解決錯誤。

Answer 2

我有同樣的問題，但是給出的解決方案並沒有消除錯誤。 我發現的解決方案是在應用LabelEncoder之前將列轉換為str：train [col] = train [col] .astype（'str'）。 這使所有內容都具有相同的類型並消除了錯誤。 我什至不認為您需要替換NaN。

數據集上的sklearn.preprocessing.LabelEncoder TypeError

問題描述

2 個解決方案

解決方案1
1 2018-03-01 07:04:20

解決方案2
0 2019-04-18 04:54:39

數據集上的sklearn.preprocessing.LabelEncoder TypeError

問題描述

2 個解決方案

解決方案1 1 2018-03-01 07:04:20

解決方案2 0 2019-04-18 04:54:39

解決方案1
1 2018-03-01 07:04:20

解決方案2
0 2019-04-18 04:54:39