简体   繁体   English

数据集上的sklearn.preprocessing.LabelEncoder TypeError

[英]sklearn.preprocessing.LabelEncoder TypeError on data set

There are 14 columns of data and approximately 1,011,052 rows. 有14列数据和大约1,011,052行。 About ten rows are skipped when reading the CSV (with the error being: Error tokenizing data. C error: Expected 14 fields in line <...>, saw 15). 读取CSV时,将跳过大约十行(错误是:错误标记数据。C错误:<...>行中应有14个字段,看到15个)。 Using data.apply(LabelEncoder().fit_transform) to convert strings to floats for use in scikit-learn.fit(...) . 使用data.apply(LabelEncoder().fit_transform)将字符串转换为浮点数,以供scikit-learn.fit(...) Use of data.apply(LabelEncoder().fit_transform) is suggested here ( https://stackoverflow.com/a/31939145/2178774 ). 建议在此处使用data.apply(LabelEncoder().fit_transform)https://stackoverflow.com/a/31939145/2178774 )。 ( Edit: Note that 670 is the first value.) 编辑:请注意,670是第一个值。)

data = pd.read_csv('./dm.csv',error_bad_lines=False)

print(X.shape,y.shape)

(1011052, 13) (1011052, 1)

data.apply(LabelEncoder().fit_transform)

---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-18-9734848fb589> in <module>()
     19 # y is now: array([2, 0, 1, 3, 2, 0, 1, 3])
     20 
---> 21 data.apply(LabelEncoder().fit_transform)
     22 # TypeError: ("'>' not supported between instances of 'int' and 'str'", 'occurred at index 670')
     23 

/usr/lib64/python3.6/site-packages/pandas/core/frame.py in apply(self, func, axis, broadcast, raw, reduce, args, **kwds)
   4358                         f, axis,
   4359                         reduce=reduce,
-> 4360                         ignore_failures=ignore_failures)
   4361             else:
   4362                 return self._apply_broadcast(f, axis)

/usr/lib64/python3.6/site-packages/pandas/core/frame.py in _apply_standard(self, func, axis, ignore_failures, reduce)
   4454             try:
   4455                 for i, v in enumerate(series_gen):
-> 4456                     results[i] = func(v)
   4457                     keys.append(v.name)
   4458             except Exception as e:

/usr/lib64/python3.6/site-packages/sklearn/preprocessing/label.py in fit_transform(self, y)
    110         """
    111         y = column_or_1d(y, warn=True)
--> 112         self.classes_, y = np.unique(y, return_inverse=True)
    113         return y
    114 

/usr/lib64/python3.6/site-packages/numpy/lib/arraysetops.py in unique(ar, return_index, return_inverse, return_counts)
    209 
    210     if optional_indices:
--> 211         perm = ar.argsort(kind='mergesort' if return_index else 'quicksort')
    212         aux = ar[perm]
    213     else:

TypeError: ("'>' not supported between instances of 'int' and 'str'", 'occurred at index 670')

Edit: On read_csv there is the following output: /usr/lib/python3.6/site-packages/IPython/core/interactiveshell.py:2717: DtypeWarning: Columns (0) have mixed types. Specify dtype option on import or set low_memory=False. 编辑:read_csv有以下输出: /usr/lib/python3.6/site-packages/IPython/core/interactiveshell.py:2717: DtypeWarning: Columns (0) have mixed types. Specify dtype option on import or set low_memory=False. /usr/lib/python3.6/site-packages/IPython/core/interactiveshell.py:2717: DtypeWarning: Columns (0) have mixed types. Specify dtype option on import or set low_memory=False. interactivity=interactivity, compiler=compiler, result=result)

Edit: Added dtype={...} to read_csv, which now results in the type error: TypeError: ("'>' not supported between instances of 'str' and 'int'", 'occurred at index 0') . 编辑:向read_csv添加了dtype = {...},现在导致类型错误: TypeError: ("'>' not supported between instances of 'str' and 'int'", 'occurred at index 0')

data = pd.read_csv('./dm.csv',error_bad_lines=False,header=None,dtype={
  0: np.dtype('u8'), # 64-bit unsigned integer
  1: np.dtype('u4'), # 32-bit unsigned integer
  2: np.dtype('U'),  # unicode
  3: np.dtype('U'),  # unicode
  4: np.dtype('U'),  # unicode
  5: np.dtype('U'),  # unicode
  6: np.dtype('u2'), # 16-bit unsigned integer
  7: np.dtype('U'),  # unicode
  8: np.dtype('U'),  # unicode
  9: np.dtype('f2'), # 16-bit floating point
  10:np.dtype('U'),  # unicode
  11:np.dtype('U'),  # unicode
  12:np.dtype('f4'), # 32-bit floating point
  13:np.dtype('U')   # unicode
})

Edit: The type error occurs when using two rows of data. 编辑:使用两行数据时发生类型错误。 It occurs in the eighth column. 它出现在第八列。 Row1 Column8 is "GHI789". 第1列第8列为“ GHI789”。 Row2 Column8 is "NaN". 第2行第8列为“ NaN”。

X = data.iloc[0:2,0:14]
print(X)
print('--------')
for col in X.columns:
    print(col)
    print(X.dtypes[col])
    if X.dtypes[col] == "object":
        le = LabelEncoder()
        le.fit_transform(X[col])
        X[col] = le.transform(X[col])

Output: 输出:

     0      1           2   \
0  100  138.0  2017-12-31   
1  101   13.0  2017-12-31   

        3         4   \
0  Title1    ABC123   
1  Title2    ABC123

       5    6        7   \
0  User1  0.0   DEF456
1  User2  0.0   DEF456

        8    9      10  \
0  GHI789  0.0  XYZ123   
1     NaN  0.0  XYZ123

        11   12   13  
0  Title11  0.0  NaN  
1  Title22  0.0  NaN  

--------

0
object
1
float64
2
object
3
object
4
object
5
object
6
float64
7
object
8
object

---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-70-c94173863fd7> in <module>()
     29     if X.dtypes[col] == "object":
     30         le = LabelEncoder()
---> 31         le.fit_transform(X[col])
     32         X[col] = le.transform(X[col])

/usr/lib64/python3.6/site-packages/sklearn/preprocessing/label.py in fit_transform(self, y)
    110         """
    111         y = column_or_1d(y, warn=True)
--> 112         self.classes_, y = np.unique(y, return_inverse=True)
    113         return y
    114 

/usr/lib64/python3.6/site-packages/numpy/lib/arraysetops.py in unique(ar, return_index, return_inverse, return_counts)
    209 
    210     if optional_indices:
--> 211         perm = ar.argsort(kind='mergesort' if return_index else 'quicksort')
    212         aux = ar[perm]
    213     else:

TypeError: '>' not supported between instances of 'float' and 'str'

Edit: Solution?: "NaN" mixed with strings is an issue. 编辑: 解决方案?: “ NaN”与字符串混合是一个问题。 Solution is then to replace "NaN" with an empty string. 解决方案是用空字符串替换“ NaN”。 Such as: data = data.replace(np.nan, '', regex=True) . 例如: data = data.replace(np.nan, '', regex=True)

Edit: Just noticed two issues with column 9. One: About two-hundred rows were empty string, causing str to float issue. 编辑:刚注意到第9列有两个问题。一:大约两百行是空字符串,导致str浮动问题。 Two: Another large set were the str "0", which was parsed as either an int or str, again causing str to float issue. 二:另一个大集合是str“ 0”,它被解析为int或str,再次导致str浮动问题。 In the second case, a fix is the perform the following: data[9] = data[9].replace('^0$', 0.0, regex=True) . 在第二种情况下,解决方法是执行以下操作: data[9] = data[9].replace('^0$', 0.0, regex=True)

    if train[col].dtype == 'object':
      train[col] = train[col].fillna(train[col].mode().iloc[0])

You can fill this types of NaN value by replacing with the mean value in this colums. 您可以通过用此列中的平均值替换来填充此类NaN值。 i think this will solve the error. 我认为这将解决错误。

I had the same problem but the solutions given did not get rid of the error. 我有同样的问题,但是给出的解决方案并没有消除错误。 The solution I found was to convert the column to str: train[col] = train[col].astype('str') before applying the LabelEncoder. 我发现的解决方案是在应用LabelEncoder之前将列转换为str:train [col] = train [col] .astype('str')。 This makes everything the same type and removes the error. 这使所有内容都具有相同的类型并消除了错误。 I don't even think you need to replace the NaNs. 我什至不认为您需要替换NaN。

暂无
暂无

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 是否可以将 sklearn.preprocessing.LabelEncoder() 应用于 2D 列表? - Is it possible to apply sklearn.preprocessing.LabelEncoder() on a 2D list? 使用 sklearn.preprocessing.LabelEncoder() 使用 Python 编码多个分类数据在 2D 数组输入上需要太多处理时间 - Encoding Multiple Categorical Data with Python using sklearn.preprocessing.LabelEncoder() takes too much processing time on 2D array inputs 如何使用 OneHotEncoder 和 LabelEncoder 预处理看不见的数据以匹配训练集? - How preprocessing unseen data with OneHotEncoder and LabelEncoder so that matchs the train set? SkLearn-为什么LabelEncoder()。仅适用于训练数据 - SkLearn - Why LabelEncoder().fit only to training data sklearn LabelEncoder:TypeError:&#39;int&#39;和&#39;str&#39;的实例之间不支持&#39;&lt;&#39; - sklearn LabelEncoder : TypeError : '<' not supported between instances of 'int' and 'str' 在Anaconda中更新软件包后,“从sklearn.preprocessing导入LabelEncoder,OneHotEncoder”失败 - “from sklearn.preprocessing import LabelEncoder, OneHotEncoder” fails after update of packages in Anaconda 为什么 sklearn 预处理 LabelEncoder inverse_transform 仅适用于一列? - Why does sklearn preprocessing LabelEncoder inverse_transform apply from only one column? Sklearn 预处理 -- *** TypeError: 找不到匹配的签名 - Sklearn Preprocessing -- *** TypeError: No matching signature found 为什么不应该使用 sklearn LabelEncoder 来编码输入数据? - Why shouldn't the sklearn LabelEncoder be used to encode input data? 使用来自sklearn的LabelEncoder和OneHotEncoder编码数据时出现意外问题 - Unexpected issue when encoding data using LabelEncoder and OneHotEncoder from sklearn
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM