[英]'KeyError:' when iterating over pandas data frame?
I have two lists Y_train
and Y_test
.我有两个列表
Y_train
和Y_test
。 At the moment they hold categorical data.目前他们持有分类数据。 Each element is either
Blue
or Green
.每个元素要么是
Blue
要么是Green
。 They are going to be the targets for a Random Forest classifier.它们将成为随机森林分类器的目标。 I need them encoded as 1.0s and 0.0s.
我需要将它们编码为 1.0s 和 0.0s。
Here is a print(Y_train)
to show you what the data frame looks like.这是一个
print(Y_train)
向您展示数据框的样子。 The random numbers down the side are because the data has been shuffled.旁边的随机数是因为数据已被洗牌。 (
Y_test
is the same, just smaller): (
Y_test
是一样的,只是更小):
183 Blue
126 Blue
1 Blue
409 Blue
575 Green
...
396 Blue
192 Blue
578 Green
838 Green
222 Blue
Name: Colour, Length: 896, dtype: object
To encode this I was going to simply loop over them and change each element to their encoded values:为了对此进行编码,我将简单地遍历它们并将每个元素更改为它们的编码值:
for i in range(len(Y_train)):
if Y_train[i] == 'Blue':
Y_train[i] = 0.0
else:
Y_train[i] = 1.0
However, when I do this, I get the following:但是,当我这样做时,我得到以下信息:
Traceback (most recent call last):
File "G:\Work\Colours.py", line 90, in <module>
Main()
File "G:\Work\Colours.py", line 34, in Main
RandForest(X_train, Y_train, X_test, Y_test)
File "G:\Work\Colours.py.py", line 77, in RandForest
if Y_train[i] == 'Blue':
File "C:\Users\Me\AppData\Roaming\Python\Python37\site-packages\pandas\core\series.py", line 1068, in __getitem__
result = self.index.get_value(self, key)
File "C:\Users\Me\AppData\Roaming\Python\Python37\site-packages\pandas\core\indexes\base.py", line 4730, in get_value
return self._engine.get_value(s, k, tz=getattr(series.dtype, "tz", None))
File "pandas\_libs\index.pyx", line 80, in pandas._libs.index.IndexEngine.get_value
File "pandas\_libs\index.pyx", line 88, in pandas._libs.index.IndexEngine.get_value
File "pandas\_libs\index.pyx", line 131, in pandas._libs.index.IndexEngine.get_loc
File "pandas\_libs\hashtable_class_helper.pxi", line 992, in pandas._libs.hashtable.Int64HashTable.get_item
File "pandas\_libs\hashtable_class_helper.pxi", line 998, in pandas._libs.hashtable.Int64HashTable.get_item
KeyError: 6
The weird thing is that it produces this error at different times.奇怪的是它在不同的时间产生这个错误。 I've used flags and prints to see how far it gets.
我已经使用标志和印刷品来查看它的进展情况。 Sometimes it will get quite a few iterations into the loop, and then other times it will only do one or two iterations before breaking.
有时它会在循环中进行多次迭代,然后其他时候它只会在中断之前进行一两次迭代。
I'm assuming I just don't quite understand how you're supposed to iterate over data frames properly.我假设我不太明白你应该如何正确地迭代数据帧。 If someone with more experience with this stuff could help me out that would be great.
如果对这些东西有更多经验的人可以帮助我,那就太好了。
Try:尝试:
Y_train[Y_train == 'Blue']=0.0
Y_train[Y_train == 'Green']=1.0
That should solve your issues.那应该可以解决您的问题。
In cases where you even have more number of labels than your current example(Blue and Green in your case), sklearn
provides a label encoder that allows you to do this very easily using如果您的标签数量甚至超过当前示例(在您的情况下为蓝色和绿色),
sklearn
提供了一个标签编码器,允许您使用
from sklearn import preprocessing
label_encoder = preprocessing.LabelEncoder()
# Transforms the 'column' in your dataframe df
df['column']= label_encoder.fit_transform(df['column'])
If you are using a your own method to label encoding,it is better to create a separate encoded column rather than modifying original column.After that you can assign encoded column to your dataframe.如果您使用自己的方法来标记编码,最好创建一个单独的编码列而不是修改原始列。之后您可以将编码列分配给您的数据帧。 As a example for your scenario.
作为您的场景的示例。
encoded = np.ones((Y_train.shape[0],1))
for i in range(Y_train.shape[0]):
if Y_train[i] == 'Blue':
encoded[i] = 0
Note that this will only work for if you have two categories.请注意,这仅适用于您有两个类别的情况。
for multiple categories,you can use sklearn or pandas methods.对于多个类别,您可以使用 sklearn 或 pandas 方法。
For multiple categories对于多个类别
Another approach is using pandas cat.codes .You can convert pandas series to a category and get the category codes.另一种方法是使用熊猫cat.codes 。您可以将熊猫系列转换为类别并获取类别代码。
Y_train = pd.Series(Y_train)
encoded = Y_train.astype("category").cat.codes
You can use sklearn Labelencoder to encode categorical data as well.您也可以使用sklearn Labelencoder对分类数据进行编码。
from sklearn.preprocessing import LabelEncoder
le = LabelEncoder()
encoded = le.fit_transform(Y_train)
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.