嘗試隨機化數據幀的列時出現 KeyError

Question

最小示例：
考慮這個數據幀temp ：

temp = pd.DataFrame({"A":[1,2,3,4,5,6,7,8,9,10],"B":[2,3,4,5,6,7,8,9,10,11],"C":[3,4,5,6,7,8,9,10,11,12]})
>>> temp
    A   B   C
0   1   2   3
1   2   3   4
2   3   4   5
3   4   5   6
4   5   6   7
5   6   7   8
6   7   8   9
7   8   9  10
8   9  10  11
9  10  11  12

現在，嘗試在 for 循環中一次對每一列進行洗牌。

>>> for i in temp.columns:
...     np.random.shuffle(temp.loc[:,i])
...     print(temp)
...
    A   B   C
0   8   2   3
1   3   3   4
2   9   4   5
3   6   5   6
4   4   6   7
5  10   7   8
6   7   8   9
7   1   9  10
8   2  10  11
9   5  11  12
    A   B   C
0   8   7   3
1   3   9   4
2   9   8   5
3   6  10   6
4   4   4   7
5  10  11   8
6   7   5   9
7   1   3  10
8   2   2  11
9   5   6  12
    A   B   C
0   8   7   6
1   3   9   8
2   9   8   4
3   6  10  10
4   4   4   7
5  10  11  11
6   7   5   5
7   1   3   3
8   2   2  12
9   5   6   9

這完美地工作。
具體例子：

現在，如果我要得到這個數據幀的一部分，用於訓練和測試目的，那么我將使用train_test_split從功能sklearn.model_selection 。

>>> from sklearn.model_selection import train_test_split
>>> temp = pd.DataFrame({"A":[1,2,3,4,5,6,7,8,9,10],"B":[2,3,4,5,6,7,8,9,10,11],"C":[3,4,5,6,7,8,9,10,11,12]})
>>> y = [i for i in range(16,26)]
>>> len(y)
10
>>> X_train,X_test,y_train,y_test = train_test_split(temp,y,test_size=0.2)
>>> X_train
    A   B   C
2   3   4   5
6   7   8   9
8   9  10  11
0   1   2   3
7   8   9  10
3   4   5   6
1   2   3   4
9  10  11  12

現在，我們已經獲得了X_train數據X_train 。 為了打亂它的每一列：

>>> for i in X_train.columns:
...     np.random.shuffle(X_train.loc[:,i])
...     print(X_train)
...

不幸的是，這會導致錯誤。
錯誤：

sys:1: SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
    Traceback (most recent call last):
  File "<stdin>", line 2, in <module>
  File "mtrand.pyx", line 4852, in mtrand.RandomState.shuffle
  File "mtrand.pyx", line 4855, in mtrand.RandomState.shuffle
  File "C:\Users\H.P\AppData\Local\Programs\Python\Python36\lib\site-packages\pandas\core\series.py", line 623, in __getitem__
    result = self.index.get_value(self, key)
  File "C:\Users\H.P\AppData\Local\Programs\Python\Python36\lib\site-packages\pandas\core\indexes\base.py", line 2560, in get_value
    tz=getattr(series.dtype, 'tz', None))
  File "pandas\_libs\index.pyx", line 83, in pandas._libs.index.IndexEngine.get_value
  File "pandas\_libs\index.pyx", line 91, in pandas._libs.index.IndexEngine.get_value
  File "pandas\_libs\index.pyx", line 139, in pandas._libs.index.IndexEngine.get_loc
  File "pandas\_libs\hashtable_class_helper.pxi", line 811, in pandas._libs.hashtable.Int64HashTable.get_item
  File "pandas\_libs\hashtable_class_helper.pxi", line 817, in pandas._libs.hashtable.Int64HashTable.get_item
KeyError: 4

跟蹤問題及其解決方案：

在SettingWithCopyWarning ，我發現了這個問題，它的第一個答案下面有這一行：

但是，它可以創建一個副本來更新您看不到的data['amount']副本。 然后你會想知道為什么它不更新。

但是，如果是這種情況，那么為什么代碼適用於第一種情況？

答案中還給出了：

Pandas 在幾乎所有的方法調用中都會返回一個對象的副本。 就地操作是一種可行的操作，但通常不清楚數據正在被修改並且可能在副本上工作。

因此，而不是使用np.random.shuffle我們可以使用np.random.permutation ，如圖這個答案。 所以：

>>> for i in X_train.columns:
...     X_train.loc[:,i] = np.random.permutation(X_train.loc[:,i])
...     print(X_train)
...

但是，我再次得到了SettingWithCopyWarning ，以及答案。

C:\Users\H.P\AppData\Local\Programs\Python\Python36\lib\site-packages\pandas\core\indexing.py:621: SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  self.obj[item_labels[indexer[info_axis]]] = value
    A   B   C
2  10   4   5
6   9   8   9
8   2  10  11
0   8   2   3
7   1   9  10
3   3   5   6
1   4   3   4
9   7  11  12
    A   B   C
2  10   5   5
6   9  11   9
8   2   4  11
0   8   9   3
7   1   3  10
3   3   8   6
1   4  10   4
9   7   2  12
    A   B   C
2  10   5  10
6   9  11   5
8   2   4  11
0   8   9   3
7   1   3   4
3   3   8   6
1   4  10  12
9   7   2   9

這可以是一種解決方法。

問題：

當我使用train_test_split時，為什么代碼適用於第一種情況，而不適用於第二種情況？
當我不使用就地洗牌np.random.shuffle SettingWithCopyWarning時，為什么我仍然得到np.random.shuffle ？

征求建議：

是否有更好（易於使用/無錯誤/更快）的方法來進行列洗牌？

Answer 1

1.當我使用train_test_split時，為什么代碼適用於第一種情況，而不適用於第二種情況？

由於train_test_split洗牌的行X_train 。 因此每列的索引不是一個范圍而是一組值

您可以通過檢查temp和X_train的索引來看到這X_train

X_train.index
Int64Index([6, 8, 9, 5, 0, 2, 3, 4], dtype='int64')

temp.index
RangeIndex(start=0, stop=10, step=1)

在第一種情況下，與第二種情況不同，可以安全地將列視為數組。 如果您將第二種情況中的代碼更改為

for i in X_train.columns:
    np.random.shuffle(X_train.loc[:,i].values)
    print(X_train)

這不會導致錯誤。

請注意，您提供的情況下的洗牌將導致每列不同的洗牌。 即數據點會混淆。

SettingWithCopyWarning當我不使用就地洗牌np.random.shuffle SettingWithCopyWarning時，為什么我仍然得到np.random.shuffle ？

使用最新版本的 Pandas (0.22.0) 時我沒有收到警告

征求建議：

是否有更好（易於使用/無錯誤/更快）的方法來進行列洗牌？

我建議在axis=1時使用 sample ，它會洗牌列，並且樣本數應該是列數。 即X_train.shape[1]

X_train = X_train.sample(X_train.shape[1],axis=1)

In []: X_train.sample(X_train.shape[1],axis=1)
Out[]: 
    B   A   C
6   8   7   9
9  11  10  12
8  10   9  11
4   6   5   7
5   7   6   8
0   2   1   3
2   4   3   5
3   5   4   6

Answer 2

我也遇到了 train_test_split 的這個問題。 我用這個代替：

np.random.shuffle(x.iloc[:, i].values)

不知道為什么它有效，但它似乎解決了問題

嘗試隨機化數據幀的列時出現 KeyError

問題描述

問題：

征求建議：

2 個解決方案

解決方案1
2 已采納 2018-02-16 08:13:26

解決方案2
0 2021-01-29 16:09:35

嘗試隨機化數據幀的列時出現 KeyError

問題描述

問題：

征求建議：

2 個解決方案

解決方案1 2 已采納 2018-02-16 08:13:26

解決方案2 0 2021-01-29 16:09:35

解決方案1
2 已采納 2018-02-16 08:13:26

解決方案2
0 2021-01-29 16:09:35