简体   繁体   English

StratifiedKfold在异构DataFrame上

[英]StratifiedKfold over heterogeneous DataFrame

I have a pandas DataFrame which contains string and float columns that needs to be split into balanced slices in order to train a sklearn pipeline. 我有一个pandas DataFrame,它包含需要拆分成平衡切片的字符串和浮点列,以便训练sklearn管道。

Ideally I'd use StratifiedKFold over the DataFrame to get smaller chunks of data to cross validate. 理想情况下,我会在DataFrame上使用StratifiedKFold来获取较小的数据块以进行交叉验证。 But it complains that I have unorderable types, like this: 但它抱怨我有无法解决的类型,如下所示:

import pandas as pd
from sklearn.cross_validation import StratifiedKFold

dataset = pd.DataFrame(
    [
        {'title': 'Dábale arroz a la zorra el abad', 'size':1.2, 'target': 1},
        {'title': 'Ana lleva al oso la avellana', 'size':1.0, 'target': 1},
        {'title': 'No te enrollé yornetón', 'size':1.4, 'target': 0},
        {'title': 'Acá sólo tito lo saca', 'size':1.4, 'target': 0},
    ])
skfs = StratifiedKFold(dataset, n_folds=2)

>>>  TypeError: unorderable types: str() > float()

There are ways to get folds indices and do slicing over the DataFrame, but I don't think that guarantees that my classes are going to be balanced. 有一些方法可以获得折叠索引并对DataFrame进行切片,但我认为这并不能保证我的类会得到平衡。

What's the best method to split my DataFrame? 拆分DataFrame的最佳方法是什么?

StratifiedKFold requires the number of splits, and the .split() method uses the label's class distribution to stratify the samples. StratifiedKFold需要分割数量,而.split()方法使用标签的类分布来对样本进行分层。 Assuming your label is target , you would: 假设您的labeltarget ,您将:

from sklearn.model_selection import StratifiedKFold
skf = StratifiedKFold(n_splits=2)
X=dataset.drop('target', axis=1)
y=dataset.target
for train_index, test_index in skf.split(X, y):
    X_train, X_test = X.iloc[train_index], X.iloc[test_index]
    y_train, y_test = y.iloc[train_index], y.iloc[test_index]

sklearn.cross_validation.StratifiedKFold is deprecated since version 0.18 and will be removed in 0.20. sklearn.cross_validation.StratifiedKFold自版本0.18起不推荐使用,将在0.20中删除。 So here is an alternative approach: 所以这是另一种方法:

from sklearn.model_selection import StratifiedKFold

skf = StratifiedKFold(n_splits=2)
t = dataset.target
for train_index, test_index in skf.split(np.zeros(len(t)), t):
    train = dataset.loc[train_index]
    test = dataset.loc[test_index]

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM