设置 pandas dataframe 等于不在其他 dataframe 中的值

Question

我有一个 pandas dataframe ，我想将其分为测试和训练 dataframe 进行数据分析练习。 我有一个事故数据库，其中有 3 个事故等级 - 1、2、3。 我想将每个相同的样本大小写入训练 dataframe，然后将此处未添加的样本写入测试 dataframe。 训练 dataframe 工作正常，但测试 dataframe 不是。

我的代码如下。

def split_df(dataframe, train_df, test_df, val_low, val_high, sample_size): 
    for i in range (val_low, val_high): 
        if(i == val_low): 
            dataframe = accidents.loc[accidents['Accident_Severity'] == i].sample(n = sample_size)
            train_df = accidents.loc[accidents['Accident_Severity'] == i].sample(n = math.trunc(sample_size*0.7))
        else: 
            dataframe = dataframe.append(accidents.loc[accidents['Accident_Severity'] == i].sample(n = sample_size))
            train_df = train_df.append(accidents.loc[accidents['Accident_Severity'] == i].sample(n = math.trunc(sample_size*0.7)))
        
    test_df = accidents[~train_df] #(This is the problem - how do I write the values not in the train_df dataframe to the test_df?) 
    
    return dataframe, train_df, test_df

所以 test_df 是所有不在 train_df 中的东西。

Answer 1

正如您已经找到的元素，您将保留这些元素。 我从那一点开始。 在 pandas dataframe 中，每个元素都有一个唯一的索引。 使用 pandas dataframe 的方法index ，您可以获得训练集中存在的索引列表。 以下代码行查找意外（完整数据帧）中但在 train_df 中缺失的索引。

index = accidents.index.difference(train_df.index)

下一步将是 select 这些索引来自事故 dataframe。

注意：有一种方法可以重新索引 dataframe。 如果在比较索引之前使用它。 不要想知道为什么它不起作用。 两个数据帧之间的索引不是独立的。

设置 pandas dataframe 等于不在其他 dataframe 中的值

问题描述

1 个解决方案

解决方案1
1 2020-08-10 06:55:09

设置 pandas dataframe 等于不在其他 dataframe 中的值

问题描述

1 个解决方案

解决方案1 1 2020-08-10 06:55:09

解决方案1
1 2020-08-10 06:55:09