简体   繁体   English

子集数据帧扩展/切片以匹配原始数据帧列

[英]Subset data-frame extended/sliced to match original data-frame columns

The Problem: 问题:

I trained a classifier on a dataset with more features than the test data. 我在数据集比测试数据具有更多功能的数据集上训练了分类器。 For example, my original dataset has 7 days of the week: Monday-Sunday, where-as in the test dataset, every single observation happens to be on Thursday (thus I have 6 less features). 例如,我的原始数据集有一周的7天:星期一至星期日,在哪里-如测试数据集中那样,每个观测值恰好在星期四(因此我少了6个特征)。 Hence, when I run the predict() , I get an error that the number of features do not match. 因此,当我运行predict() ,出现一个错误,即功能数量不匹配。 These missing features are the features that were not created using get_dummies() : 这些缺少的功能是未使用get_dummies()创建的功能:

Day_of_the_week_is_monday, Day_of_the_week_is_tuesday, ... 

Ideally, I would like to perform data cleaning and do the following: 理想情况下,我想执行数据清理并执行以下操作:

  • Automatically create missing columns, filled with 0s. 自动创建缺少的列,填充为0。 (Since is_Thursday will have all 1s, the rest should have 0s) (由于is_Thursday将全为1,其余的应为0)
  • Remove any 'extra' columns in the subset dataframe, that are not present in the original training data. 删除子集数据框中所有原始培训数据中不存在的“额外”列。 For example, get_dummies() might create more levels in the subset dataframe, which I would like to remove. 例如,get_dummies()可能会在子集数据框中创建更多级别,我想删除该级别。

Reproducible Example 可重现的例子


dataframe = pd.DataFrame({
                          'Result' : np.array([1,2,2,10,100],dtype='int32'),
                          'Day_of_the_week' : pd.Categorical(["Monday","Tuesday","Wednesday","Thursday","Friday"]),})

dataframe_dummies = pd.get_dummies(dataframe, prefix=['Day_of_the_week_is'])

### get subset dataframe

dataframe_subset = pd.DataFrame({
                          'Result' : np.array([1,2,2,10],dtype='int32'),
                          'Day_of_the_week' : pd.Categorical(["Thursday","Thursday","Thursday","Saturday"]),})

dataframe_subset_dummies = pd.get_dummies(dataframe_subset, prefix=['Day_of_the_week_is'])

Main dataset looks like: 主要数据集如下:

   Result  Is_Friday  Is_Monday  Is_Thursday  Is_Tuesday  Is_Wednesday
0       1          0          1            0           0             0
1       2          0          0            0           1             0
2       2          0          0            0           0             1
3      10          0          0            1           0             0
4     100          1          0            0           0             0

Subset Dataframe 子集数据框

   Result  Day_is_Saturday  Day_is_Thursday
0       1                0                1
1       2                0                1
2       2                0                1
3      10                1                0

What has to be done: 必须做什么:

1) Remove the is_Saturday because it's not in the original data. 1)删除is_Saturday,因为它不在原始数据中。

2) Add remaining cols filled with 0s. 2)添加剩余的0填充的cols。

I can do it manually, but it seems very troublesome to do. 我可以手动完成,但是这样做似乎很麻烦。 Is there a function that can do this for me? 有功能可以帮我吗? Eg extend the subset dataframe to match the main data set, or remove cols to match main data? 例如,扩展子集数据框以匹配主数据集,还是删除cols以匹配主数据?

A simple loop and check should do the trick to add missing columns and delete missing columns: 一个简单的循环和检查应该可以解决添加缺失列和删除缺失列的技巧:

In [16]: a = pd.DataFrame([[1,2,3],[2,3,4]], columns=['A', 'B', 'E'])

In [17]: b = pd.DataFrame([[3,4,5],[4,5,6]], columns=['A', 'B', 'C'])

In [18]: for col in b.columns:
    ...:     if col not in a:
    ...:         a[col] = 0
    ...:

In [19]: for col in a.columns:
    ...:     if col not in b:
    ...:         del a[col]
    ...:

In [20]: a
Out[20]:
   A  B  C
0  1  2  0
1  2  3  0

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM