[英]Subset data-frame extended/sliced to match original data-frame columns
The Problem: 问题:
I trained a classifier on a dataset with more features than the test data.
我在数据集比测试数据具有更多功能的数据集上训练了分类器。 For example, my original dataset has 7 days of the week: Monday-Sunday, where-as in the test dataset, every single observation happens to be on Thursday (thus I have 6 less features).
例如,我的原始数据集有一周的7天:星期一至星期日,在哪里-如测试数据集中那样,每个观测值恰好在星期四(因此我少了6个特征)。 Hence, when I run the
predict()
, I get an error that the number of features do not match.因此,当我运行
predict()
,出现一个错误,即功能数量不匹配。 These missing features are the features that were not created usingget_dummies()
:这些缺少的功能是未使用
get_dummies()
创建的功能:
Day_of_the_week_is_monday, Day_of_the_week_is_tuesday, ...
Ideally, I would like to perform data cleaning and do the following: 理想情况下,我想执行数据清理并执行以下操作:
Reproducible Example 可重现的例子
dataframe = pd.DataFrame({
'Result' : np.array([1,2,2,10,100],dtype='int32'),
'Day_of_the_week' : pd.Categorical(["Monday","Tuesday","Wednesday","Thursday","Friday"]),})
dataframe_dummies = pd.get_dummies(dataframe, prefix=['Day_of_the_week_is'])
### get subset dataframe
dataframe_subset = pd.DataFrame({
'Result' : np.array([1,2,2,10],dtype='int32'),
'Day_of_the_week' : pd.Categorical(["Thursday","Thursday","Thursday","Saturday"]),})
dataframe_subset_dummies = pd.get_dummies(dataframe_subset, prefix=['Day_of_the_week_is'])
Main dataset looks like: 主要数据集如下:
Result Is_Friday Is_Monday Is_Thursday Is_Tuesday Is_Wednesday
0 1 0 1 0 0 0
1 2 0 0 0 1 0
2 2 0 0 0 0 1
3 10 0 0 1 0 0
4 100 1 0 0 0 0
Subset Dataframe 子集数据框
Result Day_is_Saturday Day_is_Thursday
0 1 0 1
1 2 0 1
2 2 0 1
3 10 1 0
What has to be done: 必须做什么:
1) Remove the is_Saturday because it's not in the original data. 1)删除is_Saturday,因为它不在原始数据中。
2) Add remaining cols filled with 0s. 2)添加剩余的0填充的cols。
I can do it manually, but it seems very troublesome to do. 我可以手动完成,但是这样做似乎很麻烦。 Is there a function that can do this for me?
有功能可以帮我吗? Eg extend the subset dataframe to match the main data set, or remove cols to match main data?
例如,扩展子集数据框以匹配主数据集,还是删除cols以匹配主数据?
A simple loop and check should do the trick to add missing columns and delete missing columns: 一个简单的循环和检查应该可以解决添加缺失列和删除缺失列的技巧:
In [16]: a = pd.DataFrame([[1,2,3],[2,3,4]], columns=['A', 'B', 'E'])
In [17]: b = pd.DataFrame([[3,4,5],[4,5,6]], columns=['A', 'B', 'C'])
In [18]: for col in b.columns:
...: if col not in a:
...: a[col] = 0
...:
In [19]: for col in a.columns:
...: if col not in b:
...: del a[col]
...:
In [20]: a
Out[20]:
A B C
0 1 2 0
1 2 3 0
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.