简体   繁体   English

如何使用特征工具为新数据(我们要对其进行预测)制作特征

[英]how to make features using featuretools, for the new data(on which we want to make prediction)

I have a single dataframe and want to use featuretools for auto feature engineering part.我有一个数据框,想使用 featuretools 进行自动特征工程部分。 I am able to do it with normalize entities function.我可以使用规范化实体功能来做到这一点。 code snippet is below:代码片段如下:

es = ft.EntitySet(id = 'obs_data')
es = es.entity_from_dataframe(entity_id = 'obs', dataframe = X_train,
                              variable_types = variable_types, make_index = True, index = "Id")
for feat in interaction:   # interaction columns are found using xgbfir
    es = es.normalize_entity(base_entity_id='obs', new_entity_id=feat, index=feat)
features, feature_names = ft.dfs(entityset = es, 
                                 target_entity = 'obs', 
                                 max_depth = 2)

Its creating features, Now I want to do same thing for X_test.它的创建功能,现在我想为 X_test 做同样的事情。 I read blogs on this and they are suggesting to combine X_train and X_test and then do the same process.我阅读了有关此的博客,他们建议将 X_train 和 X_test 结合起来,然后执行相同的过程。 suppose there are 5 obs in X_test and if i combine it with X_train, then each observation (from X_test) will have effect of other 4 observation (X_test) also, which is not a good idea.假设 X_test 中有 5 个 obs,如果我将它与 X_train 结合起来,那么每个观察(来自 X_test)也会对其他 4 个观察(X_test)产生影响,这不是一个好主意。 Anyone can suggest how to do feature engineering using featuretools for the new data?任何人都可以建议如何使用新数据的特征工具进行特征工程?

You can try using cutoff times which specifies the last point in time that an observation can be used for a feature calculation.您可以尝试使用截止时间,它指定观察可用于特征计算的最后一个时间点。 The labels can be passed along with the cutoff times to ensure that they stay aligned with the feature matrix.标签可以与截止时间一起传递,以确保它们与特征矩阵保持一致。 Then, you can split the feature matrix to X_train and X_test .然后,您可以将特征矩阵拆分为X_trainX_test

With new data, the normalization should be repeatable so that the entity set can have the same structure.对于新数据,规范化应该是可重复的,以便实体集可以具有相同的结构。 Then, you can calculate features with cutoff times as usual.然后,您可以像往常一样计算具有截止时间的特征。 You may also want to look into Compose which automatically generates the cutoff times based on how you define the prediction problem.您可能还想查看Compose ,它会根据您定义预测问题的方式自动生成截止时间。 If cutoff times don't work in your use case, I will need more details to better understand how each observation will have an effect on the others.如果截止时间在您的用例中不起作用,我将需要更多详细信息以更好地了解每个观察结果如何对其他人产生影响。 Let me know if this helps.如果这有帮助,请告诉我。

It is possible with calculate_feature_matrix() in featuretools.在特征工具中使用calculate_feature_matrix() 是可能的。 You can get detailed guide from its webpage: https://docs.featuretools.com/en/stable/guides/deployment.html#calculating-feature-matrix-for-new-data您可以从其网页获取详细指南: https : //docs.featuretools.com/en/stable/guides/deployment.html#calculating-feature-matrix-for-new-data

Suppose new data is X_test.假设新数据是 X_test。 If it is a dataframe, you should create an entityset for it.如果它是一个数据框,你应该为它创建一个实体集。

es_test = es.entity_from_dataframe(entity_id = 'entity', dataframe = X_test)

Otherwise, if it is an entity already, you can skip previous step.否则,如果它已经是一个实体,则可以跳过上一步。 Suppose your test entity is es_test and your generated feature names is feature_names .假设您的测试实体是es_test并且您生成的功能名称是feature_names By using train data's feature names you can create a new feature matrix for test data.通过使用训练数据的特征名称,您可以为测试数据创建一个新的特征矩阵。

test_feat_generated= ft.calculate_feature_matrix(feature_names, es_test)

For later use of feature_names, you can look load_features() , save_features() functions.为了以后使用 feature_names,您可以查看load_features()save_features()函数。

Note: Train and test entities should have the same entity_id otherwise you would get an error.注意:训练和测试实体应该具有相同的entity_id否则你会得到一个错误。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 如何使用 DFS 在 Featuretools 中制作“堆叠”功能 - How to Make 'Stacked' Features in Featuretools using DFS 如何做出新的预测 - How to make a new prediction 我们如何将数值数据转换为标记数据并进行预测? - How can we convert numerical data to labeled data and make a prediction? 我们如何使用 Scikit-Learn 分类器进行预测? - How can we make a prediction using Scikit-Learn Classifiers? 如何通过一些额外的列对 Pandas DataFrame 中的新数据进行预测? - How to make prediction on the new data in Pandas DataFrame with some extra columns? 如何使用 ML 算法根据数据的现有特征对数据(作为新列)进行评分或排名? - How to make scoring or ranking data (as new column) from existing features on data by using ML algorithms? 具有多个特征的线性回归 - 如何在使用数组训练神经网络后进行预测 - Linear regression with multiple features - How to make a prediction after training a neural network using an array 在新数据集中进行预测 - make prediction in new dataset 我们如何使用分类器根据数值数据和标记数据进行预测? - How can we use a Classifier to Make a Prediction based on Numeric Data and Labeled Data? 如何使用波士顿住房数据集进行预测? - How to make prediction using the Boston housing dataset?
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM