[英]FeatureTools: Dealing with many-to-many relationships
I have a dataframe of purchases with multiple columns, including the three below: 我有一个包含多个列的购买数据框架,包括以下三个列:
PURCHASE_ID (index of purchase)
WORKER_ID (index of worker)
ACCOUNT_ID (index of account)
A worker can have multiple accounts associated to them, and an account can have multiple workers. 一个工作人员可以有多个与之关联的帐户,一个帐户可以有多个工作人员。
If I create WORKER and ACCOUNT entities and add the relationships then I get an error: 如果创建WORKER和ACCOUNT实体并添加关系,则会收到错误消息:
KeyError: 'Variable: ACCOUNT_ID not found in entity'
Here is my code so far: 到目前为止,这是我的代码:
import pandas as pd
import featuretools as ft
import featuretools.variable_types as vtypes
d = {'PURCHASE_ID': [1, 2],
'WORKER_ID': [0, 0],
'ACCOUNT_ID': [1, 2],
'COST': [5, 10],
'PURCHASE_TIME': ['2018-01-01 01:00:00', '2016-01-01 02:00:00']}
df = pd.DataFrame(data=d)
data_variable_types = {'PURCHASE_ID': vtypes.Id,
'WORKER_ID': vtypes.Id,
'ACCOUNT_ID': vtypes.Id,
'COST': vtypes.Numeric,
'PURCHASE_TIME': vtypes.Datetime}
es = ft.EntitySet('Purchase')
es = es.entity_from_dataframe(entity_id='purchases',
dataframe=df,
index='PURCHASE_ID',
time_index='PURCHASE_TIME',
variable_types=data_variable_types)
es.normalize_entity(base_entity_id='purchases',
new_entity_id='workers',
index='WORKER_ID',
additional_variables=['ACCOUNT_ID'],
make_time_index=False)
es.normalize_entity(base_entity_id='purchases',
new_entity_id='accounts',
index='ACCOUNT_ID',
additional_variables=['WORKER_ID'],
make_time_index=False)
fm, features = ft.dfs(entityset=es,
target_entity='purchases',
agg_primitives=['mean'],
trans_primitives=[],
verbose=True)
features
How do I separate the entities to include many-to-many relationships? 如何分隔实体以包括多对多关系?
Your approach is correct, however you don't need to use the additional_variables
variables argument. 您的方法是正确的,但是不需要使用
additional_variables
变量参数。 If you omit it, your code will run without issues. 如果您省略它,您的代码将运行没有问题。
The purpose of additional_variables
to EntitySet.normalize_entity
is to include other variables you want in new parent entity you are creating. 的目的
additional_variables
到EntitySet.normalize_entity
是包括你想要其他变量正在创建新的母公司。 For example, say you had variables about a hire date, salary, location, etc. You would put those as additional variables because they are static with respect to a worker. 例如,假设您有关于雇用日期,薪水,地点等的变量。您可以将这些变量作为附加变量,因为它们对于工人而言是静态的。 In this, case I don't think you have any variables like that.
在这种情况下,我认为您没有像这样的变量。
Here is the code and output I see 这是我看到的代码和输出
import pandas as pd
import featuretools as ft
import featuretools.variable_types as vtypes
d = {'PURCHASE_ID': [1, 2],
'WORKER_ID': [0, 0],
'ACCOUNT_ID': [1, 2],
'COST': [5, 10],
'PURCHASE_TIME': ['2018-01-01 01:00:00', '2016-01-01 02:00:00']}
df = pd.DataFrame(data=d)
data_variable_types = {'PURCHASE_ID': vtypes.Id,
'WORKER_ID': vtypes.Id,
'ACCOUNT_ID': vtypes.Id,
'COST': vtypes.Numeric,
'PURCHASE_TIME': vtypes.Datetime}
es = ft.EntitySet('Purchase')
es = es.entity_from_dataframe(entity_id='purchases',
dataframe=df,
index='PURCHASE_ID',
time_index='PURCHASE_TIME',
variable_types=data_variable_types)
es.normalize_entity(base_entity_id='purchases',
new_entity_id='workers',
index='WORKER_ID',
make_time_index=False)
es.normalize_entity(base_entity_id='purchases',
new_entity_id='accounts',
index='ACCOUNT_ID',
make_time_index=False)
fm, features = ft.dfs(entityset=es,
target_entity='purchases',
agg_primitives=['mean'],
trans_primitives=[],
verbose=True)
features
this outputs 这个输出
[<Feature: WORKER_ID>,
<Feature: ACCOUNT_ID>,
<Feature: COST>,
<Feature: workers.MEAN(purchases.COST)>,
<Feature: accounts.MEAN(purchases.COST)>]
If we change the target entity and increase the depth 如果我们改变目标实体并增加深度
fm, features = ft.dfs(entityset=es,
target_entity='workers',
agg_primitives=['mean', 'count'],
max_depth=3,
trans_primitives=[],
verbose=True)
features
the output is now features for the workers entity 现在输出是worker实体的特征
[<Feature: COUNT(purchases)>,
<Feature: MEAN(purchases.COST)>,
<Feature: MEAN(purchases.accounts.MEAN(purchases.COST))>,
<Feature: MEAN(purchases.accounts.COUNT(purchases))>]
Let's explain the feature named MEAN(purchases.accounts.COUNT(purchases))>
让我们解释一下名为
MEAN(purchases.accounts.COUNT(purchases))>
In other words, "what is the average number of purchases made by accounts related to purchases made by this worker". 换句话说,“与该工人所进行的购买有关的帐户所进行的平均购买次数是多少”。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.