简体   繁体   English

FeatureTools:处理多对多关系

[英]FeatureTools: Dealing with many-to-many relationships

I have a dataframe of purchases with multiple columns, including the three below: 我有一个包含多个列的购买数据框架,包括以下三个列:

 PURCHASE_ID (index of purchase)
 WORKER_ID (index of worker)
 ACCOUNT_ID (index of account)

A worker can have multiple accounts associated to them, and an account can have multiple workers. 一个工作人员可以有多个与之关联的帐户,一个帐户可以有多个工作人员。

If I create WORKER and ACCOUNT entities and add the relationships then I get an error: 如果创建WORKER和ACCOUNT实体并添加关系,则会收到错误消息:

KeyError: 'Variable: ACCOUNT_ID not found in entity'

Here is my code so far: 到目前为止,这是我的代码:

import pandas as pd
import featuretools as ft
import featuretools.variable_types as vtypes

d = {'PURCHASE_ID': [1, 2], 
     'WORKER_ID': [0, 0], 
     'ACCOUNT_ID': [1, 2], 
     'COST': [5, 10], 
     'PURCHASE_TIME': ['2018-01-01 01:00:00', '2016-01-01 02:00:00']}
df = pd.DataFrame(data=d)

data_variable_types = {'PURCHASE_ID': vtypes.Id,
                       'WORKER_ID': vtypes.Id,
                       'ACCOUNT_ID': vtypes.Id,
                       'COST': vtypes.Numeric,
                       'PURCHASE_TIME': vtypes.Datetime}

es = ft.EntitySet('Purchase')
es = es.entity_from_dataframe(entity_id='purchases',
                               dataframe=df,
                               index='PURCHASE_ID',
                               time_index='PURCHASE_TIME',
                               variable_types=data_variable_types)

es.normalize_entity(base_entity_id='purchases',
                   new_entity_id='workers',
                   index='WORKER_ID',
                   additional_variables=['ACCOUNT_ID'],
                   make_time_index=False)

es.normalize_entity(base_entity_id='purchases',
                   new_entity_id='accounts',
                   index='ACCOUNT_ID',
                   additional_variables=['WORKER_ID'],
                   make_time_index=False)

fm, features = ft.dfs(entityset=es,
                     target_entity='purchases',
                     agg_primitives=['mean'],
                     trans_primitives=[],
                     verbose=True)
features

How do I separate the entities to include many-to-many relationships? 如何分隔实体以包括多对多关系?

Your approach is correct, however you don't need to use the additional_variables variables argument. 您的方法是正确的,但是不需要使用additional_variables变量参数。 If you omit it, your code will run without issues. 如果您省略它,您的代码将运行没有问题。

The purpose of additional_variables to EntitySet.normalize_entity is to include other variables you want in new parent entity you are creating. 的目的additional_variablesEntitySet.normalize_entity是包括你想要其他变量正在创建新的母公司。 For example, say you had variables about a hire date, salary, location, etc. You would put those as additional variables because they are static with respect to a worker. 例如,假设您有关于雇用日期,薪水,地点等的变量。您可以将这些变量作为附加变量,因为它们对于工人而言是静态的。 In this, case I don't think you have any variables like that. 在这种情况下,我认为您没有像这样的变量。

Here is the code and output I see 这是我看到的代码和输出

import pandas as pd
import featuretools as ft
import featuretools.variable_types as vtypes

d = {'PURCHASE_ID': [1, 2], 
     'WORKER_ID': [0, 0], 
     'ACCOUNT_ID': [1, 2], 
     'COST': [5, 10], 
     'PURCHASE_TIME': ['2018-01-01 01:00:00', '2016-01-01 02:00:00']}
df = pd.DataFrame(data=d)

data_variable_types = {'PURCHASE_ID': vtypes.Id,
                       'WORKER_ID': vtypes.Id,
                       'ACCOUNT_ID': vtypes.Id,
                       'COST': vtypes.Numeric,
                       'PURCHASE_TIME': vtypes.Datetime}

es = ft.EntitySet('Purchase')
es = es.entity_from_dataframe(entity_id='purchases',
                               dataframe=df,
                               index='PURCHASE_ID',
                               time_index='PURCHASE_TIME',
                               variable_types=data_variable_types)

es.normalize_entity(base_entity_id='purchases',
                   new_entity_id='workers',
                   index='WORKER_ID',
                   make_time_index=False)

es.normalize_entity(base_entity_id='purchases',
                   new_entity_id='accounts',
                   index='ACCOUNT_ID',
                   make_time_index=False)

fm, features = ft.dfs(entityset=es,
                     target_entity='purchases',
                     agg_primitives=['mean'],
                     trans_primitives=[],
                     verbose=True)
features

this outputs 这个输出

[<Feature: WORKER_ID>,
 <Feature: ACCOUNT_ID>,
 <Feature: COST>,
 <Feature: workers.MEAN(purchases.COST)>,
 <Feature: accounts.MEAN(purchases.COST)>]

If we change the target entity and increase the depth 如果我们改变目标实体并增加深度

fm, features = ft.dfs(entityset=es,
                     target_entity='workers',
                     agg_primitives=['mean', 'count'],
                     max_depth=3,
                     trans_primitives=[],
                     verbose=True)
features

the output is now features for the workers entity 现在输出是worker实体的特征

[<Feature: COUNT(purchases)>,
 <Feature: MEAN(purchases.COST)>,
 <Feature: MEAN(purchases.accounts.MEAN(purchases.COST))>,
 <Feature: MEAN(purchases.accounts.COUNT(purchases))>]

Let's explain the feature named MEAN(purchases.accounts.COUNT(purchases))> 让我们解释一下名为MEAN(purchases.accounts.COUNT(purchases))>

  1. For a given worker, find each of the purchases related to that worker. 对于给定的工人,找到与该工人相关的每个购买。
  2. For each of those purchases, calculate the total number of purchases made by the account who involved in that particular purchase. 对于这些购买中的每一项,请计算参与该特定购买的帐户进行的购买总数。
  3. Average this count across all of the given worker's purchases. 将所有给定工人购买的数量平均。

In other words, "what is the average number of purchases made by accounts related to purchases made by this worker". 换句话说,“与该工人所进行的购买有关的帐户所进行的平均购买次数是多少”。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM