简体   繁体   English

在Python中展平实体-属性-值(EAV)架构

[英]Flatten Entity-Attribute-Value (EAV) Schema in Python

I've got a csv file in something of an entity-attribute-value format ( ie , my event_id is non-unique and repeats k times for the k associated attributes): 我有一个实体属性值格式的csv文件( ,我的event_id是非唯一的,并且对k个相关属性重复了k次):

    event_id, attribute_id, value
    1, 1, a
    1, 2, b
    1, 3, c
    2, 1, a
    2, 2, b
    2, 3, c
    2, 4, d

Are there any handy tricks to transform a variable number of attributes ( ie , rows) into columns? 是否有任何方便的技巧将可变数量的属性( 行)转换为列? The key here is that the output ought to be an mxn table of structured data, where m = max(k) ; 这里的关键是输出应该是结构化数据的mxn表,其中m = max(k) ; filling in missing attributes with NULL would be optimal: NULL填充缺少的属性将是最佳选择:

    event_id, 1, 2, 3, 4
    1, a, b, c, null
    2, a, b, c, d

My plan was to (1) convert the csv to a JSON object that looks like this: 我的计划是(1)将csv转换为如下所示的JSON对象:

    data = [{'value': 'a', 'id': '1', 'event_id': '1', 'attribute_id': '1'},
     {'value': 'b', 'id': '2', 'event_id': '1', 'attribute_id': '2'},
     {'value': 'a', 'id': '3', 'event_id': '2', 'attribute_id': '1'},
     {'value': 'b', 'id': '4', 'event_id': '2', 'attribute_id': '2'},
     {'value': 'c', 'id': '5', 'event_id': '2', 'attribute_id': '3'},
     {'value': 'd', 'id': '6', 'event_id': '2', 'attribute_id': '4'}]

(2) extract unique event ids: (2)提取唯一的事件ID:

    events = set()
    for item in data:
        events.add(item['event_id'])

(3) create a list of lists, where each inner list is a list the of attributes for the corresponding parent event. (3)创建一个列表列表,其中每个内部列表都是相应父事件的属性列表。

    attributes = [[k['value'] for k in j] for i, j in groupby(data, key=lambda x: x['event_id'])]

(4) create a dictionary that brings events and attributes together: (4)创建将事件和属性放在一起的字典:

    event_dict = dict(zip(events, attributes))

which looks like this: 看起来像这样:

    {'1': ['a', 'b'], '2': ['a', 'b', 'c', 'd']}

I'm not sure how to get all inner lists to be the same length with NULL values populated where necessary. 我不确定如何使所有内部列表的长度相同,并在必要时填充NULL值。 It seems like something that needs to be done in step (3). 似乎需要在步骤(3)中完成某些操作。 Also, creating n lists full of m NULL values had crossed my mind, then iterate through each list and populate the value using attribute_id as the list location; 另外,我已经想到创建n个充满m个 NULL值的列表,然后遍历每个列表并使用attribute_id作为列表位置填充该值; but that seems janky. 但这看起来很简陋。

Your basic idea seems right, though I would implement it as follows: 您的基本想法似乎是正确的,尽管我可以按以下方式实现它:

import itertools
import csv

events = {}  # we're going to keep track of the events we read in
with open('path/to/input') as infile:
    for event, _att, val in csv.reader(infile):
        if event not in events:
            events[event] = []
        events[int(event)].append(val)  # track all the values for this event

maxAtts = max(len(v) for _k,v in events.items())  # the maximum number of attributes for any event
with open('path/to/output', 'w') as outfile):
    writer = csv.writer(outfile)
    writer.writerow(["event_id"] + list(range(1, maxAtts+1)))  # write out the header row
    for k in sorted(events):  # let's look at the events in sorted order
        writer.writerow([k] + events[k] + ['null']*(maxAtts-len(events[k])))  # write out the event id, all the values for that event, and pad with "null" for any attributes without values

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM