简体   繁体   中英

Flatten Entity-Attribute-Value (EAV) Schema in Python

I've got a csv file in something of an entity-attribute-value format ( ie , my event_id is non-unique and repeats k times for the k associated attributes):

    event_id, attribute_id, value
    1, 1, a
    1, 2, b
    1, 3, c
    2, 1, a
    2, 2, b
    2, 3, c
    2, 4, d

Are there any handy tricks to transform a variable number of attributes ( ie , rows) into columns? The key here is that the output ought to be an mxn table of structured data, where m = max(k) ; filling in missing attributes with NULL would be optimal:

    event_id, 1, 2, 3, 4
    1, a, b, c, null
    2, a, b, c, d

My plan was to (1) convert the csv to a JSON object that looks like this:

    data = [{'value': 'a', 'id': '1', 'event_id': '1', 'attribute_id': '1'},
     {'value': 'b', 'id': '2', 'event_id': '1', 'attribute_id': '2'},
     {'value': 'a', 'id': '3', 'event_id': '2', 'attribute_id': '1'},
     {'value': 'b', 'id': '4', 'event_id': '2', 'attribute_id': '2'},
     {'value': 'c', 'id': '5', 'event_id': '2', 'attribute_id': '3'},
     {'value': 'd', 'id': '6', 'event_id': '2', 'attribute_id': '4'}]

(2) extract unique event ids:

    events = set()
    for item in data:
        events.add(item['event_id'])

(3) create a list of lists, where each inner list is a list the of attributes for the corresponding parent event.

    attributes = [[k['value'] for k in j] for i, j in groupby(data, key=lambda x: x['event_id'])]

(4) create a dictionary that brings events and attributes together:

    event_dict = dict(zip(events, attributes))

which looks like this:

    {'1': ['a', 'b'], '2': ['a', 'b', 'c', 'd']}

I'm not sure how to get all inner lists to be the same length with NULL values populated where necessary. It seems like something that needs to be done in step (3). Also, creating n lists full of m NULL values had crossed my mind, then iterate through each list and populate the value using attribute_id as the list location; but that seems janky.

Your basic idea seems right, though I would implement it as follows:

import itertools
import csv

events = {}  # we're going to keep track of the events we read in
with open('path/to/input') as infile:
    for event, _att, val in csv.reader(infile):
        if event not in events:
            events[event] = []
        events[int(event)].append(val)  # track all the values for this event

maxAtts = max(len(v) for _k,v in events.items())  # the maximum number of attributes for any event
with open('path/to/output', 'w') as outfile):
    writer = csv.writer(outfile)
    writer.writerow(["event_id"] + list(range(1, maxAtts+1)))  # write out the header row
    for k in sorted(events):  # let's look at the events in sorted order
        writer.writerow([k] + events[k] + ['null']*(maxAtts-len(events[k])))  # write out the event id, all the values for that event, and pad with "null" for any attributes without values

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM