简体   繁体   中英

How to Efficiently Add a Dimension to a Pandas DataFrame Created from a Complex Dictionary

I think melt (as discussed here ) may potentially be useful for this, but I can't quite figure out how to use it to solve my problem.

I'm starting with a complex dictionary like this:

order = [
    {
        "order_id" : 0,
        "lines" : [
            {
                "line_id" : 1,
                "line_amount" : 3.45,
                "line_description" : "first line"
            },
            {
                "line_id" : 2,
                "line_amount" : 6.66,
                "line_description" : "second line"
            },
            {
                "line_id" : 3,
                "line_amount" : 5.43,
                "line_description" : "third line"
            },
        ]
    },
    {
        "order_id" : 1,
        "lines" : [
        ...
    }
]

I want a DataFrame with one row per order line (not one row per order) that still includes the original order's attributes (which in this example is just the order_id) . Currently the most efficient way to achieve this I've come up with is:

# Orders DataFrame
odf = pandas.DataFrame(orders)

line_dfs = []
for oid, line_list in odf.iterrows():
    line_df = pandas.DataFrame(line_list).copy()
    line_df["order_id"] = oid
    line_dfs += [ line_df ]

# Line DataFrame
ldf = pandas.concat(line_dfs, sort=False, ignore_index=True).copy()

Is there a more efficient, "vectorized" way to .apply something to achieve this?

ldf = odf.lines.apply(...?...)

Thanks for any help, including just a link to an answer on SO or elsewhere that already addresses this and that I just haven't found yet.

Did you try read_json ?

df = pd.read_json(orders)

Use list comprehension with pop for extract lines by key and merge dicts for list of dictionaries and pass to DataFrame constructor:

orders = [
    {
        "order_id" : 0,
        "lines" : [
            {
                "line_id" : 1,
                "line_amount" : 3.45,
                "line_description" : "first line"
            },
            {
                "line_id" : 2,
                "line_amount" : 6.66,
                "line_description" : "second line"
            },
            {
                "line_id" : 3,
                "line_amount" : 5.43,
                "line_description" : "third line"
            },
        ]
    },
    {
        "order_id" : 1,
        "lines" : [
 {
                "line_id" : 1,
                "line_amount" : 30.45,
                "line_description" : "first line"
            },
            {
                "line_id" : 2,
                "line_amount" : 60.66,
                "line_description" : "second line"
            },
            {
                "line_id" : 3,
                "line_amount" : 50.43,
                "line_description" : "third line"
            },
        ]
    }
]

L = [{**x, **y} for x in orders for y in x.pop('lines')]
odf = pd.DataFrame(L)
print (odf)     
   line_amount line_description  line_id  order_id
0         3.45       first line        1         0
1         6.66      second line        2         0
2         5.43       third line        3         0
3        30.45       first line        1         1
4        60.66      second line        2         1
5        50.43       third line        3         1

Another solution with loops:

L = []
for x in orders:
    for y in x.pop('lines'):
        L.append({**x, **y})

odf = pd.DataFrame(L)

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM