How optimize this piece of code?

Question

my data looks like this it is list of dicts:

wishlist_result[0] = {'userId': 19814, 'entityIds': [40, 45, 54, 322]}

I am converting it into:

user_id : 19814 entity_id : 40, user_id : 19814 entity_id : 45, user_id : 19814 entity_id : 54, user_id : 19814 entity_id : 322

wishlist_data = pd.DataFrame()
for i in wishlist_result:
    wishlist_from_dict = pd.DataFrame.from_dict(
        wishlist_result[wishlist_result.index(i)])
    wishlist_data = wishlist_data.append(
        wishlist_from_dict, ignore_index=True)

wishlist_data = wishlist_data.rename(
    index=str, columns={
        "userId": "user_id",
        "entityIds": "entity_id"
    })

This code is taking too long I have around 60k records like I mentioned above any way to get this conversion in less time ??

Answer 1

Using dataframes for "everything" is often not the best solution. Code can become unreadable, and constructing many small dataframes can also be very slow. My solution uses plain Python containers to solve your issue:

import pandas as pd

wishlist_result = [
    {"userId": 19814, "entityIds": [40, 45, 54, 322]},
    {"userId": 19814, "entityIds": [12, 22]},
]

def flatten(data):
    flattened = []
    for entry in data:
        user_id = entry["userId"]
        entity_ids = entry["entityIds"]
        for entity_id in entity_ids:
            row = dict(user_id=user_id, entity_id=entity_id)
            flattened.append(row)

    return flattened


rows = flatten(wishlist_result)
df = pd.DataFrame(rows, columns=["user_id", "entity_id"])
print(df)

outputs

   user_id  entity_id
0    19814         40
1    19814         45
2    19814         54
3    19814        322
4    19814         12
5    19814         22

I benchmarked my approach with a list of length 60000 duplicating your wishlist_result examples. Runtime of the snippet is ~800 ms on my old Mac.

In case you want it shorter, a nested list comprehension also works, runtime does not change significantly:

rows = [
    {"user_id": entry["userId"], "entity_id": entity_id}
    for entry in wishlist_result
    for entity_id in entry["entityIds"]
]

I often avoid list comprehensions with nested for loops, as team mates who want to read or reuse my code might not know the order of execution. But here the order is quite clear by the variables involved.

Answer 2

If you're concatenating a lot of frames, it's quicker to use pd.concat than it is to append each time:

all_wishlists = []
for i in wishlist_result:
    all_wishlists.append(
        pd.DataFrame.from_dict(wishlist_result[wishlist_result.index(i)])
    )

wishlist_data = pd.concat(all_wishlists, ignore_index=True)\
                  .rename(index=str,
                          columns={"userId": "user_id",
                                   "entityIds": "entity_id"})

Even better, we can change this to a list comprehension and reduce the entire thing down to:

wishlist_data = pd.concat([pd.DataFrame.from_dict(wishlist_result[wishlist_result.index(i)])
                           for i in wishlist_result], ignore_index=True)\
                  .rename(index=str,
                          columns={"userId": "user_id",
                                   "entityIds": "entity_id"})

You also shouldn't need to do pd.DataFrame.from_dict(wishlist_result[wishlist_result.index(i)]) for i in wishlist_result - you don't need to find the item and then index it again. Instead you can just do:

wishlist_data = pd.concat([pd.DataFrame.from_dict(result)
                           for result in wishlist_result], ignore_index=True)\
                  .rename(index=str,
                          columns={"userId": "user_id",
                                   "entityIds": "entity_id"})

How optimize this piece of code?

Question

2 answers

solution1
1 ACCPTED 2018-08-23 08:38:34

solution2
0 2018-08-23 08:45:35

How optimize this piece of code?

Question

2 answers

solution1 1 ACCPTED 2018-08-23 08:38:34

solution2 0 2018-08-23 08:45:35

solution1
1 ACCPTED 2018-08-23 08:38:34

solution2
0 2018-08-23 08:45:35