简体   繁体   中英

Building dataFrame from a list of list of objects takes too long

I am pulling a big amount of data. It comes as a list of lists of objects.

Example: [[objectA, objectB],[objectC],[],[ObjectD]...]

Each object have a lot of attributes, however for my dataframe I need only name, value, timestamp, description. I tried two things:

for events in events_list:
    if len(events) > 0:
       for event in events:
           df = DataFrame([])
           df['timestamp'] = event.timestamp
           df['value'] = event.value
           df['name'] = event.name
           df['desc'] = event.desc
           final_df = final_df.append(df)

This takes around ~ 15 minutes to complete.

I change the code to use python list:

df_list = list()
for events in events_list:
    if len(events) > 0:
       for event in events:
           df_list.append([event.timestamp, event.value, event.name, event.desc])
final_df = pd.DataFrame(df_list, columns=['timestamp', 'value', 'name', 'desc'])

With this change I managed to reduce the time to approximately ~ 10-11 minutes .

I am still researching if there is a way to do it faster. Before I did the change with python list I tried dictionary but it was way slower than I expected. Currently I am reading about Panads vectorization which seems really fast, however I am not sure if I can use it for my purpose. I know that Python loops are a bit slow and there is not much I can do about them, so I am also trying to figure out a way to do those loops in the dataframe.

My question is, has any of you tackled this problem before and is there a better way to do it ?

EDIT: There are questions about the data. It comes through an API and it is constructed this way because every group of objects is grouped by name. For example:

[[objectA, objectB (both have the same name)],[objectC],[EMPTY - There is no data for this name],[ObjectD]...]

Because I cannot change the way I get the data, I have to work with this data structure.

The computationally heavy operation in your initial method is append - each time you are using final_df.append(df) you are creating an entirely new (and larger each iteration!) dataframe. Instead, aggregate all the dataframes into a list and use pd.concat(df_list) .

To go faster than that you may want to consider using multiprocessing to some extent, either through the standard python multiprocessing libraries or through a framework - I recommend Dask.

Edit: PS If your data is originally in a csv/excel/parquet or another format supported by pandas, you can use pandas to load all the data at once in a very efficient manner. Even if your events include unnecessary columns, it'll be much faster to load the entire data set and then filter out the redundant columns.

How about something like this?

import datetime
import itertools as itt
import operator
import random
from dataclasses import dataclass

import pandas as pd


# DUMMY DATA SETUP

@dataclass
class Obj:
    name: str
    timestamp: datetime.datetime
    value: int
    desc: str


group_lens = [random.randint(0, 1000) for _ in range(200000)]
event_count = 1

events = []
for curr_group_len in group_lens:
    curr_group = []
    for _ in range(curr_group_len):
        curr_group.append(
            Obj(f"event_{event_count}", datetime.datetime.now(), random.randint(-100, 100), f"event_{event_count} desc"))
        event_count += 1
    events.append(curr_group)

# DATAFRAME CREATION

cust_getter = operator.attrgetter('name', 'timestamp', 'value', 'desc')

df = pd.DataFrame(data=(cust_getter(elem) for elem in itt.chain.from_iterable(events)))

I tested it on a 2-dimensional list of 10,006,766 elements, and it only took 9 seconds.

I found answer to my question using generators: Here is a link to another thread that was specifically created to figure out how to create dataframe from a list of python generators. In there we figure out a way to solve the problem from this thread: Create Pandas Dataframe from List of Generators

To summarize it, I replaced this:

for events in events_list:
    if len(events) > 0:
       for event in events:
           for record in event:
               df_list.append([record.timestamp, record.value, record.name, record.desc])
final_df = pd.DataFrame(df_list, columns=['timestamp', 'value', 'name', 'desc'])

With this:

data= (record.Timestamp, record.Value, record.Name, record.desc) 
                for events in events_list for event in events for record in event)

dataframe = pd.DataFrame(data, columns=["timestamp", "value", "name", "desc])

Using List comprehension I save a lot of time by building the list as a whole and not doing continuous append.

Test with 15 million records (including creation of the DF):

list append with for-loop = 16 minutes

list_comprehension = 3 minutes

I will continue to test this for the next couple of days with different amount of data.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM