Building dataFrame from a list of list of objects takes too long

Question

I am pulling a big amount of data. It comes as a list of lists of objects.

Example: [[objectA, objectB],[objectC],[],[ObjectD]...]

Each object have a lot of attributes, however for my dataframe I need only name, value, timestamp, description. I tried two things:

for events in events_list:
    if len(events) > 0:
       for event in events:
           df = DataFrame([])
           df['timestamp'] = event.timestamp
           df['value'] = event.value
           df['name'] = event.name
           df['desc'] = event.desc
           final_df = final_df.append(df)

This takes around ~ 15 minutes to complete.

I change the code to use python list:

df_list = list()
for events in events_list:
    if len(events) > 0:
       for event in events:
           df_list.append([event.timestamp, event.value, event.name, event.desc])
final_df = pd.DataFrame(df_list, columns=['timestamp', 'value', 'name', 'desc'])

With this change I managed to reduce the time to approximately ~ 10-11 minutes .

I am still researching if there is a way to do it faster. Before I did the change with python list I tried dictionary but it was way slower than I expected. Currently I am reading about Panads vectorization which seems really fast, however I am not sure if I can use it for my purpose. I know that Python loops are a bit slow and there is not much I can do about them, so I am also trying to figure out a way to do those loops in the dataframe.

My question is, has any of you tackled this problem before and is there a better way to do it ?

EDIT: There are questions about the data. It comes through an API and it is constructed this way because every group of objects is grouped by name. For example:

[[objectA, objectB (both have the same name)],[objectC],[EMPTY - There is no data for this name],[ObjectD]...]

Because I cannot change the way I get the data, I have to work with this data structure.

Answer 1

The computationally heavy operation in your initial method is append - each time you are using final_df.append(df) you are creating an entirely new (and larger each iteration!) dataframe. Instead, aggregate all the dataframes into a list and use pd.concat(df_list) .

To go faster than that you may want to consider using multiprocessing to some extent, either through the standard python multiprocessing libraries or through a framework - I recommend Dask.

Edit: PS If your data is originally in a csv/excel/parquet or another format supported by pandas, you can use pandas to load all the data at once in a very efficient manner. Even if your events include unnecessary columns, it'll be much faster to load the entire data set and then filter out the redundant columns.

Answer 2

How about something like this?

import datetime
import itertools as itt
import operator
import random
from dataclasses import dataclass

import pandas as pd


# DUMMY DATA SETUP

@dataclass
class Obj:
    name: str
    timestamp: datetime.datetime
    value: int
    desc: str


group_lens = [random.randint(0, 1000) for _ in range(200000)]
event_count = 1

events = []
for curr_group_len in group_lens:
    curr_group = []
    for _ in range(curr_group_len):
        curr_group.append(
            Obj(f"event_{event_count}", datetime.datetime.now(), random.randint(-100, 100), f"event_{event_count} desc"))
        event_count += 1
    events.append(curr_group)

# DATAFRAME CREATION

cust_getter = operator.attrgetter('name', 'timestamp', 'value', 'desc')

df = pd.DataFrame(data=(cust_getter(elem) for elem in itt.chain.from_iterable(events)))

I tested it on a 2-dimensional list of 10,006,766 elements, and it only took 9 seconds.

Answer 3

I found answer to my question using generators: Here is a link to another thread that was specifically created to figure out how to create dataframe from a list of python generators. In there we figure out a way to solve the problem from this thread: Create Pandas Dataframe from List of Generators

To summarize it, I replaced this:

for events in events_list:
    if len(events) > 0:
       for event in events:
           for record in event:
               df_list.append([record.timestamp, record.value, record.name, record.desc])
final_df = pd.DataFrame(df_list, columns=['timestamp', 'value', 'name', 'desc'])

With this:

data= (record.Timestamp, record.Value, record.Name, record.desc) 
                for events in events_list for event in events for record in event)

dataframe = pd.DataFrame(data, columns=["timestamp", "value", "name", "desc])

Using List comprehension I save a lot of time by building the list as a whole and not doing continuous append.

Test with 15 million records (including creation of the DF):

list append with for-loop = 16 minutes

list_comprehension = 3 minutes

I will continue to test this for the next couple of days with different amount of data.

Building dataFrame from a list of list of objects takes too long

Question

3 answers

solution1
1 2020-01-17 17:40:20

solution2
1 2020-01-17 19:28:22

solution3
0 ACCPTED 2020-03-02 17:50:21

Building dataFrame from a list of list of objects takes too long

Question

3 answers

solution1 1 2020-01-17 17:40:20

solution2 1 2020-01-17 19:28:22

solution3 0 ACCPTED 2020-03-02 17:50:21

solution1
1 2020-01-17 17:40:20

solution2
1 2020-01-17 19:28:22

solution3
0 ACCPTED 2020-03-02 17:50:21