Pandas pd.Series() and pd.DataFrame() are very slow

Question

I need some help to improve the performance of the following code.

        for object in dict_of_objects.values():
            test = pd.Series(object.properties)    #properties is a dict
            series_list.append(test)

        # List comprehension is not really faster than the loop since pd.Series() takes most time
        #series_list = [pd.Series(object.properties) for object in dict_of_objects.values()]

        # Also very slow
        df = pd.DataFrame(series_list)

After timing the code a bit I found out that pd.Series(object.properties) and pd.DataFrame(series_list) are very slow - both need around 9s to complete while append needs only 0.4s. As a result, the list comprehension isn't really an improvement since it calls pd.Series(object.properties) as well.

Do you have some suggestions on how to improve the performance of this?

Best, Julz

Answer 1

The same result can be achieved, for example, like below:

properties_list = [o.properties for o in dict_of_objects.values()]
df = pd.DataFrame(properties_list).T

Or with dict() of properties, which requires less operations:

properties_dict = {k: o.properties for k, o in dict_of_objects.items()}
df = pd.DataFrame.from_dict(properties_dict)

Answer 2

Let's look at some code snippets:

import numpy as np
import pandas as pd
from copy import deepcopy as cp

N_objects = 10
N_samples = 10000

class SimpleClass:
    def __init__(self,prop):
        self.properties = prop

dict_of_objects = {'obj{}'.format(i): 
                        SimpleClass({
                                        'alice' : np.random.rand(N_samples),
                                        'bob'   : np.random.rand(N_samples)
                                    }) for i in range(N_objects)}

def slow_update(dict_of_objects):
    series_list = []
    for obj in dict_of_objects.values():
        test = pd.Series(obj.properties)
        series_list.append(test)
    return pd.DataFrame(series_list)

def med_update(dict_of_objects):
    return pd.DataFrame([pd.Series(obj.properties) for obj in dict_of_objects.values()])

def fast_update(dict_of_objects):
    keys = iter(dict_of_objects.values()).__next__().properties.keys()
    return pd.DataFrame({k: [obj.properties[k] for obj in dict_of_objects.values()] for k in keys})

And with timings:

>>> %timeit slow_update(dict_of_objects)
2.88 ms ± 19.5 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
>>> %timeit med_update(dict_of_objects)
2.86 ms ± 23.2 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
>>> %timeit fast_update(dict_of_objects)
344 µs ± 17.6 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

The fast update does the following:

Grab the fields from an iterator with one use of __next__ .
Construct the fields using list comprehension.
Construct the data structure using dictionary comprehension.

It's about 8 times faster than most methods.

Edit: as correctly pointed out by @koPytok, fast_update will not work if each object's properties attribute has different keys . This is worth bearing in mind if you choose to implement this for something such as a NoSQL database grab -- in MongoDB, documents are not required to share the same fields (here swap document for object, field for key).

Enjoy!

Pandas pd.Series() and pd.DataFrame() are very slow

Question

2 answers

solution1
2 2019-10-17 14:35:06

solution2
2 ACCPTED 2019-10-17 14:47:23

Pandas pd.Series() and pd.DataFrame() are very slow

Question

2 answers

solution1 2 2019-10-17 14:35:06

solution2 2 ACCPTED 2019-10-17 14:47:23

solution1
2 2019-10-17 14:35:06

solution2
2 ACCPTED 2019-10-17 14:47:23