简体   繁体   中英

Pandas pd.Series() and pd.DataFrame() are very slow

I need some help to improve the performance of the following code.

        for object in dict_of_objects.values():
            test = pd.Series(object.properties)    #properties is a dict
            series_list.append(test)

        # List comprehension is not really faster than the loop since pd.Series() takes most time
        #series_list = [pd.Series(object.properties) for object in dict_of_objects.values()]

        # Also very slow
        df = pd.DataFrame(series_list)

After timing the code a bit I found out that pd.Series(object.properties) and pd.DataFrame(series_list) are very slow - both need around 9s to complete while append needs only 0.4s. As a result, the list comprehension isn't really an improvement since it calls pd.Series(object.properties) as well.

Do you have some suggestions on how to improve the performance of this?

Best, Julz

The same result can be achieved, for example, like below:

properties_list = [o.properties for o in dict_of_objects.values()]
df = pd.DataFrame(properties_list).T

Or with dict() of properties, which requires less operations:

properties_dict = {k: o.properties for k, o in dict_of_objects.items()}
df = pd.DataFrame.from_dict(properties_dict)

Let's look at some code snippets:

import numpy as np
import pandas as pd
from copy import deepcopy as cp

N_objects = 10
N_samples = 10000

class SimpleClass:
    def __init__(self,prop):
        self.properties = prop

dict_of_objects = {'obj{}'.format(i): 
                        SimpleClass({
                                        'alice' : np.random.rand(N_samples),
                                        'bob'   : np.random.rand(N_samples)
                                    }) for i in range(N_objects)}

def slow_update(dict_of_objects):
    series_list = []
    for obj in dict_of_objects.values():
        test = pd.Series(obj.properties)
        series_list.append(test)
    return pd.DataFrame(series_list)

def med_update(dict_of_objects):
    return pd.DataFrame([pd.Series(obj.properties) for obj in dict_of_objects.values()])

def fast_update(dict_of_objects):
    keys = iter(dict_of_objects.values()).__next__().properties.keys()
    return pd.DataFrame({k: [obj.properties[k] for obj in dict_of_objects.values()] for k in keys})

And with timings:

>>> %timeit slow_update(dict_of_objects)
2.88 ms ± 19.5 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
>>> %timeit med_update(dict_of_objects)
2.86 ms ± 23.2 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
>>> %timeit fast_update(dict_of_objects)
344 µs ± 17.6 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

The fast update does the following:

  1. Grab the fields from an iterator with one use of __next__ .
  2. Construct the fields using list comprehension.
  3. Construct the data structure using dictionary comprehension.

It's about 8 times faster than most methods.

Edit: as correctly pointed out by @koPytok, fast_update will not work if each object's properties attribute has different keys . This is worth bearing in mind if you choose to implement this for something such as a NoSQL database grab -- in MongoDB, documents are not required to share the same fields (here swap document for object, field for key).

Enjoy!

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM