I need some help to improve the performance of the following code.
for object in dict_of_objects.values():
test = pd.Series(object.properties) #properties is a dict
series_list.append(test)
# List comprehension is not really faster than the loop since pd.Series() takes most time
#series_list = [pd.Series(object.properties) for object in dict_of_objects.values()]
# Also very slow
df = pd.DataFrame(series_list)
After timing the code a bit I found out that pd.Series(object.properties)
and pd.DataFrame(series_list)
are very slow - both need around 9s to complete while append needs only 0.4s. As a result, the list comprehension isn't really an improvement since it calls pd.Series(object.properties) as well.
Do you have some suggestions on how to improve the performance of this?
Best, Julz
The same result can be achieved, for example, like below:
properties_list = [o.properties for o in dict_of_objects.values()]
df = pd.DataFrame(properties_list).T
Or with dict()
of properties, which requires less operations:
properties_dict = {k: o.properties for k, o in dict_of_objects.items()}
df = pd.DataFrame.from_dict(properties_dict)
Let's look at some code snippets:
import numpy as np
import pandas as pd
from copy import deepcopy as cp
N_objects = 10
N_samples = 10000
class SimpleClass:
def __init__(self,prop):
self.properties = prop
dict_of_objects = {'obj{}'.format(i):
SimpleClass({
'alice' : np.random.rand(N_samples),
'bob' : np.random.rand(N_samples)
}) for i in range(N_objects)}
def slow_update(dict_of_objects):
series_list = []
for obj in dict_of_objects.values():
test = pd.Series(obj.properties)
series_list.append(test)
return pd.DataFrame(series_list)
def med_update(dict_of_objects):
return pd.DataFrame([pd.Series(obj.properties) for obj in dict_of_objects.values()])
def fast_update(dict_of_objects):
keys = iter(dict_of_objects.values()).__next__().properties.keys()
return pd.DataFrame({k: [obj.properties[k] for obj in dict_of_objects.values()] for k in keys})
And with timings:
>>> %timeit slow_update(dict_of_objects)
2.88 ms ± 19.5 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
>>> %timeit med_update(dict_of_objects)
2.86 ms ± 23.2 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
>>> %timeit fast_update(dict_of_objects)
344 µs ± 17.6 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
The fast update does the following:
__next__
.It's about 8 times faster than most methods.
Edit: as correctly pointed out by @koPytok, fast_update
will not work if each object's properties
attribute has different keys . This is worth bearing in mind if you choose to implement this for something such as a NoSQL database grab -- in MongoDB, documents are not required to share the same fields (here swap document for object, field for key).
Enjoy!
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.