I have a list of dictionaries called list_of_dict in the following format:
[{'id': 123, 'date': '202001', 'variable_x': 3},
{'id': 345, 'date': '202101', 'variable_x': 4}, ... ]
To transform it to a pandas DataFrame, I simply do:
df = pd.DataFrame(list_of_dict)
It works, but when a try to do it with a list of 20 million dictionaries, it takes about an hour to run.
Does Python have a faster way to achieve this?
There are multiple cases of fastest way to build a dataframe is a list of dictionaries. Below timings show this.
Fundamentally reading 20M rows into memory will mean heavy use of virtual memory and swapping. The primary optimisation I would expect to come from sharding and not needing all data in memory.
d = [{'id': 123, 'date': '202001', 'variable_x': 3},
{'id': 345, 'date': '202101', 'variable_x': 4}]
c = d[0].keys()
r = 2*10**5
a = np.tile([list(l.values()) for l in d], (r,1))
d = np.tile(d, r)
%timeit pd.DataFrame(d)
%timeit pd.DataFrame(a, columns=c)
%timeit pd.DataFrame(a)
print(f"2D array size: {len(a):,}\ndict array size: {len(d):,}")
53.4 µs ± 238 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
90.6 ms ± 400 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
90.4 ms ± 1.45 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
2D array size: 400,000
dict array size: 400,000
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.