简体   繁体   中英

Is there a faster way to convert a large list of dictionaries into a pandas DataFrame?

I have a list of dictionaries called list_of_dict in the following format:

[{'id': 123, 'date': '202001', 'variable_x': 3},
 {'id': 345, 'date': '202101', 'variable_x': 4}, ... ]

To transform it to a pandas DataFrame, I simply do:

df = pd.DataFrame(list_of_dict)

It works, but when a try to do it with a list of 20 million dictionaries, it takes about an hour to run.

Does Python have a faster way to achieve this?

There are multiple cases of fastest way to build a dataframe is a list of dictionaries. Below timings show this.

Fundamentally reading 20M rows into memory will mean heavy use of virtual memory and swapping. The primary optimisation I would expect to come from sharding and not needing all data in memory.

d = [{'id': 123, 'date': '202001', 'variable_x': 3},
 {'id': 345, 'date': '202101', 'variable_x': 4}]

c = d[0].keys()
r = 2*10**5
a = np.tile([list(l.values()) for l in d], (r,1))
d = np.tile(d, r)

%timeit pd.DataFrame(d)
%timeit pd.DataFrame(a, columns=c)
%timeit pd.DataFrame(a)
print(f"2D array size: {len(a):,}\ndict array size: {len(d):,}")

output

53.4 µs ± 238 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
90.6 ms ± 400 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
90.4 ms ± 1.45 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
2D array size: 400,000
dict array size: 400,000

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM