Is there a faster way to convert a large list of dictionaries into a pandas DataFrame?

Question

I have a list of dictionaries called list_of_dict in the following format:

[{'id': 123, 'date': '202001', 'variable_x': 3},
 {'id': 345, 'date': '202101', 'variable_x': 4}, ... ]

To transform it to a pandas DataFrame, I simply do:

df = pd.DataFrame(list_of_dict)

It works, but when a try to do it with a list of 20 million dictionaries, it takes about an hour to run.

Does Python have a faster way to achieve this?

Answer 1

There are multiple cases of fastest way to build a dataframe is a list of dictionaries. Below timings show this.

Fundamentally reading 20M rows into memory will mean heavy use of virtual memory and swapping. The primary optimisation I would expect to come from sharding and not needing all data in memory.

d = [{'id': 123, 'date': '202001', 'variable_x': 3},
 {'id': 345, 'date': '202101', 'variable_x': 4}]

c = d[0].keys()
r = 2*10**5
a = np.tile([list(l.values()) for l in d], (r,1))
d = np.tile(d, r)

%timeit pd.DataFrame(d)
%timeit pd.DataFrame(a, columns=c)
%timeit pd.DataFrame(a)
print(f"2D array size: {len(a):,}\ndict array size: {len(d):,}")

output

53.4 µs ± 238 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
90.6 ms ± 400 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
90.4 ms ± 1.45 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
2D array size: 400,000
dict array size: 400,000

Is there a faster way to convert a large list of dictionaries into a pandas DataFrame?

Question

1 answers

solution1
1 2021-05-08 15:40:30

output

Is there a faster way to convert a large list of dictionaries into a pandas DataFrame?

Question

1 answers

solution1 1 2021-05-08 15:40:30

output

solution1
1 2021-05-08 15:40:30