[英]Pandas apply function to each row of a dataframe to return one or more new rows per entry
I have a dataset like the following:我有一个如下所示的数据集:
import pandas as pd
df = pd.DataFrame([[[{'name': 'Joe', 'age': 32, 'category': 'A'}, {'name': 'Jane', 'age': 35, 'category': 'A'}]],
[[{'name': 'Beth', 'age': 33, 'category': 'B'}, {'name': 'Bob', 'age': 32, 'category': 'B'}]],
[[{'name': 'John', 'age': 35, 'category': 'C'}]],
[[{'name': 'Jill', 'age': 33, 'category': 'D'}]],
], columns=['Entries'])
The dataframe has a single column (named 'Entries'), where each row contains a list of one or more dictionaries.数据框只有一列(名为“条目”),其中每一行包含一个或多个字典的列表。
I need a way to convert the dataframe for each key in the dictionary to become a column, and for the values to appear in those corresponding columns, like so:我需要一种方法将字典中每个键的数据框转换为一列,并将值显示在相应的列中,如下所示:
age category name
0 32.0 A Joe
1 35.0 A Jane
2 33.0 B Beth
3 32.0 B Bob
4 35.0 C John
5 33.0 D Jill
Currently I have the following code to do this:目前我有以下代码来做到这一点:
df2 = pd.DataFrame()
for idx, row in df.iterrows():
for entry in row.Entries:
name = entry['name']
age = entry['age']
category = entry['category']
single_entry = pd.Series({'name': name, 'age': age, 'category': category})
df2 = df2.append(single_entry, ignore_index=True)
The code above works fine, but is very slow on my actual dataset, which has over 1,000,000 rows.上面的代码工作正常,但在我的实际数据集上非常慢,它有超过 1,000,000 行。
I considered using built-in Pandas functions to leverage their speed gains, for example the apply function, but I don't know how to apply this to this particular problem.我考虑过使用内置的 Pandas 函数来利用它们的速度增益,例如 apply 函数,但我不知道如何将其应用于这个特定问题。
What is a more efficient way to achieve the above result?达到上述结果的更有效方法是什么?
I suggest use list comprehension with flatten values for improve speed:我建议使用带有展平值的列表理解来提高速度:
df = pd.DataFrame([y for x in df['Entries'] for y in x])
Another idea:另一个想法:
from itertools import chain
df = pd.DataFrame(chain.from_iterable(df['Entries'].tolist()))
print (df)
name age category
0 Joe 32 A
1 Jane 35 A
2 Beth 33 B
3 Bob 32 B
4 John 35 C
5 Jill 33 D
Performance with sample data repeated 10000 times for 40k rows:对 40k 行重复 10000 次样本数据的性能:
df = pd.concat([df] * 10000, ignore_index=True)
In [222]: %timeit pd.DataFrame([y for x in df['Entries'] for y in x])
66.1 ms ± 770 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
In [223]: %timeit pd.DataFrame(chain.from_iterable(df['Entries'].tolist()))
60.9 ms ± 691 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
In [224]: %timeit pd.DataFrame(itertools.chain(*df.Entries.tolist()))
60.8 ms ± 237 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
In [225]: %timeit pd.DataFrame(sum(df.Entries.tolist(),[]))
3.94 s ± 18.9 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
In [226]: %timeit pd.DataFrame(df['Entries'].explode().tolist())
131 ms ± 3.28 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
If you want to retain index from the records, explode would help如果您想保留记录中的索引,explode 会有所帮助
s=df['Entries'].explode()
pd.DataFrame(s.tolist(),index=s.index)
name age category
0 Joe 32 A
0 Jane 35 A
1 Beth 33 B
1 Bob 32 B
2 John 35 C
3 Jill 33 D
IIUC国际大学联盟
pd.DataFrame(sum(df.Entries.tolist(),[]))
name age category
0 Joe 32 A
1 Jane 35 A
2 Beth 33 B
3 Bob 32 B
4 John 35 C
5 Jill 33 D
Or或者
import itertools
pd.DataFrame(itertools.chain(*df.Entries.tolist()))
name age category
0 Joe 32 A
1 Jane 35 A
2 Beth 33 B
3 Bob 32 B
4 John 35 C
5 Jill 33 D
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.